Goal: Calculate the gene correlation matrix for a large subset of the GSE86922 data set without running out of time or memory
Update my python import path to include the stats.py module from the course website
import sys
sys.path.append("/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/webroot/htdocs/code/")
Load the stats.py module with my example pearson function
import stats
Load some other useful tools
import gzip
from csv import reader, excel_tab
Parse the GSE86922 data set
fp = reader(gzip.open("GSE86922_Brodsky_GEO_processed.txt.gz"), dialect = excel_tab)
header = fp.next()
data = []
annotations = []
for row in fp:
annotations.append(row[:4])
data.append([float(i) for i in row[4:]])
Naive function for calculating the correlation matrix. Note that this doesn't take advantage of the symmetry of the matrix
def dmat(rows):
d = []
for (n,i) in enumerate(rows):
c = []
for j in rows:
c.append(stats.pearson(i,j))
d.append(c)
return d
Test runs with increasing $N$ to get a sense of how the problem scales.
We expect $O(N^2)$ for both time and space.
%%time
d = dmat(data[:10])
%%time
d = dmat(data[:100])
%%time
d = dmat(data[:500])
%%time
d = dmat(data[:1000])
%%time
d = dmat(data[:2000])
Filter for detected transcripts
Here I'm using a list comprehension instead of an explicit for loop. It's just a convenient short hand.
from math import log
f = [i for i in data if(sum(1 for j in i if(j >= log(10)/log(2))) >= 2)]
len(f)
%%time
d = dmat(f[:10])
%%time
d = dmat(f[:100])
%%time
d = dmat(f[:1000])
%%time
d = dmat(f[:2000])
%%time
d = dmat(f)
Note that the full matrix run time is as predicted by quadratic scaling from the first 2k rows
26*25./60
So, we can afford to calculate the matrix for the ~10k expressed transcripts, but this is a bit large for plotting.
Let's do a more aggressive filter for differentially expressed genes.
Filter for 4x change between two samples
f2 = [i for i in f if(max(i)-min(i) >= 2)]
len(f2)
%%time
d = dmat(f2)
This is small enough for an interactive plot, so let's take a look at it
%matplotlib notebook
import matplotlib.pyplot as plt
fig = plt.figure()
plt.imshow(d, interpolation = "none", aspect = "auto")
plt.colorbar()