How can we figure out how to get at the data in our ExpressionProfile class?
Create an instance of the class:
import cdt
data = cdt.ExpressionProfile("supp2data.cdt")
IPython's ? tells us:
data?
?? tells us more, including the source code for the class, which is enough to remember the names of the member variables
data??
So, e.g., we can get at the gene list like this:
data.geneName[:10]
We can use dir to get a list of an object's attributes:
dir(data)
dir() with no arguments lists all top-level objects. For IPython in --pylab mode, this is a very long list. More useful is the IPython magic %who, which lists all top-level objects that we explicitly created.
%who
We can also use help to view an object's docstrings (again, we haven't defined any for ExpressionProfile, so its help output is terse).
help(data)
reload(cdt)
data = cdt.ExpressionProfile("supp2data.cdt")
help(data)
After defining ExpressionProfile.num with a @property decorator, so that we could add a docstring to it.
reload(cdt)
data = cdt.ExpressionProfile("supp2data.cdt")
help(data)
data.num[5][5:10]
At this point, we rolled back the @property change to cdt.py to avoid confusion (too late?)
Load the pearson function from the example on the website
import stats
reload(stats)
from stats import pearson
Our first try at a function to calculate the pairwise correlations.
Rather than defining it explicitly in terms of pearson, we ask for a distance parameter to be passed to the function. We assume that distance is a mapping from two equal length vectors to a scalar, and that it is symetric ($D(x,y) = D(y,x)$)
We take advantage of the symetry to only calculate the upper triangle of the matrix, filling in the lower triangle by copying previously calculated values.
def distmatrix(data, distance):
D = []
for i in xrange(len(data.geneName)):
row = []
for j in xrange(len(data.geneName)):
if(j >= i):
row.append(distance(data.num[i],data.num[j]))
else:
row.append(D[j][i])
D.append(row)
return D
As defined, we have to remember to call distmatrix with both an ExpressionProfile and a distance function
dists = distmatrix(data)
Here, we redefine our function with pearson as the default distance function (to be used if this parameter is not supplied).
We also define a second optional parameter, N. If given, the calculation is run for only the first N genes -- this is useful for quick debugging of our function, and for figuring out its run time as a function of problem size (i.e., its computational complexity).
def distmatrix(data, distance = pearson, N = None):
D = []
if(N is None):
N = len(data.geneName)
for i in xrange(N):
row = []
for j in xrange(N):
if(j >= i):
row.append(distance(data.num[i],data.num[j]))
else:
row.append(D[j][i])
D.append(row)
return D
Timing our function for the first 10 genes
%time dist = distmatrix(data, N = 10)
If we only supply two parameters, without explicitly stating which optional parameter we are supplying, python assumes that we are supplying the first optional parameter (distance).
dist = distmatrix(data, 10)
Supplying all three parameters works (they're in the right order, so we don't need to be explicit about who's who)
%time dist = distmatrix(data, pearson, 10)
Scaling up to the first 100 genes
%time dist = distmatrix(data, N = 100)
A quick redefinition of our function to handle N larger than the number of genes:
def distmatrix(data, distance = pearson, N = None):
D = []
if(N is None):
N = len(data.geneName)
else:
N = min(N, len(data.geneName))
for i in xrange(N):
row = []
for j in xrange(N):
if(j >= i):
row.append(distance(data.num[i],data.num[j]))
else:
row.append(D[j][i])
D.append(row)
return D
Scaling up to 1000 genes:
%time dist = distmatrix(data, N = 1000)
And the full calculation:
%time dist = distmatrix(data)
--> run time of ~2 minutes for a single core of a hyper-threaded 2.5 GHz dual core i5-3210M processor running in 64bit mode.
For a similar architecture (e.g., another i5), single-core speed should depend linearly on the clock speed (e.g., a 5 GHz processor would run twice as fast).
(Without explicitly setting up our problem for parallel processing, adding additional processor cores will give us no speed up).
For other architectures, run times are problem dependent (in this case, a floating point benchmark would be a good predictor of the relative run time).
See if you can figure out the time scaling for your computer.