Goal: Use numpy to calculate the correlation matrix for several filtered versions of the example data set, then use BioPython to cluster the final filtered matrix.
%matplotlib notebook
import matplotlib.pyplot as plt
import gzip
from csv import reader, excel_tab
import numpy as np
fp = reader(gzip.open("GSE86922_Brodsky_GEO_processed.txt.gz"), dialect = excel_tab)
header = fp.next()
data = []
annotations = []
for row in fp:
annotations.append(row[:4])
data.append([float(i) for i in row[4:]])
# This is new -- we deallocate the reader object to close the file when we're done reading it
del fp
Convert our list of lists to a numpy array. This gives us access to fast numpy matrix functions.
d = np.array(data)
d.shape
We'll plot our matrix as we transform and filter it.
Here is the original matrix:
fig = plt.figure()
plt.imshow(d, interpolation = "none", aspect = "auto")
plt.colorbar()
In the heatmap above, note that most of the transcripts are blue, due to not being expressed. Also note that the columns look mostly the same. Most genes have low differential expression relative to their basal expression level, so mean abundance is much easier to see than the differences between samples.
Mean center the rows. Since each row is a gene, this puts the samples with "typical" expression at 0, samples where the gene is upregulated at positive values, and samples where the gene is down regulated at negative values.
We use axis=1 in np.mean to indicate that it should iterate over columns (dimension 1), giving row means. axis=0 would iterate over rows (dimension 0), giving column means.
mean = np.mean(d, axis = 1)
mean.shape
The result of mean is a 1 dimensional array, which is not compatable with our 2 dimensional matrix. Therefore, we reshape it to "2 dimensions".
mean = mean.reshape(mean.shape[0],1)
mean.shape
Now we have compatable arrays for subtracting the row means:
centered = d - mean
fig = plt.figure()
plt.imshow(centered, interpolation = "none", aspect = "auto")
plt.colorbar()
In the previous heatmap, mean abundance was the dominant signal. This is specifically what we removed by subtraction, so our new heatmap emphasizes differential expression among samples. Most genes are not differentially expressed, so this looks flat.
Now we scale each gene row to a unit vector by dividing by its length.
norm = np.linalg.norm(centered, axis = 1)
norm = norm.reshape(norm.shape[0], 1)
scaled = centered/norm
fig = plt.figure()
plt.imshow(scaled, interpolation = "none", aspect = "auto")
plt.colorbar()
This increases the contrast, as all rows are now on the same scale (-1 to 1). Because most rows are noise for transcripts that aren't actually expressed, this is not very informative.
Since the rows of our matrix are now centered unit vectors, multiplying the matrix by its transpose (equivalent to taking the dot product of all pairs of vectors) yields to correlation matrix:
corr = np.dot(scaled, scaled.T)
Because numpy tries to allocate enough memory for the full result before starting, we find out right away that this calculation is too large to do in core memory.
Let's filter for actually detected transcripts:
We will create a text matrix of annotations parallel to the data matrix. By carrying both matrices through our filtering steps, we can track which transcripts correspond to which rows in our final filtered data matrix.
anno = np.array([[i[0],i[3]] for i in annotations])
anno[0]
Threshold at 10 counts per million (CPM), as in the paper
thresh = np.log(10)/np.log(2)
numpy allows a very R-like compact syntax for filtering.
The logic here is:
x = (np.sum(d >= thresh, axis = 1) >= 2)
f = d[x,:]
fa = anno[x,:]
f.shape, fa.shape
fig = plt.figure()
plt.imshow(f, interpolation = "none", aspect = "auto")
plt.colorbar()
Having cut to only observed transcripts (i.e., the "real" signal), we see that the levels of most transcripts are independent of geneotype; they have a consistent level for the unstimulated (first 6) and LPS treated (second 6) conditions.
We will mean center and scale the rows as before:
mean = np.mean(f, axis = 1)
mean = mean.reshape(mean.shape[0],1)
centered = f - mean
fig = plt.figure()
plt.imshow(centered, interpolation = "none", aspect = "auto")
plt.colorbar()
Having removed the mean abundance signal, we see a bias for "up in unstimulated relative to LPS"
norm = np.linalg.norm(centered, axis = 1)
norm = norm.reshape(norm.shape[0], 1)
scaled = centered/norm
fig = plt.figure()
plt.imshow(scaled, interpolation = "none", aspect = "auto")
plt.colorbar()
Now we have a small enough data matrix to fit the correlation matrix in memory:
%%time
corr = np.dot(scaled, scaled.T)
Comparing to the previous run time of 10 minutes 39 seconds for the naive python implementation
(10.*60+39.)/.359
Yay -- with numpy, we get a ~1500x speed up compared to the naive Python implementation.
This is still too big to conveniently plot, so let's aggressively filter for the most differential genes
Syntax is similar to the previous filter. Here, we are comparing the largest and smallest value in each row to get the maximum fold change among the samples and filtering for transcripts with at least a 4x difference (1.5x-2x is more typical):
x = (np.max(f, axis = 1) - np.min(f, axis = 1) >= 2)
f2 = f[x,:]
fa2 = fa[x,:]
f2.shape, fa2.shape
fig = plt.figure()
plt.imshow(f2, interpolation = "none", aspect = "auto")
plt.colorbar()
We requested genes with a large fold change, so now column differences are apparent even in the untransformed heatmap. Note that we selected primarily for differences between unstimulated and LPS treated, as expected from our previous plots.
mean = np.mean(f2, axis = 1)
mean = mean.reshape(mean.shape[0],1)
centered = f2 - mean
fig = plt.figure()
plt.imshow(centered, interpolation = "none", aspect = "auto")
plt.colorbar()
Again, mean centering makes it easier to see the differential expression. Note that we can now pick out a few rows with the reverse trend -- "up in LPS relative to untreated"
norm = np.linalg.norm(centered, axis = 1)
norm = norm.reshape(norm.shape[0], 1)
scaled = centered/norm
fig = plt.figure()
plt.imshow(scaled, interpolation = "none", aspect = "auto")
plt.colorbar()
Again, doing all pairwise dot products of the centered and scaled rows to get the correlation matrix:
%%time
corr = np.dot(scaled, scaled.T)
corr.shape
And this matrix is small enough to conveniently plot
fig = plt.figure()
plt.imshow(corr, interpolation = "none", aspect = "auto")
plt.colorbar()
As expected, all correlation values are between -1 and 1.
The challenge is now to permute the rows and columns to put all of the strong correlation values on the diagonal -- this is the goal of clustering.
We will use Michael de Hoon's python bindings for his Cluster3 program. Conveniently, these are included in Biopython:
import Bio.Cluster as Pycluster
Calculate the centered Pearson distance matrix using the Cluster3 code. This is equal to 1 minus the correlation matrix (so, perfectly correlated transcripts are at distance 0 from each other, uncorrelated genes are at distance 1 and perfectly anticorrelated genes are at distance 2).
Setting dist to a value other than "c" would choose a different distance method; e.g., "u" for uncentered Pearson distances (appropriate if the data is already transformed to a meaningful 0) or "e" for Euclidean distances.
dist = Pycluster.distancematrix(f2, dist = "c")
Perform hierarchical clustering with maximum ("m") linkage clustering. This sets the distances for internal nodes to the maximum of the distances to their two children. The result is a "chunky" dendrogram with many short branches and a few long ones, giving relatively few easily distinguishable clusters. In contrast, average ("a") and centroid ("c") linkage clustering set distances to averages of the child distances which leads to more uniform branch lengths.
tree = Pycluster.treecluster(distancematrix = dist, method = "m")
Save the clustering result as a pair of tab delimited files: CDT for the row-permuted data matrix and GTR for the dendrogram (tree).
record = Pycluster.Record()
# Restore "None" values
record.data = f2
record.geneid = fa2[:,0]
record.genename = fa2[:,1]
record.gweight = None
record.gorder = None
record.expid = header[4:]
record.eweight = [1. for i in range(len(header) - 4)]
record.eorder = None
record.uniqid = "UNIQID"
record.save("clustered1_cm", geneclusters = tree)
Same thing, but use the centered and scaled data for the heatmap
This is not so good, because the values are all very small relative to javatreeview's contrast range. We could fix this by multiplying by a constant factor, but this would still have the disadvantage of obscuring the magnitude of differential expression.
record = Pycluster.Record()
# Restore "None" values
record.data = scaled
record.geneid = fa2[:,0]
record.genename = fa2[:,1]
record.gweight = None
record.gorder = None
record.expid = header[4:]
record.eweight = [1. for i in range(len(header) - 4)]
record.eorder = None
record.uniqid = "UNIQID"
record.save("clustered1_cm_scaled", geneclusters = tree)
Same thing, but use the centered data for the heatmap -- this gives the best view in JavaTreeView (even better would be picking an appropriate transform specific to this experiment)
record = Pycluster.Record()
# Restore "None" values
record.data = centered
record.geneid = fa2[:,0]
record.genename = fa2[:,1]
record.gweight = None
record.gorder = None
record.expid = header[4:]
record.eweight = [1. for i in range(len(header) - 4)]
record.eorder = None
record.uniqid = "UNIQID"
record.save("clustered1_cm_centered", geneclusters = tree)
List the files that we just created
%ls clustered1_cm*.*