Calculating the Pearson distance matrix in pure Python

Pearson distance metric, handling missing values as in Cluster3

Note that the rmsd ($\sqrt{\sum^N{(x_i - \bar x})^2}$) is the numerator of the standard deviation.
Note that we re-use the same means in both rmsd and pearson, so we avoid redundant calculation by passing it the mean as a parameter to rmsd

from math import sqrt

def mean(x):
    return sum(i for i in x if(i is not None))/float(len(x))

def rmsd(x, mx):
    return sqrt(sum((i-mx)**2 for i in x if(i is not None)))

def pearson(x,y):
    assert(len(x) == len(y))
    mx = mean(x)
    sx = rmsd(x,mx)
    if(sx == 0.):
        return 0.
    my = mean(y)
    sy = rmsd(y,my)
    if(sy == 0.):
        return 0.
    return sum((i-mx)*(j-my) for (i,j) in zip(x,y) if((i is not None and j is not None)))/sx/sy
            
def pearson_dist(x,y):
    return 1. - pearson(x,y)

Function to calculate a distance matrix, based on an arbitrary (symmetric) distance function

def dmat(data, dist=pearson_dist):
    D = []
    for i in xrange(len(data)):
        D.append([])
        for j in xrange(len(data)):
            if(i > j):
                D[-1].append(D[j][i])
            else:
                D[-1].append(dist(data[i],data[j]))
    return D

Load the sample data set, using the default mode (missing values = None) of the CDT parser that we wrote in class

import stats

%%time
(header, genes, annotations, logratios) = stats.parse_cdt3(open("supp2data.cdt"))

CPU times: user 83.9 ms, sys: 3.82 ms, total: 87.7 ms
Wall time: 86.5 ms

Calculate the correlation matrix and the Pearson distance matrix, handling missing values as in Cluster3

%%time
correlation_matrix = dmat(logratios,pearson)

CPU times: user 2min 47s, sys: 410 ms, total: 2min 47s
Wall time: 2min 46s

%%time
D_pearson = dmat(logratios)

CPU times: user 3min 14s, sys: 742 ms, total: 3min 14s
Wall time: 3min 13s

Calculating the correlation matrix in numpy

On day 4, I mentioned that we can calculate the correlation matrix as a simple matrix product in numpy -- here's how.

First, so that we can check our work against the pure python implementation, we re-calculate the correlation matrix with missing values set to 0. (since we require this masking for the numpy method)

Parse the sample data again, this time setting missing values to 0., and re-calculate the correlation matrix

%%time
(header, genes, annotations, maskedratios) = stats.parse_cdt3(open("supp2data.cdt"),0.)

CPU times: user 370 ms, sys: 4 ms, total: 374 ms
Wall time: 370 ms

%%time
masked_correlation_matrix = dmat(maskedratios,pearson)

CPU times: user 3min 28s, sys: 847 ms, total: 3min 29s
Wall time: 3min 28s

import numpy
from numpy.linalg import norm

Convert our data matrix to a numpy array

M = numpy.array(maskedratios)
M.shape

(2467, 79)

Calculate $D D^T$ where $D$ is $M$ with row centering and scaling (the normalization implied by Pearson correlation).

To apply these normalizations row-wise in numpy, it is easiest to transpose the matrix, do column-wise normalization, and then transpose the normalized matrix.

%%time
# transpose
T = M.T
# center on column means (row means of the original matrix)
T_centered = T - numpy.mean(T,axis=0)
# scale on column norms (row norms of the original matrix)
T_centered_scaled = T_centered/norm(T_centered,axis=0)
# transpose back to the original space
row_normed_matrix = T_centered_scaled.T
# calculate the matrix product
row_normed_correlation = numpy.dot(row_normed_matrix,row_normed_matrix.T)

CPU times: user 487 ms, sys: 0 ns, total: 487 ms
Wall time: 486 ms

Note that the numpy method is ~500 times faster than the pure python method.

Let's confirm that the results are the same within rounding error:

d = row_normed_correlation - numpy.array(masked_correlation_matrix)
absdiff = [abs(i) for i in d.ravel()]
min(absdiff), max(absdiff)

(0.0, 1.5543122344752192e-15)

Principal Components Analysis (PCA)

$M M^T$ and its normalized variants are useful starting points for thinking about the natural dimensions of $M$. Here, we will walk through a simple principal components analysis of the data based on the singular value decomposition (SVD). Note that the main difference among different PCA approaches is the choice of normalization prior to calculating the SVD.

First, we repeat the mean centering from above, but apply it column-wise (i.e., assuming that the mean differential expression for any condition is 0.)

centered_matrix = M - numpy.mean(M,axis=0)

Multiplying the centered matrix by its transpose gives a covariance matrix. This is a good starting point for PCA for cases where each column is on a comprable scale.

covariance_matrix = numpy.dot(centered_matrix, centered_matrix.T)
covariance_matrix.shape

(2467, 2467)
CPU times: user 552 ms, sys: 0 ns, total: 552 ms
Wall time: 551 ms

Scaling on column norms and multiplying gives a correlation matrix. The is a good starting point for PCA for cases where the columns represent very different measurements, or span very different dynamic ranges. This avoids loss of dimension at the cost of putting the resulting components on a harder to interpret scale.

scaled_matrix = centered_matrix/norm(centered_matrix,axis=0)

numpy_correlation_matrix = numpy.dot(scaled_matrix, scaled_matrix.T)
numpy_correlation_matrix.shape

(2467, 2467)
CPU times: user 547 ms, sys: 4 ms, total: 551 ms
Wall time: 550 ms

Calculate the svd of the centered matrix (equivalent to eigenvalue decomposition of the covariance matrix). We use the transpose of the centered matrix to ask where the genes lie relative to coordinate vectors that are linear combinations of the conditions.

from numpy.linalg import svd

%%time
(u1,s1,v1) = svd(centered_matrix.T, full_matrices = False)

CPU times: user 69.3 ms, sys: 15 µs, total: 69.3 ms
Wall time: 69 ms

Check the number of dimensions filled up by our data (the rank of the data matrix) by looking for the first singular value that is numerically zero

%matplotlib
import matplotlib.pyplot as plt

Using matplotlib backend: TkAgg

from IPython.core.display import display

fig = plt.figure()
plt.plot(s1,marker="o")
display(fig)

This matrix is close to full rank, although some components clearly have more variance than others (i.e., the data is more ellipsoid than spherical).

We can convert the singular values to relative variance on each component by a simple transform:

def relative_variance(s):
    v = s**2
    return v/sum(v)

rv = relative_variance(s1)

fig = plt.figure()
plt.plot(rv,marker="o")
display(fig)

sum(rv[:4]), sum(rv[:3])

(0.53891478784108382, 0.48230022672680584)

About half of the variance is accounted for by the first three components
The 5th and 6th components are the first example of a near-degenerate pair

Let's have a look at these first three components.

First, we'll project our data onto the full space of principal components

projected1 = numpy.dot(centered_matrix, u1)

The first three columns of projected1 give the subspace spanned by the first three components -- let's plot them

(Note that I'm using numpy's special slicing syntax: x[:,c] extracts column c for all rows of x)

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.plot(projected1[:,0],
        projected1[:,1],
        projected1[:,2],
        "k,")
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
ax.set_zlabel("PC 3")
# Give axes equal scaling
L = 0
for i in (ax.get_xlim3d, ax.get_ylim3d, ax.get_zlim3d):
    (a,b) = i()
    L = max(L, -a, b)
ax.set_xlim3d(-L,L)
ax.set_ylim3d(-L,L)
ax.set_zlim3d(-L,L)
display(fig)

We can also take just two components at a time:

The first two components:

fig = plt.figure()
plt.plot(projected1[:,0],projected1[:,1],"bo")
display(fig)

The first and third components:

fig = plt.figure()
plt.plot(projected1[:,0],projected1[:,2],"bo")
display(fig)

The second and third components:

fig = plt.figure()
plt.plot(projected1[:,1],projected1[:,2],"bo")
display(fig)

(Note that in this last projection, we are starting to see a more spherical cross-section relative to the ellipse of the full data)

Let's take a look at which conditions are contributing to which components.

We can plot everything as column vectors:

from matplotlib.colors import LinearSegmentedColormap

# Simple green->red non-overlapping gradient
cdict = {"red":((0.,0.,0.),(.5,0.,0.),(1.,1.,1.)),
         "green":((0.,1.,1.),(.5,0.,0.),(1.,0.,0.)),
         "blue":((0.,0.,0.),(.5,0.,0.),(1.,0.,0.))}
# map gradient to 256 actual RGBA values
cmap_rg = LinearSegmentedColormap("rg",cdict,256)

# Simple blue->yellow non-overlapping gradient
cdict = {"red":((0.,0.,0.),(.5,0.,0.),(1.,1.,1.)),
         "green":((0.,0.,0.),(.5,0.,0.),(1.,1.,1.)),
         "blue":((0.,1.,1.),(.5,0.,0.),(1.,0.,0.))}
# map gradient to 256 actual RGBA values
cmap_yb = LinearSegmentedColormap("yb",cdict,256)

fig = plt.figure()
plt.imshow(u1, interpolation="none", aspect="auto", vmin=-1, vmax=1, cmap=cmap_yb)
display(fig)

Or, just the first three components as simple line-plots

fig = plt.figure()
plt.plot(u1[:,0],color="red",marker="o")
plt.plot(u1[:,1],color="green",marker="o")
plt.plot(u1[:,2],color="blue",marker="o")
display(fig)

Same analysis for the correlation matrix

%%time
(u2,s2,v2) = svd(scaled_matrix.T, full_matrices = False)

CPU times: user 67.3 ms, sys: 0 ns, total: 67.3 ms
Wall time: 66.9 ms

rv = relative_variance(s2)

fig = plt.figure()
plt.plot(rv,marker="o")
display(fig)

sum(rv[:6]), sum(rv[:4]), sum(rv[:3])

(0.4813533951143843, 0.37386222140030922, 0.31087517367721645)

Note that the scaled data is more spherical, with a larger number of components contributing significant variance.

projected2 = numpy.dot(scaled_matrix, u2)

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.plot(projected2[:,0],
        projected2[:,1],
        projected2[:,2],
        "k,")
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
ax.set_zlabel("PC 3")
# Give axes equal scaling
L = 0
for i in (ax.get_xlim3d, ax.get_ylim3d, ax.get_zlim3d):
    (a,b) = i()
    L = max(L, -a, b)
ax.set_xlim3d(-L,L)
ax.set_ylim3d(-L,L)
ax.set_zlim3d(-L,L)
display(fig)

We can see the more spherical shape in the projection

fig = plt.figure()
plt.imshow(u2, interpolation="none", aspect="auto", vmin=-1, vmax=1, cmap=cmap_yb)
display(fig)

fig = plt.figure()
plt.plot(u2[:,0],color="red",marker="o")
plt.plot(u2[:,1],color="green",marker="o")
plt.plot(u2[:,2],color="blue",marker="o")
display(fig)

After normalization, a larger number of conditions contribute to the first few coordinates.

fig = plt.figure()
b = plt.boxplot(centered_matrix)
display(fig)

fig = plt.figure()
b = plt.boxplot(scaled_matrix)
display(fig)