Goal: Fit contrasts and generate differential gene lists for comparison to tables S1 and S2 of Sci. Rep. 7:42225.
%cd ../data
import numpy as np
def clip(s):
"""Coerce kallisto est_count floating point value to nearest integer."""
# Note that the max function shouldn't be necessary, as all est_counts should be positive
return max(0,int(float(s)+.5))
Parse table of per-sample estimated counts from kallisto. The table is in CDT format and has been filtered for genes with $TPM \geq 10$ in at least one sample, exactly as in the TPM file that we were previously working with.
from csv import reader, excel_tab
orfs = []
names = []
data = []
fin = reader(open("GSE88801_kallisto_est_counts_thresh10.cdt"),dialect=excel_tab)
header = next(fin)[4:]
eweights = next(fin)[4:]
for row in fin:
orfs.append(row[1])
names.append(row[2])
data.append([clip(i) for i in row[4:]])
Here, I convert the count matrix to a numpy array. This is mainly to allow passing the matrix directly to R via ryp2. Since we will be passing data to R via flat files, we could have skipped this step and worked with the data list of lists instead.
C = np.array(data)
C.shape
Load the table of sample descriptions and generate an easy to parse "state" column for R. In order to keep our R code simple, we will make the row order of the sample table match the column order of the count matrix. Therefore, we use a dictionary, name2sample, to index the original sample table by sample name.
from csv import reader
fp = reader(open("sample_table_v2.csv"))
sample_header = next(fp)
samples = []
name2sample = {}
for i in fp:
samples.append(i)
name2sample[i[0]] = i
sample_header
out = open("sample_table_v2.tdt","w")
out.write("\t".join(sample_header+["state"])+"\n")
# Loop over the count matrix header to match its ordering of the samples
for i in header:
# Use our dictionary to look up the corresponding row of the sample table
s = name2sample[i]
# Generate unique state names by concatenating the three parameters that distinguish the states
# (strain, infection, and time)
out.write("\t".join(s + ["%s.%s.%s" % (s[2],s[1],s[3])])+"\n")
out.close()
Write the count matrix in an easily parsed tab delimited format (relative to the original CDT file, we drop the eweight row and all of the extra annotation columns)
out = open("GSE88801_kallisto_est_counts_thresh10.txt","w")
out.write("\t".join(["gene"]+header)+"\n")
for (name,row) in zip(names,C):
out.write("\t".join([name]+[str(i) for i in row])+"\n")
out.close()
Use rpy2 to connect our Jupyter notebook to an R interpreter (in addition to the python3 interpreter that we're already connected to). Subsequent cells that start with %%R will be executed by the R interpreter. In this case, we are using ryp2 so that we can document our R and Python steps in one place. We could just as well use separate Jupyter and RStudio notebooks, as long as we were careful to execute the commands in the right order. In other contexts, we might choose to use rpy2 pass data back and forth between R and Python -- this is very convenient, but we need to be careful that nothing falls through the cracks during the between-language data conversions.
%load_ext rpy2.ipython
Load the R modules necessary for limma analysis (analogous to import in Python)
%%R
library(limma)
library(edgeR)
Read the sample table into R. Note that R automatically converts the text columns into factors (essentially, qualitative labels) and the integer columns into integers.
%%R
samples <- read.delim("sample_table_v2.tdt")
print(summary(samples))
Specify the model to be fit. Here, we just want to fit the means over the three replicates for each distinct state (equivalent to assuming that there are enough context-dependent effects that none of the parameters are additive).
%%R
state <- samples$state
# Specify the model: the 0 removes the default intercept parameter, so we just fit a separate mean for each state
# (If you'd like to try an alternative model, bind the appropriate columns of the samples data.frame above
# and try: d <- model.matrix(~infection+strain+time) for an independently additive model)
d <- model.matrix(~0+state)
# This step just cleans up R's default parameter names
colnames(d) <- gsub("state","",colnames(d))
print(colnames(d))
Read and normalize the count matrix
%%R
# Read the count matrix, using the gene column as row names
C <- read.delim("GSE88801_kallisto_est_counts_thresh10.txt",row.names=1)
# Covert the matrix to limma's preferred format, implicitly log2 transforming and depth normalizing to CPM values
dge <- DGEList(counts=C)
# Apply between-sample TMM normalization
dge <- calcNormFactors(dge)
# Estimate the mean-variance trend via locally-linear regression and use this trend
# to assign weights to the observations (counts)
v <- voom(dge, d, plot = TRUE)
Fit the model (i.e., calculate the state means for each gene), convert to contrasts of interest, and re-estimate the variances based on the global variance distribution.
%%R
# Fit the model (classic linear regression)
fit <- lmFit(v, d)
# Generate the contrast matrix
contrast.matrix <- makeContrasts(# S1 contrast
BMDM.uninfected.24-BMDM.uninfected.4,
# S2 contrasts
BMDM.Live.4-BMDM.uninfected.4,BMDM.Dead.4-BMDM.uninfected.4,
BMDM.Live.24-BMDM.uninfected.24,BMDM.Dead.24-BMDM.uninfected.24,
J774.Live.4-J774.uninfected.4,J774.Dead.4-J774.uninfected.4,
J774.Live.24-J774.uninfected.24,J774.Dead.24-J774.uninfected.24,
levels=d)
# Apply the contrast matrix
fit2 <- contrasts.fit(fit, contrast.matrix)
# Apply Empirical Bayes "shrinkage"
fit2 <- eBayes(fit2)
# Simple summary of significantly differential genes with no fold change filter
print(summary(decideTests(fit2)))
Same summary as above, but with a 2x fold-change filter (1 after $\log_2$ transform)
%%R
print(summary(decideTests(fit2,lfc=1)))
To see the details of the differential genes (rather than just summary counts) we'll use topTable. By default, it reports the 10 most significant genes based on F-test across all parameters.
%%R
print(topTable(fit2))
%%R
print(colnames(fit2$coefficients))
Same thing, but considering only a single contrast and filtering for at least a 2x fold change
%%R
print(topTable(fit2, coef="BMDM.uninfected.24 - BMDM.uninfected.4", lfc=1, p.value = .05))
Extract the topTable results from R by writing them to CSV formatted text files (we choose comma, rather than tab, delimiters because write.csv has slightly better documentation). We set n=40000 to make sure we get all genes passing the filter (for t0) or all genes (for t1).
%%R
for(tc in colnames(fit2$coefficients)){
print(tc)
# Extract all genes significantly differential on this contrast for a 2x fold change cutoff and 5% FDR
# Use write.csv rather than write.table for clean compatibility with python's csv.reader
write.csv(topTable(fit2, coef=tc, n = 40000, lfc=1, p.value = .05),
paste("limma1.",gsub(" ","",tc),".t0.csv",sep=""))
# Extract the adjusted p-values for this contrast for all genes, independent of significance
write.csv(topTable(fit2, coef=tc, n = 40000),
paste("limma1.",gsub(" ","",tc),".t1.csv",sep=""))
}
!ls limma1*.csv
from csv import reader
Take a look at one of the differential gene sets
diff = []
fp = reader(open("limma1.J774.Live.24-J774.uninfected.24.t0.csv"))
header = next(fp)
header
for row in fp:
diff.append(row)
len(diff)
diff[0]
diff[-1]
list(zip(header,diff[-1]))
for (i,j) in zip(header,diff[-1]):
print("%10s: %20s" % (i,j))
Parse the gene names from this list into a python set
s = set()
fp = reader(open("limma1.J774.Live.24-J774.uninfected.24.t0.csv"))
next(fp)
for row in fp:
s.add(row[0])
len(s)
Let's do the same parse for all of the differential gene lists. We'll use glob to get a list of the appropriate file names and use a python dict to map list names to sets of gene names.
from glob import glob
glob("limma1*.csv")
diff_genes = {}
for i in glob("limma1*.t0.csv"):
s = set()
fp = reader(open(i))
next(fp)
for row in fp:
s.add(row[0])
diff_genes[i.replace(".t0.csv","")] = s
print(i,len(s))
diff_genes.keys()
Use set.intersection to find the common genes between two sets.
len(diff_genes["limma1.J774.Dead.4-J774.uninfected.4"].intersection(
diff_genes["limma1.BMDM.Dead.24-BMDM.uninfected.24"]))
s = diff_genes["limma1.J774.Dead.4-J774.uninfected.4"].intersection(
diff_genes["limma1.BMDM.Dead.24-BMDM.uninfected.24"])
len(s), sorted(s)[:10]
Same parse as before, but cutting for just the genes that are up in each contrast (exercise for the reader -- try cutting for down regulated genes instead)
up_genes = {}
for i in glob("limma1*.t0.csv"):
s = set()
fp = reader(open(i))
next(fp)
for row in fp:
if(float(row[1]) > 0):
s.add(row[0])
up_genes[i.replace(".t0.csv","")] = s
print(i,len(s))
Filtering example -- filter the CDT file from our PCA analysis for rows matching one of the differential gene sets
!ls -lrth *.cdt
This first cell is a scouting run to check which column to filter on
fp = open("PCA_clustering_example1.um.annotated.cdt")
out = open("filter1.cdt","w")
out.write(fp.readline())
out.write(fp.readline())
for line in fp:
fields = line.split("\t")
print(fields)
break
Here's the actual filtering cell
filter_set = diff_genes["limma1.J774.Dead.4-J774.uninfected.4"]
fp = open("PCA_clustering_example1.um.annotated.cdt")
out = open("filter1.cdt","w")
# Note that we want the two header lines no matter what
out.write(fp.readline())
out.write(fp.readline())
for line in fp:
fields = line.split("\t")
# Here's where we write only rows matching our gene list
if(fields[1] in filter_set):
out.write(line)
out.close()