Goal: Estimate transcript abundances from GSE86922 with kallisto and merge the results for limma.
In PLoS Pathogens 12:e1005910, transcript abundances are estimated by aligning the paired end reads to the mouse genome with RSubread and then counting read coverage on each annotated transcript (the protocol recommended in limma's voom documentation). Here, we will instead try the alignment free approach of kallisto.
N.B.: The non-python portions of this notebook were invoked from bash on Debian and Ubuntu Linux servers.
Note that newer versions of the sra toolkit can extract FASTQ files directly from the SRA. Here, I am using an older version of the toolkit, so I first download sra files from the SRA and then extract FASTQ files locally with fastq-dump. The full data is ~71 GB, so I delete the sra files after extraction to save space:
export SRA='ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP089/SRP089914/' for i in $(curl "${SRA}" | awk '{print $9;}'); do curl "${SRA}${i}/${i}.sra" > "${i}.sra" && \ fastq-dump --split-files --gzip "${i}.sra" && rm "${i}.sra"; done;
This results in two files (one for each sequencing primer) for each sample, except for SRR4244246 which is listed in the SRA as single-end, probably due to an upload error.
This requires slightly more than 4GB of RAM. Runs fine on an 8GB laptop.
Here, I'm using the copy of the mouse transcriptome from the Kallisto website. We could instead use the same version as used in the paper for a more accurate methods comparison.
cd /home/mvoorhie/pdf/papers/PLoSPathogens_12_e1005910/ curl http://bio.math.berkeley.edu/kallisto/transcriptomes/Mus_musculus.GRCm38.rel79.cdna.all.fa.gz gunzip Mus_musculus.GRCm38.rel79.cdna.all.fa.gz kallisto index -i GRCm38_all_mRNA.idx Mus_musculus.GRCm38.rel79.cdna.all.fa
for i in SRR*_2.fastq.gz; do export SAMPLE="${i%_2.fastq.gz}" kallisto quant -i GRCm38_all_mRNA.idx -t 4 --rf-stranded \ -o "${SAMPLE}".rf.kallisto "${SAMPLE}"_{1,2}.fastq.gz done;
export SAMPLE="SRR4244246" kallisto quant -i GRCm38_all_mRNA.idx -t 4 --single --rf-stranded \ -l 150 -s 50 \ -o "${SAMPLE}".rf.kallisto "${SAMPLE}"_1.fastq.gz
N.B.: The following protocol is how I do this sort of merge with my existing tools. A simpler approach, using just functions we've used in class, would work in this case.
Standard library tools
# General operating system functions. We'll use os.path to build proper absolute file names.
import os
A few useful classes from my local toolbox (these are on the course website)
# Convenience class for working with tabular text data.
# You could use your own parsing functions or something like pandas instead.
from MsvUtil import Table
# Convenience class for
from CdtFile import CdtFile, CdtRow
datadir = "/home/mvoorhie/pdf/papers/PLoSPathogens_12_e1005910/"
I created a small table of sample annotations based on the descriptions in the SRA and saved it as a CSV file. Here, we're just using it to get a list of sample names, but we'll use it downstream when building the design matrix for limma.
Load this table:
sample_table = Table.fromCsv(open(os.path.join(datadir,"sample_table.csv")))
sample_table[0]
Parse the kallisto results for each sample and index the resulting tables by sample name
kallisto = dict((i, Table.fromTdt(open(os.path.join(
datadir,i+".rf.kallisto","abundance.tsv"))))
for i in sample_table["name"])
Index rows by transcript_id
name2row = dict((sample, dict((i["target_id"],i) for i in table))
for (sample,table) in kallisto.items())
Confirm that all kallisto outputs are relative to the same transcript lists
Since they are, we could use a simpler merge scheme than the row indexed scheme that I'm using here, but this method generalizes to trickier merges
s = list(name2row)
S = set(name2row[s[0]])
for i in s[1:]:
assert(S == set(name2row[i]))
len(S)
Merge kallisto results to a single CDT file with estimated counts in the "heatmap" portion and TPMs in the "annotation" portion.
snames = sample_table["name"]
genes = sorted(S)
est_counts = CdtFile(
probes = [CdtRow(gid = "GENE%05d" % n,
uniqid = i,
name = i,
ratios = [int(float(name2row[j][i]["est_counts"])+.5)
for j in snames],
extra = [name2row[j][i]["tpm"] for j in snames])
for (n,i) in enumerate(genes)],
fieldnames = snames[:],
extranames = [j+"_TPM" for j in snames])
est_counts.write(open("est_counts.cdt","w"))