Goal: Estimate transcript abundances from GSE86922 with kallisto and merge the results for limma.

In PLoS Pathogens 12:e1005910, transcript abundances are estimated by aligning the paired end reads to the mouse genome with RSubread and then counting read coverage on each annotated transcript (the protocol recommended in limma's voom documentation). Here, we will instead try the alignment free approach of kallisto.

N.B.: The non-python portions of this notebook were invoked from bash on Debian and Ubuntu Linux servers.

Downloading paired-end reads

Note that newer versions of the sra toolkit can extract FASTQ files directly from the SRA. Here, I am using an older version of the toolkit, so I first download sra files from the SRA and then extract FASTQ files locally with fastq-dump. The full data is ~71 GB, so I delete the sra files after extraction to save space:

export SRA='ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP089/SRP089914/'
for i in $(curl "${SRA}" | awk '{print $9;}'); do
  curl "${SRA}${i}/${i}.sra" > "${i}.sra" && \
  fastq-dump --split-files --gzip "${i}.sra" && rm "${i}.sra";
done;

This results in two files (one for each sequencing primer) for each sample, except for SRR4244246 which is listed in the SRA as single-end, probably due to an upload error.

Quantify transcript abundances with Kallisto

This requires slightly more than 4GB of RAM. Runs fine on an 8GB laptop.

Download and index mouse transcriptome

Here, I'm using the copy of the mouse transcriptome from the Kallisto website. We could instead use the same version as used in the paper for a more accurate methods comparison.

cd /home/mvoorhie/pdf/papers/PLoSPathogens_12_e1005910/

curl http://bio.math.berkeley.edu/kallisto/transcriptomes/Mus_musculus.GRCm38.rel79.cdna.all.fa.gz
gunzip Mus_musculus.GRCm38.rel79.cdna.all.fa.gz
kallisto index -i GRCm38_all_mRNA.idx Mus_musculus.GRCm38.rel79.cdna.all.fa

Quantify paired end samples

  • -t 4 allocates 4 processor cores to kallisto. Going much higher than this has diminishing returns.
  • --rf-stranded indicates the relative orientations of the first and second read to the 5'->3' orientation of the transcript.
for i in SRR*_2.fastq.gz; do
  export SAMPLE="${i%_2.fastq.gz}"
  kallisto quant -i GRCm38_all_mRNA.idx -t 4 --rf-stranded \
    -o "${SAMPLE}".rf.kallisto "${SAMPLE}"_{1,2}.fastq.gz
done;

Quantify the "single-end" sample

  • --single indicates that this is a single-end read set (default is paired-end)
  • -l 150 -s 50 gives the mean and standard deviation of the fragment length distribution (the sizes of the transcript fragments inserted into the library backbone). For paired-end samples, kallisto estimates this directly from the observed spacing of pseudoaligned left/right read pairs. For single-end, you can get this experimentally from, e.g., a Bioanalyzer run. Here I'm using the length estimate from the paired-end runs and making a slightly high guess at the standard deviation.
export SAMPLE="SRR4244246"
kallisto quant -i GRCm38_all_mRNA.idx -t 4 --single --rf-stranded \
   -l 150 -s 50 \
   -o "${SAMPLE}".rf.kallisto "${SAMPLE}"_1.fastq.gz

Merge per-sample kallisto results

N.B.: The following protocol is how I do this sort of merge with my existing tools. A simpler approach, using just functions we've used in class, would work in this case.

Standard library tools

In [1]:
# General operating system functions.  We'll use os.path to build proper absolute file names.
import os

A few useful classes from my local toolbox (these are on the course website)

In [2]:
# Convenience class for working with tabular text data.
# You could use your own parsing functions or something like pandas instead.
from MsvUtil import Table
# Convenience class for
from CdtFile import CdtFile, CdtRow
In [3]:
datadir = "/home/mvoorhie/pdf/papers/PLoSPathogens_12_e1005910/"

I created a small table of sample annotations based on the descriptions in the SRA and saved it as a CSV file. Here, we're just using it to get a list of sample names, but we'll use it downstream when building the design matrix for limma.

Load this table:

In [4]:
sample_table = Table.fromCsv(open(os.path.join(datadir,"sample_table.csv")))
sample_table[0]
Out[4]:
name:		SRR4244242
genotype:		WT
LPS:		0
rep:		1

Parse the kallisto results for each sample and index the resulting tables by sample name

In [5]:
kallisto = dict((i, Table.fromTdt(open(os.path.join(
    datadir,i+".rf.kallisto","abundance.tsv"))))
                      for i in sample_table["name"])

Index rows by transcript_id

In [6]:
name2row = dict((sample, dict((i["target_id"],i) for i in table))
                for (sample,table) in kallisto.items())

Confirm that all kallisto outputs are relative to the same transcript lists

Since they are, we could use a simpler merge scheme than the row indexed scheme that I'm using here, but this method generalizes to trickier merges

In [7]:
s = list(name2row)
S = set(name2row[s[0]])
for i in s[1:]:
    assert(S == set(name2row[i]))
len(S)
Out[7]:
88198

Merge kallisto results to a single CDT file with estimated counts in the "heatmap" portion and TPMs in the "annotation" portion.

In [8]:
snames = sample_table["name"]
genes = sorted(S)

est_counts = CdtFile(
    probes = [CdtRow(gid = "GENE%05d" % n,
                     uniqid = i,
                     name = i,
                     ratios = [int(float(name2row[j][i]["est_counts"])+.5)
                               for j in snames],
                     extra = [name2row[j][i]["tpm"] for j in snames])
              for (n,i) in enumerate(genes)],
    fieldnames = snames[:],
    extranames = [j+"_TPM" for j in snames])

est_counts.write(open("est_counts.cdt","w"))
In [ ]: