Goal: Look at the mapped reads relative to sequencing depth for Mucor and Mouse, as a guide for picking sample files and their respective depths.

In [1]:
%cd ~/pdf/papers/mBio_10_e02765/kallisto/
/home/mvoorhie/pdf/papers/mBio_10_e02765/kallisto

In [2]:
%matplotlib
import matplotlib.pyplot as plt
Using matplotlib backend: TkAgg

In [3]:
import re
In [4]:
from MsvUtil import Table
In [5]:
table = Table.fromCsv("../sample_table.csv")
print(len(table))
print(table[0])
24
Run:		SRR7541435
desc:		Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b
condition:		mac
mucor:		atf1
batch:		2
rep:		b

In [6]:
set(table["mucor"])
Out[6]:
{'NRRL3631', 'R7B', 'atf1', 'atf2', 'un'}
In [7]:
set(table["condition"])
Out[7]:
{'cm', 'mac', 'mmc'}
In [8]:
log_re = re.compile("processed (?P<total>[\d,]+) reads, (?P<aligned>[\d,]+) reads pseudoaligned")
In [9]:
depths = []
mucor_mapped = []
for i in table:
    for line in open("{run}_rf.err".format(run=i["Run"])):
        p = log_re.search(line)
        if(p is not None):
            depths.append(int(p.group("total").replace(",","")))
            mucor_mapped.append(int(p.group("aligned").replace(",","")))
            break
    else:
        raise ValueError, "Couldn't find depth line in kallisto quant log"
In [10]:
mouse_mapped = []
for (i,d) in zip(table,depths):
    for line in open("{run}_rf_GRCm38.err".format(run=i["Run"])):
        p = log_re.search(line)
        if(p is not None):
            assert(d == int(p.group("total").replace(",","")))
            mouse_mapped.append(int(p.group("aligned").replace(",","")))
            break
    else:
        raise ValueError, "Couldn't find depth line in kallisto quant log"
In [11]:
def color(sample):
    if(sample["mucor"] == "un"):
        return "red"
    elif(sample["condition"] == "mac"):
        if(sample["batch"] == "1"):
            return "blue"
        else:
            return "cyan"
    else:
        return "green"
In [12]:
fig = plt.figure()
plt.scatter(depths,mucor_mapped,c = [color(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mucor kallisto")
fig
Out[12]:
  • For Mucor alone, pseudoalignment is linear with depth
  • For no Mucor, there is only a background level of pseudoalignment (a few thousand reads)
  • For infected, there is a weak correlation of pseudoalignment with depth, which might be improved if we label the infections by type
    • In any event, the amount of pseuoalignment (yield) is quite a bit lower than what we would expect for mucor alone
In [13]:
fig = plt.figure()
plt.scatter(depths,mouse_mapped,c = [color(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mouse kallisto")
fig
Out[13]:
  • For mucor alone, there are very few mouse pseudoaligned reads
  • For mouse alone, pseudoaligned reads are linear with depth
  • For infections, pseudoalignment is still pretty linear, with counts pretty close to what we'd expect for mouse alone
In [14]:
def color(sample):
    if(sample["mucor"] == "un"):
        return "red"
    elif(sample["condition"] == "mac"):
        if(sample["mucor"] == "R7B"):
            return "blue"
        elif(sample["mucor"] == "NRRL3631"):
            return "cyan"
        elif(sample["mucor"] == "atf1"):
            return "purple"
        else:
            return "magenta"
    else:
        return "green"

def edge(sample):
    if(sample["batch"] == "1"):
        return "grey"
    else:
        return "black"
In [15]:
fig = plt.figure()
plt.scatter(depths,mucor_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mucor kallisto")
fig
Out[15]:
  • Best Mucor yields in the infections are for the more virulent (R7B) strain in the first infection
  • Outside of that, there is really no pattern
In [16]:
fig = plt.figure()
plt.scatter(depths,mouse_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mouse kallisto")
fig
Out[16]:
  • Mouse yields are relatively low for the two good Mucor yields, as expected.
  • The worst mouse yield is also a low mucor yield, so that may say more about, e.g., rRNA levels in that sample.
In [17]:
fig = plt.figure()
plt.scatter(mucor_mapped,mouse_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Reads pseudo-aligned to mucor")
plt.ylabel("Reads pseudo-aligned to mouse")
plt.title("Mouse/Mucor kallisto")
fig
Out[17]:
  • This is the most informative sort of the relative mapping
  • For the first experiment, there are relatively more mapped reads for the more virulent vs. less virulent strain
    • This experiment also had more mapped reads overall (mostly, or entirely, due to higher sequencing depth)
  • For the second experiment, the more virulent (WT) strain had relatively fewer mapped reads compared to the two aft mutants (which had attenuated virulence in the immunosuppressed mouse model). They don't really give CFUs for these infections (fig 4a shows poorer growth period for spores recovered from macrophages, but it's not clear if these are the same samples that were sequenced)
  • So, best teaching example is the lower depth more virulent first infection: SRR7541452
    • All ~29 million reads = 1.3 GB, of which ~5 million map to mucor
    • So, 20% should be on the high side of reasonable file size and will give enough for a reasonable profile
      • Meaning, we should sample 6 million reads
        • Can then subsample this as exercise as well as for first teaching sample
mvoorhie@ayanganna:~/pdf/papers/mBio_10_e02765/SRP154454$ time zcat SRR7541452_1.fastq.gz | \
  stream_sampler.py -s 42 -f fastq - 6000000 | gzip > ~/SRR7541452.6M.fastq.gz

real    2m31.357s
user    3m13.668s
sys     0m5.820s

That gives 313M, which is big, but let's start with it anyway.

In [17]: