Goal: Look at the mapped reads relative to sequencing depth for Mucor and Mouse, as a guide for picking sample files and their respective depths.

%cd ~/pdf/papers/mBio_10_e02765/kallisto/

/home/mvoorhie/pdf/papers/mBio_10_e02765/kallisto

%matplotlib
import matplotlib.pyplot as plt

Using matplotlib backend: TkAgg

import re

from MsvUtil import Table

table = Table.fromCsv("../sample_table.csv")
print(len(table))
print(table[0])

24
Run:		SRR7541435
desc:		Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b
condition:		mac
mucor:		atf1
batch:		2
rep:		b

set(table["mucor"])

{'NRRL3631', 'R7B', 'atf1', 'atf2', 'un'}

set(table["condition"])

{'cm', 'mac', 'mmc'}

log_re = re.compile("processed (?P<total>[\d,]+) reads, (?P<aligned>[\d,]+) reads pseudoaligned")

depths = []
mucor_mapped = []
for i in table:
    for line in open("{run}_rf.err".format(run=i["Run"])):
        p = log_re.search(line)
        if(p is not None):
            depths.append(int(p.group("total").replace(",","")))
            mucor_mapped.append(int(p.group("aligned").replace(",","")))
            break
    else:
        raise ValueError, "Couldn't find depth line in kallisto quant log"

mouse_mapped = []
for (i,d) in zip(table,depths):
    for line in open("{run}_rf_GRCm38.err".format(run=i["Run"])):
        p = log_re.search(line)
        if(p is not None):
            assert(d == int(p.group("total").replace(",","")))
            mouse_mapped.append(int(p.group("aligned").replace(",","")))
            break
    else:
        raise ValueError, "Couldn't find depth line in kallisto quant log"

def color(sample):
    if(sample["mucor"] == "un"):
        return "red"
    elif(sample["condition"] == "mac"):
        if(sample["batch"] == "1"):
            return "blue"
        else:
            return "cyan"
    else:
        return "green"

fig = plt.figure()
plt.scatter(depths,mucor_mapped,c = [color(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mucor kallisto")
fig

For Mucor alone, pseudoalignment is linear with depth
For no Mucor, there is only a background level of pseudoalignment (a few thousand reads)
For infected, there is a weak correlation of pseudoalignment with depth, which might be improved if we label the infections by type
- In any event, the amount of pseuoalignment (yield) is quite a bit lower than what we would expect for mucor alone

fig = plt.figure()
plt.scatter(depths,mouse_mapped,c = [color(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mouse kallisto")
fig

For mucor alone, there are very few mouse pseudoaligned reads
For mouse alone, pseudoaligned reads are linear with depth
For infections, pseudoalignment is still pretty linear, with counts pretty close to what we'd expect for mouse alone

def color(sample):
    if(sample["mucor"] == "un"):
        return "red"
    elif(sample["condition"] == "mac"):
        if(sample["mucor"] == "R7B"):
            return "blue"
        elif(sample["mucor"] == "NRRL3631"):
            return "cyan"
        elif(sample["mucor"] == "atf1"):
            return "purple"
        else:
            return "magenta"
    else:
        return "green"

def edge(sample):
    if(sample["batch"] == "1"):
        return "grey"
    else:
        return "black"

fig = plt.figure()
plt.scatter(depths,mucor_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mucor kallisto")
fig

Best Mucor yields in the infections are for the more virulent (R7B) strain in the first infection
Outside of that, there is really no pattern

fig = plt.figure()
plt.scatter(depths,mouse_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Total reads")
plt.ylabel("Reads pseudo-aligned to transcriptome")
plt.title("Mouse kallisto")
fig

Mouse yields are relatively low for the two good Mucor yields, as expected.
The worst mouse yield is also a low mucor yield, so that may say more about, e.g., rRNA levels in that sample.

fig = plt.figure()
plt.scatter(mucor_mapped,mouse_mapped,c = [color(i) for i in table], edgecolors = [edge(i) for i in table])
plt.xlabel("Reads pseudo-aligned to mucor")
plt.ylabel("Reads pseudo-aligned to mouse")
plt.title("Mouse/Mucor kallisto")
fig

This is the most informative sort of the relative mapping
For the first experiment, there are relatively more mapped reads for the more virulent vs. less virulent strain
- This experiment also had more mapped reads overall (mostly, or entirely, due to higher sequencing depth)
For the second experiment, the more virulent (WT) strain had relatively fewer mapped reads compared to the two aft mutants (which had attenuated virulence in the immunosuppressed mouse model). They don't really give CFUs for these infections (fig 4a shows poorer growth period for spores recovered from macrophages, but it's not clear if these are the same samples that were sequenced)

So, best teaching example is the lower depth more virulent first infection: SRR7541452
- All ~29 million reads = 1.3 GB, of which ~5 million map to mucor
- So, 20% should be on the high side of reasonable file size and will give enough for a reasonable profile
  - Meaning, we should sample 6 million reads
    - Can then subsample this as exercise as well as for first teaching sample

mvoorhie@ayanganna:~/pdf/papers/mBio_10_e02765/SRP154454$ time zcat SRR7541452_1.fastq.gz | \
  stream_sampler.py -s 42 -f fastq - 6000000 | gzip > ~/SRR7541452.6M.fastq.gz

real    2m31.357s
user    3m13.668s
sys     0m5.820s

That gives 313M, which is big, but let's start with it anyway.