IPython line magics

In [1]:
%pwd
Out[1]:
'/home/explorer/BMS270/Notebooks'
In [2]:
%cd ~/BMS270/data/
/home/explorer/BMS270/data
In [3]:
%ls
sample_table.csv     sample_table.windows
sample_table.oldmac  SRR7541452.6M.fastq.gz

Bash commands

In [4]:
!ls
sample_table.csv     sample_table.windows
sample_table.oldmac  SRR7541452.6M.fastq.gz
In [5]:
!head -c 64 sample_table.csv | hexdump -C
00000000  52 75 6e 2c 64 65 73 63  2c 63 6f 6e 64 69 74 69  |Run,desc,conditi|
00000010  6f 6e 2c 6d 75 63 6f 72  2c 62 61 74 63 68 2c 72  |on,mucor,batch,r|
00000020  65 70 0a 53 52 52 37 35  34 31 34 33 35 2c 22 4d  |ep.SRR7541435,"M|
00000030  75 63 6f 72 20 61 74 66  31 20 6d 75 74 61 6e 74  |ucor atf1 mutant|
00000040

Coverting among hexadecimal, decimal, and strings in python

In [6]:
0x52
Out[6]:
82
In [7]:
chr(0x52)
Out[7]:
'R'
In [8]:
"\x52"
Out[8]:
'R'
In [9]:
ord("R")
Out[9]:
82
In [10]:
"%x" % ord("R")
Out[10]:
'52'

Reading a text file in python

In [11]:
fname = "sample_table.csv"
In [12]:
fname
Out[12]:
'sample_table.csv'
In [13]:
type(fname)
Out[13]:
str
In [14]:
fp = open(fname)
In [15]:
fp
Out[15]:
<_io.TextIOWrapper name='sample_table.csv' mode='r' encoding='UTF-8'>
In [16]:
fp.readline()
Out[16]:
'Run,desc,condition,mucor,batch,rep\n'
In [17]:
fp.readline()
Out[17]:
'SRR7541435,"Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,atf1,2,b\n'
In [18]:
fp.readline()
Out[18]:
'SRR7541439,"Mucor atf2 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,atf2,2,b\n'
In [19]:
fp.readline().rstrip()
Out[19]:
'SRR7541438,"Mucor atf2 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a",mac,atf2,2,a'
In [20]:
fp.readline().rstrip().split(",")
Out[20]:
['SRR7541434',
 '"Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours',
 ' replicate a"',
 'mac',
 'atf1',
 '2',
 'a']
In [21]:
from csv import reader
In [22]:
r = reader(fp)
In [23]:
next(r)
Out[23]:
['SRR7541453',
 'Mucor R7B virulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b',
 'mac',
 'R7B',
 '1',
 'b']
In [24]:
row = next(r)
In [25]:
row
Out[25]:
['SRR7541452',
 'Mucor R7B virulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a',
 'mac',
 'R7B',
 '1',
 'a']

Other newline conventions

  • The old MacOS convention was \r
  • The DOS/Windows convention is \r\n
In [26]:
!head -c 64 sample_table.oldmac | hexdump -C
00000000  52 75 6e 2c 64 65 73 63  2c 63 6f 6e 64 69 74 69  |Run,desc,conditi|
00000010  6f 6e 2c 6d 75 63 6f 72  2c 62 61 74 63 68 2c 72  |on,mucor,batch,r|
00000020  65 70 0d 53 52 52 37 35  34 31 34 33 35 2c 22 4d  |ep.SRR7541435,"M|
00000030  75 63 6f 72 20 61 74 66  31 20 6d 75 74 61 6e 74  |ucor atf1 mutant|
00000040
In [27]:
fname = "sample_table.oldmac"
fp = open(fname)
fp.readline()
Out[27]:
'Run,desc,condition,mucor,batch,rep\n'
In [28]:
fp
Out[28]:
<_io.TextIOWrapper name='sample_table.oldmac' mode='r' encoding='UTF-8'>
In [29]:
fp = open(fname,newline="\n")
In [30]:
fp.readline()
Out[30]:
'Run,desc,condition,mucor,batch,rep\rSRR7541435,"Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,atf1,2,b\rSRR7541439,"Mucor atf2 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,atf2,2,b\rSRR7541438,"Mucor atf2 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a",mac,atf2,2,a\rSRR7541434,"Mucor atf1 mutants cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a",mac,atf1,2,a\rSRR7541453,"Mucor R7B virulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,R7B,1,b\rSRR7541452,"Mucor R7B virulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a",mac,R7B,1,a\rSRR7541437,"Mucor atf1 mutants singlecultured in MMC medium during 24 hours, replicate b",mmc,atf1,2,b\rSRR7541450,"Mucor R7B virulent strain singlecultured in cell media for 5 hours, replicate a",cm,R7B,1,a\rSRR7541457,"Mucor R7B virulent strain singlecultured in MMC medium during 24 hrs, used as a control for both atf mutants, rep b",mmc,R7B,2,b\rSRR7541451,"Mucor R7B virulent strain singlecultured in cell media for 5 hours, replicate b",cm,R7B,1,b\rSRR7541455,"Mucor R7B virulent cocultured w/ macrophages (cell line J774A.1) for 5 hrs used as a control for both atf mutants, rep b",mac,R7B,2,b\rSRR7541454,"Mucor R7B virulent cocultured w/ macrophages (cell line J774A.1) for 5 hrs used as a control for both atf mutants, rep a",mac,R7B,2,a\rSRR7541436,"Mucor atf1 mutants singlecultured in MMC medium during 24 hours, replicate a",mmc,atf1,2,a\rSRR7541456,"Mucor R7B virulent strain singlecultured in MMC medium during 24 hrs, used as a control for both atf mutants, rep a",mmc,R7B,2,a\rSRR7541444,"Mouse macrophages (cell line J774A.1) singlecultured in cell media for 5 hours, replicate c",mac,un,1,c\rSRR7541445,"Mouse macrophages (cell line J774A.1) singlecultured in cell media for 5 hours, replicate d",mac,un,1,d\rSRR7541446,"Mucor NRRL3631 avirulent strain singlecultured in cell media for 5 hours, replicate a",cm,NRRL3631,1,a\rSRR7541447,"Mucor NRRL3631 avirulent strain singlecultured in cell media for 5 hours, replicate b",cm,NRRL3631,1,b\rSRR7541440,"Mucor atf2 mutants singlecultured in MMC medium during 24 hours, replicate a",mmc,atf2,2,a\rSRR7541441,"Mucor atf2 mutants singlecultured in MMC medium during 24 hours, replicate b",mmc,atf2,2,b\rSRR7541442,"Mouse macrophages (cell line J774A.1) singlecultured in cell media for 5 hours, replicate a",mac,un,1,a\rSRR7541443,"Mouse macrophages (cell line J774A.1) singlecultured in cell media for 5 hours, replicate b",mac,un,1,b\rSRR7541448,"Mucor NRRL3631 avirulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate a",mac,NRRL3631,1,a\rSRR7541449,"Mucor NRRL3631 avirulent strain cocultured with mouse macrophages (cell line J774A.1) for 5 hours, replicate b",mac,NRRL3631,1,b\r'
In [31]:
!head -c 64 sample_table.windows | hexdump -C
00000000  52 75 6e 2c 64 65 73 63  2c 63 6f 6e 64 69 74 69  |Run,desc,conditi|
00000010  6f 6e 2c 6d 75 63 6f 72  2c 62 61 74 63 68 2c 72  |on,mucor,batch,r|
00000020  65 70 0d 0a 53 52 52 37  35 34 31 34 33 35 2c 22  |ep..SRR7541435,"|
00000030  4d 75 63 6f 72 20 61 74  66 31 20 6d 75 74 61 6e  |Mucor atf1 mutan|
00000040
In [32]:
fname = "sample_table.windows"
fp = open(fname)
fp.readline()
Out[32]:
'Run,desc,condition,mucor,batch,rep\n'
In [33]:
fp = open(fname,newline="\n")
fp.readline()
Out[33]:
'Run,desc,condition,mucor,batch,rep\r\n'

Reading a binary file in python

In [34]:
!head -c 64 SRR7541452.6M.fastq.gz | hexdump -C
00000000  1f 8b 08 00 c9 cc aa 5c  00 03 ac fd db ae ec b8  |.......\........|
00000010  12 2d 88 bd 9f af 68 e0  3c 36 ec 12 75 57 22 37  |.-....h.<6..uW"7|
00000020  52 9c 94 44 25 0c ec 87  b6 60 f7 0f 34 ec 07 c3  |R..D%....`..4...|
00000030  4f fe 7f 98 11 23 a8 4c  86 72 4a b3 92 3d ab 6a  |O....#.L.rJ..=.j|
00000040
In [35]:
fname = "SRR7541452.6M.fastq.gz"
In [36]:
fp = open(fname)
In [38]:
fp.readline()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-b93e1db84839> in <module>()
----> 1 fp.readline()

/usr/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte
In [39]:
fp = open("SRR7541452.6M.fastq.gz","rb")
In [40]:
fp
Out[40]:
<_io.BufferedReader name='SRR7541452.6M.fastq.gz'>
In [41]:
fp.read(10)
Out[41]:
b'\x1f\x8b\x08\x00\xc9\xcc\xaa\\\x00\x03'
  • 1f 8b is the gzip magic
  • 08 specifies DEFLATE compression
  • 00 means no flags set
  • c9 cc aa 5c is MTIME
  • 00 means no extra flags set
  • 03 means generated on Unix (Linux)
In [42]:
fp.seek(0)
s = fp.read(10)
In [43]:
s
Out[43]:
b'\x1f\x8b\x08\x00\xc9\xcc\xaa\\\x00\x03'
In [44]:
s[4:8]
Out[44]:
b'\xc9\xcc\xaa\\'
In [45]:
import time

t = 0
for i in reversed(s[4:8]):
    t = t << 8
    t += i
t

time.strftime("%Y-%m-%d %H:%M:%s", time.localtime(t))
Out[45]:
'2019-04-07 21:23:1554697417'
In [46]:
fp.close()
In [47]:
import gzip
In [48]:
fp = gzip.open(fname)
In [49]:
fp
Out[49]:
<gzip _io.BufferedReader name='SRR7541452.6M.fastq.gz' 0x7fb4d4056160>
In [50]:
fp.readline()
Out[50]:
b'@SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51\n'
In [51]:
fp.readline()
Out[51]:
b'GTTTCTCCAGCTGCTCCTGCGTCGCGACAGGGAGAGACTTAATCTTATTCA\n'
In [52]:
fp.readline()
Out[52]:
b'+SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51\n'
In [53]:
fp.readline()
Out[53]:
b'DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\n'
In [54]:
fp = gzip.open(fname,"rt")
In [55]:
fp
Out[55]:
<_io.TextIOWrapper name='SRR7541452.6M.fastq.gz' encoding='UTF-8'>
In [56]:
fp.readline()
Out[56]:
'@SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51\n'
In [57]:
for i in range(10):
    print(fp.readline())
GTTTCTCCAGCTGCTCCTGCGTCGCGACAGGGAGAGACTTAATCTTATTCA

+SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51

DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

@SRR7541452.26058354 HWI-D00148:479:HVK5NBCXX:2:2204:10192:56738 length=51

GGCTGATAGGAGGTTTGTAATAACTGTGGCACCTCAGAATGATATTTGTCC

+SRR7541452.26058354 HWI-D00148:479:HVK5NBCXX:2:2204:10192:56738 length=51

DDDDDIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

@SRR7541452.12133545 HWI-D00148:479:HVK5NBCXX:1:2206:7770:75707 length=51

GGCCGCTCGGCACTCCTTTGTGGCTTGCTCATAGATTAGCTGTTCTATCAG

+SRR7541452.12133545 HWI-D00148:479:HVK5NBCXX:1:2206:7770:75707 length=51

Parsing Illumina base qualities

Every 4th line of the FASTQ file gives the per-base quality scores (probability of a miscall) encoded as single characters.

$P = 10^{\frac{-Q}{10}}$

$Q = ord(C)-q_0$

$q_0 = 33$ 64 for older Illumina files

In [58]:
q0 = 33
C = "D"
Q = ord(C)-q0
P = 10**(-Q/10)
P
Out[58]:
0.00031622776601683794
In [59]:
C = "I"
Q = ord(C)-q0
P = 10**(-Q/10)
P
Out[59]:
0.0001
In [60]:
fp.seek(0)
header = fp.readline().rstrip()
print(header)
seq = fp.readline().rstrip()
print(seq)
header2 = fp.readline().rstrip()
print(header2)
quals = fp.readline().rstrip()
print(quals)
@SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51
GTTTCTCCAGCTGCTCCTGCGTCGCGACAGGGAGAGACTTAATCTTATTCA
+SRR7541452.22752346 HWI-D00148:479:HVK5NBCXX:2:2105:2570:87783 length=51
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
In [61]:
Qs = []
for C in quals:
    Q = ord(C)-q0
    Qs.append(Q)
print(Qs)
[35, 35, 35, 35, 35, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40]
In [62]:
len(seq)
Out[62]:
51
In [63]:
%matplotlib nbagg
import matplotlib.pyplot as plt
In [64]:
fig = plt.figure()
plt.plot(Qs)
Out[64]:
[<matplotlib.lines.Line2D at 0x7fb4b69a8a58>]
In [65]:
for i in range(20):
    header = fp.readline().rstrip()
    seq = fp.readline().rstrip()
    header2 = fp.readline().rstrip()
    quals = fp.readline().rstrip()
    Qs = []
    for C in quals:
        Q = ord(C)-q0
        Qs.append(Q)
    plt.plot(Qs)