Normalizing strings to upper case

String comparisons in python are case sensitive. Most of the time that we are dealing with sequence data, we do not want case to be meaningful, so it's useful to normalize all of our sequences to uppercase.

This is related to our discussion of coercing all input data to floats in the pearson function. In both cases, we are sanitizing our input to a known state that is consistent with the assumptions of the code that follows (e.g. that all start codons are equivalent, even if we read them from a file that has both "atg" and "ATG")

Here I use the string upper method to convert a string to uppercase

In [1]:
dna = "ATGcaGTa".upper()
In [2]:
dna
Out[2]:
'ATGCAGTA'

There is also a lower method. It does what you would expect.

In [3]:
dna.lower()
Out[3]:
'atgcagta'

upper and lower return modified strings, but do not change the original string (they are not "in-place" modifications).

In [4]:
dna
Out[4]:
'ATGCAGTA'

Strings are iterable, so we can use them in for loops.

In [5]:
for i in dna:
    print i
    print "hello"
A
hello
T
hello
G
hello
C
hello
A
hello
G
hello
T
hello
A
hello

Python references

What happens when we re-use the name of the built-in sum function?

In [6]:
x = [1,5,7.2,3]
In [7]:
sum = 0
for i in x:
    sum += i
m = sum/float(len(x))
In [8]:
m
Out[8]:
4.05

sum is now 16.2, not a function, so we can't call it

In [9]:
m = sum(x)/float(len(x))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-dd574a952194> in <module>()
----> 1 m = sum(x)/float(len(x))

TypeError: 'float' object is not callable

Fortunately, we still have a reference to the original sum function, so we can restore the original name

In [10]:
sum = __builtin__.sum
In [11]:
m = sum(x)/float(len(x))
In [12]:
m
Out[12]:
4.05

What happens when we have multiple references to the same list?

In [13]:
x
Out[13]:
[1, 5, 7.2, 3]

Make x and y point to the same list

In [14]:
y = x
In [15]:
y
Out[15]:
[1, 5, 7.2, 3]

Since x and y point to the same list, they see the same modifications

In [16]:
y.append("b")
In [17]:
y
Out[17]:
[1, 5, 7.2, 3, 'b']
In [18]:
x
Out[18]:
[1, 5, 7.2, 3, 'b']

Make z a distinct copy of y

In [19]:
z = y[:]

z can now be modified independently

In [20]:
z
Out[20]:
[1, 5, 7.2, 3, 'b']
In [21]:
z.append("c")
In [22]:
z
Out[22]:
[1, 5, 7.2, 3, 'b', 'c']
In [23]:
y
Out[23]:
[1, 5, 7.2, 3, 'b']

A similar exercise for elements of a list

This assignment makes a copy of the value of y[2] (specifically, because y[2] is a float. If it were a list, then a would be a reference to that list, not a copy).

In [24]:
a = y[2]
In [25]:
a
Out[25]:
7.2
In [26]:
a += 1
In [27]:
a
Out[27]:
8.2

So here, changing a didn't change y

In [28]:
y
Out[28]:
[1, 5, 7.2, 3, 'b']

Assignment to an indexed element does modify the list

In [29]:
y[2] = 3
In [30]:
y
Out[30]:
[1, 5, 3, 3, 'b']
In [31]:
y[2] += 1
In [32]:
y
Out[32]:
[1, 5, 4, 3, 'b']

Generators and generator expressions

Feel free to skip this digression

In [33]:
for i in range(10):
    print i**2
0
1
4
9
16
25
36
49
64
81

In [34]:
range(10)
Out[34]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [35]:
for i in xrange(10):
    print i**2
0
1
4
9
16
25
36
49
64
81

In [36]:
xrange(10)
Out[36]:
xrange(10)
In [37]:
help(xrange)
Help on class xrange in module __builtin__:

class xrange(object)
 |  xrange(stop) -> xrange object
 |  xrange(start, stop[, step]) -> xrange object
 |  
 |  Like range(), but instead of returning a list, returns an object that
 |  generates the numbers in the range on demand.  For looping, this is 
 |  slightly faster than range() and more memory efficient.
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __iter__(...)
 |      x.__iter__() <==> iter(x)
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __reduce__(...)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __reversed__(...)
 |      Returns a reverse iterator.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T


In [38]:
x = xrange(10)
In [39]:
it = iter(x)
In [40]:
help(it)
Help on rangeiterator object:

class rangeiterator(object)
 |  Methods defined here:
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __iter__(...)
 |      x.__iter__() <==> iter(x)
 |  
 |  __length_hint__(...)
 |      Private method returning an estimate of len(list(it)).
 |  
 |  next(...)
 |      x.next() -> the next value, or raise StopIteration


In [41]:
it.next()
Out[41]:
0
In [42]:
it.next()
Out[42]:
1
In [43]:
[i**2 for i in xrange(10)]
Out[43]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [44]:
g = (i**2 for i in xrange(10))
In [45]:
g
Out[45]:
<generator object <genexpr> at 0x7f05ad03c460>
In [46]:
it = iter(g)
In [47]:
it.next()
Out[47]:
0
In [48]:
g = (i**2 for i in xrange(10))
In [49]:
for i in g:
    print i
0
1
4
9
16
25
36
49
64
81

end of generator digression

Aside: what kind of error do we get when we forget the parenthesis on a function?

In [50]:
sum/len(x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-4ba57528d2d3> in <module>()
----> 1 sum/len(x)

TypeError: unsupported operand type(s) for /: 'builtin_function_or_method' and 'int'
In [51]:
sum
Out[51]:
<function sum>
In [52]:
sum(x)
Out[52]:
45

Parsing files

Using IPython magics to find the files we downloaded

Get our current working directory

In [53]:
%pwd
Out[53]:
u'/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/MikeNotebooks'

List the files in the working directory (in class, I used the less portable "!ls", which doesn't work on Windows)

In [54]:
%ls
Day1.html        Day1prerun.ipynb  Day2_warmups.ipynb  iNotifyTest.ipynb~  SB1405F_ryp1.bam  SB1405G_ryp1.sam  SB1405_ryp1.zip  Warm-up.ipynb
Day1.ipynb       Day2.html         iNotifyTest.html    SB1405E_ryp1.bam    SB1405F_ryp1.sam  SB1405H_ryp1.bam  supp2data.cdt
Day1prerun.html  Day2.ipynb        iNotifyTest.ipynb   SB1405E_ryp1.sam    SB1405G_ryp1.bam  SB1405H_ryp1.sam  Warm-up.html

Confirm that we can open the downloaded file

In [55]:
fp = open("supp2data.cdt")

Read the first 100 bytes

In [56]:
fp.read(100)
Out[56]:
'ORF\tNAME\talpha 0\talpha 7\talpha 14\talpha 21\talpha 28\talpha 35\talpha 42\talpha 49\talpha 56\talpha 63\talp'

Read the next 100 bytes (note that read has the side-effect of advancing the file pointer, so we get a different result each time rather than always getting the first 100 bytes)

In [57]:
fp.read(100)
Out[57]:
'ha 70\talpha 77\talpha 84\talpha 91\talpha 98\talpha 105\talpha 112\talpha 119\tElu 0\tElu 30\tElu 60\tElu 90\tE'

Example 1: BAM (a binary read alignment format)

Open the file

In [58]:
fp = open("SB1405E_ryp1.bam")

Read the first 100 bytes

  • Note that most of these bytes don't correspond to ascii characters, so python prints them in hexidecimal format (e.g. "\xff")
  • Where the bytes do correspond to an ascii character (e.g., "#") it is coincidental.
In [59]:
fp.read(100)
Out[59]:
"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\xb6\x01\x9d\x91\xcdJ\xdcP\x14\xc7\x8f\x15)\x11\x04\xddTh)\x84Y\x89\x90x?s\xef\xcd\xc2\xda\x19E\x05\x9d~\xa4\x16t#\xa9\x13mp\xccd2\x89\x1fO\xe1K\xb8r\xe3[\xf8\x06>\x85;\xa1\xabn\xcc\\)\x15{\xb2\xe9\xe2n~\xe7w\xcf=\xe7\x7f\xdb\x1f\xb7'n"

Here's how to find the numeric value of an ascii character (which python prints as a decimal)

In [60]:
ord("=")
Out[60]:
61

Here's how to convert the numerical value back to an ascii character

In [61]:
chr(61)
Out[61]:
'='

Figuring out the hexidecimal representation of decimal 61 ($61 = 3*16 + 13$)

In [62]:
61/16
Out[62]:
3
In [63]:
61%16
Out[63]:
13

Confirming that hexidecimal 3D is decimal 61

In [64]:
0x3d
Out[64]:
61

General conclusion -- binary formats are a headache to debug by hand and often have portability issues. Where possible, we will prefer text formats (or an interconvertable text/binary pair, such as SAM/BAM).

SAM (a text alignment format)

SB1405E_ryp1.sam is isomorphic to SB1405E_ryp1.bam -- we can convert between the two with no loss of information (i.e., they have the same information content, provided we already have full knowledge of both file formats).

In [65]:
fp = open("SB1405E_ryp1.sam")

Note that all of the bytes of the SAM file correspond to ascii characters.

There are two special characters (which would normally be printed as whitespace):

  • \t = tab (which in this context delimits fields within a line of the file)
  • \n = newline (which delimits lines in a UNIX text file)
In [66]:
fp.read(100)
Out[66]:
'@HD\tVN:1.0\tSO:unsorted\n@SQ\tSN:Supercontig_1.1\tLN:5934958\n@SQ\tSN:Supercontig_1.2\tLN:4086229\n@SQ\tSN:Su'
In [67]:
fp.read(100)
Out[67]:
'percontig_1.3\tLN:3712251\n@SQ\tSN:Supercontig_1.4\tLN:3655338\n@SQ\tSN:Supercontig_1.5\tLN:3618900\n@SQ\tSN:'
In [68]:
fp.read(500)
Out[68]:
'Supercontig_1.6\tLN:1928946\n@SQ\tSN:Supercontig_1.7\tLN:1790487\n@SQ\tSN:Supercontig_1.8\tLN:1570074\n@SQ\tSN:Supercontig_1.9\tLN:474902\n@SQ\tSN:Supercontig_1.10\tLN:349507\n@SQ\tSN:Supercontig_1.11\tLN:256943\n@SQ\tSN:Supercontig_1.12\tLN:75818\n@SQ\tSN:Supercontig_1.13\tLN:12082\n@PG\tID:Bowtie\tVN:0.12.7\tCL:"bowtie -tS -p 4 -k 1 --un SB1405E.stripped.20.decoy.unaligned.CpSilveira.k1.unaligned.fastq --max SB1405E.stripped.20.decoy.unaligned.CpSilveira.k1.maxed.fastq --phred33-quals CpSilveira_bowtie -"\nHWI-ST640:854'

We rewind the file by re-opening it.

For the case of a random access filestream, like this one, we could also use fp.seek(0), but this does not generalize to all filestreams (e.g., data being decompressed or downloaded from the web or a piece of equipment).

In [69]:
fp = open("SB1405E_ryp1.sam")

Because this is a text file, we can use readline to read one newline delimited line at a time

In [70]:
fp.readline()
Out[70]:
'@HD\tVN:1.0\tSO:unsorted\n'
In [71]:
fp.readline()
Out[71]:
'@SQ\tSN:Supercontig_1.1\tLN:5934958\n'
In [72]:
line = fp.readline()
line
Out[72]:
'@SQ\tSN:Supercontig_1.2\tLN:4086229\n'

Note that the first few lines are all special header records, which start with an at sign (@). We can burn through all of these with a while loop

In [73]:
while(line[0] == "@"):
    line = fp.readline()
    print line
@SQ	SN:Supercontig_1.3	LN:3712251

@SQ	SN:Supercontig_1.4	LN:3655338

@SQ	SN:Supercontig_1.5	LN:3618900

@SQ	SN:Supercontig_1.6	LN:1928946

@SQ	SN:Supercontig_1.7	LN:1790487

@SQ	SN:Supercontig_1.8	LN:1570074

@SQ	SN:Supercontig_1.9	LN:474902

@SQ	SN:Supercontig_1.10	LN:349507

@SQ	SN:Supercontig_1.11	LN:256943

@SQ	SN:Supercontig_1.12	LN:75818

@SQ	SN:Supercontig_1.13	LN:12082

@PG	ID:Bowtie	VN:0.12.7	CL:"bowtie -tS -p 4 -k 1 --un SB1405E.stripped.20.decoy.unaligned.CpSilveira.k1.unaligned.fastq --max SB1405E.stripped.20.decoy.unaligned.CpSilveira.k1.maxed.fastq --phred33-quals CpSilveira_bowtie -"

HWI-ST640:854:C4Y5NACXX:7:2210:11769:63697	0	Supercontig_1.3	1125178	255	51M	*	0	0	GCAAGATGTCCGAATAAAGGAAAATCGAAAGAATGGCATGCGTCGGCGGTC	CCCFFFFFHHHHHIIJJJJJIJJJJJJJJJIJIJJJIJJJGJIJJIJJDBB	XA:i:0	MD:Z:51	NM:i:0


Parse the remaining lines of the file into a list of lists (i.e., two dimensional array) by splitting the tab delimited fields with the string split method. We could have done this with a for loop, but I used a list comprehension instead.

Note that data does not contain the first non-header record (the final state of line in the previous cell). How could I change my code to include this line?

In [74]:
data = [i.split("\t") for i in fp]

Data has 8787 rows (each corresponding to a read alignment)

In [75]:
len(data)
Out[75]:
8787

Here are the elements of the first row of the file. You can take a look at the SAM documentation to see what each field means.

In [76]:
data[0]
Out[76]:
['HWI-ST640:854:C4Y5NACXX:7:2109:4863:75595',
 '0',
 'Supercontig_1.3',
 '1125185',
 '255',
 '51M',
 '*',
 '0',
 '0',
 'GTCCGAATAAAGGAAAATCGAAAGAATGGCATGCGTCGGCGGTCGCAAAAG',
 'CBCFFFFFHHHHHJJJJJJJJJJJJJIIJGJJJIIGIIIJJHBCDDDDDDD',
 'XA:i:0',
 'MD:Z:51',
 'NM:i:0\n']

Example 3: CDT (a text format for expression profiling data)

The extended CDT format is described in the JavaTreeView User's manual. Note that it is a simple table of floating point numbers (with missing values) plus some annotation information.

In [77]:
fp = open("supp2data.cdt")

The csv module is very handy for handling all of the special cases in Excel-generated comma delimited (CSV) and tab delimited text files -- in fact, it's fancier than what we need here, because Cluster3 and JavaTreeView don't support, e.g., fields with newline characters. For the specific case of reading supp2data.cdt, the csv module does the right thing.

In [78]:
from csv import reader, excel_tab

Wrap our filestream in a csv.reader, which will take care of splitting up each line into separate fields for us.

In [79]:
r = reader(fp, dialect = excel_tab)

Since the reader takes care of splitting on tabs, each row is now a list of strings (rather than a single string)

The first row of a simple CDT file has two special columns (ORF and NAME) plus the name of each experimental condition (in general, this refers to a two color microarray hybridization of an experimental sample and a reference sample).

In [80]:
print r.next()
['ORF', 'NAME', 'alpha 0', 'alpha 7', 'alpha 14', 'alpha 21', 'alpha 28', 'alpha 35', 'alpha 42', 'alpha 49', 'alpha 56', 'alpha 63', 'alpha 70', 'alpha 77', 'alpha 84', 'alpha 91', 'alpha 98', 'alpha 105', 'alpha 112', 'alpha 119', 'Elu 0', 'Elu 30', 'Elu 60', 'Elu 90', 'Elu 120', 'Elu 150', 'Elu 180', 'Elu 210', 'Elu 240', 'Elu 270', 'Elu 300', 'Elu 330', 'Elu 360', 'Elu 390', 'cdc15 10', 'cdc15 30', 'cdc15 50', 'cdc15 70', 'cdc15 90', 'cdc15 110', 'cdc15 130', 'cdc15 150', 'cdc15 170', 'cdc15 190', 'cdc15 210', 'cdc15 230', 'cdc15 250', 'cdc15 270', 'cdc15 290', 'spo 0', 'spo 2', 'spo 5', 'spo 7', 'spo 9', 'spo 11', 'spo5 2', 'spo5 7', 'spo5 11', 'spo- early', 'spo- mid', 'heat 0', 'heat 10', 'heat 20', 'heat 40', 'heat 80', 'heat 160', 'dtt 15', 'dtt 30', 'dtt 60', 'dtt 120', 'cold 0', 'cold 20', 'cold 40', 'cold 160', 'diau a', 'diau b', 'diau c', 'diau d', 'diau e', 'diau f', 'diau g']

Each remaining line (for the case of a simple CDT file) gives the expression profile of a single gene. The first field is the gene's systematic name, followed by an annotation, followed by $\log_2$ ratios for each condition (or empty strings for missing values)

In [81]:
print r.next()
['YBR166C', 'TYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+', '0.33', '-0.17', '0.04', '-0.07', '-0.09', '-0.12', '-0.03', '-0.2', '-0.06', '-0.06', '-0.14', '-0.18', '-0.06', '-0.25', '0.06', '-0.12', '0.25', '0.43', '0.21', '-0.04', '-0.15', '-0.04', '0.21', '-0.14', '-0.03', '-0.07', '-0.36', '-0.14', '-0.42', '-0.34', '-0.23', '-0.17', '0.23', '0.3', '0.41', '-0.07', '-0.23', '-0.12', '0.16', '0.74', '0.14', '-0.49', '-0.32', '0.19', '0.23', '0.24', '0.28', '1.13', '-0.12', '0.1', '', '0.66', '0.62', '0.08', '0.62', '0.43', '0.5', '-0.25', '-0.51', '-0.67', '0.21', '-0.74', '-0.36', '-0.01', '0.38', '0.15', '-0.22', '-0.09', '0.33', '0.08', '0.39', '-0.17', '0.23', '0.2', '0.2', '-0.17', '-0.69', '0.14', '-0.27']

In [82]:
print r.next()
['YOR357C', 'GRD19  SECRETION        GOLGI PROTEIN RETENTION', '-0.64', '-0.38', '-0.32', '-0.29', '-0.22', '-0.01', '-0.32', '-0.27', '-0.51', '-0.67', '-0.62', '-0.58', '-0.38', '-0.94', '-0.34', '-0.92', '-0.15', '0.03', '0.16', '-0.34', '-0.32', '-0.34', '-0.12', '-0.34', '-0.27', '-0.15', '-0.15', '-0.51', '-0.3', '-0.25', '-0.12', '-0.4', '0.98', '0.99', '0.25', '0.15', '0.08', '0.23', '0.18', '-0.29', '-0.45', '0.01', '-0.34', '-1.12', '-0.54', '-0.94', '-1.09', '-0.45', '-0.23', '-0.36', '0.08', '0.28', '0.18', '-0.12', '', '0.25', '-0.22', '-0.04', '-0.25', '0.04', '-0.03', '-0.07', '-0.04', '0.73', '-0.06', '0.54', '-0.09', '-0.29', '-0.1', '0.36', '-0.2', '-0.34', '-0.14', '-0.09', '0.06', '-0.17', '0.04', '-0.97', '-1.79']

To do the homework (writing a function to factor supp2data into four components), we need one more trick: representing a two dimensional array (or matrix) as a list of lists

Here's a list of lists with 4 rows and 3 columns

(note that a list of lists doesn't have to be rectangular -- we could make each row a different length -- but the particular data that we're working with is rectangular).

In [83]:
x = [[1,2,3],
     [3,4,5],
     [6,7,9],
     [11,2,1]]

Indexing a row of a list of lists

In [84]:
x[0]
Out[84]:
[1, 2, 3]

Indexing an element of a list of lists (first row, third column)

In [85]:
x[0][2]
Out[85]:
3

Generating a list of lists with a pair of nested four loops (this may be useful for the homework)

In [86]:
y = []
for i in range(10):
    y.append([])
    for j in range(3):
        y[-1].append(i+j)
        
In [87]:
print y
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9], [8, 9, 10], [9, 10, 11]]

In [88]:
y
Out[88]:
[[0, 1, 2],
 [1, 2, 3],
 [2, 3, 4],
 [3, 4, 5],
 [4, 5, 6],
 [5, 6, 7],
 [6, 7, 8],
 [7, 8, 9],
 [8, 9, 10],
 [9, 10, 11]]

Any data read from a text file arrives as strings. If we want to do math on it, we need to coerce it to a numerical type like float -- here's how

In [89]:
float("3.3")
Out[89]:
3.3

Because supp2data.cdt contains missing values, not all log ratio fields can be coerced to floats. Here's one possible way of dealing with the missing data:

In [90]:
if(i == ""):
    val = None
else:
    val = float(i)

Here's another

What are the trade-offs between these two methods? Can you think of a third alternative?

In [91]:
if(i == ""):
    val = 0.
else:
    val = float(i)
In [91]: