Practical Bioinformatics -- Day 3

Using "." to "look inside" an object

First, let's make a sequence (multiplication on a string gives tandem repeats)

In [1]:
seq = "ATGC"*50
In [2]:
seq
Out[2]:
'ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC'

Here I:

In [4]:
seq.rstrip("C").split("TG")[:10]
Out[4]:
['A', 'CA', 'CA', 'CA', 'CA', 'CA', 'CA', 'CA', 'CA', 'CA']

Functions vs. Classes

As an example, we will try two different ways of implementing reverse complementation for DNA

As a function

In [5]:
def complement(dna):
    retval = ""
    for i in dna:
        if(i == "A"):
            c = "T"
        elif(i == "T"):
            c = "A"
        elif(i == "G"):
            c = "C"
        else:
            c = "G"
        retval = c + retval
    return retval
In [8]:
s = "GATACA"
complement(s)
Out[8]:
'TGTATC'

As a class method

In [9]:
class DNA:
    def __init__(self, seq):
        # A class constructor is a good place to sanitize/normalize
        # our data.  Here, we normalize DNA sequences to uppercase
        # to simplify comparison
        self.seq = seq.upper()
        
    def complement(self):
        retval = ""
        for i in self.seq:
            if(i == "A"):
                c = "T"
            elif(i == "T"):
                c = "A"
            elif(i == "G"):
                c = "C"
            else:
                c = "G"
            retval = c + retval
        return retval
    

Constructing a DNA sequence:

In [10]:
dna = DNA("GATACA")

Calling a class method:

In [11]:
dna.complement()
Out[11]:
'TGTATC'

Example of using python's built-in help. (I am commenting this out to keep my transcript small). Experiment with this!

In [76]:
#help(str)

Examples of pre-allocating vectors and matrices

Multiplication on lists is similar to strings -- here I'm creating a list of ten zeros

In [16]:
l1 = [0]*10
In [17]:
l1
Out[17]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Here's a 5x10 matrix of zeros as a list of lists (I use a for loop to avoid creating 10 linked copies of a single row)

In [18]:
l2 = []
for i in xrange(10):
    l2.append([0]*5)
In [19]:
l2
Out[19]:
[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]

The same matrix as a numpy array:

In [20]:
a2 = zeros((10,5))
In [21]:
a2
Out[21]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

Explicitly creating an array of 32-bit signed integers, rather than the default floating-point dtype

This is an example of passing an optional, named parameter to a function. You can give your own functions this behaviour by supplying default parameter values in the first line of the function definition.

In [22]:
a3 = zeros((10,5), dtype = "int32")
In [23]:
a3
Out[23]:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int32)

Using indexing to change the values in a matrix

In [24]:
l2
Out[24]:
[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]
In [25]:
l2[3][2]
Out[25]:
0
In [26]:
l2[3][2] = "ham"
In [27]:
l2
Out[27]:
[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 'ham', 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]
In [28]:
a2
Out[28]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
In [29]:
a2[3][2] = 5
In [30]:
a2
Out[30]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  5.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

Note that numpy arrays can only contain one type of data:

In [31]:
a2[3][3] = "ham"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-4b7f8e330901> in <module>()
----> 1 a2[3][3] = "ham"

ValueError: could not convert string to float: ham

Homework problem: parsing a CDT file

Bryne doing a quick scouting run on the file, to see how the data is structured:

In [32]:
data = open("supp2data.cdt").readlines()
data[0]
Out[32]:
'ORF\tNAME\talpha 0\talpha 7\talpha 14\talpha 21\talpha 28\talpha 35\talpha 42\talpha 49\talpha 56\talpha 63\talpha 70\talpha 77\talpha 84\talpha 91\talpha 98\talpha 105\talpha 112\talpha 119\tElu 0\tElu 30\tElu 60\tElu 90\tElu 120\tElu 150\tElu 180\tElu 210\tElu 240\tElu 270\tElu 300\tElu 330\tElu 360\tElu 390\tcdc15 10\tcdc15 30\tcdc15 50\tcdc15 70\tcdc15 90\tcdc15 110\tcdc15 130\tcdc15 150\tcdc15 170\tcdc15 190\tcdc15 210\tcdc15 230\tcdc15 250\tcdc15 270\tcdc15 290\tspo 0\tspo 2\tspo 5\tspo 7\tspo 9\tspo 11\tspo5 2\tspo5 7\tspo5 11\tspo- early\tspo- mid\theat 0\theat 10\theat 20\theat 40\theat 80\theat 160\tdtt 15\tdtt 30\tdtt 60\tdtt 120\tcold 0\tcold 20\tcold 40\tcold 160\tdiau a\tdiau b\tdiau c\tdiau d\tdiau e\tdiau f\tdiau g\r\n'

Bryne's homework solution as a python function

In [33]:
data[1]
Out[33]:
'YBR166C\tTYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+\t0.33\t-0.17\t0.04\t-0.07\t-0.09\t-0.12\t-0.03\t-0.2\t-0.06\t-0.06\t-0.14\t-0.18\t-0.06\t-0.25\t0.06\t-0.12\t0.25\t0.43\t0.21\t-0.04\t-0.15\t-0.04\t0.21\t-0.14\t-0.03\t-0.07\t-0.36\t-0.14\t-0.42\t-0.34\t-0.23\t-0.17\t0.23\t0.3\t0.41\t-0.07\t-0.23\t-0.12\t0.16\t0.74\t0.14\t-0.49\t-0.32\t0.19\t0.23\t0.24\t0.28\t1.13\t-0.12\t0.1\t\t0.66\t0.62\t0.08\t0.62\t0.43\t0.5\t-0.25\t-0.51\t-0.67\t0.21\t-0.74\t-0.36\t-0.01\t0.38\t0.15\t-0.22\t-0.09\t0.33\t0.08\t0.39\t-0.17\t0.23\t0.2\t0.2\t-0.17\t-0.69\t0.14\t-0.27\r\n'
In [34]:
def sepData(fn):
    geneName=[]
    geneAnn=[]
    num= []
    
    dat = open(fn).readlines()
    
    expCond = [i.strip() for i in dat[0].split("\t")[2:]]
    
    for i in dat[1:]:
        w= i.split("\t")
        geneName.append(w[0])
        geneAnn.append(w[1])
        row = []
        for j in w[2:]:
            try:
                row.append(float(j))
            except ValueError:
                row.append(0.)
        num.append(row)

    return geneName,geneAnn,expCond,num
In [35]:
a = sepData("supp2data.cdt")

Four return values, as expected:

In [36]:
len(a)
Out[36]:
4

Checking that the parsed data has the expected shape (2467 genes by 79 conditions)

In [37]:
len(a[0])
Out[37]:
2467
In [38]:
len(a[2])
Out[38]:
79

Assigning a tuple to a tuple -- this is a good trick for unpacking a return value:

In [39]:
geneName,geneAnn,expCond,num = sepData("supp2data.cdt")

Digression -- lists vs. generators

The important point is that lists represent a sequence by actually allocating it in memory, whereas a generator simply emits each element of a sequence one at a time, without allocating the full sequence.

The parsing example resumes at In [64]

In [40]:
range(5)
Out[40]:
[0, 1, 2, 3, 4]
In [41]:
for i in range(5):
    print i **2
0
1
4
9
16
In [42]:
for i in xrange(5):
    print i**2
0
1
4
9
16
In [43]:
x = range(5)
y = xrange(5)
In [44]:
x
Out[44]:
[0, 1, 2, 3, 4]
In [45]:
y
Out[45]:
xrange(5)
In [46]:
xi = iter(x)
yi = iter(y)
In [47]:
xi.next()
Out[47]:
0
In [58]:
xi.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-58-ef459a0034b7> in <module>()
----> 1 xi.next()

StopIteration: 
In [49]:
yi.next()
Out[49]:
0
In [54]:
yi.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-54-b381cd96e979> in <module>()
----> 1 yi.next()

StopIteration: 
In [59]:
x
Out[59]:
[0, 1, 2, 3, 4]
In [60]:
y
Out[60]:
xrange(5)
In [61]:
xi = iter(x)
xi.next()
Out[61]:
0
In [62]:
yi = iter(y)
yi.next()
Out[62]:
0

Parsing example continued

We converted Bryne's parsing function to a class (in cdt.py on the website) to which we added an output write function

Importing the module:

In [64]:
import cdt

Parsing the cdt file by creating an instance of our class, ExpressionProfile, which results in a call to the __init__ function that we wrote:

In [65]:
data1 = cdt.ExpressionProfile("supp2data.cdt")

Using dir to look inside of our class instance:

In [66]:
dir(data1)
Out[66]:
['__doc__',
 '__init__',
 '__module__',
 'expCond',
 'geneAnn',
 'geneName',
 'num',
 'write']

Trying out the write method:

In [67]:
data1.write("test1.cdt")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-5ce7d47be52d> in <module>()
----> 1 data1.write("test1.cdt")

/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/May2013/cdt.py in write(self, fname)
     29         out = open(fname,"w")
     30         out.write("\t".join(["ORF","NAME"]+self.expCond)+"\n")
---> 31         for (name, anno, data) in (self.geneName, self.geneAnn, self.num):
     32             out.write("\t".join([name,anno]+[coerce_num(i) for i in data])+"\n")
     33         out.close()

ValueError: too many values to unpack

Oops! I forgot to use zip in my for loop $\rightarrow$ fixed the bug and saved the file

In [68]:
reload(cdt)
Out[68]:
<module 'cdt' from 'cdt.py'>

Still buggy $\rightarrow$ data1 is still bound to the old version of the class

In [69]:
data1.write("test1.cdt")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-5ce7d47be52d> in <module>()
----> 1 data1.write("test1.cdt")

/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/May2013/cdt.py in write(self, fname)
     29         out = open(fname,"w")
     30         out.write("\t".join(["ORF","NAME"]+self.expCond)+"\n")
---> 31         for (name, anno, data) in zip(self.geneName, self.geneAnn, self.num):
     32             out.write("\t".join([name,anno]+[coerce_num(i) for i in data])+"\n")
     33         out.close()

ValueError: too many values to unpack

If I create a new class instance, it is bound to the new (reloaded) version

In [70]:
data2 = cdt.ExpressionProfile("supp2data.cdt")
In [71]:
data2.write("test2.cdt")

I can dynamically rebind data1 to the new version of the class by directly reassigning its __class__ attribute (note that this does not result in a call to __init__)

In [72]:
data1.__class__
Out[72]:
cdt.ExpressionProfile
In [73]:
data1.__class__ = cdt.ExpressionProfile
In [74]:
data1.write("test1.cdt")

A quick check that all rows of the log ratio matrix are the same length.

To do this I am using a set, which is an unordered set of unique elements.

I initialize the set with a generator comprehension for the sequence of lengths of the rows in the log ratio matrix.

Because all 2467 values emitted by the generator are identical (they are all the integer 79), set's __init__ method collapses them to a single element.

Use help(set) to read about set's nifty support for set theory operators like union, intersection, and difference.

In [75]:
set(len(i) for i in data1.num)
Out[75]:
{79}
In [77]:
%logstart -o BMS270b.2013.03.log
Activating auto-logging. Current session state plus future input saved.
Filename       : BMS270b.2013.03.log
Mode           : backup
Output logging : True
Raw input log  : False
Timestamping   : False
State          : active