First, let's make a sequence (multiplication on a string gives tandem repeats)
seq = "ATGC"*50
seq
Here I:
seq.rstrip("C").split("TG")[:10]
As an example, we will try two different ways of implementing reverse complementation for DNA
def complement(dna):
retval = ""
for i in dna:
if(i == "A"):
c = "T"
elif(i == "T"):
c = "A"
elif(i == "G"):
c = "C"
else:
c = "G"
retval = c + retval
return retval
s = "GATACA"
complement(s)
class DNA:
def __init__(self, seq):
# A class constructor is a good place to sanitize/normalize
# our data. Here, we normalize DNA sequences to uppercase
# to simplify comparison
self.seq = seq.upper()
def complement(self):
retval = ""
for i in self.seq:
if(i == "A"):
c = "T"
elif(i == "T"):
c = "A"
elif(i == "G"):
c = "C"
else:
c = "G"
retval = c + retval
return retval
Constructing a DNA sequence:
dna = DNA("GATACA")
Calling a class method:
dna.complement()
Example of using python's built-in help. (I am commenting this out to keep my transcript small). Experiment with this!
#help(str)
Multiplication on lists is similar to strings -- here I'm creating a list of ten zeros
l1 = [0]*10
l1
Here's a 5x10 matrix of zeros as a list of lists (I use a for loop to avoid creating 10 linked copies of a single row)
l2 = []
for i in xrange(10):
l2.append([0]*5)
l2
The same matrix as a numpy array:
a2 = zeros((10,5))
a2
Explicitly creating an array of 32-bit signed integers, rather than the default floating-point dtype
This is an example of passing an optional, named parameter to a function. You can give your own functions this behaviour by supplying default parameter values in the first line of the function definition.
a3 = zeros((10,5), dtype = "int32")
a3
l2
l2[3][2]
l2[3][2] = "ham"
l2
a2
a2[3][2] = 5
a2
Note that numpy arrays can only contain one type of data:
a2[3][3] = "ham"
Bryne doing a quick scouting run on the file, to see how the data is structured:
data = open("supp2data.cdt").readlines()
data[0]
Bryne's homework solution as a python function
data[1]
def sepData(fn):
geneName=[]
geneAnn=[]
num= []
dat = open(fn).readlines()
expCond = [i.strip() for i in dat[0].split("\t")[2:]]
for i in dat[1:]:
w= i.split("\t")
geneName.append(w[0])
geneAnn.append(w[1])
row = []
for j in w[2:]:
try:
row.append(float(j))
except ValueError:
row.append(0.)
num.append(row)
return geneName,geneAnn,expCond,num
a = sepData("supp2data.cdt")
Four return values, as expected:
len(a)
Checking that the parsed data has the expected shape (2467 genes by 79 conditions)
len(a[0])
len(a[2])
Assigning a tuple to a tuple -- this is a good trick for unpacking a return value:
geneName,geneAnn,expCond,num = sepData("supp2data.cdt")
The important point is that lists represent a sequence by actually allocating it in memory, whereas a generator simply emits each element of a sequence one at a time, without allocating the full sequence.
The parsing example resumes at In [64]
range(5)
for i in range(5):
print i **2
for i in xrange(5):
print i**2
x = range(5)
y = xrange(5)
x
y
xi = iter(x)
yi = iter(y)
xi.next()
xi.next()
yi.next()
yi.next()
x
y
xi = iter(x)
xi.next()
yi = iter(y)
yi.next()
We converted Bryne's parsing function to a class (in cdt.py on the website) to which we added an output write function
Importing the module:
import cdt
Parsing the cdt file by creating an instance of our class, ExpressionProfile, which results in a call to the __init__ function that we wrote:
data1 = cdt.ExpressionProfile("supp2data.cdt")
Using dir to look inside of our class instance:
dir(data1)
Trying out the write method:
data1.write("test1.cdt")
Oops! I forgot to use zip in my for loop $\rightarrow$ fixed the bug and saved the file
reload(cdt)
Still buggy $\rightarrow$ data1 is still bound to the old version of the class
data1.write("test1.cdt")
If I create a new class instance, it is bound to the new (reloaded) version
data2 = cdt.ExpressionProfile("supp2data.cdt")
data2.write("test2.cdt")
I can dynamically rebind data1 to the new version of the class by directly reassigning its __class__ attribute (note that this does not result in a call to __init__)
data1.__class__
data1.__class__ = cdt.ExpressionProfile
data1.write("test1.cdt")
A quick check that all rows of the log ratio matrix are the same length.
To do this I am using a set, which is an unordered set of unique elements.
I initialize the set with a generator comprehension for the sequence of lengths of the rows in the log ratio matrix.
Because all 2467 values emitted by the generator are identical (they are all the integer 79), set's __init__ method collapses them to a single element.
Use help(set) to read about set's nifty support for set theory operators like union, intersection, and difference.
set(len(i) for i in data1.num)
%logstart -o BMS270b.2013.03.log