String comparisons in python are case sensitive. Most of the time that we are dealing with sequence data, we do not want case to be meaningful, so it's useful to normalize all of our sequences to uppercase.
This is related to our discussion of coercing all input data to floats in the pearson function. In both cases, we are sanitizing our input to a known state that is consistent with the assumptions of the code that follows (e.g. that all start codons are equivalent, even if we read them from a file that has both "atg" and "ATG")
Here I use the string upper method to convert a string to uppercase
dna = "ATGcaGTa".upper()
dna
There is also a lower method. It does what you would expect.
dna.lower()
upper and lower return modified strings, but do not change the original string (they are not "in-place" modifications).
dna
Strings are iterable, so we can use them in for loops.
for i in dna:
print i
print "hello"
x = [1,5,7.2,3]
sum = 0
for i in x:
sum += i
m = sum/float(len(x))
m
sum is now 16.2, not a function, so we can't call it
m = sum(x)/float(len(x))
Fortunately, we still have a reference to the original sum function, so we can restore the original name
sum = __builtin__.sum
m = sum(x)/float(len(x))
m
x
Make x and y point to the same list
y = x
y
Since x and y point to the same list, they see the same modifications
y.append("b")
y
x
Make z a distinct copy of y
z = y[:]
z can now be modified independently
z
z.append("c")
z
y
This assignment makes a copy of the value of y[2] (specifically, because y[2] is a float. If it were a list, then a would be a reference to that list, not a copy).
a = y[2]
a
a += 1
a
So here, changing a didn't change y
y
Assignment to an indexed element does modify the list
y[2] = 3
y
y[2] += 1
y
Feel free to skip this digression
for i in range(10):
print i**2
range(10)
for i in xrange(10):
print i**2
xrange(10)
help(xrange)
x = xrange(10)
it = iter(x)
help(it)
it.next()
it.next()
[i**2 for i in xrange(10)]
g = (i**2 for i in xrange(10))
g
it = iter(g)
it.next()
g = (i**2 for i in xrange(10))
for i in g:
print i
sum/len(x)
sum
sum(x)
Get our current working directory
%pwd
List the files in the working directory (in class, I used the less portable "!ls", which doesn't work on Windows)
%ls
Confirm that we can open the downloaded file
fp = open("supp2data.cdt")
Read the first 100 bytes
fp.read(100)
Read the next 100 bytes (note that read has the side-effect of advancing the file pointer, so we get a different result each time rather than always getting the first 100 bytes)
fp.read(100)
Open the file
fp = open("SB1405E_ryp1.bam")
Read the first 100 bytes
fp.read(100)
Here's how to find the numeric value of an ascii character (which python prints as a decimal)
ord("=")
Here's how to convert the numerical value back to an ascii character
chr(61)
Figuring out the hexidecimal representation of decimal 61 ($61 = 3*16 + 13$)
61/16
61%16
Confirming that hexidecimal 3D is decimal 61
0x3d
General conclusion -- binary formats are a headache to debug by hand and often have portability issues. Where possible, we will prefer text formats (or an interconvertable text/binary pair, such as SAM/BAM).
SB1405E_ryp1.sam is isomorphic to SB1405E_ryp1.bam -- we can convert between the two with no loss of information (i.e., they have the same information content, provided we already have full knowledge of both file formats).
fp = open("SB1405E_ryp1.sam")
Note that all of the bytes of the SAM file correspond to ascii characters.
There are two special characters (which would normally be printed as whitespace):
fp.read(100)
fp.read(100)
fp.read(500)
We rewind the file by re-opening it.
For the case of a random access filestream, like this one, we could also use fp.seek(0), but this does not generalize to all filestreams (e.g., data being decompressed or downloaded from the web or a piece of equipment).
fp = open("SB1405E_ryp1.sam")
Because this is a text file, we can use readline to read one newline delimited line at a time
fp.readline()
fp.readline()
line = fp.readline()
line
Note that the first few lines are all special header records, which start with an at sign (@). We can burn through all of these with a while loop
while(line[0] == "@"):
line = fp.readline()
print line
Parse the remaining lines of the file into a list of lists (i.e., two dimensional array) by splitting the tab delimited fields with the string split method. We could have done this with a for loop, but I used a list comprehension instead.
Note that data does not contain the first non-header record (the final state of line in the previous cell). How could I change my code to include this line?
data = [i.split("\t") for i in fp]
Data has 8787 rows (each corresponding to a read alignment)
len(data)
Here are the elements of the first row of the file. You can take a look at the SAM documentation to see what each field means.
data[0]
The extended CDT format is described in the JavaTreeView User's manual. Note that it is a simple table of floating point numbers (with missing values) plus some annotation information.
fp = open("supp2data.cdt")
The csv module is very handy for handling all of the special cases in Excel-generated comma delimited (CSV) and tab delimited text files -- in fact, it's fancier than what we need here, because Cluster3 and JavaTreeView don't support, e.g., fields with newline characters. For the specific case of reading supp2data.cdt, the csv module does the right thing.
from csv import reader, excel_tab
Wrap our filestream in a csv.reader, which will take care of splitting up each line into separate fields for us.
r = reader(fp, dialect = excel_tab)
Since the reader takes care of splitting on tabs, each row is now a list of strings (rather than a single string)
The first row of a simple CDT file has two special columns (ORF and NAME) plus the name of each experimental condition (in general, this refers to a two color microarray hybridization of an experimental sample and a reference sample).
print r.next()
Each remaining line (for the case of a simple CDT file) gives the expression profile of a single gene. The first field is the gene's systematic name, followed by an annotation, followed by $\log_2$ ratios for each condition (or empty strings for missing values)
print r.next()
print r.next()
To do the homework (writing a function to factor supp2data into four components), we need one more trick: representing a two dimensional array (or matrix) as a list of lists
Here's a list of lists with 4 rows and 3 columns
(note that a list of lists doesn't have to be rectangular -- we could make each row a different length -- but the particular data that we're working with is rectangular).
x = [[1,2,3],
[3,4,5],
[6,7,9],
[11,2,1]]
Indexing a row of a list of lists
x[0]
Indexing an element of a list of lists (first row, third column)
x[0][2]
Generating a list of lists with a pair of nested four loops (this may be useful for the homework)
y = []
for i in range(10):
y.append([])
for j in range(3):
y[-1].append(i+j)
print y
y
Any data read from a text file arrives as strings. If we want to do math on it, we need to coerce it to a numerical type like float -- here's how
float("3.3")
Because supp2data.cdt contains missing values, not all log ratio fields can be coerced to floats. Here's one possible way of dealing with the missing data:
if(i == ""):
val = None
else:
val = float(i)
Here's another
What are the trade-offs between these two methods? Can you think of a third alternative?
if(i == ""):
val = 0.
else:
val = float(i)