A function without a return statement returns the default return value: None
def f(x):
a = x+3
y = f(4)
y
print y
Returning multiple values from a function via a tuple
def f(x):
a = x**2
b = x**3
c = x**4
return a,b,c
f(4)
a = f(4)
a
a[0]
a[2]
Assigning to a tuple from a tuple
(x,y,z) = f(4)
x
y
z
Note that we need to have the same number of elements on both sides of the assignment operator (=)
(x,y,z,w) = f(4)
Note that the file pointer moves as we read data!
fp = open("supp2data.cdt")
Read the first 100 bytes of the homework file
fp.read(100)
Now, when we call readline, we are already 100 bytes in:
fp.readline()
If we type fp with no method call, python just prints a summary of the fp object, telling us that it is a file object, currently open in read mode ("r") for the file "supp2data.cdt"; this particular file object lives at the block of RAM starting at hexidecimal offset 0x7f4d6b2bbb70 (if you re-run the code, the file object is likely to be allocated at a different location in memory)
fp
fp = open("supp2data.cdt")
from csv import reader, excel_tab
Re-opening the file rewinds it to the beginning
fp.readline()
We will attach two different csv.reader objects to our file object, one in the default "comma separated value" mode and one in tab-delimited mode:
reader1 = reader(fp)
reader2 = reader(fp,excel_tab)
Read one line in tab-delim mode. Since we already read the first line of the file directly, this gives us the second line (the first gene)
reader2.next()
Read another line in tab-delim mode (this time showing it with print for more compact formatting)
print reader2.next()
Read the next line in CSV mode. Note that even though this is the first time we have used reader1, it still starts from the point in the file where reader2 left off (because both readers are tied to the position of the underlying file object, fp). Also note the CSV-parsing doesn't work well for tab-delim data
print reader1.next()
We can rewind the file without re-opening it via the seek method. This is faster the re-opening the file (but both are so fast that you usually won't care).
fp.seek(0)
print reader2.next()
When figuring out how to parse a file, it is often useful to have the whole thing in memory.
Here we read the whole file into a list of lists (2D array)
data = []
for i in reader(open("supp2data.cdt"),excel_tab):
data.append(i)
Get the dimensions of the array (2468 rows by 81 columns)
len(data),len(data[0])
Use a set to confirm that all rows are 81 columns
s = set()
for i in data:
s.add(len(i))
s
Two one-liner equivalents to our list-of-lists parse:
data = [i for i in reader(open("supp2data.cdt"),excel_tab)]
data = list(reader(open("supp2data.cdt"),excel_tab))
The list trick above is also useful for converting a set into something that we can index (i.e., impose a well defined ordering on the set)
s[0]
S = list(s)
S
S[0]
Here's a one-liner version of cell [34] for creating the set
s = set(len(i) for i in data)
s
The + operator is defined as concatenation for both strings and lists
"this"+"that"
["this"]+["that"]
x = ["x","y\r\n"]
We can index from the back of a string, list, or tuple by using negative indices, starting from -1
x[-1]
If we know our lines end with "\r\n", we can trim them via string slicing:
x[-1][:-2]
If we're less sure of which line terminator we're dealing with (e.g., "\n" on UNIX vs. "\r\n" on DOS/Windows), it is safer to use rstrip (which trims any number of instances of the given characters from the right of the string)
x[-1].rstrip("\n\r")
def parse_cdt1(fname):
orf = []
name = []
logratio = []
r = reader(open(fname),excel_tab)
header = r.next()[2:]
for row in r:
orf.append(row[0])
name.append(row[1])
logratio.append(row[2:])
return header,orf,name,logratio
Let's check that it works
(Note that I should also have checked that the parsed data was of the correct dimensions)
parsed1 = parse_cdt1("supp2data.cdt")
print parsed1[0][:10]
print parsed1[1][:10]
print parsed1[2][0]
print parsed1[3][0][:10]
Here's a test case:
a = "-3.2"
b = ""
The method that I gave at the end of day 2: explicitly check for empty strings.
Here, I introduced an optional parameter (missing) by declaring a default value (None). This lets the user supply a different default value (e.g., if we want to ensure that the entire output is floats).
def make_float(x, missing = None):
if(x == ""):
return missing
else:
return float(x)
Testing the function:
print [make_float(i) for i in (a,b)]
print [make_float(i,0.) for i in (a,b)]
print [make_float(i) for i in parsed1[3][0][:10]]
Adolfo's method, using exceptions. We try to coerce to a float, then catch exceptional cases and give them special treatment
def make_float(x, missing = None):
try:
return float(x)
except:
return missing
Some examples of how exceptions work in this case:
float("")
try:
a = float("")
except:
a = 2
print a
try:
a = float("")
except TypeError:
a = 3
except ValueError:
a = 2
except:
a = 4
print a
def f(x):
return float(x)
try:
for i in range(3):
for j in range(4):
z = f(2)
except:
print "didn't work"
def f(x):
return float(x)
try:
for i in range(3):
for j in range(4):
z = f("two")
except:
print "didn't work"
def f(x):
return float(x)
try:
for i in range(3):
for j in range(4):
z = f("2")
except:
print "didn't work"
Adolfo's homework solution, using exceptions
def parse_cdt2(fname, missing = None):
orf = []
name = []
logratio = []
r = reader(open(fname),excel_tab)
header = r.next()[2:]
for row in r:
orf.append(row[0])
name.append(row[1])
logline = []
for field in row[2:]:
try:
logline.append(float(field))
except ValueError:
logline.append(missing)
logratio.append(logline)
return header,orf,name,logratio
Let's test it with both ways of handling missing values
parsed2 = parse_cdt2("supp2data.cdt",0.)
parsed2b = parse_cdt2("supp2data.cdt")
Confirm that we get the right first dimensions
[len(i) for i in parsed2]
Confirm that we get the right second dimension for the log ratios
len(parsed2[-1][0])
Confirm that the log ratio matrix is rectangular (all rows the same length)
set(len(i) for i in parsed2[-1])
Call with an open file object rather than a file name
def parse_cdt3(fp, missing = None):
orf = []
name = []
logratio = []
r = reader(fp,excel_tab)
header = r.next()[2:]
for row in r:
orf.append(row[0])
name.append(row[1])
logline = []
for field in row[2:]:
try:
logline.append(float(field))
except ValueError:
logline.append(missing)
logratio.append(logline)
return header,orf,name,logratio
parsed3 = parse_cdt3(open("supp2data.cdt"))
With this added layer of indirection, we can now parse things other than vanilla file objects (e.g., files on the web, or compressed files)
from urllib import urlopen
parsed3b = parse_cdt3(urlopen("http://histo.ucsf.edu/BMS270/BMS270_2015/data/supp2data.cdt"))
parsed3[0] == parsed3b[0]
!gzip supp2data.cdt
import gzip
parsed3c = parse_cdt3(gzip.open("supp2data.cdt.gz"))
!gunzip supp2data.cdt.gz
We finish up by doing a full parse with both methods of handling missing values.
We use the tuple trick from the top of this notebook to give each of the four returned components a name (so that we don't have to refer to them as parsed3[0], etc.)
(header,genes,annotations,matrix1) = parse_cdt3(open("supp2data.cdt"))
(header,genes,annotations,matrix2) = parse_cdt3(open("supp2data.cdt"),0.)
Unlike day 1, I don't start matplotlib in inline mode. Now, the plots show up as separate interactive windows.
%matplotlib
import matplotlib.pyplot as plt
import numpy as np
plt.plot(matrix2[0])
I can still embed the plots in the notebook by explicitly calling IPython's display function
from IPython.core.display import display
Expression profile for the first gene vs. all conditions
fig = plt.figure()
plt.plot(matrix2[0])
display(fig)
Manually zoomed detail (note that if I re-run the notebook I will lose this zoom)
display(fig)
Scatter plot comparing the expression profiles of the first two genes for all conditions
fig = plt.figure()
plt.plot(matrix2[0],matrix2[1],"bo")
display(fig)
All of the data as a heatmap
fig = plt.figure()
plt.imshow(matrix2, interpolation="None", aspect = "auto")
plt.colorbar()
display(fig)
Without interpolation="None", we get undesired anti-aliasing
fig = plt.figure()
plt.imshow(matrix2, aspect = "auto")
plt.colorbar()
display(fig)
Without aspect = "auto" we get matched x and y dimensions (not suitable for our very skinny data matrix)
fig = plt.figure()
plt.imshow(matrix2, interpolation = "None")
plt.colorbar()
display(fig)
The main homework problem for tonight is calculating the square matrix of Pearson correlations of expression profiles among the 2467 yeast genes in this data set. It is useful to start from smaller subsets of genes and work up to the full data set.
Expression profiles for the first two genes:
m = matrix2[:2]
Expression profiles for the first ten genes:
m = matrix2[:10]
If you'd like to write your calculated matrix to disk, you'll need to know how to write files
Open a file in write mode (rather than the default read mode):
fp = open("example.txt","w")
Write a line of text to the file
(note that newlines are not added automatically -- we have to give them explicitly)
fp.write("Hello, world\n")
One way to write comma or tab delimited text to a file is to use the string.join method (the reciprocal of the split method)
s = "\t".join(["these","are","some","words and phrases"])
print s
fp.write(s+"\n")
s = ",".join(["these","are","some","words and phrases"])
print s
fp.write(s+"\n")
We can also use csv.writer -- note that this adds Excel-style quoting escaping for special cases like commas inside of a comma delimited field
from csv import writer
out = writer(fp)
out.writerow(["these","are","some","words and phrases"])
out.writerow(['This is a "trickier" set of fields','"What," you might ask, "is difficult about formatting them?"'])
When we're done writing to the file, we close it with it's close method.
fp.close()
Let's see what we got:
open("example.txt").read()
Or, with proper formatting of the tabs and newlines:
(note the Excel-style quoting of the tricky fields in the last line)
print "".join(open("example.txt").read())