Review of first two days

Some notes on defining functions

A function without a return statement returns the default return value: None

def f(x):
    a = x+3

y = f(4)

y

print y

None

Returning multiple values from a function via a tuple

def f(x):
    a = x**2
    b = x**3
    c = x**4
    return a,b,c

f(4)

(16, 64, 256)

a = f(4)

a

(16, 64, 256)

a[0]

16

a[2]

256

Assigning to a tuple from a tuple

(x,y,z) = f(4)

x

16

y

64

z

256

Note that we need to have the same number of elements on both sides of the assignment operator (=)

(x,y,z,w) = f(4)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-378e957bf413> in <module>()
----> 1 (x,y,z,w) = f(4)

ValueError: need more than 3 values to unpack

Some notes on reading files

Note that the file pointer moves as we read data!

fp = open("supp2data.cdt")

Read the first 100 bytes of the homework file

fp.read(100)

'ORF\tNAME\talpha 0\talpha 7\talpha 14\talpha 21\talpha 28\talpha 35\talpha 42\talpha 49\talpha 56\talpha 63\talp'

Now, when we call readline, we are already 100 bytes in:

fp.readline()

'ha 70\talpha 77\talpha 84\talpha 91\talpha 98\talpha 105\talpha 112\talpha 119\tElu 0\tElu 30\tElu 60\tElu 90\tElu 120\tElu 150\tElu 180\tElu 210\tElu 240\tElu 270\tElu 300\tElu 330\tElu 360\tElu 390\tcdc15 10\tcdc15 30\tcdc15 50\tcdc15 70\tcdc15 90\tcdc15 110\tcdc15 130\tcdc15 150\tcdc15 170\tcdc15 190\tcdc15 210\tcdc15 230\tcdc15 250\tcdc15 270\tcdc15 290\tspo 0\tspo 2\tspo 5\tspo 7\tspo 9\tspo 11\tspo5 2\tspo5 7\tspo5 11\tspo- early\tspo- mid\theat 0\theat 10\theat 20\theat 40\theat 80\theat 160\tdtt 15\tdtt 30\tdtt 60\tdtt 120\tcold 0\tcold 20\tcold 40\tcold 160\tdiau a\tdiau b\tdiau c\tdiau d\tdiau e\tdiau f\tdiau g\r\n'

If we type fp with no method call, python just prints a summary of the fp object, telling us that it is a file object, currently open in read mode ("r") for the file "supp2data.cdt"; this particular file object lives at the block of RAM starting at hexidecimal offset 0x7f4d6b2bbb70 (if you re-run the code, the file object is likely to be allocated at a different location in memory)

fp

<open file 'supp2data.cdt', mode 'r' at 0x7f4d6b2bbb70>

fp = open("supp2data.cdt")

from csv import reader, excel_tab

Re-opening the file rewinds it to the beginning

fp.readline()

'ORF\tNAME\talpha 0\talpha 7\talpha 14\talpha 21\talpha 28\talpha 35\talpha 42\talpha 49\talpha 56\talpha 63\talpha 70\talpha 77\talpha 84\talpha 91\talpha 98\talpha 105\talpha 112\talpha 119\tElu 0\tElu 30\tElu 60\tElu 90\tElu 120\tElu 150\tElu 180\tElu 210\tElu 240\tElu 270\tElu 300\tElu 330\tElu 360\tElu 390\tcdc15 10\tcdc15 30\tcdc15 50\tcdc15 70\tcdc15 90\tcdc15 110\tcdc15 130\tcdc15 150\tcdc15 170\tcdc15 190\tcdc15 210\tcdc15 230\tcdc15 250\tcdc15 270\tcdc15 290\tspo 0\tspo 2\tspo 5\tspo 7\tspo 9\tspo 11\tspo5 2\tspo5 7\tspo5 11\tspo- early\tspo- mid\theat 0\theat 10\theat 20\theat 40\theat 80\theat 160\tdtt 15\tdtt 30\tdtt 60\tdtt 120\tcold 0\tcold 20\tcold 40\tcold 160\tdiau a\tdiau b\tdiau c\tdiau d\tdiau e\tdiau f\tdiau g\r\n'

We will attach two different csv.reader objects to our file object, one in the default "comma separated value" mode and one in tab-delimited mode:

reader1 = reader(fp)

reader2 = reader(fp,excel_tab)

Read one line in tab-delim mode. Since we already read the first line of the file directly, this gives us the second line (the first gene)

reader2.next()

['YBR166C',
 'TYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+',
 '0.33',
 '-0.17',
 '0.04',
 '-0.07',
 '-0.09',
 '-0.12',
 '-0.03',
 '-0.2',
 '-0.06',
 '-0.06',
 '-0.14',
 '-0.18',
 '-0.06',
 '-0.25',
 '0.06',
 '-0.12',
 '0.25',
 '0.43',
 '0.21',
 '-0.04',
 '-0.15',
 '-0.04',
 '0.21',
 '-0.14',
 '-0.03',
 '-0.07',
 '-0.36',
 '-0.14',
 '-0.42',
 '-0.34',
 '-0.23',
 '-0.17',
 '0.23',
 '0.3',
 '0.41',
 '-0.07',
 '-0.23',
 '-0.12',
 '0.16',
 '0.74',
 '0.14',
 '-0.49',
 '-0.32',
 '0.19',
 '0.23',
 '0.24',
 '0.28',
 '1.13',
 '-0.12',
 '0.1',
 '',
 '0.66',
 '0.62',
 '0.08',
 '0.62',
 '0.43',
 '0.5',
 '-0.25',
 '-0.51',
 '-0.67',
 '0.21',
 '-0.74',
 '-0.36',
 '-0.01',
 '0.38',
 '0.15',
 '-0.22',
 '-0.09',
 '0.33',
 '0.08',
 '0.39',
 '-0.17',
 '0.23',
 '0.2',
 '0.2',
 '-0.17',
 '-0.69',
 '0.14',
 '-0.27']

Read another line in tab-delim mode (this time showing it with print for more compact formatting)

print reader2.next()

['YOR357C', 'GRD19  SECRETION        GOLGI PROTEIN RETENTION', '-0.64', '-0.38', '-0.32', '-0.29', '-0.22', '-0.01', '-0.32', '-0.27', '-0.51', '-0.67', '-0.62', '-0.58', '-0.38', '-0.94', '-0.34', '-0.92', '-0.15', '0.03', '0.16', '-0.34', '-0.32', '-0.34', '-0.12', '-0.34', '-0.27', '-0.15', '-0.15', '-0.51', '-0.3', '-0.25', '-0.12', '-0.4', '0.98', '0.99', '0.25', '0.15', '0.08', '0.23', '0.18', '-0.29', '-0.45', '0.01', '-0.34', '-1.12', '-0.54', '-0.94', '-1.09', '-0.45', '-0.23', '-0.36', '0.08', '0.28', '0.18', '-0.12', '', '0.25', '-0.22', '-0.04', '-0.25', '0.04', '-0.03', '-0.07', '-0.04', '0.73', '-0.06', '0.54', '-0.09', '-0.29', '-0.1', '0.36', '-0.2', '-0.34', '-0.14', '-0.09', '0.06', '-0.17', '0.04', '-0.97', '-1.79']

Read the next line in CSV mode. Note that even though this is the first time we have used reader1, it still starts from the point in the file where reader2 left off (because both readers are tied to the position of the underlying file object, fp). Also note the CSV-parsing doesn't work well for tab-delim data

print reader1.next()

['YLR292C\tSEC72  SECRETION        ER PROTEIN TRANSLOCATION SUBCOMPLEX SUBUNIT\t-0.23\t0.19\t-0.36\t0.14\t-0.4\t0.16\t-0.09\t-0.12\t-0.14\t-0.14\t-0.38\t-0.22\t0.12\t-0.43\t-0.42\t-0.45\t-0.17\t-0.27\t0.29\t-0.09\t0.4\t-0.1\t0.46\t0.15\t-0.17\t-0.18\t-0.07\t-0.34\t0.3\t-0.12\t-0.06\t-0.17\t0.07\t0.38\t0.34\t\t-0.15\t-0.2\t0.19\t0.37\t0.24\t-0.07\t\t0.24\t0.45\t0.23\t0.5\t-0.07\t\t0.66\t0.94\t0.46\t0.06\t-0.18\t0.39\t-0.18\t0.16\t0.55\t-0.06\t-0.94\t0.21\t-0.71\t-0.86\t-0.45\t0.42\t1.04\t0.65\t0.53\t-0.47\t0.21\t-0.29\t-0.36\t-0.1\t-0.29\t-0.18\t-0.34\t-0.47\t-0.43\t-1.06']

We can rewind the file without re-opening it via the seek method. This is faster the re-opening the file (but both are so fast that you usually won't care).

fp.seek(0)

print reader2.next()

['ORF', 'NAME', 'alpha 0', 'alpha 7', 'alpha 14', 'alpha 21', 'alpha 28', 'alpha 35', 'alpha 42', 'alpha 49', 'alpha 56', 'alpha 63', 'alpha 70', 'alpha 77', 'alpha 84', 'alpha 91', 'alpha 98', 'alpha 105', 'alpha 112', 'alpha 119', 'Elu 0', 'Elu 30', 'Elu 60', 'Elu 90', 'Elu 120', 'Elu 150', 'Elu 180', 'Elu 210', 'Elu 240', 'Elu 270', 'Elu 300', 'Elu 330', 'Elu 360', 'Elu 390', 'cdc15 10', 'cdc15 30', 'cdc15 50', 'cdc15 70', 'cdc15 90', 'cdc15 110', 'cdc15 130', 'cdc15 150', 'cdc15 170', 'cdc15 190', 'cdc15 210', 'cdc15 230', 'cdc15 250', 'cdc15 270', 'cdc15 290', 'spo 0', 'spo 2', 'spo 5', 'spo 7', 'spo 9', 'spo 11', 'spo5 2', 'spo5 7', 'spo5 11', 'spo- early', 'spo- mid', 'heat 0', 'heat 10', 'heat 20', 'heat 40', 'heat 80', 'heat 160', 'dtt 15', 'dtt 30', 'dtt 60', 'dtt 120', 'cold 0', 'cold 20', 'cold 40', 'cold 160', 'diau a', 'diau b', 'diau c', 'diau d', 'diau e', 'diau f', 'diau g']

When figuring out how to parse a file, it is often useful to have the whole thing in memory.

Here we read the whole file into a list of lists (2D array)

data = []
for i in reader(open("supp2data.cdt"),excel_tab):
    data.append(i)

Get the dimensions of the array (2468 rows by 81 columns)

len(data),len(data[0])

(2468, 81)

Use a set to confirm that all rows are 81 columns

s = set()
for i in data:
    s.add(len(i))
s

{81}

Two one-liner equivalents to our list-of-lists parse:

data = [i for i in reader(open("supp2data.cdt"),excel_tab)]

data = list(reader(open("supp2data.cdt"),excel_tab))

The list trick above is also useful for converting a set into something that we can index (i.e., impose a well defined ordering on the set)

s[0]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-4e98c4f87897> in <module>()
----> 1 s[0]

TypeError: 'set' object does not support indexing

S = list(s)

S

[81]

S[0]

81

Here's a one-liner version of cell [34] for creating the set

s = set(len(i) for i in data)

s

{81}

A few other miscellaneous comments

The + operator is defined as concatenation for both strings and lists

"this"+"that"

'thisthat'

["this"]+["that"]

['this', 'that']

x = ["x","y\r\n"]

We can index from the back of a string, list, or tuple by using negative indices, starting from -1

x[-1]

'y\r\n'

If we know our lines end with "\r\n", we can trim them via string slicing:

x[-1][:-2]

'y'

If we're less sure of which line terminator we're dealing with (e.g., "\n" on UNIX vs. "\r\n" on DOS/Windows), it is safer to use rstrip (which trims any number of instances of the given characters from the right of the string)

x[-1].rstrip("\n\r")

'y'

Homework solutions

Version 1

Partitions the data into the four desired pieces
Does not coerce the log ratio strings to floats

def parse_cdt1(fname):
    orf = []
    name = []
    logratio = []
    r = reader(open(fname),excel_tab)
    header = r.next()[2:]
    for row in r:
        orf.append(row[0])
        name.append(row[1])
        logratio.append(row[2:])
    return header,orf,name,logratio

Let's check that it works

(Note that I should also have checked that the parsed data was of the correct dimensions)

parsed1 = parse_cdt1("supp2data.cdt")

print parsed1[0][:10]

['alpha 0', 'alpha 7', 'alpha 14', 'alpha 21', 'alpha 28', 'alpha 35', 'alpha 42', 'alpha 49', 'alpha 56', 'alpha 63']

print parsed1[1][:10]

['YBR166C', 'YOR357C', 'YLR292C', 'YGL112C', 'YIL118W', 'YDL120W', 'YHL025W', 'YGL248W', 'YIL146C', 'YJR106W']

print parsed1[2][0]

TYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+

print parsed1[3][0][:10]

['0.33', '-0.17', '0.04', '-0.07', '-0.09', '-0.12', '-0.03', '-0.2', '-0.06', '-0.06']

Thinking about handling missing values for float coercion

Here's a test case:

a = "-3.2"
b = ""

The method that I gave at the end of day 2: explicitly check for empty strings.

Here, I introduced an optional parameter (missing) by declaring a default value (None). This lets the user supply a different default value (e.g., if we want to ensure that the entire output is floats).

def make_float(x, missing = None):
    if(x == ""):
        return missing
    else:
        return float(x)

Testing the function:

print [make_float(i) for i in (a,b)]

[-3.2, None]

print [make_float(i,0.) for i in (a,b)]

[-3.2, 0.0]

print [make_float(i) for i in parsed1[3][0][:10]]

[0.33, -0.17, 0.04, -0.07, -0.09, -0.12, -0.03, -0.2, -0.06, -0.06]

Adolfo's method, using exceptions. We try to coerce to a float, then catch exceptional cases and give them special treatment

def make_float(x, missing = None):
    try:
        return float(x)
    except:
        return missing

Some examples of how exceptions work in this case:

float("")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-2909c948daee> in <module>()
----> 1 float("")

ValueError: could not convert string to float:

try:
    a = float("")
except:
    a = 2
print a

2

try:
    a = float("")
except TypeError:
    a = 3
except ValueError:
    a = 2
except:
    a = 4
print a

2

def f(x):
    return float(x)

try:
    for i in range(3):
        for j in range(4):
            z = f(2)
except:
    print "didn't work"

def f(x):
    return float(x)

try:
    for i in range(3):
        for j in range(4):
            z = f("two")
except:
    print "didn't work"

didn't work

def f(x):
    return float(x)

try:
    for i in range(3):
        for j in range(4):
            z = f("2")
except:
    print "didn't work"

Version 2

Adolfo's homework solution, using exceptions

def parse_cdt2(fname, missing = None):
    orf = []
    name = []
    logratio = []
    r = reader(open(fname),excel_tab)
    header = r.next()[2:]
    for row in r:
        orf.append(row[0])
        name.append(row[1])
        logline = []
        for field in row[2:]:
            try:
                logline.append(float(field))
            except ValueError:
                logline.append(missing)
        logratio.append(logline)
    return header,orf,name,logratio

Let's test it with both ways of handling missing values

parsed2 = parse_cdt2("supp2data.cdt",0.)

parsed2b = parse_cdt2("supp2data.cdt")

Confirm that we get the right first dimensions

[len(i) for i in parsed2]

[79, 2467, 2467, 2467]

Confirm that we get the right second dimension for the log ratios

len(parsed2[-1][0])

79

Confirm that the log ratio matrix is rectangular (all rows the same length)

set(len(i) for i in parsed2[-1])

{79}

Version 3

Call with an open file object rather than a file name

def parse_cdt3(fp, missing = None):
    orf = []
    name = []
    logratio = []
    r = reader(fp,excel_tab)
    header = r.next()[2:]
    for row in r:
        orf.append(row[0])
        name.append(row[1])
        logline = []
        for field in row[2:]:
            try:
                logline.append(float(field))
            except ValueError:
                logline.append(missing)
        logratio.append(logline)
    return header,orf,name,logratio

parsed3 = parse_cdt3(open("supp2data.cdt"))

With this added layer of indirection, we can now parse things other than vanilla file objects (e.g., files on the web, or compressed files)

from urllib import urlopen

parsed3b = parse_cdt3(urlopen("http://histo.ucsf.edu/BMS270/BMS270_2015/data/supp2data.cdt"))

parsed3[0] == parsed3b[0]

True

!gzip supp2data.cdt

import gzip

parsed3c = parse_cdt3(gzip.open("supp2data.cdt.gz"))

!gunzip supp2data.cdt.gz

We finish up by doing a full parse with both methods of handling missing values.

We use the tuple trick from the top of this notebook to give each of the four returned components a name (so that we don't have to refer to them as parsed3[0], etc.)

(header,genes,annotations,matrix1) = parse_cdt3(open("supp2data.cdt"))

(header,genes,annotations,matrix2) = parse_cdt3(open("supp2data.cdt"),0.)

Plotting expression profiling data

Unlike day 1, I don't start matplotlib in inline mode. Now, the plots show up as separate interactive windows.

%matplotlib
import matplotlib.pyplot as plt
import numpy as np

Using matplotlib backend: TkAgg

plt.plot(matrix2[0])

[<matplotlib.lines.Line2D at 0x7f4d4c02d150>]

I can still embed the plots in the notebook by explicitly calling IPython's display function

from IPython.core.display import display

Expression profile for the first gene vs. all conditions

fig = plt.figure()
plt.plot(matrix2[0])
display(fig)

Manually zoomed detail (note that if I re-run the notebook I will lose this zoom)

display(fig)

Scatter plot comparing the expression profiles of the first two genes for all conditions

fig = plt.figure()
plt.plot(matrix2[0],matrix2[1],"bo")
display(fig)

All of the data as a heatmap

fig = plt.figure()
plt.imshow(matrix2, interpolation="None", aspect = "auto")
plt.colorbar()
display(fig)

Without interpolation="None", we get undesired anti-aliasing

fig = plt.figure()
plt.imshow(matrix2, aspect = "auto")
plt.colorbar()
display(fig)

Without aspect = "auto" we get matched x and y dimensions (not suitable for our very skinny data matrix)

fig = plt.figure()
plt.imshow(matrix2, interpolation = "None")
plt.colorbar()
display(fig)

The main homework problem for tonight is calculating the square matrix of Pearson correlations of expression profiles among the 2467 yeast genes in this data set. It is useful to start from smaller subsets of genes and work up to the full data set.

Expression profiles for the first two genes:

m = matrix2[:2]

Expression profiles for the first ten genes:

m = matrix2[:10]

Writing to files

If you'd like to write your calculated matrix to disk, you'll need to know how to write files

Open a file in write mode (rather than the default read mode):

fp = open("example.txt","w")

Write a line of text to the file

(note that newlines are not added automatically -- we have to give them explicitly)

fp.write("Hello, world\n")

One way to write comma or tab delimited text to a file is to use the string.join method (the reciprocal of the split method)

s = "\t".join(["these","are","some","words and phrases"])
print s
fp.write(s+"\n")

these	are	some	words and phrases

s = ",".join(["these","are","some","words and phrases"])
print s
fp.write(s+"\n")

these,are,some,words and phrases

We can also use csv.writer -- note that this adds Excel-style quoting escaping for special cases like commas inside of a comma delimited field

from csv import writer
out = writer(fp)
out.writerow(["these","are","some","words and phrases"])
out.writerow(['This is a "trickier" set of fields','"What," you might ask, "is difficult about formatting them?"'])

When we're done writing to the file, we close it with it's close method.

This ensures that the file is flushed to disk (during writing, python and the operating system are allowed to buffer small writes in memory)
If you need to flush a file to disk before you close it (.e.g., to log the progress of a long running program) you can call the file's flush method

fp.close()

Let's see what we got:

open("example.txt").read()

'Hello, world\nthese\tare\tsome\twords and phrases\nthese,are,some,words and phrases\nthese,are,some,words and phrases\r\n"This is a ""trickier"" set of fields","""What,"" you might ask, ""is difficult about formatting them?"""\r\n'

Or, with proper formatting of the tabs and newlines:

(note the Excel-style quoting of the tricky fields in the last line)

print "".join(open("example.txt").read())

Hello, world
these	are	some	words and phrases
these,are,some,words and phrases
these,are,some,words and phrases
"This is a ""trickier"" set of fields","""What,"" you might ask, ""is difficult about formatting them?"""