mvoorhie@virgil:~/Projects/Courses/PracticalBioinformatics/python/examples$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

# A few experiments to address Nicole's question:
#   "Why does Tali need to initialize her lists as, e.g.: d1 = []"

#     If d1 is not defined yet, Python doesn't know what to do with it
>>> d1.append(3)
Traceback (most recent call last):                                
  File "<stdin>", line 1, in <module>
NameError: name 'd1' is not defined
#     Initialize d1 as an empty list
>>> d1 = []
#     Now Python looks inside of d1 and finds the appropriate append
#     function for a list
>>> d1.append(3)                                
>>> d1
[3]
#     What if we initialized d1 as an empty tuple instead?
>>> d1 = tuple()
>>> d1
()
#     Tuples are immutable lists, so they don't define an append function.
>>> d1.append(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

#  Now some experiments addressing Nicole's question:
#     "Do variables in Python start out with undefined types?"

>>> type(d1)
<type 'tuple'>
>>> d1 = 5
>>> type(d1)
<type 'int'>
#    Here, the += operator implicitly defines the sum of a float and an int
#          to be a float
>>> d1 += 2.5
>>> d1
7.5
>>> type(d1)
<type 'float'>

#   --> answer: No.  Python is a strongly typed language, with each variable
#               having exactly one type at all times.  Type is usually defined
#               implicitly on assignment

# Now, I want to move my Python session to the directory that my data is in
#    (instead, I could have stayed in the current directory and used
#     a longer string to refer to the file, e.g. open("data/supp2data.csv"))

>>> 
mvoorhie@virgil:~/Projects/Courses/PracticalBioinformatics/python/examples$ cd ../data/
mvoorhie@virgil:~/Projects/Courses/PracticalBioinformatics/python/data$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

# Open "supp2data.csv" and read the whole file.  Each line of the file
#   is returned as a separate element of the list f.  The file is assumed
#   to be a UNIX text file, so lines are split on "\n".
#   This is the buffering that Tali did at the beginning of her homework
#   solution.

>>> f = open("supp2data.csv").readlines()

# Now, we will try parsing the same file with a default csv.reader object.
#   This object does more sophisticated parsing than the simple
#   "split on commas" rule that we were using, so it can handle the tricky
#   case that Eve found -- gene annotations with commas in them (which should
#   not be split into separate elements).

# First, we make a file object wrapping the file (in read-only mode).
>>> fp = open("supp2data.csv")
# Now, we make a csv.reader object wrapping the file object.
>>> import csv
>>> r = csv.reader(fp)
# When we ask the reader (r) for its next element, it first calls fp.next()
#   (getting a string, which is the next line in the file), then breaks that
#   line up according to the CSV parsing rules, returning a list of strings.
#   This is the first line of the file, so each string is a column header.
>>> r.next()
['ORF', 'NAME', 'alpha 0', 'alpha 7', 'alpha 14', 'alpha 21', 'alpha 28', 'alpha 35', 'alpha 42', 'alpha 49', 'alpha 56', 'alpha 63', 'alpha 70', 'alpha 77', 'alpha 84', 'alpha 91', 'alpha 98', 'alpha 105', 'alpha 112', 'alpha 119', 'Elu 0', 'Elu 30', 'Elu 60', 'Elu 90', 'Elu 120', 'Elu 150', 'Elu 180', 'Elu 210', 'Elu 240', 'Elu 270', 'Elu 300', 'Elu 330', 'Elu 360', 'Elu 390', 'cdc15 10', 'cdc15 30', 'cdc15 50', 'cdc15 70', 'cdc15 90', 'cdc15 110', 'cdc15 130', 'cdc15 150', 'cdc15 170', 'cdc15 190', 'cdc15 210', 'cdc15 230', 'cdc15 250', 'cdc15 270', 'cdc15 290', 'spo 0', 'spo 2', 'spo 5', 'spo 7', 'spo 9', 'spo 11', 'spo5 2', 'spo5 7', 'spo5 11', 'spo- early', 'spo- mid', 'heat 0', 'heat 10', 'heat 20', 'heat 40', 'heat 80', 'heat 160', 'dtt 15', 'dtt 30', 'dtt 60', 'dtt 120', 'cold 0', 'cold 20', 'cold 40', 'cold 160', 'diau a', 'diau b', 'diau c', 'diau d', 'diau e', 'diau f', 'diau g']

#  Here, I decided that I wanted to catch the returned list in a variable
#    for further hacking, rather than just dumping it on the screen.
#   First, to go back to the beginning, I re-open the file and make a new 
#    reader object for it (in cases where speed matters, it is faster to
#    rewind the file than to ask the operating system to open it again.
#    Most of the time, it is better to simply re-open the file).
>>> fp = open("supp2data.csv")
>>> r = csv.reader(fp)

#  This time, when I ask the reader to parse the next line, I assign the
#    result to a so that I can play with it.
>>> a = r.next()
#  Inspecting the number of elements, first, second, and last element:
>>> len(a)
81
>>> a[0]
'ORF'
>>> a[1]
'NAME'
>>> a[2]
'alpha 0'
>>> a[-1]
'diau g'
#  Now I parse the remaining (non-header) lines of the file.
#     The for loop inside of the list comprehension keeps calling
#     r.next() and appending the result to the list which is then
#     assigned to data.  The result is a list of lists that can
#     be indexed as: data[row][column].
>>> data = [i for i in r]
#  First row, first column (yeast systematic gene name):
>>> data[0][0]
'YBR166C'
#  First row, second column (gene annotation)
>>> data[0][1]
'TYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+'
#  First row, last column (log2(ratio) -- note that this is still a string)
>>> data[0][-1]
'-0.27'
#  We decided to use csv.reader in place of our original parsing method
#     in order to avoid splitting annotations with commas into additional
#     fields.  If we were successful, every row should have exactly 81
#     columns (name, annotation, and 79 log ratios).
#  We test this by looping over all of the data and checking our assertion:
>>> for i in data:
...     assert(len(i) == 81)
... 

#  Yay!  assert never raised an exception, so our parsing was successful.

#  In order to check that our test can actually detect a bad parse, I corrupted
#     data by adding an extra field to the 53rd line (52 counting from 0):
>>> data[52].append("some extra data")
#  Now assert _does_ raise an exception:
>>> for i in data:
...     assert(len(i) == 81)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
AssertionError

#  For debugging, we would like to know more about the bad line

#  We could just print its contents:
>>> for i in data:
...     if(len(i) != 81):
...             print i
... 
['YER042W', 'NONE   OXIDATIVE STRESS RESPONS PEPTIDE-METHIONINE SULFOXIDE REDUCTASE', '-0.29', '1.1', '-0.2', '-0.06', '-0.42', '0.44', '0.32', '0.7', '0.42', '0.06', '0.03', '-0.14', '-0.18', '-0.3', '-0.29', '0.1', '0.18', '0.29', '-0.25', '0.61', '0.1', '-0.43', '0.23', '0.14', '0.32', '-0.34', '-0.34', '-0.12', '0.31', '-0.25', '-0.58', '-0.3', '-0.49', '-0.12', '0.4', '0.81', '0.51', '-0.17', '-0.1', '0.3', '0.71', '0.19', '0.12', '0.11', '0.19', '0.07', '0.23', '0.29', '0.69', '0.11', '0.9', '0.88', '1.07', '0.45', '0.49', '0.58', '0.75', '0.54', '-0.58', '-0.69', '-0.34', '-0.56', '-0.36', '-0.12', '-0.62', '-0.56', '-0.89', '-0.54', '-0.38', '0.15', '-0.03', '-0.23', '-0.04', '-0.03', '0.18', '-0.42', '-0.51', '0.1', '-0.74', 'some extra data']

#  Or, we could keep track of what line we're on and print its index
#    (plus the first few columns, just for fun):

>>> for i in xrange(len(data)):
...     if(len(data[i]) != 81):
...             print i, data[i][:5]                                                       
...                                                                                        
52 ['YER042W', 'NONE   OXIDATIVE STRESS RESPONS PEPTIDE-METHIONINE SULFOXIDE REDUCTASE', '-0.29', '1.1', '-0.2']                                                

#  Since we reported the index (52 above), we can refer directly to that
#     line for further inspection:
                      
>>> a = data[52]
>>> a                                                                                      
['YER042W', 'NONE   OXIDATIVE STRESS RESPONS PEPTIDE-METHIONINE SULFOXIDE REDUCTASE', '-0.29', '1.1', '-0.2', '-0.06', '-0.42', '0.44', '0.32', '0.7', '0.42', '0.06', '0.03', '-0.14', '-0.18', '-0.3', '-0.29', '0.1', '0.18', '0.29', '-0.25', '0.61', '0.1', '-0.43', '0.23', '0.14', '0.32', '-0.34', '-0.34', '-0.12', '0.31', '-0.25', '-0.58', '-0.3', '-0.49', '-0.12', '0.4', '0.81', '0.51', '-0.17', '-0.1', '0.3', '0.71', '0.19', '0.12', '0.11', '0.19', '0.07', '0.23', '0.29', '0.69', '0.11', '0.9', '0.88', '1.07', '0.45', '0.49', '0.58', '0.75', '0.54', '-0.58', '-0.69', '-0.34', '-0.56', '-0.36', '-0.12', '-0.62', '-0.56', '-0.89', '-0.54', '-0.38', '0.15', '-0.03', '-0.23', '-0.04', '-0.03', '0.18', '-0.42', '-0.51', '0.1', '-0.74', 'some extra data']                                      

#  Oops, typo!  i is its most recent value -- the index of the last line of the file, which has the correct length:
>>> len(data[i])
81                                                                              

#  Here's what I meant to type, the index of the bad line, which is one element
#    too long:           
>>> len(data[52])                                                                         
82                                              

#  Here's how to read the documentation for the csv module
>>> help(csv)

#  Now, we would like to make a new copy of the file using tab characters (\t)
#     rather than commas.
#  First, we open a new file for writing (if this file does not exist, it is
#     created.  If it already existed, it is overwritten).
>>> outfile = open("formatted.tdt","w")
#  Then, we wrap the file object in a writer object that will handle all of
#     the formatting details for us.  Rather than using the default CSV
#     dialect as before, we specify the excel_tab dialect.  This is an example
#     of using a named parameter (dialect = ...) to indicate which optional
#     parameter we are providing.
>>> out = csv.writer(outfile, dialect = csv.excel_tab)
#  writerow takes a list as input and generates a properly formatted string
#     to be written as a line of text in the file by the file object.
#     In this case, each row that we pass is a list of strings (because we
#     didn't bother to coerce them to floats), but writerow is able to
#     handle any type of list element that can be converted to a string
#     (try experimenting with this).
>>> for i in data:
...     out.writerow(i)
... 
#  To close the file and flush any buffered output to disk, we destroy the
#      writer object.   Because we have an explicit reference to the wrapped
#      file object, we destroy that as well (if we had started with
#      'out = csv.writer(open("formatted.tdt","w"), dialect = csv.excel_tab)'
#      then we could get away with just 'del out' at this point)
>>> del out
>>> del outfile

#  This is where class ended  (it is okay to stop reading here)

#  Alex noticed that with our original parsing function (based on split
#     rather than csv.reader) randomly chosen rows were different lengths
#     (and, therefore, crashing the correlation function).  We wanted
#     to know how many rows were the wrong length, and which ones they
#     were (note that this uses some tricks that we haven't covered
#     in class):

# First, I parse supp2data.csv by naively splitting on commas, as before:
>>> data = [i.split(",") for i in open("supp2data.csv")]
# This is the number of lines in the file (2467 genes + header)
>>> len(data)                                
2468
# A set is an unordered collection of unique elements.  If you initialize
#    it with a redundant list of elements, only the unique elements will
#    be retained, so this is a good way to count unique things.
#    In this case, I am initializing the set with the length of each row
#    in data, so I get the unique row lengths.  Instead of using a
#    list comprehension, as we have done in class, I am dropping the
#    square brackets ([]), which results in a generator expression.
#    A generator expression is to a list comprension as xrange is to range
#    i.e., I get the same sequence of elements, but no list is actually
#    generated (so, I save some time and some memory).
>>> set(len(i) for i in data)
set([81, 82, 83, 84])

# So, there _are_ some corrupt lines that are too long, but no lines that
#   are too short (as expected).  Now, which lines are too long and how
#   many of them are there?  To answer this, I will use another new type,
#   a dictionary:
>>> h = {}

#   A dictionary is a collection of (key, value) pairs.  Each value can
#   be retrieved with the corresponding key like this:
#        value = h[key]
#   and set (or changed) like this:
#        h[key] = value
#   If you try to look up a key that is not in the dictionary, it will
#        raise a KeyError.

#   Here, I let the keys be the different row lengths and the values
#        be lists of rows with the corresponding lengths (i.e., h indexes
#        data on row length).
>>> for i in data:                           
...     try:
...             h[len(i)].append(i)
...     except KeyError:                     
...             h[len(i)] = [i]
... 

#   Having generated the index, I can now count the number of rows of each
#       length.  h.items() is a generator that emits each (key, value)
#       pair in the dictionary.
>>> for (key,val) in h.items():
...     print key, len(val)
... 
81 2172
82 279
83 14
84 3

# So, over 10% of the rows have a bad length when parsed without csv.reader;
#   therefore, we are not surprised that randomly chosen pairs were crashing
#   the correlation function.

# Since we've indexed all of the rows, it's easy to inspect the bad ones.
#   Here, I am looking up the systematic name of the first gene in the first
#   row of 82 elements:
>>> h[82][0][0]
'YIL118W'
>>>