Review and Questions

Indentation matters

Creating a simple 4x5 matrix (the right way)

In [1]:
mat = []
for i in range(4):
    row = []
    for j in range(5):
        row.append(i*j)
    mat.append(row)

mat
Out[1]:
[[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8], [0, 3, 6, 9, 12]]

Same code with an indentation error

(Note that the redundant appends of row create 5 element groups in mat all linked to the same list)

In [2]:
mat = []
for i in range(4):
    row = []
    for j in range(5):
        row.append(i*j)
        mat.append(row)

mat
Out[2]:
[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 2, 4, 6, 8],
 [0, 2, 4, 6, 8],
 [0, 2, 4, 6, 8],
 [0, 2, 4, 6, 8],
 [0, 2, 4, 6, 8],
 [0, 3, 6, 9, 12],
 [0, 3, 6, 9, 12],
 [0, 3, 6, 9, 12],
 [0, 3, 6, 9, 12],
 [0, 3, 6, 9, 12]]

Scope

  • Variables defined inside of a function are not available outside of that function (unless we return them)
  • The first line of the function defines the local (inside of the function) names for the arguments passed to the function, independent of the external names of those variables
    • e.g., external A and B below are referenced as x and y inside of f
  • The assignment by the return value by the client of the function defines the external names of those variables independent of their names inside of the function
    • e.g., then value j returned by f is assigned to C
In [3]:
def f(x,y):
    j = x*y
    return j
In [4]:
A = 2
B = 4
C = f(A,B)
In [5]:
C
Out[5]:
8

Question: "How to you append to the beginning of a list?"

Answer: use the + operator to concatenate (this also works for strings)

In [6]:
x = [1,2,3]
x = [4]+x
In [7]:
x
Out[7]:
[4, 1, 2, 3]

We discussed that adding to the left is more expensive than adding to the right because it implies a copy of the full list to a new location in memory.

In the case where we know the final size of the list and where all of our data is going to go in it, we can get around this problem by pre-allocating the list:

In [13]:
x = [None]*10

Start with a list of 10 null values (the expected final size of the list)

In [14]:
x
Out[14]:
[None, None, None, None, None, None, None, None, None, None]

Fill in 9 numbers, leaving room for an annotation on the left

In [15]:
for i in xrange(9):
    x[i+1] = i
x
Out[15]:
[None, 0, 1, 2, 3, 4, 5, 6, 7, 8]

Fill in the annotation

In [16]:
x[0] = "hi"
In [17]:
x
Out[17]:
['hi', 0, 1, 2, 3, 4, 5, 6, 7, 8]

Question: How do you keep track of the names you've assigned?

(Equivalently -- how do you keep track of all of the data that you've put into core memory?)

In "vanilla" python, you can use the dir function. This works in ipython, but is cluttered up with a lot of extra names that ipython generates:

In [18]:
dir()
Out[18]:
['A',
 'B',
 'C',
 'In',
 'Out',
 '_',
 '_1',
 '_10',
 '_12',
 '_14',
 '_15',
 '_17',
 '_2',
 '_5',
 '_7',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__name__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i14',
 '_i15',
 '_i16',
 '_i17',
 '_i18',
 '_i2',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 '_sh',
 'exit',
 'f',
 'get_ipython',
 'i',
 'j',
 'mat',
 'quit',
 'row',
 'x']

To deal with this clutter problem, ipython provides the %who magic, which lists just the names that you've explicitly defined

In [19]:
%who
A	 B	 C	 f	 i	 j	 mat	 row	 x	 


How do you get rid of a name you've assigned?

E.g., to free up memory, or for debugging purposes

Answer: use the del statement

In [20]:
A
Out[20]:
2
In [21]:
del A
In [22]:
A
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-22-bf072e911907> in <module>()
----> 1 A

NameError: name 'A' is not defined

Defining our own modules

As we develop reusable code, it is useful to factor it into our own modules, so that we don't have to keep copy-pasting all of the time. (Modules also make it easier to document and version-control our code, and are a useful way of sharing our code with others).

Here, we create a "stats" module with functions from earlier in the week which will be useful for solving last night's homework problem.

We start by saving stats.py (from a 2013 session of this course) to the directory from which we launched the IPython notebook.

In [23]:
import stats

A module's __file__ attribute tells us the location of the module on disk. Here I'm confirming that I imported the stats.py in my working directory:

In [24]:
stats.__file__
Out[24]:
'stats.py'

Other modules come from elsewhere

(based on the search path defined in the PYTHONPATH environment variable, which you can inspect and manipulate via):

import sys
sys.path

E.g., because I'm running ipython in my normal system context, numpy is imported from my system copy of numpy

(if I were running my Canopy copy of ipython from the Canopy virtual environment, the normal context for this course, then numpy would be imported from the version in my Canopy directory)

In [25]:
import numpy

The loaded file has a .pyc extension, indicating that it is a byte-code version, automatically generated by the python interpreter from the text-format .py version.

This is useful to know for two reasons:

  • If you want to read the source, you should look at the .py version
  • The python interpretter is supposed to regenerate the .pyc version any time it is older than the .py version (to ensure that it is the latest version of your code). Very rarely, this will fail, and you will keep seeing the old version of your module every time you import. The fix is to delete the .pyc version, forcing python to regerenate it from the new code.
In [26]:
numpy.__file__
Out[26]:
'/usr/lib/python2.7/dist-packages/numpy/__init__.pyc'
In [27]:
!ls /usr/lib/python2.7/dist-packages/numpy/__init__.py
/usr/lib/python2.7/dist-packages/numpy/__init__.py

In [28]:
numpy??

Okay, let's see what's in the stats module we just loaded -- the three stats functions from our day 1 homework.

In [29]:
dir(stats)
Out[29]:
['__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 'mean',
 'pearson',
 'stdev']

Some examples of viewing the docstrings from the first lines of the function definitions.

In [31]:
help(stats.mean)
Help on function mean in module stats:

mean(x)
    Return the mean (average) of a list
    of values.


In [32]:
stats.mean?
In [33]:
help(mean)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-cbb4293c3000> in <module>()
----> 1 help(mean)

NameError: name 'mean' is not defined
In [34]:
from stats import mean

Here I added the CDT parsing function from last night's homework to my copy of stats.py using my favorite text editor

*The editor that comes with Canopy is a good choice for editing python code. Other popular text editors are:

  • emacs (a very powerful text-mode editor with a steep learning curve)
  • vi (also a very powerful text-mode editor with a steep learning curve; basis for some of the IPython Notebook interface)
  • eclipse (a very full featured graphical IDE)*

Having edited stats.py, I reload the module to get access to the new code

In [43]:
reload(stats)
Out[43]:
<module 'stats' from 'stats.py'>

Now I can see the new function

In [37]:
dir(stats)
Out[37]:
['__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 'mean',
 'parse_cdt3',
 'pearson',
 'stdev']

Noting that we can crash a naive implementation of Pearson with divide-by-zero errors

In [41]:
x = [0.,0.,0.]
stats.pearson(x,x)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-41-e02b436b85d2> in <module>()
      1 x = [0.,0.,0.]
----> 2 stats.pearson(x,x)

/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/MikeNotebooks/stats.py in pearson(x, y)
     61     #    faster to calculate)
     62     from math import sqrt
---> 63     return sxy/sqrt(ssx*ssy)
     64 
     65 def parse_cdt3(fp, missing = None):

ZeroDivisionError: float division by zero

Loading the example data via our new module

In [52]:
(cols, genes, annotations, matrix) = stats.parse_cdt3(open("supp2data.cdt"),0.)

Showing off some of the tricks available via numpy.arrays

In [46]:
import numpy

Make an array from a list of lists. All of the elements of the lists are floats, so we get the default floating-point type (64 bit floats on my laptop)

In [53]:
M = numpy.array(matrix)
In [59]:
M.dtype
Out[59]:
dtype('float64')

As expected, the array has 2467 rows (genes) and 79 columns (conditions)

In [54]:
M.shape
Out[54]:
(2467, 79)

Transposing a numpy array is essentially free:

In [55]:
M.T.shape
Out[55]:
(79, 2467)

Using numpy to calculate the dot product of two vectors (the expression profiles of the first two genes)

In [56]:
d = numpy.dot(M[0],M[1])
In [57]:
d.shape
Out[57]:
()
In [58]:
d
Out[58]:
0.98920000000000008
In []: