Practical Bioinformatics -- Day 2

Definining functions

Defining a function for the mean calculation:

In [2]:
def mean(x):
    """Return the mean of a list of numbers."""
    s = 0.
    for i in x:
        s += i
    return s/len(x)
In [3]:
L1 = [.5,10,17,3]
In [4]:
mean(L1)
Out[4]:
7.625

Should we report our result with return or print?

In [5]:
def mean2(x):
    """Print the mean of a list of numbers."""
    s = 0.
    for i in x:
        s += i
    print s/len(x)
In [6]:
x = mean(L1)
y = mean2(L1)
7.625

mean uses return, so the result is stored in x.

In [7]:
x
Out[7]:
7.625

mean2 uses print with no return statement, so it returns the default return value, which is None.

In [8]:
y
In [9]:
z = None
In [10]:
z

Normal test for equality: z and y both point to None, so their values are equal.

In [11]:
z == y
Out[11]:
True

Test for identity: z and y point to the same location in memory (because the None value is unique and lives in a single location)

In [12]:
z is y
Out[12]:
True

Advanced exercise: The "%p" format string returns an object's address in memory. Can you use this to figure out how python handles memory management for int and float values. Is there a difference?


Should we document our function with a string or a comment?

In [13]:
def mean3(x):
    #Return the mean of a list of numbers.
    s = 0.
    for i in x:
        s += i
    return s/len(x)

Because mean is documented with a docstring, its documentation is visible to IPython via mean? or tab-completion on mean(.

mean3 uses a comment, which is invisible to the built in help.

Functions for creating integer sequences:

In [14]:
x = range(10000)

Using IPython's %time magic to time our mean function

In [15]:
%time mean(x)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.49 ms
Out[15]:
4999.5

For more accurate measurements, use %timeit, which averages over multiple runs (c.f. %timeit? for details)

In [16]:
%timeit mean(x)
1000 loops, best of 3: 570 µs per loop
In [17]:
%time?

Demonstrating that IPython magics are converted to python function calls when logging:

In [18]:
%logstart -o day2.log
Activating auto-logging. Current session state plus future input saved.
Filename       : day2.log
Mode           : backup
Output logging : True
Raw input log  : False
Timestamping   : False
State          : active
In [19]:
%less day2.log

Using introspection to find the source of the sqrt function in the pylab environment:

In [21]:
sqrt.__class__
Out[21]:
numpy.ufunc

Ah, it's the version from numpy (not part of standard Python. Standard Python includes a slightly different sqrt implementation in the math module)

In [22]:
sqrt.__class__.__module__
Out[22]:
'numpy'

? is an IPython-specific way to look up the documentation and source for an object:

In [25]:
sqrt?

Importing functions from external modules

Advanced: There are several ways to import modules from directories other than the current directory. The most common is to modify the PYTHONPATH environment variable. You can inspect and modify the current import path via sys.path.

In [26]:
import stats
In [27]:
stats?

dir lists the contents of an object:

In [28]:
dir(stats)
Out[28]:
['__builtins__', '__doc__', '__file__', '__name__', '__package__', 'mean']
In [29]:
stats.mean?
In [30]:
stats.mean(L1)
Out[30]:
7.625
In [31]:
from stats import mean
In [32]:
mean(L1)
Out[32]:
7.625

Here, we edited the definition of mean in stats.py and experimented with how to make the change visible to Python:

In [33]:
mean(L1)
Out[33]:
7.625
In [34]:
stats.mean(L1)
Out[34]:
7.625
In [35]:
import stats
In [36]:
stats.mean(L1)
Out[36]:
7.625
In [37]:
reload(stats)
Out[37]:
<module 'stats' from 'stats.py'>
In [38]:
stats.mean(L1)
Out[38]:
'Oops!'
In [39]:
mean(L1)
Out[39]:
7.625
In [40]:
from stats import mean
In [41]:
mean(L1)
Out[41]:
'Oops!'

File I/O

Opening a file in read-only mode:

In [42]:
fp = open("stats.py")

Read the first 10 bytes:

Python scripts and modules are simple text files, so Python translates the bytes to text using its default text encoding

In [43]:
fp.read(10)
Out[43]:
'"""Some de'

Read the next 10 bytes (note that read has a side effect: the file pointer moves through the file as it is read, so sequential calls to read produce different results)

In [44]:
fp.read(10)
Out[44]:
'scriptive '

Read the rest of the file:

In [45]:
fp.read()
Out[45]:
'statistics"""\n\ndef mean(x):\n    """Return the mean of a list of numbers."""\n    s = 0.\n    for i in x:\n        s += i\n    #return s/len(x)\n    return "Oops!"\n\n'
s /div>

Rewind by reopening the file

In [46]:
fp = open("stats.py")

Because stats.py is a text file, we can use readline in place of read to read a whole line of text (including the end-of-line character)

In [47]:
fp.readline()
Out[47]:
'"""Some descriptive statistics"""\n'
In [48]:
fp.readline()
Out[48]:
'\n'
iv class="prompt input_prompt">In [49]:
fp.readline()
Out[49]:
'def mean(x):\n'

For file objects, next has the same behavior as readline.

In [50]:
fp.next()
Out[50]:
'    """Return the mean of a list of numbers."""\n'

Objects with a next method are iterators, which can be used by a for loop

To be more precise: for loops operate on iterables, which are objects with an __iter__ method, which is responsible for generating or pointing to the appropriate iterator object

In [51]:
for i in open("stats.py"):
    print i
"""Some descriptive statistics"""



def mean(x):

    """Return the mean of a list of numbers."""

    s = 0.

    for i in x:

        s += i

    #return s/len(x)

    return "Oops!"



The example above is double spaced, because print adds its own newline (\n) character. We can use the rstrip string method to remove the redundant newline (and any other trailing whitespace) ftem each line of the file:

In [52]:
for i in open("stats.py"):
    print i.rstrip()
"""Some descriptive statistics"""

def mean(x):
    """Return the mean of a list of numbers."""
    s = 0.
    for i in x:
        s += i
    #return s/len(x)
    return "Oops!"

To write data to disk, we can open a file in write mode:

In [53]:
out = open("example.txt","w")
In [54]:
out.write("This is some text")
out.write("this is some other text")
out.close()
In [55]:
for line in open("example.txt"):
    print line.rstrip()
This is some textthis is some other text

If we want to write output to separate lines, we need to add explicit newlines:

In [57]:
out = open("example.txt","w")
out.write("This is some text\n")
out.write("this is some other text\n")
out.close()
In [58]:
for line in open("example.txt"):
    print line.rstrip()
This is some text
this is some other text

We have been using relative paths, which look for files relative to the current directory. We can specify absolute paths instead, by starting with a forward slash (/).

Windows uses backslash () as its default path separator, but Python will do the correct conversion of forward slash on Windows, so you don't need to worry about this

In [60]:
for line in open("/home/mvoorhie/Projects/Courses/PracticalBioinformatics/python/May2013/example.txt"):
    print line.rstrip()
This is some text
this is some other text

Inspecting the example tab-delimited expression data from the PNAS paper

readlines returns all lines of a text file as a list of strings

In [61]:
data = open("supp2data.cdt").readlines()
In [62]:
len(data)
Out[62]:
2468
In [63]:
data[0]
Out[63]:
'ORF\tNAME\talpha 0\talpha 7\talpha 14\talpha 21\talpha 28\talpha 35\talpha 42\talpha 49\talpha 56\talpha 63\talpha 70\talpha 77\talpha 84\talpha 91\talpha 98\talpha 105\talpha 112\talpha 119\tElu 0\tElu 30\tElu 60\tElu 90\tElu 120\tElu 150\tElu 180\tElu 210\tElu 240\tElu 270\tElu 300\tElu 330\tElu 360\tElu 390\tcdc15 10\tcdc15 30\tcdc15 50\tcdc15 70\tcdc15 90\tcdc15 110\tcdc15 130\tcdc15 150\tcdc15 170\tcdc15 190\tcdc15 210\tcdc15 230\tcdc15 250\tcdc15 270\tcdc15 290\tspo 0\tspo 2\tspo 5\tspo 7\tspo 9\tspo 11\tspo5 2\tspo5 7\tspo5 11\tspo- early\tspo- mid\theat 0\theat 10\theat 20\theat 40\theat 80\theat 160\tdtt 15\tdtt 30\tdtt 60\tdtt 120\tcold 0\tcold 20\tcold 40\tcold 160\tdiau a\tdiau b\tdiau c\tdiau d\tdiau e\tdiau f\tdiau g\r\n'
In [64]:
print data[0]
ORF	NAME	alpha 0	alpha 7	alpha 14	alpha 21	alpha 28	alpha 35	alpha 42	alpha 49	alpha 56	alpha 63	alpha 70	alpha 77	alpha 84	alpha 91	alpha 98	alpha 105	alpha 112	alpha 119	Elu 0	Elu 30	Elu 60	Elu 90	Elu 120	Elu 150	Elu 180	Elu 210	Elu 240	Elu 270	Elu 300	Elu 330	Elu 360	Elu 390	cdc15 10	cdc15 30	cdc15 50	cdc15 70	cdc15 90	cdc15 110	cdc15 130	cdc15 150	cdc15 170	cdc15 190	cdc15 210	cdc15 230	cdc15 250	cdc15 270	cdc15 290	spo 0	spo 2	spo 5	spo 7	spo 9	spo 11	spo5 2	spo5 7	spo5 11	spo- early	spo- mid	heat 0	heat 10	heat 20	heat 40	heat 80	heat 160	dtt 15	dtt 30	dtt 60	dtt 120	cold 0	cold 20	cold 40	cold 160	diau a	diau b	diau c	diau d	diau e	diau f	diau g

In [65]:
data[1]
Out[65]:
'YBR166C\tTYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+\t0.33\t-0.17\t0.04\t-0.07\t-0.09\t-0.12\t-0.03\t-0.2\t-0.06\t-0.06\t-0.14\t-0.18\t-0.06\t-0.25\t0.06\t-0.12\t0.25\t0.43\t0.21\t-0.04\t-0.15\t-0.04\t0.21\t-0.14\t-0.03\t-0.07\t-0.36\t-0.14\t-0.42\t-0.34\t-0.23\t-0.17\t0.23\t0.3\t0.41\t-0.07\t-0.23\t-0.12\t0.16\t0.74\t0.14\t-0.49\t-0.32\t0.19\t0.23\t0.24\t0.28\t1.13\t-0.12\t0.1\t\t0.66\t0.62\t0.08\t0.62\t0.43\t0.5\t-0.25\t-0.51\t-0.67\t0.21\t-0.74\t-0.36\t-0.01\t0.38\t0.15\t-0.22\t-0.09\t0.33\t0.08\t0.39\t-0.17\t0.23\t0.2\t0.2\t-0.17\t-0.69\t0.14\t-0.27\r\n'
In [66]:
data[-1]
Out[66]:
'YLR160C\tASP3   ASPARAGINE UTILIZATION   L-ASPARAGINASE II\t0.07\t-0.04\t0.12\t-0.1\t0.06\t-0.32\t0.21\t-0.25\t-0.17\t-0.29\t-0.42\t-0.36\t0.23\t-0.32\t0.16\t-0.54\t0.38\t-0.22\t0.49\t-0.18\t-0.09\t-0.49\t-0.15\t0.08\t0.11\t0.28\t0.34\t0.23\t-0.01\t0.15\t0.04\t0.14\t-0.18\t-0.47\t-0.1\t0.37\t0.62\t0.34\t0.32\t0.46\t0.04\t0.4\t0.48\t-1.25\t0.18\t0.11\t0.21\t1.11\t-0.6\t-0.92\t-0.58\t-0.42\t-0.81\t0.06\t0.03\t0.08\t0.21\t-0.42\t0.14\t0.3\t-0.23\t-0.42\t-0.06\t0.78\t0.56\t-0.3\t0.06\t0.33\t-0.38\t-0.17\t-0.25\t-0.38\t-0.27\t-0.25\t0.28\t0.1\t0.28\t0.41\t-0.1\r\n'

The string split method splits a line on whitespace (or on an arbitrary string, if supplied)

In [67]:
data[1].split()
Out[67]:
['YBR166C',
 'TYR1',
 'TYROSINE',
 'BIOSYNTHESIS',
 'PREPHENATE',
 'DEHYDROGENASE',
 '(NADP+',
 '0.33',
 '-0.17',
 '0.04',
 '-0.07',
 '-0.09',
 '-0.12',
 '-0.03',
 '-0.2',
 '-0.06',
 '-0.06',
 '-0.14',
 '-0.18',
 '-0.06',
 '-0.25',
 '0.06',
 '-0.12',
 '0.25',
 '0.43',
 '0.21',
 '-0.04',
 '-0.15',
 '-0.04',
 '0.21',
 '-0.14',
 '-0.03',
 '-0.07',
 '-0.36',
 '-0.14',
 '-0.42',
 '-0.34',
 '-0.23',
 '-0.17',
 '0.23',
 '0.3',
 '0.41',
 '-0.07',
 '-0.23',
 '-0.12',
 '0.16',
 '0.74',
 '0.14',
 '-0.49',
 '-0.32',
 '0.19',
 '0.23',
 '0.24',
 '0.28',
 '1.13',
 '-0.12',
 '0.1',
 '0.66',
 '0.62',
 '0.08',
 '0.62',
 '0.43',
 '0.5',
 '-0.25',
 '-0.51',
 '-0.67',
 '0.21',
 '-0.74',
 '-0.36',
 '-0.01',
 '0.38',
 '0.15',
 '-0.22',
 '-0.09',
 '0.33',
 '0.08',
 '0.39',
 '-0.17',
 '0.23',
 '0.2',
 '0.2',
 '-0.17',
 '-0.69',
 '0.14',
 '-0.27']
<;,Here's how to split on tabs:

In [69]:
data[1].split("\t")[:10]
Out[69]:
['YBR166C',
 'TYR1   TYROSINE BIOSYNTHESIS    PREPHENATE DEHYDROGENASE (NADP+',
 '0.33',
 '-0.17',
 '0.04',
 '-0.07',
 '-0.09',
 '-0.12',
 '-0.03',
 '-0.2']
In [70]:
data[1].split("\t")[2:10]
Out[70]:
['0.33', '-0.17', '0.04', '-0.07', '-0.09', '-0.12', '-0.03', '-0.2']

Coercing strings to floats:

In [72]:
float(data[1].split("\t")[2])
Out[72]:
0.33

Handling missing values with a try..except block:

In [81]:
r = []
for i in data[1].split("\t")[2:]:
    try:
        r.append(float(i))
    except ValueError:
        print i

In [75]:
r = []
for i in data[1].split("\t")[2:]:
    try:
        r.append(float(i))
    except ValueError:
        r.append(None)
In [77]:
print r
[0.33, -0.17, 0.04, -0.07, -0.09, -0.12, -0.03, -0.2, -0.06, -0.06, -0.14, -0.18, -0.06, -0.25, 0.06, -0.12, 0.25, 0.43, 0.21, -0.04, -0.15, -0.04, 0.21, -0.14, -0.03, -0.07, -0.36, -0.14, -0.42, -0.34, -0.23, -0.17, 0.23, 0.3, 0.41, -0.07, -0.23, -0.12, 0.16, 0.74, 0.14, -0.49, -0.32, 0.19, 0.23, 0.24, 0.28, 1.13, -0.12, 0.1, None, 0.66, 0.62, 0.08, 0.62, 0.43, 0.5, -0.25, -0.51, -0.67, 0.21, -0.74, -0.36, -0.01, 0.38, 0.15, -0.22, -0.09, 0.33, 0.08, 0.39, -0.17, 0.23, 0.2, 0.2, -0.17, -0.69, 0.14, -0.27]
In [78]:
r = []
for i in data[1].split("\t")[2:]:
    try:
        r.append(float(i))
    except ValueError:
        r.append(0.)
In [79]:
print r
[0.33, -0.17, 0.04, -0.07, -0.09, -0.12, -0.03, -0.2, -0.06, -0.06, -0.14, -0.18, -0.06, -0.25, 0.06, -0.12, 0.25, 0.43, 0.21, -0.04, -0.15, -0.04, 0.21, -0.14, -0.03, -0.07, -0.36, -0.14, -0.42, -0.34, -0.23, -0.17, 0.23, 0.3, 0.41, -0.07, -0.23, -0.12, 0.16, 0.74, 0.14, -0.49, -0.32, 0.19, 0.23, 0.24, 0.28, 1.13, -0.12, 0.1, 0.0, 0.66, 0.62, 0.08, 0.62, 0.43, 0.5, -0.25, -0.51, -0.67, 0.21, -0.74, -0.36, -0.01, 0.38, 0.15, -0.22, -0.09, 0.33, 0.08, 0.39, -0.17, 0.23, 0.2, 0.2, -0.17, -0.69, 0.14, -0.27]

Oops! Better log my session!

In [84]:
%logstart -o BMS270b.2013.02.log
Activating auto-logging. Current session state plus future input saved.
Filename       : BMS270b.2013.02.log
Mode           : backup
Output logging : True
Raw input log  : False
Timestamping   : False
State          : active