Python Basics

Basic data types ("nouns")

If the last value in an IPython notebook cell is a stand-alone object, IPython will print a text representation of it.

We started class by taking a look at a few common data types.

A string (text):

"Hello, world"

'Hello, world'

An integer:

5

5

A floating-point value:

4.2

4.2

A string enclosed in double quotes (") can't contain a double quote:

(So, the python interpretter throws an Exception (in this case, a SyntaxError), halts the program, and prints a message showing where things went wrong)

"I say "Hello""

  File "<ipython-input-4-60f4981fcfd7>", line 1
    "I say "Hello""
                ^
SyntaxError: invalid syntax

We can get around this problem by enclosing our string in single quotes ('):

'I say "Hello"'

'I say "Hello"'

For trickier strings, we can use three double quotes ("""). This also allows us to include line breaks in our string:

"""This is
a longer string
with "asdfa" asdf's
"""

'This is\na longer string\nwith "asdfa" asdf\'s\n'

In the cell above, line breaks are indicated by the newline character ("\n").

If we pass the string through the print function, the newlines will be rendered as line breaks:

Functions

print("""This is
a longer string
with "asdfa" asdf's
""")

This is
a longer string
with "asdfa" asdf's

print is a function. It takes zero or more comma-separated objects (called "arguments" or "parameters") enclosed in parenthesis, does something, and (optionally) returns an object. In the case of print, "does something" is displaying a text representation of its parameters, and it does not specify a return value (so it returns the default return value, a special object called None).

Functions like print that "do something" detectable, other than returning a value, are said to have "side effects".

print(5)

5

In older versions of python, print, unlike other functions, didn't use parenthesis. Either syntax works in Python 2.7 (which we are using), but only the new syntax works in Python 3.

print 5

5

Operators

Some math operators (see today's slides for more):

3+3

6

3*2

6

Integer division return an integer (always rounding down):

3/2

1

If the numerator or denominator is a float, then an unrounded float is returned:

3./2.

1.5

We can assign an object to name (called a "variable" or "reference") using a single equals sign (=):

x = 5

x

5

x/2

2

We can coerce x to a float using the float function:

float(x)

5.0

float(x)/2

2.5

y = 3

x+y

8

Here I'm using a comma as shorthand for making the tuple (x,y), so that I can see both values (tuples are introduced later in this notebook).

x,y

(5, 3)

x = x + y

x

8

x = x + y
x

11

Shorthand for incrementing x by y. There is a similar update syntax for other operators (see today's slides):

x += y

x

14

DNA example

Create a string named dna and use its find method to find the first start codon

A method is a function associated with a particular class of object (in this case, the string class). We can think of it as a way to delegate our work. We are saying "dna, find your start codon for me".

We discussed a few ways to discover available methods:

help(dna) prints the documentation for dna's class (in this case, string)
dna? shows the documentation in an IPython pager
Pressing shift-tab after typing dna. pops up a menu of available completions (which includes the class methods).

dna = "GCGGCATGCATGGATAACAGATCTAAGAT"

dna

'GCGGCATGCATGGATAACAGATCTAAGAT'

Find the first instance of "ATG" -- the result is the index of the first character, counting from zero on the left.

dna.find("ATG")

5

If the string is not found, find returns -1

dna.find("M")

-1

Optionally, we can supply the starting point of the search (again, as an index counting from zero)

dna.find("ATG",2)

5

dna.find("ATG",5)

5

So, if we start from the right of the first hit, we can find a second ATG:

dna.find("ATG",6)

9

Here is the indexing syntax for retrieving a single character or substring (slice). See today's slides for more indexing syntax.

dna[5]

'A'

dna[5:5+3]

'ATG'

start = dna.find("ATG",6)
dna[start:start+3]

'ATG'

At this point, we'd like to confirm that we've found all of the start codons. To do so, we need two new tools.

First tool: conditional expressions.

A double equals sign (==) returns True for equal values, False otherwise.

(See today's slides for other logical operators)

x == 3

False

x

14

x == 14

True

Second tool: while loops

A while loop executes the code in the following indented block as long as its condition is True.

(Note that the condition is only checked at the beginning of each iteration through the loop)

start = 0
offset = 0
while(start != -1):
    start = dna.find("ATG",offset)
    print(start)
    offset = start+1

5
9
-1

We can also report the reading frame for each start position via the modulo operator (%).

x % y returns the remainder after integer division of x by y

start = 0
offset = 0
while(start != -1):
    start = dna.find("ATG",offset)
    print(start, start % 3)
    offset = start+1

(5, 2)
(9, 0)
(-1, 2)

We can supress the final (non-biological) value by using break to exit the loop when find fails.

We use an if statement to detect this case.

start = 0
offset = 0
while(start != -1):
    start = dna.find("ATG",offset)
    if(start == -1):
        break
    print(start, start % 3)
    offset = start+1

(5, 2)
(9, 0)

A bit more if syntax: we can add an arbitrary number of elif tests as well as an else to handle any remaining cases. Note that as soon as an if or elif expression evaluates True, none of the other conditions are checked.

start = 0
offset = 0
while(start != -1):
    start = dna.find("ATG",offset)
    if(start == -1):
        break
    elif(start == 5):
        print("it's 5")
    else:
        print("nope")
    print(start, start % 3)
    offset = start+1

it's 5
(5, 2)
nope
(9, 0)

We can do arbitrary nesting of any of the block statements (we've see while and if/elif/else so far; we'll see more later).

Here, we see that break exits only the inner-most loop:

a = 3
while(a < 5):
    a += 1
    b = 2
    while(b < 30):
        b += 1
        if(b > 4):
            break
        print a,b

4 3
4 4
5 3
5 4

One way of doing a "long jump" (breaking out of more than just the inner-most loop) -- using a flag variable that we set to False when we are ready to exit. (A more elegant method is to put the nested loop into a function and use return to exit).

We use three new tricks in the following code:

Setting a variable to a boolean value (in this case, True. False works).
Combining two expressions with the logical operator and (or, xor, and not also work)
- I didn't mention this in class, but python logical operators are evaluated left to right, and evaluation stops as soon as the value of the full expression can be determined. This is relevant if parts of the expression have side effects.
Adding a comment to our code (which is ignored by the Python interpretter) using the pound sign (#)

a = 3
# flag is True while we want to be in our nested loop
# set it to False to exit
flag = True
while((a < 5) and flag):
    a += 1
    b = 2
    while(b < 30):
        b += 1
        if(b > 4):
            flag = False
            break
        print a,b

4 3
4 4

An example of adding comments to an IPython notebook using a markdown cell (which you can choose via cell -> cell type from the IPython menu)

This is a markdown cell.

html

$\sqrt{5}$

Pseudo-random number generator example

Here we explore a neat trick with the modulus operator (from the reading frame example above)

Initialize some constants:

a = 6
m = 13
z = 2

A very simple linear congruent random number generator:

z = (a*z) % m
print(z)

12

z = (a*z) % m
print(z)

7

z = (a*z) % m
print(z)

3

Given the modulus of 13, there are 13 possible values for z (0 to 12) -- does it attain all of them?

We can use the range function to make a list of the possible values (for this example, the important detail is that the list has 13 elements):

range(13)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Some other examples of using the range function (starting at a value other than 0 and using a step other than 1)

range(3,13)

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

range(3,13,3)

[3, 6, 9, 12]

range returns a list, which is a mutable sequence of objects (which may be of mixed types)

x = ["hi", 5, 3.2, "up"]

We can index and slice a list just like a string:

x[0]

'hi'

x[3]

'up'

x[1:3]

[5, 3.2]

Back to our question -- does z attain all 13 possible values?

With range in hand, we can use a for loop. This loop steps i through each of the 13 elements in the list returned by range(13). At each iteration, the indendent block is executed (just like our while loop above). In this particular case, we do not use i directly in the body of the loop -- we're just using it as a counter.

z = 2
for i in range(13):
    z = (a*z) % m
    print(z)

12
7
3
5
4
11
1
6
10
8
9
2
12

Instead of simply printing the values of z, we can save them in a list, named seq. One of the reasons that lists are mutable is that we can add new values to the end, using the append method (we'll see later that there are other ways to change the contents of a list).

z = 2
# Initialize seq as an empty list
seq = []
for i in range(13):
    seq.append(z)
    z = (a*z) % m
print seq

[2, 12, 7, 3, 5, 4, 11, 1, 6, 10, 8, 9, 2]

An easy way to check the unique values in seq is to coerce it to a set (a collection of unique, hashable elements)

set(seq)

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

Since, fortuitously, the values in the set are printed in ascending order, it is easy to see that z takes on all possible values except for 0. (Because our generating function is deterministic, and the value 2 occurs twice, we can infer that 0 will never occur, even with more iterations).

What happens if we use 0 as the seed value?

z = 0
seq = []
for i in range(13):
    seq.append(z)
    z = (a*z) % m
print seq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

set(seq)

{0}

Showing that sets contain only unique values:

set((0,0,0,2,2,3,3,3))

{0, 2, 3}

An easy way to check the length of most python collections (of the data types we've seen so far, this works for sets, lists, and strings; in general, len works for any object with a __len__ method)

len(seq)

13

len(set(seq))

1

set(["hi", 4, 3])

{3, 4, 'hi'}

Because a list is mutable, it is not hashable, so we can't use a list as an element of a set.

(One way to think about this is, in order to guarantee that every element in a set is unique, we must be able to ensure that the elements can't change after we've checked for uniqueness)

set(["hi",[1,2,3],5])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-66-0be6c630b2be> in <module>()
----> 1 set(["hi",[1,2,3],5])

TypeError: unhashable type: 'list'

If we specify a sequence with parenthesis instead of square brackets, then we get a tuple, the unmutable equivalent of a list, and we can use it as an element in a set:

set(["hi",(1,2,3),5])

{5, 'hi', (1, 2, 3)}

We can also deliberately coerce any sequence to a tuple (to be precise: any iterable; i.e., any object that we can iterate over, as in a for or while loop)

tuple(seq)

(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Likewise, any iterable can be coerced to a set, so we can form a set from a string:

set("ATGCAGATAGAGA")

{'A', 'C', 'G', 'T'}

The set methods contain useful operators from set theory, such as unions and intersections (check the set? documentation for a full list)

s = set((3,5,7))
t = set((5,7,9))

s.union(t)

{3, 5, 7, 9}

s.intersection(t)

{5, 7}

An example of checking the documentation with the help function:

help(s.intersection)

Help on built-in function intersection:

intersection(...)
    Return the intersection of two or more sets as a new set.
    
    (i.e. elements that are common to all of the sets.)

Experimenting with values of a and m that do not give full range random number generators. Here, z is restricted to a particular group depending on its initial value:

z = 2
a = 6
m = 7
seq = []
for i in range(13):
    seq.append(z)
    z = (a*z) % m
print seq

[2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2]

z = 3
a = 6
m = 7
seq = []
for i in range(13):
    seq.append(z)
    z = (a*z) % m
print seq

[3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3]

A really good choice for a and m. C.f. Comm. ACM 31:1192 for a review.

Here, we divide z by the modulus (m), taking care to use floating-point division, to transform our integer sequence into integers between 0 and 1 (can you see why the elements of u are always greater than 0 and less than 1?)

a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(13):
    seq.append(z)
    u.append(z/float(m))
    z = (a*z) % m
print u

[1.9557774076032347e-08, 0.00032870750889587566, 0.5245871020129822, 0.7354235321913956, 0.26330554078487006, 0.37622397131110724, 0.19628582577979464, 0.9758738810084173, 0.512318108469396, 0.5304490451377114, 0.2571016295147602, 0.10708725457409735, 0.8154876268540917]

Same thing, generating 1000 elements, but not printing them

a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(1000):
    seq.append(z)
    u.append(z/float(m))
    z = (a*z) % m

A simple example of importing a module (more on this tomorrow)

import matplotlib

A more complicated example. First we use an IPython "magic" function to set up communication between the IPython notebook and matplotlib. Then, we import just the pyplot submodule and rename it to plt (for convenience, and to follow idiomatic usage of this particular module in the matplotlib community)

%matplotlib inline
import matplotlib.pyplot as plt

Plotting our sequence of 1000 floats from above

plt.plot(u)

[<matplotlib.lines.Line2D at 0x7f040b01c410>]

Histogram of just the first 20 values -- note that the histogram details are returned as a tuple of several objects

plt.hist(u[:20])

(array([ 2.,  3.,  5.,  2.,  1.,  3.,  0.,  1.,  1.,  2.]),
 array([  1.95577741e-08,   9.75874057e-02,   1.95174792e-01,
          2.92762178e-01,   3.90349564e-01,   4.87936950e-01,
          5.85524336e-01,   6.83111723e-01,   7.80699109e-01,
          8.78286495e-01,   9.75873881e-01]),
 <a list of 10 Patch objects>)

Histogram of all 1000 elements. Here, we hide the returned value by assigning it to h

h = plt.hist(u)

Using an optional, named parameter to specify a non-default number of bins:

h = plt.hist(u, bins=100)

Why is the distribution choppy? Are we simply undersampling for this bin size?

a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(10000):
    seq.append(z)
    u.append(z/float(m))
    z = (a*z) % m

Yes:

h = plt.hist(u, bins=100)

A parameter-free alternative to a histogram: plotting the sorted values:

plt.plot(sorted(u))

[<matplotlib.lines.Line2D at 0x7f040ab92f10>]

Same thing, plotting a blue circle for every element:

plt.plot(sorted(u),"bo")

[<matplotlib.lines.Line2D at 0x7f040aae2410>]

Same thing, for fewer elements (so that we can see the circles)

plt.plot(sorted(u[:20]),"bo")

[<matplotlib.lines.Line2D at 0x7f040ab45b90>]

(This example didn't work in class and won't work in a default Canopy installation -- it requires R and the python rpy2 module)

Another histogram alternative: convolving the data with a smoothing kernel to get a kernel density estimate (KDE). In effect, we are still choosing a bin size (the bandwidth of the kernel) but are not specifying where each bin starts (instead, we use all possible starting positions). It is possible to generate KDEs using the modules in Canopy, but we can do it more succinctly in R, with better default values (and I wanted to show off IPython's R magic). Note that the KDE distorts the boundaries of the distribution:

%load_ext rmagic

%%R -i u
print(summary(u))
plot(density(u))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2449  0.4927  0.4960  0.7472  1.0000

A sequence of 1000 elements, where each element is the average of 100 uniform deviates (generated by our previous method)

a = 16807
m = 2**31-1
z = 42
u = []
for i in range(1000):
    s = 0.
    for j in range(100):
        z = (a*z) % m
        s += z/float(m)
    u.append(s/100.)

By the central limit theorem, this should give us a normal distribution:

(can you do better than this example by playing with the number of iterations of the i and j loops above?)

h= plt.hist(u)

h= plt.hist(u, bins = 100)

plt.plot(sorted(u))

[<matplotlib.lines.Line2D at 0x7f040ac9a850>]

We can split our sequence into two vectors and look at their correlation in two dimensions:

plt.plot(u[:500],u[500:],"bo")

[<matplotlib.lines.Line2D at 0x7f03ff7a32d0>]

How does this pair of independent normal distributions differ from a pair of uniform distributions?

a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(1000):
    seq.append(z)
    u.append(z/float(m))
    z = (a*z) % m

plt.plot(u[:500],u[500:],"ro")

[<matplotlib.lines.Line2D at 0x7f03ff7401d0>]

For tonight's homework, we need two more tools

Tool 1: the square root function from the math module

from math import sqrt

sqrt(3)

1.7320508075688772

Tool 2: User defined functions

Here, we define a function to calculate the median of a sequence of floats or ints.

The names in parenthesis on the first line define the parameters to be passed to the function (in this case, a single parameter for the sequence of numbers).

The special return statement tells python to exit the function, returning the given value

def median(x):
    S = sorted(x)
    # Case 1: an odd number of elements in the list
    if(len(x) % 2 == 1):
        # Return the middle value (here, integer division is useful for indexing)
        return S[len(x)/2]
    # Case 2: an even number of elements in the list
    else:
        # Return the mean of the two middle elements
        return (S[len(x)/2]+S[len(x)/2-1])/2.

Testing both branches of the function for inputs with known outputs

median((1,2,3))

2

median((2,4,5,7))

4.5

Note that sorting the list is sufficient, but not necessary, for finding the median. We can find the median more quickly with a partial sort.