If the last value in an IPython notebook cell is a stand-alone object, IPython will print a text representation of it.
We started class by taking a look at a few common data types.
A string (text):
"Hello, world"
An integer:
5
A floating-point value:
4.2
A string enclosed in double quotes (") can't contain a double quote:
(So, the python interpretter throws an Exception (in this case, a SyntaxError), halts the program, and prints a message showing where things went wrong)
"I say "Hello""
We can get around this problem by enclosing our string in single quotes ('):
'I say "Hello"'
For trickier strings, we can use three double quotes ("""). This also allows us to include line breaks in our string:
"""This is
a longer string
with "asdfa" asdf's
"""
In the cell above, line breaks are indicated by the newline character ("\n").
If we pass the string through the print function, the newlines will be rendered as line breaks:
print("""This is
a longer string
with "asdfa" asdf's
""")
print is a function. It takes zero or more comma-separated objects (called "arguments" or "parameters") enclosed in parenthesis, does something, and (optionally) returns an object. In the case of print, "does something" is displaying a text representation of its parameters, and it does not specify a return value (so it returns the default return value, a special object called None).
Functions like print that "do something" detectable, other than returning a value, are said to have "side effects".
print(5)
In older versions of python, print, unlike other functions, didn't use parenthesis. Either syntax works in Python 2.7 (which we are using), but only the new syntax works in Python 3.
print 5
Some math operators (see today's slides for more):
3+3
3*2
Integer division return an integer (always rounding down):
3/2
If the numerator or denominator is a float, then an unrounded float is returned:
3./2.
We can assign an object to name (called a "variable" or "reference") using a single equals sign (=):
x = 5
x
x/2
We can coerce x to a float using the float function:
float(x)
float(x)/2
y = 3
x+y
Here I'm using a comma as shorthand for making the tuple (x,y), so that I can see both values (tuples are introduced later in this notebook).
x,y
x = x + y
x
x = x + y
x
Shorthand for incrementing x by y. There is a similar update syntax for other operators (see today's slides):
x += y
x
Create a string named dna and use its find method to find the first start codon
A method is a function associated with a particular class of object (in this case, the string class). We can think of it as a way to delegate our work. We are saying "dna, find your start codon for me".
We discussed a few ways to discover available methods:
dna = "GCGGCATGCATGGATAACAGATCTAAGAT"
dna
Find the first instance of "ATG" -- the result is the index of the first character, counting from zero on the left.
dna.find("ATG")
If the string is not found, find returns -1
dna.find("M")
Optionally, we can supply the starting point of the search (again, as an index counting from zero)
dna.find("ATG",2)
dna.find("ATG",5)
So, if we start from the right of the first hit, we can find a second ATG:
dna.find("ATG",6)
Here is the indexing syntax for retrieving a single character or substring (slice). See today's slides for more indexing syntax.
dna[5]
dna[5:5+3]
start = dna.find("ATG",6)
dna[start:start+3]
At this point, we'd like to confirm that we've found all of the start codons. To do so, we need two new tools.
A double equals sign (==) returns True for equal values, False otherwise.
(See today's slides for other logical operators)
x == 3
x
x == 14
A while loop executes the code in the following indented block as long as its condition is True.
(Note that the condition is only checked at the beginning of each iteration through the loop)
start = 0
offset = 0
while(start != -1):
start = dna.find("ATG",offset)
print(start)
offset = start+1
We can also report the reading frame for each start position via the modulo operator (%).
x % y returns the remainder after integer division of x by y
start = 0
offset = 0
while(start != -1):
start = dna.find("ATG",offset)
print(start, start % 3)
offset = start+1
We can supress the final (non-biological) value by using break to exit the loop when find fails.
We use an if statement to detect this case.
start = 0
offset = 0
while(start != -1):
start = dna.find("ATG",offset)
if(start == -1):
break
print(start, start % 3)
offset = start+1
A bit more if syntax: we can add an arbitrary number of elif tests as well as an else to handle any remaining cases. Note that as soon as an if or elif expression evaluates True, none of the other conditions are checked.
start = 0
offset = 0
while(start != -1):
start = dna.find("ATG",offset)
if(start == -1):
break
elif(start == 5):
print("it's 5")
else:
print("nope")
print(start, start % 3)
offset = start+1
We can do arbitrary nesting of any of the block statements (we've see while and if/elif/else so far; we'll see more later).
Here, we see that break exits only the inner-most loop:
a = 3
while(a < 5):
a += 1
b = 2
while(b < 30):
b += 1
if(b > 4):
break
print a,b
One way of doing a "long jump" (breaking out of more than just the inner-most loop) -- using a flag variable that we set to False when we are ready to exit. (A more elegant method is to put the nested loop into a function and use return to exit).
We use three new tricks in the following code:
a = 3
# flag is True while we want to be in our nested loop
# set it to False to exit
flag = True
while((a < 5) and flag):
a += 1
b = 2
while(b < 30):
b += 1
if(b > 4):
flag = False
break
print a,b
An example of adding comments to an IPython notebook using a markdown cell (which you can choose via cell -> cell type from the IPython menu)
This is a markdown cell.
html$\sqrt{5}$
Initialize some constants:
a = 6
m = 13
z = 2
A very simple linear congruent random number generator:
z = (a*z) % m
print(z)
z = (a*z) % m
print(z)
z = (a*z) % m
print(z)
Given the modulus of 13, there are 13 possible values for z (0 to 12) -- does it attain all of them?
We can use the range function to make a list of the possible values (for this example, the important detail is that the list has 13 elements):
range(13)
Some other examples of using the range function (starting at a value other than 0 and using a step other than 1)
range(3,13)
range(3,13,3)
range returns a list, which is a mutable sequence of objects (which may be of mixed types)
x = ["hi", 5, 3.2, "up"]
We can index and slice a list just like a string:
x[0]
x[3]
x[1:3]
Back to our question -- does z attain all 13 possible values?
With range in hand, we can use a for loop. This loop steps i through each of the 13 elements in the list returned by range(13). At each iteration, the indendent block is executed (just like our while loop above). In this particular case, we do not use i directly in the body of the loop -- we're just using it as a counter.
z = 2
for i in range(13):
z = (a*z) % m
print(z)
Instead of simply printing the values of z, we can save them in a list, named seq. One of the reasons that lists are mutable is that we can add new values to the end, using the append method (we'll see later that there are other ways to change the contents of a list).
z = 2
# Initialize seq as an empty list
seq = []
for i in range(13):
seq.append(z)
z = (a*z) % m
print seq
An easy way to check the unique values in seq is to coerce it to a set (a collection of unique, hashable elements)
set(seq)
Since, fortuitously, the values in the set are printed in ascending order, it is easy to see that z takes on all possible values except for 0. (Because our generating function is deterministic, and the value 2 occurs twice, we can infer that 0 will never occur, even with more iterations).
What happens if we use 0 as the seed value?
z = 0
seq = []
for i in range(13):
seq.append(z)
z = (a*z) % m
print seq
set(seq)
Showing that sets contain only unique values:
set((0,0,0,2,2,3,3,3))
An easy way to check the length of most python collections (of the data types we've seen so far, this works for sets, lists, and strings; in general, len works for any object with a __len__ method)
len(seq)
len(set(seq))
set(["hi", 4, 3])
Because a list is mutable, it is not hashable, so we can't use a list as an element of a set.
(One way to think about this is, in order to guarantee that every element in a set is unique, we must be able to ensure that the elements can't change after we've checked for uniqueness)
set(["hi",[1,2,3],5])
If we specify a sequence with parenthesis instead of square brackets, then we get a tuple, the unmutable equivalent of a list, and we can use it as an element in a set:
set(["hi",(1,2,3),5])
We can also deliberately coerce any sequence to a tuple (to be precise: any iterable; i.e., any object that we can iterate over, as in a for or while loop)
tuple(seq)
Likewise, any iterable can be coerced to a set, so we can form a set from a string:
set("ATGCAGATAGAGA")
The set methods contain useful operators from set theory, such as unions and intersections (check the set? documentation for a full list)
s = set((3,5,7))
t = set((5,7,9))
s.union(t)
s.intersection(t)
An example of checking the documentation with the help function:
help(s.intersection)
Experimenting with values of a and m that do not give full range random number generators. Here, z is restricted to a particular group depending on its initial value:
z = 2
a = 6
m = 7
seq = []
for i in range(13):
seq.append(z)
z = (a*z) % m
print seq
z = 3
a = 6
m = 7
seq = []
for i in range(13):
seq.append(z)
z = (a*z) % m
print seq
A really good choice for a and m. C.f. Comm. ACM 31:1192 for a review.
Here, we divide z by the modulus (m), taking care to use floating-point division, to transform our integer sequence into integers between 0 and 1 (can you see why the elements of u are always greater than 0 and less than 1?)
a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(13):
seq.append(z)
u.append(z/float(m))
z = (a*z) % m
print u
Same thing, generating 1000 elements, but not printing them
a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(1000):
seq.append(z)
u.append(z/float(m))
z = (a*z) % m
A simple example of importing a module (more on this tomorrow)
import matplotlib
A more complicated example. First we use an IPython "magic" function to set up communication between the IPython notebook and matplotlib. Then, we import just the pyplot submodule and rename it to plt (for convenience, and to follow idiomatic usage of this particular module in the matplotlib community)
%matplotlib inline
import matplotlib.pyplot as plt
Plotting our sequence of 1000 floats from above
plt.plot(u)
Histogram of just the first 20 values -- note that the histogram details are returned as a tuple of several objects
plt.hist(u[:20])
Histogram of all 1000 elements. Here, we hide the returned value by assigning it to h
h = plt.hist(u)
Using an optional, named parameter to specify a non-default number of bins:
h = plt.hist(u, bins=100)
Why is the distribution choppy? Are we simply undersampling for this bin size?
a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(10000):
seq.append(z)
u.append(z/float(m))
z = (a*z) % m
Yes:
h = plt.hist(u, bins=100)
A parameter-free alternative to a histogram: plotting the sorted values:
plt.plot(sorted(u))
Same thing, plotting a blue circle for every element:
plt.plot(sorted(u),"bo")
Same thing, for fewer elements (so that we can see the circles)
plt.plot(sorted(u[:20]),"bo")
(This example didn't work in class and won't work in a default Canopy installation -- it requires R and the python rpy2 module)
Another histogram alternative: convolving the data with a smoothing kernel to get a kernel density estimate (KDE). In effect, we are still choosing a bin size (the bandwidth of the kernel) but are not specifying where each bin starts (instead, we use all possible starting positions). It is possible to generate KDEs using the modules in Canopy, but we can do it more succinctly in R, with better default values (and I wanted to show off IPython's R magic). Note that the KDE distorts the boundaries of the distribution:
%load_ext rmagic
%%R -i u
print(summary(u))
plot(density(u))
A sequence of 1000 elements, where each element is the average of 100 uniform deviates (generated by our previous method)
a = 16807
m = 2**31-1
z = 42
u = []
for i in range(1000):
s = 0.
for j in range(100):
z = (a*z) % m
s += z/float(m)
u.append(s/100.)
By the central limit theorem, this should give us a normal distribution:
(can you do better than this example by playing with the number of iterations of the i and j loops above?)
h= plt.hist(u)
h= plt.hist(u, bins = 100)
plt.plot(sorted(u))
We can split our sequence into two vectors and look at their correlation in two dimensions:
plt.plot(u[:500],u[500:],"bo")
How does this pair of independent normal distributions differ from a pair of uniform distributions?
a = 16807
m = 2**31-1
z = 42
seq = []
u = []
for i in range(1000):
seq.append(z)
u.append(z/float(m))
z = (a*z) % m
plt.plot(u[:500],u[500:],"ro")
For tonight's homework, we need two more tools
from math import sqrt
sqrt(3)
Here, we define a function to calculate the median of a sequence of floats or ints.
The names in parenthesis on the first line define the parameters to be passed to the function (in this case, a single parameter for the sequence of numbers).
The special return statement tells python to exit the function, returning the given value
def median(x):
S = sorted(x)
# Case 1: an odd number of elements in the list
if(len(x) % 2 == 1):
# Return the middle value (here, integer division is useful for indexing)
return S[len(x)/2]
# Case 2: an even number of elements in the list
else:
# Return the mean of the two middle elements
return (S[len(x)/2]+S[len(x)/2-1])/2.
Testing both branches of the function for inputs with known outputs
median((1,2,3))
median((2,4,5,7))
Note that sorting the list is sufficient, but not necessary, for finding the median. We can find the median more quickly with a partial sort.