Load the array data from last week
import cdt
data = cdt.ExpressionProfile("supp2data.cdt")
If we know the row we are interested in, we can do lookups by simple list indexing:
data.geneName[10]
data.geneAnn[10]
data.num[10][:10]
A python dictionary is a type of associative array.
Define python dictionaries using curly braces enclosing a list of colon-delimited key-value pairs:
d = {"spam":6,3:"onion","four":"five"}
d2 = {("eggs","four"):5}
The number of keys in a dictionary is its length:
len(d)
Indexing a dictionary is like indexing a list, but with keys in place of integer indices:
d["spam"]
d[3]
Trying to look up a non-existant key raises a KeyError
d["eggs"]
We can also use indexing to assign a key to a new value or to add a new key,value pair:
d["spam"] = 7
d["spam"]
d["eggs"] = "fried"
d["eggs"]
Use del to remove a key,value pair (it is rare that you'll need this)
del d["spam"]
d["spam"]
Application: Index the microarray data by systematic gene name.
Our mapping will be name to row index, so that we can use the same dictionary to do lookups on both annotations (data.geneAnn) and expression profiles (data.num).
gene_index = {}
n = 0
for i in data.geneName:
gene_index[i] = n
n += 1
anpre>
GAL4, a transcriptional regulator of galactose catabolism in Saccharomyces, has the systematic gene name "YPL248C".
Use the dictionary to lookup the row index for GAL4:
gene_index["YPL248C"]
Use the row index to confirm our name-based lookup:
data.geneName[gene_index["YPL248C"]]
Look up the corresponding annotation:
data.geneAnn[gene_index["YPL248C"]]
And first 10 columns of the corresponding expression profile:
data.num[gene_index["YPL248C"]][:10]
Create the dictionary from In [22], using enumerate to simplify the code:
gene_index = {}
for (n,i) in enumerate(data.geneName):
gene_index[i] = n
Generate a many-to-many mapping of annotation keywords to row indices.
In order to support multiple rows for a single keyword, we will have our dictionary map from strings (keys) to lists of integers (values).
keywords = {}
for (n,i) in enumerate(data.geneAnn):
# Split annotation on whitespace -> list of words
for keyword in i.split():
try:
# If we've seen this keyword before -> append the current row index to its list
keywords[keyword].append(n)
except KeyError:
# The first time we've seen this keyword -> make a new list for it
keywords[keyword] = [n]
The keys in a dictionary are unique. Assigning multiple values to a single key results in a single copy of that key mapped to the last value:
d = {"A":3,"A":4,"A":5}
d
The unique key property of dictionaries is very useful for creating sets of unique elements.
Python includes a special data type, set, expressly for this purpose. A set is like a dictionary with empty values (i.e., just keys).
Here, we for a set on each keyword list so that we can iterate over just the unique keywords in each annotation:
keywords = {}
for (n,i) in enumerate(data.geneAnn):
for keyword in set(i.split span class="p">()):
try:
keywords[keyword].append(n)
except KeyError:
keywords[keyword] = [n]
Demonstrating the uniqueness of set elements:
s = set("AAAAABBBCCDCCDAD")
s
s2 = set("AAABBDDDDEEE")
s2
sets define many useful operations from set theory -- very useful for "Venn diagram"-type calculations (c.f. help(set) for a full list)
s2.difference(s)
s.difference(s2)
s.intersection(s2)
Number of unique keywords in the annotation column:
len(keywords)
Number of genes with "GALACTOSE" in their annotation:
len(keywords["GALACTOSE"])
Information for all four of those genes:
for i in keywords["GALACTOSE"]:
print data.geneName[i], data.geneAnn[i], data.num[i][:10]
keys(): a list of the dictionary's keys in undefined but deterministic order:
keywords.keys()[:10]
values(): the values, in the same order as keys()
keywords.values()[:10]
items(): (key,value) pairs, in the same order as keys() -- this is usually what you want when writing a for loop over a dictionary:
keywords.items()[:10]
A dictionary mapping the Watson-Crick base-pairing rules:
nuc = {"A":"T","T":"A","G":"C","C":"G","N":"N"}
A test sequence to play with:
seq = "GATACA"*20
print seq
First try: returns the correct sequence, but as a list:
anti = []
for i in seq:
anti.append(nuc[i])
print anti
We can use join to put the list together into a string (we call join from an empty string in order to join the elements with no separator)
anti = []
for i in seq:
anti.append(nuc[i])
print "".join(anti)
Alternatively, we can append to a string directly, rather than using a list:
anti = ""
for i in seq:
anti += nuc[i]
print anti
Here's the second solution as a function:
def antisense(seq):
anti = ""
for i in seq:
anti += nuc[i]
return anti
Printing the input and antisense sequences on adjacent lines is a good way to check the output. We'll use this trick a lot when working with sequence alignments (note that IPython, most terminals, and most programmer-oriented text editors use a fixed-width font, to aid in this sort of comparison).
print seq
print antisense(seq)
We came up with a lot of ways to walk backwards over the sequence:
Via explicit indexing math:
anti = ""
for i in xrange(len(seq)):
anti += nuc[seq[len(seq)-i-1]]
print anti
By including start, stop, and step parameters when calling xrange:
anti = ""
for i in xrange(len(seq)-1,-1,-1):
anti += nuc[seq[i]]
print anti
By using the reversed generator:
anti = ""
for i in reversed(seq):
anti += nuc[i]
print anti
By using special splice syntax directly on the string (this works for lists as well):
anti = ""
for i in seq[::-1]:
anti += nuc[i]
print anti
Alternatively, we could walk forward over the sequence but prepend the new characters to the output string:
anti = ""
for i in seq:
anti = nuc[i] + anti
print anti
All of the above approaches are reasonable solutions.
Here, I choose In [60] because:
def revcomp(seq):
anti = ""
for i in seq[::-1]:
anti += nuc[i]
return anti
revcomp(seq)
Our dictionary isn't defined for lower case sequences:
revcomp("atgc")
Rather than augmenting the dictionary, it is better to normalize all inputs to upper-case (this way, we only have to deal with the ambiguity at one place in our program):
revcomp("atgc".upper())
s = "atgc"
s = s.upper()
revcomp(s)
Generating a simple scoring matrix:
scores = {}
for i in "ATGC":
scores[i] = {}
for j in "ATGC":
if(i == j):
scores[i][j] = 1
else:
scores[i][j] = -1
scores
Some example sequence pairs for the homework problems:
s1 = "ATCGAAA"
s2 = "ATGCAAA"
s3 = "AT-GCAA-"
s4 = "ATCGCAAA"
%logstart -o BMS270b.2013.06.log