Practical Bioinformatics -- Day 6

Using dictionaries for indexing

Load the array data from last week

In [1]:
import cdt
In [2]:
data = cdt.ExpressionProfile("supp2data.cdt")

If we know the row we are interested in, we can do lookups by simple list indexing:

In [3]:
data.geneName[10]
Out[3]:
'YNL272C'
In [4]:
data.geneAnn[10]
Out[4]:
'SEC2   SECRETION        GDP/GTP EXCHANGE FACTOR FOR SEC4P'
In [7]:
data.num[10][:10]
Out[7]:
[0.31, 0.12, 0.34, 0.61, 0.18, 0.28, 0.14, 0.0, -0.07, -0.01]

A python dictionary is a type of associative array.

Define python dictionaries using curly braces enclosing a list of colon-delimited key-value pairs:

In [8]:
d = {"spam":6,3:"onion","four":"five"}
In [9]:
d2 = {("eggs","four"):5}

The number of keys in a dictionary is its length:

In [10]:
len(d)
Out[10]:
3

Indexing a dictionary is like indexing a list, but with keys in place of integer indices:

In [11]:
d["spam"]
Out[11]:
6
In [12]:
d[3]
Out[12]:
'onion'

Trying to look up a non-existant key raises a KeyError

In [14]:
d["eggs"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-1835664bf5b8> in <module>()
----> 1 d["eggs"]

KeyError: 'eggs'

We can also use indexing to assign a key to a new value or to add a new key,value pair:

In [15]:
d["spam"] = 7
In [16]:
d["spam"]
Out[16]:
7
In [17]:
d["eggs"] = "fried"
In [18]:
d["eggs"]
Out[18]:
'fried'

Use del to remove a key,value pair (it is rare that you'll need this)

In [19]:
del d["spam"]
In [20]:
d["spam"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-20-eb605e19a614> in <module>()
----> 1 d["spam"]

KeyError: 'spam'

Application: Index the microarray data by systematic gene name.

Our mapping will be name to row index, so that we can use the same dictionary to do lookups on both annotations (data.geneAnn) and expression profiles (data.num).

In [22]:
gene_index = {}
n = 0
for i in data.geneName:
    gene_index[i] = n
    n += 1
anpre>

GAL4, a transcriptional regulator of galactose catabolism in Saccharomyces, has the systematic gene name "YPL248C".

Use the dictionary to lookup the row index for GAL4:

In [23]:
gene_index["YPL248C"]
Out[23]:
2047

Use the row index to confirm our name-based lookup:

In [24]:
data.geneName[gene_index["YPL248C"]]
Out[24]:
'YPL248C'

Look up the corresponding annotation:

In [25]:
data.geneAnn[gene_index["YPL248C"]]
Out[25]:
'GAL4   GALACTOSE REGULATION     TRANSCRIPTIONAL ACTIVATOR'

And first 10 columns of the corresponding expression profile:

In [26]:
data.num[gene_index["YPL248C"]][:10]
Out[26]:
[-0.06, -0.29, -0.29, 0.01, -0.18, -0.29, -0.15, -0.3, -0.18, 0.03]

Create the dictionary from In [22], using enumerate to simplify the code:

In [27]:
gene_index = {}
for (n,i) in enumerate(data.geneName):
    gene_index[i] = n

Generate a many-to-many mapping of annotation keywords to row indices.

In order to support multiple rows for a single keyword, we will have our dictionary map from strings (keys) to lists of integers (values).

In [71]:
keywords = {}
for (n,i) in enumerate(data.geneAnn):
    # Split annotation on whitespace -> list of words
    for keyword in i.split():
        try:
            # If we've seen this keyword before -> append the current row index to its list
            keywords[keyword].append(n)
        except KeyError:
            # The first time we've seen this keyword -> make a new list for it
            keywords[keyword] = [n]

The keys in a dictionary are unique. Assigning multiple values to a single key results in a single copy of that key mapped to the last value:

In [29]:
d = {"A":3,"A":4,"A":5}
In [30]:
d
Out[30]:
{'A': 5}

The unique key property of dictionaries is very useful for creating sets of unique elements.

Python includes a special data type, set, expressly for this purpose. A set is like a dictionary with empty values (i.e., just keys).

Here, we for a set on each keyword list so that we can iterate over just the unique keywords in each annotation:

In [31]:
keywords = {}
for (n,i) in enumerate(data.geneAnn):
    for keyword in set(i.split span class="p">()):
        try:
            keywords[keyword].append(n)
        except KeyError:
            keywords[keyword] = [n]

Demonstrating the uniqueness of set elements:

In [32]:
s = set("AAAAABBBCCDCCDAD")
In [33]:
s
Out[33]:
{'A', 'C', 'B', 'D'}
In [34]:
s2 = set("AAABBDDDDEEE")
In [35]:
s2
Out[35]:
{'A', 'B', 'E', 'D'}

sets define many useful operations from set theory -- very useful for "Venn diagram"-type calculations (c.f. help(set) for a full list)

In [36]:
s2.difference(s)
Out[36]:
{'E'}
In [37]:
s.difference(s2)
Out[37]:
{'C'}
In [38]:
s.intersection(s2)
Out[38]:
{'A', 'B', 'D'}

Number of unique keywords in the annotation column:

In [39]:
len(keywords)
Out[39]:
4312

Number of genes with "GALACTOSE" in their annotation:

In [40]:
len(keywords["GALACTOSE"])
Out[40]:
4

Information for all four of those genes:

In [41]:
for i in keywords["GALACTOSE"]:
    print data.geneName[i], data.geneAnn[i], data.num[i][:10]
YML051W GAL80  GALACTOSE REGULATION     TRANSCRIPTIONAL REPRESSOR [0.0, -0.17, -0.22, 0.06, -0.1, 0.0, 0.08, -0.14, -0.1, -0.36]
YJL168C SET2   GALACTOSE REGULATION     TRANSCRIPTIONAL REPRESSOR OF GAL4 [-0.34, -0.03, -0.07, -0.1, -0.54, 0.1, -0.2, -0.17, -0.1, 0.46]
YPL248C GAL4   GALACTOSE REGULATION     TRANSCRIPTIONAL ACTIVATOR [-0.06, -0.29, -0.29, 0.01, -0.18, -0.29, -0.15, -0.3, -0.18, 0.03]
YLR081W GAL2   TRANSPORT        GLUCOSE AND GALACTOSE PERMEASE [-0.03, -0.03, 0.03, -0.27, 0.07, -0.27, -0.07, -0.38, -0.4, -0.4]

Some useful dictionary methods

keys(): a list of the dictionary's keys in undefined but deterministic order:

In [42]:
keywords.keys()[:10]
Out[42]:
['FUSION;',
 'L42B',
 'HDF1',
 'SILENCED',
 'L22A',
 'TELOMERE',
 'PRP6',
 'PRP4',
 'L42A',
 'PRP2']

values(): the values, in the same order as keys()

In [43]:
keywords.values()[:10]
Out[43]:
[[151, 178],
 [1595],
 [1052],
 [2119, 2213],
 [1599],
 [78, 96, 150, 299, 1011, 1969, 2024],
 [1405],
 [414],
 [1603],
 [1952]]

items(): (key,value) pairs, in the same order as keys() -- this is usually what you want when writing a for loop over a dictionary:

In [44]:
keywords.items()[:10]
Out[44]:
[('FUSION;', [151, 178]),
 ('L42B', [1595]),
 ('HDF1', [1052]),
 ('SILENCED', [2119, 2213]),
 ('L22A', [1599]),
 ('TELOMERE', [78, 96, 150, 299, 1011, 1969, 2024]),
 ('PRP6', [1405]),
 ('PRP4', [414]),
 ('L42A', [1603]),
 ('PRP2', [1952])]

Using dictionaries to transform nucleotide sequences

A dictionary mapping the Watson-Crick base-pairing rules:

In [45]:
nuc = {"A":"T","T":"A","G":"C","C":"G","N":"N"}

A test sequence to play with:

In [46]:
seq = "GATACA"*20
In [47]:
print seq
GATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACA

Exercise: Given a $5'\rightarrow3'$ sequence, return its $3'\rightarrow5'$ antisense

First try: returns the correct sequence, but as a list:

In [48]:
anti = []
for i in seq:
    anti.append(nuc[i])
print anti
['C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T', 'C', 'T', 'A', 'T', 'G', 'T']

We can use join to put the list together into a string (we call join from an empty string in order to join the elements with no separator)

In [51]:
anti = []
for i in seq:
    anti.append(nuc[i])
print "".join(anti)
CTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGT

Alternatively, we can append to a string directly, rather than using a list:

In [50]:
anti = ""
for i in seq:
    anti += nuc[i]
print anti
CTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGT

Here's the second solution as a function:

In [52]:
def antisense(seq):
    anti = ""
    for i in seq:
        anti += nuc[i]
    return anti

Printing the input and antisense sequences on adjacent lines is a good way to check the output. We'll use this trick a lot when working with sequence alignments (note that IPython, most terminals, and most programmer-oriented text editors use a fixed-width font, to aid in this sort of comparison).

In [53]:
print seq
print antisense(seq)
GATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACAGATACA
CTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGTCTATGT

Exercise: Given a $5'\rightarrow3'$ sequence, return its $5'\rightarrow3'$ reverse complement

We came up with a lot of ways to walk backwards over the sequence:

Via explicit indexing math:

In [55]:
anti = ""
for i in xrange(len(seq)):
    anti += nuc[seq[len(seq)-i-1]]
print anti
TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC

By including start, stop, and step parameters when calling xrange:

In [56]:
anti = ""
for i in xrange(len(seq)-1,-1,-1):
    anti += nuc[seq[i]]
print anti
TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC

By using the reversed generator:

In [58]:
anti = ""
for i in reversed(seq):
    anti += nuc[i]
print anti
TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC

By using special splice syntax directly on the string (this works for lists as well):

In [60]:
anti = ""
for i in seq[::-1]:
    anti += nuc[i]
print anti
TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC

Alternatively, we could walk forward over the sequence but prepend the new characters to the output string:

In [61]:
anti = ""
for i in seq:
    anti = nuc[i] + anti
print anti
TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC

All of the above approaches are reasonable solutions.

Here, I choose In [60] because:

In [62]:
def revcomp(seq):
    anti = ""
    for i in seq[::-1]:
        anti += nuc[i]
    return anti
In [63]:
revcomp(seq)
Out[63]:
'TGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATCTGTATC'

Our dictionary isn't defined for lower case sequences:

In [64]:
revcomp("atgc")
< van class="ansired">---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-64-9fc2deb7f315> in <module>()
----> 1 revcomp("atgc")

<ipython-input-62-c7690cdbafe7> in revcomp(seq)
      2     anti = ""
      3     for i in seq[::-1]:
----> 4         anti += nuc[i]
      5     return anti

KeyError: 'c'

Rather than augmenting the dictionary, it is better to normalize all inputs to upper-case (this way, we only have to deal with the ambiguity at one place in our program):

In [65]:
revcomp("atgc".upper())
Out[65]:
'GCAT'
In [66]:
s = "atgc"
s = s.upper()
revcomp(s)
Out[66]:
'GCAT'

Generating a simple scoring matrix:

In [67]:
scores = {}
for i in "ATGC":
    scores[i] = {}
    for j in "ATGC":
        if(i == j):
            scores[i][j] = 1
        else:
            scores[i][j] = -1
In [68]:
scores
Out[68]:
{'A': {'A': 1, 'C': -1, 'G': -1, 'T': -1},
 'C': {'A': -1, 'C': 1, 'G': -1, 'T': -1},
 'G': {'A': -1, 'C': -1, 'G': 1, 'T': -1},
 'T': {'A': -1, 'C': -1, 'G': -1, 'T': 1}}

Some example sequence pairs for the homework problems:

In [69]:
s1 = "ATCGAAA"
s2 = "ATGCAAA"
In [70]:
s3 = "AT-GCAA-"
s4 = "ATCGCAAA"
In [72]:
%logstart -o BMS270b.2013.06.log
Activating auto-logging. Current session state plus future input saved.
Filename       : BMS270b.2013.06.log
Mode           : backup
Output logging : True
Raw input log  : False
Timestamping   : False
State          : active