This page contains slides and supporting material for the Spring 2011 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python
Homepage for the Python programming language (download, documentation, etc.)
Learning Python
Excellent introduction to Python. The 4th edition covers Python3 (for biology, it's probably best to stick with Python2 until modules like numpy complete the transition to Python3 -- expect this to happen in the near future). (Older edition of Learning Python)
Safari Bookshelf
The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST.
Dive Into Python
Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)

Day 1: Python

Slides
Mark's slides for day 1 (fixed typo for "Learning Python" in the homework section -- thanks Marie)
Transcript
Mark's Python session for day 1

Day 2: File Formats

Slides
Mark's slides for day 2 (fixed typo in stdev calculation -- thanks to whoever caught this in class)
supp2data.png
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) rendered as a heatmap
supp2data.csv
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to comma separated value (CSV) text format.
Transcript
Mark's Python session for day 2 (concatenates my IDLE and shell Python sessions; for the annotated version of my stats.py module see below).
stats1.py
Annotated solutions for day 1 homework
Exception class documentation
The Python documentation on the different classes of exceptions that can be raised (the hierarchy is given at the bottom of the documentation). Also links to documentation on defining new exception classes. Note that StopIteration is the exception raised by generators to signal the end of a for loop.

Day 3: Distance Metrics

PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
Slides
Mark's slides for day 3
Transcript
Annotated version of Mark's Python session for day 3
PairCorrelation.py
Commented version of the module that we wrote in class, with some additional variations.

Day 4: Distance Matrices

Slides
Mark's slides for day 4
Transcript
Mark's Python session for day 4
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code and license details).
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
MeV
Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.

Day 5: Clustering

Slides
Mark's slides for day 5
SimpleCdt.py
Simple write function for generating cross-platform Excel/Cluster3/JavaTreeView compatible CDT files.
test1.cdt
Example output from SimpleCdt.writecdt, so that we can confirm that the code actually does work across platforms. (Note that you may need to rename this file "test1.txt" in order for it to be recognized by Excel)
fig2clusterdata.txt
Data for figure 1 of the PNAS paper, taken from the primary reference: Science 283:83.

Day 6: Sequences

Slides
Mark's slides for day 6
distmatrix.py
The distance matrix script that we wrote in class today. Note that I've added a few lines so that it can be run from the command line. E.g., on OS X or Linux:
chmod "a+x" distmatrix.py
time ./distmatrix.py
script.py
Example command-line script
BatchCluster.py
Larger example -- run cluster with all distance metrics and all linkage methods (Warning: as is, this script will take a long time to run -- try commenting out some of the methods the first time you use it)
getwhiteboard.py
Larger example -- insert images from Mark's camera into his presentation (Linux-specific, but see if you can turn it into something useful on your computer)
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 7: Alignment

Slides
Mark's slides for day 7
Transcript
Mark's Python session for day 7.
geneticCode.py
Python translation table
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols for common BLAST tasks in chapter 9. Note: In order to get full access to the book (i.e., to avoid preview mode) you must first log in to the Safari Bookshelf through the UCSF library site.

Day 8: Dynamic Programming

Slides
Mark's slides for day 8 (updated with "beads on a string" slides for calculating number of possible pairwise alignments).
dp.py
Needleman-Wunsch implementation with no gap opening penalty.
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay)
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m^2 n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O( n ).
BLAST
The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
Design Patterns: Elements of Reusable Object-Oriented Software
This is the design patterns textbook that I mentioned at class (the UCSF library doesn't have it, but it's available by interlibrary loan from most other UC libraries)

Day 9: Multiple Alignment

Slides
Mark's slides for day 9
Transcript
Mark's bash/Python session for day 9
CLUSTALX
Multiple sequence alignment program based on neighbor-joining trees.
JALVIEW
Program for viewing and annotating multiple alignments. (The "Install Anywhere" version is slightly easier to install compared to the "Java Web Start" version).
Hsp82aa.fasta
Hsp82 protein sequences, including some over-predicted N-termini.
hmmclust.py, ClustalTools.py
Sample script (and supporting module) for combining HMMer alignment with CLUSTALW tree building.
BinuclearZn.full.cdt, BinuclearZn.full.gtr
HMM-based alignment of fungal binuclear Zn cluster transcription factors formatted for viewing in JavaTreeView.
GFF file format
Specification for the commonly used version 2 of GFF (e.g., this is what JALVIEW uses for sequence annotations).
GFFv3 file format
Recent update of GFF by Lincoln Stein, with stricter definitions and better support for relationships among features, as used by newer programs such as Gbrowse2.
HMMer
Sean Eddy's profile-HMM implementation
HMMer web interface
New web interface to PHMMer, HMMscan, and HMMsearch
InterProScan
Search interface for InterPro, a meta-database of 15 motif databases (mostly HMM based)
PLoS Comp. Biol. 4:e1000069
Reference for HMMer 3 heuristic
Cygwin
A UNIX-like environment for Windows (install this if you'd like today's OS X/Linux shell examples to work on Windows). Also includes a repository of useful open source programs like Emacs and GCC.
Fink
A repository of useful open source programs for OS X, based on the Debian GNU/Linux APT package manager.
MacPorts
A repository of useful open source programs for OS X, based on the FreeBSD Ports system.

Cygwin, Fink, and MacPorts are software repositories grafted on top of existing operating systems (Windows or OS X). An alternative strategy is for a software repository to embody a complete operating system. This is the approach of the Linux distributions, which combine the Linux kernel with other system components (e.g., command shells and graphical desktops) as well as many useful programming and analysis tools. Here are two such distributions that are easy to get started with:

Ubuntu
A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix
Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.

Day 10: Dynamic Programming, part 2

Slides
Mark's slides for day 10
Transcript
Mark's Python session for day 10
dp2.py
Extensions of the dp.py dynamic-programming code to cover non-zero gap opening penalties (nwg) and local alignments (sw). Note that the two new functions use distinct fill algorithms (nwg_fill and sw_fill) but a common traceback function.
day10.py
In-class version.
blosum62.py
BLOSUM62 scoring matrix as python dictionary