This page contains slides and supporting material for the 2015 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python
Homepage for the Python programming language (download, documentation, etc.)
Learning Python
Excellent introduction to Python. The 4th edition covers Python3 (for biology, it's probably best to stick with Python2 until modules like numpy complete the transition to Python3 -- expect this to happen in the near future). (Older edition of Learning Python)
Safari Bookshelf
The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended).
Enthought Canopy
"Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython
The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib
MATLAB-like plotting for python. See also the gallery of example plots
Numpy
Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Dive Into Python
Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)
Dive Into Python 3
Introduction to Python 3, from Mark Pilgrim, the author of the original Dive Into Python. Since the "scientific python stack" is still somewhat dependent on Python 2, most of this book won't be relevant to this course. Chapter 12, however, has a very good discussion of using ElementTree and lxml for parsing XML documents (e.g., MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST output, ...) (Note that as of October 2011, this book is no longer hosted at its original website)
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts
A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin
UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu
A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix
Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides
Mark's slides for day 1
Day1.ipynb
Mark's annotated IPython notebook for day 1
Day1.html
HTML export of the notebook

Day 2: File Formats

Slides
Mark's slides for day 2
Day2.ipynb
Mark's annotated IPython notebook for day 2
Day2.html
HTML export of the notebook
Example Files
Zip archive of example text and binary files
SAM specification
Documentation for the SAM and BAM file formats
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)
CDT file format
Documentation for the extended CDT file format from the JavaTreeView manual

Day 3: Distance Metrics

Slides
Mark's slides for day 3
Day3.ipynb
Mark's annotated IPython notebook for day 3
Day3.html
HTML export of the notebook
Note G
Ada Lovelace's code for the Bernoulli numbers -- often cited as the first published computer program (here is the full document)
Cluster3 distance metrics
Documentation on the distance metrics available in Cluster3 -- useful for the last (optional) homework problem in today's slides.

Day 4: Hierarchical Clustering

Slides
Mark's slides for day 4
Day4.ipynb
Mark's annotated IPython notebook for day 4
Day4.html
HTML export of the notebook
stats.py
Final version of the stats.py module from today's class
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
MeV
Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization
SGD
Stanford Saccharomyces genome database

Day 5: PCA + practical hierarchical clustering

Principal Components Analysis (PCA)

I mentioned Numerical Recipes quite a bit during our discussion of PCA. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.

PCA_example.ipynb
Mark's annotated IPython notebook for the PCA example
PCA_example.html
HTML export of the PCA notebook
What is Principal Component Analysis?
Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433
Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA
Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also includes hierarchical clustering, among many other methods)

Hierarchical Clustering

Day5.ipynb
Mark's annotated IPython notebook for day 5
Day5.html
HTML export of the notebook
test.cdt
Heatmap
test.gtr
Dendrogram

Day 6: Sequence Analysis

Slides
Mark's slides for day 6
Day6.ipynb
Mark's annotated IPython notebook for day 6
Day6.html
HTML export of the notebook
geneticCode.py
The (standard) genetic code as a Python dictionary
sequences1.zip
Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py
The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
pydotter_canopy.py
Python clone of Eric Sonnhammer's DOTTER program for windowed dotplots.
qtdotter.zip
Port of pydotter_canopy.py, replacing the Tk backend with a Canopy-compatible Qt4 backend and adding matplotlib integration. (The first line of this file has been updated to point at the system copy of python in order to avoid Canopy's Tk incompatability)
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay)
Introduction to Protein Structure, 2nd Edition (Branden and Tooze)
Great book for orienting yourself on natural protein sequences.

Day 7: Sequence Alignment

Slides
Mark's slides for day 7
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m2n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).

Day 8: Dynamic Programming

Slides
Mark's slides for day 8
NeedlemanWunschExamples.html
Some test sequences for you Needleman-Wunsch code, with optimal scores, example optimal alignments, and dynamic programming matrices
blosum45.txt
BLOSUM45 scoring matrix
blosum62.txt
BLOSUM62 scoring matrix
blosum80.txt
BLOSUM80 scoring matrix
ScoringMatrices.html
IPython notebook for clustering BLOSUM scoring matrices
UCSC Genome Browser
Human genome
NCBI GQuery
Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST
NCBI's BLAST portal
JMB 215:403
The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
IRE Transactions on Information Theory 2:113
Noam Chomsky's original description of regular expressions and context free grammars
HMMer
Sean Eddy's profile-HMM implementation
HMMer web interface
New web interface to PHMMer, HMMscan, and HMMsearch
InterProScan
Search interface for InterPro, a meta-database of 15 motif databases (mostly HMM based)
PLoS Comp. Biol. 4:e1000069
Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195
Algorithmic details for HMMer 3 speed-ups
RNA 18:193
Recent general-purpose SCGF framework for statistical models of RNA secondary structure.
Regular Expressions
Documentation for Python's re module

Day 9: Multiple Alignment

Slides
Mark's slides for day 9
Calmodulin_example.zip
Calmodulin-related sequences
GATA.zip
GATA transcription factors
GFF file format
Specification for the commonly used version 2 of GFF (e.g., this is what JALVIEW uses for sequence annotations).
GFFv3 file format
Recent update of GFF by Lincoln Stein, with stricter definitions and better support for relationships among features, as used by newer programs such as Gbrowse2.
FastaFile.py
Simple FASTA-file parser
Sequence.py
Fancier nucleotide/protein classes supporting translation, ORF-finding, etc.
ClustalTools.py
Utilities for some multiple alignment and phylogenic tree formats
aln2cdt.py
Conversion script to map CLUSTAL aln/phb files to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Lucien.py
A larger alignment pipeline example (depends on databases and code not included on this site).
GATA.hack.cdt
In-class GATA alignment from Lucien.py pipeline, formatted as CDT for JavaTreeView (from hmmalign Stockholm-format output).
GATA.hack.gtr
In-class GATA phylogeny from Lucien.py pipeline, formatted as GTR for JavaTreeView (from fasttree2 New Hampshire-format output).

Day 10: Wrap Up

Slides
Mark's slides for day 10
Day10.txt
Mark's bash session. Most of this is specific to Linux/OS X (but will work on Windows with Cygwin)
cli.py
Final state of the example command-line script that we wrote today
dp2.py
Example dynamic-programming implementations: Note that nwg, nwp, and sw use distinct fill algorithms (nwg_fill, nwp_fill, and sw_fill) but a common traceback function (sw_traceback). This module also includes three utility functions: makeIdent, for generating simple scoring matrices, gapped_score, for independent confirmation of alignment scores, and nw_dump, for diagnostic dumps of the dynamic programming data structures. sw_backtrack is an alternative to sw_traceback that exhaustively enumerates all optimal alignments.