Site updates
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Day 9
Day 10

This page contains slides and supporting material for the 2015 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python: Homepage for the Python programming language (download, documentation, etc.)
Learning Python: Excellent introduction to Python. The 4th edition covers Python3 (for biology, it's probably best to stick with Python2 until modules like numpy complete the transition to Python3 -- expect this to happen in the near future). (Older edition of Learning Python)
Safari Bookshelf: The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended).
Enthought Canopy: "Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython: The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib: MATLAB-like plotting for python. See also the gallery of example plots
Numpy: Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation: Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Dive Into Python: Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)
Dive Into Python 3: Introduction to Python 3, from Mark Pilgrim, the author of the original Dive Into Python. Since the "scientific python stack" is still somewhat dependent on Python 2, most of this book won't be relevant to this course. Chapter 12, however, has a very good discussion of using ElementTree and lxml for parsing XML documents (e.g., MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST output, ...) (Note that as of October 2011, this book is no longer hosted at its original website)
Numerical Recipes: Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts: A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin: UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu: A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix: Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git: Distributed version control system. Very fast, but optimized for Linux.
mercurial: Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides: Mark's slides for day 1
Day1.ipynb: Mark's annotated IPython notebook for day 1
Day1.html: HTML export of the notebook

Day 2: File Formats

Slides: Mark's slides for day 2
Day2.ipynb: Mark's annotated IPython notebook for day 2
Day2.html: HTML export of the notebook
Example Files: Zip archive of example text and binary files
SAM specification: Documentation for the SAM and BAM file formats
PNAS 95:14863 (Eisen et. al.): Paper introducing cluster analysis for microarrays.
supp2data.xls: Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format
supp2data.cdt: Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)
CDT file format: Documentation for the extended CDT file format from the JavaTreeView manual

Day 3: Distance Metrics

Slides: Mark's slides for day 3
Day3.ipynb: Mark's annotated IPython notebook for day 3
Day3.html: HTML export of the notebook
Note G: Ada Lovelace's code for the Bernoulli numbers -- often cited as the first published computer program (here is the full document)
Cluster3 distance metrics: Documentation on the distance metrics available in Cluster3 -- useful for the last (optional) homework problem in today's slides.

Day 4: Hierarchical Clustering

Slides: Mark's slides for day 4
Day4.ipynb: Mark's annotated IPython notebook for day 4
Day4.html: HTML export of the notebook
stats.py: Final version of the stats.py module from today's class
Cluster3: Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c: Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
JavaTreeView: Alok Saldanha's port of Michael Eisen's TREEVIEW program.
MeV: Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.
Malcolm Gladwell TED talk: An interesting perspective on clustering
Bell Labs Trellis interview: Another useful take on multivariate data analysis and visualization
SGD: Stanford Saccharomyces genome database

Day 5: PCA + practical hierarchical clustering

Principal Components Analysis (PCA)

I mentioned Numerical Recipes quite a bit during our discussion of PCA. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.

PCA_example.ipynb: Mark's annotated IPython notebook for the PCA example
PCA_example.html: HTML export of the PCA notebook
What is Principal Component Analysis?: Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433: Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA: Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also includes hierarchical clustering, among many other methods)

Hierarchical Clustering

Day5.ipynb: Mark's annotated IPython notebook for day 5
Day5.html: HTML export of the notebook
test.cdt: Heatmap
test.gtr: Dendrogram

Day 6: Sequence Analysis

Slides: Mark's slides for day 6
Day6.ipynb: Mark's annotated IPython notebook for day 6
Day6.html: HTML export of the notebook
geneticCode.py: The (standard) genetic code as a Python dictionary
sequences1.zip: Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py: The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
pydotter_canopy.py: Python clone of Eric Sonnhammer's DOTTER program for windowed dotplots.
qtdotter.zip: Port of pydotter_canopy.py, replacing the Tk backend with a Canopy-compatible Qt4 backend and adding matplotlib integration. (The first line of this file has been updated to point at the system copy of python in order to avoid Canopy's Tk incompatability)
The BLAST book: Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis: In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay)
Introduction to Protein Structure, 2nd Edition (Branden and Tooze): Great book for orienting yourself on natural protein sequences.

Day 7: Sequence Alignment

Slides: Mark's slides for day 7
Needleman-Wunsch: Primary reference for global alignment by dynamic programming.
Smith-Waterman: Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh: Gotoh's speed-up of Smith-Waterman from O(m²n) to O(mn) time.
Myers and Miller: Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).

Day 8: Dynamic Programming

Slides: Mark's slides for day 8
NeedlemanWunschExamples.html: Some test sequences for you Needleman-Wunsch code, with optimal scores, example optimal alignments, and dynamic programming matrices
blosum45.txt: BLOSUM45 scoring matrix
blosum62.txt: BLOSUM62 scoring matrix
blosum80.txt: BLOSUM80 scoring matrix
ScoringMatrices.html: IPython notebook for clustering BLOSUM scoring matrices
UCSC Genome Browser: Human genome
NCBI GQuery: Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST: NCBI's BLAST portal
JMB 215:403: The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
IRE Transactions on Information Theory 2:113: Noam Chomsky's original description of regular expressions and context free grammars
HMMer: Sean Eddy's profile-HMM implementation
HMMer web interface: New web interface to PHMMer, HMMscan, and HMMsearch
InterProScan: Search interface for InterPro, a meta-database of 15 motif databases (mostly HMM based)
PLoS Comp. Biol. 4:e1000069: Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195: Algorithmic details for HMMer 3 speed-ups
RNA 18:193: Recent general-purpose SCGF framework for statistical models of RNA secondary structure.
Regular Expressions: Documentation for Python's re module

Day 9: Multiple Alignment

Slides: Mark's slides for day 9
Calmodulin_example.zip: Calmodulin-related sequences
GATA.zip: GATA transcription factors
GFF file format: Specification for the commonly used version 2 of GFF (e.g., this is what JALVIEW uses for sequence annotations).
GFFv3 file format: Recent update of GFF by Lincoln Stein, with stricter definitions and better support for relationships among features, as used by newer programs such as Gbrowse2.
FastaFile.py: Simple FASTA-file parser
Sequence.py: Fancier nucleotide/protein classes supporting translation, ORF-finding, etc.
ClustalTools.py: Utilities for some multiple alignment and phylogenic tree formats
aln2cdt.py: Conversion script to map CLUSTAL aln/phb files to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Lucien.py: A larger alignment pipeline example (depends on databases and code not included on this site).
GATA.hack.cdt: In-class GATA alignment from Lucien.py pipeline, formatted as CDT for JavaTreeView (from hmmalign Stockholm-format output).
GATA.hack.gtr: In-class GATA phylogeny from Lucien.py pipeline, formatted as GTR for JavaTreeView (from fasttree2 New Hampshire-format output).

Day 10: Wrap Up

Slides

Mark's slides for day 10

Day10.txt

Mark's bash session. Most of this is specific to Linux/OS X (but will work on Windows with Cygwin)

cli.py

Final state of the example command-line script that we wrote today

dp2.py

Example dynamic-programming implementations:

Global alignment with zero gap opening penatlies (nw)
Global alignment (nwg)
Local alignment (sw)
Global alignment with optional edge constraints (nwp)

Note that nwg, nwp, and sw use distinct fill algorithms (nwg_fill, nwp_fill, and sw_fill) but a common traceback function (sw_traceback). This module also includes three utility functions: makeIdent, for generating simple scoring matrices, gapped_score, for independent confirmation of alignment scores, and nw_dump, for diagnostic dumps of the dynamic programming data structures. sw_backtrack is an alternative to sw_traceback that exhaustively enumerates all optimal alignments.