This page contains slides and supporting material for the 2013, module 3 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python
Homepage for the Python programming language (download, documentation, etc.)
Enthought Canopy
"Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython
The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib
MATLAB-like plotting for python. See also the gallery of example plots
Numpy
Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Dive Into Python
Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)
Dive Into Python 3
Introduction to Python 3, from Mark Pilgrim, the author of the original Dive Into Python. Since the "scientific python stack" is still somewhat dependent on Python 2, most of this book won't be relevant to this course. Chapter 12, however, has a very good discussion of using ElementTree and lxml for parsing XML documents (e.g., MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST output, ...) (Note that as of October 2011, this book is no longer hosted at its original website)
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts
A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin
UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu
A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix
Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides
Mark's slides for day 1
Transcript
Mark's Python session for day 1

Day 2: File Formats

Slides
Mark's slides for day 2
Transcript
Mark's Python session for day 2
stats.py
Commented version of yesterday's homework problems
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)

Day 3: Distance Metrics

Slides
Mark's slides for day 3
Transcript
Mark's Python session for day 3
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code and license details).
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
MeV
Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization

Day 4: Hierarchical Clustering

Slides
Mark's slides for day 4
Transcript
Mark's Python session for day 4
SGD
Stanford Saccharomyces genome database
Mega Yeast
Larger version of the yeast expression set from Audry Gasch

Day 5: Hierarchical Clustering

Slides
Mark's slides for day 5
Transcript
Mark's Python session for day 5
supp2data.cdt
supp2data.csv, reformatted for Cluster3/JavaTreeView
rowshuffle.cdt
supp2data.cdt with rows shuffled
rowuncouple.cdt
rowshuffle.cdt with independent within-row shuffling
supp2data.um.cdt
Clustering result for supp2data.cdt -- heatmap
supp2data.um.gtr
Clustering result for supp2data.cdt -- dendrogram
rowshuffle.um.cdt
Clustering result for rowshuffle.cdt -- heatmap
rowshuffle.um.gtr
Clustering result for rowshuffle.cdt -- dendrogram
rowuncouple.um.cdt
Clustering result for rowuncouple.cdt -- heatmap
rowuncouple.um.gtr
Clustering result for rowuncouple.cdt -- dendrogram
supp2data_dists.cdt
Uncentered Pearson distance matrix for supp2data.cdt
rowshuffle_dists.cdt
Uncentered Pearson distance matrix for rowshuffle.cdt

Day 6: Sequence analysis

Slides
Mark's slides for day 6
Transcript
Mark's Python session for day 6
geneticCode.py
The (standard) genetic code as a Python dictionary
sequences1.zip
Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py
The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
pydotter_canopy.py
Python clone of Eric Sonnhammer's DOTTER program for windowed dotplots. (The first line of this file has been updated to point at the system copy of python in order to avoid Canopy's Tk incompatability)
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols for common BLAST tasks in chapter 9. UCSF's access to this book is currently limited to one user at a time.
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay)
Introduction to Protein Structure, 2nd Edition (Branden and Tooze)
Great book for orienting yourself on natural protein sequences.

Day 7: Alignment

Slides
Mark's slides for day 7
Transcript
Mark's Python session for day 7
Sequence.py
An example "real world" sequence implementation. This is the module that I use for most of my day-to-day sequence manipulation.
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m2n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).
Design Patterns: Elements of Reusable Object-Oriented Software
This is the design patterns textbook that I mentioned at class (the UCSF library doesn't have it, but it's available by interlibrary loan from most other UC libraries)

Day 8: Heuristic Alignment and Search

Slides
Mark's slides for day 8
Transcript
Mark's Python session for day 8
blosum45.txt
BLOSUM45 scoring matrix
blosum62.txt
BLOSUM62 scoring matrix
blosum80.txt
BLOSUM80 scoring matrix
BLOSUM clustering example
UCSC Genome Browser
Human genome
NCBI GQuery
Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST
NCBI's BLAST portal
JMB 215:403
The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
HMMer
Sean Eddy's profile-HMM implementation
HMMer web interface
New web interface to PHMMer, HMMscan, and HMMsearch
InterProScan
Search interface for InterPro, a meta-database of 15 motif databases (mostly HMM based)
PLoS Comp. Biol. 4:e1000069
Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195
Algorithmic details for HMMer 3 speed-ups
RNA 18:193
Recent general-purpose SCGF framework for statistical models of RNA secondary structure.
Regular Expressions
Documentation for Python's re module

Day 9: Multiple Alignment

CLUSTALX
Multiple sequence alignment program based on neighbor-joining trees.
JALVIEW
Program for viewing and annotating multiple alignments. (The "Install Anywhere" version is slightly easier to install compared to the "Java Web Start" version).
MESQUITE
Multiple alignment workbench with good tree viewer
Slides
Mark's slides for day 9
Calmodulin_example.zip
Calmodulin-related sequences
GATA.zip
GATA transcription factors
GFF file format
Specification for the commonly used version 2 of GFF (e.g., this is what JALVIEW uses for sequence annotations).
GFFv3 file format
Recent update of GFF by Lincoln Stein, with stricter definitions and better support for relationships among features, as used by newer programs such as Gbrowse2.
FastaFile.py
Simple FASTA-file parser
ClustalTools.py
Utilities for some multiple alignment and phylogenic tree formats
aln2cdt.py
Conversion script to map CLUSTAL aln/phb files to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Lucien.py
A larger alignment pipeline example (depends on databases and code not included on this site).

Day 10: Singular Value Decomposition

Slides
Mark's slides for day 10
Transcript
Mark's SVD example for day 10
PNAS 97:10101
An early application of SVD to microarray data, including the sporulation data that was also used in the Eisen clustering paper.
dp2.py
Example dynamic-programming implementation for Needleman-Wunsch with zero gap opening penatlies (nw), an extension to cover non-zero gap opening penalties (nwg) and local alignments (sw). Note that nwg and sw use distinct fill algorithms (nwg_fill and sw_fill) but a common traceback function.