This page contains slides and supporting material for the 2013, module 3
version of BMS270: Practical Bioinformatics with Programming.
Please see the site updates page for
information on tracking changes.
General Resources
Python
Homepage for the Python programming language (download,
documentation, etc.)
Enthought Canopy
"Scientific" python distribution -- free for academic use.
This is the an easy way to
install libraries like numpy, scipy, and matplotlib on OS X and
Windows (on Linux, it is easiest to install these libraries directly
from your distribution's repository).
Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython
The interactive shell/notebook that we are using for this course.
See also the gallery of
example notebooks
Matplotlib
MATLAB-like plotting for python.
See also the gallery of
example plots
Numpy
Numerical library that serves as the foundation for matplotlib,
scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules.
See also http://docs.python.org/
for the full list of on-line documentation.
Dive Into Python
Programmer-oriented introduction to Python. Faster paced than
Learning Python, but the entire book can be downloaded as a free pdf
file.
(Note that as of October 2011, this book is no longer hosted at its
original website)
Dive Into Python 3
Introduction to Python 3, from Mark Pilgrim, the author of the
original Dive Into Python. Since the "scientific python stack" is
still somewhat dependent on Python 2, most of this book won't be
relevant to this course. Chapter 12, however, has a very good discussion
of using ElementTree and lxml for parsing XML documents (e.g.,
MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST
output, ...)
(Note that as of October 2011, this book is no longer hosted at its
original website)
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts
A repository of useful open source programs for OS X, based on
the FreeBSD Ports system.
Cygwin
UNIX-like environment for Windows with package manager for popular
UNIX/Linux programs.
Ubuntu
A user-friendly Linux distribution based on
Debian. This is
the distribution that I use on my laptop. The installation CD can be
booted as a "Live CD", allowing you to try Linux with no change to your
computer.
Knoppix
Knoppix is a "Live CD" version of Debian.
Knoppix may be a bit less user friendly than Ubuntu,
but it may boot faster on some computers. More information about
Knoppix can be found on this
unofficial site.
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)
Day 1: Python
Slides
Mark's slides for day 1
Transcript
Mark's Python session for day 1
Day 2: File Formats
Slides
Mark's slides for day 2
Transcript
Mark's Python session for day 2
stats.py
Commented version of yesterday's homework problems
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited
text format (CDT)
Day 3: Distance Metrics
Slides
Mark's slides for day 3
Transcript
Mark's Python session for day 3
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code and license details).
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
MeV
Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization
Day 4: Hierarchical Clustering
Slides
Mark's slides for day 4
Transcript
Mark's Python session for day 4
SGD
Stanford Saccharomyces genome database
- Mega Yeast
Larger version of the yeast expression set from Audry Gasch
Day 5: Hierarchical Clustering
Slides
Mark's slides for day 5
Transcript
Mark's Python session for day 5
supp2data.cdt
supp2data.csv, reformatted for Cluster3/JavaTreeView
rowshuffle.cdt
supp2data.cdt with rows shuffled
rowuncouple.cdt
rowshuffle.cdt with independent within-row shuffling
supp2data.um.cdt
Clustering result for supp2data.cdt -- heatmap
supp2data.um.gtr
Clustering result for supp2data.cdt -- dendrogram
rowshuffle.um.cdt
Clustering result for rowshuffle.cdt -- heatmap
rowshuffle.um.gtr
Clustering result for rowshuffle.cdt -- dendrogram
rowuncouple.um.cdt
Clustering result for rowuncouple.cdt -- heatmap
rowuncouple.um.gtr
Clustering result for rowuncouple.cdt -- dendrogram
supp2data_dists.cdt
Uncentered Pearson distance matrix for supp2data.cdt
rowshuffle_dists.cdt
Uncentered Pearson distance matrix for rowshuffle.cdt
Day 6: Sequence analysis
Slides
Mark's slides for day 6
Transcript
Mark's Python session for day 6
geneticCode.py
The (standard) genetic code as a Python dictionary
sequences1.zip
Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py
The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
pydotter_canopy.py
Python clone of Eric Sonnhammer's
DOTTER program
for windowed dotplots.
(The first line of this file has been updated to point at the system copy of python in order to avoid Canopy's Tk incompatability)
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols for common BLAST tasks in chapter 9. UCSF's access to this book is currently limited to one user at a time.
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models
(and the related dynamic programming algorithms for generating
and searching with them) this book gives good general coverage
of sequence alignment and statistics. (The UCSF library has copies at
both Parnassus and Mission Bay)
Introduction to Protein Structure, 2nd Edition (Branden and Tooze)
Great book for orienting yourself on natural protein sequences.
Day 7: Alignment
Slides
Mark's slides for day 7
Transcript
Mark's Python session for day 7
Sequence.py
An example "real world" sequence implementation. This is the module
that I use for most of my day-to-day sequence manipulation.
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m2n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement
from O(mn) to O(n).
Design Patterns: Elements of Reusable Object-Oriented Software
This is the design patterns textbook that I mentioned at class
(the UCSF library doesn't have it, but it's available by interlibrary
loan from most other UC libraries)
Day 8: Heuristic Alignment and Search
Slides
Mark's slides for day 8
Transcript
Mark's Python session for day 8
blosum45.txt
BLOSUM45 scoring matrix
blosum62.txt
BLOSUM62 scoring matrix
blosum80.txt
BLOSUM80 scoring matrix
BLOSUM clustering example
- UCSC Genome Browser
Human genome
NCBI GQuery
Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST
NCBI's BLAST portal
JMB 215:403
The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
HMMer
Sean Eddy's profile-HMM implementation
HMMer web interface
New web interface to PHMMer, HMMscan, and HMMsearch
InterProScan
Search interface for InterPro, a meta-database of 15 motif databases
(mostly HMM based)
PLoS Comp. Biol. 4:e1000069
Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195
Algorithmic details for HMMer 3 speed-ups
RNA 18:193
Recent general-purpose SCGF framework for statistical models of
RNA secondary structure.
Regular Expressions
Documentation for Python's re module
Day 9: Multiple Alignment
CLUSTALX
Multiple sequence alignment program based on neighbor-joining trees.
JALVIEW
Program for viewing and annotating multiple alignments. (The
"Install Anywhere" version is slightly easier to install compared
to the "Java Web Start" version).
MESQUITE
Multiple alignment workbench with good tree viewer
Slides
Mark's slides for day 9
Calmodulin_example.zip
Calmodulin-related sequences
GATA.zip
GATA transcription factors
GFF file format
Specification for the commonly used version 2 of GFF (e.g.,
this is what JALVIEW uses for sequence annotations).
GFFv3 file format
Recent update of GFF by Lincoln Stein, with stricter definitions
and better support for relationships among features, as used by newer
programs such as Gbrowse2.
FastaFile.py
Simple FASTA-file parser
ClustalTools.py
Utilities for some multiple alignment and phylogenic tree formats
aln2cdt.py
Conversion script to map CLUSTAL aln/phb files to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Lucien.py
A larger alignment pipeline example (depends on databases and
code not included on this site).
Day 10: Singular Value Decomposition
Slides
Mark's slides for day 10
Transcript
Mark's SVD example for day 10
PNAS 97:10101
An early application of SVD to microarray data, including the sporulation data that was also used in the Eisen clustering paper.
dp2.py
Example dynamic-programming implementation for Needleman-Wunsch
with zero gap opening penatlies (nw), an extension to cover
non-zero gap opening penalties (nwg) and local alignments
(sw). Note that nwg and sw use distinct fill algorithms
(nwg_fill and sw_fill) but a common traceback function.