This page contains slides and supporting material for the 2017
version of BMS270: Practical Bioinformatics with Programming.
Please see the site updates page for
information on tracking changes.
General Resources
Python
Homepage for the Python programming language (download,
documentation, etc.)
Safari Bookshelf
The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python
Excellent introduction to Python. The 4th edition covers Python3 (our class is based on python 2, and will switch to python 3 next year, when Canopy updates. Python 2.7 is supported through 2020).
(Older edition of Learning Python)
Enthought Canopy
"Scientific" python distribution -- free for academic use.
This is the an easy way to
install libraries like numpy, scipy, and matplotlib on OS X and
Windows (on Linux, it is easiest to install these libraries directly
from your distribution's repository).
Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter
The interactive shell/notebook that we are using for this course.
See also the gallery of
example notebooks
Matplotlib
MATLAB-like plotting for python.
See also the gallery of
example plots
Numpy
Numerical library that serves as the foundation for matplotlib,
scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules.
See also http://docs.python.org/
for the full list of on-line documentation.
Dive Into Python
Programmer-oriented introduction to Python. Faster paced than
Learning Python, but the entire book can be downloaded as a free pdf
file.
(Note that as of October 2011, this book is no longer hosted at its
original website)
Dive Into Python 3
Introduction to Python 3, from Mark Pilgrim, the author of the
original Dive Into Python. Since the "scientific python stack" is
still somewhat dependent on Python 2, most of this book won't be
relevant to this course. Chapter 12, however, has a very good discussion
of using ElementTree and lxml for parsing XML documents (e.g.,
MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST
output, ...)
(Note that as of October 2011, this book is no longer hosted at its
original website)
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts
A repository of useful open source programs for OS X, based on
the FreeBSD Ports system.
Cygwin
UNIX-like environment for Windows with package manager for popular
UNIX/Linux programs.
Ubuntu
A user-friendly Linux distribution based on
Debian. This is
the distribution that I use on my laptop. The installation CD can be
booted as a "Live CD", allowing you to try Linux with no change to your
computer.
Knoppix
Knoppix is a "Live CD" version of Debian.
Knoppix may be a bit less user friendly than Ubuntu,
but it may boot faster on some computers. More information about
Knoppix can be found on this
unofficial site.
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)
Day 1: Python
Slides
Mark's slides for day 1
Day1.ipynb
Mark's annotated Jupyter notebook for day 1
Day1.html
HTML export of the notebook
Python Primer
A first draft "intro to python" comic
Comm ACM 31:1192
Good early review of random number generators
Day 2: File Formats
Slides
Mark's slides for day 2
Day2.ipynb,Day2b.ipynb,Day2c.ipynb
Mark's annotated IPython notebooks for day 2
Day2.html,Day2b.html,Day2c.html
HTML exports of the notebooks
stats.py
Example statistical functions
example1
Example data file #1
example2
Example data file #2
GSE86922_Brodsky_GEO_processed
RnaSeq expression profile from GSE86922
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited
text format (CDT)
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
CDT file format
Documentation for the extended CDT file format from the JavaTreeView manual
Jupyter markdown docs
Documentation on fancy formatting in markdown cells
Day 3: Distance Metrics
Slides
Mark's slides for day 3
Day3.ipynb
Mark's annotated IPython notebook for day 3
Day3.html
HTML export of the notebook
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
Cluster3 distance metrics
Documentation on the distance metrics available in
Cluster3 -- useful for the last (optional) homework problem in today's slides.
Day 4: Hierarchical Clustering
Slides
Mark's slides for day 4
Day3_dmat.html (ipynb)
Correlation matrix homework problem with timings
Day3_scaling.html (ipynb)
Projecting run times for large correlation matrices
Day3_dmat2.html (ipynb)
Today's approach to the correlation matrix problem
clustered1_cm_centered cdt gtr
Data matrix (CDT) and tree (GTR) for the final output of Day3_dmat2.ipynb.
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization
Day 5: PCA + practical hierarchical clustering
Slides
Mark's slides for day 5
Day5.ipynb
Mark's annotated IPython notebook for day 5
Day5.html
HTML export of the notebook
Hierarchical Clustering
GSE86922_Kallisto ipynb html
Protocol for estimating transcript levels in the GSE86922 data set with kallisto. The merge step depends on MsvUtil.py and CdtFile.py below.
sample_table.csv
Sample table used by GSE86922_Kallisto.ipynb
est_counts.cdt
Output of GSE86922_Kallisto.ipynb
Mus_musculus.GRCm38.79.gtf.gz
Release 79 of the ENSEMBL transcriptome annotation of the GRCm38 mouse genome. *(This is the version currently mirrored on the Kallisto website)*
SafeMath.py
Example module for math on data with missing values
MsvUtil.py
Example utility module. Includes the Table class for tabular text
CdtFile.py
Example module for parsing and manipulating CDT and GTR files. Depends on SafeMath.py and MsvUtil.py
Principal Components Analysis (PCA)
Numerical Recipes
Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.
PCA.py
Example PCA implementation. Depends on CdtFile.py
What is Principal Component Analysis?
Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433
Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA
Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also
includes hierarchical clustering, among many other methods)
Day 6: Sequences
Slides
Mark's slides for day 6
Day6.ipynb
Mark's annotated IPython notebook for day 6
Day6.html
HTML export of the notebook
geneticCode.py
The (standard) genetic code as a Python dictionary
sequences1.zip
Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py
The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models
(and the related dynamic programming algorithms for generating
and searching with them) this book gives good general coverage
of sequence alignment and statistics. (The UCSF library has copies at
both Parnassus and Mission Bay). See also Sean Eddy's notes from his
new Python-based Biological Data Analysis course at Harvard.
Introduction to Protein Structure, 2nd Edition (Branden and Tooze)
Great book for orienting yourself on natural protein sequences.
Day 7: Sequence Alignment
Slides
Mark's slides for day 7
Day7.ipynb
Mark's annotated IPython notebook for day 7
Day7.html
HTML export of the notebook
UCSF library Data Science Initiative
Note Programming and Pizza May 18th (5-6:30pm 1407 Mission Hall) and all day workshops May 20-21
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m2n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement
from O(mn) to O(n).
pydotter
Python clone of Eric Sonnhammer's
DOTTER
program for windowed dotplots. Includes:
- Dotplot.py graphics-independent dotplot calculation
- pydotter.py Tk-based stand-alone program
- DotplotQt.py Canopy-compatable module for plotting from a Jupyter notebook (see DotplotQt_Canopy.ipynb for an example).
Note G
Ada Lovelace's code for the Bernoulli numbers -- often cited as the first published computer program (here is the full document)
Day 8: Heuristic Approaches
Slides
Mark's slides for day 8
Day8.ipynb
Mark's annotated IPython notebook for day 8
Day8.html
HTML export of the notebook
gapped.py
Example scoring function for alignments with homogeneous or affine gap penalties
monte_aligner.py
Example implementation for solving the pairwise alignment problem by simulated annealing. Note that the current move set is likely to be biased. See 10.9 of Numerical Recipes for a description of this algorithm.
NeedlemanWunschExamples.html
Some test sequences for you Needleman-Wunsch code, with optimal scores, example optimal alignments, and dynamic programming matrices
blosum45.txt
BLOSUM45 scoring matrix
blosum62.txt
BLOSUM62 scoring matrix
blosum80.txt
BLOSUM80 scoring matrix
ScoringMatrices.html
IPython notebook for clustering BLOSUM scoring matrices
NCBI GQuery
Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST
NCBI's BLAST portal
JMB 215:403
The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
HMMer
Sean Eddy's profile-HMM implementation
HMMer web interface
New web interface to PHMMer, HMMscan, and HMMsearch
PLoS Comp. Biol. 4:e1000069
Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195
Algorithmic details for HMMer 3 speed-ups
RNA 18:193
Recent general-purpose SCGF framework for statistical models of
RNA secondary structure.
Day 9: PCA
Day9.ipynb
Mark's annotated IPython notebook for day 9
Day9.html
HTML export of the notebook
Day 10: Wrap Up
Slides
Mark's slides for day 10
dp2.py
Example dynamic-programming implementations:
Global alignment with zero gap opening penatlies (nw)
Global alignment (nwg)
Local alignment (sw)
Global alignment with optional edge constraints (nwp)
Note that nwg, nwp, and sw use distinct fill algorithms
(nwg_fill, nwp_fill, and sw_fill) but a common traceback
function (sw_traceback). This module also includes three
utility functions: makeIdent, for generating simple scoring
matrices, gapped_score, for independent confirmation of
alignment scores, and nw_dump, for diagnostic dumps of the
dynamic programming data structures. sw_backtrack is an
alternative to sw_traceback that exhaustively enumerates all
optimal alignments.
FastaFile.py
Simple FASTA-file parser
Sequence.py
Fancier nucleotide/protein classes supporting translation, ORF-finding, etc.
ClustalTools.py
Utilities for some multiple alignment and phylogenic tree formats
fasta2cdt.py
Conversion script to map sequence alignment (FASTA) tree (PHB) pairs to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Probcons
Good tool for multiple alignment of a small number of proteins
Clustal Omega
Good tool for multiple alignment of a moderate number of proteins
FastTree
Good tool for generating phylogenetic trees from deep multiple alignments.
JALVIEW
Multiple alignment viewer sharing many design decisions with JavaTreeView
MSB2.phmmer.fasta
Hits to H. capsulatum MSB2 from phmmer search vs. UniProt eference proteomes
MSB2.phmmer.clustalo.fasta
Tree generated by Clustal Omega
MSB2.phmmer.clustalo.phb
Tree generated by FastTree
MSB2.phmmer.clustalo.cdt
Alignment formatted for JavaTreeView
MSB2.phmmer.clustalo.gtr
Tree formatted for JavaTreeView