This page contains slides and supporting material for the 2018
version of BMS270: Practical Bioinformatics with Programming.
Please see the site updates page for
information on tracking changes.
General Resources
Python
Homepage for the Python programming language (download,
documentation, etc.)
Safari Bookshelf
The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python
Excellent introduction to Python.
Dive Into Python 3
Programmer-oriented introduction to Python. Faster paced than
Learning Python, but the entire book can be downloaded as a free pdf
file. Chapter 12 on parsing XML with lxml is particularly good. There is also a substantially different earlier edition based on Python 2.
(Note that as of October 2011, these books are no longer hosted at their
original website)
Enthought Canopy
"Scientific" python distribution -- free for academic use.
This is the an easy way to
install libraries like numpy, scipy, and matplotlib on OS X and
Windows (on Linux, it is easiest to install these libraries directly
from your distribution's repository).
Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter
The interactive shell/notebook that we are using for this course.
See also the gallery of
example notebooks
Matplotlib
MATLAB-like plotting for python.
See also the gallery of
example plots
Numpy
Numerical library that serves as the foundation for matplotlib,
scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules.
See also http://docs.python.org/
for the full list of on-line documentation.
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts
A repository of useful open source programs for OS X, based on
the FreeBSD Ports system.
Cygwin
UNIX-like environment for Windows with package manager for popular
UNIX/Linux programs.
Ubuntu
A user-friendly Linux distribution based on
Debian. This is
the distribution that I use on my laptop. The installation CD can be
booted as a "Live CD", allowing you to try Linux with no change to your
computer.
Knoppix
Knoppix is a "Live CD" version of Debian.
Knoppix may be a bit less user friendly than Ubuntu,
but it may boot faster on some computers. More information about
Knoppix can be found on this
unofficial site.
git
Distributed version control system. Very fast, but optimized for Linux.
mercurial
Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)
Day 1: Python
Slides
Mark's slides for day 1
Day1.ipynb
Mark's annotated Jupyter notebook for day 1
Day1.html
HTML export of the notebook
Python Primer
A first draft "intro to python" comic
Comm ACM 31:1192
Good early review of random number generators
Day 2: File Formats
Slides
Mark's slides for day 2
Day2.ipynb
Mark's annotated Jupyter notebook for day 2
Day2.html
HTML export of the notebook
stats.py
Example statistical functions (day 1 homework solutions)
example1
Example data file #1
example2
Example data file #2
Day2_example2.ipynb
Jupyter notebook decoding example2 (as an example of parsing a binary file)
Day2_example2.html
HTML export of the example2 notebook
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited
text format (CDT)
supp2data_samples.htm
Table documenting the samples in supp2data.cdt
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
CDT file format
Documentation for the extended CDT file format from the JavaTreeView manual
JavaTreeView.336e7a13.zip
Alternate build of JavaTreeView, in case you're getting plug-in errors from the official build. To install on OS X:
1) Save the attached zip to your Desktop
2) In a terminal window:
mkdir JavaTreeView
cd JavaTreeView
unzip ../Desktop/JavaTreeView.336e7a13.zip
cd dist
java -jar TreeView.jar
Day 3: Distance Metrics
Slides
Mark's slides for day 3
Day3.ipynb
Mark's annotated IPython notebook for day 3
Day3.html
HTML export of the notebook
cdt_reader.py
Example CDT parser (day 2 homework solution)
Cluster3
Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c
Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
Cluster3 distance metrics
Documentation on the distance metrics available in
Cluster3 -- useful for the last (optional) homework problem in today's slides.
Day 4: Hierarchical Clustering
Slides
Mark's slides for day 4
Day4.ipynb
Mark's annotated IPython notebook for day 4
Day4.html
HTML export of the notebook
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization
Day 5: Negative Controls and Aggregating Data
Slides
Mark's slides for day 5
Day5.ipynb
Mark's annotated IPython notebook for day 5
Day5.html
HTML export of the notebook
geneticCode.py
The (standard) genetic code as a Python dictionary
correlation.cdt
Correlation matrix in CDT format. This is that matrix that I applied clustered ordering to at the bottom of the day 5 notebook.
Day 6: Sequences
Slides
Mark's slides for day 6
Day6.ipynb
Mark's annotated IPython notebook for day 6
Day6.html
HTML export of the notebook
pydotter_2017_05_04.zip
Python 2 clone of Eric Sonnhammer's
DOTTER program
for windowed dotplots. pydotter should run on OS X and Linux.
The original DOTTER program should run on Windows and Linux.
sequences1.zip
Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py
The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
The BLAST book
Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis
In addition to excellent coverage of hidden Markov models
(and the related dynamic programming algorithms for generating
and searching with them) this book gives good general coverage
of sequence alignment and statistics. (The UCSF library has copies at
both Parnassus and Mission Bay). See also Sean Eddy's notes from his
new Python-based Biological Data Analysis course at Harvard.
Introduction to Protein Structure, 2nd Edition (Branden and Tooze)
Great book for orienting yourself on natural protein sequences.
Day 7: Sequence Alignment
Slides
Mark's slides for day 7
No notebook for today
All relevant code is in the slides and the two linked python files below
ungapped.py
Example implementation of ungapped alignment
gapped.py
Example scoring function for alignments with homogeneous or affine gap penalties
UCSF library Data Science Initiative
Note Programming and Pizza today 4-6 in CL221-222
Needleman-Wunsch
Primary reference for global alignment by dynamic programming.
Smith-Waterman
Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh
Gotoh's speed-up of Smith-Waterman from O(m2n) to O(mn) time.
Myers and Miller
Myers and Miller's update to Gotoh's algorithm, improving the space requirement
from O(mn) to O(n).
Day 8: Principal Components Analysis (PCA)
Numerical Recipes
Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.
Slides
Mark's slides for day 8
Day8.ipynb
Mark's annotated IPython notebook for day 8
Day8.html
HTML export of the notebook
TPM CDT
Log2(TPM) estimates for the most abundant transcripts
observed in Sci Rep 7:42225, based on running the reads from
GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample.
What is Principal Component Analysis?
Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433
Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA
Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also
includes hierarchical clustering, among many other methods)
Cell Syst. 2:239
Cool application of PCA for "shallow-seq" analysis of bulk and single-cell experiments.
Day 9: Differential Expression
Slides
Mark's slides for day 9
S1.csv
CSV export of Sci Rep 7:42225 table S1, giving the DESeq2 results for the 24h/4h BMDM uninfected comparison.
GSE88801_limma2.html
Example limma analysis based on the pairwise sample comparisons from S1 and S2 of Sci Rep 7:42225. Yields fit contrasts for all genes (cdt) and clustered contrasts for significantly differential genes (cdt gtr).
GSE88801_limma3.html
Example limma analysis fitting independent factors for time, strain, and infection type. Yields fit contrasts for all genes (cdt) and clustered contrasts for genes significantly differential in the live/uninfected contrast (cdt gtr).
Genome Biol. 14:R95
Comparison of differential expression methods for RnaSeq, including limma and DESeq. From 2013, so doesn't include DESeq2. (erratum)
limma user's guide
Documentation for limma
NAR 43:e47
Primary reference for limma
Genome Biol. 15:R29
Primary reference for voom (limma's RnaSeq weighting method)
Elements of Statistical Learning
Great textbookbook on machine learning from a statistics point of view with very good coverage of regression methods.
Introduction to Statistical Learning
"Easier" version of the above, with examples in R.
QH506.M74
Statistical Modeling and Machine Learning for Molecular Biology -- A gentle introduction to statistics for expression profiling. Includes good and thorough explanations of Bayesian statistics.
stream_sampler.py
Example implementation for a reservoir sampler for FASTA, FASTQ, and SAM files. (Note that SAM files should be unsorted alignments of single-end reads. Equivalent BAM files can be sampled by piping through samtools view). For more about reservoir sampling, see this essay from Sean Eddy.
Day 10: Wrap Up
Slides
Mark's slides for day 10. See also last years slides on
sequence search and multiple alignment
(or, even better, read the Biological Sequence Analysis book linked above)
dp2.py
Example dynamic-programming implementations:
Global alignment with zero gap opening penatlies (nw)
Global alignment (nwg)
Local alignment (sw)
Global alignment with optional edge constraints (nwp)
Note that nwg, nwp, and sw use distinct fill algorithms
(nwg_fill, nwp_fill, and sw_fill) but a common traceback
function (sw_traceback). This module also includes three
utility functions: makeIdent, for generating simple scoring
matrices, gapped_score, for independent confirmation of
alignment scores, and nw_dump, for diagnostic dumps of the
dynamic programming data structures. sw_backtrack is an
alternative to sw_traceback that exhaustively enumerates all
optimal alignments.
dp2_np.py
Re-implementation of global alignment with zero gap opening penatlies (nw)
using numpy.array rather than python lists. Note that this implementation
sacrifices portability without gaining much in clarity or performance.
Is there an alternate implementation that takes advantage of numpy's
vector operations?
monte_aligner.py
Example implementation for solving the pairwise alignment problem by simulated annealing. Note that the current move set is likely to be biased. See 10.9 of Numerical Recipes for a description of this algorithm.