Home
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Day 9
Day 10

This page contains slides and supporting material for the 2018 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python: Homepage for the Python programming language (download, documentation, etc.)
Safari Bookshelf: The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python: Excellent introduction to Python.
Dive Into Python 3: Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. Chapter 12 on parsing XML with lxml is particularly good. There is also a substantially different earlier edition based on Python 2. (Note that as of October 2011, these books are no longer hosted at their original website)
Enthought Canopy: "Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter: The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib: MATLAB-like plotting for python. See also the gallery of example plots
Numpy: Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation: Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Numerical Recipes: Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts: A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin: UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu: A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix: Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git: Distributed version control system. Very fast, but optimized for Linux.
mercurial: Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides: Mark's slides for day 1
Day1.ipynb: Mark's annotated Jupyter notebook for day 1
Day1.html: HTML export of the notebook
Python Primer: A first draft "intro to python" comic
Comm ACM 31:1192: Good early review of random number generators

Day 2: File Formats

Slides

Mark's slides for day 2

Day2.ipynb

Mark's annotated Jupyter notebook for day 2

Day2.html

HTML export of the notebook

stats.py

Example statistical functions (day 1 homework solutions)

example1

Example data file #1

example2

Example data file #2

Day2_example2.ipynb

Jupyter notebook decoding example2 (as an example of parsing a binary file)

Day2_example2.html

HTML export of the example2 notebook

PNAS 95:14863 (Eisen et. al.)

Paper introducing cluster analysis for microarrays.

supp2data.xls

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format

supp2data.cdt

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)

supp2data_samples.htm

Table documenting the samples in supp2data.cdt

JavaTreeView

Alok Saldanha's port of Michael Eisen's TREEVIEW program.

CDT file format

Documentation for the extended CDT file format from the JavaTreeView manual

JavaTreeView.336e7a13.zip

Alternate build of JavaTreeView, in case you're getting plug-in errors from the official build. To install on OS X:

1) Save the attached zip to your Desktop
2) In a terminal window:

mkdir JavaTreeView
cd JavaTreeView
unzip ../Desktop/JavaTreeView.336e7a13.zip
cd dist
java -jar TreeView.jar

Day 3: Distance Metrics

Slides: Mark's slides for day 3
Day3.ipynb: Mark's annotated IPython notebook for day 3
Day3.html: HTML export of the notebook
cdt_reader.py: Example CDT parser (day 2 homework solution)
Cluster3: Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c: Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
Cluster3 distance metrics: Documentation on the distance metrics available in Cluster3 -- useful for the last (optional) homework problem in today's slides.

Day 4: Hierarchical Clustering

Slides: Mark's slides for day 4
Day4.ipynb: Mark's annotated IPython notebook for day 4
Day4.html: HTML export of the notebook
Malcolm Gladwell TED talk: An interesting perspective on clustering
Bell Labs Trellis interview: Another useful take on multivariate data analysis and visualization

Day 5: Negative Controls and Aggregating Data

Slides: Mark's slides for day 5
Day5.ipynb: Mark's annotated IPython notebook for day 5
Day5.html: HTML export of the notebook
geneticCode.py: The (standard) genetic code as a Python dictionary
correlation.cdt: Correlation matrix in CDT format. This is that matrix that I applied clustered ordering to at the bottom of the day 5 notebook.

Day 6: Sequences

Slides: Mark's slides for day 6
Day6.ipynb: Mark's annotated IPython notebook for day 6
Day6.html: HTML export of the notebook
pydotter_2017_05_04.zip: Python 2 clone of Eric Sonnhammer's DOTTER program for windowed dotplots. pydotter should run on OS X and Linux. The original DOTTER program should run on Windows and Linux.
sequences1.zip: Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py: The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
The BLAST book: Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis: In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay). See also Sean Eddy's notes from his new Python-based Biological Data Analysis course at Harvard.
Introduction to Protein Structure, 2nd Edition (Branden and Tooze): Great book for orienting yourself on natural protein sequences.

Day 7: Sequence Alignment

Slides: Mark's slides for day 7
No notebook for today: All relevant code is in the slides and the two linked python files below
ungapped.py: Example implementation of ungapped alignment
gapped.py: Example scoring function for alignments with homogeneous or affine gap penalties
UCSF library Data Science Initiative: Note Programming and Pizza today 4-6 in CL221-222
Needleman-Wunsch: Primary reference for global alignment by dynamic programming.
Smith-Waterman: Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).
Gotoh: Gotoh's speed-up of Smith-Waterman from O(m²n) to O(mn) time.
Myers and Miller: Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).

Day 8: Principal Components Analysis (PCA)

Numerical Recipes Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.

Slides: Mark's slides for day 8
Day8.ipynb: Mark's annotated IPython notebook for day 8
Day8.html: HTML export of the notebook
TPM CDT: Log₂(TPM) estimates for the most abundant transcripts observed in Sci Rep 7:42225, based on running the reads from GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample.
What is Principal Component Analysis?: Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433: Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA: Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also includes hierarchical clustering, among many other methods)
Cell Syst. 2:239: Cool application of PCA for "shallow-seq" analysis of bulk and single-cell experiments.

Day 9: Differential Expression

Slides: Mark's slides for day 9
S1.csv: CSV export of Sci Rep 7:42225 table S1, giving the DESeq2 results for the 24h/4h BMDM uninfected comparison.
GSE88801_limma2.html: Example limma analysis based on the pairwise sample comparisons from S1 and S2 of Sci Rep 7:42225. Yields fit contrasts for all genes (cdt) and clustered contrasts for significantly differential genes (cdt gtr).
GSE88801_limma3.html: Example limma analysis fitting independent factors for time, strain, and infection type. Yields fit contrasts for all genes (cdt) and clustered contrasts for genes significantly differential in the live/uninfected contrast (cdt gtr).
Genome Biol. 14:R95: Comparison of differential expression methods for RnaSeq, including limma and DESeq. From 2013, so doesn't include DESeq2. (erratum)
limma user's guide: Documentation for limma
NAR 43:e47: Primary reference for limma
Genome Biol. 15:R29: Primary reference for voom (limma's RnaSeq weighting method)
Elements of Statistical Learning: Great textbookbook on machine learning from a statistics point of view with very good coverage of regression methods.
Introduction to Statistical Learning: "Easier" version of the above, with examples in R.
QH506.M74: Statistical Modeling and Machine Learning for Molecular Biology -- A gentle introduction to statistics for expression profiling. Includes good and thorough explanations of Bayesian statistics.
stream_sampler.py: Example implementation for a reservoir sampler for FASTA, FASTQ, and SAM files. (Note that SAM files should be unsorted alignments of single-end reads. Equivalent BAM files can be sampled by piping through samtools view). For more about reservoir sampling, see this essay from Sean Eddy.

Day 10: Wrap Up

Slides

Mark's slides for day 10. See also last years slides on sequence search and multiple alignment (or, even better, read the Biological Sequence Analysis book linked above)

dp2.py

Example dynamic-programming implementations:

Global alignment with zero gap opening penatlies (nw)
Global alignment (nwg)
Local alignment (sw)
Global alignment with optional edge constraints (nwp)

Note that nwg, nwp, and sw use distinct fill algorithms (nwg_fill, nwp_fill, and sw_fill) but a common traceback function (sw_traceback). This module also includes three utility functions: makeIdent, for generating simple scoring matrices, gapped_score, for independent confirmation of alignment scores, and nw_dump, for diagnostic dumps of the dynamic programming data structures. sw_backtrack is an alternative to sw_traceback that exhaustively enumerates all optimal alignments.

dp2_np.py

Re-implementation of global alignment with zero gap opening penatlies (nw) using numpy.array rather than python lists. Note that this implementation sacrifices portability without gaining much in clarity or performance. Is there an alternate implementation that takes advantage of numpy's vector operations?

monte_aligner.py

Example implementation for solving the pairwise alignment problem by simulated annealing. Note that the current move set is likely to be biased. See 10.9 of Numerical Recipes for a description of this algorithm.