Home
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Day 9
Day 10

This page contains slides and supporting material for the 2017 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python: Homepage for the Python programming language (download, documentation, etc.)
Safari Bookshelf: The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python, Programming Python, and BLAST. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python: Excellent introduction to Python. The 4th edition covers Python3 (our class is based on python 2, and will switch to python 3 next year, when Canopy updates. Python 2.7 is supported through 2020). (Older edition of Learning Python)
Enthought Canopy: "Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter: The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib: MATLAB-like plotting for python. See also the gallery of example plots
Numpy: Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation: Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Dive Into Python: Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)
Dive Into Python 3: Introduction to Python 3, from Mark Pilgrim, the author of the original Dive Into Python. Since the "scientific python stack" is still somewhat dependent on Python 2, most of this book won't be relevant to this course. Chapter 12, however, has a very good discussion of using ElementTree and lxml for parsing XML documents (e.g., MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST output, ...) (Note that as of October 2011, this book is no longer hosted at its original website)
Numerical Recipes: Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts: A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin: UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu: A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix: Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git: Distributed version control system. Very fast, but optimized for Linux.
mercurial: Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides: Mark's slides for day 1
Day1.ipynb: Mark's annotated Jupyter notebook for day 1
Day1.html: HTML export of the notebook
Python Primer: A first draft "intro to python" comic
Comm ACM 31:1192: Good early review of random number generators

Day 2: File Formats

Slides: Mark's slides for day 2
Day2.ipynb,Day2b.ipynb,Day2c.ipynb: Mark's annotated IPython notebooks for day 2
Day2.html,Day2b.html,Day2c.html: HTML exports of the notebooks
stats.py: Example statistical functions
example1: Example data file #1
example2: Example data file #2
GSE86922_Brodsky_GEO_processed: RnaSeq expression profile from GSE86922
PNAS 95:14863 (Eisen et. al.): Paper introducing cluster analysis for microarrays.
supp2data.xls: Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format
supp2data.cdt: Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)
JavaTreeView: Alok Saldanha's port of Michael Eisen's TREEVIEW program.
CDT file format: Documentation for the extended CDT file format from the JavaTreeView manual
Jupyter markdown docs: Documentation on fancy formatting in markdown cells

Day 3: Distance Metrics

Slides: Mark's slides for day 3
Day3.ipynb: Mark's annotated IPython notebook for day 3
Day3.html: HTML export of the notebook
Cluster3: Michael de Hoon's port of Michael Eisen's CLUSTER program.
cluster.c: Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code)
Cluster3 distance metrics: Documentation on the distance metrics available in Cluster3 -- useful for the last (optional) homework problem in today's slides.

Day 4: Hierarchical Clustering

Slides: Mark's slides for day 4
Day3_dmat.html (ipynb): Correlation matrix homework problem with timings
Day3_scaling.html (ipynb): Projecting run times for large correlation matrices
Day3_dmat2.html (ipynb): Today's approach to the correlation matrix problem
clustered1_cm_centered cdt gtr: Data matrix (CDT) and tree (GTR) for the final output of Day3_dmat2.ipynb.
Malcolm Gladwell TED talk: An interesting perspective on clustering
Bell Labs Trellis interview: Another useful take on multivariate data analysis and visualization

Day 5: PCA + practical hierarchical clustering

Slides: Mark's slides for day 5
Day5.ipynb: Mark's annotated IPython notebook for day 5
Day5.html: HTML export of the notebook

Hierarchical Clustering

GSE86922_Kallisto ipynb html: Protocol for estimating transcript levels in the GSE86922 data set with kallisto. The merge step depends on MsvUtil.py and CdtFile.py below.
sample_table.csv: Sample table used by GSE86922_Kallisto.ipynb
est_counts.cdt: Output of GSE86922_Kallisto.ipynb
Mus_musculus.GRCm38.79.gtf.gz: Release 79 of the ENSEMBL transcriptome annotation of the GRCm38 mouse genome. *(This is the version currently mirrored on the Kallisto website)*
SafeMath.py: Example module for math on data with missing values
MsvUtil.py: Example utility module. Includes the Table class for tabular text
CdtFile.py: Example module for parsing and manipulating CDT and GTR files. Depends on SafeMath.py and MsvUtil.py

Principal Components Analysis (PCA)

Numerical Recipes Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.

PCA.py: Example PCA implementation. Depends on CdtFile.py
What is Principal Component Analysis?: Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433: Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA: Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also includes hierarchical clustering, among many other methods)

Day 6: Sequences

Slides: Mark's slides for day 6
Day6.ipynb: Mark's annotated IPython notebook for day 6
Day6.html: HTML export of the notebook
geneticCode.py: The (standard) genetic code as a Python dictionary
sequences1.zip: Example protein, DNA, and RNA sequences in FASTA format.
blosum62.py: The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.
The BLAST book: Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols in chapter 9. The above link is for the Safari Bookshelf copy (preferred). The UCSF library also has a limit-users e-book version (aquired during a brief lapse in the Safari Bookshelf subscription).
Biological Sequence Analysis: In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay). See also Sean Eddy's notes from his new Python-based Biological Data Analysis course at Harvard.
Introduction to Protein Structure, 2nd Edition (Branden and Tooze): Great book for orienting yourself on natural protein sequences.

Day 7: Sequence Alignment

Slides

Mark's slides for day 7

Day7.ipynb

Mark's annotated IPython notebook for day 7

Day7.html

HTML export of the notebook

UCSF library Data Science Initiative

Note Programming and Pizza May 18th (5-6:30pm 1407 Mission Hall) and all day workshops May 20-21

Needleman-Wunsch

Primary reference for global alignment by dynamic programming.

Smith-Waterman

Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).

Gotoh

Gotoh's speed-up of Smith-Waterman from O(m²n) to O(mn) time.

Myers and Miller

Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).

pydotter

Python clone of Eric Sonnhammer's DOTTER program for windowed dotplots. Includes:

Dotplot.py graphics-independent dotplot calculation
pydotter.py Tk-based stand-alone program
DotplotQt.py Canopy-compatable module for plotting from a Jupyter notebook (see DotplotQt_Canopy.ipynb for an example).

Note G

Ada Lovelace's code for the Bernoulli numbers -- often cited as the first published computer program (here is the full document)

Day 8: Heuristic Approaches

Slides: Mark's slides for day 8
Day8.ipynb: Mark's annotated IPython notebook for day 8
Day8.html: HTML export of the notebook
gapped.py: Example scoring function for alignments with homogeneous or affine gap penalties
monte_aligner.py: Example implementation for solving the pairwise alignment problem by simulated annealing. Note that the current move set is likely to be biased. See 10.9 of Numerical Recipes for a description of this algorithm.
NeedlemanWunschExamples.html: Some test sequences for you Needleman-Wunsch code, with optimal scores, example optimal alignments, and dynamic programming matrices
blosum45.txt: BLOSUM45 scoring matrix
blosum62.txt: BLOSUM62 scoring matrix
blosum80.txt: BLOSUM80 scoring matrix
ScoringMatrices.html: IPython notebook for clustering BLOSUM scoring matrices
NCBI GQuery: Keyword search across Pubmed/GenBank/GEO/etc.
NCBI BLAST: NCBI's BLAST portal
JMB 215:403: The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.
HMMer: Sean Eddy's profile-HMM implementation
HMMer web interface: New web interface to PHMMer, HMMscan, and HMMsearch
PLoS Comp. Biol. 4:e1000069: Statistical argument for HMMer 3 speed-ups
PLoS Comput Biol. 7:e1002195: Algorithmic details for HMMer 3 speed-ups
RNA 18:193: Recent general-purpose SCGF framework for statistical models of RNA secondary structure.

Day 9: PCA

Day9.ipynb: Mark's annotated IPython notebook for day 9
Day9.html: HTML export of the notebook

Day 10: Wrap Up

Slides

Mark's slides for day 10

dp2.py

Example dynamic-programming implementations:

Global alignment with zero gap opening penatlies (nw)
Global alignment (nwg)
Local alignment (sw)
Global alignment with optional edge constraints (nwp)

Note that nwg, nwp, and sw use distinct fill algorithms (nwg_fill, nwp_fill, and sw_fill) but a common traceback function (sw_traceback). This module also includes three utility functions: makeIdent, for generating simple scoring matrices, gapped_score, for independent confirmation of alignment scores, and nw_dump, for diagnostic dumps of the dynamic programming data structures. sw_backtrack is an alternative to sw_traceback that exhaustively enumerates all optimal alignments.

FastaFile.py

Simple FASTA-file parser

Sequence.py

Fancier nucleotide/protein classes supporting translation, ORF-finding, etc.

ClustalTools.py

Utilities for some multiple alignment and phylogenic tree formats

fasta2cdt.py

Conversion script to map sequence alignment (FASTA) tree (PHB) pairs to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py

Probcons

Good tool for multiple alignment of a small number of proteins

Clustal Omega

Good tool for multiple alignment of a moderate number of proteins