Site updates
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Day 9
Day 10

This page contains slides and supporting material for the 2013, module 3 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python: Homepage for the Python programming language (download, documentation, etc.)
Enthought Canopy: "Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython: The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib: MATLAB-like plotting for python. See also the gallery of example plots
Numpy: Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation: Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Dive Into Python: Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. (Note that as of October 2011, this book is no longer hosted at its original website)
Dive Into Python 3: Introduction to Python 3, from Mark Pilgrim, the author of the original Dive Into Python. Since the "scientific python stack" is still somewhat dependent on Python 2, most of this book won't be relevant to this course. Chapter 12, however, has a very good discussion of using ElementTree and lxml for parsing XML documents (e.g., MINiML files from GEO, SVG vector graphics, NCBI's XML format for BLAST output, ...) (Note that as of October 2011, this book is no longer hosted at its original website)
Numerical Recipes: Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
MacPorts: A repository of useful open source programs for OS X, based on the FreeBSD Ports system.
Cygwin: UNIX-like environment for Windows with package manager for popular UNIX/Linux programs.
Ubuntu: A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
Knoppix: Knoppix is a "Live CD" version of Debian. Knoppix may be a bit less user friendly than Ubuntu, but it may boot faster on some computers. More information about Knoppix can be found on this unofficial site.
git: Distributed version control system. Very fast, but optimized for Linux.
mercurial: Distributed version control system. Very similar to git, but better cross-platform support. Also, it's written in Python =)

Day 1: Python

Slides

Mark's slides for day 1

Transcript

Mark's Python session for day 1

Day 2: File Formats

Slides

Mark's slides for day 2

Transcript

Mark's Python session for day 2

stats.py

Commented version of yesterday's homework problems

PNAS 95:14863 (Eisen et. al.)

Paper introducing cluster analysis for microarrays.

supp2data.xls

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format

supp2data.cdt

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)

Day 3: Distance Metrics

Slides

Mark's slides for day 3

Transcript

Mark's Python session for day 3

Cluster3

Michael de Hoon's port of Michael Eisen's CLUSTER program.

cluster.c

Source code (version 1.49) for Michael de Hoon's Cluster3 implementation. Redistribution is governed by the Python License (See the Cluster3 website for full source code and license details).

JavaTreeView

Alok Saldanha's port of Michael Eisen's TREEVIEW program.

MeV

Cluster/TreeView alternative from JCVI (TIGR), similar to Acuity.

Malcolm Gladwell TED talk

An interesting perspective on clustering

Bell Labs Trellis interview

Another useful take on multivariate data analysis and visualization

Day 4: Hierarchical Clustering

Slides

Mark's slides for day 4

Transcript

Mark's Python session for day 4

SGD

Stanford Saccharomyces genome database

Mega Yeast

Larger version of the yeast expression set from Audry Gasch

Day 5: Hierarchical Clustering

Slides

Mark's slides for day 5

Transcript

Mark's Python session for day 5

supp2data.cdt

supp2data.csv, reformatted for Cluster3/JavaTreeView

rowshuffle.cdt

supp2data.cdt with rows shuffled

rowuncouple.cdt

rowshuffle.cdt with independent within-row shuffling

supp2data.um.cdt

Clustering result for supp2data.cdt -- heatmap

supp2data.um.gtr

Clustering result for supp2data.cdt -- dendrogram

rowshuffle.um.cdt

Clustering result for rowshuffle.cdt -- heatmap

rowshuffle.um.gtr

Clustering result for rowshuffle.cdt -- dendrogram

rowuncouple.um.cdt

Clustering result for rowuncouple.cdt -- heatmap

rowuncouple.um.gtr

Clustering result for rowuncouple.cdt -- dendrogram

supp2data_dists.cdt

Uncentered Pearson distance matrix for supp2data.cdt

rowshuffle_dists.cdt

Uncentered Pearson distance matrix for rowshuffle.cdt

Day 6: Sequence analysis

Slides

Mark's slides for day 6

Transcript

Mark's Python session for day 6

geneticCode.py

The (standard) genetic code as a Python dictionary

sequences1.zip

Example protein, DNA, and RNA sequences in FASTA format.

blosum62.py

The BLOSUM62 scoring matrix as a Python dictionary of dictionaries.

pydotter_canopy.py

Python clone of Eric Sonnhammer's DOTTER program for windowed dotplots. (The first line of this file has been updated to point at the system copy of python in order to avoid Canopy's Tk incompatability)

The BLAST book

Excellent coverage of sequence alignment methods with an emphasis on BLAST. Of particular interest are the coverage of dynamic programming in chapter 3 and the protocols for common BLAST tasks in chapter 9. UCSF's access to this book is currently limited to one user at a time.

Biological Sequence Analysis

In addition to excellent coverage of hidden Markov models (and the related dynamic programming algorithms for generating and searching with them) this book gives good general coverage of sequence alignment and statistics. (The UCSF library has copies at both Parnassus and Mission Bay)

Introduction to Protein Structure, 2nd Edition (Branden and Tooze)

Great book for orienting yourself on natural protein sequences.

Day 7: Alignment

Slides

Mark's slides for day 7

Transcript

Mark's Python session for day 7

Sequence.py

An example "real world" sequence implementation. This is the module that I use for most of my day-to-day sequence manipulation.

Needleman-Wunsch

Primary reference for global alignment by dynamic programming.

Smith-Waterman

Primary reference for local alignment by dynamic programming. (There is an earlier paper that gives the scoring method, but this paper gives the algorithm).

Gotoh

Gotoh's speed-up of Smith-Waterman from O(m²n) to O(mn) time.

Myers and Miller

Myers and Miller's update to Gotoh's algorithm, improving the space requirement from O(mn) to O(n).

Design Patterns: Elements of Reusable Object-Oriented Software

This is the design patterns textbook that I mentioned at class (the UCSF library doesn't have it, but it's available by interlibrary loan from most other UC libraries)

Day 8: Heuristic Alignment and Search

Slides

Mark's slides for day 8

Transcript

Mark's Python session for day 8

blosum45.txt

BLOSUM45 scoring matrix

blosum62.txt

BLOSUM62 scoring matrix

blosum80.txt

BLOSUM80 scoring matrix

BLOSUM clustering example

UCSC Genome Browser

Human genome

NCBI GQuery

Keyword search across Pubmed/GenBank/GEO/etc.

NCBI BLAST

NCBI's BLAST portal

JMB 215:403

The primary reference for BLAST, giving a fast heuristic method for approximating the local alignment methods and a statistical framework for interpreting the results.

HMMer

Sean Eddy's profile-HMM implementation

HMMer web interface

New web interface to PHMMer, HMMscan, and HMMsearch

InterProScan

Search interface for InterPro, a meta-database of 15 motif databases (mostly HMM based)

PLoS Comp. Biol. 4:e1000069

Statistical argument for HMMer 3 speed-ups

PLoS Comput Biol. 7:e1002195

Algorithmic details for HMMer 3 speed-ups

RNA 18:193

Recent general-purpose SCGF framework for statistical models of RNA secondary structure.

Regular Expressions

Documentation for Python's re module

Day 9: Multiple Alignment

CLUSTALX: Multiple sequence alignment program based on neighbor-joining trees.
JALVIEW: Program for viewing and annotating multiple alignments. (The "Install Anywhere" version is slightly easier to install compared to the "Java Web Start" version).
MESQUITE: Multiple alignment workbench with good tree viewer
Slides: Mark's slides for day 9
Calmodulin_example.zip: Calmodulin-related sequences
GATA.zip: GATA transcription factors
GFF file format: Specification for the commonly used version 2 of GFF (e.g., this is what JALVIEW uses for sequence annotations).
GFFv3 file format: Recent update of GFF by Lincoln Stein, with stricter definitions and better support for relationships among features, as used by newer programs such as Gbrowse2.
FastaFile.py: Simple FASTA-file parser
ClustalTools.py: Utilities for some multiple alignment and phylogenic tree formats
aln2cdt.py: Conversion script to map CLUSTAL aln/phb files to JavaTreeView cdt/gtr format. Depends on FastaFile.py and ClustalTools.py
Lucien.py: A larger alignment pipeline example (depends on databases and code not included on this site).

Day 10: Singular Value Decomposition

Slides

Mark's slides for day 10

Transcript

Mark's SVD example for day 10

PNAS 97:10101

An early application of SVD to microarray data, including the sporulation data that was also used in the Eisen clustering paper.

dp2.py

Example dynamic-programming implementation for Needleman-Wunsch with zero gap opening penatlies (nw), an extension to cover non-zero gap opening penalties (nwg) and local alignments (sw). Note that nwg and sw use distinct fill algorithms (nwg_fill and sw_fill) but a common traceback function.