Home
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Day 9
Day 10

This page contains slides and supporting material for the 2019 version of BMS270: Practical Bioinformatics with Programming. Please see the site updates page for information on tracking changes.

General Resources

Python: Homepage for the Python programming language (download, documentation, etc.)
Safari Bookshelf: The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python and Programming Python. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python: Excellent introduction to Python.
Dive Into Python 3: Programmer-oriented introduction to Python. Faster paced than Learning Python, but the entire book can be downloaded as a free pdf file. Chapter 12 on parsing XML with lxml is particularly good. There is also a substantially different earlier edition based on Python 2. (Note that as of October 2011, these books are no longer hosted at their original website)
Enthought Canopy: "Scientific" python distribution -- free for academic use. This is the an easy way to install libraries like numpy, scipy, and matplotlib on OS X and Windows (on Linux, it is easiest to install these libraries directly from your distribution's repository). Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter: The interactive shell/notebook that we are using for this course. See also the gallery of example notebooks
Matplotlib: MATLAB-like plotting for python. See also the gallery of example plots
Numpy: Numerical library that serves as the foundation for matplotlib, scipy, etc.
Python Library Documentation: Detailed documentation for all of the standard python modules. See also http://docs.python.org/ for the full list of on-line documentation.
Numerical Recipes: Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
VirtualBox: The VM software that we are using for this course. For a more flexible open source alternative (which is a bit trickier to install on OS X), see QEMU. UCSF also has a license for VMWare.
XQuartz: An implementation of X11 for OS X -- useful for running graphical programs over an ssh connection.
Debian: The Linux distribution that we are using for this course. Good choice for stability and ease of administration.
Ubuntu: A user-friendly Linux distribution based on Debian. This is the distribution that I use on my laptop. The installation CD can be booted as a "Live CD", allowing you to try Linux with no change to your computer.
git: This is the distributed version control system that we will use for this course.

Day 1: Python

Slides: Mark's slides for day 1
Day1.ipynb: Mark's Jupyter notebook for day 1
Day1.html: HTML export of the notebook
Day1_extended.ipynb: An extended and annotated version of today's notebook, showing some additional features of the example files.
Day1_extended.html: HTML export of the extended notebook
Python Primer: A first draft "intro to python" comic

Day 2: File Formats

Slides: Mark's slides for day 2
TPM CDT: Log₂(TPM) estimates for the most abundant transcripts observed in Sci Rep 7:42225, based on running the reads from GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample.
Day2.ipynb: Mark's annotated Jupyter notebook for day 2
Day2.html: HTML export of the notebook

Day 3: Distance Metrics

Slides

Mark's slides for day 3

Day3.ipynb

Mark's annotated Jupyter notebook for day 3

Day3.html

HTML export of the notebook

JavaTreeView

Alok Saldanha's port of Michael Eisen's TREEVIEW program.

stats.py

Example statistical functions (day 2 homework solutions)

CDT file format

Documentation for the extended CDT file format from the JavaTreeView manual

JavaTreeView.336e7a13.zip

Alternate build of JavaTreeView, in case you're getting plug-in errors from the official build. To install on OS X:

1) Save the attached zip to your Desktop
2) In a terminal window:

mkdir JavaTreeView
cd JavaTreeView
unzip ../Desktop/JavaTreeView.336e7a13.zip
cd dist
java -jar TreeView.jar

PNAS 95:14863 (Eisen et. al.)

Paper introducing cluster analysis for microarrays.

supp2data.xls

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) original Excel format

supp2data.cdt

Supplementary data for PNAS 95:14863 figure 2 (yeast expression profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited text format (CDT)

supp2data_samples.htm

Table documenting the samples in supp2data.cdt

Day 4: Hierarchical Clustering

Slides: Mark's slides for day 4
Day4.ipynb: Mark's annotated IPython notebook for day 4
Day4.html: HTML export of the notebook
Malcolm Gladwell TED talk: An interesting perspective on clustering
Bell Labs Trellis interview: Another useful take on multivariate data analysis and visualization

Day 5: Hierarchical Clustering, part 2

Slides: Mark's slides for day 5
Day5.ipynb: Mark's IPython notebook for day 5
Day5.html: HTML export of the notebook
Day5_prep.ipynb: Notebook fitting the scaling behavior of treecluster
Day5_prep.html: HTML export of the notebook
ENSEMBL_names.txt: Table mapping ENSEMBL mouse gene names to common names.
geneticCode.py: The (standard) genetic code as a Python dictionary

Day 6: Principal Components Analysis (PCA)

Numerical Recipes Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.

Slides: Mark's slides for day 6
Day6_template.ipynb: Jupyter notebook template for day 6
Day6.ipynb: Mark's Jupyter notebook for day 6
Day6.html: HTML export of the notebook
What is Principal Component Analysis?: Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433: Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA: Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also includes hierarchical clustering, among many other methods)
Cell Syst. 2:239: Cool application of PCA for "shallow-seq" analysis of bulk and single-cell experiments.

Day 7: Data Aggregation

Slides: Mark's slides for day 7
Day7.ipynb: Mark's Jupyter notebook for day 7
Day7.html: HTML export of the notebook
example1.cdt: Final clustered heatmap from today
example1.gtr: Corresponding tree file
S1.csv: CSV export of Sci Rep 7:42225 table S1, giving the DESeq2 results for the 24h/4h BMDM uninfected comparison.

Day 8: Differential Expression

Slides: Mark's slides for day 8
Day8a.ipynb: Mark's Jupyter notebook for day 8 (A)
Day8a.html: HTML export of the notebook
Day8b.ipynb: Mark's Jupyter notebook for day 8 (B)
Day8b.html: HTML export of the notebook
sample_table_v2.csv: Sample table
limma1.J774.Live.24-J774.uninfected.24.t0.csv: Differential genes
est_counts CDT (TPM ≥ 10): Estimated counts for the most abundant transcripts observed in Sci Rep 7:42225, based on running the reads from GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample. *(I.e., same pipeline as GSE88801_kallisto_TPMs_thresh10.cdt above)*
est_counts CDT (TPM ≥ 1): As above, but cutting for TPM ≥ 1 for comparison with the supplemental tables of the paper.
S2.xls: Sci Rep 7:42225 table S2, giving the DESeq2 results for the 8 infected vs. uninfected comparisons.
Genome Biol. 14:R95: Comparison of differential expression methods for RnaSeq, including limma and DESeq. From 2013, so doesn't include DESeq2. (erratum)
limma user's guide: Documentation for limma
NAR 43:e47: Primary reference for limma
Genome Biol. 15:R29: Primary reference for voom (limma's RnaSeq weighting method)
Elements of Statistical Learning: Great textbookbook on machine learning from a statistics point of view with very good coverage of regression methods.
Introduction to Statistical Learning: "Easier" version of the above, with examples in R.

Day 9: Differential Expression, part 2

Slides: Mark's slides for day 9
Day9_template1.ipynb: Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 10 in at least one sample and filtering for significantly differential transcripts with a fold change of at least 2x
Day9_template1.html: HTML export of the notebook
Day9_template1.results.tar.gz: Tarball of fit parameters and differential gene lists from this analysis.
Day9_template2.ipynb: Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 1 in at least one sample.
Day9_template2.html: HTML export of the notebook
Day9a.ipynb: Mark's annotated Jupyter notebook for day 9 (A)
Day9a.html: HTML export of the notebook

Day 10: Abundance Estimation

Slides: Mark's slides for day 10
Closing Slides: Mark's closing slides
Day10.ipynb: Mark's notebook for day 10
Day10.html: HTML export of the notebook
Mucci2.transcriptome.fasta.gz: Mucor circinelloides transcriptome, for analysis of reads from mBio 10:e02765
Mucor_Yields.html: Notebook exploring the yields for the full mBio 10:e02765 data set.
stream_sampler.py: Example Python 2 implementation of a reservoir sampler for FASTA, FASTQ, and SAM files. (Note that SAM files should be unsorted alignments of single-end reads. Equivalent BAM files can be sampled by piping through samtools view). For more about reservoir sampling, see this essay from Sean Eddy.
GSE88801_kallisto.tar.gz: Unmerged kallisto results for the GSE88801 data set.