This page contains slides and supporting material for the 2019
version of BMS270: Practical Bioinformatics with Programming.
Please see the site updates page for
information on tracking changes.
General Resources
Python
Homepage for the Python programming language (download,
documentation, etc.)
Safari Bookshelf
The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python and Programming Python. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
Learning Python
Excellent introduction to Python.
Dive Into Python 3
Programmer-oriented introduction to Python. Faster paced than
Learning Python, but the entire book can be downloaded as a free pdf
file. Chapter 12 on parsing XML with lxml is particularly good. There is also a substantially different earlier edition based on Python 2.
(Note that as of October 2011, these books are no longer hosted at their
original website)
Enthought Canopy
"Scientific" python distribution -- free for academic use.
This is the an easy way to
install libraries like numpy, scipy, and matplotlib on OS X and
Windows (on Linux, it is easiest to install these libraries directly
from your distribution's repository).
Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
IPython/Jupyter
The interactive shell/notebook that we are using for this course.
See also the gallery of
example notebooks
Matplotlib
MATLAB-like plotting for python.
See also the gallery of
example plots
Numpy
Numerical library that serves as the foundation for matplotlib,
scipy, etc.
Python Library Documentation
Detailed documentation for all of the standard python modules.
See also http://docs.python.org/
for the full list of on-line documentation.
Numerical Recipes
Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
VirtualBox
The VM software that we are using for this course. For a more flexible open source alternative (which is a bit trickier to install on OS X), see QEMU. UCSF also has a license for VMWare.
XQuartz
An implementation of X11 for OS X -- useful for running graphical programs over an ssh connection.
Debian
The Linux distribution that we are using for this course. Good
choice for stability and ease of administration.
Ubuntu
A user-friendly Linux distribution based on
Debian. This is
the distribution that I use on my laptop. The installation CD can be
booted as a "Live CD", allowing you to try Linux with no change to your
computer.
git
This is the distributed version control system that we will use for this course.
Day 1: Python
Slides
Mark's slides for day 1
Day1.ipynb
Mark's Jupyter notebook for day 1
Day1.html
HTML export of the notebook
Day1_extended.ipynb
An extended and annotated version of today's notebook, showing some additional features of the example files.
Day1_extended.html
HTML export of the extended notebook
Python Primer
A first draft "intro to python" comic
Day 2: File Formats
Slides
Mark's slides for day 2
TPM CDT
Log2(TPM) estimates for the most abundant transcripts
observed in Sci Rep 7:42225, based on running the reads from
GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample.
Day2.ipynb
Mark's annotated Jupyter notebook for day 2
Day2.html
HTML export of the notebook
Day 3: Distance Metrics
Slides
Mark's slides for day 3
Day3.ipynb
Mark's annotated Jupyter notebook for day 3
Day3.html
HTML export of the notebook
JavaTreeView
Alok Saldanha's port of Michael Eisen's TREEVIEW program.
stats.py
Example statistical functions (day 2 homework solutions)
CDT file format
Documentation for the extended CDT file format from the JavaTreeView manual
JavaTreeView.336e7a13.zip
Alternate build of JavaTreeView, in case you're getting plug-in errors from the official build. To install on OS X:
1) Save the attached zip to your Desktop
2) In a terminal window:
mkdir JavaTreeView
cd JavaTreeView
unzip ../Desktop/JavaTreeView.336e7a13.zip
cd dist
java -jar TreeView.jar
PNAS 95:14863 (Eisen et. al.)
Paper introducing cluster analysis for microarrays.
supp2data.xls
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) original Excel format
supp2data.cdt
Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited
text format (CDT)
supp2data_samples.htm
Table documenting the samples in supp2data.cdt
Day 4: Hierarchical Clustering
Slides
Mark's slides for day 4
Day4.ipynb
Mark's annotated IPython notebook for day 4
Day4.html
HTML export of the notebook
Malcolm Gladwell TED talk
An interesting perspective on clustering
Bell Labs Trellis interview
Another useful take on multivariate data analysis and visualization
Day 5: Hierarchical Clustering, part 2
Slides
Mark's slides for day 5
Day5.ipynb
Mark's IPython notebook for day 5
Day5.html
HTML export of the notebook
Day5_prep.ipynb
Notebook fitting the scaling behavior of treecluster
Day5_prep.html
HTML export of the notebook
ENSEMBL_names.txt
Table mapping ENSEMBL mouse gene names to common names.
geneticCode.py
The (standard) genetic code as a Python dictionary
Day 6: Principal Components Analysis (PCA)
Numerical Recipes
Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.
Slides
Mark's slides for day 6
Day6_template.ipynb
Jupyter notebook template for day 6
Day6.ipynb
Mark's Jupyter notebook for day 6
Day6.html
HTML export of the notebook
What is Principal Component Analysis?
Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
WIREs Comp. Stat. 2:433
Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
ICA vs. PCA
Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also
includes hierarchical clustering, among many other methods)
Cell Syst. 2:239
Cool application of PCA for "shallow-seq" analysis of bulk and single-cell experiments.
Day 7: Data Aggregation
Slides
Mark's slides for day 7
Day7.ipynb
Mark's Jupyter notebook for day 7
Day7.html
HTML export of the notebook
example1.cdt
Final clustered heatmap from today
example1.gtr
Corresponding tree file
S1.csv
CSV export of Sci Rep 7:42225 table S1, giving the DESeq2 results for the 24h/4h BMDM uninfected comparison.
Day 8: Differential Expression
Slides
Mark's slides for day 8
Day8a.ipynb
Mark's Jupyter notebook for day 8 (A)
Day8a.html
HTML export of the notebook
Day8b.ipynb
Mark's Jupyter notebook for day 8 (B)
Day8b.html
HTML export of the notebook
sample_table_v2.csv
Sample table
limma1.J774.Live.24-J774.uninfected.24.t0.csv
Differential genes
est_counts CDT (TPM ≥ 10)
Estimated counts for the most abundant transcripts
observed in Sci Rep 7:42225, based on running the reads from
GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample. *(I.e., same pipeline as GSE88801_kallisto_TPMs_thresh10.cdt above)*
est_counts CDT (TPM ≥ 1)
As above, but cutting for TPM ≥ 1 for comparison with the supplemental tables of the paper.
S2.xls
Sci Rep 7:42225 table S2, giving the DESeq2 results for the 8 infected vs. uninfected comparisons.
Genome Biol. 14:R95
Comparison of differential expression methods for RnaSeq, including limma and DESeq. From 2013, so doesn't include DESeq2. (erratum)
limma user's guide
Documentation for limma
NAR 43:e47
Primary reference for limma
Genome Biol. 15:R29
Primary reference for voom (limma's RnaSeq weighting method)
Elements of Statistical Learning
Great textbookbook on machine learning from a statistics point of view with very good coverage of regression methods.
Introduction to Statistical Learning
"Easier" version of the above, with examples in R.
Day 9: Differential Expression, part 2
Slides
Mark's slides for day 9
Day9_template1.ipynb
Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 10 in at least one sample and filtering for significantly differential transcripts with a fold change of at least 2x
Day9_template1.html
HTML export of the notebook
Day9_template1.results.tar.gz
Tarball of fit parameters and differential gene lists from this analysis.
Day9_template2.ipynb
Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 1 in at least one sample.
Day9_template2.html
HTML export of the notebook
Day9a.ipynb
Mark's annotated Jupyter notebook for day 9 (A)
Day9a.html
HTML export of the notebook
Day 10: Abundance Estimation
Slides
Mark's slides for day 10
Closing Slides
Mark's closing slides
Day10.ipynb
Mark's notebook for day 10
Day10.html
HTML export of the notebook
Mucci2.transcriptome.fasta.gz
Mucor circinelloides transcriptome, for analysis of reads from mBio 10:e02765
Mucor_Yields.html
Notebook exploring the yields for the full mBio 10:e02765 data set.
stream_sampler.py
Example Python 2 implementation of a reservoir sampler for FASTA, FASTQ, and SAM files. (Note that SAM files should be unsorted alignments of single-end reads. Equivalent BAM files can be sampled by piping through samtools view). For more about reservoir sampling, see this essay from Sean Eddy.
GSE88801_kallisto.tar.gz
Unmerged kallisto results for the GSE88801 data set.