This page contains slides and supporting material for the 2019
version of BMS270: Practical Bioinformatics with Programming.
Please see the site updates page for
information on tracking changes.
General Resources
- Python
- Homepage for the Python programming language (download,
documentation, etc.)
- Safari Bookshelf
- The Safari bookshelf provides on-line access to a large selection of programming books, primarily from O'Reilly. Useful references for this course include: Learning Python and Programming Python. (Bioinformatics Programming Using Python is not recommended). Note that you may need to go through the Safari Bookshelf link above for full access to some of the books linked below.
- Learning Python
- Excellent introduction to Python.
- Dive Into Python 3
- Programmer-oriented introduction to Python. Faster paced than
Learning Python, but the entire book can be downloaded as a free pdf
file. Chapter 12 on parsing XML with lxml is particularly good. There is also a substantially different earlier edition based on Python 2.
(Note that as of October 2011, these books are no longer hosted at their
original website)
- Enthought Canopy
- "Scientific" python distribution -- free for academic use.
This is the an easy way to
install libraries like numpy, scipy, and matplotlib on OS X and
Windows (on Linux, it is easiest to install these libraries directly
from your distribution's repository).
Fernando Perez's py4sci starter kit is another good list of useful scientific packages.
- IPython/Jupyter
- The interactive shell/notebook that we are using for this course.
See also the gallery of
example notebooks
- Matplotlib
- MATLAB-like plotting for python.
See also the gallery of
example plots
- Numpy
- Numerical library that serves as the foundation for matplotlib,
scipy, etc.
- Python Library Documentation
- Detailed documentation for all of the standard python modules.
See also http://docs.python.org/
for the full list of on-line documentation.
- Numerical Recipes
- Excellent reference for numerical methods. The older editions are available on-line at this link. The new (3rd) edition adds a chapter covering clustering and HMMs. See chapter 7.1 (and Comm. ACM 31:1192) for background on random number generators and chapters 14 and 15 for statistics and model fitting.
- VirtualBox
- The VM software that we are using for this course. For a more flexible open source alternative (which is a bit trickier to install on OS X), see QEMU. UCSF also has a license for VMWare.
- XQuartz
- An implementation of X11 for OS X -- useful for running graphical programs over an ssh connection.
- Debian
- The Linux distribution that we are using for this course. Good
choice for stability and ease of administration.
- Ubuntu
- A user-friendly Linux distribution based on
Debian. This is
the distribution that I use on my laptop. The installation CD can be
booted as a "Live CD", allowing you to try Linux with no change to your
computer.
- git
- This is the distributed version control system that we will use for this course.
Day 1: Python
- Slides
- Mark's slides for day 1
- Day1.ipynb
- Mark's Jupyter notebook for day 1
- Day1.html
- HTML export of the notebook
- Day1_extended.ipynb
- An extended and annotated version of today's notebook, showing some additional features of the example files.
- Day1_extended.html
- HTML export of the extended notebook
- Python Primer
- A first draft "intro to python" comic
Day 2: File Formats
- Slides
- Mark's slides for day 2
- TPM CDT
- Log2(TPM) estimates for the most abundant transcripts
observed in Sci Rep 7:42225, based on running the reads from
GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample.
- Day2.ipynb
- Mark's annotated Jupyter notebook for day 2
- Day2.html
- HTML export of the notebook
Day 3: Distance Metrics
- Slides
- Mark's slides for day 3
- Day3.ipynb
- Mark's annotated Jupyter notebook for day 3
- Day3.html
- HTML export of the notebook
- JavaTreeView
- Alok Saldanha's port of Michael Eisen's TREEVIEW program.
- stats.py
- Example statistical functions (day 2 homework solutions)
- CDT file format
- Documentation for the extended CDT file format from the JavaTreeView manual
- JavaTreeView.336e7a13.zip
- Alternate build of JavaTreeView, in case you're getting plug-in errors from the official build. To install on OS X:
1) Save the attached zip to your Desktop
2) In a terminal window:
mkdir JavaTreeView
cd JavaTreeView
unzip ../Desktop/JavaTreeView.336e7a13.zip
cd dist
java -jar TreeView.jar
-
PNAS 95:14863 (Eisen et. al.)
- Paper introducing cluster analysis for microarrays.
- supp2data.xls
- Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) original Excel format
- supp2data.cdt
- Supplementary data for PNAS 95:14863 figure 2 (yeast expression
profiles) converted to Cluster3/JavaTreeView-compatible tab-delimited
text format (CDT)
- supp2data_samples.htm
- Table documenting the samples in supp2data.cdt
Day 4: Hierarchical Clustering
- Slides
- Mark's slides for day 4
- Day4.ipynb
- Mark's annotated IPython notebook for day 4
- Day4.html
- HTML export of the notebook
- Malcolm Gladwell TED talk
- An interesting perspective on clustering
- Bell Labs Trellis interview
- Another useful take on multivariate data analysis and visualization
Day 5: Hierarchical Clustering, part 2
- Slides
- Mark's slides for day 5
- Day5.ipynb
- Mark's IPython notebook for day 5
- Day5.html
- HTML export of the notebook
- Day5_prep.ipynb
- Notebook fitting the scaling behavior of treecluster
- Day5_prep.html
- HTML export of the notebook
- ENSEMBL_names.txt
- Table mapping ENSEMBL mouse gene names to common names.
- geneticCode.py
- The (standard) genetic code as a Python dictionary
Day 6: Principal Components Analysis (PCA)
Numerical Recipes
Is very useful for thinking about PCA and linear modeling. Relevant sections (relative to Numerical Recipes in C are: section 2.6 (singular value decomposition), chapter 11 sections 0 to 3 (symetric Eigensystems, plus some of the methods commonly used for implementing SVD), and chapter 15 sections 0 to 4 (least squares regression, for which SVD provides a stable implementation). I.T. Joliffe's book on PCA is a well-written and thorough coverage from a practicing statistician's point of view.
- Slides
- Mark's slides for day 6
- Day6_template.ipynb
- Jupyter notebook template for day 6
- Day6.ipynb
- Mark's Jupyter notebook for day 6
- Day6.html
- HTML export of the notebook
- What is Principal Component Analysis?
- Nice essay on PCA and factor analysis by "biological" mathematician Lior Pachter
- WIREs Comp. Stat. 2:433
- Very good PCA tutorial (associated MATLAB code is here). Working through the examples in this paper in python is a great way to get a feel for the logistics of PCA.
- ICA vs. PCA
- Tutorial on performing PCA and ICA (independent component analysis) using scikits-learn (a python-based package for machine learning, which also
includes hierarchical clustering, among many other methods)
- Cell Syst. 2:239
- Cool application of PCA for "shallow-seq" analysis of bulk and single-cell experiments.
Day 7: Data Aggregation
- Slides
- Mark's slides for day 7
- Day7.ipynb
- Mark's Jupyter notebook for day 7
- Day7.html
- HTML export of the notebook
- example1.cdt
- Final clustered heatmap from today
- example1.gtr
- Corresponding tree file
- S1.csv
- CSV export of Sci Rep 7:42225 table S1, giving the DESeq2 results for the 24h/4h BMDM uninfected comparison.
Day 8: Differential Expression
- Slides
- Mark's slides for day 8
- Day8a.ipynb
- Mark's Jupyter notebook for day 8 (A)
- Day8a.html
- HTML export of the notebook
- Day8b.ipynb
- Mark's Jupyter notebook for day 8 (B)
- Day8b.html
- HTML export of the notebook
- sample_table_v2.csv
- Sample table
- limma1.J774.Live.24-J774.uninfected.24.t0.csv
- Differential genes
- est_counts CDT (TPM ≥ 10)
- Estimated counts for the most abundant transcripts
observed in Sci Rep 7:42225, based on running the reads from
GSE88801 through kallisto and filtering for genes with TPM ≥ 10 in at least one sample. *(I.e., same pipeline as GSE88801_kallisto_TPMs_thresh10.cdt above)*
- est_counts CDT (TPM ≥ 1)
- As above, but cutting for TPM ≥ 1 for comparison with the supplemental tables of the paper.
- S2.xls
- Sci Rep 7:42225 table S2, giving the DESeq2 results for the 8 infected vs. uninfected comparisons.
- Genome Biol. 14:R95
- Comparison of differential expression methods for RnaSeq, including limma and DESeq. From 2013, so doesn't include DESeq2. (erratum)
- limma user's guide
- Documentation for limma
- NAR 43:e47
- Primary reference for limma
- Genome Biol. 15:R29
- Primary reference for voom (limma's RnaSeq weighting method)
- Elements of Statistical Learning
- Great textbookbook on machine learning from a statistics point of view with very good coverage of regression methods.
- Introduction to Statistical Learning
- "Easier" version of the above, with examples in R.
Day 9: Differential Expression, part 2
- Slides
- Mark's slides for day 9
- Day9_template1.ipynb
- Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 10 in at least one sample and filtering for significantly differential transcripts with a fold change of at least 2x
- Day9_template1.html
- HTML export of the notebook
- Day9_template1.results.tar.gz
- Tarball of fit parameters and differential gene lists from this analysis.
- Day9_template2.ipynb
- Example notebook for limma analysis of the GSE88801 data, starting from transcripts with TPM ≥ 1 in at least one sample.
- Day9_template2.html
- HTML export of the notebook
- Day9a.ipynb
- Mark's annotated Jupyter notebook for day 9 (A)
- Day9a.html
- HTML export of the notebook
Day 10: Abundance Estimation
- Slides
- Mark's slides for day 10
- Closing Slides
- Mark's closing slides
- Day10.ipynb
- Mark's notebook for day 10
- Day10.html
- HTML export of the notebook
- Mucci2.transcriptome.fasta.gz
- Mucor circinelloides transcriptome, for analysis of reads from mBio 10:e02765
- Mucor_Yields.html
- Notebook exploring the yields for the full mBio 10:e02765 data set.
- stream_sampler.py
- Example Python 2 implementation of a reservoir sampler for FASTA, FASTQ, and SAM files. (Note that SAM files should be unsorted alignments of single-end reads. Equivalent BAM files can be sampled by piping through samtools view). For more about reservoir sampling, see this essay from Sean Eddy.
- GSE88801_kallisto.tar.gz
- Unmerged kallisto results for the GSE88801 data set.