BISC-576 Practical Statistics and Bioinformatics ... - the Rohs Lab

7 downloads 135 Views 98KB Size Report
Bioinformatics: Sequence and Genome Analysis (second edition) by David W. Mount.(Cold. Spring Harbor Lab 2004). Introduction to Proteins. by A. Kessel and  ...
BISC-576 Practical Statistics and Bioinformatics Instructors: Ting Chen, RRI-408H, Ph: 213-740-2415; [email protected] Remo Rohs, RRI-404C, Ph: 213-740-0552; [email protected] Units: 2. Description: This course provides basic training and practical experience in statistics and bioinformatics. Students will learn basic statistical and bioinformatics methods and apply them to the state-ofthe-art biological applications. Goals: • To develop basic analytical skills in statistics and bioinformatics. • To gain familiarity and competency in statistics and bioinformatics software packages applicable to molecular biology, genomics analysis, and structural bioinformatics and their underlying principles. Textbooks The Practice of Statistics in the Life Sciences (second edition) by Brigitte Baldi and David S. Moore (W.H. Freeman 2010). Bioinformatics: Sequence and Genome Analysis (second edition) by David W. Mount.(Cold Spring Harbor Lab 2004) Introduction to Proteins. by A. Kessel and N. Ben-Tal (Chapman & Hall/CRC Press, 1st Edition, 2011). Course Contents: This course will cover three major areas of bioinformatics: statistics for biological sequence analysis, computer algorithms for sequence alignment, molecular structural analysis. More specifically, it includes the following topics: discrete and continuous random variables, parametric and nonparametric statistics, NCBI resources, pairwise sequence alignment, multiple sequence alignment, BLAST searching, phylogenetic trees, UCSC genome browser, clustering, analysis of the high-throughput sequencing data, molecular structure analysis and prediction. Homework: Eight sets of homework will be assigned by the instructors. Students should hand in each homework by the specified due date. Points will be subtracted for projects submitted after the due date. Grade: The course grade will be based upon 160 points with 20 points for each set of eight homework assignments.

Tentative Course Schedule: Class WK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Topic Introduction to probability: discrete random variables and distributions (Chap 9 & 10, The Practice of Statistics in the Life Sciences Continuous random variables and normal distributions confidence intervals (Chap 11 & 12, The Practice of Statistics in the Life Sciences) The Central Limit Theorem (Chap 13, The Practice of Statistics in the Life Sciences) Parametric hypothesis testing: t-test, F-test, chi-square test (Chap 17 & 21, The Practice of Statistics in the Life Sciences) Pairwise sequence alignment (Chap 3, Bioinformatics: Sequence and Genome Analysis) BLAST and Statistics (Chap 4, Bioinformatics: Sequence and Genome Analysis) Multiple sequence alignment (Chap 5, Bioinformatics: Sequence and Genome Analysis) Motif finding (Chap 5, Bioinformatics: Sequence and Genome Analysis) Hierarchical Clustering (Chap 7, Bioinformatics: Sequence and Genome Analysis) Analysis of High-throughput sequencing data (Lecture notes) Secondary structure elements, structural alignment, and fold classification (Chap1, Introduction to Proteins) Homology modeling and molecular simulations (Chap 3, Introduction to Proteins) Protein function annotation and prediction (Chap 2, Introduction to Proteins) RNA folding and sequence-dependent DNA shape (Lecture notes) Classification of protein-nucleic acid readout modes (Lecture notes)

Introduction to Probability: discrete random variables and distributions. We will introduce the basic concept of probability under the context of DNA and protein sequences, the binomial distribution and the Poisson distribution. (Chap 9 & 10, The Practice of Statistics in the Life Sciences) Introduction to Probability: continuous random variables and distributions. We will introduce the basic concept of probability, and the normal distributions. (Chap 11 & 12, The Practice of Statistics in the Life Sciences) The Central Limit Theorem: We will introduce the central limit theorem that is basic for data analysis. (Chap 13, The Practice of Statistics in the Life Sciences) Parametric hypothesis testing: We will introduce the concept of hypothesis testing, t-test, Ftest, and Chi-square test. (Chap 17 & 21, The Practice of Statistics in the Life Sciences) Pairwise sequence alignment: We will introduce the pairwise sequence alignment algorithm: the dynamic programming, and applications in DNA and protein sequence alignments. (Chap 3, Bioinformatics: Sequence and Genome Analysis) BLAST and Statistics: We will introduce the basic hashing used in BLAST, and the statistics of the BLAST scores. (Chap 4, Bioinformatics: Sequence and Genome Analysis)

Motif Finding: In this section, we will introduce the concept of DNA motifs for protein-DNA binding sites, representations of DNA motifs. We will discuss several algorithms for finding DNA motifs: the word-count statistics, the maximum likelihood method, and the Bayesian method. (Chap 5, Bioinformatics: Sequence and Genome Analysis) Multiple Sequence Alignments: We will introduce the neighbor-joining method and its application for multiple sequence alignments. (Chap 5, Bioinformatics: Sequence and Genome Analysis) Hierarchical Clustering: Hierarchical clustering has many applications in biology. We will introduce the basic algorithms and three basic clustering strategies: single-linkage, averagelinkage and complete-linkage. (Chap 7, Bioinformatics: Sequence and Genome Analysis) Next Generation Sequence Analysis: The analysis of next generation sequencing data is critical in many biological applications. We will introduce the basic algorithms for read-mapping, and the challenges in the analysis. We will also discuss the identification of genome variants, and discuss two basic statistical models: likelihood ratio test and Bayesian methods. (Lecture Notes) Secondary structure elements, structural alignment, and fold classifications: This lecture will introduce alpha-helices and beta-sheets and the Ramachandran plot as means of identifying secondary structure elements of proteins. In addition, the basic principles for structural alignment methods will be discussed and various proteins will be aligned. We will classify protein folds according to their structural topology. (Chap1, Introduction to Proteins) Homology modeling and molecular simulations: Computational prediction methods are applied if an experimentally solved structure is unavailable. We will compare knowledge-based prediction methods such as homology modeling with physics-based molecular simulation approaches, including molecular dynamics and Monte Carlo methods. (Chap 3, Introduction to Proteins) Protein function annotation and prediction: Revealing the unknown function of a protein is a primary of structural bioinformatics analyses. We will demonstrate how the function of a protein can be annotated or predicted based on structural homology, evolutionary conservation, electrostatic potential, and other properties. (Chap 2, Introduction to Proteins) RNA folding and sequence-dependent DNA shape: While RNA and DNA have very similar chemical properties, they have very different biological functions. We will explain this difference by analyzing fold characteristics of RNA in comparison with nuances in the double helix as a function of its base sequence. (Lecture notes) Classification of protein-nucleic acid readout modes: Non-specific binding in nucleosomes deforms DNA while many transcription factors recognize DNA without major deformations but high binding specificity. We will identify base readout (hydrogen bonding between protein side chains and base pairs) and shape readout (recognition of sequence-dependent electrostatic potential) as origins of binding specificity. (Lecture notes) Statement for Students with Disabilities

Any student requesting academic accommodations based on a disability is required to register with Disability Services and Programs (DSP) each semester. A letter of verification for approved accommodations can be obtained from DSP. Please be sure the letter is delivered to me (or to TA) as early in the semester as possible. DSP is located in STU 301 and is open 8:30 a.m.– 5:00 p.m., Monday through Friday. The phone number for DSP is (213) 740-0776. Statement on Academic Integrity USC seeks to maintain an optimal learning environment. General principles of academic honesty include the concept of respect for the intellectual property of others, the expectation that individual work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one’s own academic work from misuse by others as well as to avoid using another’s work as one’s own. All students are expected to understand and abide by these principles. Scampus, the Student Guidebook, contains the Student Conduct Code in Section 11.00, while the recommended sanctions are located in Appendix A: . Students will be referred to the Office of Student Judicial Affairs and Community Standards for further review, should there be any suspicion of academic dishonesty. The Review process can be found at: http://www.usc.edu/student-affairs/SJACS/.