BIOL 4800/4801 Syllabus Fa. 2014. Page 1 of 4. Tools in Microbial
Computational Biology BIOL 4800/4801, Fall 2014. When: 1230-‐1320 W, 1230-
‐1520 F.
BIOL 4800/4801 Syllabus Fa. 2014
Tools in Microbial Computational Biology BIOL 4800/4801, Fall 2014 When: 1230-‐1320 W, 1230-‐1520 F Where: TBD Instructor: J. Cameron Thrash, Ph.D. Email:
[email protected] Twitter handle: @DrJCThrash My office: A112 Life Sciences Annex Prerequisite: General Microbiology BIOL2051 Required readings/websites: The course will require the following books: Practical Computing for Biologists, Haddock & Dunn Phylogenomics, Desalle & Rosenfeld Course website(s): Moodle eCommunication Policy: The best way to contact me is Circular representation of multiple bacterial through email and/or twitter. I will respond to email or genomes (Grote et al. 2012 mBio) twitter messages within 6 hours, except on weekends or between 10pm and 7am. I may respond much quicker, because like you I am glued to my devices, but I do have a life outside of teaching and research (when I’m lucky). If you want an individual physical/video call appointment, email me with a short description of your issue and the desired time and duration of the meeting. This will be subject to my availability. I accept and encourage twitter follows, but I do not accept any other social media friend requests (until graduation). Course description. In modern biology, the need for competence in computational tools is becoming as ubiquitous as that for traditional techniques like PCR. This course will provide basic training in navigating the command-‐line environment, utilizing common tools for genomics and ecology, submitting jobs to High Performance Computing clusters, and managing input and output files. It is NOT a programming class. Prior computational experience is helpful, but not required, as the goal of this course is to bring neophytes to a basic level of competence with common computational biology methods. While the focus will be applying these to microbiological research, many tools are system/organism independent. Classes will take place in a computer lab (TBD) and will have access to the LSU High Performance Computing (HPC) infrastructure. Each week will consist of two hours of theory/practical lecture along with two hours of computer laboratory exercises (plus take-‐home exercises). The 4800/4801 course can be taken for credit by upper-‐level undergraduates and graduate students equally (3 credit hours). Course learning outcomes. By the end of this course, you should be able to: • Understand a HPC infrastructure • Remotely access a HPC cluster using the command line • Navigate and manipulate the file structure within a Linux environment • Complete basic file manipulation tasks using Linux commands • Write basic shell scripts for parsing input and output files and sending jobs to the compute nodes • Download genomic information from public databases directly to a HPC cluster • Execute parallel (threaded) analyses using BLAST, HMMER, and multiple alignment tools • Understand the modern sequencing platform methodologies, capabilities and limitations • Execute threaded RAxML and MrBayes phylogenetic inferences from multiple alignments • Curate genome sequencing data prior to manipulation • Perform basic automated microbial genome assembly and annotation with the A5 pipeline • Construct orthologous groups from a group of closely related genome-‐sequenced microorganisms • Assess the core and pan-‐genome of a group of closely related microorganisms
Page 1 of 4
BIOL 4800/4801 Syllabus Fa. 2014 How we’re going to get there (Course Philosophy and Format) Philosophy. This course is designed to get you to a basic working knowledge of many of the common tools used in modern bioinformatics, particularly as applied to microbial genomics. This is a combined lecture/laboratory course, with the laboratory portion spent utilizing computers instead of a typical wet lab. While there will be some lecture component during the Wednesday class, as much as possible this period will have active learning exercises instead of me simply standing around talking. Extensive research on education and the neuroscience of learning has shown there are much more effective ways for us to learn than by sitting and listening to a person stand in the front of the room and talk. You don’t have to come to class to learn that way anyway, for there are endless lectures and resources available online, many from the most eminent scientists in their fields. Some of these will be part of your pre-‐class assignments. Therefore, I endeavor to make class time as effective as possible for stimulating your investment in the material and activating all modes of thinking. The added benefit of being able to do this work yourselves in the lab portion of the class will help complete the process. Classroom mechanics. The one-‐hour Wednesday class will be chiefly concerned with theory and applications behind the various tools we will be learning to use. The three-‐hour Friday class will have a practical lecture for instructional purposes and then two hours of lab time to proceed through exercises designed to help you learn the tools themselves. There will be some assigned readings/podcasts/web-‐videos/lecture slides you will be responsible for at each class period, listed as “Readings, etc.” in the class schedule, below. I will also introduce these pre-‐lecture assignments each week by email. There will be a short online Moodle quiz on the material for Wednesdays that closes one hour before class. Computational Requirements. Our classes will be conducted in a computer lab, TBD. Prior to the first class, you need to have requested an account with LSU HPC for access to the super computer SuperMike II. You will use this account for completing classroom exercises and major assignments. For in-‐class exercises, you will be using the lab computers and logging on through a terminal. For your major assignments, you will need another computer with terminal login capabilities so that you may access SuperMike II remotely. All exercises involving significant computational effort will require the use of a class computing allocation. Details for operating in the HPC environment will be presented during the first two weeks of the course. Major Assignments. In addition to your outputs from classroom exercises you will be responsible for a series of major assignments will be either continuations of the exercises in the Friday lab period, more difficult/comprehensive versions of these exercises, or small team projects that will make use of one or more of the tools we have learned. All major assignments will have a written component that accompanies the results of your computer work and be graded according to a specific rubric. These may include additional reading. You will be graded on the following: There will be 1000 total points, graded accordingly: Quizzes 10% A 900-‐1000 In-‐class exercises 30% B 800-‐899 Major Assignments 60% C 700-‐799 D 600-‐699 F < 600 Late assignments. Assignments will sacrifice 10% of their points per day they are late. Other important information Absences/Code of Student Conduct. You are expected to have read, understand, and adhere to the LSU Absence Policy (http://saa.lsu.edu/important-‐lsu-‐policies) and the Code of Student Conduct (http://saa.lsu.edu/code-‐ student-‐conduct). Our goal should be to learn, not simply to get grades. In science, as in life, your integrity is one of, if not the, most valuable asset you have. Preserve it, protect it, cultivate it.
Page 2 of 4
BIOL 4800/4801 Syllabus Fa. 2014 Students with Disabilities. If anyone has a disability that may require accommodation, you should immediately contact the office of Services for Students with Disabilities to officially document the needed accommodation. The instructor must be presented with this documentation during the first week of class. Time requirements. It is expected that you will have read or viewed the assigned material prior to class for the background necessary to properly participate in the activities and think critically about the concepts addressed. As a general policy, for each hour you are in class, you (the student) should plan to spend at least two hours preparing for the next class. Since this course is for three credit hours, you should expect to spend around six hours outside of class each week reading or working on assignments for the class. Class schedule The schedule is preliminary and subject to change depending on how quickly we are moving through the material. Details on your pre-‐class readings, etc., are supplied below. Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Date August 27th (W) August 29th (F) September 3rd (W) September 5th (F) September 10th (W) September 12th (F) September 17th (W) September 19th (F) September 24th (W) September 26th (F) October 1st (W) October 3rd (F) October 8th (W) October 10th (F) October 15th (W) October 17th (F) October 22nd (W) October 24th (F) October 29th (W) October 31st (F) November 5th (W) November 7th (F) November 12th (W) November 14th (F) November 19th (W) November 21st (F) November 26th (W) November 28th (F) December 3rd (W) December 5th (F)
Subject The command line environment HPC tutorial Basic Linux commands Basic Linux commands, shell text editors Database access Collecting and manipulating fasta/Genbank files Local alignment and dynamic programming BLAST Multiple sequence alignment clustalW, MUSCLE, T-‐Coffee Genome sequencing Fall Break Curating sequencing output Quality scores, trimming, dereplicating Assembly and annotation methods A5 pipeline Hidden Markov Models HMMER Applying HMMs-‐ pFam, SFams SFam database searches Single gene phylogeny RAxML, MrBayes Orthology determination All vs. all BLASTP + OrthoMCL Assessing core and pan genomes ITEP Thanksgiving Thanksgiving Presenting MA10 Presenting MA10
Readings, etc. 1, 2, 3, 4, 5, 6 7, 8, 9, 10, 11 12 13 14 15, 16 17 18, 19 20, 21 22, 23 24, 25 26 27, 28, 29
Major Assignment Complete lab Complete lab MA 1 MA 2 MA 3 MA 4 MA 5 MA 6 MA 7 MA 8 MA 9 MA 10
Readings, etc. to be completed before class 1. Read the Syllabus 2. Apply for an HPC account (https://accounts.hpc.lsu.edu/login_request.php) 3. Software Carpentry Unix shell tutorials (http://software-‐carpentry.org/v4/shell/index.html)-‐ Introduction, Files and Directories, Creating and Deleting 4. Review the HPC@LSU website (http://www.hpc.lsu.edu), familiarizing yourself with Accounts and Allocations, the LSU HPC Usage Policy, the User Guide for SuperMike II, and the Computational Biology tools available. 5. Haddock and Dunn, Chp. 4
Page 3 of 4
BIOL 4800/4801 Syllabus Fa. 2014 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
Haddock and Dunn, Chp. 20 Haddock and Dunn, Chp. 2 Haddock and Dunn, Chp. 3 Haddock and Dunn, Chp. 5 Software Carpentry Unix shell tutorials (http://software-‐carpentry.org/v4/shell/index.html)-‐ Pipes and Filters through remaining tutorials. Software Carpentry regex tutorials (http://software-‐carpentry.org/v4/regexp/index.html)-‐ all. DeSalle and Rosenfeld, Chp. 4 DeSalle and Rosenfeld, Chp. 5 DeSalle and Rosenfeld, Chp. 6 Mardis, E. R. (2008). Next-‐Generation DNA Sequencing Methods. Annual Review of Genomics and Human Genetics, 9(1), 387–402. Metzker, M. L. (2010). Sequencing technologies -‐ the next generation. Nature Reviews Genetics, 11(1), 31–46. TBD Tritt, A., Eisen, J. A., Facciotti, M. T., & Darling, A. E. (2012). An Integrated Pipeline for de Novo Assembly of Microbial Genomes. Plos One, 7(9), e42304. TBD Eddy, S. R. (2004). What is a hidden Markov model? Nature Biotechnology, 22(10), 1315–1316. Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLOS Computational Biology, 7(10), e1002195. TBD Sharpton, T. J., Jospin, G., Wu, D., Langille, M. G., Pollard, K. S., & Eisen, J. A. (2012). Sifting through genomes with iterative-‐sequence clustering produces a large, phylogenetically diverse protein-‐family resource. BMC Bioinformatics, 13, 264. DeSalle and Rosenfeld, Chp. 8 TBD TBD Benedict, M. N., Henriksen, J. R., Metcalf, W. W., Whitaker, R. J., & Price, N. D. (2014). ITEP: An integrated toolkit for exploration of microbial pan-‐genomes. BMC Genomics, 15(1), 8. Grote, J., Thrash, J. C., Huggett, M. J., Landry, Z. C., Carini, P., Giovannoni, S. J., & Rappé, M. S. (2012). Streamlining and Core Genome Conservation among Highly Divergent Members of the SAR11 Clade. mBio, 3(5), e00252–12. TBD
29. Major Assignments (MA) 1. Download the genome sequences for a set of microorganisms from one of the major databases (e.g., GenBank, IMG) using the command line, including protein fastas, nucleotide fastas, scaffold fastas, and genbank files. Practice manipulating fasta files with basic linux commands. (50 pts) 2. Execute a threaded BLAST search of a set of proteins with similar annotations against the nr or IMG v4 databases. Curate the results by best hit. (50 pts) 3. Produce threaded multiple-‐sequence alignments using different programs, visualize the alignments with graphical software, compare by eye. (50 pts) 4. Download the raw sequencing data for a microbial genome of choice in GenBank. Dereplicate if necessary, quality trim, and prepare for assembly. (50 pts) 5. Complete an assembly of a microbial genome using A5. (60 pts) 6. Convert your multiple sequence alignments from MA3 into HMMs, search these against a database like nr or IMG using threaded HMMER call. (60 pts) 7. Search the ORFs from your A5 assembly against the SFam database and curate at the level of a protein family. Compare with the A5 annotations. (60 pts) 8. Execute threaded phylogenetic inferences using RAxML and MrBayes for 10 of the genes from your assembly. (60 pts) 9. All vs. all blasts and OrthoMCL for your assembled genome with others that are closely related. (60 pts) 10. Complete core and pan-‐genome analysis of your genome with others that are closely related using ITEP, reporting the various additional outputs. Format for final presentation to the class. (100 pts)
Page 4 of 4