Chapter 2 - Metagenomic Protocols and Strategies

1 downloads 0 Views 299KB Size Report
the metagenome can be performed through at least three types of ..... After separation of the aqueous phase, 1 volume of 2-propanol was added and the ...... (2) the assembly of the resulting reads; (3) annotation without assembling ... technologies: single-molecule real-time sequencing (PacBio RS II platform and MinION.
CHAPTE R 2

Metagenomic Protocols and Strategies Celia Méndez-García*, Rafael Bargiela†, Mónica Martínez-Martínez†, Manuel Ferrer†

Carl R. Woese Institute for Genomic Biology, Urbana, IL, United States †Institute of Catalysis, Consejo Superior de Investigaciones Científicas, Madrid, Spain *

2.1 Introduction We have always used microbiological cultures to determine the microbial composition, but at present we know that a large proportion of microorganisms in each ecosystem cannot be cultured with traditional tools, and their detection is only possible with the DNA sequencing of their genetic fingerprint, the so-called metagenome [1]. Characterization of the metagenome can be performed through at least three types of approaches, as follows: (i) to determine the composition of total microorganisms present (including those that are dead, dormant or inactive), through the massive sequencing of the 16S rRNA gene amplified from DNA; (ii) to identify the potentially active bacteria that are dividing, through the massive sequencing of cDNA produced from the 16S rRNA molecule; and (iii) to analyze the genomic content through massive sequencing of DNA or cDNA, or functional screens in clone libraries. The aim of this chapter is to provide the basis and standard methods to cover these approaches. Before detailing the strategies to each approach, it is important to mention that there are several methods and techniques that can be used for the analysis of both the composition and the functionality of microbial communities. Before describing them, some prior circumstances that can lead to an important bias and influence the final result should be described. These circumstances are the selection, preservation and transport of the sample and the method used for nucleic acid extraction, to cite just a few.

2.2  Wet-Lab Techniques for Sample Collection, Preservation, and Transport The recommendation is to collect the samples and to freeze them immediately at −80°C, although higher temperatures are also acceptable (up to −20°C). In some cases, storage under strict anaerobic conditions is recommended to ensure the viability and representation of microbes and microbial products [2]. Freezing likely prevents changes in microbial communities until nucleic acid extraction can be performed. Quick freezing is especially important if RNA will be extracted, because RNA is easily degraded at room Metagenomics. https://doi.org/10.1016/B978-0-08-102268-9.00002-1 © 2018 Elsevier Inc. All rights reserved.

15

16  Chapter 2 temperature. Samples can be preserved for up to 2 h at room temperature in a stabilizing buffer without affecting the microbial composition. There might be factors that modify the final results, such as contamination during collection, the time that elapses until they are frozen, freezing at insufficiently low temperatures or thawing during transport to the laboratory. The sample's composition could change when the sample is kept at room temperature, so its immediate freezing is recommended. For example, changes in the Firmicutes and Bacteroidetes phyla have been described after freezing at −20°C, in case of fecal microbiota. Frozen samples can undergo up to four freeze-thaw cycles without affecting the microbial composition significantly [3]. Depending on each laboratory, and based on the type of sample and the period for which it will be preserved until processing, it would be necessary to choose the most suitable method according to different criteria, including the price, availability, difficulty of use, time required for management and whether it is compatible with other methods that would be applied to the same sample.

2.3  Wet-Lab Techniques for Sample Pretreatment and Nucleic Acid Extraction (DNA) Gold standard methods for nucleic acid isolation are already available. Indeed, wellestablished methods and commercially available kits are available to isolate DNA and cDNA, for example, from human microbiota [4–8]. For other types of samples, nucleic acid extraction can be performed directly from cooled samples using the PowerSoil DNA Isolation Kit (MoBio, USA), Meta-G-Nome DNA isolation kit, UltraClean MegaPrep (MoBio Laboratories, Inc.), or GNOME DNA Extraction Kit (MP Biomedicals) according to supplier’s recommendations. Detailed standard protocols to isolate high-quality and highmolecular-weight DNA have been reviewed elsewhere [9]. Several studies showed that the size of DNA fragments extracted from environmental samples varied within a range between less than 10 kb and more than 400 kb, depending on the sample and the mechanical, chemical or enzymatic protocols used for the DNA extraction [9]. Some samples need an enrichment approach combined with a cell separation and gradient centrifugation to isolate high-molecular-weight DNA. Also, in many cases the extraction of inhibitor-free metagenomic DNA (i.e., from polluted sediments) is required and, in this context, elimination of humic compounds and ion-exchange treatments are strongly recommended [9]. The DNA extraction procedure (together with its further size sorting) is a critical step and considerably differs when constructing large insert libraries (i.e., fosmid, cosmis or pBAC libraries constructed for archiving and gene probing screens) and small-insert expression libraries (e.g., those in lambda phage and plasmid vectors, specially aimed at the activity screening). The method is also critical for the effectiveness of the direct DNA sequencing, as contaminants may produce bias in the sequencing steps.

Metagenomic Protocols and Strategies  17 First of all, one could divide samples depending on the source, that is, liquid (e.g., marine environments) or solid samples (e.g., soil, marine or river sediments). In subsequent examples, we describe key strategies of preenrichment to separate microbial cells from environmental samples and to isolate high-quality DNA from preenriched cells or from raw samples containing microbes.

2.3.1  Sample Handling for Superficial Sea- and Freshwater Sample Treatment Immediately after collection, water samples (approximately 2–20 L) are filtered onto a 500-kDa NMWL ultrafiltration disc (Biomax polyethersulfone, Millipore). After this, the filters are cut into strips (1 cm × 2 mm) and used directly for DNA extraction (see the protocols described in the following paragraphs). Alternatively, in the case of large sample amounts (e.g., 100–200 L) a tangential flow filtration device can be used, such as Pellicon TFF 0.1 μm (MilliporeTM) to separate solid particles and pico-eukaryotic organisms (2–4 μm). Part of the overconcentrated product (retentate solution, e.g., 10 mL) can be used for enrichment of cells with a desired supplement (for example, minimal medium supplemented with petroleum, to enrich cells able to metabolize it), or this retentate solution can be filtered onto a 500-kDa pore size with the Amicon system (Millipore), and the filter cut into strips and used directly for DNA extraction (see the following).

2.3.2  Sample Handling for Solid and Sediment Sample Treatment The cloning efficiency of metagenomic fragments greatly depends on the methodology used to purify fragments of DNA. Many existing methods to isolate such fragments directly from environmental samples, especially soils and sediments, are hampered by the problems of mechanical shearing due to physical forces (e.g., bead-beating) or DNA quality (e.g., copurification with humic acids). Some samples can be used directly for DNA extraction or need more than a centrifugation step for isolation of microorganisms. In this case a Nycodenz extraction technique is suggested, as follows. Soil or sediment samples (1–5 g) are resuspended in 5–40 mL (depending on the sample properties) of TE (Tris-EDTA) buffer pH 8.0 and mixed by inverting the tube 10–15 times. Then samples are mixed with moderate shaking to release cells from the solid matrix, and centrifuged at low speed (approx. 200–400×g for 1–5 min) to eliminate bulky soil particles. Then place the supernatant in a new tube and centrifuge it at 6000×g for 15–30 min at 4°C. After that, discard the supernatant and conserve the pellet for subsequent DNA extraction. To avoid DNA damage during purification from environmental samples, a Nycodenz extraction technique is highly recommended. During the physical separation of the bacterial fraction using a Nycodenz cushion, a whitish band of microbial biomass is obtained at the interface between the Nycodenz and the aqueous layer by using a high-speed density gradient. This method has been used successfully to isolate DNA from freshwater, compost,

18  Chapter 2 rhizosphere-associated soils and pristine and contaminated sediments. The procedure is outlined as follows. 1. Prepare sample suspension: to 15 g sample add disruption buffer (35 mL total volume: 0.2 M NaCl, 50 mM Tris-HCl pH 8.0) and mix (preferably overnight with orbital shaking). 2. Centrifuge at low speed (approx. 200–400×g for 1–5 min) to eliminate large soil particles and then use supernatant for biomass separation via Nycodenz. 3. Transfer 25 mL of the soil homogenate to an ultracentrifuge tube and carefully pipette 9–11 mL of Nycodenz (0.8–1.3 g mL−1) to form a layer below the homogenate. 4. Centrifuge at 10,000×g for 20–40 min at 4°C. 5. A faint whitish band containing bacterial cells is resolved at the interface between the Nycodenz and the aqueous layer. This band is transferred into a sterile tube. Note that sometimes soils contain a lot of small particles, which are not separable: they cover the Nycodenz surface making a solid layer mixed with microbial biomass (this problem is typical for clay soils). 6. Add approx. 35 mL of phosphate buffered saline buffer (PBS) and pellet the cells by centrifugation at 10,000×g for 20 min. The cell pellet, resuspended in 0.5–2.0 mL TE buffer pH 8.0, is then ready for lysis and DNA extraction.

2.3.3  Isolation of High-Quality DNA by Phenol:Chloroform Method Followed by DNA Cleaning The following procedure is recommended when a high quantity of humic acid is present in the soil samples and for Nycodenz separated biomass, to minimize the volume of solvents, as the direct extraction of nucleic acid from soil samples requires a large amount of solvents. The protocol is as follows. 1. To the solution obtained as described previously (cell pellet in TE buffer) add 25 μL of Lysozyme solution (10% w/v, prepared prior to use), and incubate 2 h at 37°C (1500 rpm). 2. Add 6 μL RNase solution (1% w/v) free of DNase, and incubate approx. 30 min at 37°C. 3. Add 8 μL solution Proteinase K (1% w/v) and 60 μL SDS solution (10% w/v) and incubate 30 min at 50°C, until solution becomes clear and viscous. 4. Add 100 μL NaCl 5 M and mix gently by inverting the tube 4–6 times. 5. Add 80 μL of a cetyltrimethylammonium bromide (CTAB) solution prewarmed at 65°C (10% w/v in 0.7 M NaCl), mix gently by inverting the tube 4–6 times and incubate 10 min at 65°C. 6. The final volume should be around 748 μL.

Metagenomic Protocols and Strategies  19 7. Add 750 μL ChCl3:isoamyl alcohol, mix gently by inverting the tube 4–6 times, then centrifuge 3 min at 14,000 rpm, and quickly transfer the previous supernatant to a new 2-mL Eppendorf. 8. Add 350 μL ChCl3:isoamyl alcohol and 350 μL Phenol, mix gently by inverting the tube 4–6 times, then centrifuge 3 min at 14,000 rpm, and quickly transfer the previous supernatant to a new 2-mL Eppendorf. 9. Add 650 μL ChCl3:isoamyl alcohol, mix gently by inverting the tube 4–6 times, then centrifuge 3 min at 14,000 rpm, and quickly transfer the previous supernatant to a new 1.5-mL Eppendorf. 10. Finally add 0.6 vol. of isopropanol, mix gently by inverting the tube 4–6 times, transfer suspension to ice for 10 min, centrifuge 30 min at 14,000 rpm, and eliminate the supernatant. 11. Add 500 μL ethanol (70% v/v) to the DNA pellet, mix gently by inverting the tube 4–6 times, centrifuge 30 min at 14,000 rpm, and eliminate the supernatant. 12. Quickly transfer the tube to speed-vac and dry for about 5 min without heating. 13. Resuspend the DNA pellet in 500 μL TE buffer, pH 8.0. 14. If the quality of DNA of the previous step is good enough, then one could proceed directly to the digestion and cloning step. If not, an ultrapure DNA protocol should be applied: DNA Clean & Concentrator from Zymo Research Corp. has been shown to be effective for purification of quality DNA. By using this product, one can purify DNA by employing a single-buffer system that allows efficient and selective DNA adsorption onto a matrix. Here, it is important to use at least 200 μg DNA, since the recovery of DNA is, in the majority of samples, lower than 40%.

2.3.4  Isolation of High-Quality DNA by the Hexadecyltrimethylammonium Bromide Method The total DNA can also be extracted by the hexadecyltrimethylammonium bromide (CTAB) method with some modifications. Briefly, harvested cells, separated from environmental samples as described previously, are resuspended in 750 μL lysozyme-CTAB extraction solution (8 mg mL−1 lysozyme, 2% CTAB, 1.4 M NaCl, 20 mM ethylenediaminetetraacetic acid (EDTA), 100 mM Tris-HCl, pH 8, 50 mg L−1 ARNase, 0.3 M sucrose). After incubation during 2 h at 37°C to improve cell lysis, 250 μL sodium dodecyl sulfate (SDS) 2% (w/v) was added, the solution was vortexed for 1 min, and then 2-μL beta-mercaptoethanol added and incubated 30 min at 60°C. To purify the DNA, 1 volume of chloroform:isoamyl alcohol (24:1) was added. The solution was mixed and centrifuged (12,000 rpm, 15 min). After separation of the aqueous phase, 1 volume of 2-propanol was added and the solution incubated at −20°C during 1 h to facilitate DNA precipitation. The precipitated DNA was washed with 1 volume of 70% (v/v) ethanol and dried. Finally, the DNA was resuspended in 50 μL sterile distilled water. The purity of extracted DNA was assessed by measuring the 260/280 and 260/230 ratios using a spectrophotometer.

20  Chapter 2

2.3.5  Isolation of High-Quality DNA by Commercial Kits Commercial kits such as UltraClean MegaPrep (MoBio Laboratories, Inc.), G’NOME DNA Extraction Kit (BIO101) and PowerSoil DNA Isolation Kit (MoBio, USA) may be used for isolation of metagenomic DNA from eukaryotic or prokaryotic cells and tissues in less than 2 h with no organic extractions. In all cases one may follow the supplier’s recommendations. Preliminary separation of cellular biomass from soil homogenate via Nycodenz gradient is recommended in order to achieve maximal DNA recovery per gram of soil (see the following). This is recommended for soils highly contaminated with humic acids and other pollutants. However, the samples can be used directly. DNA purification kits from other manufacturers presumably work equally well, but they have not been tested in our laboratory. The GNOME DNA kit uses RNase Mix to eliminate RNA immediately after lysis and Protease Mix to rapidly degrade cellular proteins. This is followed by a proprietary “salting out” technique that precludes the need for phenol, chloroform or other organic extractions. Preparation of metagenomic DNA using this kit is described in the following paragraphs. 1. Immediately after collection samples are either stored in 95% EtOH at 4°C or are shock-frozen in liquid nitrogen, followed by storage at −80°C. Alternatively, 10 g soil is directly homogenized with 40 mL of a 0.2-M NaCl, 50 mM Tris-HCl pH 8.0 buffer, mix (overnight, 4°C, orbital shaking) and then centrifuged at low speed (approx. 200–500×g for 1–5 min) to eliminate large soil particles. Cells (plus small particles) are separated by centrifugation at high speed (9000×g for 15 min). Freshwater samples (>600 mL) are subjected to cell separation either by centrifugation (9000×g for 15 min) or filtration through a 0.20-μm filter (the filter is removed and used directly). DNA was isolated from these samples following steps 2–10. To each 0.1 g of separated cells ground in liquid nitrogen, add 1.85-mL Cell Suspension Solution (use a 15-mL clear plastic tube for efficient mixing). Mix until the solution appears homogeneous. 2. Add 50 μL of RNase Mix, and mix thoroughly. Add 100 μL of Cell Lysis/Denaturing Solution, and mix well. 3. Incubate at 55°C for 15 min. 4. Add 25 μL Protease Mix, and mix thoroughly. 5. Incubate at 55°C for 30 to 120 min (the longer time will result in minimal protein carry over and will also allow for substantial reduction in residual protease activity). 6. Add 500 μL “Salt-Out” Mixture, and mix gently yet thoroughly. Divide sample into 1.5-mL tubes. Refrigerate at 4°C for 10 min. 7. Spin for 10 min at maximum speed in a microcentrifuge (at least 10,000×g). Carefully collect the supernatant, avoiding the pellet. If a precipitate remains in the supernatant, spin again until it is clear. Pool the supernatants in a 15-mL (or larger) clear plastic tube. 8. To this supernatant, add 2 mL TE buffer and mix. Then add 8 mL of 100% ethanol. If spooling the DNA, add the ethanol slowly and spool the DNA at the interphase with a

Metagenomic Protocols and Strategies  21 clean glass rod. If centrifuging the DNA, add the ethanol and gently mix the solution by inverting the tube. 9. Spin for 15 min at 1000–1500×g. Eliminate the excess ethanol by blotting or air drying the DNA. 1 0. Dissolve the genomic DNA in TE (10 mM Tris pH 7.5, 1 mM EDTA).

2.3.6  Handling Instructions for the Isolation of High-Quality Genomic DNA From Acidic Samples Some environmental samples, such as hot spring or acid mine drainage biofilms, are extremely acidic. The polymeric exopolysaccharide (EPS, extracellular polymeric substance) generated by the microorganisms creating the biofilm acts as an attachment system to energy sources in acidophilic chemolitotrophs, as well as a safety shield against environmental low pH in neutrophilic heterotrophs. In its chemical nature, the EPS is basically an ionic exchanger that tightly binds H+ and other cationic metallic species present in low pH solutions. Minimal disruption of this substance produces a drastic drop in the pH of the genomic DNA extraction solution. Sample storage at −80°C has the same effect. If the extraction solutions are not properly buffered, the procedure courses at low pH, with the result of highly sheared, therefore low-quality, genomic DNA. Prior to storage, both acidity and metallic species should be removed from the biofilm samples. A practical way to perform metallic species removal is to wash the biofilm samples with a sulfuric acid solution of lower pH than the environmental sample. The following buffering of the sample can be done with a Tris-based buffer containing 150 mM NaCl, 100 mM EDTA, and 1 M Tris. Buffering with phosphate-containing buffers is not advised since the buffering capacity is hindered during the first washing steps due to precipitation of phosphates with the low pH. Nevertheless, some compositional changes could occur during the washing steps, and minimally processed samples are desirable for shotgun metagenomics. Commercial DNA extraction kits include buffer solutions with different buffering capabilities. A comparison among traditional phenol:chloroform-based and commercial kitbased DNA extractions showed that, for pH 2 acidic biofilms, the PowerSoil DNA Isolation Kit (MoBio, USA) provided the best quality of extracted DNA using fresh samples (processed within 2 h after collection), and is therefore highly recommended for this type of material.

2.3.7  Isolation of High-Quality DNA From Fecal Samples Although the previous methods can be used for any type of samples, in the case of samples derived from fecal material we recommend the following protocol. Fecal samples are stored in RNAlater (Life Technologies) at −80°C until use. The samples are defrosted overnight at 4°C and then total DNA is extracted as follows. Briefly, samples (at least 0.4–1.0 g) are

22  Chapter 2 diluted (dilution ½) in phosphate buffered saline solution (PBS) (ratio 1 g sample per 3 mL PBS). Then, they are centrifuged at 2000 rpm at 4°C for 2 min to remove fecal debris. Total DNA is extracted from pelleted cells with the QIAamp DNA Stool Kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions.

2.3.8  Rapid Amplification of DNA Using Phi29 DNA Polymerase For samples yielding very low amounts of DNA, a multiple displacement amplification (MDA) of the whole genomic DNA (or whole genome amplification, WGA) is possible by using the DNA polymerase of the phage phi29. This enzyme is the replicative polymerase from the Bacillus subtilis phage phi29 (Φ29), which has exceptional strand displacement and processive synthesis properties. In addition, the polymerase has an inherent 3′ → 5′ proofreading exonuclease activity. All these characteristics highlight the extreme processivity, extreme strand displacement, and high fidelity of the enzyme. To perform MDA, the Illustra GenomiPhi DNA Amplification Kit (GE Healthcare) constitutes an easy and broadly used solution. The protocol includes the following steps: 1. Mix sample buffer (9 μL) with template DNA (1 μL containing 10 ng of DNA template). 2. Heat to 95°C for 3 min to denature the DNA. Cool to 4°C on ice. Denaturation can proceed chemically as well by mixing 9 μL of a denaturation solution containing 400 mM KOH and 10 mM EDTA with 1 μL of DNA template for 3 min, then neutralizing with 1 μL of a neutralizing solution containing 400 mM HCl and 600 mM Tris-HCl (pH 7.5). Store on ice. 3. For each amplification reaction, combine on ice 9 μL of the reaction buffer with 1 μL of enzyme mix and transfer master mix to cooled sample. 4. Incubate the samples at 30°C for 1.5 h. 5. Inactivate the Phi29 DNA polymerase enzyme by heating the samples to 65°C for 10 min; then cool to 4°C. 6. Amplification material will be then ready for downstream procedures and can be stored at −20°C. Yields from a GenomiPhi DNA Amplification Kit reaction are 4–7 μg per 20 μL reaction when starting with 10 ng of purified DNA. Increased reaction times (2 h) may be helpful for samples such as blood or oral swabs. According to the manufacturer’s protocol, the average product length is greater than 10 Kb, and DNA replication is extremely accurate due to the proofreading 3′ → 5′ exonuclease activity of the phage Φ29 DNA polymerase.

2.4  Wet-Lab Techniques for RNA Isolation Prior to RNA extractions the samples need to be stabilized with RNAprotect bacteria reagent (Qiagen GmbH, Hilden, Germany). The samples are mixed with this reagent, incubated for

Metagenomic Protocols and Strategies  23 5–15 min at room temperature and centrifuged for 15 min at 4000×g and 4°C. The supernatant is discarded, and the pellet frozen in −70°C. In a deep freezer, the RNA is expected to remain intact for up to one month. Prior to RNA extraction, the samples are thawed on ice. RNA isolation is performed using the RNAeasy Mini Kit (Qiagen GmbH) according to manufacturer's instructions. The concentration of the RNA is assessed using a Nanodrop system (NanoDrop, Wilmington, DE), and the integrity is estimated by the 28S/18S ratio using an Agilent 2100 bioanalyzer (Agilent, Palo Alto, CA). Second, total RNA is extracted from pelleted cells, previously separated as described before, using the RiboPure-Bacteria kit (Ambion) and then treated with DNAse I. The RNA extractions are verified by standard agarose gel electrophoresis and quantified with a Nanodrop-1000 spectrophotometer (Thermo Scientific). Once the DNA and RNA materials representing the total and active metagenome, respectively, are ready, they can be analyzed by multiple techniques that are detailed here. The DNA and RNA concentration can be measured using a Quanti-iT dsDNA Assay kit (Invitrogen, Paisley, UK), QuantiT PicoGreen dsDNA Assay Kit (Invitrogen) or with a Nanodrop-1000 spectrophotometer (Thermo Scientific) prior to amplification and sequencing. At least 1–2 μg material is needed.

2.5  Meta-Taxonomy Through 16S rRNA Gene: From Wet-Lab Techniques to Bioinformatics Meta-taxonomy is the most commonly used strategy for characterizing the composition and relative amounts of microbial communities and their evolution as a function of the time or other environmental variables. This is because the samples consist of a large number of microorganisms, primarily bacteria but also fungi, viruses and archaea. Most of the bacteria from the ecosystems are difficult to culture, and even when it is possible, the study of microbiota by culture can be very laborious. Therefore, different techniques have been developed that are based on high-throughput sequencing, and they allow for the identification of microorganisms based on their genomic sequences. The most common approach to studying these microorganisms is based on DNA extraction from all the microorganisms of a given sample (using methods described previously), the amplification of the 16S ribosomal RNA (16S rRNA), the sequencing of the amplified products, the clustering of the obtained sequences according to their similarities, and the performance of a taxonomic classification to identify the relative abundance of each bacterium. Gold standard methods are available in the literature [10–12]. These analyses can be performed at the phylum, class, order, family, or genus level. Depending on the methodology in use and the length of the sequenced fragment, the analysis can be performed with accuracy at the genus level or only at the family level. The 16S rRNA gene is considered an evolutionary chronometer, because it is highly conserved in all organisms and is used universally for

24  Chapter 2 taxonomic identification. The secondary structure of the 16S rRNA gene is highly conserved throughout evolution, presenting variable spacing regions where changes are accumulated slowly. These regions also have enough variability to differentiate bacteria that are closely related phylogenetically. The full size of the bacterial gene is large enough (approx. 1500 bp) and there are large databases of 16S rRNA sequences. To identify active microorganisms, we need to extract the RNA, transform it into cDNA by reverse transcription, and then sequence the cDNA. Thus, we only identify microorganisms that are dividing. The techniques and technologies for the study are the same as those for meta-taxonomy, except that the active fraction of the starting material for use in meta-taxonomy is the cDNA and not the DNA molecule. Gold standard methods are available in the literature [4–8]. In summary, the most common approach is the extraction of total DNA or RNA from a sample, the amplification of the 16S rRNA with universal primers and then the massive sequencing of the amplicons. The appropriate taxonomic group is assigned to each sequence through searches in public databases such as the “Ribosomal Database Project” (RDP), and the results then are analyzed using bioinformatic tools to first validate the quality (identity ≥98%) and length (≥200 bp) of the sequences. The usual order of the bioinformatic analysis is as follows: (1) the quality control of the sequences; (2) the removal of chimeric sequences; (3) the clustering of the sequences by similarity and overlapping features; (4) taxonomic assignment; and (5) statistical analysis to determine the significant differences. For the sake of simplicity only the analysis of 16S rRNA sequences from DNA samples will be described.

2.5.1  Wet-Lab Techniques for Amplification and Sequencing of 16S rRNA 2.5.1.1  Wet-lab techniques for amplification and sequencing of 16S rRNA by cultivation approach For the taxonomic analysis based on the 16S rDNA gene, one can use different regions of the gene, although the most commonly used are hypervariable V/V3, V3/V4 or V5/ V6 regions. Evaluation of DNA extraction kits and different hypervariable regions can be seen in Burbach et al. [13]. The following primer pairs can be used at a final concentration of 0.4 μM: 27f/1492r, for Bacteria [14, 15]; 20f/958r [16, 17] for the Archaea domain; and ARM979F/ARM1365R [18], for “ARMAN”-like microorganisms. An annealing temperature of 55oC for 30 cycles has delivered good results for all three primer sets indicated. The PCR products should be cleaned, for example using the Illustra GFX PCR DNA and Gel Band Purification Columns (GE Healthcare, UK) and quantified, optimally with Qubit (Thermo Fisher). The 16S rRNA gene sequences are cloned using, for example, the pGEM-T Easy Vector System I (Promega, USA) according to the manufacturer’s directions. After plating, the positive transformants are screened via PCR amplification of the inserts with flanking vector primers (M13F 5′-GTTTTCCCAGTCACGAC-3′; M13R 5′-GAAACAGCTATGACCATG-3′).

Metagenomic Protocols and Strategies  25 Amplicons from clone libraries are subjected to Sanger sequencing, by using the BigDye Terminator v3.1 sequencing kit (Applied Biosystems and running the capillary electrophoresis to resolve the sequence in an ABI PRISM 3130xl Genetic Analyzer (Applied Biosystems). 2.5.1.2  Wet-lab techniques for small subunit (SSU) rRNA hypervariable tag sequencing The pros of the technology described in Section 2.5.1.1 include the possibility of obtaining full-length sequences of the 16S rRNA gene, and thus guaranteeing deepening of the phylogenetic analysis even to the species level. Cons are high reagent and time cost and limiting diversity coverage, as one is limited to the number of clones selected for sequencing. For this reason, in samples where high diversity is expected, it is actually highly recommended to perform the so-called SSU rRNA hypervariable tag analysis. Briefly, the 16S rRNA is amplified from DNA material and the amplification product directly subjected to sequencing, avoiding the tedious cloning step otherwise needed by the standard protocol described in Section 4.1.1. The protocol, which essentially is based on the same pair of primers described just previously, is described in the following paragraphs. For the taxonomic analysis based on the 16S rRNA gene one can use different hypervariable regions of the gene, although the most commonly used are the V1/V3 or V3/V5 regions. Evaluation of DNA extraction kits and different hypervariable regions can be seen in Burbach et al. [13]. The most recently updated primer pair information for amplification of the 16S in each taxonomic group is summarized in Table 2.1. Targeted DNA next-generation sequencing (NGS) library preparation, compatible with multiple sequencing platforms, can be performed nowadays in an automated way by using the amplicon-based library preparation (up to 48 samples) with the Access Array system from Fluidigm (https://www.fluidigm.com/products/access-array). Fluidigm includes integrated fluidic circuits (IFCs) and controllers to create amplicon libraries (up to 384 if using barcoding) using the Access Array platform. The primers listed previously can be used to manually amplify the indicated regions in the laboratory. Annealing temperatures have been included in Table 2.1, although it is advised to create a gradient of possible temperatures to test in each specific laboratory condition. Both annealing and elongation times should be customized according to the polymerase specifications and length of the PCR product. The preceding primers can be bonded to 8 to 10 nucleotide multiplexing tags so that samples can be pooled [19]. As a general method, amplification of the selected 16S rRNA region can be performed in 25 μL reactions using, for example, the AmpliTaq Gold 360 Master Mix (Applied Biosystems, USA) following the instructions provided

26  Chapter 2 Table 2.1: List of primers used for 16S rRNA gene profiling

Target

Primer Name

Primer Sequence (5′ to 3′)

16S V1-V3

V1-V3 F28 V1-V3 R519 V3-V5 F357 V3-V5 R926 V4 515F V4 806R V4 515F (new) V4 806R (new) Arch349F Arch806R F566Euk R1200Euk Euk_1391F EukBr-7R ITS1 ITS4R ITS3F ITS4R

GAGTTTGATCNTGGCTCAG GTNTTACNGCGGCKGCTG CCTACGGGAGGCAGCAG CCGTCAATTCMTTTRAGT GTGCCAGCMGCCGCGGTAA GGACTACHVGGGTWTCTAAT GTGYCAGCMGCCGCGGTAA GGACTACNVGGGTWTCTAAT GYGCASCAGKCGMGAAW GGACTACVSGGGTATCTAAT CAGCAGCCGCGGTAATTCC CCCGTGTTGAGTCAAATTAAGC GTACACACCGCCCGTC TGATCCTTCTGCAGGTTCACCTAC TCCGTAGGTGAACCTGCGG TCCTCCGCTTATTGATATGC GCATCGATGAAGAACGCAGC TCCTCCGCTTATTGATATGC

16S V3-V5 16S V4 16S V4 (new) Archaea Eukaryotic 18S Eukaryotic 18S Fungal ITS1-ITS4 Fungal ITS3-ITS4

Expected Product Length (nt)

Annealing Temperature (°C)

643

58

694

52

252

50

252

50

528

55

765+

64

200–280

61

580+

61

462+

61

Expected product lengths after amplification at defined annealing temperature are given.

by the manufacturer, with primers at a final concentration of 0.4 μM. Consensus working PCR conditions include a denaturation step of 95°C for 10 min, followed by 30 cycles of 95°C for 45 s, 55°C or 60°C (as the annealing temperatures for 27f/338r and 44f/519r, respectively) for 45 s and 72°C for 1 min, with a final extension cycle of 72°C for 10 min. The products obtained from separate amplification reactions (conditions as previously given) can be pooled and purified using Gel Band Purification Columns (GE Healthcare, UK), and the DNA concentration determined using the Quant-iT PicoGreen dsDNA assay kit (Applied Biosystems, USA). The uniquely tagged, pooled DNA samples can then be sequenced. The following protocol is specially designated when working with DNA from marine samples, although it can also be used for the analysis of DNA from other environmental origins. The following protocol is described for the analysis of eight different samples, as a case of study, although the protocol can be extended for a higher or lower number of samples by just adapting the number of tags used to mark each of them. Assays are performed using universal-bacterial primers targeting the variable regions of the 16S rRNA, V1-V3 (27F mod 5′–AGRGTTTGATCMTGGCTCAG–3′; 519R mod bio 5′– GTNTTACNGCGGCKGCTG–3′), amplifying a fragment of approximately 400 bp.

Metagenomic Protocols and Strategies  27 The amplified 16S rRNA regions contain sufficient nucleotide variability to enable the identification of bacterial species. Multiplex identifiers (MIDs) specific to each sample can be used: TCCAGTAC for Sample 1, TCCAGGTG for Sample 2, TCATCTCC for Sample 3, TCATGGTT for Sample 4, TCATTGTT for Sample 5, TCCACGTG for Sample 6, TCAGTAAG for Sample 7, and TAGGATGA for Sample 8. PCR reactions are performed in a final volume of 50 μL with 40 ng of sample DNA, 0.3 254 μmol/L of each primer, 1× PCR Buffer with 1.5 mmol/L of MgCl2, 0.2 mmol/L of each dNTP and 1U of HotStarTaq Plus Master Mix Kit (Qiagen, Valencia, CA, USA). The PCR cycling procedure is as follows: 94°C for 3 min followed by 28 cycles at 94°C for 30 s, 53°C for 40 s and 72°C for 1 min; a final elongation at 72°C for 5 min is performed. After PCRs, all amplicons can be purified using Agencourt Ampure beads (Agencourt Bioscience Corporation, MA, USA) and an equal amount sequenced. This protocol can be used for the analysis of DNA from environmental origins other than marine samples. The following protocol is recommended for the analysis of 16S rRNA from fecal material. For each sample, a region of the 16S rDNA (ssu gene) is amplified by PCR using the universal primers E8F (5′–AGAGTTTGATCMTGGCTCAG–3′) and 530R (5′–CCGCGGCKGCTGGCAC–3′). For species-level analysis the primers 27F (5′– AGAGTTTGATCCTGGCTCAG–3′) and 338R (5′–TGCTGCCTCCCGTAGGAGT–3′) are used. The amplified region comprises the hypervariable regions and V1, V2 and V3 for the first pair of primers and V1, V2 for the second. In both cases the sample-specific Multiplex Identifier (MID) is added for pyrosequencing. PCR is performed under the following conditions: 95°C for 2 min, followed by 25 cycles of 95°C for 30 s, 52°C for 1 min and 72°C for 1 min, and a final extension step at 72°C for 10 min. Amplification is verified by electrophoresis in an agarose gel (1.4%). PCR products are purified using the QIAquick gel extraction kit (QIAGEN) or the NucleoFast 96 PCR Clean-Up kit (Macherey-Nagel). Different sequencing platforms may be used, an extensive description of which will be discussed in details in Section 2.6.1. 2.5.1.3  Wet-lab techniques for mRNA amplification Total RNA can be directly retrotranscribed to obtain the 16S rRNA (as a measure of active bacteria) as well as the rest of the RNA, particularly the mRNA. Prior to mRNA amplification, to remove the maximum amount of rRNA, rRNAs is first depleted with the MICROBExpress kit (Ambion), which captures and removes the rRNA (16S rRNA, 23S rRNA) by hybridization. Second, one can use the mRNA-ONLY Prokaryotic mRNA isolation kit (Epicentre), which uses a terminator 5′-phosphate-dependent exonuclease that specifically digests rRNAs due to the presence of 5′ monophosphate groups. Finally, mRNA is linearly amplified using the MessageAmp II-Bacteria kit (Ambion), which adds poly(A) tails to the

28  Chapter 2 mRNAs. To retrotranscribe the total RNA and the amplified mRNA into single-stranded cDNA, the High-Capacity cDNA Reverse Transcription kit (Ambion) can be used. To synthesize double-stranded cDNA (ds-cDNA), we can use standard procedures. Details of the mRNA amplification method are reported by Pérez-Cobas et al. [5]. Once isolated, the cDNA material is ready for amplification and sequencing of 16S rRNA by the cultivation approach or small subunit (SSU) rRNA hypervariable tag sequencing, following protocols described previously.

2.5.2  Bioinformatic and Statistical Analysis of 16S rRNA Amplicons For bioinformatic analysis, specific knowledge is required. Although many of the programs can be run on Windows and Unix systems, Unix is becoming a standard in bioinformatics. All the programs described in this chapter can be downloaded for free from the web. The bioinformatics analysis is based on the following steps: • • • •

Quality control of the sequences. Elimination of the chimeric sequences. Grouping of the sequences on the basis of similarity and clustering. Taxonomic assignation.

Generally, the datasets provided by the sequencing centers have passed a phase of control quality. If not, the users may carry out a filtering and cutting sequence step to eliminate short or low-quality sequences until the average quality is enough to guarantee a good taxonomic assignation. In the case of sequences from MiSeq/NextSeq500 from Illumina systems, the sequences R1 and R2 (left/right or forward/reverse) need to be joined to obtain a full-length amplicon. This step is done after the quality control in order to ensure annealing of the sequences, maintaining a high level of quality in the intermediate region. One of the platforms of analysis most widely used is QIIME (http://qiime.org). This platform allows performing most of the analyses needed for the investigation of microbial communities. QIIME, once installed, carries with it all programs described in this chapter, making the command lines accessible and uniform. For more information, please review the manual on the QIIME webpage. In this section we explain the commands needed in each step of the taxonomic analysis. All steps are integrated and normalized in the QIIME platforms, allowing users to use preferred parameters or modify them. The actual format for the sequence analysis program is from systems based on the IonTorrent technologies, similar to those also used in the 454 technology. This format is easily translated to FASTQ. In some cases, the sequences still contain contamination with sequences from adaptors or primers, which can be eliminated using the program “cutadapt” (https://cutadapt. readthedocs.io/en/stable).

Metagenomic Protocols and Strategies  29 2.5.2.1  Detection of amplification products and amplicon purification The quality of the files can be visualized by using the “fastQC” program (http://www. bioinformatics.babraham.ac.uk/projects/fastqc/). For quality control, “prinseq-lite” can be used; it allows among other functions elimination of the short sequences, trimming and statistical summary, etc. (http://prinseq.sourceforge.net/). This program is a script of perl that can be installed in all operative systems. The program functions in command line. The following example is proposed: •





prinseq-lite.pl -fastq infile.fastq -out_good infile_cleaned -out_format 3 -min_len 200 -min_ qual_ mean 20 -out_bad null -trim_qual_right 30 -trim_qual_window 20 -trim_qual_type mean This line runs the script prinseq-lite.pl using the entry file infile.fastq and generates new files with a basic name infile_cleaned (-out_good infile_cleaned) in FASTQ format (-out_ format 3). The minimum length of each sequence is 200 nucleotides (-min_len 200) and a median quality of 20, discarding all sequences that do not meet the required parameters (-out_bad null). A quality-control step will be done per base from the right measuring that the quality is not below 30 (-trim_qual_right 30) in a window of 20 nucleotides (-trim_ qual_window 20) using the median (-trim_qual_type mean). We recommend checking the manual on the web or the following command: prinseq-lite.pl -h In the case of matched sequences, as well as those produced by the technology MiSeq from Illumina, the program “fastq-join” from “ea-utils” (https://github.com/ ExpressionAnalysis/ea-utils) can be used. An example of the command line is as follows: fastq-join readsFile_1.fq readsFile_2.fq -o reads.%.fq This line executes the fastq-join joining, from the extremes, the reads of the files File_1. fq and readsFile_2.fq, generating as data output a file with the sequences that could be joined (reads.join. fq) and two files for those sequences that could not be joined (reads. un1.fq and reads.un2. fq). The file reads.join.fq is the one to be used in the following steps. In the files reads.un1.fq and reads.un2.fq, a low number of sequences will remain but, if the joining process went well, the number should be reduced.

In some cases, it is necessary to transform the file in “fastq” format into “fasta” format. For that, the command fastq_to_fasta from the suite FASTX Toolkit should be used; it contains a number of tools to manage sequence data (http://hannonlab.cshl.edu/fastx_toolkit/). For example, to transfer “fastq” to “fasta” one can use the following command line: •

fastq_to_fasta -i reads.join.fq -o reads.joined.fasta

2.5.2.2  Checking presence of chimeras The next step is the elimination of chimeric sequences that can be generated during the process of amplification. These sequences can be detected when the two extremes differ significantly in their phylogenetic assignations. One of the programs that can be used in this

30  Chapter 2 process is the “usearch”, which is applied for searching and clustering of massive sequencing data (http://drive5.com/usearch). An example of the command line that explains the step of chimeric sequence elimination is as follows: •

usearch -uchime_ref reads.join.fasta -db 16s_ref.udb -uchimeout reads_noChimera.fastastrand plus

With this line, the “usearch” finds chimeric sequences in the dataset reads.join.fasta (reads. join.fasta) using the formatted database for “usearch” 16s_ref.udb (see later) and generates an output file. The command finish with the parameter -strand plus, according to the manual, is still in experimental phase but is used to define the orientation of the sequences. In this case, probably the most important definition is the one from the reference database. In the “usearch” website, the authors suggest utilization of the reference databases of other programs: CS_Gold of Chimera Slayer from the Microbiome Utilities Portal of Broad Institute (http:// microbiomeutil.sourceforge.net/) or RDP_Gold from the Ribosomal Database Project (https:// rdp.cme.msu.edu). Both databases are ready to be downloaded from the “usearch” web: • CS_Gold: http://drive5.com/uchime/gold.fa • RDP_Gold: http://drive5.com/uchime/rdp_gold.fa 2.5.2.3  Grouping of sequences based on similarity and clustering Several programs are used for the definition of sequence clusters based on OTU (operative taxonomic units); among them, the following explains the functioning of “usearch” and “CROP”. The first uses the uparse algorithm, integrated in the “usearch” program under the command -cluster_otu, mostly used because of its speed in executing large volumes of data. •

usearch -usearch_global reads_noChimera.fasta -db otus.fasta -strand plus -id 0.97 -otutabout otu_table.txt In this case, clusters will be found in the file “reads_noChimera.fasta” generating a new file named “otus.fasta”, and generating clusters among those sequences that share a minimum identity of 97%. Also, a table with the names of the OTUs and all sequences contained in each of them is generated.

The other program for clustering is CROP, which employs a Gaussian method in which there is no need to specify the similarity value; CROP uses a Bayesian probabilistic system with a flexible value of clustering based on similarity and which reduces the effects of PCR and sequencing errors (https://github.com/tingchenlab/CROP). Using the program CROP, the minimum line of command is: •

CROPLinux -i reads_noChimera.fasta -o otus.fasta –s with a distribution of Gaussian distances between alignments with a center at 97%. Also, the CROP generates a table of clusters and a list of all OTUs with the sequences contained in each of them.

Metagenomic Protocols and Strategies  31 2.5.2.4  Taxonomic assignations For taxonomic assignation, several programs can be used, each of them based on different approaches. Among these programs the most widely used is the rdp_classfier, an algorithm from “usearch”, blast and SINA. To assign sequences using “usearch” and the –utax command, an appropriated formatted table is needed, which can be obtained following this command: •



usearch -makeudb_utax refdb.fa -output refdb.udb -taxconfsin 500.tc where refdb.fa is our database. This allows using different databases, such as rdp_ classifier. When an assignation at species level is needed, we advise to use a classification of the databases of greengenes from the web: http://greengenes.secondgenome.com/ downloads, where the last versions of the 16S ribosomal sequences databases can be found. Following from this, one can continue with the taxonomic assignation using the command: usearch -utax otus.fasta -db tax.udb -utaxout utax.txt -strand both which classifies the sequences from the file otus.fasta using the generated database tax. udb and produces an annotation table utax.txt trying to classify the sequences in both directions (-strand both).

The other program for taxonomic assignation is rdp_classifer from the Ribosomal Database Project previously mentioned (https://sourceforge.net/projects/rdp-classifier/), which is a Bayesian-type classifier. When downloading, the program contains all preinstalled and updated database versions. The rdp_classifier, through its databases, allows classifying up to genus-level sequences of bacterial and archaeal 16S rRNA as well as intergenic ribosomal sequences (ITS) and larger ribosomal sequences (LSU) from fungi. •

rdp_classifier classify -q otus.fasta -o taxonomy.rdp -c 0.8 -f fixrank In the above line we defined the entry file (-q otus.fasta), the output file (-o taxonomy. rdp), the minimum required assignation threshold (0.8), and the type of taxonomic ranges to be reported in the resulting table, which, in this case, will be limited to kingdom, domain, phylum, class, order, family and genus (-f fixrank). If needed, the classifier can be trained to use other databases. The QIIME platform contains a simplified method to obtain a classification using the rdp_classifier program or other databases (i.e., greengenes).

2.5.3  Expression of Results Through the previously described bioinformatic method, one can obtain the taxonomic analysis of selected sequences. Results can be expressed in the form of relative abundance of the different phyla, classes, orders, families, genera, and when possible, species, in each of the samples, so that the values can be compared. The most common data allow up to

32  Chapter 2 genus level and, rarely, to species level. To compare the results within samples and within the different taxonomic groups and to evaluate the richness (quantity of microorganisms) and the biodiversity (quantity of species), one can use the criteria of alpha and beta diversity index. The alpha biodiversity index (such as Shannon) reflects the heterogeneity of a community on the basis of the number of species present, and the beta diversity index (such as Chao) reflects the abundance and representation of each species. Results can be obtained quantitatively, weighted, so that the abundance of observed microorganisms can be evaluated, or qualitatively, unweighted, which takes into consideration the presence or absence of microorganisms.

2.6  Metagenomics: Sequencing Methods and Bioinformatics Metagenomic studies that are based on the massive sequencing of DNA have been applied to study how differences in the microbial composition are translated into differences in the gene content. This method, which was developed in the 1980s and 1990s, is known as “shotgun” sequencing and allows for the reading of DNA or RNA fragment sequences without previous amplification. This method requires a large amount of data computing, which usually makes the process more expensive. The set of all these fragments is considered representative of the set of bacterial genomes present in the original community. At present, metagenomics is a basic discipline for the study of microbial communities from a physiological and functional point of view. Thanks to massive sequencing and bioinformatic analysis, it would be possible to elucidate the metabolic routes and functions of the bacteria through a comparative analysis between the sequenced DNA fragments and the databases. In addition to the functional information, with the growth of the databases and the improvement of the collected information, metagenomics also provides taxonomic data that are always more reliable than that of other sources. With this strategy, genetic information on potential new biocatalysts, enzymes, genomic associations between different organisms and their phylogenetic and evolutionary profiles, etc. can be uncovered. Metagenomic sequencing obviates the important bias that is introduced during the PCR process, and the resulting fragments are randomly chosen from all the existing genomes in the original sample. Even so, the results may be influenced by the number of copies of 16S rDNA, which is variable in each species. This approach provides us with a more reliable view of the actual taxonomic distribution of the microorganisms. A review by Thomas et al. [20] is suggested for an overview of the protocols applied to metagenomics. The basic bioinformatic steps to analyze metagenomic samples can be summarized as follows: (1) quality control; (2) the assembly of the resulting reads; (3) annotation without assembling (steps 1 and 3); (4) the annotation of reads corresponding to the identified contiguous regions of the genome (known as assembled contigs) (steps 1, 2 and 4); and, finally, (5) the taxonomic characterization of the reads or contigs.

Metagenomic Protocols and Strategies  33

2.6.1  Sequencing Methods for Community DNA and cDNA Analysis 2.6.1.1  Advances in high-throughput DNA sequencing The efforts of the National Human Genome Research Institute to reduce the cost of genome sequencing to US $1000 in 10 years after the first human genome was sequenced, more than a decade ago, led to the development of next-generation sequencing (NGS) technologies, now broadly used in the field of microbial ecology. NGS can be categorized into short-read and long-read sequencing. Short-read sequencing includes the now obsolete 454 pyrosequencing method, the Solexa/Illumina platform, the Sequencing by Oligo Ligation Detection or SOLiD system, and the Ion Torrent technology. The most relevant long-read sequencers include the MinION system from Oxford Nanopore Technologies and the Pacific Biosciences (PacBio) RS II sequencer. Overview of short-read sequencing platforms

454 pyrosequencing. 454 Life Sciences (acquired by Roche in 2007) released the first NGS technology (pyrosequencing) in 2005 at a much lower price than the automated Sanger sequencer. Pyrosequencing falls into the sequencing-by-synthesis category, and relies on single-nucleotide addition into an elongating strand bound into beads in a PicoTiterPlate, along with beads containing an enzyme cocktail. dNTPs are added iteratively, and if an incorporation into the strand occurs, an enzymatic cascade takes place and results in a bioluminescence signal, which is captured by a charge-coupled device camera. The pros of this technology include the generation of long reads (1 Kb maximum) and relatively fast run times. Cons are high reagent cost, high error rates in homopolymer repeats, and, most importantly, the shut-down of 454 and its support of the platform in mid-2016. Therefore, 454 pyrosequencing has become an obsolete NGS technology in just over a decade. Solexa/Illumina. In 2006, the Solexa/Illumina sequencing platform was launched (Illumina acquired Solexa in 2007). The technology falls into the sequencing-by-synthesis category, and relies on cyclic reversible termination, similar to Sanger sequencing. To begin the process, a small sequence, complementary to an adapter region, is linked to the DNA target so that amplification of double-stranded DNA will start in this region. During the elongation process, a cocktail of 3′-blocked dNTPs is added, so that incorporation of the fluorophore can be recorded every cycle. Illumina offers the highest throughput of all platforms on the market and the lowest per-base cost, and read lengths of up to 300 bp can now be generated with this system. The cons are technically challenging sample loading and the requirement of sequence complexity in the sample due to high throughput. SOLiD. Sequencing by Oligo Ligation Detection (SOLiD) by Applied Biosystems (currently Life Technologies) was launched in 2007. This technology falls into the sequencing-byligation category, and involves hybridization and ligation of a labeled probe and anchor

34  Chapter 2 sequences to a DNA strand. SOLiD utilizes two-base-encoded probes and universal bases that bind between the probe and the template. The anchor fragment encodes a known sequence that is complementary to an adapter sequence where ligation is initiated. After ligation, the template is imaged and the known bases in the probe are recognized. A new cycle begins after complete removal of the anchor-probe complex and regeneration of the ligation site. The pros of this technology include its high throughput and lower error rates. Cons are the generation of the shortest reads of all platforms and long run times. Ion Torrent. In 2010, Ion Torrent (now Life Technologies) released the Personal Genome Machine (PGM), the first NGS platform without optical sensing. The Ion Torrent platform detects the H+ ions that are released as iteratively delivered dNTPs are incorporated. The resulting pH shift is proportional to the number of nucleotides incorporated. The pros of this technology are that optical sensing and fluorescent nucleotides are not required, as well as the fast run times. Cons include high error rates in the detection of homopolymers. Overview of long-read sequencing platforms

Short-read sequencing raw datasets can be complemented with long-read continuous sequencing to minimize wrong reads. There are two main types of long-read sequencing technologies: single-molecule real-time sequencing (PacBio RS II platform and MinION system) and synthetic approaches that rely on existing short-read technologies to generate longer reads in silico (the Illumina synthetic long-read sequencing and the 10× Genomics emulsion-based platform). PacBio RS II. Currently, the most widely used long-read platform is the single-molecule realtime (SMRT) methodology used by Pacific Biosciences (PacBio). The instrument utilizes a flow cell with thousands of individual picoliter wells with an optical waveguide at the bottom that guides light energy into a small volume (the so-called zero-mode waveguides, ZMW). PacBio fixes the polymerase in the bottom of the well and allows the DNA strand to progress through the ZMW. The color and duration of the light emitted as the labeled nucleotide pauses briefly during incorporation at the bottom of the ZMW is visualized and recorded continuously with a laser and camera sensor system. The polymerase cleaves the fluorophore bound to the dNTP during incorporation, which diffuses away before DNA synthesis continues. The PacBio RS II is currently the most widely used instrument and is capable of generating average read-lengths of 10–15 Kb. However, a major limitation of the system is the high error rate (up to 15%), with indel errors dominating. Nevertheless, sufficiently high coverage can overcome the high error rate. The limited throughput and high costs of PacBio RS II, in addition to the need for high coverage, place this instrument out of reach of many small laboratories. To overcome this limitation, PacBio has recently launched the more affordable Sequel system, which delivers ~7× more reads with 1 million zero-mode waveguides (ZMWs) per SMRT cell.

Metagenomic Protocols and Strategies  35 ONT MinION system. MinION from Oxford Nanopore Technologies (ONT) was the first prototype of a nanopore sequencer (in 2014). Sequencing of ssDNA molecule is carried out when the DNA template is treated with a primary and a secondary protein, the actions of which cause a measurable (k-mer) voltage shift. The current MK1 MinION flow cell structure contains a chip with 512 individual channels that are capable of sequencing at ~70 bp/s, with an expected increase to 500 bp/s in 2016. The new PromethION instrument is an ultrahigh-throughput platform reported to include 48 individual flow cells, each with 3000 pores running at 500 bp per second. It yields ~2–4 Tb for a 2-day run on a fully loaded device, placing it in potential competition with the Illumina HiSeq systems. The ONT MinION is a small (~3 × 10 cm for the MK1) USB-based device that runs off a personal computer. This affords the MinION superior portability, underlying its advantage for rapid clinical responses and remote field locations. Nevertheless, adjunct equipment is still required for library preparation (e.g., a thermocycler). Overcoming these essentials could reduce the space required for a fully functional sequencing system in the near future. Theoretically, a DNA molecule of any size can be sequenced on the machine, but in practice there are some limitations in the sequencing of very long DNA fragments. ONT MinION has a high error rate (up to 30%), with indels dominating. Effective homopolymer sequencing also remains a challenge for this system. Modified bases constitute a challenge to the device as well, as a modified base will alter the typical voltage shift for a given k-mer. Nevertheless, recent improvements in the chemistry and the base calling algorithms are improving preciseness of this technology. There are currently two systems available for generating synthetic long-reads: the Illumina synthetic long-read sequencing platform and the 10× Genomics emulsion-based system. Synthetic long-reads are generated by partitioning large DNA fragments into either microtiter wells or an emulsion in such a way that very few molecules exist in each division. The template fragments are then sheared and barcoded within each partition. This approach allows for sequencing on existing short-read sequencers. The resulting data are demultiplexed by barcode and reassembled knowing which fragments sharing barcodes are derived from the same original large fragment. Thus, synthetic barcoded reads provide an association among small fragments derived from a larger one. By segregating the fragments, repetitive or complicated regions can be isolated and assembled locally. The Illumina system (earlier Moleculo) partitions DNA into a microtiter plate and does not require specialized instrumentation. The 10× Genomics GemCode and Chromium instruments use emulsion to partition DNA and require the use of a microfluidic device to perform presequencing reactions. Illumina synthetic long-reads. The Illumina synthetic long-read procedure uses the existing Illumina infrastructure. Accordingly, the throughput and error profile are the same as those of current Illumina devices. To access this technology, an additional kit for long-read sequencing

36  Chapter 2 needs to be purchased. Nevertheless, one disadvantage is that, as a consequence of how the DNA is partitioned, the system requires more coverage than is required for a typical shortread project, thus increasing the associated costs. 10× Genomics emulsion-based platform. The 10× Genomics emulsion-based platform uses an existing short-read infrastructure for sequencing. The microfluidic device is a one-time additional equipment cost, and the emulsion technique needs only 1 ng of starting material. This is advantageous for samples in which low amounts of DNA are recovered, such as biopsies. At present, data output from the GemCode instrument is limited by the number of barcodes that can be used and a DNA partitioning susceptible to improvement. In February 2016, 10× Genomics introduced the new Chromium System, a transformative upgrade for the existing GemCode platform. Although the chemistry has remained the same, the number of possible micelle partitions has increased 10-fold, and 5-fold for the number of barcodes. 2.6.1.2  Single-end, paired-end sequencing, and mate pair sequencing in Illumina platforms Single-end or single-read sequencing involves sequencing of the DNA strand from only one end, and is the simplest way to utilize Illumina sequencing. Paired-end sequencing allows users to sequence both ends of a DNA sequence. These fragments (forward and reverse read from the same sequence) may or may not overlap. Paired-end sequencing enables detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts. Mate pair sequencing involves the generation of long-insert paired-end DNA libraries, which are useful for a number of sequencing applications, including de novo sequencing, genome finishing, structural variant detection, or identification of complex genomic rearrangements. Combining data generated from mate pair library sequencing with that from short-insert paired-end reads provides a powerful combination of read lengths for maximal sequencing coverage across a genome or a metagenome. 2.6.1.3  Choice of sequencing technology for metagenomics studies Since the Illumina platform is the most widely used system for high-throughput sequencing of environmental genomic DNA, we will consider this system in all the following descriptions. The highest throughput is achieved using the HiSeq platform. Technology-wise, a good sequencing strategy to obtain quality raw data for metagenomic analysis would aim to acquire the longest reads possible. For this reason, paired-end or mate pair sequencing options are normally preferred. In 2013, the Illumina TruSeq Synthetic Long-Reads technology, enabling the sequencing of long multi-kb synthetic reads, was introduced. A combination of short reads and long synthetic reads has been demonstrated to be highly effective to resolve complex populations and detect rare microorganisms in environmental samples.

Metagenomic Protocols and Strategies  37

2.6.2  Bioinformatic Analysis of Shotgun Sequencing Data From Metagenomic Samples 2.6.2.1  Generating quality shotgun sequencing data: Preliminary advice for library preparation One of the important contemplations for library preparation of environmental samples for shotgun metagenomic sequencing has to do with amplification. However, it depends on the nature of the samples (e.g., yield of the DNA will be less on samples like water, swabs, biopsies which necessitates the amplification). Isolationtion of sufficient amount of DNA material (around 250–500 ng) allowing an amplification-free based library preparation is highly preferred. Shotgun metagenomics requires quality DNA that will be sheared in fragments of sizes able to be converted into library fragments susceptible of being sequenced in full length. The llumina TruSeq PCRFree Library Preparation Kit is frequently used for metagenomics library preparation. Briefly, the steps to follow include fragmentation of the DNA, repair of fragment ends, library size selection, adenylation of 3′ ends, adaptor ligation, and normalization and pooling of libraries. 2.6.2.2  How to proceed with raw sequencing data generated by Illumina instruments The FASTQ format The format in which Illumina (also Ion Torrent) sequencing data is delivered is FASTQ. FASTQ files include DNA sequence information with quality metadata. The structure of the data contained in the FASTQ file includes a header line that starts with @, followed by an ID and an optional description, a row containing the sequence, a plus sign, and a row containing as many quality score symbols as nucleotides sequenced. Example: @unique_sequence_ID optional_description ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAAGAATAC! + =-(DD--DDD/DD5:*1B3&)-B6+8@+1(DDB:DD07/DB&3((+:?=8*D+DDD+B)*)B.8CDBDD4DDD@@D

FASTQ files can hold hundreds of millions of records and are normally very large. Each base call is associated with a quality score (Q). Raw Illumina data includes a Phred score corresponding to each base sequenced, where Q = -10 x log10 (P), with P being the probability that a base call is erroneous. For example, a Q score of 20 denotes a 1:100 chance that the base is called incorrectly, a Q score of 30 denotes a 1:1000 chance, and so on. It is generally believed that the Illumina Q scores are accurate. The quality score conversion to symbols is based on the ASCII character code used by computer keyboards (http://www.ascii-code.com/). Quality check and trimming of sequence data: FastQC, trimmomatic, and fastx-toolkit.

The first step after having received the raw data from the sequencing facility in FASTQ format is to perform a quality check and to trim low-quality bases from sequence read ends. Usually, the sequencing center will provide all information needed to proceed with the analysis. Raw data is normally stored in FTP (File Transfer Protocol) servers, which may require login data to access the information. An FTP client is a computer program that allows the transfer of data

38  Chapter 2 between computers. Widely used free FTP clients include FileZilla or Cyberduck, with the latter being a specific solution for Mac OSX users. Raw data files are normally compressed due to their large size. The compression methods normally utilized are “tar” and “gzip,” used individually or simultaneously: [file_name].fastq.gz or [file_name].fastq.tgz.

To uncompress in Unix systems: $ gunzip [file_name].fastq.gz $ tar –xvf [file_name].fastq.tgz

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) allows performing quality controls of generated sequences. The software outputs a distribution plot of all qualities per position across all reads. The graphical output is very useful to immediately detect low-quality regions. In addition to the quality of each sequenced base, it will give an idea of presence of and abundance of contaminating sequences, average read length, or GC content. The quality of sequence reads delivered by Illumina often decreases at the end of the read. Poor quality at the read ends can be resolved by using quality trimmers like trimmomatic (http://www.usadellab.org/) or fastx-toolkit (http://hannonlab.cshl.edu/ fastx_toolkit/), to name two of the most broadly used. Left-over adapter sequences in the reads can be remedied by using adapter trimmers, like ‘trimmomatic’. As a practical feature, trimmomatic can perform quality and adapter trimming at the same time. Once the trimming has been performed, it is advisable to rerun the data through FastQC to ensure that the quality of reads is optimal to proceed with downstream analyses. To use FastQC in Unix: $ fastqc ../raw-data/[file_name].fastq.gz

Both quality and adapter trimming software output results in the .fastq format. Quality checked and trimmed reads are now ready for further analyses. 2.6.2.3  Metagenome assembly Biologists who need to assemble shotgun metagenomic sequence data are faced with the challenge of choosing, from a myriad of options, the assembly strategy and software best suited for their experiment. This choice is made even harder by the fast development of new assembly tools, which is driven by advances in sequencing technologies. All assembly approaches rely on the simplistic premise that similar DNA fragments originate from the same genome or the coordinates within a genome. The similarity between regions of DNA sequences is then used to join together the individual fragments into larger contiguous sequences, or contigs, thereby recovering the information lost during the sequencing process.

Metagenomic Protocols and Strategies  39 The assembly job is complicated by the fact that, often, this underlying assumption is incorrect. A great example to illustrate how this assumption fails is by considering genomic repeats, which yield fragments with highly similar sequences that originate from different positions in a genome. Comparably, nearly identical sequences may originate from different genomes within the sample in metagenomic samples. The degree of impact of such artifacts into the assembly depends on the lengths of the sequence reads. Repetitive regions shorter than a sequence read can be automatically resolved. Moreover, segments of the genome that diverge by less than the error rate of the sequencing instrument cannot easily be distinguished by an assembler. Possible assembly solutions to overcome these problems involve the development of different assembly practices, which will be discussed below. Strategies used for metagenome assembly

The strategies used by sequence assemblers can be organized into three major methodologies: greedy, overlap-layout-consensus (OLC), and de Bruijn graph. Greedy assembly. Many pilot assemblers developed in the 1990s, such as phrap (http://www. phrap.org/), or the J. Craig Venter Institute (JCVI) TIGR Assembler, relied on the greedy paradigm. This assembly method always joins the reads that overlap best, as long as they do not contradict the already constructed assembly. Overlap-layout-consensus assembly. OLC assemblers first identify every pair of reads that overlap sufficiently well, and then organize this information into a graph containing a node for every read and an edge between any pair of overlapping reads. This graph structure allows the development of complex assembly algorithms that can take into account the global relationship between the reads. A variant of this approach called string graph simplifies the global overlap graph by removing “transitive edges” containing redundant information. De Bruijn graph-based assemblies. De Bruijn graph assemblers model the relationship between exact substrings of length k, or k-mers, extracted from the sequencing reads. Similarly to the OLC approach, the nodes in the graph represent k-mers, and the edges indicate that the adjacent k-mers overlap by exactly k − 1 letters (for example, the 4-mers ACTA and CTAG share exactly three letters). Whereas the reads themselves are not directly modeled in this paradigm, they are implicitly represented as paths through the de Bruijn graph. Most de Bruijn graph assemblers use the read information to refine the graph structure and to remove graph patterns that are not consistent with the reads. Due to the character of the de Bruijn graph approach, error correction approaches used both before and during assembly are crucial for obtaining high-quality assemblies. The de Bruijn approach dominates the design of modern (meta) genome assemblers that use short-read sequencing data as input, such as Meta-IDBA, MetaVelvet, SOAPdenovo, or metaSPAdes. De Bruijn graph assemblers are, however, hampered by sequencing errors and are not as effective with longer and more inaccurate sequence reads.

40  Chapter 2 De Bruijn graph-based approaches have been successful in assembling highly accurate short reads (200 bp, such as Roche 454 and Sanger sequencing data). Broadly used de Bruijn graph assemblers

Meta-IDBA (http://i.cs.hku.hk/~alse/hkubrg/projects/metaidba/). Meta-IDBA is an iterative de Bruijn graph de novo short read assembler, specially designed for de novo metagenomic assembly. One of the most difficult problems in the assembly of shotgun metagenomic data is that similar strains of the same species group together and yield an almost intractable de Bruijn graph. Meta-IDBA handles this problem by clustering analogous regions of similar strains through partitioning of the graph into subcomponents based on topology. Each component represents a similar region among strains from the same species or from different species. After the components are separated, all contigs within the same cluster are aligned to produce a consensus. To run Meta-IDBA: 1. Convert .fastq data in fasta format (.fa) using the fq2fa tool: $ bin/fq2fa fq_file fa_file

2. Assembly $ metaidba --read read_file [--output out] [options]

The option --connect is used to connect paired-end read components: $ metaidba --read [reads_file].fa --output out --connect

MetaVelvet (http://metavelvet.dna.bio.keio.ac.jp/). MetaVelvet-SL utilizes supervised learning for de novo metagenome assembly. MetaVelvet decomposes a de Bruijn graph constructed from mixed short reads into individual subgraphs and builds scaffolds based on each decomposed de Bruijn subgraph as an isolate species genome. MetaVelvet makes use of two features inherent to preassembly data for the decomposition of the de Bruijn graph: coverage (abundance) difference and graph connectivity. For simulated datasets, MetaVelvet succeeds in generating significantly higher N50 scores, defined as the minimum contig length needed to cover 50% of the (meta) genome, than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes. To run MetaVelvet, three commands are needed: 1. velveth to import read sequences and construct a k-mer hash table; k-mer size has an important effect on assembly results: larger k values bias assembly results towards more

Metagenomic Protocols and Strategies  41 abundant species, while lower k values recover lower abundance species. If the short-read data is longer than 65bp, the developers advise to set k-mer longer than 51. To run in Unix: $ velveth [out_dir] [k-mer_length] -fastq -shortPaired [file_name].R1.trimmed.paired.fastq [file_name].R2.trimmed.paired.fastq

2. velvetg to construct an initial de Bruijin graph: $ velvetg [out_dir] -exp_cov auto -ins_length 500

or $ velvetg [out_dir] -exp_cov [manually_determined_exp_cov--see_below] -ins_length 500

3. meta-velvetg to generate scaffolds: $ meta-velvetg [out_dir] -ins_length 500

Contrary to single-genome assemblers (e.g., Velvet and SOAPdenovo), MetaVelvet assumes metagenomics settings. Accordingly, the k-mer coverage histogram might be multimodal rather than unimodal. Coverage peak parameters can largely affect MetaVelvet assembling results. Although simple and automatic peak detection algorithms are implemented in meta-velvetg, manual inspection of k-mer coverage peaks and manual setting of the coverage peak parameters is strongly recommended. To manually set the coverage peak parameters: 1. Execute velveth, velvetg, and meta-velvetg as indicated previously and check that “outdir/meta-velvetg.Graph2-stats.tx” is created. 2. Draw a k-mer coverage histogram and manually determine the k-mer coverage peaks by inspecting the resulting graph. In R: $ R (R) > install.packages("plotrix") (R) > library(plotrix) (R) > data = read.table("out-dir/meta-velvetg.Graph2-stats.txt", header = TRUE) (R) > weighted.hist(data$short1_cov, data$lgth, breaks = seq(0, 200, by = 1))

3. Run meta-velvetg as indicated before with manual setting of coverage peaks. SOAPdenovo (http://soap.genomics.org.cn/soapdenovo.html). SOAPdenovo is a shortread assembly method that can build a de novo draft assembly for human-sized genomes. The program was specially designed to assemble Illumina Genome Analyzer short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way. The new version, SOAPdenovo2, is now available. SOAPdenovo2 implements a new algorithm that reduces memory consumption in graph construction, resolves more repeated regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genomes.

42  Chapter 2 To run SOAPdenovo for metagenome assembly: 1. $ all -s config_file -K 63 -R -o graph_prefix 1>ass.log 2>ass.err 2. $ pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err  OR   $ sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph. log 2>pregraph.err 3. $ contig -g graph_prefix -R 1>contig.log 2>contig.err 4. $ map -s config_file -g graph_prefix 1>map.log 2>map.err 5. $ scaff -g graph_prefix -F 1>scaff.log 2>scaff.err

MetaSPAdes (http://bioinf.spbau.ru/en/spades). MetaSPAdes addresses various challenges of shotgun metagenomic sequence data assembly by exploiting computational ideas that have been proved to be useful in single cell genomics and highly polymorphic diploid genomes assemblies. MetaSPAdes first constructs the de Bruijn graph of all reads using SPAdes, transforms it into the assembly graph using graph simplification procedures, and reconstructs paths in the assembly graph that correspond to long genomic fragments within a metagenome. MetaSPAdes works across a wide range of coverage depths, and attempts to maintain a trade-off between the accuracy and the contiguity of assemblies. To face the microdiversity challenge, metaSPAdes focuses on reconstructing a consensus backbone of a strain-mixture, thus ignoring some strain-specific features corresponding to rare strains. MetaSPAdes addressed a number of challenges in metagenomic assembly and implemented several novel features, such as: (i) efficient assembly graph processing to address the microdiversity challenge; (ii) a new repeat resolution approach that utilizes rare strain variants to improve the consensus assembly of strain-mixtures; and (iii) fast algorithms for constructing assembly graphs and error-correcting reads. The field of metagenomic assembly also faces technological challenges caused by innovations in sequencing and library preparation techniques. As described previously, the creation of mate-pair libraries or the emergent TruSeq Synthetic Long Reads technology (TSLR) have a potential to significantly improve assembly quality. However, metagenomic assemblers have not caught up with this technological innovation yet. MetaSPAdes now faces the challenge of incorporating these emerging technologies into its assembly pipeline. To run metaSPAdes in its most generic way possible: $ spades.py --meta -s [assembly_reads].fastq -o out_dir

Quality of the assembly

Determining whether an assembly is correct and comparing the quality of different assemblies of the same data set are difficult, given that the correct answer is usually unknown. Commonly used quality measures are: (i) total size, (ii) number of contigs generated, or (iii) the weighted median contig size (N50). Even if correctly used, N50 values rarely correlate with the actual quality of assembly, and are meaningless in situations in which the goal is to reconstruct multiple sequences that are present in the sample at varying levels of

Metagenomic Protocols and Strategies  43 abundance, as occurs in metagenomics. For metagenomic assemblies, a crucial metric is the percentage of reads that maps back to the assembly. A percentage of matching reads of 95% or higher would be indicative of good assembly practices. There are many different mappers available to map your reads back to the assemblies. Usually they result in a SAM or BAM file (http://genome.sph.umich.edu/wiki/SAM). Those are formats that contain the alignment information, where BAM is the binary version of the plain text SAM format. In this example pipeline we will include instructions for bowtie2 (http:// bowtie-bio.sourceforge.net/bowtie2/index.shtml): 1. Create index $ bowtie2-build assembly_contigs.fa assembly_contigs_index

2. Align reads: # Single-end reads: $ bowtie2 -x assembly_contigs_index -U reads.fq -S alignment_SE.sam

# Paired-end reads: $ bowtie2 -x assembly_contigs_index -1 reads_1.fq -2 reads_2.fq -S alignment_PE.sam

3. Use samtools view to convert the SAM file into a BAM file. $ samtools view -bS alignment.sam > alignment.bam

MetaQUAST (http://bioinf.spbau.ru/metaquast) performs alignments with reference genomes and metagenome assemblies. To address metagenome assemblies, MetaQUAST adds several new features over QUAST: (i) ability to use an unlimited number of reference genomes, (ii) automated species content detection, (iii) detection of chimeric contigs (interspecies misassemblies), and (iv) significantly redesigned visualizations. To run MetaQUAST: 1. Contigs obtained after assembly should be partitioned into groups, each of which contains sequences mapped to a particular reference genome based on previously generated alignments. The contigs mapped to several genomes are included into every matching group. Unaligned contigs are combined into one extra group. $ python metaquast.py [contigs_1].fasta [contigs_2].fasta … -R reference_1,reference_2,…

2.6.2.4  Metagenome binning Finding the microbial origin of the metagenome assemblies is increasingly important to understand the contribution of individual microbes in the entire community functioning. This process, known as binning, can be performed by analyzing the compositional similarity between the metagenomic assemblies and the sequences of sequenced bacterial genomes and plasmids. This can be done by analyzing the frequencies of tetranucleotides. In this process,

44  Chapter 2 the differential abundance of contigs across samples can also be analyzed to estimate the abundance of each organism. Computational resources for metagenome binning based on nucleotide composition include: (i) R packages qgraph, igraph, or pv-clust; (ii) tetramerFreqs; (iii) Databionic ESOM tools; or (iv) 2T-binning. Tools based on differential abundance are: (i) Multi-metagenome, (ii) MGS Canopy algorithm, and (iii) Databionic ESOM tools. Software based on both nucleotide composition and differential abundance comprise: (i) MetaBAT, (ii) CONCOCT, (iii) MaxBin, (iv) GroopM, and (v) Databionic ESOM tools. Due to their broad use and to the theoretical superiority of nucleotide composition combined with differential abundance methods, we will include a brief pipeline for the use of MetaBAT and Databionic ESOM tools in the command line of Unix systems. An excellent comparison of methodological features of different metagenome binning tools based in nucleotide composition and differential abundance combined can be found in a recent publication by Sangwan et al. [21]. MetaBAT (https://bitbucket.org/berkeleylab/metabat)

MetaBAT integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. It is advised to use more than one and diverse samples to achieve better binning results, and to have a quality metagenome assembly to perform the binning. To use MetaBAT, SAM tools (http://samtools.sourceforge.net/), BWA (http://bio-bwa. sourceforge.net/), and BamM (http://ecogenomics.github.io/BamM/) are required. SAM (sequence alignment/map) tools offer various functions to edit the alignments in the SAM format, which includes indexing, merging, sorting and generating alignments in a perposition format. The three known BWA algorithms (backtrack, SW and MEM) allow the alignment of highly homologous assemblies (from 100 to 1Mb) to sequence large genomes, such as those of humans. BamM produces indexed and sorted BAM files, which contain only reads that mapped. To run MetaBAT in its most generic way: 1. Map reads to assembly: $ bamm make -d [file_name].assembly.contigs.fa -c [file_name].R1.trimmed.paired. fastq.gz [file_name].R2.trimmed.paired.fastq.gz -o out_dir

2. Run MetaBAT $ runMetaBat.sh [file_name].assembly.contigs.fa /out_dir/[file_name].assembly. contigs.trimmed.paired.bam

Metagenomic Protocols and Strategies  45 Databionic ESOM tools (http://databionic-esom.sourceforge.net):

Databionic ESOM Tools are a suite of programs to perform data mining tasks like clustering, visualization, and classification with emergent self-organizing maps (ESOMs). By borrowing from informaticians working on artificial neuronal networks, bioinformaticians came up with an algorithmically guided procedure that “reduces” the higher-dimensional vectorial space to a two-dimensional ESOM. In an ESOM, each point represents a fragment of an assembled scaffold. Dark lines between clusters show definitive separation of genome bins, and colors designate the genome bin for each scaffold fragment. To use ESOM tools: 1. Prepare the metagenome assembly for ESOM using esomWrapper.pl (in https://github. com/tetramerFreqs/Binning/blob/master/esomWrapper.pl) $ perl esomWrapper.pl -path fasta_folder -ext fa

The esomWrapper.pl program will generate six files: a .lrn file (learn file = table of tetranucleotide frequencies), a .cls file (class file = list of the class of each contig, which can be used to color data points, etc.), and a .names file (list of the names of each contig), as well as a .suffix file, a .ann file, and a .log file. 2. Start ESOM (X11 must be enabled): $ ./esomana

3. Load .lrn, .names, and .cls files 4. Normalize the data (optional, but recommended) by using, for example, RobustZT. 5. Train the data. # Using the graphical user interface (GUI): • Tools > Training. Use default parameters with the following exceptions: ◦ Training algorithm: K-batch ◦ Number of rows/columns in map: ~five to six times more neurons than there are data points. E.g., for 12,000 data points (window, NOT contigs), use 200 rows × 328 columns (~65,600 neurons). ◦ Start value for radius = 50 (increase/decrease for smaller/larger maps). • No more than 20 epochs • Hit “START”—training will take 10 min to many hours depending on the size of the data set and parameters used. # From the terminal: • At this point, you may also choose to add additional data (like coverage) to your contigs. You may do so using the addInfo2lrn.pl script or by simply using the flag -info in esomTrain.pl.

46  Chapter 2 •

esomTrain.pl script may be used to train the data without launching the GUI. This script will add the additional information to the lrn file (using -info), normalize it and train the ESOM. • To view the results of the training, simply launch ESOM as in [2] 6. Analyze the output: • The best view (see VIEW tab) is achieved with UMatrix background, tiled display. Use Zoom, Color, Bestmatch size to get the desired view. Viewing without data points drawn (uncheck “Draw bestmatches”) helps to see the underlying data structure. • Use the CLASSES tab to rename and recolor classes. • To select a region of the map, go to the DATA tab and then draw a shape with the mouse (holding left click), and close it with right click. Data points will be selected and displayed in DATA tab. • To assign data points to bins, use the CLASS tab and using your pointer draw a boundary around the region of interest (e.g., using the data structure as a guide—see also the “contours” box in the VIEW tab which might help to delineate bins). This will assign each data point to a class (bin). The new .cls file can be saved (File > Save .cls) for further analysis. Checking the quality of bins: CheckM:

Because of the improvement in sequencing quality and speed, many genomes are being resequenced to avoid errors. Actually, this can be done either by using paired-end sequencing or alignment with close genomes of related organisms. When two highly similar genomes (from similar species) are considered, these options are not the most appropriate. Rather, one should first identify the regions where a bias in sequencing is expected and then this region should be analyzed by the comparison of frequencies of tetranucleotides. Genome sequencing quality can be also assessed by comparing any of the 104 highly conserved genes within the domains Bacteria and Archaea. This can be done by using reciprocal BLAST-based homology searches in the Pfam and TIGRFAM databases or by the CheckM (http://ecogenomics. github.io/CheckM/) method, which allow identifying marker genes coming from a secondary genome. The recommended workflow for assessing the completeness and contamination of genome bins is to use lineage-specific marker sets. The workflow consists of four mandatory (M) steps and one recommended (R) step: 1. (M) $ checkm tree   2. (R) $ checkm tree_qa 3. (M) $ checkm lineage_set   4. (M) $ checkm analyze    5. (M) $ checkm qa   6. bin_qa_plot provides a visual representation of the completeness, contamination, and strain heterogeneity within each genome bin:  $ checkm bin_qa_plot -x fa  

Metagenomic Protocols and Strategies  47 2.6.2.5  Metagenome annotation Once the data has been assembled into contigs, the next natural step to do is annotation of the data, i.e., gene finding and functional assignment of genes through comparison with protein databases. For gene finding a range of programs is available (Metagene Annotator, MetaGeneMark, Orphelia, FragGeneScan). A broadly used resource is Prodigal (Prokaryotic Dynamic Programing Genefinding Algorithm, http://prodigal.ornl.gov/). A version for metagenomic datasets, MetaProdigal, was released in 2012. MetaProdigal can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. To use Prodigal in its most standard mode:  $prodigal-imetagenome.fna-ocoords.gbk-aproteins.faa-p anon

By default, Prodigal produces one output file, which consists of gene coordinates and some metadata associated with each gene. However, the program can produce four more output files at the user's request: protein translations (with the -a option), nucleotide sequences (with the -d option), a complete listing of all start/stop pairs along with score information (with the -s option), and a summary of statistical information about the genome or metagenome (with the -w option). The next step is the functional annotation of the genes/proteins. A variety of ways exist to proceed with the annotation. An option is to use webMGA to do rpsBLAST searches against the COG database. COGs are clusters of orthologs genes, i.e., evolutionary counterparts in different species, usually with the same function (http://www.ncbi.nlm.nih.gov/COG/). Many COGs have known functions and the COGs are also grouped at a higher level with functional classes. To get COG classifications of the proteins: 1. Access webMGA in http://weizhong-lab.ucsd.edu/metagenomic-analysis/ and select Server/Function annotation/COG. 2. Upload the protein file proteins.faa from the Prodigal output and use the default -e value cutoff. rpsBLAST is used, which is a BLAST based on position specific scoring matrices (pssm). For each COG, one such pssm has been constructed. These are compiled into a database of profiles that is searched against. rpsBLAST is more sensitive than a normal BLAST, which is important if genomes in the metagenome are distant from existing sequences in databases. It is also faster than searching against all proteins in the database. 3. The file output.2 includes detailed information on the classifications for every protein with a hit below the -e value cutoff. These can be viewed with:  $ less README.txt  $ less -S output.2

48  Chapter 2

2.6.3  Metagenomes Databases A number of public databases have been created to store metagenome sequence datasets. Indeed, all new sequences obtained by DNA sequencing should be released to public databases. 1. Genbank-NCBI (https://www.ncbi.nlm.nih.gov/genbank/metagenome/): Metagenome sequences can be uploaded to Genbank, registering the sequences under a new BioProject and BioSample. Unassembled sequences should be submitted to the Sequence Read Archive (SRA), while assembled metagenomic contigs must be submitted to the Whole Genome Shotgun project (WGS project). Other types of metagenomic sequences (16S ribosomal RNA, fosmids, etc.) can be submitted directly using Genbank, through the submission tools Sequin or tbl2asn. Finally, metagenomic transcriptomes can also be submitted to Genbank, in this case associated with the TSA (Transcriptome Shotgun Assembly) project. 2. EBI metagenomics (https://www.ebi.ac.uk/metagenomics/): The European Bioinformatics Institute (EBI) offers their own storage and analysis service for metagenomic data. This database is composed of more than 13,000 metagenomes and more than 1100 metatranscriptomes, classified according to their environment of origin. Submission of raw sequences requires a previous registration in the server and it is performed through the ENA Webin tool and the data is archived in the European Nucleotide Archive (ENA), allowing the submission of sequence reads, genome assemblies, targeted assembled and annotated sequences and the registration of studies (projects) and samples. Moreover, this database also allows analysis of the submitted data, performing taxonomic analysis upon 16S rRNA using Qiime or functional analysis using the InterPro sequence analysis resource [22]. 3. IMG/M (https://img.jgi.doe.gov/m/): The Integrated Microbial Genomes with Microbiome samples (IMG/M) includes archaea, bacteria, eukarya, plasmids, viruses, and genome fragments, as well as metagenome and metatranscriptome datasets. Genebank is the major source of genome data, by downloading all the new sequences through the Genomes OnLine Database (GOLD v.5) [23]. GOLD is a data management system for cataloging and continuous monitoring of sequencing projects worldwide. GOLD supports the IMG family of data management systems as a filter of projects and metadata, requiring a least minimal metadata to annotate the projects. Submission in the IMG systems needs an entry in GOLD as a prerequisite to submit a project annotation [23]. In IMG there are 11,004 metagenome datasets from 544 metagenome studies. Since 2012, IMG started to also include metatranscriptomic data, with 5156 different datasets from a very rich variety of samples from microbial communities (marine, estuarine, freshwater, forest soil, rhizosphere, bioreactors, etc.). IMG also allows the visualization and comparative analysis of the data through the IMG User Interface (UI) [24].

Metagenomic Protocols and Strategies  49 4. MG-RAST (http://metagenomics.anl.gov): An open-submission data portal for processing, analyzing, sharing and disseminating metagenomic datasets. There are over 200,000 datasets, of which around 30,000 are available for public download. Nearly 8% of the overall data is from metatranscriptomic experiments. To submit data or use the datasets, account registration is required. Users submitting data are encouraged to provide metadata, which increases the value of the data, and to declare the intention to publish the data via MG-RAST, receiving priority for analysis resources. MG-RAST provides an API (Application Programming Interface), which lets users access the datasets more easily. It is accessible through different programming languages, with examples in Perl and Python via github (http://github.com/MG-RAST/MG-RAST-Tools). In contrast to IMG/M and EBI metagenomics, MG-RAST has always preferred raw sequence data, without any prior filtering or assembly [25]. 5. FOAM (http://cbb.pnnl.gov/portal/software/FOAM.html): Functional Ontology Assignments for Metagenomes (FOAM) is a functional gene database developed to screen environmental metagenomic sequence datasets. Analysis of current environmental metagenomes is often challenging due to the high diversity and large proportion of uncharacterized microbial taxa in most environmental habitats. The FOAM database represents a useful tool to analyze this data. It is based on hidden Markov models (HMMs) using sets of aligned protein sequences (profiles) which were tailored to a large group of KEGG Orthologs (KOs). FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories [26]. 6. iMicrobe (http://imicrobe.us): The iMicrobe project uses an existing computational infrastructure called CyVerse to create a data commons for microbial data sets taken from diverse environments. Its collection includes not only metagenomic data but also other microbial-related sequencing data such as transcriptomics and genomics. Actually, it contains 128 projects including 3338 different environmental omics samples. Metagenomic analysis is supported through its website, which also provides a RESTbased API for easier access to bioinformaticians [27]. 7. VIROME (http://virome.dbi.udel.edu/): Viral Informatics Resource for Metagenome Exploration (VIROME) is a database designed to classify all putative open reading frames (ORFs) from viral metagenome shotgun libraries. It allows the exploration of metagenome sequence data collected from viral assemblages occurring within a number of different environmental contexts, currently containing data derived from 466 libraries. VIROME includes a bioinformatic pipeline for data analysis by the user [27,28]. 8. MeganDB (http://www.megan-db.org/): MetaGenome Analyzer DataBase (MegaDB) is a database of metagenomic datasets previously calculated using the popular tool MEGAN (MEta Genome Analyzer) [29], which is used for the taxonomic and functional annotation of metagenomic data. Currently is composed of 235 metagenomes [27].

50  Chapter 2 9. HIH Human Microbiome Project (http://hmpdacc.org): The Human Microbiome Project Consortium dedicates its work to studying the influence of the human microbiome in human health. All the metagenomic studies under the context of the Human Microbiome Project are stored in the Human Microbiome Project database, with a current catalog of 1265 metagenomes from different environments located in the human body [29].

2.7  Metagenomic: Wet-Lab Techniques for Clone Library Construction In some cases, DNA sequencing may not be our goal, but rather to use the DNA material to create so-called clone libraries that can be later screened for activities of interest. There are two distinct strategies used in metagenomics, according to the primary goal. First, largeinsert libraries (cosmid, fosmid or bacterial artificial chromosomes (pBACs)) are constructed for archiving and sequence homology screening purposes: to capture the largest amount of the available genetic resources available in the sample and archive it for further studies/ interrogation. Second, small-insert expression libraries, especially those made in lambda phage vectors, are constructed for activity screening. The small size of the cloned fragments means that most genes present in the appropriate orientation will be under the influence of the extremely strong vector expression signals, and thus have a good chance of being expressed and detected by activity screens [9]. Though these two strategies may differ in some technical aspects, both are complementary when used for naïve screens. For the sake of simplicity, we describe here the methods used to construct large insert libraries based on the utilization of the pCCFOS vector. The CopyControl Fosmid Library Production kit (EPICENTRE) utilizes a strategy based on the cloning of randomly sheared and end-repaired DNA with an average insert size of 40 kbp. Frequently genomic DNA is sufficiently sheared, as a result of the purification process, so additional shearing is not necessary. Test the extent of shearing of the DNA by first running a small amount of it (around 100 ng). Run the sample on a 20-cm long gel 1% agarose at 30–35 V overnight at 4°C and stain with ethidium bromide. The end repair protocol can only be started when at least 10% of the shared DNA run Fosmid control DNA provided with the kit (36 Kb size). If the genomic DNA migrates slower (higher MW) than the 36 Kb fragment, then the DNA needs to be sheared. Shear the DNA (2.5 μg) by passing it through a 200-μL small-bore pipette tip. Aspirate and expel the DNA from the pipette tip 50–100 times. If the genomic DNA migrates faster than the 36 Kb fragment (lower MW), then it has been sheared too much and should be reisolated. For the end-repair protocol, take into account these suggestions: 1. Thaw and thoroughly mix all of the reagents listed here before dispensing; place on ice. Combine the following on ice: 8 μL 10× End-Repair Buffer, 8 μL 2.5 mM dNTP

Metagenomic Protocols and Strategies  51

2. 3. 4.

5. 6. 7. 8.

9. 10. 11.

Mix, 8 μL 10 mM ATP, 32 μL sheared insert DNA (approximately 4.3 μg)*, 20 μL sterile water, 4 μL End-Repair Enzyme Mix, 80 μL total reaction volume. *The end-repair reaction can be scaled up or scaled down as dictated by the amount of DNA available. Incubate at room temperature for 45 min. Add gel loading buffer and incubate at 70°C for 10 min to inactivate the End-Repair Enzyme Mix. Select the size of the end-repaired DNA by low melting point (LMP) agarose gel electrophoresis. Run the sample on a 20-cm long 1% agarose gel at 30–35 V overnight at 4°C. Do not stain the DNA with EtBr and do not expose it to UV. Use EtBr-stained DNA marker lanes as a ruler to cut out the agarose region containing the 25–60 Kb DNA and trim excess agarose. Then you can proceed to the agarose gel-digesting assay using the “GELase (EPICENTRE) Agarose Gel-Digesting protocol” described in the following steps 5–11. Thoroughly melt the gel slice by incubating at 70°C for 3 min for each 200 mg of gel. Transfer the molten agarose immediately to 45°C and equilibrate 2 min for each 200 mg of gel. Digest the agarose with 1 U of gelase for 30 min at 45°C. Centrifuge the tubes in a microcentrifuge at maximum speed (15,000×g) for 15 min at 4°C to pellet any insoluble oligosaccharides. Carefully remove the upper 90%–95% of the supernatant, which contains the DNA, to a sterile 1.5 mL tube. You should be careful to avoid the gelatinous pellet. Then precipitate the DNA by adding 0.1 volume of 3 M sodium acetate, pH 7.0, and 2.5 volumes of ethanol and mix gently. Wash the pellet with 70% ethanol. Gently resuspend the DNA pellet in TE buffer (around 200 μL). Concentrate the DNA in a Microcon-100 (Millipore) concentrator membrane (100 kDa cut-off) at 4°C to a final volume of 20–50 μL. DNA concentration can be around 75 ng (in 50 μL a total of 3.75 μg). This concentrated DNA is the insert to ligate to the pCC1FOS vector.

The next step is the ligation of DNA fragment into the pCC1FOS fosmid vector. A single ligation reaction will produce 103–106 clones depending on the quality of the insert DNA. Based on this information calculate the number of ligation reactions that you will need to perform. The ligation reaction can be scaled up as needed. A 10:1 molar ratio of pCC1FOS vector to insert DNA is optimal. If we use 0.5 μg of 100 Kb DNA insert we need around 0.5 μg of vector. Combine the following reagents in the order listed and mix thoroughly after each addition: 1 μL 10× Fast-Link Ligation Buffer, 1 μL pCC1FOS (0.5 μg/μL), 1 μL 10 mM ATP, 6.8 μL concentrated insert DNA (75 ng/μL), 0.2 μL MilliQ water, 1 μL Fast-Link DNA Ligase, 10 μL total reaction volume.

52  Chapter 2 Incubate at room temperature for 2 h and then transfer the reaction to 70°C for 10 min to inactivate the Fast-Link DNA Ligase, after which the construct is packaged as follows: 1. Thaw, on ice, 1 tube of the MaxPlax Lambda Packaging Extracts for every ligation reaction performed in the preceding step. 2. Transfer 25 μL of the extract to a second 1.5 mL microfuge tube and place on ice. 3. Add 10 μL of the ligation reaction to each 25 μL of the thawed extracts being held on ice. 4. Mix gently by pipetting the solutions and then centrifuge briefly. 5. After 90-min packaging reaction is completed at 30°C, add the remaining 25 μL of MaxPlax Lambda Packaging Extract to each tube. 6. Incubate the reactions for an additional 90 min at 30°C. 7. At the end of the second 90-min incubation, add Phage Dilution buffer (PD buffer: 10 mM Tris-ClH pH 8.3, 100 mM NaCl, 10 mM MgCl2) to 1 mL final volume in each tube and mix gently. Add 25 μL of chloroform to each. Mix gently and store at 4°C (up to a month). A viscous precipitate may form after addition of the chloroform. This precipitate will not interfere with library production. Determine the titer of the phage particles (packaged fosmid clones) and then plate the fosmid library. The day of the packaging reactions, inoculate 50 mL of LB broth + 10 mM MgSO4 with 5 mL of the EPI300-T1R overnight culture. Shake at 37°C to an OD600nm = 0.8–1.0. Store the cells at 4°C until needed (titering). The cells may be stored for up to 72 h at 4°C if necessary. Before plating the library, we recommend that the titer of packaged fosmid clones be determined. This will aid in determining the number of plates and dilutions to make to obtain a library that meets the needs of the user. 1. Prepare dilutions 1:101, 1:102, 1:104 and 1:105 in PD buffer. 2. Add 10 μL of each dilution to 100 μL ready-to-use E. coli EPI300-T1R, and incubate at 37°C over 20 min. 3. Plate the resulting cells on 12.5 μg/mL chloramphenicol-LB agar plates. 4. After overnight incubation at 37°C, the colonies are quantified as follows: if there were 200 colonies on the plate streaked with the 1:104 dilution, then the titer in cfu/mL, (where cfu represents colony-forming units) of this reaction would be: (# of colonies) (dilution factor) (1000 μL/mL)/(volume of phage plated [μL]) That is: (200 cfu) (104) (1000 μL/mL)/(10 μL) = 2 × 108 cfu/mL Based on the titer of the phage particles determined previously, dilute the phage particles with PD buffer to obtain the desired number of clones and clone density on the plate, which will be prepared as previously described. Subsequently these clones are plated with the help of a colony-picker robot, in 384-well plates (LB, 12.5 μg/mL chloramphenicol and 15% of glycerol). Plates are incubated overnight without shaking at 37°C. The colony-picker robot is again used to produce copies of the 384-well plates.

Metagenomic Protocols and Strategies  53 Once the clone library is prepared, the clones are ready for activity screening. Extensive descriptions of assay methods have been reviewed elsewhere [30–32].

Acknowledgments This project has received funding from the European Union’s Horizon 2020 research and innovation program [Blue Growth: Unlocking the potential of Seas and Oceans] under grant agreement No. [634486] (project acronym INMARE). This research was also supported by the grants PCIN-2014-107 (within the ERA-NET IB2 program) and BIO2014-54494-R from the Spanish Ministry of Economy, Industry and Competitiveness.

References [1] Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer KH, et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 2014;12:635–45. [2] Cardona S, Eck A, Cassellas M, Gallart M, Alastrue C, Dore J, et al. Storage conditions of intestinal microbiota matter in metagenomic analysis. BMC Microbiol 2012;12:158. [3] Gorzelak MA, Gill SK, Tasnim N, Ahmadi-Vand Z, Jay M, Gibson DL. Methods for improving human gut microbiome data by reducing variability through sample processing and storage of stool. PLoS One 2015;10:e0134802. [4] Hampton-Marcell JT, Moormann SM, Owens SM, Gilbert JA. Preparation and metatranscriptomic analyses of host-microbe systems. Methods Enzymol 2013;531:169–85. [5] Pérez-Cobas AE, Gosalbes MJ, Friedrichs A, Knecht H, Artacho A, Eismann K, et al. Gut microbiota disturbance during antibiotic therapy: a multi-omic approach. Gut 2013;62:1591–601. [6] Reck M, Tomasch J, Deng Z, Jarek M, Husemann P, Wagner-Döbler I, et al. Stool metatranscriptomics: a technical guideline for mRNA stabilisation and isolation. BMC Genomics 2015;16:494. [7] Bashiarde S, Zilberman-Schapira G, Elinav E. Use of metatranscriptomics in microbiome research. Bioinform Biol Insights 2016;10:19–25. [8] Moen AE, Tannæs TM, Vatn S, Ricanek P, Vatn MH, Jahnsen J, et al. Simultaneous purification of DNA and RNA from microbiota in a single colonic mucosal biopsy. BMC Res Notes 2016;9:328. [9] Guazzaroni ME, Golyshin PN, Ferrer M. Analysis of complex microbial communities through metagenomic survey. In: Marco D, editor. Metagenomics: theory, methods and applications. Caister Academic Press; 2010. p. 55–67. [10] Logares R, Sunagawa S, Salazar G, Cornejo-Castillo FM, Ferrera I, Sarmento H, et al. Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. Environ Microbiol 2014;9:2659–71. [11] Takahashi S, Tomita J, Nishioka K, Hisada T, Nishijima M. Development of a prokaryotic universal primer for simultaneous analysis of Bacteria and Archaea using next-generation sequencing. PLoS One 2014;8:e105592. [12] Jovel J, Patterson J, Wang W, Hotte N, O’Keefe S, Mitchel T, et al. Characterization of the gut microbiome using 16S or shotgun metagenomics. Front Microbiol 2016;7:459. [13] Burbach K, Seifert J, Pieper DH, Camarinha-Silva A. Evaluation of DNA extraction kits and phylogenetic diversity of the porcine gastrointestinal tract based on Illumina5 sequencing of two hypervariable regions. Microbiologyopen 2016;5:70–82. [14] Hongoh Y, Yuzawa H, Ohkuma M, Kudo T. Evaluation of primers and PCR conditions for the analysis of 16S rRNA genes from a natural environment. FEMS Microbiol Lett 2003;221:299–304. [15] Weisburg WG, Barns SM, Pelletier DA, Lane DJ. 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol 1991;173:697–703. [16] Massana R, Murray AE, Preston CM, DeLong EF. Vertical distribution and phylogenetic characterization of marine planktonic Archaea in the Santa Barbara Channel. Appl Environ Microbiol 1997;63:50–6.

54  Chapter 2 [17] DeLong EF. Archaea in coastal marine environments. Proc Natl Acad Sci U S A 1992;89:5685–9. [18] Baker BJ, Tyson GW, Webb RI, Flanagan J, Hugenholtz P, Allen EE, et al. Lineages of acidophilic archaea revealed by community genomic analysis. Science 2006;314:1933–5. [19] Bargiela R, Mapelli F, Rojo D, Chouaia B, Tornés J, Borin S, et al. Bacterial population and biodegradation potential in chronically crude oil-contaminated marine sediments are strongly linked to temperature. Sci Rep 2015;5:11651. [20] Thomas T, Gilbert J, Meyer F. Metagenomics—a guide from sampling to data analysis. Microb Inform Exp 2012;2:3. [21] Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets. Microbiome 2016;4:8. [22] Mitchell A, Bucchini F, Cochrane G, Denise H, ten Hoopen P, Fraser M, et al. EBI metagenomics in 2016— an expanding and evolving resource for the analysis and archiving of metagenomic data. Nucleic Acids Res 2016;44:D595–603. [23] Reddy TB, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, et al. The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 2015;43:D1099–106. [24] Chen IA, Markowitz VM, Chu K, Palaniappan K, Szeto E, Pillay M, et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res 2017;45:D507–16. [25] Wilke A, Bischof J, Gerlach W, Glass E, Harrison T, Keegan KP, et al. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res 2016;44:D590–4. [26] Prestat E, David MM, Hultman J, Taş N, Lamendella R, Dvornik J, et al. FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus. Nucleic Acids Res 2014;42:e145. [27] Mineta K, Gojobori T. Databases of the marine metagenomics. Gene 2016;576:724–8. [28] Wommack KE, Bhavsar J, Polson SW, Chen J, Dumas M, Srinivasiah S, et al. VIROME: a standard operating procedure for analysis of viral metagenome sequences. Stand Genomic Sci 2012;6:427–39. [29] Human Microbiome Project Consortium. A framework for human microbiome research. Nature 2012;486:215–21. [30] Reyes-Duarte D, Ferrer M, García-Arellano H. Functional-based screening methods for lipases, esterases, and phospholipases in metagenomic libraries. Methods Mol Biol 2012;861:101–13. [31] Popovic A, Tchigvintsev A, Tran H, Chernikova TN, Golyshina OV, Yakimov MM, et al. Metagenomics as a tool for enzyme discovery: hydrolytic enzymes from marine-related metagenomes. Adv Exp Med Biol 2015;883:1–20. [32] Distaso MA, Tran H, Ferrer M, Golyshin PN. Metagenomic mining of enzyme diversity. In: Lee SY, editor. Handbook of hydrocarbon and lipid microbiology series. Consequences of microbial interactions with hydrocarbons, oils and lipids: production of fuels and chemicals. Springer; 2016. p. 1–25.

Further Reading [1] Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015;25:1043–55.