supplementary information - Nature

5 downloads 0 Views 769KB Size Report
Chaysavanh Manichanh, H. Bjørn Nielsen, Trine Nielsen, Nicolas Pons, Julie Poulain ..... Danish individuals were examined at Steno Diabetes Center, Gentofte.
SUPPLEMENTARY INFORMATION

doi:10.1038/nature09944

Correction notice

Enterotypes of the human gut microbiome Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Yamada, Daniel R. Mende, Gabriel R. Fernandes, Julien Tap, Thomas Bruls, Jean-Michel Batto, Marcelo Bertalan, Natalia Borruel, Francesc Casellas, Leyden Fernandez, Laurent Gautier, Torben Hansen, Masahira Hattori, Tetsuya Hayashi, Michiel Kleerebezem, Ken Kurokawa, Marion Leclerc, Florence Levenez, Chaysavanh Manichanh, H. Bjørn Nielsen, Trine Nielsen, Nicolas Pons, Julie Poulain, Junjie Qin, Thomas Sicheritz-Ponten, Sebastian Tims, David Torrents, Edgardo Ugarte, Erwin G. Zoetendal, JunWang, Francisco Guarner, Oluf Pedersen, Willem M. de Vos, Søren Brunak, Joel Doré, MetaHIT Consortium, Jean Weissenbach, S. Dusko Ehrlich & Peer Bork

Nature 473, 174–180 (2011) In the version of the Supplementary Information originally posted online, there was a minor error in section 2.2, in which the mean abundance for Actinobacteria was incorrectly listed as 82% instead of 8.2%. This has been corrected in the new version of the Supplementary Information; see Supplementary Information Table of Contents for details.

SUPPLEMENTARY INFORMATION

doi:10.1038/nature09944

Table of Contents I. Supplementary Methods ........................................................................................................................... 4 1

Sample preparation .............................................................................................................................. 4 1.1

Sample collection .......................................................................................................................... 4

1.1.1

Danish individuals ................................................................................................................. 4

1.1.2

Spanish individuals ................................................................................................................ 4

1.1.3

Italian individuals .................................................................................................................. 4

1.1.4

French individuals ................................................................................................................. 5

1.2

DNA extraction .............................................................................................................................. 5

1.2.1 2

Danish, French, Italian and Spanish individuals .................................................................... 5

Sequence processing............................................................................................................................. 5 2.1

European samples ......................................................................................................................... 5

2.2

American samples ......................................................................................................................... 5

2.3

Japanese samples.......................................................................................................................... 5

2.4

Further trimming........................................................................................................................... 6

2.5

Removal of potential human reads............................................................................................... 6

3

Assembly and gene prediction .............................................................................................................. 6

4

Phylogenetic annotation ....................................................................................................................... 6

5

4.1

Reference genome set .................................................................................................................. 6

4.2

Reference genome mapping ......................................................................................................... 6

4.3

Estimating sequence similarity barriers across phylogenetic ranks ............................................. 7

4.4

HITChip analysis ............................................................................................................................ 7

Phylogenetic analysis of external datasets ........................................................................................... 7

WWW.NATURE.COM/NATURE | 1

doi:10.1038/nature09944

RESEARCH SUPPLEMENTARY INFORMATION

5.1

Analyzing 16S rRNA sequences from 154 American individuals................................................... 7

5.2

Analyzing Illumina-based metagenome sequences from 85 Danish individuals .......................... 8

6

Functional annotation ........................................................................................................................... 8 6.1

Estimating abundance of a gene/protein ..................................................................................... 8

6.2

Assigning proteins to eggNOG orthologous groups...................................................................... 8

6.3

Assigning proteins to KEGG orthologous groups, modules and pathways ................................... 9

7

Clustering .............................................................................................................................................. 9 7.1

Clustering algorithm...................................................................................................................... 9

7.2

Distance metric ............................................................................................................................. 9

7.3

Optimal number of clusters ........................................................................................................ 10

7.4

Cluster validation ........................................................................................................................ 10

7.5

Cluster similarity measurement .................................................................................................. 11

8

Simulating phylogenetic/functional compositions ............................................................................. 11

9

Supervised learning ............................................................................................................................. 11

10

Between-class analysis.................................................................................................................... 12

11

Network correlation analysis .......................................................................................................... 12

12

Statistical treatment of over-/under-representation ..................................................................... 12

13

Correlations with host properties ................................................................................................... 12

II. Supplementary Notes ............................................................................................................................. 14 1

2

Feasibility of comparative gut metagenomics .................................................................................... 14 1.1

Functional repertoire of samples compared to bacterial genomes ........................................... 14

1.2

Comparing different sequencing technologies ........................................................................... 14

1.3

Functional rarefaction ................................................................................................................. 15

Global phylogenetic and functional variation of intestinal metagenomes ........................................ 15 2.1

Non bacterial DNA content ......................................................................................................... 15

WWW.NATURE.COM/NATURE | 2

doi:10.1038/nature09944

RESEARCH SUPPLEMENTARY INFORMATION

2.1.1

Eukaryotic contamination ................................................................................................... 16

2.1.2

Prophage sequences ........................................................................................................... 16

2.2

Phylogenetic groups identified in gut metagenomes ................................................................. 16

2.3

Functions identified in gut metagenomes .................................................................................. 17

3

Highly abundant functions from low-abundance microbes ............................................................... 18

4

Robust clustering of samples across nations: Identification of enterotypes ...................................... 18

5

4.1

Deriving enterotypes .................................................................................................................. 18

4.2

Drivers of enterotypes in three different datasets ..................................................................... 18

4.3

Robustness of enterotype clusters ............................................................................................. 19

4.4

Generalization and predictive power of enterotype clusters ..................................................... 20

4.5

Independent experimental verification of enterotypes using HITChip ...................................... 20

Phylogenetic and functional variation between enterotypes ............................................................ 21 5.1

6

Phylogenetic composition of enterotypes .................................................................................. 21

5.1.1

Enterotype 1........................................................................................................................ 21

5.1.2

Enterotype 2........................................................................................................................ 21

5.1.3

Enterotype 3........................................................................................................................ 21

Phylogenetic and functional biomarkers for host properties ............................................................. 21 6.1

Age bias in the dataset ................................................................................................................ 22

6.2

Identification of an unknown Clostridiales genus correlating with host-age ............................. 22

6.3

Functions correlating with host-nationality................................................................................ 22

6.4

Verifying host-phenotypic classifications based on hydrogenotrophic microorganisms ........... 22

6.5

Effect of genome size in functional over-representation in enterotypes................................... 23

WWW.NATURE.COM/NATURE | 3

RESEARCH SUPPLEMENTARY INFORMATION

doi:10.1038/nature09944

I. Supplementary Methods 1 Sample preparation 1.1 Sample collection Sample collection for all European studies was approved by Ethics committee (different committees for different studies).

1.1.1 Danish individuals Danish individuals were examined at Steno Diabetes Center, Gentofte. The participants were asked to provide a frozen, crude fecal sample. Samples were collected at home, and immediately frozen in their home freezer. The samples were delivered to Steno Diabetes Center using insulating polystyrene foam containers and stored at -80°C until analysis.

1.1.2 Spanish individuals Patients with ulcerative colitis (UC) or Crohn’s disease (CD) attending the outpatient clinic of Hospital Vall d’Hebron were asked to give written consent to take part in this study. Eligible patients were aged 18 to 75 years, had UC or CD previously diagnosed by endoscopy and histological examination of intestinal mucosal biopsies. They were in clinical remission for at least 3 months, and had stable maintenance therapy with mesalazine or azathioprine. Exclusion criteria included pregnancy or breast feeding, severe concomitant disease involving the liver, heart, lungs or kidneys, treatment with steroids, cyclosporine, anti-TNF drugs or topical anti-inflammatory preparations during the previous 3 months, and treatment with antibiotics during the previous 4 weeks. Healthy controls were recruited among family relatives of the UC and CD patients; antibiotic treatment for at least 4 weeks before fecal sample collection was excluded. The protocol was approved by the Ethics Committee of our institution (CEIC, Hospital Vall d’Hebron). Patients and healthy controls were asked to provide a frozen stool sample. Fresh stool samples were obtained at home, and samples were immediately frozen by storing them in their home freezer. Frozen samples were delivered to the Hospital using insulating polystyrene foam containers, and then they were stored at -80°C until analysis. None of the patients or controls underwent bowel cleansing or endoscopic procedures before fecal sampling.

1.1.3 Italian individuals Fecal samples were obtained from 6 elderly living in Camerino, Italy. The elderly volunteers consumed an unrestricted Western-type diet. They took neither antibiotics nor any drug known to influence the fecal microbiota composition for at least three months prior to sampling and were free of known metabolic or gastrointestinal diseases. Whole stools were collected in clean boxes and stored at 4°C under anaerobic conditions using an anaerocult® A (Merck, Nogent sur Marne, France) until sampling as 200 mg aliquots in 2 ml sterile screw-cap tubes which were frozen at -20°C for further analysis.

WWW.NATURE.COM/NATURE | 4

doi:10.1038/nature09944

RESEARCH SUPPLEMENTARY INFORMATION

1.1.4 French individuals Fecal samples were obtained from 4 obese and 4 healthy individuals, frozen immediately, and delivered at INRA.

1.2 DNA extraction 1.2.1 Danish, French, Italian and Spanish individuals A frozen aliquot (200 mg) of each fecal sample was suspended in 250 µl of 4 M guanidine thiocyanate– 0.1 M Tris (pH 7.5) and 40 µl of 10% N-lauroyl sarcosine. Then, DNA extraction was conducted using bead beating method as previously described37. The DNA concentration and its molecular size were estimated by nanodrop (Thermo Scientific) and agarose gel electrophoresis.

2 Sequence processing 2.1 European samples Sanger sequencing was performed using standard protocols. Shotgun randomly shared DNA libraries were constructed using low copy plasmid (pCNS, 3 kb insert). Terminal clone end sequences were determined using BigDye terminator chemistry and capillary DNA sequencers (3730XL, Applied Biosystems) according to standard protocols established at Genoscope. Cloning vector and sequencing primer were removed from raw reads after aligning reads to the vector/primer sequences using BLASTN. Reads were quality trimmed by removing bases in either end with phred quality under 15. Lastly, reads shorter than 300bp were removed.

2.2 American samples Sanger reads for two American adult human gut metagenomes4 were downloaded from NCBI Trace Archive. The vector and sequence trimming coordinates from the trace information were used to remove the cloning vector and sequencing primer. Titanium reads for two American female obese individuals5 were downloaded from the NCBI Short Read Archive. These reads were not processed further than the trimming provided by the authors.

2.3 Japanese samples We identified the following unclipped vector/linker sequences in the Japanese samples: 1. 5’- GAGAGCTCCTGCAGGCTAGCTTGCGCAAGGATCCTAGGCCTGAAGCTTGTC - 3’ 2. 5’- GCATGGTACCACGCGTACGTAAGCAAGATCTTCCCGGGTGAATTCGTC - 3’ These sequences from the pTS1 cloning vector (K. K. and T. H., personal communication) were clipped from the 13 Japanese samples using the makeClip program from Forge assembler41.

WWW.NATURE.COM/NATURE | 5

doi:10.1038/nature09944

RESEARCH SUPPLEMENTARY INFORMATION

2.4 Further trimming All Sanger reads were finally trimmed for low quality regions in the ends using makeClip.

2.5 Removal of potential human reads Sequence reads were aligned against human genome assembly hg18 obtained from UCSC Genome Browser42 using BLAT43 (gfClient v 31, default parameters). Possible human DNA sequences were identified with a very low alignment threshold to maximize true positives and minimize false negatives (‘pslFilter -minMatch=50’ from the BLAT package), and were removed.

3 Assembly and gene prediction Assembly and gene prediction were performed using the SMASH comparative metagenomics pipeline38. To obtain contigs and scaffolds from the reads, we employed SMASH’s iterative assembly procedure using Arachne software44-45. This procedure iteratively assembles unassembled reads (singletons) from the previous iteration until no more assembly is possible. Protein coding genes were predicted using GeneMark46 (v 2.6p) by the SMASH pipeline. SMASH uses the GC-content based heuristic models (provided with GeneMark software) to predict genes on scaffolds shorter than 200kb as well as unassembled reads, and a self-trained hybrid model using both GC-content and sequence content on scaffolds longer than 200kb.

4 Phylogenetic annotation Phylogenetic annotation of each metagenome sample was performed using the SMASH pipeline38.

4.1 Reference genome set We obtained a set of 1511 reference microbial genomes from the National Center for Biotechnology Information (NCBI), Human Microbiome Project10 and the MetaHIT Consortium11. We identified 16S rRNA gene sequences from each of these genomes using an HMM-based algorithm47 and assigned a taxonomic rank to the genome based on the classification of the 16S rRNA gene using the RDP Classifier39. We used the taxonomic tree provided with the RDP Classifier, which is based on the bacterial taxonomy proposed by the Taxonomic Outline of Bacteria and Archaea48, with further rearrangements proposed for Firmicutes and Cyanobacteria by the Bergey’s Manual of Systematic Bacteriology49-50.

4.2 Reference genome mapping Sequence reads were aligned to the reference microbial genomes (listed in Supplementary Table 3) using BLASTN (WU-BLAST 2.0, default parameters except E=1e-20 Z=4000000000 B=5). Each read was assigned the taxonomy of the highest scoring hit(s) above the similarity threshold for the taxonomic rank (>65% for phylum and >85% for genus established by parameter exploration, see Supplementary Methods Section 4.3). Alignments were also required to span over 75bp covering >80% of the read length. Since paired-end reads are from two ends of a cloned DNA fragment, two reads from such a fragment represent only one physical DNA fragment. Hence taxonomy assignments of reads were

WWW.NATURE.COM/NATURE | 6

doi:10.1038/nature09944

RESEARCH SUPPLEMENTARY INFORMATION

transferred to the corresponding fragments. The numbers of fragments assigned to each reference genome were counted. (A fragment assigned to N different reference genomes contributes 1/N to each genome). These counts were normalized by the sizes of these genomes to obtain the quantitative relative abundance (relative number of individuals) of each genome in the sample. Number of unassigned fragments was normalized by the average genome size in the reference set (3.54Mb) to calculate the approximate abundance of unknown genomes. Phylogenetic abundances at various phylogenetic ranks (species, genus, phylum etc) were calculated by adding the abundances of genomes under that rank.

4.3 Estimating sequence similarity barriers across phylogenetic ranks Since there are no established sequence similarity barriers to differentiate genomes from different phylogenetic ranks, we estimated the sequence similarity cutoffs to safely assign a sequence to either a genus or a phylum. For this purpose, we retrieved 40 single copy marker genes51 from a subset of 835 genomes (after removing some redundancy at species level) and generated 40 sets of pairwise alignments using BLASTN. These marker genes are highly representative of the reference genome set, and hence of at least the sequenced microbial species, since 801 of the 853 genomes (94.6%) contained at least 38 out of the 40 genes. Supplementary Figure 1a shows the distribution of sequence similarity levels between genomes from the same phylum (green) and different phyla (red). Supplementary Figure 1b shows the same distribution at genus level. We estimated the false positive rates at different similarity thresholds at both phylum and genus levels (Supplementary Figure 15). At the phylum level, a 65% threshold had 0.77% false positive rate; at the genus level, an 85% threshold had 1.84% false positive rate. Thus we chose 65% and 85% as the thresholds for the genus and phylum level assignments. This is a rather conservative cutoff, since the marker genes are among the genes under the highest levels of selective constraint51.

4.4 HITChip analysis 10 ng from the fecal DNA extract was used to amplify the 16S rRNA genes with the T7prom-Bact-27-for and Uni-1492-rev primers. Subsequently, an in vitro transcription and labeling with Cy3 and Cy5 dyes, was performed. Fragmentation of Cy3/Cy5 labeled target mixes was followed by hybridization on the arrays at 62.5˚C for 16h in a rotation oven (Agilent Technologies, Amstelveen, The Netherlands). The slides were washed and dried before scanning. Signal intensity data was obtained from the microarray images using the Agilent Feature Extraction software, version 9.1 (http://www.agilent.com). Microarray data normalization and further analysis was performed using a set of R-based scripts (http://rproject.org) in combination with a custom designed relational database20 which operates under the MySQL database management system (http://www.mysql.com).

5 Phylogenetic analysis of external datasets 5.1 Analyzing 16S rRNA sequences from 154 American individuals Published 16S rRNA sequence data5 derived from fecal samples of 154 individuals, including female monozygotic and dizygotic twin pairs and their mothers, were downloaded from

WWW.NATURE.COM/NATURE | 7

RESEARCH SUPPLEMENTARY INFORMATION

doi:10.1038/nature09944

http://gordonlab.wustl.edu/NatureTwins_2008/V2.fasta.gz. This dataset containing 1119519 sequence reads from the V2 region of the 16S rRNA gene was processed using the SMASH pipeline38 to classify the reads using the RDP Classifier39 (using minimum read length of 200 and a minimum confidence score of 0.5). For each sample, the number of reads assigned to different genera by the RDP classifier were normalized by the average 16S gene copy number in genomes belonging to each genus obtained from rrnDB52. This resulting relative abundance profile at the genus level was used for further analysis.

5.2 Analyzing Illumina-based metagenome sequences from 85 Danish individuals Illumina reads from 85 Danish individuals from a previously published dataset8 (we focused on the Danish individuals as most of the Spanish ones came from patients with Crohn’s disease with dysbiosis and a known reduction of species complexity, which introduces biases and caused technical difficulties in assignment and analysis) were quality trimmed and filtered using a customized pipeline based on the FASTX toolkit53. Briefly, (i) bases were trimmed from the beginning of reads unless the number of base calls for any base (A, T, G, C) was within the average across all cycles plus/minus two standard deviations, (ii) bases were trimmed from the end of reads if the quality score was 85% similarity) to these three genomes can be considered as a single genus that we call “unknown Clostridiales genus” in the manuscript. Abundance of this genus has a significant negative correlation with host-age (p