Divergences in Gene Repertoires of Microbiome

0 downloads 0 Views 6MB Size Report
Jun 1, 2017 - Bezuidt OK, Pierneef R, Gomri AM, Adesioye F, Makhalanyane TP, Kharroub K, .... Connell JH, Ralph OS: Mechanisms of succession in natural ...
Divergences in Gene Repertoires of Microbiome Components Derived from Distinct Body-sites of Human

Thesis Submitted to AcSIR for the Award of the Degree of DOCTOR OF PHILOSOPHY In Biological Sciences

By Vinod Kumar Gupta Registration Number: 10BB13J17011 Under the guidance of Dr. Chitra Dutta

Structural Biology & Bioinformatics Division, CSIR - Indian Institute of Chemical Biology, Kolkata

Dedicated to HARIJAN COMMUNITY

Contents

CONTENTS Page No. Acknowledgement

i

List of Tables

iii

List of Figures

vi

List of Abbreviations

xii

Chapter 1: General Introduction

1-20

1.1. Preamble

1

1.2. The Early Days of Human Microbiome Studies

2

1.3. Human Microbiome Project (HMP)

3

1.4. Body Niche and Human Microbiota

5

1.5. Worldwide Diversity in Microbiota Composition

6

1.6. Approaches for Microbiome Characterization and Analyses

8

1.7. Therapeutic Application of Human Microbiome

9

1.8. Variations in microbiota between different body sites of the human

10

1.9. Pan-genome Analysis

11

1.9.1.

Software for Pan-genome Characterization

1.10. Metagenomics

14 15

1.10.1.

Taxonomic profiling

18

1.10.2.

Functional Profiling

18

1.11. Aims and Scope of the Present Dissertation

19

Chapter 2: Divergences in Gene Repertoire among the Reference Prevotella Genomes Derived from Distinct Body-sites of Human 2.1. Introduction 2.1.1.

Why Prevotella?

21-68 21-23 22

2.2. Objectives of the Study

24

2.3. Method and Materials

25-31

2.3.1.

Genome sequences

25

2.3.2.

Orthologous gene family construction

26

2.3.3.

Pan-genome and core genome size calculation

27

Contents 2.3.4.

2.3.5.

Evolutionary analysis

29

2.3.4.1. 16s rRNA tree

29

2.3.4.2. Core-genome tree

29

2.3.4.3. Pan-genome tree

30

Functional mapping of Pan-genome and niche specific gene Families

30

2.3.6.

Trends in codon usage in core and unique genes

30

2.3.7.

Test for positive selection between two P. denticola strains

31

2.4. Results

32-63

2.4.1.

Variation in size of homologous gene clusters

2.4.2.

Orthologous Gene Families - distribution into the core, flexible, and singleton genes

2.4.3.

39

Exclusive presence or absence of gene families in genomes derived from specific body sites of human hosts

2.4.6.

36

Niche specific variation in expansion/contraction of pan/core genome

2.4.5.

32

Trends in expansion of pan genome and contraction of core genome in Prevotella genus

2.4.4.

32

41

Trees based on the pan - genome and sequence variations in core genome – niche-specific features

50

2.4.7.

Trends in codon usage in core and unique genes

53

2.4.8.

Metabolic functional profile of the Prevotella pan-genome

55

2.4.9.

Functional categorization of singletons of each Prevotella strain

57

2.4.10.

Variation in functional profile of the niche specific genes

58

2.4.11.

Comparative analysis between P. denticola CRIS 18C-A (UGT isolate) and P. denticola F0289 (oral cavity isolate)

2.4.12.

59

Trends in GC content of core, accessory and unique genes among Prevotella strains

2.5. Discussion

61 64

Contents Chapter 3: Functional Metabolic Driving Forces for Niche Specificity in Reference Prevotella Genomes Adapted at Distinct Body Niches in Human 3.1.

69-106

Introduction

69-73

3.1.1. Microsuccession

69

3.1.2. Ecological determinants at diverse body sites in human

70

3.1.3. Functional profile of microbiota

71

3.1.4. Host – Microbiome interactions in human health and disease

72

3.1.5. Background

73

3.2.

Objectives of the Study

74

3.3.

Method and Materials

75-78

3.3.1. Sequence retrieval

75

3.3.2. KEGG annotation

76

3.3.3. Niche specific functional features

78

3.3.4. Metabolic reconstruction of disease associated Prevotella genomes 78 3.4.

Results

79-101

3.4.1. KEGG pathways profile among niche specific Prevotella genomes 79 3.4.2. Core metabolic pathways in human associated Prevotella strains

85

3.4.3. Niche Specificity among Prevotella genomes

91

3.4.3.1. Niche specific metabolic features among Prevotella genomes

91

3.4.3.2. Niche specific variations in the metabolic pathway profiles of Prevotella genus

94

3.4.4. Clustering of Prevotella genomes based on major KEGG pathways distribution 3.4.5. Metabolic assignment of niche specific variable genome

95 97

3.4.6. KEGG Pathway profile of the disease-associated Prevotella species 3.4.6.1. Periodontitis associated Prevotella

98 98

3.4.6.2. KEGG pathways profile of P. copri – the species associated with Rheumatoid Arthritis 3.5.

Discussion

100 102

Contents Chapter 4: Reconstruction of Functional Diversity in Distinct Body Niches Derived from Human Microbiota for Exploring the Niche Specialization 4.1.

107-177

Introduction

107-109

4.1.1. Background

109

4.2.

Objectives of the Study

110

4.3.

Method and materials

111-116

4.3.1. Sample selection and dataset of metagenomic samples

111

4.3.2. Reference genomes from the Human Microbiome Project (HMP) 111 4.3.3. Estimation of taxonomic and functional diversity

112

4.3.4. Multivariate analysis for clustering the metagenomic samples derived from different body niches 4.3.5. Characterization of functional profile of the human microbiome

113 113

4.3.6. Estimation of functional mutualism among human microbiota components

114

4.3.7. Mapping the evolutionary relationship among niche specific dominant microbial species

115

4.3.8. Cluster analysis of healthy individuals based on microbial driver at distinct body niches in human 4.4.

Results

116 117-169

4.4.1. Divergences in microbiota composition between distinct body niches of the Human 4.4.2. Core microbiota and microbial niche specificity

117 124

4.4.3. Evolutionary relationship among dominant species derived from a specific body niche 4.4.4. Functional divergence in human microbiome

126 128

4.4.4.1. Principal coordinate analysis based on presence/absence of KEGG pathway enzymes

128

4.4.4.2. Principal coordinate analysis based on relative abundances KEGG pathway enzymes

130

4.4.4.3. Niche specific KEGG pathway enzymes derived from PCA factor analysis

139

Contents 4.4.5. Core KEGG pathway enzymes among all 486 metagenomic samples

144

4.4.6. Niche specific functional characteristics derived from niche specific microbiome

147

4.4.6.1. Niche specific functional features conserved among all healthy individuals

147

4.4.6.2. Niche specific functional diversity derived from niche specific microbiota

152

4.4.6.3. Niche specific exclusiveness for KEGG pathway enzymes 154 4.4.7. Microbial sources of KEGG pathway enzymes responsible for niche specificity 4.4.8. Niche specific microbial and functional enterotypes/variants

157 160

4.4.9. Functional cooperation between niche specific microbial communities 4.5.

Discussion

166 170

Chapter 5: Geography, Ethnicity or Subsistence Specific Variations in Human Microbiome Composition & Diversity

178-206

5.1.

Introduction

178-179

5.2.

Method and Materials

180-182

5.2.1. Dataset

180

5.2.2. Identification of the niche specific core microbiota among various populations

181

5.2.3. Clustering of populations based on microbial composition at various body niches

181

5.2.4. Functional characterization of representative metagenomic samples from different body niches

181

5.2.5. Survey of microbiome studies for healthy individuals from different geographical country’s or populations

182

5.2.6. Diversity in microbiome composition between different populations 5.3.

Results 5.3.1. Microbiota conserved among healthy individuals

182 183-198

Contents from different geographical countries or populations

183

5.3.2. Exclusiveness for microbiota components in a specific geographical population

185

5.3.3. Diversity in microbiome of healthy populations from different countries

187

5.3.4. Microbial characterization of niche specific human microbiome in healthy individuals from different countries/populations

189

5.3.5. Niche specific microbial richness and biodiversity among healthy subjects from different countries/populations

191

5.3.6. Functional characterization of human microbiome from different populations 5.4.

Discussion

194 199

Concluding Remarks

207-210

Bibliography

211-229

Appendix 3.1: Core KEGG pathway enzymes for 36 Prevotella strains used in chapter 3 (cited in Table 3.7)

230-231

Appendix 4.1: Details of KEGG modules listed in Table 4.25

232-234

List of Publications

235

Acknowledgement ACKNOWLEDGEMENT

Undertaking this PhD is a part of dreams in my life and it would not have been possible to complete the PhD without the visible and invisible support and motivation from many not only human but also other than humans. First of all, I am extremely grateful to my guide, Dr. Chitra Dutta for her continuous support, supervision, guidance and motivation from my first day at her lab. I feel myself luckiest PhD student for getting her guidance and blessings. She is not only a PhD guide for me, she influenced my life a lot and her countless teachings will guide me in my whole life at each and every situation. I learned from her, how to become a good human being, how to become a good leader. I am very grateful to her for providing me full freedom to work; this nature makes her unique among all. I am very thankful to my co-guide Dr. Sucheta Tripathy for helping me. I am very much inspired and motivated by Lord Ram, Aacharaya Chanakya, Adi Sankaracharya, Swami Vivekananda, Mahatma Gandhi and Dr. Abdul Kalam. I feel myself blessed for getting the chance of staying on the land of Swami Vivekananda and Subhash Chandra Bose. I would like to gratefully acknowledge my all teachers who taught me everything since my childhood, I learned from them writing, speaking, and even learning also. Completion of PhD is not an individual’s task, many people contributed directly or indirectly in this thesis, I am thankful to all of them. I am very grateful to Indian Army who is working continuously day and night in very adverse conditions to make a safe and secure environment in the country for providing the peaceful workplace to us. I am very much thankful to our farmers who produced sufficient amount of food that is the first need to anyone for doing research. I am very thankful to the Harijan community of our country, whose continuous efforts to make clean the country without getting any respect from the people; in a dirty place research cannot be possible for us so I feel a big role of this community in completion of my PhD thesis. I would like to express my respect to our first Prime Minister Pt. Jawaharlal Nehru who is the Builder of Modern Science and Promoter of Scientific Temper in India. I am very much inspired by his words “It is science alone that can solve the problems of hunger and poverty, of insanitation and illiteracy, of superstition and deadening custom and tradition, of vast resources running to waste, or a rich country inhabited by starving people... Who indeed could afford to ignore science today? At every turn we have to seek its aid... The future belongs to science and those who make friends with science.” I am very thankful to security staff of IICB who took care of any danger for securing us.

~i~

List of tables

LIST OF TABLES Table No.

TITLE

Page No.

Chapter 1 1.1. National and International Initiatives for Human Microbiome study by different countries 1.2. Software or pipelines for pangenome analysis

7 15

Chapter 2 2.1. 2.2. 2.3. 2.4. 2.5.

Details of Prevotella strains used for the analysis Strain wise pan-genome composition Niche specific Prevotella pan-genome Details of genes exclusively absent from each Prevotella strain Core genes under positive selection among both P. denticola strains

26 35 43 44-49 61

Chapter 3 3.1. 3.2. 3.3. 3.4. 3.5.

Disease associated Prevotella strains Details of 36 Prevotella genomes used in the present analysis KEGG mapping process and reference organisms used for mapping KEGG profile of Prevotella strains derived from Human Microbiome KEGG pathways exclusively present in Human associated Prevotella strains 3.6. KEGG pathways exclusively absent in Human associated Prevotella strains 3.7. KEGG pathways present in all 36 Prevotella strains under the study 3.8. Niche specific presence and absence of KEGG pathways

73 75-76 77 80 81-84 84 86-90 92-93

Chapter 4 4.1. Details of metagenomic samples included in this current study 4.2. Niche wise most five dominant microbial species derived from healthy human individuals 4.3. Diversity in microbiome composition (relative abundance) among healthy subjects by body niches 4.4. Niche specific core microbial species between all included healthy samples

~ iii ~

111 121 123 125

List of tables

Table No.

TITLE

4.5. Pairwise ANOSIM test of significance for niche wise segregation (Presence/absence) 4.6. Pairwise ANOSIM test of significance for niche wise segregation (Relative abundance) 4.7. Number of human microbiota derived KEGG pathway enzymes strongly associated with specific body niches 4.8. KEGG pathway enzymes strongly associated with GUT microbiota derived from PCA factor analysis based on relative abundance of KEGG pathway enzymes

Page No. 130 132 133 134135

4.9. KEGG enzymes strongly associated with UGT microbiota derived 135-36 from PCA factor analysis based on relative abundance of KEGG pathway enzymes 4.10. KEGG enzymes strongly associated with both Skin & Airways 137 microbiota derived from PCA factor analysis based on relative abundance of KEGG pathway enzymes 4.11. KEGG enzymes strongly associated with Oral cavity microbiota 137-38 derived from PCA factor analysis based on relative abundance of KEGG pathway enzymes 4.12. Niche specific dominant (>Mean + SD) metabolic functions derived 139 from respective microbiota 4.13. KEGG pathways associated with GUT and UGT 140-42 4.14. KEGG pathways associated with oral cavity 142-43 4.15. KEGG pathways associated with Skin and Airways 143 4.16. Distribution of Core KEGG pathway enzymes in major metabolic 144 pathways 4.17. Number of core KEGG pathway enzymes derived from niche specific 148 metagenome 4.18. Number of core KEGG pathway enzymes derived from niche specific 148 reference bacterial strains (HMP-DACC) 4.19. Number of complete KEGG modules pathways derived from minimal 149-50 niche metagenome (MNM) for each body niche or sub-sites 4.20. Niche specific exclusively present KEGG modules derived from 151 Minimal Niche Metagenome (MNM) 4.21. KEGG modules exclusively absent from body niche derived from 152 Minimal Niche Metagenome (MNM) 4.22. Niche specific exclusively present KEGG pathway enzymes 154-56 4.23. Dominant or driving species in niche specific metagenomic samples 161 4.24. P & R value of significance (ANOSIM statistical test) for segregation 165 of variants in each niche based on microbial and functional composition

~ iv ~

List of tables

Table No.

TITLE

4.25. All KEGG functional modules completed by low abundant bacterial species in specific niche

Page No. 167-68

Chapter 5 5.1. Details of population/country specific samples used in the study 5.2. Number of shared microbial communities (genera and phyla) between different populations 5.3. Exclusively present or absent microbial communities (genera) in niche specific microbiota of respective population 5.4. Niche specific Beta diversity for microbiota composition between populations from different countries

~v~

180 183 186 194

List of figures

LIST OF FIGURES Figure No.

TITLE

Page No.

Chapter 1 1.1. Hits from a PubMed search for the term “Microbiome” 1.2. Nutritional and physico-chemical determinants of microbiota at major body sites of human 1.3. Construction of Pan-genome [Dis. = Dispensable Genes; Uni. = Unique Genes] 1.4. Type of pan-genome: Open (sky color) and Close (violet color) 1.5. General workflow for metagenomics

5 6 12 14 17

Chapter 2 2.1. Gene distribution pattern among homologous clusters. Size of clusters is genes (orthologs and paralogs) present in a cluster 2.2. The gene family frequency spectrum for 28 Prevotella genomes. Bars represent the number of orthologous gene families belonging to singletons (17166), flexible genome (7263) and core genome (456). 2.3. Pan and core genome analysis curve for 28 Prevotella genomes. The number of shared genes is plotted (sky color) as a function of the number of Prevotella genomes sequentially considered. The size of Prevotella pan-genome is plotted (violet) as a function of the number of Prevotella genomes sequentially considered. All dots represent the random combination of the genome for calculation pangenome and core genome size. 2.4. New gene family distribution among Prevotella pangenome. Addition of new gene families into the pangenome with sequential addition of prevotella genome into the analysis 2.5. Pan and core genome progress with addition of niche specific Prevotella genomes. The plot shows progression of core and pangenome after sequential addition of Prevotella genomes into the analysis as per their body niche. The color bars represent the number of new gene families added into the Prevotella pangenome. The species names are colored according to their niches. 2.6. Trends of core and pan-genome curves with niche wise variation in order of genomes. Variation in shape of pan and core genome curve due to consideration of Prevotella genomes in different orders based on body niche into the analysis. Letters indicate niches [G: GIT (3), O: Oral Cavity (17), S: SKIN (1) and U: UGT (7)].

~ vi ~

32 34

37

38

39

40

List of figures Figure No.

TITLE

2.7. Distribution of dispensable genes among 28 Prevotella strains. Colored cells indicate presence of genes in the respective Prevotella strain and orthologous gene family, while uncolored cells indicate absence of genes. The species names and cells are colored according to their niches - green: GIT, red: ORAL Cavity, purple: SKIN and blue: UGT. Dark cell colors represent flexible genome and light cell colors represent singletons. 2.8. Relative evolutionary divergence of Prevotella. Neighbor Joining (NJ) tree based on 28 Prevotella and E. Coli 83972 (reference) 16S rRNA sequences, was constructed using MEGA 5 after 1000 bootstrap replications. The species names are colored according to their niches (brown: GIT, green: ORAL Cavity, purple: SKIN and blue: UGT). 2.9. Relative evolutionary divergence of Prevotella. (A) NJ Tree based on the binary gene presence/absence matrix of orthologous gene families of 28 Prevotella and 3 Bacteroides strains & E. Coli 83972 (reference) and (B) NJ tree based on core genome using 100 bootstrap replications. The bootstrap values are marked at the root of each branch of trees. The species names are colored according to their niches (brown: GIT, green: ORAL Cavity, purple: SKIN and blue: UGT). 2.10. Codon usage distance. Heatmap of codon usage distances between core genomes of Prevotella strains. 2.11. Codon usage distance. Heatmap of codon usage distances between core and unique genes for all 28 Prevotella genomes 2.12. Percentage relative abundance and distribution of major COG categories between core genome (Inner most layer), accessory genome (middle layer) and singletons (outer layer) of Prevotella. 2.13. Relative abundance and distribution of all functional COG categories between core genome (blue bars), accessory genome (red bars) and singletons (green bars) of Prevotella. 2.14. Relative abundance and distribution pattern of COG categories within singletons. 2.15. COG distribution patterns of the niche-specific orthologous gene families. 2.16. Comparative functional analysis of gene repertoire of P. denticola F0289 (Oral cavity) and P. denticola CRIS 18C-A (UGT) strains. 2.17. Trends in GC content of core, accessory and unique genes among 28 Prevotella strains. Colored numbers on each histogram is according to dataset

~ vii ~

Page No. 42

50

52

54 54 55

56

57 58 60 62-63

List of figures Figure No.

TITLE

Page No.

Chapter 3 3.1. Metabolic divergence among body niches derived niche specific prevotella genomes 3.2. PCA plot based on (A) Presence/absence and (B) Relative abundance of KEGG pathway enzymes among 36 prevotella genomes derived from five major body niches in human (each dot represents the prevotella genome and dot color represents respective body niche) 3.3. KEGG pathway enzymes frequency heatmap. All KEGG enzyme frequencies were hierarchically clustered. The horizontal axis shows the percentage frequency of genes involved in respective pathways, while the strains are located on vertical axis. 3.4. Percentage frequency distribution of KEGG pathway enzymes among healthy and periodontitis associated prevotella genomes 3.5. Principal component analysis plot showing variation in KEGG functional profile among prevotella genomes associated with healthy human individuals and periodontitis patients 3.6. Percentage frequency distribution of KEGG pathway enzymes among healthy and rheumatoid arthritis associated prevotella genomes. * indicates the significant variation between healthy and rheumatoid arthritis associated prevotella genomes for a respective KEGG pathway category

91 95

96

98 99 100

Chapter 4 4.1. Principal component analysis based on microbiota composition in human at different body sites 4.2. Niche specific microbial enrichment. Vertical axis showing the total number of species derived from the respective body sites of the human 4.3. Divergence in microbiota composition within the respective body site 4.4. Bray-Curtis beta diversity in microbiome composition (presence/absence) among healthy subjects by body niches 4.5. Phylogenetic tree based on 16s rRNA sequences from niche specific dominant microbial species 4.6. Phylogenetic tree based on core genes sequences among niche specific dominant microbial species 4.7. Principal component analysis based on composition (presence/absence) of KEGG pathway enzymes derived from niche specific microbiota

~ viii ~

117 118 119 122 126 127 129

List of figures Figure No.

TITLE

4.8. Principal component analysis based on composition (relative abundances) of KEGG pathway enzymes derived from niche specific microbiota 4.9. Principal component analysis based on composition (relative abundances) of KEGG pathway enzymes derived from niche specific microbiota 4.10. Demonstration of KEGG enzymes oppositely distributed between GUT and UGT (A) K00262 and (B) K03217. 4.11. Functional distribution pattern of 796 core KEGG enzymes 4.12A. Principal Coordinate Analysis (PC1 vs. PC2) based on niche wise relative abundances of 796 core KEGG pathway enzymes 4.12B. Principal Coordinate Analysis (PC1 vs. PC3) based on niche wise relative abundances of 796 core KEGG pathway enzymes 4.13. Alpha diversity (Shannon Index) for major body niches based on average abundance of KEGG pathway enzymes derived from niche specific microbiome 4.14. Beta diversity (Bray Curtis) for major body niches based on composition of KEGG pathway enzymes between all metagenomic samples derived from specific major body niches of healthy individuals 4.15. Contribution of airways microbiota in KEGG pathway enzymes required for niche specialization in airways; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species 4.16. Contribution of GUT microbiota in KEGG pathway enzymes required for niche specialization in GUT; Size of bubble indicates the abundance of KEGG pathway enzyme; Values at vertical axis (Right hand side) is average relative abundances of respective bacterial species 4.17. Contribution of Oral cavity microbiota in KEGG pathway enzymes required for niche specialization in Oral cavity; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species 4.18. Contribution of skin microbiota in KEGG pathway enzymes required for niche specialization on skin; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species 4.19. Contribution of UGT microbiota in KEGG pathway enzymes required for niche specialization in UGT; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species

~ ix ~

Page No. 131 133 139 145 146 146 153 153

157

158

158

159

159

List of figures Figure No.

TITLE

4.20A. Variants based on relative abundances of microbial communities (Left panel) and based on relative abundances of KEGG pathway enzymes (Right panel): (A) Airways (B) Buccal Mucosa (C) GUT 4.20B. Variants based on relative abundances of microbial communities (Left panel) and based on relative abundances of KEGG pathway enzymes (Right panel): (D) Posterior fornix; (E) Supra gingival plaque; (F) Tongue Dorsum 4.21. Gradually increasing number of completed KEGG modules in each body niche by cooperation of gradually low abundant bacterial species of niche specific microbiota. Bar representing the number of KEGG modules completed by respective bacterial species derived from respective body niche; values in brackets on horizontal axis are percentage relative abundance of respective bacterial species

Page No. 162

163

169

Chapter 5 5.1. Multidimensional scaling of population specific microbiota composition based on relative abundances of microbial communities A) Gut microbiome, B) Vaginal Microbiome, C) Oral Microbiome, D) Skin Microbiome 5.2. Percentage frequency distribution pattern of phyla among different country specific populations (A) Gut microbiota (B) Oral cavity microbiota (C) Vaginal Microbiota 5.3. Percentage frequency distribution pattern of genera among different country specific populations (A) Gut microbiota (B) Oral cavity microbiota (C) Skin Microbiota (D) Vaginal Microbiota 5.4. Alpha diversity within subjects as measured using the relative inverse Simpson index of phylum-level relative abundances (A) GUT microbiota (B) Oral cavity microbiota (C) Skin microbiota; S.I. (Simpson Index) 5.5. Alpha diversity within subjects as measured using the relative inverse Simpson index of genus-level relative abundances (A) GUT microbiota (B) Oral cavity microbiota (C) Skin microbiota (D) Vaginal Microbiota; S.I. (Simpson Index) 5.6. Functional KEGG profile derived from GUT microbiome among three populations (Malawi, South Korea & USA) 5.7. Functional KEGG profile derived from oral cavity microbiome among two populations (Canada and Thailand) 5.8. Functional KEGG profile derived from skin microbiome among four populations (Benin, Brazil, Netherlands & USA) 5.9. Functional KEGG profile derived from vaginal microbiome in UK population

~x~

188

190 191

192

193

194 195 196 196

List of figures Figure No.

TITLE

5.10. Principal component plot showing variation in KEGG functional profile among representative populations derived from niche specific microbiota. 5.11. Enriched microbial communities at various niches of the human body in diverse populations around the world. Box color: body niche; Color in Map: percentage urbanization of countries (http://www.unicef.org/); Up arrow: Dominant abundance of Phyla/Genus compare to respective population; Down arrow: Low abundances of Phyla/Genus/family compare to respective population; * and # comparisons between specific countries; Number in respective boxes: Citations as mentioned superscripted in text paragraph just above this figure. 5.12. Gradual transition of the gut microbiota composition with changes in the host subsistence strategies.

~ xi ~

Page No. 197 203

204

List of abbreviations

LIST OF ABBREVIATIONS

BLAST: Basic Local Alignment Search Tool BPGA: Bacterial Pan Genome Analysis pipeline CA: Cluster Analysis CDS: Coding Sequences CM: Core Metagenome CNM: Core Niche Metagenome COG: Clusters of Orthologous Groups COPD: Chronic obstructive pulmonary disease dN: Non Synonymous Substitution DNA: Deoxyribonucleic acid dS: Synonymous Substitution EBI: European Bioinformatics Institute ENA: European Nucleotide Archive FC: Flexible Core FMT: Fecal Microbiota Transplantation GIT: Gastro Intestinal Tract GM: Gut Microbiome/Microbiota HGP: Human Genome Project HGT: Horizontal Gene Transfer HMP: Human Microbiome Project HMP-DACC: HMP-Data Analysis and Coordination Center (DACC) HUMAnN: The HMP Unified Metabolic Analysis Network IBD: Inflammatory Bowel Disease iHMP: Integrative Human Microbiome Project ITEP: Integrated Toolkit for Exploration of Pan-genomes KAAS: KEGG Automatic Annotation Server KEGG: Kyoto Encyclopedia of Genes and Genomes MetaHIT: Metagenomics of the Human Intestinal Tract MNM: Minimal Niche Metagenome NCG: Niche Core Genome

~ xii ~

List of abbreviations NGS: Next Generation Sequencing NIH: National Institute of Health NJ: Neighbor Joining NKEP: Number of KEGG enzymes in respective pathways NMDS: Non-multidimensional scaling OTU: Operational Taxonomic Unit PCA: Principal Component/Coordinate Analysis PD: Periodontitis PDGHM: Prevotella Draft Genome from Human Microbiome PGAP: Pan Genome Analysis Pipeline PGAT: Prokaryotic Genome Analysis Tool PICRUSt: Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates PTS: Phospho Transferase System RA: Rheumatoid arthritis RDP: Ribosomal Database Project RNA: Ribonucleic acid RT: Respiratory Tract SC: Stringent Core SI: Simpson Index SSD: Summation of Standard Deviation T2D: Type 2 Diabetes UGT: Uro-Genital Tract WEIRD: Western, Educated, Industrialized, Rich and Democratic countries WMS: Whole Metagenome Shotgun

~ xiii ~

1

Chapter

GENERAL INTRODUCTION 1.1.

Preamble

1

1.2.

The Early Days of Human Microbiome Studies

1.3.

Human Microbiome Project (HMP)

1.4.

Body Niche and Microbiota

1.5.

Worldwide Diversity in Microbiota Composition

1.6.

Approaches for Microbiome Characterization and Analyses

1.7.

Therapeutic Application of Human Microbiome

1.8.

Variations in Microbiota Between Different Body Sites of the Human 10

1.9.

Pan-genome Analysis

1.10. Metagenomics

2

3

5 6

9

11

15

1.11. Aims and Scope of the Present Dissertation

8

19

Chapter 1: General Introduction 1.1. Preamble We share our body space with around 100 trillion microbial symbionts, which play fundamental roles in our health and disease. Collectively known as the „human microbiota‟ [Consortium 2012; Turnbaugh et al. 2007], this resident microflora is made up of communities of symbiotic, commensal and pathogenic bacteria (along with fungi and viruses), which have co-evolved with the human race and along with the „host‟ human cells, form a complex ecosystem that, as a whole, interactively performs a multitude of functions pertaining to our nutrition, development, metabolism, immunity and many other fundamental processes. They help to digest our food, govern our intake of fat, produce certain vitamins, regulate our immune system, protect us against pathogenic microbes, influence our disease susceptibility and likelihood of developing allergies and so on [Johnson et al. 2012; D'Argenio et al. 2015; Lloyd-Price et al. 2016]. Usually the microbiota components are beneficial to us, but many recent experimental studies showed that dysbiosis (alteration in microbiota composition) in the microbiome have been associated with various diseases, such as periodontitis, rheumatoid arthritis, inflammatory bowel disease (IBD), chronic obstructive pulmonary diseases (COPD), multiple sclerosis, type-1 and type-2 diabetes, allergies, asthma, autism, bacterial vaginosis, obesity, cancer etc. [Schwarzberg et al. 2014; Wang et al. 2013; Scher et al. 2013; Scher et al. 2011; Maeda et al. 2016; Bernard et al. 2014; Fukui et al. 1999; Stingu et al. 2013; Clarke et al. 2012; Zozaya-Hinchliffe et al. 2010; Hilty et al. 2010; Brook et al. 2002; Finegold et al. 1996; Brook et al. 1995]. The intricate relationship between these microbes and our health is the focus of a growing number of research initiatives and a perception is just beginning to emerge that studying the human genome only will not be sufficient to comprehend the biology of the human being. About 100000 genes were estimated by some researchers in human genome before the Human Genome Project (HGP), but completion of HGP reported only ~20000 protein coding genes, which surprised the many scientists because this number was not significantly larger than less complex organisms like fruit fly and round worm genome [Hood et al. 2003; Green et al. 2015; Engel et al. 1993]. With small number of protein coding genes, human genome is not capable to perform all metabolic tasks. Today we know that the microorganisms living in or on human body provide traits that humans did not need to evolve on their own genome [Sender et al. 2016; Human Microbiome Project

Page | 1

Chapter 1: General Introduction Consortium et al. 2012]. Realization of the fact that our genetic landscape is a summation of the genes embedded in our own genome as well as in genomes of our microbiota (the microbiome), and our metabolic features a coalescence of human and microbial traits, has led to the launching of numerous microbiome projects worldwide. Just as the question, “what is it to be human?” has fascinated humans since the beginning of recorded history, the question, “what is the human microbiome?” is now obsessing researchers around the world. Large-scale endeavors such as the Human Microbiome Project (HMP) have been initiated for characterization of healthy human microbiome (Turnbaugh et al. 2007). Studies are being conducted to explore the plausible disease links of microbiome and efforts are being made to understand how microbiome varies with host lifestyle, genetics, age, nutrition, medication, and environment.

1.2. The Early Days of Human Microbiome Studies In 1683 Antony van Leeuwenhoek, after his discovery of microscope, examined plaque between his own teeth (a little white matter), and said that “I then most always saw, with great wonder, that in the said matter there were many very little living animalcules, very prettily a-moving” [Leeuwenhoek et al. 1683]. These observations of Antony van Leeuwenhoek are the truly recorded roots of microbiology and first recorded study of the human microbiome.

Leeuwenhoek had also noted striking differences in microbial

contents between oral and faecal samples from individuals in states of health and disease [Dobell et al. 1920]. Studies on microbial diversity at different body sites, and between health and disease are, therefore, as old as microbiology itself. What is new today is the ability to use powerful molecular techniques to characterize the compositional details of such diversity and to gain insight into their plausible functional implications. Recent advancement of cultureindependent, high throughput next generation sequencing technologies has enhanced our ability to characterize the human microbiome at various states of health and disease, and the concurrent development of molecular phylogenetic approach to study natural microbial ecosystems without the traditional requirement for cultivation has provided a fundamental breakthrough in enabling researchers to compare microbial communities across environments within a unified phylogenetic context [Consortium 2012; Turnbaugh et al. 2007; Pace et al. 1997].

Page | 2

Chapter 1: General Introduction The term “Microbiome” was coined in 2001 by Joshua Lederberg to signify the ecological community of microorganisms including eukaryotes, archaea, bacteria and viruses that inhabit a shared ecological niche like oceans, soil, human body or elsewhere [Lederberg et al. 2001]. Specifying the definition of the “microbiome” has been complicated by confusion about terminology. Till date, in many scientific articles, the “microbiota” (the microbial taxa associated with a specific ecological niche within human) and “microbiome” (the catalog of these microbes and their genes) are used interchangeably. Many fundamental questions about the human microbiome are difficult to address till date. Questions like “What is the typical composition of a healthy human microbiome?”, “how many species live in a given body site?”

or “how stable the

microbiome within an individual is over time?” are still hard to answer, not only due to problems with definitions of bacterial species and with the rate of sequencing error, but also owing to the intrinsic variations in human microbiome across individuals. Nevertheless, attempts are being made around the world to characterize the microbiome of different body habitats in people of diverse genetic, demographic, ethnic and geographical backgrounds in various states of disease and health.

1.3. Human Microbiome Project (HMP) With the goal of identification and characterization of human microbiome and its role in human health & disease, National Institute of Health (NIH) initiated the Human Microbiome Project (HMP) in 2008 [Turnbaugh et al. 2007]. For HMP majority of funding coming from NIH common fund which is designed to catalyze new and emerging areas of science. The HMP is a fundamentally and experimentally extension of Human Genome Project. Using metagenomic sequencing the HMP planned to identify the microorganisms associated with human body at five major body sites (18 sub sites): Gastrointestinal Tract (GIT), Oral cavities, Respiratory Tract (RT), Skin and Urogenital Tract (UGT) with a focus on developing a reference set of 3,000 isolate microbial genome sequences in healthy individual, Development of new tools and technologies for computational analysis, establishment of a data analysis and coordinating center (DACC), and resource repositories and clinical protocols for sampling the human microbiome [www.hmpdacc.org]. In addition with healthy cohort project, HMP supported the fifteen demonstration projects aimed to determine the relationship between disease and changes in the human microbiome: three skin disease (Acne, Psoriasis and Atopic Dermatitis & Immunodeficiency), six GIT diseases (Crohn‟s disease, esophageal adenocarcinoma, Page | 3

Chapter 1: General Introduction necrotizing enterocolitis, pediatric inflammatory bowel disease [IBD], and ulcerative colitis and obesity), four UGT diseases (bacterial vaginosis, circumcision, reproductive history, and sexual history) and one febrile illnesses [Peterson et al. 2009; Proctor et al. 2014; Proctor et al. 2011]. The HMP also supported the study about ethical, legal and social implications (ELSI) and application of the metagenomic analysis of the human microbiota. The HMP has the potential to change the current scenario of medical and pharmaceutical science. It is expected that the HMP will not only detect new ways to identify the susceptibility to diseases but also define the parameters in form of probiotics and prebiotics to manipulate human microbiota to cure the diseases in the context of individual‟s physiology [Sanders et al. 2015; Doron et al. 2015; Ozen et al. 2015; Rodriguez et al. 2015; Butel et al. 2014; Corzo et al. 2015; Vandenplas et al. 2014; Bindels et al. 2015; Rossen et al. 2015; Cammarota et al. 2014]. Much has been explored about the composition and diversity of human microbiota, but we still have little knowledge about mechanisms of interaction of microbiota with human body [Stilling et al. 2014]. The second stage of the Human Microbiome Project has been launched in the year of 2014 by NIH that is Integrative Human Microbiome Project (iHMP), will study the mechanism of interaction of microbiota with human host activities in disease-specific cohorts [Proctor et al. 2014]. These data sets will serve as experimental test beds to evaluate new models, methods, and analyses on the interactions of host and microbiome. The iHMP will create integrated longitudinal datasets from both the microbiome and human from three different cohort studies of microbiome-associated conditions, on the dynamics of preterm birth, inflammatory bowel disease, and type 2 diabetes using multiple „omics technologies [Proctor et al. 2014]. A very large number of publications on microbiome have been published in last decade and growing exponentially (Fig. 1.1), which shows that scientific community currently focusing on this emerging field in order to understand role of microbiome in human health.

Page | 4

Number of Publication

Chapter 1: General Introduction

2334

2500 2000 1374

1500 835

1000 521

500

265 0

2

1

3

6

13

0

Year Figure 1.1: Hits from a PubMed search for the term “Microbiome”

1.4. Body Niche and Human Microbiota To characterize the human microbiome, HMP aimed to collect samples from 5 major body sites of human body, which include 18 sub sites as shown in Fig. 1.2. Human fetus feels sterile environment in womb, but during and after birth neonates acquire wide range of microbes from various sources like birth canal, mother‟s skin, respiratory tract, gastro intestinal tract & oral cavity, breast milk, relatives, pets, soil, water etc. The composition of microbiota at a specific body niche influenced or determined by various factors like host genetics, age, gender, diet, life style, climate, hygiene practice, occupation etc. [Blaser et al. 2008; Blekhman et al. 2015; Castellarin et al. 2011; Chen et al. 2007; Gao et al. 2008; Garrett et al. 2010; Goodrich et al. 2014; Islami et al. 2008; Kostic et al. 2012; Ley et al. 2005; O'Keefe et al. 2015; Peek et al. 2002; Tana et al. 2010; Turnbaugh et al. 2006; Wang et al. 2011]. Colonization and adaptation of microbes at a specific body site governed by the various forces working at diverse body site for instance, host defense, microbial competition, nutritional and physic-chemical (Fig. 1.2).

Page | 5

Chapter 1: General Introduction

Figure 1.2: Nutritional and physico-chemical determinants of microbiota at major body sites of human

1.5.

Worldwide Diversity in Microbiota Composition

One of the fundamental issues in the microbiome research is characterization of the healthy human microbiota. Recent studies have elucidated substantial divergences in the microbiome structure in healthy individuals from different race and ethnicity categories [Leung et al. 2015; Li et al. 2014; Mason et al. 2013; Nam et al. 2011; Nasidze et al. 2011; Schnorr et al. 2014; Van Treuren et al. 2015; Yap et al. 2011; Yatsunenko et al. 2012]. However, most of the studies of the human microbiome have so far been carried out on US or western population and little attention has been paid to characterization of Page | 6

Chapter 1: General Introduction microbiome composition and its impact on human health in non-western populations [Morton et al. 2015]. If we think globally, the human microbiome studies are mostly partial, representing only the study of the people from US, Europe and other so-called WEIRD countries (i.e., Western, Educated, Industrialized, Rich and Democratic countries), which mostly represent urban population [Morton et al. 2015]. Only recently, some national and international initiatives have been taken for characterization of human microbiome in diverse ethnic populations (Table 1.1) and there is a fast growing collection of data describing the microbiome structures in various non-US or non-Western populations [Moossavi et al. 2014]. These studies have shown significant variations in microbiome composition in healthy individuals from different race and ethnicity categories [Leung et al. 2015; Li et al. 2014; Mason et al. 2013; Nam et al. 2011; Nasidze et al. 2011; Schnorr et al. 2014; Van Treuren et al. 2015; Yap et al. 2011; Yatsunenko et al. 2012]. Betweengroup differences in susceptibilities to many health conditions from preterm birth to type 2 diabetes, obesity and even cancer are now being linked to microbiome diversity [Blaser et al. 2008; Blekhman et al. 2015; Castellarin et al. 2011; Chen et al. 2007; Gao et al. 2008; Garrett et al. 2010; Goodrich et al. 2014; Islami et al. 2008; Kostic et al. 2012; Ley et al. 2005; O'Keefe et al. 2015; Peek et al. 2002; Tana et al. 2010; Turnbaugh et al. 2006; Wang et al. 2011].

Table 1.1: National and International Initiatives for Human Microbiome study by different countries S. No. 1 2 3 4 5 6 7 8 9 10 11 12

Initiative Human Microbiome Project International Human Microbiome Consortium MetaGenoPolis Metagenomics of Human Intestinal Tract International Human Microbiome Standards - IHMS Korean Microbiome Diversity Using Korean Twin Cohort Project The Australian Jumpstart Human Microbiome Project Canadian Human Microbiome Initiative MicroObes, Human Intestinal Microbiome in Obesity & Nutritional Transition Human Gut Microbiome and Infections DAFF/HRB elderly gut metagenomics project ELDERMET Human Metagenome Consortium

Year 2007 2008 2013 2008 2011 2010 2009 2009 2008

Country USA International France Europe Europe Korea Australia Canada France

NA 2007 2005

China Ireland Japan

Page | 7

Chapter 1: General Introduction Various patterns observed for micro ecology at distinct body habitats across various countries/populations around the world. There are many other studies demonstrating geography or ethnicity-specific divergences in gut microbiota composition, for instance, American community, both Japanese and Korean communities and Chinese community showed higher abundances of Firmucutes, Actinobacteria and Bacterodetes respectively in their gut microbiota (GM) [Nam et al. 2011]. At the genus levels, Japanese showed relatively higher abundance of Bifidobacterium and Clostridium, Chinese of Bacteroides and Korean of Prevotella and Faecalibacterium. Nishijima et al. showed dominance of Prevotella in Malawi, Venezuela & Peru; Bacteroides in USA, China, Denmark, Spain & France; Eubacterium in Russia; Clostridium in Sweden and Blautia in Austria [Nam et al. 2011; Nishijima et al. 2016]. Bacteroides genus was dominated in American and Jamaican populations while Prevotella genus dominated in Indian population [Kao et al. 2016]. Ruminococcus, Roseburia, Veillonellaceae dominated in gut microbiome of healthy individuals from the Netherlands [Bonder et al. 2016].

1.6.

Approaches for Microbiome Characterization and Analyses

A variety of approaches has been used to characterize the human microbiome in order to explore the role of microbiota and how the microbiome change with time, age, sex, geography and individual‟s genetics. Identification of microbial communities present in our body can be possible by culture based techniques, which is complex, laborious and expensive. There are so many problems also with this technique. For instance, a nonselective culture medium, nutritional requirement, optimum environmental conditions, which should be appropriate for growth of all microbial species that is practically not possible because of variations in growth rate, nutritional requirements, environmental conditions etc. in microbes [Douterelo et al. 2014]. During past few years molecular techniques to analyze the microbiome have been widely used to avoid the problems encountered in culture-dependent methods [Douterelo et al. 2014]. This technique is based upon the sequencing of 16S rRNA genes. These genes are then compared with the reference database of 16S rRNA genes to identify the species present in a target sample.

Page | 8

Chapter 1: General Introduction The drastic reduction in sequencing costs experienced over the recent years has made it possible to identify specific microbial taxa in a microbiome that are difficult or impossible to culture. In shotgun metagenomics, the total DNA of a sample from a specific niche is sequenced without targeting any specific gene to get a view of complete genetic repertoire from whole microbial organisms [Sharpton et al. 2014]. Metagenomic sequencing makes feasible the assembly of complete genomes from a microbial community and provides insights about the community function, metabolism and interactions between organisms of microbial community. Computational analysis of metagenomic sequencing data is a big challenge because huge amount of data generates in the metagenomic sequencing Analyzing the data, generated in huge amount during metagenomic sequencing, is a major challenge to bioinformaticians, because of lacking the reference genome sequences in the database to identify microbial organisms and their metabolic functions. To address this problem, the HMP aims to sequence 3000 genomes from the diverse body sites of human [Turnbaugh et al. 2007]. By June 2016 the genomes of 1171 phylogenetically diverse isolates from different body sites of the human had been sequenced by the HMP; 457 from GIT; 249 from Oral cavity; 125 from Skin; 50 from airways; 148 from UGT, and these reference genomes greatly improved their ability to analyze metagenomic data.

1.7. Therapeutic Application of Human Microbiome The 100 trillions of microbes reside in human body at distinct body sites making a microbial environment, which is beneficial to human health in many ways [Turnbaugh et al. 2007]. There is a homeostasis in healthy between human and microbiota. Any change in microbiota may result dysbiosis - a state of disruption in homeostasis between human and microbiota that may be implicated in various diseases like periodontitis, obesity, type-1 and type-2 diabetes, inflammatory bowel disease (IBD), rheumatoid arthritis, allergies, asthma, autism and even cancer [Schwarzberg et al. 2014; Wang et al. 2013; Scher et al. 2013; Scher et al. 2011; Maeda et al. 2016; Bernard et al. 2014; Fukui et al. 1999; Stingu et al. 2013; Clarke et al. 2012; Zozaya-Hinchliffe et al. 2010; Hilty et al. 2010; Brook et al. 2002; Finegold et al. 1996; Brook et al. 1995]. Currently scientists are focusing on to cure microbiota associated diseases by restoring the microbiome composition to its healthy state [Rossen et al. 2015]. But there are some major challenges in developing the microbiota based therapies. The mechanism of adaptation of Page | 9

Chapter 1: General Introduction microbiome at distinct body sites of human, the nature of imbalance in the microbiome composition in various diseases and the role of microbiome in pathology of a specific disease must be established so that one can determine the proper combination of microbes to treat the diverse diseases. In last few years, certain diseases like C. difficile gut infections or inflammatory bowel disease were effectively cured, with a few exceptions, by fecal microbiota transplantation, which inspired a large section of scientific community and pharmaceutical companies to develop microbiome derived drugs or therapies [Rossen et al. 2015]. There are many ways identified such as probiotics, prebiotics, whole microbiota transplantation & antibacterial conjugate vaccine to develop the microbiota based therapies. Specific microbial strains, which are well-known for specific beneficial functions, can be used as a probiotic to maintain the healthy microbiome composition in human, while prebiotics are the non-digestible food ingredient that beneficially affects the host by selectively stimulating the growth and/or activity of one or a limited number of bacteria in human body at a specific body niche to improve the human health [Sanders et al. 2015; Doron et al. 2015; Ozen et al. 2015; Rodriguez et al. 2015; Butel et al. 2014; Corzo et al. 2015; Vandenplas et al. 2014; Bindels et al. 2015; Rossen et al. 2015; Cammarota et al. 2014]. Prebiotics are usually oligosaccharides or more complex saccharides, which must resist the host digestion, absorption, and adsorption before fermentation by ≥1 species of the indigenous microbiota. Combination of probiotics and prebiotics might provide synergistic effect to reverse the dysbiosis of microbiota in specific diseases. The human microbiome could be manipulated by such “smart” strategies to prevent and treat acute gastroenteritis, antibiotic-associated diarrhea and colitis, inflammatory bowel disease, irritable bowel syndrome, necrotizing enterocolitis, and a variety of other disorders.

1.8.

Variations in microbiota between different body sites of the human

Determining the role human microbiota in disease predisposition and pathogenesis will depend critically upon first defining normal healthy states [Turnbaugh et al. 2007]. Prior studies of healthy individuals have revealed wide variations in community composition of the microbiome both within and between people. Variations in factors like diet, environment, life style or ethnicity may cause substantial divergences in the microbiome composition at a specific body site of individuals. Drastic variations also exist across Page | 10

Chapter 1: General Introduction different body niches within an individual, as each body-habitat offer unique microenvironment to the resident microbiota. No taxa could be detected, as yet, to be universally present in all body sites of human host, although several taxa have been found in dominance at a specific body niche [Human Microbiome Project Consortium et al. 2012]. Each body site was characterized by one or a few signature taxa but notably, less dominant taxa were also highly personalized, both among individuals and body sites. In the oral cavity, for example, most habitats are dominated by Streptococcus, but these are followed in abundance by Haemophilus in the buccal mucosa, Actinomyces in the supragingival plaque and Prevotella in the immediately adjacent (but low oxygen) subgingival plaque [Abubucker et al. 2012]. Elucidating the biogeography of microbial communities around the human body is, therefore, critical for establishing healthy baselines, from which to measure deviations associated with diseases. A number of studies have been conducted on cross-habitat divergences in microbiome composition [Walter et al. 2011; Benson et al. 2010; Costello et al. 2009; Oh et al. 2010] and it has been reported that the human microbiome components often undergo in situ adaptive evolution owing to environmental filtering and competitive exclusion/symbiosis between microbes, the nature of which differ from one body niche to another.

It has also been reported that the adaptive strategies of

microbiome components at distinct niches are genomically encoded [Walter et al. 2011]. However, prior to the present study, no attempt has been made to characterize the nichespecific modulations, if any, in the genomic architectures of the microbial species colonized at different body habitats of human.

1.9. Pan-genome Analysis An efficient way to study genomic alterations in microbial organisms is the pan-genome analysis - a new way to analyze the data generated from Next Generation Sequencing (NGS) [Tettelin et al. 2008; Medini et al. 2005] and to determine genomic diversity of a specific taxonomic clade such as species, genus or phylum [Tettelin et al. 2008; Medini et al. 2005]. Recent advances in ultra-high-throughput sequencing technology and metagenomics have led to a paradigm shift in microbial genomics from few genome comparisons to large-scale pan genome studies at different scales of phylogenetic resolution. Pan genome - a concept introduced by Tettelin and co-workers - refers to the complete inventory of genes in a specific phylogenetic clade [Tettelin et al. 2008; Medini Page | 11

Chapter 1: General Introduction et al. 2005]. In a study of seven Streptococcus agalactiae genomes, Tettelin et al [Tettelin et al. 2005] demonstrated that strains of a particular bacterial species may differ extensively in their gene content and total gene pool of a species may be orders of magnitudes larger than the gene content of any single strain. It is, therefore, quite rational to describe a bacterial species by its pan-genome, which includes the conserved core genome that comprises genes present in all genomes of a given dataset and a variable (or dispensable) genome, which is further subdivided into accessory & unique genome. The accessory genome includes the genes not present in all but in more than one genome and unique genes (also known as singletons) are specific to a strain (Fig. 1.3). The core genes mostly belongs to the housekeeping genes, which are essential for growth but dispensable genome contributes to the species diversity, it might encode supplementary metabolic pathways and functions that are not essential for growth but which deliberate selective advantages, such as adaptation to diverse niches, antibiotic resistance etc.

B

A C

Three Genomes

Dis .p.

Uni.

Pan-genome

Uni.

Core Dis .

Dis . Uni.

Pangenome = Core Genes + Dispensable Genes + Unique Genes

Figure 1.3: Construction of Pan-genome [Dis. = Dispensable Genes; Uni. = Unique Genes]

During the last decade, pan genome studies have been conducted on nearly fifty bacterial species, which include model organisms like Escherichia coli and members of normal human flora like Lactobacillus paracasei as well as a number of pathogens Page | 12

Chapter 1: General Introduction like Streptococcus pneumonia, Haemophilus influenza, Yersinia pestis, Coxiella burnetii, etc. [Rasko et al. 2008; Smokvina et al. 2013; Hogg et al. 2007; Snipen et al. 2009; Hiller et al. 2007; Eppinger et al. 2010]. Traditionally defined at the species level, the pan genomic approach has later been implemented at various levels of phylogenetic resolution ranging from genus to phylum, kingdom and beyond [Bezuidt et al. 2016; Gao et al. 2014; Huang et al. 2015]; Genus level studies were carried out on Streptococcus, Salmonella, Prochlorococcus etc., while the phylum level study was conducted on Chlamydiae [Collingro et al. 2011]. Lapierre & Gogarten et al. has extended the concept of the pan genome to the entire bacterial domain [Lapierre et al. 2009; Kettler et al. 2007; Jacobsen et al. 2011; Donati et al. 2010; Lefebure et al. 2007]. The concept has also been extrapolated to viral, plant and fungal genome studies [Aherfi et al. 2013; Cao et al. 2011; Read et al. 2013; Li et al. 2014; Dunn et al. 2012]. Pan genome analyses have provided valuable insight into genome dynamics, species evolution, population structure, pathogenesis or symbiosis, drug resistance and many other features of the microbial world (Tettelin et al. 2005; Medini et al. 2005; Vernikos et al. 2015; Mira et al. 2010; Xiao et al. 2015; Serruto et al. 2006). It has also been implemented for development of vaccines against bacteria (Maione et al. 2005). By extrapolation of the pan-genome, one can predict that how many additional genomes need to be sequenced to characterize the genomic diversity in a specific taxonomic clade. Pan-genome has been classified into two types “open” and “close” pan-genome (Fig. 1.4). When pan-genome size increases with sequential addition of new genomes to the dataset called open pan-genome, but in case of closed pangenome, no novel genes are added into the pan-genome. An open pan-genome indicates that the species or genus or phylum is still evolving by gene acquisition and diversification, but this is opposite for the close pan-genome.

Page | 13

Chapter 1: General Introduction Open/Closed Pan-Genome

1400

Pan-genome Size

1200

ypan_open = 652.99x0.30 ypan_closed = 552.25x0.06

1000 800 600 400 200 0

2

4

6

8

Number of Genomes Figure 1.4: Type of pan-genome: Open (sky color) and Close (violet color)

1.9.1.

Software for Pan-genome Characterization

Recent advances in NextGen sequencing techniques and metagenomics, bacterial genomics has experienced a transition from single-genome studies to multiple genome comparison at a time using hundreds to thousands organisms at different scales of taxonomic resolution. There are many software and pipelines are available in public domain for characterizing the pan-genome of a species, genus or phylum as described in Table 1.2 (Chaudhari et al. 2016; Laing et al. 2010; Brittnacher et al. 2011; Zhao et al. 2012; Benedict et al. 2014; Page et al. 2015; Wozniak et al. 2011; Treangen et al. 2014; Ernst et al. 2013; Zhao et al. 2014; Santos et al. 2013). These software tools are capable to perform several types of pan-genomic analyses.

Page | 14

Chapter 1: General Introduction Table 1.2: Software or pipelines for pangenome analysis Software Name BPGA Panseq PGAT PGAP ITEP Roary CAMBer Harvest PanCake PanGP PANNOTATOR

Platforms Required http://www.iicb.res.in/bpga/index.html L/W/M https://lfz.corefacility.ca/panseq/ L/W/O http://nwrce.org/pgat O http://pgap.sourceforge.net/ L https://price.systemsbiology.net/itep L http://sanger-pathogens.github.io/Roary L http://bioputer.mimuw.edu.pl/camber/index.html L/W https://github.com/marbl/harvest L/M https://bitbucket.org/CorinnaErnst/pancake/wiki/Home L/W http://PanGP.big.ac.cn L/W http://bnet.egr.vcu.edu/pannotator/index.html O Web Address

Major features a, b, c, d a a, d a, b, c, d a, c, d a, b a, c a b a

Note: (a) Clustering homologous genes, creates pan-matrix; (b) Pan-genome profile analysis; (c) Phylogenetic analysis; (d) Functional analysis; (L) Linux; and (W) Windows; (M) Mac; (O) Online.

1.10. Metagenomics Historically, microbial communities were characterized using culture-dependent approaches, which require a specific culture medium for growing in the lab. This approach was limited to organisms which can grow in laboratory on a culture media. The enormous number of microbial species has never been grown in the laboratory, and options for analyzing and quantifying the uncultured organisms were severely restricted until the development of DNA based culture-independent approaches in the 1980s; that is called metagenomics [Pace et al. 1986]. The term "metagenomics" was coined by Jo Handelsman, Jon Clardy, Robert M. Goodman, Sean F. Brady, and others to capture the notion of analysis of a collection of gene sequences from the environment in a way analogous to the study of a single genome [Handelsman et al. 2004]. Recently, Kevin Chen and Lior Pachter defined metagenomics as "the application of modern genomics technique without the need of isolation and lab cultivation of individual species" [Chen et al. 2005]. Estimation of relative abundance of the species in a specific niche is also not possible in culture dependent studies because cultivation processes are biased in term of differences in growth rate on the artificial culture media [Morgan et al. 2012]. Furthermore, it is not feasible to quantify and analyze the whole microbial communities in detail in a laboratory setting [Morgan et al. 2012]. Metagenomic methods have the potential to overcome these limitations for analyzing the whole microbial communities Page | 15

Chapter 1: General Introduction together. Such microbial communities can be referred to as the microbiota, while the full genetic potential derived from these microbial communities is called the microbiome [Morgan et al. 2012]. Culture-independent techniques, which analyze the DNA extracted directly from a sample rather than from individually cultured microbes, allow us to investigate several aspects of microbial communities such as taxonomic diversity, functional role of that community, adaptation strategies, competition among them, cooperation behavior or mutualism etc. Metagenomics has transformed the field of microbial ecology in that it facilitates the study of entire microbial communities directly from a sample, rather than through cultivation of the single organisms and amplification of their DNA [Handelsman et al. 2004]. Microbes are the most common form of life on the earth, which include archaea, bacteria, fungi and viruses, and they have a massive impact on their ecosystems where they inhabit. The term metagenomics can be used both as (A) Targeted metagenomics, which is based on amplification and sequencing of a phylogenetic marker for example, 16S ribosomal RNA gene (Fig. 1.5) for bacteria and the 18S gene for micro-eukaryotes [Morgan et al. 2012] and (B)

Shotgun metagenomics, where whole DNA from the sample is fragmented and analyzed without amplification (Fig. 1.5) [Morgan et al. 2012].

Whole shotgun metagenomics is divided into two main approaches: (1) function based, which focus on the investigating of biological functions of gene products derived from DNA of the sample and (2) sequence-based, which relies on the sequencing of extracted DNA from a sample [Morgan et al. 2012]. The function and taxonomy of the sequenced DNA can, to some extent, be found using reference databases of known genes and microbial genomes. To investigate the microbial communities efficiently at scale, almost all current studies employ high-throughput DNA sequencing, increasingly in combination with other genome-scale platforms such as proteomics or metabolomics. Although the DNA sequencing was too expensive since the 1970s but the advancement of low cost next generation high-throughput sequencing make it economically feasible for most scientists

Page | 16

Chapter 1: General Introduction to sequence the DNA of entire environmental samples and metagenomic studies [Morgan et al. 2012; Handelsman et al. 2004].

OTU

Figure 1.5: General workflow for metagenomics

Page | 17

Chapter 1: General Introduction 1.10.1. Taxonomic profiling 16S ribosomal RNAs is generally used to determine the composition of microbiome in a sample (might be derived from human, soil, marine etc.) as a bacterial phylogenetic marker [Kim et al. 2015]. Two different approaches are common for clustering the 16S ribosomal RNA sequences. First is homology-based approach in which 16s rRNA reads are searched in a 16S ribosomal RNA sequence database, such as GreenGenes, Ribosomal Database Project (RDP), and Silva [Cole et al. 2009; Quast et al. 2013; DeSantis et al. 2006]. By using sequence alignment algorithms, sequencing reads from diverse bacterial 16S ribosomal RNAs are assigned to the closest species available in the respective database of known bacterial species. And second approach is clustering reads (known as OTU clustering) based on the sequence identity using different clustering tools like CD-HIT, UCLUST, and DNACLUST and representative (longest) sequences of the clusters are searched in 16s rRNA databases for taxonomy assignment [Li et al. 2006; Edgar et al. 2010; Lozupone et al. 2006]. Many clustering methods (UCLUST and DNACLUST) use k-mer profiling to obtain rough estimates of sequence similarity between pairs of 16S RNA sequences as a pre-process step to reduce the time for sequence alignments [Edgar et al. 2010]. Clustering of 16S rRNA for OTU Prediction splits sequences into different blocks to apply the independent Bayesian clustering for each block, and merges all blocks to complete clustering into OTUs [Hao et al. 2011]. This method does not require the threshold for sequence identity to cluster sequences, and more robust against sequencing errors. QIIME a pipeline for taxonomic profiling uses these two approaches [Caporaso et al. 2010]. The reads that are not mapped to any reference sequences are discarded. Factors for variation in different microbial communities can be investigated after taxonomic assignments. Taxonomy-based approaches using 16S ribosomal RNA genes are not enough to identify the biological functions of a specific microbial community. In addition, functional profile of microbial community could be estimated by using predicted genes and shotgun sequencing reads mapped on the protein reference databases (Fig. 1.5).

1.10.2. Functional Profiling Functional profiling of microbial communities based on Whole Metagenome Shotgun (WMS) sequencing data attempts to catalog the genes present in a sample. A catalog of the protein coding functions of the microbial community can be created by either Page | 18

Chapter 1: General Introduction matching the all reads to curated reference functional databases i.e. KEGG database or by assembling the reads and annotating the resulting chromosomal segments. Conventional methods such as BLAST are robust but computationally rigorous and techniques for rapid mapping DNA reads to annotated reference genes fail when the references within the curated databases deviate from DNA sequences of homologous genes in the metagenome sample (Fig. 1.5). To resolve these challenges, researchers often turn to metagenome assembly and consequent annotation which has profound limitations, such as strong bias toward abundant organisms, chimeric assembly of closely related sequences [Westbrook et al. 2017; Kim et al. 2015].

1.11. Aims and Scope of the Present Dissertation As already stated, within a human body, the distinct body sites create unique microenvironment, where the resident microbial communities undergo in situ evolution under selective pressures, which is likely to vary across the body-niches, since the host cell environment and the species composition of the microbiome both differ drastically from one body site to another. Most of the human microbiome studies have focused on the habitat-specific variations in the microbiome composition at taxonomic levels only, and not at the genetic levels of microbiome components, though there are reasons to believe that adaptive strategies of these microbes at distinct niches might have been genomically encoded. Availability of reference microbial genome sequences derived from distinct body sites has provided an opportunity for exploring the niche-specific variations in genome architectures of the resident microflora. In chapter 2, as a case study, we have analyzed the habitat-driven changes in the gene complements and metabolic pathways of Prevotella genomes derived from different body sites of human. With a view to explore the metabolic pathways composition in disease associated microbial species of the human microbiome, we have carried out a KEGG pathway enrichment analysis on Prevotella strains implicated in various diseases like periodontitis and rheumatoid arthritis in the present dissertation (Chapter 3). An endeavor has also been made to determine the principal mechanism for niche specialization at the level of metabolic pathways configuration. Another analysis has been performed in Chapter 4 to identify the dominance of metabolic pathways and their source organisms in each body site of human based on relative abundances of KEGG pathway enzymes using 486 metagenomic samples derived from Page | 19

Chapter 1: General Introduction 18 distinct body sites of human. Outcomes of our analyses are expected to provide an insight into the selective modulation of the functional capability by alteration in microbiota using different approaches in order to improve the human health. In Chapter 5, we present a study on geography, ethnicity and subsistence-specific variations in human microbiome composition & diversity. On the basis of published data and publicly available both type of metagenomic and 16s rRNA amplicon samples from healthy individuals, an effort has been made to characterize the core microbiome. i.e., the set of genera commonly found in a specific body site of all healthy individuals under analysis, irrespective of their geographic locations, ethnic background or mode of subsistence. We have analyzed the impact of cross-population variations in microbiome composition of various biogeographic spaces at the metabolic pathway and functional levels at the five major human body habitats – Gastrointestinal Tract, Oral cavity, Respiratory Tract, Gut, Urogenital Tract (UGT) and Skin. A special emphasis has been given on the functional roles played by the microbes with relatively low abundances at specific body-sites. Finally, through review of existing literature, an attempt has been made to delineate the trends in variation in the gross compositional structure and decrease in diversity of the human microbiome, especially in the gut microbiota, as the human populations passed through three stages of subsistence - foraging, rural farming, and industrialized urban life.

Page | 20

Chapter

2

Divergences in Gene Repertoire among the Reference Prevotella Genomes Derived from Distinct Body-sites of Human 2.1.

Introduction

21-23

2.2.

Objectives of the Study 24

2.3.

Method and Materials

2.4.

Results

2.5.

Discussion 64

32-63

25-31

Chapter 2: Divergence in gene repertoire of Prevotella genomes

2.1. Introduction The genetic script of any microorganism usually portrays a complex interplay between its taxonomic legacy and ecological constraints. The legacy of the ancestral gene repertoire normally does not vary within a specific lineage, but adaptation to distinct ecological niches may cause wide differentiation among closely related genomes through selection of distinct genetic traits. Microbes under adaptive evolution often undergo a process of genomic homeostasis - some old genes are shed off and new genes are acquired through horizontal transfer [Ochman et al. 2001; Dutta et al. 2012; Toft et al. 2010; Dobrindt et al. 2004; Polz et al. 2013]. There may also occur other evolutionary processes like gene duplication, recombination or positive selection in specific genes, which, together with neutral mutation and genetic drift; may lead to substantial genomic diversity between two species of the same genus, or even between two strains of the same species [Ochman et al. 2000; Didelot et al. 2010; Lefebure et al. 2009; Kuo et al. 2009; Paul et al. 2010; Cohan et al. 2008]. In a human body, the distinct body habitats create unique niches for the resident microbiota that experience selective evolutionary pressure from the host as well as from other co-resident microbes [Ley et al. 2006]. The nature of this pressure is expected to vary at different habitats, since the host cell environment and the community composition of the microbiome both differ drastically from one body site to another. It is known that local environmental filtering can create a great impact on in situ evolution of the microbial flora at distinct body niches [Walter et al. 2011]. During last few years, there has been an increasing amount of literature on adaptive evolution of the microbiome at different body habitats of human, especially at the gastrointestinal tracts [Walter et al. 2011; Benson et al. 2010; Costello et al. 2009; Oh et al. 2010]. However, such studies usually focus on the habitat-specific variations in the microbiome composition at the phylum, genus or species levels and no information is available on the niche-specific variations, if any, at the genome or sub-genome levels of the resident microbes, though there are reasons to believe that adaptive strategies of the microbiome components at distinct niches might have been genomically encoded [Benson et al. 2010]. Release of an initial catalog of 178 initial reference genome sequences from the microbial flora of diverse anatomical niches in 2010 provided an opportunity of exploring nichespecific variations in the genome architectures of the human microbiota. Here, as a case Page | 21

Chapter 2: Divergence in gene repertoire of Prevotella genomes study, we are presenting the pan-genomic analysis of twenty-eight Prevotella genomes derived from different body sites of human and reported in this catalog [Nelson et al. 2010]. 2.1.1. Why Prevotella? Prevotella is one of the important genera among bacterial phylum Bacteroidetes. Prevotella is a genus of Gram negative bacteria. It‟s mainly composed of obligatory anaerobic bacilli. Based on biochemical differences in phenotypic characteristics like saccharolytic potential and bile sensitivity, and 16S rRNA gene phylogeny, some species from Bacteroides were reclassified into a new taxonomical genus Prevotella by Shah & Collins in 1990 [Shah et al. 1997]. Prevotella mainly exist as commensal bacteria inside the body of mammals including human, while the only species, Prevotella paludivivens sp. nov. has been isolated and cultured from the soil environment [Atsuko et al. 2007]. The rationale behind selection of the genus of Prevotella as the case study lies in its importance as a component of the natural human flora, which is distributed at distinct body sites like GIT, Oral cavity, UGT, Airways & Skin. It has been identified as the most dominant genus of one of three distinct enterotypes of human gut microbiome, the other two enterotypes being dominated by Bacteroides and Ruminococcus [Wu et al. 2011]. According to a later study on a larger cohort, the boundaries between the enterotype led by Bacteroides and that by Ruminococcus may not be as discrete as it was reported earlier, but the Prevotella-driven enterotype is quite distinct from both these groups [De Filippo et al. 2010]. A study by Wu et al. showed a strong association between the relative occurrences of the gut enterotypes with long-term diets of their respective hosts the Bacteroides and Prevotella enterotypes being associated with protein and animal fat or carbohydrates, respectively [Wu et al. 2011]. De Fillipo et al. reported exclusive presence of Prevotella and two other genera in rural African children having fiber rich diets while Bacteroides were absent [De Filippo et al. 2010]. Another study on elderly people have shown a substantial decrease in the abundances of Prevotella and Ruminococcus in gut microbiota upon transition from healthy community-dwelling subjects to frail long-term care residents – an observation that indicated a plausible role of the diet-driven alterations in relative abundances of Prevotella and other gut microbiota in varying rates of health decline upon aging [Claesson et al. 2012].

Page | 22

Chapter 2: Divergence in gene repertoire of Prevotella genomes Apart from being natural inhabitants of human body, Prevotella spp. may be implicated in diverse anaerobic infections arising from the respiratory tract, urogenital tract and gastrointestinal tract [Finegold et al. 1996; Brook et al. 1995]. Polymicrobial infections involving Prevotella spp. include respiratory infections like aspiration pneumonia, chronic otitis media, lung abscess, pulmonary empyema and sinusitis [Finegold et al. 1995]. Prevotella spp. had been isolated from brain abscesses, burns and abscesses near oral cavity, osteomyelitis and urinary tract infections [Brook et al. 2002]. Changes in Prevotella abundance and diversity were discovered during several dysbiosis-associated diseases, including bacterial vaginosis, asthma and chronic obstructive pulmonary disease (COPD) and rheumatoid arthritis [Zozaya-Hinchliffe et al. 2010; Hilty et al. 2010; Scher et al. 2013]. Studies have indicated that some Prevotella taxa might be associated with opportunistic infections, but there are also reasons to consider the Prevotella spp. as a reservoir for resistance genes, like the β-Lactamase producing genes cfxA and cfxA2 [Nord et al. 1978; Giraud-Morin et al. 2003; Handal et al. 2005; Iwahara et al. 2006]. Few studies were reported previously using some Prevotella strains [Purushe et al. 2010] but extensive study that takes into account all available whole genome sequences of Prevotella genus had never been performed. Also, very few computational studies are carried out for understanding pan genome structure of all members of any bacterial genus isolated and sequenced from human microbiome. Significance of Prevotella as a component of human microbiota is, therefore, beyond doubt. Yet, little is known about the genetic basis of Prevotella diversity at different body habitats of humans and its symbiotic/pathogenic implications.

Page | 23

Chapter 2: Divergence in gene repertoire of Prevotella genomes

2.2. Objectives of the Study Prevotella strains are distributed at diverse body niches of human. This diversified distribution of Prevotella strains encouraged us to probe into the genetic factors, which are responsible for adaptation or colonization at different body niches, while they belong to the same taxonomic genus or species. Such studies may provide a better insight into adaptation strategies of microbiome components at diverse body niches. In this chapter, our specific objectives are:  To conduct a pan-genome analysis of 28 Prevotella genomes derived from distinct body sites like oral cavity, gastrointestinal tract (GIT), urogenital tract (UGT) and skin.  To delineate the core gene complements as well as the tissue-specific variations in accessory or dispensable genome composition, if any, in 28 Prevotella genomes derived from distinct body habitats like oral cavity, gastrointestinal ducts (GIT), urogenital ducts (UGT) and skin. The core genome might be essential for association of Prevotella strains with human body irrespective of their colonization site. The core genes mostly belongs to the housekeeping genes which are essential for growth and survival of organisms but dispensable genome contributes to the species diversity, it might encode supplementary metabolic pathways and functions that are not essential for growth but which deliberate selective advantages, such as adaptation to diverse niches, antibiotic resistance etc.  To map the core, accessory and unique genes to COG categories with a view to identify major functions of these genes  To compare the pan genome and core genome structures of Prevotella and Bacteroides strains of similar niche adaptation to assess the relative importance of taxonomic traits and ecological prerequisites

Page | 24

Chapter 2: Divergence in gene repertoire of Prevotella genomes

2.3. Method and Materials 2.3.1. Genome sequences All twenty eight genome sequences of Prevotella strains used in this study are listed in the dataset (Table 2.1). These high quality draft genome sequences were retrieved (January, 2013) from HMP-DACC (Data Analysis and Coordination Center for Human Microbiome Project supported by NIH). Prevotella genomes under this study were stated as Finished, High Quality Draft or Improved High Quality Draft in HMP project catalog [http://www.hmpdacc.org]. HMP sets the standards for High Quality Draft genome (N50 >5kb for contigs or > 20kb for scaffolds and percentage bacterial core gene set > 90%) to ensure completeness of the genomes. N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value [http://www.hmpdacc.org; Tatusova et al. 2013]. In case of draft genomes, there is a chance of losing the genomic information, so we have also checked the completeness of draft assembly for each Prevotella genome based on N50 score (Table 2.1) using QUAST server and percentage occurrence of orthologs of bacterial core gene set (Table 2.1) of 200 universal core genes among bacteria [Gurevich et al. 2013]. Based on these calculations, we found that all draft assemblies are endowed with a very high quality (near to complete genome). All 28 Prevotella strains used in this present analysis are the members of 25 different species. 22 species had one strain each, while three species, namely P. buccae, P. denticola and P. melaninogenica had the two strains each. Among 28 genomes of PDGHM dataset, 17 were derived from the oral cavity, 7 from urogenital tract, 3 from gut and 1 from skin of the human body. All annotated protein sequences of the twenty-eight members of PDGHM were also downloaded from HMP-DACC. After removal of truncated protein sequences, there were 73864 protein sequences that formed the working dataset of PDGHM proteins.

Page | 25

Chapter 2: Divergence in gene repertoire of Prevotella genomes Table 2.1 - Details of Prevotella strains used for the analysis Sr. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Name of organism P. copri DSM 18205 P. salivae DSM 15606 P. stercorea DSM 18206 P. buccae ATCC 33574 P. buccae D17 P. denticola F0289 P. marshii DSM 16973 P. melaninogenica ATCC 25845 P. melaninogenica D18 P. multiformis DSM 16608 P. nigrescens ATCC 33563 P. oris F0302 P. oulorum F0390 P. pallens ATCC 700821 P. sp. oral taxon 299 str. F0039 P. sp. oral taxon 306 str. F0472 P. sp. oral taxon 317 str. F0108 P. sp. oral taxon 472 str. F0295 P. tannerae ATCC 51259 P. veroralis F0319 P. bergensis DSM 17361 P. amnii CRIS 21A-A P. bivia JCVIHMP010 P. buccalis ATCC 35310 P. denticola CRIS 18C-A P. disiens FB035-09AN P. oralis ATCC 33269 P. timonensis CRIS 5C-B1

Body Niche GIT GIT GIT Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Skin UGT UGT UGT UGT UGT UGT UGT

Bioproject Accession PRJNA30025 PRJNA53199 PRJNA65131 PRJNA51491 PRJNA38737 PRJNA49293 PRJNA50531 PRJNA31383 PRJNA40045 PRJNA53055 PRJNA64737 PRJNA38329 PRJNA43117 PRJNA64739 PRJNA40047 PRJNA75153 PRJNA38521 PRJNA38731 PRJNA33153 PRJNA38331 PRJNA34637 PRJNA40671 PRJNA31377 PRJNA40669 PRJNA61825 PRJNA51065 PRJNA51495 PRJNA40667

Size (Mb) 3.51 3.27 3.10 3.28 3.36 2.94 2.56 3.17 3.29 3.06 2.67 3.25 2.81 3.13 2.45 2.95 4.10 3.64 2.59 2.99 3.27 2.42 2.42 3.03 3.18 3.00 2.84 2.80

G+C (%) 46.3 42.4 50.5 52.2 52.4 51.9 48.5 42.8 42.7 52.9 44 45 47.8 38.8 38.7 42.7 49.1 48.2 47.5 43 48.9 37.7 41.1 46.8 51.4 41.5 45.6 43.6

CDS Count 3337 2939 3017 2896 2617 2386 2335 2296 2461 2884 2421 3316 2488 2860 1935 2442 2926 3092 2811 3048 2825 2023 2041 2456 2701 2621 2488 2202

GIT: Gastrointestinal Tract, Oral: Oral Cavity, UGT: Urogenital tract, Size (Mb): Genome size in mega base pairs, CDS Count: Number of protein coding sequences.

2.3.2. Orthologous gene family construction The most fundamental method to identify the relationship between genes or proteins is characterization of homology among them. Two proteins or genes are known homologous if they have originated from a common ancestor [Koonin et al. 2005]. There are two types of homologs 1) Paralogs are the homologs those are present within a same organism or species but perform different biochemical function 2) Orthologs are the homologs present in different organisms or species and have identical or very similar biochemical function [Koonin et al. 2005]. Page | 26

Chapter 2: Divergence in gene repertoire of Prevotella genomes Homology is a bioinformatics tool to explore the biochemical function and evolutionary history of a newly sequenced gene or protein. Significant similarity between two protein or nucleic acid sequences implies about their same evolutionary origin and similar biochemical functions. Although both nucleic acid and protein sequences can be compared to identify the homology, but comparison of protein sequences is much more effective due to degeneracy of genetic code [Pertsemlidis et al. 2001]. Estimation of orthologous protein was performed using all-against-all protein BLAST. Hits were filtered using CD-HIT webserver based on the criteria of 50% sequence similarity over at least 50% sequence length coverage of the protein to construct the homologous protein clusters. All 73864 protein sequences of PDGHM were clustered into 24885 distinct homologous protein families [Huang et al. 2010]. The homologous clusters were transformed into orthologous clusters by removing the paralogs. The clusters were then processed using BPGA (Bacterial Pan-Genome Analysis pipeline) to create a binary (1/0) matrix (pan-matrix) where, rows represent clusters and columns represent respective genomes in „1‟, „0‟ format, i.e. „1‟ for presence and „0‟ for absence of the genes from respective genome, in each cluster [Chaudhari et al. 2016]. 2.3.3. Pan-genome and core genome size calculation The total number of orthologous clusters were calculated with sequential addition of new Prevotella genome and plotted against the number of genomes added sequentially. To reduce the biasness in pan-genome and core genome calculation, we have used all possible combination and permutation for each genome added to the estimation. We calculated the size of complete genomic repertoire (pan-genome) of Prevotella genus using following formulae (Equation i) and applied the power law regression (Heaps law) model as follows (Equation ii), which allowed the extrapolation to infinite number of genomes, providing a prediction about Prevotella pan-genome, whether it is open or closed [Tettelin et al. 2005; Medini et al. 2005; Tettelin et al. 2008; Zhang et al. 2014; Mira et al. 2010; Lukjancenko et al. 2010].

N pan  i1 f pan(Gi , Op).................................................(i) c

Op=0, if no orthologous gene is shared by any strain in gene family (Gi) Op=1, if at least one orthologous gene is shared by at least one strain in gene family (Gi) Where, C is the total number of orthologous gene families. Page | 27

Chapter 2: Divergence in gene repertoire of Prevotella genomes

……………………………… (ii) Where, ypan ==> Pan-genome size, X ==> Number of genomes Apan, Bpan & Cpan ==> Fitting parameters If, 0 < Bpan < 1 ==> Open Pan-genome Bpan = 0 or near to zero ==> Closed Pan-genome

Number of core genes after addition of each new genome was plotted as a function of the number of genomes added sequentially, in similar manner as pan-genome plot. The exponential curve fit model (Equation iv) was used to fit the core genome and calculated the size of core genome using following formulae (Equation iii).

Ncore  i 1 f core (Gi , Oc)...........................................(iii ) n

Oc=0, if no orthologous gene is shared by any strain in gene family (Gi) Oc=1, if orthologous gene is shared by all strains in gene family (Gi) Where, C is the total number of orthologous gene families.

……………………….. (iv) Here, ycore ==> core genome size x ==> Number of genomes Acore, Bcore & Ccore ==> Fitting parameters

Page | 28

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.3.4. Evolutionary analysis Three phylogenetic trees were constructed based on 16s rRNA genes, Core genes and binary pan-matrix. 2.3.4.1. 16s rRNA tree 16s rRNA based phylogeny is the traditional way to reconstruct the evolutionary relationship between organisms. 16sRNA is the easiest choice for phylogenetic tree reconstruction because it is present in all organisms and conserved enough in sequence and structure. To reconstruct 16s rRNA based phylogenetic tree we extracted the 16s rRNA gene sequences from all 28 Prevotella strains using given annotation and prediction by RNAmmer 1.2 Server based on sequence alignment against 16s rRNA database [Lagesen et al. 2007]. A multiple alignment for each 16s rRNA gene was performed using MEGA5 software and constructed Neighbor Joining phylogenetic tree using 100 bootstrapping replications [Tamura et al. 2011]. TreeGraph 2.0 was used to visualize the final 16s rRNA tree [Stover et al. 2010]. E. coli 83972 strain was used as out-group. 2.3.4.2. Core-genome tree To identify and extract the core genes, we performed an analysis using BPGA, which yielded the core gene clusters followed by multiple alignments using MUSCLE software [Chaudhari et al. 2016; Edgar et al. 2004]. The aligned core genes of each orthologous cluster from all 28 Prevotella genomes were concatenated. Tree was created using concatenated core gene alignments using 100 bootstrapping replications. Treegraph 2.0 software was used to format the tree [Stover et al. 2010].

Page | 29

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.3.4.3. Pan-genome tree The pan-genome tree was constructed on the basis of distance between pan-genome profiles of Prevotella strains under the study. E. coli 83972 strain was used as an outgroup. We calculated the distance between pan-genome profiles using jaccard distance formulae and transformed the pan-matrix (presence/absence) into distance matrix.

(

)

( )

Here, (

)

A Neighbor Joining phylogenetic tree was created based upon the distance matrix using hierarchical clustering in past3 statistical software [Hammer et al. 2001]. 100 Bootstrapping was used to demonstrate the steadiness of the branching in pan-genome tree. 2.3.5. Functional mapping of Pan-genome and niche specific gene families Pan-genome includes the core, accessory and unique genes. We extracted the genes from each strain of Prevotella, which identified as core, accessory and singletons, and mapped against the COG and KEGG database using Functional Analysis (COG/KEGG) module of BPGA to assign the COG/KEGG functional id to them. We also identified the gene families, which were present in Prevotella strains derived from a specific body niche and assigned the functional COG/KEGG id to representative of niche specific gene families [Chaudhari et al. 2016]. 2.3.6. Trends in codon usage in core and unique genes Codon usage of core and unique genes were calculated for all 28 Prevotella strains. We calculated the frequency of each triplet codon except stop codons for core/unique genes

Page | 30

Chapter 2: Divergence in gene repertoire of Prevotella genomes from all Prevotella strains of the dataset. The codon usage distances between core/core and core/unique genes were also calculated according to

di , j 

n

( f

 ft , j )2 …................ (vi)

t ,i t  all _ coding _ triplets

Where, di,j is the codon usage distance between core/unique gene sets of ith and jthgenome, ft,i and ft,j are the relative codon frequencies of respective genomes [Beck et al. 2012]. To quantify the codon usage differences between core and unique gene sets for each strain, we used the codon usage ratio ( () ()

) as follows ( ) ( )

………………………. (vii)

2.3.7. Test for positive selection between two P. denticola strains Positive selection of a gene, protein or function by a species is one of the major driving forces behind the adaptive evolution. Positively selected gene or trait must be beneficial and heritable. At the molecular level, selection occurs, when a particular DNA variant becomes more common and beneficial because of its effect on the organisms that carry it. Charles Darwin and Alfred Wallace (1858) proposed that positive selection could explain many wonderful adaptations that suit organisms to their environments and lifestyles, and this simple process remains the central explanation for a major section of the evolutionary adaptations yet today [Lefebure et al. 2009]. To estimate the dN (non-synonymous substitutions) and dS (synonymous substitutions) values for core genes P. denticola CRIS 18C-A and P. denticola F0289 adapted at different body niches UGT and Oral cavity respectively, we performed a pairwise nucleotide & amino acids sequence alignment between core genes of both strains. After estimation of dN and dS, the ratio of dN to dS (dN/dS) was calculated for each pair of core genes. The pairs with dN/dS > 1 (an excess of non-synonymous substitutions over synonymous ones) were considered to be under positive selection [Yang et al. 1998].

Page | 31

Chapter 2: Divergence in gene repertoire of Prevotella genomes

2.4. Results 2.4.1. Variation in size of homologous gene clusters All protein coding genes were clustered using BLASTP by CD-HIT suite, based on 50% sequence identity and 50% coverage length. The total 73864 protein coding genes from 28 Prevotella strains are distributed into 24885 homologous gene clusters. The clusters varied in their sizes (1 to 84) in terms of gene count, which include orthologs as well as paralogs. Most of the clusters (16488) have only one gene from a specific strain of the Prevotella. Very little number (60) of clusters contains more than 28 genes in a specific cluster (Fig. 2.1).

18000

Gene Distribution in Clusters

1

Number of Clusters

16000 14000 12000 10000 8000 6000 4000 2000

2

3 38

48 53

60

0

84

40 51 55 Size of the clusters Figure 2.1: Gene distribution pattern among homologous clusters. Size of clusters is genes (orthologs and paralogs) present in a cluster

2.4.2. Orthologous Gene Families - distribution into the core, flexible, and singleton genes All 28 Prevotella strains belonged to 25 Prevotella species derived from different body sites such as oral cavity, GIT, UGT and skin as a component of human microbiome are listed in table 2.1. Only one strain of each of the twenty two Prevotella species are included in our dataset, while three species namely P. buccae, P. denticola and P. Page | 32

Chapter 2: Divergence in gene repertoire of Prevotella genomes melaninogenica, have two strains each. Different strains of the same species were isolated from same body niche except strains of P. denticola. One of the P. denticola strains colonized at UGT and other at oral cavity in the human. Each strain of Prevotella contains the both protein coding and RNA genes, but we considered only protein coding genes for pan-genome analysis. Total genome sizes and the number of predicted protein coding sequences in dataset vary between species or strains, ranged from 2.42 to 4.1 Mb and 1935 to 3337, respectively (Table 2.1). Niche specific variations in genome size and total number of protein coding genes were observed. Interestingly, the genome sizes and the number of CDSs of seven Prevotella strains derived from the urogenital tracts (UGT) are, in general, lower than those of three GIT isolates, while the genomes derived from the oral cavity vary substantially in their genome sizes as well as in the number of predicted CDSs. The average genomic G+C content also varies widely across the dataset – from 38.7% in Prevotella sp. oral taxon 299 str. F0039 to 52.2% in another oral cavity derivative P. buccae ATCC 33574 (Noel et al. 2010). There are considerable intra-species variations in genome sizes (3.28/3.36, 2.94/3.18 & 3.17/3.29) and number of CDSs (2896/2617, 2386/2701 & 2296/2461) across two strains of P. buccae, P. denticola and P. melaninogenica respectively (Table 2.1). Amino acid sequences of total 73864 annotated complete CDSs were extracted from all 28 Prevotella genomes. We used protein sequences instead nucleotide sequences for homologous clustering to reduce the false negative results due to degeneracy in codon triplet. CD-HIT algorithm clustered the all 73864 amino acid sequences using a 50% cutoff for protein similarity and 50% sequence coverage, into 24885 distinct clusters of orthologous genes (gene families). These 24885 orthologous gene families are categorized into two categories based on source strains of Prevotella genus from the dataset. a. Core genes: Gene families, which consists genes from all 28 genomes of Prevotella dataset under study. These genes are considered as essential genes for survival and association of Prevotella strains with human host. b. Variable genes: These genes are also known as dispensable genes or flexible genes that exist in some genomes, but not present in all Prevotella genomes under study. These variable genes again classified into two categories (i). “Singleton” or “unique” genes (ii). “Accessory” genes. Page | 33

Chapter 2: Divergence in gene repertoire of Prevotella genomes Among 24885 gene families, 456 orthologous gene families (~1.81%) contain genes from all twenty-eight strains and hence represent the core gene complement of the Prevotella genus. These core gene families are extensively conserved across the Prevotella genomes. These are presumably needed to perform many essential functions to survive in human body at distinct body sites, where diverse types of metabolic and physiological factors are encountered (Fig. 2.2). The varied number (2638 ± 700) of predicted protein-coding genes in individual Prevotella genomes are present in the genomes of PGDHM, and on an average, 17% of total genes characterizes the conserved core genes in each genome of Prevotella. 7263 gene families (~29% of the pan genome) include the set of accessory genes, found in more than one, but not present in all Prevotella genomes of the Prevotella dataset (Fig. 2.2).

17166

18000

No. of orthologous Gene families

16000 14000 12000 10000 8000 6000 4000 2000

456

0

1

3

5

7

9

11 13 15 17 19 21 23 25 27 Number of Genomes

Figure 2.2 - The gene family frequency spectrum for 28 Prevotella genomes. Bars represent the number of orthologous gene families belonging to singletons (17166), flexible genome (7263) and core genome (456).

Page | 34

Chapter 2: Divergence in gene repertoire of Prevotella genomes Table 2.2: Strain wise pan-genome composition Sr. No. 1 2 3

P. copri DSM 18205 P. salivae DSM 15606 P. stercorea DSM 18206

Body Niche GIT GIT GIT

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

P. buccae ATCC 33574 P. buccae D17 P. denticola F0289 P. marshii DSM 16973 P. melaninogenica ATCC 25845 P. melaninogenica D18 P. multiformis DSM 16608 P. nigrescens ATCC 33563 P. oris F0302 P. oulorum F0390 P. pallens ATCC 700821 P. sp. oral taxon 299 str. F0039 P. sp. oral taxon 306 str. F0472 P. sp. oral taxon 317 str. F0108 P. sp. oral taxon 472 str. F0295 P. tannerae ATCC 51259 P. veroralis F0319

Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral Oral

16 17 19 20 20 19 16 19 14 18 16 24 19 16 15 16 15

66 73 73 52 73 72 61 63 61 57 61 49 72 64 58 27 63

18 10 8 28 7 9 23 18 25 25 23 27 9 20 27 57 22

21

P. bergensis DSM 17361

Skin

16

51

33

22 23 24 25 26 27 28

P. amnii CRIS 21A-A P. bivia JCVIHMP010 P. buccalis ATCC 35310 P. denticola CRIS 18C-A P. disiens FB035-09AN P. oralis ATCC 33269 P. timonensis CRIS 5C-B1

UGT UGT UGT UGT UGT UGT UGT

23 22 19 17 17 18 21

62 63 61 74 63 53 61

15 15 20 9 20 29 18

Name of organism

Core Genes (%) 14 16 15

Accessory Genes (%) 41 62 39

Unique Genes (%) 45 22 46

A very large proportion of the genes (~69%, 17166 gene families) of the total gene repertoire in the pan-genome of Prevotella genus are present in only one genome (Fig. 2.2). The number of these unique genes or singletons varies significantly across different Prevotella strains (Table 2.2). In the oral cavity isolate P. tannerae, 57% of the annotated CDS appeared to be singletons with no orthologs in other strains of the dataset, as per the 50% similarity & 50% coverage criteria. Two GIT isolates P. copri and P. stercorea also have shown substantially high percentage of predicted CDSs in the unique gene category, Page | 35

Chapter 2: Divergence in gene repertoire of Prevotella genomes while in the sole skin isolate P. bergensis, 33% of the CDSs have been identified as unique genes. The percentage of unique genes is significantly low in P. buccae, P. denticola and P. melaninogenica, as these species share the species-specific genes between their two strains. Numbers of unique genes are relatively low (< 20%) also in all UGT isolates except P. oralis ATCC 33269 (Table 2.2). Three Prevotella species, which have two strains each in our dataset, contain relatively low number (7-9%) of unique genes. P. buccae strains have considerably more (10 & 18%) unique genes compared to P. denticola and P. melaninogenica species. The P. denticola strains colonized at two distinct body sites (oral cavity & UGT) possess similar proportion of unique genes as two strains of P. melaninogenica, both of which are colonized in the oral cavity. 2.4.3. Trends in expansion of pan genome and contraction of core genome in Prevotella genus To explore the Prevotella pan-genome expansion with sequential addition of genomes into the dataset, we calculated the pan-genome size at each step of genome addition (228). These calculated pan-genome size was plotted with number of genomes included in the dataset. Similarly we also calculated the shared gene families with sequential addition of the genomes, in order to portray the trend in reduction of core genome size with addition of more and more genomes into the study. To avoid the bias, if any, in the order of genome addition, we have considered 50 random combination and permutations at each step of genome addition for calculation of pan-genome and core genome of Prevotella. To plot the pan-genome and core genome curve, a median was taken on the size of pan-genomes or core genomes after each step. The power-law regression model for pan-genome was applied to extrapolate the pan-genome of Prevotella and we found that pan-genome curve increases unboundedly after addition of 28th Prevotella genome and yet to reach a plateau (Fig. 2.3).

Page | 36

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Pangenome

27000

Core genome

Number of Orthologous gene families

23000 19000 15000 11000 7000 3000 -1000 0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

Number of Genome Figure 2.3: Pan and core genome analysis curve for 28 Prevotella genomes. The number of shared genes is plotted (sky color) as a function of the number of Prevotella genomes sequentially considered. The size of Prevotella pan-genome is plotted (violet) as a function of the number of Prevotella genomes sequentially considered. All dots represent the random combination of the genome for calculation pangenome and core genome size.

(

)

………………… (viii)

We also calculated the number of new genes added into the pan-genome of Prevotella with sequential addition of new genome into the study by subtracting the pan-genome for Nth genome from [N+1]th genome (Fig. 2.4). On an average, each additional Prevotella genome added 850 new genes to the pan-genome, leading to an open pan-genome. In accordance with these observations, the power-law regression shows that the Prevotella pan-genome is indeed “open” with a γ-parameter of 0.7 (here, Bpan). The open pangenome shows the need of more and more Prevotella strains to be sequenced for characterization the Prevotella pan-genome (eq. viii).

Page | 37

Chapter 2: Divergence in gene repertoire of Prevotella genomes

2000

Number of New Gene Families

1500

1000

500

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

Number of Genome Figure 2.4: New gene family distribution among Prevotella pangenome. Addition of new gene families into the pangenome with sequential addition of Prevotella genome into the analysis.

We also estimated the number of gene families shared by genomes included in the dataset at a step of sequential addition for each combination and permutation (Fig. 2.3). Exponential curve fit model was applied on median values of core genome size for each combination and permutation, to extrapolate the core genome curve (eq. ix).

(

)

…………. (ix)

As expected, the number of shared gene families among the genomes under study at a particular step of genome addition, gradually decreases with inclusion of each new genome. The core genome curve also not reached a plateau till addition of 28 Prevotella strains, which indicates the continuous contraction of Prevotella core genome. Availability of new Prevotella sequences may result the further contraction Prevotella core genome. On an average, 54 gene families were replaced with new genes in each addition genome of Prevotella (Fig. 2.4).

Page | 38

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.4.4. Niche specific variation in expansion/contraction of pan/core genome In our dataset 17 Prevotella genomes isolated from the oral cavity, 7 from urogenital tract, 3 from gut and 1 from skin of the human body as a component of human microbiome (Table 2.1). In an attempt to examine whether the trends in expansion of the pan-genome and/or contraction of the core genome differ across the Prevotella genomes isolated from distinct body sites of the human, no permutations or combination of the genomes were considered in this analysis, as that would prevent picturing of the progression across niche-specific subgroups of Prevotella genus. Instead, genomes isolated from a specific body niche were added consecutively (Fig. 2.5).

Figure 2.5: Pan and core genome progress with addition of niche specific Prevotella genomes. The plot shows progression of core and pan-genome after sequential addition of Prevotella genomes into the analysis as per their body niche. The color bars represent the number of new gene families added into the Prevotella pan-genome. The species names are colored according to their niches.

First we included Prevotella strains isolated from GIT followed by oral cavity, skin and UGT in niche specific pan and core genome characterization (Fig. 2.5). Figure 2.5 showed that the GIT-derived strains added an substantial number of new genes to the panPage | 39

Chapter 2: Divergence in gene repertoire of Prevotella genomes genome, but following addition of the genomes derived from different body-sites did not cause any drastic change in the trends in expansion of the pan-genome or reduction of the core genome, except in the case of the oral isolate P. tannerae, inclusion of which led to a sharp decrease in the number of shared genes and also to the addition of an appreciable number of new genes to the Prevotella gene pool (Fig. 2.5). These observations are complying with the findings that P. tannerae contains the highest percentage of unique genes; followed by two GIT isolates P. copri and P. stercorea (Table 2.2).

PAN_OSUG PAN_GUSO CORE_USGO

50000

PAN_SOUG CORE_OSUG CORE_GUSO

PAN_SGUO CORE_SOUG

PAN_USGO CORE_SGUO

40000

Pan-genome Size

30000 20000 10000 0

-10000 -20000 0

5

10

15

20

25

30

Number of Genomes Figure 2.6: Trends of core and pan-genome curves with niche wise variation in order of genomes. Variation in shape of pan and core genome curve due to consideration of Prevotella genomes in different orders based on body niche into the analysis. Letters indicate niches [G: GIT (3), O: ORAL Cavity (17), S: SKIN (1) and U: UGT (7)].

Shape of pan and core genome curve depends on the order of genome addition into the dataset, it would be different for different ordering of genomes or niches but the pangenome and core genome size would remain the same for any order. The trends observed in pan and core genome curve (Fig. 2.5) when GIT derive genomes added followed by oral cavity, skin and UGT derived Prevotella strains were remain the same in case of inter-niche and intra-niche variations in the ordering of Prevotella strains in estimating the core and pan genome (Fig. 2.6). For example, in all cases, the size of the pan genome Page | 40

Chapter 2: Divergence in gene repertoire of Prevotella genomes increased noticeably with inclusion of GIT isolates (Fig. 2.6). Considerable increase in the pan genome size and decrease in the core gene numbers upon addition of P. tannerae is also apparent in all random permutations in genome ordering for pan and core genome estimation (Fig. 2.6). 2.4.5. Exclusive presence or absence of gene families in genomes derived from specific body sites of human hosts We performed an analysis to identify the gene families exclusively present and absent from the Prevotella strains colonized at a specific body niche in human. Niche specific binary (1/0) presence/absence matrix has been extracted from pan-matrix of all 28 Prevotella genomes of the dataset. There are 19753 orthologous gene families are associated with Prevotella strains isolated from a specific body niche but absent from the Prevotella strains adapted to other body sites. These niche specific gene families are members of the accessory and unique genome of Prevotella pan-genome except skin specific gene families which derived from unique genome only. Figure 2.7 illustrated that 11798, 3673, 3348 and 934 orthologous gene families are exclusively present in the Prevotella strains derived from oral cavity, GIT, UGT and skin, respectively (Table 2.2, Figure 2.7). It was not surprising to find largest number of habitat-specific gene families (11798) in oral isolates, since 17 genomes out of 28 in our study belonged to the oral cavity but number of GIT and UGT specific genes are not differ appreciably while number UGT isolates (7) are more than twice of the GIT isolates (3). It was even more interesting to find 998 gene families that are absent exclusively in genomes derived from specific body sites of the host but present in all other habitat isolates of the Prevotella (Figure 2.7). This observation indicated that the microbiome components might have undergone not only the niche-specific gain of functions, but also niche-specific loss of functions, which might have facilitated adaption of the respective strains to the specific body sites. Alternatively, cooperation between microbial communities colonized at the niche might direct the loss of some function (gene families) from the some strains of the dataset adapted at a specific niche in human.

Page | 41

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Figure 2.7: Distribution of dispensable genes among 28 Prevotella strains. Colored cells indicate presence of genes in the respective Prevotella strain and orthologous gene family, while uncolored cells indicate absence of genes. The species names and cells are colored according to their niches - green: GIT, red: ORAL Cavity, purple: SKIN and blue: UGT. Dark cell colors represent flexible genome and light cell colors represent singletons.

Zoomed portion of Figure 2.7 shows that 221 gene families, which are present in all strains derived from oral cavity, skin and UGT, but not in those derived from GIT. Similarly, there are 17, 115 and 645 gene families exclusively absent in oral cavity, UGT and skin isolates of the Prevotella genus. This observation advocates that adaptation to any specific niche within the human body might require both the gain and loss of specific genes or functions in the microbiota. We also identified the gene families shared by all genomes isolated from a specific body site. Numbers of such core gene families in GIT, oral cavity and UGT derived Prevotella Page | 42

Chapter 2: Divergence in gene repertoire of Prevotella genomes strains are 927, 513 and 808, respectively; including 456 gene families which are shared by all 28 Prevotella genomes of the dataset. Number of total gene families (i.e. pan genomes) in GIT, oral cavity and UGT subsets of the dataset are 6431, 16461 and 7203 respectively (Table 2.3).

Table 2.3: Niche specific Prevotella pan-genome

927

Niche specific clusters 3673

Exclusively absent clusters 221

513

11798

17

SKIN 1 934 UGT 7 7203 808 3348 GIT: Gastrointestinal Tract, Oral: Oral Cavity, UGT: Urogenital tract.

645 115

Niche

No. of genomes

Orthologous clusters (Pan genome size)

Core clusters

GIT

3

6431

ORAL

17

16461

We also identified the gene families exclusively absent from a genome but present in remaining all 27 Prevotella strains. There are total 187 exclusively absent genes were identified; 10,2,11 and 164 exclusively absent genes belongs to GIT, Skin, UGT and oral cavity isolates of Prevotella respectively (Table 2.4). Out of these 187 genes 27 are hypothetical.

Page | 43

Chapter 2: Divergence in gene repertoire of Prevotella genomes Table 2.4: Details of genes exclusively absent from each Prevotella strain Niche GIT GIT GIT GIT GIT GIT GIT GIT GIT SKIN SKIN UGT

Organism name P.copri DSM 18205 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.stercorea DSM 18206 P.bergensis DSM 17361 P.bergensis DSM 17361 P.amnii CRIS 21A-A

Protein Id EFC74560.1 EFM00833.1

Gene Description phosphopyruvate hydratase chaperone protein ClpB

COG Id COG0148 G COG0542 O

EEX52238.1

arabinose 5-phosphate isomerase

COG0794 M

EFZ37143.1

undecaprenyl-diphosphatase UppP

COG1968 V

EFB93670.1

rubredoxin

COG1773 C

EGC86300.1

3-deoxy-8-phosphooctulonate synthase

COG2877 M

AEA20996.1

GDP-mannose 4,6-dehydratase

COG1089 M

EGC86225.1

3-deoxy-D-manno-octulosonate cytidylyltransferase GDP-L-fucose synthase

COG1212 M

P.amnii CRIS 21A-A P.buccalis ATCC 35310 P.buccalis ATCC 35310 P.buccalis ATCC 35310 P.disiens FB03509AN P.oralis ATCC 33269

EID33434.1 EFB33325.1

EFB32658.1 EHJ42052.1 EFB92986.1

COG0034 F COG1970 M

EFB93929.1

large conductance mechanosensitive channel protein IgA Peptidase M64 3-deoxy-manno-octulosonate-8phosphatase ribosomal protein L30

EFB32494.1

ribosomal protein L27

COG0211 J

EEX53829.1

glycine-tRNA ligase

COG0423 J

EFN92164.1

firmicute fructose-1,6-bisphosphatase

COG3855 G

EFB34908.1

ORAL

P.buccae D17

EFA98590.1

ORAL

P.buccae D17

EFU30246.1

ORAL ORAL ORAL ORAL ORAL

P.buccae D17 P.buccae D17 P.buccae D17 P.buccae D17 P.buccae D17

ADK96346.1 EGQ17448.1 EFB93293.1 EFB33331.1 EGC85472.1

2-oxoglutarate oxidoreductase, beta subunit 2-oxoglutarate ferredoxin oxidoreductase subunit gamma branched-chain-amino-acid transaminase 2-oxoglutarate ferredoxin oxidoreductase subunit alpha cytochrome D ubiquinol oxidase, subunit II NADH:ubiquinone oxidoreductase, Na(+)-translocating, A subunit DNA polymerase III, gamma/tau subunit DnaX adenylate kinase ATP-dependent protease LonB ribosomal protein L20 GTP-binding protein LepA phosphotransferase enzyme family

COG1013 C

ORAL

P.timonensis CRIS 5C-B1 P.timonensis CRIS 5C-B1 P.timonensis CRIS 5C-B1 P.timonensis CRIS 5C-B1 P.buccae D17

ORAL

P.buccae D17

EGC86374.1

citrate transporter

COG1055 P Continued…..

UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT

EFU29670.1

pyridine nucleotide-disulfide oxidoreductase family protein class II glutamine amidotransferase

COG0451 MG COG2509 R

EFU31583.1 EFB33144.1 EFU31581.1 EFC69904.1

COG1778 R COG1841 J

COG1014 C COG0115 EH COG0674 C COG1294 C COG1726 C COG2812 L COG0563 F COG0466 O COG0292 J COG0481 M COG1660 R

Page | 44

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Niche ORAL ORAL ORAL ORAL

Organism name P.marshii DSM 16973

P.oris F0302

Protein Id EGQ11584.1 EFN90746.1 EFB32999.1 EFZ36112.1

Gene Description rubrerythrin CTP synthase DNA polymerase III, alpha subunit excision endonuclease subunit UvrA

COG Id COG1592 C COG0504 F COG0587 L COG0178 L

ORAL ORAL

P.oulorum F0390 P.oulorum F0390

EFA97784.1 EFB34185.1

peptidase C1-like family protein D-phosphoglycerate dehydrogenase

COG3579 E COG0111 HE

ORAL ORAL

P.oulorum F0390 P.oulorum F0390

EFM02180.1 EFB92161.1

conserved hypothetical protein putative phosphoserine transaminase

COG4198 S COG1932 HE

ORAL ORAL

P.oulorum F0390 P.sp. oral taxon 317 str. F0108 P.sp. oral taxon 317 str. F0108 P.sp. oral taxon 472 str. F0295 P.sp. oral taxon 472 str. F0295 P.sp. oral taxon 472 str. F0295 P.sp. oral taxon 472 str. F0295 P.sp. oral taxon 299 str. F0039 P.sp. oral taxon 299 str. F0039 P.sp. oral taxon 299 str. F0039 P.sp. oral taxon 299 str. F0039 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259

EFC74778.1 EFU31314.1

COG0334 E COG0304 IQ

EFB33724.1

glutamate dehydrogenase 3-oxoacyl-(acyl-carrier-protein) synthase II maltodextrin phosphorylase

EFA97313.1

ABC transporter, ATP-binding protein

COG1137 R

EFA91775.1

carboxynorspermidine decarboxylase

COG0019 E

EFZ38106.1

fumarate reductase subunit B

COG0479 C

EFC74986.1

alpha-glucosidase

-

EFA97265.1

ribosomal protein L17

COG0203 J

AEA21872.1

S-ribosylhomocysteinase LuxS

COG1854 T

EGQ18946.1

50S ribosomal protein L31

COG0254 J

EFC67472.1

conserved hypothetical protein

-

EFA98549.1

COG1575 H

EFC75848.1

1,4-dihydroxy-2-naphthoate octaprenyltransferase 3'-5' exonuclease domain protein

EFU30634.1

5'/3'-nucleotidase SurE

COG0496 R

EGQ14933.1

50S ribosomal protein L29

-

EFM02034.1

50S ribosomal protein L9

COG0359 J

EFC76794.1

acetyltransferase

COG0456 R

EGV30218.1

adenylate cyclase

COG2954 S

EFV05134.1

aminodeoxychorismate lyase

COG1559 R

EFA98345.1

ApbE family protein

COG1477 H

EGQ23346.1

arginine repressor

COG1438 K

EFU29992.1

ATP-dependent Clp protease ATPbinding subunit ClpC band 7/Mec-2 family protein

COG0542 O

ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL

P.melaninogenica D18 P.melaninogenica D18

EFU30366.1

COG0058 G

COG0349 J

COG0330 O Continued………

Page | 45

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Niche ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL

Organism name P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259

Protein Id EFL47097.1

Gene Description biotin-requiring enzyme

COG Id COG4770 I

EFC74124.1

carboxyl- protease

COG0793 M

EEX52666.1

cardiolipin synthase

COG1502 I

EGQ16917.1

cardiolipin synthetase

COG1502 I

EFC74084.1

cell division protein FtsA

COG0849 D

EFV05759.1

COG0664 T

EGQ21498.1

cyclic nucleotide-binding domain protein cytidine deaminase

EFA97739.1

deoxyribose-phosphate aldolase

COG0274 F

EGQ17357.1

dihydrodipicolinate synthase

COG0329 EM

EID33252.1

DNA polymerase III, delta subunit

COG0470 L

EFA92488.1

DNA polymerase III, delta subunit

COG1466 L

EGC20754.1

DNA primase

COG0358 L

EFU31466.1

DNA processing protein DprA

COG0758 LU

EFB36474.1

DNA repair protein RecN

COG0497 L

EEX51824.1

COG0592 L

EFV05431.1

DNA-directed DNA polymerase III beta subunit efflux ABC transporter, permease protein efflux transporter, RND family, MFP subunit endonuclease/exonuclease/phosphatase family protein endonuclease/exonuclease/phosphatase family protein exodeoxyribonuclease VII, large subunit Fe-S oxidoreductase

AEA22275.1

FHA domain protein

-

EFA97916.1

gliding motility-associated protein GldE glutamine amidotransferase subunit PdxT glutamine cyclotransferase-related protein glutathione peroxidase

COG1253 R

EFA98000.1 EGC86910.1 EGC85489.1 EGQ18182.1 EFA91710.1

EFC74972.1 EFB35041.1 EFC76889.1

COG0295 F

COG2177 D COG1566 V COG2374 R COG1570 L COG0621 J

COG0311 H COG2234 R COG0386 O Continued…….

Page | 46

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Niche ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL

Organism name P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259

Protein Id EFB33431.1

Gene Description glycosyl hydrolase, family 25

COG Id COG3757 M

EFB32745.1

glycosyl transferase, group 1 family

COG0438 M

EEX54059.1

group 2 glycosyl transferase

COG0463 M

EGQ17815.1

GSCFA family protein

-

EFA91757.1

HD domain protein

COG1418 R

EFB31567.1

HDIG domain protein

COG1480 R

EFC73587.1

HimA protein

COG0776 L

EFL45536.1

hydrolase, TatD family

COG0084 L

EFB33339.1

inner membrane protein OxaA

COG0706 U

EFC75811.1

COG4139 H

EFA91929.1

iron compound ABC transporter, permease protein lipid kinase, YegS/BmrU family

EFC73130.1

lipoate-protein ligase B

COG1235 R

EFC72943.1

lipoprotein

COG1196 D

EFC75823.1

lipoprotein

COG3147 S

EFB30907.1

COG4948 MR

EFC72248.1

mandelate racemase/muconate lactonizing enzyme family protein mannosyl-glycoprotein endo-beta-Nacetylglucosaminidase membrane protein

EFC75286.1

membrane protein

COG4591 M

EFC75604.1

membrane protein

COG0628 R

EFC76492.1

metallo-beta-lactamase family protein

COG0491 R

EFZ37246.1

MOP/MATE family multidrugresistance efflux pump MORN repeat protein

COG0534 V

MotA/TolQ/ExbB proton channel family protein multidrug resistance protein, FusA/NodT family Na(+)-translocating NADH-quinone reductase subunit C Na/Pi cotransporter family protein

COG0811 U

EFN92151.1

EID33320.1 EFC74704.1 EFB33091.1 EFZ35886.1 EEX18935.1

COG1597 IR

COG1705 NU COG0392 S

COG4642 S

COG1538 MU COG2869 C COG1283 P Continued…….

Page | 47

Chapter 2: Divergence in gene repertoire of Prevotella genomes Niche ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL

Organism name P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259

Protein Id EFB31922.1 EGQ12105.1

Gene Description NAD dependent epimerase/reductaserelated protein NADH dehydrogenase subunit J

COG Id COG0451 MG COG0839 C

EGQ16369.1

naphthoate synthase

COG0447 H

EFB34324.1

octaprenyl-diphosphate synthase

COG0142 H

EFC74994.1

O-methyltransferase family protein

COG4122 R

EFB92717.1

COG4105 R

EFB93480.1

outer membrane assembly lipoprotein YfiO outer membrane protein, OMP85 family PAP2 family protein

EFU30040.1

patatin family phospholipase

COG4667 R

EFA92212.1

COG0768 M

EFC76200.1

penicillin-binding protein, transpeptidase domain protein peptidase, M16 family

EFB93678.1

peptidase, M23 family

COG0739 M

EFC75739.1

peptidyl-prolyl cis-trans isomerase

-

AEA20066.1

permease, YjgP/YjgQ family

COG0795 R

EFC74938.1

COG2050 Q

EFB93212.1

phenylacetic acid degradation-related protein polysaccharide export protein, BexD/CtrA/VexA family potassium uptake protein, TrkH family

EGQ15145.1

preprotein translocase

COG1862 U

EGQ15221.1

primosome assembly protein PriA

COG1198 L

EFV04290.1

COG0682 M

EFA97696.1

prolipoprotein diacylglyceryl transferase protein-export membrane protein SecD

EFB31615.1

putative cell division protein FtsQ

-

AEA20944.1

putative membrane protein

-

EFB92356.1

COG1663 M

EFC76991.1

putative tetraacyldisaccharide-1-P 4'kinase putative TPR domain protein

EGQ21579.1

pyridoxal biosynthesis lyase PdxS

COG0214 H

EFB32498.1

EFB32504.1

COG4775 M -

COG0612 R

COG1596 M COG0168 P

COG0342 U

-

Continued…….

Page | 48

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Niche ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL ORAL

Organism name P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259 P.tannerae ATCC 51259

Protein Id EFB32542.1

Gene Description RecF protein

COG Id COG1195 L

EID33506.1

rare lipoprotein B family protein

-

EFB93251.1

RNA methyltransferase, RsmE family

COG1385 S

EFA96944.1

RNA polymerase sigma-54 factor

COG1508 K

EFB92316.1

rod shape-determining protein MreC

COG1792 M

EFU29995.1

septum formation protein Maf

COG0424 D

ADK95980.1

Ser/Thr phosphatase family protein

COG0737 F

EFC75794.1

shikimate dehydrogenase

COG0169 E

EFZ38209.1

shikimate kinase

COG0703 E

ADK95333.1

sigma-54 interaction domain protein

COG2204 T

EFC67405.1

sigma-70, region 4 family

COG1595 K

EEX18471.1

COG0616 OU

EFC75605.1

signal peptide peptidase SppA, 67K type sodium:solute symporter family protein thymidine kinase

AEA20380.1

translocator protein, LysE family

COG1280 E

EGC86277.1

COG0848 U

EFA92087.1

transport energizing protein, ExbD/TolR family transporter, CPA2 family

EGQ17955.1

trigger factor

COG0544 O

EFC76508.1

tRNA pseudouridine synthase A

COG0101 J

EGQ17140.1

UDP-N-acetylglucosamine 2epimerase YitT family protein

COG0381 M

EEX51805.1

EGQ13350.1

COG0591 ER COG1435 F

COG0475 P

COG1284 S

GIT: Gastrointestinal Tract, Oral: Oral Cavity, UGT: Urogenital tract * Hypothetical genes are not shown in table

Page | 49

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.4.6. Trees based on the pan - genome and sequence variations in core genome – niche-specific features We constructed three types of Neighbor Joining (NJ) Trees, using different approaches to elucidate the relative importance of lineage-specific divergences and niche-specific selections in shaping the gene architectures of the Prevotella strains. The first one is the traditional phylogenetic tree based on 16S rRNA sequences (Fig. 2.8), followed by a phylogenetic tree based on the binary (1/0) gene presence/absence matrix (pan-genome) and last one tree was constructed using concatenated alignments of core genes among all 28 Prevotella genomes from the dataset. As an outgroup species E. coli was taken and 3 GIT-derived Bacteroides strains namely B. eggerthii DSM 20697, B. clarus YIT 12056 and B. stercoris ATCC 43183 were included in an attempt to see whether the GIT-derived genomes of Prevotella and Bacteroides cluster together.

Figure 2.8: Relative evolutionary divergence of Prevotella. Neighbor Joining (NJ) tree based on 28 Prevotella and E. Coli 83972 (reference) 16S rRNA sequences, was constructed using MEGA 5 after 1000 bootstrap replications. The species names are colored according to their niches (brown: GIT, green: ORAL Cavity, purple: SKIN and blue: UGT).

Page | 50

Chapter 2: Divergence in gene repertoire of Prevotella genomes These three control GIT derived Bacteroides strains were included on the basis of similarity with GIT derived Prevotella strains in our phylogenetic analysis such as GC content and genome size and additionally, Bacteroides and Prevotella genus belongs to the same ancestry. In human microbiome reference database restrict us to include the Bacteroides strains from the other body sites like oral cavity, UGT and skin. Figure 2.8 and 2.9 shows that, three Bacteroides strains clustered together under a distinct node – completely separated from the Prevotella genomes of the dataset. This observation suggests that so far the genetic architectures of the microbiome components are concerned, the taxonomic legacy rules over their niche-based desires, if any. Though in figure 2.8 and 2.9, for sake of resolution, we included only three GIT-derived Bacteroides genomes as representative examples, it was checked that lineage-specific segregation of Bacteroides from Prevotella did not depend on choice of representative Bacteroides genomes. Noticeable standing out of P. tannerae, either next to or in between Bacteroides and E. coli in all three phylogenetic trees is quite consistent with the recent reassignment of P. tannerae strain under a novel genus Alloprevotella gen. nov (Downes et al. 1994). Trees based on pan-genome and core genome showed a high similarity in trends of segregation of Prevotella strains into different sub-groups, but both trees differed noticeably in their sub-groups relative position on the tree (Fig. 2.9). These both trees showed a number of similarities as well as divergences with 16S rRNA phylogenetic tree. All three trees demonstrated that, oral isolates P. oulorum and P. oris and the GIT isolate P. salivae appeared either under a common node (Figure 2.9A & 2.9B), or adjacent to each other (Figure 2.8). Two UGT isolates, P. amnii and P. bivia separated under a common node (Fig. 2.8 & 2.9). Two other UGT isolates, P. buccalis and P. timonensis also co-segregated in all trees (Fig. 2.8 & 2.9).

Page | 51

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Figure 2.9: Relative evolutionary divergence of Prevotella. (A) NJ Tree based on the binary gene presence/absence matrix of orthologous gene families of 28 Prevotella and 3 Bacteroides strains & E. Coli 83972 (reference) and (B) NJ tree based on core genome using 100 bootstrap replications. The bootstrap values are marked at the root of each branch of trees. The species names are colored according to their niches (brown: GIT, green: ORAL Cavity, purple: SKIN and blue: UGT).

Page | 52

Chapter 2: Divergence in gene repertoire of Prevotella genomes The considerable similarity in all three phylogenetic tress indicating that complete gene repertoire (Fig. 2.9A) of the individual genomes as well as the core genome (Fig. 2.9B) in these Prevotella strains are in complying with their 16S rRNA (Fig. 2.8) phylogeny. In an attempt to elucidate the niche specificity in Prevotella strains under consideration we compared the all three trees and found many niche-specific divergences. Cosegregation of GIT isolates P. copri DSM 18205 and P. stercorea DSM 17361 in trees based on pan-genome and core genome and their substantial distance in 16S rRNA based tree suggested these two “not-so-closely-related” GIT isolates might have adopted similar accessory genome structures in order to colonize and adapt to a similar habitat within the human body. Intriguingly enough, these two GIT-derived Prevotella genomes appeared in a node adjacent to the node of three GIT-derived Bacteroides in the pan-genome based tree (Figure 2.9A), which suggested that there might be certain inter-genus similarities in the gene repertoire of these GIT-derived microbial species. However, as mentioned above, such habitat-specific similarities could not hinder lineage-specific separation of Prevotella and Bacteroides genomes. 2.4.7. Trends in codon usage in core and unique genes Codon usage patterns in the core gene and unique gene sets for each of the Prevotella genomes under study and codon usage distances between the core gene sets for all possible pairs of genomes (Fig. 2.10), calculated as described in Materials & Methods section. We also estimated the codon usage distance between core gene and unique genes of each Prevotella strain (Fig. 2.11). These distances in codon usage are represented by the color heat map and histograms (Fig. 2.10 & 2.11). The heat-map of codon usage distances between core genes of Prevotella strains showed the average genomic G+C-bias of the respective genomes in codon usage pattern (Fig. 2.10). Strains having similar G+C content showed the lower codon usage distance between them, irrespective of their adaptation site in human body. We can also interpret that adaptation of a strain to any specific anatomical site of the human body might have influenced (or have been influenced by) the gene repertoire of the microbiome components under study, but it appears that such adaptation could not impart any niche-specific selection pressure at the codon levels, in general.

Page | 53

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Figure 2.10: Codon usage distance. Heatmap of codon usage distances between core genomes of Prevotella strains. Numbers are genome ids according to dataset table 2.1.

Genome of each strain contains both core and unique genes, both type of genes have different patterns of codon usage. We calculated the codon frequency of both core and unique genes in each strain of Prevotella from the dataset mapped on a histogram to demonstrate the variations in codon usage between core and unique genes from the same strain (Fig. 2.11).

Figure 2.11: Codon usage distance. Heatmap of codon usage distances between core and unique genes for all 28 Prevotella genomes

Page | 54

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.4.8. Metabolic functional profile of the Prevotella pan-genome Realization of the fact that the lineage-specific selections and niche-specific constraints both might have played important roles in shaping the genomic architectures of the microbiome components under study has encouraged us to examine the distribution of COG categories in 28 Prevotella strains under the present study. Prevotella pan-genome comprises the total 24885 orthologous gene families, which belong to core, accessory and unique genome. All genes within an orthologous gene family might be similar in their metabolic functions. We extracted the representative gene from each gene family to assign the functional COG ids for understanding the functional profile of the Prevotella genus. We assigned the functional COG ids to core, accessory and unique genes separately and plotted on the histogram to explore the functional distribution among core and variable genome. Distribution of major COG categories in core, accessory and unique genes have been shown in figure 2.12. Figure 2.12 shows that majority of the core genes belong to the Information storage and processing (34%) and Metabolism (36%) categories. On the other hand, a major portion of the unique genes belong to the Cellular processes and signaling COG category.

25

22

12

29 18 25

Cellular processes & signaling Information storage & processing Metabolism Poorly characterized

36 34 17 25 36.3

21

Figure 2.12: Percentage relative abundance and distribution of major COG categories between core genome (Inner most layer), accessory genome (middle layer) and singletons (outer layer) of Prevotella.

Page | 55

Chapter 2: Divergence in gene repertoire of Prevotella genomes

Expansion of these major categories into detailed COG categories revealed that most of the core genes (22%) belong to the Translation, ribosomal structure & biogenesis (J) COG category. Members of the Nucleotide transport & metabolism (F) and Energy production & conversion (C) categories are also present in much higher percentage among the core genes than among the accessory or unique genes. A large proportion (~25%) of the unique genes falls under the Unknown functions (S) and General function prediction only (R) COG functional category, majority of the singletons appear to be involved in Cell wall/membrane/envelope biogenesis (M), Replication, recombination & repair (L) and Transcription (K) processes (Fig. 2.13).

Figure 2.13: Relative abundance and distribution of all functional COG categories between core genome (blue bars), accessory genome (red bars) and singletons (green bars) of Prevotella. P.T.: Post-translational modification; C.C.: Cell cycle; C.D.: Cell division; I.T.: Intracellular trafficking; S.M.: Secondary Metabolite

Page | 56

Chapter 2: Divergence in gene repertoire of Prevotella genomes 2.4.9. Functional categorization of singletons of each Prevotella strain Genes unique to a specific genome are presumably the genes acquired to improve the living at a specific habitat by providing a special metabolic function to the microbes. To understand the role of a habitat or body niche in acquiring the genes uniquely possessed by a microbial genome, we categorized the COG function of singletons from each Prevotella strain of our dataset. COG distribution patterns of singletons varied extensively across the genomes isolated from diverse body niches, showing no readily identifiable niche-specific features and in most of the genomes, a large fraction of singletons fell under the categories of General function prediction only (R) and Function unknown (S) (Fig. 2.14).

Figure 2.14: Relative abundance and distribution pattern of COG categories within singletons. COG categories shown on vertical axis at right side; abbreviations are according to standard COG categories

Page | 57

Chapter 2: Divergence in gene repertoire of Prevotella genomes In majority of the Prevotella strains including the sole skin isolate P. bergensis and three GIT isolates, a substantial fraction of unique genes fall in the categories E, F or G. In some oral isolates like P. melaninogenica ATCC 25845, P. nigrescens, P. multiformis etc. and UGT derivatives P. amnii, P. denticola and P. disiens carry a considerable percentage of unique genes involved in Defense mechanisms category (Fig. 2.14). The COG category T (signal transduction mechanism) is also pretty well represented in some of the Prevotella genomes like P. denticola, P. copri etc. 2.4.10. Variation in functional profile of the niche specific genes Figure 2.15 shows that certain COG categories like Transcription (K), Replication, recombination and repair (L), Cell wall/ membrane/ envelope biogenesis (M), General function prediction only (R) and Function unknown (S) are relatively higher, as compared to all other COG categories, among niche specific genes.

18.0 16.0

GIT

Skin

Oral cavity

UGT

Niche Specific COG (% )

14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0

Figure 2.15: COG distribution patterns of the niche-specific orthologous gene families.

Page | 58

Chapter 2: Divergence in gene repertoire of Prevotella genomes Among the gene families found exclusively in the skin isolate, genes involved in Signal transduction mechanisms (T), Carbohydrate transport and metabolism (G) and Inorganic ion transport and metabolism (P) are found in relatively high frequencies. GIT-specific gene families are enriched in genes involved Signal transduction mechanisms (T) and Transcription (K), while genes under the category M (Cell wall/membrane/envelop biogenesis) are more frequent among oral isolates of the Prevotella (Fig. 2.15). Remarkably, gene families associated with Defense mechanisms are significantly overrepresented in GIT, oral cavity & UGT Prevotella isolates, as compared to the skin isolates (Fig. 2.15). 2.4.11. Comparative analysis between P. denticola CRIS 18C-A (UGT isolate) and P. denticola F0289 (oral cavity isolate) Table 2.1, shows that in our dataset Prevotella denticola is the only species whose strains adapted at different body sites in human; one is at UGT and other at the oral cavity. As indicated by the observations like niche wise variations in functional composition and gene repertoire, the genetic architecture of the Prevotella members of the human microbiome might have been influenced appreciably by the nature of adaptation to any specific body habitat within the human body. We have carried out a comparative analysis between gene contents of two strains of P. denticola: P. denticola CRIS 18C-A isolated from the urogenital tract and P. denticola F0289 isolated from the oral cavity of human, for better understanding the niche-specific divergences in genetic architectures of the microbiome components. This present analysis would enlighten the resemblance and differences in both strains, which plays the important role in adaptation at distinct body sites of human even they belong to the same species of the Prevotella. There are differences in genomic properties of both strains. UGT isolate is larger in genome size and has higher number of CDS compare to oral cavity isolate (Table 2.1). Based on the gene families obtained using from protein blast using criteria of 50% identity and 50% coverage length, both P. denticola strains share 1968 genes (Fig. 2.16). Functional assignment of genes from these two strains revealed that majority of strain-specific genes belongs to the Replication, recombination and repair (L) and transcription (K) categories. Unexpectedly, both these categories (L & K) represent the higher proportion (25-35 %) of strain-specific genes than genes shared by both strains (12%) (Fig. 2.16). An appreciable number (8-13 %) of strain-specific genes also fell into the COG category M (Cell wall/membrane/envelope biogenesis). The UGT isolate of P. denticola species Page | 59

Chapter 2: Divergence in gene repertoire of Prevotella genomes contains 4% of strain-specific genes in the Signal transduction mechanisms category, in contrast to 1% in the oral isolate of the same species (Fig. 2.16). On the other hand, the percentage occurrence of genes associated with Carbohydrate transport and metabolism is much higher (7%) in the oral strain, as compared to that (3%) in the UGT strain (Fig. 2.16).

Figure 2.16: Comparative functional analysis of gene repertoire of P. denticola F0289 (Oral cavity) and P. denticola CRIS 18C-A (UGT) strains.

All 1968 shared genes among both strains of P. denticola were subjected to the test for positive selection to identify the functions, which are positively selected for transfer into next generations by P. denticola species to get the benefits for adaptation or colonization at UGT and oral cavity in human. Out of 1968 shared genes, only six genes are found to be under the positive selection as listed in Table 2.5.

Page | 60

Chapter 2: Divergence in gene repertoire of Prevotella genomes Table 2.5: Core genes under positive selection among both P. denticola strains GI-ids

dN/dS

COG Hit

Gene Description

325483398 & 326944103

1

COG0772

Bacterial cell division membrane protein

325483381 & 326945632

1.08

COG0082

Chorismate synthase

COG1208

Nucleoside diphosphate sugar pyrophosphorylase. Involved in lipopolysaccharide biosynthesis.

1.13

COG0543

2-polyprenylphenol hydroxylase and related flavodoxin oxidoreductases

1.5

COG0036

Pentose-5-phosphate-3epimerase

1.5

COG1694

Predicted pyrophosphatase

325484345 & 326946281

325483802 & 326944751 325483451 & 326944091 325483576 & 326944463

1.12

Description of COG class Cell cycle control, cell division, chromosome partitioning Amino acid transport and metabolism Cell wall/ membrane/ envelope biogenesis & Translation, ribosomal structure and biogenesis Coenzyme transport and metabolism & Energy production and conversion Carbohydrate transport and metabolism General function prediction only

2.4.12. Trends in GC content of core, accessory and unique genes among Prevotella strains We carried out an analysis to identify the variation of GC content pattern in core, accessory and unique genes. A uniform pattern was observed in most of the strains of the Prevotella genus under study (Fig. 2.17). Higher number of protein coding genes located on low GC region (Mean + SD) metabolic functions derived from respective microbiota GUT Energy metabolism Carbohydrate metabolism

UGT Membrane transport Carbohydrate metabolism

Oral Cavity Membrane transport

Skin/Airways Membrane transport Metabolism of cofactors and vitamins Folding sorting and degradation

To explore the metabolic nature of niche specific microbiota, KEGG metabolic functions were assigned to all KEGG enzymes, which are found in substantial higher abundance in specific body niches (Table 4.12). 4.4.4.3. Niche specific KEGG pathway enzymes derived from PCA factor analysis PCA analysis based on relative abundance of KEGG pathway enzymes showed that all major body niches are significantly segregated to each other is explained by variance of first three principal coordinates (Fig. 4.8 & 4.9). KEGG pathway enzymes highly correlated (r > 0.7) with specific body niches were identified using PCA factor analysis (Table 4.7, 4.13, 4.14, 4.15). These KEGG pathway enzymes, which might be effective for niche specificity, were mapped to KEGG pathways using KEGG mapping server (Table 4.8, 4.9, 4.10, 4.11). The pattern shown in PCA plot (Fig. 4.8) that GUT and UGT

Page | 139

Chapter 4: Niche specific functional diversity were situated on opposite ends of the first principal coordinate is also reflected in KEGG pathways assigned to GUT and UGT associated KEGG enzymes (Table 4.13).

Table 4.13: KEGG pathways associated with GUT and UGT Major Pathway category

Pathway category Cell growth & death

Cellular Processes

Environmental Information Processing

Cell motility Transport & catabolism Membrane transport

Signal transduction

Genetic Information Processing

Folding, sorting & degradation Transcription Translation Cancers

Human Diseases

Drug resistance Infectious diseases

Amino acid metabolism Metabolism

Biosynthesis of other secondary metabolites

KEGG pathway Cell cycle - Caulobacter Meiosis - yeast Bacterial chemotaxis Peroxisome ABC transporters Phosphotransferase system (PTS) Bacterial secretion system Two-component system PI3K-Akt signaling pathway Protein export Sulfur relay system Protein processing in endoplasmic reticulum RNA polymerase Aminoacyl-tRNA biosynthesis Pathways in cancer Prostate cancer Choline metabolism in cancer beta-Lactam resistance Cationic antimicrobial peptide (CAMP) resistance Salmonella infection Staphylococcus aureus infection Alanine, aspartate and glutamate metabolism Glycine, serine and threonine metabolism Cysteine and methionine metabolism Valine, leucine and isoleucine degradation Lysine biosynthesis Arginine and proline metabolism Histidine metabolism Phenylalanine, tyrosine and tryptophan biosynthesis

KEGG pathway enzymes involved GUT UGT 1 0 1 0 0 1 0

1

0

6

0

6

1 5 1 1 0

1 2 0 1 1

1

0

0 0 1 1 0 2

1 1 0 0 0 1

2

2

1 0

0 2

3

1

1

1

2

1

0

2

0 2 1

1 0 0

1

0

1

1

Monobactam biosynthesis

Continued…….. Page | 140

Chapter 4: Niche specific functional diversity

Major Pathway category

Metabolism

Pathway category

KEGG pathway

Glycolysis / Gluconeogenesis Citrate cycle (TCA cycle) Pentose phosphate pathway Pentose and glucuronate interconversions Fructose and mannose metabolism Galactose metabolism Carbohydrate Ascorbate and aldarate metabolism metabolism Starch and sucrose metabolism Amino sugar and nucleotide sugar metabolism Pyruvate metabolism Glyoxylate and dicarboxylate metabolism Butanoate metabolism Oxidative phosphorylation Methane metabolism Carbon fixation in photosynthetic organisms Energy metabolism Carbon fixation pathways in prokaryotes Nitrogen metabolism Sulfur metabolism N-Glycan biosynthesis Glycan biosynthesis Lipopolysaccharide biosynthesis & metabolism Peptidoglycan biosynthesis Fatty acid degradation Synthesis and degradation of Lipid metabolism ketone bodies Glycerolipid metabolism One carbon pool by folate Thiamine metabolism Riboflavin metabolism Metabolism of Vitamin B6 metabolism cofactors & vitamins Nicotinate and nicotinamide metabolism Pantothenate and CoA biosynthesis

KEGG pathway enzymes involved GUT UGT 1 1 4 0 1 0 1

1

5

2

0

1

1

0

2

4

5

4

4

0

1

0

0 0 2

2 0 1

3

0

4

0

1 1 1 1 0 0

0 0 0 0 1 1

0

1

0 1 1 0 1

1 0 1 1 0

1

1

0

2

Continued……..

Page | 141

Chapter 4: Niche specific functional diversity

Major Pathway category

Pathway category

Metabolism of other amino acids

Metabolism

Metabolism of terpenoids & polyketides Nucleotide metabolism Xenobiotics biodegradation & metabolism Digestive system Endocrine system

Organismal Systems

Environmental adaptation Immune system

KEGG pathway Phosphonate and phosphinate metabolism Selenocompound metabolism D-Alanine metabolism Glutathione metabolism

KEGG pathway enzymes involved GUT UGT 1

0

2 0 0

0 1 0

1

3

3 0

1 3

Aminobenzoate degradation

0

1

Protein digestion and absorption Progesterone-mediated oocyte maturation Estrogen signaling pathway

1

0

1

0

1

0

1

0

1

0

1

0

Terpenoid backbone biosynthesis Purine metabolism Pyrimidine metabolism

Plant-pathogen interaction Antigen processing and presentation NOD-like receptor signaling pathway

Table 4.14: KEGG pathways associated with oral cavity Major Pathway category Environmental Information Processing Human Diseases

Metabolism

Pathway category Membrane transport Signal transduction Cancers Drug resistance

Amino acid metabolism

KEGG pathway

KEGG pathway enzymes involved

ABC transporters 11 Two-component system 1 Phosphatidylinositol signaling system 1 Choline metabolism in cancer 1 Vancomycin resistance 1 Alanine, aspartate & glutamate metabolism 1 Glycine, serine & threonine metabolism 1 Cysteine & methionine metabolism 1 Valine, leucine & isoleucine biosynthesis 1 Arginine & proline metabolism 1 Phenylalanine, tyrosine & tryptophan biosynthesis 1 Continued…….

Page | 142

Chapter 4: Niche specific functional diversity

Major Pathway category

KEGG pathway

Pathway category Carbohydrate metabolism Energy metabolism Glycan biosynthesis & metabolism

Metabolism

Lipid metabolism Metabolism of cofactors & vitamins Metabolism of other amino acids

KEGG pathway enzymes involved

Citrate cycle (TCA cycle) Butanoate metabolism Oxidative phosphorylation Carbon fixation pathways in prokaryotes Nitrogen metabolism

1 1 1 1 1

Peptidoglycan biosynthesis Glycerolipid metabolism Glycerophospholipid metabolism Nicotinate & nicotinamide metabolism Pantothenate & CoA biosynthesis Folate biosynthesis beta-Alanine metabolism Taurine & hypotaurine metabolism Phosphonate & phosphinate metabolism Glutathione metabolism

1 1 1 1 1 2 1 1 1 1

Table 4.15: KEGG pathways associated with Skin and Airways Major Pathway category Environmental Information Processing Genetic Information Processing

Pathway category Membrane transport Signal transduction Folding, sorting & degradation Translation Amino acid metabolism Carbohydrate metabolism Energy metabolism Glycan biosynthesis & metabolism

Metabolism

Lipid metabolism Metabolism of cofactors & vitamins Metabolism of other amino acids Nucleotide metabolism

KEGG pathway ABC transporters Bacterial secretion system Two-component system RNA degradation Protein export Sulfur relay system Ribosome biogenesis in eukaryotes Lysine degradation Tryptophan metabolism Citrate cycle (TCA cycle)

KEGG pathway enzymes involved 2 2 3 1 2 2 1 1 1 1

Nitrogen metabolism Peptidoglycan biosynthesis

3 1

Biosynthesis of unsaturated fatty acids Folate biosynthesis Porphyrin & chlorophyll metabolism Glutathione metabolism

1

Purine metabolism

1

2 2 1

Page | 143

Chapter 4: Niche specific functional diversity 4.4.5. Core KEGG pathway enzymes among all 486 metagenomic samples A survey has been carried out among all 486 metagenomic samples to identify the core KEGG pathway enzymes, which are conserved in all metagenomic samples under the present study. There are total 796 KEGG pathway enzymes shared by all 486 metagenomic samples. KEGG metabolic functions were mapped to these 796 core enzymes shown as figure 4.11 and table 4.16. All 796 core enzymes were mapped to total 908 KEGG metabolic pathways, most of them belongs to metabolism (597) followed by Genetic Information Processing (174) and Environmental Information Processing (123).

Table 4.16: Distribution of Core KEGG pathway enzymes in major metabolic pathways KEGG pathways Amino acid metabolism Biosynthesis of secondary metabolites Cancers Carbohydrate metabolism Cell growth and death Cell motility Cellular community Digestive system Drug resistance Endocrine and metabolic diseases Endocrine system Energy metabolism Environmental adaptation Folding sorting and degradation Glycan biosynthesis and metabolism Immune diseases Immune system Infectious diseases Lipid metabolism Membrane transport Metabolism of cofactors and vitamins Metabolism of other amino acids Metabolism of terpenoids & polyketides Nervous system Neurodegenerative diseases Nucleotide metabolism Replication and repair Signal transduction Transcription Translation Transport and catabolism Xenobiotics biodegradation & metabolism

Total No. of Core enzymes belong to KEGG pathway 61 9 12 178 15 4 2 2 14 2 10 54 1 29 23 1 4 19 26 87 37 19 16 2 2 69 65 36 5 75 3 26

% of Core enzymes 6.72 0.99 1.32 19.60 1.65 0.44 0.22 0.22 1.54 0.22 1.10 5.95 0.11 3.19 2.53 0.11 0.44 2.09 2.86 9.58 4.07 2.09 1.76 0.22 0.22 7.60 7.16 3.96 0.55 8.26 0.33 2.86

Page | 144

Chapter 4: Niche specific functional diversity

19 24 Cellular Processes

123 Environmental Information Processing

174 Genetic Information Processing

597 50

Human Diseases Metabolism Organismal Systems

Figure 4.11: Functional distribution pattern of 796 core KEGG enzymes

As mentioned above, 796 core KEGG enzymes were found in all 486 metagenomic samples, which represent the universal functional characteristics of human microbiome, irrespective of body niches and individuals. Relative abundances of all 796 core KEGG enzymes for all 486 metagenomic samples were mined to examine the niche specificity, if any, in core metabolic features in terms of relative abundances using principal coordinate analysis. Figure A and B present the PC1 versus PC2 and PC1 versus PC3 plots, respectively, of the PCA based on the relative abundances of the core KEGG enzymes.

Page | 145

Chapter 4: Niche specific functional diversity

A Oral GUT UGT

PC2 (22.6%)

Skin

Airways

PC1 (29.4%) Figure 4.12A: Principal Coordinate Analysis (PC1 vs. PC2) based on niche wise relative abundances of 796 core KEGG pathway enzymes

0.02

B 0.015

UGT PC3 (14.2%)

0.01 0.005

GUT Airways

0

Oral Cavity

-0.005

Skin -0.01 -0.02

-0.015

-0.01

-0.005PC1

0 0.005 (29.4%)

0.01

0.015

0.02

Figure 4.12B: Principal Coordinate Analysis (PC1 vs. PC3) based on niche wise relative abundances of 796 core KEGG pathway enzymes

Page | 146

Chapter 4: Niche specific functional diversity As observed in Figures 4.12A and 4.12B, all body niches are also significantly (P=0.0001, R=0.877) segregated from each other, except skin and airways derived samples (P=0.16, R=0.15). This observation suggests that many functional enzymes are contributed across all body niches in human by niche specific unique microbiota, but abundance of these common functional enzymes vary from one body niche to another for their abundances, which creates a unique functional metabolic environment at each body niche. GUT and UGT samples are again situated on the opposite ends of the Axis 1 (PC1) and oral cavity samples are placed between UGT and GUT on PCA plot (Fig. 4.12A & 4.12B) based on relative abundances of 796 core KEGG enzymes. PC2 (Fig. 4.12A) explains the variation in abundances of core KEGG enzymes derived from airways while variation in metagenomic samples derived from skin and oral cavity was explained by the third principal coordinate (Fig. 4.12B).

4.4.6. Niche specific functional characteristics derived from niche specific microbiome 4.4.6.1. Niche specific functional features conserved among all healthy individuals In this current study, we have also intended to investigate the body-site specific metabolic functionality derived from niche specific microbiota might be for maintaining the healthy human physiology. Construction of binary (1/0) presence/absence matrix of KEGG pathway enzymes for each body niche using all 486 metagenomic samples under analysis facilitated identification of the niche specific core enzymes, present in all metagenomic samples or individuals (core niche metagenome; CNM) derived from a specific body niche or sub-site (Table 4.17). But this core component of functional microbiome based on metagenome also includes; A. Core enzymes present in all (more than 90%) reference bacterial genomes derived from a specific body niche of the human (Table 4.18) (niche core genome; NCG or essential genome of microbial community for niche specificity, NCG might be essential for niche specific microbial community to colonize at a specific body niche

Page | 147

Chapter 4: Niche specific functional diversity in healthy human. Most of these genes or pathways belong to housekeeping genes and implicated in maintenance of basic cellular functions.

B. Core enzymes present in all 486 metagenomic samples (core metagenome; CM), which are associated with entire human body irrespective of body niches & individuals

Table 4.17: Number of core KEGG pathway enzymes derived from niche specific metagenome Body niche or Sub-site (Total metagenomic samples count) Airways (68) Buccal Mucosa (77) Keratinized Gingiva (3) Palatine Tonsil (4) Saliva (2) Gingival Plaque (88) Throat (4) Tongue Dorsum (86) UGT (40) Skin (23) GUT (91)

Core Niche Metagenome (CNM) 2115 4688 6934 6196 7545 8376 5651 7826 1481 5876 7450

Minimal Niche Metagenome (MNM = CNM-CM-NCG) 1076 3701 5947 5209 6558 7389 4664 6839 484 4745 6508

Complete KEGG Pathway Modules 34 125 178 173 200 214 150 217 15 150 191

Table 4.18: Number of core KEGG pathway enzymes derived from niche specific reference bacterial strains (HMP-DACC) Major Body Niches Airways GUT Oral cavity Skin UGT

Total Bacterial Reference Genomes (HMP) 50 452 244 123 146

Niche Core Genome (NCG) 243 146 191 335 201

Page | 148

Chapter 4: Niche specific functional diversity Minimal Niche Metagenome (MNM) derived from niche specific microbiota (Table 4.17), which may be essentially required for normal or healthy physiology of a specific body niche or sub-site, has been calculated and complete KEGG modules were mapped to pathways (Table 4.19) for the niche specific minimal metagenomic component (MNM) for each niche to explore the metabolic functions literally playing an essential role at the specific body niche or sub-site to maintain the human health or physiology.

Table 4.19: Number of complete KEGG modules pathways derived from minimal niche metagenome (MNM) for each body niche or sub-sites KEGG Pathways ABC-2 type & other transport systems Arginine & proline metabolism Aromatic amino acid metabolism Aromatics degradation ATP synthesis Bacterial secretion system Biosynthesis of secondary metabolites Branched-chain amino acid metabolism Carbon fixation Cell signaling Central carbohydrate metabolism Cofactor & vitamin biosynthesis Cysteine & methionine metabolism DNA polymerase Drug efflux transporter/pump Drug resistance Fatty acid metabolism Glycan metabolism Glycosaminoglycan metabolism Histidine metabolism Lipid metabolism Lipopolysaccharide metabolism Lysine metabolism Metabolic capacity

An 1

Bm 0

Kg 1

Pt 3

Sa 3

Sk 3

Gt 2

Gp 4

Td 4

Th 3

Ug 0

1 2

3 5

3 6

3 5

3 6

1 5

3 6

3 7

3 6

3 5

0 0

0 4 1 0

0 5 3 0

0 7 4 0

0 7 4 0

1 7 7 1

0 7 3 0

0 7 6 0

2 8 8 0

0 8 8 1

0 6 4 0

0 1 0 0

1

2

2

2

2

1

2

2

2

2

0

0 0 2

1 0 4

1 7 5

1 6 5

1 8 5

1 7 3

1 8 5

1 9 5

1 10 5

1 1 5

0 0 0

1 1

7 2

10 2

10 2

11 2

7 2

11 1

11 2

11 2

9 2

0 0

0 0 2 1 0 0 2 0 0 0 0

0 1 3 4 0 1 2 2 3 1 1

2 3 3 4 1 0 2 4 3 1 1

0 2 3 4 0 1 2 5 3 1 1

1 3 4 4 2 1 2 6 3 1 1

3 2 2 6 1 0 2 6 2 1 1

1 1 4 4 1 2 2 6 3 0 1

3 2 3 6 1 3 2 6 3 1 1

2 0 0 4 2 0 4 3 0 6 4 1 1 0 0 1 1 0 2 2 0 6 1 0 3 3 0 1 1 0 1 1 0 Continued……..

Page | 149

Chapter 4: Niche specific functional diversity

KEGG Pathways An Bm Kg Pt Sa Sk Gt Gp Td Th Metallic cation, iron-siderophore 0 4 4 4 4 1 1 4 4 4 & vitamin B12 transport system Mineral & organic ion transport 0 0 2 1 2 0 2 4 3 1 system Nitrogen metabolism 0 2 1 2 2 2 3 2 2 1 Other amino acid metabolism 0 1 1 1 1 1 1 1 1 1 Other carbohydrate metabolism 2 2 3 4 6 3 6 4 6 3 Pathogenicity 0 0 0 0 1 0 0 0 0 0 Peptide & nickel transport 1 1 1 1 1 1 1 1 1 1 system Phosphate & amino acid 0 7 11 8 10 0 11 11 11 9 transport system Phosphotransferase system 5 9 12 11 11 8 9 12 11 9 Polyamine biosynthesis 1 3 3 3 4 3 3 3 4 3 Proteasome 0 0 0 0 0 4 0 0 0 0 Protein processing 0 0 1 0 1 3 1 2 1 1 Pyrimidine metabolism 0 1 2 2 2 2 2 2 2 2 Repair system 0 0 0 1 1 1 1 1 1 0 Replication system 0 1 1 1 1 2 1 2 1 1 RNA polymerase 0 0 0 0 0 1 0 0 0 0 RNA processing 0 0 4 2 4 7 3 7 6 2 Saccharide, polyol, & lipid 1 4 7 9 11 3 10 11 11 9 transport system Serine & threonine metabolism 0 1 1 1 1 0 0 1 1 1 Spliceosome 0 0 0 1 0 6 1 1 1 1 Sulfur metabolism 0 1 1 2 2 1 2 1 2 1 Two-component regulatory 6 40 53 51 52 32 58 53 58 43 system Ubiquitin system 0 0 0 0 0 5 0 0 0 0 An: Anterior Nares; Bm: Buccal Mucosa; Kg: Keratinized Gingiva; Pt: Palatine Tonsil; Sa: Saliva; Sk: Skin; Gu: GUT; Gp: Gingival Plaque; Td: Tongue Dorsum; Th: Throat; Ug: UGT

Ug 1 0 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 1 0 0 0 7 0

Complete KEGG modules or pathways derived from minimal niche metagenome comprise six modules i.e. M00086 (Fatty acid metabolism); M00153 (ATP synthesis); M00201 (Saccharide, polyol, and lipid transport system); M00275 (Phosphotransferase system); M00450 (Two-component regulatory system); M00454 (Two-component regulatory system); which are part of all niche specific minimal metagenome, while 44 KEGG modules are exclusively present (Table 4.20) and 31 exclusively absent (Table 4.21) in a specific body niche or sub-site.

Page | 150

Chapter 4: Niche specific functional diversity Table 4.20: Niche specific exclusively present KEGG modules derived from Minimal Niche Metagenome (MNM) Name of KEGG Module Aromatic amino acid metabolism Aromatics degradation

ATP synthesis

Cell signaling Central carbohydrate metabolism DNA polymerase Drug resistance Glycan metabolism Glycosaminoglycan metabolism Lipid metabolism Metallic cation, iron-siderophore & vitamin B12 transport system Mineral and organic ion transport system Nitrogen metabolism Pathogenicity Phosphate & amino acid transport system Proteasome

Protein processing Repair system RNA polymerase RNA processing Spliceosome

Two-component regulatory system

Ubiquitin system

KEGG Module ID

Body Niche

M00038 M00350 M00540 M00551 M00569 M00142 M00143 M00416 M00684 M00690 M00308 M00309 M00262 M00625 M00068 M00059 M00094 M00241 M00321 M00175 M00575 M00322 M00340 M00341 M00342 M00343 M00410 M00290 M00181 M00425 M00352 M00353 M00354 M00397 M00398 M00484 M00474 M00510 M00516 M00379 M00380 M00384 M00387 M00407

Stool KG Gingival Plaque Gingival Plaque Saliva Skin Skin Skin Stool Skin Stool Stool Skin Stool Saliva Gingival Plaque Skin Skin Gingival Plaque Stool Saliva TD Skin Skin Skin Skin Skin Skin Skin Skin Skin Skin Skin Skin Skin Gingival Plaque Stool Stool Skin Skin Skin Skin Skin Skin

Page | 151

Chapter 4: Niche specific functional diversity Table 4.21: KEGG modules exclusively absent from body niche derived from Minimal Niche Metagenome (MNM) Name of KEGG Module Arginine & proline metabolism Aromatic amino acid metabolism ATP synthesis

Bacterial secretion system Branched-chain amino acid metabolism Central carbohydrate metabolism Cofactor and vitamin biosynthesis Cysteine and methionine metabolism Drug resistance Histidine metabolism Other carbohydrate metabolism Peptide & nickel transport system Phosphotransferase system (PTS)

Polyamine biosynthesis Two-component regulatory system

KEGG Module ID M00028 M00023 M00144 M00151 M00155 M00336 M00432 M00010 M00307 M00127 M00338 M00743 M00743 M00026 M00045 M00012 M00061 M00324 M00274 M00279 M00283 M00281 M00134 M00445 M00471 M00479 M00492 M00453 M00457 M00465 M00465

Body Niche UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT UGT Airways UGT UGT UGT UGT UGT Airways Airways Airways Airways

4.4.6.2. Niche specific functional diversity derived from niche specific microbiota As discussed above, the composition of microbiota for a specific body niche is highly diverse within an individual and between the individuals. So in the present study, an analysis has been carried out to calculate the alpha (relative abundance) and beta diversity (presence/absence) based on KEGG pathway enzyme composition.

Page | 152

6

6.5

7

Alpha Diversity (Shannon)

7.5

8

Chapter 4: Niche specific functional diversity

Airways

Skin

UGT

GUT

Oral

Figure 4.13: Alpha diversity (Shannon Index) for major body niches based on average abundance of KEGG pathway enzymes derived from niche specific microbiome

Values of alpha diversity index for each body niche show the variation in average abundances of KEGG pathway enzymes within a niche. Figure 4.13 represents that variation in abundance of KEGG pathways enzymes is lowest (6.71) and highest (7.52) in metagenomic samples from UGT and Oral cavity respectively compare to other body niches. Similarly, the histogram for beta diversity, as shown in figure 4.14, depicts that the samples from airways (0.76) and UGT (0.72) are more diverse, while the gut (0.40) metagenomic samples are least diverse between individuals with respect to the

Beta Diversity

composition (presence/absence) of KEGG pathway enzymes. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Anterior Nares

Oral

GUT

UGT

Skin

Figure 4.14: Beta diversity (Bray Curtis) for major body niches based on composition of KEGG pathway enzymes between all metagenomic samples derived from specific major body niches of healthy individuals

Page | 153

Chapter 4: Niche specific functional diversity

4.4.6.3. Niche specific exclusiveness for KEGG pathway enzymes As wide range of studies revealed that major anatomical body niches of human are unique in the composition of microbiota [Human Microbiome Project Consortium et al. 2012] and the current study also showed that all body niches are significantly segregated to each other in PCA on microbiota composition (Fig. 4.1), it appears that the microbial species that colonize in or on human body have the priority for the selection of body niches, which might depend on the selection pressure of specific body niches generated by many internal and external factors like host genetics, life-style, diet, geography etc. This observation has motivated us to investigate the niche specific microbial functional architectures, as derived from the niche specific microbiota. Exclusive KEGG pathway profile analysis was performed to explore the KEGG pathway enzymes, exclusively present (in more than 90% samples of the particular niche or subsite) among most of the metagenomic samples isolated from a specific body niche but completely absent from all remaining body niches included into this study (Table 4.22).

Table 4.22: Niche specific exclusively present KEGG pathway enzymes Body Niche Tongue Dorsam

KEGG ID K05487 K11505 K13585 K02373 K04692

Sub- & Supra gingival plaque

K05277 K07992 K08233 K12244 K13066 K13472

KEGG enzyme interleukin 1 family, member 9 centromere protein M holdfast attachment protein HfaA FAS-associated deathdomain protein signal transducer and activator of transcription 3 leucoanthocyanidin dioxygenase TYRO protein tyrosine kinase binding protein polyneuridine-aldehyde esterase Skp1-protein-hydroxyproline Nacetylglucosaminyltransferase caffeic acid 3-O-methyltransferase sulfotransferase

KEGG Pathways

Cell cycle - Caulobacter

Continued……..

Page | 154

Chapter 4: Niche specific functional diversity

Body Niche

KEGG ID

KEGG enzyme

K02442 K04056

formylmethanofuran dehydrogenase subunit D membrane protein GlpM type III secretion protein O

K04383

interleukin 1 alpha

K04687

interferon gamma

K04940 K05384 K05518 K05990

opine dehydrogenase bilin biosynthesis protein phosphoserine phosphatase RsbX NA transcription attenuation protein (tryptophan RNA-binding attenuator protein) voltage-gated sodium channel type IV beta NA

K00203

GUT

K06285 K04848 Keratinized Gingiva

K05272 K10991 K06337 K06420 K06421

Skin

K06423 K07144 K13604

DNA repair protein Swi5/Sae3 spore coat-associated protein S small acid-soluble spore protein C (minor alpha/beta-type SASP) SASP D (minor alpha/beta-type SASP) SASP F (minor alpha/beta-type SASP) 5-(amino methyl)-3-furanmethanol phosphate kinase bacteriochlorophyllide d C-20 methyltransferase

KEGG Pathways Energy metabolism Membrane transport Cell growth and death Development Endocrine and metabolic diseases Immune diseases Immune system Infectious diseases Neurodegenerative diseases Signal transduction Signaling molecules & interaction Development Endocrine & metabolic diseases Folding, sorting and degradation Immune diseases Immune system Infectious diseases Signal transduction Signaling molecules and interaction Transport and catabolism Energy metabolism

Adrenergic signaling in cardiomyocytes

Metabolism of cofactors and vitamins Continued…….

Page | 155

Chapter 4: Niche specific functional diversity

Body Niche

Skin

KEGG ID K10936

KEGG enzyme

KEGG Pathways

accessory colonization factor AcfA

Infectious diseases Amino acid metabolism Biosynthesis of secondary metabolites Folding, sorting & degradation Cancers Cardiovascular diseases Endocrine and metabolic diseases Immune diseases Immune system Infectious diseases Signal transduction Signaling molecules & interaction

K11819

S-alkyl-thiohydroximate lyase SUR1

K14020

Wolfamin

K10785

T-cell receptor beta chain V region

K02359 K02579

egghead protein (zeste-white 4 protein) heparan sulfate N-deacetylase/Nsulfotransferase NDST4 guanine nucleotide-binding protein G(z) subunit alpha ubiquitin-conjugating enzyme E2 L6 guanine nucleotide-binding protein subunit alpha-11 Ras-related protein Rab-3B Ras-related protein Rab-4B Ras-related protein Rab-3A Ras-related protein Rab-27A Ras-related protein Rab-6C Ras-related protein Rab-15 ADP-ribosylation factor-like protein 12 carboxypeptidase A4 solute carrier family 25 (mitochondrial uncoupling protein), member 7 microtubule-associated protein 1 light chain ecdysteroid 22-hydroxylase NA glutathione S-transferase kappa 1

K04535 K04553 K04635 K06108 K07880 K07882 K07885 K07895 K07908 K07960 K08637 K08769 K10435 K10721 K12759 K13299

As Table 4.22 shows that there are only five body sub-sites i.e. gut, skin, tongue dorsum, keratinized gingiva & gingival plaque, which contains 52 KEGG enzymes exclusively present in respective body niche but absent (might be present in less than 10% samples of the particular niche or sub-site) from all other body niches or sub-sites.

Page | 156

Chapter 4: Niche specific functional diversity 4.4.7. Microbial sources of KEGG pathway enzymes responsible for niche specificity We are also interested in identification of microbial source of the KEGG pathway enzymes, which are intensely associated with a specific body niche (Table 4.7). So KEGG ids were assigned to all coding genes of all reference genomes from each major body niche available at HMP-DACC. A matrix has been constructed which contains the copy number of KEGG enzymes for respective microbial genomes. Using these copy number of KEGG pathway enzymes, their weighted abundances, which is contributed by a microbial genome to a niche specific metagenome, were also calculated for exploring the microbial source of most abundant and strategic KEGG pathway enzymes probably responsible for functional niche specificity. Bubble plots (Fig. 4.15-4.19) are plotted for showing the contribution of more and less abundant microbial species in niche associated representative KEGG pathway enzymes identified by factor analysis of PCA based on relative abundances (Fig. 4.8 & 4.9 and table 4.7). Figure 4.15-4.19 shows that most dominant or strategic KEGG pathway enzymes are not always contributed by most dominant microbial species of a niche. It means we can say that less abundant species should not be ruled out from the microbiome composition analysis because many functions are contributed by less abundant species instead of most dominant microbial communities.

Figure 4.15: Contribution of airways microbiota in KEGG pathway enzymes required for niche specialization in airways; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species

Page | 157

Chapter 4: Niche specific functional diversity

Figure 4.16: Contribution of GUT microbiota in KEGG pathway enzymes required for niche specialization in GUT; Size of bubble indicates the abundance of KEGG pathway enzyme; Values at vertical axis (Right hand side) is average relative abundances of respective bacterial species

Figure 4.17: Contribution of Oral cavity microbiota in KEGG pathway enzymes required for niche specialization in oral cavity; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species

Page | 158

Chapter 4: Niche specific functional diversity

Figure 4.18: Contribution of skin microbiota in KEGG pathway enzymes required for niche specialization on skin; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species

Figure 4.19: Contribution of UGT microbiota in KEGG pathway enzymes required for niche specialization in UGT; Size of bubble indicates the abundance of KEGG pathway enzyme; Values in parenthesis at vertical axis is average relative abundances of respective bacterial species

Page | 159

Chapter 4: Niche specific functional diversity Niche specific microbial source of KEGG pathway enzymes have been identified using factor analysis of relative abundance based PCA plot (Fig. 4.8 & 4.9) and are shown in five bubble plots (Fig. 4.15-4.19). Bubble plots of microbial source for different body niches reveal that many KEGG pathway enzymes are abundant in a specific body niche and crucial for niche specialization, but it can be contributed by comparatively less abundant species instead of most dominant species. For instance, in airways and skin two KEGG enzymes K00164 & K00374 are contributed by relatively less abundant Staphylococcus epidermidis (airways-13.6% & skin-13%) and not by the most abundant Propionibacterium acnes (airways-42.8% & skin-75%) and Corynebacterium accolens (20.9%) species; in GUT both KEGG enzymes K01768, K01960 mainly come from Parabacteroides merdae (4.1%), while most dominant species in gut are Alistipes putredinis (9.2%) and Bacteroides vulgatus (8.5%). Similarly in the oral cavity, K02051 is contributed by the underrepresented species Actinomyces viscosus (1%) instead of relatively more abundant species like Steptococcus mitis (16%). In UGT most dominant bacterial species Lactobacillus crispatus (49%) contributes most of the abundant KEGG enzymes except K02006, which comes from the less abundant Lactobacillus iners (16%). 4.4.8. Niche specific microbial and functional enterotypes/variants To get an overview of variation in species and molecular functional profiles of the microbiota of specific body niches in human, we used taxonomic and KEGG pathway enzymes composition for each niche derived from HMP-DACC. We excluded the body niches having less than 20 metagenomic samples from the further analysis to reduce the biasness. We also excluded the body niches, if considerable number of samples is not present for more than one dominant microbial species (Table 4.23). Multidimensional cluster analysis (ANOSIM) and principal component/coordinate analysis (PCA) was performed based on relative abundances of microbial species and KEGG pathway enzymes for each body niche to reveal the distinct clusters (known as “Variants” or for gut “Enterotypes”) of individuals or metagenomic samples (4.20A & 4.20B).

Page | 160

Chapter 4: Niche specific functional diversity Table 4.23: Dominant or driving species in niche specific metagenomic samples Body Niche (total sample) Airways (68) Buccal Mucosa (77)

Supra gingival Plaque (83)

Tongue Dorsum (86)

GUT (91)

Posterior Fornix (37)

Number of Metagenomic samples 15 43 67 7

Corynebacterium accolens Propionibacterium acnes Streptococcus mitis Haemophilus parainfluenzae

18 42

Rothia dentocariosa Corynebacterium matruchotii

7

Haemophilus parainfluenzae

7 41

Neisseria mucosa Haemophilus parainfluenzae

16

Streptococcus parasanguinis

13 10 11 9 1 31 1 3 8 2 5 1 2 1 2 3 2 1 5 18 7 5

Prevotella melaninogenica Neisseria flavescens Alistipes putredinis Prevotella copri Bacteroides eggerthii Bacteroides unclassified Ruminococcus bromii Bacteroides caccae Bacteroides vulgatus Eubacterium siraeum Eubacterium rectale Bacteroides coprocola Bacteroides stercoris Ruminococcus torques Bacteroides plebeius Bacteroides ovatus Bacteroides finegoldii Bacteroides cellulosilyticus Lactobacillus jensenii Lactobacillus crispatus Lactobacillus iners Lactobacillus gasseri

Dominant Species

Microbial and Functional Variants Corynebacterium accolens Propionibacterium acnes Streptococcus mitis Haemophilus parainfluenzae Rothia dentocariosa Corynebacterium matruchotii Haemophilus parainfluenzae Neisseria genus Haemophilus parainfluenzae Streptococcus parasanguinis Prevotella melaninogenica Neisseria flavescens Prevotella copri Bacteroides Alistipes Ruminococcus Eubacterium

Lactobacillus jensenii Lactobacillus crispatus Lactobacillus iners Lactobacillus gasseri

Note: Bold variants are only based on microbiota composition not detected based on functional composition

Page | 161

Chapter 4: Niche specific functional diversity

Figure 4.20A: Variants based on relative abundances of microbial communities (Left panel) and based on relative abundances of KEGG pathway enzymes (Right panel): (A) Airways (B) Buccal Mucosa (C) GUT

Page | 162

Chapter 4: Niche specific functional diversity

Figure 4.20B: Variants based on relative abundances of microbial communities (Left panel) and based on relative abundances of KEGG pathway enzymes (Right panel): (D) Posterior fornix; (E) Supra gingival plaque; (F) Tongue Dorsum

Page | 163

Chapter 4: Niche specific functional diversity Though the knowledge of microbial and functional composition of the human microbiome is rapidly increasing these days, most of the studies are focused on the gut microbiome. It has been reported that there are three major enterotypes in gut microbiota dominated by Bacteroides, Prevotella and Ruminococcus [Arumugam et al. 2011]. However, not much information is available about the variants, if any, of the microbiome of other body niches like Skin, Oral cavity, UGT etc. In the present study, we have carried out an analysis to reveal the variants, if any, in the microbiome composition and/or functional profiles at specific body niches including gut microbiota. We have also carried out an analysis to reveal the variants or enterotypes based on functional divergence for each body niche in human. The fundamental question arises whether both types of variants (microbial and functional) are associated to each other and represent similar types of patterns. Most of the body niches or sub-sites except supra-gingival plaque included in this analysis for variants or enterotypes detection shows the similar pattern for both types of variants (functional and microbial) but segregation in the microbial community based variants is more significant compared to functional variants (Fig. 4.20A & 4.20B). As shown in Figure 4.20A for the airways microbiome, healthy individuals are clustered in two distinct groups on the basis of microbiota composition (AL) as well as functional categories (AR). Group names are assigned according to the major dominant species or genera in the respective samples (Table 4.23). In airways, one groups is for C. accolens dominant individuals and other is for P. acnes dominant individuals, but variants based on microbial composition are segregated more significantly (P=0.0001, R=0.739) than those based on the functional composition (P=0.0878, R=0.119). Similarly, variants are detected in other body niches like buccal mucosa, gut, posterior fornix, supra-gingival plaque and tongue dorsum, as shown in the respective figures (Fig. 4.20A & 4.20B), but significance (P value for significance variation) of distinctness between variants or enterotypes varies from one body niche to another (Table 4.24).

Page | 164

Chapter 4: Niche specific functional diversity Table 4.24: P & R value of significance (ANOSIM statistical test) for segregation of variants in each niche based on microbial and functional composition

Airways

C. accolens

Buccal Mucosa

H. parainfluenzae

Alistipes

P. acnes 0.0001 (0.739) 0.0878 (0.119) S. mitis 0.0001 (0.70) 0.0001 (0.70) Bacteroides 1 (0.0233) 0.627 (0.12)

Bacteroides GUT

Eubacterium 0.002 (0.521) 0.005 (0.55) 0.812 (0.19) 0.002 (0.62)

Eubacterium P. copri L. gasseri 0.0012 (1) 0.0006 (0.97)

P. copri 0.002 (0.91) 0.004 (0.55) 0.001 (0.67) 0.001 (0.81) 0.001 (0.85) 0.001 (0.78)

Rumino 0.11 (0.85) 0.60 (0.48) 1 (0.22) 0.10 (0.60) 1 (0.16) 1 (0.11) 0.19 (0.93) 0.20 (0.71)

L. iners 0.0006 (1) 0.0006 (0.99) 0.009 (1) 0.0096 (0.91)

L. jensenii 0.0006 (1) L. crispatus 0.0006 (0.99) Posterior 0.0474 (1) L. gasseri fornix 0.0534 (0.82) 0.0078 (1) L. iners 0.006 (0.82) H. parainfluenzae Neisseria genus R. dentocariosa 0.0006 (0.90) 0.0006 (0.82) 0.0006 (0.84) C. matruchotii 0.0006 (0.61) 0.0006 (0.67) 0.0006 (0.50) Supra 0.0012 (0.72) 0.0006 (0.62) gingival H. parainfluenzae 0.0042 (0.45) 0.0036 (0.42) plaque 0.0006 (0.55) Neisseria genus 0.0006 (0.45) N. flavescens P. melaninogenica S. parasanguinis 0.0006 (0.46) 0.0006 (0.52) 0.0006 (0.64) H. parainfluenzae 0.0018 (0.34) 0.0006 (0.38) 0.0006 (0.51) Tongue 0.0006 (0.74) 0.0006 (0.79) N. flavescens dorsum 0.0012 (0.51) 0.0006 (0.68) 0.0006 (0.50) P. melaninogenica 0.0006 (0.32) Note: Rumino: Ruminococcus genera; Highlighted values for microbiota composition (Left panel PCA plots figure 4.20A & 4.20B) and not highlighted for functional composition (Right panel PCA plots figure 4.20A & 4.20B); values in bracket are R values; P values of significance are outside the brackets

Page | 165

Chapter 4: Niche specific functional diversity 4.4.9. Functional

cooperation

between

niche

specific

microbial

communities Microbial characterization of human metagenome samples of a specific body niche showed a very high variation in abundances of different microbial species - some are highly abundant, while some other are present in low frequencies, or even in trace amounts. Most of the studies on microbiome characterization focus on the abundant or dominant microbial species or genera only. In the present analysis, we are interested to explore the functional role of the microbial communities present in very low abundances in healthy individuals at specific body niche. Initially, we mapped the complete KEGG modules in most abundant species (seed species) and then calculated the number of KEGG modules completed by the gradually decreasing abundant species colonized at the specific body niche. As we know that all enzymes of the KEGG modules should be present to perform any molecular function. So, if a species which are present in relatively low abundance complete any KEGG module and thereby helps performing a specific molecular function, we may refer to it as a functional cooperation in favor of human health or healthy microbial homeostasis. All KEGG modules completed by lowly abundant bacterial species are shown in following table 4.25.

Page | 166

Chapter 4: Niche specific functional diversity Table 4.25: All KEGG functional modules completed by low abundant bacterial species in specific niche (KEGG module details given in appendix 4.1) Body Niches

Bacterial Species

Airways

C. accolens

S. epidermidis

S. aureus

B. vulgatus

Gastrointestinal Tract

P. copri E. rectale B. stercoris B. ovatus P. merdae R. bromii D. invisus F. prausnitzii R. torques S. wadsworthensis

Oral Cavity

B. plebeius

H. parainfluenzae

Modules completed by respective bacterial species M00342, M00242, M00127, M00126, M00119, M00053, M00045, M00023, M00003, M00023, M00045, M00053, M00119, M00126, M00127, M00242, M00342, M00577 M00530, M00478, M00459, M00440, M00434, M00359, M00348, M00345, M00318, M00234, M00135, M00119, M00087, M00083, M00082, M00026, M00577, M00082, M00083, M00087, M00119, M00135, M00234, M00318, M00345M00348, M00359, M00434, M00440, M00459, M00478, M00530, M00582, M00324, M00255, M00134, M00087, M00582, M00134, M00255, M00324, M00491 M00632, M00579, M00577, M00572, M00570, M00527, M00526, M00361, M00133, M00127, M00123, M00122, M00119, M00053, M00052, M00051, M00028, M00024, M00023, M00022, M00021, M00019, M00017, M00016, M00015, M00009, M00004, M00491, M00009, M00015, M00016, M00017, M00019, M00021, M00022, M00023, M00024, M00028, M00051, M00052, M00053, M00119, M00122, M00123, M00127, M00133, M00361, M00526, M00527, M00570, M00572, M00577, M00579, M00632, M00793 M00061, M00077, M00127 M00627, M00572, M00554, M00362, M00361, M00335, M00133, M00028, M00022, M00127, M00028, M00133, M00335, M00361, M00362, M00554, M00572, M00627, M00632 M00064, M00632, M00478 M00009, M00478, M00011 M00459, M00077, M00053, M00011, M00077, M00459, M00499 M00491, M00454, M00362, M00361, M00242, M00134, M00121, M00082, M00017, M00016, M00499, M00017, M00082, M00121, M00134, M00242, M00361, M00362, M00454, M00491, M00549 M00238, M00234, M00183, M00133, M00064, M00018, M00793, M00064, M00133, M00183, M00234, M00238, M00577 M00458, M00275, M00577, M00458, M00603 M00017 M00489, M00475, M00472, M00471, M00436, M00348, M00333, M00228, M00198, M00191, M00190, M00189, M00121, M00025, M00024, M00017, M00025, M00121, M00189, M00190, M00191, M00198, M00228, M00333, M00348, M00436, M00471, M00472, M00475, M00489, M00500 M00077 M00727, M00577, M00572, M00527, M00525, M00394, M00359, M00335, M00307, M00239, M00190, M00168, M00150, M00127, M00123, M00121, M00115, M00096, M00093, M00053, M00020, M00017, M00016, M00011, M00007, M00004, M00003, M00549, M00004, M00007, M00011, M00016, M00017, M00020, M00053, M00093, M00096, M00115, M00121, M00123, M00127, M00150, M00168, M00190, M00239, M00307, M00335, M00359, M00394, M00525, M00527, M00572, M00577, M00727, M00728

Continued……

Page | 167

Chapter 4: Niche specific functional diversity Body Niches

Bacterial Species

C. matruchotii R. dentocariosa

Oral Cavity

P. melaninogenica S. parasanguinis N. flavescens S. sanguinis V. atypica N. mucosa V. parvula C. gingivalis A. viscosus

S. epidermidis

Skin

F. magna S. capitis S. mitis

Vagina

C. accolens C. pseudogenitalium C. tuberculostearicum L. iners L. jensenii

Modules completed by respective bacterial species M00479, M00461, M00365, M00359, M00348, M00209, M00176, M00119, M00116, M00035, M00009, M00482, M00035, M00116, M00119, M00176, M00209, M00348, M00359, M00365, M00461, M00479, M00581 M00525, M00436, M00149, M00119, M00119, M00581, M00119, M00149, M00436, M00525, M00530 M00169, M00133, M00127, M00122, M00119, M00119, M00115, M00051, M00530, M00115, M00119, M00119, M00122, M00127, M00133, M00169, M00526 M00246, M00245, M00133, M00133, M00526, M00133, M00245, M00246, M00275 M00349, M00300, M00144, M00133, M00133, M00025, M00024, M00016, M00008, M00275, M00016, M00024, M00025, M00133, M00133, M00144, M00300, M00349, M00471 M00045, M00471, M00627 M00577, M00527, M00728, M00577, M00627 M00249, M00627, M00333 M00126 M00035, M00126, M00338 M00208, M00204, M00338, M00208, M00482 M00582, M00577, M00530, M00525, M00478, M00359, M00348, M00345, M00318, M00242, M00234, M00135, M00126, M00119, M00116, M00083, M00082, M00026, M00023, M00003, M00627, M00023, M00026, M00082, M00083, M00116, M00119, M00126, M00135, M00234, M00242, M00318, M00345, M00348, M00359, M00478, M00525, M00530, M00577, M00582, M00632 M00550, M00324, M00283, M00255, M00127, M00051, M00632, M00127, M00255, M00283, M00324, M00550, M00793 M00134, M00793, M00223 M00632, M00603, M00554, M00550, M00525, M00491, M00362, M00275, M00168, M00017, M00223, M00168, M00275, M00362, M00491, M00525, M00550, M00554, M00603, M00632, M00793 M00053, M00045, M00793, M00053, M00342 M00198 M00484, M00473, M00471, M00450, M00210, M00149, M00116, M00198, M00149, M00210, M00450, M00471, M00473, M00484, M00526 M00120, M00526, M00242 M00223, M00052, M00242, M00223, M00287

Note: All niche specific microbial species are in decreasing order of their relative abundance from up to down of the table; average relative abundance of respective bacterial species is shown in figure 4.21.

Page | 168

Chapter 4: Niche specific functional diversity

Figure 4.21: Gradually increasing number of completed KEGG modules in each body niche by cooperation of gradually low abundant bacterial species of niche specific microbiota. Bar representing the number of KEGG modules completed by respective bacterial species derived from respective body niche; values in brackets on horizontal axis are percentage relative abundance of respective bacterial species

Figure 4.21 shows gradual increase in the number of complete KEGG pathway modules with sequential addition of gradually decreasing abundant bacterial species to the seed (most dominant) bacterial species. Interestingly, in gut, a species Sutterella wadsworthensis present in very low average relative abundance (0.95%) in 91 gut metagenomic samples, but it is contributing 31 of the KEGG modules, which are not completed by high abundant species like Alistipes putredinis (9.76%), Bacteroides vulgatus (9.64%) etc. Most of these 31 modules belong to “two-component regulatory system”. Similarly in the oral cavity, Neisseria flavescens is present relatively in low abundance (2.48%), but involved for completing the 19 KEGG modules. These modules belong to different molecular pathways such as Microcin C transport system, Capsaicin biosynthesis, Nucleotide sugar biosynthesis etc.

Page | 169

Chapter 4: Niche specific functional diversity

4.5. Discussion As we know that humans have co-evolved with the trillions of microbes (human microbiome), which includes bacteria, viruses, fungi etc. that reside in our body at different body sites or niches like gut, oral cavity, skin, ugt & airways and create a complex, body-niche specific, adaptive ecosystems that are finely accustomed to persistently fluctuating environment of body habitat [Lloyd-Price et al. 2016]. Human genome project revealed that Humans are almost identical in their genomic makeup. These minor differences in our genomes cannot always account for the tremendous phenotypic diversity across the humans from different population [Hood et al. 2003; Green et al. 2015; Engel et al. 1993]. It is quite rational to believe that a substantial part of the inter-individual phenotypic differences may arise due to the variations in the human microbiome across the healthy individuals. Understanding this variability in the “healthy microbiome” has thus been a foremost task in human microbiome research. As many previous studies showed the association of dysbiosis in microbiota with several diseases, including rheumatoid arthritis, inflammatory bowel disease, periodontitis, multiple sclerosis, types 1 & 2 diabetes, allergies, asthma, ulcerative colitis, autism, and cancer [Turnbaugh et al. 2007; Li et al. 2014; Qin et al. 2010; Yoshimoto et al. 2013; Cani et al. 2012; Musso et al. 2010; Schwabe et al. 2013; Thomas et al. 2015; Rajagopala et al. 2017; Toh et al. 2015; Bhargava et al. 2014; Mielcarz et al. 2015; Jangi et al. 2016; Fujimura et al. 2015; Panzer et al. 2015; Riiser et al. 2015; Devaraj et al. 2013]. But a major challenge in distinguishing the healthy microbiome from the unhealthy one is that, a large degree of inter individual variations occurs in microbiota composition even in the absence of any disease [Lloyd-Price et al. 2016]. This large amount of variations in microbiota composition among healthy individuals complicates the identification of simple microbial imbalances that either cause disease or reflect a diseased condition. So a proper understanding of the healthy microbiome structure & properties, especially characterization of diverse microbial communities that are encountered in healthy individuals is an obligatory first step to understand the role of microbiome in human health and diseases. Existence of variations in microbiome composition across healthy individuals suggests that different microbial communities are playing similar functional roles in different healthy individuals, so characterizing the microbiome at the molecular

Page | 170

Chapter 4: Niche specific functional diversity functional level would be more significant instead of characterizing the compositional divergence of the microbiota [Lloyd-Price et al. 2016]. One of the major objectives of the present study was to characterize the niche specific functional configuration and microbiota composition, using relative abundances of KEGG pathway enzymes and bacterial species in 486 metagenomic samples derived from distinct body niches or body-sites of the 102 healthy individuals. An earlier study on 242 healthy individuals reported that diversity and abundance of each habitat‟s dominant microbes vary widely among healthy individuals with strong niche specialization, but the metagenomic carriage of metabolic pathways is constant or stable among individuals, irrespective of variations in the structure of microbial communities [Human Microbiome Project Consortium et al. 2012]. But the present study reveals functional divergence as well as microbial divergence; not only between body niches but also between the healthy individuals. Metagenomic reconstruction of 486 samples in this study also showed that buccal mucosa, gut & gingival plaque are most enriched body sites in terms of number of different bacterial species present there (Fig. 4.2). Maximum variations within an individual (alpha diversity) has been observed in UGT, but gut is more diverse (beta diversity) when we compare with other healthy individuals from the same niche. However, beta diversity index is not an appropriate parameter to delineate the divergence of microbiota among individuals, as it does not consider the abundance information of the microbial species. So we have used a different parameter that is “Summation of Standard Deviation (S.S.D.)” for each present microbial species in the respective sample, which consider the relative abundance instead of presence/absence of respective microbial species to explain the inter-individual variation in the microbiota. SSD showed that gut and posterior fornix are characterized by the highest inter-individual diversity (245.80 and 198.45 respectively) in the microbiota composition (Table 4.3). PCA on the basis of microbial composition (Fig. 4.1) does not have any overlap of the metagenomic samples from five different major body niches except airways and skin, which indicates the microbial uniqueness of body niches. We also identified a core set of microbiota universally present in a specific body niche of all or 90% individuals included in the present study. But only few taxa were found to be universally present in most of the individuals because of large variations in the microbiota composition (Table 4.4). Despite Page | 171

Chapter 4: Niche specific functional diversity substantial niche-specific renovations in the genetic makeup and evolutionary relationship of the microbiota components, their taxonomic identities have not been perturbed; as suggested by clear separation of niche specific dominant microbial species in the core genome based tree (Fig. 4.6) and 16s rRNA phylogenetic tree (Fig. 4.5). If niche specialization would have dominated over taxonomic trends, the genomes derived from a specific body niche should have co-segregated together in both of the trees, but this did not happen. These observations indicated that the human microbiota at different body habitats tend to develop their molecular adaptive strategies in such a way that the in situ niche-specific modulations in their gene repertoires do not have any global impact on their taxonomic characteristics. As discussed above that a large degree of variation encountered in microbiota within and among healthy individuals, so characterization of “Healthy Microbiota” is therefore no longer a worthwhile definition. A more practical approach to investigating the niche-specific features of the microbiome is to analyze the functional profile or configuration of the microbiota of different body niches. As we have seen in the PCA based on the presence/absence of KEGG pathway enzymes, the body niches are also unique in their metabolic arrangement (Fig. 4.7). In this figure 4.7, samples from a specific body niche are densely clustered, but scattered away from one another, which points towards large variations among individuals. Three body niches gut, gingival plaque and tongue dorsum are clustered together indicating the similarity in functional profile derived from these microbial communities. But less variations in functional profiles of microbiome of different body niches as compared to the respective microbial composition indicates about a healthy “functional core” - a set of metabolic and other molecular functions present in each body niche and all individuals. For our dataset of 486 metagenomic samples, 796 KEGG enzymes were observed in all samples irrespective of body niches and individuals. Most (597) of these 796 KEGG enzymes belonged to pathways involved in Metabolism [for instances, carbohydrate metabolism (178), membrane transport (87) etc.]; 174 in Genetic Information Processing; and 123 in Environmental Information Processing (Fig. 4.11 & Table 4.16). The functional core must, of course, include the functions obligatory to microbiota for human associated microbial life. It is tempting to speculate that these core functions play important roles in keeping the host healthy. Such core set presumably represent the functions that are might be essential for normal human physiology but could not have the potential to encoded by own genome of the human. Page | 172

Chapter 4: Niche specific functional diversity In PCA based on relative abundances of KEGG pathway enzymes (Fig. 4.8), the metagenomic samples of any particular bode habitat are clustered together and significantly (P=0.001, R >0.78) segregated from samples of other body niches, samples of airways and skin being the exceptions (P=0.032, R=0.198). This observation indicates that it is differential abundances of the enzymes rather than their presence/absence, which confer niche-specificity to the microbiota of distinct body habitats. Functional divergence of the microbiota between body niches may be attributed more to the relative abundances than to presence/absence of KEGG pathway enzymes. The same set of KEGG pathway enzymes might be present (or absent) in any two or more body niches but their abundance might vary substantially across these body niches. To confirm the hypothesis that each niche has the unique functional environment derived from microbiota mostly in terms of relative abundances of functional features, we performed a principal coordinate analysis using relative abundances of 796 core KEGG pathway enzymes for each major body niche and we observed that all body niches again significantly (P=0.0001, R=0.877) segregated to each other (Fig. 4.12A & 4.12B). One interesting observation was found in our analysis that UGT and GUT metagenomic samples were clustered on opposite ends of PC 1 axis in both types of PCA plots (presence/absence & relative abundances). This indicates that the UGT and GUT microbiome are of opposite characters with respect to presence/absence and relative abundances of different KEGG pathway enzymes, for instances, K00262 for glutamate dehydrogenase show high abundance in GUT but relatively low abundance in UGT, while for and K03217 enzyme, the reverse is true (Fig. 4.10). Factor analysis of the principal coordinates derived from relative abundances of KEGG pathway enzymes has revealed the 68, 62, 24 & 49 KEGG pathway enzymes, which are highly associated or correlated with GUT, UGT, skin/airways and oral cavity body niches, respectively (Table 4.7) and responsible for separation of body niches from each other. Significantly (>Mean + S.D.) most abundant enzymes of these identified niche associated KEGG pathway enzymes are involved in various metabolic pathways such as Energy metabolism & carbohydrate metabolism in gut; membrane transport & carbohydrate metabolism in UGT; membrane transport in oral cavity; Membrane transport, Metabolism of cofactors and vitamins & Folding sorting and degradation in skin/airways microbiome (Table 4.12).

Page | 173

Chapter 4: Niche specific functional diversity One intriguing observation of the present study is the potential implication of relatively low abundant pathways in functional metabolism. Abundant or niche specific molecular functions of the microbiome are not necessarily provided by the abundant species, species present in low frequencies may also contribute appreciably to the functional profile of the microbiota. In this present analysis we have investigated the microbial sources of niche specifically highly associated KEGG pathway enzymes using reference bacterial genomes provided by HMP for each major body niche. As shown in five bubble plots for each major body niche (Fig. 4.15 - 4.19), more copy number (abundance) of many niche associated KEGG pathway enzymes may be provided by lowly abundant species instead of highly abundant species. For example, in airways and skin, two KEGG enzymes K00164 & K00374 provided by less abundant Staphylococcus epidermidis (airways13.6% & skin-13%) instead of most abundant Propionibacterium acnes (airways-42.8% & skin-75%) and Corynebacterium accolens (20.9%) species; in GUT both KEGG enzymes K01768, K01960 mainly comes from Parabacteroides merdae (4.1%) while most dominant species in gut are Alistipes putredinis (9.2%) and Bacteroides vulgatus (8.5%); similarly in oral cavity K02051 contributed by very less abundant Actinomyces viscosus (1%) instead of more abundant species like Steptococcus mitis (16%); in UGT most dominant bacterial species Lactobacillus crispatus (49%) contributes most of the abundant KEGG enzymes except K02006 which is coming from the less abundant Lactobacillus iners (16%). It is worth mentioning at this point that each body niche or site is physiologically very much different from one another, so we should focus on niche specific core functions that characterize the microbiome of different body habitats. We identified the niche specific core of KEGG pathway enzymes for each body site. Niche specific core of KEGG enzymes include three types of enzymes (Table 4.17), essential for microbial life in human body at particular body niche (Table 4.18); enzymes essential for normal healthy human physiology irrespective of body niche, which are not produced by own human genome (Table 4.16); and essential for niche specific healthy human life or physiology (Table 4.17). Niche core genome (NCG) represents the essential functional features require for niche specific microbial community to colonize at a specific body niche. Most of these genes or pathways belong to housekeeping genes and implicated in maintenance of basic cellular functions. Complete KEGG modules or pathways derived from minimal niche metagenome comprise six modules i.e. M00086 (Fatty acid metabolism); M00153 Page | 174

Chapter 4: Niche specific functional diversity (ATP synthesis); M00201 (Saccharide, polyol, and lipid transport system); M00275 (Phosphotransferase system); M00450 (Two-component regulatory system); M00454 (Two-component regulatory system); which are part of all niche specific minimal metagenome. 44 KEGG modules are exclusively present (Table 4.20) and 31 exclusively absent (Table 4.21) in a specific body niche or sub-site. It means for niche specificity is achieved not only by exclusive presence of certain KEGG modules or enzymes but also by exclusive absence of KEGG modules in a particular body niche. UGT microbiome comprises the minimum alpha diversity for KEGG pathway enzymes as compared to other body niches, which represents that minimal variation in abundance of KEGG pathways enzymes within a metagenomic sample (Fig. 4.13). On contrary, the oral cavity contains highest variation among all five major body niches. The inter personal diversity in terms of both presence/absence (beta diversity index) and relative abundances (PCA on relative abundances) of KEGG pathway enzymes were observed highest in airways followed by UGT, oral cavity, skin and GUT (Fig. 4.14). Certain studies have sub-divided the healthy individuals into three enterotypes based on variation in composition of gut microbiota that is enriched by one of the three genera Bacteroides (enterotype 1), Prevotella (enterotype 2) and Ruminococcus (enterotype 3) [Arumugam et al. 2011]. But little is known about the variation in functional architectures in other niches. To our knowledge, the current analysis has focused for the first time, on inter personal variation or investigation of different “variants” (parallel to enterotype) based on microbiota composition as well as functional configuration in other body niches. Multidimensional cluster analysis (ANOSIM) and principal coordinate analysis (PCA) was performed based on abundances of microbial species and KEGG pathway enzymes for each body niche to reveal the distinct clusters (known as “Variants” or for gut “Enterotypes”) of individuals or metagenomic samples driving by an enriched microbial species or genera. Microbiome derived from airways, buccal mucosa, posterior fornix, supra-gingival plaque and tongue dorsum also contain the different variants similar as gut enterotypes revealed by other previous studies for instances, Airways: C. accolens (variant 1) and P. acnes (variant 2); Buccal mucosa: S. mitis (variant 1) and H. parainfluenzae (variant 2); Supra-gingival plaque: Rothia dentocariosa (variant 1), Corynebacterium matruchotii (variant 2), Haemophilus parainfluenzae (variant 3) and Neisseria genus (variant 4); Tongue dorsum: Haemophilus parainfluenzae (variant 1), Streptococcus parasanguinis (variant 2), Prevotella melaninogenica (variant 3) and Page | 175

Chapter 4: Niche specific functional diversity Neisseria flavescens (variant 4); Posterior fornix: Lactobacillus jensenii (variant 1), Lactobacillus crispatus (variant 2), Lactobacillus iners (variant 3) and Lactobacillus gasseri (variant 4). In this analysis, we have observed two more enterotypes other than Prevotella, Bacteroides and Ruminococcus for gut microbiota - one is Eubacterium (enterotype 4) and other is Alistipes (enterotype 5). Variants clustering based on functional composition in all body niches except supra-gingival plaque also follow the similar pattern as observed for microbiota composition (Fig. 4.20A & 4.20B). Two terms “cooperation or co-occurrence” and “competition” are widely used to address the assembly and structure of microbial communities for colonization and adaptation at a specific ecological or anatomical niche. Competition always occurs among different microbial communities for the adaptation at a specific niche but cooperation is the mechanism, which can be encountered for two purposes: one is for adaptation of microbial communities and second might be for providing special functionality to human to maintain the normal human physiology. In the present study, we performed an analysis to investigate the functional cooperation among microbiota components of a specific body niche, using molecular functional architecture of reference genomes derived from the HMP. We have seen that many KEGG modules, which might be important for human physiology, are not provided or completed by dominant microbiota components (bacterial species) in a specific body niche but a lowly abundant species completed that KEGG module by providing the enzymes, for which dominant species was deficient. This type of functional cooperation only should be for maintaining the functional homeostasis in favor of human health because all microbes colonized at a specific body niche should be capable to encode all functions essential for them. In each body niche, many KEGG modules are completed or finished by low abundant species (Fig. 4.21 & Table 4.25). For instances, in gut, Sutterella wadsworthensis is present in very low relative abundance (0.95%) in 91 gut metagenomic samples but contributing 31 of the KEGG modules (Fig. 4.21), which are not contributed by most abundant species like Alistipes putredinis (9.76%), Bacteroides vulgatus (9.64%) etc. Most of these 31 modules belong to “twocomponent regulatory system”. Similarly in oral cavity Neisseria flavescens is also present relatively in low abundance (2.48%) but involved for completing the 19 KEGG modules (Fig. 4.21). These modules belong to different molecular pathways such as Microcin C transport system, Capsaicin biosynthesis, Nucleotide sugar biosynthesis etc.

Page | 176

Chapter 4: Niche specific functional diversity In a nutshell, the human microbiome at different body niches may differ substantially in community composition as well as in functional architectures. Inter-niche and interpersonal functional divergences in the microbiome may be attributed not only to the presence/absence of KEGG pathway enzymes, but also to their relative abundances. Species of relatively low abundance, even those present in trace amount, confer functional stability to the host through completion of different KEGG pathway modules that the dominant microbes lack in.

Page | 177

Chapter

5

Geography, Ethnicity or Subsistence Specific Variations in Human Microbiome Composition & Diversity 5.1.

Introduction

178-179

5.2.

Method and Materials

5.3.

Results

5.4.

Discussion 199

183-198

180-182

Chapter 5: Population specific diversity in microbiome composition

5.1. Introduction One of the major issues in development of microbiome based therapies for microbiome associated diseases is considerable variation microbiota composition between healthy individuals from the same population as well as from different race and ethnicity. Recent advancement of culture-independent, high throughput next generation sequencing technologies has enhanced our ability to characterize the human microbiome at various states of health and disease [Human Microbiome Project Consortium et al. 2012; Turnbaugh et al. 2007]. Large-scale endeavors such as the Human Microbiome Project (HMP) have been initiated for characterization of healthy human microbiome [Turnbaugh et al. 2007]. Studies are being conducted to explore the plausible disease links of microbiome and efforts are being made to understand how microbiome varies with host lifestyle, genetics, age, nutrition, medication, and environment [Blaser et al. 2008; Blekhman et al. 2015; Castellarin et al. 2011; Chen et al. 2007; Gao et al. 2008; Garrett et al. 2010; Goodrich et al. 2014; Islami et al. 2008; Kostic et al. 2012; Ley et al. 2005; O'Keefe et al. 2015; Peek et al. 2002; Tana et al. 2010; Turnbaugh et al. 2006; Wang et al. 2011]. In last decade of this century since 2007, after announcement of human microbiome project (HMP) large number of human microbiome studies has been conducted on western population, which represents different life style to non-western populations [Morton et al. 2015]. Few microbiome studies on non-western population have shown a large degree of variation in microbiome composition at different body niches of healthy human individuals from western population [Leung et al. 2015; Li et al. 2014; Mason et al. 2013; Nam et al. 2011; Nasidze et al. 2011; Schnorr et al. 2014; Van Treuren et al. 2015; Yap et al. 2011; Yatsunenko et al. 2012]. Many studies also have shown the link between microbiome composition and various diseases like type 2 diabetes, obesity, periodontitis, rheumatoid arthritis, cancer etc. [Blaser et al. 2008; Blekhman et al. 2015; Castellarin et al. 2011; Chen et al. 2007; Gao et al. 2008; Garrett et al. 2010; Goodrich et al. 2014; Islami et al. 2008; Kostic et al. 2012; Ley et al. 2005; O'Keefe et al. 2015; Peek et al. 2002; Tana et al. 2010; Turnbaugh et al. 2006; Wang et al. 2011].

Page | 178

Chapter 5: Population specific diversity in microbiome composition The major outcomes of previous studies for healthy subjects showed the general trend in the human microbiome evolution - a gradual transition in the gross compositional structure along with a continual decrease in diversity of the microbiome, especially of the gut microflora, as the human populations passed through three stages of subsistence like foraging, rural farming and industrialized urban western life [Gomez et al. 2016; Morton et al. 2015]. Gut microbiomes of the hunter-gatherer populations usually show higher abundances of Prevotella, Proteobacteria (especially of Succinivibrio), Spirochaetes (specifically of Treponema), Clostridiales, Ruminobacter etc., while those of the urban communities are often enriched in Bacteroides, Bifidobacteria and some Firmicutes like Ruminococcus, Blautia etc. [Schnorr et al. 2014]. In recent times, a number of studies have been published summarizing microbiome research from various perspectives, but a comprehensive account of the observations made on geography, ethnicity or life-style-specific variations in healthy microbiome composition is long overdue. In the current study, we attempt to address this issue. On the basis of published data and publicly available both type of metagenomic and 16s rRNA amplicon samples from healthy individuals, an attempt has been made to characterize the core microbiome – the set of genera commonly found in a specific body site of healthy individuals and core molecular functions derived from microbiota among populations, irrespective of their geographic locations, ethnic background or mode of subsistence. We will also discuss the major observations on cross-population variations in microbiome composition of various biogeographic spaces considering the five major human body habitats - Oral cavity, Respiratory Tract, Gut, Urogenital Tract (UGT) and Skin. Geography represents an ensemble of genetic, environmental and cultural factors and the degree to which the microbiome is shaped by each of these factors remains debated. It is not clear yet which factor plays a dominant role in shaping the microbiome – nature or nurture, host genetics or his environment, traditions and life-style? Identification of the factors responsible for geography-based alterations in microbial communities of healthy individuals is also the aim of the present study.

Page | 179

Chapter 5: Population specific diversity in microbiome composition

5.2. Method and Materials 5.2.1. Dataset Taxonomic abundances and functional enrichment data for microbiome derived from major body sites (gut, skin, oral cavity & vagina) in human, are retrieved from previous publications and European Nucleotide Archive (ENA) [www.ebi.ac.uk/metagenomics] to investigate the core microbiota and functional resemblance or variations among different geographical and ethnic populations derived from body niche specific microbiota in human (Table 5.1). In this study total 1204 samples are included to investigate the population based variation microbiome composition (Table 5.1). Table 5.1: Details of population/country specific samples used in the study The number of healthy individuals from different countries used in this study Metagenomic reads downloaded from EBI metagenomics database and characterized microbiota composition

Taxonomic composition data from previous published studies SAMPLE ISOLATION BODY-SITES

SAMPLE ISOLATION BODY-SITES Country

Gut

Skin

Brazil China MALAWI South Korea CANADA Thailand USA Benin Netherlands UK

6 20 20 20

20

Oral

Vagina 20

20 20 20 3 2 20

Country North America Belgium Philippines Argentina Bolivia California Louisiana Germany Poland Georgia Turkey Congo South Africa China Austria Denmark France Japan Malawi Peru Russia Spain Sweden United States Venezuela

Gut

Oral

Vagina 13 19

187 41 121 55 104 5 31 83 54 36 126 18

10 10 10 10 10 10 10 10 10 10 10 10

Page | 180

Chapter 5: Population specific diversity in microbiome composition 5.2.2. Identification of the niche specific core microbiota among various populations First of all we have constructed a binary presence/absence matrix (1/0 matrix) by comparing the taxonomic presence/absence of microbial communities in metagenomic samples derived from different body sites of the healthy individuals from different populations included in the present study. From this binary matrix, we have identified the microbial communities (phyla and genera), which are present in all healthy individuals. These communities, shared by all healthy individuals, may be referred to as the core microbiota that reflects the universal features of the human microbiome, irrespective of the host’s genetics and geographical origin. 5.2.3. Clustering of populations based on microbial composition at various body niches A non-multidimensional scaling (NMDS) plot was constructed using the Euclidean divergence between microbial compositions at the phylum level of the 191 individuals from 10 different geographical countries. Hierarchical clustering was performed using the Ward method based on the Bray–Curtis distances. ANOSIM statistical test was performed using Past3 software to evaluate the significant divergences between different geographical populations [Hammer et al. 2001]. 5.2.4. Functional characterization of representative metagenomic samples from different body niches To investigate the variations in the functional profile of body-niche specific microbiomes from different geographical populations, we have used representative metagenomic samples derived from distinct body sites of populations from different countries. Functional composition was predicted for all representative microbial community’s metagenomes, using PICRUSt pipeline [Langille et al. 2013]. PICRUSt pipeline predict the functions of a metagenome in two steps; 1. Estimation of genus level OTUs based on 16s rRNA profile, and 2. Assignment of KEGG orthologs using reference bacterial genomes

Page | 181

Chapter 5: Population specific diversity in microbiome composition 5.2.5. Survey of microbiome studies for healthy individuals from different geographical country’s or populations To explore the variations in microbiome composition derived from a specific body habitat among the populations from different countries, we surveyed the all published studies carried out for characterization of healthy microbiome from the all-western and nonwestern countries. The survey of microbiome composition was carried out to find out cross-population variations in microbiome composition of various biogeographic spaces considering the five major human body habitats - Oral cavity, Respiratory Tract, Gut, Urogenital Tract (UGT) and Skin. 5.2.6. Diversity in microbiome composition between different populations a. Alpha Diversity: alpha diversity indicates the richness and evenness of microbial species or pathways in an individual or a population. Alpha diversity primarily calculated by Simpson Index (eq. 5.1) using relative abundances of respective determinants [Morgan et al. 2012].

[

]



……………………..….. 5.1

Here is the fraction of total species or pathways comprised by species or pathways (relative abundance) S is the total number of species or pathways in a sample

b. Beta diversity: beta diversity represents the sharing or overlapping of microbial species or pathways between different individuals or populations [Morgan et al. 2012]. Beta diversity (Whittaker) is calculated using past3 software [Hammer et al. 2001].

Page | 182

Chapter 5: Population specific diversity in microbiome composition

5.3. Results 5.3.1. Microbiota conserved among healthy individuals from different geographical countries or populations To determine the geography, ethnicity or local environment (rural/urban) specific variations in the microbiota composition in healthy human individuals, the present study aims to characterize the core microbiota at different body habitats of human i. e. gut, oral cavity, skin, vagina etc. among human populations from different geographical countries. The core microbiota of a specific body site of human refers to the set of the genera and phylum, which are commonly found in that specific body site of all populations included in this present study, irrespective of their geographic locations, ethnic background or places of dwelling.

Table 5.2: Number of shared microbial communities (genera and phyla) between different populations GUT

Oral Cavity

Skin

Vagina

13

14

4

6

Core 2/3 Phyla/Genera

5/9

4/1

0/2

Actinobacteria Bacteroidetes Firmicutes Fusobacteria Proteobacteria Haemophilus Veillonella Porphyromonas Neisseria Rothia Prevotella Granulicatella Fusobacterium Streptococcus Leptotrichia* Actinomyces*

Cyanobacteria Firmicutes Proteobacteria Actinobacteria

Bacteroidetes* Actinobacteria*

Propionibacterium Corynebacterium*

Lactobacillus Atopobium

Countries included

Core Microbiota [Phylum]

Core Microbiota [Genus]

Bacteroidetes Firmicutes

Bacteroides Ruminococcus Blautia Eubacterium* Clostridium* Faecalibacterium* Bifidobacterium* Coprococcus*

Note: * Found in all countries except one country included in this study for respective body niches

Page | 183

Chapter 5: Population specific diversity in microbiome composition Core microbiota indicates the common feature of the human microbiome, which is associated with human body regardless of individual’s life style, living environment, geographical variants etc. As shown in Table 5.2, in gut derived microbiota two phylum Bacteroidetes & Firmicutes; and three genera Bacteroides, Ruminococcus & Blautia were commonly found in all populations from the 13 different countries, though the abundances of these core phyla or genera have been found to vary substantially across different populations, for instance, the phylum Bacteroidetes is dominant in people from Canada (54%), USA (52%), China (51%), as compared to those from Brazil (9%), South Korea (10%) and Malawi (11%), while Firmicutes are enriched (>40%) in most of the countries under study except in China (26%), Malawi (33%), USA (34%), Canada (35%) and South Korea (38%). Oral microbiome is characterized in 14 geographically diverse countries, in which total 46 genera and 8 phyla were found (relative abundances of which are more than 0.5% average among all healthy individuals from a particular country), but out of 8 phyla and 46 genera only 5 phyla and 8 genera were shared by all 14 countries/populations. Maximum numbers of genera (26) are present in oral microbiome in people from China, while Georgia & Turkey (24) and Bolivia people have the least number (17) of genera in their oral microbiome. Few metagenomic samples are reported for skin and vagina from the different geographical countries (4 from skin & 6 from vagina) other than USA in which 4 (skin) and 2 (vagina) phyla are commonly present in all populations respectively (Table 5.2). Skin microbiome is dominated by Acinetobacter (17%) in people from Benin, Propionibacterium in Brazil (64%) & Netherlands (21%) and Ralstonia in USA (11%), while vaginal microbiome is dominated by Lactobacillus (50% to 80%) in all populations except in Belgium population (16%). Vaginal microbiome of Belgium population is dominated by Bacteroides (34%) but populations from North America show extremely low (0.02% - 0.2%) abundance of Bacteroides.

Page | 184

Chapter 5: Population specific diversity in microbiome composition 5.3.2. Exclusiveness for microbiota components in a specific geographical population In addition to provide an account of geography, ethnicity or local environment (rural/urban) specific microbiota, exclusively present/absent genera and phylum were identified in an effort to explore the effect of various life styles, environment, geographical variants or any other factor associated to a country, on the microbiota composition colonized and adapted at different body sites or habitats in human. In case of gut microbiota, some phyla or genera are exclusively present in people from the specific country and entirely absent from other countries or populations included into the present study. For example, the phylum Tenericutes is present only in the population from South Korea and Fusobacteria in China & Malawi. Similarly, some phyla or genera are exclusively absent from the people of a specific country, but present in all other nationals, for instance, China people are deficient for Actinobacteria and Spanish for Proteobacteria in their guts. The phylum Spirochaetes is present in the oral cavity microbiome of people from only 4 countries (Georgia, Turkey, China & Philippines) out of 14 populations included in this present study. People from Benin show exclusive presence of Fusobacteria, Chloroflexi, Acidobacteria & Planctomycetes phyla in their skin microbiota and Bacteroidetes phyla is exclusively absent from the skin microbiota of Brazilian population. Vaginal microbiome of Chinese and North American white people are lacking of Bacteroidetes and Actinobacteria respectively. Fusobacteria (black) and Tenericutes (black & white) exclusively found in vaginal microbiome of North American people except Asian ethnic population (Table 5.3).

Page | 185

Chapter 5: Population specific diversity in microbiome composition Table 5.3: Exclusively present or absent microbial communities (genera) in niche specific microbiota of respective population Exclusively present genera Fusobacterium, Succinivibrio & Megasphaera GUT Microbiome Turicibacter, SMB53 & Lactococcus

Lautropia & Staphylococcus

Country Malawi

South Korea

Canada

Dialister Oral Cavity Microbiome Abiotrophia Burkholderia, Cedecea & Morganella Cloacibacterium Ruminococcus & Syntrophococcus Granulicatella & Finegoldia

Thailand

Nelumbo & Anaerococcus Prevotella, Bacteroides, Micrococcus, Acinetobacter, Sphingomonas, Delftia, Skin Dialister, Peptostreptococcus, Mycobacterium, Microbiome Luteimonas, Porphyromonas, Alcaligenes, Fusobacterium, Gemmata, Dietzia, Pedobacter & Alicyclobacillus Methylobacterium, Janthinobacterium, Hyphomicrobium, Sediminibacterium & Ralstonia Dialister, Streptococcus, Anaerobranca, Gemella, Micromonas & Peptostreptococcus Mobiluncus, WAL_1855D & Anaerococcus

Brazil Benin

Clostridium, Pseudomonas, Bacteroides, Vaginal Pelomonas, Betaproteobacteria, Microbiome Escherichia/Shigella, Chitinophagaceae, Aeromonas, Ralstonia, Acidovorax, Propionibacterium, Herbaspirillum, Parabacteroides, Undibacterium, Enterobacteriaceae & Sediminibacterium

Belgium

California Congo China Turkey Netherlands

Exclusively absent genera Eubacterium & Clostridium Malawi Faecalibacterium & Bifidobacterium – South Korea LeptotrichiaCongo ActinomycesBolivia

Corynebacterium USA

USA

North America UK

PeptoniphilusBelgium Aerococcus – Belgium MegasphaeraBelgium

A specific human microbiota component (genera or phyla) exclusively present or absent in a specific country’s population indicates the robust variation in microbiome composition between different populations, which may depend upon the various factors specific to any country’s population such as geography, life style, diet, genetics, occupation choices (indoor or outdoor). As shown in Table 5.3, some countries included in our study comprise a large number of exclusive genera in the microbiome, compared to Page | 186

Chapter 5: Population specific diversity in microbiome composition USA population, for instance, Benin population includes 17 exclusive genera in their skin microbiome. Similarly, 16 genera were found exclusively in the vaginal microbiome of Belgium people (Table 5.3). 5.3.3. Diversity in microbiome of healthy populations from different countries We collected taxonomic data from previously published studies for microbiome characterization in various populations and metagenomic data from EBI metagenomics database. To evaluate the microbiome composition, we downloaded the processed reads with annotations from EBI metagenomics database [www.ebi.ac.uk/metagenomics]. The relative abundances of phyla and genera were estimated for each sample included in this present study. Next, we combined the data of the microbiome composition at different body-sites of human hosts collected from the metagenomic reads (available in the public domain) and previously published studies to investigate the population specific variations in the microbiome composition. The independent cohort data were combined for each body habitat per country to construct body-site specific microbiota composition datasets. These country specific datasets of 1204 healthy individuals included 927 samples from gut; 45 from skin; 160 from oral cavity and 72 from the vagina. In gut, oral cavity & vagina, the most dominant phylum was the Firmicutes with mean relative abundances 49.13 ± 13.2%, 36.72 ± 6.8% & 87% respectively, while in skin, Actinobacteria (27.35 ± 23.18%) & Proteobacteria (26.93 ± 23.74%) were found in dominance. All microbial compositions (at the phylum level only) of 191 individuals from 10 countries (due to no availability of relative abundance data from previous studies, remaining samples were excluded for NMDS clustering analysis) were applied to Nonmetric multidimensional scaling (NMDS) analysis to examine the clustering of microbiome composition for each body-site.

Page | 187

Chapter 5: Population specific diversity in microbiome composition

Figure 5.1: Multidimensional scaling of population specific microbiota composition based on relative abundances of microbial communities A) Gut microbiome, B) Vaginal Microbiome, C) Oral Microbiome, D) Skin Microbiome

The MDS plots of the phylum based microbial composition of 191 healthy individuals for each body niche showed that they intended to cluster together according to their nationalities (Fig. 5.1). The country wise clustering of the individuals for microbial compositions in their niche specific microbiota indicates the higher similarity of microbiome composition within a country specific population as compared to those between populations from different countries. Factor analysis of Non-metric multidimensional scaling (NMDS) showed that uncharacterized components (r= 0.79), Actinobacteria (r = -0.89), Bacteroidetes (r = 0.89) & Firmicutes (r = -0.65) are the main determining microbial community of variation in their gut microbiome composition for South Korean, Malawian, Chinese & Brazilian population respectively (Fig. 5.1A). Relative abundances of vaginal microbiota were available for the populations of only two countries (China & UK). Vaginal microbiome of UK population is characterized by the Page | 188

Chapter 5: Population specific diversity in microbiome composition dominance of Firmicutes (r = 0.98) while major proportion (average 92%) of vaginal microbiome in Chinese population is not characterized (Fig. 5.1B). In Canadian population, Bacteroidetes (r = 0.86), Synergistetes (r = 0.50), Spirochaetes (r = 0.51), & Unassigned microbial communities (r = 0.75); and in Thai population, Actinobacteria (r = -0.61) & Proteobacteria (r = -0.58) are playing a major role in configuring the oral cavity microbiome (Fig. 5.1C). As shown in Figure 1D, Brazil and USA population are clustered separately on opposite ends of the MDS 1 axis, which represents the opposite patterns of skin microbiome composition among the populations of these countries. Correlation coefficient values showed that skin microbiome of USA population are enriched by Proteobacteria (r = 0.95) and Bacteroidetes (r= 0.59), while the skin microbiome of Brazilian population has the higher representation of Actinobacteria (r = -0.97). We have excluded the Netherlands and Benin population for further analysis because very less number of individuals is available from these countries (Fig. 5.1D). 5.3.4. Microbial characterization of niche specific human microbiome in healthy individuals from different countries/populations We identified 113 genera and 15 bacterial phyla in microbiome derived from 4 body sites i.e. gut, oral cavity, skin & vagina among 1204 healthy individuals from the 35 geographically diverse countries. The only phyla and genera with an average relative abundance of ≥0.5% in all metagenomic samples from a specific body habitat were included in this analysis. We also revealed the variation in relative abundances of core microbial communities (present in all populations involved in this current study) among populations from different countries using comparative analysis of relative abundances for a specific body site. Malawian gut microbiome showed the highest abundance of Actinobacteria, while people from most of the other countries (except France, Austria & Japan) hardly include Actinobacteria in gut microbiome (Fig. 5.2A). In case of the oral microbiome, three phyla Firmicutes, Proteobacteria and Bacteroidetes constitute the major portion of the total microbiota across the all countries (Fig. 5.2B). The Oral microbiome showed the similar type of pattern among all countries for core microbiota at the phylum level characterization (Fig. 5.2B). The vaginal microbiome showed a large degree of variation among countries, where Brazilian and Netherlands population are

Page | 189

Chapter 5: Population specific diversity in microbiome composition enriched by Actinobacteria, but USA and Benin people have Proteobacteria as the dominant phyla in their vaginal microbiome (Fig. 5.2C).

Figure 5.2: Percentage frequency distribution pattern of phyla among different country specific populations (A) Gut microbiota (B) Oral cavity microbiota (C) Vaginal Microbiota

But when we compared the microbiome of each body sites between different countries at genera level, we observed that an appreciable portion (20 - 90%) of the gut microbiome is not properly characterized, but the minimally characterized (