Molecular Systems Biology Peer Review Process File
Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours Takuji Yamada, Alison S. Waller, Jeroen Raes, Aleksej Zelezniak, Nadia Perchat, Alain Perret, Marcel Salanoubat, Kiran R. Patil, Jean Weissenbach, Peer Bork Corresponding author: Peer Bork, EMBL
Review timeline:
Submission date: Editorial Decision: Revision received: Accepted:
20 December 2011 01 February 2012 15 March 2012 24 March 2012
Transaction Report: (Note: With the exception of the correction of typographical or spelling errors that could be a source of ambiguity, letters and reports are not edited. The original formatting of letters and referee reports may not be reflected in this compilation.)
1st Editorial Decision
01 February 2012
Thank you again for submitting your work to Molecular Systems Biology. We have now heard back from the three referees whom we asked to evaluate your manuscript. As you will see from the reports below, the referees find the topic of your study of potential interest. However, they raise several concerns, which should be convincingly addressed in a revision of this work. Key issues are the following: - The impact of the new gene-enzyme associations on genome-scale metabolic models should be demonstrated by re-validating (some of) the updated SEED models. - The methods used to perform the analysis should be described with more rigor and in sufficient details so that other researchers with the suitable expertise can easily reproduce the results. The recommendations provided by the three reviewers are important and very clear. - Along the same line, it would be helpful to include, as datasets, in supplementary information the results of the NBH, COR (including the clustered phylogenetic profiles), DOM (including the clustered domain list) and PNE scoring, respectively. In addition, for the purpose of long-term archiving, we would also kindly ask you to include as 'dataset' in supplementary information the predictions and sequences currently hosted on you website. - The exact source, nature and essential characteristics of the metagenomic data used in this study should be clearly specified. Reviewer #2 and #3 express specific questions in this regard. Given the request of the reviewers for more discussion and the number of figures included in the
© European Molecular Biology Organization
1
Molecular Systems Biology Peer Review Process File
study, we would prefer to consider this manuscript as an Article and recommend rearranging the manuscript with separate Results and Discussion sections. *** PLEASE NOTE *** As part of the EMBO Publications transparent editorial process initiative (see our Editorial at http://www.nature.com/msb/journal/v6/n1/full/msb201072.html), Molecular Systems Biology will publish online a Review Process File to accompany accepted manuscripts. When preparing your letter of response, please be aware that in the event of acceptance, your cover letter/point-by-point document will be included as part of this File, which will be available to the scientific community. More information about this initiative is available in our Instructions to Authors. If you have any questions about this initiative, please contact the editorial office
[email protected]. If you feel you can satisfactorily deal with these points and those listed by the referees, you may wish to submit a revised version of your manuscript. Please attach a covering letter giving details of the way in which you have handled each of the points raised by the referees. A revised manuscript will be once again subject to review and you probably understand that we can give you no guarantee at this stage that the eventual outcome will be favourable. Best wishes, Editor Molecular Systems Biology http://www.nature.com/msb --------------------------------------------------------------------------Referee reports: Reviewer #1 (Remarks to the Author): Review of "Prediction and identification of sequences coding for orphan enzymes using (meta)genomic neighbours", by Yamada et al. Overview: A large portion of currently characterized enzymes are considered 'orphans,' ie, they are not associated with any genetic sequence. Most of these enzymes were categorized before the rapid sequencing capabilities of the last decade, and they present a significant hurdle to large-scale metabolic modeling efforts, most notably those using genome-scale metabolic models. Genomescale models are being produced manually and automatically (notably through SEED); in both cases the basis of enzyme assignments are in the linkage of genomic to functional information, and orphans represent black holes in this process. The paper presents an attempt to link orphan enzymes to likely genetic sequences, a task that could have a large immediate impact on the usability of much previously collected enzymatic knowledge (ie, knowledge about orphan enzymes), which is of only limited utility for current large-scale modeling methods as it stands. To address this clearly important issue, the authors perform an analysis of all orphan EC numbers in the KEGG pathway database (composing 1772 orphan enzymes), and identify 555 enzymes that have pathway neighbors (ie, other enzymes that are adjacent in a pathway). Based on the observation that enzymes involved in a pathway typically occur close to each other on a chromosome, this adjacency data was used as a first-pass filter for determining orphan genes for which the gene sequence might be ascertained. A set of heuristics, based on observed frequency relations in genomic and metagenomic data (co-occurrence of same pathway genes in genomic/metagenomic data, closeness of within-pathway genes on a chromosome, presence of signature domains relating generally to the EC# function in a candidate DNA sequence, and the number of pathway neighbors present together) was then used to predict DNA sequences for orphan genes. The metrics used were benchmarked using enzymes of known function, and a number of orphan genes were assigned putative DNA sequences. This was used to then augment SEED genome-scale metabolic models with new reactions (with a flux coupling analysis to show that these genes affected function), and six candidate sequences (predicted to code orphan genes) were
© European Molecular Biology Organization
2
Molecular Systems Biology Peer Review Process File
experimentally tested, with two of the genes turning out to have the predicted function. The study is well conceived, appears to have been well executed, is highly relevant and important, and will have an immediate impact on the field of genome-scale modeling, as well as contributing to the genomic databases (KEGG and others) as a whole. The validation of two orphan gene functions gives strong support to the validity of the predictions. I recommend publication, with the following revisions: 1. After adding newly-gene associated orphan enzymes to genome-scale metabolic models from SEED, I feel a natural next step would be to re-validate the models (or a subset of them, e.g., focusing on E. coli) using either biolog growth data or gene knockout lethality data (this is available for multiple organisms, and accuracies of predictions could be easily determined before and after adding the orphan genes). This re-validation would significantly strengthen the paper, as it would give more confirmation that the orphan genes indeed improve the quality of the models. 2. Regarding the 6 enzyme experiments which yielded validations of gene-enzyme associations for two enzymes; there is no specific discussion of why the 4 other enzymes were not confirmed. I suggest that the authors give more detail about what was found for these enzymes, and expand on the possible reasons for failure. 3. It is unclear how the points in figure 2 were generated (eg, random sampling of the parameters, or some directed sampling?)-please explain. Further, it is unclear which point on the plot is used for the final analysis. This should be very clearly explained (or if, somehow, multiple points were used, this process should be very explicitly described). 4. As far as I can tell, the actual DNA sequences (either an example or a consensus sequence) associated with each orphan gene (ie, the result of the work) is not given. This could, eg, be put in Supplementary table 1. Reviewer #2 (Remarks to the Author): The authors assigned putative gene sequences to currently orphaned enzymatic activities using four scores: 1) the STRING neighborhood score between genes, NBH; 2) the STRING phylogenetic profiling co-occurrence score, COR; 3) domains unique to EC sub-subclasses, DOM; and 4) the count of adjacent pathway neighbors, PNE. These criteria were combined to assign putative sequences to 105 orphaned enzymatic functions, two of which (out of seven tested) were validated experimentally. This is an interesting study, for which the fundamental results seem sound; it is based on a good idea and builds on previous work performing similar tasks using phylogenetic profiling and other data integration methods. The manuscript seems somewhat incomplete, though, with the computational methods in particular quite sparsely explained and their validation apparently performed in a manner likely to strongly overstate accuracy (although this is difficult to determine due to lack of welldescribed data and algorithms). These concerns are not so much about the study results as about its reproducibility and understandability for readers, however, and are detailed below. Moderate ---* The manuscript's wording and descriptions of the problem being solved are unclear throughout. This is particularly critical in the abstract and introduction, where phrases such as "inference of candidate sequences to orphan enzymes" or "metagenomic neighbourhood information" are difficult to understand. "Pathway context" is used without definition, likewise "projection" to genomic data and flux coupling analysis, and the distinction between enzymes and enzyme activities is extremely blurred. Consider rewording throughout for clarity. * The methods for integration of the four scores, and thus for generating Figure 2, do not appear to be described in detail in either the main or supplemental methods. No supporting data regarding them are provided, and it would thus be extremely difficult to replicate the study. If I understand Figure 2A in particular, it appears that a (potentially exhaustive?) search over parameter settings for
© European Molecular Biology Organization
3
Molecular Systems Biology Peer Review Process File
score thresholds has been performed, which would suggest a strong possibility of overfitting to training data during cross-validation and thus overestimating the 70% (and later 90%) "accuracy" figures. * Particularly given that 2 out of 6 validations is indeed less than 90%, I'd suggest either performing a more formal machine learning evaluation of the integration method or avoiding the "accuracy" terminology altogether. The former could include a held-out validation set or a more sophisticated machine learner rather than thresholding for integration, for example. The latter could benefit from cross-validation expected performance in held-out data, but separating it more clearly from the later application to real, unlabeled orphan enzymes. * The Introduction and Conclusion are both somewhat cursory and provide little background on the research area or discussion of potential follow-up, respectively. Computational enzyme and protein function prediction is a well-studied and highly related area that deserves at least a paragraph in the introduction, STRING itself being an easy example. The Hanson et al 2011 review in the Biochemical Journal seems relevant, as does Chen and Vitkup's phylogenetic profiling method in Genome Biology 2006. * In the conclusion, I'd like to see a bit more substance about why so few of such highly refined predictions validated - "many variables" isn't terribly convincing. The seven tested sequences and their results are mentioned in the supplement, but a more sophisticated discussion of the data driving their predictions and more concrete potential reasons for failure would be of interest. Is overexpression of isolated sequences in E. coli a concern? I'd also wonder how many more predictions could be expected from more data - would doubling the number of genomes or metagenomes help, or is this "most" of what's attainable? Would assembled metagenomes be of use, rather than relying on long reads? * The "Arctic" metagenomic data used in this study is listed as "unpublished," with no resource provided for obtaining the data or replicating the study? No accession numbers or direct information are provided for any of the metagenomes, which are presumably not stored in as central a location as STRING. Minor ---* As above, in Figure 1 and accompanying text, the statement that "sequences were predicted with >70% accuracy" is a bit misleading, since A) subsets of sequences were predicted with >70% precision, not accuracy, and B) this was evaluated (perforce) in held-out cross-validation data, not in the orphan enzymes being described in Fig. 1. * The statement that "functionally associate genes are usually clustered in conserved operon structures" is only true in some (albeit many) organisms, even among bacteria. * It's unclear to me what data backs the accuracy value in "candidate genes were predicted (with >80% accuracy) for Phenylpyruvate decarboxylase". * "unculturable bacteria" is not a particularly accurate description of metagenomic data. * In Figure 3C, the title should be "Novelty" Reviewer #3 (Remarks to the Author): In this manuscript, the authors propose a methodology for associating orphan enzyme functions with gene sequences. This is a very important objective, with the potential to impact many areas of biological research via the substantial improvement of annotations. I found the article interesting, and the work presented will most certainly have an impact on my own work. I suspect many other readers will feel the same. Modelers will use this data to improve models, bioinformaticists will use the simple approach to associate new gene sequences with function, and wetlab biologists may attempt to experimentally validate more of the targets indicated.
© European Molecular Biology Organization
4
Molecular Systems Biology Peer Review Process File
The ideas behind the research (exploiting pathway co-localization on the genome for annotation) are not novel. The large-scale application of these ideas to link orphaned enzymes to possible gene sequences is novel. Bringing metagenomic data into the analysis is also interesting. The fact that the authors experimentally validated some of their orphan assignments certainly adds impact to the work. The statements about metabolic modeling were well put, and they demonstrate the importance of this kind of work. The work appears to have sufficient impact for MSB because it is the first wide-scale systematic study of this nature. It would be nice to see this technology integrated into an annotation framework like the SEED so it can be applied repeatedly as the number of sequenced genomes and metagenomes increases in time. The work is well written and presented in the manuscript. There are few grammatical errors, and the figures are visually compelling and clear. I do have a number of concerns that I feel must be address before this manuscript is ready for publication: 1.) The authors say the following on page 5 of their manuscript: "Some of these 26 share sequence homology with our predicted candidates and the others that do not, may represent alternative orthologous groups catalyzing the same reaction, as about 70% of the EC numbers in KEGG are encoded by more than one orthologous group." Why not indicate the specific number of candidates that had homology in the manuscript? Why force the reader to dig for it in the supplementary material? I looked up the table in the supplement, and it appears as though 17/26 had homology, 6 did not have homology, and three were "promiscuous". It would make sense to report these numbers in the manuscript. It would also be helpful if the authors offered an explanation as to what they mean by "promiscuous" in this table. 2.) Did the authors attempt to "assemble" metagenomes prior to analysis? Assembly would produce much longer contigs that would be far more amenable to this type of analysis compared with the raw reads themselves. Assembly would also enable the analysis of Illumina metagenomes where sequencing depth and potential sequence diversity would be much higher. Overall, it is surprising to see so few hits arising from metagenomic data. Consider that the authors only included ~300 genomes in the "single genome" analysis. A single metagenome could easily represent more than 300 genomes worth of sequence. I would suggest a few comments in the manuscript along these lines: -please indicate if any assembly was attempted prior to metagenome analysis -authors should comment on the "nature" of the rank abundance curve for the metagenomes used in analysis (how diverse was you metagnome?) -can authors indicate the average read length and net amount of sequence data represented in the metagenomes used for analysis? 3.) This approach has been applied for a long time in order to assign functions to genes (albeit, until this paper, this effort was done manually and not systematically). In fact, many of the orphan enzymes that remain, remain specifically because their associated genes are not co-localized with their pathways on the genome. This is the first place biologists would have looked when attempting to identify the gene that corresponds with the orphan function. As a result, it's not clear that the authors can extend the success rate of this approach in correctly identifying the gene sequence for non orphan enzymes to be the predicted success rate for analysis of orphan enzymes. I would expect the success rate to be far below 70%, and frankly, it would still be a resounding success if only 20% of the predictions were correct. Yet the authors repeatedly indicate that their method is expected to be 70% accurate, while their experimental success rate was only 33% accurate. I would suggest the authors include a more thorough and forthright discussion of expected accuracy, and include some indication of why success may be lower than that indicated by the benchmarking study. Simply stating that "the benchmark accuracy was 70%, and the experimental validation success rate was 33%, and the true accuracy of their algorithm's predictions is probably between these two" would be a more justifiable statement of expected accuracy. Overall, with some revisions responding to the above comments, I would recommend this manuscript for publication.
© European Molecular Biology Organization
5
Molecular Systems Biology Peer Review Process File
1st Revision - authors' response
15 March 2012
We have re-formatted the manuscript into an article to accommodate discussion of some of the points raised by the reviewers and we have included new analyses revealing that addition of the Orphan Enzymes makes the genome-scale metabolic models more accurate. We hope to have addressed the following key issues: 1. The impact of the new gene-enzyme associations on genome-scale metabolic models should be demonstrated by re-validating (some of) the updated SEED models. We have now assessed the impact of adding the novel Orphan Enzymes on the ability of the SEED models to correctly predict gene essentiality. Firstly, we evaluated in total 72 SEED models to determine how many genes changed from essential to non-essential when the Orphan Enzyme reactions were added. While for some models there was no change in gene essentiality, for 11 genome-scale models more than 10 genes changed from essential to non-essential. Then for the 4 organisms for which there is genome-scale experimental evidence of gene essentiality, we determined that the model with the added Orphan Enzyme reactions correctly predicted 15/15 of the genes whose essentiality prediction changed. Thus showing that adding the Orphan Enzymes make the models more correct. 2.
The methods used to perform the analysis should be described with more rigor and in sufficient details so that other researchers with the suitable expertise can easily reproduce the results. The recommendations provided by the three reviewers are important and very clear.
We have edited the results and methods in the manuscript to describe the analysis in more detail. We specifically tried to explain the methods and results related to the benchmarking in more detail. 3.
Along the same line, it would be helpful to include, as datasets, in supplementary information the results of the NBH, COR (including the clustered phylogenetic profiles), DOM (including the clustered domain list) and PNE scoring, respectively. In addition, for the purpose of long-term archiving, we would also kindly ask you to include as 'dataset' in supplementary information the predictions and sequences currently hosted on you website.
We have provided the requested data as supplemental data sets. 4.
The exact source, nature and essential characteristics of the metagenomic data used in this study should be clearly specified. Reviewer #2 and #3 express specific questions in this regard.
We have expanded the methods section and paid specific attention to detailing the source of the metagenomic datasets as well as the methods used to assemble them. In addition we have added some supplemental figures regarding the contig length distributions and the number of genes on each contig. Most concerns of the reviewers are addressed in the detailed response below. Given these additional efforts, we hope that our explanations satisfy the concerns of the reviewers and will permit the publication of our manuscript in Molecular Systems Biology.
Detailed responses to comments from the individual reviewers
Referee #1 (Remarks to the Author): A large portion of currently characterized enzymes are considered 'orphans,' ie, they are not associated with any genetic sequence. Most of these enzymes were categorized before the rapid sequencing capabilities of the last decade, and they present a significant hurdle to large-scale metabolic modeling efforts, most notably those using genome-scale metabolic models. Genome-
© European Molecular Biology Organization
6
Molecular Systems Biology Peer Review Process File
scale models are being produced manually and automatically (notably through SEED); in both cases the basis of enzyme assignments are in the linkage of genomic to functional information, and orphans represent black holes in this process. The paper presents an attempt to link orphan enzymes to likely genetic sequences, a task that could have a large immediate impact on the usability of much previously collected enzymatic knowledge (ie, knowledge about orphan enzymes), which is of only limited utility for current large-scale modeling methods as it stands. To address this clearly important issue, the authors perform an analysis of all orphan EC numbers in the KEGG pathway database (composing 1772 orphan enzymes), and identify 555 enzymes that have pathway neighbors (ie, other enzymes that are adjacent in a pathway). Based on the observation that enzymes involved in a pathway typically occur close to each other on a chromosome, this adjacency data was used as a first-pass filter for determining orphan genes for which the gene sequence might be ascertained. A set of heuristics, based on observed frequency relations in genomic and metagenomic data (cooccurrence of same pathway genes in genomic/metagenomic data, closeness of within-pathway genes on a chromosome, presence of signature domains relating generally to the EC# function in a candidate DNA sequence, and the number of pathway neighbors present together) was then used to predict DNA sequences for orphan genes. The metrics used were benchmarked using enzymes of known function, and a number of orphan genes were assigned putative DNA sequences. This was used to then augment SEED genome-scale metabolic models with new reactions (with a flux coupling analysis to show that these genes affected function), and six candidate sequences (predicted to code orphan genes) were experimentally tested, with two of the genes turning out to have the predicted function. The study is well conceived, appears to have been well executed, is highly relevant and important, and will have an immediate impact on the field of genome-scale modeling, as well as contributing to the genomic databases (KEGG and others) as a whole. The validation of two orphan gene functions gives strong support to the validity of the predictions. I recommend publication, with the following revisions: 1. After adding newly-gene associated orphan enzymes to genome-scale metabolic models from SEED, I feel a natural next step would be to re-validate the models (or a subset of them, e.g., focusing on E. coli) using either biolog growth data or gene knockout lethality data (this is available for multiple organisms, and accuracies of predictions could be easily determined before and after adding the orphan genes). This re-validation would significantly strengthen the paper, as it would give more confirmation that the orphan genes indeed improve the quality of the models. We thank the reviewer for this suggestion and have evaluated the ability of the SEED models to correctly predict gene essentiality from gene knockouts before and after the addition of the candidate genes from Orphan Enzymes. ACTION TAKEN: Firstly, we evaluated in total 72 SEED models to determine how many reactions changed from essential to non-essential when the Orphan Enzyme reactions were added. While for some models there was no change in gene essentiality, for 11 genome-scale models more than 10 genes changed from essential to non-essential. Then for the 4 organisms for which there is genomescale experimental evidence of gene essentiality, we determined that the model with the added Orphan Enzyme reactions correctly predicted 15/15 of the genes whose essentiality prediction changed. Results are included on Page 9 line 272 – Page 10 line 292, and presented in Figure 6. 2. Regarding the 6 enzyme experiments which yielded validations of gene-enzyme associations for two enzymes; there is no specific discussion of why the 4 other enzymes were not confirmed. I suggest that the authors give more detail about what was found for these enzymes, and expand on the possible reasons for failure. Thank you for the suggestion. Concerning the candidate proteins for EC 2.1.1.19, 2.1.1.68, 2.3.1.32 and 2.7.1.28, neither product formation, nor substrate consumption was detected in enzymatic assays through LC/MS. For EC 2.7.1.28, a peak of very slight intensity with a m/z consistent with the one of the product D-glyceraldehyde-3-phosphate could be detected. Nevertheless, LC/MS analyses could not lead us to conclude for the predicted activity, as the substrate D-glyceraldehyde could never be detected, and neither ATP consumption nor ADP formation could be established. We have also expanded our discussion related to possible reasons for experimental failure Firstly, an enzyme can be purified in a soluble form but will become inactive, during the purification process
© European Molecular Biology Organization
7
Molecular Systems Biology Peer Review Process File
due to improper handling, or exposure to unfavourable conditions such as oxygen. In addition the proteins purified in this study were tagged with a histidine (his-tagged), as many heterologously expressed proteins are. The addition of a terminal his-tag can dramatically decrease the activity of a protein (Kadas et al. 2008) or render it totally inactive (Albermann et al. 2000, Halliwell et al. 2001). We also now provide an example of how adjusting one of the enzyme assay or analytical method conditions can affect the outcome. For example, in assay optimization trials for EC 2.6.1.38 the mobile phase for the LC/MS was changed from 10mM ammonium acetate to water, the peak area of the product glutamate was increased more than 11 times. However there is a practical limit to how many permutations of experimental conditions can be attempted, and only if the initial screening assay is close to the optimal conditions, further optimization is feasible.. ACTION TAKEN: We have moved the information concerning the 4 experiments that did not confirm the predicted activity from the supplemental material to the main manuscript (Results: Page 7 lines 171- 185) and have expanded the discussion related to difficulties in performing enzymatic assays (Discussion: Page 10, line 312 – Page 11, line 334). 3. It is unclear how the points in figure 2 were generated (eg, random sampling of the parameters, or some directed sampling?)-please explain. Further, it is unclear which point on the plot is used for the final analysis. This should be very clearly explained (or if, somehow, multiple points were used, this process should be very explicitly described). The reviewer is correct that we were not clear in describing how figure 2 was generated. Each point represents a set of predictions from a specific combination of the 4 parameters. We indeed used multiple points for the final analysis as our final set of predictions comes from the union of all points (predictions from parameter sets) with accuracy >70%. Specifically, for the benchmarking 350 non-orphan enzymes were extracted from the KEGG pathway database v. 57. These enzymes were then treated as Orphan Enzymes and candidate sequences were generated using the computational pipeline described above, and each prediction was assigned a set of 4 scores NBH (>0.4,>0.5,>0.6,>0.7,>0.8,>0.9), COR (>0.1,>0.2,>0.3,>0.4,>0.5,>0.6), DOM (0 or 1), PNE (1, 2 or >=3). Then for each combination of scoring parameters the number of correct and incorrect EC number assignments was calculated in order to determine the accuracy of each parameter combination. In total, 100 randomized data sets were generated to benchmark the prediction pipeline. Finally to obtain a high confidence set of candidate sequences we took the union from all of the parameter combinations that yielded an accuracy of > 70 %. ACTION TAKEN: We added detail and more clearly explained the benchmarking procedure in the results and methods section (Methods: Page 14, lines 432-456). We similarly edited the figure legend for Figure 2. 4. As far as I can tell, the actual DNA sequences (either an example or a consensus sequence) associated with each orphan gene (ie, the result of the work) is not given. This could, eg, be put in Supplementary table 1. We have now generated a fasta file for protein sequences and will provide it as a supplemental text file and host it on our web server. ACTION TAKEN: All of the candidate protein sequences from the high confidence predictions are provided as a supplemental text file.
Referee #2 (Remarks to the Author): The authors assigned putative gene sequences to currently orphaned enzymatic activities using four scores: 1) the STRING neighborhood score between genes, NBH; 2) the STRING phylogenetic profiling co-occurrence score, COR; 3) domains unique to EC sub-subclasses, DOM; and 4) the count of adjacent pathway neighbors, PNE. These criteria were combined to assign putative sequences to 105 orphaned enzymatic functions, two of which (out of seven tested) were validated experimentally. This is an interesting study, for which the fundamental results seem sound; it is
© European Molecular Biology Organization
8
Molecular Systems Biology Peer Review Process File
based on a good idea and builds on previous work performing similar tasks using phylogenetic profiling and other data integration methods. The manuscript seems somewhat incomplete, though, with the computational methods in particular quite sparsely explained and their validation apparently performed in a manner likely to strongly overstate accuracy (although this is difficult to determine due to lack of well-described data and algorithms). These concerns are not so much about the study results as about its reproducibility and understandability for readers, however, and are detailed below. We thank you for these constructive remarks. Most of the concerns are addressed in the detailed response below. Moderate ---* The manuscript's wording and descriptions of the problem being solved are unclear throughout. This is particularly critical in the abstract and introduction, where phrases such as "inference of candidate sequences to orphan enzymes" or "metagenomic neighbourhood information" are difficult to understand. "Pathway context" is used without definition, likewise "projection" to genomic data and flux coupling analysis, and the distinction between enzymes and enzyme activities is extremely blurred. Consider rewording throughout for clarity. We apologize for the lack of clarity. Throughout the text, we have improved the wording and added definitions where needed. For example, we changed “inference of candidate sequences” changed to “prediction of candidate sequences”, and “projected onto genomic data” to “mapped to genes in the STRING genomes using orthology”, “pathway context” was changed to “Orphan Enzymes that operate in metabolic pathways (i.e. connected to at least one other enzyme by a common compound)”. ACTION TAKEN: The entire manuscript was re-written with an eye on clarifying vague terminology. * The methods for integration of the four scores, and thus for generating Figure 2, do not appear to be described in detail in either the main or supplemental methods. No supporting data regarding them are provided, and it would thus be extremely difficult to replicate the study. If I understand Figure 2A in particular, it appears that a (potentially exhaustive?) search over parameter settings for score thresholds has been performed, which would suggest a strong possibility of overfitting to training data during cross-validation and thus overestimating the 70% (and later 90%) "accuracy" figures. We apologize for this lack of detail. In this revised manuscript we have clarified that we only estimated the accuracy for each set of parameter combinations (a specific combination of the 4 scores) as a means to identify predictive parameter sets. We then united all of the predictions from the parameter combinations with estimated accuracies over 70 % to form our final ‘high confidence’ set of predictions. We did not attempt to optimize the parameters, and we did not integrate the four scores to produce one ‘overall’ score. As such we do not think that overfitting is an issue, in addition, the number of validation examples is much larger than the limited parameter space. In detail, for the benchmarking 350 non-orphan enzymes were extracted from the KEGG pathway database, chosen so that the distribution of node degree (network structure) was the same as for the Orphan Enzymes. These enzymes were then treated as Orphan Enzymes and candidate sequences were generated using the computational pipeline described above, and each prediction was assigned a set of 4 scores (NBH, COR, DOM, PNE). The predictions were classified according to their 4 scores: NBH (>0.4,>0.5,>0.6,>0.7,>0.8,>0.9), COR (>0.1,>0.2,>0.3,>0.4,>0.5,>0.6), DOM (0 or 1), PNE (1, 2 or more). Then for each combination of scoring parameters the number of correct and incorrect EC number assignments was calculated in order to determine the accuracy of each parameter combination. In total, 100 randomized data sets were generated to benchmark the prediction pipeline. Finally to obtain a high confidence set of candidate sequences we took the union from all of the parameter combinations that yielded an accuracy of > 70 %. ACTION TAKEN: We have now described the method for the benchmarking analysis in greater detail (See Methods: Page 14, lines 432-456 and Figure 2 Caption). In addition, throughout the
© European Molecular Biology Organization
9
Molecular Systems Biology Peer Review Process File
manuscript we no longer refer to the overall accuracy of our final set of predictions (previously >70 accuracy). Instead, we refer to each combination of parameters having a certain accuracy, and then the union of the sets with high accuracy is ‘our high confidence’ set of predictions. * Particularly given that 2 out of 6 validations is indeed less than 90%, I'd suggest either performing a more formal machine learning evaluation of the integration method or avoiding the "accuracy" terminology altogether. The former could include a held-out validation set or a more sophisticated machine learner rather than thresholding for integration, for example. The latter could benefit from cross-validation expected performance in held-out data, but separating it more clearly from the later application to real, unlabeled orphan enzymes. The lower success rate for the experimental validations is indeed a concern, however, we think that it does not invalidate the results from the benchmarking. As explained above we determined an accuracy for each combination of parameters scores, we did not attempt to optimize the parameters. We have re-worded the manuscript to indicate that this accuracy cannot be extended to the union of all predictions and we will just consider them our set of high confidence predictions. However, to estimate the accuracy of our final set of high confidence predictions we calculated the accuracy (TP/TP+FP) the union of all predictions from parameter combinations yielding >70% accuracy. For genomic data the mean accuracy of 100 randomizations is 87%, while for metagenomic data the mean accuracy is 73% (See boxplots below). This shows that we did not over estimate the accuracy.
Metagenomic data
Genomic data
0.6
0.7
0.8
0.9
Accuracy ACTION TAKEN: We now refer to the predictions from the union of all parameter sets wIth > 70% accuracy as our high-confidence set of predictions. In addition we estimated the accuracy for this high confidence set by counting the TP and FP predictions for the union (boxplot above, this boxplot is also included as supplementary figure 2). * The Introduction and Conclusion are both somewhat cursory and provide little background on the research area or discussion of potential follow-up, respectively. Computational enzyme and protein function prediction is a well-studied and highly related area that deserves at least a paragraph in the introduction, STRING itself being an easy example. The Hanson et al 2011 review in the Biochemical Journal seems relevant, as does Chen and Vitkup's phylogenetic profiling method in Genome Biology 2006. We agree and thank the reviewer for this suggestion – we now discuss computational protein function prediction in the introduction (with the specific references added), and have expanded the discussion to include potential areas for follow-up research. ACTION TAKEN: We have added a paragraph in the introduction discussing computational protein function prediction and linking these methods to our use of them for the reverse problem (ie.
© European Molecular Biology Organization
10
Molecular Systems Biology Peer Review Process File
Linking sequence to function) (Introduction: Page 3 lines 52-63). We have also added a section in the Discussion regarding possible extensions of this pipeline to eukaryotes or to Orphan Enzymes that are not in metabolic pathways (Discussion: Page 11 line 351- Page 12 line 361). In the conclusion, I'd like to see a bit more substance about why so few of such highly refined predictions validated - "many variables" isn't terribly convincing. The seven tested sequences and their results are mentioned in the supplement, but a more sophisticated discussion of the data driving their predictions and more concrete potential reasons for failure would be of interest. Is overexpression of isolated sequences in E. coli a concern? We thank the reviewer for his/her suggestions and have expanded our discussion on possible reasons for experimental failure. Yes over-expression of sequences in E. Coli is a concern as an enzyme can be purified in a soluble form but may become inactive, during the purification process due to improper handling, or exposure to unfavourable conditions such as oxygen. In addition, the proteins purified in this study were tagged with a histidine (his-tagged) and the addition of a terminal his-tag can dramatically decrease the activity of a protein (Kadas et al. 2008) or render it totally inactive (Albermann et al. 2000, Halliwell et al. 2001). In addition, we had specified the ‘many variables’ that can influence success or failure of Enzymatic assays. Specifically, only by adjusting the buffer type, buffer pH, cofactors, time of incubation, temperature of incubation, or the analytical methods used, a certain assay might become successful. For example, in assay optimization trials for EC 2.6.1.38 when the mobile phase for the LC/MS was changed from 10mM ammonium acetate to water, the peak area of the product glutamate was increased more than 11 times. However there is a practical limit to how many permutations of experimental conditions can be attempted, and only if the initial screening assay is close to the optimal conditions, further optimization is feasible. ACTION TAKEN: As we have separated the discussion section hopefully the discussion regarding the experimental ‘failures’ will be more apparent (Discussion: Page 10, line 312 – Page 11, line 334). In addition, we added a few sentences regarding difficulties in obtaining active enzymes from heterologous expression in E.Coli. Also to illustrate how complex the enzyme assay optimization procedure could be, we provided a concrete example of how changing one ‘variable’ (ie. eluant used for the LC/MS) increased detection of the product by 11 fold.
150
Metagenomic data
50
50
100
100
Genomic data
0
0
The number of predicted enzymes
150
I'd also wonder how many more predictions could be expected from more data - would doubling the number of genomes or metagenomes help, or is this "most" of what's attainable? Would assembled metagenomes be of use, rather than relying on long reads?
1
10
50
100
150
200
300
the number of genomes
1
2
5
10
30
50
the number of samples
To investigate how many more predictions we could expect from more data, we performed a downsampling analysis. We randomly sampled subsets of the genomes and metagenomes, and counted how many ECs had candidate gene predictions from that genome in the ‘high confidence’ predictions (from parameter sets with >70% accuracy). From the shape of this rarefaction curve we can estimate the maximum number of high confidence predictions that we could expect. From the
© European Molecular Biology Organization
11
Molecular Systems Biology Peer Review Process File
metagenomic data it appears that we cannot obtain any more. However, we expect the shape of this curve to change with an increase in contig length and thus and increase in the number of genomic neighbours. Also, these curves represent lower bounds for the total number of predictable enzymes as for this analysis we used the NBH and COR scores that were calculated from the full set of 338 genomes ACTION TAKEN: We performed down-sampling to estimate if we can predict more candidate sequences given more genome or metagenome sequences. * The "Arctic" metagenomic data used in this study is listed as "unpublished," with no resource provided for obtaining the data or replicating the study? No accession numbers or direct information are provided for any of the metagenomes, which are presumably not stored in as central a location as STRING. In the methods section we have now listed references for all of the metagenomes provided. These references indicate accession numbers that can be used to download the reads from NCBI. In addition, the assembly pipeline that we used is publicly available for those wishing to duplicate the analyses. ACTION TAKEN: Details related to the metagenomes were added in the methods section (Page 12 lines 371-387) and an additional supplementay table 1. Minor ---* As above, in Figure 1 and accompanying text, the statement that "sequences were predicted with >70% accuracy" is a bit misleading, since A) subsets of sequences were predicted with >70% precision, not accuracy, and B) this was evaluated (perforce) in held-out cross-validation data, not in the orphan enzymes being described in Fig. 1. Throughout the text we have modified the terminology related to the estimated accuracy. ACTION TAKEN: We have edited the main text as well as Figure 1 and the corresponding legend to indicate that we benchmarked the accuracy for certain combinations of the 4 scores, and refer to the final set as the ‘high confidence’ set of predictions. * The statement that "functionally associate genes are usually clustered in conserved operon structures" is only true in some (albeit many) organisms, even among bacteria. We have toned down this sentence. It now reads, “as neighbouring prokaryotic genes are often involved in the same metabolic pathway”. ACTION TAKEN: Sentence re-worded. * It's unclear to me what data backs the accuracy value in "candidate genes were predicted (with >80% accuracy) for Phenylpyruvate decarboxylase". As mentioned above we hope to have clarified the statements regarding accuracy. The sentence is now worded as follows: “Furthermore, candidate genes were predicted for Phenylpyruvate decarboxylase (EC 4.1.1.43), using a parameter combination with 80 % accuracy, that converts Phenylpyruvate to phenylacetaldehyde, which is the first and crucial step in the synthesis of branched-chain higher alcohols as biofuels (Atsumi et al, 2008). ACTION TAKEN: Sentence re-worded. * "unculturable bacteria" is not a particularly accurate description of metagenomic data.
© European Molecular Biology Organization
12
Molecular Systems Biology Peer Review Process File
We thank the reviewer for his/her comment. The sentence has now been re-written as : “However, for 13 Orphan Enzymes we found candidate sequences only in metagenomic data, exemplifying the ability of this pipeline to detect sequences from bacteria in environmental samples”. ACTION TAKEN: Sentence re-worded. * In Figure 3C, the title should be "Novelty" We thank the reviewer for this comment. ACTION TAKEN: We have corrected Figure 3.
Referee #3 (Remarks to the Author): In this manuscript, the authors propose a methodology for associating orphan enzyme functions with gene sequences. This is a very important objective, with the potential to impact many areas of biological research via the substantial improvement of annotations. I found the article interesting, and the work presented will most certainly have an impact on my own work. I suspect many other readers will feel the same. Modelers will use this data to improve models, bioinformaticists will use the simple approach to associate new gene sequences with function, and wetlab biologists may attempt to experimentally validate more of the targets indicated. The ideas behind the research (exploiting pathway co-localization on the genome for annotation) are not novel. The large-scale application of these ideas to link orphaned enzymes to possible gene sequences is novel. Bringing metagenomic data into the analysis is also interesting. The fact that the authors experimentally validated some of their orphan assignments certainly adds impact to the work. The statements about metabolic modeling were well put, and they demonstrate the importance of this kind of work. The work appears to have sufficient impact for MSB because it is the first wide-scale systematic study of this nature. It would be nice to see this technology integrated into an annotation framework like the SEED so it can be applied repeatedly as the number of sequenced genomes and metagenomes increases in time. The work is well written and presented in the manuscript. There are few grammatical errors, and the figures are visually compelling and clear. I do have a number of concerns that I feel must be address before this manuscript is ready for publication: We thank the reviewer for his/her constructive feedback. Most of concerns are addressed in the detailed response below. 1.) The authors say the following on page 5 of their manuscript: "Some of these 26 share sequence homology with our predicted candidates and the others that do not, may represent alternative orthologous groups catalyzing the same reaction, as about 70% of the EC numbers in KEGG are encoded by more than one orthologous group." Why not indicate the specific number of candidates that had homology in the manuscript? Why force the reader to dig for it in the supplementary material? I looked up the table in the supplement, and it appears as though 17/26 had homology, 6 did not have homology, and three were "promiscuous". It would make sense to report these numbers in the manuscript. It would also be helpful if the authors offered an explanation as to what they mean by "promiscuous" in this table. We thank the reviewer for this suggestion. We have added these numbers to the main manuscript (Page 6 lines 138-152). “For 17 of the 26 (65%) database sequences there was homology to sequences from EC numbers that agreed up to at least the 1st digit. “. In addition, we edited the supplemental table, we removed the term promiscuous and now indicate the exact number of digits the 2 EC numbers agree at. ACTION TAKEN: We added these numbers to the main manuscript, and edited the supplemental table.
© European Molecular Biology Organization
13
Molecular Systems Biology Peer Review Process File
2.) Did the authors attempt to "assemble" metagenomes prior to analysis? Assembly would produce much longer contigs that would be far more amenable to this type of analysis compared with the raw reads themselves. Assembly would also enable the analysis of Illumina metagenomes where sequencing depth and potential sequence diversity would be much higher. Overall, it is surprising to see so few hits arising from metagenomic data. Consider that the authors only included ~300 genomes in the "single genome" analysis. A single metagenome could easily represent more than 300 genomes worth of sequence. I would suggest a few comments in the manuscript along these lines: -please indicate if any assembly was attempted prior to metagenome analysis The reviewer is correct that assembly will produce much longer contigs. We did actually assemble the metagenomic reads to produces contigs, but perhaps a sentence talking about long reads, suggested that we did not. We have now added detail concerning the assembly in the methods. In the discussion we also detail why there were fewer candidate sequences from the metagenomes. The main reason that there were fewer candidate sequences from the metagenomic data is due to the fact that most contigs only contain 2 or fewer neighbouring genes, thereby limiting the number of neighbor genes with enzymatic functions in metagenomic contig. In addition we have edited Figure 3 to indicate significant number of candidate sequences that were obtained from both metagenomic and genomic data. ACTION TAKEN: We now discuss details concerning assembly of the metagenomes in the methods (Pg 12 lines 371-387). We have also added discussion that the low number of neighbouring genes on the metagenomic contigs limits the number of genes with pathway information (Discussion: Page 10 lines 294-311). In the supplement we have also provided details on the contig length distribution (supplementary figure 8), and number of genes on each contig (supplementary figure 7). In figure 3 we have also added a pie chart to illustrate the significant number of candidate sequences that were obtained from both metagenomic and genomic data. -authors should comment on the "nature" of the rank abundance curve for the metagenomes used in analysis (how diverse was you metagnome?) We thank the reviewer for the suggestion to expand the discussion of the nature of the metagenomic samples. There is quite a large amount of inter-metagenome diversity between each of the gut and ocean metagenomes. However, we can make a few general statements about the diversity of the microbial communities. For the 37 gut metagenomes (which each represent an individual) the communities are quite diverse. The two most abundant genuses are on average 10 to 5 % of the population, accompanied by a long tail of less abundant organisms (Arumugam et al 2010). If we look at the species level, the gut metagenomes are estimated to contain 1000 species with at least 160 species in each individual (Qin et al 2010). So looking at the genome level there will of course be many more entities, thus explaining why such a small amount of sequences were assembled into long contigs. For the Ocean metagenomes, the 18 samples from the Global Ocean Sampling Expedition are probably even more diverse with very few organisms present at 1% of the data. Most of the taxonomic classification of this data is given at the level of order or phylum (even higher than genus). But most samples are dominated by a few orders which make up 30%. ACTION TAKEN: We have added the above into the discussion (see Discussion Page 10 lines 294311). -can authors indicate the average read length and net amount of sequence data represented in the metagenomes used for analysis? Thank you for the suggestion. ACTION TAKEN: We have provided an additional supplemental table that details the amount of sequence data for each metagenome. In addition, we prepared 2 additional figures, a histogram of the contig lengths, and a plot of the number of genes per contig (supplemental figures 7 and 8) .
© European Molecular Biology Organization
14
Molecular Systems Biology Peer Review Process File
3.) This approach has been applied for a long time in order to assign functions to genes (albeit, until this paper, this effort was done manually and not systematically). In fact, many of the orphan enzymes that remain, remain specifically because their associated genes are not co-localized with their pathways on the genome. This is the first place biologists would have looked when attempting to identify the gene that corresponds with the orphan function. As a result, it's not clear that the authors can extend the success rate of this approach in correctly identifying the gene sequence for non orphan enzymes to be the predicted success rate for analysis of orphan enzymes. I would expect the success rate to be far below 70%, and frankly, it would still be a resounding success if only 20% of the predictions were correct. Yet the authors repeatedly indicate that their method is expected to be 70% accurate, while their experimental success rate was only 33% accurate. I would suggest the authors include a more thorough and forthright discussion of expected accuracy, and include some indication of why success may be lower than that indicated by the benchmarking study. Simply stating that "the benchmark accuracy was 70%, and the experimental validation success rate was 33%, and the true accuracy of their algorithm's predictions is probably between these two" would be a more justifiable statement of expected accuracy. The reviewer’s comment that many orphans remain because their associated genes are not colocalized with their pathways on the genome is a valid point. However, our prediction method is only designed to find Orphan Enzymes that have pathway and genomic neighbours. Therefore, our accuracy estimations are only valid for enzymes that are co-localized with their pathway neighbours. At the moment our method cannot be extended to detect candidate sequences for Orphan Enzymes that do not have pathway neighbours. However, we now discuss possible ways to modify the method to capture these non-pathway associated Orphans. Other types of gene-association relationships might be used to identify genes that might be genomic neighbours of the Orphan Enzyme genes. In addition, due to all of the previous comments we have clarified that our benchmarking does not allow us to determine the overall accuracy of our method, but rather the benchmarking estimated accuracy for certain combinations of parameters allowing us to define a set of high confidence predictions from these sets of parameter combinations. ACTION TAKEN: We have added a section in the discussion speculating how we can extend the method to predict candidate sequences for non-pathway associated Orphans (Discussion: Page 11 line 351- Page 12 line 361. Overall, with some revisions responding to the above comments, I would recommend this manuscript for publication.
© European Molecular Biology Organization
15