Guidelines for Reporting and Using Prediction Tools for Genetic ...

SPECIAL ARTICLE OFFICIAL JOURNAL

Guidelines for Reporting and Using Prediction Tools for Genetic Variation Analysis

www.hgvs.org

Mauno Vihinen∗ Department of Experimental Medical Science, Lund University, Lund, Sweden

Communicated by Barend Mons Received 7 June 2012; accepted revised manuscript 17 October 2012. Published online 21 November 2102 in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22253

ABSTRACT: Computational prediction methods are widely used for the analysis of human genome sequence variants and their effects on gene/protein function, splice site aberration, pathogenicity, and disease risk. New methods are frequently developed. We believe that guidelines are essential for those writing articles about new prediction methods, as well as for those applying these tools in their research, so that the necessary details are reported. This will enable readers to gain the full picture of technical information, performance, and interpretation of results, and to facilitate comparisons of related methods. Here, we provide instructions on how to describe new methods, report datasets, and assess the performance of predictive tools. We also discuss what details of predictor implementation are essential for authors to understand. Similarly, these guidelines for the use of predictors provide instructions on what needs to be delineated in the text, as well as how researchers can avoid unwarranted conclusions. They are applicable to most prediction methods currently utilized. By applying these guidelines, authors will help reviewers, editors, and readers to more fully comprehend prediction methods and their use.

what the results mean, and how reliable they are. Thus, methods and their use need to be properly documented in publications. Human Mutation and other related journals continually receive manuscripts that report novel prediction methods as well as present applications of such methods to data analysis. Many novel methods have been published under the Informatics section and new submissions are welcome. Because descriptions of predictive methods are of varying quality and detail, it has become necessary to provide guidelines for authors (see Boxes 1 and 2). Adherence to these standards will help reviewers, editors, and readers to gain a more complete picture of technical details, performance, and the interpretation of results, as well as facilitate comparison with related methods. It is not possible to provide detailed guidelines for all kinds of prediction methods, but these recommendations apply to most such publications including variant effects and pathogenicity, disease risk prediction, splice site aberration prediction, and protein functional prognosis. The goal of these guidelines is to provide a framework for adequate and sufficient reporting without restricting or preventing the use and development of predictors. When these guidelines are followed, the methods and results sections in manuscripts will not be significantly longer but their value will be greatly improved.

C 2012 Wiley Periodicals, Inc. Hum Mutat 34:275–277, 2013.

KEY WORDS: pathogenicity; mutation; variation; genome; informatics; prediction; benchmark

Guidelines for Authors of Articles that Report Prediction Methods General Requirements

Introduction Numerous computational methods are available and widely used in practically all areas of biology and biomedicine, including genome variation analysis. Such methods are constantly being developed because they are in great demand, as experimental methods, including next-generation sequencing, produce datasets so large that their analysis is often only feasible through computational approaches. Predictors can be very useful, provided details related to their development and use are carefully documented to facilitate the reproducibility of predictions. It is essential for readers to understand how the predictions were performed and on which data they were based, in order to appreciate what was done and how the tools were utilized,

These instructions are for those who report novel prediction tools or improvements to existing ones (Box 1). We feel that submissions for new types of prediction methods are most valuable if they help variation-related studies. The described method should have a performance level high enough to provide practical application in a biological and/or medical setting. If authors are reporting an update to a previously described method, the performance needs to have a significant advantage over earlier methods or versions of the same method. Articles describing the use of novel parameters and approaches are also valuable to the research community. Methods should be reliable, of high quality, reproducible, appropriate to the task, and properly documented.

Method Description ∗

Correspondence to Mauno Vihinen, Department of Experimental Medical Science,

Lund University, BMC D10, Lund SE-22 184, Sweden. E-mail [email protected]

Authors presenting new methods have to pay special attention to describe their approach in such detail that it would be possible to

C

2012 WILEY PERIODICALS, INC.

repeat the work. This is a general requirement in journals; however, often it is not fully enforced. Full details are necessary, for example, for others in the field to be able to compare their method with a new one. Detailed information is required for all the steps of the method development including, but not limited to the rationale of the approach, the description of the computational algorithm and approach, what datasets are used, feature selection, as well as the optimization of the program.

Box 1. Publication Guidelines for Prediction Method Developers (a) General Requirements r The method should be novel and has reasonable performance, r or has clearly improved performance compared with the previous version and/or related methods, r or describes novel useful parameters or applications and has reasonable performance. (b) Method Description r The method has to be described in sufficient detail; r depending on the approach, the following information should be provided: description of training, optimization, input parameters and their selection, details of applied computational tools, and hardware requirements; and r estimation of time complexity. (c) Datasets r Description of training and test datasets; r benchmark datasets should be used, if available; r use largest available benchmark dataset; r description of how data was collected, quality controls, details of sources; r distribution of datasets is recommended, preferably via benchmark database; and r report sizes of datasets. (d) Method Performance Assessment r Report all appropriate performance measures; r in case of binary classifier, use all six measures characterizing the contingency matrix; if possible, perform receiver operating characteristics (ROC) analysis and provide AUC value; r perform statistically valid evaluation with sufficiently large and unbiased dataset(s) that contains both positive and negative cases; r use cross-validation or some other methods to separate training and test sets; r training and test sets have to be disjoint; r if the dataset is imbalanced make sure to use appropriate methods to mitigate this; r perform comparison with related methods. (e) Implementation r Detailed description of the method including user guidelines and examples (can be provided on website); r availability of the program, either to download or as web service; r batch submission possibility highly recommended; and r make the program preferably open source.

276

HUMAN MUTATION, Vol. 34, No. 2, 275–277, 2013

Box 2. Publication Guidelines for Prediction Method Users (a) Users r The method has to be appropriate to the task. r The applied method should have proven performance. r It may be beneficial to use several methods (if available). r Report method details: citation, URL, version, used parameters, and program options. r Additional information used in the analysis, for example, user-generated multiple sequence alignment should be published, for example, in supplementary material. r Include P values, confidence intervals, and all other reliability measures provided by the tool. r Be careful with data interpretation. r Before using prediction results, understand the principle of the method, its use, limitations, and applications

Datasets Special requirements apply to machine learning methods, including a detailed description of the method and its parameterization as well as the datasets used for training and testing, which have to be disjoint. Benchmarks are standard datasets that can be used to compare different methods. If established benchmark datasets are available in the application domain, then they should be used at least for reporting the method performance and possibly also for training. The largest dataset available should be used to avoid random effects with small sets. Some benchmarks are available in bioinformatics including, for example, multiple sequence alignments and variation effects in VariBench [http://structure.bmc.lu.se/VariBench; Nair and Vihinen, 2012; Raghava et al., 2003; Thompson et al., 1999; Van Walle et al., 2005]. For a more comprehensive list, see http://structure.bmc.lu.se/benchmark_datasets. If new datasets are generated for method training/testing, it is recommended that they be made available, preferably via benchmark services or as a supplement to the article. Dataset distribution exclusively on the author’s web page is not recommended because of the lack of guarantees for long-term availability. The selection criteria and methods used for collecting datasets need to be provided in detail, including database versions, criteria, and numbers of cases in each step of selection.

Method Performance Assessment The method performance needs to be delineated in full. It is not sufficient to select only those parameters that show a good performance. An extreme case would be a classifier predicting all cases to one class, which would certainly have perfect sensitivity but inferior specificity. In the case of binary classifiers, altogether six measures describing the resulting contingency matrix (also called for confusion table or matrix) of observed results should be reported [discussed in detail in Vihinen, 2012]. If possible, that is, when there are a sufficient number of cases, ROC analysis should be made. If the method has more than two outputs, then more parameters need to be reported [for instructions, see Baldi et al., 2000; Vihinen, 2012]. The performance has to be reported in relation to other methods in the field and it must be based on large enough dataset so that the result is guaranteed to be statistically valid. As space is often limited, the full performance details might be more appropriate if submitted as supplementary files.

When testing the method performance, the same cases cannot be used for method training (if applicable) and testing. A common approach is to partition the dataset and repeat training and testing for different partition combinations with approach called cross-validation. Other options are, for example, random sampling and leave-one-out test. The numbers of positive and negative cases should be balanced for certain performance measures to be meaningful [Vihinen, 2012]. Several approaches are available to handle imbalanced datasets [see, e.g., Chawla, 2010].

Implementation If the method is computationally intensive, estimates of time requirements for real life examples must be provided. The time should reflect the computational time if the program source code or executable files are provided, or average response for the service. If the service implementation allows batch submission, then it should be mentioned. The developed method must be available for users as a web application and/or be downloadable for local installation. The program code should preferably be open source. The service should have an easy to use interface to allow also those accessing it sporadically to obtain predictions easily. The method and the program have to be properly documented so that the end users can take benefit of it, that is, user guidelines that include descriptions of, for example, the effects, use, and theoretical basis of adjustable program parameters.

Guidelines for Authors of Articles that Use Previously Published Prediction Methods It is equally important for predictor users to document the use of the tools (Box 2) so that readers can evaluate the presented results as well as compare their own results with those in publications. Users of prediction methods should understand the principles and concepts of the applied method, and not just to use it as a black box. One can see in manuscripts and even in the published literature statements

of blind belief in predictor results. Predictions have to be critically evaluated and discussed and not taken as the plain truth, so users need to appreciate the limitations of the method as well. Sufficient details of the methods used are essential for the reader to be able to interpret the meaning of observed results and to estimate their significance. Authors should pay attention to predictor selection. In many areas, systematic prediction method performance comparisons have been published and these could be used as guidelines for method selection. Sometimes concurrent use of several methods may be beneficial. The used method should suit the analysis. When reporting results, details about the used parameters, features, and settings of the program, as well as details of additional datasets including version numbers, must be provided. It is not sufficient to state that default parameters were used, as they may differ between program versions. In addition, the relevant features of results need to be reported, if not in the main text, then in supplementary material. Conclusions should be supported by the results, and not be speculative nor extrapolate beyond the tested range of validity of the used method. Predictions can only be proved by experimental tests.

References Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412–424. Chawla NV. 2010. Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach, L, Eds. Data Mining and Knowledge Discovery Handbook. 2nd ed. New York: Springer; pp. 875–886. Nair PS, Vihinen M. 2012. VariBench. A benchmark database for variations. Hum Mutat. [Epub ahead of print.] Raghava GP, Searle SM, Audley PC, Barber JD, Barton GJ. 2003. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4:47. Thompson JD, Plewniak F, Poch O. 1999. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15:87–88. Van Walle I, Lasters I, Wyns L. 2005. SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21:1267–1268. Vihinen M. 2012. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics 13(Suppl 4):S2.

HUMAN MUTATION, Vol. 34, No. 2, 275–277, 2013

277