HaploGrep: a fast and reliable algorithm for ... - Semantic Scholar

8 downloads 4738 Views 504KB Size Report
The Client is implemented in JavaScript as HTML5-based rich internet ... a polymorphism as ''local private mutation,'' when it is not associated with the current ... The classification of haplogroups and the storage of several results per sample is ...
INFORMATICS

Human Mutation OFFICIAL JOURNAL

HaploGrep: A Fast and Reliable Algorithm for Automatic Classification of Mitochondrial DNA Haplogroups

www.hgvs.org

Anita Kloss-Brandsta¨tter,1y Dominic Pacher,2y Sebastian Scho¨nherr,1,2y Hansi Weissensteiner,1y Robert Binna,2 Gu¨nther Specht,2 and Florian Kronenberg1 1

Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck,

Austria; 2Department of Database and Information Systems, Institute of Computer Science, University of Innsbruck, Innsbruck, Austria

Communicated by Pui-Yan Kwok Received 5 July 2010; accepted revised manuscript 16 September 2010. Published online in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/humu.21382

ABSTRACT: An ongoing source of controversy in mitochondrial DNA (mtDNA) research is based on the detection of numerous errors in mtDNA profiles that led to erroneous conclusions and false disease associations. Most of these controversies could be avoided if the samples’ haplogroup status would be taken into consideration. Knowing the mtDNA haplogroup affiliation is a critical prerequisite for studying mechanisms of human evolution and discovering genes involved in complex diseases, and validating phylogenetic consistency using haplogroup classification is an important step in quality control. However, despite the availability of Phylotree, a regularly updated classification tree of global mtDNA variation, the process of haplogroup classification is still time-consuming and error-prone, as researchers have to manually compare the polymorphisms found in a population sample to those summarized in Phylotree, polymorphism by polymorphism, sample by sample. We present HaploGrep, a fast, reliable and straight-forward algorithm implemented in a Web application to determine the haplogroup affiliation of thousands of mtDNA profiles genotyped for the entire mtDNA or any part of it. HaploGrep uses the latest version of Phylotree and offers an all-in-one solution for quality assessment of mtDNA profiles in clinical genetics, population genetics and forensics. HaploGrep can be accessed freely at http:// haplogrep.uibk.ac.at. Hum Mutat 31:1–8, 2010. & 2010 Wiley-Liss, Inc. KEY WORDS: mitochondrial DNA; haplogroup; Phylotree; quality assurance

Introduction Human mitochondrial DNA (mtDNA) is routinely analyzed in various disciplines, such as medical genetics (searching for Additional Supporting Information may be found in the online version of this article. y

These authors contributed equally to this work.

Correspondence to: Anita Kloss-Brandsta¨tter, Innsbruck Medical University

Division of Genetic Epidemiology, Scho¨pfstr. 41, Innsbruck, 6020 Austria. E-mail: [email protected] Contract grant sponsor: The O¨sterreichische Nationalbank; Contract grant number: 13059 (to A.K.B.); Contract grant sponsor: The ‘‘Genomics of Lipid-Associated Disorders-GOLD’’ of the Austrian Genome Research Programme GEN-AU (to F.K.).

pathogenic mutations) [Durham et al., 2007; Elliott et al., 2008], population genetics (studying evolutionary patterns in human populations) [Kivisild et al., 2004; Pereira et al., 2009; QuintanaMurci et al., 2010], and forensic genetics (targeting degraded remains) [Biesecker et al., 2005; Ivanov et al., 1996]. The strict maternal inheritance of mtDNA results in a natural grouping of sequence haplotypes into monophyletic clusters, referred to as haplogroups. The members of a haplogroup carry a specific sequence motif as a consequence of sharing a common ancestor. All human mtDNA haplotypes can be traced back to a common matrilineal ancestor [Behar et al., 2008; Macaulay et al., 2005], and by assuming a sequential accumulation of mutations along maternally inherited lineages, all mtDNA haplogroups can be represented in a single phylogenetic tree. Currently, the most comprehensive overview of the human mtDNA tree is reflected in Phylotree (http:// www.phylotree.org), a regularly updated classification tree of global mtDNA variation considering all currently available whole mtDNA genome sequences [van Oven, 2010; van Oven and Kayser, 2008]. For population genetic inferences, the identification of haplogroup affiliation of members of a population sample is a crucial prerequisite for further investigations and conclusions on the demographic impact on human migrations. For clinical genetics, knowledge of the mitochondrial haplogroup can aid in finding disease-associated mutations, especially in case–control study settings in order to control for confounding due to mitochondrial population stratification [Biffi et al., 2010], and can help to distinguish them from either phylogenetically informative sites or sequencing artifacts. It has been found that sample mixup, contamination, biochemical problems, and transcription errors can lead to ‘‘phantom mutations’’ [Brandsta¨tter et al., 2005], which in turn resulted in false associations and erroneous conclusions [Bandelt et al., 2004, 2007a]. As a consequence, guidelines concerning mtDNA genotyping were issued [Salas et al., 2005a]. These especially recommend double evaluation of mtDNA sequences, checking for phylogenetic consistency using haplogroup classification and the construction of quasi-median networks for the detection of phantom mutations [Bandelt et al., 2009; Brandsta¨tter et al., 2006a]. However, despite the online availability of Phylotree, a researcher who wants to determine the haplogroup affiliation of certain samples still has to manually compare the polymorphisms found in a population sample to those summarized in Phylotree, polymorphism by polymorphism, sample by sample. This process is in most cases time-consuming, long-winded, and error-prone. A software tool for inferring the haplogroup of a (set of) sample(s) in an easy, straight-forward and automated way would therefore be of immense practical importance.

& 2010 WILEY-LISS, INC.

We therefore developed HaploGrep, a Web application designed to determine the most probable haplogroup(s) of mtDNA haplotypes from the entire mitochondrial genome or any part of it. Using this application will significantly reduce the error rate and the amount of time needed for the preparation of those data for downstream quasi-median network analyses. Ultimately, HaploGrep yields an all-in-one solution for quality assessment of mtDNA profiles in clinical genetics, population genetics, and forensics.

Methods Incorporation of Phylotree Phylotree [van Oven and Kayser, 2008] is a clearly structured phylogenetic representation of the human mtDNA tree, with the potential to be re-coded into a computer-processable format like XML (Extensible Markup Language; http://www.w3.org/XML). Phylotree is stored the binary XLS format and provided for users as a HTML file. We designed a three-step model to generate the XML-version of Phylotree: (1) in a preprocessing step the provided XLS file is checked by VBA scripts to assure data integrity, because Phylotree is maintained in a manual fashion; (2) haplogroups, corresponding profiles, accession numbers, as well as references are extracted from the Excel file and encoded into an XML file, whereby the tree structure of Phylotree is preserved. As a general rule, mtDNA haplotypes are reported as a list of nucleotide positions that deviate from the revised Cambridge Reference Sequence (rCRS) [Andrews et al., 1999]. The rCRS is set as the base for the XML tree and the haplogroups are appended recursively; (3) this XML file is used to determine the phylogenetic weight of each polymorphism (see below). This procedure is performed as a service by our group each time a new version of Phylotree is published, which on average occurs every 6 months, to ensure that HaploGrep works with the most up-todate knowledge of the human mtDNA phylogeny.

Determination of Phylogenetic Weights HaploGrep’s approach to determine the most likely haplogroup includes the calculation of a phylogenetic weight for every mutation. As suggested by Soares et al. [2009], a site-specific mutation rate is needed to determine correct and reliable haplogroups. In particular, this rate is necessary if two haplogroups differ by only one single mutation. The weight is computed by traversing our XML representation of Phylotree to obtain the absolute number of occurrences per mutation in the entire tree. These numbers are used for calculating the weights of mutations needed by the HaploGrep algorithm by executing the following steps: (1) the mutations are sorted descendingly by their number of occurrences in the tree (e.g., m.152T4C106; m.16311T4C—69; m.16189T4C—67; etc.); (2) the highest value plus 1 is taken as a reference value (e.g., 10611 5 107); (3) the weight is obtained by subtracting the number of occurrences from the reference value (e.g., the weight of m.152T4C was 107106 5 1; weight of m.16311T4C: 10769 5 38, m.16189T4C: 10767 5 40, etc.). As a final result, hotspot mutations with a high number of occurrences in the tree have a low phylogenetic weight, while stable mutations with only a few occurrences in the tree have a high phylogenetic weight. Super hot spots were not taken into consideration for the HaploGrep algorithm by assigning weight 0: m.523_524delAC, m.16182A4C, m.16183A4C, m.16519T4C, and C-insertions as m.309, m.315,

2

HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

and m.16193. Supp. Table S1 summarizes the occurrences of mutations in Phylotree and their corresponding weights in HaploGrep.

Main Algorithm HaploGrep calculates a ranked list of relevant haplogroups for each given sample. To create this list, HaploGrep goes through all polymorphisms within a given sequence range from the human mtDNA haplogroup tree starting at the rCRS. The sequence range can include the entire mtDNA genome or any part of it, even single SNPs, in any combination of sequenced fragments and genotyped SNPs (e.g., 16024–1636–5; 70–340; 2706; 7028; 14766). For every haplogroup in Phylotree the adequate rank rhg is then calculated according to: Pk Pk 1 wi 1 wi Pi¼1 rhg ¼  Pi¼1 1 ;  m n 2 2 i¼1 wi i¼1 wi where wi is the phylogenetic weight of the ith polymorphism, k is the number of polymorphisms that were observed in the sample from the currently tested haplogroup, m is the total number of polymorphisms expected in the analyzed range of the currently tested haplogroup, and n is the total number of polymorphisms in the sample. The formula shows that the rank calculation consists of two main components. The first part calculates the weighted ratio of haplogroup-associated polymorphisms that were found in the sample to the total number of haplogroup-associated polymorphisms in the currently tested haplogroup within the analyzed range. The second part reflects the ratio of polymorphisms in the test sample that are associated with the haplogroup under investigation to the total number of polymorphisms found in the test sample. This second component assures that a result will be ranked higher if it uses as many polymorphisms in the sample as possible. The two components are weighted equally and form the final overall rank rhg. The resulting list is sorted according to these ranks to present the user the best results first. For demonstration, we want to give an example: assume that a sample was sequenced for the entire control region (nucleotide positions 16024–16569; 1–576) and genotyped for 45 coding region SNPs as described by Brandsta¨tter et al. [2006b]. It showed the following profile (with corresponding weights): m.263G (105), m.309_310insC (0), m.315_316insC (0), m.750A4G (106), m.951G4A (103), m.8860A4G (106), m.15326A4G (106), m.16354C4T (101), and m.16519T4C (0). The polymorphisms m.309_310insC, m.315_316insC, and m.16519T4C were hot spots and therefore not taken into consideration (weighted with 0). For every haplogroup the rank was calculated according to HaploGrep’s algorithm: haplogroup H2a1 was rated best because all ‘‘diagnostic’’ polymorphisms for the haplogroup were found in the sample [(10511061103110611061101)/(105110611031 10611061101) 5 1] and none of the sample’s polymorphisms remained unexplained or ‘‘private’’ [(10511061103110611061 101)/(10511061103110611061101101010) 5 1]. Therefore, the overall rank for H2a1 was rH2a1 5 0.51.010.51.0 5 1.0. For example, the third best hit was haplogroup H2a, where the value for the first part of equation was still 1.0 but because two polymorphisms (m.951G4A, m.16354C4T) remained unexplained in the sample’s profile, the second part of the equation was (105110611061106)/(10511061103110611061 101101010) 5 0.6746, thus leading to a final rank of rH2a 5 0.5  1.010.5  0.6746 5 0.837.

Architecture

Workflow and Graphical User Interface

HaploGrep was designed as a Web application based on a clientserver architecture. This allows the separation of the core application from the user interface and assures that the user always works with the latest version of HaploGrep. The core of HaploGrep is implemented as a java library. REST was used for the client-server communication, implemented within the restlet framework [Fielding, 2000], thus putting all computational tasks on server side. The results of these tasks are provided as REST-resources using JSON (http://www.json.org) and XML as standard exchange formats. A server side session object is created for every new request and identified through an encrypted session key (please see ‘‘Security management’’); all data are stored in an in-memory hash table, thus increasing the speed of HaploGrep significantly, because all operations are exclusively executed in main memory. Furthermore, this architecture utilizes current multicore architectures and allows the handling of concurrent requests efficiently. The Client is implemented in JavaScript as HTML5-based rich internet application, communicating through asynchronous HTTP requests (AJAX) with the REST-resources. Using JSON as a transport medium within the JavaScript client allows a lightweight architecture and reduces the overhead. The user interface was implemented with the EXT-JS JavaScript framework (http://www.extjs.com). The JavaScript InfoVis Toolkit (http:// thejit.org) was used for the visualization of Phylotree.

The import file for HaploGrep is a simple text file (with the file extension .hsd) consisting of several columns: the first column contains the sample identifier (ID), the second column specifies the targeted genetic region (any sequenced range and any genotyped SNP, in any combination), the third column can include the assumed haplogroup status (or left blank if not specified), followed by the polymorphisms separated by tab stops (an example can be downloaded from HaploGrep’s homepage). This format can be created with spreadsheet applications such as Microsoft Excel or Open Office Calc and saved as a tab delimited text file. Alternatively, this format is generated automatically with the freely available epidemiological data management system eCOMPAGT [Scho¨nherr et al., 2009; Weissensteiner et al., 2010]. After importing this file into HaploGrep, the content is displayed in the upper part of the Web interface (Fig. 1) and the classification of haplogroups can be started; all samples are processed in batch mode. HaploGrep presents the top 10 results for every sample in the lower part of the Web interface; the top result is further used as recommended haplogroup. For each result, the computed final overall rank (see above) and the expected and unused polymorphisms are listed and can be surveyed in detail. The user can select any other haplogroup from the list of potential haplogroups as haplogroup status of the sample under investigation. Depending on the rank rhg of the highest scoring haplogroup, the sample is highlighted in different

Figure 1. HaploGrep’s Web interface. Notes: (1) Toolbar of HaploGrep; (2) data and associated haplogroups are highlighted in different colors depending on the quality of the assignment; (3) 10 top hits for every sample are displayed with details on diagnostic and remaining polymorphisms. The remaining polymorphisms are further categorized into ‘‘hotspot,’’ ‘‘local private mutation,’’ ‘‘global private mutation,’’ and ‘‘ polymorphism out of range;’’ (4) location of the sample within the human phylogenetic tree; this visualization enables a view on polymorphisms that could additionally be analyzed for further refinement of the haplogroup status. HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

3

colors according to a traffic light scheme: samples with a suggested haplogroup that is ranked with a score above 0.9 are highlighted in green; scores between 0.8 and 0.9 are highlighted in yellow, and scores below 0.8 are emphasized in red color. In case a haplogroup was already defined in the input file, HaploGrep’s recommendation is added in brackets and the user needs to accept the changes manually. This conservative approach avoids an unreflected acceptance of results, as the user sometimes disposes of extra information on the samples’ ancestry. To improve the functionality of HaploGrep, an interactive visualization of Phylotree was created using the JavaScript InfoVis Toolkit. When selecting a haplogroup from the list of results, the entire phylogenetic branch is built and the polymorphisms are highlighted in different colors depending on whether they were typed in the sample or not and whether they have been found in the sample or not and even backmutations are considered. The three tables in the lower left corner (see Fig. 1) show the results ranked by their weights, list the expected polymorphisms for the currently selected haplogroup, and indicate the status of the remaining polymorphisms that are not associated with the haplogroup. In the latter case we distinguish between ‘‘hotspot,’’ ‘‘local private mutation,’’ ‘‘global private mutation,’’ and ‘‘polymorphism out of range.’’ We define a polymorphism as ‘‘local private mutation,’’ when it is not associated with the current haplogroup, but has been observed in Phylotree at least once. On the contrary, a ‘‘global private mutation’’ refers to a polymorphism that is neither associated with the current haplogroup nor has it ever been seen in Phylotree. This information is very useful for data quality assurance. After editing and selecting the appropriate haplogroup for each sample, the sample data can be saved as a tab-delimited text file or as an .rdf file for the phylogenetic software Network.exe [Bandelt et al., 1999] (http://www.fluxus-engineering.com).

Security Management HaploGrep neither saves nor stores user data or uploaded files. It runs on our server’s main memory where only a logging tool collects data for controlling for inappropriate software behavior and monitoring the status of HaploGrep. This supports the traceability of errors and improves the reliability of our application. HaploGrep’s session management is accomplished using HTTP’s cookie feature. Once a client has imported a file, the Web server assigns an encrypted session key (‘‘source identifier’’) to the client’s Web browser. Each time the client starts HaploGrep, the server checks if a session key is already available on client side. If the cookies were not deleted and the server session is still valid, the client has immediate access to his previously uploaded data. The session key is further encrypted by a cryptographic hash function (MD5); therefore, unauthorized clients cannot access foreign data. The session is deleted after 12 hr.

Results Computation Time The classification of haplogroups and the storage of several results per sample is a memory consuming task, which was evaluated on a multicore 64 bit Linux Server with 8 gigabyte RAM. HaploGrep scales very well in time and memory space, taking about 30 to 40 msec per sample to determine the haplogroup. If parallel requests are performed, HaploGrep makes use of the multi core architecture and distributes parallel requests to different cores, which leads to an even faster computation time

4

HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

Figure 2. Computation time and memory requirements depending on the number of submitted samples. Note: Both computation time and required memory show a linear dependency to the number of samples. (20 msec for each sample) compared to sequential requests. The memory usage scales in a linear way proportional to the number of imported samples (Fig. 2). With the current server configuration (8 GB of main memory) over 50,000 samples can be handled each day.

Validation Using Predetermined Data A set of 100 previously published mtDNA profiles from the entire control region (16024–16569; 1–576) (40 Europeans [Alshamali et al., 2008], 30 Asians [Alshamali et al., 2008; Brandsta¨tter et al., 2007a; Zimmermann et al., 2009], and 30 Africans [Alshamali et al., 2008; Brandsta¨tter et al., 2004]), which were randomly chosen from the supplementary material sections of the corresponding articles, were imported into HaploGrep (Supp. Table S2). Twenty-eight samples were assigned with the previously published haplogroup. Due to recent updates in Phylotree and the use of weights for each polymorphism, 59 samples were relabeled with a more precise haplogroup, and 13 samples were assigned a new haplogroup. However, it should be pointed out that at the time of publication of the selected test samples, the assigned haplogroup status reflected the state of knowledge at that time. Finally, we compiled a dataset of 120 sequences with varying ranges (control region only [Brandsta¨tter et al., 2004, 2006a, 2007a,b; Irwin et al., 2010; Lee et al., 2006], entire mitochondrial genome [Achilli et al., 2008; Behar et al., 2006, 2008; Pala et al., 2009; Palanichamy et al., 2004], and control region plus a variety of selected SNPs [Brandsta¨tter et al., 2008; Zimmermann et al., 2009]) that is already available on HaploGrep’ Web application and that can be used as test dataset for understanding the functionality of HaploGrep without the need to upload own data (Supp. Table S3).

Comparison with Alternative Software Solutions Beside HaploGrep, other software solutions exist to determine haplogroups: mtDNAmanager provides a Web-based haplogroup classification for single and multiple mtDNA samples. However, this solution allows only the classification of haplogroups from the

mtDNA control region (16024–16569; 1–576) [Lee et al., 2008]. HaploGrep differs from mtDNAmanager in several points: unlike mtDNAmanager the complete range of the mitochondrial genome is analyzed and the classification process is based on Phylotree. This makes the classification far more reliable for users and provides a standard classification scheme. HaploGrep’s results can be further verified either by analyzing the detailed results for every rank or by traversing the provided tree structure that makes the classification process fully transparent. A further important advantage is the increased accuracy and speed of HaploGrep compared to mtDNAmanager (Fig. 3). The test set described above was imported into HaploGrep and mtDNAmanager: only 43 samples were labeled identical by both systems; HaploGrep reassigned 46 samples with a more accurate haplogroup than mtDNAmanager (due to the use of Phylotree); 11 samples were assigned an inappropriate haplogroup (given the genotyped range) by mtDNAmanager. HaploGrep is capable of determining about 20 samples per second and is therefore up to 28 times faster than mtDNAmanager. MitoVariome is another database application that provides limited query functionalities to determine haplogroups using mutation motifs [Lee et al., 2009]. Unfortunately, it is only possible to classify one sample at a time and the classification process takes about 30 sec for a single sample. In addition, only the control region is examined and only the backbone of the human phylogenetic tree is used for guiding the classification process, which leads to a coarse assignment to only major haplogroup clusters.

HaploGrep as a Tool for Quality Assurance The discovery of a wide range of errors in mtDNA disease studies [Bandelt et al., 2002, 2004, 2005, 2007a,b; Bandelt and Salas, 2009; Brandsta¨tter et al., 2005; Kong et al., 2008; Salas et al., 2005b; Yao et al., 2003] has led to formulation of guidelines, which especially recommend checking for phylogenetic consistency using

Figure 3.

Comparison of computation times between HaploGrep and mtDNAmanager. Note: HaploGrep determines haplogroups up to 28 times faster than mtDNAmanager.

haplogroup classification and the construction of quasi-median networks for the detection of phantom mutations [Bandelt et al., 2001, 2009; Salas et al., 2005a]. A detailed instruction on how this can be undertaken is described in Bandelt et al. [2009]. Briefly, the first step is to allocate mtDNA profiles to major haplogroups; the second step is to ascertain whether all haplogroup-specific mutations in the new profile were actually observed, following the evolutionary pathway connecting the targeted sequence with the rCRS. HaploGrep can help to achieve this task especially through the visualization of the phylogenetic position of the sample in question. When, for example, the data of Uusimaa et al. [2004], which were discussed to be problematic in Bandelt et al. [2009], were imported into HaploGrep, it could be observed immediately that only a small part of samples had a score above 0.9 (Fig. 4). This would have been highly alarming, as the West Eurasian phylogeny is well known [Soares et al., 2010; Torroni et al., 2001] and especially entire mtDNA genomes should be easily located on a terminal node of the human mtDNA tree. A ‘‘phylogenetic bookkeeping of mutations’’ [Bandelt et al., 2009], that is, a systematic mutation by mutation comparison along known evolutionary pathways, as seen in Figure 4, would have helped to identify missed mutations and to check for phantom mutations on the list of ‘‘remaining’’ mutations, which were marked as ‘‘global private mutation.’’

Discussion Accurate haplogroup assignment of mitochondrial haplotypes is an important but time-consuming task in human evolution studies and clinical genetics. Especially in the light of recent discoveries of a wide range of errors in mtDNA profiling, which resulted in false associations and erroneous conclusions [Bandelt and Salas, 2009; Bandelt et al., 2007b], it is necessary to check for phylogenetic consistency using haplogroup classification and to construct networks for the detection of phantom mutations [Bandelt and Du¨r, 2007]. However, the conversion of annotated mtDNA population data to Network.exe format [Bandelt et al., 1999] is a difficult and error-prone task. An automatic conversion can either be accomplished with HaploGrep or the recently described mtDNA data management software eCOMPAGT [Weissensteiner et al., 2010]. In addition, although Phylotree provides a regularly updated and comprehensive phylogeny of global human mtDNA variation [van Oven, 2010; van Oven and Kayser, 2008], it is still necessary to have an insight into the mtDNA phylogeny to make use of this comprehensive information. Furthermore, a manual search for the presence or absence of diagnostic polymorphisms is a long-winded, complicated, and sometimes confusing task. Studies of human populations worldwide are still far from being complete, and as a consequence, it is anticipated that new haplogroups will be discovered and that the already known haplogroups will be resolved further and characterized more precisely. These developments may lead to more precise results when performing haplogroup identifications of mtDNA profiles. However, haplogroup assignment will also become a more complicated and timeconsuming process as more and more haplogroups need to be taken into consideration and this circumstance calls for automation. Additionally, with the increasing knowledge of variation in mitochondrial haplogroups, the need to reanalyze already typed samples will arise naturally. This task can only be achieved with reasonable time exposure when applying IT-based solutions. HaploGrep offers a fast, straightforward and reliable solution to this problem. The standardized and straightforward procedure of haplogroup HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

5

Figure 4.

Screenshot of HaploGrep analysis of 17 samples from Uusimaa et al. [Uusimaa et al., 2004]. Note: As an example, the results for patient no. 7 are depicted. It can be seen that several key mutations on the evolutionary path to U5b2a1a2 were missed (red color). It can further be seen that one remaining polymorphisms is a ‘‘global private mutation;’’ this means that the polymorphism from the sample was never seen in Phylotree, hinting to the possibility of a genotyping error.

assignment with HaploGrep has the potential to overcome the issue of an intuitive assignment of samples to haplogroups. With each update of the classification tree Phylotree, the haplogroup classification procedure can quickly be repeated according to the actual state of known variation in the human mitochondrial genome. HaploGrep is a freely available software solution with several strengths: (1) HaploGrep is implemented as Web application, and therefore no further installation procedures are required; (2) HaploGrep’s recommendations are based on Phylotree, a periodically updated classification tree estimated from data worldwide; (3) any given range of the mitochondrial genome can be used for haplogroup classification, which is based on the phylogenetic stability of mtDNA polymorphisms; (4) for every input sample the top 10 results and the phylogenetic position of the respective haplogroups are displayed, thus providing a detailed explanation of how and why a haplogroup was ranked best; (5) all results are visualized in a tree structure, showing the current position in Phylotree with possible hints for further refinement of the actual haplogroup status; (6) the classified data can be exported as a tab delimited text file or as a .rdf file for Network.exe. HaploGrep has limitations as well: its recommendations depend on the data accuracy of Phylotree, but because Phylotree is updated at least every 6 months, it is the most accurate presentation of

6

HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

haplogroups and therefore an excellent data source for HaploGrep. In addition, the accuracy of HaploGrep’s haplogroup classification depends on the analyzed sample range, clearly yielding more accurate results when using larger sequence ranges. In agreement with the ongoing discussion on error prevention, we provide a constructive solution to check for phylogenetic consistency as recommended by Bandelt et al. [2009]. HaploGrep can be used without a login and imported samples are exclusively visible to the appropriate users due to HaploGrep’s session management. The export possibilities as a standard tab delimited file and as a .rdf file for the phylogenetic software Network.exe render HaploGrep currently the most comprehensive solution for human mtDNA haplogroup classification.

Outlook Starting from now (Phylotree version 10), we will continuously update HaploGrep to work with the latest version of Phylotree. To strengthen the security of HaploGrep and to allow a secure data transfer, an encrypted channel will be available. This will be achieved by using HTTPS, a combination of HTTP with the SSL/ TLS protocol, which guarantees protection from eavesdroppers.

Acknowledgments Sebastian Scho¨nherr was supported by a scholarship from the University of Innsbruck (Doktoratsstipendium aus der Nachwuchsfo¨rderung, MIP10/2009/3). Hansi Weissensteiner was supported by a scholarship from the Autonomous Province of Bozen/Bolzano—South Tyrol. We thank Dr. Sarah J. Bourlat (Museum of Natural History, Stockholm, Sweden) for language revision of the article. The authors appreciate the constructive comments made by three anonymous reviewers, which considerably improved the application.

References Achilli A, Perego UA, Bravi CM, Coble MD, Kong QP, Woodward SR, Salas A, Torroni A, Bandelt H-J. 2008. The phylogeny of the four pan-American MtDNA haplogroups: implications for evolutionary and disease studies. PLoS ONE 3:e1764. Alshamali F, Brandsta¨tter A, Zimmermann B, Parson W. 2008. Mitochondrial DNA control region variation in Dubai, United Arab Emirates. FSI: Genetics 2:e9–e10. Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23:147. Bandelt H-J, Achilli A, Kong Q-P, Salas A, Lutz-Bonengel S, Sun C, Zhang YP, Torroni A, Yao YG. 2005. Low ‘‘penetrance’’ of phylogenetic knowledge in mitochondrial disease studies. Biochem Biophys Res Commun 333:122–130. Bandelt H-J, Du¨r A. 2007. Translating DNA data tables into quasi-median networks for parsimony analysis and error detection. Mol Phylogenet Evol 42:256–271. Bandelt H-J, Forster P, Ru¨hl A. 1999. Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol 16:37–48. Bandelt H-J, Lahermo P, Richards M, Macaulay V. 2001. Detecting errors in mtDNA data by phylogenetic analysis. Int J Legal Med 115:64–69. Bandelt H-J, Olivieri A, Bravi C, Yao YG, Torroni A, Salas A. 2007a. ‘‘Distorted’’ mitochondrial DNA sequences in schizophrenic patients. Eur J Hum Genet 15:400–402. Bandelt H-J, Quintana-Murci L, Salas A, Macaulay V. 2002. The fingerprint of phantom mutations in mitochondrial DNA data. Am J Hum Genet 71:1150–1160. Bandelt H-J, Salas A. 2009. Contamination and sample mix-up can best explain some patterns of mtDNA instabilities in buccal cells and oral squamous cell carcinoma. BMC Cancer 9:113. Bandelt H-J, Salas A, Bravi C. 2004. Problems in FBI mtDNA database. Science 305:1402–1404. Bandelt H-J, Yao YG, Bravi CM, Salas A, Kivisild T. 2009. Median network analysis of defectively sequenced entire mitochondrial genomes from early and contemporary disease studies. J Hum Genet 54:174–181. Bandelt H-J, Yao YG, Salas A, Kivisild T, Bravi CM. 2007b. High penetrance of sequencing errors and interpretative shortcomings in mtDNA sequence analysis of LHON patients. Biochem Biophys Res Commun 352:283–291. Behar DM, Metspalu E, Kivisild T, Achilli A, Hadid Y, Tzur S, Pereira L, Amorim A, Quintana-Murci L, Majamaa K, Herrnstadt C, Howell N, Balanovsky O, Kutuev I, Pshenichnov A, Gurwitz D, Bonne-Tamir B, Torroni A, Villems R, Skorecki K. 2006. The matrilineal ancestry of ashkenazi jewry: portrait of a recent founder event. Am J Hum Genet 78:487–497. Behar DM, Villems R, Soodyall H, Blue-Smith J, Pereira L, Metspalu E, Scozzari R, Makkan H, Tzur S, Comas D, Bertranpetit J, Quintana-Murci L, Tyler-Smith C, Wells RS, Rosset S. 2008. The dawn of human matrilineal diversity. Am J Hum Genet 82:1130–1140. Biesecker LG, Bailey-Wilson JE, Ballantyne J, Baum H, Bieber FR, Brenner C, Budowle B, Butler JM, Carmody G, Conneally PM, Duceman B, Eisenberg A, Forman L, Kidd KK, Leclair B, Niezgoda S, Parsons TJ, Pugh E, Shaler R, Sherry ST, Sozer A, Walsh A. 2005. Epidemiology. DNA identifications after the 9/11 World Trade Center attack. Science 310:1122–1123. Biffi A, Anderson CD, Nalls MA, Rahman R, Sonni A, Cortellini L, Rost NS, Matarin M, Hernandez DG, Plourde A, de Bakker PI, Ross OA, Greenberg SM, Furie KL, Meschia JF, Singleton AB, Saxena R, Rosand J. 2010. Principal-component analysis for assessment of population stratification in mitochondrial medical genetics. Am J Hum Genet 86:904–917. Brandsta¨tter A, Egyed B, Zimmermann B, Duftner N, Padar Z, Parson W. 2007a. Migration rates and genetic structure of two Hungarian ethnic groups in Transylvania, Romania. Ann Hum Genet 71:791–803. Brandsta¨tter A, Klein R, Duftner N, Wiegand P, Parson W. 2006a. Application of a quasi-median network analysis for the visualization of character conflicts to a population sample of mitochondrial DNA control region sequences from southern Germany (Ulm). Int J Legal Med 120:310–314.

Brandsta¨tter A, Niedersta¨tter H, Pavlic M, Grubwieser P, Parson W. 2007b. Generating population data for the EMPOP database—an overview of the mtDNA sequencing and data evaluation processes considering 273 Austrian control region sequences as example. Forensic Sci Int 166:164–175. Brandsta¨tter A, Peterson CT, Irwin JA, Mpoke S, Koech DK, Parson W, Parsons TJ. 2004. Mitochondrial DNA control region sequences from Nairobi (Kenya): inferring phylogenetic parameters for the establishment of a forensic database. Int J Legal Med 118:294–306. Brandsta¨tter A, Salas A, Niedersta¨tter H, Gassner C, Carracedo A, Parson W. 2006b. Dissection of mitochondrial superhaplogroup H using coding region SNPs. Electrophoresis 27:2541–2550. Brandsta¨tter A, Sa¨nger T, Lutz-Bonengel S, Parson W, Be´raud-Colomb E, Wen B, Kong Q-P, Bravi CM, Bandelt H-J. 2005. Phantom mutation hotspots in human mitochondrial DNA. Electrophoresis 26:3414–3429. Brandsta¨tter A, Zimmermann B, Wagner J, Go¨bel T, Ro¨ck AW, Salas A, Carracedo A, Parson W. 2008. Timing and deciphering mitochondrial DNA macrohaplogroup R0 variability in Central Europe and Middle East. BMC Evol Biol 8:191. Durham SE, Samuels DC, Cree LM, Chinnery PF. 2007. Normal levels of wild-type mitochondrial DNA maintain cytochrome c oxidase activity for two pathogenic mitochondrial DNA mutations but not for m.3243A-G. Am J Hum Genet 81:189–195. Elliott HR, Samuels DC, Eden JA, Relton CL, Chinnery PF. 2008. Pathogenic mitochondrial DNA mutations are common in the general population. Am J Hum Genet 83:254–260. Fielding RT. 2000. Architectural styles and the design of network-based software architecture (PhD thesis). University of California. Irwin JA, Ikramov A, Saunier J, Bodner M, Amory S, Ro¨ck AW, O’Callaghan J, Nuritdinov A, Atakhodjaev S, Mukhamedov R, Parson W, Parsons TJ. 2010. The mtDNA composition of Uzbekistan: a microcosm of Central Asian patterns. Int J Legal Med 124:195–204. Ivanov PL, Wadhams MJ, Roby RK, Holland MM, Weedn VW, Parsons TJ. 1996. Mitochondrial DNA sequence heteroplasmy in the Grand Duke of Russia Georgij Romanov establishes the authenticity of the remains of Tsar Nicholas II. Nat Genet 12:417–420. Kivisild T, Reidla M, Metspalu E, Rosa A, Brehm A, Pennarun E, Parik J, Geberhiwot T, Usanga E, Villems R. 2004. Ethiopian mitochondrial DNA heritage: tracking gene flow across and around the gate of tears. Am J Hum Genet 75:752–770. Kong QP, Salas A, Sun C, Fuku N, Tanaka M, Zhong L, Wang CY, Yao YG, Bandelt H-J. 2008. Distilling artificial recombinants from large sets of complete mtDNA genomes. PLoS ONE 3:e3016. Lee HY, Song I, Ha E, Cho SB, Yang WI, Shin KJ. 2008. mtDNAmanager: a Webbased tool for the management and quality analysis of mitochondrial DNA control-region sequences. BMC Bioinformatics 9:483. Lee HY, Yoo J-E, Park MJ, Chung U, Shin K-J. 2006. Mitochondrial DNA control region sequences in Koreans: identification of useful variable sites and phylogenetic analysis for mtDNA data quality control. Int J Legal Med 120:5–14. Lee YS, Kim WY, Ji M, Kim JH, Bhak J. 2009. MitoVariome: a variome database of human mitochondrial DNA. BMC Genomics 10(Suppl 3):S12. Macaulay V, Hill C, Achilli A, Rengo C, Clarke D, Meehan W, Blackburn J, Semino O, Scozzari R, Cruciani F, Taha A, Shaari NK, Raja JM, Ismail P, Zainuddin Z, Goodwin W, Bulbeck D, Bandelt H-J, Oppenheimer S, Torroni A, Richards M. 2005. Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes. Science 308:1034–1036. Pala M, Achilli A, Olivieri A, Kashani BH, Perego UA, Sanna D, Metspalu E, Tambets K, Tamm E, Accetturo M, Carossa V, Lancioni H, Panara F, Zimmermann B, Huber G, Al-Zahery N, Brisighelli F, Woodward SR, Francalacci P, Parson W, Salas A, Behar DM, Villems R, Semino O, Bandelt H-J, Torroni A. 2009. Mitochondrial haplogroup U5b3: a distant echo of the epipaleolithic in Italy and the legacy of the early Sardinians. Am J Hum Genet 84:814–821. Palanichamy MG, Sun C, Agrawal S, Bandelt H-J, Kong Q-P, Khan F, Wang C-Y, Chaudhuri TK, Palla V, Zhang Y-P. 2004. Phylogeny of mitochondrial DNA macrohaplogroup N in India, based on complete sequencing: implications for the peopling of South Asia. Am J Hum Genet 75:966–978. Pereira L, Freitas F, Fernandes V, Pereira JB, Costa MD, Costa S, Maximo V, Macaulay V, Rocha R, Samuels DC. 2009. The diversity present in 5140 human mitochondrial genomes. Am J Hum Genet 84:628–640. Quintana-Murci L, Harmant C, Quach H, Balanovsky O, Zaporozhchenko V, Bormans C, van Helden PD, Hoal EG, Behar DM. 2010. Strong maternal Khoisan contribution to the South African coloured population: a case of gender-biased admixture. Am J Hum Genet 86:611–620. Salas A, Carracedo A, Macaulay V, Richards M, Bandelt H-J. 2005a. A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics. Biochem Biophys Res Commun 335:891–899. Salas A, Yao YG, Macaulay V, Vega A, Carracedo A, Bandelt H-J. 2005b. A critical reassessment of the role of mitochondria in tumorigenesis. PLoS Med 2:e296. HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

7

Scho¨nherr S, Weissensteiner H, Coassin S, Specht G, Kronenberg F, Brandsta¨tter A. 2009. eCOMPAGT—efficient combination and management of phenotypes and genotypes for genetic epidemiology. BMC Bioinformatics 10:139. Soares P, Achilli A, Semino O, Davies W, Macaulay V, Bandelt H-J, Torroni A, Richards MB. 2010. The archaeogenetics of Europe. Curr Biol 20:R174–R183. Soares P, Ermini L, Thomson N, Mormina M, Rito T, Ro¨hl A, Salas A, Oppenheimer S, Macaulay V, Richards MB. 2009. Correcting for purifying selection: an improved human mitochondrial molecular clock. Am J Hum Genet 84:740–759. Torroni A, Bandelt H-J, Macaulay V, Richards M, Cruciani F, Rengo C, MartinezCabrera V, Villems R, Kivisild T, Metspalu E, Parik J, Tolk HV, Tambets K, Forster P, Karger B, Francalacci P, Rudan P, Janicijevic B, Rickards O, Savontaus ML, Huoponen K, Laitinen V, Koivumaki S, Sykes B, Hickey E, Novelletto A, Moral P, Sellitto D, Coppa A, Al Zaheri N, Santachiara-Benerecetti AS, Semino O, Scozzari R. 2001. A signal, from human mtDNA, of postglacial recolonization in Europe. Am J Hum Genet 69:844–852. Uusimaa J, Finnila S, Remes AM, Rantala H, Vainionpaa L, Hassinen IE, Majamaa K. 2004. Molecular epidemiology of childhood mitochondrial encephalomyopathies

8

HUMAN MUTATION, Vol. 31, No. 0, 1–8, 2010

in a Finnish population: sequence analysis of entire mtDNA of 17 children reveals heteroplasmic mutations in tRNAArg, tRNAGlu, and tRNALeu(UUR) genes. Pediatrics 114:443–450. van Oven M. 2010. Revision of the mtDNA tree and corresponding haplogroup nomenclature. Proc Natl Acad Sci USA 107:E38–E39. van Oven M, Kayser M. 2008. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat 30:E386–E394. Weissensteiner H, Scho¨nherr S, Specht G, Kronenberg F, Brandsta¨tter A. 2010. eCOMPAGT integrates mtDNA: import, validation and export of mitochondrial DNA profiles for population genetics, tumour dynamics and genotype– phenotype association studies. BMC Bioinformatics 11:122. Yao Y-G, Macauley V, Kivisild T, Zhang Y-P, Bandelt H-J. 2003. To trust or not to trust an idiosyncratic mitochondrial data set. Am J Hum Genet 72:1341–1346. Zimmermann B, Bodner M, Amory S, Fendt L, Ro¨ck AW, Horst D, Horst B, Sanguansermsri T, Parson W, Brandsta¨tter A. 2009. Forensic and phylogeographic characterization of mtDNA lineages from northern Thailand (Chiang Mai). Int J Legal Med 123:495–501.