Locusspecific database domain and data ... - Wiley Online Library

3 downloads 5685 Views 188KB Size Report
Locus-Specific Database Domain and Data Content. Analysis: Evolution and ... Published online 29 July 2010 in Wiley Online Library (wileyonlinelibrary.com).
DATABASES

Human Mutation OFFICIAL JOURNAL

Locus-Specific Database Domain and Data Content Analysis: Evolution and Content Maturation Toward Clinical Use

www.hgvs.org

Christina Mitropoulou,1 Adam J. Webb,2 Konstantinos Mitropoulos,1 Anthony J. Brookes,2 and George P. Patrinos1,3 1

Erasmus MC, Faculty of Medicine and Health Sciences, MGC-Department of Cell Biology and Genetics, Rotterdam, The Netherlands;

2

Department of Genetics, University of Leicester, Leicester, United Kingdom; 3University of Patras, School of Health Sciences, Department

of Pharmacy, Patras, Greece

Communicated by Mauno Vihinen Received 30 April 2010; accepted revised manuscript 12 July 2010. Published online 29 July 2010 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/humu.21332

ABSTRACT: Genetic variation databases have become indispensable in many areas of health care. In addition, more and more experts are depositing published and unpublished disease-causing variants of particular genes into locus-specific databases (LSDBs). Some of these databases contain such extensive information that they have become known as knowledge bases. Here, we analyzed 1,188 LSDBs and their content for the presence or absence of 44 content criteria related to database features (general presentation, locus-specific information, database structure) and data content (data collection, summary table of variants, database querying). Our analyses revealed that several elements have helped to advance the field and reduce data heterogeneity, such as the development of specialized database management systems and the creation of data querying tools. We also identified a number of deficiencies, namely, the lack of detailed disease and phenotypic descriptions for each genetic variant and links to relevant patient organizations, which, if addressed, would allow LSDBs to better serve the clinical genetics community. We propose a structure, based on LSDBs and closely related repositories (namely, clinical genetics databases), which would contribute to a federated genetic variation browser and also allow the maintenance of variation data. Hum Mutat 31:1109–1116, 2010. & 2010 Wiley-Liss, Inc. KEY WORDS: locus-specific databases (LSDBs); genetic variation; polymorphisms; genetic diseases; diagnostic laboratories; domain analysis

Introduction Recent years have witnessed a remarkable increase in the identification of underlying genetic defects leading to both common and rare inherited disorders. This increase may be attributed to the rapid development and improving economies of Correspondence to: George P. Patrinos, University of Patras, School of Health

Sciences, Department of Pharmacy, University Campus, Rion, GR-26504, Patras, Greece. E-mail: [email protected]

high-throughput genomic variation detection and sequencing technologies, supported by growing numbers and complexities of gene variation databases. Such databases are repositories in which allelic variations are catalogued and described within specific genes. Existing databases can generally be assigned to one of three broad categories: general (core or centralized) databases; locusspecific databases (LSDBs), recording variation data for specific genes; and National/Ethnic mutation databases (NEMDBs) [Patrinos and Brookes, 2005]. LSDBs benefit from rigorous expert curation, often coordinated by consortia of collaborating researchers with scientific expertise, in particular genes, such as the HbVar database of hemoglobin variants and thalassemia mutations [Hardison et al., 2002; see also Claustres et al., 2002]. These resources contain up to 50% unpublished variations with, usually, thorough phenotypic descriptions. Consequently, LSDBs are extremely useful tools, contributing toward the identification of causative mutations, providing information about phenotypic patterns associated with a specific mutation, and enabling researchers to define an optimal strategy for mutation detection [Patrinos and Brookes, 2005]. The latter reason, in particular, explains the recent rapid growth in the number of LSDBs, with such databases now available for hundreds to thousands of human genes (sometimes with more than one database per gene). These repositories aim to promote data submission and form well maintained, accurate and up-todate data sources. Currently, however, the LSDB field suffers from a number of serious deficiencies. First of all, the majority of LSDBs presently available have been constructed from a plethora of different technologies and designs. Most of the resulting databases are both unsophisticated in implementation and small in scale. Several database management systems (DBMSs), such as the Leiden Open Variation Database (LOVD: http://www.lovd.nl) [Fokkema et al., 2005], the Universal Mutation Database (UMD: http://www. umd.be) [Be´roud et al., 2005], and MUTbase [Riikonen and Vihinen, 1999] attempt to harmonize the current data content heterogeneity by providing off-the-shelf software for LSDB development and curation. These three systems, however, have very different designs. The LSDB field is therefore attempting to grow and mature in response to the urgent needs that exist, while struggling to overcome deficiencies such as data content heterogeneity. Less than a decade ago, Claustres and coworkers [2002] examined the comparatively few LSDBs that existed on the Web at that time in an attempt to define optimal content scope and

& 2010 WILEY-LISS, INC.

suggest ways for data curation. However, the rapid growth of LSDBs and the vast heterogeneity that presently characterizes the field dictate the need for a thorough data-content and LSDBdomain analysis. This will allow comprehensive mapping of standards (i.e., data models, ontology options, on which the existing LSDBs are based) and provide insight into ways the field should further develop. Here, we report our findings from a thorough domain analysis of 1,188 existing LSDBs currently available through the Internet. These data are compared with previous findings and, based on these results, recommendations are made toward implementation of LSDBs for use in a clinical and genetic laboratory setting.

Methods Between September 2008 and September 2009 we examined 1,728 LSDBs based on a list amalgamated from the following sources: (1) Human Genome Variation Society Website (HGVS; http://www.hgvs.org/dblist/glsdb.html; 727 LSDBs), which contains an extensive and routinely updated list of LSDBs with their URLs; (2) the Leiden Open Variation Database Website (LOVD; http://www.lovd.nl/2.0/index_list.php; 853 LSDBs), which contains LSDBs developed using the LOVD database management software (DBMS) [Fokkema et al., 2005]; (3) the Universal Mutation Database website (UMD; http://www.umd.be; 26 LSDBs), which contains LSDBs developed using the UMD DBMS [Be´roud et al., 2005]; and (4) the Website of the Bioinformatics group of the University of Tampere, Institute of Biomedical Technology (http://bioinf.uta.fi/base_root/mutation_ databases_list.php; 122 LSDBs), which contains LSDBs developed using the MUTbase DBMS [Riikonen and Vihinen, 1999]. We found considerable overlap between these lists, with 640 LSDBs represented on more than one of the above Websites. After accounting for this redundancy, a total 1,188 LSDBs remained for our downstream analysis. These LSDBs were examined for the presence or absence of 44 content criteria pertaining to: (1) Database analysis, namely, (a) general presentation, (b) locus-specific information, and (c) database structure, and (2) Data content, namely, (a) data collection, (b) variant information table, and (c) database querying. These criteria (summarized in Table 1) were selected to ensure an objective evaluation of the various LSDBs and their content. LSDBs were scored for these criteria using a binary scoring mode: ‘‘0,’’ absence/no; and ‘‘1,’’ presence/yes. Criteria-based subjective judgments by the evaluator (e.g., ease of use, LSDB design, etc.) were excluded from our study. We also excluded checking for ‘‘hit counters’’ from our study, as these do not provide a reliable estimate of Website traffic. Hit counters do not necessarily distinguish between unique and returning visitors, or between genuine visitors and automated visitors such as search engine robots. Implementation of such hit counters may vary between LSDBs making direct comparison inappropriate. We have made our extensive listing of LSDBs, complete with all criteria described in this article, available via the GEN2PHEN Knowledge Centre (http://www.gen2phen.org). The list can be browsed, filtered, and searched at http://www.gen2phen.org/data/ lsdbs. We will continue to maintain and extend this listing, and encourage visitors to contribute by submitting additional LSDBs. The listing can also be accessed in plain text and Microsoft Excel formats. We also offer a number of machine-readable formats such as Atom, allowing gene-specific listings to retrieved and utilized by third-party Websites.

1110

HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

Table 1. Database Contents and Data Content Criteria Upon Which the Entire LSDB Domain Analysis has been Performed 1. Database features 1a. General presentation Explanation of content and aim Useful links Links to OMIM Links to HGMD Links to other LSDBs Database description published Copyright Disclaimer Language other than English 1b. Locus-specific information Reference sequence Information about disease List of associations Chromosome location Information on gene Protein function Nomenclature followed 1c. Database structure Summary table listing all variations Downloadable variation table Static HTML pages (no search option) Flat-file database Relational database Variation visualization tool Use of a DBMS a

2. Data content 2a. Data collection Collection via literature Contact curators Online submission Year of the last LSDB update 2b. Variation table Complete reference list Summary phenotypic description Detailed phenotypic description Links to references Cross-reference with other databases Restriction enzyme change reported Ethnic group Mutation frequency Detection method 2c. Database querying Querying tool(s) Field: variation name Field: Gene region Field: Codon number Field: Author’s namea Field: Phenotype Field: Ethnic group Field: Geographic location Other fields for querying

Author from published report or personal communication.

Results The 1,188 LSDBs that were included in our analysis covered 967 different genes, with 103 LSDBs being redundant for the genes they represented. Specifically, 93 genes were represented in two different LSDBs, from which 11 involved the same DBMS (e.g., LOVD) but different installation (e.g., LSDBs for both the ASS1 and GPM6B genes are available in both the Leiden and Penn State University installations). In these cases, the number of variant alleles was different. Also, for nine genes (AP3B1, BEST1, CDH23, FOXL2, HPS1, L1CAM, LMNA, LYST, and MYO7A), there were three different LSDBs, whereas for the TP53 gene there were four different LSDBs available on the Web. In addition to the 1,188 LSDBs included in our analysis, we also identified 33 gene-specific variation resources that did not qualify as LSDBs (i.e., nondatabase resources such as simple downloadable PDF files). An additional 21 LSDBs unavailable to the general public, that is, either password-protected or located in members-only areas (e.g., http://www.euroglycanet.org), were also excluded from our analysis. A further nine databases were no longer available. By way of comparison, the Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk) [Stenson et al., 2003]) lists 2,689 genes in its public version (and 3,739 in its Professional 2010.1 Release) containing at least one variation.

Criteria Examined The presence or absence of 44 content criteria were examined in the 1,188 LSDBs, pertaining to: (1) database analysis (general presentation, locus-specific information, and database structure), and (2) data content (data collection, variant information table, and database querying).

Database analysis General presentation: The majority of LSDBs (86.5%) had a home page explaining the database contents and aim. A user guide and a minimal set of relevant external links were generally available for the user to access additional information (Fig. 1A). Important links included HGMD (55.6%), OMIM (Online Mendelian Inheritance in Man) [Amberger et al., 2009] (76.5%) and other useful links (91.2%; GenBank/EMBL, HGVS, etc.), but only 0.3% included links to other LSDBs. A copyright notice and a guide to citation were displayed in 81.1% of LSDBs, whereas 78.9% also displayed a disclaimer notice. In approximately 25% of the cases studied, the database description was published in a peer-reviewed scientific journal.

Finally, only 0.5% of the LSDBs were presented in a language other than English, namely, French in all cases. Locus-specific information: A number of databases provide supplementary information such as a reference sequence (87.3%), chromosomal location (74.5%) or other information (82.8%) on the gene of interest, making these registries readily accessible to physicians and scientists from many fields. Only 21.1% provided information on the relevant disease(s) and even fewer (2.4%) provided information on protein function. Proper variation nomenclature [den Dunnen and Antonarakis, 2001] was followed in 76.4% of LSDBs. Finally, only 19.4% of LSDBs displayed links to patient associations or to Websites with clinical content (Fig. 1B).

A) GENERAL PRESENTATION Explanation of content and aim Useful links Links to OMIM Links to HGMD Links to other LSDBs Database description published Copyright Disclaimer Language other than English

Reference sequence Information about disease List of associations Chromosome location Information on gene Protein function HGVS Nomenclature followed C) DATABASE STRUCTURE Summary table listing all variations Downloadable variation table Static HTML pages (no search option)

Relational database Variation visualization tool Use of a specific DBMS

0

200

400

600

800

1000

1200

Claustres et al. (2002) Presence | Absence This study Presence | Absence

Figure 1.

Overview of the database features criteria, namely, LSDB general presentation (A), locus-specific information (B), and database

structure (C). HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

1111

Database structure: Contrary to previous observations, few LSDBs (11.2%) were structured as flat-files containing a number of fields for each entry. The majority of LSDBs were relational (Structured Query Language [SQL]-based; 82.6%). However, a substantial number (195, 16.4%) are still based on static HTML pages (rather than pages dynamically generated on request using SQL database queries). This absolute number is similar to the previous study (191, 73%) [Claustres et al., 2002] despite the considerable number of new LSDBs included in this study. This suggests that these static HTML LSDBs represent older databases that have not been upgraded, whereas newer databases have opted for more sophisticated solutions. Furthermore, we found that over half (51%) of these static HTML databases have not been updated since 2002. A complete table listing all variations was available in 87.3% of LSDBs, from which only a small portion included downloadable formats (10.5%) that were sometimes difficult to download. Very few LSDBs showed mutation maps and few added graphical displays, including dynamic graphing tools, depicting the location of variations throughout the gene (or protein) sequence (13.5%). The SERPINA1 LSDB [Zaimidou et al., 2009], for example, employed the VariVis tool [Smith and Cotton, 2008], whereas MUTbase-based LSDBs employed a built-in visualization tool [Riikonen and Vihinen, 1999]. As there are no uniformly accepted object model standards for LSDB design yet, a substantial number of LSDBs are based on ad hoc custom-built platforms, resulting in data content heterogeneity. It is encouraging though that 66.6% of LSDBs are based on an established DBMS geared for locus-specific variation data, namely, LOVD (853 LSBDs), MUTbase (122 LSDBs), and UMD (26 LSDBs). General use of such platforms contributes significantly to data uniformity (Fig. 1C). A large number of LSDBs are updated frequently. In particular, 691 (58%) of the LSDBs analyzed were updated in 2009, whereas 900 (76%) were updated during the last 2 years (Fig. 2). Fortunately, only 59 LSDBs (5.4%) were truly outdated, that is, updated in 2000 or earlier, whereas few others were updated between 2001 and 2007.

Database last update (year) Undefined

1998-2000

2001-2003

2004-2006

2007

2008

2009

0

200

400

600

Count

Figure 2.

1112

Year of the last LSDB update.

HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

800

Data content Data collection and submission: Individual LSDB entries usually correspond to a variation in a single patient or data that are collectively reported from a group of patients bearing the same variation. This, generally manually curated, information is based on data derived from both published literature (99.9% of LSDBs) and direct submissions of unpublished variations by researchers to the LSDB curators (98.1% of LSDBs). Such contributions are made either by e-mail or in prestructured data submission forms. Such forms were generally made available upon request or could be downloaded from the corresponding LSDB. Direct online submission, using dedicated online submission forms in passwordprotected interfaces, was available in 75% of the cases (Fig. 3A). Variation table: The majority of LSDBs provide information on the gene of interest, protein function, protein structure, and/or protein sequence alignment. A complete reference list was included in 87.7% of LSDBs, of which 94.8% linked directly to the PubMed literature database. Summary phenotypic descriptions (i.e., pathogenicity: Yes/No, etc.) were available in 74.5% of LSDBs, whereas detailed phenotypic descriptions were available only in 8% of the LSDBs analyzed. The depth of the detailed information depends on the nature of the LSDB and of the resulting phenotype (e.g., cancer, hematological disorders, syndromes, etc.). Only 1.5% of LSDBs linked to corresponding entries in other LSDBs. In addition to the variation listing, many databases provided data fields for associated information, such as variation detection methodology (57.6%), and restriction enzyme changes that are indicative of the presence or absence of a gene variation (65.2%). Although the ethnic or geographic origin of patients was indicated in 69.9% of LSDBs, variation/allele frequency data were only available in 1.6% of LSDBs. This likely reflects the fact that, for many genes, this information is just not known. These data are summarized in Figure 3B. Database querying: A search engine that allows tailored database querying is one of the main features that distinguish an LSDB from the conventional search functionality of generic genome browsers. In 76.9% of LSDBs, mostly relational databases or those utilizing a DBMS, a search engine was available. These allowed the user to query for variation name (67.1%), gene region (63%), codon number (62.2%), author name (68.9%), phenotype (69.1%), ethnic group (59.8%), or geographic location (58.6%) (Fig. 3C). In 70.9% of LSDBs, additional fields for querying were available, such as cancer type and classification, DNA source, protein domain, etc. Data quality: As far as data quality is concerned, we have attempted to get some indication on how sparse the data documented in the LSDBs are over some of the key content criteria. We have determined the number of LSDBs that combine information on three main parameters, namely, pathogenicity, reference sequence and adoption of the official HGVS nomenclature. This would be yet another quality indicator for LSDBs, because these parameters are among the most relevant ones for LSDB users. We found that 154 LSDBs (13%) fulfill all three content criteria. On the other hand, there have been some inconsistencies found regarding gene names that are used in a number of LSDBs. In particular, we found 18 LSDBs that were not using Human Gene Nomenclature (HGNC) symbols, and we made a note of these LSDBs listings on the GEN2PHEN Knowledge Center. All these gene names have been converted to their proper HGNC names and hyperlinked to the corresponding LSDBs. The full list of the changed gene names, with aliases or withdrawn names in

A) DATA COLLECTION Collection via literature Contact curators Online submission B) VARIATION TABLE Complete reference list Summary phenotypic description Detailed phenotypic description Links to references

Restriction enzyme change shown Ethnic group Mutation frequency Detection method C) DATABASE QUERYING Querying tool(s) Field: Variant name Field: Gene region Field: Codon number Field: Author's name Field: Phenotype Field: Ethnic group Field: Geographic location Other fields for querying

0

200

400

600

800

1000

1200

Claustres et al. (2002) Presence | Absence This study Presence | Absence

Figure 3.

Overview of the data collection criteria, namely, data collection (A), variation table (B), and database querying (C).

brackets, are as follows: BEST1 (VMD2), BRCA2 (FANCD1), BRIP1 (FANCJ), CDK16 (PCTK1), CLRN1 (USH3A), DCAF12L1 (WDR40B), DCAF12L2 (WDR40C), DCAF8L1 (WDR42B), ELANE (ELA2), GIGYF2 (PARK11), HMGN5 (NSBP1), KDM5C (JARID1C), KDM6A (UTX), NOS2 (NOS2A), PALB2 (FANCN), PEX2 (PXMP3), PNP (NP), TAB3 (MAP3K7IP3).

Discussion The elucidation of the human genome sequence has revolutionized our ability to explore how genes cause disease and other phenotypes. Sadly, the resulting flood of primary genomic information is not yet being managed or utilized as effectively as it should be, due, not least, to the lack of a sufficiently organized and mature database infrastructure. Two international consortia have emerged to assist in addressing these issues,

namely, the Human Variome Project (HVP; http://www.human variomeproject.org) [Horaitis et al., 2007; Kaput et al., 2009] and the GEN2PHEN project (http://www.gen2phen.org), an integrated project funded by the European Commission. Here, we have reported our results from a thorough domain analysis of the LSDB field, motivated by the goals of the GEN2PHEN project. This effort constitutes a formal ‘‘requirements analysis’’ that would: (1) contribute guidelines upon which the LSDB field can be further evolved, (2) formalize the data models and the nomenclature systems being utilized by the entire LSDB community, and (3) bring maximum synergy with groups involved in the LSDB field. Actual data models in current use were fully documented and compared with previous data at the time that the field was just starting to grow [Claustres et al., 2002]. Our overall goal was to provide important supporting material for defining the basis on which the LSDB field would not only evolve HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

1113

towards unifying genetic variation databases in one or more research-oriented central genome variation browsers, but also to mature toward its implementation in the clinical environment such that it would be beneficial to medical practitioners, bioscientists and, ultimately, patients.

Evolution of the LSDB Domain: Strengths and Pitfalls We documented, in detail, the content of 1,188 unique LSDBs in an effort to represent the type and depth of information that is currently provided in such databases, mostly aiming to assess the way the LSDB field is currently evolving. Our approach differs from the one adopted several years ago [Claustres et al., 2002] in the sense that not only has our study included significantly more LSDBs from a multitude of sources but we have also excluded criteria that are more subjective in nature. A key observation that derives from our analysis is the fact that more and more LSDBs are generated using an, often downloadable, LSDB management system to allow data—both published and unpublished—collection, curation, and storage. In particular, 66.6% of the LSDBs analyzed are built using the LOVD, MUTbase, or the UMD DBMS, versus only 27% in the previous study (Fig. 1C) [Claustres et al., 2002]. Therefore, today there is less data-content heterogeneity than existed 8 years ago. On top of this observation, there is a trend toward interested database curators building their LSDBs by selecting from the existing DBMSs rather than designing their LSDB from scratch. The use of DBMSs, based on relational databases, enriches the existing LSDBs that are characterized by extensive querying capacity. In our analysis, we found that 76.6% contain extensive querying capacity compared to 35% as was documented previously (Fig. 3C). It is notable that in the previous study 73% of LSBDs consisted of static HTML pages. This number has been significantly decreased today to just 16.4% (Fig. 1C). The use of a DBMS also allows other scientists working in the same field or diagnostic laboratories to directly submit a novel variant to the LSDB curators either by contacting them directly (98.1 vs. 29%) or via a dedicated online submission tool (75 vs. 68%) (Fig. 3A). In the latter case, although the numbers are comparable, the significant increase of the total number of LSDBs upon which this study was performed (almost 5-fold more) shows that an online direct submission tool is incorporated in significantly more LSDBs, mostly those based on the LOVD DBMS. Also, the use of a DBMS has enabled database curators to keep their LSDBs up to date more easily. We found that 58% of existing LSDBs have been updated in 2009, and 76% were updated in the last 2 years. On the other hand, only 13% of LSDBs were updated from 2006 or earlier (Fig. 2). Documentation of ethnic differences of the various mutant alleles and their corresponding variation frequencies in existing LSDBs shows a striking discrepancy. Although, the number of LSDBs documenting the ethnic background of the mutant allele carrier or patient has improved (69.9%, up from 26%), documentation of variation frequencies sharply decreased (1.6%, down from 18%) (Fig. 3B). This observation can be explained either by the increasing number of National/Ethnic mutation frequency databases (NEMDBs) [Patrinos, 2006; van Baal et al., 2007] or by the lack of relatively recent studies on this topic. Today, although of clear value, journals tend to discourage authors from publishing these data that, otherwise, help stratification of national variation screening efforts. The existence of dedicated journals that encourage submission of this kind of data [Patrinos and Petricoin, 2009] would tackle this problem.

1114

HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

On the other hand, there are a number of shortcomings that have been observed in existing LSDBs. First of all, there have been a few national efforts to document mutant alleles observed in specific population in DBMSs designed for LSDBs rather then NEMDBs, such as the Chinese and the Australian Human Variome Project nodes (http://china-hvp.org/LOVD; https://australianhu manvariomedatabase.arcs.org.au, respectively), both based on the LOVD platform. As a result, there will be several genes that will appear in more than one LOVD installation, such as the BRCA1/2 genes that appear in the Australian and Chinese Human Variome Project databases and Leiden’s BRCA-specific LSDB (http:// chromium.liacs.nl/LOVD2/cancer), all based on the LOVD installation. This constitutes a major problem, because, if such an approach is adopted by others, there will be no single LSDB to document all mutant alleles of a particular gene, which would confuse the end user as to which LSDB is the most comprehensive. In particular, in the case of the BRCA1/BRCA2 genes, although Leiden’s BRCA-specific LSDB seems to be the most comprehensive (deduced from the number of unique variants reported therein), there are several variants that are not documented in this LSDB yet are reported in the Chinese Human Variome Project (e.g., c.66dupA, c.43A4G, etc.; Table 2). The creation of NEMDBs that only document the prevalence of mutant alleles in specific populations and ethnic groups, and their corresponding mutation frequencies, where applicable, could be a solution to this problem [Patrinos, 2006]. Documentation of the genetic basis of several populations has already started, that is, Hellenic [Patrinos et al., 2005], Cypriot, Iranian [Kleanthous et al., 2006], Israeli [Zlotogora et al., 2007], and such efforts have been encouraged by the existence of a dedicated DBMS for NEMDB development and curation (ETHNOS) [van Baal et al., 2010], and funded by the European Commission (FP6-INCO ‘‘MEDGENET’’ [031968], FP6INFRA ‘‘ITHANET’’ [026539]). A similar shortcoming lies in the fact that there are several redundant LSDBs installed in different locations, such as the LSDBs for the ATR-X (Mental Retardation [http://grenada.lumc.nl/LOVD2/MR/home.php?select_db 5 ATRX] and Penn State’s LOVD copy of HbVar databases [http://lovd. bx.psu.edu/home.php?select_db 5 ATRX], or the FLCN [Folliculin Mutation Database at http://skingenedatabase.com/home.php? select_db 5 FLCN] and the European BHD Consortium [EBC] database at https://grenada.lumc.nl/LOVD2/shared1/home.php? select_db 5 FLCN] genes. Disappointingly, there are several LSDBs that, although installed, have no content. From the total of 502 LSDBs comprising the LOVD installation of the Mental Retardation database (data derived from Tarpey et al. [2009]), as many as 20% (100 LSDBs) had no variants documented (websites assessed November 2009). In this case, it would be highly recommended that curators do not make a new LSDB available unless a critical

Table 2. Comparison Among Different LOVD Installations Hosting Redundant LSDBs for the BRCA1 and BRCA2 Genes Installation

LSDB

Total number of unique DNA variants reported

Total number of variants reported

Chinese HVP node

BRCA1 BRCA2 BRCA1 BRCA2 BRCA1 BRCA2

134 62 0 0 502 485

234 113 0 0 1,455 924

Australian HVP node BRCA

LSDB not installed (URLs: Chinese HVP node: http://china-hvp.org/LOVD/; URLs: Australian HVP node: https://australianhumanvariomedatabase.arcs.org.au/; BRCA: http://chromium.liacs.nl/LOVD2/cancer/home.php).

mass of genomic variation data, pertaining to that particular gene, is assembled. A common problem in the LSDB field is passwordprotected LSDB access. There are more than 20 LSDBs that require users to have previously registered to access their content. Our analysis revealed that a number of LSDBs in several installations do not qualify as LSDBs: the International Immunogenetics Information System (http://imgt.cines.fr) that provides no information of DNA variation; the Aldehyde Dehydrogenase Gene Superfamily databases (http://www.aldh.org) that just provides a graphical distribution of the genetic variation in the ALDH genes, without an actual querying interface; or the SNCA LSDB that only documents SNCA gene variation in a portable document file (PDF) format (http://www.med.upatras.gr/athanas siadou/snca_lsdb.pdf). Last, but not least, the almost exclusive use of English in all LSDBs is another finding that was obvious from our analysis. In particular, there were only six LSDBs that documented their contents in a language other than English, namely, French. This number has been significantly reduced compared to the previous study [Claustres et al., 2002], namely, 0.5% versus 11% (Fig. 1A). This feature may pose difficulties for users with poor literacy in English, particularly patients, hence making LSDBs useful only to English speakers.

Maturation of LSDB Content Toward Clinical Use: The Clinical Genetics Database Concept Globally, DNA diagnostics laboratories undertake extensive genetic screening in patients affected by a plethora of disease states, but currently little of this extremely valuable primary information is finding its way into the public domain for wider exploitation in anonymized format. Apart from the lack of suitable databases and support software, other problems relate to ethical and legal restrictions, financial structures, and limited manpower. Close partnering between LSDB stakeholders and clinical genetic diagnostics laboratories would therefore be highly desirable, to identify their needs and build some key bioinformatics solutions, not only to assist in data gathering and querying, but also for routinely moving diagnostic laboratory gene variation data into the public domain, and report on this accordingly. From our LSDB domain analysis it seems that current LSDBs are designed more for the research community rather than the clinical/diagnostic laboratory. From the few LSDBs (8%) where detailed phenotypic descriptions are available, even fewer LSDBs combine scientific and diagnostic data on variations with associated information useful for clinicians or students (e.g., population distribution of alleles, haplotype associations), for patients and their families (e.g., treatment, diagnosis, dedicated organizations, or parent associations) and for diagnosticians (technical support in the form of primer sequences and variation detection protocols). The poor documentation on disease and protein function supports this claim. Current LSDBs do not support data retrieval and transmission between scientific personnel and clinicians working in diagnostic laboratories [Cotton et al., 2009]. To satisfy this requirement, a novel type of registry, that is, a clinical genetics database (CGDB), could be developed for use in clinical diagnostic laboratories that would accommodate complete and reliable genotype and phenotype records on each patient for each novel submitted variation. Such registries would incorporate at least the following main features: a unique identifier for each instance of a genetic variant; detailed genotypic and phenotypic descriptions; and information regarding the person or group that has contributed

CGDB

NEMDB

NEMDB

CGDB

CGDB

LSDB

LSDB

CGDB

CGDB

LSDB

LSDB

CGDB

LSDB

LSDB

LSDB

LSDB

LSDB

LSDB

Central Genome Variation Browser

Figure 4. A proposed structure for a ‘‘federated’’ LS/CGDB system. LSDBs and CGDBs are developed using an off-the-shelf DBMS for each database type that would share the majority of fields. The LSDBs and CGDBs would form a virtual NEMDB for each population that will subsequently contribute data to a central genetic variation browser (see also text for details). the records. For the latter feature, a unique ID would define the data contributor, based on existing schemes (e.g., ResearcherID from Thomson Reuters; http://www.researcherid.com). Genotypic data entry should be automated directly from accredited genotypic platforms or Laboratory Integrated Management System (LIMS). For reasons of consistency and data content homogeneity, CGDBs will need to operate under the same, or very similar, DBMSs as the LSDBs, sharing a large number of fields. Prefabricated LSDB and CGDB database software will ideally be available for simple download and installation (i.e., ‘‘LS/CGDBin-a-box’’ solutions), with these being designed to operate in a manner which enables every LSDB and CGDB group to retain full control and ownership of their database content. Every resulting database can then be integrated with centralized search capabilities, and all search results will provide links back to the original source databases. Such a ‘‘federated’’ LS/CGDB system could also be structured as a virtual NEMDB for a particular population, subsequently contributing data to a central genetic variation genome browser (Fig. 4). The latter browser would then consist of a collection of not only genetic variations but also frequencies ofvariations in different ethnic populations, which is of particular importance for the clinical genetic testing in the developing countries [Patrinos, 2006].

Conclusions and Future Perspectives Our detailed LSDB domain analysis revealed many elements that have recently contributed in moving the field forward and reducing the overall data heterogeneity, such as the development of specialized DBMS, the existence of data querying tools, and so on. Our analysis also pinpointed a number of deficiencies, namely, lack of detailed disease and phenotypic descriptions for each genetic variant, etc., that, if addressed, would allow LSDBs to better serve the clinical genetics community, patients, their families, and related associations and not just researchers. HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

1115

LSDBs available today predominantly document genetic variants pertinent to monogenic disorders. Our expectation is that as more data on genotype/phenotype correlations become available from genome-wide association studies—particularly for complex multifactorial disorders—it will become necessary to reorient the way pathogenic sequence variants are documented in many LSDBs. This may involve not only alterations in variant annotation, but could also necessitate a restructuring of existing DBMSs or even the design of novel database schemas. Overall, the development of a proposed federated network of LSDBs and CGDBs would bridge the divide between gene-centric and genome-wide approaches to databasing variation. This will realize the organization of knowledge that will be beneficial for both research and clinical genetics communities, and ultimately provide improvements to public health.

Acknowledgments We thank Raymond Dalgleish for constructive comments on this manuscript and all GEN2PHEN partners for their feedback. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 200754—the GEN2PHEN project.

References Amberger J, Bocchini CA, Scott AF, Hamosh A. 2009. McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37:D793–D796. Be´roud C, Hamroun D, Collod-Be´roud G, Boileau C, Soussi T, Claustres M. 2005. UMD (Universal Mutation Database): 2005 update. Hum Mutat 26:184–191. Claustres M, Horaitis O, Vanevski M, Cotton RG. 2002. Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. Genome Res 12:680–688. Cotton RG, Al Aqeel AI, Al-Mulla F, Carrera P, Claustres M, Ekong R, Hyland VJ, Marafie MJ, Paalman MH, Patrinos GP, Qi M, Ramesar RS, Scott RJ, Sijmons RH, Sobrido MJ, Vihinen M. 2009. Capturing all disease-causing mutation for clinical and research use: towards an effortless system for the human variome project. Genet Med 11:843–849. den Dunnen JT, Antonarakis SE. 2001. Nomenclature for the description of human sequence variations. Hum Genet 109:121–124. Fokkema IF, den Dunnen JT, Taschner PE. 2005. LOVD: easy creation of a locusspecific sequence variation database using an ‘‘LSDB-in-a-box’’ approach. Hum Mutat 26:63–68. Hardison RC, Chui DH, Giardine B, Riemer C, Patrinos GP, Anagnou N, Miller W, Wajcman H. 2002. HbVar: a relational database of human hemoglobin variants and thalassemia mutations at the globin gene server. Hum Mutat 19:225–233. Horaitis O, Talbot Jr CC, Phommarinh M, Phillips KM, Cotton RG. 2007. A database of locus-specific databases. Nat Genet 39:425. Kaput J, Cotton RG, Hardman L, Watson M, Al Aqeel AI, Al-Aama JY, Al-Mulla F, Alonso S, Aretz S, Auerbach AD, Bapat B, Bernstein IT, Bhak J, Bleoo SL, Blo¨cker H, Brenner SE, Burn J, Bustamante M, Calzone R, Cambon-Thomsen A, Cargill M, Carrera P, Cavedon L, Cho YS, Chung YJ, Claustres M, Cutting G, Dalgleish R, den Dunnen JT, Dı´az C, Dobrowolski S, dos Santos MR, Ekong R, Flanagan SB, Flicek P, Furukawa Y, Genuardi M, Ghang H, Golubenko MV, Greenblatt MS, Hamosh A, Hancock JM, Hardison R, Harrison TM, Hoffmann R,

1116

HUMAN MUTATION, Vol. 31, No. 10, 1109–1116, 2010

Horaitis R, Howard HJ, Barash CI, Izagirre N, Jung J, Kojima T, Laradi S, Lee YS, Lee JY, Gil-da-Silva-Lopes VL, Macrae FA, Maglott D, Marafie MJ, Marsh SG, Matsubara Y, Messiaen LM, Mo¨slein G, Netea MG, Norton ML, Oefner PJ, Oetting WS, O’Leary JC, de Ramirez AM, Paalman MH, Parboosingh J, Patrinos GP, Perozzi G, Phillips IR, Povey S, Prasad S, Qi M, Quin DJ, Ramesar RS, Richards CS, Savige J, Scheible DG, Scott RJ, Seminara D, Shephard EA, Sijmons RH, Smith TD, Sobrido MJ, Tanaka T, Tavtigian SV, Taylor GR, Teague J, To¨pel T, UllmanCullere M, Utsunomiya J, van Kranen HJ, Vihinen M, Webb E, Weber TK, Yeager M, Yeom YI, Yim SH, Yoo HS, Contributors to the Human Variome Project Planning Meeting. 2009. Planning the human variome project: the Spain report. Hum Mutat 30:496–510. Kleanthous M, Patsalis PC, Drousiotou A, Motazacker M, Christodoulou K, Cariolou M, Baysal E, Khrizi K, Moghimi B, Pourfarzad F, van Baal S, Deltas C, Najmabadi H, Patrinos GP. 2006. The Cypriot and Iranian national mutation frequency databases. Hum Mutat 27:598–599. Patrinos GP. 2006. National and Ethnic mutation databases: recording populations’ genography. Hum Mutat 27:879–887. Patrinos GP, Brookes AJ. 2005. DNA, disease and databases: disastrously deficient. Trends Genet 21:333–338. Patrinos GP, Petricoin EF. 2009. A new scientific journal linked to a genetic database: towards a novel publication modality. Hum Genomics Proteomics 1:e597478. Patrinos GP, van Baal S, Petersen MB, Papadakis MN. 2005. The Hellenic national mutation database: a prototype database for mutations leading to inherited disorders in the hellenic population. Hum Mutat 25:327–333. Riikonen P, Vihinen M. 1999. MUTbase: maintenance and analysis of distributed mutation databases. Bioinformatics 15:852–859. Smith TD, Cotton RG. 2008. VariVis: a visualisation toolkit for variation databases. BMC Bioinformatics 9:206. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN. 2003. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21:577–581. Tarpey PS, Smith R, Pleasance E, Whibley A, Edkins S, Hardy C, O’Meara S, Latimer C, Dicks E, Menzies A, Stephens P, Blow M, Greenman C, Xue Y, Tyler-Smith C, Thompson D, Gray K, Andrews J, Barthorpe S, Buck G, Cole J, Dunmore R, Jones D, Maddison M, Mironenko T, Turner R, Turrell K, Varian J, West S, Widaa S, Wray P, Teague J, Butler A, Jenkinson A, Jia M, Richardson D, Shepherd R, Wooster R, Tejada MI, Martinez F, Carvill G, Goliath R, de Brouwer AP, van Bokhoven H, Van Esch H, Chelly J, Raynaud M, Ropers HH, Abidi FE, Srivastava AK, Cox J, Luo Y, Mallya U, Moon J, Parnau J, Mohammed S, Tolmie JL, Shoubridge C, Corbett M, Gardner A, Haan E, Rujirabanjerd S, Shaw M, Vandeleur L, Fullston T, Easton DF, Boyle J, Partington M, Hackett A, Field M, Skinner C, Stevenson RE, Bobrow M, Turner G, Schwartz CE, Gecz J, Raymond FL, Futreal PA, Stratton MR. 2009. A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation. Nat Genet 41:535–543. van Baal S, Kaimakis P, Phommarinh M, Koumbi D, Cuppens H, Riccardino F, Macek Jr M, Scriver CR, Patrinos GP. 2007. FINDbase: a relational database recording frequencies of genetic defects leading to inherited disorders worldwide. Nucleic Acids Res 35:D690–D695. van Baal S, Zlotogora J, Lagoumintzis G, Gkantouna V, Tzimas I, Poulas K, Tsakalidis A, Romeo G, Patrinos GP. 2010. ETHNOS: a versatile electronic tool for the development and curation of National Genetic databases. Hum Genomics 4:361–368. Zaimidou S, van Baal S, Smith TD, Mitropoulos K, Ljujic M, Radojkovic D, Cotton RG, Patrinos GP. 2009. A1ATVar: a relational database of human SERPINA1 gene variants leading to alpha1-antitrypsin deficiency. Hum Mutat 30:308–313. Zlotogora J, van Baal S, Patrinos GP. 2007. Documentation of inherited disorders and mutation frequencies in the different religious communities in Israel in the Israeli National Genetic Database. Hum Mutat 28:944–949.

Suggest Documents