BIOINFORMATICS
EDITORIAL
Vol. 24 no. 19 2008, pages 2127–2128 doi:10.1093/bioinformatics/btn464
Databases, data tombs and dust in the wind Jonathan D. Wren1,∗ and Alex Bateman2 1 Arthritis
and Immunology Research Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA and 2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SA, UK
Contact:
[email protected]
1
DATABASE PERMANENCE AND CHANGE
Many Internet-based resources have become inaccessible since their publication. Of all URLs published within any given year, ∼6% of that total will disappear each year thereafter. Approximately 20% of URLs published in MEDLINE abstracts are now inaccessible, about 20% of which are links to databases (Wren, 2008). In some cases, though, the database has simply been relocated and many of the more popular databases are quite stable (Galperin, 2006). Some of this probably reflects an obsolescence factor, one that serves the community by removing outdated or unused databases. But most, even if infrequently accessed, are probably a loss to the scientific community. The continued existence or even probable shelf life of a database is difficult to predict up front. But like other scientific resources, such as cell lines, they can be backed up by keeping copies offsite and/or permitting the data to be downloaded. Preservation efforts and downloadable data add to a database’s value. On a related note, some databases are never updated and some are poorly maintained (e.g. bugs not fixed). Static data, even if organized into a queryable format, is more or less equivalent to supplementary ∗ To
whom correspondence should be addressed.
2
DATA TOMBS
Among existing online databases, some are so rarely accessed that they would better be characterized as ‘data tombs’. Although these are somewhat analogous to uncited papers, the difference is that it should be theoretically easier to identify a potential user base for a database resource than to predict who might build upon or cite a body of research. It is difficult to objectively estimate, a priori, how frequently a new database might be accessed. A narrow scope might mean few users, but it could be of great value to those few. A broad scope might suggest a wide user base, but the data within might be useless. But there are some factors that might help identify potential data tombs. First, if a database has been cited (whether by its URL or name) by a group other than its creators, that is suggestive evidence of its potential utility. In cases where the need for a new database is in doubt, it might not be unreasonable to set this as a prerequisite for moving forward with the peer-review process. Such a reference could, of course, be fabricated, but doing so is non-trivial. In the absence of citations, signed letters of support from users could similarly be used during the review process to persuade reviewers of its utility. Data tombs, in large part, seem to have resulted from a ‘build it and they will come’ philosophy, which is OK as a means of justifying database creation, but not publication. Second, it is worth requesting a mix of reviewers who have expertise in biological database design, and those anticipated to be among the users of the database. It is also essential that the reviewers be asked to report on the usage and testing of the database as well as reviewing the manuscript.
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected]
2127
Downloaded from http://bioinformatics.oxfordjournals.org/ at OMRF Library on April 3, 2013
As biomedical data accumulates, the need to store, share and organize it grows. Consequently, the number of Internetaccessible databases has been rapidly growing on an annual basis. Bioinformatics regularly publishes descriptions of biomedically relevant databases, Nucleic Acids Research has published an annual database issue since 1996 and now a new open-access journal, Database: The Journal of Biological Databases and Curation, will soon be launched by Oxford University Press in 2009 (http://www.oxfordjournals.org/our_journals/databa/). Since databases can be made publicly available on the Internet without publication, it is worth considering what factors prioritize publication of database descriptions in a peer-reviewed journal. In general, publication of a database description in a journal advertises it as a valuable resource for scientific research. Implicitly, it is assumed that this resource is publicly available (most likely for free) and will be maintained. However, therein lies the problem: Database papers are simply not of the same nature as regular research articles. Over time, some databases simply become inaccessible, some are created but not maintained or updated, and some databases are never used (Galperin, 2006). Thus, for database creators, reviewers and journal editors, there are several additional considerations to judge, prior to publication, how potentially valuable these new databases may be.
information, and detracts from a database’s potential value. Updates and maintenance, unfortunately, can be no more guaranteed than permanence. Like the database itself, they are dependent upon the database creators. Fortunately, both these issues have been recognized and at least one website is taking steps to address them. Pathguide (http://www.pathguide.org) provides data on popularity as well as uptime for some database resources (Bader et al., 2006). Hopefully, a centralized user satisfaction website will evolve from this or another effort that records each published database, permits users to comment on their utility (and disappearance) and permits authors to post updates regarding relocations or ongoing difficulties. This would help quantify database utility (e.g. to justify requests for funding support), track updates and also permit reviewers to use it as a factor that might suggest how likely a new database is to be maintained and updated, on the basis of existing databases from the same authors.
Editorial
Third, it would be useful to have a subsection in every database paper whereby authors explicitly describe and make convincing arguments for their anticipated user base, if none exists already. Not merely with generic proclamations (e.g. ‘this database can be of value to researchers who access genomic data’), but with specific examples (e.g. ‘so-and-so et al. were limited to the analysis of 431 genes of category X in their recent paper, but if they had been able to access our database, they could have more than quadrupled that number’).
3
DATABASE UTILITY
2128
4
SUMMARY
Publishing database descriptions in peer-reviewed journals implies that they have value to the scientific community, value at least equal to the regular research papers published in the same journal. This may vary by journal, but no journal would want to publish databases if they knew they were destined to become data tombs or dust in the wind. Estimating the potential value of a new database needs to focus on the nature of the resource being described and take into account that, unlike regular research articles, a database’s continued existence, utility and maintenance can only be estimated probabilistically. Thus, factors that increase this probability increase its potential value. Conflict of Interest: none declared.
REFERENCES Bader,G.D. et al. (2006) Pathguide: a pathway resource list. Nucleic Acids Res., 34, D504–D506. Bateman,A. (2007) Editorial. Nucleic Acids Res., 35, D1–D2. Galperin,M. (2006) The molecular biology database collection: 2006 update. Nucleic Acids Res., 34(Database issue), D3–D5. Wren,J.D. (2008) URL decay in MEDLINE—a 4-year follow-up study. Bioinformatics, 24, 1381–1385.
Downloaded from http://bioinformatics.oxfordjournals.org/ at OMRF Library on April 3, 2013
Putting aside issues of permanence and anticipated access frequency, it is also important to weigh potential utility. Bear in mind that the first to publish a Database of Purpose X raises the requirements for anyone who wants to publish a similar database in the future. Merely creating a better quality database will probably not be enough to justify publication if the content is largely the same, yet it seems a shame that an inferior version would be part of the publication record rather than a superior one. However, such events cannot be reasonably anticipated and so requesting low-cost and/or high-benefit improvements is best done prior to publication, when authors are most motivated to make such changes. Specifics vary by database, and design considerations for good databases have been outlined (Bateman, 2007). Here are some of the more important strategic considerations. First, more can be less. Busy web pages can confuse and distract. Developers often equate the power of their software with the number of options, but users usually equate the number of options with the number of barriers between them and their results. Default parameters should already be set and labeled as such. It is good practice to have a link or help feature explaining why one would want to choose one option over another if it is not already selfexplanatory. If a database is hard to navigate, query or use, users will prefer an alternative that is less sophisticated but easier to use. Second, the more a database can be integrated into other programs, the more its potential value is. The ability to access database entries by URL parameters (e.g. as one can do to link to PubMed articles by their PubMed ID) or by a programmatic interface (e.g. APIs, SOAP)
also adds to its value and utility, since it can become a resource not just to people, but to programs. Third, all databases should be evaluated in terms of the value provided to their intended user base. For example, agglomerating data from different sources into a ‘metabase’ can be potentially useful, but if the users of the database would mostly be the programmers that could just as easily combine the different data sources themselves, then there is not much value. Finally, it is worth noting that the most valuable databases are either central repositories of data or add value by curation of biological information. Databases that merely parse a larger database (e.g. PDB) into a smaller, focused subset are rarely of any practical value. When judging a database, an important question to be asked is whether the sum of biological knowledge has been increased by the database. If the answer is no then what purpose does it fulfill?