10–11
1999 Oxford University Press
Nucleic Acids Research, 1999, Vol. 27, No. 1
DBcat: a catalog of biological databases Claude Discala1,2, Marion Ninnin1, Frédéric Achard1, Emmanuel Barillot1,3,* and Guy Vaysseix1,3 1GIS
Infobiogen, 7 rue Guy Môquet BP 8, 94801 Villejuif cedex, France, 2CNS, 2 rue Gaston Crémieux BP 191, 91006 Évry cedex, France and 3Généthon, 1 rue de l’Internationale BP 60, 91002 Évry cedex, France Received August 31, 1998; Accepted October 13, 1998
ABSTRACT The DBcat (http://www.infobiogen.fr/services/dbcat ) is a comprehensive catalog of biological databases, maintained and curated on a daily basis at GIS Infobiogen. It contains more than 400 databases classified by application domains. The DBcat is a structured flat file library, that can be searched by means of an SRS server or a dedicated Web interface. The files are available for downloading from Infobiogen anonymous ftp server. INTRODUCTION As a result of scientific, historical and political factors, the data in biology are disseminated in dozens of independent databases. Moreover, these databases have heterogeneous formats and contents. To assist researchers in identifying the resources pertinent to their information needs, we developed the DBcat, a comprehensive catalog of biological databases. The DBcat is also used on a daily basis by the GIS Infobiogen to fulfill its mission of offering resources for French research in biology. It gives material for (i) carrying on technology watch and systematic analysis of the sources of biological data, (ii) providing efficient access to the data and (iii) performing daily mirroring of the more important databases.
Figure 1. A DBcat entry from the DBcat database.
DATA ACQUISITION ORGANIZATION The DBcat is structured as a flat file library. Each entry contains the description of a database with different fields to record the various types of data, e.g., the name (NAME), the database domain in controlled vocabulary (DOMAIN), the description (DESCRIPTION), the names of the different authors and Email contacts (AUTHOR, CONTACT), bibliographic references (RA, RT, RL), site where the database is produced (ORIGINAL-SITE), Email address or URL to submit an entry to the database (SUBMIT), a list of Web (or anonymous ftp) sites from where the database can be queried or retrieved (URL-FTP, URL-WWW). As an illustration, the entry corresponding to the DBcat database itself is reproduced in Figure 1.
The DBcat is an on-going effort that originated in 1994 when Généthon built, for in-house use, a list of programs and data sources available via the Internet. This list gave birth to the BioCatalog (a software directory of general interest in molecular biology and genetics, maintained at the EBI, http://www.ebi.ac.uk/biocat/ ) (1) and the DBcat. The latter is now produced at GIS Infobiogen. New databases are searched in the Web, either by means of general purpose Web search engines or biology oriented Web sites. Journals, such as the Nucleic Acids Research Database Issue, are also consulted. The producers of the database are asked by Email to complete a form. There is also a Web form to submit a new database entry to the DBcat (http://www.infobiogen.fr/services/dbcat/file/dbcat_form.html ). If the author has
*To whom correspondence should be addressed at: GIS Infobiogen, 7 rue Guy Môquet BP 8, 94801 Villejuif cedex, France. Tel: +33 1 49 58 36 82; Fax: +33 1 45 59 52 50; Email:
[email protected]
28
Nucleic Acids Research, 1999, Vol. 27, No. 1
validated the entry corresponding to its database, it is marked as CHECKED. ACCESS The DBcat contains more than 400 database entries, available in one flat file. To reflect the areas of interest of the users, the database entries are also grouped into eight application domains: DNA, RNA, Protein, Genomics, Mapping, Protein structure, Literature and Miscellaneous. The number of databases listed in each domains is given in Table 1. Table 1. DBcat grouped by application domains Domain
No. of records
DNA
60
RNA
21
Protein
74
Genomic
57
Mapping
28
Protein structure
18
Literature
37
Miscellaneous
113
Total
408
The DBcat provides the users with a variety of modes of access: (i) download the flat files: ftp://ftp.infobiogen.fr/pub/db/dbcat ; (ii) Web interface homepage with a simple query by name interface:
http://www.infobiogen.fr/services/dbcat/ ; (iii) SRS servers: there are a number of SRS sites indexing the DBcat catalog, e.g., GIS Infobiogen (http://www.infobiogen.fr/srs/ ) in France or Seqnet (http://www.seqnet.dl.ac.uk:80/srs5/ ) in the UK. CONCLUSION We plan to introduce two fields for better characterization of a DBcat entry: (i) DR: a list of databases having cross-references to the database described by this entry; (ii) Access: a list of the interfaces available for accessing the data, e.g., SQL, ORB, SRS. The idea is to facilitate the work of researchers that, simultaneously, needs to interact with a number of heterogeneous databases; typically, bioinformatics people working on integration and data interoperation (2–4). We strongly encourage the submission of new data to the DBcat, as well as updates. Feedback is crucial to the success of the DBcat. ACKNOWLEDGEMENTS The authors wish to thank Patricia Rodriguez-Tomé and the GREG for supporting the first version of the DBcat. We also wish to thank the authors/curators of the 408 databases listed in this catalog for providing their work and expertise to the community. REFERENCES 1 Rodriguez-Tome,P. (1998) Bioinformatics, 14, 469–470. 2 Achard,F., Cussat-Blanc,C., Viara,E. and Barillot,E. (1998) Bioinformatics, 14, 342–348. 3 Karp,P.D. (1995) J. Computat. Biol., 2, 175–586. 4 Fasman,K.H. (1994) J. Computat. Biol., 1, 165–171.