Vol. 16, No. 617, pp. 93-101, 1992
Mathl.Comput.Madelling Printed in G&Britain
ACCESS G.
M.
TO MOLECULAR
KEEN,
G.
W.
009s7177192 $5.00 + 0.00 Pergamon Press Ltd
BIOLOGY
REDGRAVE,
R.
J.
DATABASES M. J ClNKOSKY
LAWTON,
Theormkal Biology and Biophysics Gnmp, T-10, MS K?lO Los Alamos National lkbomtory, Los Alamos, NM 87545, U.S.A. S. K. MISHRA Genetics Department, Box 8232 Wabington Unlvemity School of Medkine 45fXi Scott Am, J.
W.
St Louis, MO 63110, U.S.A.
FICKETT
AND
C.
BURKS
CenterforHuman Genome Studies Los Alamos National lkbomtory, Los Aimms,
NM
87545,U.S.A.
Abstract-The LiMB database was created to catalogue and begin coonhnanng access to the rapidly pmllfemthrg databases relevant to molecular biology. LiMB contains information about these databases and ls cummtly implemented in a relational database management system tbat allow% for complex, multiconditional queries. We present an overview of LiMB, a description of its cummt hnplementation, and a discussion of the mle of LiMB in developing strategies for automated access to distributed, heterogeneous databases.
1.
INTRODUCTION
The process of generating knowledge in molecular biology can be represented cycle shown in Figure 1. Each of the stages depicted imposes characteristics required
for expressing
the associated
information.
For
example,
as fottowing the on the medium
raw experimental
data (e.g.,
an
autoradiograph of an electrophoretic gel, or a plot of activity in a chromatographic separation) are very often graphics-intensive, requiring for adequate representation a photograph on paper or graphics capabilities on a computer. The variety of data types and structures required for adequately representing molecular biology is also enriched by the number of domains of biology (see Figure 2) spanned in presenting the interpretations of data and relating them to the functional and physical interpretation
contexts of the particular system being studied. For example, a DNA sequence-an on experimental data-is neither as interesting nor nearly a~ useful unless tied into
a node in a taxonomic tree, a node in a representation of cell substructure, one or more nodes in a representation of cellular/macromolecular function.
and associated
with
The sheer volume of molecular biological data in the past two decades has already precipitated a move to computer media for collection, maintenance, and analysis of data. However, recent marked improvements, both in database management systems (DBMS) and in the flexibility and richness of user interfaces (as well as the power of the systems supporting them), are leading to another,
potentially
larger
wave of migration
of molecular
biological
data onto computer
media.
Molecular biology is thus becoming increasingly dependent on computers to collect, maintain, and analyze the tremendous amounts of raw and interpreted data it generates. Consequently, a large number of databases designed to manage these data have appeared. However, a lack of wide-spread
standards for organizing
and accessing
these databases,
and sparse implementation
of
We axe grateful to R Sutherland for technical assistance in implanting LiMB in the BDBMS. C.B, GX., G-B., and JL were supported in part by institutional fundhrg frnm Los Alamos National lkbomtory, and JF’. and M.C. were supported in part by fundhrg frnm DOEOHEB (ERW-FllG). This work was done under the auspices of the U.S. Dept. of Energy.
93
g44rl
G.M.
KEEN el al.
source
materials
c
raw
biological _)
data
cl
entities
+
metadatabases
and their interrelations
experimental protocols
‘1 Figure 1. Information
flow in molecular biology.
those standards that do exist, has led to virtually every molecular biology database that currently exists being relatively unique in terms-for example-of access protocols and query platforms. Thus, the molecular biologist wanting to access molecular biological data has to: (i) find the relevant databases among a highly decentralized (geographically and institutionally) set; (ii) learn a number of access protocols and query languages for the relevant, identified databases; and (iii) post-process-most often with a high degree of manual intervention-the query results on individual databases to answer the exact query that initiated the search.
Figure 2. Biological domains. Many of the databases listed in LiMB are primarily focused on macromolecules and their chemical structures. However, related annota, tion and links that provide information about the physical and functional context of these primary data extend into other “domains” of biological information.
Molecular biology databases
95
The LiMB (Listing of Molecular Biology Databases) database was created to help facilitate and simplify this process [l]. It is a database of databases, or a database directory. LiMB provides a systematic and coordinated approach to identifying, linking and accessing heterogeneous, distributed databases relevant to molecular biology; in short, it provides an overview of these databases and the data sets they cover. As such, it helps molecular biologists to access and contribute data relevant to their research. LiMB also contains information about the computer and software systems on which the different databases are maintained, which is useful for database managers and designers in building software links for related data sets. This also helps users find software suitable to their applications as an alternative to developing it themselves. LiMB provides these data to the scientific community in a flat file distribution, free of charge, on hardcopy, on PC (MS DOS) or Macintosh formatted floppy, and via electronic network (see below). 2. METHODS Systems
The LiMB database is currently managed with the Sybase (Emeryville, CA) relational database management system (RDBMS). (F or a discussion of the relational model, see [2].) It is maintained on Sun Microsystems (Mountain View, CA) workstations running under UNIX (Sun OS 4.1.1). Front end tools for data transmission (e.g., to generate the distributed flat file version of LiMB) into and out of the RDBMS were developed with UNIX shell tools and the C programming language [3]. 2.0 of LiMB
Release
The most recent release (2.0) of LiMB was initially distributed in August 1990 and contains information on 94 databases. Release 1.0 (February 1988) contained 53 entries, and Release 1.2 (January 1990) contained 76 entries. Release 3.0 is planned for summer 1992. Access
to LiMB
Since its inception in October 1986, LiMB has been directly distributed to more than 600 people and institutions in over 18 countries. Many researchers have also accessed it indirectly through several network servers (e.g., [4]) that have mounted copies of LiMB for the communities they serve. LiMB is also available for anonymous FTP from GenBank at the address “genbank.bio.net” (host # 134.172.1.160). It is found in the directory pub/db/LiMB. Inquiries regarding access to LiMB should be directed to: LiMB Database; Theoretical Biology and Biophysics Group; T-10, MS K710; Los Alamos National Laboratory; Los Alamos, NM 87545; U.S.A. (e-mail,
[email protected]; fax, 505-665-3493). Information about retrieval of LiMB through an automatic network server can be retrieved by sending an electronic mail messagewith a message text that contains only the word “limb-info”-to “bioserve@ genome.lanl.gov”. 3. DATABASE Contents
OVERVIEW
of LiMB
An entry in the flat file distributed version of LiMB currently consists of 53 fields. The information contained in these fields covers: names and addresses of database staff, details about data access and submissions, keywords describing primary and secondary biological data items covered, database size, hardware and software for database management, source(s) of data, funding, literature citations, etc. Table 1 describes these fields in greater detail. Whenever possible, the information in LiMB is based on questionnaires completed by database managers. Otherwise, secondary sources such as journal articles are used. LiMB also includes an index of all the biological keywords found in the descriptions of individual databases, pointing to the names of the database entries in which they are found. An abbreviated listing of the databases included in LiMB has been given previously [5]; the most current view is available by acquiring a copy of LiMB (see information provided above).
G.M.
96
KEEN et al.
Table 1. Field detitions
for flat file format.
Definition
Field entry
~ Nnmc of entry
number
Dcrtclblrse number
history
I-listory of en1ry
status
Administrative status of entry information
res.nam
Name of respondent
res.add
Address of respondent
res.tel
Telephone number of respondent
res .net
Network vddress of respondent
gen.nam
Name for general inquiry
(as dialed from U.S.A.)
gen. add
Address for general inquiry
gen.tel
Telephone
gen.net con.nam con. add
Network address for general inquiry A~l~lrcaa fur coI~Lrib~llitlg d111h
con. tel
‘I;:l~:~~l~o~~e IIIIIIIIX~ for cw~l~rilmtig
con.net
Network address for contributing
act .nam
Name for acquiring
act.
add
Address for ncquiring dnta
act.
tel
number for general inquiry (as dialed from U.S.A.)
Nzullc for conlrihuting
Telephone
acc.net
r1nt.n
number for acquiring
name. now
Current,
Alternative
name.
Obsolete/incorrect
obs
and/or informal
Source of data in database Funding base for data bank
citation
Recent
charter
Formal or informal
nam
data.
set
publications
Other detabnses Primary
names of database
names of database
source
cross.
data (as dialed from U.S.A.) data
formal name of database
funding
data.pri
describing charter
database
of database
and/or data bank effort
to which this database
is cross-referenced
dattl items in lhe database
Secondary
data items in the database
hardware
Hardware
the data are maintained
op. sys
Operating
system hardware
dbms language
Software
format
I~ormat used for distributed
lItit text l&s
software
Softwilre
with the database
on
is run under
system used for maintaining
Progr,ullnling
data
Iwlg,Ll~~gc1~x1 for ~ofl.wnw wyutwII
that is distributed
access
Limitations
update
Frequency
con. on1
Cw
con. mag
Cillr co~~lribuliolls IJC rllndc to dLltabusc 01, mnglwtic
con.flp
Can contributions
on access to data of d&abase
contributions
updates
be made to database be made to database
on-line?
con. elm
Can contributions
be made to database
by electronic
Can contributions
be made to database
on hardcopy?
act.
Is database
distributed
acc.mag
Is dnttlbase
distributed
on mngnetic
acc.flp
Is database
distributed
on floppy disk?
acc.elm
Is database
distributed
on electronic
acc.pap
Is ~lrrtrrl~twc:~lintril~lltccl 0Ii Ili~r&:ul~y’!
lope?
on floppy disk?
con.pap on1
U.S.A.)
rr01ll
data.
Network address for acquiring
name. alt
dr~l.r~ (IW clidd
datn
mail?
on-line? tf~pe? mail?
byt . all
Number of bytes contained
in the database
byt .pri ent .pri
Number of bytes contained
in primary data items in the database
Number of entries secondary
(individual
primary data items with associated
data items) in the database
comment
Any additional
///
End of entry
us&l
information
about
the database
97
Molecular biology databases
Maintenance
of LiMB in an RDBMS
LiMB was initially maintained in flat text file form identical to the distributed version; this WAS convenient and expedient for the early development of the database when relevant fields and data values were still being formulated and changing rapidly. However, we have recently ported the LiMB database over to an RDBMS (the schema for the relational version of LiMB is presented in Figure 3), which offers a number of advantages over our previous strategy. The relational model has been implemented by a number of commercial vendors. This provides not only the advantage of being able to acquire extensive, off-the-shelf software for individual applications, but because the relational model is well-defined, applications in one commercial system are relatively straightforwardly ported to other commercial systems. PUBLICATION
DBASE
I_
-
3 is cross-referenced
-I
- __.-
d._db_id
db_id db_abriv )
dbname
)-
runs on nhat
software
with db_updstes
d
db_access db_bytall hs action . . ^ mstoly *or
--w
is reference
db_bytpri
4
T contains
what data
for
dbgqcon
- -
db mz res db_pnsen db_isresponse db_chatier
1
is contact
for
L
is located
at
db_dialup
literature,
.
..
dir. subs, or other sources
.
^
-I_ for
Figure 3. LiMB relational schema. Relational tables are composed of columns (categories of data, or “fields”) and rows (specific instances of data in each field, or “values”). The LiMB tables and the fields they contain are shown. For example, the table dbaae (short for “database”) contains the field db_name (short for “current formal name of the database”). For the most part, the table names and field names are self-explanatory; a detailed schema and field definition is available upon request.
In addition, the implementation of the relational model has been accompanied by the development of a Structured Query Language (SQL), which allows a great variety of queries to be constructed from a relatively similar syntax, and-as above-allows one to port queries from an application installed in one RDBMS to the same application installed in another RDBMS with relatively little, if any, modification. Redundancy of the individual data values is greatly reduced, which decreases both the chance of errors arising in association with maintaining multiple copies of the same data value and the requirement for disk space in storing these copies. For example, if an individual managed three different databases and coauthored six different articles cited in LiMB, his or her name would appear in the flat file version of LiMB a total of nine times (once for each managed database and once for each article). But, in the relational model, the name would only have to be entered once.
98
G.M.
KEEN et al.
The greatest advantage to the relational scheme is the power and previously unidentified queries without the overhead of software. For example, imagine a query of the flat fiIe version databases located in the United States or West Germany that (i) contain information on nucleotide sequences (ii) distribute their data on floppy disk.
it gives the user to make complex ad hoc development of additional of LiMB for the names of all the both
and
Answering this query on a flat file representation of LiMB can certainly be done, either with a series of simple pattern searches accompanied by manual filtering of the output, or by writing software specific to the flat file format for the relevant fields and specific to the boolean logic implied. However-assuming adherence to the relational model and atomicity of the relevant fields-the query can be simply formulated (and answered) in SQL with no manual massaging of output or additional software development. This is illustrated in Figure 4. gwr [l]
isql -Ulimb Password: 1> SELECT distinct db name, ad country FROM dbase new, address, abdata, data, media, dbmed 2> 3> WHERE dbase-new.db ad id = address.ad id = dbdata.dd -db_id 4> AND dbase-new.db-id= data.dt idAND dbdata.dd dt id 5> = dbmed.dc db id AND dbase new.ab Td 6> = media.md-id7> AND dbmed:dm md rd AND data.dt_?iamelike "%nucleotide sequences%" 8> 9> AND (address.ad_country = "U.S.A." or address.ad_country = "WEST GERMANY") AND media.md type = "floppy disk" lo> 11> AND dbmed.dmIisdis = 1 12> go db name ad-country
Signal Scan U.S.A. LRNA Compilation WEST GERMANY Berlin RNA Data Bank WEST GERMANY Transcription Factor Database U.S.A. Cloning Vector Sequence Database U.S.A. Compilation of Small RNA Sequences U.S.A. The GenBank Genetic Sequence Data Bank U.S.A. IITVSequence & Sequence Analysis Database U.S.A. (6 rows affected) Figure 4. Sample DBMS query. Line 1 indicates what action is being performed (“SELECT”), and the object entity of that action (“dbnanm”, ‘Lad_country”). Line 2 lists all the tables being used in the query. Lines 3-7 establish joins between the tables, and lines 8-11 specify the required conditions. Line 12 indicates that the query is finished and that the SQL server should execute the query.
It is worth noting that several of the databases described in LiMB have also begun or completed converting to an RDBMS; it is likely that this trend will continue, and as such, could provide an important advantage in developing systems supporting queries that span multiple databases (see the Discussion below). The relational schema of LiMB is currently being revised to allow for a greater range of information. The new design for the database is now completed (Figure 3); once the transition to
99
Molecular biology databases
the new schema Sybase-compatible
is complete LiMB will, upon request, table dumps on magnetic tape.
be distributed
in its relational
form via
4. DISCUSSION Span of Biological
Data Covered in LiMB
The databases listed in LiMB already sample many of the domains implied in Figure 2, and sample several not schematized in Figure 2. In the initial phases of collecting data for LiMB, very little structure was imposed on the biological keys used to attach the individual database entries to nodes in the domains in Figure 2. However, we plan to begin imposing more structure on the biological “keywords” in LiMB in coordination with developing more systematic schematizations of biological information (see [6] for an example). Role of LiMB in Providing
”
Automated
Access to Databases
There is a waxing interest [7] in characterizing and building a matrix of biological knowledge and at the same time a growing recognition [8,9,10]-especially in the context of the proposed effort to completely map and sequence the human genome--of the need for a systematic and coordinated approach to designing, developing, and maintaining molecular biological databases. Clearly, a “database of databases” supports developing and maintaining such an overview. There is a rapidly increasing number of computer databases serving the molecular biology community. Prior to the existence of LiMB (and other directories similar to it), a scientist had no way of determining which databases were relevant to a particular query without accessing the individual databases (if they could be identified in the first place). This was a very hit-and-miss, time-consuming process (Figure 5a). Consequently, there has been an outstanding need for a central repository of information about these databases (one sign of this is the large number of requests for data and descriptive articles about LiMB that we received in the year prior to Release 1.0). LiMB is providing information useful to potential users of and contributors to the individual databases, but perhaps more importantly, will also serve the needs of the growing community of scientists who are designing and maintaining these databases and the connections between them. With LiMB in its current state of development, an individual can pre-screen individual databases on the basis of information content, geographical convenience, medium of data distribution, etc. (Figure 5b). However, pursuit of any cross-database query will eventually involve accessing two or more databases and-in most cases-manually post-processing the results of the individual queries to effect the logical construct of a cross-database query. Ideally, the scientist should only have to interact with a single interface; that interface would parse the scientist’s query, query LiMB for determination of which databases are relevant to the scientist’s query and how to access the individual databases, and then automatically access the individual databases and pass the relevant information back to the scientist. LiMB will eventually be used as a tool to help facilitate a front end for friendly, automatic access to a number of distinct databases, as implied in Figure 5c. We are currently evaluating the several strategies being developed or implemented by other groups. At one end of the spectrum are approaches based on collecting flat file versions of several databases onto one computer and allowing text-searching across all of them. One example is the GeneInfo system at NLM [ll]. This approach has the benefit of eliminating multiple access systems and therefore reducing the amount of effort required for the “meta” front end. It has several extreme disadvantages: not all databases are available in flat file form; in order for the system to be current, the local support staff have to be constantly updating their local versions of the various databases; and, perhaps most importantly, flat file versions of databases support much less sophicated queries than versions imbedded in DBMSs. At the other end of the spectrum are approaches based on accessing databases in their native environments, most often on geographically distributed systems. The C-SIN system developed at BBN Laboratories [12] is an example. No responsibility is assumed for importing and updating databases and, in principle, the full richness of the various DBMSs can be exploited. The difficulty in this approach is making transparent (with a sophisticated “meta” front end) the various
100
G.M.
a. past
KEEN et al.
b. present
:. futu’re
LiMB
1 database 4 1
1 database 4 1
Figure 5. Access to heterogeneous, distributed databases. Arrows indicate information flow, with stippled arrows indicating exploratory exchanges and solid arrows indicating substantive exchange. (a) Absence of any up-to-date and readily available directory meant that a scientist had to access a database just to determine whether or not the database was relevant to his query. Each database accessed requires a new set of access protocols and query language. (b) Currently, directories such as LiMB allow abridged, pre-screening of databases of potential interest using a single access and query protocol. Though this greatly improves the efficiency of finding databases of interest, it does nothing to ease the burden of learning access protocols and query languages for each of those databases. Neither does it eliminate post-processing to integrate retrieved data. (c) Ideally, a single interface would provide not only the pre-screening of databases through LiMB, but also the consequent accessing, querying, and virtual integration of the databases found to be relevant to the scientist’s initial query.
logon and query protocols for the various databases so that the user can submit all queries in a single language. Transparency can only be achieved with highly-detailed and well-represented “descriptions” of the native query languages and data structures, and the complexity of the “meta” front end is rather open-ended, depending on the complexity of the anticipated queries. This approach also assumes that all databases of interest can be accessed on-line, which has not been the case. We currently favor the latter approach as a starting point for planning. Two trends favor the reduction of drawbacks associated with this approach until now: increasing numbers of databases are becoming available on-line; and the DBMS being used to maintain them are less often developed in-house (and therefore, both unique and idiosyncratic) and more often vendor-supplied (and therefore, increasingly standardized). LiMB is now, for the most part, a simple directory; in the future, it will expand its role as a centralized source of information. It is our goal that LiMB will eventually serve as an integrated, dynamic point of reference for a front end accessing the data contained in the databases now simply listed in LiMB. REFERENCES 1. C. Burks, J.R. Lawton and G.I. Bell, The LIMB database, Science 241, 888 (1988). 2. C.J. Date, An Introduction lo Dabzbase Syslems, (Fourth Edition), Addison-Wesley Publishing Company, Reading, MA, (1986). 3. B. Kern&m and M. Hitchie, The C Programming Language, (Second Edition), Prentice Hall, Engelwood Cliffs, NJ, (1988). 4. D. Davidson and J. Chappelear, The Genbank-server at the University of Houston, Nucl. Acids Res. 18, 1571-1572 (1990). 5. J.R. Lawton, F. Martinez and C. Burks, Overview of the LiMB database, Nucl. Acids Res. 17 (15), 5885-5899 (1989).
Molecular biology databases 6.
7. 8.
9. 10. 11.
12.
101
C. Overton, K. Koile and J. Pastor, GeneSys: A knowledge management system for molecular biology, In Computers and DNA, (Edited by G.I. Bell and T. Marr), pp. 213-239, Addison-Wesley, Reading, MA, (1990). H.J. Morowitz and T.F. Smith, Report of Ihe Malriz of Biological Knowledge Workshop, Santa Fe Institute, Santa Fe, NM, (1987). B.M. Alberts, D. Botstein, S. Brenner, C.R. Cantor, R.F. Doolittle, L. Hood, V.A. McKusick, D. Natham, M.V. Olson, S. Orkin, L.E. Rosenberg, F.H. Ruddle, S. Tilghman, J. Tooze and J.D. Watson, Mapping and Sequencing the Human Genome, National Academy Press, Washington, D.C., (1988). L. Philipson, The DNA data libraries, Nature 332, 676 (1988). Office of Technology Assessment, Mapping Our Genes. The Genome Projects: How Big, How Fari?, U.S. Government Printing Office, (1988). D. Harmon, D. Benson, L. Fitspatrick, R. Huntzinger and C. Goldstein, IHX: Information retrieval system for experimentation and user applications, In Proceedings of RIAO 88 Conference: Ileer-oriented Content-bared Tezl and Image Handling, pp. 840-848, American Society for Information Sciences, Cambridge, MA, (1988). C. Hollister and J. Page-Castell, The Chemical Substances Information Network: User services office, evaluation and feedback, J. Chem. Info. Comp. Sci. 24, 259 (1985).