access to molecular biology databases - Semantic Scholar

3 downloads 11444 Views 766KB Size Report
hnplementation, and a discussion of the mle of LiMB in developing strategies for .... ered, database size, hardware and software for database management,.
Vol. 16, No. 617, pp. 93-101, 1992

Mathl.Comput.Madelling Printed in G&Britain

ACCESS G.

M.

TO MOLECULAR

KEEN,

G.

W.

009s7177192 $5.00 + 0.00 Pergamon Press Ltd

BIOLOGY

REDGRAVE,

R.

J.

DATABASES M. J ClNKOSKY

LAWTON,

Theormkal Biology and Biophysics Gnmp, T-10, MS K?lO Los Alamos National lkbomtory, Los Alamos, NM 87545, U.S.A. S. K. MISHRA Genetics Department, Box 8232 Wabington Unlvemity School of Medkine 45fXi Scott Am, J.

W.

St Louis, MO 63110, U.S.A.

FICKETT

AND

C.

BURKS

CenterforHuman Genome Studies Los Alamos National lkbomtory, Los Aimms,

NM

87545,U.S.A.

Abstract-The LiMB database was created to catalogue and begin coonhnanng access to the rapidly pmllfemthrg databases relevant to molecular biology. LiMB contains information about these databases and ls cummtly implemented in a relational database management system tbat allow% for complex, multiconditional queries. We present an overview of LiMB, a description of its cummt hnplementation, and a discussion of the mle of LiMB in developing strategies for automated access to distributed, heterogeneous databases.

1.

INTRODUCTION

The process of generating knowledge in molecular biology can be represented cycle shown in Figure 1. Each of the stages depicted imposes characteristics required

for expressing

the associated

information.

For

example,

as fottowing the on the medium

raw experimental

data (e.g.,

an

autoradiograph of an electrophoretic gel, or a plot of activity in a chromatographic separation) are very often graphics-intensive, requiring for adequate representation a photograph on paper or graphics capabilities on a computer. The variety of data types and structures required for adequately representing molecular biology is also enriched by the number of domains of biology (see Figure 2) spanned in presenting the interpretations of data and relating them to the functional and physical interpretation

contexts of the particular system being studied. For example, a DNA sequence-an on experimental data-is neither as interesting nor nearly a~ useful unless tied into

a node in a taxonomic tree, a node in a representation of cell substructure, one or more nodes in a representation of cellular/macromolecular function.

and associated

with

The sheer volume of molecular biological data in the past two decades has already precipitated a move to computer media for collection, maintenance, and analysis of data. However, recent marked improvements, both in database management systems (DBMS) and in the flexibility and richness of user interfaces (as well as the power of the systems supporting them), are leading to another,

potentially

larger

wave of migration

of molecular

biological

data onto computer

media.

Molecular biology is thus becoming increasingly dependent on computers to collect, maintain, and analyze the tremendous amounts of raw and interpreted data it generates. Consequently, a large number of databases designed to manage these data have appeared. However, a lack of wide-spread

standards for organizing

and accessing

these databases,

and sparse implementation

of

We axe grateful to R Sutherland for technical assistance in implanting LiMB in the BDBMS. C.B, GX., G-B., and JL were supported in part by institutional fundhrg frnm Los Alamos National lkbomtory, and JF’. and M.C. were supported in part by fundhrg frnm DOEOHEB (ERW-FllG). This work was done under the auspices of the U.S. Dept. of Energy.

93

g44rl

G.M.

KEEN el al.

source

materials

c

raw

biological _)

data

cl

entities

+

metadatabases

and their interrelations

experimental protocols

‘1 Figure 1. Information

flow in molecular biology.

those standards that do exist, has led to virtually every molecular biology database that currently exists being relatively unique in terms-for example-of access protocols and query platforms. Thus, the molecular biologist wanting to access molecular biological data has to: (i) find the relevant databases among a highly decentralized (geographically and institutionally) set; (ii) learn a number of access protocols and query languages for the relevant, identified databases; and (iii) post-process-most often with a high degree of manual intervention-the query results on individual databases to answer the exact query that initiated the search.

Figure 2. Biological domains. Many of the databases listed in LiMB are primarily focused on macromolecules and their chemical structures. However, related annota, tion and links that provide information about the physical and functional context of these primary data extend into other “domains” of biological information.

Molecular biology databases

95

The LiMB (Listing of Molecular Biology Databases) database was created to help facilitate and simplify this process [l]. It is a database of databases, or a database directory. LiMB provides a systematic and coordinated approach to identifying, linking and accessing heterogeneous, distributed databases relevant to molecular biology; in short, it provides an overview of these databases and the data sets they cover. As such, it helps molecular biologists to access and contribute data relevant to their research. LiMB also contains information about the computer and software systems on which the different databases are maintained, which is useful for database managers and designers in building software links for related data sets. This also helps users find software suitable to their applications as an alternative to developing it themselves. LiMB provides these data to the scientific community in a flat file distribution, free of charge, on hardcopy, on PC (MS DOS) or Macintosh formatted floppy, and via electronic network (see below). 2. METHODS Systems

The LiMB database is currently managed with the Sybase (Emeryville, CA) relational database management system (RDBMS). (F or a discussion of the relational model, see [2].) It is maintained on Sun Microsystems (Mountain View, CA) workstations running under UNIX (Sun OS 4.1.1). Front end tools for data transmission (e.g., to generate the distributed flat file version of LiMB) into and out of the RDBMS were developed with UNIX shell tools and the C programming language [3]. 2.0 of LiMB

Release

The most recent release (2.0) of LiMB was initially distributed in August 1990 and contains information on 94 databases. Release 1.0 (February 1988) contained 53 entries, and Release 1.2 (January 1990) contained 76 entries. Release 3.0 is planned for summer 1992. Access

to LiMB

Since its inception in October 1986, LiMB has been directly distributed to more than 600 people and institutions in over 18 countries. Many researchers have also accessed it indirectly through several network servers (e.g., [4]) that have mounted copies of LiMB for the communities they serve. LiMB is also available for anonymous FTP from GenBank at the address “genbank.bio.net” (host # 134.172.1.160). It is found in the directory pub/db/LiMB. Inquiries regarding access to LiMB should be directed to: LiMB Database; Theoretical Biology and Biophysics Group; T-10, MS K710; Los Alamos National Laboratory; Los Alamos, NM 87545; U.S.A. (e-mail, [email protected]; fax, 505-665-3493). Information about retrieval of LiMB through an automatic network server can be retrieved by sending an electronic mail messagewith a message text that contains only the word “limb-info”-to “bioserve@ genome.lanl.gov”. 3. DATABASE Contents

OVERVIEW

of LiMB

An entry in the flat file distributed version of LiMB currently consists of 53 fields. The information contained in these fields covers: names and addresses of database staff, details about data access and submissions, keywords describing primary and secondary biological data items covered, database size, hardware and software for database management, source(s) of data, funding, literature citations, etc. Table 1 describes these fields in greater detail. Whenever possible, the information in LiMB is based on questionnaires completed by database managers. Otherwise, secondary sources such as journal articles are used. LiMB also includes an index of all the biological keywords found in the descriptions of individual databases, pointing to the names of the database entries in which they are found. An abbreviated listing of the databases included in LiMB has been given previously [5]; the most current view is available by acquiring a copy of LiMB (see information provided above).

G.M.

96

KEEN et al.

Table 1. Field detitions

for flat file format.

Definition

Field entry

~ Nnmc of entry

number

Dcrtclblrse number

history

I-listory of en1ry

status

Administrative status of entry information

res.nam

Name of respondent

res.add

Address of respondent

res.tel

Telephone number of respondent

res .net

Network vddress of respondent

gen.nam

Name for general inquiry

(as dialed from U.S.A.)

gen. add

Address for general inquiry

gen.tel

Telephone

gen.net con.nam con. add

Network address for general inquiry A~l~lrcaa fur coI~Lrib~llitlg d111h

con. tel

‘I;:l~:~~l~o~~e IIIIIIIIX~ for cw~l~rilmtig

con.net

Network address for contributing

act .nam

Name for acquiring

act.

add

Address for ncquiring dnta

act.

tel

number for general inquiry (as dialed from U.S.A.)

Nzullc for conlrihuting

Telephone

acc.net

r1nt.n

number for acquiring

name. now

Current,

Alternative

name.

Obsolete/incorrect

obs

and/or informal

Source of data in database Funding base for data bank

citation

Recent

charter

Formal or informal

nam

data.

set

publications

Other detabnses Primary

names of database

names of database

source

cross.

data (as dialed from U.S.A.) data

formal name of database

funding

data.pri

describing charter

database

of database

and/or data bank effort

to which this database

is cross-referenced

dattl items in lhe database

Secondary

data items in the database

hardware

Hardware

the data are maintained

op. sys

Operating

system hardware

dbms language

Software

format

I~ormat used for distributed

lItit text l&s

software

Softwilre

with the database

on

is run under

system used for maintaining

Progr,ullnling

data

Iwlg,Ll~~gc1~x1 for ~ofl.wnw wyutwII

that is distributed

access

Limitations

update

Frequency

con. on1

Cw

con. mag

Cillr co~~lribuliolls IJC rllndc to dLltabusc 01, mnglwtic

con.flp

Can contributions

on access to data of d&abase

contributions

updates

be made to database be made to database

on-line?

con. elm

Can contributions

be made to database

by electronic

Can contributions

be made to database

on hardcopy?

act.

Is database

distributed

acc.mag

Is dnttlbase

distributed

on mngnetic

acc.flp

Is database

distributed

on floppy disk?

acc.elm

Is database

distributed

on electronic

acc.pap

Is ~lrrtrrl~twc:~lintril~lltccl 0Ii Ili~r&:ul~y’!

lope?

on floppy disk?

con.pap on1

U.S.A.)

rr01ll

data.

Network address for acquiring

name. alt

dr~l.r~ (IW clidd

datn

mail?

on-line? tf~pe? mail?

byt . all

Number of bytes contained

in the database

byt .pri ent .pri

Number of bytes contained

in primary data items in the database

Number of entries secondary

(individual

primary data items with associated

data items) in the database

comment

Any additional

///

End of entry

us&l

information

about

the database

97

Molecular biology databases

Maintenance

of LiMB in an RDBMS

LiMB was initially maintained in flat text file form identical to the distributed version; this WAS convenient and expedient for the early development of the database when relevant fields and data values were still being formulated and changing rapidly. However, we have recently ported the LiMB database over to an RDBMS (the schema for the relational version of LiMB is presented in Figure 3), which offers a number of advantages over our previous strategy. The relational model has been implemented by a number of commercial vendors. This provides not only the advantage of being able to acquire extensive, off-the-shelf software for individual applications, but because the relational model is well-defined, applications in one commercial system are relatively straightforwardly ported to other commercial systems. PUBLICATION

DBASE

I_

-

3 is cross-referenced

-I

- __.-

d._db_id

db_id db_abriv )

dbname

)-

runs on nhat

software

with db_updstes

d

db_access db_bytall hs action . . ^ mstoly *or

--w

is reference

db_bytpri

4

T contains

what data

for

dbgqcon

- -

db mz res db_pnsen db_isresponse db_chatier

1

is contact

for

L

is located

at

db_dialup

literature,

.

..

dir. subs, or other sources

.

^

-I_ for

Figure 3. LiMB relational schema. Relational tables are composed of columns (categories of data, or “fields”) and rows (specific instances of data in each field, or “values”). The LiMB tables and the fields they contain are shown. For example, the table dbaae (short for “database”) contains the field db_name (short for “current formal name of the database”). For the most part, the table names and field names are self-explanatory; a detailed schema and field definition is available upon request.

In addition, the implementation of the relational model has been accompanied by the development of a Structured Query Language (SQL), which allows a great variety of queries to be constructed from a relatively similar syntax, and-as above-allows one to port queries from an application installed in one RDBMS to the same application installed in another RDBMS with relatively little, if any, modification. Redundancy of the individual data values is greatly reduced, which decreases both the chance of errors arising in association with maintaining multiple copies of the same data value and the requirement for disk space in storing these copies. For example, if an individual managed three different databases and coauthored six different articles cited in LiMB, his or her name would appear in the flat file version of LiMB a total of nine times (once for each managed database and once for each article). But, in the relational model, the name would only have to be entered once.

98

G.M.

KEEN et al.

The greatest advantage to the relational scheme is the power and previously unidentified queries without the overhead of software. For example, imagine a query of the flat fiIe version databases located in the United States or West Germany that (i) contain information on nucleotide sequences (ii) distribute their data on floppy disk.

it gives the user to make complex ad hoc development of additional of LiMB for the names of all the both

and

Answering this query on a flat file representation of LiMB can certainly be done, either with a series of simple pattern searches accompanied by manual filtering of the output, or by writing software specific to the flat file format for the relevant fields and specific to the boolean logic implied. However-assuming adherence to the relational model and atomicity of the relevant fields-the query can be simply formulated (and answered) in SQL with no manual massaging of output or additional software development. This is illustrated in Figure 4. gwr [l]

isql -Ulimb Password: 1> SELECT distinct db name, ad country FROM dbase new, address, abdata, data, media, dbmed 2> 3> WHERE dbase-new.db ad id = address.ad id = dbdata.dd -db_id 4> AND dbase-new.db-id= data.dt idAND dbdata.dd dt id 5> = dbmed.dc db id AND dbase new.ab Td 6> = media.md-id7> AND dbmed:dm md rd AND data.dt_?iamelike "%nucleotide sequences%" 8> 9> AND (address.ad_country = "U.S.A." or address.ad_country = "WEST GERMANY") AND media.md type = "floppy disk" lo> 11> AND dbmed.dmIisdis = 1 12> go db name ad-country

Signal Scan U.S.A. LRNA Compilation WEST GERMANY Berlin RNA Data Bank WEST GERMANY Transcription Factor Database U.S.A. Cloning Vector Sequence Database U.S.A. Compilation of Small RNA Sequences U.S.A. The GenBank Genetic Sequence Data Bank U.S.A. IITVSequence & Sequence Analysis Database U.S.A. (6 rows affected) Figure 4. Sample DBMS query. Line 1 indicates what action is being performed (“SELECT”), and the object entity of that action (“dbnanm”, ‘Lad_country”). Line 2 lists all the tables being used in the query. Lines 3-7 establish joins between the tables, and lines 8-11 specify the required conditions. Line 12 indicates that the query is finished and that the SQL server should execute the query.

It is worth noting that several of the databases described in LiMB have also begun or completed converting to an RDBMS; it is likely that this trend will continue, and as such, could provide an important advantage in developing systems supporting queries that span multiple databases (see the Discussion below). The relational schema of LiMB is currently being revised to allow for a greater range of information. The new design for the database is now completed (Figure 3); once the transition to

99

Molecular biology databases

the new schema Sybase-compatible

is complete LiMB will, upon request, table dumps on magnetic tape.

be distributed

in its relational

form via

4. DISCUSSION Span of Biological

Data Covered in LiMB

The databases listed in LiMB already sample many of the domains implied in Figure 2, and sample several not schematized in Figure 2. In the initial phases of collecting data for LiMB, very little structure was imposed on the biological keys used to attach the individual database entries to nodes in the domains in Figure 2. However, we plan to begin imposing more structure on the biological “keywords” in LiMB in coordination with developing more systematic schematizations of biological information (see [6] for an example). Role of LiMB in Providing



Automated

Access to Databases

There is a waxing interest [7] in characterizing and building a matrix of biological knowledge and at the same time a growing recognition [8,9,10]-especially in the context of the proposed effort to completely map and sequence the human genome--of the need for a systematic and coordinated approach to designing, developing, and maintaining molecular biological databases. Clearly, a “database of databases” supports developing and maintaining such an overview. There is a rapidly increasing number of computer databases serving the molecular biology community. Prior to the existence of LiMB (and other directories similar to it), a scientist had no way of determining which databases were relevant to a particular query without accessing the individual databases (if they could be identified in the first place). This was a very hit-and-miss, time-consuming process (Figure 5a). Consequently, there has been an outstanding need for a central repository of information about these databases (one sign of this is the large number of requests for data and descriptive articles about LiMB that we received in the year prior to Release 1.0). LiMB is providing information useful to potential users of and contributors to the individual databases, but perhaps more importantly, will also serve the needs of the growing community of scientists who are designing and maintaining these databases and the connections between them. With LiMB in its current state of development, an individual can pre-screen individual databases on the basis of information content, geographical convenience, medium of data distribution, etc. (Figure 5b). However, pursuit of any cross-database query will eventually involve accessing two or more databases and-in most cases-manually post-processing the results of the individual queries to effect the logical construct of a cross-database query. Ideally, the scientist should only have to interact with a single interface; that interface would parse the scientist’s query, query LiMB for determination of which databases are relevant to the scientist’s query and how to access the individual databases, and then automatically access the individual databases and pass the relevant information back to the scientist. LiMB will eventually be used as a tool to help facilitate a front end for friendly, automatic access to a number of distinct databases, as implied in Figure 5c. We are currently evaluating the several strategies being developed or implemented by other groups. At one end of the spectrum are approaches based on collecting flat file versions of several databases onto one computer and allowing text-searching across all of them. One example is the GeneInfo system at NLM [ll]. This approach has the benefit of eliminating multiple access systems and therefore reducing the amount of effort required for the “meta” front end. It has several extreme disadvantages: not all databases are available in flat file form; in order for the system to be current, the local support staff have to be constantly updating their local versions of the various databases; and, perhaps most importantly, flat file versions of databases support much less sophicated queries than versions imbedded in DBMSs. At the other end of the spectrum are approaches based on accessing databases in their native environments, most often on geographically distributed systems. The C-SIN system developed at BBN Laboratories [12] is an example. No responsibility is assumed for importing and updating databases and, in principle, the full richness of the various DBMSs can be exploited. The difficulty in this approach is making transparent (with a sophisticated “meta” front end) the various

100

G.M.

a. past

KEEN et al.

b. present

:. futu’re

LiMB

1 database 4 1

1 database 4 1

Figure 5. Access to heterogeneous, distributed databases. Arrows indicate information flow, with stippled arrows indicating exploratory exchanges and solid arrows indicating substantive exchange. (a) Absence of any up-to-date and readily available directory meant that a scientist had to access a database just to determine whether or not the database was relevant to his query. Each database accessed requires a new set of access protocols and query language. (b) Currently, directories such as LiMB allow abridged, pre-screening of databases of potential interest using a single access and query protocol. Though this greatly improves the efficiency of finding databases of interest, it does nothing to ease the burden of learning access protocols and query languages for each of those databases. Neither does it eliminate post-processing to integrate retrieved data. (c) Ideally, a single interface would provide not only the pre-screening of databases through LiMB, but also the consequent accessing, querying, and virtual integration of the databases found to be relevant to the scientist’s initial query.

logon and query protocols for the various databases so that the user can submit all queries in a single language. Transparency can only be achieved with highly-detailed and well-represented “descriptions” of the native query languages and data structures, and the complexity of the “meta” front end is rather open-ended, depending on the complexity of the anticipated queries. This approach also assumes that all databases of interest can be accessed on-line, which has not been the case. We currently favor the latter approach as a starting point for planning. Two trends favor the reduction of drawbacks associated with this approach until now: increasing numbers of databases are becoming available on-line; and the DBMS being used to maintain them are less often developed in-house (and therefore, both unique and idiosyncratic) and more often vendor-supplied (and therefore, increasingly standardized). LiMB is now, for the most part, a simple directory; in the future, it will expand its role as a centralized source of information. It is our goal that LiMB will eventually serve as an integrated, dynamic point of reference for a front end accessing the data contained in the databases now simply listed in LiMB. REFERENCES 1. C. Burks, J.R. Lawton and G.I. Bell, The LIMB database, Science 241, 888 (1988). 2. C.J. Date, An Introduction lo Dabzbase Syslems, (Fourth Edition), Addison-Wesley Publishing Company, Reading, MA, (1986). 3. B. Kern&m and M. Hitchie, The C Programming Language, (Second Edition), Prentice Hall, Engelwood Cliffs, NJ, (1988). 4. D. Davidson and J. Chappelear, The Genbank-server at the University of Houston, Nucl. Acids Res. 18, 1571-1572 (1990). 5. J.R. Lawton, F. Martinez and C. Burks, Overview of the LiMB database, Nucl. Acids Res. 17 (15), 5885-5899 (1989).

Molecular biology databases 6.

7. 8.

9. 10. 11.

12.

101

C. Overton, K. Koile and J. Pastor, GeneSys: A knowledge management system for molecular biology, In Computers and DNA, (Edited by G.I. Bell and T. Marr), pp. 213-239, Addison-Wesley, Reading, MA, (1990). H.J. Morowitz and T.F. Smith, Report of Ihe Malriz of Biological Knowledge Workshop, Santa Fe Institute, Santa Fe, NM, (1987). B.M. Alberts, D. Botstein, S. Brenner, C.R. Cantor, R.F. Doolittle, L. Hood, V.A. McKusick, D. Natham, M.V. Olson, S. Orkin, L.E. Rosenberg, F.H. Ruddle, S. Tilghman, J. Tooze and J.D. Watson, Mapping and Sequencing the Human Genome, National Academy Press, Washington, D.C., (1988). L. Philipson, The DNA data libraries, Nature 332, 676 (1988). Office of Technology Assessment, Mapping Our Genes. The Genome Projects: How Big, How Fari?, U.S. Government Printing Office, (1988). D. Harmon, D. Benson, L. Fitspatrick, R. Huntzinger and C. Goldstein, IHX: Information retrieval system for experimentation and user applications, In Proceedings of RIAO 88 Conference: Ileer-oriented Content-bared Tezl and Image Handling, pp. 840-848, American Society for Information Sciences, Cambridge, MA, (1988). C. Hollister and J. Page-Castell, The Chemical Substances Information Network: User services office, evaluation and feedback, J. Chem. Info. Comp. Sci. 24, 259 (1985).