Neuronal Database Integration: The Senselab EAV Data Model Luis Marenco, MD', Prakash Nadkami, MD', Emmanouil Skoufos, PhD', Gordon Shepherd, MD, D.Phil2 and Perry Miller, MD, PhD' 'Center for Medical Informatics and 2Department of Neurobiology, Yale University School of Medicine, New Haven, CT different physical databases are both programmatically cumbersome and extremely inefficient, often being limited by network bandwidth when the two databases involved reside on separate machines. While planning merging of the databases, we realized that the combined schema had become extremely complex, and that its maintenance would be quite difficult as many more kinds of neuronal data were incorporated into the system. This complexity would also be reflected in increased coding to maintain the user interface. It was clearly desirable to build a framework that allowed new types of data to be added without schema revisions. We therefore considered implementing an Entity-Attribute-Value (EAV) schema for the combined database. The EAV physical schema is widely used in electronic patient record systems (EPRSs) [3, 4]. Here, rather than having multiple tables with numerous columns that hard-code particular concepts (e.g., the names of lab parameters), one conceptually has a single table with three columns. These are an Entity (the object being described), an Attribute (an aspect of the object being described), and the Value for that attribute. In EPRSs, the entity is typically a patient event, identified by a Patient ID and one or more timestamps. The EAV table has one row for each fact (such as a clinical finding or a lab value) that is recorded for a particular patient event. EAV has the advantage of being able to store sparse data efficiently. (While thousands of possible types of findings could potentially apply to a given patient across all clinical specialties, one only stores the few dozen that are actually applicable to a given patient.). In addition, attributes are treated as metadata (i.e., data describing the rest of the data within a database). EAV therefore allows new attributes to be incorporated into the record as medical knowledge progresses, without requiring a schema redesign. Neuronal data, in addition to being highly complex, also has the property of sparsity, and appears to provide a good fit for the EAV model. EAV storage results in a varying performance penalty compared with
We discuss an approach towards integrating heterogeneous nervous system data using an augmented Entity-Attribute- Value (EA V) schema design. This approach, widely used in implementing electronic patient record systems (EPRSs), allows the physical schema ofthe database to be relatively immune to changes in domain knowledge. This is because new kinds of facts are added as data (or as metadata) rather than hard-coded as the names ofnewly created tables or columns. Because the domain knowledge is stored as metadata, aframework developed in one scientific domain can be ported to another with only modest revision. We describe our progress in creating a code framework that handles browsing and hyperlinking ofthe different kinds ofdata. INTRODUCTION The Senselab project [1], an initiative supported by the Human Brain Project [2], was started in 1993 for integrating various forms of neuronal data. Our group originally created separate physical databases for each type of data, providing a unified front end through a Webbased interface. This approach partly reflected the approach to development: different individuals were responsible for each database. In addition, we were in a research/learning mode, trying out various database engines to determine whether special features of a particular engine provided any advantages in schema design and development. Thus, for example, we experimented with the Object-Relational database engine Illustra (now part of Informix Universal Server) for modeling neuro-anatomical data, and contrasted the development process with traditional DBMSs like Sybase SQL Server. As the databases matured, it became clear that it would eventually become necessary to merge them into a single physical database. While Web-based integration using hyperlinks is satisfactory when one is searching a database an object at a time, it is not really intended for operating on sets of objects. Specifically, crossdatabase "joins" between tables residing in
1091-8280/99/$5.00 © 1999 AMIA, Inc.
102
conventional (i.e., non-EAV) systems. Neuronal research data, while structurally more complex, are significantly less voluminous than EPRS data. Hence performance issues are less critical. Existing commercial frameworks for creating Web-based database front-ends provide little or no support for EAV data structures, which are rarely seen in the typical business applications that comprise the major part of the database market. Therefore, we created our own framework to automatically generate forms that made full use of the Web metaphor (for example, generation of appropriate hyperlinks). These forms support data display as well as data editing and creation (for authorized users). TYPES OF DATA CAPTURED IN THE SYSTEM We summarize the types of data in the four Webaccessible neuroscience databases previously created, which needed physical integration. NeuronDB [5] stores data on various neuronal cells- receptors, neuronal currents and neurotransmitters and inter-neuron connectivity. Within a single cell, data is organized by specific anatomical compartment (canonical compartments), in an effort to unify multiple neuronal types. ModelDB [6] stores computational models of neuronal function. Current stored models are compliant with the Neuron [7] and Genesis [8] simulation programs, both of which are widely used by neuroscientists. ORDB (Olfactory Receptor Database) [9] holds amino acids and nucleotide sequences, researcher and laboratory information and hyperlinks to other web related data. It will store other diverse information related to these molecules, as it becomes available. OdorDB (Odors Database) stores chemical, biological and experimental data on odor molecules. This data records neurotransmitters, second messenger molecules, electrophysiological behavior, and cell types studied, among others. All the databases contain bibliographic citations (in most cases, Web hyperlinks to NCBI's PubMed). The necessity for physical integration becomes clear when one realizes that there are multiple logical links between the entities across different databases. For example, there is a many-to-many relationship between the neuronal models of ModelDB and the neuronal cells (and compartments) of NeuronDB. Many classes of
objects in OdorDB have counterparts in both NeuronDB and ORDB. EVOLUTION OF THE SYSTEM The first database that was ported to an EAV architecture was OdorDB. This, the most recently created of the four databases, existed in conventional (i.e., non-EAV) form only in a prototype phase. After this port was tested and debugged, the metadata for the other databases was first specified. Then the contents of the other databases were converted to EAV format (based on the metadata) and imported into the new system. Conversion was helped by reusing code and data structures from two production databases previously created by our group. These were DNA Workbench [10], a package to manage physical mapping of a chromosomal region, and ACT/DB [ 1], an EPRS-like EAV system to manage clinical studies data. In adapting EAV to handle neuronal data, we had to make fundamental enhancements to the basic EAV data model that are not seen in EPRSs. We call the augmented model EAV/CR (EAV with Classes and Relationships). For space reasons, we avoid a full exposition of the EAV/CR model here, and only focus on its interesting features. (The full details, including the client and server code libraries that manage creation of a generic user interface, are available for downloading from the URL http://ycmi.med.yale.edu/ senselab/info/design/EAVCR.zip. Details specific to Senselab are found in the file Senselab.zip at the same site.) In the discussion below, we emphasize the differences between EAV/CR and EAV as implemented in EPRSs, where appropriate. FEATURES OF THE EAV/CR PHYSICAL DATA MODEL Metadata Overview: Like all EAV systems, the schema of an EAV/CR database involves creating a fixed number of conventional tables to hold the EAV data. In any EAV database, the physical schema (the way the data is physically stored into tables) is radically different from the logical schema (i.e., as perceived by the database's users). The description of the logical schema must therefore be stored in the metadata. All relational database engines also store metadata in their "data dictionary". This metadata is used by the DBMS in various circumstances, such as when checking a SQL query for semantic correctness before executing
103
it. Referential integrity checks and constraint checks are also stored in the DBMS metadata. However, in an EAV system, the system data dictionary is of limited utility, and therefore EAV/CR maintains its metadata independently.
to achieve total independence, because an application may need to invoke domain-specific algorithms. For example, one feature of ORDB is the ability to perform sequence comparison using the well-known BLAST algorithm [12].) EAV/CR achieves some degree of modularity in the domain-specific situation by permitting attributes to be methods (i.e., functions that perform computations and return values) rather than properties (i.e., values supplied by the
The EAV/CR metadata is orthogonal to the DBMS metadata, going well beyond the latter in many respects. This is because a large part of it deals with user interface issues, such as how the details of an object are to be presented. (Traditionally, client-server RDBMSs are unconcerned with user interface issues because presentation is presumed to be a responsibility of the client application.) Like the metadata in an RDBMS, the EAV/CR metadata is active in that it is constantly consulted by generic code that handles the user interface. Because we are committed to providing a Webbased interface, much of the EAV/CR metadata is necessarily Web-specific. For example, certain attributes hold hyperlinks to external databases that are actually invocations of CGI scripts with parameters (e.g., links to MEDLINE via NCBI's PubMed). For such links, our metadata needs to record the "template"-i.e., the unvarying part-of such URLs. These templates undergo macro substitution with data values at runtime so that the correct hyperlink is generated. Another metadata attribute applies to images: the designer can specify whether, when an object is displayed, an associated image should be made accessible through a hyperlink, or displayed inline using the HTML
attribute. Classes and Attributes: A major difference between EAV/CR and the EAV-based EPRSs is that in the latter, the only "Entities" are Patient Events, while in EAV/CR, several classes of data can be entities. In Senselab, examples of such classes are neurons, neuro-transmitters, channels, and so forth. In addition, unlike in the EPRS, where all the attributes in the system (i.e., history, clinical findings, lab tests, etc.) apply to a patient, the permissible attributes that can apply to a given neuronal data class is necessarily restricted. Thus, receptor molecules have associated nucleotide/amino-acid sequences, but neuronal models do not. Therefore, EAV/CR metadata must store information describing each class of data present in the system, and the applicable attributes of each class. Because the metadata contents can be replaced with the description of classes of an entirely new scientific domain, one can achieve some degree of domain independence. (It may not be possible
user). Object Dictionary: A basic description of every instance of a class (i.e., an object) is recorded in an Object Dictionary table. Class-specific details of an object are stored in the EAV tables described below. The advantage of storing common information on all objects across all classes in a single place is that a supporting table of synonyms/keywords can be used to enable searching of objects. The object dictionary approach is not unique to EAV/CR-it has been used previously by numerous bioinformaticians, and it is currently regarded as an essential part of any complex scientific database. Strong Typing: In most EPRSs, a single EAV table is used to store the data, and all values-whether text or numbers-are stored in string form. Since the primary purpose of these EPRSs is to retrieve data rapidly for a single given patient, the overhead involved in converting strings back into numbers is not significant. For scientific data, however, many queries to the data involve criteria based on sets of objects (rather than a single object). Further, a large part of the data is "binary" as far as the database engine is concerned, meaning that the database only stores the data and does not try to interpret it, or operate on parts of it. (Examples of binary data are voltage signal tracings or neuronal model files written using a simulation language.) For these reasons, EAV/CR segregates EAV data into separate tables based on datatype: integers, floats, dates, short strings, long strings, and binary. In addition, an object can itself be an attribute of another obiect, so we also represent Object IDs as values. Therefore, every attribute within the system must have a defined datatype. Relationships: As stated earlier, the only "Entity" in an EPRS is the patient event: an EPRS simulates a single giant Patient-Events table with a potentially infinite number of columns (attributes) per row.
104
store array data. In the hierarchy context, NF2 means that the Parent ID need not be repeatedly stored as part of a parent-child pair. Instead, the system simply stores a list of all children for a parent. Coexistence with Conventionally stored Data: All EAVdatabases contain non-EAV (i.e., conventionally structured) data components when such storage is appropriate. (For example, Senselab has a conventional table for references that are not yet in Medline, because most scientific databases need a reusable bibliography component.) The metadata must therefore track, for each class, whether it is represented in EAV or conventional form, so as to generate the appropriate user interface code. IMPLEMENTATION ISSUES The Web forms are generated as Active Server Pages (ASP) scripts. Our code therefore relies on Microsoft Internet Information Server 4.0 and invokes some ActiveX components. The generated forms also contain client-side JavaScript. The database engines used are Microsoft Access 97 (for prototyping) and Oracle 7.3 (for production). Our server code framework (as opposed to the metadata framework) is specific to Windows NT. While it would have been possible to create the framework using CGI scripts (an older technology that works across all Web Server platforms), we were more concerned with the problem of browser independence. Another advantage of ASP files is that most of them are simply standard HTML files with embedded scripting code. Therefore, they can be partly edited through visual Web page editor programs. This allows cosmeticization of machinegenerated forms if necessary. PRESENT STATUS The data described earlier has been fully ported over to EAV/CR form. The unified database is publicly accessible through the Web via the URL http://ycmi.med.yale.edu/senselab/. We will presently be incorporating two additional kinds of data: information about neural circuits, and data (including 3-D images) on functional Magnetic Resonance Imaging of the Olfactory cortex. The 3-D data in particular will provide a serious test of our system, as well as a quantification of the domain-independence of our code framework. CONCLUSIONS The use of the EAV/CR model has allowed us to integrate the highly heterogeneous data in the
In scientific databases, however, there are different classes of entities related to each other. For example, in OdorDB, an experiment records links to several classes: the species on which the experiment was performed, the odor molecules used, the neurotransmitter and/or the second messenger studied, the neuronal cell type under investigation, and so on. One can query the set of data on any of these criteria. In conventional databases, many-to-many relationships between classes are implemented through "bridge" tables, which have fields that point to the appropriate class tables. EAV/CR represents a relationship as a special type of class that has other classes as members. (This is another reason why it is necessary to support Object Ids as values.) In a conventional database, there can be many kinds of relationships: in EAV/CR, one creates a class for each kind. Management of Hierarchies: A hierarchy is a special kind of relationship where a "parent" object has one or more "children" objects. Hierarchies are very common in neuronal data, e.g., with neurons belonging to a nucleus, which in turn is part of an anatomical structure. Neuronal data also exhibits classificational hierarchy. Thus, "MI receptor" is a child of "Muscarinic Receptor", which in turn is a child of "Cholinergic Receptor". For neuronal data, queries specified at a coarser level of granularity (e.g., "list sites within the thalamus where dopamine is released") must also retrieve data stored at finer granularity. In conventional databases, a hierarchy is represented by a table with two columns, "parent" and "child". (A third column, "serial number" is used if ordering of the children of a particular parent is required.) For a given parent object, there are as many rows as there are children. Standard recursive algorithms are used to navigate this table to retrieve all descendants (or all ancestors). The issue of hierarchy traversal is well known as the "Bill of Materials problem" [13]. When the data is strictly hierarchical (i.e., where a child object cannot have multiple parents) EAV/CR allows representation of hierarchies through a simulation of non-first-normal form (NF2) [14] to speed up retrieval. NF2 is a feature of object-oriented and object-relational databases, and is so called because it departs from the first rule of relational database normalization (all elements of a table shall be atomic, and a table shall not have repeating values). Among other things, NF2 is used to
105
Senselab project. As with all generic frameworks, a large amount of effort must be invested up front in creating the code libraries to display and manipulate the data. Once they are created, however, this effort is amortized in the ability to generate standard Web forms rapidly. We emphasize that the "standard" user interface is applicable to most but not all circumstances. Therefore our code framework has been evolving to handle alternative metaphors of data presentation that might be used frequently. For NeuronDB, for example, it was decided that, when displaying data on neurons by region, as well as when listing data on receptors, all levels of the existing data hierarchy should be preexpanded, and a tree structure displayed. (The reason for this is that that the total number of objects involved in each case is relatively modest, and therefore it is ergonomically beneficial to present all the leaves of the hierarchy as multiple hyperlinks on a single Web page.) Thus, one of the branches ofthe tree has the leaf, "CAl pyramidal neuron" with the ancestor nodes, "hippocampus", "archicortex" and "forebrain". We therefore have created parameterized code that goes down a particular designer-specified hierarchy and displays all leaves of the tree with all intervening nodes. We invite collaborations from bioinformatics researchers who would like to test the EAV/CR framework for their own complex datasets. Acknowledgements: This work was supported by NIH Grants RO 1 DC02307 and RO 1 DC03972 to Dr.Gordon Shepherd
Medical Care. Washington, D. C. IEEE Computer Press, Los Alamitos, CA. 1994. [4]. Friedman C, Hripcsak G, Johnson S, Cimino J, and Clayton P. A Generalized Relational Schema for an Integrated Clinical Patient Database. in Proc. 14th Symposium on Computer Applications in Medical Care. Washington, D. C. IEEE Computer Press, Los Alamitos, CA. 1990. [5]. Mirsky JS, Nadkarni PM, Hines M, Healy MD, Miller PL, and Shepherd GM, A framework for informatics support of computerbased neuronal modeling: imposing order in a complex domain. , (in preparation). Peterson B, Healy M, Nadkarni P, [6]. Miller P, and GM. S, ModelDB: An environment for running and storing computational models and their results applied to neuroscience. J. Amer.Informatics Assoc 3(6): p. 389-398, 1996. [7]. Hines ML and Carnevale NT, The NEURON simulation environment. Neural Computation 9(6): p. 1179-1209, 1997. Bower JM and Beeman D. The Book of [8]. Genesis. Springer-Verlag, New York, 1995. [9]. Healy MD, Smith JE, Singer MS, Nadkarni PM, Skoufos E, Miller PL, and Sheperd GM, Olfactory Receptor Database (ORDB): A resource for sharing and analyzing published and unpublished data. Chemical Senses 22: p. 321-326, 1997. [10]. Nadkarni PM, Cheung K-H, Castiglione C, Miller PL, and Kidd KK, DNA Workbench: A Database Package to Manage Regional Mapping. Journal of Computational Biology 3(2): p. 319-329, 1996. [11]. Nadkarni PM, Brandt C, Frawley S, Sayward F, Einbinder R, Zelterman D, Schacter L, and Miller PL, Managing attribute-value clinical trials data using the ACT/DB clientserver database system. Journal of the American Medical Informatics Association 5(2): p. 139151, 1998. [12]. Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ, Basic Local Alignment Search Tool. J. Mol. Biol. 215: p. 403-410., 1990. [13]. Goodman N, Bill of Materials in Relational Database. InfoDB 5(1): p. 2-13, 1990. [14]. Stonebraker M, Object-Relational Database Systems, . White paper available from Illustra Technologies Inc.: Oakland, CA, 1993.
REFERENCES [1]. Shepherd GM, Healy MD, Singer MS, Peterson BE, Mirsky JS, Wright L, Smith JE, Nadkarni PM, and Miller PL, Senselab: a project in multidisciplinary, multilevel sensory integration., in Neuroinformatics: An Overview of the Human Brain Project. S.H.K.M.F. Huerta., Editor. Lawrence Erlbaum Associates, Inc.: Mahwah, NJ. p. 21-56, 1997. [2]. Shepherd G, Mirsky JS, Healy MD, Singer MS, Skoufos E, Hines MS, Nadkarni PM, and Miller PL, The Human Brain Proiect: Neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends in Neurosciences, (in press). Huff SM, Haug DJ, Stevens LE, Dupont [3]. CC, and Pryor TA. HELP the next generation: a new client-server architecture. in Proc. 18th Symposium on Computer Applications in
106