VLDB 2001 Submission Style Guide for Word

3 downloads 99 Views 140KB Size Report
Browser Engine. B. Events. API. B2. C. Plugin. Source. Knowledge. Base. A. (Protégé). Figure 1: BioMediator System Architecture. Query Processor. BioMediator.
The BioMediator System as a Tool for Integrating Biologic Databases on the Web Ron Shaker1, Peter Mork2,4, J. Scott Brockenbrough2,3, Loren Donelson2, Peter Tarczy-Hornoch1,2,4 Dept. of 1Pediatrics, 2Medical Education & Biomedical Informatics, 3Biological Structure, 4 Computer Science & Engineering University of Washington Seattle, WA USA email: {rshaker, pmork, jsbrock, donelson, pth}@u.washington.edu Abstract BioMediator is a data integration system tailored to the domain of molecular biology. Based on our collaborations with biologic researchers, we have identified several challenges in building a data integration system that addresses their needs. BioMediator provides a common interface to several Web-accessible data sources using a novel source knowledge base to organize metadata about the sources. This approach allows BioMediator to answer poorly specified queries, to re-use wrappers, and to support multiple mediated schemas, which can easily be modified. We describe the system architecture, focusing on query processing and data access and conclude by comparing our approach to the more classic federated approach.

1. Introduction BioMediator is a data integration system that provides a common interface to Web-accessible sources of biologic information. Standard data integration techniques can be used to provide access to these sources, but these techniques are not always adequate for biologic researchers (e.g., as new experiments are devised, the mediated schema needs to evolve). BioMediator includes several features (e.g., an easily modified mediated schema) that address the challenges introduced by biologic researchers’ needs. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment

Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004

In the post-genomic era, biologic research can benefit from access to the large amounts of data and knowledge stored in public repositories. Each of these sources was developed to address specific needs and is organized around a unifying concept and/or organism (e.g., the Mouse Genome Database [1]). However, the new “systems” approach to biology [2] requires analyzing experimental results in a more general context. Thus, this approach requires integrating information from a distributed set of highly heterogeneous sources including both public and private data sources. More generally, inductive research (i.e., research that generates rather than tests hypotheses) depends on integrating immense data sets. The biologists performing these experiments have two other needs not addressed by standard data integration approaches. First, the system must support both poorly specified queries (What is known about the genetic disease narcolepsy?) as well as very specific queries (A mutation of what gene(s) results in dysprothrombinemia, haemophilia caused by an inactive protein.). Second, the mediated schema must be easily customized for different user groups whose needs evolve over time. BioMediator addresses the first challenge by providing support for flexible query answering [3]. A user begins by issuing a declarative query, which establishes the basic topics of interest. The user can then browse these results and issue new queries to explore related topics. For example, a query for narcolepsy returns a number of related genes. The user could retrieve more information about one, some or all of these related topics. The second challenge requires several innovations: In biologic data integration systems the mediated schema is often hard-coded [4]. However the heterogeneous and evolving needs of biologists mandate supporting multiple mediated schemas that can easily be changed. As a consequence, BioMediator is driven by information stored in a Protégé [5] knowledge-base that is modified using a

graphical interface. The contents of this knowledge-base can be easily extended to support new user groups. As a result, researchers can create custom mediated schemas. A standard timeline is as follows: A new user selects an existing mediated schema and begins using BioMediator. Eventually, limitations of this schema become evident and the user makes changes to a copy of the original schema (by copying the Protégé knowledgebase). After running some experiments, the new schema is modified to reflect the new data needs. To support user driven schema evolution, the wrappers are as general as possible. All of the available data-fields are exposed, whether or not they appear in a given mediated schema. When changes are made to a mediated schema, previously invisible data-fields can be mapped to the new schema. This can often be done with no additional programming using a plugin for the Protégé environment (described in section 5.3). These features make BioMediator an excellent tool for data integration in the biologic sciences.

2. Application In order to guarantee that BioMediator presents real world solutions to actual research problems, the BioMediator working group has as collaborators four biomedical research laboratories that advise on their biologic research needs and resultant queries. A series of structured interviews, as well as direct observation of lab data management practices, leads to the development of queries designed specifically for that lab, but which often apply more generally. One example is positional cloning [6]. It is a method of identifying genes responsible for inherited disorders and is used in hundreds of labs internationally. In the case of Hereditary Neuralgic Amyotrophy (HNA), researchers in the U.S. have collected data on the affected and nonaffected members of 30 families stricken by the disease, while collaborators in Europe have collected similar data on families there. Researchers search the DNA of affected and non-affected family members for sequence variations (markers) that co-occur with the expression of the disease. The better the correlation between the presence of a marker and the disease (phenotype), the more likely it is that the marker lies close to the responsible gene. The U.S. lab both has private data sources with genotypic and phenotypic information, and regularly searches public genetic data repositories to compare their markers to those established for the general population. European collaborators perform this same process, in parallel, several times per week. BioMediator provides a means of integrating the privately collected U.S. and European data with that of the public genetic databases. Once these data sets have been integrated, the queries to compare markers can be written (and re-used) to generate the most up to date result sets possible.

A second real world example of the practical utility of BioMediator is our collaborating research biologist studying the cellular signals in teratogenesis (development of birth defects) who is collecting samples of embryonic tissues at critical time points from susceptible and non-susceptible mice. In each tissue, high throughput technologies are used to determine the expression levels of individual genes (using Affymetrix gene expression microarray technology [7]) as well as that of proteins derived from the expression of those genes (using Isotope Coded Affinity Tagging, or ICAT [8]). Test results are stored in two separate private data repositories within the lab, both as electronic spreadsheets. These private repositories need to be integrated with searches of public data sources as in the first example. Because gene expression transcripts can be edited in a variety of ways, each gene can potentially lead to the formation of more than one protein. It is a formidable task to unambiguously identify each of several hundred to a thousand proteins in each sample as the product of a given gene. The protein must first be identified among those described in public data repositories and, when possible, annotated as to its function, biologic significance and the exact chromosomal locus of the gene which codes for it. Next, unique gene identifiers are checked against those whose expression has been measured (several thousand per sample) with the microarray. A successful experiment generates hypotheses about which cellular proteins and their respective genes play important roles in teratogenesis. An abbreviated list of the questions they would like to answer follows: • • •



Which genes are differentially expressed between the susceptible vs. non-susceptible mice? What does the pattern of expression in the different embryonic tissues tell us about how the defect (which comes from a variety of tissues) is caused? What is the time course of these changes in gene expression and how does it relate to the time course of the expression of the corresponding proteins in the two groups? Do our candidate proteins of interest belong to known cellular or metabolic pathways?

BioMediator has been able to successfully query across private and public data to annotate proteins and help identify the corresponding genes that were expressed during the experiment. This has saved the investigator weeks of laborious analysis and annotation by hand.

3. Architecture BioMediator has a componentized architecture, designed to perform data integration over multiple structured and semi-structured biologic data sources in a flexible and reconfigurable way. Following is an overview of the

Query Processor

B (Protégé)

Semantic

Syntactic

Translation Step

Translation Step

API BioMediator Layer

Plugin

Source A Knowledge Base

B1

D

Wrapper

E1

Wrapper

E2

Wrapper

E3

Wrapper

EN

GeneTests F1 Data Source Entrez F2 Data Source

C

Events

URL Browser Engine B2

Metawrapper

PDB F3 Data Source

URL

XML

XML

API XML

Public FN Data Source

Figure 1: BioMediator System Architecture

system, which is driven by a knowledge base and schema mapping rules at both query and translation levels. The system relies heavily on the source knowledge base (SKB) which is represented in Protégé [5] and accessed via the Protégé API. The SKB (Fig. 1A) contains a) the mediated schema, b) annotations that describe how database cross-references are established and maintained, and c) the catalogue of all possible data sources and the mediated schema elements contained by those sources. The mediated schema includes a class hierarchy describing concepts in the biologic domain and a property hierarchy describing relationships among the classes. Sample classes include Genes and Proteins; sample properties include codes-for and associated-with. The query processor (Fig. 1B) provides an API for launching and managing queries posed using the mediated schema. The metawrapper (Fig. 1C) translates these mediated schema queries into source specific queries using forward mapping rules. The wrappers (Fig. 1E) pass the remapped queries through to the data sources (Fig. 1F). Data sources return results in native format, which are translated to XML syntax with native semantics by the wrappers. The metawrapper applies reverse mapping rules in translating the XML result streams from native semantics to mediated schema semantics. Both the forward and reverse mapping rules employed by the metawrapper are generated by the plugin (Fig. 1D). The query processor then retrieves data from the metawrapper, organizes that data and generates events which can be used to synthesize a navigable representation of the result set. Once a view has been constructed, it may be repeatedly queried, expanded or grown using the query processor's API.

4. Query processor In the BioMediator architecture, the query processor (Fig. 1B) exposes the external API. Through interactions like querying and browsing, data are retrieved and organized in a result graph. These user interactions are posed in terms of the mediated schema. The nodes of the result graph correspond to object instances, which are owned by a specific database. Associated with each node are attribute/value pairs that describe that instance. Similarly, each edge in the result graph is an instance of a

member of the property hierarchy representing relationships. For example, when a user queries for ‘narcolepsy’, one of the data instances (nodes in result graph) identified is OMIM [9] record 161400. The value for the ‘Name’ attribute of this instance is ‘narcolepsy’. In turn, via the property hierarchy member instance associated-with, the data instance OMIM record 161400 is related to instances from LocusLink, including record 3060 for the gene HCRT, a mutation of which is partially responsible for narcolepsy. The leaves of the property hierarchy are the relationships published by the underlying data sources. These leaves are further annotated with metadata describing how the data sources are maintained. This includes information about who validates crossreferences, how frequently the source is updated, etc. We capture this information for each property because some sources have different policies for the curation of different relationships. For example, LocusLink distinguishes between reference sequences (RefSeq [10]), which are validated by experts, and more generic relationships, which are not validated. The property hierarchy and associated annotations allow us to support a variety of join conditions. The most specific option is to name the property of interest. If a leaf property is named, information from a single source is retrieved. When a non-leaf property is named, the system retrieves information for all of the leaves under that property. This is exactly analogous to query reformulation [11] performed in a federated database. One of the key innovations introduced by BioMediator is the ability to describe properties, rather than name them. The user provides attribute/value criteria that the query processor compares against the annotation metadata stored in the SKB. Properties that match the indicated criteria are provided automatically while browsing. 4.1 Querying and browsing In BioMediator we distinguish between queries for data instances, and browsing properties. A query specifies a class, an attribute, and a value via a query statement. Based on the contents of the SKB, the query processor uses the metawrapper to retrieve all instances of the indicated class for which the attribute/value pair is

associated. In other words, the set of attribute/value pairs for the instance must contain the attribute/value pair indicated in the query. As a convenience, it is also possible to query for a specific instance using a sourcespecific identifier. Once at least one query has been issued, there are two methods for adding edges to the results graph. First, a collection of instances can be expanded along some property (in many cases this collection is the set of all nodes currently in the results graph) effectively extending out each node in the collection by one more edge. As indicated in the previous section, the property being expanded might be a named property (such as codes-for), or a property described using annotations. Secondly, the graph can be grown. This expands every node iteratively: When an edge connects a new node to the graph, that node is also expanded iteratively n-deep until either a terminal node is reached or a cycle is encountered. In general, these expansions are guided using described properties (not named properties). As a heuristic, the paths generated while growing do not contain two instances of a given leaf property (i.e., cycles are broken at the level of the mediated schema, not the result level). The ability to describe properties is particularly powerful when combined with the grow operation. Given the complexity of the domain, users do not always know the path they want to explore linking two instances. For example, a user might want to find the genes related to some disease. One option is to directly retrieve the genes directly associated-with that disease. An alternative is to retrieve the proteins that cause the disease and then to retrieve the genes that code-for these proteins. This sort of exploration is not possible in a relational database: the FROM clause must exhaustively enumerate the relations to be joined. BioMediator is more general in that the join conditions can be either enumerated or described. Regardless of the function used to add information to the graph (query, expand, grow), results are returned using callback functions. We have currently developed callbacks for writing the results to an XML file or RDF database or to update a real-time GUI. In this interface results are streamed to the user as they are produced by the data access layer.

5. Data access The data access portion of the system (Fig. 1C-1F) provides transparent access to a collection of disparate, remote data sources. Access is achieved via a two-tier model in which source data is first converted to a common syntax by the wrapper, and second converted to a common semantics by the metawrapper. Much of the data of interest to biologists resides in public databases on the Internet (e.g., Entrez [12], GeneTests [13], PDB [14]). Access to these databases is typically provided via HTML forms and CGI, however

some data sets are also available for download with the intention of being reconstituted as locally managed relational databases. 5.1 Wrapper architecture Each wrapper provides an interface to a particular remote data source (including private data sources local to the biologist but remote to the BioMediator application). Wrappers are implemented as HTTP servlets, accepting queries in the form of URLs and returning XML formatted data. When queried, the wrapper establishes a real time connection with the data source in order to guarantee data freshness. It is the wrapper's job to syntactically translate incoming queries and outgoing result sets. Queries to the wrapper are converted from a native namespace URL (e.g., http://wrapper?key=value) to a form that's understood by the native query interface of the corresponding data source. Once the wrapper retrieves a result set from the data source, the results are converted from the source's native format (e.g., ASN1, HTML, text) into valid XML that as closely as possible reflects the semantics of said data source. 5.2 Schema mapping In order to query over multiple heterogeneous data sources, the data access layer must first relate the schemas of the underlying sources to the mediated schema. This relationship is established using mapping rules that are interpreted and applied by the metawrapper [15] (or semantic translation engine). In the forward direction, rules are applied to re-write incoming query URLs from mediated namespace (e.g., http://metawrapper?entity=Gene&attribute=Symbol[ value=BRCA1]) to the wrapper's native namespace. In the reverse direction, wrapper output is streamed through a SAX parser, which uses a collection of triggers to generate XML that conforms to the mediated schema. Mapping rules can become quite involved in order to support complex operations. Therefore, simple directives have been developed to handle the majority of common cases. Directives are short, simple and objective-oriented commands with a relatively easy to learn syntax. This facilitates reconfiguration of the mappings by nonprogrammers, and rapid development of new models. 5.3 Plugin The plugin links the data access portion of BioMediator to the rest of the system. It provides an API for creating metawrapper (or mediated namespace) query URLs as well as a means of checking to see which queries are supported and which classes of entities are generated by each data source. Development continues on a drag and drop interface for the plugin that will guide end users in creation of mapping directives. User choices will be limited to what

is present in the mediated schema and source schemas. This will further facilitate ease of system reconfiguration as well as limit (or remove) the possibility of generating incorrect or unsupported directives. A command line version of this plugin is currently in use.

6. Discussion BioMediator shares many similarities with previous projects that integrate public data sources. BioKleisli [16] pioneered the use of federated database technology in the biologic domain. Like BioKleisli, BioMediator uses a nested data model for data exchange (XML in our case), which is generated by an array of wrappers [17]. The TAMBIS [18] project extended BioKleisli by introducing a mediated schema (expressed using a description logic) that organizes the source data into a class hierarchy. This approach to integration emphasizes class-oriented queries. In BioMediator we place more emphasis on the crossreferences used to interrelate records across the public data sources. These links introduce new challenges in optimization [19], estimation [20] and modelling [21]. BioMediator organizes these cross-references into an explicit property hierarchy stored in the SKB.

the limited support provided by many data management systems. We also replaced the declarative query interface in favour of a method-oriented API; neither users nor application developers were interested in learning a query language. Moreover, a declarative query interface aligns poorly with an interactive system. One important feature from our original query interface that we have preserved is the ability to describe properties of interest. 6.3 Data access

Schema evolution in the older generation of federated database projects is frequently a challenge since mediated schemas are rarely stored in a way biologists can easily interact with them. Maintenance of the federation is also often a problem because most use hard-coded wrappers to translate data from the source schema into the mediated schema. These wrappers typically need updating when the generalized mediated schema changes. Complex schema changes can necessitate updating wrappers to a large number of sources. Current federated database projects include those developed for a specific use (e.g., PharmGKB [22] in pharmacogenomics, GeneX [23] in expression arrays), and those that can be customized (e.g., Kleisli [16], P/FDM [24], ACEDB [25], OPM [21]), all with slightly different flavours of the same basic concept.

The semantic translation engine was originally implemented using SAX, allowing conversion of wrapper XML to proceed without first seeing all of the wrapper's output. This technique reduces the memory footprint of the metawrapper and, since the BioMediator system as a whole is designed to stream output and process inputs at their earliest availability, it also helps provide the client with more immediate and continuous feedback. In terms of usability, the initial intention was to create a simple syntax for mapping between source and mediated schemas. However, this required much of the translation logic to be embedded in the metawrapper, making the application complex, hard to verify and hard to extend. The second approach saw removal of the mapping logic from the translation engine and addition of a plugin layer designed to generate a path-triggered set of mapping rules that can be interpreted and applied by the translation engine. The rules generated by the plugin are constructed based on mapping directives defined by the biologist. The main features of the new model are that it allows for independent development and testing of system components, and separates intention (directives) from method (rules). With the addition of directives have come some new features such as conditional copying, combining and filtering of data from the wrapper's output stream. Wrapper complexity has in turn decreased in many cases and output no longer has to be as carefully structured. Remote sources that provide raw XML result sets (e.g., Entrez [12]) can now be returned to the metawrapper with little reformatting by the wrapper application.

6.2 Query processor

6.4 Future evolution

We originally built BioMediator [26] using a more traditional query processor: A declarative query [27] was posed against the mediated schema. This query was then rewritten as an XQuery posed against the underlying data sources, which was executed by a query engine (first Tukwila [28], and later Qexo [29]). We replaced the query engine because our discussions with users suggested that they often don’t know exactly for what they are looking. Instead, they have some idea of where to start (i.e., they have an initial query). From there, they want to browse in a constrained fashion. In fact, we believe that this exploratory behaviour is common despite

The usefulness of BioMediator to the integration and querying of biologic data has already been demonstrated through its successful use by some of the group's biomedical collaborators. In order to make the system accessible to a larger audience, past work will need to be incrementally extended as well as taken in entirely new directions. Incrementally, the system will require new wrappers and access to additional public sources such as pathway and function databases. Also, exploring the usage of new tools for better generalizing access to private data stores and experimental data is crucial to our users, and inclusion of more private data necessitates

6.1 Source knowledge base

developing a security model to allow for the sharing of experimental results between researchers. In terms of new directions, one of our goals is to extend the model to include analytic tools that operate across distributed biologic data, for example proteomic and expression array data. Some preliminary work in this area has in fact already been done in conjunction with the creation of an expression array annotation package for BioConductor [30]. By combining data integration with analytics, our collaborators can perform their in silico experiments in a single unified framework.

Acknowledgements Funding was provided by NHGRI (R01HG02288, PI: Tarczy-Hornoch) , NLM training grant (T15LM07442, trainees: Mork & Donelson) and BISTI planning grant (P20LM007714, PI: Brinkley, co-PI: Tarczy-Hornoch).

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

J. A. Blake, J. E. Richardson, C. J. Bult, J. A. Kadin, and J. T. Eppig, "MGD: the Mouse Genome Database," Nucleic Acids Res, vol. 31, pp. 193-5, 2003. T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J. Buhler, J. K. Eng, R. Bumgarner, D. R. Goodlett, R. Aebersold, and L. Hood, "Integrated genomic and proteomic analyses of a systematically perturbed metabolic network," Science, vol. 292, pp. 929-34, 2001. T. Andreasen, A. Motro, H. Christiansen, and H. L. Larsen, presented at Flexible Query Answering Systems, 5th International Conference, FQAS 2002, Copenhagen, Denmark, 2002, http://www.fqas2002.org/. Z. Lacroix and T. Critchlow, Bioinformatics : managing scientific data. San Francisco, CA: Morgan Kaufmann Publishers, 2003. M. Musen, M. Crubézy, R. Fergerson, N. F. Noy, S. Tu, and J. Vendetti, Protégé-2000. Stanford, CA: Stanford Medical Informatics, http://protege.stanford.edu/. N. Maniatis, A. Collins, J. Gibson, W. Zhang, W. Tapper, and N. E. Morton, "Positional cloning by linkage disequilibrium," Am J Hum Genet, vol. 74, pp. 846-55, 2004. R. J. Lipshutz, S. P. Fodor, T. R. Gingeras, and D. J. Lockhart, "High density synthetic oligonucleotide arrays," Nat Genet, vol. 21, pp. 20-4, 1999. S. Gygi, B. Rist, S. Gerber, F. Turecek, M. Gelb, and R. Aebersold, "Quantitative analysis of complex protein mixtures using isotope-coded affinity tags," Nature Biotechnol, vol. 17, pp. 994-9, 1999. Online Mendelian Inheritance in Man, OMIM™. Bethesda, MD: McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine, 2000, http://www.ncbi.nlm.nih.gov/omim/. K. D. Pruitt and D. R. Maglott, "RefSeq and LocusLink: NCBI gene-centered resources," Nucleic Acids Research, vol. 29, pp. 137–140, 2001. J. D. Ullman, "Information Integration Using Logical Views," presented at 6th International Conference on Database Theory, Delphi, Greece, 1997. Entrez search and retrieval system. Bethesda, MD: National Center for Biotechnology Information, National Library of Medicine, http://www.ncbi.nlm.nih.gov/Entrez/. GeneTests. Seattle, WA: University of Washington, 1993, http://www.genetests.org/.

[14] J. L. Sussman, D. Lin, J. Jiang, N. O. Manning, J. Prilusky, O. Ritter, and E. E. Abola, "Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules," Acta Crystallogr D Biol Crystallogr, vol. 54, pp. 1078-84, 1998. [15] R. Shaker, P. Mork, M. Barclay, and P. Tarczy-Hornoch, "A Rule Driven Bi-Directional Tranlslation System Remapping Queries and Result Sets Between a Mediated Schema and Heterogeneous Data Sources," Proc AMIA Symp., San Antonio, Texas, 2002. [16] S. Chung and L. Wong, "Kleisli: a new tool for data integration in biology," S.Y. Chung and L. Wong. Kleisli: a new tool for data integration in biology. Trends in Biotechnology, 17(9):351--355, 1999., 1999. [17] G. Wiederhold, "Intelligent Integration of Information," presented at Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, 1993. [18] P. G. Baker, A. Brass, S. Bechhofer, C. A. Goble, N. Paton, and R. Stevens, "TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources," presented at Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, Montréal, Québec, Canada, 1998. [19] Z. Lacroix, L. Raschid, and M.-E. Vidal, "Efficient Techniques to Explore and Rank Paths in Life Science Data Sources," presented at International Workshop on Data Integration in the Life Sciences, Leipzig, Germany, 2004. [20] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid, "Links and Paths through Life Science Data Sources," presented at International Workshop on Data Integration in the Life Sciences, Leipzig, Germany, 2004. [21] I.-M. A. Chen and V. Markowitz, "An Overview of the Object Protocol Model (OPM) and the OPM Data Management Tools," Information Systems, vol. 20, pp. 393– 418, 1995. [22] D. L. Rubin, M. Hewett, D. E. Oliver, T. E. Klein, and R. B. Altman, "Automating data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML," Pac Symp Biocomput, pp. 8899, 2002. [23] GeneX™ Gene Expression. Santa Fe, NM: National Center for Genome Resources, http://www.ncgr.org/genex/. [24] G. J. Kemp, N. Angelopoulos, and P. M. Gray, "Architecture of a mediator for a bioinformatics database federation," IEEE Trans Inf Technol Biomed, vol. 6, pp. 116-22, 2002. [25] L. D. Stein and J. Thierry-Mieg, "Scriptable access to the Caenorhabditis elegans genome sequence and other ACEDB databases," Genome Res, vol. 8, pp. 1308-15, 1998. [26] P. Mork, A. Y. Halevy, and P. Tarczy-Hornoch, "A Model for Data Integration Systems of Biomedical Data Applied to Online Genetic Databases," Proc AMIA Symp., Washington, DC, 2001. [27] P. Mork, R. Shaker, A. Y. Halevy, and P. Tarczy-Hornoch, "PQL: A Declarative Query Language over Dynamic Biological Schemata," Proc AMIA Symp., San Antonio, TX, 2002. [28] Z. G. Ives, M. Friedman, D. Florescu, A. Y. Levy, and D. S. Weld, "An Adaptive Query Execution System for Data Integration," presented at Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999. [29] P. Bothner, Qexo, 2004, http://www.qexo.org. [30] H. Mei, P. Tarczy-Hornoch, P. Mork, A. J. Rossini, R. Shaker, and L. Donelson, "Expression array annotation using the BioMediator biologic data integration system and the Bioconductor analytic platform," Proc AMIA Symp., Washington, DC, 2003.