Vol. 17 no. 7 2001 Pages 587–601
BIOINFORMATICS
Extending traditional query-based integration approaches for functional characterization of post-genomic data Barbara A. Eckman 1,∗, Anthony S. Kosky 2 and Leonardo A. Laroco, Jr 1 1 Department
of Bioinformatics, GlaxoSmithKline, King of Prussia, PA, USA and 2 Data Management Systems, Gene Logic Inc., Berkeley, CA, USA
Received on December 23, 2000; revised on February 28, 2001; accepted on March 6, 2001
ABSTRACT Motivation: To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, flat file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. Results: Target Informatics Net (TINet) is a readily extensible data integration system developed at GlaxoSmithKline (GSK), based on the Object-Protocol Model (OPM) multidatabase middleware system of Gene Logic Inc. Data sources currently integrated include: the Mouse Genome Database (MGD) and Gene Expression Database (GXD), GenBank, SwissProt, PubMed, GeneCards, the results of runtime BLAST and PROSITE searches, and GSK proprietary relational databases. Special-purpose class methods used to filter and augment query results include regular expression pattern-matching over BLAST HSP alignments and retrieving partial sequences derived from primary structure annotations. All data sources and methods are accessible through an SQL-like query language or a GUI, so that when new investigations arise no additional programming beyond query specification is required. The power and flexibility of this approach are illustrated in such integrated queries as: (1) ‘find homologs in genomic sequence to all novel genes cloned and reported in the scientific literature within the past three months that are linked to the MeSH term ‘neoplasms”; (2) ‘using a neuropeptide precursor query sequence, return only HSPs where the target genomic sequences conserve the G[KR][KR] motif at the appropriate points in the HSP alignment’; and (3) ‘of the human genomic sequences annotated with exon boundaries in GenBank, return only those with valid puta∗ To whom correspondence should be addressed.
c Oxford University Press 2001
tive donor/acceptor sites and start/stop codons’. Availability: Freely available to non-profit educational and research institutions. Usage by commercial entities requires a license agreement. Contact: barbara
[email protected]
INTRODUCTION The case for database integration As we enter the post-genomic era, the sheer volume of data and number of techniques available for use in the identification and characterization of regions of functional interest in genomic sequence is increasing too fast to be managed by traditional methods. Investigators must deal with the enormous influx of genomic sequence data from human and other organisms. The results of analysis applications such as BLAST (Altschul et al., 1990), PROSITE (Hofmann et al., 1999), and GeneWise (Birney and Durbin, 2000) must be integrated with a large variety of sequence annotations found in data sources such as GenBank (Benson et al., 2000), SwissProt (Bairoch and Apweiler, 2000), PubMed (Wheeler et al., 2000) and GeneCards (Rebhan et al., 1998). Public and private repositories of experimental results such as The Jackson Laboratory’s Gene Expression Database (GXD) (Ringwald et al., 2000) must also be integrated. To derive the greatest advantage from this data requires full query-based access to all the most up-to-date information available, with the flexibility to customize queries easily to meet the needs of a variety of individual investigators and gene families. Wide-ranging multi-source queries, particularly those containing joins on the results of BLAST searches, often return unmanageably large result sets, requiring non-traditional methods to identify and exclude extraneous data. Such filtering often requires more sophisticated conditions than the SQL query language can express, for example, regular expression pattern matching. 587
B.A.Eckman et al.
Obstacles to integration A commonly noted obstacle for integration efforts is that relevant information is widely distributed, both across the Internet and within individual organizations, and is found in a variety of storage formats, both traditional relational databases and non-traditional data sources (e.g. Internet web sites and the results of gene-finding applications or homology searches). To add to the difficulty, many of the most interesting data sources are not easily queried, nor are their results easily parsed: for example, annotations in flat-file sequence databases such as GenBank and SwissProt, web sites such as GeneCards, and alignments resulting from BLAST searches. Many data sources do not represent biological objects optimally for the kinds of queries that investigators typically want to pose. For example, GenBank is sequence-centric, not gene-centric, and SwissProt is sequence-centric, not domain-centric. Further, biological data sources often differ in their representation of key concepts: for GenBank a gene is an annotation on a sequence, while for MGD a gene is a locus which confers phenotype. Finally, integrated queries must operate on the most up-to-date versions of the data sources in order to avoid being ‘scooped’ on important discoveries such as characterizing a novel G-protein coupled receptor in genomic sequence. But keeping local copies of all relevant data sources current on a daily basis is a monumental task. The TINet system The Target Informatics Net (TINet) system represents our response to the compelling need for database integration in the face of these obstacles. Its goal is not to replace human interpreters, but rather to make data interpretation easier and faster for human experts through an integrated, queryable view of genomic sequence and associated data. Strategies for database integration. Our integration approach follows primarily the federated model (Heimbigner and McLeod, 1985; Alonso et al., 1987; Sheth and Larson, 1990), supplemented by limited, judicious use of the data warehouse or ‘instantiated’ approach (Davidson et al., 1995). Currently we focus primarily on overcoming heterogeneity of form (syntactic heterogeneity) among data sources rather than heterogeneity of meaning (semantic heterogeneity). In the Discussion we describe TAMBIS, a noteworthy example of an integration approach focusing primarily on semantic heterogeneity. In our use of the instantiated approach, local copies of data sources of interest are maintained and updated nightly, primarily to enable us to add scientific value by reorganizing the data to better fit anticipated queries or by precomputing summary statistics. Examples are precomputing GenBank feature counts and reorganizing SwissProt domains to be first-class objects with their own 588
BLAST search
GSK Expression
Swiss-Prot
GeneCards
MGD
GSK Chemokine
Result GXD
PROSITE search
GenBank GSK GPCR
Relat ional
OMIM
PubMed
Flat file
Web site
Application
Fig. 1. Using a middleware approach to database integration, data from a variety of sources, remote and local, in different storage formats, is presented to the user as the result of a single query.
features, to enable fast execution of queries like ‘return all mammalian sequences with more than 20 exons’ or ‘return all G-protein coupled receptors whose N-terminal segment is between 275 and 400 amino acids long and contains at least two glycosylation sites (CARBOHYD features)’. In addition, this approach may be the preferred strategy where very fast access at runtime is required. A disadvantage of a purely relational approach to data warehousing, however, is that SQL has limited expressive power for filtering out extraneous results, as noted above. In our use of the federated strategy, data sources are not transformed or loaded into a single storage format, nor necessarily even mirrored internally at GSK, but rather are accessed in their native formats as required, thereby ensuring that only the most current view of the data is provided. This approach was chosen as our primary strategy because it is cost-effective, scaleable and flexible. If remote data sources are not mirrored locally, the cost of maintaining each remote database view is significantly reduced, many such views may be simultaneously maintained, and replacing one data source with another is simply a matter of plugging in a different view. In TINet we use a middleware layer to provide access to all data sources of interest. In this context middleware may be defined as software that allows multiple heterogeneous data sources to be queried as if they were components of a single large database (Figure 1). In the terms of Davidson et al. (1995), our method is a ‘tightly integrated/view’ approach, which is acknowledged to be the most flexible methodology within the space of integration approaches.
Extending traditional integration for post-genomic data
Another common integration strategy in the bioinformatics domain is the use of hard-coded Perl ‘glue’ scripts to implement cross-data-source joins, comparisons and aggregations (count, average, maximum, minimum, etc.) over subsets of data. Even if existing code is reused, this approach requires substantial programming every time a new question arises, to retrieve the data, implement the desired comparisons among data sources, and compute the values of the aggregate functions. Using our approach, on the other hand, only a new declarative query specification is required. Nevertheless, hard-coding data acquisition and filtering using special-purpose Perl scripts does permit extensive tuning for specific applications, potentially yielding very fast response times; this approach may be the preferred strategy when only a very limited set of queries involving very large datasets is required.
Beyond browsing to one-step querying We have claimed that full query-based access to data sources is required in order to derive the greatest advantage from the deluge of genomic sequence and related information. To make this case it is necessary to consider the difference between querying and browsing. Currently the field of bioinformatics abounds with applications that permit browsing over multiple data sources. Examples are Entrez (Wheeler et al., 2000), DoubleTwist.com (DoubleTwist Inc., 2000), the ExPASy Molecular Biology Server (Swiss Institute of Bioinformatics, 2000), and many in-house applications, e.g. the Merck Gene Index Browser (Eckman et al., 1998). On the other hand, only a very few applications besides ours currently permit querying over multiple heterogeneous data sources: SRS (Carter et al., 1999), Kleisli (Davidson et al., 1999; Wong, 2000), TAMBIS (Baker et al., 1998), and Discovery Link (Haas et al., 2000a). Comparisons between these applications and our approach will be presented in the Discussion. The relationship between browsing and querying is similar to the relationship in library research between browsing the stacks and online search. Both are valid approaches with distinct advantages. Browsing, like freely wandering in the stacks, permits relatively undirected exploration. It involves a great deal of ‘leg work’, but it is the method of choice when one wants to explore the domain of interest to help sharpen one’s focus. It is also well suited to retrieval of a web page by its identifier, or a book by its call number. On the other hand, querying, like online search, permits the formulation of a complex search request as a single statement, and thus makes very efficient use of human time. A query specifies ‘what I want’, as opposed to ‘how to get it’, and its results are returned as a single collated set. Querying is the method of choice when one’s
interests are already focused, especially if aggregations over subsets of data are involved.
Motivating example: select kinase cDNAs for expression studies As a motivating example, consider an investigator who wants to select a non-redundant set of kinase cDNA sequences for differential expression studies. The investigator plans to use the GeneCards web site to find a non-redundant set of kinase genes, and GenBank to find the longest cDNA sequence associated with each gene. In English, the researcher wants the following: ‘return the HUGO name, length, GenBank accession, and sequence of the longest full-length cDNA sequence related to each GeneCards entry that has been annotated as a kinase’. The browsing approach. In a browsing approach to this question ‘kinase’ is entered into the GeneCards text search form, which returns a summary page listing approximately 670 GeneCards. The user then visits each of the 670 GeneCards pages (Figure 2) and follows the GenBank hyperlink for each of the related nucleotide sequences, filtering out non-mRNA entries, comparing lengths, and finally noting the identifier, length and sequence of the longest mRNA entry. Assuming an average of five nucleotide sequences for each of the 670 GeneCards, the browsing approach requires 4020 web page visits. Visiting each of these pages and comparing and manually collating the results is very tedious and prone to error, a poor use of human time and effort. The querying approach. With TINet’s query capability, on the other hand, a short SQL-like query is written, typically in 10–15 min, and sent off for execution. About an hour later the result is returned, containing 1 row for each of the 670 kinase genes (Figure 3). The primary reason that TINet is able to extract relevant data items from the GeneCards output to join with GenBank in this query is that a structured object view has previously been built over the GeneCards web page (Figure 4). The OPM data model in which the view is represented and how the OPM system enables queries across multiple data sources will be summarized in the next section.
SYSTEM AND METHODS Data sources integrated The TINet system integrates a variety of data sources in a variety of formats, both local and remote, traditional and non-traditional. Relational databases include The Jackson Laboratory’s Mouse Genome Database (MGD) and Gene Expression Database (GXD) in Sybase, as well as GlaxoSmithKline proprietary databases in Sybase and Oracle. Flat-file data sources currently include the GenBank nucleotide database and the SwissProt pro589
B.A.Eckman et al.
Approved HUGO gene nomenclature commitee symbol JAK3 (Janus kinase 3 (a protein tyrosine kinase, leukocyte))
GeneCard for gene
JAK3
[Back to GeneCards Homepage]
GDB ID: 376460
Synonyms (according to GDB)
Chromosomal location: (according to OMIM, according to UDB)
• • •
JAKL Janus kinase 3 (a protein tyrosine kinase, leukocyte) L-JAK
Chromosome: 19 LocusLink cytogenetic band: 19p13.1 Unified DataBase coordinate (from pter): 22,128,600 bases JAK3_HUMAN: tyrosine-protein kinase jak3 (ec 2.7.1.112) (janus kinase 3) (jak-3)(leukocyte janus kinase) (l-jak). –gene: jak3. [1124 amino acids; 125 kd]
Proteins: (according to SWISS-PROT)
• function: tyrosine kinase of the non-receptor type, involved in the interleukin-2 and interleukin-4 signaling pathway. phosphorylates stat6, irs1, irs2 and pi3k. • catalytic activity: atp + a protein tyrosine = adp + protein tyrosine phosphate. • subcellular location: wholly intracellular, possibly membrane associated (by similarity). • alternative products: three splice variants were isolated from different mrna sources: breast (jak3b), spleen (jak3s; shown here), and activated monocytes (jak3m). jak3b may be defective as it lack some part of the kinase domain. • tissue specificity: in nk cells and an nk-like cell line but not in resting t cells or in other tissues. the s-form is more commonly seen in hematopoietic lines, whereas the b- and m-forms are detected in cells both of hematopoietic and epithelial origins. • domain: possesses two phosphotransferase domains. the second one probably contains the catalytic domain (by similarity), while the presence of slight differences suggest a different role for domain 1. • ptm: tyrosine phosphorylated in response to il-2 and il-4. • similarity: with nonreceptor type tyrosine-protein kinases. belongs to the janus kinases subfamily. Unigene Cluster for JAK3: ( Build 129; Jan 9 2001 )
Sequences (GenBank/EMBL/DDBJ accessions according to Unigene)
Janus kinase 3 (a protein tyrosine kinase, leukocyte) Hs.99877 [show with all ESTs] Unigene Representative Sequence: U09607 Gene/cDNA sequence: AC007201 U08340 U09607 U31317 U31601 U57096 U70065
Fig. 2. A portion of a GeneCards entry, showing related nucleotide sequences.
tein database. The results of runtime sequence analysis applications are also viewed as data sources and are fully integrated into the system: Washington University’s WU-BLAST2 (Gish, 2000), NCBI’s BLAST version 2 (Altschul et al., 1997), and the GCG motifs program (Genetics Computer Group (GCG), 1998) for performing searches against the PROSITE database of protein sequence motifs. Support for NCBI’s PSI-BLAST (Altschul et al., 1997) is currently in development. Finally, web resources are also viewed as data sources; currently PubMed and GeneCards have been integrated. 590
The SmithKline Beecham–Gene Logic collaboration Underlying TINet are the Object-Protocol Model (OPM) data model, query translation tools, and Multidatabase Query System (MQS) developed by Victor Markowitz’s group, originally at the Lawrence Berkeley Laboratory and later at Gene Logic Inc. (Markowitz et al., 1999). At the start, the Multidatabase Query System included Sybase, Oracle and SRS 5.0 connectivity. In the fall of 1998 SmithKline Beecham (SB, now GlaxoSmithKline)
Extending traditional integration for post-genomic data
select hugo_name = gc.hugo_name, seq_length = gs2.seq_length, longest_acc= gs2.primary_accession from gc in GENECARDS:Genecard, gc_accs = gc.unigene_nucleic_acids.cluster.cdna_accessions, maxlen = max (select gs.seq_length from gs in genbank:Seq where gs.topology = “mRNA” and gs.accessions in gc_accs ), HUGO Sequence gs2 in genbank:Seq Name Length where JAK1 3541 gc.text_search match "kinase" JAK3 4064 and gs2.accessions in gc_accs MAP3K4 5445 and gs2.topology = “mRNA” MAP4K2 2906 MAP4K5 3000 and gs2.seq_length = maxlen etc ;
GenBank Accession M64174 U09607 AF002715 U07349 U77129
GenBank Sequence tccagtttgcttcttgg… ccctctgaccaggac… aagatggccgcggc… gctccggcccgccc… ggcgccgacccatg…
Fig. 3. Selecting kinase cDNAs for expression studies: query and result.
and Gene Logic (GL) embarked on a 1.5-year collaboration to develop and extend the capabilities of MQS to meet the integration needs of SB bioinformatics and to add value for scientific investigation. The deliverables of the collaboration included four new CORBA services: (1) a BLAST server with accompanying BLAST-cache server, (2) a generic web site server with parsers for PubMed and GeneCards; (3) a generic flat-file server with parsers for GenBank and SwissProt and generic schemadriven loaders for major releases and nightly updates; and (4) Application-Specific Data Type (ASDT) servers which define and execute OPM class methods. In the course of the collaboration these methods were pushed much further than was originally envisioned (Topaloglou et al., 1999), to perform sophisticated filtering and analysis of results within declarative queries that go far beyond what can be accomplished with ordinary SQL predicates. Examples will be given in the Results.
Technical summary of the OPM Multidatabase Query System (MQS) OPM (Chen and Markowitz, 1995) is an object-oriented data model similar to the ODMG standard for objectoriented databases (Cattell, 1996), with extensions for modeling scientific processes and experiments. OPM provides a uniform model for representing the structure of complex, heterogeneous data sources and for expressing queries against them.
Data in OPM are represented by objects. Objects are uniquely identified by object identifiers (oids), qualified by attributes, and classified into classes. Classes are organized in subclass–superclass hierarchies. Attributes may be simple or consist of a tuple of simple attributes, such as name: (first, last). Attributes can be single-valued, set-valued or list-valued and can be required to have non-null values. An attribute’s value class may be a system-provided data type or a controlled-value class of enumerated values or ranges, or an attribute may take values from an object class or a union of object classes. In addition OPM supports the specification of derived attributes using derivation rules involving arithmetic and aggregate operations or compositions of attributes and inverse attributes. OPM is supported by a suite of Data Management tools, fully described in Markowitz et al. (1999). Database Query tools support querying and exploring databases via OPM views using a high-level query language or Web-based graphical interfaces. Retrofitting tools provide facilities for constructing OPM views for relational and flat-file databases. OPM has also been extended with Application Specific Data Types (ASDTs) which were designed to model complex, multimedia data types such as DNA sequences, protein structures, genetic maps and gel images. ASDTs allow methods to be associated with data represented either using a primitive data type or external files. Methods 591
B.A.Eckman et al.
Fig. 4. A portion of the GeneCards OPM schema.
are implemented by means of ASDT servers, which conform to a simple CORBA interface. ASDT servers may act as wrappers for external programs or scripts, or may implement functions themselves. The OPM MQS allows for querying and exploring multiple heterogeneous data sources that have OPM views. Queries against the MQS are expressed in the OPM multidatabase query language, OPM-MQL. Queries can be submitted to MQS using either a command-line interface, a C++ API, or Web-based graphical query tools. MQS uses a client–server architecture that supports multiple Database Management Systems (DBMSs) and data sources, and allows a multidatabase system to be dynamically reconfigured with new databases, DBMSs and ASDTs. Servers for each DBMS in the system handle database-specific query optimizations and evaluate singledatabase queries. 592
The main components of an OPM MQS are shown in Figure 5. • A central Multidatabase Query Processor takes OPMMQL queries and forms and evaluates a query plan. This plan involves splitting the query into single database sub-queries together with expressions for combining the sub-query results. The Multidatabase Query Processor can perform post-processing of subquery results in order to compensate for differences in the query facilities supported by different data sources and DBMSs. • OPM Database Servers provide database-specific functions used in the translation and optimization of OPM-MQL queries, evaluate single-database queries, and return query results as OPM data structures.
Extending traditional integration for post-genomic data
Fig. 5. The OPM multidatabase system architecture underlying TINet.
• OPM ASDT Servers perform methods for OPM ASDTs and return method results as OPM data structures. • The Multidatabase Directory stores information about the data sources, DBMS servers and ASDTs involved in a federation and the inter-database links representing known connections between data sources. As part of the SmithKline Beecham–Gene Logic collaboration, four types of database server were provided: • Relational database servers, which access Sybase or Oracle databases through OPM views. • Flat-file/XML servers, which allow access to flat-file databases parsed into XML, using a relational database for indexing purposes. • Web servers, allowing remote databases to be accessed through web-based interfaces. MQS query conditions are mapped into CGI call parameters, and the resulting web pages are translated into OPM data using custom Perl scripts. • BLAST servers, allowing access to WU-BLAST or similar applications. Query conditions are mapped to
BLAST-call parameters and the results of BLAST searches are parsed to form complex nested OPM objects. Because of the costly nature, BLAST results are cached for a user-specified amount of time, so that they can be re-used in subsequent queries. A separate BLAST-cache server keeps track of BLAST calls and associated output files. This approach allows multiple BLAST servers to be run concurrently and to share the same cache. The BLAST-cache server is also responsible for deleting BLAST output files once they become outdated. In addition servers for two ASDTs were supplied: • Blast-alignment ASDT, which provided a number of specialized functions on BLAST alignments such as defining and extracting sub-alignments and fixedlength pattern matching on the query and target sequences in the alignment and the string encoding perfect matches, similar residues, and mismatches. • GenBank Feature ASDT, which contained a stub for a method, to be developed by SB, to compute the nucleic acid sequence corresponding to a GenBank Feature location. This server was implemented as a C++ wrapper 593
B.A.Eckman et al.
for a Perl script. Since GenBank Feature locations may be compound structures involving references to other GenBank entries, this script may involve further GenBank accesses when necessary.
Further development at GlaxoSmithKline One of OPM’s greatest strengths as a development environment is the ease with which new data sources and class methods may be added dynamically to the system by multiple developers working in parallel. Using the OPM Software Developer’s Kit (SDK), GSK has engaged in its own development effort to enhance the TINet system while Gene Logic continues to provide enhancements to the OPM core system. Working within the OPM paradigm of using a standard query language over new data sources and database servers, we have enabled our users to take advantage of new and highly flexible additions to the system, without needing to learn a new user interface or command language for each round of enhancements. The pattern matching capacity of the BLAST-alignment ASDT server was supplemented by the incorporation of the regular expression C library to provide full regular expression-based motif finding methods. Currently implemented methods count the number of occurrences of a motif within a sequence, return a list of the occurrences, and provide an easy-to-read output indicating the position of the occurrences. The sample query in the Results for identifying potential novel neuropeptides uses these methods to formulate its selection criteria. Building on the location attribute in the GenBank Feature Table (DNA Data Bank of Japan et al., 1999), which is defined by basepair offsets into an entry’s nucleotide sequence, we have implemented biologically useful operations on these sequences via the GenBank Feature ASDT methods: splicing exons into putative coding sequences, extracting the collection of putative 2 bp donor and 2 bp acceptor splice sites based on exon boundary annotations, and extracting sequence windows around the 5 and 3 ends of the coding sequence, such as start and stop codons. The sample query in the Results on putative primary structure in genomic sequence uses these methods and regular expression methods to formulate its selection criteria. ASDT methods provide an easy means of adding procedural processing over existing OPM data sources to the OPM-MQL declarative query language. Building new OPM servers using Gene Logic’s SDK enables new data source types to be added to the system. Currently in beta release is a server for the GCG motifs program for performing searches against the PROSITE database of protein sequence motifs. We plan to extend this server to include the option to perform searches against Pfam (Sonnhammer et al., 1997) and PRINTS (Attwood et al., 1994) as well. Like the BLAST server, such new 594
application servers will allow standardized query access to the search results as well as joins between the searches and other disparate data sources.
Hardware and software For the performance results reported in the next section, the core TINet system was deployed on a dedicated machine: 2 CPU UltraSPARC-II at 296 MHz, 1.5 GB RAM, Sun Enterprise 450/SunOS release 5.6. This configuration supports: all the OPM core services (CORBA servers, MQS clients); user interfaces to the MQS client (generic OPM GUI query interface, a Perl/shell CGI user interface); an Apache v1.3.1 HTTP (web) server (Apache Software Foundation, 1995); the Visigenic VisiBroker ORB (Inprise Corporation, 1996); and the Sybase SQL server (Sybase Inc., 1997). OPM development for the SunOS environment utilizes the Sun WorkShop Compiler C++ 5.0, the Visigenic VisiBroker C++ 3.3 libraries, the Sybase Adaptive Server Enterprise 11.5.1 libraries, and Perl 5.005 (Wall et al., 1996). Based on a CORBA distributed architecture (Mowbray and Zahavi, 1995), the TINet system may be flexibly configured to run over multiple machines to accommodate increased user demands in the future. Back-end TINet BLAST searches are run on a dedicated BLAST server that supports all user applications deployed by our large Bioinformatics department: a 14 CPU alpha EV6 at 524 MHz, 4 GB RAM, Digital UNIX Alpha/OSF1 v4.0 running the WU-BLAST 2.0 and NCBI BLAST 2.0 engines. RESULTS A database integration system is only worthwhile in a bioinformatics context if it can support scientifically meaningful and useful queries. Since the delivery of the extended OPM software in early 2000, the TINet system has been used for a variety of applications and continues to be expanded and enhanced based on our customers’ needs. In this section we present a sampling of queries which represents the range of needs that TINet currently meets, and serves to illustrate the power and flexibility of the OPM software to answer biologically meaningful questions. All of the queries are short (1/2 to 1 page with many newlines) and most took 10–15 min to write. TINet queries vary widely in the number of data sources accessed, the complexity of the filtering performed, and the volume of results returned. Typical query execution times on the specified hardware platform range from 1 s to 16 h, with multiple BLAST searches, particularly tblastn searches, representing the rate-limiting step for the longer running queries. All sample queries summarized here were benchmarked on the hardware and software configurations described in the System and methods. All timings are presented as the mean (± the standard deviation) over 10 iterations.
Extending traditional integration for post-genomic data
Performance was strikingly consistent, with a maximum standard deviation of 12% of the mean across all queries.
Basic sample queries Find ESTs that may correspond to interesting neurological targets. ‘Return accession numbers and definitions of human EST sequences that are similar (60% identical over 50 AA) to calcium channel sequences in SwissProt that have references published since 1995 with ‘brain’ in the abstract’. This query accesses SwissProt, PubMed, and BLAST (tblastn vs gbest), and returned 539 HSPs from 538 unique ESTs in 11 min, 58 s (±46.5 s), plus BLAST execution time (about 15.5 h). Twenty tblastn searches, one for each unique SwissProt sequence, were performed serially. This is an excellent example of a query whose performance would benefit from the addition of parallelization, and we are currently determining how best to accomplish this in the context of the system as a whole. Mine recent literature for serotonin-related expression data linked to GenBank sequences. ‘Return the source, abstract and GenBank accession number for all articles published in 2000 that have a MeSH term ‘gene expression’, mention ‘serotonin’, and are linked to nucleotide sequences in GenBank’. This query returned five articles related to six sequences in 10 s (±0.7 s). Although it accesses only PubMed, this query is notable because it cannot be performed using the NCBI web browser interface. To the best of our knowledge, that tool does not currently permit the user to limit the results only to articles that are linked to GenBank sequences. This filtering is possible in the TINet system due to the OPM query processor’s ability to compensate through post-processing for limitations in the native data source’s query capability. Queries like this are useful in automated, periodic ‘alerts’ of literature of interest to investigators. Find full-length coding sequences in GenBank. ‘Return the accession number and definition of sequences annotated as channels for which the CDS start and stop locations are specified unambiguously’. Like the preceding example, this query performs filtering that is currently impossible using the NCBI Entrez interface. The OPM-MQL MATCH operator (similar to SQL LIKE) is used to exclude entries whose location specifications contain ‘>’, ‘