Similar_Join: Extending DBMS with a Bio-specific ... - Semantic Scholar

2 downloads 2238 Views 244KB Size Report
comparison algorithm development, concentrated on improving the sensitive ..... 25(2): 201-4. [24] Selley, J.N., J. Swift, and T.K. Attwood, EASY-an Expert.
Similar_Join: Extending DBMS with a Bio-specific Operator Jake Yue Chen

John V. Carlis

Myriad Proteomics, Inc. 2150 West Dauntless Ave. Salt Lake City, UT 841116, USA Tel: (01) 801-303-1722

Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA Tel: (01) 612-625-6092

[email protected]

[email protected]

ABSTRACT Existing sequence comparison software applications lack adequate automation, abstraction, performance, and flexibility. Users need a new way of studying and applying sequence comparisons in the post-genomics era. We invented and developed a new bio-specific Database Management System (DBMS) operator, Similar_Join, to abstract the labor-intensive batch sequence similarity search task into a syntactically concise database operation. We implemented the Similar_Join operator as part of a relational operator package. This implementation enabled us to write simple PL/SQL scripts within the DBMS to accomplish routine sequence similarity searches conveniently, for example, a “batch BLAST” that compares 7,000 human genes against 500,000 human Expressed Sequence Tags (EST) in a few hours. We also implemented a simple version of Similar_Join as a database operator in the extended data cartridge of Oracle 8i object-relational DBMS. When fully integrated into SQL language extensions, we demonstrated this operator could enable biology users to achieve interesting complex biological queries previously impossible inside the DBMS.

Keywords Database Management System (DBMS), Genomic DBMS Extension, Similarity Search, Relational Operator, Similar_Join Operator

1. INTRODUCTION Sequence comparison has become a routine and essential bioinformatics task. There are three basic and widely used sequence comparison algorithms, Smith-Waterman [26], BLAST [1], and FASTA [22]. Computer programs employing these algorithms allow biologists to start with one or more nucleotide or protein sequences (called “query sequences”), search through a huge database of sequences (called “target sequences”), fish out a set of highly related (called “matched” or “hit”) sequence pairs (called “similarity search results”), and, often additionally, obtain detailed pair-wise sequence alignment reports (called “sequence Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The Eighteenth Annual ACM Symposium on Applied Computing ’03, March 9-12, 2003, Melbourne, Florida. Copyright 2002 ACM 1-58113-000-0/00/0000…$5.00.

alignment results”). Sequence comparison techniques have enabled biologists to detect sequence homology relationship between newly discovered and already stored sequences [21], annotate ESTs [23], construct homologous sequence libraries [28], predict protein functions [11], identify protein fold regions [18], discover tandem repeats [27], and even retrieve text from the biological literature [19]. In the past decade, sequence comparison research evolved along two development themes, algorithm development and application software development. The first development theme, sequence comparison algorithm development, concentrated on improving the sensitive detection of remote sequence homology relationships. For example, Bucher and Hofmann improved the sensitivity of native Smith-Waterman algorithm by an order of magnitude by assessing sequence relationships using log likelihood ratios [4]. In SAM-T98, Karplus et al improved the detection of remote homologous sequences by iteratively building a Hidden Markov Model (HMM) for each query sequence and its hit sequences [16]. Zhu et al matched distantly related sequences consistently over large protein families using a Bayesian search method [32]. More significantly, the development of PositionSpecific Iterated BLAST (PSI-BLAST) enabled sensitive detection of remote evolutionary sequence relationships by constructing a position specific scoring matrix and searching the target sequences using this matrix [2]. The latest trend in this development theme is to incorporate significant evolutionary information into the sequence comparison method. For example, the Blocks server incorporated conserved regions of proteins during protein homology detections into alignment block constructions [13]; Kelley et al also reported using a 3dimentional position-specific scoring matrix to combine the power of multiple sequence profiles with knowledge of protein structure during enhanced functional assignment of newly sequenced genomes [17]. The second development theme, sequence comparison application software development, concentrated on improving the throughput of sequence comparison tasks and the user exploration of sequence comparison results. For example, Zhang and Madden developed PowerBLAST to automate BLAST tasks and extended it with such features as automated preprocessing of low quality sequence regions, restricting BLAST search sequences by keywords, submitting very long sequences, and post-processing of BLAST alignment results [31]. Visual-BLAST/FASTA [10] and AlignmentViewer [9] enabled biology users to filter and interpret large amount of similarity search results through information visualization. BEAUTY-X enhanced BLAST alignment output

for subsequent functional sequence annotations [30]. Sequence server Samurai from Informax uses a relational database management system at the backend to support network server based sequence similarity searches [25]. BSU enables users to schedule BLAST execution and receive email notifications of new match results found [3]. EASY enables users to crossvalidate similarity search results by combining results from parallel execution of many protein sequence and pattern database searches [24]. However, no reported application software truly combines automated computer processing with interactive user exploration for genome-wide sequence comparison tasks. In this paper, we establish a third and novel development theme, sequence comparison database system development, to steer future research of this fundamental bioinformatics topic into the genomic DBMS domain. We invented a new bio-specific operator, Similar_Join, which operates on two database relations containing sequence data, to abstract the batch sequence similarity search task into a syntactically concise database operation. We designed and implemented the Similar_Join operator in the extended data cartridge of Oracle 8i objectrelational DBMS, which integrates sequence similarity search tasks and their results tightly with other database querying activities and managed data. We also demonstrated that Similar_Join operator could be built into the SQL language extensions in Oracle 8i, which enables future biological data analysts to achieve interesting biological queries previously impossible inside the DBMS. Overall, our research, which we compare to the initial work in the similar field of Geographical Information Systems [29], should inspire biological algorithm developers to study how to transform existing sequence comparison techniques into new database indexing techniques for biological sequences in a genomic DBMS, and should shield application software developers from concerns of sequence comparison execution details. In the remainder of this paper, we organize the description of the new Similar_Join operator into four sections: operator design, implementation, application, and discussion.

2. Operator Design 2.1 Motivation To most biology users today, performing routine high-quality sequence similarity searches still represents significant challenges. Take, for example, a common task “Run BLAST of a batch of query sequences against a large set of target sequences” (often called “batch BLAST”). The entire process, if performed with a BLAST server, requires biology users to prepare BLAST sequences, iteratively execute the BLAST program, parse all BLAST results, and combine batch BLAST results into a distilled report. During the course of batch BLAST, biology users also need to deal with potentially large volumes of raw sequences and textual BLAST files. As we know first hand, these computational burdens wear out biology users, leaving them too exhausted to further experiment with alternative search parameters and alternative search programs, especially when the task involves large data sets. Existing sequence comparison development themes do not address these user challenges adequately. First, many software applications such as the widely-used NCBI BLAST [14] do not offer sufficient automation or data management to relieve users of

computational burdens during batch sequence comparison tasks. Second, even top sequence comparison application software suffers from various performance problems. For example, query sequences are often not indexed during batch sequence comparison tasks; round-trip movement of large volumes of data from the DBMS server to the application software is required for current database support. Third, it is difficult to replace an existing similarity search algorithm with another one from within a developed software application. There is no standard sequence comparison specification that algorithm developers can comply in order to shield application software developers from reacting to algorithmic details. Users need a new way of studying and applying sequence comparisons in the post-genome era.

2.2 Concept Starting from an entirely new database system development theme, we invented a Similar_Join bio-specific database operator defined as the following: Definition: A relational operator that takes at least two relations containing query and target sequence strings, and outputs one or more relations containing all similarly matched sequence string pairs filtered from the product of the two input relations. By relational operators, we refer to a particular type of relational algebra functions that take existing relations as the input and produce new relations in the output [6]. In relational algebra, basic relational operators include filter, project, group, times, union, intersect, minus, join, and divide. Relational operators operate on whole relations, unlike some database functions that operate within a row or a cell. Also, they operate on relations, and not on un-identified tables. Our definition of Similar_Join does not require the inclusion of pairwise sequence alignment results, which usually follow the similarity search results found in most sequence comparison software. This is because we want to introduce similarity search and pairwise sequence alignment as two separate operations orthogonal to each other. While we introduce similarity search into the DBMS as a relational operator, we regard pairwise alignment merely as a database row function, which takes two sequence strings of a matched sequence pair as the input and produce an alignment text string as the output. Whether to include the pairwise alignment function in the final result of Similar_Join operator or not is a pure implementation choice. The new Similar_Join operator is both “complex” and “contentspecific”, different from basic content-neutral relational operators such as join, filter, project, and group. Similar_Join is a complex operator, similar to MATCH [12] or HAS [5], which requires complex relational algebra operations. Similar_Join is also a content-specific, or, more precisely, “bio-specific”, operator, applicable only to relations that contain string operands (varchar2 data type for Oracle) as DNA or protein sequences. The new Similar_Join operator concept establishes a new framework for future sequence comparison bioinformatics development (as shown in Figure 1). This operator provides a natural abstraction for all sequence comparison tasks, including one-to-one, one-to-many, and many-to-many sequence similarity searches and alignments, making it easy for both users and application software developers to use. Since the operator performs on relational data inside the DBMS, it can also have a strong data management support, including integration with query

optimizers and SQL. More importantly, since the definition of the operator does not include details on how to call a “similar match”, developers of the operator can implement many different similarity search methods and match thresholds for users to specify. Genome Annotation Software

Sequence Analysis Software

Application Software Development Theme

... Other Software Applications

Database Access Database System Development Theme

The Similar_Join Bio-specific Database Operator

... Other Biospecific Operators

Implementation of Indexing, Search, and Query Optimization Methods

Algorithm Development Theme

BLAST Methods

FASTA Methods

Smith... Other Waterman Algorithms Algorithm

Figure 1. The expected role of the Similar_Join bio-specific database operator in the sequence comparison development framework.

2.3 Design Structure

naming the input relations and labeling the input arms, users of the Similar_Join operator no longer have to make this distinction, nor do they need to distinguish whether a task requires “comparing 2 sequences”, “comparing one query sequence against a target sequence set”, or a “comparing a batch of query sequences against a target sequence set”. The operator outputs more than one relation as is drawn, a “Similar_Join Result” relation and two detailed result relations (shown in a cloud to suggest an extensible schema), “Joined Sequence Pair” and “Joined Seq Pair Match Detail”. The “Similar_Join Result” relation has a primary key “Result ID”, which an application or a user can use to retrieve the complex similarity search results kept in the detailed result relations. The “Joined Sequence Pair” relation keeps information about matched sequence pairs and the degree of similarity measured by attributes “Match_Hit_Stregnth”, “E_value_neg_log”, and “E_value_fraction”. The entity “Joined Seq Pair Match Detail” in the schema fragment captures information about each matched region of each matched sequence pairs and the detailed of similarity measures. Similar_Join Parameter Set

Similar_Join Type SJoin Name

SJoin Name (FK) SJoin PS ID

SJoin Operator Name

Default Flag

Similar_Join Parameter Query Sequence

Sequence_ID Seq_string

A

B SIMILAR_JOIN A.[Sequence ID] = B.[Sequence ID]

Result ID

Set ID

Be In Set A

Schema Name Relation Name Attribute Name Time Created

Be In Set B

SJoin Name (FK) SJoin PS ID (FK) Set A ID (FK) Set B ID (FK) Time Executed Description

Matchpair Seq A ID Matchpair Seq B ID Result ID (FK)

SIMILAR_JOIN Result Schema Fragment Joined Sequence Pair Result ID Matchpair Seq_A_ID Matchpair_Seq_B_ID Match_Hit_Strength E_value_neg_log E_value_faction Relationship_Descriptor

SJoin PS ID (FK) SJoin Name (FK) Parameter Value Default Value

Similar_Join Result Sequence Set

Joined Sequence Pair

Similar Join Result Result ID

Parameter ID

Target Sequence

SJoin_Type Parameter_ID Parameter_Value Use_Default_Flag

Sequence_ID Seq_string

Similar_Join Parameter

Joined Seq Pair Match Detail Region_ID Span_A_Percentage Span_B_Percentage A_Similarity_Percentage B_Similarity_Percentage Region_Length Region_on_A_Start Region_on_B_Start A_strand_flag B_strand_flag

Figure 2. The design structure of the Similar_Join operator, shown using a query structure model notation. Figure 2 shows the design structure of the Similar_Join operator using a query structure model notation [7]. The operator is drawn in the shape of a pentagon in the center with data flow arrows pointing to and away from the vertices/edges. The operator takes two standard input relations, “Query Sequence” and “Target Sequence”, and an additional input parameter relation, “Similar_Join Parameter”, to give users some flexibility in controlling the underlying search algorithm. Although we still distinguish the query sequences and target sequences by properly

Match Hit Strength E Value Neg Log E Value Fraction Relationship Descriptor

Joined Seq Pair Match Detail Region ID Result ID (FK) Matchpair Seq A ID (FK) Matchpair Seq B ID (FK) Span A Percentage Span B Percentage A Similarity Percentage B Similarity Percentage Region Length Region on A Start Region on B Start A Strand Flag B Strand Flag

Figure 3. A system-level schema supporting the Similar_Join operator. Figure 3 above shows the complete system-level schema used by the Similar_Join operator using a variation of the LDS notation discussed in [8]. This schema keeps track of all similarity results in entities previously described, “Similar Join Result”, “Joined Sequence Pair” and “Joined Seq Pair Match Detail”. It also keeps track of essential operator parameters such as the input sequence sets (from entity “Sequence Set”), operator execution parameter set (from entity “Similar_Join Parameter Set”), and choice of Similar_Join algorithms (in entity “Similar_Join Type”, connected through entity “Similar_Join Parameter Set”). This schema structure allows us to design and implement the Similar_Join operator flexibly. For example, if users subsequently demand that detailed alignment results should be remembered and integrated with the Similar_Join similarity search results, we could easily extend this system-level schema with a “sequence alignment” schema element found in [8], which contains “Result ID” as partial key. As another example, we could choose one of the three similarity search algorithms to implement Similar_Join

in different versions (remembered by “SJoin_Type” in the “Similar Join Parameter” entity): •

Exact_Join Only two identical sequence strings from relations A and B show up as joined sequence pairs.



Subseq_Join Only two sequence strings that have substring relationship to each other from relations A and B show up as joined sequence pairs.



Blast_Join Only two sequence strings that satisfy a minimal P value for measuring sequence similarity from relations A and B show up as joined sequence pairs.

3. IMPLENENTATION 3.1 As a Relational Operator Package We have implemented the Similar_Join operator as part of a relational operator package. The interface specifications of the Similar_Join operator package are: Rel_op.Similar_Join_Prepare (Input_RelationA_Seq_Attribute_Name Input_RelationB_Seq_Attribute_Name SJoin_Name IN, Sjoin_ParameterSet_Name IN, Result_ID OUT ); Rel_op.Similar_Join_Finish (SJoin_Name IN, Sjoin_ParameterSet_Name IN, Result_ID IN );

IN, IN,

To perform a Similar_Join similarity search using the “Exact_Join” method on two sets of sequences, for example, we could execute the two PL/SQL statements in one PL/SQL block: SQLPLUS> DECLARE v_Result_ID Number; BEGIN Rel_op.Similar_Join_Prepare (‘Query_Sequence_Set.Sequence’, ‘Target_Sequence_Set.Sequence’, ‘Exact_Join’, ‘Default’, v_Result_ID); Rel_op.Similar_Join_Finish ( ‘Exact_Join’, ‘Default’, v_Result_ID); -- v_Result_ID passes on the identifier to access hit results. -- Write additional PL/SQL scripts here to query the hit results from -- “Joined Sequence Pair” and “Joined Seq Pair Match Detail” END /

To perform a Similar_Join similarity search using the “Blast_Join” method on two sets of sequences, however, three execution steps are necessary, due to the technical difficulties in calling an interpreted Perl program directly from Oracle SQL block: SQLPLUS>EXEC Rel_op.Similar_Join_Prepare (‘Query_Sequence_Set.Sequence’, ‘Target_Sequence_Set.Sequence’, ‘Blast_Join’, ‘Default’) / Result_ID = 45 SQLPLUS> HOST ‘Batch_BLAST_4in1 Result_ID = &PROMPT’ SQLPLUS> EXEC Rel_op.Similar_Join_Finish (‘Blast_Join’, ‘Default’, &&PROMPT)

Even with the relatively “verbose” implementation, a biology user does not need to leave the database management system environment in order to perform an otherwise daunting batch

BLAST task. The biology user simply starts by executing the stored procedure Rel_op.Similar_Join_Prepare to prepare data for the system schema. Then, at SQLPLUS prompt, he/she enters the result_ID (‘45’ in this example) output from the first execution step. (The user only need to enter the result_ID value once to define the values of both &PROMPT and &&PROMPT.) Next, the operating system (host) program ‘Batch_BLAST_4in1’ automatically exports both sequence sets (where result ID = 45) from the database, executes parallel BLAST jobs, parses BLAST results in parallel, and imports similarity results back into the database system schema tables. Finally, the stored procedure Rel_op.Similar_Join_Finish processes the detailed BLAST results and set flags for filtered hits (not shown). In the next section, we describe how to further integrate this relational operator closely into the DBMS.

3.2 As an Oracle8i Data Cartridge Our Relational Operator Package Implementation of Similar_Join provides the abstraction, automation, and querying capability of similarity searches inside DBMS. However, experienced database users prefer writing queries in SQL to writing queries in a procedural language such as PL/SQL. Algorithm developers also need a better framework to transform their existing sequence indexing and search techniques into the query processing and optimizing engine of a DBMS. Traditionally, this would have meant the development of a full-featured genomic-specific DBMS. Or, is there a feasible alternative? The answer is “yes” with the careful choice of an extensible DBMS to implement the operator. We chose Oracle8i objectrelational DBMS, because its Extended Data Cartridge support [20] makes it possible to implement domain-specific operators tightly integrated with SQL and the query engine. The Oracle “database operator” concept differs slightly from the “relational operator”: relational operators operate on relations, whereas Oracle database operators operate on table columns. Since we enforce primary keys for each table, and a table column is essentially a horizontally partitioned table, the conceptual difference can be ignored. Oracle database operators also differ from database functions, which operate on table cell values instead of table columns only. Our Similar_Join Oracle8i database operator has the following interface specification: Similar_Join ( Input_TableA_Seq_Attribute_Name, Input_TableB_Seq_Attribute_Name, SJoin_Algorithm_Name, Sjoin_ParameterSet_Name DEFAULT ‘Default’ ) Returns Result_OID

This interface specification of Similar_Join differs from the previously described operator’s interface specification for the relational operator package implementation in two details. First, the relational operator Similar_Join returns a relation identifier Result_ID, whereas the database operator Similar_Join returns an object identifier Result_OID. Each Result_OID references a unique Similar_Join hit result as performed by the Oracle8i Extended Data Cartridge. Second, the original relational table “Joined Seq Pair Match Detail” in the relational operator Similar_Join implementation is now replaced by a nested table with the same name inside the object table “Joined Sequence Pair”. The table nesting ensures all relevant results about a matched sequence pair can be accessed through the Result_OID.

To perform a Similar_Join similarity search of all the sequences against themselves in table “sequence_table” (shown below) using the “Subseq_Join” method, a biologist analyst could write SQL like this:

million rows of indexed matched sequence regions. To explore the relationship of any pair of sequences, we were able to execute a simple SQL statement and obtain desired answers within seconds.

Sequence_Table:

The Similar_Join operator empowers biological data analysts to attempt complex database queries previously impossible to achieve. Examine the following potentially important real-world query, which belong to the same category of “complex bioinformatics queries” [15]:

SeqID

SeqString

1

AGGGC

2

GCATT

3

GCAAATT

4

GCA

5

TGCA

SELECT a.seqstring, b.seqstring, Similar_Join (a.seqstring, b.seqstring) AS ResultSymbol FROM sequence_table a, sequence_table b WHERE Similar_Join (a.seqstring, b.seqstring) is not NULL AND a.seqid b.seqid ;

Note that in this version of implementation, we further simplified the database operator to return a character symbol (“>” for super sequence, “” for super sequence, “

GCA

TGCA

Suggest Documents