A Component-based Implementation of Multiple Sequence Alignment

6 downloads 0 Views 455KB Size Report
ABSTRACT. This paper addresses the efficient execution of a Multiple. Sequence Alignment (MSA) method, in particular the pro- gressive alignment-based ...
A Component-based Implementation of Multiple Sequence  Alignment Umit Catalyureky , Mike Grayy, Tahsin Kurcy , Joel Saltzy , Eric Stahlberg+ , Renato Ferreiraz

y Dept. of Biomedical Informatics The Ohio State University Columbus, OH, 43210

catalyurek.1,gray.105 kurc.1,[email protected]

+Ohio Supercomputer Center

z Dept. of Computer Science

[email protected]

[email protected]

1224 Kinnear Road Columbus, OH, 43210

UFMG, Belo Horizonte Brazil

publicly available for research and development, and provide a rich source of information for molecular studies. As a reThis paper addresses the eÆcient execution of a Multiple sult, data servers for sequence databases should be designed Sequence Alignment (MSA) method, in particular the proto handle large number of queries (e.g., multiple sequence gressive alignment-based CLUSTAL W algorithm, on a clusalignment, BLAST) submitted by many clients. ter of workstations. We describe a scalable componentMultiple sequence alignment (MSA) analysis is a powerful based implementation of CLUSTAL W program targeting mechanism for analyzing structural and functional similaridistributed memory machines and multiple query workloads. ties and di erences, and for nding historical and evolutionWe look at the e ect of data caching on the performance of ary relationships. MSA involves computing the alignment the data server. We present a distributed, persistent cache of three or more sequences and is a computationally expenapproach for caching intermediate results for reuse in subsive method. Several research projects have focused on the sequent or concurrent queries. Our initial results show that development of heuristics [8, 11, 12], and the parallel implethe cache-enabled CLUSTAL W program scales well on a mentations of MSA algorithms [7, 13]. cluster of workstations. This paper addresses the eÆcient execution of progressive alignment-based strategies, in particular the CLUSTAL W Keywords algorithm, using a component-based framework [3] when Multiple sequence alignment, distributed computing, component- multiple queries are submitted to the data server. Componentbased frameworks, where an application is developed from based architecture, persistent data caching a set of interacting components, are gaining popularity as a viable approach for application development in distributed 1. INTRODUCTION environments [3, 6, 9]. They o er a exible model that alWhen a new sequence is discovered, it is important to lows for placement of components on di erent platforms to search sequence databases to nd homologous sequences [1], minimize computation and communication overheads. Imand analyze structural and functional similarities and difplementations of CLUSTAL W on a shared-memory maferences among multiple sequences [8, 11, 12]. Many of the chine have been presented in [4, 7]. Our work di ers from large gene databases, including GenBank1 and PDB, are those projects in that we target distributed systems. We also examine how data caching a ects the performance of a data This research was supported by the National Science server running on a disk-based storage/compute cluster. Foundation under Grants #ACI-9619020 (UC Subcontract #10152408), #EIA-0121177, #EIA-0203846, #ACI0130437, #ACI-9982087, Lawrence Livermore National 2. CLUSTAL W Laboratory under Grant #B500288 and #B517095 (UC Subcontract #10184497). Progressive alignment-based strategies are commonly used 1 As of March 2002, GenBank of The National Center for methods for multiple sequence alignment problems, because Biotechnology Information (NCBI) contained about 14.9 of their eÆcient execution times. The CLUSTAL W [12] million sequences with about 15.8 billion bases. the size algorithm is one of the most widely used methods among of the data has tripled in the last two years. several alternatives. The basic idea behind progressive alignment is to generate an initial phylogenetic tree via a series of pairwise alignments and then, following the order in the phylogenetic tree, to incrementally build up the multiple sePermission to make digital or hard copies of all or part of this work for quence alignment. A progressive alignment-based method personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies consists of three main stages: 1) Pairwise Alignment: A bear this notice and the full citation on the first page. To copy otherwise, to distance matrix is computed based on the alignments of all republish, to post on servers or to redistribute to lists, requires prior specific pairs of sequences. For q sequences, this rst step requires permission and/or a fee. the computation of q (q 1)=2 pairwise alignments. As a SAC 2003, Melbourne, Florida, USA result, the execution time complexity of this step is O(q 2 l2 ) Copyright 2003 ACM 1-58113-624-2/03/03 ...$5.00.

ABSTRACT

for q sequences, each of which has an average length of l. 2) Computation of the Guide Tree: A phylogenetic tree is calculated from the distance matrix using the neighborjoining method. The computational complexity of this step is O(q 3 ). 3) Progressive Alignment: A series of pairwise alignments are computed using full dynamic programming to align larger and larger groups of sequences. The branching order in the guide tree determines the ordering of the sequence alignments. The computational complexity of this step is O(ql2 ).

3.

THE RUNTIME ENVIRONMENT

In this paper we employ a component-based framework, called DataCutter [3], for the implementation of parallel, cache-enabled CLUSTAL W. DataCutter provides programming and runtime support for developing data-intensive applications that execute in distributed and heterogeneous environments. In this framework, the application processing structure is implemented as a set of components, referred to as lters. Data exchange between lters is performed through a stream abstraction. A lter de nes an object that performs application-speci c processing on data. Currently, lter codes can be developed using C++ language binding by sub-classing a lter base class provided by the framework. The base class exports three virtual functions; init, process, and nalize, that should be implemented by the application developer. A stream is an abstraction that de nes how the lters of an application are connected to each other. It also provides a means for uni-directional data exchange between two lters, from the source lter to the destination lter. Bi-directional data exchange can be established by two streams in opposite directions. All transfers to and from streams are done through data bu ers, each of which represents a contiguous memory region containing useful data. A lter group denotes a group of lters that implement the application processing structure and collectively carry out execution of an application-de ned unit-of-work (UOW). An example of a UOW would be a MSA request submitted to the data server. A work cycle begins when the runtime system calls the init function, which is where any required resources such as memory can be pre-allocated, for each lter in the lter group. Next the process function is called to read from any input streams, work on data bu ers received, and write to any output streams. A special marker is sent by the runtime system after the last bu er to mark the end for the current UOW. The nalize function is called after all processing is nished for the current UOW, to allow release of allocated resources such as scratch space. The interface functions may be called again to process another UOW. The runtime system provides a multi-threaded, distributed execution environment. Multiple instances of a lter group can be instantiated and executed concurrently. Work can be assigned to any group. Within each lter group instance, multiple copies of individual lters can also be created. Decomposition of application processing into lters and lter copies e ectively provide a combination of functional- and data-parallelism. Filter groups and lters can be placed on a distributed collection of machines to minimize computation and communication overheads { lters that exchange large volumes of data can be placed on the same machine, while compute intensive lters can be executed on more powerful machines or less loaded hosts. Filters co-located on the

Guide Tree & Prog. Align

Pairwise Alignment

Figure 1: A layout for component-based implementation of MSA. same machine are executed as separate threads by the runtime system. Data exchange between two co-located lters is carried out by simple pointer copy operations, thereby minimizing the communication overhead. Communication between two lters on di erent hosts is done through TCP/IP sockets.

4. CACHE-ENABLED, COMPONENT-BASED IMPLEMENTATION OF MSA In this section, we describe an approach for improving the execution of MSA analysis by caching previously computed results. We rst present a component-based implementation of MSA without data caching. We continue with a description of how data caching can be enabled in MSA. Finally, we describe a cache-enabled implementation of MSA.

4.1 Decomposition into Components The rst task in a component-based implementation is to partition the application into a set of processing units, in other words into a group of lters. For CLUSTAL W, a natural decomposition is to have three lters, corresponding to the three stages of the algorithm. However, both complexity analysis (see Section 2) and empirical studies in [4, 7] show that the pairwise alignment step is by far the most time consuming step of the algorithm2 . Based on this observation, we decompose the CLUSTAL W algorithm into two lters, as displayed in Figure 1. If the copies of the Pairwise Alignment lter are created, the Guide Tree & Prog Align (GT&PA) lter can control the assignment of pairwise alignments to those copies through a demanddriven work assignment mechanism in order to achieve better computational load balance. The GT&PA lter starts assigning pairwise alignments to the Pairwise Alignment lter; whenever one of the Pairwise Alignment lter copies nishes its computation, a new pair is assigned to that copy.

4.2 Data Caching Data caching is a common optimization technique that takes advantage of commonalities among queries for more eÆcient use of system resources [2, 5, 10]. Queries submitted against a database of sequences for MSA analysis usually involve comparison of known sequences (stored in the database) to one or more new sequences entered by the client. Even if a user submits a query involving a new sequence, the query generally contains multiple known sequences. Moreover, in a multiple client environment, several clients may submit to a server requests that may contain overlapping subsets of sequences. In these cases, caching results from previous queries can speed up the execution of new queries. 2 Our experiments in an earlier work show that 78-95% of the total execution time is incurred in the pairwise alignment step.

Hash Table

c bits = 2c elements

Hash (UniqueID)

i

j

Sequence

Guide Tree & Prog. Align

Pairwise Align Pairwise Align Cache & Cache Compute

Unique ID = (i