MPI-HMMER-Boost: Distributed FPGA Acceleration - CiteSeerX

10 downloads 127546 Views 623KB Size Report
Dec 15, 2006 - Department of Computer Science and Engineering University at Buffalo, The ... We present a cluster-enabled hardware/software-accelerated implementation of the ...... He received his M.S. and Ph.D. degrees in Computer.
Journal of VLSI Signal Processing 2007

* 2007 Springer Science + Business Media, LLC. Manufactured in The United States. DOI: 10.1007/s11265-007-0062-9

MPI-HMMER-Boost: Distributed FPGA Acceleration JOHN PAUL WALTERS Institute for Scientific Computing, Wayne State University, Detroit, MI 48202, USA XIANDONG MENG Electrical and Computer Engineering Department, Wayne State University, Detroit, MI 48202, USA VIPIN CHAUDHARY Department of Computer Science and Engineering University at Buffalo, The State University of New York, Buffalo, NY 14260, USA TIM OLIVER, LEOW YUAN YEOW AND DARRAN NATHAN Progeniq Pte Ltd., 8 Prince George_s Park, 118407, Singapore, Singapore BERTIL SCHMIDT UNSW Asia, 1 Kay Siang Road, 248922, Queenstown, Singapore JOSEPH LANDMAN Scalable Informatics LLC, 2433 Woodmont, Canton, MI 48188, USA Received: 15 December 2006; Revised: 5 March 2007; Accepted: 17 March 2007

Abstract. HMMER, based on the profile Hidden Markov Model (HMM) is one of the most widely used sequence database searching tools, allowing researchers to compare HMMs to sequence databases or sequences to HMM databases. Such searches often take many hours and consume a great number of CPU cycles on modern computers. We present a cluster-enabled hardware/software-accelerated implementation of the HMMER search tool hmmsearch. Our results show that combining the parallel efficiency of a cluster with one or more high-speed hardware accelerators (FPGAs) can significantly improve performance for even the most time consuming searches, often reducing search times from several hours to minutes. Keywords:

HMMER, database searching, FPGA, VLSI, MPI, profile hidden markov models 1.

John Paul Walters: This research was supported in part by NSF IGERT grant 9987598 and the Institute for Scientific Computing at Wayne State University.Vipin Chaudhary: This research was supported in part by NSF IGERT grant 9987598, the Institute for Scientific Computing at Wayne State University, MEDC/Michigan Life Science Corridor, and NYSTAR.

Introduction

Protein sequence analysis tools to predict homology, structure and function of particular peptide sequences exist in abundance. One of the most commonly used tools is the profile hidden Markov model algorithm developed by Eddy [9] and

Walters et al.

colleagues [7]. These tools allow scientists to construct mathematical models (Hidden Markov Models or HMMs) of a set of aligned protein sequences with known similar function and homology, which is then applicable to a large database of proteins. The tools provide the ability to generate a log-odds score as to whether or not the protein belongs to the same family as the proteins which generated the HMM, or to a set of random unrelated sequences. HMMER [8] is a widely used implementation of profile HMM database searching. It is composed of two primary search procedures, hmmsearch and hmmpfam. Both of these search tools rely on the Viterbi algorithm for comparing a query sequence to an HMM probabilistically. Due to the complexity of the calculation, and the possibility to apply many HMM_s to a single sequence (Pfam search [4, 23]), these calculations require significant numbers of processing cycles. Efforts to accelerate these searches have resulted in several platform and hardware specific variants including an Altivec port by Lindahl [17], a GPU port of hmmsearch by Horn et al. [13] of Stanford as well as several optimizations performed by the authors of this paper [16, 22, 30]. These optimizations span a range between minimal source code changes with some impact upon performance, to recasting the core algorithms in terms of a different computing technology and thus fundamentally altering the calculation. Each approach has specific benefits and costs. In this paper we explore a combination of two acceleration techniques. The first is the BioBoost hardware accelerator [24] (FPGA) while the second is an MPI cluster implementation, MPI-HMMER. We merge these two techniques in the hopes that we may benefit from both technologies—using MPI to scale beyond a single hardware accelerator. Our implementation makes use of any processors that are available, whether FPGA or traditional CPUs, and efficiently accelerates the database search beyond either technique_s individual capability. The remainder of this paper is organized as follows. In Section 2 we describe similar work in the area of HMMER acceleration. In Section 3 we describe both the necessary background information as well as our multi-level FPGA/MPI acceleration strategies. Finally, in Sections 4 and 5 we present our performance results and conclusions.

2.

Related Work

There have been a variety of techniques used to both implement and accelerate HMMER searches. They range from typical high performance computing (HPC) strategies such as clustering, to web services, and even extending the core HMMER algorithm to novel processing architectures. In this section we discuss the various strategies used in both implementing and accelerating HMMER. Walters, et al. [30] describe an SSE2 [15] implementation of the Viterbi [28] algorithm. Using the SIMD instructions of the Pentium-4 and above, multiple iterations of the Viterbi algorithm were processed in parallel. However, the data dependencies that we describe in Section 3.3 prevented deeper parallelism. Similarly Eric Lindahl [17] implemented an Altivec-optimized Viterbi algorithm using SIMD Altivec instructions. The advantage of the Altivec implementation over the SSE2 implementation is the larger set of Altivec instructions coupled with additional SIMD registers. With these extra instructions (32 bit vector max), most steps of the Viterbi algorithm can be processed within the Altivec registers without requiring additional packing or memory operations. ClawHMMER [13] implements hmmsearch using highly parallel graphics processors. Unlike traditional general purpose processors, graphics hardware has been optimized to perform the same operation over large streams of input data. This is similar to the SIMD operations of general purpose CPUs, but with greater width and speed. In the case of ClawHMMER, multiple sequences are computed in parallel, rather than parallelizing the Viterbi algorithm itself. JackHMMer [31] uses network processors in place of a general purpose processor in order to accelerate the core Viterbi algorithm. Specifically, JackHMMer uses the Intel IXP 2850 network processor, a heterogeneous multicore chip consisting of an XScale CPU paired with 16 32-bit microengines. The XScale CPU distributes work to the microengines and acts as a coordinator for the worker microengines. TimeLogic provides an FPGA HMM protein characterization solution named DeCypherHMM [27]. The DeCypher engine is deployed as a standard PCI card into an existing machine. Multiple DeCypher engines can be installed in a single machine, which according to TimeLogic, results in near linear

MPI-HMMER-Boost: Distributed FPGA Acceleration

speedup. Other FPGA implementations exist as well [18, 22]. The disadvantage of these implementations is that they eliminate portions of the Plan7 model which can lead to inaccuracies. However, they also exhibit extremely high speedup at the cost of such potential inaccuracy. SledgeHMMER [6] is a web service designed to allow researchers to perform Pfam database searches without having to install HMMER locally. To use the service, a user submits a batch job to the SledgeHMMER website. Upon completion of the job, the results are simply emailed back to the user. SledgeHMMER is a clustered Pfam implementation using a Unix file locking strategy rather than MPI or PVM. Other cluster-like implementations exist as well. Zhu, et al. [32] discuss a Pfam search implementation for the EARTH multi-threaded architecture [14]. Their implementation achieves near linear speedup with 128 dual core processors. HMMER itself supports a PVM [26] implementation of the major database search tools including hmmsearch and hmmpfam. 3.

Multi-Grained HMMER acceleration

In this section we describe the three primary enhancements that we have made towards accelerating HMMER. Our optimizations focus on the hmmsearch tool due to its more compute-intensive nature. However, all of the HMMER acceleration techniques are equally applicable to hmmpfam as well. We present both a hardware/FPGA accelerated hmmsearch, called the BioBoost accelerator, as well as the hmmsearch portion of the software-accelerated

Figure 1.

Plan7 HMM model architecture.

MPI-HMMER tool. Using MPI-HMMER and the BioBoost accelerator, we combine both strategies to implement a heterogeneous software/hardware-accelerated hmmsearch which we call MPI-HMMERBoost. In order to best communicate the acceleration techniques used in MPI-HMMER-Boost, we first begin with an overview of both the Plan7 model and the Viterbi algorithm. Both hmmsearch and hmmpfam fundamentally rely on the Plan7 model and Viterbi algorithm for their basic operations. These details are particularly important when accelerating HMMER within hardware. 3.1. Overview of the Plan7 Model HMMER operations rely upon accurate construction of an HMM representation of a multiple sequence alignment (MSA) of homologous protein sequences. This HMM may then be applied to a database of other protein sequences for homology determination, or grouped together to form part of a protein family set of HMMs that are used to test whether a particular protein sequence is related to the consensus model. It can also annotate potential functions within the query protein from what is known about the function of the aligned sequence from the HMM (homology transfer and functional inference) [29]. These functions in HMMER are based upon the profile HMM [9] architecture. The profile HMM architecture is constructed using the Plan7 model as depicted in Fig. 1. One of the advantages of profile HMMs is that they capture position-specific alignment information

Walters et al.

within their model. Traditional alignment strategies, such as the Smith–Waterman algorithm [25], BLAST [3], or Needleman–Wunsch algorithm [21] use position-independent scoring. This results in fixed penalties for every position in the sequence, though more recent algorithms such as PSI-BLAST [20] seek to overcome such limitations. The Plan7 model consists of a linear sequence of nodes. Each node, k, consists of a match (Mk ), insert (Ik ), and delete state (Dk ). Between the states are arrows called state transitions with associated probabilities. Each M-state emits a single residue, with a probability score that is determined by the frequency that that residue has been observed in the corresponding column of the multiple sequence alignment. Each M -state (and also each I -state) therefore carries a vector of 20 probabilities; one for each amino acid. Insertions and deletions are modeled using I -states and D -states, respectively. Transitions are arranged so that at each node, either the M-state or the D-state is used. I-states have a selftransition, allowing one or more inserted residues to occur between consensus columns. The model begins in state (B) and ends in state (E). Furthermore, there are five special states: S, N, C, T, and J. They control alignment specific features of the model; e.g. how likely the model is to generate various sorts of global, local or even multi-hit alignments. 3.2. The Viterbi Algorithm One of the major bioinformatics applications of profile HMMs is database searching. HMMER includes two searching options. A researcher may search a database of profile HMMs against an input query sequence, or alternatively, may search a sequence database for matches to an input profile HMM. In HMMER these searches are known as hmmpfam and hmmsearch, respectively. In either case, the similarity score simðH; SÞ of a profile HMM H and a protein sequence S is used to rank all sequences/HMMs in the queried database. The highest ranked sequences/HMMs with their corresponding alignments are then displayed to the user as top hits identified by the database search. The key to effective database searching is the accuracy of the similarity score simðH; SÞ . In HMMER it is computed using the well-known Viterbi [28] algorithm. The similarity score can

therefore be recast into finding the Viterbi score of the profile HMM sequence H and the protein sequence S. The Viterbi score is defined as the most probable path through H that generates a sequence equal to S. The algorithm is presented in Algorithm 1. Algorithm 1 The Plan7 Viterbi Algorithm

In Algorithm 1, there are three two-dimensional matrices: M, I, and D. Mði; jÞ denotes the score of the best path emitting the subsequence S½1:::i of S ending with S½i being emitted in state Mj . Similarly, Iði; jÞ is the score of the best path ending with S½i being emitted by state Ij, and Dði; jÞ for the best path ending in state Dj . In addition, there are five onedimensional matrices: XN, XE, XJ, XB, and XC. XNðiÞ, XJðiÞ, and XCðiÞ denotes the score of the best path emitting S½1:::i ending with S½i being emitted in special state N , J , and C , respectively. XEðiÞ and XBðiÞ denotes the score of the best path emitting S½1:::i ending in E and B, respectively. Finally, the score of the best path emitting the complete sequence S is determined by XCðnÞ þ trðC; TÞ. These matrices and their corresponding dependencies (see Section 3.3) are used for the FPGA implementation. 3.3. BioBoost HMMER-FPGA Acceleration In order to develop a parallel architecture for Algorithm 1, its dependencies must first be accurate-

MPI-HMMER-Boost: Distributed FPGA Acceleration

ly described. Figure 2 shows the data dependencies for computing the cell ði; jÞ in matrices M, I, and D. It can be seen that the computation of this cell requires the left, upper, and upper-left neighbor as well as XBði  1Þ. The value of XBði  1Þ indirectly depends on XJði  1Þ. XJði  1Þ further depends on XEði  1Þ which, finally, depends on every element in row i  1 of matrix M. These indirect dependencies necessarily serialize the computation of matrices M , I , and D. Thus, computing several matrix cells in parallel is not possible for a full Plan7 Viterbi score calculation due to the J-state dependencies. Eliminating the J-state is a typical strategy used when implementing the Viterbi algorithm in hardware ([18, 22]). In so doing, highly efficient parallelism can be achieved within an FPGA at the cost of the implementation_s inability to find multihit alignments such as repeat matches of subsequences of S to subsections of H. This can in turn result in a severe loss of sensitivity for database searching within HMMER. The BioBoost implementation implements the full Plan7 architecture. Figure 3 shows the design for each processing element (PE) of the BioBoost accelerator. It contains registers to store the following temporary matrix values: Mði  1; j  1Þ , Iði  1; j  1Þ, Dði  1; j  1Þ, Mði; j  1Þ , Iði; j  1Þ , and Dði; j  1Þ . The matrix

values Mði; jÞ, Iði; jÞ, and Dði; jÞ are not stored explicitly, but rather are used as inputs to the Mði; j  1Þ, Iði; j  1Þ, and Dði; j  1Þ registers. The PE receives the emission probabilities (eðMj ; si Þ and eðIj ; si Þ) and transition probabilities trðMj1 ; Mj Þ , trðIj1 ; Mj Þ , tr ðDj1 ; Mj Þ, trðIj ; Mj Þ, trðIj ; Mj Þ, trðMj1 ; Dj Þ, trðDj1 ; Mj Þ , and trðMj ; EÞ from the internal FPGA RAM (Block RAM). The transition probabilities trðB; Mj Þ , trðN; NÞ, trðE; JÞ, trðJ; JÞ, trðJ; BÞ, trðN; BÞ, trðC; CÞ, and trðC; TÞ are each stored within the PE_s registers. Each PE maintains a four stage pipeline: Fetch, Compute1, Compute2, and Store. In the Fetch stage transition, emissions and intermediate matrix values are read from the Block RAM. All necessary computations are performed within the two compute stages. Finally, results are written back to the Block RAM in the store stage. The computation of the special state matrices uses intermediate values for XEðiÞ which are computed as XEði; jÞ ¼ maxðXE ði; j  1Þ; Mði; jÞ þ trðMj ; EÞÞ. The updating of the onedimensional matrices XN , XJ , XB , and XC is only performed at the end of the matrix row, when j ¼ k. As mentioned above, the Plan7 Viterbi algorithm does not allow computing several cells in parallel. Instead of computing the Viterbi algorithm on one database subject at a time, different query/subject pairs are aligned independently in separate PEs. The

Figure 2. Data dependencies for computing the values Mði; jÞ, Iði; jÞ, and Dði; jÞ in Algorithm 1. Solid lines are used for direct dependencies and dashed lines for indirect dependencies.

Walters et al.

Figure 3.

HMM Processing Element (PE) design.

system design with 4 PEs is shown in Fig. 4. Each PE has its own intermediate value storage (IVS). The IVS needs to store one row of previously computed results of the matrices M, I, and D. The PEs are connected to a shared emission and transition storage unit which requires that all PEs process the same HMM state in every clock cycle. Therefore, the bandwidth required to access the transition storage is reduced to a single state. The score collect and score buffer units are used in cases where multiple PEs produce results in the same clock cycle. The HMM loader transfers emission and transition values into their respective storage. The sequence loader fetches sequence elements from external memory and forwards them to the emission selection multiplexers. The system is connected to the BioBoost HMMER software running on the host system via a USB interface.

The host software performs two functions: loads/ starts the FPGA and post-processes relevant hits. The host portion of hmmsearch is described in Algorithm 2. As can be seen, the FPGA is itself used as a firstpass filter. Large database chunks are quickly scanned and narrowed down to a few interesting sequences. These sequences are then processed in software, according to the high-level algorithm presented in Algorithm 2. The evalue described in Algorithm 2 is a statistical measure of the number of false hits expected within the entire database whose scores are less than or equal to the current sequence_s score. The software HMMER implementation uses these e-values to filter results whose score indicates a high chance of a false positive. Similarly the T and E input parameters are user-defined threshold (T) and cutoff (E) values.

MPI-HMMER-Boost: Distributed FPGA Acceleration

Algorithm 2 BioBoost FPGA Processing Algorithm

3.4. MPI-HMMER—Software Acceleration with MPI MPI [10] is a library standard proposed by the MPIforum. Its purpose is to specify a standards-based message passing library. MPI itself was designed to support both workstation clusters and massively parallel systems. In practice, however, its use tends toward the more common commodity clusters. MPI

Figure 4. FPGA.

Architecture of the HMM system implemented on

bindings are available for the most popular scientific programming languages, including C, C++, and FORTRAN. There are many implementations of MPI, including optimized implementations from vendors who often provide their own MPI libraries tailored to specific communications hardware or parallel computer architectures. Typical MPI implementations include MPICH [11, 12], LAM [5], and MPICH-GM [19] (targeting Myrinet hardware). We elected to use MPI to implement our clustered MPI-HMMER solution because of both its pervasiveness in the scientific community as well as its high-performance. While HMMER already provides a PVM-based cluster implementation, we found that its use proved too restrictive. Furthermore, PVM is not capable of performing truly asynchronous sends, a useful strategy for the double-buffering implementation we describe below. Because MPI is already able to communicate with heterogeneous workstations, without the overhead of a virtual machine-like interface, we found no compelling reason to use PVM. In order to gain performance through an MPI implementation it is necessary to identify the regions of code that are the most expensive and execute those code paths on the worker nodes. In doing so, the majority of the computation can be offloaded to the workers, while freeing the master node to distribute sequences and post-process results. After profiling hmmsearch, we found that the P7Viterbi() function (shown in Algorithm 2) is responsible for over 90% of the total execution time. Therefore, in order to effectively utilize multiple worker nodes, it is crucial that P7Viterbi() be executed on the worker nodes. In [30] we describe our original MPI-HMMER implementation. We build from the basic strategy used within HMMER_s built-in PVM [26] implementation in that we allocate one node to function as a dedicated master, while all other nodes function as workers (Fig. 5). In an effort to optimize the clustering strategy, we have added two features to our MPI implementation. The first is the addition of a database chunking technique used to minimize the number of calls into the MPI communications library. This allows nodes to receive and work on batches of sequences before sending their results back to the master. It also reduces the number of sends and receives in order to minimize MPI library

Walters et al.

send and continue with the next buffer helps to minimize the latency of larger clusters. 3.5. MPI Parallelism with FPGA Acceleration

Figure 5.

MPI-HMMER design.

calls. Since communication time is typically the greatest bottleneck in a distributed computation, it is crucial that we minimize the time spent sending and receiving data. By batching sequences together we can keep worker nodes busier and minimize the number of sends/receives. To further minimize the amount of time workers spent blocking for data, we also implemented a double-buffering scheme. By employing double-buffering the worker node can compute on a buffer while also receiving the next batch of data into a second buffer. This has great advantages in situations such as MPI-HMMER where data transfer can cause a severe bottleneck. By employing double-buffering we overlap communication and computation as much as possible and in doing so mask the communication latency. A key requirement for double-buffering is the ability to perform non-blocking communication. MPI has the ability to perform such non-blocking communication and the overlap between communication and computation is highly effective in minimizing the time spent waiting for additional data. This proves especially true with increasing node numbers. Indeed, as the size of the cluster increases so too does the number of nodes trying to send their data back to the master node. Because most MPI implementations are not yet thread-safe, multi-threading cannot be used to safely receive data from multiple workers simultaneously. Being able to quickly post a non-blocking

Algorithm 2 gives a high-level overview of the strategy used to perform database searches with the BioBoost accelerator. Because the FPGA does not process data in the same way that MPI-HMMER does, the partitioning of computation between the master node and the FPGA-enabled worker node must be altered to take this into account. From Algorithm 2 we can see that the majority of the computation can be run on the FPGA-enabled node, while only a small part (post-processing) must be run on the host node. However, from our analysis, the amount of data required to perform the software Viterbi algorithm exceeded the benefit of computing the software Viterbi on the FPGA-enabled node. Rather than force the FPGA-enabled node to handle both the FPGA computation as well as the software Viterbi computation, we rely on the FPGA to simply perform the hardware computation and return the scores to the master node for postprocessing. While this adds a minor amount of additional strain to the master node, it alleviates a great deal of data packing and communication. After including double-buffering support, we even saw a slight improvement in performance due to the overlapping exchange of data between the master node and the FPGA-enabled node (see Section 4). In Algorithms 3 and 4 we show precisely the division of computation between the master node and the FPGA-enabled node. For clarity we omit from Algorithms 3 and 4 the double-buffering that we use in order to maintain performance. Algorithm 3 BioBoost Master MPI Algorithm

MPI-HMMER-Boost: Distributed FPGA Acceleration

Algorithm 4 BioBoost FPGA MPI Algorithm

3.6. MPI-HMMER-Boost In order to further improve the performance of hmmsearch database searches, we extended our MPI-HMMER implementation to include support for hardware acceleration, specifically the BioBoost accelerated FPGA described in Section 3.3. Building from the implementations described in Section 3.4 and 3.5 we again faced two challenges: (1) the results returned by FPGA are not of the same form as those returned by the software nodes, and (2) the FPGA must be kept busy in order to observe any measurable performance gains.

Figure 6.

MPI-HMMER with FPGA design.

As we described in Section 3.3, the results returned by the FPGA to the master node must still be post-processed by the master node. In addition to the standard post-processing performed on data returned from software worker nodes, a small number of results from the FPGA must also be rescored with the software Viterbi. This difference in data presented a challenge in efficiently processing data from the workers and FPGAs and led us to the multi-threaded approach shown in Fig. 6. Figure 6 shows a schematic of our combined MPIHMMER-Boost implementation. We separate our master node into two server threads, one responsible for collecting and processing the data from the software nodes while the other is responsible for collecting and post-processing the FPGA results. In order to simplify communication between the two groups, two MPI communicators are formed upon initializing the MPI library. These two communicators represent nodes with hardware-acceleration and those without. Each worker node attempts to initialize an FPGA. Should the hardware be found, it will

Walters et al.

be used for accelerated computation. After each node has tried to initialize the BioBoost accelerator an MPI-Allreduce is used to inform each node in the cluster of all FPGAs. Using this information, each node initializes two communication groups: COMMBOOST and COMM-OTHER. Any FPGA-enabled node is added to the COMM-BOOST communicator, while all other nodes are added to the COMMOTHER communicator. With these communicators established, the server threads can both listen for messages from their respective groups and act accordingly. We note that the master must join both communicators (one for each thread). One possible caveat is that the master is not allowed to perform most computations itself in order to keep resources available for the worker nodes. This means that an FPGA attached to the master node will not be utilized unless multiple MPI processes are started on the master node. Ultimately, this compromise allows for greater MPI scalability due to greater resource availability on the master. 4.

Performance Results

4.1. Hardware Platform The hardware platform used for the software nodes is based on a SunFire x2100 cluster running the Red Hat Enterprise Linux 4 operating system. The cluster consists of one master node, with a Dual Core AMD Opteron 275 processor, and ten worker nodes with gigabit ethernet connections between nodes. Each worker node has a dual core AMD Opteron 175 processor with 2 GB of RAM. LAM/MPI [5] version 7.1.2 was used for MPI communication. The FPGA_s PE design uses Xilinx ISE 8.2i for synthesis, mapping, placement, and routing in Verilog. The acceleration board used for this study was a low-cost Xilinx Spartan-3 XC3S1500 board, with 64 MB SDRAM, Ethernet, USB 2.0 and PS/2 interfaces. The XC3S1500 board can be plugged into any host computer system via a USB interface. Two of the worker nodes were used as host machines for these FPGA accelerators. 4.2. Experimental Evaluation In this section we compare the performance of the optimized HMMER application running on the cluster to FPGA hmmsearch running on the Spartan-3 XC3S1500 board. The size of one PE is 451

logic slices. The amount of memory required is 50 RAM entries per HMMER state, comprising 42 emissions and 8 transitions. Furthermore, there are three entries per HMMER state for each PEs IVS (immediate value storage). Thus, both the size of the HMM and the number of PEs that we are able to support is limited by the amount of Block RAM on the target FPGA. Using all 32 Block RAMs on the XC3S1500 it is possible to fit 10 PEs and support a profile HMM of up to 256 states. Therefore, the theoretical peak performance of the FPGA implementations can be measured by: Performanceboard ¼ Npe  fboard where Npe is the maximum number of PEs, and fboard is the maximum board clock speed. The maximum hardware clock speed achieved for this implementation is 70 MHz, and the maximum number of PEs is 10 using a Spartan-3 XC3S1500 board. Therefore Performance=700 Mega Cell Update Per Second (MCUPS). MCUPS is a commonly used performance measure in computational biology, representing the time necessary for a complete computation of a single entry of each of the M, D, and I matrices. MCUPS represents the peak performance that the hardware accelerator can achieve and does not consider data communication time or initialization time. Thus, it should be considered an upper-bound. 4.3. Examining Speedup as a Function of the HMM Size In hmmsearch we compared two models to two protein databases of different size. Model 1 (rrm.hmm) includes 77 states, and model 2 (PF00244.hmm) includes 236 states. The swissprot [1] database is a FASTA format protein database containing 217,875 sequences and 82,030,653 characters. Uniref90.fasta [2] is the other protein database containing 2,521,679 sequences and 899,379,867 characters.We performed a comprehensive HMMER performance evaluation based on the serial implementation, the optimized MPI-HMMER implementation (one master and one worker with double buffering), the original (serial) FPGA implementation, MPI+FPGA implementation without double-buffering and the optimized MPIHMMER-Boost implementation (including doublebuffering). In Fig. 7 we provide the speedups for hmmsearch, comparing the above five different

MPI-HMMER-Boost: Distributed FPGA Acceleration Serial Hmmer for 1 AMD

MPI-HMMER with 1 AMD worker

Original FPGA Hmmer

Unoptimized MPI+FPGA with 1 FPGA worker

MPI-HMMER-Boost with 1 FPGA worker

Speedups

20 15 10 5 0 RRM vs. Swissprot Figure 7.

RRM vs. Uniref90 PF00244 vs. Swissprot PF00244 vs. Uniref90 HMM Model vs. Database

Comparative speedups of various implementations for various configurations.

implementations. All tests were performed using one master processor with one worker or FPGA processor. With the larger PF00244 model against the larger Uniref90 database, the optimized MPI-HMMERBoost (with double-buffering) achieves the best performance among all implementations. In Fig. 7, we have seen that the performance of the MPI-HMMER-Boost hmmsearch outperforms other implementations due to the use of data prefetching. In fact, a performance increase of approximately 10% occurs simply through the implementation of double-buffering alone. This indicates that the double-buffered MPI+FPGA hmmsearch is a more efficient architecture for high throughput Viterbi computations than the others presented in Fig. 7.

To ensure a fair comparison between the MPIHMMER and the MPI-HMMER-Boost results, all performance comparisons in this section are to this optimized baseline implementation. In order to better examine the overall performance, we show a raw timing comparison of the optimized MPIHMMER (with one master and one worker) and MPIHMMER-Boost (one master with one and two FPGAs) implementations in Fig. 8. In Fig. 9 we translate those results into speedups. As we see, a single FPGA achieves an 8 to 15 speedup over MPI-HMMER alone. Adding a second FPGA to the computation further increased the speedup to over 30. This is to be expected as we are still limiting our MPI-HMMER timings to a single worker node for comparison.

10000

Time (Seconds)

1000

100 MPI with1 AMD worker MPI-HMMER-Boost with1 FPGA worker MPI-HMMER-Boost with2 FPGA workers

10

1 RRM vs. Swissprot

PF00244 vs. Swissprot

RRM vs. Uniref90

PF00244 vs. Uniref90

HMM Model vs. Database Figure 8.

Timing comparison of optimized MPI and MPI+FPGA implementation for two processors (one master, one worker).

Walters et al.

35 30 Spe e dup

25

MPI with 1 AMD worker MPI-HMMER-Boost with 1 FPGA worker MPI-HMMER-Boost with 2 FPGA workers

20 15 10 5 0 RRM vs. Swissprot RRM vs. Uniref90

PF00244 vs. Swissprot HMM Model vs. Database

Figure 9.

PF00244 vs. Uniref90

Speedup Comparisons of MPI and MPI+FPGA.

We next fully test our implementations by comparing our two HMM models to our two sequence databases for various numbers of processors as shown in Fig. 10. As we increase the number of processors from 2 to 16 (without an FPGA), the MPI speedup increases. Furthermore, the MPIHMMER-Boost implementation sees speedups of 8 to 15 while using a single accelerator board and varying numbers of processors. This demonstrates the FPGA_s ability to exploit the available inherent parallelism of HMMER sequences as well as the advantage of also utilizing the available software worker nodes. Further, we show that with sufficient data, multiple FPGAs can be used simultaneously to achieve even greater speedup. This is clearly evident is Fig. 10c, where near linear speedup is observed simply with the addition of a second FPGA. However, as would be expected, the efficiency of using the software worker processes diminishes with the addition of the second FPGA. In Figs. 10a, b, d we see a steady decline in speedup with the use of two FPGAs and multiple software worker nodes. This is to be expected as the overhead of coordinating the slower worker nodes overshadows their computational benefit. However, in Fig. 10c we see that with sufficient work, multiple FPGAs can still be augmented with software workers for a modest speedup. However, eventually even the most compute intensive datasets are unable to keep multiple software workers busy, as evidenced

by the decline in speedup after eight software workers (Fig. 10c). Further examining Fig. 10 we can also see the effect of the number of states within an HMM on the FPGA implementation as compared to the software implementation. For example, in the MPI-HMMER case we achieve approximately a 12 speedup regardless of the number of states within the HMM. This is to be expected as the software implementation of the Viterbi algorithm does not improve in efficiency as the number of states increases. However, in the case of the BioBoost FPGA, the greater number of states within the PF00244 model (236 states) clearly outperforms the smaller rrm model (77 states). This is true regardless of the number of software worker nodes available within the cluster. This is due to the hardware implementation of the BioBoost accelerator. The best results are achieved with large HMMs composed of many states. 4.4. Results Summary & Discussion Our results show that the MPI-HMMER-Boost implementation utilizing multiple software worker nodes achieves excellent performance over that of the MPI-only and FPGA-only implementations. All tests show speedup regardless of the model and database used though the best speedups are achieved with large HMMs. Furthermore, the low cost of the Spartan-3 devices makes it possible to put several

MPI-HMMER-Boost: Distributed FPGA Acceleration

Figure 10. Comparisons of MPI performance to MPI+FPGA performance for various processors. a RRM vs Swissprot. b RRM vs Uniref90. c PF00244 vs Uniref90. d PF00244 vs Swissprot.

FPGA accelerators in a cluster for even greater performance, as we show in Fig. 10. However, our results also show that, particularly in the case of multiple FPGAs, researchers must be cognizant of the impact that software workers will have on the MPI-HMMER-Boost system. We have shown that with a sufficient amount of work multiple FPGA and software workers can be utilized for improved system performance. But careful consideration should be made of the resources used for the computation. One should not expect an automatic speedup with the addition of software workers. We believe that additional performance could still be achieved in the MPI portion of our implementa-

tion by using a thread-safe MPI library. This would allow us to remove the locks that protect the MPI library on the master node, allowing multiple threads to enter the library simultaneously. However, it would be incompatible with most, if not all, nonMPI-2 implementations. 5.

Conclusions

In this paper we have shown the efficacy of combining both hardware acceleration and software acceleration through MPI-enabled clusters to achieve excellent performance and cluster utilization of HMMER_s hmmsearch functionality. We have

Walters et al.

shown that multiple FPGAs can be used efficiently in a distributed manner, in some cases achieving near linear speedup. In the future we hope to test even greater numbers of FPGAs on larger sequence databases in the hopes of truly evaluating the scalability of our solution. To continue to support an even greater number of FPGAs and software workers (in the range of hundreds or thousands) the work of the master node will necessarily have to be split among multiple masters in a hierarchical architecture. We leave this for future work.

References 1. Swissprot protein sequence database. http://www.ebi.ac.uk/ swissprot/, 2006. 2. Uniref sequence database. http://www.ebi.ac.uk/uniref/, 2006. 3. S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D. J. Lipman, BBasic Local Alignment Search Tool,^ J Mol Biol, vol. 215, no. 3, October 1990, pp. 403–410. 4. A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy, BThe Pfam Protein Families Database,^ Nucleic Acid Res., vol. 32, 2004, pp. 38–141. 5. G. Burns, R. Daoud and J. Vaigl, BLAM: An Open Cluster Environment for MPI,^ in Proc. of Supercomputing Symposium, 1994, pp. 379–386. 6. G. Chukkapalli, C. Guda and S. Subramaniam, BSledgeHMMER: A Web Server for Batch Searching the Pfam Database,^ Nucleic Acids Res., vol. 32, 2004(Web Server issue). 7. R. Durbin, S. Eddy, A. Krogh and A. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998. 8. S. Eddy, BHMMER: Profile HMMs for Protein Sequence Analysis,^ http://hmmer.wustl.edu, 2006. 9. S.R. Eddy, BProfile Hidden Markov Models,^ Bioinformatics, vol. 14, no. 9, 1998. 10. The MPI Forum, BMPI: A Message Passing Interface,^ Proc. of the Supercomputing Conference, 1993, pp. 878–883. 11. W. Gropp, E. Lusk, N. Doss and A. Skjellum, BA HighPerformance, Portable Implementation of the MPI Message Passing Interface Standard,^ Parallel Comput., vol. 22, no. 6, September 1996, pp. 789–828. 12. W.D. Gropp and E. Lusk, BUser_s Guide for mpich, a Portable Implementation of MPI,^ Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6. 13. D.R. Horn, M. Houston and P. Hanrahan, BClawhmmer: A Streaming Hmmer-Search Implementation,^ in SC _05: The International Conference on High Performance Computing, Networking and Storage, 2005. 14. H.H.J. Hum, O. Maquelin, K.B. Theobald, X. Tian, G.R. Gao and L.J. Hendren, BA Study of the Earth-Manna Multithreaded System,^ Int. J. Parallel Program., vol. 24, no. 4, 1996, pp. 319–348.

15. Intel Corporation. BSSE2: Streaming SIMD (Single Instruction Multiple Data) Second Extensions,^ http://www.intel.com, 2006. 16. J. Landman, J. Ray and J.P. Walters, BAccelerating Hmmer Searches on Opteron Processors with Minimally Invasive Recoding,^ in AINA _06: Proc. of the 20th International Conference on Advanced Information Networking and Applications—Volume 2 (AINA_06), IEEE Computer Society, Washington, DC, USA, 2006, pp. 628–636. 17. E. Lindahl, BAltivec-Accelerated HMM Algorithms,^ http:// lindahl.sbc.su.se/, 2005. 18. R.P. Maddimsetty, J. Buhler, R. Chamberlain, M. Franklin and B. Harris, BAccelerator Design for Protein Sequence Hmm Search,^ in Proc. of the 20th ACM International Conference on Supercomputing (ICS06), ACM, 2006, pp. 287–296. 19. Myricom, BMpich-Gm Software,^ http://www.myri.com/scs/ download-mpichgm.html. 20. NCBI, BPosition-specific iterated BLAST,^ http://www.ncbi.nlm. nih.gov/BLAST/. 21. S. Needleman and C. Wunsch, BA General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Sequences,^ J. Mol. Biol., vol. 48, no. 3, 1970. 22. T.F. Oliver, B. Schmidt, J. Yanto and D.L. Maskell, BAcclerating the Viterbi Algorithm for Profile Hidden Markov Models Using Reconfigurable Hardware,^ Lect. Notes Comput. Sci., vol. 3991, 2006, pp. 522–529. 23. Pfam, BThe PFAM HMM Library: A Large Collection of Multiple Sequence Alignments and Hidden Markov Models Covering Many Common Protein Families,^ http://pfam. wustl.edu, 2006. 24. Progeniq, BBioBoost Accelerator Platform,^ http://www. progeniq.com/, 2006. 25. T.F. Smith and M.S. Waterman, BIdentification of Common Molecular Subsequences,^ J. Mol. Biol., vol. 147, 1981. 26. V.S. Sunderam, BPVM: A Framework for Parallel Distributed Computing,^ Concurrency: Pract. Exper., vol. 2, no. 4, 1990, pp. 315–339. 27. TimeLogic BioComputing Solutions, BDecypherHMM,^ http://www.timelogic.com/, 2006. 28. A.J. Viterbi, BError Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,^ IEEE Trans. Inf. Theory, vol. IT-13, 1967, pp. 260–269. 29. J.P. Walters, J. Landman and V. Chaudhary, BOptimized Cluster-Enabled Hmmer Searches,^ in To appear in Grids for Bioinformatics and Computational Biology, E.G. Talbi and A. Zomaya (Eds.), Wiley & Sons, 2007. 30. J.P. Walters, B. Qudah and V. Chaudhary, BAccelerating the Hmmer Sequence Analysis Suite Using Conventional Processors,^ in AINA _06: Proc. of the 20th International Conference on Advanced Information Networking and Applications— Volume 1 (AINA_06), IEEE Computer Society, Washington, DC, USA, 2006, pp. 289–294. 31. B.Wun, J. Buhler and P. Crowley, BExploiting Coarse-Grained Parallelism to Accelerate Protein Motif Finding with a Network Processor,^ in PACT _05: Proc. of the 2005 International Conference on Parallel Architectures and Compilation Techniques, 2005. 32. W. Zhu, Y. Niu, J. Lu and G.R. Gao,^ Implementing Parallel Hmm-Pfam on the Earth Mulithreaded Architecture,^ in The 2nd IEEE Computer Society Bioinformatics Conference, 2003.

MPI-HMMER-Boost: Distributed FPGA Acceleration

John Paul Walers is currently a Postdoctoral Researcher at the University at Buffalo. He received his Ph.D. from Wayne State University in Detroit, Michigan in 2007 and his bachelor_s degree from Albion College in Albion, Michigan in 2002. His current research focuses on checkpointing, with an emphasis handling large checkpoint data. He also is actively engaged in the MPI-HMMER project, an open source MPI acceleration of the HMMER sequence analysis suite.

Excellence in Bioinformatics and Life Sciences at University at Buffalo, SUNY. Earlier, he spent several years at Wayne State University where he was the Director of Institute for Scientific Computing and Associate Professor of Computer Science, Electrical and Computer Engineering, and Biomedical Engineering. He currently maintains an adjunct position at Wayne State University in the Department of Neurological Surgery. In 2005, he founded Micass LLC that is focused on Computer Assisted Diagnosis and Interventions. Earlier, he was the Senior Director of Advanced Development at Cradle Technologies, Inc. where he was responsible for advanced programming tools and architecture for multi-processor chips. From January 2000 to February 2001, he was with Corio, Inc. where he has held various technical and senior management positions, finally as the Chief Architect. In addition, he is on advisory boards of several startup companies. Vipin received the B.Tech. (Hons.) degree in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur in 1986, the MS degree in Computer Science, and the Ph.D. degree in Electrical and Computer Engineering from The University of Texas at Austin in 1989 and 1992, respectively. His current research interests are in the area of Computer Assisted Diagnosis and Interventions; Medical Image Processing; Grid and High Performance Computing and its applications in Computational Biology and Medicine; Embedded Systems Architecture and Applications; and Sensor Networks and Pervasive Computing. Vipin has been the principal or coprincipal investigator on some $10 million in research projects from government agencies and industry and has published over 100 papers. He was awarded the prestigious President of India Gold Medal in 1986 for securing the first rank amongst graduating students in IIT.

Xiandong Meng received the B.S. degree in Electronic Engineering from Tianjin University of Technology, China in 1996. He received his M.S. and Ph.D. degrees in Computer Engineering from Wayne State University in 2002 and 2007, respectively. He joined Texas A&M University Supercomputing Facility in 2007. His primary research Page 2 interests are in the field of cluster computing, high-throughput bioinformatics computing and reconfigurable computer architecture.

Vipin Chaudhary is an Associate Professor of Computer Science and Engineering and the New York State Center of

Tim Oliver received a Masters in Engineering from Sheffield Hallam University in England, in 1998. During that time, he developed an interest in re-configurable computing and for his Masters project worked on improving the performance of image processing algorithms using FPGA devices. From 1999 to 2001, he was a Design Engineer at Jennic Limited in Sheffield developing high-speed interfaces, data scrambling and error detection and correction systems for gigabit communications platforms. He has worked with Progeniq in Singapore developing accelerated bio-informatics solutions.

Walters et al.

At the same time, he is pursuing a Ph.D. degree at the Centre for High Performance Embedded Systems (CHiPES) of the Nanyang Technology University in Singapore. His current field of research is in design encapsulation techniques for improving FPGA design productivity and facilitating dynamic reconfiguration.

Leow Yuan Yeow graduated from the National University of Singapore with a Bachelor of Computing in Computer Engineering. He developed an interest in FPGAs during his final-year project designing an OpenMP-to-VHDL translator. After Page 1 graduation, he has been working at Progeniq Pte Ltd, designing and implementing accelerated solutions for bioinformatics applications.

Bertil Schmidt is an Associate Professor of Computer Science and Engineering at the University of New South Wales. Prior to that, he was a Faculty Member at Nanyang Technological University (NTU) in Singapore. At NTU he also held appointments as Programme Director of M.Sc. in Bioinformatics and as Deputy Director of the Biomedical Engineering Research Centre. A/P Schmidt has been involved in the design and implementation of parallel algorithms and architectures for over a decade. He has published in journals such as IEEE Transactions on Parallel and Distributed Systems, Journal of Parallel and Distributed Computing, and Bioinformatics.

Darran Nathan is co-Founder and CEO of Progeniq Pte Ltd, the company behind the BioBoost accelerator that speeds up computationally intensive applications in the life sciences by 10x-50x faster than typical processors. He is also the Secretariat of the Association for Medical and BioInformatics Singapore (AMBIS), and Project Manager with the Asia Pacific Bioinformatics Network (APBioNet). He was awarded the prestigious Singapore Computer Society IT Leader Youth Award in 2002 for his continued efforts in commercialization of R&D efforts in Singapore. He has also been awarded numerous academic awards such as the Lee Kuan Yew Award for Mathematics and Science in the year 2000.

Dr. Joseph Landman is Founder and CEO of Scalable Informatics LLC and has been working with high performance computing systems for more than 18 years, in varying end user, product developer, and vendor roles. He holds B.S. in Physics from Stony Brook, M.S. in Physics from Michigan State University, and a Ph.D. in Physics from Wayne State University. His research specialization was computational simulation/experimentation with semiconductors. Prior to commencing graduate studies, he spent a year performing measurements of electrical properties of the then newlydiscovered high temperature superconductor materials at IBM T.J. Watson Research Center. Dr. Landman has been involved in many aspects of the application of computational tools and techniques to problems across a wide spectrum of disciplines. At Silicon Graphics, he developed an automated installation tool for Irix, as well as one of the first commercially available cluster parallel implementations of NCBI BLAST. At MSC.Software, he built a more powerful system to enable seamless cluster execution of informatics applications. He founded Scalable Informatics LLC after 6 years at SGI, and 1.5 years at MSC.Software, to build high performance computing solutions. Dr. Landman is working with research teams at SUNY Buffalo on optimizing the performance of HMMer and related programs.

Suggest Documents