Fast Matching of CBG Patterns using FPGAs

5 downloads 0 Views 40KB Size Report
present a technique for pattern matching an important class of protein ... An example CBG is RK℄ x´2 4µ DE℄ x´2 3µ Y. This matches any ... solution is proposed using field programmable gate arrays (FPGAs). ... key characteristic that we make use of is that FPGAs can be quickly ... Section 3 proposes a better architecture.
Fast Matching of CBG Patterns using FPGAs Scott Hazelhurst and Gloria Aikhorin School of Computer Science, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa Phone: +27 11 717-6181; Fax: +27 11 717-6199

Abstract Large data sets makes pattern matching in computational biology a challenge. We present a technique for pattern matching an important class of protein patterns. We show how a protein pattern can be represented as a logical expression, from which a circuit can be automatically synthesised, and implemented on field programmable gate arrays, which leads to highly parallelisable solutions. The method was tested on all CBG patterns in the Prosite database, and all the patterns could be dealt with very efficiently leading to throughput rates in most cases in excess of 108 symbols per second. Keywords: Pattern matching, CBGs, FPGAs

1 Introduction The problem of pattern matching in biological data such as protein databases becomes more serious as the databases grow in size. Patterns can be specified in different ways. Regular expressions are one method, but these are expensive. Classes of Characters and Bounded-Sized Gaps (CBGs) offer a good trade-off between expressiveness and computational efficiency [3]. A CBG is a sequence of elements, where an element is either: (1) a class of amino acids, or (2) a bounded gap. An example CBG is [RK ℄ x(2; 4) [DE℄ x(2; 3) Y. This matches any protein that contains an R or K followed by 2 to 4 arbitrary amino acids, followed by a D or E followed by 2 or 3 amino acids followed by a Y. This paper proposes a new solution to CBG matching. For performance reasons, a hardware solution is proposed using field programmable gate arrays (FPGAs). FPGAs are programmable logic devices that offer the possibility of fast implementation compared to software, with much greater flexibility than ASICs, and highly parallelised solutions for very high performance. A key characteristic that we make use of is that FPGAs can be quickly reprogrammed. FPGAs have expanded in functionality and size, and the tools for programming them have improved in ease of use and sophistication [1]. For each CBG pattern, we build (automatically) a specialised matching circuit optimised for the particular problem at hand, rather than building a general circuit for doing general matching. These circuits do not require external memory (for example, for a table). The general methodology is: (1) take the CBG representing the pattern and represent it as a boolean expression; (2) convert the boolean expression into a VHDL program; and (3) automatically synthesise the VHDL program into a circuit using standard FPGA design tools.

1

2 Naive System Architecture Figure 1 shows a simple version of our proposed architecture. The external interface has: a 5-bit data input; a clock input; a reset input; and one output, which indicates whether there is a match. Amino acids are encoded in 5 bits. In each clock cycle, one amino acid is fed in. The Match output goes high exactly if the buffers in the circuit contain a sequence that matches the pattern. In[4:0]

5

Shift registers

Clk

Matcher Circuit

Reset Match

Figure 1: General Architecture of System

A series of shift-registers, each 5-bits wide, stores a segment of the sequence. On each clock cycle a new piece of data enters from the left, and the data moves right one step through the shift-registers. If the longest minimal sequence that matches the CBG is of length n, there will be n registers. The other main part of the circuit is a piece of combinational logic that does the actual matching. It uses the data stored in the shift registers to make a decision where there is a match. Although the general architecture is the same, for each CBG, we synthesise a new combinational circuit specialised for matching exactly that pattern. The overall system performance is determined by the maximum speed that system clock can be set at — the limiting factor is the longest combinational delay in the Matcher part of the circuit. Synthesis of matcher circuit: Details of the synthesis of the combinational circuit are omitted for space reasons (see [2] for a complete description). The basic idea is that we can convert a CBG into a semantically equivalent boolean expression, which can be converted using standard techniques into a circuit for programming the FPGA. Experimental Results: We tested these ideas out on the Prosite database [4]. Of the 1568 patterns in the data base, 1332 (85%) are represented by CBGs. We first measured the size of the boolean expressions required to represent all the CBGs. Size was measured by the size of the data structures used for the boolean expressions, binary decision diagrams (BDDs), a compact data structure for boolean expressions. The smallest CBG required 20 BDD nodes, and the largest CBG required 34736, the average being 264. Approximately 75% of the patterns require fewer than 100 BDD nodes, 93% fewer than 200 BDD nodes, 98% fewer than a 1000 nodes, and only 0.3% require more than 10000 nodes. See [2] for more detail. FPGA utilisation and performance correlates with the size of the BDDs. We took representative CBGs to determine performance of a range of circuits. In outline, on a high end Xilinx XCV2vp50 FPGA, matching patterns represented with fewer than 200 nodes can be done in approximately 9ns and 0.2% of the FPGA capacity, and matching patterns represented with fewer than 2000 nodes can be done in less than 19ns and 3% of the FPGA capacity. The most complex patterns can be done in 27ns and about 18% of FPGA capacity. Thus, we expect to be able to match 93% of the patterns at a throughput in excess of 100 million amino acids per second, and the worst case throughput should be more than 30 million amino acids per second. Synthesis for the larger patterns was expensive. To some extent, this is not a problem as in many applications, the patterns can be precomputed. However, there are cases where we shall 2

need to do better. Section 3 proposes a better architecture. The best comparative work is that of Navarro and Raffinot [3], who designed a sophisticated bit-parallel algorithm in software. On a 500MHz Pentium III, they were able to match patterns in the Prosite database at a rate of between 5Mb/s and 20Mb/s, but were unable to cope with 11% of the cases. For the same patterns we would match at over 100Mb/s. However, comparison is difficult because of differences in technology and the age of technology. What we do claim is that our basic approach is at least as competitive as theirs, not limited by the size of the pattern in the same way, and as discussed in the next sections, easier to parallelise further.

3 A Better Architecture For about 5% of the CBG patterns, the performance of our method, while still feasible and yielding good results, deteriorated, and synthesis time became a problem. The problem is with patterns which have several relatively large gaps allowed. This made our boolean expressions grow quickly. We propose complementing the circuit with ‘memory’. We break up a CBG into segments, each corresponding to a maximal sub-sequence of the CBG with no gaps. We synthesise a combinational matcher for each segment. For each gap, a shift register records how recently the segment to its left was matched. Suppose a gap is x(r; s), then the shift register is of size s. When a segment to the left of the gap matches, we load 1’s into positions r through s of the register. On each clock cycle, the contents of the register shift down by 1. When a 1 appears in the lowest position of the register it means that the segment to the left of the gap matched at the appropriate time, and provided that everything to the right of the gap now matches the input data then we have an overall match. Gap memory register

G1 Combinational Circuit M3

G2

M2

M1

5−bit wide input data shift registers

Figure 2: A more sophisticated architecture

Figure 2 shows what the circuit for the CBG [RK ℄ [K ℄ [AS℄ x(6; 8) [R℄ [K ℄ x(2; 3) [R℄(4) would look like. If the combinational matcher M3 matches and the lowest bits of registers G1 and G2 are high, a match has taken place. (Since data flows from left to right, the rightmost register contains the earliest amino acid currently stored in the circuit.) Our preliminary experimental results showed that for the largest CBG the size of the largest individual matcher was small (a few hundred nodes) and hence we expect to be able to reduce synthesis time and FPGA utilisation substantially as well as some modest improvement on processing time.

3

4 Conclusion The FPGA based solutions to the problem of matching CBGs patterns achieve good throughput and utilisation results. Given these positive results, there are a number of areas that we would like to take forward. The immediate work to be done is the implementation of the new proposed architecture together with proper experimentation. There is immense opportunity for improving parallelism. It would be possible to either parallelise one search even further, or more likely do parallel pattern matching (searching for many patterns at the same time). Initial results indicate that it would be possible to search for hundreds (and maybe all) of the CBG patterns on the Prosite database at the same time. The challenge is likely to be outputting results rather than the matching itself. There is also room for improving how VHDL synthesis is done, improving parallelism and critically reducing synthesis costs. Cleverer encoding of amino acids taking into account their properties may mean that a redundant encoding will lead to more efficient matching. Finally, given the good performance obtained on CBGs, it seems worth exploring what other types of patterns the method can deal with effectively.

References [1] S. Hauck. The Roles of FPGAs in Reprogrammable Systems. Proceedings of the IEEE, 86(4):615– 638, April 1998. [2] S. Hazelhurst and G. Aikhorin. Fast matching of CBG patterns with applications to protein matching. Technical Report TR-Wits-CS-2002, School of Computer Science, University of the Witwatersrand, 2002. ftp://ftp. s.wits.a .za/pub/resear h/reports/TR-Wits-CS-2002-5.ps.gz/. [3] Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with application to protein searching. In Proceedings of the Fifth Annual International Conference on Computational Biology, pages 231–240. ACM Press, 2001. [4] Swiss Institute of Bioinformatics. PROSITE database of protein families and domains. http: //tw.expasy.org/prosite. Last accessed: 10 October 2002.

4

Suggest Documents