Cluster of Reconfigurable Nodes for Scanning Large ... - CiteSeerX

Cluster of Reconfigurable Nodes for Scanning Large Genomic Banks S. Guyetant a, M. Giraud a, L. L’Hours a, S. Derrien a, S. Rubini b, D. Lavenier a, F. Raimbault c a IRISA

/ CNRS / Universit´e de Rennes 1 – 35042 Rennes Cedex, France

b Universit´ e

de Bretagne Occidentale – BP 809, 29285 Brest Cedex, France

c Universit´ e

de Bretagne Sud – BP 573, 56017 Vannes Cedex, France

Abstract Genomic data are growing exponentially and are daily searched by thousands of biologists. To reduce the search time, efficient parallelism can be exploited by dispatching data among a cluster of processing units able to scan locally and independently their own data. If PC clusters are well suited to support this type of parallelism, we propose to substitute PCs by reconfigurable hardware closely connected to a hard disk. We show that low cost FPGA nodes interconnected through a standard Ethernet network may advantageously compete against high performance clusters. A prototype of 48 reconfigurable processing nodes has been experimented on content-based similarity and pattern search. Key words: Cluster, Genomic Banks, Reconfigurable Architecture, Similarity Search, Pattern Search

1

Introduction and Background

Biologists daily scan genomic banks to find DNA or protein sequence similarities. The rationales behind this fundamental operation mainly come from the theory of evolution which states that genes are derived from common ancestors. Extracting some kind of similarity between an unknown gene and a well characterized gene may hence bring important clues to direct further investigation. The distance between genes can be measured as the number of minimal mutations, insertions or deletions of nucleotides to transform one gene into another and can reflect their evolutionary relatedness. From a more pragmatic point of view, scanning a genomic bank simply consists in pinpointing regions of common origin, or domains, which may in turn coincide with region of similar structure or similar function [Vin01]. Preprint submitted to Elsevier Science

4th May 2004

One step further is to characterize these domains with specific patterns, as the PROSITE patterns [NCV+ 04], and to exhibit DNA or protein sequences housing these patterns. Whatever the type of search performed on the genomic banks – similarity, pattern search, or other – one has to face their exponential growth. With the systematic sequencing of complete genomes (more than 900 ongoing projects in April 2004 [GOL]) and the numerous EST projects, which generate huge banks of annotated short sequences, the volume of genomic data is almost doubling every year. This growth is to be compared to Moore’s law, which claims that computing power (at fixed cost) is only doubling every 18 months [Tuo02]. This observation has important consequences: the sole contribution of microprocessors performance improvement is not sufficient to maintain the time required to perform the complete scan of genomic banks to a constant. Hence, the only way to cope with the bank size exponential growth is the use of either parallel and/or application specific high-performance computing systems. Today, the most popular software for scanning genomic banks is the blast software [SGM+ 90,AMS+ 97]. blast takes as input a query sequence and a genomic bank, and outputs a list of alignments. These alignments represent regions, in both sequences, sharing statistical significant similarity. Basically, alignments are built from the raw text of the DNA or protein sequences. They are stored into banks implemented as low-structured flat files. The search mainly consists in reading sequentially strings of characters and performing a pairwise comparison with the query sequence. 1.1 Complete scans and heuristics Obviously, content-based approaches require to perform a scan onto the whole bank. In order to speed-up the search, heuristics based on genomic data properties have been proposed. They aim to decrease the time spent in the pairwise comparison compared to the well-known dynamic programming algorithm [SW81], which has a quadratic complexity. Basically, heuristics act in two steps: (i) detection of hits likely to generate alignments; (ii) extension of these hits into significant alignments (see figure 1). A hit, as in blast, can be a common word (the seed ) of W characters between the two sequences or, as in patternhunter[MTL02], a more elaborated seed. Hash table techniques are generally used to quickly detect hits. The extension phase starts from the seed and tries to widen the match in both directions using mismatch and deletion/insertion edit operations. The overall execution time greatly depends on the sensitivity – roughly, the 2

ability to pick up pertinent data – of the first step: a high sensitivity generally leads to a high number of hits to extend. The computation time is high, but results of high quality are obtained (number of significant alignments detected). On the other hand, a low sensitivity restricts the number of hits to extend, and the probability of missing good hits is high. The computation time is shorter, but the quality of the results is poorer. In that scheme, high sensitivity and short execution time remain two antagonist criteria. query TGGAGTGAGGAA

Alignments bank

hit detection

hit extension

A T T C T G − T C T A | | | | | | | | A T T C A G T T C A A

Figure 1. The search of alignments proceeds in two steps: (i) identification of common words (hit detection) between the query sequence and all the sequences of the bank; (ii) generation of significant alignments (hit extension) by exploring the hit neighborhood.

1.2 Scanning genomic banks on a PC cluster However, if sensitivity is requested, the high computational power provided by clusters of PCs can be an efficient way to shorten the execution time. The bank can be dispatched among the cluster, each node storing a fraction of this bank on its local disk. A search operation then consists in broadcasting a query sequence over the cluster, followed by a parallel search operation, each node working independently on its own bank subset. The results are then forwarded to a single node and merged into a list of alignments. Such a parallelization scheme allows a very efficient implementation, since tasks are independent and do not require inter-nodes communications, except for the initializing step (broadcasting the query sequence) and the final step (merging the results). Besides, the traffic involved by these two steps remains very low, and the communication overhead is therefore neglectible. Furthermore, if the bank is equally dispatched among the cluster, the scanning time is roughly the same for all the nodes of the cluster. While this approach remains very attractive, we believe that using application specific cluster nodes tailored to genomic – or other – content-based search applications can further improve execution time. Having a closer look at our application, we can make the following remarks: • The hit detection is a quick operation: the query is encoded into an hash 3

table, allowing fast detection of common words with the sequences of the bank. Furthermore, the hash table entirely fit into the primary memory. • For sensitive search, scanning a bank is a compute-bound problem: many hits are generated, leading to many hit extensions. This is a much more time-consuming task than the previous one since it often involves dynamic programming algorithms. Speeding up the process while preserving a high quality means generating less hits with a higher probability of having them housed into alignments. In other words, the selectivity of the hit detection must be enhanced. Unfortunately, this means that computing better hits will also take more time. Hence, the time saved in the extension phase will be lost in detecting better hits.

1.3 The RDISK project Our proposal is to move a significant amount of computational power directly near the data source, in order to increase the selectivity – and, of course, to preserve the sensitivity. The idea is to filter genomic data at the disk output rate and to report only relevant hits for further processing when needed. Sensitivity and selectivity are provided by mapping a highly selective filter on reconfigurable hardware able to process data on-the-fly. Thus, instead of having a cluster of standard PCs, we propose a cluster of application specific processing nodes based on hard disks coupled with FPGAs, which are silicon chips able to re-wire their inner logic almost instantaneously[RGSV93]. In this scheme, parallelism comes both from: (1) the concurrent data access: as on a cluster of PCs, the bank is dispatched over the local disks. The bank is thus accessed concurrently; (2) the hit detection: mapping the filter on reconfigurable hardware allows to take advantage of hardware specialization and low-level parallelism. A 48 board system called RDISK-48 has been assembled and tested on various genomic search applications. Data is read and filtered on-the-fly, allowing for example the human genome to be entirely scanned in less than 2 seconds.

1.4 Related work

Conventional biocomputing accelerators Apart from classical clusters for which parallelized algorithms have been developed – IBM, Sun, HP, SGI, Dell or Apple life sciences departments sell clusters made of rackable nodes with 2 general purpose processors, such as 4

Opterons or PowerPCs and 1GB or more memory for around $4000 – , major biocomputing centers and companies are equipped with specialized hardware to accelerate searches; three main commercial products share the market. • Timelogic [TML] proposes the DeCypher solutions, a wide range of servers, from a Sun Blade to a Sun Fire, including 1 to 8 DeCypher accelerators array. The accelerators are based on reconfigurable technology that gives a great flexibility. • GeneMatcher2, from Celera Genomics [PAR], is a Linux cluster with 8 up to thousand general purpose processors. Their specialized nodes are not reconfigurable: 144 application-specific circuits are used as accelerators for dynamic programming algorithms. For executing blast, they are useless, and the “massively parallel supercomputer” is no more than a cluster. • The last one is BioXL/H, from Biocceleration [BCL]. BioXL/H is a network device that connects 4 to 32 accelerator boards and one hard drive around a PCI bus. It is also specialized in dynamic programming algorithms only, although it is based on reconfigurable technology. All these products suffer from limited data bandwidth and their impressive benchmarks are done in conditions where data are stored in main memory and intensively re-used to minimize hard drive accesses.

Intelligent disks On the other hand, the idea to add computing power to mass storage devices is not new. The seminal work in this research area dates back to the late seventies, when the first “database machines” were proposed [Bab79]. They were highly specific systems, with special purpose hardware such as custom drives and hardwired algorithms attached to each disk arms. Their high development cost and the lack of flexibility, compared to general purpose processors, led to their demise, and in the late eighties, this research field was considered to be dead. As summarized by Michael Stonebraker in his 1988 book[Sto88]: “The history of DBMS 1 research is littered with innumerable proposals to construct hardware database machines to provide high performance operations. In general these have been proposed by hardware types with a clever solution in search of a problem on which it might work.” However, with the wide spreading use of powerful hard-disk embedded controllers, and the higher level of transistor integration, this statement deserved to be reevaluated. There was hence a rebirth of this concept in the late nineties, when three similar projects: IDisk [KPH98], Active Disk [RFGN01] and Smart Disk [MKC01] studied how to move computation into hard drive controllers, and how such systems performance would scale with the number of disks. 1

Database Management System

5

Their justification was that a lot of processing power is wasted into hard disk controllers because low cost 16 bits micro-controllers are already too powerful for their purpose. Besides, experiments with on-the-shelve workstations networked in a shared-nothing fashion have shown that communications between nodes diminish performance. If accessing the software that runs into disk controllers was possible, this would be of course the most cost effective solution; but to our knowledge, no commodity hard disk drive vendor has allowed it. Today, the capacity of hard drives continues to grow faster than their throughput, so that data intensive applications are still I/O-limited. A project recently started at the Washington University [WCIZ03] and currently under development in the start-up company Data Search Systems [DSS], targets processing for unstructured data. Data are sniffed on the IDE bus out of one disk drive so that they can be filtered, compressed or encrypted on-the-fly into a FPGA component. This project is the closest from ours, except their target application is data mining on business or intelligence data. The rest of the paper is organized as follows. Section 2 presents the hardware overview of the RDISK-48 system. Section 3 deals with the system environment. Section 5 is devoted to the implementation of two genomic search applications: blast-like and pattern search application. Section 6 concludes this paper.

2

Underlying Hardware

2.1 RDISK-48 cluster overview

Figure 2 shows an overview of the RDISK-48 cluster. It is composed of 48 identical nodes, each of them housing a reconfigurable processing engine (RPE) tightly coupled to an IDE hard drive. The interconnection system is based on low-cost 100 Mb Ethernet technology, and allows each node to be connected to a front-end workstation. The RPE (see section 2.2) is a dedicated hardware architecture which is used to perform on-the-fly processing of a data set as it is read from the disk. It then forwards the relevant data (or some identifiers) to the host machine for a post processing phase. Practically speaking, the RPE acts as a filter whose purpose is to send to the host the fraction of the data set which is the most likely to be relevant for the application at hand. Obviously, the performance (and hence the viability) of our approach is very dependent on the amount of communications involved between the host and 6

RDISK node

15MB/s

40GB IDE drive

~10 0k

B/s ~100k B/s

15MB/s kB/s ~100

15MB/s parallel disk access (10 - 100 disks)

FPGA with hardware filter

100Mbit ~5MB/s Ethernet Switch Host post- processing (performed only on relevant data)

Figure 2. The RDISK-48 cluster: 48 reconfigurable processing units are interconnected through a 100 Mb Ethernet switch. For proper operation, the filtering efficiency must be greater than 1/150.

the nodes: it is well known that for cluster machines, the network is very often a performance bottleneck. In our case, to prevent network saturation, the aggregated bandwidth of the cluster has to remain far below the maximum capacity of our 100 Mb Ethernet network. In practice, since all communications involve the host machine, this poses a constraint on the aggregated RDISK nodes output bandwidth. Given that the host processor Ethernet link is able to absorb up to 5 MB/s bandwidth (half of the peak 100 Mb/s), the previous constraint translates (for a 48 board cluster) as a per-node maximum output bandwidth in the range of 100 KB/s. In other words, since on each node, data is read from the disk at an average rate of 15 MB/s, the efficiency of the filter implemented on the RPE must be at least of 1/150.

2.2 RDISK node

The RDISK node is built around an FPGA device, in which a Reconfigurable Processing Engine (RPE) is implemented. This RPE is the main component of our system and is responsible for handling the network traffic, accessing data on the hard drive and performing the filtering step. The main idea behind this architecture is to provide a low-cost system compared to a standard PC cluster node. We therefore decided to base our system on a low-end FPGA device (< $25 Spartan-II 200) which is be used as the core component of the system. This device provides the equivalent of 200,000 system gates in the form of 5000 logic cells and 7 kB of embedded memory blocks. The FPGA is used to host 7

Figure 3. A top level view of a RDISK node. The PCB form factor was chosen to allow the rack-mounting of 8 nodes on a single rack line.

Figure 4. Front and rear view of the 48 node RDISK cluster

the hardware filter design, and is therefore directly connected the hard drive IDE bus. In addition to this FPGA, a node contains several other devices: • The node Ethernet connectivity is handled by a low-cost (< $7) 100 Mb Ethernet controller, which is connected to the core FPGA through a simple 16 bit wide ISA bus interface. • Because of the limited FPGA on-chip memory capacity, we use an additional 32 MB SDRAM chip (< $7), which purpose is to store the System on a chip firmware, as well as to buffer the Ethernet packets (see ??). • Finally, the dynamic reconfiguration is handled by a small MCU unit, which shares the IDE bus with the FPGA, and is in charge of reconfiguring the FPGA upon request from the host, using one of the bitstream stored on the disk. The estimated cost of a single RDISK node was evaluated to be roughly $200 8

(including the PCB and assuming a middle range HDD) ; which is almost 5 times cheaper than a typical PC cluster node.

2.3 Reconfigurable System on Chip

The core of the RDISK system is a reconfigurable System on a Chip implemented on the FPGA as depicted in figure 5). The deliberate choice for a low-price FPGA forced us to a small foot-print SoC implementation. We decided to base our system implementation on the XR16 processor designed by Jan Gray [Gra00]. This simple 16 bit processor uses less than 300 logic cells and runs at 40MHz. To improve its original performance, we added a instruction cache memory and an memory controller to enable access to the external SDRAM chip. Among others, this CPU is in charge of: • Handling the network protocol management (we use a custom protocol built on top of UDP) • Initiate the Hard drive data transfers. • Controlling and monitoring the Hardware filter. • Forwarding reconfiguration commands issued by the host to the configuration manager. The current version of the SoC fits in less than half of the available FPGA resources, leaving the rest (roughly 3000 LUTs and 10 BlockRams) for the filter implementation. Compared to existing commercial soft-core solutions, our implementation remains very competitive, especially in terms of resource usage (as an example, a similar design using the Xilinx microblaze soft-core CPU would hardly fit in our FPGA). AX88796L 100 Mbit Ethernet controler

IDE bus

XR16 CPU +2kB ICACHE

IDE disk Controler with FIFO

SDRAM CTRL Config. Manager

FPGA config. port

8051 MCU (conf. Manager)

32MBytes SDRAM

100MBit

100Mbit Ethernet wrapper

Hardware Filter (up to 1200 slices)

Spartan-II FPGA

Figure 5. The RDISK System on Chip internal architecture

9

! !!

!!

#

"

$%&' (

Figure 6. The design template for the hardware filter. The filter module is directly connected to the IDE controller through a dedicated channel which allows streaming read operation at the disk maximum bandwidth (in PIO mode 4)

As we need to be able to take advantage of the maximum hard-drive throughput, the hard-drive controller is implemented as a complex autonomous statemachine that handles most of the IDE protocol, and does not need external buffer memory. The controller can be used in two modes: • In the filtering mode, the hard-disk data bus is directly connected to the filter input data port, through a simple on-chip FIFO channel. This mode allows very fast and continuous read operation (up to 15MB/s in continuous read for PIO mode 4) without any CPU intervention. • In the control mode, the hard-disk is controlled by the CPU. This mode is currently used for updating the database content. It offers much lower performance, since both the hard-drive access and the network commands are handled by the CPU.

2.4 Hardware filter design

As shown in figure 6, the hardware module implementing the filter follows a predefined interface: input data is read from the hard disk through a dedicated channel, while filtered data is stored in a small 256x16 FIFO to be later on read by the XR16 and forwarded to the host. Two dedicated ports allow the CPU to control and monitor the filter execution. These ports are mapped in the CPU address space, and are made accessible to the front end computer through our network interface protocol. The host computer can, at any time, change some of the filter execution parameter (for example for increasing or decreasing the filter selectivity). 10

3

Underlying System

3.1 Programming environment

From a user point of view, an application running on the RDISK-48 system is divided in two parts: a hardware filter replicated on each node of the cluster, and a post processing application running on the front end machine (Figure 7). The firmware of the node controls all the data transfers between the disk, the filter and the host through the Ethernet network. Along with the SoC, it acts as an application framework which is reused in every new RDISK application.

Figure 7. Stream model of the RDISK cluster.

Hence, to program a new application, only two tasks need to be specified: the filtering task, and the post processing task. The first one is described using a hardware description language, such as VHDL. The second one is simply written in C language. The figure 8 summarizes the development process. The filter specification (gray box) is merged with the predefined SoC specification (XR16, IDE and Ethernet drivers) through standard FPGA CAD tools. The resulting bitstream is thus application dependent. The C specification describes how results coming from the nodes are merged, post-processed and displayed. The program is compiled and linked with a specific library including routines for reconfiguring the cluster, initializing the nodes, launching the applications, performing communications, etc. 11

XR16.vhd

IDE.vhd

Eth.vhd

filter.vhdl

lib−rdisk.c post−filter.c

FPGA CAD tools

C tools

RSoC−filter.bit

exec−filter

IDE XR16

FILTER

Eth

front−end

FPGA

Figure 8. Development flow on the RDISK cluster. The programmer of a new application must only write a hardware description of its filter (filter.vhd) and a program to post-process the hits (post-filter.c).

3.2 Initialization, Configuration and Execution

When the cluster is powered on, each node starts a similar sequence of operations. The micro-controller reads a bootstrap configuration bitstream from the hard drive and configures the FPGA. Once the FPGA is ready to run with its new configuration, the external micro controller releases control over the hard drive bus and moves to a sleep mode, waiting for a new reconfiguration request. The bootstrap hardware configuration contains a minimalist XR16 based System on Chip implementation whose firmware fits within the kB FPGA on-chip memory. This hardware configuration can handle only two type of services to the host: (1) It provides the host with access to the hard drive for both reading and writing, and the RDISK node can therefore manage all the file system user commands. (2) The host can ask a node to reconfigure its RPE using one of the configuration file stored on its attached disk, whose identifier is passed along with the command. This design serves as a safe configuration, and is stored at a specific hard-drive location which cannot be accessed by the user in write mode. At any time, upon request of the host, or because of a timeout operation, the RDISK node must be able to return to this safe configuration. 12

Executing a query requires three informations: the bitstream containing the hardware filter to be used, the file containing the database, and some additional parameters (for example a query sequence). The file system (see section 3.3) then checks that, on each RDISK node, a local copy of the bitstream is present on the node’s hard disk. If not, the configuration is sent through the network, and copied to the drive. The configuration partition on the drive hence works like a configuration cache. The time to configure a node with a local bitstream is about 800 ms compared to a few seconds needed by a network broadcast. Once all nodes have a local copy of the current configuration bitstream, the host sends a reconfiguration command. Programmable filters are then initialized and started, waiting for data. The file system sends the physical location and size of the bank to be scanned to the hard drive controller which starts feeding the filter. Result data from the filter is stored in a FIFO memory and send to the host. The FIFO is regularly flushed to ensure the host has a regular post-process task. The host receives two kind of data: results from filtering, that are processed as soon as they arrive, and control data, indicating which nodes have finished their jobs. An application is considered done when all nodes have send such a message. 3.3 File System The file system manages three distinct types of data: (1) bitstream files containing FPGA configurations, (2) binary files containing XR16 firmware code, and (3) data files containing genomic banks. Although using an existing generalpurpose file system would have been possible, we designed our own file system to takes advantage both of the targeted applications and the RDISK architecture specificities: • I/O access: as sequential scan over several GB of data is the standard operation, the file system is optimized for this type of access. Data are stored on contiguous drive sectors. Such a mapping reduces drive head moving overhead (during sequential access to the file) and simplifies control operations when supplying a stream of data directly to a hardware filter. Obviously, contiguous storage will create an external fragmentation, i.e. unusable sectors between files. But the very low modifying rate of the file location map keeps the fragmentation at a fair level. • Node simplicity: to minimize the file management cost on the embedded XR16 processors, meta-data for the complete RDISK system reside on the front end computer as standard Linux files. The XR16 processor only deals 13

with low level operations, such as read or write accesses to physically addressed sectors, while the file attributes are updated by the front end processor. A simple network protocol, inspired by the TFTP protocol, allows the host to communicate with nodes by sending low level commands through the Ethernet network. Users initiates operations on the RDISK files using a set of commands which are the functional copies of the commands ls, cp, rm, chmod and newfs. In this file system, each RDISK node is seen as a distinct storage device. Hence, the designation scheme of a file includes the RDISK board number where the file is stored, followed by the file type (bitstream, XR16 program, or database) and name.

4

Experimentation

We now present two applications that are well suited for the RDISK prototype. In the similarity search application, the goal is to find in a bank the closest sequences of a short query sequence. This application uses pre-defined filters that are initialized with the query sequence. In the pattern search application, we want to retrieve all sequences of the bank that matches a pattern, eventually with some errors. Here the filters have to be synthesized depending on the pattern. 4.1 Similarity search

Introduction Given a query sequence, the similarity search aims to return sequences from genomic banks sharing one or more nearly identical regions. The result is a list of scored alignments reflecting the closeness with the query sequence. The alignment scores depend on the cost associated with match, mismatch and gap errors. As an example, figure 9 depicts an alignment having a score of 22. ACTTTCTATTACCTAGTAGAGCAACGACTAACTGGATTTCCCTTTCTAACGGACC--TAGCC ||||| |||||| |||| |||||| ||||| |||||| ||||| |||| ||||| ||||| ACTTT--ATTACCGAGTATAGCAACTACTAATTGGATTACCCTTACTAATGGACCGGTAGCC | | | | | | 10 20 30 40 50 60 Figure 9. Alignment: with a match/mismatch scoring of +1/ − 3 and a gap penalty of −2, the score is equal to 22 (51 matches, 7 mismatches, and 4 gaps).

As shown by Karlin and Altschul [KA90], the score of an alignment is related to the probability of finding it in a genomic bank. The higher the score, the lower 14

e-value

10

1

10−2

10−4

10−6

100 Mnt

16.55

18.23

21.60

24.96

28.32

1 Gnt

18.23

19.91

23.28

26.64

30.00

10 Gnt

19.91

21.60

24.96

28.32

31.68

100 Gnt

21.60

23.28

26.64

30.00

33.36

1 Tnt

23.28

24.96

28.32

31.68

35.04

Figure 10. This table shows threshold for different e-values and bank sizes.

the probability to exhibit this alignment. This is expressed as the expected value, or e-value, which is e = Kmne−λS . This equation states that the number of alignments expected by chance (e) during a bank search is function of the size of the search space (m × n = size of the bank × size of the query), the normalized score (λS) and an additional constant (K). The two constants λ and K are computed according to the match/mismatch scoring (see [AMS+ 97], chapter 4, for more details). Thus, knowing the size of the genomic bank, the size of the query sequence, and setting both the match/mismatch scoring and the desired expected value, one can deduce the threshold value from the latter equation. On the RDISK system, the idea is to hardwire a filter computing ungapped alignment scores, and to report only significant ones, that is to say scores above a given threshold value. To optimize the hardware, a filter is designed for each threshold value. Actually, the number of different filters to design is quite low: the table on figure 10 displays the threshold values for different e-values and for different sizes of genomic banks with a query length of 1000 nucleotides (nt). The match/mismatch scoring is +1/ − 3, which corresponds to the default blast values (λ = 1.37 and K = 0.711). Hence, for scanning genomic bank from 100 Mnt to 1 Tnt with an e-value ranging from 10 to 10−6 , twenty-five different filters are required. Note, however, that a filter designed for a given expected value can be used for smaller expected value: it will only generate more false positive hits.

Filter architecture The parallel architecture of the filter, depicted in figure 11, is made of small processing elements (PE) performing dedicated pattern matching computation. The number of PEs ideally equals the length of the query sequence. Before starting a search, the query is shifted into the PEs (one nucleotides per PE). Then, on each clock cycle, two nucleotides directly read from the disk are broadcasted to the PEs. The comparators generate a hit whenever a score exceeds the threshold. 15

Embeeded sequence from bank

PE

T

G

C

G

Query Sequence

A

G

...

T

... A

T

Score generation

C ... Threshold

Offset generation

...

Score comparison

Hit

Figure 11. A query sequence CAG...T is searched against a bank whose current sequence is GCGTTA.... When a local score exceeds the threshold, the sequence location is stored directly as the bank file offset for quick retrieval.

Threshold Maximum query size

programmable

16

18

20

22

30

137

180

198

160

159

157

Figure 12. Maximum size of query for some filters; filters are designed to run at 40 MHz with a 78 Mnt/s bandwidth.

The bank is stored in order to optimize the disk bandwidth. A sequence is split into small data frames of 16 bytes in which nucleotides are two-bit encoded. Every 64 frames, an additional frame indicates the offset position of the sequence in the bank. For a sustain disk bandwidth of 15.5 MB/s, we get a 60.4 Mnt/s bandwidth. When a hit is found, this frame is directly forwarded to the front end processor through the Ethernet network, as described in 3.2. It contains all the information to rapidly locate the sequence and apply further processing. This mechanism is simple and doesn’t slow down the filter activity. All the frames are processed on-the-fly, at the disk rate output. In the worst case, the quantity of information per hit is equal to 16 bytes. Actually, when consecutive hits occur, we generate only one offset frame, and thus, we limit the network traffic. Several filters have been pre-designed for threshold values ranging from 16 to 32: as the PEs are highly optimized, the hardware resources needed to implement one PE vary with the comparator complexity. Consequently, the number of PEs which can fit into the FPGA component – and hence the size of the query sequence – directly depends on the chosen threshold value. The table on figure 12 illustrates the characteristics of some filters. Note that the input filter bandwidth (78 Mnt/s) is higher than the output disk bandwidth (60.4 Mnt/s). Disks are thus used at their maximum capacity. 16

Results We have tested our similarity search implementation on the primate division of GenBank (release 139, 4.3 Gnt) with various query sequences of length approximatively equal to 300. The sequences have been chosen to test various situations, from reporting no significant alignments to several hundreds. The front end processor first requests the RDISK boards to load a filter, before broadcasting the query sequence. Then it receives hits from the RDISK nodes and runs a Smith & Waterman computation to generate significant alignments. This process starts as soon as the first hit is received, and is thus performed concurrently with the scan of the bank. The result is a sorted list of alignments. The table on figure 13 reports experiments with a threshold set to 20. As the query sequence is limited to 160, the search is performed in two passes, without reconfiguration of the RDISK nodes before the second pass. Query

Execution time (s)

Ethernet

sequences

hits

align

Tc

Ts

TSW

T

user bandwidth

Line

5449

5324

0.8

3.0

38.2

39

29 KB/s

Globin

329

329

0.8

3.0

1.6

3.8

1.7 KB/s

Prion

66

66

0.8

3.0

0.3

3.8

0.3 KB/s

Random

2

2

0.8

3.0

0.1

3.8

0.0 KB/s

Figure 13. Results for the similarity search application. The hits column is the number of hits received by the front end processor. The align column is the number of significant alignments computed using the SW algorithm. The three next columns respectively report the time to configure the RDISK cluster (T c ), the time for scanning two times the GenBank primate division (T s ), and the time for computing the alignments (TSW ). Since detection of the hits and hit extension are overlapped, the overall execution time is equal to T = T c + max(Ts , TSW ). The last column shows the mean user throughput over the network.

The network traffic corresponds to the result data frames generated each time a new hit is found (16 bytes). Even with a permissive filter, the network is far from saturated. The other bandwidth not to be exceeded is the disk output on the host computer: the incoming result frames are pointers to data chunks of 3 KB. In our worst example, this represents a bandwidth of 5.4 MB/s which is far under the capacity a SCSI drive can sustain, even in random access. Anyway, the high selectivity of the filter must be pointed out: most of the detected hits lead to significant alignments. This is not surprising since the filter nearly perform an exact alignment score computation. However, due to some hardware simplifications, in some cases false positive hits are generated. 17

If we now consider the quality of the results, comparison with other software has to be made. Our filters are based on the blast philosophy to exhibit statistical significant alignments. However, we differ on the way to select hits: blast hits are simple identical words of W characters in both sequence. In DNA search, the default word length is 11, but it can be set to a lower value (down to 7) to increase sensitivity. As an example, to be able to detect the alignment depicted in figure 9 the word length should be set to 6 in blast. An ungapped filter with a threshold value of 20 detect this alignment: from index 8 to 57, there are 41 matches and 7 mismatches (score = 41 − 7 × 3). On the other hand, with the same filter, no hit can be found in the alignment on figure 14 (because of the gap in position 40) whereas blast finds a hit. ACTTCCTATTACCTAGTAGAGCAACTACTAACTGGATT-CCCTTTCTAACGGACC--TAGCC ||||| |||||| |||| |||||||||||| |||||| ||||| |||| ||||| ||||| ACTTC--ATTACCGAGTATAGCAACTACTAATTGGATTACCCTTACTAATGGACCGGTAGCC | | | | | | 10 20 30 40 50 60 Figure 14. This alignment will be found by blast because more than 11 consecutive matches occur between positions 20 to 31, but our heuristic will not detect it because its maximal score without gap is 19 (gap in position 40).

Finally, we run blast on a 16 processor SunFire 6800 machine to perform a sensitive search (word length set to its minimal value, i.e. 7). Processors are 750 MHz Ultra-Sparc 3 with 1 GB memory. As the Xilinx Spartan-2 FPGA, these processors have been shipped by 2001/2002. Technologies are thus comparable. Under optimal condition, we measured an average processing power of 20 Mnt/s per node for a query of length 160. This represents the third of only one RDISK node capability. 4.2 Pattern searching

Introduction When common regions of similarities have been found in a whole family of genes, one can characterize these domains with patterns. A pattern can be an exact word or a finite dictionary, like in the PROSITE pattern database [NCV+ 04], in which families of proteins are grouped. An example of this syntax is the pattern D-[ILV]-x(1,3)-A, where [ILV] is a choice between several amino acids (I, L, and V) and x(1,3) a gap of any amino acids whose length is between 1 and 3. As those patterns are regular expressions, they can be represented as automata (figure 15). Once patterns have been selected in databases or designed from scratch, one can search for them in new protein or nucleic sequence banks, for example when new sequences are produced by sequencing. The result is then a list of 18

Figure 15. An automaton describing the PROSITE pattern D-[ILV]-x(1,3)-A.

!"# !)(' $&%' *+,-

Figure 16. In a WFA, each transition has a weight. By replacing the D transition (left), one can count errors (middle) or use arbitrary substitution matrices (right).

scored matches between the pattern and the bank. Several scoring schemes can be used (figure 16): • in an exact matching system, a word is accepted or not; • a generalization of the exact matching is to count the number of errors and to compare it to a threshold; • a further step is to replace the substitution penalties of −1 by the weights taken from a substitution matrix like the BLOSUM62 matrix [HH92]. The first two scoring methods are only a particular case of the third. It can be resolved by weighted finite automata (WFA), which are finite-state machines with weights on every transition. They are represented in hardware by a direct mapping: each state is materialized into a register, and each transition is hardwired as a weight generator, an adder and an optional maximum operator (linear encoding scheme, figure 17). This kind of mapping is slightly different from a pure systolic implementation and allows more complex patterns to be represented, for instance patterns with backwards transitions. GIH F F 5 J

E D

6&7

5

CB

A

.0/ 1 2

3

.0/ 1

4

5

9 :=?&@

8 6&7

6&7

Figure 17. Linear encoding scheme for the WFA depicted on figure 15. This architecture can be viewed as a shift register in which a p-bit weight is aggregated. A final comparator detects the matching subsequences. The p bits of the weight are a compound of p − 1 bits representing a two’s complement integer, and of one bit for initialization, overflow, and non-existing transitions.

The score reported by the WFA is compared to a threshold. Here we want to assign to each sequence an exact score and to keep only the relevant data above the threshold. On the RDISK system, either we choose to compute the full score in the FPGA (and no false hit is reported), either we limit the score 19

! "

Figure 18. Architecture of the WFA pattern filter ×3. Three autreading frames are read.

computation to a fewer number of bits and the computation is finalized on the host. As in the first application, a compromise must be found to ensure that the selectivity remains high. In the following, we study only the first case: as the score is fully computed in the nodes, the post-processing phase will be reduced to format the results and display them to the user. Filter architecture When parsing nucleic banks for protein patterns, one must translate each group of three nucleotides (a codon) to one amino acid. There are six reading frames (three in forward direction, three in reverse). If one materializes three different automata (one for each direct reading frame), the input filter bandwidth is 120 Mnt/s (figure 18). In this case, the disks are still used at their maximum capacity of 60.4 Mnt/s. If a larger automaton is needed, only a single automaton is implemented, the filter input bandwidth is then reduced to 40 Mnt/s and becomes the bottleneck. In this approach, the filter resource usage is directly dependent on the WFA size, and on the bitwidth weight value. The table on figure 19 reports the maximum number of transitions for different values of p. Results Whereas the similarity search was a programmable architecture, this one must be synthetised every time a new topology of WFA is choosen. Our tool directly generates VHDL from an abstract description of the pattern or of the WFA. Real experiments were conducted on the EST Canine Database (34 Gnt, NCBI) with 5 olfactive receptors patterns (OR). The WFA describing those patterns have between 11 and 25 transitions, and we implemented them using a ×2 filter. The table on figure 20 reports the processing time and the 20

Automaton

Input bandwidth

Maximum query size

Replication

LUTs

by cycle

at 40 MHz

p=6

p=8

p = 10

p = 12

×1

7p

1 nt

40 Mnt/s

71

53

42

35

×2

14p

1 nt

40 Mnt/s

35

26

21

17

×3

15p

3 nt

120 Mnt/s

33

25

20

16

×6

30p

3 nt

120 Mnt/s

16

12

10

8

Figure 19. Maximum filter size for different bit widths p. The LUTs column show the maximal number of LUTs for each transition of the automaton. As the ×1 and the ×3 filters parse only the forward direction, two passes will be needed to parse the six reading frames. With p = 8 and ×1 filter, more than 98 % of the patterns of the PROSITE bank can be implemented.

Query

Nb of

Execution time (s)

Ethernet

pattern

threshold

hits

Tb

Tc

Ts

T

user bandwidth

OR1

3

20985

311

0.8

17.7

330

9.48 KB/s

OR2

2

13499

220

0.8

17.7

239

6.10 KB/s

OR3

1

17469

241

0.8

17.7

260

7.89 KB/s

OR4

1

24178

235

0.8

17.7

249

10.93 KB/s

OR5

0

21692

287

0.8

17.7

306

9.80 KB/s

Figure 20. Results for the pattern searching application with the ×2 filter, which has an input bandwith of 40 Mnt/s. The definitions for T c and Ts are the same as in figure 13. Here an extra time Tb is needed to compile, prepare and broadcast the tailored bitstream to all the nodes. Since this application does not need a heavy post-processing, the total execution time is equal to T = T b + Tc + Ts .

Ethernet traffic for each pattern. Like in the similarity search, only the offset of the relevant data is sent to the host. For this application, 8 bytes are generated on every hit. As a comparison, we have run the same filtering on a standard 2 GHz PC with 728 MB RAM. Two hours are needed to parse the whole bank for a pattern through the six reading frames. Thus one RDISK node is more than four times faster than this machine: the RDISK-48 cluster is equivalent to a cluster with 192 PCs. 21

5

Conclusion

A cluster of reconfigurable nodes for scanning large genomic banks has been presented. Nodes are made of low cost Xilinx Spartan-2 FPGA tightly coupled with hard disk drive and interconnected through a 100 Mb Ethernet network. Content-based similarity and pattern search applications have been experimented on the prototype RDISK-48. The association of a minimal system node architectured around two main components – reconfigurable hardware and hard drive – is the key point of this project. Compared to a cluster of PCs, it has the following advantages: • Complex filters can be efficiently parallelized and mapped into hardware. Dedicated operators tailored to genomic processing bring high computing power, allowing high filter selectivity. • Genomic banks are scanned at the maximal disk bandwidth: the filtering task does not provide any slow down penalty. • Due to the high selectivity of the filters, the amount of information needed to be forwarded to the front end processor remains quite low, and can be supported by a basic Ethernet network. • Since the volume of pertinent data extracted from the disks is small, costly processing, such as Smith & Waterman algorithm, can easily be performed by the front end processor (a standard PC), without any extra hardware. Merging all these remarks together leads to the design of a powerful low-cost reconfigurable system dedicated to the search of genomic banks. We estimate the cost of one RDISK node to $200, including the hard drive. The complete RDISK-48 (48 nodes, 2.4 TB of memory, with Ethernet switches, power supplies, rack cabinet...) could be estimated to $15,000. This is to be compared with the 192 PC cluster to obtain identical performance on the pattern search application (section 4.2). The great advantage of reconfigurable hardware, compared to VLSI accelerators, is that new hardware can always be imagined without any physical modification. Currently, two first search applications have been implemented, but many others could benefit of the RDISK architecture as long as they require systematic scans of the data. As an example, the identification of proteins by peptide-mass finger printing, or an approximate SRS-like search on the genomic database annotation would probably fit very well with the RDISK architecture. We now discuss the evolution of the RDISK system compared to a cluster of PCs. Moore’s law states that the number of transistors per square inch on integrated circuits approximatively double every 18 months. In the same time, the clock frequency is only multiplied by 1.5. Thus, processors will continue to 22

increase their clock speed, and the time for searching a genomic bank will be tightly correlated to the clock frequency. Other microprocessor improvements (such as the increasing of the cache memory or a higher number of floating point units) will not improve the genomic search as it is mainly based on integer data flow processing. On the other hand, the clock frequency of FPGA components is growing, but their hardware resource follow the exponential Moore growth as well, as illustrated by the introduction of the low cost Xilinx Spartan-2 (XC2S200, 200 K gates) in 2001, and the Spartan-3 (XC3S800, 800 K gates) in 2004: in 3 years, four more time hardware resources are available for the same cost. In addition, as reported in the ITRS reports [AEJ+ 02,Com03], from 2001 to 2004 the on-chip clock frequency has been multiplied by 2.47 (1.57 every 18 months). The computing power, as defined by [Vui94], being the product of the frequency and the gate density, we obtain a computing power increasing of 12 for the last 3 years for this FPGA family. Thus, contrary to the microprocessors, FPGA components fully benefit of this raw silicon size increasing. Tomorrow, more and more complex architectures will fit onto these components, extending the computing power gap between microprocessors and reconfigurable custom hardware. Focusing on the RDISK system, the next generation of low-cost FPGA components, such as in the Xilinx Spartan family, will probably host a hard core CPU, high-speed links, more embedded memory, and reconfigurable resources equivalent to a few million gates. In other words, the RDISK prototype boards we designed will be available as on-the-shelves components with much more processing capabilities. If FPGA is a promising technology to provide high efficiency computing power, it still remains complex to program. The reconfigurable part of the RDISK nodes are specified with a hardware description language (HDL), which requires specific knowledge in architecture design. The programmability of such a system is thus restricted, and only people able to manage both HDL and C languages can claim to program this system. However, this constraint can be released by using high level languages targeted to generate hardware architecture. In our case, the Stream-C language [GSAK00] based on communicating streams, and targeting FPGA technology is a perfect candidate to improve the RDISK programming environment.

References [AEJ+ 02] A. Allan, D. Edenfeld, W.H. Joyner, A.B. Kahng, M. Rodgers, and Y. Zorian. 2001 technology roadmap for semiconductors. Computer, January 2002.

23

[AMS+ 97] S.F. Altschul, T.L. Madden, A.A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, pages 3899–3402, 1997. [Bab79]

Edward Babb. Implementing a relational database by means of specialized hardware. ACM Trans. Database Syst., 4(1):1–29, 1979.

[BCL]

Biocceleration. http://www.biocceleration.com/BioXLH-technical.html.

[Com03]

ITRS Roadmap Committee. International technology roadmap for semiconductors. Technical report, 2003.

[DSS]

Data search systems. http://www.datasearchsystems.com.

[GOL]

Genomes online. http://www.genomesonline.org.

[Gra00]

Jan Gray. Hands-on computer architecture – teaching processor and integrated systems design with fpgas. In Workshop on Computer Architecture Education, June 2000.

[GSAK00] M. Gokhale, J.M. Stone, J. Arnold, and M. Kalinowski. Streamoriented fpga computing in the streams-c high level language. In IEEE International Symposium on FPGAs for Custom Computing Machines, 2000. [HH92]

J.G. Henikoff and S. Henikoff. Amino Acid Substitution Matrices form protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915 – 10919, november 1992.

[KA90]

S. Karlin and S.F. Altschul. Methods for assessing the statistical significance of molecular sequences features by using general scoring schemes. Proceedings of the National Academy of Sciences, 47, 1990.

[KPH98] Kimberly Keeton, David A. Patterson, and Joseph M. Hellerstein. A case for intelligent disks (idisks). SIGMOD Rec., 27(3):42–52, 1998. [MKC01] Gokhan Memik, Mahmut T. Kandemir, and Alok Choudhary. Design and evaluation of smart disk cluster for dss commercial workloads. Journal of Parallel and Distributed Computing (JPDC), 61(11):1633–1664, 2001. [MTL02] B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3), 2002. [NCV+ 04] Hulo N., Sigrist C.J.A., Le Saux V., Langendijk-Genevaux P.S., Bordoli L., Gattiker A., De Castro E., Bucher P., and Bairoch A. Recent improvements to the prosite database. Nucl. Acids. Res., 32, 2004. [PAR]

Paracel. http://www.paracel.com/gm/ov.htm.

[RFGN01] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active disks for large-scale data processing. IEEE Computer, june 2001.

24

[RGSV93] J. Rose, A. El Gamal, and A. Sangiovanni-Vincentelli. Architecture of field programmable gate array. Proceedings of the IEEE, 81(7), 1993. [SGM+ 90] F.Altschul Stephen, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. [Sto88]

Michael Stonebraker. Readings in database systems. Morgan Kaufmann Publishers Inc., 1988.

[SW81]

T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. Mol. Biol, 147:195–197, 1981.

[TML]

Timelogic. http://www.timelogic.com.

[Tuo02]

Ilkka Tuomi. The lives and death of moore’s law. First Monday, 7, November 2002.

[Vin01]

Martin Vingron. Sequence Analysis, chapter 2, pages 27–57. WILEYVCH, 2001.

[Vui94]

J. Vuillemin. On computing power. LNCS, Programming Languages and System Architectures, 782, 1994.

[WCIZ03] Benjamin West, Roger D. Chamberlain, Ronald S. Indeck, and Qiong Zhang. An fpga-based search engine for unstructured database. In Workshop on Application Specific Processors, December 2003.

25

Cluster of Reconfigurable Nodes for Scanning Large ... - CiteSeerX

Cluster of Reconfigurable Nodes for Scanning Large ... - CiteSeerX

Suggest Documents

Reconfigurable Miniature Sensor Nodes for Condition ... - CiteSeerX

Reconfigurable Memory for Reconfigurable Computing - CiteSeerX

cable-based reconfigurable machines for large scale ... - CiteSeerX

cable-based reconfigurable machines for large scale ... - CiteSeerX

Run-time Reconfigurable Cluster of Processors

cable-based reconfigurable machines for large

Lateral Scanning Linnik Interferometry for Large Field

Quick Detection of Nodes with Large Degrees

A Language for Large Ensembles of Independently Executing Nodes

Inversion of potential fields on nodes for large grids

Reconfigurable Computing Cluster: three domains ...

Cluster-based MDS Algorithm for Nodes Localization in ... - arXiv

SCANNING MICROCALORIMETERS FOR STUDYING ... - CiteSeerX

Analysis of a scanning pentaprism system for measurements of large ...

Duczmal L. Geographically Meaningful Cluster Scanning

Design of large mm-wave beam-scanning

Reconfigurable Hardware in Wearable Computing Nodes - TIK-ETHZ

Large-Memory Nodes for Energy Efficient High-Performance Computing

Network-resource Isolation for Virtualization Nodes - CiteSeerX

Bison: bisulfite alignment on nodes of a cluster | SpringerLink

WattDB: An Energy-Proportional Cluster of Wimpy Nodes

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale ...

WattDB: An Energy-Proportional Cluster of Wimpy Nodes

Identifying Rate-Limiting Nodes in Large-Scale Cortical Networks for ...