1 motivation

9 downloads 0 Views 422KB Size Report
a biosequence database is searched in order to detect signi cant ... Release 94.0 database contains ve hundred million characters in over seven hundred thousand DNA and ... ago now require days to scan the latest releases of this database. ..... Register. (28 bits). 28. Query Sequence Select. BANK 1. LSB. MSB. BANK 0.
A Dynamically Recon gurable Systolic Array

1

BioSCAN: A Dynamically Recon gurable Systolic Array for Biosequence Analysis R. K. Singh, W. D. Dettlo y, V. L. Chi, D. L. Ho man, S. G. Tell, C. T. White, S. F. Altschulz and B. W. Erickson Department of Computer Science &  Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599; y MCNC, Inc., Research Triangle Park, NC 27709; z National Center for Biotechnology Information, NLM, NIH, Bethesda, MD 20894

Abstract

We describe the design, implementation, and deployment via the Internet of BioSCAN, an applicationspeci c computer system for the rapid determination of statistically signi cant alignments of biopolymer (DNA, RNA, protein) sequences. BioSCAN continues to outperform other systems designed to perform this basic task of molecular biology, which continues to grow in magnitude and importance. The BioSCAN system is hosted by a general-purpose workstation containing a special-purpose hardware engine that accelerates the core algorithm for comparing two biosequences. Careful partitioning of the computational tasks between hardware and software provides not only high performance but also programmability. The BioSCAN system can compare a sequence of up to 12,992 characters with an arbitrarily large database containing arbitrarily long sequences at a rate of 2 million database characters per second. This rate is nearly 1,000 times greater than the rate achieved by a state-of-the-art workstation using software alone. This network-sharable computational resource is accessible interactively via the World Wide Web using Mosaic, Netscape or other client software.

1 MOTIVATION Searching sequence databases is a basic task of not only text and biopolymer comparison but also speech and hand-writing recognition. A sequence is an ordered set of characters from a small alphabet. The general task is to compare a given query sequence with each entry sequence of a given database. Each sequence within a biosequence database is searched in order to detect signi cant relationships between gene (DNA, RNA) sequences or gene product (protein) sequences of living organisms. Detection of similar segments between two sequences can provide powerful insights into the biological roles of these molecules. Ongoing world-wide initiatives to sequence the complete genomes of a variety of organisms are resulting in the acquisition and accumulation of biosequence data at ever increasing rates. For example, the GenBank Release 94.0 database contains ve hundred million characters in over seven hundred thousand DNA and RNA sequences. This database is doubling in size every 15 to 18 months. This growth rate is rapidly outpacing the speed-up of general-purpose computers. Software tools that ran acceptably fast a few years ago now require days to scan the latest releases of this database. Computational tools have been implemented to address this problem using a general-purpose uniprocessor, general-purpose multiprocessor, or special-purpose multiprocessor system. Two popular heuristic-based search tools that run on a general-purpose uniprocessor are FASTA (Pearson and Lipman, 1988) and BLAST (Altschul et al., 1990). Parallel processing programs developed to run on a general-purpose multiprocessor system include dFLASH (Rigoutsos and Califano, 1993) on the IBM SP/1 and Blaze (Brutlag et al., 1993) on the MasPar MP-1. An extreme solution has been the development of a class of parallel processors designed expressly to compare biopolymer sequences. Examples of these special-purpose multiprocessor systems are PNAC (Lopresti, 1987), BISP (Chow et al., 1991), SPLASH (Gokhale et al., 1990), BSYS (Hughey and Lopresti, 1991), bioccelerator-4 (Compugen, 1995), and BioSCAN (White et al., 1991; Singh et al., 1993). Singh et al

A Dynamically Recon gurable Systolic Array

2

In this paper we describe BioSCAN (Biological Sequence Comparative Analysis Node), a network-sharable computational resource for searching biosequence databases, that was designed and developed at the University of North Carolina at Chapel Hill in collaboration with MCNC, Inc.

2 BIOSEQUENCE SIMILARITY For computational purposes, a biosequence is an ordered set of characters from an alphabet. Each character stands for one type of monomer appearing in a biopolymer chain. The four DNA monomers (nucleotides) are represented by characters from the alphabet fA,C,G,Tg. DNA sequences typically contain a few thousand characters. Similarly, the twenty protein monomers (amino acids) are represented by characters from the alphabet fA,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Yg. Protein sequences usually contain a few hundred characters. Comparing a query sequence with a database of entry sequences requires a measure of the similarity of two sequences. A popular measure originally developed for comparing text sequences is the edit distance, which is based on a set of editing operations (substitutions, insertions, deletions) required to transform one sequence into another. Each editing operation is assigned a score. The edit distance between two sequences is the minimum total score of transforming the one sequence into the other, which can be calculated using a dynamic programming algorithm. In comparing two biosequences, one is generally interested in discovering pairs of sequences that are suciently similar that they might have diverged from a common ancestor or have converged to a common function. Biological evolution proceeds by the accumulation of genetic mutations, where substitutions are the most common type of mutation. For example, the DNA substitutions of A by T, C by G, G by C, and T by A are favored in nature. Mutations due to insertion, deletion, duplication, or other rearrangements are observed much less frequently. Thus the simplest measure of biosequence similarity is one based only on substitutions. This measure of biosequence similarity inherently detects locally optimal alignments.

3 BioSCAN APPROACH The BioSCAN system compares a given query sequence with each entry sequence in a speci ed database. The system ags locations in an entry sequence that are suciently similar to the query sequence to satisfy a programmable goodness-of- t measure speci ed by a substitution score table (Altschul, 1991). The resulting information may be used in various ways depending on the desired application. Generally the hardware helps to reduce the amount of data to be subsequently processed by the software by several orders of magnitude. The BioSCAN system serves as a pre lter for a broad range of existing and future applications. A segment of a sequence is a contiguous subsequence of any length. Each query segment is compared with each entry segment that has the same length as the query segment. The algorithm traverses a search space that is proportional to the product of the length of the query sequence and the total length of the database. For each comparison of entry sequence A (A[0],A[1],...,A[i],...A[La-1]) of length La with query sequence B (B[0],B[1],...,B[i],...B[Lb-1]) of length Lb, the search space can be visualized as the interior of the rectangle in Fig. 1. Statistically signi cant (high scoring) segment pairs, one from sequence A and one from sequence B, can be represented as diagonal line segments of various lengths. Segment pairs represented on the same diagonal correspond to the same alignment of sequences A and B. For any particular alignment of A and B, there is a constant di erence between the indices of each pair of aligned characters. For example, the alignment of A[i] and B[j] lies on the diagonal (i-j) in Fig. 1. This provides a convenient way to number the various alignments of A and B and the corresponding diagonals. Searching for high-scoring segment pairs in software requires a time proportional to the product of the lengths of the two sequences. Given sucient processing elements (PEs) (one per query sequence character), Singh et al

A Dynamically Recon gurable Systolic Array Sequence B A[0] B[0]

3 Sequence A A[i]

A[La-1]

(i,j)

B[j]

PEs

Diagonal (i-Lb) Diagonal (i-j) B[Lb-1] Database Sequence (Time) Alignment or Hit

Figure 1: Diagonal Graph of Aligned Pairs of Segments the query sequence dimension of this space (vertical axis) can be processed in parallel. Hardware completes the search in a time proportional to the length of the database. This is visualized in Fig. 1 where entry sequence character A[i] is completely processed in the ith cycle. The information state of the system after the ith cycle is represented by the vertical bar below A[i]. The jth PE computes the partial sums on the jth row. The sum computed by PE[j] at step i is passed to PE[j+1] to use in step i+1. At the end of each cycle, a one-bit result is generated. This bit speci es whether or not the diagonal just completed (e.g. diagonal (i-Lb) at step i in Fig. 1) contains at least one high-scoring segment pair. During postprocessing in software, the location and score of each high scoring segment pair can easily be computed from the diagonal index. This system partitioning has lead to a hardware architecture that is exceedingly ecient at ltering the initial search space, and that defers application-dependent processing to software executed on general-purpose machines. The BioSCAN system comprises hardware-based pre ltering and software-based postprocessing. A custom VLSI chip implements the BioSCAN algorithm described below. It identi es every alignment containing one or more high-scoring segment pairs. Each such alignment is reanalyzed in software to determine the score, length, and end points of each locally optimal (non-overlapping) segment pair. These results are reported in a format similar to that generated by the BLAST program. The overall signi cance of the similarity between query and entry sequences is estimated by the sum statistics of Karlin and Altschul (Karlin and Altschul, 1993).

4 ALGORITHM Algorithms for detecting and measuring similarities between biosequences have continuously evolved over the last three decades. One of the earliest exhaustive sequence comparison methods is that of Needleman and Wunsch (Needleman and Wunsch, 1970). Subsequently, many algorithmic re nements have been o ered, most notably by Smith and Waterman (Smith and Waterman, 1981) and by Altschul and other (Altschul et al., 1990). Generally, given two sequences of lengths m and n, each of O(m2 ) contiguous segments of one sequence (of all possible lengths, starting at each position) are compared with all O(n2 ) segments of the same length of the other sequence. Using algorithmic optimization, a global similarity score can be computed between two sequences by a Singh et al

A Dynamically Recon gurable Systolic Array

4

dynamic programming algorithm in O(nm) time. It has been shown that dynamic programming algorithms are amenable to parallelization and map well onto systolic arrays where all the diagonals from Fig. 1 can be evaluated in parallel. However, the complexity of the computations at the core of dynamic programming algorithms results in large PE size and hence reduced computational density when implemented in VLSI technology. The BioSCAN system directly implements a simple linear similarity algorithm. This generates alignments without insertion and deletion, very similar to the BLAST system (Altschul et al., 1990). The score of each segment pair is the sum of the scores of the aligned characters. A two-dimensional lookup table provides the score for each pair of aligned characters. Several high-scoring segment pairs may be combined to estimate the overall similarity of the two sequences. The linear similarity score is de ned as: SLx y = ;

X? T (A ? ; B ? ) L 1

x k

y k

k=0

(1)

SLx y is the score of the segment pair of length L with rightmost elements Ax and By in alignment. The ;

general recurrence relation for the linear similarity algorithm is expressed as: Sx y = Sx?1 y?1 + T (Ax?k ; By?k )

(2) The parallel implementation of this algorithm is based on a greedy heuristic where each alignment (or diagonal) is evaluated concurrently on a separate processor. Each processor performs the following core computation in parallel. if (Sx?1 y?1 < 0) Sx y = T (Ax ; By ) else if (Sx?1 y?1  S ) Sx y = Sx?1 y?1 (3) else Sx y = Sx?1 y?1 + T (Ax ; By ) ;

;

;

;

;

;

;

;

;

where, Sx y is max (SLx y ) while Sx y is less than S (a user-speci ed score threshold). T is any substitution score table. The simplicity of the core computation in the BioSCAN algorithm lends itself to an ecient implementation in VLSI and thus can achieve high computational density compared to other reported systems for biosequence comparison. ;

L

;

;

5 ASIC ARCHITECTURE & IMPLEMENTATION The BioSCAN ASIC implements the linear similarity algorithm of Eq. 3. The primary ASIC architecture and implementation was driven by minimizing o -chip communication to enable easy cascading of multiple ASICs. The architecture features a scalable linear systolic array of PEs and does not limit the size of the array. Since the package pinout is independent of the array size, the system o ers scalability without requiring a redesign of the circuit board. Since the high-speed path between adjacent ASICs is limited to only one pin, strategic placement of the input and the output pins provides easy scalability by cascading multiple ASICs to form a large array. The internal data path of the ASIC is shown in Fig. 2. The ASIC design is modular and consists of four major blocks: clocks, frame, memory, and PEs. The primary clock input to the ASIC is a square waveform from single input pin. A 2-phase nonoverlapping clock is derived on-chip. The true and complement phases of the the two clocks are globally routed using an electrically balanced tree and interlocking circuitry. The nal driving stage of the four clocking signals are placed in the corners of the die to minimize crosstalk interference with other circuits and power supply lines. The frame consists of all the I/O pins, global routing, and the control logic. The controller synchronizes the operation of the frame, memory, and PE's. External accesses trigger the controller to operate. Singh et al

A Dynamically Recon gurable Systolic Array CONTROL 5

5

CE

DATA 28

AEn

AIn

28-Bit Data I/O Buffer 28 28 RAM ADDR REG (28 bits)

28 448

BANK SELECT

RAM (448 x 28 bits)

28 BUFFER/DRIVER 28

8 8 SELECTOR REGISTER

8( SELECTOR 2

)

28

8 SELECTOR REGISTER

8( SELECTOR 2

)

PE 0 ACCUMULATOR (16 bits) PE p-1 ACCUMULATOR (16 bits) Oval Ovfl AOut

Figure 2: BioSCAN ASIC Internal Datapath Since the computation is dominated by high bandwidth memory accesses, an onchip SRAM provides store for the substitution score table. Logically the SRAM represents a square table of 28x28 16-bit 2's complement integer values. Physically, the memory is organized as 28 banks of separate data storage. Each bank is organized as 16 rows and 28 columns and contains 28 16-bit substitution scores for one character of the alphabet (Fig. 3). Since all 28 16-bit values are broadcast to the entire PE array, 28 lines run from the memory to all the PEs. The bit lines are common among all banks whereas separate word lines determine which bank is selected. Accessing a bank rst requires writing to the RAM address register. The memory access (read and write) is bit serial (LSB rst) and takes 16 clock cycles to stream 28 16-bit values to and from a bank. MSB

LSB BANK 0 I/O Buffer

BANK 1

28 RAM Address Register (28 bits)

28

Database Select

16

16

BANK 27 16-bit shift register

28 Query Sequence Select

Figure 3: Memory organization The BioSCAN PE is essentially composed of four separate components: the select register, the 28-to-1 Singh et al

A Dynamically Recon gurable Systolic Array

6

MUX, the accumulator, and the adder. Each is visible in the layout, shown in Fig. 4, and can be logically separated from the whole. from previous PE from previous PE

from RAM buffer

8

28

Selector Register (8 bit)

8

8 Bus Selector 2 j 0 j 27

Accumulator 16 bits A15 A14 A0 Bj

Overflow Exception

Negation Exception

8 Carry to next PE

Σ to next PE

Figure 4: BioSCAN Processing Element Architecture The select register in each PE stores one character of the query sequence encoded in a 2-of-8 encoding scheme and uses it to select one of the 28 broadcast lines as its input for the duration of the database scan. The 2-of-8 encoding was chosen over binary encoding because 2-of-8 decoding logic, replicated in each PE, is much smaller than binary decoding logic. This encoding scheme limits the alphabet size to 28 (8*7/2). However, nothing precludes a user from writing a value into a PE with more than 2 of 8 bits set. In such a case, the output of the 28-to-1 multiplexer (Fig. 5) is a logical-OR of more than one data stream. A special value where 3 LSBs are set is used to dynamically bypass a processing element using software. Any PE with this special selector register value will have its AIn connected to it AOut via a multiplexer. This feature was implemented as a rudimentary form of eld-repairability. 14 Broadcast Data Lines

from 8-bit selector register

14 Broadcast Data Lines

from 8-bit selector register

Figure 5: 2 of 8 Selector for 28-to-1 Multiplexer During the scan cycle, each PE bit-serially adds the selected value to the similarity score in its accumulator, which was received from the previous PE in the previous scan cycle. Each bit of the calculated score Singh et al

A Dynamically Recon gurable Systolic Array

7

is immediately passed to the next PE in the array. Simple logic in each PE handles two exceptions: (1) A negative score input to a PE is zeroed before accumulation. (2) A score of 16,384 or greater is propagated through the remainder of the array unchanged. Prior to a database scan, each ASIC is initialized with search parameters. While the SRAM (Fig. 3) is programmed with a substitution score table, the selector registers of PEs (Fig. 4) are loaded with a query sequence, one character per PE. Since the threshold value in each PE is hard-wired to 16,384, the score values are scaled before initialization. Each database character received by the ASIC selects 1 of the 28 banks. During each of the next 16 clock cycles, 28 column bits from successive rows of the selected bank are sensed, ampli ed, and broadcast in parallel within the ASIC. The device pad-frame was laid out with due consideration to the design, manufacturing and assembly of the circuit boards. The bit-serial accumulator output was placed on the opposite side of the IC package from the accumulator input, which helps the placement and routing of high-speed signals on the circuit board. The device implementation o ers testability and fault tolerance. The SRAM is independently readable and writable. The pattern sequences held by the PEs can be scanned out to verify their contents. The PEs can be controlled and observed individually to obtain 100% stuck-at-fault coverage. Both for device testing during fabrication and accommodating PE failures in the eld, individual PEs can be shunted through by storing a special code in the selector register to activate logic not shown in Fig. 4. Several adjacent PEs can be shunted through to improve the fabrication and system yield. During tests, up to 15 PEs have been successfully shunted through without impacting the system performance. PE or ASIC failure can be diagnosed in the eld and the defective PE or ASIC can be removed from the array using software. The BioSCAN ASIC was designed using 1.2 um scalable CMOS design rules from MOSIS and fabricated by Hewlett-Packard. Each 7.75 mm x 9 mm die contains 812 PEs. Table 1 shows the design and process parameters for the chip. Die Size Transistors Processors Pins/Package Clock Voltage/Power Interface Process Gate length/tox Poly/Metal1/Metal2

7750 m x 9050 m 537,675 (89,736 in RAM) 812 (536 transistors/PE) 84 pin PGA (42 Vdd/GND) 32 MHz @ 50  C (ambient) 4.75-5.25V (3.0W max) TTL-compatible 1.2 m N-well CMOS 1.2 m/20nm 2.4/3.6/4.2m pitch

Table 1: BioSCAN ASIC Device Summary

6 CIRCUIT BOARD ARCHITECTURE & IMPLEMENTATION The BioSCAN circuit board architecture was developed with the goals of modularity and reusability of subsystems, testability during manufacturing and eld installation, ease of installation and con guration, and scalability and parallelism using multiple ASICs. The circuit board design is hierarchical and modular. The block diagram of the circuit board is shown in Fig. 6. It consists of three major subsystems: a host interface, a recon gurable processor array, and an array interface consisting of input/output bu ers and control logic. The modularity of design facilitates easy migration to other machines. Singh et al

A Dynamically Recon gurable Systolic Array

8 ARRAY INTERFACE

HOST INTERFACE D[0:31]

ARRAY

PDBU F

DXCVR S XD[0:31]

32

PD[0:27]

32

28

AIn Chip0

Data FIFO 4Kx32

A[0:31]

DFQ[0:31]

Data Formatter

32

AOut

FD[0:27]

28

AM[0:5] AS

AIn IXQ[0:28]

DS[0:1]

29

VME-bus

IDN[0:15]

Address Decode

16

Index FIFO 1Kx29

IDX[0:28]

Identity FIFO 1Kx16

AEn CEn Ovfl Oval

29-bit Counter

29

Oval

Chip1 AOut

AEn CEn Ovfl Oval

OVID[0:15] 16

STIDX

AIn

Array Control Logic

RStrobeL DTAC K

WStrobeL HIF Control LA[0:5]

AEn[0:15] Ovfl

7 IACK

Interrupt Control Logic

AIn

Board Control

AEn CEn Ovfl Oval

CEn[0:15]

CSR

IRQ[1:7]

Chip14 AOut

{Reset,RunP,SelS, RamS,AddS,RW}

Chip15 AOut

AEn CEn Ovfl Oval

6

SRequestH ClrReq L

Figure 6: BioSCAN Circuit Board The host interface (HIF) subsystem is designed to the speci cations of the IEEE standard VMEbus protocol. The HIF module presents a simpli ed interface to other board subsystems. The functional decomposition and modularity of design enables the system design to be easily and eciently adapted to a di erent host-bus standard by making changes only to the HIF module. The device driver software provides access to the BioSCAN circuit board through a set of 9 registers. Some registers access the ASICs directly for testing and initialization. Other registers are used to transfer data and operating parameters to the circuit board. The input FIFO register is replicated to form an 8 KByte page of simulated memory that may be useful for direct memory access. The processor array consists of 16 ASICs daisy-chained in a ring fashion. This ring can be broken at any point to form a linear array of processors. In case of critical failure of one ASIC, the ring organization enable the remaining ASICs to be used as a contiguous array of 15 ASICs. The array is dynamically recon gurable into smaller subarrays of contiguous ASICs. This is useful when the query sequence can be contained in few ASICs and allows multiple searches to be performed in parallel. The logic function that allows extending or breaking the processor chain is contained within each ASIC. This eliminates any slowing down associated with placing the control logic on the circuit board and allows the processor array to operate at full clock speed. The circuit board contains 16 sockets that can be fully or partially populated with BioSCAN ASICs. These ASICs are accessed from the host machine through a static memory interface of three registers. The ASICs may be enabled individually for testing and initialization, and simultaneously for searching with a long query sequence. The systolic processor array is data hungry. It often places high demands on the host-bus transfer rate. The imbalance is ampli ed as the processor array is clocked at higher rates. Operating at 32-MHz, the BioSCAN array requires a stream of 28-bit value at a rate of 2 MHz. The array input interface (AIF) provides data management and control functions. To use the host VMEbus eciently and to reduce disk storage requirements, databases are compressed and the circuit board implements a simple hardware decompression scheme. The decoding logic broadcasts the data to all 16 ASICs. The sequence databases are transferred over the VMEbus and bu ered in a 32-bit FIFO. Each FIFO word contains multiple encoded database characters packed into 2, 3, 4, or 5-bit elds. The character size is selected based on the size of the alphabet required by a particular database. The output from the systolic array consists of over ow indications from each individual ASIC. An over ow, Singh et al

A Dynamically Recon gurable Systolic Array

9

de ned as an accumulated score exceeding the threshold, indicates a signi cant alignment. In order to be meaningful, the over ow needs to be related back to the location of the alignment in the database. When an over ow occurs, a word is written in the output FIFO. The output word consists of two elds: a 29-bit index eld and a 16-bit identity eld. The index eld is the value of a counter which is incremented whenever a database word is written to the array, thus counting diagonals. The identity value consists of one bit per ASIC indicating that the corresponding ASIC registered an over ow at the current diagonal. The relative sizes of the input and output FIFOs were selected with the expectation that the similarity table and threshold would be designed to identify less than one percent of the diagonals as having signi cant alignments. Smaller input FIFOs result in reduced throughput on low-performance VMEbus hosts like a SUN 4/280. On the other hand, the size of the input FIFOs is less of a factor with higher performance hosts such as a SUN SparcSERVER 690MP. The AIF control subsystem supervises the input and output FIFO status and changes operating modes accordingly. Database characters are written to the array whenever the board has been enabled by software and the input and output FIFOs are not empty and not full, respectively. The host is interrupted whenever either FIFO requires servicing. The circuit board design o ers high degree of test coverage through the use of a small amount of extra control logic and no additional data paths. An input subsystem diagnostic mode allows the input FIFO and decoder to be tested by reading back each decoded word as it is presented to the array. Output subsystem diagnostic modes allow over ow indications to be written to the FIFO, independent of the array. Together, these diagnostic modes allow the board logic to be tested before installing any BioSCAN ASICs. The BioSCAN board consists of over 250 equivalent ICs including 23 PLDs. These devices are connected by over 1700 signals on a circuit board with 4 signal layers and 2 power planes. The system was extensively simulated using the Verilog-XL simulator and worked the very rst time it was powered up. Each BioSCAN circuit board with 16 ASICs supports searches of queries up to a maximum length of 12,992 elements. Four such circuit boards are installed and operational in two Sun-4 hosts. These relentless servers continue to serve the biological research community uninterrupted.

7 SOFTWARE ARCHITECTURE & IMPLEMENTATION Two major objectives of the project were to provide a unique and useful resource for molecular biologists and to o er a platform for systems research. The functional components of the software architecture, shown in Fig. 7, re ect the hierarchical and layered nature of its design. Web-user Command-line E-mail user WWW Server

Unix shell

Mail Server

BSCAN Application

BioSCAN Server

Databases

BioSCAN PCB H/W

Figure 7: BioSCAN System Architecture Singh et al

A Dynamically Recon gurable Systolic Array

10

A networked client-server approach facilitated most ecient utilization of the processing bandwidth of the BioSCAN system while meeting the needs of three distinct groups of users of the resource, namely, the end-user, the application developer, and the system programmer. The primary design goal for the BioSCAN software was to provide a stable and consistent interface to the BioSCAN hardware for the rst two types of the users while preserving exibility in the underlying implementation to accommodate the third. The end users access the BioSCAN system through application programs and use the system as a research tool for sequence analysis. No speci c knowledge of the hardware and software con guration is required. Any researcher with a network connection and appropriate client application software can access the BioSCAN system. The BioSCAN system supports user accesses over the Internet through an automated e-mail server and a World Wide Web (WWW) server, as illustrated in Fig. 8. Access to BioSCAN from the Internet using Web clients is described in more detail in the Internet Interface section below. End users with local access to the hardware can access the BioSCAN system using a Unix command line interface. Client

Client

Netscape

Mail Internet boundary

Internet boundary

Daemon

Daemon

PROCMAIL

SENDMAIL

APACHE HTTP

• C program • .procmailrc

• C program

• C program

rsh

body.#

CGI-BIN

reply results

• HTML program • C program

SERVE • Perl program

fork/exec pipe

fork/exec

BSCAN Application (BioSCAN E-Mail Server)

BSCAN Application (BioSCAN Web Server)

Figure 8: BioSCAN User Interfaces More advanced users can develop their own applications using the Application Programming Interface (API) provided by an Application Subroutine Library (ASL) to interface existing sequence analysis software to the BioSCAN system. The ASL translates API calls into messages to the BioSCAN server system. The system supports a sequence alignment application as shown in Fig. 9. BSCAN

#!/bin/sh • zscan • allseg fork/exec

pipe

ALLSEG • C program

DB

DB

flat seq file

index file

.dat

.sdi

ZSCAN • C program

DB index file

.bsx

BioSCAN Server

Figure 9: BioSCAN Core Application Singh et al

A Dynamically Recon gurable Systolic Array

11

For wider accessibility, the BioSCAN system was designed to operate as a server node on the Internet. This allowed client applications an access to the resource in a shared manner. The BioSCAN server is composed of three types of processes: server processes, scheduler processes, and run processes. Server processes accept transactions from clients and pass job requests to the scheduler process. The scheduler process organizes user jobs into runs that can be accommodated by the BioSCAN hardware. The BioSCAN server architecture is shown in Fig. 10. Daemon

BIOSCAND • C++ program • listener fork socket connection request

BSCAN Application

BIOSCAND child socket • C++ program • queue job • exit

BIOSCHEDD • C++ program • jobs queue

fork

BioSCHEDD child socket call back with results

User software space Kernel software space

Scheduler Abstract operations Logical operations Register operations

DB encoded file

.bse GenBank GenPept SwissProt PIR

Device driver

Hardware VME Interface ProcessorArray BioSCAN PCB

Figure 10: BioSCAN Core Server After a run is constructed, a run process is started to handle interaction with the hardware. The run is passed to the rst of several objects that form a hierarchy of abstract BioSCAN machines. This layered approach allows changes to be made in the BioSCAN hardware and low-level software without impacting the scheduling and the interface software. It is even possible to incorporate software simulations of unbuilt hardware devices with existing server software to assess impact of the new design on the overall system function. The server-level software is implemented using C++ while the ASL is written using the C language.

8 INTERNET INTERFACE Queries submitted via e-mail are processed on a rst-come rst-served basis. This method is most suited for users who do not require an interactive response. The submission format is based on that accepted by the BLAST server from NCBI. To obtain the current set of instructions for using the BioSCAN e-mail server send a message to [email protected] with the word HELP or help on a line by itself in the body of the e-mail message. A more detailed tutorial on the system can be obtained in one of the following two ways: (1) For users with File Transfer Program (FTP), the tutorial text can be retrieved via anonymous FTP from ftp.cs.unc.edu. The le tutor.txt resides in the directory /pub/projects/bioscan. (2) All other users can request tutorial and any other help by sending a regular message to [email protected]. The BioSCAN WWW server is provided for interactive queries. Using a series of web pages the WWW server provides a user-friendly, graphical interface to the BioSCAN network server. To access the BioSCAN WWW server, a WWW client program such as NCSA Mosaic or Netscape is necessary. The WWW client Singh et al

A Dynamically Recon gurable Systolic Array

12

software must be capable of handling HTML+ forms. This requires NCSA Mosaic for X windows version 2.0 (or later) or Netscape. The Uniform Resource Locator for the BioSCAN home page is . Each of the highlighted text strings is an anchor which is a link to additional web pages on the server. The BioSCAN Online! page contains another set of anchors that point to the forms used to submit queries. To simplify query speci cation, the query pages are split into separate forms for nucleic acids and protein sequences. Only those options that are appropriate for the selected type of database search are displayed. If the sequence to be searched is already present in one of the online databases, one can select either Match a Named Nucleic Acid Sequence or Match a Named Protein Sequence, as appropriate. If you wish to provide your own sequence use either Match a Nucleic Acid Sequence or Match a Protein Sequence. A typical user-speci ed protein sequence query is shown in Fig. 11a.

(a) Query

(b) Result

Figure 11: A protein sequence query and results. The result of a search is presented as a hypertext document containing summary data and a table of database sequences that match the query sequence within the speci ed parameters. The summary information includes the length of the query sequence, the name of the database searched, the length of the database, the statistically expected number of alignments, and the number of alignments actually found. A detailed account of how to use the BioSCAN web server, the available options and parameters for database searches, and interpretation the search results is given in (Singh et al., 1996). A typical partial result is shown in Fig. 11b.

Singh et al

A Dynamically Recon gurable Systolic Array

13

9 SYSTEM PERFORMANCE & COMPARISON Many other special purpose hardware (Lopresti, 1987; Gokhale et al., 1990; Chow et al., 1991; Hughey and Lopresti, 1991; Compugen, 1995) and software (Altschul et al., 1990; Pearson and Lipman, 1988) for the analysis of biosequences have been reported in the literature. The algorithmic di erences, degree of programmability, and diversity of implementation technology make it very dicult to perform an objective comparison (Singh et al., 1993). In comparing the BioSCAN system with other special-purpose parallel computing systems, we note that these systems use di erent core algorithms that are variations of the original dynamic programming algorithm (Needleman and Wunsch, 1970) and di erent VLSI technology. Since the basic architecture of these systems is systolic in nature, we have devised a performance measure that is based on the number of million cell operations per second (MCOPS), where a cell is de ned as a PE in a hardware implementation or the equivalent block of instructions (basic inner loop) in a software implementation. Based on this measure, Table 2 shows the performance of several general-purpose and special-purpose systems. The BioSCAN system, operating at clock rate of 32 MHz, performs a cell operation in each PE every 16 clock cycles for a theoretical processing rate of 2 MHz. Each 812-PE chip delivers a peak theoretical performance of 1,624 MCOPS. The current BioSCAN system with 16 chips on a single board can deliver a peak performance of 25,984 MCOPS. In practice, the observed processing rate of the BioSCAN system is 1.95 million characters per second. Thus the 812 PEs on a single chip run at an e ective rate of 1,583 MCOPS and the current 16-chip BioSCAN system runs at 25,334 MCOPS. The performance data for the Sun-4/690MP system was obtained by running an optimized software implementation of the BioSCAN algorithm from Eq. 3. For the software implementation of the BioSCAN algorithm, a cell corresponds to the operations needed to perform a single character-to-character comparison. System Sun-4/690 MP Convex C-240 P-NAC SPLASH

bioccelerator-4

B-SYS BISP BioSCAN

Date 1991 1989 1987 1990 1994 1991 1991 1992

Technology Performance (MCOPS) 1 CPU 0.56 1 CPU 2.67 4.0 m NMOS 1.1 1.0 m FPGA 50 1.0 m FPGA 1000 2.0 m CMOS 1551 1.0 m CMOS 9600 1.2 m CMOS 25334

Table 2: Performance of Special-purpose and General-purpose Systems

10 SUMMARY & FUTURE WORK We have described a massively parallel, fault-tolerant, scalable systolic multiprocessor system called BioSCAN (Biological Sequence Comparative Analysis Node) for rapid, rigorous, and sensitive searches of biosequence (DNA, RNA, protein) databases. A combination of hardware and software o ers both the desired performance and the necessary exibility. The BioSCAN system is capable of performing a single scan of a sequence of up to 12,992 elements past an arbitrarily large data base. Scanning a database of biosequences at a sustained rate of 2 million elements per second independent of query sequence length, the BioSCAN hardware runs an exhaustive, non-gapped sequence comparison algorithm at up to 1; 000 times faster than a software implementation Singh et al

A Dynamically Recon gurable Systolic Array

14

on a state-of-the-art workstations. The result of a BioSCAN search may be useful by itself, but generally serves as substantially pre- ltered input for subsequent software analyses such as reporting detailed gapped alignments. Searching a protein sequence against the protein database (SWISS-PROT32 with 17 million residues) takes just 10 seconds; whereas a nucleic acid sequence can be searched against DNA/RNA database (GenBank Release 94.0 with ~499 million bases) can be searched in 4 minutes and 23 seconds. The BioSCAN resource can be accessed over the Internet using World Wide Web clients such as Mosaic and Netscape. The Uniform Resource Locator address for the system is . The automated server can also be reached by sending a request in a specially formatted e-mail message to the address [email protected]. Statistics kept on BioSCAN usage via the internet indicate widespread international user acceptance. New work continues in the direction of analyzing three-dimensional structures of proteins using the BioSCAN systolic array, which is enabled by a novel method of mapping the 3D protein structure to a 1D sequence.

Acknowledgements This research was supported in part by the NSF grant MIP-902445. Signi cant support was also provided by MCNC under Design Initiative Research program. The BioSCAN web site uses web server software from the Apache HTTP Server Project . The authors would like to thank Cadence Design Systems, Inc. for providing us with the Verilog-XL simulator.

References Altschul, S. F. (1991). Amino acid substitution matrices from an information theoretic perspective. Journal of Molecular Biology, 219:555{565. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215:403{410. Brutlag, D., Dautrucourt, J., Fier, J., Moxon, B., and Stamm, R. (1993). Blaze. Computational Chemistry, 17:203. Chow, E. T., Hunkapiller, T., Peterson, J. C., Zimmerman, B. A., and Waterman, M. S. (1991). A systolic array processor for biological information signal processing. Proceedings of the International Conference on Supercomputing, pages 216{223. Compugen (1995). Bioccellerator User Guide. Compugen Ltd., Hamacabim St. 17, Petah-Tikva, 49220 Israel. Gokhale, M., Holme, B., A.kopser, Kunze, D., Lopresti, D., Lucas, S., and abd P. Olsen, R. (1990). Splash: A recon gurable linear logic array. Proceedings of the International Conference on Parallel Processing, pages 526{532. Hughey, R. P. and Lopresti, D. P. (1991). B-sys: A 470-processor programmable systolic array. Proceedings of the International Conference on Parallel Processing, pages 580{586. Karlin, S. and Altschul, S. F. (1993). Application and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences, 90:5873{5877. Lopresti, D. (1987). P-NAC: A systolic array for comparing nucleic acid sequences. Computer, 20(7):98{99. Singh et al

A Dynamically Recon gurable Systolic Array

15

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443{453. Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85:2444{2448. Rigoutsos, I. and Califano, A. (1993). dFLASH: A distributed fast look-up algorithm for string homology. Technical report, IBM T. J. Whatson Research Center, P.O. Box 704, Yorktown Heights, NY 105698. Singh, R. K., Ho man, D. L., Tell, S. G., and White, C. T. (1996). Bioscan: A network sharable computational resource for searching biosequence databases. Computer Applications in the Biosciences, 12(3). (in press). Singh, R. K., Tell, S. G., White, C. T., Ho man, D. L., Chi, V. L., and Erickson, B. W. (1993). A scalable systolic multiprocessor system for analysis of biological sequences. Proceedings of the Symposium on Integrated Systems, pages 167{182. Smith, T. F. and Waterman, M. S. (1981). Comparison of biosequences. Adv. Appl. Math., 2:482{489. White, C. T., Singh, R. K., Reintjes, P. B., Lampe, J., Erickson, B. W., Dettlo , W. D., Chi, V. L., and Altschul, S. F. (1991). BioSCAN: A VLSI-based system for biosequence analysis. Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors, pages 504{509.

Singh et al