BioSCAN - IEEE Computer Society

26 downloads 0 Views 467KB Size Report
C. Thomas Whitet. Raj K. Singh. Molecular Biology ... E-mail: tom@ med.unc .edu. C. Thomas White, CB# 7100, ... Vernon L. Chi. Stephen F. Altschul. Dept. of ...
BioSCAN: A VLSI-Based System for Biosequence Analysis C. Thomas Whitet Raj K. Singh Peter B. Reintjes Molecular Biology & Biotch. Dept. of Computer Science Quintus Computer Systems University of North Carolina University of North Carolina Mountain View, CA Bruce W. Erickson Wayne D. Dettloff Dept. of Chemistry MCNC University of North Carolina Research Triangle Park, NC

Vernon L. Chi Stephen F. Altschul Dept. of Computer Science NCBI, National Library of University of North Carolina Medicine, Bethesda, MD

Abstract

typically several hundred to several thousand characters in length, though the largest contain over one hundred thousand characters. The basic task in database searching is to identify those sequences in the database which contain at least one segment sufficiently similar to some segment of a query sequence. While there are many approaches to this problem, the fundamental computational complexity of this task is proportional to the product of the length of the query sequence and the total number of characters in the sequence database. The query sequence is typically a DNA or protein fragment whose sequence has just been determined in a laboratory. Alternatively, the query may be a pattern which represents a set of sequences, such that at each position in the pattern some combination of characters are required, tolerated, or prohibited. In general, segment pairs (one from a database sequence and the other from the query sequence) may be considered similar though few, if any, elements within the segments match identically. This characteristic distinguishes biological sequence analysis from other pattern matching problems where one typically looks for short identical segment pairs. For some comparison techniques, a segment from one sequence will only be matched to segments of the same length from the other sequence. In this case, the ith elements of each segment are aligned with one another. A table is used to assign a similarity score with each possible alignment pair. The score for the segment pair is typically the sum of the scores associated with each element pair. Algorithms which compare segments of unequal length insert null elements at various places in each segment, so that elements of one segment are aligned with either a real element of the other segment or a null. Segment scores are reduced in various ways to penalize for the number and length of null insertions. While it may seem that allowing nulls increases the order of the computation, the most standard version, known as "affine gap cost", increases the computation only by about a factor of 10. While there are strong advocates for

A special-purpose computer system has been designed to accelerate scanning large databases of DNA and protein sequences (biosequences) for patterns of interest. The system consists of a custom-designed circuit board installed in a host workstation and associated software. The board features a variable number of identical full-custom ASICs. Each BioSCAN ASIC, in turn, features a large one-dimensional systolic array of identical processing elements (PES). The BioSCAN system scans approximately two million database elements per second. For typical problems this results in a 1OOO-fold speedup over current workstations.

1 Introduction The identification and characterization of biological sequences are fundamental tasks of modern molecular biology. Presently the DNA sequence database contains over 50 million characters and the protein sequence database contains over 7 million characters[5,12]. These databases continue to grow by approximately 50% per year and are expected to be 10 to 50 times larger by the year 2000[3]. One of the goals of the federally-supported Human Genome Initiative is to determine and analyze approximately 3 billion characters of human DNA in the next 15 years[7]. Faster methods of biosequence database analysis are urgently needed. The Biological Sequence Comparative Analysis Node (BioSCAN) project [ 14,4] provides one cost-effective solution to this problem. For computational purposes, a biosequence is a string of characters and a segment is a substring of contiguous characters from one sequence. Sequences are ?Corresponding author: C. Thomas White, CB# 7100, University of North Carolina, Chapel Hill, NC 27599-7100 E-mail: tom@ med.unc .edu

SO4

CH3040-3/91/0000/0504$01.00 Q 1991 IEEE

Jordan Lampe Dept. of CS and Engineering University of Washington

solution. Generalized database searching techniques do not fit this problem. Finally, an algorithm considered for VLSI hardware must map well to this technology. BioSCAN features a one-dimensional systolic array of a very large number of very small concurrently-operating PES. The architecture does not constrain the size of the array. More PES per ASIC are always useful. Thus, incremental improvements in layout or technology immediately translate to higher system functionality. Further, the ASIC pinout is independent of the array size, so future implementations need not require modifications to the circuit board. The architecture also allows any number of BioSCAN ASICs to reside on a board, requiring only a single-wire communication path between adjacent devices. Thus a minimal system could be designed for a microcomputer using one or two BioSCAN ASICs.

each method, for various reasons beyond the scope of this presentation, the BioSCAN architecture directly implements the comparison of segment pairs without the insertions of nulls.

2 Project Justification Application-specific VLSI projects such as BioSCAN are worthwhile only if certain conditions are met. First the problem to be solved must be important. Currently the length of time required to scan biosequence databases for patterns of interest on general purpose computers may inhibit researchers from performing a thorough analysis. With newly determined sequence, many scans might be worthwhile, using different similarity tables and reporting thresholds[ 13. More attention might be given to discovering higher-order patterns or motifs were it possible to quickly perform systematic searches integrating multiple database scans. As database searching becomes a major activity on any given computing platform, it quickly becomes more cost effective to perform this work on specialized hardware. Table 1 compares the estimated cost/performanceratio of the BioSCAN system to that of four other computing systems. Computer system (clau)

coat (in dollars)

BloSCAN aystem (Sun4 with IOK PE linear array)

50K

MUaPu (VAxatationwith 16K PE mlngular m a y ) VAX4600 or IBM4300

Relative Perlonnancn

The mathematical function implemented by BioSCAN operates on two swings of characters: A from alphabet a and B from alphabet p. Given an arbitrary function of similarity values S(u,b),where U is in a and b is in p, every segment of A is compared with every segment of B of the same length. The alignment score for each segment pair of length L and leftmost characters A[i] and B[j] is simply the sum of the similarity values of the corresponding aligned elements:

CosVPerformance Ratio

OI

L- 1

S(A[i+k],Bu+k]) BOOK

10

3

4LWK

10

SunJim (atand-aione work.tation)

4OK

[I1

Convex C2 (vector mini-.uper)

1,WOK

10

(.tmd-.lOlle mainframe)

3 System Overview

k=O

Figure 1 shows the diagonal path graph for a search pattern B and database sequence A . Each segment pair in which A [ i ] is aligned with B [ j ] and whose alignment score equals or exceeds a threshold T appears as a bold diagonal line segment lying on the diagonal with index (i-j). The BioSCAN system reports the index of each diagonal that contains at least one segment pair whose score is at least T. Figure 2 expresses the algorithm as a "C" program fragment. The BioSCAN hardware performs the inner loop in parallel. A test for negative sums allows the highest-scoring segment pair on a given diagonal to be computed in a single pass. The threshold test in effect tags high-scoring diagonals, insuring that a thresholdexceeding segment score is not lost in trying to extend a high-scoring segment further down the diagonal. Only the diagonal index is reported; neither the location nor score of the contained segments is kept.

Table 1: CostlPerformance Ratios of Computer Systems Executing the Linear Similarity Algorithm

Special-purpose VLSI also requires a stable algorithm. The one implemented by BioSCAN is a variation of one which has remained at the core of biosequence analysis for many yearsl131. Various heuristic prefilters currently used to speed up database scans[l1,2] are not as fast or rigorous as a hardware-based

505

Sequence B

SequenceA

-

Figure 1: Diagonal Path Graph

The information state of the system at any time may be visualized as a vertical column in Figure 1. During the ith step, the ith database sequence character A[i] is input to the system and the system information state shifts from column i-1 to column i. Thejth PE is responsible for computing partial sums on the jth row. The partial sum computed in PEj at step i is passed forward to PEj+i to use in step i+l.

' static data *I T, S[M]N], La, Lb; char * A , 'B; * variables */ int *R, i, j, k; int

'

I* threshold *I

I* I* I* I*

4 System Architecture The BioSCAN system consists of a 9U VME circuit board in a Sun4 workstation. The circuit board will contain 10 to 20 sockets that can be fully or partially populated by BioSCAN ASICs. The ASICs may be tested individually in a running system. Bad PES can be mapped out of the system dynamically. Software configures the system either to perform simultaneous scans of many relatively short patterns or to engage the entire array to scan one large pattem sequence in a single pass. Figure 3 shows the configuration of the host workstation and the five major functional blocks of the BioSCAN circuit board. The status, control and data I/O registers are memory mapped. FIFOs in input and output subsystems are used to sustain system throughput. Prior to a database scan, the ASICs are programmed with both the pattern sequence and the similarity table to be used. A threshold detector circuit is hardwired in each PE to detect values greater than or equal to 16384 (high bits 0 and 1, respectively.) The similarity table entries are prescaled at load time to this threshold. More precisely, similarity values originally scaled to a threshold T are multiplied by (16384/T). The database scanned generally consists of at least several thousand discrete sequences. Each sequence contains characters drawn from a 28-character alphabet. Sequences are separated from one another by a designated delimiter character, allowing the

similarity table *I length of A & B *I sequence A[La] *I sequence B[Lb] *I

I* similarity scores *I I* index A and B *I

BioSCAN algorithm *I

I* initialize scores *I for (j = 0; ] < Lb; ++j) R[j] = 0; I* each element of A *I for (i = 0; i < La; ++i) { I* each element of B, bottom up *I for (j = Lb-1; j >= 0; --j)

W+lI =

else

Wl; I* normal accumulation *I

report(i);

1

I I

I Figure 2: BioSCAN Algorithm

&&Ah M S I Chips PROCESSINGELEMENTARRAY

BiOSCAN CfRCUlT BOARD

II

I

Figure 3: Hardware System Configuration

database to appear as one long sequence. The delimiters prevent subthreshold scores of one sequence being added to scores computed for the next sequence. To minimize the YO bandwidth during the scan, the subsystem receives a compressed form of each sequence through the VME bus. Fields of up to 5 bits encode characters packed into 32-bit words. The circuit board decodes each field and broadcasts the result to all ASICs. A programmable mask register on the BioSCAN board selects which ASICs will report threshold-exceeding alignments. When so signaled, the circuit board buffers the value of an on-board counter, which represents the number of characters broadcast from the beginning of the scan. Also stored is which ASICs reported a hit at this time. From this information, software routines can determine which alignments have met the search criteria.

are sensed, amplified, and broadcast within the ASIC. By the end of 16 clock cycles, 28 different 16-bit integers have been broadcast to the PES. The row of data broadcast from the similarity table is seen by a linear systolic array of PES occupying over 80% of the active area of the ASIC (Figure 5). Each PE has been programmed to select one of the 28 broadcast lines as its input for the duration of a database scan. This programming represents one character of the pattern sequence, and resides in 8 bits of static RAM. Data on the selected line is always from the column of the similarity table assigned to the pattern sequence element. While 5 bits would be sufficient to select one of 28 lines, the decoder is much simplified by partially pre-decoding the pattern character in software to a 2-of-8 format. During the 16 clock cycles in which successive bits of one similarity value are received, they are added bit serially to a similarity score calculated in the previous 16 cycles by the previous PE in the array. A very small amount of additional logic within each PE handles two exceptions: (1) negative scores received from the previous PE are zeroed before the current accumulation and (2) scores received greater than or equal to 16,384 are passed unchanged to the next PE. The score computed by the last PE on an ASIC is driven off-chip to be received by the next ASIC in the chain and given to the first PE on

5 VLSI Implementation A table of similarity values is stored in each ASIC in an array of standard static RAM cells of 28 banks, each with 16 rows and 28 columns (Figure 4). Logically the RAM may be considered a table of 28 rows and 28 columns, each entry being a 16-bit 2's complementencoded integer. At run time, each database character received by the ASIC selects one of the 28 banks, which corresponds to one entire row of the logical table (28 16-bit integers). For each of the next 16 clock cycles, the 28 column bits in successive rows of the selected bank

.

Search Pattern Select

I

I

-

28-811 Data I10 Buffer

(448 x 28 bits)

REG

28 bits

I

28

e

e

I

e

I' II

t

16-blt

SHIFT REGISTER

Figure 5: BioSCAN ASIC Internal Data Path

Figure 4: Memory Organization

507

that device. In addition, threshold-exceeding scores emerging from the final PES cause an alignment hit to be signaled to the on-board circuitry. The original ASIC layout. BioSCAN 1.0, contained over 1.5 million transistors and 2196 PES in a 7.8" by 9.2" frame. It was created at MCNC (Research Triangle Park, NC) for an experimental in-house 0.8p CMOS process. The BioSCAN 2.0 ASIC is currently being designed at MCNC for a more modest (and cheaper) MOSIS technology: 1.21.1. scalable CMOS. It will have about half the number of transistors and approximately 800 processors in the same 7.8" by 9.2" area.

the spectacular performance gains possible in applying VLSI technology to specialized applications.

Acknowledgements The BioSCAN 1.0 ASIC layout was supported by the MCNC Design Initiative Program.

References [l] S . F. Altschul, "Amino Acid Substitution Matrices from an Information Theoretic Perspective," J . Mol. Biol. 219, pp. 555-565, 1991.

6 Comparison with Related Projects

[2] S . F. Altschul, W. Gish, W. Miller, E. W. Myers, & D. J. Lipman, "Basic Local Alignment Search Tool," J . Mol. Biol., 215, pp. 403-410, 1990.

We are aware of four other hardware projects intended specifically, at least in part, for biosequence analysis. In 1985 Lipton and Lopresti proposed the Princeton chip for Nucleic Acid Comparison, or P-NAC[9]. Implemented in 4 p nMOS with heavy use of PLAs, the P-NAC was designed to calculate a global distance metric on DNA sequences. It employs substitution costs fixed at twice that of individual insertions/deletions. SPLASH is a more recent project of Lopresti which employs a field-programmable logic array[ 103. While intended to use the same algorithm as P-NAC, SPLASH can also be programmed for other purposes. Most recently, Lopresti and Hughey have created B-SYS[6], a 2p scalable cMOS implementation of a novel SIMD systolic shared register architecture. B-SYS can be programmed for a wide range of biosequence analysis tasks. While a comparable BioSCAN system is much cheaper and at least 50 times faster when performing the linear similarity algorithm, it is not so generally programmable. Finally, BISP is 400K 1p cMOS systolic ASIC which can perform local alignments directly implementing the affine gap cost algorithm[8]. There are 16 PES per ASIC, each with a 128-element local data table. The onboard controller is an Intel i860. The database scan rate is similar to BioSCAN. BISP has several operating modes, making it more versatile than BioSCAN. This generality does lead to a substantially higher system cost.

7

[3] C. Burks, "How much sequence data the databanks will be processing in the near future," Biomolecular Data, R. R. Colwell, ed., pp. 17-26, Oxford University Press, NY, 1989. [4] W. D. Dettloff, R. K. Singh, C. T. White, & B. W. Erickson, "A 50 MHz 1.5M Transistor ASIC for Biosequence Analysis," ISSCC Digest of Technical Papers, p. 40, Feb. 1991. [5] GenBank(R) Release 67.0, IntelliGenetics Inc., Mountain View, CA. March 15, 1991. [6] R. P. Hughey, "Programmable Systolic Arrays", PhD dissertation, Brown University, Providence, RI, 1991. [7]Human Genome: 1989-90 Program Report. U.S. Department of Energy, Washington, D.C., March 1990.

[81 T. Hunkapiller, M. Waterman, R. Jones, J. Eggert, E. Chow, J. Peterson, & L. Hood, "Special Purpose VLSI-Based System for the Analysis of Genetic Sequences", Human Genome: 1989-90 Program Report, p . 101, U.S. Department of Energy, Washington, D.C., March 1990.

Conclusions

[9] R. J. Lipton & D. Lopresti, "A systolic array for rapid string comparison," in 1985 Chapel Hill Conjerence on VLSI, pp. 363-376. University of North Carolina at Chapel Hill, Chapel Hill, NC, 1985.

A general-purpose workstation enhanced by specialpurpose hardware can compete favorably with the fastest computers now available for performing these biosequence comparison algorithms. Moreover, a VLSI-enhanced workstation is far less costly and thus much more accessible. The BioSCAN project addresses the need for both a production machine for scanning the existing databases and a research engine for developing novel methods of biosequence analysis. It demonstrates

[lo] D. P. Lopresti, "Rapid Implementation of a Genetic Sequence Comparator Using Field-Programmable Logic Arrays," Advanced Research in VLSI 1991, UC Santa cruz.

SOX

[ l l ] W. R. Pearson & D. J. Lipman, "Improvedtools for biological sequence comparison," Proc. Nutl. Acad. Sci. USA 85, pp. 2444-2448, 1988.

[13] T. F. Smith & M.S. Waterman, "Identificationof Common Molecular Subsequences,"J . Mol. Biol. 147, pp. 195-197, 1981.

[121 Protein Identification Resource Release 27.0, National Biomedical Research Foundation, December 3 1, 1990.

[14] C. T. White, & W. D. Dettloff, "The BioSCAN Project: An Interdisciplinary Approach to Biosequence Analysis," M C N C Technical Bulletin, Vol. 1, No. 1, pp. 8-9, Sept./Oct. 1989.

509