The Journal of Supercomputing, 32, 51–69, 2005 C 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
New Processor Array Architectures for the Longest Common Subsequence Problem PANAGIOTIS D. MICHAILIDIS
[email protected]; http://macedonia.uom.gr/˜panosm KONSTANTINOS G. MARGARITIS
[email protected]; http://macedonia.uom.gr/˜kmarg Department of Applied Informatics, Parallel and Distributed Processing Laboratory, University of Macedonia, 156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece
Abstract. A longest common subsequence (LCS) of two strings is a common subsequence of two strings of maximal length. The LCS problem is to find an LCS of two given strings and the length of the LCS (LLCS). In this paper, we present a new linear processor array for solving the LCS problem. The array is based on parallelization of a recent LCS algorithm which consists of two phases, i.e. preprocessing and computation. The computation phase is based on bit-level dynamic programming approach. Implementations of the preprocessing and computation phases are discussed on the same processor array architecture for the LCS problem. Further, we propose a block processor array architecture which reduces the overall communication and time requirements. Finally, we develop a performance model for estimating the performance of the processor array architecture on Pentium processors. Keywords: longest common subsequence problem, linear processor arrays, parallel architectures, parallel algorithms, VLSI
1.
Introduction
Given a string x over an alphabet , a subsequence of x is any string w that can be obtained from x by deleting zero or more (not necessarily consecutive) symbols. The longest common subsequence (LCS) problem for two input strings x = x1 . . . xm and y = y1 . . . yn (m ≤ n) consists in finding a third string w = w1 . . . wl such that w is a subsequence of both x and y of maximum possible length. The LCS problem is to find an LCS of two given strings and the length of the LCS (LLCS problem). The LLCS of x and y is l. The LLCS and LCS problems have been the subject of much research because they can be applied to many areas, such as data compression, word processing, syntactic pattern recognition and molecular biology [6, 7, 27, 28]. The lower bounds for the LCS problem are time (nlogm), according to whether the size of the alphabet is unbounded or bounded [8]. For unbounded alphabets, any algorithm using only “equal-unequal” comparisons must take (mn) in the worst case [1]. The fastest general solution for the LCS problem is the corresponding solution to the string editing problem by Masek and Paterson [22] taking O(n 2loglogn/logn) for unbounded alphabet size and O(n 2 /logn) for bounded alphabet size. Many algorithms have been developed that, although not improving the general O(mn) time bound of the dynamic programming approach, exhibit a much better performance by specializing on certain classes of pair of sequences. A recent survey of well known sequential algorithms for the LCS problem can be
52
MICHAILIDIS AND MARGARITIS
found in [3]. To obtain even faster solutions for this problem, theoretical parallel algorithms have been proposed [2, 20]. Recent advances in Very Large Scale Integration (VLSI) technology have made possible the development of application specific processor arrays for complex and computationally intensive problems. The characteristics of parallelism, concurrency, pipelining, modularity and regularity have become standard in VLSI designs. The application specific processor array architecture is practical for the LCS problem when m and n are very large. Similar linear processor arrays for the LLCS and LCS problems have been proposed by several researchers [11–19, 26, 30]. Surveys on processor arrays for the LCS problem and related problems can be found in [9, 25]. Yang and Lee [30] proposed two linear systolic arrays for the LLCS problem. One of them allow the two strings x and y flow from opposite sides of the array of cells. This array computes the length of an LCS in 2n + m − 1 time steps and the number of cells required is 2n. However, the main disadvantage is low processor utilization. Further, they proposed an improved systolic array that follows the unidirectional data flow where one string is loaded into the array and the other string flows from left to right through the array. The number of time steps required in this improved array is n + 2m − 1 and the number of cells required is reduced to m. However, these two linear approaches require many registers in each PE (Processing Element) and the PE operations are quite complex. Robert and Tchuente [26] proposed a linear unidirectional systolic array of m cells to compute the length of an LCS in n + 2m time steps. However, this design does not enable the recovering of an LCS. In order to recover an LCS, the authors extended the previous array to an 2D array of m PEs, each one equipped with a systolic stack to store matches (i.e. xi = y j ) which occurred while computing l. Then it operates a second phase in reverse order to trace back a solution which is output from last cell after 3m time steps. Therefore, this array finds an LCS in 3m − 2 more time steps, but the stack architecture does not suit for wafer-scale integration (WSI) technologies. Lin [13] presented a linear systolic array of m cells that requires fewer I/O ports than Robert and Tchuente’s array [26] to compute LLCS in n + 2m − 1 time steps. To retrieve an LCS, the Lin’s systolic design is the same as that of the systolic presented in [26], but uses a content-addressable memory (CAM) in each PE instead of the systolic stacks. This architecture takes 2m − 1 time steps which are less than the one presented in [26]. Lecroq et al. [11, 12] presented an unidirectional systolic array architecture of m cells to solve the LLCS and LCS problems concurrently in n + 2m time steps; to achieve time improvement each PE is equipped with m registers for storing an LCS. These registers serve as a substitute for the systolic stacks or the CAMs (content-addressable memories) used in [26] and [13] respectively, for the recovering of an LCS. This is the first architecture that computes the length of an LCS and recovers the LCS in only one phase or pass instead of two as in [13, 26]. Further, Lin and Chen [14, 15] followed the systolic solution of [11] for computing the LLCS and LCS problems in one pass, but they applied the systolic array of [13] to compute the length of an LCS. Luce and Myoupo [19] proposed a unidirectional and modular semi-mesh architecture of m(m + 1)/2 cells to recover an LCS of two strings in n + 3m + l time
NEW PROCESSOR ARRAY ARCHITECTURES
53
steps. This architecture uses fewer cells and runs faster compared with the 2D array in [26]. Lin and Yeh [17, 18] introduced a systolic array that uses m2 cells and takes n +2 m2 −1 time steps, in which two characters of string x are input at a time. Finally, Lin and Yeh [16] presented similar systolic array as in [17, 18] but it uses r cells, to which q = mr characters of string x can be input at a time. This array can solve the LLCS and LCS problems in n + 2r − 1 time steps. All the above-mentioned systolic designs are based on parallelization of the classical dynamic programming algorithm [7]. In this paper, the processor array implementation of a new LCS algorithm [4, 5] is discussed in the context of the systolic paradigm for VLSI computation [10, 23, 24]. The architecture is a bit-parallel realization of the dynamic programming algorithm. Further, we obtain a new block processor array architecture that uses the fewest cells and time steps to solve the LLCS problem. Finally, we analyze the performance of the proposed processor array implementations on Pentium processors. The rest of this paper is organised as follows: Section 2 presents the recent bit-level LCS algorithm which consists of two phases, i.e. preprocessing and computation. In Section 3, we derive the new linear processor array for the computation phase. In Section 4, we give the implementation of the preprocessing phase on the same processor array. In Section 5, we present the block processor array architecture for the LLCS problem. In Section 6, we compare the proposed processor arrays with other previous designs. In Section 7, we present the performance analysis and the performance estimates of the processor array implementations. Section 8 concludes this paper.
2.
A bit-level algorithm for longest common subsequence
A bit-level sequential algorithm for the LCS problem is now presented which is derived from [4, 5]. The algorithm has two phases: the preprocessing and the LCS computation phase. 2.1.
Preprocessing phase
During the preprocessing phase the string x is encoded on a bit-level memory map M of (m × ||) bits, where || is the size of the alphabet. The memory map can be seen as a two dimensional bit-level array where each row corresponds to a character of string x and each column to a character of the alphabet. Therefore, column M Tj , for 1 ≤ j ≤ ||, holds the information of the j-th character of the alphabet, which will be denoted as σ j . The column M Tj can be seen as a bit-level vector of m bits where the i-th bit, for 1 ≤ i ≤ m, i.e. Mi, j holds information concerning the j-th character σ j and the i-th position of string x. The basic information that can be recorded is whether the j-th character of the alphabet is the i-th character of string x, that is whether xi = σ j . We also define the bit-level memory map M N as the negation of M. The preprocessing phase constructs the memory maps M and M N and the algorithm can be expressed in terms of the following piece of pseudocode.
54
MICHAILIDIS AND MARGARITIS
for j = 1 to || do for i = 1 to m do Mi, j ← 0 M Ni, j ← 1 if xi = σ j then Mi, j ← 1 and M Ni, j ← NOT Mi, j Taking as reference alphabet the UNICODE character set it is straightforward to calculate the memory requirements of the bit map M. Thus || = 64K and therefore the memory map of an m character x string requires 16m Kbytes. The overall space requirements are m||/16 bytes, assuming that a character is encoded in two bytes. Further, the preprocessing phase can be performed in approximately m|| steps. The algorithm performance can be improved if a specialised addressing is introduced such that a character is mapped directly to the appropriate column of the memory map. Such an addressing can take the form of a mapping function of a character ch of an alphabet , map(ch, ), returning the column number of the memory map [21]. The simplest but rather inefficient form of a mapping function is a linear search index ← map(ch, ) ::= {index ← 0 for j = 1 to || do if ch = σ j then index ← j} However, it is evident that much more efficient mapping functions can be achieved. Returning to the reference alphabet of the UNICODE character set it is clear that such a mapping function can be easily produced in programmable hardware (encoder) since the binary codes of UNICODE characters can be used as simple ‘base-plus-offset’ mechanism for memory addressing codes. Similar solution can be achieved for an arbitrary alphabet, given its binary encoding. Generally it can be argued that the time complexity of the mapping operation is a function of , both in terms of its size || and in terms of its binary encoding. Assuming an UNICODE-type binary encoding it can be argued that the size in bits of a single character of an alphabet is log|| and the area and time complexity of the mapping operation is that of an encoder with log|| input and log|| output bits. 2.2.
LLCS and LCS computation phase
A simple classical solution to LLCS computation phase which computes the length of an LCS of strings x and y, is the basic dynamic programming algorithm. Let L 0..m,0..n be the dynamic programming matrix, where L i, j represents the length of the LCS for x1 . . . xi and y1 . . . y j , 1 ≤ i ≤ m, 1 ≤ j ≤ n. Matrix L can be computed with the following piece of pseudocode [7]: for i = 0 to m do L i,0 ← 0 for j = 1 to n do
NEW PROCESSOR ARRAY ARCHITECTURES
55
L 0, j ← 0 for i = 1 to m do if xi = y j then L i, j ← L i−1, j−1 + 1 else L i, j ← max(L i, j−1 , L i−1, j ) In order to recover an LCS of strings x and y we use the dynamic programming matrix L. Let LC S0..m,0..n be the dynamic programming matrix, where LC Si, j represents the LCS of two strings x1 . . . xi and y1 . . . y j , 1 ≤ i ≤ m, 1 ≤ j ≤ n. Matrix LC S can be computed with the following pseudocode: for i = 0 to m do LC Si,0 ← ε for j = 1 to n do LC S0, j ← ε for i = 1 to m do if L i, j = L i, j−1 then LC Si, j ← LC Si, j−1 else if xi = y j and L i, j = 1 + L i−1, j then LC Si, j ← LC Si−1, j xi else LC Si, j ← LC Si−1, j where ε denotes the empty sequence. The pseudocode which computes the matrix L can be combined with the pseudocode that computes the matrix LC S in only one phase. Crochemore et al. [4, 5] presented a new way to compute the length of the LCS of two input strings by using bit-vector operations which is really fast in practice. The monotonicity property in L matrix allows us to store each column in the L matrix using bit-vectors. We will denote by L and c the relative encoding of the table L as follows: L i, j = L i, j − L i−1, j , ci, j = L i, j − L i, j−1 and 0 ≤ L i, j , ci, j ≤ 1 for 1 ≤ i ≤ m and 1 ≤ j ≤ n. We also define L N as the negation of L (L N =NOT L). The LLCS computation phase takes the form of a series of 3 bit-wise operations (AND, OR and AND) and the addition operation that depends on the previous column L Ni, j−1 and the bit-level memory maps (M and M N ) for the match points in the current column. For the LCS computation phase we replace the pseudocode which computes the matrix LCS using the definitions of the L, c and M. We also define by LC Si, j the binary representation of m bits for the LCS of two strings x1 . . . xi and y1 . . . y j (1 ≤ i ≤ m and 1 ≤ j ≤ n) as follows: each bit of LC Si, j takes the value 1 whether the i-th character of string x is member of the LCS and the value 0 otherwise. Using the same pseudocode notation the LLCS and LCS computation phase is given below: for i = 1 to m do L Ni,0 ← 1 LC Si,0 ← 0m for j = 1 to n do k ← map(y j , ) c0, j ←0 LC S0, j ← 0m
56
MICHAILIDIS AND MARGARITIS
for i = 1 to m do L Ni, j ← ((ci−1, j + L Ni, j−1 + (L Ni, j−1 AND Mi,k )) mod 2) OR (L Ni, j−1 AND M Ni,k ) ci, j ← (ci−1, j + L Ni, j−1 + (L Ni, j−1 AND Mi,k )) div 2 if ci, j = 0 then LC Si, j ← LC Si, j−1 else if Mi,k = 1 and NOT L Ni, j = 1 then LC Si, j ← LC Si−1, j OR 0i−1 Mi,k 0m−i else LC Si, j ← LC Si−1, j The bit-vector cm, j for 1 ≤ j ≤ n gives the length of the LCS, i.e. 1’s are found in the locations of the c. The element LC Sm,n gives the LCS of two input strings. The space requirements are mn/2 bytes because two bits are required for L N and c respectively. Finally, the time of this computation is performed in O(nm) steps approximately (for m ≤ n). 2.3.
Example
An example of the bit-vector algorithm is now presented. The example alphabet is {a, b, c} with || = 3. The example string x is the string “abccb” with m = 5 and the y is the string “bcabcb” with n = 6. The bit-level memory maps M and M N are given in Table 1. Table 2 shows the bit-level matrices L N and c. The last row of the result vector c has in total 6 elements; there are four 1 and the length of the LCS for two given strings x and y is 4. Table 3 shows the bit-level matrix LC S for recovering an LCS of two strings x and y. An LCS is obtained by the element LC S5,6 of the matrix LC S that is 11101 and it corresponds to the LCS string “abcb”. Bit-level maps M and M N for x = abccb
Table 1.
Table 2.
M
a
b
c
MN
a
b
c
a b c c b
1 0 0 0 0
0 1 0 0 1
0 0 1 1 0
a b c c b
0 1 1 1 1
1 0 1 1 0
1 1 0 0 1
LLCS computation phase
L N
ε
b
c
a
b
c
b
c
b
c
a
b
c
b
a b c c b
1 1 1 1 1
1 0 1 1 1
1 0 0 1 1
0 1 0 1 1
0 0 1 1 0
0 0 0 1 1
0 0 0 1 0
0 a b c c b
0 0 1 1 1 1
0 0 0 1 1 1
0 1 0 0 0 0
0 0 1 0 0 1
0 0 0 1 1 0
0 0 0 0 0 1
57
NEW PROCESSOR ARRAY ARCHITECTURES Table 3.
3.
LCS computation phase
LC S
ε
b
c
a
b
c
b
ε a b c c b
00000 00000 00000 00000 00000 00000
00000 00000 01000 01000 01000 01000
00000 00000 01000 01100 01100 01100
00000 10000 01000 01100 01100 01100
00000 10000 11000 01100 01100 01101
00000 10000 11000 11100 11100 01101
00000 10000 11000 11100 11100 11101
A linear processor array for the LLCS and LCS computation phase
Initially, the implementation of the computation phase of the LCS algorithm is discussed, which imposes the main computation load since usually m n. Figure 1 shows the datadependence graph and the parallel timing diagram for the LCS algorithm. Each node (i, j) (1 ≤ i ≤ m, 1 ≤ j ≤ n) of the graph is stored the character xi of string x and the character y j of input string y. Also, at the same node is assigned an entire row i of the bit-level memory maps M and M N for the character xi of string x. To calculate an element of the bit-vector ci, j (1 ≤ i ≤ m, 1 ≤ j ≤ n) and the elements of the dynamic programming matrices L Ni, j and LC Si, j (1 ≤ i ≤ m, 1 ≤ j ≤ n) we need to receive as inputs the four previously calculated values ci−1, j , L Ni, j−1 , LC Si−1, j and LC Si, j−1 as indicated in Figure 1(b). Moreover, we suppose that each processor is responsible to compute one row of the dynamic programming matrices L Ni, j , ci, j and LC Si, j of the graph. At the same graph of Figure 1(a) the parallel timing diagram is presented where the nodes which lie on the same diagonal (from left-bottom to right-top) can be computed concurrently. This is shown by the dotted lines of Figure 1(a). We transform the original dependence graph of Figure 1(a) to local dependence graph as shown in Figure 2(a) so that the characters of string y may flow via local edges while the vertical and horizontal axis of the graph represent the time steps and the characters of string x respectively. The area complexity of the algorithm is defined as the number of elementary cells (or Processing Elements, PEs) being active at any time step. This information is given by the number of columns of the dependence graph, whereas the time steps required are given by the number of rows of the graph. It should be clear that the linear processor array is derived from projecting the time axis of the dependence graph of Figure 2(a) and is shown in Figure 2(b) for a generic problem with m = 4 and for arbitrary n and ||. It is a linear array consisting of m cells connected to each other via three communication channels, one transferring the binary representation of characters of string y and the other two transferring the bit-level results c and LCS. Each row Mi , 1 ≤ i ≤ m of the bit-level memory map is preloaded into the PE of the one-way linear array so that using the specialised addressing function map(ch, ) produces a single bit per PE. For the implementation of the map(ch, ) function we introduced a programmable hardware (decoder) in each cell. Further, each cell is allocated three registers L N , aux and LC S, which respectively store the current value L Ni, j , the previous value L Ni, j−1
58
MICHAILIDIS AND MARGARITIS
Figure 1. (a) Dependence graph and parallel timing diagram for the LCS and (b) Computations.
and the previous binary string LC Si, j−1 after the processing of y j . Each PE performs a full step of the LCS computation as shown in Figure 2(b) right. Therefore, each PE calculates the new values L Ni, j and LC Si, j and they are stored in the registers L N and LC S respectively, while the result ci, j is updated and it is sent towards the next cell. The value of the register LC S is sent also towards the next cell. The characters of string y are transmitted from left to right in the same manner as the bit-level results, without intermediate delay between cells. The overall computation time is n + m − 1 time steps. Given the fact that usually m n it can be argued that the computation time is approximately n steps. The area required is m PEs. Taking the example of the UNICODE character set the local memory requirements are 16 Kbytes per cell. 4.
An implementation of the preprocessing phase
In this section the implementation of the preprocessing phase, i.e. the construction of the bitlevel memory map M is discussed. The aim is to use the same processor array architecture
NEW PROCESSOR ARRAY ARCHITECTURES
59
Figure 2. (a) Graph transformation and (b) Processor array.
as in the previous section in order to produce the bit-level memory map with its elements allocated to the appropriate PEs. The operations that should be performed by the preprocessing phase are essentially writing operations to the appropriate row Mi , 1 ≤ i ≤ m, of the bit-level memory map M that is allocated to the i-th PE. Since each memory location
60
MICHAILIDIS AND MARGARITIS
is a single bit the writing operation consists of setting this bit either to 0 or to 1. Therefore, we use one control bit for setting a memory map bit to 1. In order to implement the preprocessing phase on the linear processor array of Figure 2(b) the following assumptions are made. First, each augmented character xi of string x consists of one bit string of 16 bits which correspond to the binary code of the character belonging to . Second, it is assumed that the channel y is used for transferring the character codes at a rate of a single character per transfer step, while the result channel is wide enough to carry the control bit. Third, it is assumed that the alphabet , its encoding and size || are preloaded to the PEs. Fourth, the maximum length of string x is equal to the processor array size m and this is known to the system. Finally, the i-th PE should be aware of its position in the array, that is it should have a cell id equal to i, as numbered in Figure 2(b). The preprocessing phase can start by the reset signal through the control input and then each cell counts the loading steps that is j steps for the j-th PE. Therefore, each cell executes the following piece of pseudocode for the construction of the row of the bit-level memory map M: bit ← 1 lo ← map(σ, ), hi ← map(σ, ) for i = map(0, ) to lo − 1 do Mi ← NOT bit for i = lo to hi do Mi ← bit for i = hi + 1 to map(||, ) do Mi ← NOT bit There are || writing operations to consecutive single bit locations in the row of the bit-level memory map M. The lowest address of the local memory is denoted by map(0, ) whereas the highest address is denoted by map(||, ). All these memory locations are accessed once during the pre-processing phase, by means of three consecutive FOR loops. This allows to keep the computation time uniform for any type of preprocessing operation, that is 2m loading plus || writing steps. The completion of the preprocessing phase enables the commencement of the LCS computation phase. From the above description the preprocessing phase can be implemented in the some PE that performs the LCS computation phase with the addition of limited programmable hardware. Therefore, the whole systolic algorithm can be performed on a special purpose processor array. 5.
A block processor array architecture
In this section, we propose a block processor array architecture for solving the LLCS problem. The block architecture is derived by the following approach: we partition the area axis of the graph of Figure 2(a) or the string x into blocks of size w. Since the area of the graph is m, then we partition them into b = mw blocks so that each block has w nodes. Suppose for simplicity that w divides m evenly. All nodes inside the same block are sequentially executed by the same physical processor and there are no communication requirements.
NEW PROCESSOR ARRAY ARCHITECTURES
61
The only communication overhead is imposed by inter-block message passing, since all blocks are executed in parallel. Of course, there are now more local memory requirements, but the size of memory in distributed architectures is large enough and accesses to local memory for the same block are fast. Therefore, a linear block processor array is derived from projecting the time axis of the graph of Figure 2(a) with block of size w. The processor array consists of b processors connected to each other via two communication channels as we presented in previous architecture with the difference without the third channel that carry an LCS. A block of w rows of the bit-level memory map M is allocated to each processor. Further, each processor can perform the w nodes of block in O(1)-time computation using the bit-parallelism technique. In this case, each processor uses word registers of w bits and updates the w bit-level values in single operation. The overall computation time is n + b − 1 time steps. The area required is b PEs. 6.
Comparison with other processor array designs
In this section, we compare the proposed processor array designs for the LCS problem with other processor arrays previously presented. The important differences of the first architecture in contrast to the other array implementations are as follows: First, the basic dynamic programming algorithm described in [7], it can be observed that the matrix elements can assume large values when long strings are compared. Most applications for longest common subsequence require fairly long strings to be compared. For instance, DNA sequences are typically several million bases long, protein molecules contain thousands of aminoacids. Many processor arrays [11–15, 19, 26, 30] have been proposed in the literature which are based on parallel implementation of the dynamic programming algorithm [7] require each cell to add and compare relatively large data values, on the order of log(n) bits for strings of length n. Further, the widths of the communication channels required to exchange data between adjacent cells is very large. In contrast, our architecture is based on the basis of the bit-level algorithm, minimizes the number of bits required to represent any element in the matrix and performs longest common subsequence computations for arbitrarily long strings. It also minimizes the data flow between adjacent cells. Our architecture, based on a new bit-level algorithm, requires each cell to perform simple bit-level arithmetic and logical operations as opposed to the comparison operations used in [11–15, 19, 26, 30]. The advantage of the bit-level arithmetic and logical operations is that are executed fast enough compared to the comparison operations. However, our architecture uses in total two comparison operations which are used for recovering an LCS. Another important difference of the first processor array presented herein as opposed to the systolic designs already proposed is the VLSI implementation of the encoding scheme, i.e. the introduction of the encoder and the bit-level memory modules. This encoding scheme enables the efficient implementation of the new bit-level LCS algorithm. Furthermore, the first processor array design needs two registers in each cell and two communication channels for computing the length of the LCS as opposed to the three registers and two channels used in [14, 15] and two registers and three channels used in
62
MICHAILIDIS AND MARGARITIS
[11]. Also, our processor design requires one register in each cell and one communication channel that are wide enough to store and carry the m bits of the LCS respectively instead of m registers and m channels as in [11, 12, 14, 15]. Therefore, our architecture uses in total three registers and channels compared to the previous arrays that need in total m + 2 registers and m + 3 channels as in [11] and m + 3 registers and m + 2 channels as in [14, 15]. Requiring fewer communication channels to transfer fewer data is advantageous when our architecture is implemented on programmable processor arrays, i.e. DSP or transputer based computing systems. As the physical links between cells are limited in number, our architecture consumes less time in communicating fewer data. The second processor array architecture requires the fewest cells and time steps among all known processor designs for solving the LLCS problem. Further, it can handle very long strings without additional partitioning or folding techniques. Moreover, our second architecture achieves similar time and area complexity measures as the architectures of [16–18] without additional complexity. In other words, we simply introduced two registers of w bits in each cell in order to compute a block of w bit-level values that correspond to the w characters of string x in simple operation. In contrast to the [16–18] introduced many registers (i.e. m nr + 2 nr + 3 where r is the number of cells [16]) in each cell and the cell operations are complex for computing of a block of characters. Finally, the blockbased architecture is better suited to implementation on multicomputers, i.e. cluster based computing systems. Because these systems provide full processor and memory that can perform the bit-level code of the block computation fast and without additional overhead. Also, the second architecture may run faster on a multicomputer since it uses less time steps. 7.
Performance analysis
We analyze the performance of the proposed processor array implementations introduced in Sections 3 and 5 on Pentium processors. Two performance models are introduced: The first model analyzes the performance of the processor array designs with one processor and the second model analyzes the performance on a linear array of processors. The following notations are used in this section: • • • • • • • • •
n pr ep , the number of operations required by the preprocessing phase, n lcscomp , the number of operations required by the LCS computation phase, n llcscomp , the number of operations required by the LLCS computation phase, n comm , the number of data items transferred, t pr ep , the average time to perform one preprocessing step, tlcscomp , the average time to perform one LCS computation step, tllcscomp , the average time to perform one LLCS computation step, tcomm , the average time for data transfers and w, the number of bits in a computer word of the processor.
The parameters n pr ep , n lcscomp , n llcscomp and n comm depend on the algorithm and on the mapping method, whereas the parameters t pr ep , tlcscomp , tllcscomp , tcomm and w depend on the capabilities of the application-specific array processors.
NEW PROCESSOR ARRAY ARCHITECTURES
63
The basic idea of the mapping of the bit-level processor array architecture on a real parallel system is similar to the idea introduced in Section 5. In other words, the idea is to perform with the bit-parallelism technique all the bit-level computations required by the LLCS and LCS computation phase. This technique takes advantage of the intrinsic parallelism of the bit-operations inside a word of the processor. That is, we pack w bit-level values in a single word and update them all in a single operation. 7.1.
Performance analysis on a single processor
The execution time for computing the LCS and LLCS algorithms on a single processor can be broken up into two terms: • T pr ep includes the time spent executing the preprocessing phase. The total number of operations required to execute the preprocessing phase is approximately m|| steps when the size of the string x fits in a computer word of the processor. In practice, this rarely happens. When the size of string x is too long, then mw additional passes of w bits are required. In general, the preprocessing time is given by: T pr ep = n pr ep t pr ep =
m w|| t pr ep w
(1)
• Tlcscomp includes the computation time to run the LLCS and LCS algorithms concurrently on a single processor. The total number of operations required to execute the LLCS and LCS algorithms is mn steps if the size of string x is less than w. In general, the computation time is given by: Tlcscomp = n lcscomp tlcscomp
m = n w tlcscomp w
(2)
When we run only the LLCS algorithm on a processor then the number of operations required is n steps if m ≤ w. Therefore, the LLCS computation time, Tllcscomp , is given by: m Tllcscomp = n llcscomp tllcscomp = n tllcscomp w
(3)
The sequential execution time for solving the LLCS and LCS algorithms concurrently is the summation of the two terms and is given by: Tseq = T pr ep + Tlcscomp = n pr ep t pr ep + n lcscomp tlcscomp m m = w|| t pr ep + n w tlcscomp w w
(4) (5) (6)
64
MICHAILIDIS AND MARGARITIS
The execution time for solving the LLCS problem is given by: Tseq = T pr ep + Tllcscomp = n pr ep t pr ep + n llcscomp tllcscomp m m = w|| t pr ep + n tllcscomp w w 7.2.
(7) (8) (9)
Performance analysis on a linear array with mw processors
The overall execution time of the proposed processor array implementation for the LCS and LLCS algorithms can be broken up into three terms: • T pr ep is the preprocessing time to load the string x and to construct the bit-level memory map M. The total number of operations required by the preprocessing phase is 2 mw + w|| steps. Then, the preprocessing time on an array of processors is given by: m T pr ep = n pr ep t pr ep = 2 (10) + w|| t pr ep w • Tlcscomp is the computation time to execute the LLCS and LCS algorithms concurrently on an array of processors. The total number of operations required by the computation phase is (n + mw − 1)w steps. Therefore, the computation time is given by: m Tlcscomp = n lcscomp tlcscomp = n+ (11) − 1 w tlcscomp w When we execute the LLCS algorithm on an array of processors then the number of operations required is (n + mw − 1) steps. Therefore, the LLCS computation time, Tllcscomp , is given by: m Tllcscomp = n llcscomp tllcscomp = n + (12) − 1 tllcscomp w • Tcomm includes the communication time to transfer the characters and the bit-level results between processors. The total number of data transfers which is not overlapped with the computation phase is n + mw − 1 steps. Then, the communication time is given by: m Tcomm = n comm tcomm = n + (13) − 1 tcomm w The time, tcomm , of one data transfer dictated by the time of six communications, three communications from the left and three communications to the right. The size of data transfer is 1 byte. Therefore, the average time for one data transfer requires at least: tcomm = 6(α + β)
(14)
NEW PROCESSOR ARRAY ARCHITECTURES
65
The term α is the startup time that is assumed to be independent of the message length and β is the data transmmision time per word. The execution time for executing LLCS and LCS processor array implementation concurrently is the summation of the three terms and is given by: T p-array = T pr ep + Tlcscomp + Tcomm
(15) = n pr ep t pr ep + n lcscomp tlcscomp + n comm tcomm (16) m m m n+ = 2 + w|| t pr ep + − 1 w tlcscomp + n + −1 tcomm w w w (17) m m = 2 (18) + w|| t pr ep + n + − 1 (wtlcscomp + tcomm ) w w
The execution time for solving the LLCS problem is given by: T p-array = T pr ep + Tllcscomp + Tcomm
(19) = n pr ep t pr ep + n llcscomp tllcscomp + n comm tcomm (20) m m m = 2 + w|| t pr ep + n + − 1 tllcscomp + n + − 1 tcomm w w w (21) m m (22) = 2 + w|| t pr ep + n + − 1 (tllcscomp + tcomm ) w w
7.3.
Validation of the performance model for a pentium processor
In this subsection, we validate the performance model of Equations (6) and (9) on a single Pentium processor for solving the LCS and LLCS problems. In order to get the estimated results, we must determine the values of the parameters t pr ep , tlcscomp and tllcscomp of the Pentium 4 1.5 GHz processor. The parameter t pr ep is found by measuring the time by a sequential program which performs mw w|| operations for the preprocessing phase. Since the preprocessing time T pr ep is equal to the Equation (1), t pr ep can be obtained easily in the following way: t pr ep =
T pr ep m w w||
(23)
Similar way the parameters tlcscomp and tllcscomp are found by measuring the sequential time for the LCS computation and LLCS computation phases, respectively. These values can be obtained easily as follows: Tlcscomp n mw w Tllcscomp = n mw
tlcscomp =
(24)
tllcscomp
(25)
66
MICHAILIDIS AND MARGARITIS
Consequently, the parameters t pr ep , tlcscomp and tllcscomp which are used in our performance analysis for a single processor are 1.692E-08, 2.8279E-07 and 2.5474E-07s, respectively. Tables 4 and 5 report execution times for various combinations of n and m on a Pentium 4 1.5 GHz for solving the LCS and LLCS problems, respectively. Tables 6 and 7 show for various combinations of n and m on a Pentium 4 1.5 GHz the estimated times obtained by equations 6 and 9 for solving the LCS and LLCS problems, respectively. It must note that we made measurements for various values of n and m in order to examine the accuracy of the performance model with the experiments. We observe that the estimated results for solving the LCS and LLCS problems on a Pentium processor confirm well the experimental results. Therefore, our performance model for Table 4. Execution times (in seconds) for solving the LCS problem on a Pentium 4 1.5 GHz n/m 15 30 150 300
MB MB MB MB
256
512
1024
2048
1057.049 2114.087 10570.389 21140.768
2119.054 4238.086 21190.344 42380.666
4277.097 8554.149 42770.572 85541.099
8585.065 17170.042 85849.852 171699.614
Table 5. Execution times (in seconds) for solving the LLCS problem on a Pentium 4 1.5 GHz n/m 15 30 150 300
MB MB MB MB
256
512
1024
2048
995.575 1991.139 9955.648 19911.284
1987.293 3974.563 19872.726 39745.430
4027.120 8054.195 40270.794 80541.543
8081.493 16162.896 80814.123 161628.15
Table 6. Estimated times (in seconds) for solving the LCS problem on a Pentium 4 1.5 GHz n/m 15 30 150 300
MB MB MB MB
256
512
1024
2048
1086.197 2172.111 10859.420 21718.556
2172.395 4344.222 21718.839 43437.112
4344.790 8688.444 43437.679 86874.223
8689.580 17376.889 86875.359 173748.447
Table 7. Estimated times (in seconds) for solving the LLCS problem on a Pentium 4 1.5 GHz n/m 15 30 150 300
MB MB MB MB
256
512
1024
2048
978.485 1956.687 9782.300 19564.316
1956.971 3913.374 19564.600 39128.632
3913.942 7826.748 39129.199 78257.263
7827.884 15653.497 78258.399 156514.527
67
NEW PROCESSOR ARRAY ARCHITECTURES
a single processor validate well the computational behavior of the experimental measurements. Finally, we observe that the maximal difference between estimated and experimental values are less than 3%. 7.4.
Validation of the performance model for a linear array of pentium processors
In this subsection, we compare the experimental results with the estimated results of the processor array implementations on a linear array of Pentium processors for solving the LCS and LLCS problems. All experiments were performed on a cluster of eight Pentium 200 MHz processors networked via a switched 100 Mb/s Fast Ethernet using MPI [29] version 1.2 as a message passing system. The values of the parameters α and β which are used to determine the communication time tcomm of Equation (14) are 7.74E-04 and 1.06E07s, respectively. These values were measured using the ping-pong test and are collected for some number of iterations (we have used 100) over various message sizes. Therefore, the communication time tcomm is 4.647E-03s. When implemented on a multicomputer, such as the cluster of processors, the block processor array algorithm can be modified by using the broadcast operation as follows: In the first step, string x is broadcast to all processors and each processor uses w characters of string x. In the same first step, each processor executes the preprocessing phase to construct the bit-level memory map M. After that, a character of y is sent to the array at each following step and each processor performs the operations specified in Figure 2(b) right for computing of a block of w characters. The broadcast operation was implemented by call the MPI Bcast routine. Further, the receive and send operations are accomplished by calling the MPI Recv and MPI Send routines, respectively. Table 8 report execution times for solving the LCS problem for string of size 300,000 characters on an array of Pentium 200 MHz processors. Table 9 report the estimated times obtained by Equation (18) for solving the LCS problem. We observe that the estimated results for solving the LCS problem on a linear array of Pentium processors are quite close to the experimental results. The slight difference between Table 8. Execution times (in seconds) for solving the LCS problem on an array of Pentium 200 MHz processors n/m
64
256
300 KB No. of procs
1376.261 2
996.123 8
Table 9. Estimated times (in seconds) for solving the LCS problem on an array of Pentium 200 MHz processors n/m
64
256
300 KB No. of procs
1396.819 2
1396.847 8
68
MICHAILIDIS AND MARGARITIS
the two is due to the fact that the estimated performance Equation (18) is dominated by the communication term when the size of the string is small enough. Therefore, our performance analysis for a linear array of processors would be correct. We can not extend the experiments for an array of processors to larger string sizes and larger number of processors for two reasons: One reason is that the implementation of processor array on a cluster of processors with Fast Ethernet network requires high communication cost. In other words, the execution time of the processor array implementation dominates by the communication time. Second reason is that our goal is the validation of the performance model on a workstation cluster so that this model can be used to predict the execution time of LCS and LLCS processor array implementations accurately on programmable processors using fast communication network such as DSP or transputers systems. 8.
Conclusions
Two new linear processor array architectures that improve on the previous arrays for solving the LCS problem have been presented in this paper. The first array is based on the bitparallel realization of the dynamic programming matrices c and LC S whereas the second array is based on the implementation of the bit-level dynamic programming algorithm in blocks. The first array requires m cells and takes n + m − 1 time steps whereas the second array requires b cells and takes only n + b − 1 time steps to solve the LCS problem. In other words, our architectures achieve similar time and area complexity measures as recent architectures of [14, 16–18] without additional hardware complexity. However, the proposed architectures improve on the previous processor arrays because they use m fewer registers and communication channels in each cell than what is required in [11, 14, 16– 18]. In each step, every cell of the proposed architectures require only three registers and communication channels. Moreover, our architectures require each cell to perform simple and fast bit-level operations in a time step. Therefore, two processor array designs are more suited to VLSI and multicomputer implementation because they provide many advantages compared to the previous processor array architectures. Finally, we have introduced two performance models to analyze the performance of the processor array implementations on a single processor and linear array of processors respectively. Acknowledgments The authors would like to thank the anonymous reviewers for many helpful comments and suggestions, which have greatly improved the presentation of this paper. References 1. A. V. Aho, D. S. Hirschberg, and J. D. Ullman. Bounds on the complexity of the longest common subsequence problem. Journal of Association Computing Machineny, 23:1–12, 1976. 2. A. Apostolico, M. Attalah, L. Larmore, and S. Mcfaddin. Efficient parallel algorithms for string editing and related problems. SIAM Journal on Computing, 19:968–988, 1990.
NEW PROCESSOR ARRAY ARCHITECTURES
69
3. L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence algorithms. In 7th International Symposium on String Processing and Information Retrieval, pp. 39–48, 2000. 4. M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, and J. R. Reid. A fast and practical bit-vector algorithm for the longest common subsequence problem. In 11th Australasian Workshop on Combinatorial Algorithms, pp. 75–86, 2000. 5. M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, and J. R. Reid. A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters, 80:279–285, 2001. 6. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Gambridge, England, 1997. 7. D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18:341–343, 1975. 8. D. S. Hirschberg. An information theoretic lower bound for the longest common subsequence problem. Information Processing Letters, 7:40–41, 1978. 9. R. Hughey. Parallel hardware for sequence comparison and alignment. CABIOS, 12:473–479, 1996. 10. S. Y. Kung. VLSI Array Processors. Prentice-Hall, 1988. 11. T. Lecroq, G. Luce, and J. F. Myoupo. A faster linear systolic algorithm for recovering a longest common subsequence. Information Processing Letters, 61:129–136, 1997. 12. T. Lecroq, J. F. Myoupo, and D. Seme. A one-phase parallel algorithm for the sequence alignment problem. Parallel Processing Letters, 8:515–526, 1998. 13. Y.-C. Lin. New systolic arrays for the longest common subsequence problem. Parallel Computing, 20:1323– 1334, 1994. 14. Y.-C. Lin and J.-C. Chen. An efficient systolic algorithm for the longest common subsequence problem. The Journal of Supercomputing, 12:373–385, 1998. 15. Y.-C. Lin and J.-C. Chen. Another efficient systolic algorithm for the longest common subsequence problem. Journal of the Chinese Institute of Engineers, 23:607–613, 2000. 16. Y.-C. Lin and J.-W. Yeh. A scalable and efficient systolic algorithm for the longest common subsequence problem. Journal of Information Science and Engineering, 18:519–532, 2002. 17. Y.-C. Lin and J.-W. Yeh. Deriving a systolic algorithm for the LCS problem. International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1890–1897, 1998. 18. Y.-C. Lin and J.-W. Yeh. Deriving a fast systolic algorithm for the longest common subsequence problem. Parallel Algorithms and Applications, 17:1–18, 2002. 19. G. Luce and J. F. Myoupo. Systolic-based parallel architecture for the longest common subsequences problem. Integration, the VLSI Journal, 25:53–70, 1998. 20. M. Lu and H. Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5:835–848, 1994. 21. K. G. Margaritis and D. J. Evans. A VLSI processor array for flexible string matching. Parallel Algorithms and Applications, 11:45–60, 1997. 22. W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. Journal of Computer System Sciences, 20:18–31, 1980. 23. J. H. Moreno and T. Lang. Matrix Computations on Systolic-Type Arrays. Kluwer Academic Publishers, 1992. 24. N. Petkov. Systolic Parallel Processing. Elsevier Science Publishers, 1993. 25. N. Ranganathan and R. Sastry. VLSI architectures for pattern matching. International Journal of Pattern Recognition and Artificial Intelligence, 8:815–843, 1994. 26. Y. Robert and M. Tchuente. A systolic array for the longest common subsequence problem. Information Processing Letters, 21:191–198, 1985. 27. D. Sankoff and J.B. Kruskal. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983. 28. J. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, Boston, MA, 1997. 29. M. Snir, S. Otto, S. Huss-Lederman, D. W. Walker and J. Dongarra. MPI: The Complete Reference. The MIT Press, Cambridge, Massachusetts, 1996. 30. C.-B. Yang and R. C. T. Lee. Systolic algorithms for the longest common subsequence problem. Journal of the Chinese Institute of Engineers, 10:691–699, 1987.