high-performance systolic arrays for band matrix ... - Google Sites

High-Performance Systolic Arrays for Band Matrix Multiplication Yun Yang†, Wenqing Zhao‡, and Yasuaki Inoue† †

‡

Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan [email protected] [email protected]

Fudan University, ASIC & System State-Key-Lab, Microelectronics Department Shanghai, P.R.China [email protected]

Abstract— Band matrix multiplication is widely used in DSP systems. However traditional Kung-Leiserson systolic array for band matrix multiplication cannot be realized with high cell-efficiency. In this paper, three high-performance band matrix multiplication systolic arrays (BMMSA) are presented based on the ideas of “Matrix Compression” and “Super Pipelined”. These new systolic arrays are realized by compressing the data matrix skillfully and adjusting the operation sequence carefully. The results show that the best systolic array for band matrix multiplication uses almost 100% processing elements(PE) in each step. Also, these modifications increase the operation speed and at best spend only 1/3 processing time to complete the multiplication operation. Index Terms— systolic array, band matrix multiplication, operation speed, cell-efficiency, parallel operation.

I. INTRODUCTION ANY digital signal-processing (DSP) systems use multiplication operation. While the speed of multiplication usually limlits the operation rate of the DSP and confines the application of the whole systems. Therefore many people have studied the multipliers and proposed many algorithms, architectures, and technologies to improve the function of multipliers. Modified Booth multiplier is one of the important improvements[1], and pipelined architectures can also increase the speed and efficiency of the multiplication operation[2][3]. In addition, systolic arrays for band matrix multiplications are introduced to meet the requirement of concurrent operation, which means many computations are operated synchronously [4][5]. However traditional Kung-Leiserson systolic array for band matrix multiplication can only use about one third of cells in each step, thus its cell-efficiency (the average numbers of cells which are working in each step) is not high. Consequently many new systolic arrays are proposed to increase the cellefficiency, such as reducing the multiplication period of all the data matrix (throughput latency) and increasing the percent of average active processing element in each cycle (average utilization rate)[6][7]. In this paper, three high-performance band matrix multiplication systolic arrays (BMMSA) are presented based on the ideas of “Matrix Compression” and “Super Pipelined”. We compress the data matrix and compute the result together. Because there are much more processing threads than the common pipelined situation, we use the idea in the software science and call it “Super Pipelined”. The basic organizations

M

0-7803-8834-8/05/$20.00 ©2005 IEEE.

of the three BMMSAs come from traditional Kung-Leiserson BMMSA and these new designs can be used for wide-banded matrices with high cell-efficiency[6]. The new systolic arrays presented in this paper enable us to construct more compact input-matrix by skillful arranging and compressing the numbers in the data matrix. And some other improvements, such as adding PEs and readjusting the operation sequence, are introduced to increase the cell-efficiency. These changes make new BMMSAs more useful for the design of concurrent system than Kung-Leiserson BMMSA design in the situation of wide-banded matrices. II. PROBLEM FORMULATION The base element of BMMSA is a processing element (PE) with three data registers RA , RB and RC shown in Fig.1. In the Fig.1, A is a multiplicator, B is a multiplicand and C means the product of A and B. The input and output relation of the multiplication cell is given by A = A  . B = B C = A × B + C 

C

B

A

A Fig.1

C

B

(1)

A

C

B

B

C

A

Processing element.

To construct the BMMSA, many processing elements are linked to compose processor array, such as Linearly connected, Orthogonally connected and Hexagonally connected shown in Fig.2. Through these connections we can realize the vector and matrix computation [8].

1130

(a)

Fig.2

(b)

(c )

Connection method (a)linearly (b)orthogonally (c)hexagonally.

Consider the multiplication of matrix A and matrix B, we can expand the matrix elements and compute the corresponding results as follows:

[A] • [B ] = [C ] ,

(2)

a11 a12 " a1n  b11 b12 "b1n  c11 c12 " c1n  a a " a  b b "b  c c " c  2 n   21 22 2n  2n  ,  21 22 • =  21 22       # # #       a n1 a n 2 " a nn  bn1 bn 2 "bnn  c n1 c n 2 " c nn 

(3)

Then the result of product C is n

cij = ∑ aik bkj .

(4)

k =1

In large-scale scientific computation, sparse matrices are widely used. Many sparse matrices can be changed to band matrices. Consider matrix A and matrix B to be the band matrices with bandwidths WA and WB. Thus the product matrix C is also a band matrix whose bandwidth is WC [8][9]. Then its bandwidth can be expressed as

WC = W A + W B − 1 .

(5)

Suppose that the matrix A and matrix B have the same bandwidth 4, then the bandwidth of the output matrix C is “4+4 -1=7”. Traditional Kung-Leiserson BMMSA uses hexagonally connected architecture to regularly arrange the processing elements. The input of data is put into according to the step, then output value is sent out from the top of hexagonally connected array shown in Fig.3 . If the bandwidth is another value, it is easy to change the size of hexagonally connected array to meet the requirement of multiplier in the same way. Matrix A a22

a32

a42

a21

a31

b22

a11

c31

c21

c42

c32

b11

c43

Matrix C Fig.3

c22 c33

F H

G

I Fig.4 Moving up the laggard numbers.

If the arrangement of cells can not solve the unmatched situation, we add some additional cells to preserve the data temporarily as shown in Fig.5.

b23

c12

c13

c23

c24

In

(a )

b24

b12

( b) Out = In

b13

c41

C

E

D

In

Fig.5

Out = In

Additional cells to preserve the data (a) move right (b) move left.

When the above methods of moving up the laggard numbers and adding additional cells can not realize the correct multiplication function after the data matrices are compressed, we adopt the way of changing the operation sequence to meet the equations (2-4). Using this method, we can realize the right multiplication function (Fig.6).

c11

c52

A B

b33

b21

a12

III. DESIGN METHODOLOGY In the section, new design methods are proposed , which make the input-matrix to be more compact. From Fig. 3, we see that the array is sparse and easy to be compressed. Obviously if the data matrix can be condensed, its cell-efficiency will increase greatly. Moreover the operation for different steps can be processed in the same time, even more fast than common pipelined thread. These are the center ideas of “Matrix Compression” and “Super Pipelined”. After the matrix C is condensed, the matrix A and matrix B become more compact. The cells’ operation sequence will also be adjusted, otherwise the multiplicator and multiplicand may be lagged. In this situation the laggard numbers must be moved up to perform the correct multiplication operation (Fig.4).

Matrix B

b32

a23

a33

cost before the BMMSA is stable. If n>> WA,WB ,the time steps are approximately 3n. Thus the cell-efficiency of Kung-Leiserson BMMSA is not high enough and many PEs are not used during the processing. We should find better configurations to realize the computation with higher cell-efficiency and faster operation speed.

c14 c25

c34

a34

Matrix C

b43 c33

a33

Pipelined configuration of the Kung-Leiserson BMMSA.

The problem of traditional Kung-Leiserson BMMSA lies in the performance of processing elements. When the multiplier is operated, only 5 or 6 processing elements are in work. The average utilization rate is just about 1/3. Furthermore the throughput latency of the BMMSA with n-rank input data matrix is “3n+min(WA, WB)”. In the equation WA and WB are two operand matrices’ bandwidths, and min(WA, WB) means the

1131

b33 c33

a32

b23 c33

a 31

b13 c33

Fig.6 Changing the operation sequence.

Based on the ideas of “Matrix Compression” and “Super Pipelined”, the methods shown in Fig4-6 are used to construct three different BMMSAs. All these new designs can realize higher cell–efficiency and faster processing speed than the traditional Kung-Leiserson BMMSA in the situation of wide-banded matrices.

3) 4) 5) 6)

IV. THE DESIGN OF BAND MATRIX MULTIPLICATION SYSTOLIC ARRAYS Through compressing the input matrices and readjusting the multiplication sequence, we propose three BMMSA designs. A. The Kung-Leiserson BMMSA This BMMSA is a traditional design and its characters are as follows[7] (see also Fig.3): 1) The number of PE’s involved is “w1w2”(w1 and w2 are the bandwidths of the two operand matrices). 2) Each pipeline pumps data into the array every three cycles. 3) Each PE is active every three cycles. Thus, once the operation is in full steam, the average utilization rate is about 33 percent. 4) The throughput latency is “3n+min(w1, w2)”. 5) The maximum input bandwidth is about “2(w1+w2)/3” per cycle.

Each neighboring cnn’s distance is two space. Every two cycles the data cnn is pumped into the array ( cnn are the numbers in the input data matrix C). Each PE is used every two cycles. Thus, when the multiplication operation is in full steam, the average utilization rate is about 50 percent. The throughput latency is “2n+min(w1, w2)”. The maximum input bandwidth is about “3(w1+w2)/2” per cycle.

C. The Proposed Design of “One-Space-Parallel BMMSA” In this BMMSA the input matrices are compressed further and the distance of each neighboring cnn is one space. To realize the multiplication, the additional cells must be added. Its characters are as follows (Fig.8): Matrix A a66 a55

a44

a65 a54 a43

a32

a42

B. The Proposed Design of “Two-Space-Parallel BMMSA” The distance of each neighboring cnn in this BMMSA is two space. And the laggard numbers are moved up to meet the requirement of multiplication. Its characters are summarized as follows (Fig.7): Matrix A

a67

b54 b43 b43 b32

b33

b23

a21

b12

c31

c42 c53

Matrix B Matrix C

c52 c63

c74

c 31

c42 c53 c64

c21

b b56 b44 44 b33 b45 b33 b 45 b b34 b 22 b b46 b11 22 b23 34 b11 b23 b35 b12 b12 b24 b13

c32 c43 c54

c65

Matrix C

c11 c22 c33

c44 c55

c12 c23 c34

c45

c13

c24 c35 c 46

1) 2)

c32 c43 c54

c12 c11 c22 c33

c44 c55

c23 c34 c45

c13

c24 c35

c14 c25 c36

c46

c56

Matrix C

Fig.8 Pipelined configuration of the “One-Space-Parallel” BMMSA and its first six cycles.

1)

4)

c14 c25

5)

c36

6)

c47

c56

Matrix C

c66

Fig.7

b46 b35

c66

2) 3) c41

c21

c65

b b21 21

a31

b56 b45 b34

b22 b11

a11

b13

c63

b66

b66 b55

b44

a31

c52

b55

Matrix B

b24

c41

b32

a12 a44 a12 a33 a33 a54 a22 a54 a22 a43 a11 a43 a11 a32 a32 a53 a21 a21 a42 a65

a64

b65 b65 b54

a56 a56 a45 a45 a66 a34 a34 a55 a23 a23 a44

b32 b21

a33

c64

b76

b43

a12

a22

a64 a53

b76 b65 b54

a67 a56 a45 a34 a23

Pipelined configuration of the “Two-Space-Parallel” BMMSA and its first eight cycles.

The organization of the data is similar to that in the Kung-Leiserson design. The number of PE’s involved is “w1w2”.

The organization of the data associated with the input and output matrices is similar to that in the KungLeiserson design. The number of PE’s involved is “w1w2 +2”. Each neighboring cnn’s distance is one space. Every cycle the data cnn is pumped into the array. Almost all PEs are active when the flows of data are in full steam. Thus, the average utilization rate is about 90 percent. The throughput latency is “9n/8+min(w1, w2)”. The maximum input bandwidth is about “2(w1+w2)” per cycle.

D. The Proposed Design of “One-Space-Jumping BMMSA” In this BMMSA the input matrices are compressed into one space. To avoid the additional cells, we change the operation sequence just like somebody “jumping” from one PE to another PE (Fig.6). Through this method, the PE array can be kept stably and the data matrices can be more compact. This design has the following characters (Fig.9):

1132

Matrix A

a67 a56 a45 a66 a34 a55 a23 a44 a12 a33

a65 a54 a43 a32

b76 b65 b54

b43

b32

b21

a22 a11

Matrix B

b44 b33 b22

One − Space − Parallel

80

b45 b34

b23 b12

a21

a64 a53 a42 a31

100 90

b56

b11

One − Space − Jumping

PE utilization rate(%)

b66 b55

70

b46 b35

Two − Space − Parallel

60

b b13 24

50 40

Kung − Leiserson

30 20 c41 c52

c 31

c63

c42 c53 c64

c21 c32 c43 c54 c65

Matrix C Fig.9

1) 2) 3) 4) 5) 6)

c11 c22 c33

c44 c55

c13

c12 c23

c24

c14 c25

10 0

c36

c46

c45

Matrix C

Pipelined configuration of the “One-Space-Jumping” BMMSA and its first six cycles.

4

5

6

7

8

9 10 cycle

VI. CONCLUSIONS

The organization of the data associated with the input and output matrices is similar to that in the KungLeiserson design. The number of PE’s involved is “w1w2”. Each neighboring cnn’s distance is one space. Every cycle the data cnn is pumped into the array. Almost all PEs are active when the flows of data are in full steam. Thus, the average utilization rate is about 100 percent. The throughput latency is “n+min(w1, w2)”. The maximum input bandwidth is about “2(w1+w2)” per cycle.

Based on the Kung-Leiserson BMMSA’s configuration, we have proposed three new BMMSAs. Through the idea of “Matrix Compression” and “Super Pipelined” these new systolic arrays can reach higher average utilization rate and lower throughput latency by adjusting their operation sequences. Moreover, additional PEs are added to the One-Space-Parallel BMMSA. Furthermore, the One-SpaceJumping BMMSA can realize the maximum average utilization rate (100%) and the minimum throughput latency (n+min(w1, w2)). These new designs can also spend less processing time than the tradition Kung-Leiserson BMMSA and become stable rapidly before the systems come into the full stream state.

V. RESULTS

REFERENCES

We can analyse the PE utilization rate step by step theoretically and get more detail information of PE utilization rate with different cycle of these BMMSAs. In Fig.10 it is obvious that the proposed BMMSAs improve their characters, such as data bandwidth, average utilization rate and throughput latency. Compared with the traditional Kung-Leiserson BMMSA, the best design can use all the PEs and spend only 1/3 time to complete the operation. And the system can become stable faster, though its configuration is more complex and it must drive more PEs in the same time. All the corresponding attributes are shown in TABLE I. TABLE I COMPARISON OF FOUR BMMSAS # of PEs Input Bandwidth Throughput Latency Average utilization rate

3

Thus these proposed BMMSAs are better than the traditional Kung-Leiserson BMMSA in the use of wide-banded matrices. And the One-Space-Jumping BMMSA’s characters are best in these new designs.

c56

c66

2

Fig.10 PE utilization rate according to different cycle of four BMMSAs.

c35

c34

1

[1] [2]

[3] [4] [5] [6]

KungLeiserson w1w2

Two-SpaceParallel w1 w2

One-SpaceParallel w1w2 +2

One-SpaceJumping w1w2

2(w1+w2)/3

3(w1+w2)/2

2(w1+w2)

2(w1+w2)

3n+min(w1, w2)

2n+min(w1, w2)

9n/8+min(w1, w2)

n+min(w1, w2)

33%

50%

90%

100%

[7] [8] [9]

1133

L. P. Rubinfield, “A proof of the modified booth’s algorithm for multiplication,” IEEE Trans. Computers, vol. 24, no. 10, pp. 1014–1015, Oct. 1975. M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and K. Hatanaka, “A 15-ns 32×32-b CMOS multiplier with an Improved parallel structure,” IEEE Trans. Solid-State Circuits, vol. 25, no. 2, pp. 494–497, Apr. 1990. F. Lu, and H. Samueli, “A 200-MHZ CMOS pipelined multiplier— accumulator using a quasi-domino dynamic full-adder cell design,” IEEE Trans. Solid-State Circuits, vol. 28, no. 2, pp. 123 –132, Feb. 1993. H. T. Kung, and C. E. Leiserson, Introduction to VLSI Systems, AddisonWesley, 1980. H. T. Kung, and M. S. Lam, “Wafer-scale integration and two-level pipelined implementation of systolic arrays,” J. Para. and Dist. Comp. I , vol. 1, no. 1, pp. 32–63, Aug. 1984. K. H. Huang and J. A. Abraham, “Efficient parallel algorithms for processor arrays,” in Proc. IEEE ICPP, pp. 271-279, Aug. 1982. S. -W. Chan and C. -L. Wey, “The design of concurrent error diagnosable systolic arrays for band matrix multiplications,” IEEE Trans. Computer -Aided Design, vol. 7, no. 1, pp. 21–37, Jan. 1988. P. S. Tang, Integrated Circuit Design Methodologies, Fudan University Press, 2002. J. M. Rabaey, Digital Integerated Circuits (a design perspective), Prentice-Hall International, 1999.