Abstractâ Band matrix multiplication is widely used in DSP systems. However traditional Kung-Leiserson systolic array
High-Performance Systolic Arrays for Band Matrix Multiplication Yun Yang†, Wenqing Zhao‡, and Yasuaki Inoue† †
‡
Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan
[email protected] [email protected]
Fudan University, ASIC & System State-Key-Lab, Microelectronics Department Shanghai, P.R.China
[email protected]
Abstract— Band matrix multiplication is widely used in DSP systems. However traditional Kung-Leiserson systolic array for band matrix multiplication cannot be realized with high cell-efficiency. In this paper, three high-performance band matrix multiplication systolic arrays (BMMSA) are presented based on the ideas of “Matrix Compression” and “Super Pipelined”. These new systolic arrays are realized by compressing the data matrix skillfully and adjusting the operation sequence carefully. The results show that the best systolic array for band matrix multiplication uses almost 100% processing elements(PE) in each step. Also, these modifications increase the operation speed and at best spend only 1/3 processing time to complete the multiplication operation. Index Terms— systolic array, band matrix multiplication, operation speed, cell-efficiency, parallel operation.
I. INTRODUCTION ANY digital signal-processing (DSP) systems use multiplication operation. While the speed of multiplication usually limlits the operation rate of the DSP and confines the application of the whole systems. Therefore many people have studied the multipliers and proposed many algorithms, architectures, and technologies to improve the function of multipliers. Modified Booth multiplier is one of the important improvements[1], and pipelined architectures can also increase the speed and efficiency of the multiplication operation[2][3]. In addition, systolic arrays for band matrix multiplications are introduced to meet the requirement of concurrent operation, which means many computations are operated synchronously [4][5]. However traditional Kung-Leiserson systolic array for band matrix multiplication can only use about one third of cells in each step, thus its cell-efficiency (the average numbers of cells which are working in each step) is not high. Consequently many new systolic arrays are proposed to increase the cellefficiency, such as reducing the multiplication period of all the data matrix (throughput latency) and increasing the percent of average active processing element in each cycle (average utilization rate)[6][7]. In this paper, three high-performance band matrix multiplication systolic arrays (BMMSA) are presented based on the ideas of “Matrix Compression” and “Super Pipelined”. We compress the data matrix and compute the result together. Because there are much more processing threads than the common pipelined situation, we use the idea in the software science and call it “Super Pipelined”. The basic organizations
M
0-7803-8834-8/05/$20.00 ©2005 IEEE.
of the three BMMSAs come from traditional Kung-Leiserson BMMSA and these new designs can be used for wide-banded matrices with high cell-efficiency[6]. The new systolic arrays presented in this paper enable us to construct more compact input-matrix by skillful arranging and compressing the numbers in the data matrix. And some other improvements, such as adding PEs and readjusting the operation sequence, are introduced to increase the cell-efficiency. These changes make new BMMSAs more useful for the design of concurrent system than Kung-Leiserson BMMSA design in the situation of wide-banded matrices. II. PROBLEM FORMULATION The base element of BMMSA is a processing element (PE) with three data registers RA , RB and RC shown in Fig.1. In the Fig.1, A is a multiplicator, B is a multiplicand and C means the product of A and B. The input and output relation of the multiplication cell is given by A = A . B = B C = A × B + C
C
B
A
A Fig.1
C
B
(1)
A
C
B
B
C
A
Processing element.
To construct the BMMSA, many processing elements are linked to compose processor array, such as Linearly connected, Orthogonally connected and Hexagonally connected shown in Fig.2. Through these connections we can realize the vector and matrix computation [8].
1130
(a)
Fig.2
(b)
(c )
Connection method (a)linearly (b)orthogonally (c)hexagonally.
Consider the multiplication of matrix A and matrix B, we can expand the matrix elements and compute the corresponding results as follows:
[A] • [B ] = [C ] ,
(2)
a11 a12 " a1n b11 b12 "b1n c11 c12 " c1n a a " a b b "b c c " c 2 n 21 22 2n 2n , 21 22 • = 21 22 # # # a n1 a n 2 " a nn bn1 bn 2 "bnn c n1 c n 2 " c nn
(3)
Then the result of product C is n
cij = ∑ aik bkj .
(4)
k =1
In large-scale scientific computation, sparse matrices are widely used. Many sparse matrices can be changed to band matrices. Consider matrix A and matrix B to be the band matrices with bandwidths WA and WB. Thus the product matrix C is also a band matrix whose bandwidth is WC [8][9]. Then its bandwidth can be expressed as
WC = W A + W B − 1 .
(5)
Suppose that the matrix A and matrix B have the same bandwidth 4, then the bandwidth of the output matrix C is “4+4 -1=7”. Traditional Kung-Leiserson BMMSA uses hexagonally connected architecture to regularly arrange the processing elements. The input of data is put into according to the step, then output value is sent out from the top of hexagonally connected array shown in Fig.3 . If the bandwidth is another value, it is easy to change the size of hexagonally connected array to meet the requirement of multiplier in the same way. Matrix A a22
a32
a42
a21
a31
b22
a11
c31
c21
c42
c32
b11
c43
Matrix C Fig.3
c22 c33
F H
G
I Fig.4 Moving up the laggard numbers.
If the arrangement of cells can not solve the unmatched situation, we add some additional cells to preserve the data temporarily as shown in Fig.5.
b23
c12
c13
c23
c24
In
(a )
b24
b12
( b) Out = In
b13
c41
C
E
D
In
Fig.5
Out = In
Additional cells to preserve the data (a) move right (b) move left.
When the above methods of moving up the laggard numbers and adding additional cells can not realize the correct multiplication function after the data matrices are compressed, we adopt the way of changing the operation sequence to meet the equations (2-4). Using this method, we can realize the right multiplication function (Fig.6).
c11
c52
A B
b33
b21
a12
III. DESIGN METHODOLOGY In the section, new design methods are proposed , which make the input-matrix to be more compact. From Fig. 3, we see that the array is sparse and easy to be compressed. Obviously if the data matrix can be condensed, its cell-efficiency will increase greatly. Moreover the operation for different steps can be processed in the same time, even more fast than common pipelined thread. These are the center ideas of “Matrix Compression” and “Super Pipelined”. After the matrix C is condensed, the matrix A and matrix B become more compact. The cells’ operation sequence will also be adjusted, otherwise the multiplicator and multiplicand may be lagged. In this situation the laggard numbers must be moved up to perform the correct multiplication operation (Fig.4).
Matrix B
b32
a23
a33
cost before the BMMSA is stable. If n>> WA,WB ,the time steps are approximately 3n. Thus the cell-efficiency of Kung-Leiserson BMMSA is not high enough and many PEs are not used during the processing. We should find better configurations to realize the computation with higher cell-efficiency and faster operation speed.
c14 c25
c34
a34
Matrix C
b43 c33
a33
Pipelined configuration of the Kung-Leiserson BMMSA.
The problem of traditional Kung-Leiserson BMMSA lies in the performance of processing elements. When the multiplier is operated, only 5 or 6 processing elements are in work. The average utilization rate is just about 1/3. Furthermore the throughput latency of the BMMSA with n-rank input data matrix is “3n+min(WA, WB)”. In the equation WA and WB are two operand matrices’ bandwidths, and min(WA, WB) means the
1131
b33 c33
a32
b23 c33
a 31
b13 c33
Fig.6 Changing the operation sequence.
Based on the ideas of “Matrix Compression” and “Super Pipelined”, the methods shown in Fig4-6 are used to construct three different BMMSAs. All these new designs can realize higher cell–efficiency and faster processing speed than the traditional Kung-Leiserson BMMSA in the situation of wide-banded matrices.
3) 4) 5) 6)
IV. THE DESIGN OF BAND MATRIX MULTIPLICATION SYSTOLIC ARRAYS Through compressing the input matrices and readjusting the multiplication sequence, we propose three BMMSA designs. A. The Kung-Leiserson BMMSA This BMMSA is a traditional design and its characters are as follows[7] (see also Fig.3): 1) The number of PE’s involved is “w1w2”(w1 and w2 are the bandwidths of the two operand matrices). 2) Each pipeline pumps data into the array every three cycles. 3) Each PE is active every three cycles. Thus, once the operation is in full steam, the average utilization rate is about 33 percent. 4) The throughput latency is “3n+min(w1, w2)”. 5) The maximum input bandwidth is about “2(w1+w2)/3” per cycle.
Each neighboring cnn’s distance is two space. Every two cycles the data cnn is pumped into the array ( cnn are the numbers in the input data matrix C). Each PE is used every two cycles. Thus, when the multiplication operation is in full steam, the average utilization rate is about 50 percent. The throughput latency is “2n+min(w1, w2)”. The maximum input bandwidth is about “3(w1+w2)/2” per cycle.
C. The Proposed Design of “One-Space-Parallel BMMSA” In this BMMSA the input matrices are compressed further and the distance of each neighboring cnn is one space. To realize the multiplication, the additional cells must be added. Its characters are as follows (Fig.8): Matrix A a66 a55
a44
a65 a54 a43
a32
a42
B. The Proposed Design of “Two-Space-Parallel BMMSA” The distance of each neighboring cnn in this BMMSA is two space. And the laggard numbers are moved up to meet the requirement of multiplication. Its characters are summarized as follows (Fig.7): Matrix A
a67
b54 b43 b43 b32
b33
b23
a21
b12
c31
c42 c53
Matrix B Matrix C
c52 c63
c74
c 31
c42 c53 c64
c21
b b56 b44 44 b33 b45 b33 b 45 b b34 b 22 b b46 b11 22 b23 34 b11 b23 b35 b12 b12 b24 b13
c32 c43 c54
c65
Matrix C
c11 c22 c33
c44 c55
c12 c23 c34
c45
c13
c24 c35 c 46
1) 2)
c32 c43 c54
c12 c11 c22 c33
c44 c55
c23 c34 c45
c13
c24 c35
c14 c25 c36
c46
c56
Matrix C
Fig.8 Pipelined configuration of the “One-Space-Parallel” BMMSA and its first six cycles.
1)
4)
c14 c25
5)
c36
6)
c47
c56
Matrix C
c66
Fig.7
b46 b35
c66
2) 3) c41
c21
c65
b b21 21
a31
b56 b45 b34
b22 b11
a11
b13
c63
b66
b66 b55
b44
a31
c52
b55
Matrix B
b24
c41
b32
a12 a44 a12 a33 a33 a54 a22 a54 a22 a43 a11 a43 a11 a32 a32 a53 a21 a21 a42 a65
a64
b65 b65 b54
a56 a56 a45 a45 a66 a34 a34 a55 a23 a23 a44
b32 b21
a33
c64
b76
b43
a12
a22
a64 a53
b76 b65 b54
a67 a56 a45 a34 a23
Pipelined configuration of the “Two-Space-Parallel” BMMSA and its first eight cycles.
The organization of the data is similar to that in the Kung-Leiserson design. The number of PE’s involved is “w1w2”.
The organization of the data associated with the input and output matrices is similar to that in the KungLeiserson design. The number of PE’s involved is “w1w2 +2”. Each neighboring cnn’s distance is one space. Every cycle the data cnn is pumped into the array. Almost all PEs are active when the flows of data are in full steam. Thus, the average utilization rate is about 90 percent. The throughput latency is “9n/8+min(w1, w2)”. The maximum input bandwidth is about “2(w1+w2)” per cycle.
D. The Proposed Design of “One-Space-Jumping BMMSA” In this BMMSA the input matrices are compressed into one space. To avoid the additional cells, we change the operation sequence just like somebody “jumping” from one PE to another PE (Fig.6). Through this method, the PE array can be kept stably and the data matrices can be more compact. This design has the following characters (Fig.9):
1132
Matrix A
a67 a56 a45 a66 a34 a55 a23 a44 a12 a33
a65 a54 a43 a32
b76 b65 b54
b43
b32
b21
a22 a11
Matrix B
b44 b33 b22
One − Space − Parallel
80
b45 b34
b23 b12
a21
a64 a53 a42 a31
100 90
b56
b11
One − Space − Jumping
PE utilization rate(%)
b66 b55
70
b46 b35
Two − Space − Parallel
60
b b13 24
50 40
Kung − Leiserson
30 20 c41 c52
c 31
c63
c42 c53 c64
c21 c32 c43 c54 c65
Matrix C Fig.9
1) 2) 3) 4) 5) 6)
c11 c22 c33
c44 c55
c13
c12 c23
c24
c14 c25
10 0
c36
c46
c45
Matrix C
Pipelined configuration of the “One-Space-Jumping” BMMSA and its first six cycles.
4
5
6
7
8
9 10 cycle
VI. CONCLUSIONS
The organization of the data associated with the input and output matrices is similar to that in the KungLeiserson design. The number of PE’s involved is “w1w2”. Each neighboring cnn’s distance is one space. Every cycle the data cnn is pumped into the array. Almost all PEs are active when the flows of data are in full steam. Thus, the average utilization rate is about 100 percent. The throughput latency is “n+min(w1, w2)”. The maximum input bandwidth is about “2(w1+w2)” per cycle.
Based on the Kung-Leiserson BMMSA’s configuration, we have proposed three new BMMSAs. Through the idea of “Matrix Compression” and “Super Pipelined” these new systolic arrays can reach higher average utilization rate and lower throughput latency by adjusting their operation sequences. Moreover, additional PEs are added to the One-Space-Parallel BMMSA. Furthermore, the One-SpaceJumping BMMSA can realize the maximum average utilization rate (100%) and the minimum throughput latency (n+min(w1, w2)). These new designs can also spend less processing time than the tradition Kung-Leiserson BMMSA and become stable rapidly before the systems come into the full stream state.
V. RESULTS
REFERENCES
We can analyse the PE utilization rate step by step theoretically and get more detail information of PE utilization rate with different cycle of these BMMSAs. In Fig.10 it is obvious that the proposed BMMSAs improve their characters, such as data bandwidth, average utilization rate and throughput latency. Compared with the traditional Kung-Leiserson BMMSA, the best design can use all the PEs and spend only 1/3 time to complete the operation. And the system can become stable faster, though its configuration is more complex and it must drive more PEs in the same time. All the corresponding attributes are shown in TABLE I. TABLE I COMPARISON OF FOUR BMMSAS # of PEs Input Bandwidth Throughput Latency Average utilization rate
3
Thus these proposed BMMSAs are better than the traditional Kung-Leiserson BMMSA in the use of wide-banded matrices. And the One-Space-Jumping BMMSA’s characters are best in these new designs.
c56
c66
2
Fig.10 PE utilization rate according to different cycle of four BMMSAs.
c35
c34
1
[1] [2]
[3] [4] [5] [6]
KungLeiserson w1w2
Two-SpaceParallel w1 w2
One-SpaceParallel w1w2 +2
One-SpaceJumping w1w2
2(w1+w2)/3
3(w1+w2)/2
2(w1+w2)
2(w1+w2)
3n+min(w1, w2)
2n+min(w1, w2)
9n/8+min(w1, w2)
n+min(w1, w2)
33%
50%
90%
100%
[7] [8] [9]
1133
L. P. Rubinfield, “A proof of the modified booth’s algorithm for multiplication,” IEEE Trans. Computers, vol. 24, no. 10, pp. 1014–1015, Oct. 1975. M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and K. Hatanaka, “A 15-ns 32×32-b CMOS multiplier with an Improved parallel structure,” IEEE Trans. Solid-State Circuits, vol. 25, no. 2, pp. 494–497, Apr. 1990. F. Lu, and H. Samueli, “A 200-MHZ CMOS pipelined multiplier— accumulator using a quasi-domino dynamic full-adder cell design,” IEEE Trans. Solid-State Circuits, vol. 28, no. 2, pp. 123 –132, Feb. 1993. H. T. Kung, and C. E. Leiserson, Introduction to VLSI Systems, AddisonWesley, 1980. H. T. Kung, and M. S. Lam, “Wafer-scale integration and two-level pipelined implementation of systolic arrays,” J. Para. and Dist. Comp. I , vol. 1, no. 1, pp. 32–63, Aug. 1984. K. H. Huang and J. A. Abraham, “Efficient parallel algorithms for processor arrays,” in Proc. IEEE ICPP, pp. 271-279, Aug. 1982. S. -W. Chan and C. -L. Wey, “The design of concurrent error diagnosable systolic arrays for band matrix multiplications,” IEEE Trans. Computer -Aided Design, vol. 7, no. 1, pp. 21–37, Jan. 1988. P. S. Tang, Integrated Circuit Design Methodologies, Fudan University Press, 2002. J. M. Rabaey, Digital Integerated Circuits (a design perspective), Prentice-Hall International, 1999.