Sparse Matrix-Vector Multiplication Design on FPGAs 1
Junqing Sun1, Gregory Peterson1, Olaf Storaasli2 University of Tennessee, Knoxville, 2Oak Ridge National Laboratory 1 [jsun5, gdp]@utk.edu,
[email protected] Abstract
Creating a high throughput sparse matrix vector multiplication (SpMxV) implementation depends on a balanced system design. In this paper, we introduce the innovative SpMxV Solver designed for FPGAs (SSF). Besides high computational throughput, system performance is optimized by reducing initialization time and overheads, minimizing and overlapping I/O operations, and increasing scalability. SSF accepts any matrix size and can be easily adapted to different data formats. SSF minimizes the control logic by taking advantage of the data flow via an innovative accumulation circuit which uses pipelined floating point adders. Compared to optimized software codes on a Pentium 4 microprocessor, our design achieves up to 20x speedup. Keywords FPGA, Sparse matrix multiplication, Performance
1
INTRODUCTION
Sparse matrix-vector multiplication (SpMxV), y = Ax , is widely used in scientific computing applications, such as linear system iterative solvers, block LU solvers and eigenvalue solvers. Unfortunately, traditional Von Neumann architectures usually perform poorly on this important computational kernel because of its low computation to communication ratio. Numerous efforts have addressed SpMxV performance on microprocessors [1], however, these algorithms rely heavily on matrix sparsity structures and face degraded performance on irregular matrices. FPGAs have shown great potential in Reconfigurable Computing because of their intrinsic parallelism and flexible architecture. With their rapid increase in gate capacity and frequency, FPGAs can outperform microprocessors on both integer and floating point operations [2]. Using FPGAs for high performance SpMxV has been reported by several researchers. Zhuo and Prasanna designed an adder tree based SpMxV implementation for double precision floating point that accepts any size matrices in CRS format. A reduction circuit is used in their design that needs to be configured according to different matrix parameters [3]. El-
Gindy and Shue proposed SpMxV on FPGA for fixed point data [4]. DeLorimier and DeHon arranged the PEs in a bidirectional ring to compute the equation y = A i x , where A is a square matrix and i is an integer. The design they proposed reduces the I/O bandwidth requirement greatly by sharing the results between PEs. Because local memories are used to store the matrix and intermediate results, the matrix size is limited by the on-chip memory [7]. El-kurdi et al proposed a stream-through architecture for finite element method matrices [8]. In this paper, we introduce an innovative SpMxV Solver design for FPGAs (SSF). Besides high throughput of the SSF kernel, we increase the system performance by reducing communication time and various overheads. The hardware does not need to change for different matrices, so the initialization time is minimized and the system integration is simplified. The rest of the paper is organized as the following: section 2 introduces a new matrix storage format for SSF; section 3 describes our basic architecture for integer matrices; the complete design for floating point matrices is then discussed in section 4; the performance of our design is compared to microprocessors with optimized software codes in section 5. Finally, we draw conclusion and suggest the future work.
2
SpMxV on FPGAs
The storage format plays an important role in SpMxV and affects the performance of optimization algorithms. We use the common format, Compressed Row Storage (CRS), for our FPGA design. The multiplicand vector “x” needs to be stored in the FPGA local memory. For large matrices and vectors, we divide them into sub blocks. In contrast to the traditional Block Compressed Row Storage (BCRS) format, our matrix format is shown in Figure 1. The matrix is divided into stripes along the rows. Each stripe is then divided to submatrices (shown in dashed lines) with only submatrices having non-zeros stored and computed. We refer to this format as Row Blocked CRS (RBCRS).
A00
y0 y1
A10
y2
0
0 0 A20
A11
x
0 A21
0
Figure 1: Row Blocked Compressed Row Storage
3
BASIC DESIGN
As shown in Figure 2, SSF has multiple independent PEs as the main computational engines. Each PE is a deep pipeline consisting of a multiplier, accumulation circuit (ACC), and FIFO. When rows of submatrices in Figure 1 are fed into the PEs, the dot product results are computed and stored in FIFO1. The results of completed submatrices are stored in the result BRAM. The summation circuit fetches data from FIFO1 and the result BRAM to add up results of submatrices in the same stripe.
Row ID valid col val stall
b
ACC Circuit
F I FO 2
Multiplier
ACC Circuit
F I FO 1
PE
Result BRAM
Multiplier
Result Mux
b
Figure 4: Adder Tree Used for Pipelined Adders
PE
Summation Circuit
valid col val stall
F I FO 2
Result Controller
Row ID
Floating point adders are usually deeply pipelined for higher frequency. Because of read after write (RAW) data hazards in the pipelined architecture, floating point accumulators cannot be simply built by using adders. We propose an accumulation circuit (ACC), where the data stream is partially accumulated, stored in FIFO1, and fed to a summation circuit. Suppose the floating point adder has a latency of 4 clock cycles, we need to add 4 data from the FIFO and 1 data from the result BRAM. For these 5 inputs, we use an adder tree with 3 levels and 4 adders as shown in Figure 4.
F I FO 1
Figure 2: Data Path and Framework of SpMxV Design
The data and control flow signals are shown in Figure 3. When the “valid” signal is high, “val” and “col” values are streamed into PEs by rows. All the rows are identified by the “Row ID” values, which are highlighted in Figure 3. The “valid” signal is also used to synchronize the components with the data flow. When FIFO1 is full, “stall” will be set, which causes zeroes to be inserted for the input data and keeps rows in adder pipelines from being summed up. If FIFO1 is not empty, the result will be read out and added to the corresponding value in Result BRAM.
For our design with double precision data, 12 outputs from the FIFO and 1 value from the result BRAM need to be added. We use 13 floating point adders to build the adder tree. The data flow of the adder tree used in our design in shown in Figure 5. The rectangles represent data. The numbers in rectangles are the clock cycles when that data are available. The dashed line is a FIFO with a latency of 24 clock cycles. The final result comes out 48 clock cycles after the inputs are available. In this design data are captured at appropriate clock cycles, therefore no complicated control logic is required. We capture the row ID and write enable signal to two shifters with depth of 48 at clock 0. They will also come out of the shifters at clock 48. Because the results from the adder tree, row ID, and the write enable signal are available at the same clock cycle, they can be directly connected to the result BRAM to store the results.
Figure 5: Data Flow for Adder Tree
Figure 3: Signals for Processing Elements (PEs)
4
COMPLETE DESIGN
Without using shifters, an L level adder tree requires 2 L − 1 adders. To reduce the expensive resource cost, we propose to reduce the amount of adders by inserting buffers. For the double precision design, only 4 adders are required. Because of space limits, the technical details are not discussed here.
5
IMPLEMENTAION RESULTS
We implemented our design by using Xilinx mathematic IP cores, which follow the IEEE 754 standard but can also be customized. The implementations on Xilinx XC2VP70-7 FPGAs are summarized in Table 1. The resource cost and frequency of our design are dominated by floating point IP cores and Block RAMs. Due to the concise architecture, control circuits take less than 5% of the total resource cost. When higher performance floating point IP cores are developed in the future, our design can adopt these IP cores easily for higher performance. Table 1: Characteristics of SpMxV on XC2VP70-7 Design 64 bit Int Single FP Double FP Achievable 175MHz 200MHz 165MHz Frequency Slices 8282 (25%) 10528 (31%) 24129 (72%) BRAMs 36 (10%) 50 (15%) 92 (28%) MULT18x18 128 (39%) 32 (9%) 128 (39%)
Speeup
We use our design of 8 PEs to compare with software on microprocessor. For software performance we use OSKI, which has achieved significant speedups by using techniques such as register and cache blocking [5]. The machine is a dual 2.8GHz Intel Pentium 4 with 16KB L1, 512KB L2 Cache and 1GB memory. The GFLOPS speedup of our design over P4 is shown in Figure 6. All these matrices come from the Florida Sparse Matrix Collection [6]. They are roughly ordered by increasing irregularity. 20 18 16 14 12 10 8 6 4 2 0
6
We present an innovative SpMxV design for FPGAs. Compared to microprocessors, our design has significant speedup and depends less on matrix structure. Our future work includes implementing our design on the Cray XD1 supercomputer for scientific applications and performance analysis.
7
8
REFERENCES
[1]
O. Storaasli. “Performance of NASA Equation Solvers on Computational Mechanics Applications”, 34th AIAA Structures, Structural Dynamics and Materials Conf., Apr, 1996. K. D. Underwood. “FPGAs vs. CPUs: Trends in peak floating-point performance”, FPGA, Feb 2004. L. Zhuo and V. K. Prasanna. “Sparse matrix-vector multiplication on FPGAs”, FPGA, Feb, 2005. H. A. ElGindy, and Y. L. Shue. “On Sparse MatrixVector Multiplication with FPGA-based System”, FCCM, Apr 2002. R. Vuduc, J. Demmel, K. Yelick. “OSKI: A library of automatically tuned sparse matrix kernels”, SciDAC 2005, Journal of Physics: Conference Series, June 2005. T. Davis, University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices, NA Digest, 92(42), October 16, 1994, NA Digest, 96(28), July 23, 1996, and NA Digest, 97(23), June 7, 1997. M. deLorimier and A. DeHon. “Floating-Point Sparse Matrix-Vector Multiply for FPGAs”. International Symposium on Field Programmable Gate Arrays, February, 2005.
[2] [3] [4]
[5]
[7] ex11
rim
goodwin
dbic1
rail4284
Test Matrices
Figure 6: Speedup over 2.8GHz Pentium 4
Our design performs better than the Pentium 4 on all the test matrices, and the speedup increases with the matrix irregularity. This means the performance of our design depends less on the matrix sparsity structures.
ACKNOWLEDGEMENTS
This project is supported by the University of Tennessee Science Alliance and the ORNL Laboratory Director’s Research and Development program. We also would like to thank Richard Barrett of ORNL for useful discussions on sparse matrix computations.
[6]
Crystk02
CONCLUSIONS
[8]
Y. El-kurdi, W. J. Gross, and D. Giannacopoulos. “Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs”. 2006 IEEE Symposium on Field-Programmable Custom Computing Machines, April 2006.