Matrix Multiplication Based on Scalable Macro-Pipelined ... - CiteSeerX

1 downloads 0 Views 402KB Size Report
processor P8600 (2.4GHz) with 3MB L2 cache and 2GB. DDR3 SDRAM running Ubuntu 9.04. Gigabit communication medium is used between the host and the.
2009 International Conference on Reconfigurable Computing and FPGAs

Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture Jiang Jiang1

Vincent Mirian2

1

Kam Pui Tang2

Paul Chow2

Zuocheng Xing1

2

School of Computer Department of Electrical and Computer Engineering National University of Defense Technology University of Toronto Changsha, Hunan, P.R. China, 410073 Toronto, Ontario, Canada [email protected], {mirianvi, tangkamp, pc}@eecg.utoronto.ca, [email protected] The proposed Scalable Macro-pipelined architecture (SMPA) exploits temporal parallelism, not found in [1]. The implementation uses DSP to function as multiplieraccumulator (MAC) PE as described in [2] and [3], and uses distributed memory as optimal memory architecture as discussed in [3]. The algorithm uses many dimensions of parallelism, along the computation axis of the PE and the communication axis amongst multiple PEs. SMPA is a generic platform not optimized for any software library as in [4] and [5]. The algorithm does not use multiple FPGAs as in [6], however is scalable to multiple FPGAs. The utilization of registers is minimal in order to reduce energy consumption as argued in [7], [8] and [9]. Unlike [2] and [11], a ring topology is used by SMPA for connecting PEs. And unlike [10], a front end control unit is designed to control PE operations. The remainder of this paper is organized as follows. In Section II, an overview of the proposed SMPA is presented. The matrix multiplication algorithm based on SMPA is described in Section III and the performance is analyzed in Section IV. Section V describes the proposed SMPA FPGA hardware implementation, and Section VI presents experimental results. Section VII concludes our discussions and introduces the future work.

Abstract— In this paper, we introduce a scalable macropipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect. Keywords- matrix multiplication; temporal parallelism, macro-pipeline; FPGA accelerator

I.

INTRODUCTION

The fundamental way to improve the performance of the computer system is to exploit the parallelism. There are two kinds of parallelism: spatial parallelism and temporal parallelism. Spatial parallelism tries to use duplicated function units (FUs), multiple cores or even multiprocessors to run at the same time on different data sets. Current researches mainly focus on it, such as superscalar, multicore, graphical processing unit (GPU) and multiprocessor system. Temporal parallelism tries to use multi-stage pipeline or macro-pipeline to partition application to multiply phases and data sets and run simultaneously. Clearly, temporal parallelism and spatial parallelism can yield the same potential speedup. Matrix multiplication is a typical routine/kernel for scientific application. The LINPACK Benchmark has been used for many years to evaluate computer systems. The LINPACK Basic Linear Algebra Subprograms (BLAS) are high quality building block routines performing basic vector and matrix operations. The popular Level 3 BLAS mostly target matrix-matrix operations of order O(n3) [1]. There are attempts to improve the efficiency of the matrix multiplication algorithm. Coprocessors and accelerators have been implemented using GPU, field programmable gate arrays (FPGA), digital signal processing (DSP), and application specific integrated circuit (ASIC). Most implementations use spatial parallelism approach. Efficient algorithms have also been proposed and optimized libraries for specific computer architecture have been designed. 978-0-7695-3917-1/09 $26.00 © 2009 IEEE DOI 10.1109/ReConFig.2009.30

II.

SCALABLE MACRO-PIPELINED ARCHITECTURE

PE Ring PE0

PE1

...

PE2

PEp-1

BWto-PER HOST

BWINF

CU INF

LMA0

BWto-LMR BWMEM Memory

MC

LMB 0

LMA1

LMB1

LMA2

LMB2

...

LMAp-1

LMB p-1

LM Ring

FPGA Accelerator

Off-Chip Memory

Figure 1. SMPA Accelerator

As illustrated in Fig. 1, the accelerator system composes two parts: the host and the FPGA accelerator. The host offloads the multiplication task to the accelerator. Its responsibilities include data reordering of the multiplicand matrices, sending the reordered data to the accelerator, receiving the product data from the accelerator and data reordering of the product. The host is also responsible for writing and reading control registers in the accelerator to

48

configure execution modes and get performance statistic information respectively. The FPGA accelerator is used for matrix operations. It consists of five components: PE ring (PER), local memory ring (LMR), control unit (CU), host-accelerator interface (INF) and memory controller (MC). The PER includes multiple PEs, which are connected in ring structure and they operate in pipeline fashion. Each stage of the pipeline (or PE) can be a multiplier-accumulator (MAC) or a powerful microprocessor. Intermediate results are passed from one stage to the next stage. There are two reasons to use ring connection. First, it can support different matrix sizes. When the number of columns of matrix A (or the number of rows of matrix B) is larger than the number of PE, we need multiple iterations. Second, it’s scalable. The ring topology is easy to scale up or down. Each PE has two local memories (LMs). One of them stores the data from matrix A, and another stores the data from matrix B. The LMs are also organized in ring structure. All of the data pumped into the LMR will be digested by LMs. The INF is the communication interface between the host and the accelerator. It means to operate at high speed so as not to create a bottleneck for the accelerator. The CU is used to control the source and the result matrix data flow. If communication bandwidth of the INF is not fast enough to meet the bandwidth requirement of the LMs, the source data from INF will be diverted to off-chip memory. After that, the CU will read the whole data out from the offchip memory and feed them to LMR. Similarly, result data will be written back to the off-chip memory before sending back to the host. Of course, if the bandwidth of the INF is fast enough, there is no need to use off-chip memory. The advantages of the proposed SMPA are: • Effectively overlap communication and computation; start computation as early as possible. • No extra memory to store the intermediate result. All of them are kept at PE’s internal pipeline stages. • Small on-chip local memory requirement. Data set for each PE is regular and can be reused. • The interconnect topology is simple. Unidirectional ring topology is used to support various problem sizes. • It adopts data flow fashion. Control logic is very simple. • Once the pipeline is filled up, it can yield comparable performance as spatial parallelism. • Scalable. Pipeline/ring is easy to scale up and down. III.

respectively, and then accumulates the intermediate result from previous stage. The matrix multiplication algorithm based on the SMPA architecture can be described with the following pseudo-code as shown in Fig. 2. for (c=0; c

Suggest Documents