Our proposed processor called SuperSMP, which can execute multi-scalar, vector, and matrix instructions on parallel execution data paths. 4x32-bit instructions ...
FPGA Implementation and Evaluation of a Simple Processor for Multi -scalar/Vector/Matrix Instructions l2 Mostafa I. Soliman ,
2 Elsayed A. Elsayed 2Computer and System Section Electrical Engineering Dept. , Faculty of Engineering, Aswan University, Aswan 81542, Egypt say abdellah@yahoo. com
IComputer Science and Information Department Community College, Taibah University AI-Madinah AI-Munawwarah, Saudi Arabia mossol@ieee. org / mossol@yahoo. com
Abstract-On FPGA, this paper presents the implementation of a simple
processor
architecture
for
accelerating
data-parallel
applications. Our proposed processor called SuperSMP, which can execute multi-scalar, vector, and matrix instructions on parallel execution datapaths. 4x32-bit instructions are fetched from instruction cache. The fetched instructions are decoded and their dependencies are checked. Up to four independent scalar instructions can be issued in-order to the parallel execution units. However, vector/matrix instructions iterate the issuing of four vector/matrix
operations without checking. On four parallel
execution units, SuperSMP can perform addition, subtraction, multiplication, division, and shifting on scalar/vector/matrix data. 4x32-bit contiguous vector/matrix elements can be loaded/stored per clock cycle from/to L2 cache to/from matrix register file. Finally, up to 4x32-bit results or loaded data can be written into scalar/matrix register files. The FPGA implementation of our proposed SuperSMP requires 14,032 slices on Xilinx Virtex-5, XC5VLXllO-3FF1l53. The number of LUT flip-flop pairs is 49,398, where 17,166, 10,267, and 21,965, are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of SuperSMP is about 3.5 times of the baseline scalar processor. However, the performance of SuperSMP ranges from 4.3 to 18.2 times higher than the baseline scalar processor. Keywords-FPGA, data-parallel applications, vector/matrix processing, performance evaluation
I.
INTRODUCTION
A field programm able gate array (FPGA) is a semiconductor device that is based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects (see [1-3] for more details). A single CLB comprises two slices, with each containing four 6-input look-up tables (LUTs) and four flip-flops (FFs), for a total of eight 6input LUTs and eight FFs per CLB. For example, Virtex-5 XC5VLX11O has 160x54 array of configurable logic blocks (8, 640 CLBs or 17, 280 slices: 69, 120 6-input LUTs and 69, 120 FFs). By properly loading LUTs, FPGAs can be reprogramm ed to desired application or functionality requirements after manufacturing. This is the main feature that distinguishes FPGAs from application specific integrated circuits (ASICs), which are custom manufactured for specific design tasks. While FPGAs used to be selected for lower
978-1-4799-5807-8/14/$31.00 @2014
IEEE
speed/complexity/volume designs in the past, today's FPGAs easily push the 500 MHz performance barrier. With unprecedented logic density increases and a host of other features, such as embedded processors, DSP blocks, clocking, and high-speed serial at ever lower price points, FPGAs are a compelling proposition for almost any type of design [4]. On FPGA, this paper proposes a simple processor architecture called SuperSMP (super simple matrix processor) for accelerating data-parallel applications. Data-parallel applications, which include scientific and engineering, DSP, multimedia, network, security, etc. , are growing in importance and demanding increased performance from hardware. Since the fundamental data structures for a wide variety of data parallel applications are scalar, vector, and matrix, our proposed SuperSMP processor has a three-level instruction set architecture (ISA) executed on zero-, one-, and two dimensional arrays of data. These instruction sets are used to express a great amount of fme-grain parallelism (up to three dimensional data-level parallelism or 3D DLP) to a processor instead of the dynamical extraction by a complicated logic or statically with sophisticated compilers. This reduces the design complexity and provides a high-level programming interface to hardware since the semantic content of the SuperSMP vector/matrix instructions already includes the notion of parallel operations.
Vector instruction sets have many fundamental advantages and deserve serious consideration for implementation on microprocessors [5]. Vector ISA packages multiple homogenous, independent operations into a single short instruction, which results in compact, expressive, and scalable code [6]. The SuperSMP architecture extends the advantages of the vector ISA by adding a matrix instruction set. The parallelism found in data-parallel applications can be explicitly expressed directly to SuperSMP in the following sets of instructions: scalar-scalar instructions (OD DLP), vector-scalar instructions (lD DLP), vector-vector instructions (lD DLP), matrix-scalar instructions (2D DLP), matrix-vector instructions (2D DLP), and matrix-matrix instructions (2D or 3D DLP). In contrast of vector processors, which extend scalar cores with vector coprocessors, SuperSMP uses unified execution units for processing multiple scalar/vector/matrix operations. Like simple in-order superscalar processors, SuperSMP fetches multiple instructions, decodes and checks dependency among
fetched instructions, executes multiple scalar/vector/matrix operations on parallel execution units, and writes back multiple scalar/vector/matrix results into scalar/matrix register files. Thus, SuperSMP exploits both instruction-level parallelism (ILP) and DLP. ILP is exploited by parallel processing multiple scalar instructions, however, DLP is exploited using vector/matrix ISAs. There are many researches in the literature developed to accelerate data-parallel applications, which are related to our work. Asanovic [6] proposed Torrent-O, which is a single chip fixed-point vector microprocessor designed for multimedia, human-interface, neural network, and other digital signal processing tasks. Quintana et al. [7] proposed adding a vector unit to a superscalar core, as a way to scale superscalar processors. Espasa et al. [8] proposed Tarantula, which is an aggressive floating-point machine targeted at technical, scientific, and bioinformatics workloads. Tarantula adds to the Alpha EV8 core a vector unit capable of 32 double-precision FLOPs per cycle. Kozyrakis [9] proposed a scalable processor called VlRAM based on vector architecture and lRAM technology. VlRAM has four basic components: MIPS scalar core, vector coprocessor, embedded DRAM main memory, and external 10 interface. Burger et al. [10] proposed TRIPS to produce a scalable architecture that can accelerate industrial, consumer, embedded, and scientific workloads, reaching trillions of calculations per second on a single chip. Krashinsky [11] proposed a vector-thread (VT) architecture as a performance-efficient solution for all-purpose computing, which unifies the vector and multithreaded compute models. Batten [12] explored a new approach to building data-parallel accelerators that is based on simplitying the instruction set, microarchitecture, and programming methodology for a VT architecture. Soliman [13] proposed a low-complexity vector core called LcVc for executing both scalar and vector instructions on the same execution datapath, where a unified register file is used for storing both scalar operands and vector elements. For matrix processing, Corbal et al. [14] proposed MOM that is a novel matrix-oriented ISA paradigm for multimedia applications. MOM is based on fusing conventional vector ISAs with SIMD ISAs. Beaumont-Smith [15] proposed MatRISC, which is a RISC multiprocessor for matrix applications. MatRISC is a highly integrated system on a chip matrix-based parallel processor, which can be used as a co processor when integrated into the on-chip cache memory of a microprocessor in a workstation environment. Mei et al. [16] proposed ADRES, where a VLIW processor and a reconfigurable matrix are tightly coupled in a single architecture to increase performance, simplity programm ing model, reduce communication cost and substantial resource sharing. Soliman and Sedukhin [17] proposed a technology scalable matrix architecture for data parallel applications called Trident, which consists of a set of parallel lanes (each lane contains a set of vector pipelines and a slice of a register file) combined with a fast scalar core. Corsonello et al. [18] proposed matrix coprocessor specialized for computing matrix product and integrated with a RISC processor. Soliman and Elsayed [19, 20] proposed a simple matrix processor called SMP for executing scalar/vector/matrix instructions and
evaluated its performance. Instead of using accelerators to improve the performance of data-parallel applications, SMP uses multi-level ISA to express parallelism to common hardware. Soliman and AI-Junaid [21] proposed extending a multi-core processor with a common matrix unit to maximize on chip resource utilization and to leverage the advantages of the current multi-core revolution to improve the performance of data-parallel applications. This paper is organized as follows. Section II describes the architecture of the proposed SuperSMP processor. Section III presents the FPGA implementation of SuperSMP. The performance of our proposed SuperSMP processor is evaluated on vector/matrix kernels in Section IV. Finally, Section V concludes this paper and gives directions for future work.
II.
THE ARCHITECTURE OF OUR PROPOSED
SUPERSMP
Figure 1 shows the block diagram of our proposed SuperSMP processor. It consists of five-stage pipeline for (1) fetching multiple instructions, (2) decoding and check dependencies among fetched instructions, (3) executing multiple scalar/vector/matrix operations, (4) accessing memory for loading/storing scalar/vector/matrix data, and (5) writing back scalar/vector/matrix results or loaded data into scalar/matrix register files. Four scalar/vector/matrix instructions of an application are fetched in parallel from the instruction cache in the fetch stage and stored in the fetch/decode pipeline register. The decode stage has two types of register files (scalar and matrix) as well as the main control unit. The fetched scalar/vector/matrix instructions are decoded and checked. When the four fetched instructions are scalars, the dependencies are checked among them. Up to four independent scalar instructions are issued in-order. Since a single vector/matrix groups multiple independent operations, four vector/matrix operations from a single vector/matrix instruction are issued without checking. Thus, for a mixture of scalar/vector/matrix instructions beginning with a scalar instruction, the hardware checks the dependences among scalar instructions until reaching a vector/matrix instruction. Otherwise, a single vector/matrix instruction is decoded and up to four vector/matrix operations are issued depending vector Se«(IndLewlCache
Figure I.
The block diagram of our proposed SuperSMP processor.
length or matrix size. Scalar operands are read from scalar register file, however, vector/matrix operands are read from matrix register file. In addition, the control signals are generated for controlling the execution of the fetched instructions after checking. Multiple scalar instructions are executed in parallel on the shared functional units. However, a vector/matrix instruction iterates the process of reading operands from the matrix register file and executing vector/matrix operations on the same parallel functional units. The number of iterations depends on the contents of control registers: Strps, Wstrp, and Dim. On SuperSMP, processing matrices needs Strps and Wstrp control registers, which are holding the number of strips and the number of elements per strip, respectively. Strpsx Wstrp elements of blocks are processed using a vector/matrix instruction. For element-wise vector/matrix instructions, such as addition, subtraction, multiplication, etc. , Strps and Wstrp are read by the control unit to generate the proper control signals to process Strpsx Wstrp blocks of matrices or Strps*Wstrp strips of vectors. Other instructions, such as matrix-matrix multiplication, need three parameters for processing blocks of data. The control register Dim is used for storing the third parameter. Depending on the opcode of the instruction being executed, the control unit generates the control signals after reading Strps/Wstrp or Strps/WstrpiDim control registers. Loading/storing scalar/vector/matrix data means moving scalar (OD), vector (lD), and matrix (2D) data between scalar/matrix register files and memory. Scalar memory accesses go through the first-level (Ll) data cache that holds only scalar data. However, vector and matrix accesses go directly to the second-level cache (L2) memory. The simplest form of loading/storing a block of data is the unit-stride vector load/store, which transfers a set of elements (lD array) between contiguous memory locations and matrix register file. The base address of these 1D contiguous elements is usually specified by the contents of a scalar register. The address generator in the decode stage generates a series of memory addresses (only one address per clock cycle); each address moves four elements from/to memory. For example, to load 16-element, 32-bit each, unit-stride vector, the address generator sends the following sequence of addresses to the L2 cache: x, x + 16, x + 32, and x + 48, where x is the base address and the L2 bandwidth is 16 bytes per clock cycle. After the memory latency, the matrix register file receives four 32-bit elements per clock cycle.
Finally, up to four scalar/vector/matrix results or loaded data are written back into scalar/matrix register files. Moreover, depending on the opcode fields, register destinations are specified: destination or target fields.
III.
FPGA IMPLEMENTATION OF SUPER SMP
The design of our proposed SuperSMP processor is implemented using VHDL targeting the Xilinx Virtex-5 xc5vlx11O-3ffl153 device with speed grade -3. Synthesis and Place & Route are both performed using Xilinx ISE 14. 5. Figure 2 shows the top-level RTL schematic diagram of our proposed SuperSMP. Moreover, Table 1 presents the FPGA implementation of each component of SuperSMP including the area consumed in slices and maximum frequency in MHz. Note that, in Xilinx terms, each Virtex-5 FPGA slice contains four 4input look-up tables and four D flip-flops. A. Fetch Stage The fetch stage of SuperSMP pipeline accepts the branch/jump address and control signals for updating PC from the decode stage and generates 32-bit NPC (next PC) and 4x32-bit instructions. The outputs of the fetch stage are sent to the decode stage through IF/ID pipeline register. Internally, the 32-bit program counter (PC) is sent to the input of the instruction memory to fetch 4x32-bit instructions when the read enable signal is asserted. Depending on the control signals generated from the control unit, the input address of PC is connected to the next sequential address (PC + 16) or to the branch/jump address. 3, 163 slices are needed for implementing the fetch stage of SuperSMP (see Table 1). The number of LUT flip-flop pairs used for the fetch stage of SuperSMP is 9, 846, where 5,677, 611, and 3, 558 are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the fetch stage
Since matrix data (2D array) is stored in memory as 1D array, loading/storing a unit-stride matrix can be done similarly to loading/storing a unit-stride vector. For example, assuming an mxn matrix is stored in the memory row-by-row, the unit stride loading of a 4x4 block (LMW Md, rs, rt) can be done by generating a series of memory addresses separated by n elements. Note that rs and rt are scalar registers holding the starting address and the number of bytes between two consecutive rows (n) , respectively, and Md is the destination matrix register. Each of these addresses loads four elements. Thus, both the control registers Wstrp, which holds the number of elements per strip, Strps, which holds the number of strips, are set to four.
Figure 2.
Top-level RTL schematic diagram of our proposed SuperSMP.
TABLE I. Frequency (MHZ)
Number of Occupied Slices
FPGA STATISTICS OF SYNTHESIZING SUPERSMP
Number of Slice FF
Number of Slice LUTs
Stage SuperSMP Scalar SuperSMP Scalar SuperSMP Scalar SuperSMP Scalar 445 424 4169 1415 Fetch 3705 3163 4075 9235 222 Decode 9482 942 448 992 14688 158 5896 2544 180 12928 Execute 1293 10599 427 5589 3200 1444 1444 1122 Memory 332 332 1135 4064 4064 31 32 32 25 Writeback --0 0 14032 32232 12764 39131 92 6534 Overall 67 3983
of SuperSMP is about 2. 24 times higher than the baseline scalar processor, as shown in Figure 3. B.
Decode Stage
The decode stage of SuperSMP pipeline accepts 32-bit NPC and 4x32-bit instructions from IF/ID pipeline register and generates all control signals needed for controlling the pipeline stages and data needed to be processed. It has scalar register file with sixteen 32-bit registers (RO, Rl, . . . , R15) and matrix register file with sixteen matrix registers 4x4 32-bit elements each (MO, Ml, . . . , MI5). 4x32-bit instructions are decoded and their dependencies are checked. Up to 2x4x32-bit operands are read from either scalar or matrix register file depending on the instructions in the decode stage. In addition, up to 4x32-bit are written into either scalar or matrix register file depending on the instructions in the writeback stage. From Table 1, 5, 896 slices are needed for implementing the decode stage of SuperSMP. The number of LUT flip-flop pairs used for the decode stage of SuperSMP is 16, 989, where 7, 507, 2, 301, and 7, 181 are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the decode stage of SuperSMP is about 13. 16 times higher than the baseline scalar processor, as shown in Figure 3. Note that the decode stage of SuperSMP has an extra register file (matrix) of size 16x4x4x32-bit (1024 bytes) and a more complex control unit than the baseline scalar processor. C.
Execute Stage
The execute stage of SuperSMP accepts the control signals and operands from the decode stage through ID/EX pipeline register. Four parallel execution units operate on four pairs of 14 �13 �11 £11 �lO � 9 �
",8
.�
7
�6 u 5
�4 � 3 �2 Fetch
Decode
Execute
rVlemOlY
Wtiteback
SuperSr-.,fP
Pipeline Stage
Figure 3.
Hardware complexity of the SuperSMP pipeline stages over the baseline scalar processor.
Number of LUT Flip-Flop pirs used Number of LUT-FF Unused Unused Fully used pairs used FF LUT LUT-FF pairs SuperSMP Scalar SuperSMP Scalar SuperSMP Scalar SuperSMP Scalar 1022 611 2683 1392 3558 5097 5677 9846 2301 423 7181 1365 16989 569 7507 373 18131 1305 1961 1239 7532 5203 5396 4505 214 4278 4270 2834 1230 1238 206 2826 32 32 32 32 0 0 0 0 1984 8214 21965 49398 14748 17166 10267 4550
operands prepared in the decode stage. They can perform addition, subtraction, multiplication, MAC (multiply accumulation), AND, OR NOR, XOR, shift left/right logical, and shift right arithmetic. For a load/store operation, an execution unit generates the memory address by adding its source operand and its immediate value. When the /h operation is register-register, the ith execution unit g erforms the operation specified by the control unit on the i pair of operands fed from scalar/matrix register file through pipeline ID/EX register, where 1 :s i :s 4. For a register-inunediate operation, the ith execution unit performs the required operation on ith source operand and /h immediate value. In all cases, the results of the execution units are placed in the execute/memory (EX/MEM) pipeline register. 5, 589 slices needed for implementing the execute stage of SuperSMP, see Table 1. The number of LUT flip-flop pairs used for the execute stage of SuperSMP is 18, 131, where 5, 203, 7, 532, and 5, 396 the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the execute stage of SuperSMP is 4. 32 times higher than the baseline scalar processor, see Figure 3. D. Memory Access Stage Not all instructions need to access memory stage. If instruction is a scalar load, 32-bit data is loaded from data cache and placed in the MEMlWB pipeline register; if it is a scalar store, then the data from the EX/MEM pipeline register is written into data cache. In either case the effective address used is the one computed during the prior cycle and stored in the EX/MEM pipeline register. However, vector/matrix load/store instructions load/store 4x32-bit contiguous data from/to L2 cache to/from matrix register file. Note that only one instruction of the four instructions in EX/MEM pipeline register can be a load/store instruction. Thus, only one load/store instruction can be in the memory stage at a time. 1, 135 slices needed for implementing the memory access stage of SuperSMP, see Table 1. The number of LUT flip-flop pairs used for the memory access stage of SuperSMP is 4, 278, where 214, 2, 834, and 1, 230 the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the memory access stage of SuperSMP is 1. 01 times higher than the baseline scalar processor, as shown in Figure 3. E. Writeback Stage The last stage is the writeback stage, which writes the scalar/vector/matrix results into the SuperSMP register file. Only 31 slices needed for implementing the writeback stage of SuperSMP, as shown in Table 1. The number of LUT flip-flop
pairs used for the writeback stage of SuperSMP is 32, where 32, 0, and ° are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the writeback stage of SuperSMP is 1. 24 times higher than the baseline scalar processor, see Figure 3. F.
SuperSMP Pipeline
summ arizes the SuperSMP The last row in Table complexity shown in Figure 1. 14, 032 slices needed for implementing the overall stages of SuperSMP pipeline. The number of LUT flip-flop pairs used for the SuperSMP pipeline is 49, 398, where 17, 166, lO, 267, and 21, 965 are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of the SuperSMP pipeline is 3. 52 times higher than the baseline scalar processor, as shown in Figure 3.
IV.
PERFORMANCE E VALUATION OF
SUPERSMP
In this section, the performance of our proposed SuperSMP is evaluated on some vector/matrix kernels shown in Table 2. The number of arithmetic to memory operations ranges from 0(1) to O(n). Figure 4 shows the speedup of our proposed SuperSMP on scalar-vector, vector-vector, matrix-vector, and matrix-matrix kernels. On SuperSMP with a single execution unit and a memory unit that can load/store one element per
TABLE II.
VECTOR/MATRIX KERNELS FOR EVALUATING SUPERSMP
Vector/Matrix Kernel Vector addition (Add) Vector scaling (S Vmul)
Zi -Xi+Yi, I