E-mail: {juno, chs, wysung}@dsp.snu.ac.kr ... interleaved fashion has been developed [2][3]. ... the Section II, the architecture of the developed system will.
An FPGA Based SIMD Processor With A Vector Memory Unit Junho Cho, Hoseok Chang and Wonyong Sung School of Electrical Engineering Seoul National University Kwanak-gu, Seoul 151-744 Korea E-mail: {juno, chs, wysung}@dsp.snu.ac.kr Abstract—A SIMD processor that contains a 16-way partitioned data-path is designed for efficient multimedia data processing. In order to automatically align data needed for SIMD processing, the architecture adopts a vector memory unit that consists of 17-bank memory blocks. The vector memory unit also has address generation and rearrangement units for eliminating bank conflicts. The MicroBlaze FPGA based RISC processor is used for program control and scalar data processing. The architecture has been implemented on a Xilinx FPGA, and the implementation performance for several multimedia kernels is obtained.
II.
The system consists of a host processor, a 16-way partitioned data-path SIMD coprocessor, a 17-way vector memory unit for parallel data memory accesses, and an instruction memory as illustrated in Fig. 1. The host processor takes the role of scalar operations and program control. And the coprocessor is in charge of vector operations. When an instruction is fetched, the instruction distributor classifies the instruction into a scalar or a vector type, and then supplies it to the corresponding unit. If the fetched instruction is a scalar type, it is transmitted to the host processor directly, while no operation (NOP) instruction is supplied to the SIMD coprocessor. The coprocessor will be in the idle state in this case. On the contrary, if the instruction is a vector type, it is sent to the coprocessor. Thus the program model is similar to that of the conventional vector processors. The scalar portion of the program is executed by the host processor while the array related operations in a loop is done by the coprocessor.
I. INTRODUCTION The SIMD architecture with a very long partitioned datapath is found in various CPU’s nowadays. The architecture is very efficient for multimedia data processing because it can process multiple samples or pixels using one vector instruction. However it frequently suffers from the overhead of aligning data. If the required data are not stored in order, the overhead of reordering data such as pack, unpack, rotate and shuffle is needed, which obviously reduces the performance gain. As the functional unit becomes wide, the aligning overhead becomes more critical [1]. In order to overcome this alignment problem, a vector memory structure which uses multiple banks of memory connected in an interleaved fashion has been developed [2][3]. However, the conventional multi-bank memory architecture is not free from the bank conflict problem. In this SIMD architecture, a vector memory unit that is designed to eliminate most of the bank conflicts is equipped.
I-Mem
Host Processor
D-Mem
ALU D-Bus
Mem Bank 0
Mem Bank 1
Reg File
Instruction Decoder
I-Bus
SReg File
Mem Bank 2
Mem Bank 16
D-Bus
DRU
FSL
I-Bus
Instruction Decoder
Mem
AGU Instruction Distributor
AGU
In this paper, the architecture, the methods of application development and the performance improvement will be explained. The rest of this paper is organized as follows. In the Section II, the architecture of the developed system will be presented. And the Section III explains the supporting instruction set. Next the vectorizing procedure and the performance improvement will be shown in Section IV. Finally concluding remarks are made in Section V.
0-7803-9390-2/06/$20.00 ©2006 IEEE
ARCHITECTURE
VReg File 0
VReg File 1
VReg File 2
VReg File 15
ALU 0
ALU 1
ALU 2
ALU 15
SIMD Coprocessor
Figure 1. Architecture of the overall system
525
ISCAS 2006
DSP48_2
TABLE I.
DSP48_4
OPMODE1
±
MAC1 ABD1
OPMODE2
DSP48_1
±
MAC2 ABD2
Unit
DSP48_3
OPMODE1
OPMODE2
x
±
ADD1 SUB1 ACC1 MUL1
x
±
ADDRESS / DATA GENERATION AND REARRANGEMENT
ADD2 SUB2 ACC2 MUL2
Function
AGU
Addr for (ALU#K) = (start address + K*stride)/17
ARU
Bank for (ALU#K) = (start address + K*stride)%17 = R
DRU
ALU for (Bank#R) = K
N