An FPGA Based SIMD Processor With A Vector ... - Semantic Scholar

1 downloads 0 Views 237KB Size Report
E-mail: {juno, chs, wysung}@dsp.snu.ac.kr ... interleaved fashion has been developed [2][3]. ... the Section II, the architecture of the developed system will.
An FPGA Based SIMD Processor With A Vector Memory Unit Junho Cho, Hoseok Chang and Wonyong Sung School of Electrical Engineering Seoul National University Kwanak-gu, Seoul 151-744 Korea E-mail: {juno, chs, wysung}@dsp.snu.ac.kr Abstract—A SIMD processor that contains a 16-way partitioned data-path is designed for efficient multimedia data processing. In order to automatically align data needed for SIMD processing, the architecture adopts a vector memory unit that consists of 17-bank memory blocks. The vector memory unit also has address generation and rearrangement units for eliminating bank conflicts. The MicroBlaze FPGA based RISC processor is used for program control and scalar data processing. The architecture has been implemented on a Xilinx FPGA, and the implementation performance for several multimedia kernels is obtained.

II.

The system consists of a host processor, a 16-way partitioned data-path SIMD coprocessor, a 17-way vector memory unit for parallel data memory accesses, and an instruction memory as illustrated in Fig. 1. The host processor takes the role of scalar operations and program control. And the coprocessor is in charge of vector operations. When an instruction is fetched, the instruction distributor classifies the instruction into a scalar or a vector type, and then supplies it to the corresponding unit. If the fetched instruction is a scalar type, it is transmitted to the host processor directly, while no operation (NOP) instruction is supplied to the SIMD coprocessor. The coprocessor will be in the idle state in this case. On the contrary, if the instruction is a vector type, it is sent to the coprocessor. Thus the program model is similar to that of the conventional vector processors. The scalar portion of the program is executed by the host processor while the array related operations in a loop is done by the coprocessor.

I. INTRODUCTION The SIMD architecture with a very long partitioned datapath is found in various CPU’s nowadays. The architecture is very efficient for multimedia data processing because it can process multiple samples or pixels using one vector instruction. However it frequently suffers from the overhead of aligning data. If the required data are not stored in order, the overhead of reordering data such as pack, unpack, rotate and shuffle is needed, which obviously reduces the performance gain. As the functional unit becomes wide, the aligning overhead becomes more critical [1]. In order to overcome this alignment problem, a vector memory structure which uses multiple banks of memory connected in an interleaved fashion has been developed [2][3]. However, the conventional multi-bank memory architecture is not free from the bank conflict problem. In this SIMD architecture, a vector memory unit that is designed to eliminate most of the bank conflicts is equipped.

I-Mem

Host Processor

D-Mem

ALU D-Bus

Mem Bank 0

Mem Bank 1

Reg File

Instruction Decoder

I-Bus

SReg File

Mem Bank 2

Mem Bank 16

D-Bus

DRU

FSL

I-Bus

Instruction Decoder

Mem

AGU Instruction Distributor

AGU

In this paper, the architecture, the methods of application development and the performance improvement will be explained. The rest of this paper is organized as follows. In the Section II, the architecture of the developed system will be presented. And the Section III explains the supporting instruction set. Next the vectorizing procedure and the performance improvement will be shown in Section IV. Finally concluding remarks are made in Section V.

0-7803-9390-2/06/$20.00 ©2006 IEEE

ARCHITECTURE

VReg File 0

VReg File 1

VReg File 2

VReg File 15

ALU 0

ALU 1

ALU 2

ALU 15

SIMD Coprocessor

Figure 1. Architecture of the overall system

525

ISCAS 2006

DSP48_2

TABLE I.

DSP48_4

OPMODE1

±

MAC1 ABD1

OPMODE2

DSP48_1

±

MAC2 ABD2

Unit

DSP48_3

OPMODE1

OPMODE2

x

±

ADD1 SUB1 ACC1 MUL1

x

±

ADDRESS / DATA GENERATION AND REARRANGEMENT

ADD2 SUB2 ACC2 MUL2

Function

AGU

Addr for (ALU#K) = (start address + K*stride)/17

ARU

Bank for (ALU#K) = (start address + K*stride)%17 = R

DRU

ALU for (Bank#R) = K

N

Suggest Documents