A GFLOPS Vector-DSP for Broadband Wireless Applications - CiteSeerX

0 downloads 0 Views 93KB Size Report
performance of the processor. ... for many signal processing algorithms e.g. matrix operations, linear transforms like FFT, DCT, ... DSP implementation and DSP programming can be found in. [1,2,3]. ... C. STA Micro-architecture. The Samira ...
A GFLOPS Vector-DSP for Broadband Wireless Applications E. Matúš, H. Seidel, T. Limberg, P. Robelly and G. Fettweis Dresden University of Technology/Vodafone Chair Mobile Communication Systems, D-01062 Dresden, Germany Email: {matus,seidel,limberg,robelly,fettweis}@ifn.et.tu-dresden.de

Abstract—In this paper the low-power high-performance floating-point vector DSP (SAMIRA) is presented primarily intended for base-band signal processing applications. SAMIRA DSP is build upon 0.13um UMC technology running at a maximum clock frequency of 212MHz. The processor combines SIMD and VLIW parallelism and it represents the first silicon prototype based on synchronous transfer micro-architecture (STA). The implementation results demonstrate the quantitative performance of the processor.

I. INTRODUCTION The baseband signal processing architectures require enormous processing bandwidth in order to support computationally intensive algorithms. Te conventional approach to designing a broad-band digital receiver uses dedicated hardware to achieve the required real-time signal processing. However, if the flexibility of the solution is also required, the software programmable or reconfigurable processors may be of great interest. Various software programmable processors for base-band signal processing have been proposed recently. The common feature of all architectures is the exploitation of any kind of parallelism e.g. data level parallelism (SIMD), instruction level parallelism (superscalar, VLIW), task level parallelism (hyper/superthreading). On one side, the parallelism boosts the performance of processor architecture but on the other side the power consumption can also be reduced. This is achieved by decreasing of clock frequencies, by reducing of the control overhead, etc. Among all approaches the vector processors have been proposed as enabling technology for software defined radios. It is because the vector operations are natural for many signal processing algorithms e.g. matrix operations, linear transforms like FFT, DCT, WT[6], RT [5], digital filters [6] etc.). Many various vector DSP architectures as well as applications have been presented recently e.g. in [7,8]. The vector processing architectures are also in the focus of this paper. A. Motivation Goal of the presented work was the development of a concept for highly flexible signal processing acceleration engines targeting the domain of the broadband wireless communications. Particularly, the motivation was in the definition and development of: • an automated integrated design flow based on the unified architecture description (architecture description

language) and unified description of algorithms (MATLAB), • a compiler framework as well as Matlab programming interface based on an algebraic model exploiting SIMD parallelism. As a case study of presented framework, a vector DSP with codename “SAMIRA” targeting the domain of baseband processing for broadband wireless applications was developed and fabricated in silicon [1]. In this paper we focus primarily on SAMIRA vector DSP architecture and implementation results. More information concerning automated design flow, DSP implementation and DSP programming can be found in [1,2,3]. This paper is organized as follows: In section 2 the concept of DSPs is presented. In section 3 the Samira DSP is introduced and in section 4 the implementation results are demonstrated. II.

DSP PLATFORM

A. Overview The DSP platform applied in this work represent simplification of the TTA architecture [4]. In addition to TTA, the motivation was to spend as little resources as possible to programmability. A possible way to achieve this goal is the simplification of the instruction processing portion as much as possible without affecting the performance of the data processing portion. In order to maximize the processor efficiency, as much as possible portion of control logic is moved out of chip on the side of software development. So, static instruction scheduling is performed in the SW development phase. The execution of instructions is triggered explicitly by supplying control signals from the instruction word. B. Architecture abstraction levels In order to simplify the design effort (i.e. specification, development, implementation, optimization, verification, programming, etc.,) a template-based hierarchical architecture abstraction is used. Three architecture abstraction levels (bottom-up) are specified for: • Module level - the architecture on this level is seen as network of modules, each directly controlled by subinstruction-word of VLIW. In this context, the concept of synchronous transfer architecture was adapted.

This work was supported by the German Science Foundation (DFG) and German Federal Ministry of Education and Research (BMBF)

Sources



Input selection

Operation

Function Function

Result

Figure 1. Elementary building blocks of STA: Modules and multiplexer





Core level - the modules are (virtually) arranged (clustered) in order to form processor blocks e.g. the data paths, address generation units, memory interfaces, dedicated interconnections etc. Top level - this level describes in system interfacing of DSP core and is characterized with common bus and control interface.

C. STA Micro-architecture The Samira vector DSP is based on the STA microarchitecture [2]. The principle of STA is demonstrated in Fig.1. In order to perform operation on some source nodes a selection followed by a function have to be performed and the result is stored in the register. So, using encapsulation, two canonical building blocks are defined enabling construction of any complex architecture: basic modules and multiplexers. In general, the basic modules have an arbitrary number of input and output ports. Basic modules implement some functionality that may represent function (e.g. arithmetic, logic, transfer) or storage elements (e.g. register files of memories). In STA HW concept, a system is built up from basic modules. Thus, the output ports of one module are connected with input ports of other through an interconnection network formed by multiplexers. This is sketched in Fig. 2. Both the functionality of basic modules and the input multiplexers are explicitly controlled by processor instructions. At each cycle the instruction configures the multiplexing network and the functionality of the basic modules. Thus, the whole system forms a synchronous network, which at each clock cycle consumes and produces some data. The produced data will in turn be consumed by other basic modules in the next cycle. Due to the synchronous transfer of data between basic modules

Figure 2. STA Architecture

we have named the architecture STA [2]. The big advantage of the buffering of module outputs is also in the decoupling of computational stages. In algorithms with high data locality, the data from output registers are passed directly to the input of next module without prior storing in the register file or memory. This saves dramatically energy and increase computational power. On the other side, the output registers are clock gated and this is additional factor reducing power consumption. The simplicity and modularity of the STA concept enables the automatic generation of RTL and simulation models of processor cores from a machine description. This enables the generation of processor cores with different characteristics, e.g. size of register file, memory capacity, interconnection network, functional units, data types and amount of SIMD-vector parallelism. D. DSP Core In general, any clustering of STA basic modules is possible on the core abstraction level. Basically, the STA architecture supports data and instruction level parallelism. Instruction level parallelism is supported, since at each cycle a wide instruction controls each basic module and the multiplexing network in VLIW fashion. This poses two problems. On the one hand, for large STA systems the multiplexing interconnection network becomes a critical part of the design. On the other hand, a wide instruction memory is needed. The complexity of the interconnection network can be alleviated by reducing the number of connections between ports. An obvious strategy for this is application specific optimization of interconnections. This is a viable approach, since applications are known at design time of the processor and thus, a customization of the multiplexing interconnection network can be carried out in order to fulfill die size and power consumption requirements. Special attention requires the design of instruction decoder. Due to the high number and modality of modules and multiplexers in the designs (Fig.2) the width of decoded instruction word is unacceptable from point of view of memory requirements as well as power-consumption. To alleviate instruction memory footprint we are applying code compression techniques based on reduction of statistical redundancy of uncompressed VLIW. III.

SAMIRA VECTOR DSP

Based on the concept mentioned above a high-performance and low-power floating point vector DSP (SAMIRA) was built in a 0.13um UMC silicon technology targeting the domain of wireless baseband processing. Although the Samira was primarily specified for floating point computations it also enables full dual mode float/fix point processing. A. Vector DSP Architecture The block diagram of Samira DSP core is depicted in Fig.4. Samira comprises separated scalar and vector data paths and program control unit represented by instruction decoder and program sequencer (SEQ). The architecture is completed with on-chip scalar, vector and program memories. The computational power of the SAMIRA DSP is mainly delivered by one scalar floating point unit (SFPU) and two vector floating point units (VFPU1,VFPU2). Vector processing

Program Memory

Scalar Memory

Vector Memory

Decoder

SREG

VREG

SFPU

VFPU1

SEQ

SMUL

VFPU2

SSHIFT

VSHIFT

SALU1-4

VALU

SIF

VIF

SLOGIC

• • • • • • • • • • •

ICU

Figure 1. A simplified block diagram of Samira vector-DSP core

is based on SIMD computational model. Each vector floating point unit is able to compute eight single precision floating point operations (multiplication/addition/conversion) per cycle. In addition to floating point units also a vector fix-point ALU (VALU), a shifter (VSHIFT) and a local vector register file (VREG) are embedded into vector data path. Often used conditional operand selection operation is accomplished in VIF unit (vector IF). Scalar data path is dedicated for scalar data operations e.g. floating point ops, address generation, flags evaluation etc. Scalar part comprises four ALUs (SALU1…SALU4), a multiplier (SMUL), a shifter (SSHIFT), a scalar selection unit (SIF), a logic unit (SLOGIC) and a scalar floating point unit (SFPU). The interconnection unit (ICU) supports a variety of data transfers e.g. intra-vector permutations, interchange of data between scalar and vector units. In contrast to other vector DSP architectures incorporating crossbar switches in the ICUs, only a subset of data permutations is supported in SAMIRA. This approach saves chip area still enabling effective computation of standard signal processing algorithms. In order to enable high bandwidth as well as to minimize the power consumption, the program and data memories are embedded on chip. The data vector memory (VMEM) consists of 8k words of 256 bits. Thus the total capacity of VMEM is 256 Kbyte. The scalar memory (SMEM) can store 8k words of 32 bits. The total capacity of SMEM is 32 Kbyte. Instruction memory (IMEM) can store programs up to 64 Kbyte. (Note, that the width of compressed instructions is 128 bits.) IV.

DSP IMPLEMENTATION AND RESULTS

The processor has been designed with the design flow and tool chain developed at TU Dresden and Synopsys tool chain. In this flow, degrees of freedom for registers, memories, functional units, operations, and parallelism provide design flexibility and enable truly tailored designs. The compilerfriendly architecture allows for automatic generation of code, core, tools, testbenches, and testpatterns. In the following the technical features of Samira DSP are summarized: • Performance: 3.6 GFLOP/s, 1.7 GMAC/s, 3.7 GOP/s @ 212MHz

• • • •

Memory bandwidth: 3.4 GBit/s ext–int, 67 Gbit/s int-int 17 IEEE single-precision floating point units ~ 480k NAND gates 2.4mm² logic area on UMC 0.13µm 2.9 Mbit on-chip SRAM Multiple memory banks Compressed VLIW decoder Memory built-in self test (MBIST) 9 parallel scan chains, JTAG interface Clock gating ~8µs for a floating point complex 256-point FFT @ 212MHz Power Consumption: Core ~ 0.38mW/MHz, Core&Memories ~1.7mW/MHz 120 PQFP package Fabricated by IMEC Juli’05 Application domain: mobile high-performance signal processing

Based on observations made on Samira running benchmark algorithms, the following units within the architecture were identified as most important from point of view of power and area consumption: SREG, SFPU, SMUL, SSHIFT, SALU1-4, SEQ, VREG, VFPU1-2, VSHIFT, VALU, ICU and the DECODER. A. Power consumption The exact power consumption of the SAMIRA DSP was measured using a reference test board. As the memories and the core logic of the SAMIRA are supplied through the same pin, it was not possible to measure the power consumption of the entities independently. Whereas the complete logic was clockgated, the memories got the clock directly without any gating. The data sheet of the memories claim almost constant power consumption if the clock is triggered. Hence, the toggle rate of the memories’ address and data bus increments the overall power consumption marginally. The benchmark algorithm used within the power tests was a manually optimized 256 point complex FFT (1882 clock cycles simulation time). The measured average power consumption of whole Samira DSP running at clock frequency 104MHz at 1.15V core supply voltage is 187mW resulting in 1.8mW/MHz. The power consumption of Samira core without memories is 40mW resulting in 0.38mW/MHz. The energy consumption of DSP core per FFT with and without memories is 3.4uJ and 0.73uJ, respectively. The detailed information concerning power consumption are in Table II. It is interesting, that 61% of energy is consumed by computational resources. Competing DSPs consume at least 2 times more energy of that of SAMIRA. B. Area consumption The area consumption of the entities is shown in Table I. The DSP core itself occupies equivalent area of about 460 kGates. It is important to notice that the achieved chip area efficiency defined as ratio of area occupied by computational resources (FPUs) and total core area is above 40%. For better

TABLE II. RELATIVE AND ESTIMATED POWER CONSUMPTION OF THE SAMIRA DSP CORE WITHOUT MEMORIES. ESTIMATION IS MADE BASED ON THE REAL MEASUREMENTS AND THE POWER ANALYSIS OF EVENT-BASED SIMULATION ON BACK-ANNOTATED NETLIST.

Relative Power Consumption

Estimated Power Consumption [uW/MHz]

DECODER

6.3%

24.1

ICU

0.9%

3.3

MUX-NETWORK

22.9%

87.2

Enitity Name

VALU

2.4%

9.3

VFPU1

30.7%

116.7

VFPU2

30.7%

116.7

VSHIFT

1.3%

5.0

VREG

2.7%

10.1

SREG

0.8%

3.2

SFPU

0.3%

1.0

SMUL

0.2%

0.7

SEQ

0.2%

0.6

SALU1

0.1%

0.5

SALU2

0.2%

0.8

SALU3

0.1%

0.3

SALU4

0.1%

0.3

SHIFT

0.0%

0.1

SUM

100%

380.0

TABLE I. AREA AND RELATIVE AREA CONSUMPTION OF THE SAMIRA DSP CORE WITHOUT MEMORIES. Area (kgates)

Relative Area

DECODER

24.34

5.4%

ICU

12.21

2.7%

MUX-NETWORK

131.16

28.9%

VALU

11.78

2.6%

VFPU1

88.71

19.5%

VFPU2

88.71

19.5%

VSHIFT

9.28

2.0%

VREG

41.01

9.0%

Entity Name

Est. Core Power Efficiency

SREG

23.37

5.1%

61.7%

SFPU

11.34

2.5%

(FPUs Power consumpti on/Core power consumpti on)

SMUL

4.13

0.9%

SEQ

0.59

0.1%

SALU1

1.52

0.3%

SALU2

1.52

0.3%

SALU3

1.52

0.3%

SALU4

1.52

0.3%

SHIFT

1.22

0.3%

453.94

100.0%

SUM

Core Area Efficiency

41.6% (FPUs area/total core area)

REFERENCES area utilization the multiplexer networks in data paths have to be subject of further optimization. V.

[1]

CONCLUSION

In this paper we have presented a first vector DSP silicon prototype based on STA micro-architecture. We explained shortly the architectural features of STA and described the used design concept. Based on this, the architecture of SAMIRA vector DSP was presented. The results demonstrate the effectivnes of proposed architectural concept. The layout of fabricated SAMIRA vector DSP is in Fig.4.

[2]

[3]

[4]

[6] Scalar Unit Vector Unit

Scalar Memory

Program Memory

[5] Vector Memory

Vector Memory

Figure 4. Samira DSP layout (Size 5x5mm)

[7]

[8]

H. Seidel, G. Cichon, E. Matúš, T. Limberg, P. Robelly and G. Fettweis, “Development and Implementation of a 3.6 GFLOP/s SIMD-DSP using the Synopsys Toolchain,” in Proceedings of the 14th Synopsis User Group Conference (SNUG'05), May 2005, Munich, Germany G. Cichon, P. Robelly, H. Seidel, E. Matúš, M. Bronzel and G. Fettweis, “Synchronous transfer architecture (STA),” in Proceedings of the 4th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS'04), Samos, Greece, July 2004, pp. 126-130 J. P. Robelly, G.Cichon, H. Seidel, and G. Fettweis. A hw/sw design methodology for embedded simd vector signal processors. In International Journal of Embedded Systems (IJES), January 2005. H. Corporaal, Microprocessor Architecture from VLIW to TTA, John Wiley & Sons, New York, 1997. J. Gamec, and J. Turan, “Motion Analysis Based on Invertible Rapid Transform,” Radioengineering. Vol. 8, No.2, June 1999, pp. 12-19. L. R. Rabiner and C. M. Rader, eds., Digital Signal Processing, selected reprints. New York: IEEE Press, 1972. T.Richter, W.Drescher, F.Engel, S.Kobayashi, V.Nikolajevic, M.Weiss, and G.Fettweis. A platform-based highly parallel digital signal processor. In Proceedings of the IEEE Custom Integrated Circuits Conference (CICC’01), pages 305–308, May 2001. M.Horst, K.Berkel, J. Lukkien, R. Mak, “Recursive Filtering on a Vector DSP with Linear Speedup”, in Proceedings of the IEEE Int. Conf.on Application-specific Systems, Architectures, Processors (ASAP05), pages 305–308, July 2001.

Suggest Documents