for an MPEG-2 Decoder based on the SPADE methodology is given in [1]. ... architecture (no stereo decoding, 16-bit fixpoint computa- tional accuracy and 22.05 ...
Rapid Prototyping for Configurable System-on-a-Chip Platforms: A Simulation Based Approach Jens Bieger, Sorin A. Huss, Michael Jung, Stephan Klaus and Thomas Steininger Integrated Circuits and Systems Laboratory Department of Computer Science Darmstadt University of Technology Alexanderstr. 10, 64283 Darmstadt Germany E-mail: {bieger|huss|mjung|klaus|steininger}@iss.tu-darmstadt.de
Abstract The design of any application on a configurable System-on-a-Chip (SoC) like Atmel’s FPSLIC is subject to a lot of constraints stemming from requirements of the application and limitations of the architecture. In a topdown approach a real-time MPEG 1 Layer 3 (MP3) decoder is designed on this SoC, which integrates FPGA resources and an AVR microcontroller core within a single chip. An intensive design space exploration based on simulations on different levels of abstractions is fundamental for a real-time implementation on this limited architecture. After determining a suited functional partitioning a special DSP is implemented on the FPGA, wherefore an instruction set simulator is build, which allows concurrent HW/SW development.
1. Introduction As systems-on-a-chip (SoC) are a major revolution taking place in the design of integrated circuits, the development of applications on these new architectures is a challenging task as well. The complexity such systems is rising, while the time-to-market window is shrinking. This leads to the need of a fast design flow starting with high level models. This work describes the design of a real-time MPEG 1 Layer 3 (MP3) decoder on Atmel’s AT94K family of FPSLIC devices. This architecture integrates FPGA resources, an AVR microcontroller core, and SRAM within a single chip. Beside the real-time requirements, the small computational power of the AVR, especially in the area of floating point arithmetics, the few memory resources and the size of the FPGA are the limitations of the architecture, which have to be considered by the design. Typical stand-alone MP3 players are based on commercially available DSP cores (see for example the Micronas MAS3509, the VLSI Solutions VS1001 or the
STMicroelectronics STA013.) Since our intention is to evaluate the methodologies of concurrent HW/SW design, we decided for the FPSLIC, because it provides more flexibility for the design process. Starting with a pure functional model an implementation should be found in a top-down approach. An intensive design space exploration is mandatory in order to find a valid implementation. Therefore a functional partitioning and a mapping of these functional units to the available resources must be derived. But not only a suitable partitioning has to be found, also code and algorithmic optimization has to be performed to fulfill the real-time requirements. Another possible design space exploration for an MPEG-2 Decoder based on the SPADE methodology is given in [1]. Because of decreasing design times, there is a need of parallel HW/SW development, which implies the application of simulators for the development, since the real hardware platform is not available until late in the design process. All design decisions on the different levels of abstraction are taken based on intensive simulations. A simluation based approach implementing an energy-efficient MP3 Decoder is also advocated in [2]. Functional Simulation simulation time
Transaction Level Simulation
model accuracy
Instruction Set Simulation Cycle Accurate Simulation
Figure 1. Different Simulation Levels
Figure 1 illustrates different simulation levels starting from a pure functional simulation down to a cycle accurate simulation. Increasing model accuracy causes of course larger simulation times. So, at each level of abstraction the according model should be used for design decisions to save implementation and validation time.
2. Design Considerations
The FPSLIC is a low complexity and low cost device, which makes it especially attractive for our purpose for several reasons. Whereas the incorporation of reconfigurable hardware provides a lot of flexibility for design decisions, the severe resource constraints cut down the design space. This work presents how HW/SW-Codesign methodologies help to tackle these real life conditions.
This section introduces the necessary basic principles of the MP3 standard and the FPSLIC platform.
2.1. MP3 MPEG 1 Layer 3, as specified in [3], is a lossy audio coding technology, which is capable to achieve a compression ratio of 1:12 in comparison to CD quality PCM coded audio, without introducing a quality degradation perceivable by the human ear. The first step in MP3 coding is the transformation of ‘amplitude level per time’ to ‘amplitude level per frequency’ samples, which is applied repeatedly to short time frames of 576 samples(see [3] for a detailed description.) A mathematical model of the human sense of hearing, called the psychoacoustic model, computes the assignment of bits used to code each of the frequency samples of a frame, under the objective of minimum perceivable signal degradation. The samples are then requantized to match their respective assignment of bits, which is the lossy step in MP3 encoding. In order to be able to rescale the requantized samples during decoding, scalefactors are encoded into the MP3 bitstream. The requantized samples are Huffman coded in a final step, wherefore the ISO standard defines 32 fixed Huffman code tables. Despite the fact that there is no need for a psychoacoustic model, the decoding process is basically the inverse of the coding process. This is illustrated in the top slice of Figure 2. The transformation step of frequency to time samples is a composition of the inverse modified discrete cosine transform (IMDCT), the frequency inversion and the polyphase filter steps. A few constraints are placed on the quality parameters of the MP3 decoding process in order to be able to achieve a fully working implementation on the envisaged architecture (no stereo decoding, 16-bit fixpoint computational accuracy and 22.05 kHz sampling frequency at 64 kBit/s.) Since our main goal was to evaluate SoC design methodologies, we don’t consider this a problem.
2.3. Design Flow and Tools We first give a short overview of the various tools, which were applied for our design flow. The transaction level MP3 decoder model is implemented with the SystemC libraries version 1.2. C Source code for the AVR microcontroller is compiled with the GNU Compiler Collection (GCC) version 3.2. The FPGA configuration bitstream is synthesized from VHDL with Exemplar Logic Leonardo Spectrum v20001b. Atmel IDS 7.2 is applied for placement and routing. Atmel System Designer v2.1 provides the functionality to integrate FPGA and AVR machine code into a single configuration bitstream. For cycle accurate HW/SW Cosimulation the VHDL simulator Model Technology ModelSim v5.5a and the AVR debugger AVR Studio v2.0 are applied.
3. Design Space Exploration An intensive design space exploration on different levels of abstraction is necessary to fulfill all constraints stemming from requirements of the application and limitations of the architecture. This section gives motivation for the decisions, which led to a high level system partitioning with four concurrently executing communicating units (see Figure 2.) Certain boundaries of the design space were already given through the selection of hardware components. The MultiMediaCard is accessed via the SPI (serial peripheral interface) protocol and the selected DAC (digital analog converter) features an I 2 S interface. Neither of these interfaces is supported by the FPSLIC directly. Since it is clear without the need of extensive analysis that a pure software implementation is not reasonable here, a certain amount of FPGA resources has to be reserved for these tasks. ISO provides an unoptimized but well documented and easy to understand software reference implementation for the MP3 decoding algorithm [5]. We adopted the functional decomposition of this implementation (see topmost layer of Figure 2.) Timing analysis of the reference implementation executed on a workstation provides the numbers shown in Figure 3 for the various functional modules. As can be seen from this profile the IMDCT and polyphase filter bank steps have the highest computational demands. Analyzing the source code provides the insight that these functions are very data flow intensive, whereas all the preceding steps have a more control flow intensive nature. As a typical microcontroller the AVR is more suited for control flow algorithms. Due to the AVR’s
2.2. Hardware Platform The AT94K FPSLIC product family (see [4]) from Atmel, Inc. integrates up to 40K gates of FPGA resources, an AVR 8-bit RISC microcontroller core and up to 36K Bytes SRAM within a single chip. The AVR is capable of 129 instructions that can mostly be performed within a single clock cycle. This results in a 20+ MIPS throughput at 25 MHz clock rate. The FPGA resources within the AT94K devices are based on Atmel’s AT40K FPGA architecture. The FPGA part is connected to the AVR over an 8-bit data bus. Both, the AVR microcontroller core and the FPGA part are connected to the embedded memory separately. Up to 36K Bytes SRAM are organized as 20K Bytes program memory, 4K Bytes data memory and 12K Bytes that can dynamically allocated as data or program memory. 2
MMC Flash− Memory Card
Legend
data flow intensive algorithms
AVR/FPGA Port
DSP Simulator
AVR 8Bit RISC−MCU UART (RS232)
AVR/FPGA Port
AVR 8Bit RISC−MCU
Dual Ported SRAM
AT40K FPGA
Ring Buffer
DAC & Amplifier
Input/Output Resource
Computational Resource
Communication Resource
Allocation
Transaction
Sample Transformation Unit
control flow intensive algorithms
Level Simulation
Bitstream Decoding Unit
Simulation
frequency inversion
cated resources for those four modules are detailed in the remainder of this section.
Simulation
polyphase filter
Instruction Set
descaling
Functional
IMDCT
anti− aliasing
Cycle Accurate
MMC Flash− Memory Card
reordering
3.1. MMC Controller The MMC Controller implements the SPI protocol used to communicate with the MultiMediaCard. It provides a continuous stream of input data to the MP3 decoding logic. A small amount of FPGA resources is allocated to the lowest layer of the SPI interface. An interrupt handler on the AVR implements a finite state machine, which handles the MMC protocol. The direct port between the AVR and the FPGA is exclusively occupied by the MMC Controller.
(Implementation)
huffman decoding
scalefactor decoding
HW/SW−Cosimulation
sideinfo decoding
3.2. Bitstream Decoding Unit (BDU) The control flow intensive first half of the MP3 decoding algorithm is executed by the Bitstream Decoding Unit. This comprehends sideinfo-, scalefactor- and Huffmandecoding as well as descaling, reordering and anti-aliasing of the frequency samples. Further implementation and optimization decisions for this unit are detailed in section 4. Most of the AVR processing resources are allocated to this unit. A considerable contingent of the available data RAM is allocated to various buffers and tables of constants. Most of the available code RAM is occupied for the different decoding functions.
Figure 2. Four Layers of Abstraction
8-bit architecture and a slow multiplier unit it is not reasonable to implement the data flow intensive IMDCT and polyphase filter bank steps in software. Though the demand on computational resources for the IMDCT and filter bank steps can be alleviated considerably by applying the efficient DCT algorithm according to Lee [6], those steps are still the most demanding in the application. As a first partitioning step it was thus decided to implement the control flow intensive steps in software, whereas the computational demanding steps are implemented on the FPGA (see Figure 2, lowest layer.)
3.3. Sample Transformation Unit (STU) The mapping from frequency samples to samples over time, which is the data flow intensive part of MP3 decoding, is performed in this unit. This comprehends the inverse modified discrete cosine transform (IMDCT), the frequency inversion and the polyphase filter bank. Some decisions on the design of this unit are detailed further in section 5. A major part of the FPGA resources is assigned to this task. Demand of data RAM can be held low by storing all intermediate results in the buffers, which decouple the BDU from this one. A certain fraction of data RAM is assigned to tables of constants.
Huffmann 'HTXDQWL]LQJ
Reordering Polyphase
Stereo Antialiasing IMDCT
3.4. DAC Controller The DAC Controller fetches samples from a ring buffer and forwards them to the digital analog converter via the I 2 S protocol. A clock prescaler is applied to achieve the required frequency of 22.05 kHz. This unit can be implemented with a small amount of FPGA resources.
Figure 3. Profile of the ISO Model
Together with MMC and DAC control this already constitutes the final coarse grained partitioning into four units. A port of the ISO reference implementation to the hardware specification language SystemC [7] was modified and analyzed to figure out the demands for buffering between those four units. The responsibilities and allo-
4. Software Optimization (BDU) A naive port of the BDU relevant parts of the ISO reference implementation to the AVR proved to be to demanding in both space and time. In order to speed up 3
5.1. STU Design Space Exploration
the execution of functions and to minimize memory demand local optimizations of the algorithms are mandatory. Thereby, it is important to take the specific architectural properties of the FPSLIC/AVR into account. The transaction level simulation and runtime analysis of the AVR software show the Huffman decoding as the most demanding tasks the AVR is burdened with. Storing the 32 Huffman tables as binary trees, which is the most memory efficient way, takes up half of the available data RAM of 12kB. The program memory portion of the RAM offers a 16 bit data bus with single cycle access time, while the data RAM is accessed via an 8 bit data bus taking two cycles per transfer. Therefore, the tables are moved from the data to the code section. We have written a software tool capable of reading the textual representation of the Huffman tables and converting them to C code automatically. The tables are sorted lexically by Huffman code word and afterwards a tree of linked node objects is build recursively that holds the decoded values within its leaves. By traversing this tree the C code representation of the tree is generated (see Figure 4.)
There are several possibilities of realizing such a dedicated hardware and we have evaluated three different approaches: An application specific controller, a small DSP core and a pipelined DSP. The application specific controller uses the same algorithm for the IMDCT and the matrix operation of the polyphase filterbank due to great similarities between both transformations. A recursive algorithm is applied for this purpose, which would reduce the amount of needed coefficients (and thus the required memory) down to about a fifteenth of the non-recursive version. Unfortunately, the required logic took up about 80% of the available FPGA area and the very limited routing resources restricted the maximum clock frequency to less than 2 MHz. As a second approach we examined a small DSP core specifically adapted for the used transformations, which would fetch its instructions from RAM. By using an optimized reduced instruction set we achieve less control logic overhead, resulting in a smaller and faster design and a very flexible programmable architecture. This time the combinatorial logic depth of our naive design restricted the maximum clock frequency to about 5 MHz. The third architecture implements a pipeline with six stages, based on the previous DSP design. This architekture took 65% of the FPGA resourcen and raises the speed up to 12 MHz. According to the analyzes only 11 MHz are needed for a real-time implementation, so this is clearly a suited generic architecture.
if (get1bit() )
1 x 0 0 1 1
y 0 1 0 1
hlen 1 3 2 3
hcod 1 001 01 000
code = 0x0; else
0
00 1
if (get1bit() ) code = 0x10;
0
10 1
0
01
11
else if (get1bit() ) code = 0x1; else code = 0x11;
Huffman Table
Huffman Tree
5.2. DSP Design
C Code We used numerous well known processor design concepts to implement the DSP. Our design is a pipelined RISC architecture with six pipeline stages [8]. As two clock cycles are required to load an instruction, three commands can be processed simultaneously. The pipeline has to be repeatedly stalled for various reasons (e.g. only one bus to the SRAM for instructions and data, ringbuffer to D/A unit full.)
Figure 4. Converting Huffman Tables into C Code
The performance gain of the synthesized C routine compared to the implementation using decoding tables saved in the data RAM is 40%. The code size after compilation of the C code almost equals the former table size. To minimize memory demand for variables and tables, various binary formats are used for different kinds of data. The size of floating point constants used as factors for multiplications for instance is reduced by using an 8 bit mantissa instead of a 16 bit mantissa. As the multiplication results are truncated to 16 bits, there is very little loss of accuracy.
5. Hardware Implementation (STU) The functional simulation of the ISO decoder resulted in moving the IMDCT and polyphase algorithm to the FPGA because a dedicated hardware multiplier is needed that has to be faster than the one provided by the AVR core.
Figure 5. DSP Block Diagram
4
and hardware. However, these tools perform cycle accurate simulation, including the evaluation of timing behavioral models for software and hardware, and therefore need rather long simulation runtimes. Because of short time to market demands, this is not adequate for verifying the correctness of complex algorithms such as the IMDCT used by the decoder. The hardware specification may change during hardware development, therefore the simulator has to be flexible and easy to adapt to changes. Using a higher abstraction level makes it easier to accomplish this by reducing the complexity of the simulator thus yielding a fast prototyping system.
The DSP like architectural concepts provided by the proposed processor are a MAC unit capable of computing a 16 bit fixedpoint multiply-add operation within two cycles, a dedicated saturation logic, a separate 24 bit accumulator for higher precision summing calculations, three dedicated registerbanks each containing 32 registers with one cycle access time, three registers with built in increment logic containing the program counter, the data memory pointer and the ringbuffer pointer, respectively (see Figure 5). The design differs from the common approaches as we did not implement any forwarding logic hardware in order to reduce the necessary amount of chip area needed for control logic. Data hazards have to be avoided by changing the instruction order or by inserting a NOP instruction. An example is given in case one of Figure 6: There is a read-after-write hazard for the first two instructions, which can be avoided by switching the position of instructions two and three, respectively.
1
2
ldi y0 #9 RAW adda y0 hazard adda x15 . . st z6 st z7 st z8 rjmp @Overlap_Add nop
; access array element a[9] ; load 9 into y0 : add y0 to address reg. ; add x15 to address reg.
Figure 7. DSP Simulator GUI
; store results ; jump to next processing step ; no op in branch delay slot
Therefore, we designed a DSP simulator capable of instruction level simulation of the DSP functional behavior. It includes a disassembler and mimics program execution according to the functional specification of the hardware of the DSP. Simulation results can be watched and manipulated directly through the graphical user interface (see Figure 7) or can be saved to a file for further analysis with other tools. As we work at a higher abstraction level, simulation times are shortened dramatically and reach about 10% of real time performance: four orders of magnitude faster than cycle accurate simulation that is usually exploited for cosimulation. As illustrated in Figure 2 the results of the AVR software calculations are transmitted via the RS232 serial interface to the PC and are then further processed by the DSP simulator software. The final calculation results are stored in a file.
Figure 6. DSP Assembler Jump instructions are available and as we use a pipelined RISC architecture there is the problem of stalling and flushing the pipeline if a branch is taken. To avoid this, we took the same approach as implemented in MIPS processors. We included a branch delay slot and the instruction following a branch is always executed, eliminating the need to restart the pipeline [9]. In Figure 6 (case 2) this fact could be exploited by moving the st z8 instruction behind the relative jump rjmp. Some of the problems and methodologies used to create software for architectures not capable of dealing with data dependencies are shared with the VLIW processor concept described in [10]. Likewise state of the art processor designs rely on the compiler to reduce the hardware complexity and to improve parallelism and throughput of the processor. A recent example is the Intel Itanium Processor, where instruction level parallelism must be handled by the compiler. The ALU implementation, the instruction set and the size of the registerbank are derived from the processors main purpose: multiply-add calculations [11],[12].
6. Communication and Synchronization As detailed in section 3 the architecture of the complete system is composed of four units that execute concurrently. In order to achieve a maximum degree of parallelism these units have to be decoupled with buffers and synchronization mechanisms (see Figure 2, lowest layer.) Since the STU does not compute samples with a static rate, but rather in bursts, it has to be decoupled from the DAC with a ring buffer of sufficient size. Whereas simulations on functional and transaction level only allowed us to estimate the necessary minimum buffer size, the simulation and analysis on instruction set level provided enough information to fix this value to 79 samples. Synchronization is achieved by stalling the pipeline of the STU’s DSP during write access to a filled up ring buffer.
5.3. DSP Simulator The DSP simulator is aimed to facilitate the development of efficient code for the IMDCT and the polyphase filterbank. There are commercial coverification tools available to test correctness and integrity of both software 5
Communication between the BDU and the STU is implemented via buffers. These are located in the FPSLIC’s data RAM, which features two separate buses and thus allows concurrent access by both the FPGA and the AVR. The MP3 frame size of 576 frequency samples determines the necessary size for such a buffer. Decoupling of both stages is achieved by providing two buffers, one of which is written to by the BDU, whereas samples from the other one are consumed by the STU at each point in time. In order to decouple the BDU from the MMC Controller two buffers are used in the same way as described in the previous paragraph.
Table 2. Resource Usage of the FPSLIC Resource Mean Usage in % SRAM 97 AVR (MIPS) 73 DSP (MIPS) 82 FPGA (CLBs) 74
8. Conclusion A top-down design flow is presented in order to implement a real-time MP3 decoder on the FPLSIC SoC. As it can be seen from Table 2, an intensive design space exploration and algorithmic optimizations are mandatory to get a valid implementation on such limited resources. Choosing the suited level of abstraction for each design decision saves design time and leads to better implementations. Especially the use and the development of a dedicated DSP simulator enables the concurrent HW/SW development and faster validation of the design and the algorithm.
7. Results Table 1 presents the simulation times for the decoding of an MP3 file (mono, 64kBit/s) of one minute length (AMD Athlon 600MHz PC.) The basic design decisions were made at the higher abstraction levels. Thus the design space was already narrowed when the lower abstraction levels were handled. We are convinced that our implementation would not have been possible to achieve in a reasonable amount of time without the development and application of various simulators.
References
Table 1. Simulation Model vs. Time Simulation Model Simulation Time (sec) ISO model 17 TL SystemC model 41 901 Instruction level simulation Cycle accurate simulation 4, 32 ∗ 107 (est.)
[1] P. van der Wolf, P. Lieverse, M. Goel, D. La Hei, and K. Vissers. An MPEG-2 decoder case study as a driver for a system level design methodology, Proceedings of the seventh international workshop on Hardware/software codesign, p.33-37, Rome, Italy, 1999 [2] T. Simunic, L. Benini, and G. D. Micheli. Energy-efficient design of battery-powered embedded systems, IEEE Trans. VLSI Systems,vol.9, pp. 15–28, Feb. 2000.
Figure 8 illustrates the final profile of the optimized decoder. It presents the clock cycles needed by the concurrent AVR and DSP routines for the calculation of one frame. If compared to the software only implementation of the ISO model running on the AVR, the final optimized implementation is 40 times faster.
[3] ISO/IEC 11172-3. Information Technology - Coding of Moving Pictures and Associated Audio for Digitial Storage Media at up to about 1.5Mbit/s - Part 3: Audio, 1993. [4] Atmel, Inc. Configurable Logic Data Book, 2001. [5] D. Pan et al. ISO reference decoder source code, 1993. [6] B. G. Lee. FCT - A Fast Cosine Transform, IEEE ICASSP ‘84, San Diego, pp. 28A.3.1-4, March, 1984. [7] AnsLab Co., Ltd. SystemC Model of an MP3 Player, www.anslab.co.kr/ip products/Ans MP3 SC.zip, 2001. [8] J. Hennesy, and D. Patterson. Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo, CA, 1990. [9] J. Hennesy, and D. Patterson, Computer Organisation & Design, Morgan Kaufmann Publishers, San Mateo, CA, 1994. [10] A. Abnous, and N. Bagherzadeh. Architectural Design and Analysis of a VLIW Processor, 1991. [11] E. A. Lee, and D. G. Messerschmitt. Pipeline Interleaved Programmable DSP’s, IEEE ASSP Magazine, vol. 35, no. 9 pp 1320-1330, Sep. 1987
Figure 8. Optimized Decoder Profile
[12] E. A. Lee. Programmable DSP Architectures, PartI, IEEE ASSP Magazine, vol. 5, no. 4, pp. 4-19, Oct. 1988
The final implementation is able to decode an MP3 in real time. As table 2 shows all resources are almost fully utilized. Due to occasional peaks in computational demands it is impossible to achieve 100 percent utilization. 6