Rapid Development of Optimized DSP Code From a High Level Description Through Software Estimations Alain Pegatoquet, Emmanuel Gresset
Michel Auguin, Luc Bianco
VLSI Technology
Université de Nice, Laboratoire I3S
505, Route des Lucioles 06560 Valbonne FRANCE 33 4 92 96 11 00
41, Bld Napoléon III 06041 Nice, FRANCE 33 4 97 25 82 54
[email protected]
[email protected]
ABSTRACT Generation of optimized DSP code from a high level language such as C is very time consuming since current DSP compilers are generally unable to produce efficient code. We present a software estimation methodology from a C description that helps for a rapid development of DSP applications. Our tool VESTIM provides both a performance evaluation for assembly code generated by the compiler and an estimation of an optimized assembly code. Blocks of applications G.721 and G.728 have been evaluated using VESTIM. Results show that estimations are very accurate and allow software development time to be significantly reduced.
Keywords DSP, Code generation, Performance Estimation.
1. INTRODUCTION Programming DSP applications in high level language such as C is becoming more prevalent as applications become increasingly more complex. However, current DSP C compilers are generally unable to exploit the DSP specific architectural features to produce efficient assembly code [2]. Therefore, in order to respect tight real-time constraints, programmers commonly write DSP code by hand. However programming in assembly language becomes increasingly difficult since DSP applications are becoming larger and more complex. Furthermore, writing efficient assembly code for new DSP architectures such as VLIW processor is a very challenging task. In order to write as little assembly code by hand as possible, we propose a methodology to make more efficient use of DSP C compilers. By respecting C coding rules such as using pointers instead of indexed arrays, the performance of the generated assembly code may be significantly improved. However the optimization work must be focused on critical parts which are typically inner nested loops. Without any specific tool this is a very involved task, which may be as time consuming as writing
___________________________ Permission to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana (c) 1999 ACM 1-58113-109-7/99/06..$5.00
_
assembly code by hand. In order to assist programmers we propose a specific tool, VESTIM, that provides two kinds of estimation: a performance evaluation of the assembly code generated by the C compiler without any target DSP simulation and an estimation of an optimized (i.e. hand-written) assembly code from the C code. This metric represents a lower bound on execution times. This asymptotic value is compared with the generated assembly code and then it is easier to determine if the C compiler has produced an efficient code. Thus software estimations guide programmers for optimizing time-critical routines. In this paper we describe first the main reasons of inefficiencies of RISC based DSP C compilers. Then we show that using C coding rules and software estimations, the performance of the generated assembly code may be significantly improved. In a third part we describe our software estimation methodology and finally, results with significant applications are presented.
2. DSP CODE GENERATION 2.1 Inefficiencies of DSP C compilers DSPs have some specific architectural characteristics [1] which are difficult or impossible to exploit by a compiler to generate an optimized assembly code that respects tight real-time constraints. In [2] is shown that the classical DSP compiler design flow is also a cause of inefficiency: since the design of the compiler is more often done once the architecture is fixed, it must be adapted to the processor, generally leading to sub-optimal results [2]. Moreover, commercial DSP C compilers are often derived from the GNU-C compiler that targets general purpose processors (RISC, CISC). This approach leads to some inefficiencies for DSPs. For example, the register allocation phases of GNU-C consider homogeneous register sets whereas DSPs commonly have a reduced and heterogeneous register set which is very difficult for a compiler to match without overhead. Benchmarks with various signal processing algorithms show that memory spilling (saving temporary registers in memory) as well as specific addressing modes (modulo, bit reversal) or instructions (MAC, conditional branch,...) represent the major overhead. The memory spilling is a problem mainly due to the DSP reduced register set. Considerable research have been conducted recently in order to alleviate this problem [13]. However, proposed methods are often specific to a restricted DSP architecture model [3]. As a consequence, applications are often implemented partially or totally by hand on a DSP. This leads to increase time-to-market. Indeed, recent applications are more and more complex and
requires often several months of work to implement optimized code by hand limiting its portability and its reuse to other DSPs.
2.2 Improving the generated code Our goal is to assist the programmer to use the C compiler as much as possible (write as little in assembler as possible). It is indeed possible to improve the quality of the generated assembly code by tuning the original C source code according to the target compiler (i.e. target DSP) [12]. For example, the use of pointers instead of indexed arrays and limiting variable life range improve the performance of the C compiler [4]. Furthermore, in order to assist the compiler in its optimization process, C coding rules can be derived for a target DSP [4][5]. But without any tools, the profiling of the generated assembly code is a tedious task that needs time consuming target processor simulations. In order to guide the programmer from the C description of the application, we propose a tool that assists: • •
locate computation intensive parts of the code, provide a metric of the quality of the produced assembly code. This tool allows target blocks for optimization to be identified, as well as parts of C code efficiently compiled or without significant influence on global performance.
graph (CFG) representing the structure of the application is built. Each basic block has its own dynamic information. The estimation process will determine the number of cycles for each basic block for a target DSP processor. In order to support a class of modified Harvard type DSP architectures we adopt an internal generic processor model. The target DSP is described using an external processor description detailed in section 3.2.4. This approach corresponds to the model used in [8]. The VLSI Technology VVF3500 DSP which stands for the DSP Group OakDSPCoreTM, is the first DSP included in our estimator.
3.2.1 A DSP oriented computational scheme During our experiments we noted that we could not perform accurate estimation using the RTL representation since it is RISC oriented in the GNU C compiler. RISC architectures have a loadstore model: each memory access or operation is performed through registers. However, for many DSPs this is a restrictive model leading to unnecessary operations. RTL Decription
CFG Construction
Internal Representation for each Basic Block (DAG)
3. A GENERIC ESTIMATION MODEL In order to perform estimations we first collect dynamic information. The dynamic information represent the number of executions of each C statements (or basic block). Dynamic information may be obtained using different approaches. In [6], the authors propose the calculation of the worst case computational requirements using a static method (i.e. information are collected at or before compile time). However, this method requires the programmer to provide information such as loop upper bounds. This approach can be inappropriate for complex applications with deeply interdependent or nested control structures. Therefore we use a “dynamic” statistical approach.
3.1 Collect Dynamic Information We adapt a method [7] based on the execution of the C code with a test sequence. The C code is then annotated with this dynamic information. The test sequence coverage must correspond to a good approximation of the worst case execution time. Our tool VESTIM provides: • •
Performance of the generated assembly code (LST). Estimations of a hand-written assembly code from an intermediate description (RTL) of the application.
It is important to note that these metrics are obtained without any target DSP simulation. It avoids the need to conduct time consuming profiling simulation [14]. Performance of the generated assembly code is obtained by multiplying the number of executions of each basic block with the number of instructions of the generated assembly code (LST) per C statement.
3.2 Estimation of an optimized assembly code In order to evaluate the quality of the LST code, we compute from the RTL (Register Transfer Language) intermediate representation of the application provided by the GNU based OakC Compiler (OCC) [5], a lower performance bound which represents an estimation of an optimized hand written assembly code. Figure 1 shows the estimation flow. First, a control flow
RTL (RISC)
Rules from programmer’s experience or experimentation DSP Intermediate Representation OAK
PINE
DIR (DSP)
Task annotation by priority level Scheduling (List Scheduling)
PALM
Processor Description
Estimations
Figure 1: Estimation from an RTL description Moreover, the RTL representation contains operations not optimized by the compiler, i.e. operations that an experienced DSP programmer rarely uses. Then we defined a set of rewriting rules that is applied to the RTL representation in order to match as closely as possible a DSP execution scheme. This DSP Intermediate Representation (DIR) is a Directed Acyclic Graph (DAG) of the application suitable to estimate a class of modified Harvard DSP architectures. Each node of the DAG is an operation while edges represent data dependencies between these operations. Figure 2 shows the DIR for a FFT butterfly (real part only) obtained from a C code. This figure shows that load/store operations have been removed leading to a representation closer to the structure of a DSP data path.
3.2.2 DIR annotation by priority level Before the scheduling phase, each node of the DAG is annotated by a priority level. This number is equal to the cycle-based distance between this node and the most distant leaf node of the DIR. This cycle-based distance takes into account the pipeline parallelism of the DSP: two sequential nodes in the DIR have the same priority level if they can have a pipeline or parallel execution in the DSP. This information is given in the target processor description file. The annotation gives priority to
operations that allow more distant leaf nodes to be scheduled earlier since nodes with high priority levels are scheduled first. 5 VAL XRAM
5
4
LITT 32768
VAL XRAM
RMi 5
ADD
RMi MPY
ACC LITT 1
3
P
4
4 VAL YRAM
4 VAL XRAM
3
3
ADD MPY
SHIFT
VAL YRAM
3
3 ADD
2
SUB ASSIGN YRAM
1
3
1
ASSIGN XRAM
Figure 2: DAG representation of a FFT butterfly
3.2.3 Scheduling phase The objective of the scheduling phase is to aggregate operations of the DIR representation into a minimum number of cycles corresponding to instructions of the target DSP. We use a list scheduling heuristic [9] which works at operation level rather than instruction level since a DSP instruction may be composed of several operations (e.g. MAC). Rather than a GNU-like instruction level pattern matching, we use an operation level which is more suitable to match DSP with more parallelism (such as new dual MAC architectures). The list scheduling algorithm manipulates a list of data-ready operations ordered by their priority level. Data-ready operations are nodes of the DIR with all their predecessors nodes scheduled. Once the task is scheduled, its successors are inserted in the first position of the list of tasks with the same priority. This technique combined with the DIR annotation described in 3.2.2 allows the variable life range to be reduced. This is an essential factor for programming DSPs since they have generally a restricted and a specific register set.
3.2.4 Processor description Different approaches are generally considered for modeling target processors. The structural processor description used in VESTIM is a list of possible implementations of DIR operations. As depicted in Figure 3, each implementation or “basic operation” is defined by a name such as "add", "sub" or "mpy" labeled with: • • •
memory and operating resources constraints register constraints number of cycles
In (1) each basic operation is characterized by a name, its input operands, an operating or a memory resource (ut) and its output operands. In (2) and (3) constraints are defined which are used to schedule and allocate the operation. UTfree (respectively REGfree) represents all the memory and operating (respectively registers) resources which must be free to schedule the operation. Note that UT and REG are processor dependent. For example (5), (6) and (7) describe resources of the Oak DSP core. In order to support multi-cycle instructions the number of cycles (4) to execute a basic operation is provided. Since we focus on performance
estimation rather than actual optimized code generation, VESTIM uses an abstract processor description. Thus operations are gathered into classes if they involve common operating resources in the processor. The use of classes significantly simplifies the processor description: our processor description file is about ten times smaller than the machine description file used in the Oak-C compiler. Basic_Operation_name UTfree REGfree Nb_Cycle where
= {input} ; ut ; {output} = {ut} = {reg / reg ∈ REG } = integer
(1) (2) (3) (4)
input ∈ (REG ∪ RM), output ∈ (REG ∪ RM) and ut ∈ UT UT = {MEMX, MEMY, MULT, ALU, BS, CTRL, CG, VAL, ASSIGN, ADR, EXP} (5) REG = {X, Y, RR, ACC, P, RI, RJ} (6) RM = {RMi, RMix, RMsd, RMld} (7) where
RMi is a memory indirect addressing mode RMix is a memory indexed addressing mode RMsd is a short direct memory addressing mode RMld is a long direct memory addressing mode
Figure 3: Processor description
3.2.5 Estimation process The algorithm schedules an operation of the DIR at a cycle n if its output data and required resources (ut, UTfree and REGfree) are available. We assume that an instruction is found when no more basic operation can be scheduled at cycle n. Using this approach, the internal parallelism of the data path of the processor is taken into account. But UTfree and REGfree can also be used to model restrictions resulting from the instruction set encoding that leads to exclusive use of resources. Let us take the hashed node «Mpy» of Figure 2 as example. This operation performs a product between two values from data memory X and Y using an indirect addressing mode. Input edges of the “Mpy” node are annotated when their previous nodes "VAL xram" and "VAL yram" were scheduled. Then the following pattern is associated with the node “Mpy”: < Name: Mpy ; Input: RMi, RMi > This pattern is used to select a basic operation in the processor description. The scheduling of the basic operation is performed only if resources defined by ut, UTfree, REGfree and “output” are free. (a) Mpy = {RMi, RMi} ; MULT; {P} UTfree = {CG} REGfree = {X, Y} Nb_Cycle = 1
(b) Add = {P, ACC} ; ALU; {ACC} UTfree = {None} REGfree = {None} Nb_Cycle = 1
Figure 4: Two basic operations of the Oak DSP core For instance, resources MULT, CG, X, Y and P (Figure 4.a) must be free to schedule this “Mpy” basic operation. Thereafter the scheduling algorithm tries to schedule at the same cycle other basic operations of the DAG. This is the case with the hashed node “Add” of Figure 2 corresponding to the “Add ACC, P” basic operation (Figure 4.b), if the ALU and ACC resources are available at cycle n. As UTfree and REGfree are set to “None” there is no other resources constraints to schedule this basic operation. With the two read memory accesses scheduled before the “mpy” operation in the DAG, a MAC instruction is founded since no more operations can be scheduled at cycle n. In conclusion the scheduling algorithm is able to deal with the parallelism in the DSP data path and restrictions on parallelism
due to the encoding of instructions (ISA DSP model). These restrictions are expressed through constraints on parallel use of classes of resources. This approach is well suited to recent DSP architectures that integrate multi-MAC units.
architectures. The target DSP is described using an external processor description where data path constraints and restrictions on parallelism due to instruction encoding are expressed through constraints on parallel use of classes of resources.
4. RESULTS
The OakDSPCore is the first DSP included in our estimator, but the generic capability of the estimator is also well suited to describing DSPs with more parallelism such as new dual MAC architectures. Future work will thus be focused on a processor description for the PalmDSPCore from DSP Group and VLSI Technology. Finally, we think that VESTIM may be useful for performing coprocessor extraction by providing an "optimal" processor description. Parts of the application for which the optimal execution times are much lower than the real performance estimations for the original target architecture may allow the identification of potential coprocessors. We believe that our generic approach will be also helpful for designers wishing to explore new processor architectures aimed at meeting the demands of future high performance DSP applications.
To illustrate the efficiency of VESTIM, blocks of applications G.721[10] and G.728[11] have been evaluated. In table 1, the second column (LST) represents the performance evaluation provided by VESTIM of the assembly code generated by OCC. Values in the 3rd column (ASM) come from an optimized handwritten implementation while the last column represents estimations of an optimized assembly code using our methodology. The comparison of these last two columns shows that software estimations from a C code are very close to an optimized hand-written assembly code (the average margin is about 10%). Performance provided in the LST column are accurate to within a 2% margin (on average) of the measured performance when the LST code is executed on the target processor simulator (with the same test sequences). The benefit of using VESTIM is first to provide accurate estimations without any time consuming target processor simulation. These estimations guide programmers during the optimization work. Results of the first column allow computation intensive parts of the code to be quickly located while the last column provides an accurate metric on the quality of the generated assembly code. These information assist programmers to use more efficiently C compilers then reducing development of optimized DSP code from a high level description of the application. Application Name
LST (MIPS)
ASM (MIPS)
RTL Estimation (MIPS)
G.728
3,75 (+46%) 11,1 (+57%) 6,27 (200%) 33,86 (710%)
2,567 7,05 3,14 4,71
2,7 (+5,2%) 7,15 (+1,5%) 3,18 (+1,2%) 5,77 (+22%)
23 (+93%) 9,5 (219%)
11,9 4,4
13,93 (+17%) 4,34 (-1,4%)
G.721
Hwmcore Bloc17 Bloc14 Bloc50 Global Fmult
Table 1: Results for blocks from G.728 and G.721 For example, from a valid C code of G.728, an optimized assembly code for the Oak DSP core was developed in 3 months. The performance of this encoder and decoder is 35 MIPS. Using the Oak C compiler and VESTIM, we obtained in 2 weeks an assembly code requiring about 52 MIPS. If about 10% of the C code is implemented in assembler (2 weeks of work) the required MIPS are down to 40. This experiment shows that the software development time is reduced by two thirds while the MIPS and code size are increased by 15% compared to an hand-coded implementation.
5. CONCLUSION AND FUTURE WORKS In this paper we have presented a generic software estimation method which provides both the performance of the generated assembly code as well as an estimation of an optimized assembly code. These metrics assist programmers to use the target C compiler more efficiently and write as little assembly code as possible. This approach allows time-to-market to be significantly reduced. The main characteristics of this model include an internal generic processor model suitable for the estimation of performance on a class of modified Harvard type DSP
6. REFERENCES [1] Edward A. LEE. Programmable DSP Architectures: Part 1. IEEE ASSP Magazine, October 1988. [2] Vojin Zivovjnovic et al. DSP Processor/Compiler Co-Design: A Quantitative Approach. Proc. ICSPAT, pp. 679-683, Boston, MA, USA, October 7-10, 1996. [3] Guido ARAUJO and Sharad MALIK, Code Generation for Fixed-Point DSPs. ACM Transactions on Design Automation of Electronics Systems, Vol. 3, No 3, July 1998. [4] C. Liem, P. Paulin and A. Jerraya, Address Calculation for Retargetable Compilation and Exploration of Instruction-Set Architectures, 33rd DAC, Las Vegas, Nevada, June 3-7, 1996. [5] VVF3500 C Compiler. Revision 1.0. Getting Started With the OakDSPCore C Compiler. VLSI Technology, 1996. [6] S. Malik et al. Static Timing Analysis Of Embedded Software, 34th DAC, pp. 147-152, Anaheim, CA, 1997. [7] Marc SOLER et al. An Embedded DSP Platform for multistandard ITU G.728, G.729 and G.723.1 audio compression. Proc. ICSPAT, Boston, MA, October 7-10, 1996. [8] Jie Gong et al. Software Estimation from Executable Specifications, Technical Report ICS-93-5, March 8, 1993. [9] Rizos Sakellariou et al. Efficient Implementation of the ROW-Column 8x8 IDCT on VLIW Architectures, EUSIPCO, Vol. 2, pp. 869-872, Greece, Sept. 7-11, 1998. [10] Recommendation G.721, 32 kbit/s Adaptative Differential Pulse Code Modulation, ITU (1984). [11] Recommendation G.728, Coding of Speech at 16 kbit/s using Low-Delay Code Excited Linear Prediction , ITU (1994). [12] C. Liem et al., Industrial Experience using Rule-Driven Retargetable Code Generation for Multimedia Applications, 8th Symposium on System Level Synthesis, September 1995. [13] G. Goossens et al, Embedded Software in Real-Time Signal Processing Systems: Design Technologies, Proceedings of the IEEE, Vol. 85, No. 3, March 1997. [14] J-H Yang et al., MetaCore: An Application Specific DSP Development System, 35th DAC, pp. 800-803, CA, 1998.