An Evaluation of Compiler-Processor Interaction for DSP Applications Allan Frederiksen1, Rasmus Christiansen1, Jeff Bier2, Peter Koch1 (1) Embedded Systems Group, Aalborg University, Denmark, {allan, junior, pk}@kom.auc.dk (2) Berkeley Design Technology, Inc., Berkeley, CA 94704, U.S.A.,
[email protected] Abstract Recently, C compilers have started to be widely used in software development for embedded DSP applications. Hence, compiler performance is becoming very important for such applications. In this study, we propose a C compiler benchmarking methodology which is based on how C compilers are typically used for DSP software and which is able to quantify the compiler-introduced overhead in terms of cycle count and code size. The methodology has been applied to three different DSP processors representing three different types of architectures, and the results are discussed. We believe that the methodology outlined will prove useful in assessing the effectiveness of compilers for DSP applications, and in this way will aid DSP application developers in selecting processors, realistically estimating obtainable performance, and prudently using compilers.
1
Introduction
Recently, C compilers have started to be widely used in software development for DSP applications. However, hand-optimization in assembly language is still an inevitable part of DSP software development for many cost-sensitive applications due to stringent processing time, power consumption, and memory use constraints, and to the uneven performance of compilers. The objective of this paper is to evaluate the interaction between processors and their associated C compilers from a DSP perspective. In order to perform a quantitative analysis, we have designed a compiler benchmarking methodology which is able to quantify the overhead introduced by the compiler. The benchmark methodology is designed to yield results which are relevant, reliable, objective, comparable, and applicable. Relevance is ensured by the composition of the benchmark suite. Reliability and objectivity are achieved by using a realistic and generic programming style which does not constrain the compilers' code generation possibilities. Comparability is ensured by defining the benchmarks in a standardized and portable manner. Applicability refers to the fact that the methodology is applicable to any platform; e.g., fixed and
floating point variants of general-purpose processors as well as DSP processors. (Platform denotes a processorcompiler combination). The benchmark is defined in ISO C, as this is the most widely used high-level language for DSP applications. In this initial study, we have chosen to work with three fixed-point platforms which represent three different trends in current DSP processor design [1]: 1) Enhanced conventional DSP processors 2) Superscalar DSP processors 3) VLIW DSP processors The objectives are to 1) quantify how far the compilergenerated code is from being optimal, 2) by thorough analysis of the results to give a qualified assessment of the compilers' strengths and weaknesses, 3) investigate which architectural features may influence the compilers' performance.
2
Methodology
In this section, we present our compiler benchmarking methodology.
2.1
Types of benchmark programs
Applications typically follow the rule of thumb that approximately 90% of the execution time is spent in 10% of the code [2]. This is also the case for DSP applications due to the nature of the functionality common in such applications, where most of the time is spent in arithmetic-intensive loops with a high locality; i.e., with a small program memory usage. This is shown in figure 1. Arithmetic-intensive (kernel) Code
Cycle Count
Code Size
Non-kernel Code
Figure 1 Typical composition of an application.
To appear in "Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers"
2000 BDTI
Existing DSP benchmarks such as DSPstone [3] and the BDTI Benchmarks [4] tend to focus on the arithmeticintensive parts of the code. We believe that there are two reasons for this: 1) Execution time is often considered most important. 2) The arithmetic-intensive kernels are what typically characterize DSP applications, and similar kernels occur in many different types of DSP applications. In a compiler benchmark, we believe that the nonkernel part of the code is very important, because this part of the code is most likely to be compiled directly into the released code. The arithmetic-intensive parts of the code are more likely to be hand-optimized or implemented with libraries.
2.2
Benchmark program size
Regarding the size of the benchmark programs, small programs have one property which makes them desirable: Hand-optimized assembly code is obtainable and can be used as a reference. On the other hand, application-sized benchmarks are more likely to be realistic and to reflect the true performance of the compilers [5]. Thus, our compiler benchmarking methodology is based on three types of benchmarks:
LMS* Vector dot product* Time-constrained Vector maximum* benchmarks FFT Viterbi decoder Control Size-constrained Data initialization benchmarks Header decoder GSM EFR codec Application-sized JPEG codec benchmarks Pseudo application Table 1 Benchmark suite. An asterisk (*) in table 1 indicates that two versions of the benchmark is implemented: 1) Fixed: The filter order/vector length is a fixed parameter at compile time. 2) Generic: The filter order/vector length is passed as a parameter in a function call. Initial studies showed that the two approaches produce significant differences in the cycle counts of the compiled code on two of the evaluated platforms, and we believe that both programming styles are realistic and thus relevant.
2.4 1) Small time-constrained benchmarks (representing arithmetic-intensive code) 2) Small size-constrained benchmarks (representing non-kernel code) 3) Application benchmarks
2.3
Programming style
The time- and size-constrained benchmarks are implemented in two programming styles, which are: 1) ISO C 2) C with language extensions (CLE)
Selection of benchmark programs
The selection of the time-constrained benchmarks is based on execution time profiling information from seven typical DSP applications and comparison with kernels in existing DSP-relevant benchmarks. The size-constrained part of the code is typically what differentiates applications, and thus it is hard to find benchmarks that represent many applications. Based on a survey of DSP software, we have chosen three sizeconstrained benchmarks which we believe are representative of many applications. We have included two application benchmarks which represent modern DSP applications, and for which processor vendors or third parties often publish performance data based on hand-optimized assembly code implementations, which allows for comparisons. We have also included a “pseudo application”, which consists of the smaller benchmarks merged together. The objective of this is to investigate the sensitivity of the compiled code to the context in which it appears. The benchmark suite is presented in table 1.
The objective of the CLE programming style is to investigate the improvement which can be achieved by using language extensions such as fractional data types, memory bank and circular buffer qualifiers, and pragmas, when such are provided by the compiler. Each benchmark program is implemented as a function. This is a natural programming style for larger applications and it provides an encapsulation of the code for which cycle count and memory usage is measured. The functions are declared static, which tells the compiler that the functions are not accessed from outside of the file scope. This gives the compiler the possibility of optimizing the function for the actual parameters. However, each function is called twice with different parameters to ensure that the compiler implements the parameter passing, and the output is verified in order to prevent the compiler from performing inappropriate dead code elimination, and to support functional verification. The choice of data types is specific to each platform. Therefore, benchmark variables are declared using abstract data types in order to allow the optimal data types
To appear in "Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers"
2000 BDTI
to be applied for each platform; e.g., each multiplication in the software should be mappable into a single multiplication in the hardware.
2.5
Benchmark scenario
The time- and size-constrained benchmarks are evaluated using the scenario depicted in figure 2. This approach was also used in the DSPstone project [3]; its advantage is that it allows for a quantification of the compiler-introduced overhead in the cycle count and memory usage metrics as presented in table 2. C programs
Assembly programs
Cross compiler Profiling Cycle count Code size Data usage
Profiling / Factors
Cycle count Code size Data usage
Figure 2 Benchmarking scenario for the timeand size-constrained benchmarks. Metric Definition CC(ASM) Cycle count for the hand-optimized code. CC(C) Cycle count for the compiled code. CCF Cycle count factor, CC(C) / CC(ASM). CS(ASM) Code size for the hand-optimized code. CS(C) Code size for the compiled code. CSF Code size factor, CS(C) / CS(ASM). Table 2 Reported metrics. Comparison with hand-optimized code is not practical for the application-sized benchmarks. Instead, these applications are divided into a time- and a sizeconstrained part; cycle count [CC(C)] and code size [CS(C)] are reported individually for each part. Comparing these results with the small benchmark results and across the platforms provides valuable information about the usability of the compilers.
3
Architectures and compilers Table 3 shows the processors and the compilers used.
Processor Compiler Lucent Technologies DSP16210, Version 1.8.8.1, released in 1997 released in 1999 LSI Logic LSI402Z, Version 2.0, Released in 1998 released in 1999 Texas Instruments TMS320C6201B, Version 4.0, Released in 1997 released in 2000 Table 3 Benchmarked processors and compilers.
The DSP16210 is an enhanced conventional DSP processor. Its data path is irregular with a heterogeneous and small register file. Its instruction set is highly specialized and non-orthogonal. These properties traditionally constitute a difficult compiler target [6]. The compiler provided by the vendor is a ported GNU C compiler. The GNU compiler is designed for architectures with larger register files and more orthogonal instruction sets [7]; however, the compiler includes an additional post-optimizer which performs target-specific optimizations. The LSI402Z is a superscalar DSP processor. Its data path is regular and has a homogeneous register file. Its instruction set is orthogonal, although it is large and provides some specialized instructions. Although limited, some scheduling of instructions is conducted at run-time in the hardware. However, instruction reordering does not occur, and thus scheduling is necessary at compile-time in order to obtain optimal code. The provided compiler is a ported GNU C compiler, but the compiler-processor mismatch seems less significant compared to the case of the DSP16210. The LSI compiler does provide language extensions that support fractional arithmetic, but unfortunately these cannot be used without losing precision when accumulating, and they are therefore disallowed in the benchmark. The C6201 is a VLIW DSP processor. Its data path is regular with a large homogeneous register file. Its instruction set is orthogonal and generic. However, instruction scheduling, which is conducted at compiletime, is complex due to significant latencies associated with some instructions. The processor vendor provides a compiler with advanced features such as program-level optimization and profile-based compilation.
4
Results
We have chosen to present the mean cycle count factors and the mean code size factors for the timeconstrained and the size-constrained benchmarks, and detailed results on the single-sample FIR benchmark, the control benchmark, and the GSM EFR benchmark. Time-constrained Size-constrained CCF CSF CCF CSF 2.45 0.80 1.26 1.83 C6201 6.21 1.99 3.73 3.20 DSP16210 17.11 1.95 1.45 2.39 LSI402Z Table 4 Mean CCF and CSF values for the timeconstrained and the size-constrained benchmarks. The mean CCF values for the time-constrained benchmarks show that the TI compiler is significantly better than the Lucent compiler which again is significantly better than the LSI compiler. However, a
To appear in "Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers"
2000 BDTI
8 6
16210
4
Code Size Factor
1.5
16210
1.0
LSI402Z
0.5
C6201
0.0
SS FIR (generic)
SS FIR (fixed)
Control
Figure 3 Cycle count factors for selected benchmarks. Examining the CCFs for the single-sample FIR benchmark illustrates what was found with the mean CCF values: The single-sample FIR is one of the benchmarks that consists of only one loop, and the CCF values are smaller than the mean CCF values as described above. For the Lucent platform there is a performance difference between the generic version and the fixed version of the single-sample FIR benchmark as expected, indicating that the compiler is sensitive to the coding style. The LSI compiler actually performs worse on the fixed version due to a poor loop implementation where the fixed filter order is loaded into a register during each iteration. For the TI platform, the compiler's programlevel analysis phase is able to identify the run-time properties of the filter length and specialize the generic version to the actual run-time parameters, resulting in the generic version performing as well as the fixed version. However, on some critical loops in the GSM EFR benchmark the extra information provided to the compiler from the program level analysis actual leads to decreased performance, indicating room for compiler improvement.
size
Control
factors
for
selected
Cycle Count for GSM EFR 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
C6201
0
SS FIR (fixed)
The CSFs for the Control benchmark show that there are code size efficiency differences between the three platforms, although not as significant as the CCF differences for the time-constrained benchmarks. For the application benchmarks only the absolute measures are available. These measures are shown for the GSM EFR benchmark in figure 5 and figure 6.
LSI402Z
2
2.0
Figure 4 Code benchmarks.
million cycles
10
2.5
SS FIR (generic)
16210 LSI402Z C6201
total
size-con.
time-con.
Figure 5 Total cycle count and cycle count for the size-constrained and time-constrained parts of the GSM EFR benchmark. Code Size for GSM EFR
kbytes
Cycle Count Factor
mean CCF value of 2.45 is still far from optimal. All three targets have the most difficulty with the FFT and Viterbi decoder benchmarks which contain nested loops, while their performance is better for the other time-constrained benchmarks which contain only one loop. One of the main reasons for the LSI compiler’s poor performance is its implementation of 16x16->32-bit multiplications. Although directly supported in hardware, the compiler chooses to implement such multiplications as library calls, leading to significant overhead. The CSF values for the time-constrained benchmarks have secondary importance to the CCF values. The CSF value for the TI architecture illustrates the fact that code size and execution time are often inversely proportional on this architecture. The mean CSF value for the size-constrained benchmarks shows that the TI compiler is better than the others when optimizing for code size, although the differences between the compilers are less significant in this respect. The CCF values for the size-constrained benchmarks have secondary importance to the CSF values, however especially the CCF value for the Lucent architecture is high.
70 60 50 40 30 20 10 0
16210 LSI402Z C6201
total
size-con.
time-con.
Figure 6 Total code size and code size for the size-constrained and time-constrained part of the GSM EFR benchmark. The total cycle count differences between the three platforms shows a significant performance lead for the C6201 platform. This was to a certain degree expected, as the C6201 architecture provides more parallelism than the two others, but the performance differences are greater than what might be expected on the basis of the architecture differences alone, suggesting that the TI compiler is superior in this benchmark.
To appear in "Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers"
2000 BDTI
The measurement of code size and cycle count for the time-constrained part and the size-constrained part individually leads to some interesting results. The cycle counts for the size-constrained part show that the LSI402Z platform obtains the same performance as the C6201 platform, whereas the DSP16210 platform uses significantly more cycles when executing the sizeconstrained part. On the other hand, the code size of the size-constrained part is largely in favor of the TI compiler when considering instruction word lengths. It could be expected that the LSI402Z platform might perform best in terms of absolute memory usage because it has 16-bit instructions, whereas the DSP16210 has a combination of 16- and 32-bit instructions and the C6201 has 32-bit instructions. Apparently, the TI compiler performs better than the LSI compiler when optimizing for code size, though, and reduces the expected difference between the two platforms. The DSP16210 platform does not perform well on the GSM EFR benchmark; this is mainly due to the compiler's performance as the architecture itself is optimized for this application. A bug in the compiler’s optimizer limits the optimization level used on some of the size-constrained files. Furthermore, the use of intrinsics precludes applying the maximum level of optimization on the time-constrained functions with this compiler.
5
Discussion
The results indicate that the Lucent compiler is far from producing optimal code. For simple time- and sizeconstrained benchmarks, the Lucent compiler performs better than for the larger and more complex time- and size-constrained benchmarks, where there is significant performance degradation. The results indicate that the LSI compiler performs poorly on time-constrained code, basically due to a poor porting effort with the GNU compiler; e.g., 16x16->32-bit multiplications are executed as library calls. However, the LSI compiler performs better on size-constrained code, where multiplications typically are less significant. The TI compiler has good performance on most timeand size-constrained benchmarks. However, handoptimization is inevitable for severely time- and sizeconstrained applications. Regarding compiler-processor interaction, our experiments suggest that the homogeneous register file and the orthogonal instruction set of the TI processor have a significant influence on the good compiler performance. In contrast, the benchmarking results indicate that the compiler-processor mismatch is noticeable for the enhanced conventional DSP processor. For the LSI402Z we cannot infer much regarding compiler-processor interaction, due to the poor compiler quality. We have identified some key problem areas in the compiler-processor interaction:
1. 2. 3.
Memory Layout Specialized instructions Mode bits
For all three processors, the full data bandwidth depends on how data is placed in memory. But from detailed analysis of the generated code it does not appear that any of the compilers distribute data optimally in memory. This could be incorporated into the compilation process with advantage. The DSP16210 and the LSI402Z have specialized instructions e.g. for Viterbi decoding, but none of the compilers exploited these instructions. The DSP16210 and the LSI402Z use mode bits in order to control the data path, but the compilers assume that mode bits are constant throughout the program execution, and thus they do not exploit all of the functionality of the data path. The use of mode bits significantly complicates the compilation process because instructions perform different operations depending on the mode bit settings.
6
Conclusion
Compilers are becoming more important as DSP applications become larger and more complex, and more programmers take up DSP software development. Hence, the performance of compilers is also becoming more important. Compiler performance depends not only on the quality of the compiler but also on the characteristics of the processor architecture, and the interactions between the compiler and the architecture. We believe that our benchmarking methodology will prove useful in assessing the effectiveness of compilers for DSP applications, and in this way will aid DSP application developers in selecting processors, in realistically estimating obtainable performance, and in prudently using compilers.
References (1) J. Eyre and J. Bier, "Evolution of DSP Processors", IEEE Signal Processing Magazine 17 (2): 43-51, March 2000. (2) A.V. Aho, R. Sethi, and J.D. Ullman, “Compilers: Principles, Techniques, and Tools”, Addison-Wesley, 1986, ISBN 0201100886. (3) V. Zivojnovic, J. Martinex, C. Schäger, and H. Meyr, "DSPstone: A DSP-oriented Benchmarking Methodology", Proceedings of ICSPAT'94 -Dallas, October 1994. (4) Berkeley Design Technology, "Buyer's Guide to DSP Processors", 1999. (5) Mazen A. R. Saghir, Paul Chow, Corinna G. Lee, "Application-Driven Design of DSP Architectures and Compilers", Proc. of ICASSP'94, Vol. ii p. 437-440. (6) W.A. Wulf, "Compilers and Computer Architectures", IEEE Computer, 14 (8): 41-47, 1981. (7) Richard Stallman, "Using and Porting the GNU Compiler Collection", http://www.gnu.org/onlinedocs/gcc_toc.html
To appear in "Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers"
2000 BDTI