trends in compilable dsp architecture - CiteSeerX

10 downloads 477 Views 128KB Size Report
processor (the e-lite DSP) whose objective is the development of an ultra-low power DSP/ embedded ..... nificant advantages in software productivity could be achieved. FIGURE 3. ..... son-Wesley Publishing Company, CA, 1986. [42]M. Lam ...
TRENDS IN COMPILABLE DSP ARCHITECTURE John Glossner, Jaime Moreno, Mayan Moudgill, Jeff Derby, Erdem Hokenek, David Meltzer, Uzi Shvadron, and Malcolm Ware IBM Communications Research and Development Center Yorktown Heights, NY [email protected]

Abstract - We review the evolution of DSP architectures and compiler technology, and describe how compiler techniques are being used to optimize emerging DSP architectures. Such new architectures are characterized by the exploitation of data and instruction level parallelism while being an amenable target for a compiler, thereby reducing or eliminating the need to rely on assembly language programming and/or architecture-specific compiler intrinsics to achieve highly efficient code. We also summarize our research results on an ultra-low power compilable DSP architecture.

INTRODUCTION High-speed communications are proliferating [1], and digital signal processors (DSPs) are accelerating this trend [2]. In fact, DSPs have become a ubiquitous enabler for integration of audio, video, and communications. Often, DSPs and CPUs are found in the same system; in future generations, DSPs will be enhanced with RISC-like control features while general-purpose processors will be enhanced with DSP features [3]. A large number of standards exist or have been proposed for the wireless and wired communication markets. Such a diversity of standards necessitates a programmable platform for their timely implementation. Another important point is performance. In wireless communications, GSM and IS-54 data rates were limited to less than 15 Kbps. Future third-generation (3G) systems may provide data rates more than 100 times the previous rates. Proposed standards specify up to 64 Kbps (mobile), 384 Kbps (pedestrian) and 2 Mbps (close stationary); note that the maximum rate is actually higher than wired T1 lines. Wired communications are experiencing a similar trend: previous generation V.90 modems were limited to 56 Kbps whereas new ADSL standards specify up to 8 Mbps and future VDSL standards may specify a tremendous 52 Mbps. All these higher communication rates are driving much higher DSP processing requirements In addition to the high computing needs, complexity is driving the need to program the applications in high-level languages. In the past, when only small kernels were required to execute on a DSP, it was acceptable to program in assembly language. Previous generation DSP applications would only require about one thousand lines of C code, whereas recent

applications may require ten thousand lines of C code. Future generations of DSP applications are anticipated to require in the order of one hundred thousand lines of C code. In this paper we trace the historical underpinnings of DSP architectures, show how complex applications are challenging assembly language programming assumptions, discuss compiler technology and the difficulties of compiling high-level language programs for execution in DSP architectures, and predict how DSP architectures will evolve to support modern compilation techniques. Finally, we summarize our research on an embedded digital signal processor (the e-lite DSP) whose objective is the development of an ultra-low power DSP/ embedded processor capable of sustaining billions of multiply-accumulate operations per second with industry leading power dissipation while remaining an amenable compiler target.

MACs Symphonic Synthesis Natural Language Processing Realtime Speech Recognition

multiprocessor / 8 MAC DSP

2k

Performance

4MAC DSP multiprocessor DSP

2MAC DSP

MPEG II Encode, MP@ML, 30f/s, (1.7G) Single Chip STB (1.5G) H.263L + GSM Terminal (EHR/HSCSD/GPRS) (1.0G) Single Chip DAB XCVR ( 800M) DFSE EQ (UMTS) - 2Mb/s (650M) MPEG II MP@ML, 30fps Decode (600M) ADSL XCVR 6.1 - Mb/s (500M)

DAB XCVR (800M) Multi Carrier GSM BTS (800M) 16 X GSM_HR/EFR (400M)

500 DSP

Single Carrier GSM BTS (180M) 4 X GSM_HR/EFR (100M) ADSL XCVR 1.5Mb/s (100M) GSM Terminal (EHR/HSCSD/GPRS) (80M) AC3/MUSICAM Decode (20M)

sub 1V DSP

100

GSM Terminal (HR/EFR) (50M) V.90 (30M) GSM_HR, V.34bis (20M) GSM_EFR (15M) GSM_FR (2.5M)

( ) = OPS

FIGURE 1. Application Performance Requirements

APPLICATION REQUIREMENTS DSPs have distinct requirements when compared with general purpose processors [4]. The predominant algorithmic difference is that inner loops are easily described as vectors of moderate length. For example, a typical DSP kernel is a FIR filter which can be described mathematically as y k = ∑ c n x k – n . From this kernel it is apparent that multiple concurrent memory accesses are required to sustain performance. Generally, one instruction and two data values are required each cycle; the operations required include two address pointer updates and a multiply-accumulate (MAC) operation. Often, the result should additionally be rounded or saturated. A key point is that the native datatype is fixed-point (e.g. fractional arithmetic). This is in distinct contrast to general purpose processors (and most high-level languages) which operate on integer datatypes.

In addition to algorithmic differences, most DSPs are deployed in embedded environments where real-time constraints are prevalent. Real-time behavior has a dominant influence in the design of DSPs [5]. Whereas general-purpose applications can often manage with variable latency response, DSP applications, in contrast, should be able to precisely guarantee the latencies within the system. Figure 1 shows the performance requirements for typical DSP algorithms. Applications in which DSPs are used include speech coding, speech encryption, speech recognition, speech synthesis, high-fidelity audio, audio equalization, sound synthesis, modems, noise cancellation, echo cancellation, image compression, image composition, beam forming, and spectral estimation. Emerging applications which rely on DSPs include VoIP, xDSL, 3G modems and other broadband communications [6]. DSP applications are characterized by sampling rates that may vary more than twelve orders of magnitude. Weather forecasting for example has a sampling rate of 1/1000 Hz while radar applications require sampling rates over a gigahertz [4]. Previous generation DSP applications were comparatively simple and entire programs could often be specified with just a few hundred lines of assembly code. In today's modern era, DSP algorithms are often quite complex and embody both control and signal processing functions. What was previously only a few lines of C code has grown to more than ten thousand lines of C code in existing applications. This makes assembly language programming of modern DSP applications untenable [7]. 1200

SC140

Future Sweet Spot

1000

Anticipated/Projected Entrants: TigerSHARC: 1.2 GMAC/Sec @ 2-8 W AltiVec: 4 GMAC/Sec @ 5+W C62x: 400 MMAC/Sec @ 1.8W C64x: (2005+? / 1.1GHz) 4.4 GMAC/Sec @ ??W

S)/ C 800 A M M ( cen 600 a m ro rfe 400 P

C55x announce 10-80mW 400-800 MMAC

Pw Bett r/ P er er f

C5441

SC140

FR300

200

Carmel

56652 56812 2164 2173 C203

0 10

C5421 ISP-5.7 1620 1629 56307 C549 1628 1609 56602 C203 2181

FR500

Previous Generation DSPs 16210

56002 21065L

100

1000

Power (mW) - Note Log Scale

FIGURE 2. DSP Performance vs. Power

DSP ARCHITECTURES Figure 2 shows the competitive positioning of DSPs. Previous generation “classical” DSPs offer low power but limited performance. Recently announced DSPs are striving to provide higher performance at lower power consumption. Future DSPs will be required to provide an

advantage either in reduced power consumption or peak performance capability. A summary of many DSPs can be found in [8] and [9]. In the next sections, we review salient features of some important DSPs.

Classical Fixed-Point DSP Architectures Execution predictability in DSP systems precludes the use of many general-purpose design techniques (e.g. speculation, branch prediction, data caches, etc.). Instead, DSPs have developed a unique set of performance enhancing techniques that are optimized for their intended market [5]. These techniques are characterized by hardware that supports efficient filtering, such as the ability to sustain three memory accesses per cycle (one instruction, one coefficient, and one data access). Sophisticated addressing modes such as bit-reversed and modulo addressing may also be provided. Multiple address units operate in parallel with the datapath to sustain the execution of the inner kernel. TI’s C54x: Texas Instrument’s TMS320C54x is a good example of a classical DSP. This DSP contains a 17x17-bit multiplier (which allows support for 16-bit unsigned multiplication), two 40-bit accumulators (one dedicated to the multiply-accumulate MAC unit), special support for Viterbi decoding through a compare-select-store instruction, two addressing units with eight auxiliary registers that operate in parallel with the MAC unit, a block repeat instruction, a 40-bit barrel shifter, and exponent manipulation capability. The key characteristic of this architecture is that it supports single-cycle MAC throughput; more importantly, this architecture has been used extensively in many DSP systems [10]. Lucent’s 16xx: Designed primarily for the modem and wireless markets, the Lucent 16xx DSP consists of a 16x16-bit multiplier with a 36-bit ALU/shifter and 4 guard bits. The MAC unit is pipelined into two cycles to allow operation at higher frequency. Two 36-bit accumulators are provided. A bit manipulation unit has a 36-bit barrel shifter. Internal memory is divided into two 64 K memories (X and Y), with the X memory containing both program and coefficients. Inner loop code is placed in a 15 instruction on-chip program buffer. Both X and Y address generators can operate in parallel with the MAC unit. The processor has a shallow 3 stage pipeline with fetch, decode, and execute phases. Special instruction support is provided for hardware looping, exponent detection, and bit field extraction. Interestingly, there is no support for bit-reversed addressing [11]. IBM’s ISP-5.7 (Mwave): IBM originally began development of a programmable DSP architecture in the mid-1970s. Chips based on this architecture -as it evolved over the years have been used extensively in modems and PC multimedia subsystems [12]. The current version of this architecture, called IBM Signal Processor (ISP) 5.7, defines a 16-bit fixedpoint processor core consisting of a 16x16-bit multiplier, 32-bit ALU, 32-bit barrel shifter, data address generator, eight 16-bit general-purpose registers (GPRs), two of which are extended to 32 bits and two of which are designated as index registers, 32-bit memory address space, a program control unit, and status and control registers. Parallelism in the core permits up to three operations -usually a load/store, multiply, and ALU operation - to execute in the same cycle [13]. The ISP 5.7 is a better compiler target and differs from other

"classical" DSP architectures in a number of respects. First, it is a load/store architecture which uses GPRs, as in the case of RISC processors. Second, there is neither explicit nor implicit partition of data memory into "X" and "Y" memories. However, the core can still sustain one tap per cycle FIR computation.

Transitional DSP Architectures Transitional DSP architectures have either attempted to extend existing architectures or solve a specific programming problem. The Lucent 16000 architecture extends the 1600 architecture to a dual-MAC machine while maintaining the same pipeline and programming style [14]. Likewise, TI’s C55x extends the C54x to a dual-MAC machine [15]. Although these processors maintain many of the irregularities and specialized hardware of their predecessors, they provide performance gains and extend the lifetime of popular DSP families. LSI ZSP: LSI’s ZSP processor (originally from ZSP Corporation) tries to solve the programming burden by providing a 5-stage interlocked pipeline where instruction dependencies are determined by hardware and grouped appropriately for parallel execution. Static branch prediction is also included. A C compiler with fixed-point datatype support is provided. This interesting approach makes it a unique cross between a general purpose processor and a DSP. Peculiarities particular to classical DSPs are still evident with ZSP’s sparse address space and looping hardware [16]. Infineon Carmel: Infineon’s Carmel DSP does not alleviate programming difficulties but provides a unique memory-to-memory architecture with application specific accelerators called PowerPlugs which compete aptly with more well known DSPs. Carmel also offers the ability for users to define custom 144-bit VLIW-like instructions (CLIWs) which contain up to 6 parallel operations. Carmel’s 8-stage non-interlocked pipeline is also deeper than classical DSPs [17].

Media Processors A special class of DSP architecture was introduced with the Media processor. Since these applications are dominated by pixel processing, an 8-bit datatype is often as important as a classical DSP’s 16-bit datatype. These processors have had an influence on modern DSP architectures. Examples of media processors include IBM’s Mfast [18], Philip’s Trimedia [19], TI C80 [20], and Chromatic’s MPACT [21].

Multimedia Instruction Sets Another special class of processors with DSP functionality is general-purpose processors which include SIMD extensions. Examples of this include Intel MMX[22] and PowerPC Altivec [23]. Retrofitting DSP capability into general purpose processors has not been as successful as once envisioned. Although excellent performance can be achieved, system characteristics such as real-time constraints and power dissipation sensitivity are harder to realize on general purpose processors [24].

Modern DSP Architectures In classical DSP architectures, the execution pipelines were visible to the programmer and necessarily shallow to allow assembly language optimization. This programming restriction encumbered implementations with tight timing constraints for both arithmetic execution and memory access. The key characteristic that separates modern DSP architectures from classical architectures is the focus on compilability. Once the decision was made to focus the DSP design on programmer productivity, other constraining decisions could be relaxed. As a result, significantly longer pipelines with multiple cycles to access memory and compute arithmetic could be utilized. This has resulted in higher clock frequencies and higher performance DSPs. In an attempt to exploit instruction level parallelism inherent in DSP applications, modern DSPs tend to use VLIW-like execution packets. This is partly driven by real-time requirements which require the worst-case execution time to be minimized. This is in contrast with general purpose CPUs which tend to minimize average execution times. With long pipelines and multiple instruction issue, the difficulties of attempting assembly language programming become apparent. Controlling instruction dependencies between upwards of 100 inflight instructions is a non-trivial task for a programmer. This is exactly the area where a compiler excels. Starcore SC140: A promising modern architecture is the Starcore SC140. This DSP is jointly designed by Lucent and Motorola in their Atlanta design center. A significant design point is the Variable Length Execution Set (VLES). Most SC140 instructions are 16-bits wide and they can be grouped into VLES packets of up to 128-bits. This allows for multiple instructions to be issued with reasonable instruction code density. The SC140 architecture was also optimized jointly with the compiler. The machine is capable of 1.2 billion MACs/s with a 300 MHz core. The address space has also been extended to 32-bits [25]. ADI TigerSHARC: Analog Devices’ TigerSHARC DSP is interesting because of its use of VLIW and SIMD techniques. TigerSHARC is also significant in that its base architecture contains floating point. This may put it at a power disadvantage compared to fixed-point only solutions but it does offer an ease of implementation for DSP algorithm developers. TigerSHARC’s two computational units each have independent SIMD capability. This provides up to a peak of eight 16-bit multiplies per cycle. Two independent address units can transfer up to 128-bits of data each per cycle providing high internal bandwidth [26],[27]. Other Modern Architectures: Other important modern architectures include TI’s C64x [15][28], BOPS Manarray[29] and Lucent’s Daytona [30] processors. TI’s C64x is an improvement to its successful C6x processor family. The C64 extends the arithmetic capabilities of the C6x by providing SIMD execution of both 16-bit and 8-bit quantities. BOPS Manarray design is unique in that it architects communications between multiple processing elements. Each processing element contains five execution units which are accessed through an indirect VLIW instruction. The multiple PEs execute in a SIMD fashion as controlled by a sequence processor. The Lucent Daytona project, while providing a very high-performance

DSP engine, also contributed the high-performance bus architecture to Lucent’s Starcore offering.

DSP COMPILATION Programmer productivity is one of the major concerns in complex DSP applications. Because most classical DSPs are programmed in assembly language, it takes a very large software effort to program an application. For modern speech coders [31] for example, it may take six months or more before the application performance is known. Then, an intensive period of design verification ensues. If efficient compilers for DSPs were available, significant advantages in software productivity could be achieved. Implementations

Optimize Price / Power Performance

GSM ADSL

Compiler

VoIP Algorithms Architecture

3G

FIGURE 3. DSP Design Process

DSP Compilation Problem There are a number of issues that must be addressed in designing a DSP compiler. First, there is a fundamental mismatch between DSP datatypes and C language constructs. A basic data-type in DSPs is a saturating fractional fixed-point representation. C language constructs, however, define integer modulo arithmetic. This forces the programmer to explicitly program saturation operations. The compiler must then deconstruct these idioms to recognize the underlying fixed point operations. A second problem for compilers is that previous DSP architectures were not designed with compilability as a goal. To maintain minimal code density, multiple operations were issued from the same compound instruction. Unfortunately, to reduce instruction storage, a common encoding was 16-bits for all instructions. Often, three operations could be issued from the same 16-bit instruction. While this is good for code density, orthogonality suffered. Many special purpose registers were required and severe restrictions on operation combinations were imposed. Early attempts to remove these restrictions used VLIW instruction set architectures with nearly full orthogonality. To issue four multiply accumulates minimally requires four instructions (with additional load instructions to sustain throughput). This full generality was

required to give the compiler technology an opportunity to catch up with assembly language programmers.

Current State-of-the-art DSP compiler technology is still in its infancy. Figure 3 shows the ideal DSP design process. A DSP compiler is designed jointly with the architecture based on the intended application domain. Trade-offs are made between the architecture and compiler subject to the application performance, power, and price constraints. High Level Languages: Because DSP C compilers have difficulty generating efficient code, language extensions have been made to high level languages[32],[33]. Typical additions may include special type support for 16-bit data types (Q15 formats), saturation types, multiple memory spaces, and SIMD parallel execution support. These additions often imply a special compiler, and the code may not be emulated easily on multiple platforms. As a result, special language constructs have not been successful. In addition to language extensions, other high-level languages have been used. BOPS [29] has produced a Matlab compiler which offers exciting possibilities since Matlab is widely used in DSP algorithm design. Difficulties with this approach include Matlab’s inherent 64bit floating point type not being supported on most DSPs. On DSPs which do support 32-bit floating point, precision analysis is still required. For algorithm design, tensor algebra has been used [34],[35]. Attempts have been made to automate this into a compilation system [36]. The problem of this approach is that highly skilled algorithm designers are still required to describe the initial algorithm in tensor algebra. However, this approach holds promise because the communications and parallelism of the algorithm are captured by the tensor algebra description. Libraries: Due to the programming burden of traditional DSPs, large libraries are typically built up over time. Often more than 1000 functions are provided, including FIR filters, FFTs, convolutions, DCTs, and other computationally intensive kernels. The software burden to generate libraries is high but they can be reused for many applications. Often, with this approach, control code can be programmed in C and the computationally intensive signal processing functions are called through these libraries. Intrinsics: Often, when programming in a high-level language such as C, a programmer would like to take advantage of a specific instruction available in an architecture but there is no mechanism for describing that instruction in C. For this case intrinsics were developed. In their rudimentary form, an intrinsic is an asm statement such as found in GCC [37]. An intrinsic function has the appearance of a function call in C source code, but is replaced during pre-processing by a programmer-specified sequence of lower-level instructions. The replacement specification is called the intrinsic substitution or simply the intrinsic. An

intrinsic function is defined if an intrinsic substitution specifies its replacement. The lowerlevel instructions resulting from the substitution are called intrinsic instructions [38]. Intrinsics are used to collapse what may be more than ten lines of C code into a single DSP instruction. A typical math operation from the ETSI GSM EFR speech coder [31], L_ADD, is given as: /* GSM ETSI Saturating Add */ Word32 L_add( Word32 a, Word32 b ) { Word32 c; c = a + b; if ((( a^b ) & MIN_32 ) == 0 { if (( c^a) & MIN_32 ) { c = (a < 0) ? MIN_32 : MAX_32 } } return( c ); }

Early intrinsic efforts, like inlined asm statements, inhibited DSP compilers from optimizing code sequences [39]. A DSP C compiler could not know the semantics and side effects of the assembly language constructs. Other solutions which attempted to convey side-effect free instructions have been proposed [38]. These solutions all introduced architectural dependent modifications to the original C source. Intrinsics which eliminated these barriers were explored in [40]. This technique represented the operation in the intermediate representation of the compiler. With the semantics of each intrinsic well know to the intermediate format, optimizations with the intrinsic functions were easily enabled. This provided speedups of more than 6x [40]. The main detractor of intrinsics is that it moves the assembly language programming burden to the compiler writers. More importantly, each new application may still need a new intrinsic library. This further constrains limited software resources. Optimizations: In addition to classic compiler optimizations [41], there are some advanced optimizations which have proven significant for DSP applications. Software pipelining [42] in combination with aggressive inlining has proven effective in extracting the parallelism inherent in DSP applications. Interestingly, some DSP applications are not data dependent. In these cases profile directed optimizations are very effective at improving performance [43]. These techniques, when used with VLIW scheduling [44], have proven effective in DSP compilation. However, they still can be more than two times less efficient than assembly language programmers. Retargetable Estimation: Retargetable Estimation is a technique that involves static code analysis and profiling of a DSP application described in a high-level language (currently C) to estimate the optimal cycle count that can be achieved by a particular programmable

DSP [45]. This technique predicts hand-optimized performance of a full application within about 15%. Typical estimation achieves an accuracy of about 5%. This is primarily achieved using the same loop-level optimizations that aggressive DSP compilers use.

Future Compiler Techniques It is well recognized that the best way to design a DSP compiler is to develop it in parallel with the DSP architecture[5]. Future compiler-architecture pairs will not be afforded the luxury of large numbers of intrinsic libraries. Just as modern RISC processors do not require assembly language programming, so to will future DSP applications. Future DSP compilers will use a technique called semantic analysis. In semantic analysis, a sophisticated compiler must search for the meaning of a sequence of C language constructs. A programmer writes C code in an architecture independent manner - such as for a PowerPC machine - focusing primarily on the function to be implemented. If DSP operations are required, the programmer implements them using standard modulo C arithmetic. The compiler then analyzes the C code, automatically extracts the DSP operations and synthesizes optimized DSP code without the excess operations required to specify DSP arithmetic in C code. This technique has a significant software productivity gain over intrinsic functions. Another challenge DSP compiler writers face is parallelism extraction. Early VLIW machines alleviated the burden from the compiler by allowing full orthogonality of instruction selection. Unfortunately this led to code-bloat. General purpose machines have recognized the importance of DSP operations and have provided specialized SIMD instruction set extensions. Unfortunately, compiler technology has not been effective in exploiting these instruction set extensions, and library functions are often the only efficient way to invoke them. Modern DSP architectures will make liberal use of these so called multimedia instruction sets because DSP applications are amenable to them. However, a vectorizing compiler is required to extract this parallelism. Worse, outer loops must often also be vectorized to allow inner loop vectorization. This is a notoriously difficult problem. Low Power

DSP Compilation

Methodology Circuits Architecture Process

SOI

SIMD Vectors Scheduling Type Recognition

Research Chameleon

SiGe

Vector Research

Processor Design

3G Wireless Wide Band CDMA 2.5G GSM S/W Radio

WLAN Bluetooth

PPC

IBM DSP

FIGURE 4. IBM Research DSP Contributions

IBM'S E-LITE DSP We now summarize our research whose objective is the development of an ultra-low power DSP/embedded processor (the e-lite DSP) capable of sustaining billions of multiplyaccumulate operations per second with industry leading power dissipation while remaining an amenable compiler target. Figure 4 shows the contributions to the e-lite DSP architecture from previous research at IBM. Significant processor design, low power process, algorithms, and DSP compilation experience were applied to the design of e-lite. In particular, the IBM Research VLIW compiler contributed to the basis of the DSP compiler [46]. The e-lite DSP is optimized for communications systems, particularly 2.5G and 3G (WCDMA) wireless communications; its low power consumption target makes it ideal for mobile handsets, and the performance target makes it attractive for basestation applications as well as high-speed access networks. A key aspect of this research is that DSP programmers are not required to write assembly language, as illustrated by an example from a speech coding application. The unique research contributions of the e-lite DSP project include more than just the DSP. In particular, they include new circuit techniques for ultra-low power; important compiler optimizations for DSP operations; architecture and microarchitectures optimized for low power; and system level design experience in GSM and 3G WCDMA physical layers. Some examples of the types of power saving features being explored include: a 64-bit compound LIW instruction set architecture with a RISC execution model, pre-decoded instruction memories, non-interlocked pipelines (except for long loads), multiple operation execution with SIMD vectors (in contrast to pure VLIW), minimal control paths, a streaming register file, fully visible hardware resources scheduled by the compiler, and ultra-low power implementation techniques.

Architecture High-performance in the e-lite DSP architecture is achieved through the exploitation of data and instruction level parallelism. Since this architecture is especially targeted for ultralow power implementations, it discourages the use of implementation techniques that are typically found in other processors, such as dynamic scheduling of operations, speculative execution, branch prediction, and register interlocks. Instead, it imposes constraints on the code executed by an implementation. The e-lite DSP architecture also includes special support for reducing power consumption during the execution of a program, such as the ability to shutdown functional units, enable hardware blocks on demand (by hardware or software depending on the specific block). All the burden of scheduling instructions and managing the resources has been moved into the compiler. As a consequence, a basic assumption in the e-lite DSP architecture is that it will execute programs that have been written in high-level language (C, especially) and have been translated into machine code through the associated optimizing compiler.

A program executed by an e-lite DSP processor consists of a sequence of long-instructions or bundles, each of which corresponds to two or three operations (primitive instructions); a prefix field in the bundle is used to indicate dependencies among instructions in the bundle. A long-instruction is represented in storage as a 64-bit entity, wherein primitive instructions are encoded either in 20 or 30 bits. The 64-bit long-instruction is the minimum unit of longinstruction addressing. All parallel-executable operations contained in a long-instruction are dispatched simultaneously for execution.

P P

2 0 - b it 3 0 - b it

2 0 - b it

2 0 - b it 3 0 - b it

FIGURE 5. e-lite Instruction Format Operations that an e-lite DSP processor can execute fall into the following categories: branch operations, integer operations, storage access operations, vector/accumulator operations, and vector reduction operations. Integer instructions operate on 32-bit operands. Storage access instructions provide byte (8-bit), half-word (16-bit), word (32-bit), and vector (64-bit) transfers between storage and internal registers. 40-bit accumulation is also supported. Vector instructions and vector-reduction instructions operate on vector operands, which are represented in a 16-bit fixed-point format. Signed integers and fixed-point numbers are represented in two’s complement form. With the exception of direct memory access (DMA), no primitive instruction other than store instructions modify storage. Many instructions can be predicated and SIMD operations can masked on an element-by-element basis. Some notable features which are not in the e-lite architecture include 0-overhead loops, small address spaces, and disjoint address spaces (X/Y); though significant hardware features, these are a difficult target for a compiler.

E-lite Compiler DSP developments at IBM have always attempted to alleviate the programming burden. Improving on such tradition, the e-lite DSP compiler translates C code directly to machine instructions. In addition to the language difficulties described in Section "DSP Compilation", the e-lite DSP architecture imposes some additional challenges for the compiler: 1) The e-lite DSP architecture is a statically scheduled multiple-issue (LIW or EPIC-like) architecture. Furthermore, long pipelines throughout the processor and memory paths give rise to long latencies. To obtain good performance, the compiler must identify and schedule independent instructions for the different units, and also cover the latencies by scheduling other operations. In particular, the compiler must perform software pipelining of inner loops.

2) The e-lite DSP architecture operates on SIMD (vector) data. To obtain maximum performance from the processor, the compiler has to recognize code sequences, particularly loops, that can be vectorized. 3) The compiler must efficiently deal with multiple register files since each unit has unique register files. When an operation is assigned to a unit, its inputs and outputs are assigned to the register file of that unit. If the output is needed in some other unit, the transfers must be explicitly managed by the compiler. 4) Other issues the compiler must deal with include 40-bit accumulators, circular buffers, explicit bypasses, exposed pipeline latencies, delayed branches, and pipeline resource hazards. Autocorrelation Example: We now give an example of the techniques incorporated into the e-lite DSP compiler that are used in translating a program written in standard C language but which includes properties such as saturating arithmetic. This example reflects some of the challenges encountered in the new generation of DSP compilability, as represented by the e-lite DSP compiler. For this example, we focus on compiling the inner loop of an autocorrelation which can be described as follows: sum = 0; for( i = 0; i < N; i++ ) { m = x[i]*x[i]; /* saturating multiply */ if( m == 0x40000000 ) { /* saturate? */ m = 0x7fffffff; } else { m = m*2; } s = sum + m; /* saturating accumulate */ if( (sum^m) & 0x80000000 ) == 0 ){ /*saturate? */ if( (s^sum) & 0x80000000 ){ w = sum < 0?0x80000000:0x7fffffff; } } sum = s; }

A series of transformations implemented in the e-lite DSP compiler recognize these statements as corresponding to a saturating multiply operation followed by a saturating add. The body of the loop gets converted into an intermediate form that is equivalent to the following pseudo-code: p= ctr0= s=

add_address move_counter load_immediate

x,-2 N 0

loop: t0,p= t1= s= ctr0=

load_update mul16_shift_saturate add32_saturate decr_and_branch_nz

p,2 t0,t0 s,t1 ctr0, loop

A vectorizing analysis phase inspects the inner loop and recognizes that it consists of two parallel operations (load and multiply) followed by a reduction operation. Further, this combination of operations is supported by the e-lite DSP architecture since it has a reduction operation that adds with saturation four values. Consequently, this loop is vectorizable. One approach that could be pursued would be to convert the loop to a vector form, and let later compilation phases deal with software pipelining and multi-operation scheduling. However, we believe it is simpler to perform vectorization, software pipelining and parallelization as one compilation phase by the following steps: 1) Ensure that there is a unique non-vectorized instruction for each vector instruction in the final loop (this may involve adding copies). 2) Software-pipeline the non-vector operations using modulo scheduling but with the resource template of the equivalent vector operations (this also uncovers parallelism). 3) Replace each operation with its vector equivalent. 4) Fix up the branch count. For a given set of latencies, and ignoring delayed branch issues, the final loop resulting from this optimization is (wherein the vertical double-bar indicates primitive instructions that are executable in parallel) p= add_address ctr0= move_counter s= load_immediate # begin prolog v0,p= vec_load_update v0,p= vec_load_update v0,p= vec_load_update || ac1= vec_mul16_shift v0,p= vec_load_update || ac1= vec_mul16_shift loop: v0,p= || ac1= || s= || ctr0=

vec_load_update vec_mul16_shift vec_add32_sat decr_and_branch_nz

x, -8 N/4 0 p,8 p,8 p,8 v0,v0 p,8 v0,v0

p,8 v0,v0 s,ac1 ctr0,loop

Note that while there is a prolog, there is no epilog in this software-pipelined loop. This form of pipelining is called “no-epilog software pipelining”. Basically, when all the operations in the loop that are not speculatable (e.g. stores, reduction) are scheduled in the last initiation interval of the loop, then there is no need for an epilog. Instead, the epilog is merged with the loop body; the loop will consist only of the prolog followed by N iterations of the kernel. To deal with the cases where the loop count is not provably a multiple of the vector size, an extra copy of the loop body is added as an epilog. This epilog will be computed under a vector mask.

120

400

Typical efr Typical efr ni

e-lite efr

e-lite amr

efr Target

100

MHz

80

GSM EFR/AMR Speech Coder Typical w/ and w/o intrinsics e-lite untouched C Code

60 40 20

0

12/99

12/00

FIGURE 6. Compiler Results

E-lite Compiler Results Figure 6 shows preliminary results from our DSP compiler on the e-lite architecture. In this experiment, we have taken the existing ETSI C code for the EFR and AMR speech coders [31]. We compiled these programs - without modification - to our DSP architecture. Currently, we achieve 28 MHz operation for the full EFR codec. When outer loop parallelization is implemented later this year we anticipate sub-18 MHz performance. Moreover, with the final version of the architecture we expect performance to be comparable with highly optimized hand code. In contrast, similar experiments on another architecture require a 400 MHz processor unless intrinsic libraries are used.

CONCLUSIONS DSP processor design has undergone a major paradigm shift. With the soaring costs of software development, modern DSP architectures have been optimized for compilation. The interactions between the compiler and architecture along with new compilation techniques have made multiple instruction issue, highly parallel SIMD DSP architectures more prevalent. In the future, compiler technologies will play an even more important role in DSP sys-

tems as they evolve to take into account the memory hierarchy in addition to parallelism extraction [46].

References [1] M. Gagnaire, “An Overview of Broad-Band Access Technologies”, Proceedings of the IEEE, Vol. 85, Issue 12, December, 1997, pp. 1958-1972. [2] T. Cantrell, “DSP Doings”, Circuit Cellar Ink, No. 109, August, 1999, pp76-82. [3] J. Eyre and J. Bier, “DSP Processors Hit The Mainstream”, IEEE Computer, August 1998, pp. 51-59. [4] P. Lapley et. al., DSP Processor Fundamentals, IEEE Press, New York, 1997. [5] M. Saghir, P. Chow, and C. G. Lee, “Towards Better DSP Architecture and Compilers”, Proceedings of the International Conference on Signal Processing Applications and Technology, October, 1994, pp. 658-664. [6] J. Bier, “DSP is the Mainstream”, 2000 Embedded Processor Forum, Micro Design Resources, San Jose, April, 2000. [7] B. Ackland and P. D’Arcy, “A New Generation of DSP Architectures”, Proceedings of the Custom Integrated Circuits Conference, 1999, pp 531-536. [8] Electronic Design News, “DSP Architecture Directory”, EDN, April 23, 1998, pp. 40111. [9] BDTI, “Buyer's Guide to DSP Processors: Performance, Architecture, Applications, and Benchmarks”, Berkeley Design Technology, Inc., 1999. [10]Texas Instruments, “TMS320C54x DSP Reference Set. Volume 1: CPU and Peripherals”, TI Report number SPRU131E, June, 1998. [11]Lucent Technologies, “DSP 1611/17/18/27/28/29 Digital Signal Processor Information Manual”, January, 1998. [12]G. Ungerboeck, D. Maiwald, H.P. Kaeser, P.R. Chevillat, and J.P. Beraud, "Architecture of a Digital Signal Processor", IBM Journal of Research and Development, Vol. 29, No. 2, March 1985. [13]N.L. Bernbaum, B. Blaner, D.E. Carmon, J.K. D’Addio, F.E. Grieco, A.M. Jacoutot, M.A. Locker, B. Marshall, D.W. Milton, C.R. Ogilvie, P.M. Schanely, P.C. Stabler, and M. Turcotte, “The IBM Mwave 3780i DSP”, Proceedings of the 1996 International

Conference on Signal Processing Applications and Technology (ICSPAT ‘96), Boston, MA, Oct. 1996, pp. 1287-1291. [14]J. Bier, “DSP16xxx Targets Communications Apps”, Microprocessor Report, Volume 11, Number 12, September, 1997. [15]Tom R. Halfhill, "TI Cores Accelerate DSP Arms Race", Microprocessor Report, March 6, 2000. [16]LSI, “LSI402Z Digital Signal Processor”, Document number R20012, LSI Corporation, 1999. [17]J. Eyre and J. Bier, “Carmel Enables Customizable DSP”, Microprocessor Report, 12(17), December, 1998. [18]G.G. Pechanek, C.J. Glossner, W.F. Lawless, D.H. McCabe, C.H.L. Moller, and S.J.Walsh, “A Machine Organization and Architecture for Highly Parallel, Scalable, Single Chip DSPs,” In Proceedings of the 1995 DSPx Technical Program Conference & Exhibition, pp. 42-50, 5/15-18/95, San Jose, California. [19]B. Case, “Philips hopes to displace DSPs with VLIW”, Microprocessor Report, December, 1997, pp. 12-15. [20]C.P. Feigel, “TI Introduces Four-Processor DSP Chip”, Microprocessor Report, Volume 8, Number 4, March 28, 1994. [21]Dave Epstein, “Chromatic Raises the Multimedia Bar”, Microprocessor Report, Volume 9, Number 14, October 28, 1995. [22]A. Peleg and U. Weiser, “MMX technology extension to the Intel architecture”, IEEE Micro, August, 1996, pp. 42-50. [23]H. Nguyen and L. K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology”, Proceedings of the International Conference on Supercomputing, 1999, pp 11-20. [24]J. C. Bier, A. Shoham, H. Hakkarainen, O. Wolf, G. Blalock, and Philip D. Lapsley, “DSP on General-Purpose Processors: Performance, Architecture, Pitfalls”, Berkeley Design Technology, Inc., 1997. [25]O. Wolf and J. Bier, “StarCore Launches First Architecture”, Microprocessor Report, Volume 12, Number 14, October, 1998. [26]O. Wolf and J. Bier, “TigerSHARC Sinks Teeth Into VLIW”, Microprocessor Report, Volume 12, Number 16, December, 1998.

[27]J. Fridman and Z. Greenfield, “The TigerSHARC DSP Architecture”, IEEE Micro, Vol. 20, No. 1, January, 2000, pp 66-76. [28]J. Turley and H. Hakkarainen, “TI’s New ‘C6x DSP Screams at 1,600 MIPS”, Microprocessor Report, Volume 11, Number 2, February, 1997. [29]D. Strube, “High performance DSP technology brings new features to digital systems”, Electronic Product Design, October, 1999, pp 23-26. [30]B. Ackland et. al., “A Single-Chip 1.6 Billion 16-b MAC/s Multiprocessor DSP”, Proceedings of the Custom Integrated Circuits Conference, 1999, pp. 537-540. [31]European Telecommunications Standards Institute, Digital cellular telecommunications system, ANSI-C code for the GSM Enhanced Full Rate (EFR) speech codec (GSM 96.53), March, 1997, ETS 300 724. [32] K.W. Leary and W. Waddington, “DSP/C: A Standard High Level Language for DSP and Numeric Processing”, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, IEEE, 1990, pp. 1065-1068. [33]B. Krepp, “DSP-Oriented extensions to ANSI C”, Proceedings of the International Conference on Signal Processing Applications and Technology (ICSPAT ‘97), DSP Associates, 1997, pp. 658-664. [34] J. Granata, M. Conner, R. Tolimieri, “The Tensor Product: A mathematical Programming Language for FFTs and other Fast DSP Operations,” IEEE SP Magazine, pp. 40-48, January 1992. [35]C.J. Glossner, G.G. Pechanek, S. Vassiliadis, and J. Landon, “High-Performance Parallel FFT Algorithms on M.f.a.s.t. Using Tensor Algebra.” Proceedings of the Signal Processing Applications Conference at DSPx’96, March 11-14, 1996, pp. 529-536, San Jose Convention Center, San Jose, California. [36]N. P. Pitsianis, “A Kronecker Compiler for Fast Transform Algorithms”, 8th SIAM Conference on Parallel Processing for Scientific Computing, March, 1997. [37]R. Stallman, “Using and Porting GNU CC”, Free Software Foundation, June 1996, version 2.7.2.1. [38]D. Batten, S. Jinturkar, J. Glossner, M. Schulte, and P. D’Arcy, “A New Approach to DSP Intrinsic Functions”, Proceedings of the Hawaii International Conference on System Sciences, Hawaii, January, 2000. [39]D. Chen, W. Zhao, and H. Ru, “Design and implementation issues of intrinsic functions for embedded DSP processors”, in Proceedings of the ACM SGIPLAN Interna-

tional Conference on Signal Processing Applications and Technology (ICSPAT ‘97), September, 1997, pp. 505-509. [40]D. Batten, S. Jinturkar, J. Glossner, M. Schulte, R. Peri, and P. D’Arcy, “Interaction Between Optimizations and a New Type of DSP Intrinsic Function”, Proceeding of the International Conference on Signal Processing Applications and Technology (ICSPAT ’99), Orlando, Florida, November, 1999. [41]A. Aho, R. Sethi, and J. Ullman, “Compilers: Principles, Techniques and Tools”, Addison-Wesley Publishing Company, CA, 1986. [42]M. Lam, “Software Pipelining: An effective scheduling technique for VLIW Machines”, In Proceedings of the SIGPLAN ‘88 Conference on Programming Language Design and Implementation, Atlanta, GA, June, 1988. [43]S. Jinturkar, J. Thilo, J. Glossner, P. D’Arcy, and S. Vassiliadis, “Profile Directed Compilation in DSP Applications”, Proceedings of the International Conference on Signal Processing Applications and Technology (ICSPAT’98), September, 1998. [44]W. Hwu, “Super block: An effective technique for VLIW and superscalar compilation”, Journal of Supercomputing, Volume 7, pp. 229-248. [45]N. Ghazal, R. Newton, and J. Rabaey, “Predicting Performance Potential of Modern DSPs”, Proceedings of the 37th 2000 Design Automation Conference, Los Angeles, CA, 2000, pp 332-335. [46]M. Moudgill, J. H. Moreno, K. Ebcioglu, E.R. Altman, S.K. Chen, and A. Polyak, “Compiler/architecture interaction in a tree-based VLIW processor”, IEEE Technical Committee on Computer Architecture Newsletter, June, 1997, pp. 10-12. [47]S. Larin and T. Conte, “Compiler-Driven Cached Code Compression Schemes for Embedded ILP Processors”, Proceedings of the Annual International Symposium on Microarchitecture, 1999, pp 82-92.