assembly programming technique is suggested for software development. 1. INTRODUCTION. Digital signal processing (DSP) gained popularity in the.
The 2004 IEEE Asia-Pacific Conference on Circuits and Systems, December 6-9, 2004
DIGITAL SIGNAL PROCESSOR ARCHITECTURES AND PROGRAMMING Sen M. Kuo1 and Woon S. Gan2 1
Dept. of Electrical Engineering, Northern Illinois University, DeKalb, IL 60115
2
School of Elec. & Electronics Engineering, Nanyang Technological University, Singapore
Analog signal
ABSTRACT This paper presents modern digital signal processor architectures including multiply-accumulate unit, shifter, pipelining and parallelism, buses, data address generators, and special addressing modes and instructions. In addition, the most effective mixed C and assembly programming technique is suggested for software development.
Digital signal
DSP processor
DAC
Digital signal
Figure 1 A Typical DSP System. The programmable processor can be programmed for a variety of tasks. It is used for systems that are too complicated to implement with nonprogrammable circuits, products that need shorter development time and lower development cost, or systems that need to be upgraded frequently with new algorithms and standards. Most DSP processors employ a modified Harvard architecture, which gives a crossover path between program and data memory. Most processors also are optimized for performing repetitive multiply-and-add (MAC) operations that sequentially access data stored in consecutive memory locations. As DSP hardware and algorithm capabilities have advanced, so have processing demands, which has resulted in the development of higher-performance systems with more sophisticated algorithms for the new generation of applications. In today’s evolving DSP applications, flexibility and upgradeability of design are key factors in longer product cycles. Many industrial standards are in the early stages of development, and some of these standards must maintain compatibility with other standards. A good example is digital cellular phones, which have been upgraded from 2G to 2.5G, 3G, and 4G standards. Programmable DSP processors are especially suitable for designs that require multiple modes of operation and future upgradeability. The research results of DSP increasingly are applied to the development of complete solutions that integrate algorithms, software, and hardware into a system. Because software development has become a larger expense than hardware development in major DSP systems, processor-independent design has the advantage of porting software on different processors and the
1. INTRODUCTION Digital signal processing (DSP) gained popularity in the 1960s with the introduction of digital technology. It became the method of choice in processing signals as digital hardware increased in speed and became easier to use, less expensive, and more available. In 1979, Intel introduced the first DSP processor (the Intel 2920), which had an architecture and an instruction set specifically tailored for DSP applications. Today, general-purpose DSP processors are commercially available from Texas Instruments, Motorola, Analog Devices, Agere, DSP Groups, and many other companies [1, 2]. As DSP processors became less expensive and more powerful, real-world DSP applications such as high-speed modems and Internet access, wireless and cellular phones, audio and video players, and digital cameras have exploded onto the marketplace [3]. As illustrated in Figure 1, DSP systems use a DSP processor or other digital hardware, an analog-to-digital converter (ADC), and a digital-to-analog converter (DAC) to replace analog devices such as amplifiers, modulators, and filters. A DSP processor performs digital operations based on a specific signal-processing algorithm (or computational descriptions) implemented in software to process the digital signals. DSP algorithms can be performed on a wide variety of digital hardware and in many computer languages such as C and C++. DSP hardware includes programmable and nonprogrammable logic, general-purpose microprocessors and microcontrollers, and general-purpose digital signal processors.
0-7803-8660-4/04/$20.00 ©2004 IEEE
ADC
Analog signal
365
ability to migrate to more advanced processors in the future. In processor-independent design, a high-level language such as C or C++ is preferred and is available for most DSP processors. C programs are easier and faster to write, and they may be ported from one processor to another simply by recompiling the source code for a new processor using the C compiler for that processor. In applications where processing and memory resources are critical, the solution is a compromise that implements critical sections in assembly language and that uses C language to code the rest. This mixed C-and-assembly programming provides a good balance between ease of coding and efficient implementation. Currently, the efficiency of C compilers is improved significantly, and many optimized assembly-coded DSP libraries allow the user to develop mixed code easily and efficiently. Figure 2 illustrates how a DSP system is configured around the DSP processor. The major external blocks needed are memory and peripherals. DSP processors usually provide some on-chip cache, program read-only memory (ROM), data random-access memory (RAM), and peripherals. Peripherals such as the ADC and the DAC can connect either to the data bus using a dedicated address or to the serial interface if serial ports are available on chip.
where { b0 , b1 , …, bL −1 } are filter coefficients, {x(n), x(n−1), …, x(n−L+1)} are signal samples, and L is the length of the filter. The computation of output y(n) requires the following steps: 1. Fetch two operands, bi and x(n−i), from memory.
Data bus
Address bus
Multiply bi and x(n−i) to obtain the product.
3.
Add the product, bi x(n − i ) , to the accumulator.
4. 5. 6.
Repeat steps 1, 2, and 3 for i = 0, 1, 2, …, L−1. Store the result, y(n), in the accumulator to memory. Update the pointers for bi and x(n−i) and repeat
steps 1 through 5 for the next input sample. The generic internal architecture of the DSP processor illustrated in Figure 3 is optimized for the FIR-filtering operations given in equation (1). Compared with general-purpose microprocessors, the most unique feature is the use of parallelism and pipelining for improving processing speed. DSP processors have a number of special processing units supported by multiple dedicated buses, most of which can operate independently and concurrently. As shown in Figure 3, the arithmetic and logic unit (ALU) performs addition, subtraction, and logical operations. The shifter is used for scaling data, and the hardware multiplier and accumulators are used to perform MAC operations. Data address generators (DAGENs) generate the addresses of operands used by instructions. With these available resources, the DSP processor achieves a fast execution speed by performing operations within these units simultaneously.
Memory
DSP processor
2.
Peripherals
DAGEN A
Memory A
Memory B
DAGEN B
Figure 2 External interfaces for the DSP processor. Shifter, multiplier, ALU
2. DIGITAL SIGNAL PROCESSOR ARCHITECTURES
DAGEN C
The task of developing an efficient DSP system depends on the DSP hardware and software architectures, including data flow, arithmetic capabilities, memory configurations, I/O structures, programmability, and the instruction set of the processor. The processor architecture and the corresponding DSP algorithm must be complementary. For some applications, the algorithm is given, and we have to select a suitable processor. For other applications, the processor is given, and the task is to develop efficient algorithms that satisfy the application requirements. Most DSP processors are designed to perform repetitive MAC operations such as finite-impulse response (FIR) filtering expressed as L −1
y (n) = ∑ bi x (n − i ) ,
Accumulator(s)
Memory C
Shifter
Figure 3 Generic DSP architecture of the data computation unit. Multiplication operations require several clock cycles on a microprocessor or microcontroller, where they are performed by repetitive shift-and-add operations. To achieve the speed required by multiplication-intensive DSP algorithms, such as the FIR filtering given in Eq. (1), DSP processors employ a fully parallel hardware multiplier, which can multiply two
(1)
i =0
366
operands can be fetched simultaneously on two separate data buses, as shown in Fig. 3. The two data buses with supporting address buses connect with two separate memories [e.g., memory A for coefficient bi and
items of data within one clock cycle. At the same time, an adder immediately following the multiplier adds the product from the previous multiplication operation into a double-precision accumulator. Some processors such as the TMS320c55x provide two MAC units and four 40-bit accumulators. A number of dedicated instructions that can perform multiply, accumulate, data-move, and pointer-update operations in a single instruction are built into processors for filtering and correlation algorithms. The basic arithmetic operations performed by DSP processors are addition, subtraction, etc. The logical unit performs Boolean logic such as AND, OR, and NOT operations on individual bits of a data word and executes logical shifts of the entire data word. Binary division usually is implemented with software routine because it involves a repeated series of shift and conditional subtraction operations. The shifter can be used for pre-scaling an operand in data memory or the accumulator before an ALU operation or for post-scaling the accumulator value before storing it back into data memory. The instructions that control the operations of the DSP processor require multiple steps to execute. First, the address of the instruction is generated, and the contents of program memory at that address are read and decoded. Based on the instruction, one or more operands then are fetched to provide the required data for executing that instruction. Finally, the results are stored, and the address of the next instruction is computed. Each instruction may take several clock cycles to execute multiple steps of prefetch, decode, fetch operand, execute, and write result. These steps can be cascaded in assembly-line fashion by using pipelining. If each step requires one clock cycle, a sequence of seven-stage instructions (in the TMS320c55x) can be completed at one instruction per clock cycle after the pipeline is full. The pipeline architecture takes advantage of the inherent decomposition of instructions into multiple serial operations. Figure 3 shows that cascading the multiplier and ALU allows the simultaneous operation of both. That is, when the multiplier is performing its work at time i to produce bi x(n − i) , the ALU adds the previous product
memory B for signal x(n−i)]. This configuration avoids the conflict of accessing two operands from the same memory at the same time. A third memory with associated address and data buses is used for storing the value in the accumulator back to memory C. In addition to these data memory blocks, there is program memory and its dedicated program and data buses for fetching program instructions to avoid delays in accessing data. As shown in Fig. 3, each memory has its own address bus, which originates with a DAGEN. Accessing the sequence of operands, bi and the corresponding x(n−i), i = 0, 1, …, L−1, is a regular sequential operation where these operands are stored in consecutive locations of memory. Each DAGEN simply increases (or decreases) the address pointer for pointing at the next data item within the same clock cycle when the multiplier and ALU are performing arithmetic operations; as a result, no extra clock cycle is needed to update the address pointer. After accessing the last coefficient, b L − 1 , the coefficient pointer has to wrap around to b0 for the next iteration. This operation can
be performed by arranging the coefficient buffer in a circular fashion. Therefore, further improvements to the DAGEN include modulo L arithmetic for implementing circular buffers. In addition, the DAGEN supports bit-reversal addressing for computing FFT algorithms. When processor speed is not a limiting factor, memory access time can be relaxed by using a slower clock rate. Another effective method is to let the processor run at full speed, but allow a number of wait states for accessing memory. These processors usually have on-chip memory configurable as program memory, thus allowing small programs to be stored and executed at full speed. The initial loading of the program from the external slow program memory can be accomplished using wait states. DSP processors normally provide on-chip peripherals or peripheral interfaces to facilitate the integration of DSP with external devices, such as an ADC, a DAC, or other DSP processors or microprocessors. In addition, some internal peripherals are used to control and manage the clocking of DSP processors, the data transfer mechanism, and power management facilities. Most modern DSP processors provide both a serial and parallel I/O capability. The serial port has the advantage of being separate from the data bus. Thus, it is not constrained by stringent access time and possible bus-conflict considerations. Modern processors have the following different types of serial ports: Standard serial ports, buffered serial ports (BSP), time-division multiplexing serial ports, and multi-channel BSPs (McBSP).
bi −1 x(n − i + 1) into the accumulator. The parallel
architecture takes advantage of the inherent parallelism of DSP algorithms and applications. As shown in Figure 3, all processing units in the parallel configuration can receive different data streams and execute different operations at each cycle. For example, the multiplier accesses two operands and multiplies them. At the same time, the DAGENs update the address pointers as specified, and the ALU adds the previous product into the accumulator with (or without) rounding. Buses provide the communication paths among the units that make up a DSP processor. The execution speed of filtering given in equation (1) can be improved further by using separate data buses for each of the two inputs for the multiplier. Instead of requiring an operand fetch for bi and another on the same data bus for x(n−i), both
367
The direct-memory-access (DMA) controller is used to control data transfer of the DSP processor-memory space, which includes on- and off-chip memory and peripherals. It operates independently of the processor. The data transfer is done in block format, and the DMA controller sends an interrupt to the DSP processor when the transfer is complete. Typically, the DMA can handle multiple channels (e.g., six channels in the c54x), and the user can assign different or same priority to the channels. With the increasing demand for running DSP-based products with less power and prolonged battery usage time, the DSP processor incorporates power-management features in addition to the conventional low-voltage approach. Several methods are used in power management: (1) clock-frequency control, (2) power-down mode, and (3) disabling peripherals.
4. SYSTEM CONSIDERATIONS The same DSP processor family provides different devices to provide the best match for the given application. For example, the devices within the TMS320C54x family differ in the number of DSP cores, operating clock frequencies, voltages, on-chip ROM configurations, RAM configurations, type and number of serial ports, and host ports. DSP processors are following the path of microprocessors in terms of performance and on-chip integration. At the same time, power consumption becomes an important issue for portable products. A DSP product design is constrained by the following key design goals: 1. Cost of the product 2. Cost of the design 3. Upgradeability 4. System integration 5. Power consumption These design goals play key roles in selecting DSP processors. The selection of a DSP processor suited to a given application is a complicated task. Some of the factors that might influence choice are cost, performance, future growth, and software and hardware development support. Using floating-point processors can increase the dynamic ranges of signals and coefficients. Floating-point processors are usually more expensive than fixed-point processors, but they are more suitable for high-level C programming. Thus, they are easier to use and allow a quicker time to market. The execution speed of a DSP algorithm is also an important issue when selecting a processor. When performance is the most important factor, the algorithm must be implemented with optimized code written for those processors, and the execution times must be compared. The time to complete a particular algorithm coded in optimized assembly language is called a benchmark. A benchmark can be used to give a general measure of the performance of a specific algorithm for a particular processor. Other related issues include memory size (on chip and externally addressable) and the availability of on-chip peripheral devices such as serial and parallel interfaces, timers, and multiprocessing capabilities. In addition, space, weight, and power requirements must be minimized. A key system constraint is the system cost. Like the case for general-purpose microprocessors, second sourcing, third-party support, and industry standards are other important issues for consideration.
3. SOFTWARE DEVELOPMENTS A programming language states the algorithm in a manner that precisely defines its operations inside the processor. In documenting the algorithm, it is sometimes helpful to clarify which inputs and outputs are involved by means of signal-flow diagrams. It is essential to document programs thoroughly with titles and comments because doing so greatly simplifies the task of troubleshooting and also helps with program maintenance. For ease of understanding, it is also important to use meaningful mnemonics for variables, labels, subroutine names, etc. In general, most execution bottlenecks occur in a few sections of DSP code, usually in the loops (especially inner loops) of a program. These loops may only occupy 10% of the code, but may take 90% of the time to execute. The best strategy is coding the entire algorithm in C first, identifying the time-critical bottlenecks, and then rewriting only that small percentage of code in assembly language. Because DSP C compilers generate intermediate assembly code for optimization, time-critical portions of code can be identified by using profiling capabilities and can be replaced with handcrafted assembly code. Another method is to use a library of hand-optimized functions coded in assembly language by the engineers or in the run-time library provided by the manufacturers. These assembly routines either may be called as a function or in-line coded into the c program. Software libraries become important as DSP algorithms become more complicated and computationally demanding. DSP manufacturers usually provide a set of commonly used signal-processing operations in software libraries that are written optimally for a particular processor. Because of the improvement in C-compiler efficiency and the availability of user-friendly integrated software development tools, a mix of C and assembly routines is the most effective way of developing programs for DSP systems.
REFERENCES
[1] Sen M. Kuo and Woon S. Gan, Digital Signal Processors, Prentice Hall, Upper Saddle River, NJ, 2005. [2] Sen M. Kuo and Bob H. Lee, Real-Time Digital Signal Processing, Wiley, Chicheser, 2001. [3] Phil Lapsley, Jeff Bier, Amit Shoham, and Edward A. Lee, DSP Processor Fundamentals, IEEE Press, Piscataway, NJ, 1995.
368