Implementation of Texas Instruments TMS32010 DSP Processor on

0 downloads 0 Views 155KB Size Report
DSP processors. It is a 16-bit processor with only 144 words of data memory. The separate program memory can hold up to 4K words of instructions and ...
Implementation of Texas Instruments TMS32010 DSP Processor on Altera FPGA Chang Choo, Jeff Chung, James Fong, Shinghin Eddy Cheung Department of Electrical Engineering San Jose State University San Jose, CA 95198-0084, USA (408)924-3980 [email protected]

ABSTRACT As DSP designs on FPGA become increasingly complex, an embedded programmable controller with not only conventional instructions but also DSP-specific instructions become desirable. In following this line of thought, we implemented the mo numental TI TMS32010 DSP processor softcore (IP) on an Altera Stratix FPGA. All the TMS32010 instructions are fully functional. The synthesized design, which runs at 40 MHz, takes up approximately 1,800 LEs, 37Kbits of RAM, and 2 DSP blocks. Using the smallest Stratix device, EP1S10, our DSP core utilizes about 15% of all on-chip LEs. We are currently working on extending the instruction set, including parallel MAC and distributed arithmetic instructions. INTRODUCTION Recently, FPGA has come to the forefront in DSP technology [5]. Design implementation using FPGAs greatly reduces the time to market compared to ASICs or custom ICs. As DSP designs become increasingly complex, an embedded programmable CPU with not only conventional instructions [3,4] but also DSPspecific instructions become desirable [6]. In following this line of thought, this paper summarizes our recent progress in developing a Verilog HDL softcore of a DSP processor based on the Texas Instruments TMS320C10 [1,2], implemented on Altera Stratix FPGA [7]. An advantage of implementing a full-featured DSP processor softcore, based on the architecture of a seasoned, well-used TI DSP, is the seamless integration of old software from the DSP processor into the new FPGA.

FPGA manufacturers, Altera and Xilinx, already have well-developed libraries of basic logic functions including multipliers, shifters, and memories. Such library modules by Altera in particular are called Mega-Functions. Another useful feature in the Altera Stratix architecture is the availability of DSP blocks, which consist of dedicated hardware that boost performance and area utilization of components such as multipliers, adders, and accumulators. We used the above features to optimize our DSP design. This paper is organized as follows. In the next section, the TMS32010 architecture, including addressing modes and instruction set, is briefly described. Our implementation of the architecture is described in the same section. In the following section, verification of our imple mentation including a simple FIR filter program is described. After presenting synthesis results in the following section, we conclude this paper with some concluding remarks. ARCHITECTURE OF TMS32010 The TMS32010 is the first generation of the TI DSP processors. It is a 16-bit processor with only 144 words of data memory. The separate program memory can hold up to 4K words of instructions and coefficient data. The instruction set consists of 60 instructions and supports both DSP-specific (numeric-intensive) and general purpose operations. The TMS32010 supports three addressing modes: direct, register-indirect, and immediate modes. The internal datapath architecture is shown in Figure 1 at the end of this paper [1]. Our softcore implementation of TMS32010 is split into 3 main modules: controller, dataBus, and programBus. The controller consists of a

finite state machine with states for instruction fetch and decode based on the 16-bit instruction set. The controller outputs control signals to every submo dule in the dataBus and programBus. The programBus module consists of a 16-bit bus called programBus (actually, Instruction Register), which contains each fetched instructions from the Program ROM. The dataBus module consists of a 16-bit bus called dataBus, which has complete data information. The dataBus also contains an ALU, multiplier, accumulator, barrel shifter, and muxes.

bits. The PC and stack both provide addressing functionality to the instruction ROM. The stack size is 4 x 12-bits. C. Data Path

More detailed description of the DSP softcore is described in the following subsections. The complete detail may be found in [8] and [9].

The dataPath houses all the arithmetic hardware, a 32-bit ALU, 32-bit accumulator, and 16×16-bit multiplier with 32-bit product. It also contains a 144×16-bit data RAM. Two shifters are also used in the architecture. The 0 to 16 bit barrel shifter with 16-bit data bus input outputs a 32-bit value to the ALU. The second shifter takes the output of the accumulator and shifts 0, 1, or 4bits and outputs 16-bits to the dataBus. The auxiliary register (AR) and auxiliary register pointer (ARP), used for indirect addressing, are also located within the dataPath. The 16-bit dataBus, which holds data to and from the data RAM and accumulator, is only in this section.

A. Controller

D. Altera Mega-Functions

The controller contains a state machine, which provides all control signals to the programPath and dataPath. During startup and reset, an initialization state followed by two wait states set up the CPU. This allows initialization values to propagate through the logic. The decode state determines which instruction is on the programBus. Interrupt and direct/indirect addressing functionality are also detected within this state. All other instruction functionality stems off the decode state. When an instruction execution is complete, the state machine will always return to the fetch/decode state.

Altera MegaFunctions were used throughout the design to optimize the FPGA hardware. In the programPath module, a 1.5K×16-bit Altera altsyncram megafunction is used as the instruction ROM. All instructions are stored in a rom.mif file, which is read in during synthesis. For the PC, a 12-bit Altera lpm_counter megafunction is instantiated.

In all modules, Altera MegaFunctions are instantiated wherever possible, in order to optimize the design and performance.

The number of states per instruction varies from four to six states depending on the complexity of the instruction. The controller outputs mux controls, ALU functions, register reads/writes and other control signals to the DSP hardware depending on the current instruction. The program counter (PC) is enabled during the second to last instruction state. It takes two state cycles for the next instruction to stabilize on the programBus. B. Program Path The programPath contains the instruction ROM, PC, stack, and various muxes. The program bus also resides in the programPath. The program bus holds the 16-bit DSP instruction from the instruction ROM. The instruction ROM size is 1.5K×16-bits , but may be extended up to 4K×16-

In the dataPath, the multiplier uses an Altera lpm_multiplier megafunction with the 16×16-bit input and the 32-bit output. The rounding and saturation logic were added to the multiplieraccumulator. The T register is used to load an operand input to the multiplier and the P register stores the 32-bit product output. The ALU uses Altera’s lpm_compare, lpm_add_sub, lpm_and, lpm_or, and lpm_xor megafunctions for logical and arithmetic computations. The Accumulator uses the Altera lpm_abs for absolute value computations. A 256×16-bit altsyncram is used for the Data RAM. For testing purposes, initial values are pre-loaded into the Data RAM by using a ram.mif file, which is read in during synthesis. VERIFICATION In order to verify the functionality of the Verilog code that was written (the list of all modules are shown in Table 1 at the end of this paper), a series of test cases were created to test each individual instruction.

In each test case, the program memory was loaded with the appropriate opcode for that specific instruction using Altera’s Memory Initialization File (mif) system. A mif file is an ASCII file that specifies the initial content of a memory block for a corresponding memory address. For example, setting address 0 to a specific opcode would ensure that instruction is executed right after a system reset takes place. Each instruction in the program memory is then executed sequentially as the program counter is incremented by 1 after the previous instruction is completed. For instructions that required the use of a data memory access, a mif file was also used to set the desired values in data memory. After running simulation on instructions that modify the data memory, Quartus II allows the user to view data in memory blocks in table format. For those instructions that involved modifications to the data memory, a screen shot was taken and examined for each specific test case.

this program was monitored cycle by cycle, and verified to be working accurately. SYNTHESIS RESULTS We used Altera’s Quartus II Version 3.0 software for simulation, synthesis, and place & route. Our DSP design, fully synthesized, takes up approximately 1,800 LEs (logic elements), 37K ram bits, and 2 DSP blocks. A snapshot of the Flow Summary in Quartus II of our design is shown in Figure 2. The three numbers circled in red show total logic elements, total memory bits, and number of DSP block 9-bit elements.

Once every instruction was verified individually, we tested several application program executions. Convolution, which is a very important routine in DSP, is one such example, and is expressed as follows: N

Y(n) = Σ H(m) X(n -m) , m=0

where H and X are filter coefficients and sampled data, respectively. In TMS32010, a 4-tap FIR filter is performed by the following convolution program. LT MPY PAC LT MPY APAC LT MPY APAC LT MPY APAC SACH

Figure 2: Quartus II Flow Summary Snapshot Figure 3 shows a snapshot of the Quartus II Timing Analyzer Summary.

XN0 H0 XN1 H1 XN2 H2 XN3 H3 YN,1

The above program was input to the program memory, and data memory was pre-loaded with both filter coefficients and data. Execution of

Figure 3: Quartus II Timing Analyzer Summary Since the purpose of our design is to develop an embedded DSP CPU softcore, the analysis of the area of our design compared to the total area of

each Altera Stratix FPGA is important. Figure 4 shows the percentage of total logic elements used in each of the Stratix FPGAs starting for the smallest, EP1S10, to the largest as of this writing, EP1S120. In the case of the smallest FPGA, the EP1S10, our synthesized design took up approximately 18.7 % of all on-chip logic elements. The largest device, the EP1S120, only used up 1.7% of logic elements to fit our design.

foundation of the digital signal processor and the fundamental design properties of today’s processor technology. We successfully implemented a working DSP processor based on the TMS320C10 in Verilog, which synthesized into an Altera Stratix FPGA.

UsedRAMbitsforeachStratixDevice 4.5% 4.0%

Used Logic Elements for each Stratix Device

20.0%

Used RAM bits per device (%)

4.0%

18.7%

18.0%

Used LEs per device (%)

16.0% 14.0% 12.0%

10.7%

3.5% 3.0% 2.5%

2.2% 1.9%

2.0% 1.5%

1.1%

1.1%

1.0%

0.7%

10.0%

0.5% 0.5%

7.7%

8.0%

0.0%

6.1% 6.0%

0.4%

EP1S10

4.8%

EP1S20

EP1S25

EP1S30

EP1S40

EP1S60

EP1S80

EP1S120

Stratix Device

3.5%

4.0%

2.5% 1.7%

2.0%

Figure 5: Usage of RAM bits for Stratix Devices

0.0% EP1S10

EP1S20

EP1S25

EP1S30

EP1S40

EP1S60

EP1S80

EP1S120

StratixDevice

Figure 4: Usage of LEs for Stratix Devices Our design is optimized and efficiently uses logic elements. The graph shows that there is plenty of space in the FPGA to insert additional logic, such as various application-specific hardware accelerators and I/O logic blocks. Also, one may insert multiple DSPs into one FPGA for parallel processing applications. Figure 5 shows a graph analyzing the amount of total RAM bits used for each Altera Stratix FPGA. Using the smallest FPGA, the EP1S10, our fully synthesized design only uses approximately 4% of all on-chip RAM bits. In the largest FPGA, the EP1S120, only 0.4% of all on-chip RAM bits are used. There are plenty of RAM bits available in all Stratix FPGAs for other applications. This is in part due to the fact that the TI TMS320C10 architecture uses only small amount of RAM and ROM resources. CONCLUSIONS Although the TMS32010 DSP processor is more than 20 years old, it represents the legacy

Our fully synthesized design, which runs at 40 MHz, takes up approximately 1,800 LEs, 37Kbits of RAM, and 2 DSP blocks. Using the smallest Stratix device, the EP1S10, our DSP utilizes only 15% of all on-chip LEs. When synthesized into the largest Stratix device, the EP1S80, our DSP utilizes just over 2% of all onchip LEs In terms of future work, there are many possible areas to improve and do further development. To improve the DSP performance with respect to maximum frequency, the controller can be redesigned to include instruction look-ahead and prediction functionality. The main goal is to complete one instruction per clock cycle. In this case, up to 200 MIPS performance may be achieved. Extra pipelining to reduce long paths will also help speed up the DSP. The current controller has specific states for each and every instruction, but could be improved by creating generic multi-cycle stages for the execution of each instruction. This would make it easier to implement a faster, pipelined architecture. Researching new and improved instructions to add to the baseline can also increase the usability of the DSP. For instance a parallel multiplication instruction could prove to be

useful to increase performance. Another possible topic is to combine multiple DSP cores into one processor. Data and instruction management is an important issue in this case, but this added complexity has many performance benefits. This type of parallel processing is common in today’s market. We are currently working on extending the instruction set, including parallel MAC and distributed arithmetic instructions. ACKNOWLEDGMENTS We thank Altera University Program for generous support of their EDA tools, hardware, and training. We also thank Texas Instruments DSP University Program for various past supports for the first author while he was with Worcester Polytechnic Institute and San Jose State University.

REFERENCES [1] “TMS320C1x Digital Signal Processors”, http://focus.ti.com/lit/ds/symlink/smj320c15 .pdf, pp. 8-10,13, 1991. [2] “Texas Instruments Enters Digital Signal Processor Chip Market with High-Speed TMS320”, Texas Instruments, Incoporated, http://www.ti.com/corp/docs/company/histor y/82dspnews.shtml, pp. 1-2, 2003. [3] David A. Patterson and John L. Hennessy. Computer Architecture - A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 1990. [4] David A. Patterson and John L. Hennessey. Computer Organization & Design – The Hardware/Software Interface. S.F., CA: Morgan Kaufmann, 1998. [5] “FPGAs Provide Reconfigurable Digital Signal Processing Solutions”, Altera Corp., http://www.altera.com/literature/wp/wp_dsp _fpga.pdf, pp. 2-6, 2001. [6] “The Evolution of DSP Processors”, Berkeley Design Technology Inc., http://www.bdti.com/articles/evolution.pdf, pp. 1-9, 2000. [7] “Stratix Device Backgrounder,” Altera Corp, http://www.altera.com/literature/wp/wp_stx _backgrounder.pdf, 2002.

[8] Jeff Chung, “FPGA implementation of a Digital Signal Processor Based on the Texas Instruments TMS320C10 DSP Volume I: Architecture,” San Jose State University, 2004.

[9] James Fong and Eddy Cheung, “FPGA implementation of a Digital Signal Processor Based on the Texas Instruments TMS320C10 DSP Volume II: Verification,” San Jose State University, 2004

Table 1: List of Verilog Modules abs.v accuminmux.v accumshifter.v accumulator.v add_sub.v adder.v alu.v aluinmux.v alushifter.v ar.v arinmux.v arp.v compare.v controller.v counter.v databusmux.v datapath.v dataram.v dataraminmux.v dp.v dportmux.v dsp.v mac.v mult.v multinmux.v multiplier.v pc.v pcinmux.v programbusmux.v programpath.v reg_16.v rom.v shiterAluIn.v stack.v

Figure 1: TMS32010 internal architecture (Courtesy of Texas Instruments, Inc.)