development and validation of a scalable dsp core

0 downloads 0 Views 75KB Size Report
has three functional units: data arithmetic and logic unit. (Data ALU), address generation unit (AGU) and program control unit (PCU). The Data ALU has a ...
DEVELOPMENT AND VALIDATION OF A SCALABLE DSP CORE Thomas Johansson, Patrik Thalin, Ulrik Lindblad and Lars Wanhammar Dept. of Electrical Engineering, Linköping University SE-581 83 Linköping, Sweden E-mail: {thomasj, patrikt, ulrikl, larsw}@isy.liu.se,

ABSTRACT This paper describes a design flow used in an on-going implementation of a fixed-point DSP processor with variable word length. Based on the top-down design approach described in [1], we discuss the approach based on a sequence of models and stepwise refinements. The pros and cons of selecting different models are investigated as they are compared with each other. For the purpose of validation, a flexible framework has been developed. It allows regression testing and dynamic changing of the test data set and model to be tested. Using the tools, the output is then automatically compared with the expected result as provided by the golden model. 1

INTRODUCTION

The design and implementation of the described processor was motivated by our need in education and research for a viable form of implementing non essential functions in a system-on-chip. By using a commercial processor - the Motorola DSP56002 - as a source of inspiration, the design and validation of the processor was to be simplified. The choice of target processor was motivated by previous knowledge about its architecture and computational capabilities. Also, tools and assembly code provided by Motorola could initially be reused, cutting development time even more. Several commercial versions of the processor was also investigated so that a comparison could be done between development cost versus purchase of a commercial core. This, of course, to make sure that development could be motivated. Since targeting low power implementations, the commercial core had to include RTL code, so that it could be modified and optimized for different applications. This optimization would typically include scaling of the native data word width, introduction and/or removal of instructions from the instruction set, modifying the number of memories and their sizes, and introduction and/or removal of unwanted peripherals. 1.1 About the Motorola DSP56002 The Motorola DSP56002 is a general purpose DSP processor with triple-bus Harvard architecture that enables a high degree of parallelism. It uses fixed-point arithmetic and has three functional units: data arithmetic and logic unit (Data ALU), address generation unit (AGU) and program control unit (PCU). The Data ALU has a multiplication and accumulation (MAC) unit, two accumulators and four

input registers. The AGU has two arithmetic units and 24 registers used for address calculation. All memories have their own separate buses for data and addresses. This architecture makes it possible to do parallel data movements during data ALU operations. Fig. 1 shows an overview of the architecture, for more details see [2, 3]. YAB ADDRESS GENERATION UNIT

EXTERNAL ADDRESS BUS SWITCH

XAB PAB

PROGRAM MEMORY

X DATA MEMORY

Y DATA MEMORY

BUS CONTROL

YDB INTERNAL DATA BUS SWITCH

EXTERNAL DATA BUS SWITCH

XDB PDB GDB

PROGRAM CONTROL UNIT

DATA ALU

ADDRESS DATA

Figure 1. Overview of the Motorola DSP56002 architecture

1.2 Organization of the Paper The remainder of this paper is organized as follows. In section 2, the new features of the DSP is covered. Section 3 includes the definition of a model and comparisons between currently implemented models. The adopted definition of validation can be found in section 4 together with a description of the validation framework. Finally, an evaluation of the design flow can be found in section 5. 2

DESIGN GOALS

A number of design goals were set at an early stage of development. They aim at reducing the power consumption by adapting the numerical properties through data width scaling and the possibility to modify the instruction set as well as incorporating more complex instructions and corresponding hardware units. The introduced changes are described in the following sections. 2.1 Hardware Modifications As power dissipation by system-level buses significantly contribute to the global power, [4], we decided to modify the bus structure of the original hardware architecture. Aiming at low power implementations, one common way of dealing with this problem is to use Grey coded addressing. This approach is based on the often occurring instruction locality seen in DSP application programs, [5]. Since this locality results in sequential (program) memory

accesses, the switching activity - and hence also the power dissipation - is significantly reduced due to the use of Grey coding, [6]. Where as Grey coding can be efficiently used for sequential addressing, it does not yield the same significant results for bus transfers of non sequential data. Therefore, reducing power dissipation originating from data buses can instead be done by using variable native data word width. This method was adopted in our project. Originally, the Motorola DSP56002 has a 24-bit native data width, but in our implementation, this width can be set to any value between 16 to 24 bits. As a consequence of this (among other things), the processor was divided into a calculation core together with external memories and memory mapped hardware accelerators as shown in Fig. 2. DSP processor Memory X

Ext 1 Y

P

3

THE MODEL CONCEPT

We here define a model as a step in the design flow of successive refinement. To each model a number of sub blocks exist, that are achieved as sufficiently defined blocks are obtained. The goals for each model are stated in advance. 3.1 Top-down Design In top-down design we stepwise develop the system by synthesizing and validating each level. The design levels are successively partitioned into sub blocks. This process for decomposition is repeated until sufficiently simple blocks are obtained, as illustrated Figure 4. The process of decomposition with successive refinement also guarantee that larger and more important issues are resolved before the detailed issues. By only making small modifications between successive models, the managing and validation of models becomes simpler. Furthermore, a correct design becomes more likely. Since the only thing that change is the structure of each level in the design process, the design will at all levels be possible to validate against the top model with the same validation methods. This framework for testing is described later.

Ext 2 INPUT

Behavioral Description

OUTPUT

CORE Function Blocks

Ext n

INPUT

OUTPUT

Figure 2. The modified hardware architecture

2.2 Toolset Aiming at a complete set of development tools in the future, an assembler and a C compiler is currently under development. Unlike the tools provided by Motorola, these tools will support the introduced variable data word width. This is supported in our first model that can be used as an instruction set simulator (ISS). Fig. 3 below illustrates the ISS when run together with Domain Technologies BoxViewTM environment, [7].

Sub Blocks INPUT

OUTPUT Bus

Figure 4. The top-down approach

3.2 Design Flow

Figure 3. The BoxView environment

The design process essentially follows a top-down design approach with stepwise refinement of the subsystems. Implementation is also done in small steps of refinement. A new model is declared for each new model. The decomposition is repeated at each level until sufficiently well described blocks are obtained. On each model level a new version is declared for each sub blocks that is fully evolved. The first model was purely behavioral and intended as a bit-true instruction simulator. It was written in C due to the requirement of high simulation speed and the vast

range of development tools. To ease the use of the simulator it was designed with a separated replaceable user interface, see Fig. 5. This provides an easy to use graphical user interface (GUI). Today two GUIs are adopted for use with the simulator. The most used GUI today is BoxView. The second, which is still considered to be a beta version was developed as a student project. In the second model, we mainly ported the C-code over to VHDL. This was simplified with the use of a C style package for VHDL [8]. This model was intended to introduce timing and pipeline issues. With the ability to use back-annotation and more refined structure later, the decision was made to switch over to model M2 once we had verified the basic functionality.

User Interface

Command Interpreter

Program Control

Address Generation

Instruction Interpreter

Execution

Registers

Data Memory

Log

Program Memory

use a simple test controller accessing the internal register and put them on the global data buses. Aiming at a FPGA implementation for a demonstrator, we also develop a general and programmable test bench structure. This test structure will include access and control of internal RAMs, have a graphical user interface fully functional as a general debugger of synthesized code downloaded to a FPGA. 4

VALIDATION

We define validation as the process of determining whether a model is an accurate representation of the system. A test is defined as a set of data that is fed to the system, if the output result are the expected the system is said to be valid for that test. The validation process consists of a set of tests. An important strategy in the validation process has been to reuse the validation framework. We have a accomplished this by developing a tool, maketest, for running the validation process. It handles the loading of data in to the models and comparing the output and presenting the results. The tool is written as a shell script that execute the model. 4.1 Test

Figure 5. Simulator logical views

The third model has become the level where the most noticeable design steps has appeared. The model is completely written in behavioral VHDL, and was based on a port of model two. In this model we completely rewrote the decoder and move operations. Prior to the third model, all move operations were based on function calls. In this model we have refined the model to work as a network of sub components completely controlled by the program control unit (PCU). The design process has now extracted function calls acting as sub component with the interface defined by the function call parameters. While doing this refinement we also cleaned up and removed most of the old data structures. Model three is today a functional design emulating the bus structure as well as the separate functional units. We also emulate the use of a global clock to synchronize the blocks. 3.3 Future Work In preparation is the fourth model which aim at implementing full bus structure, separated functional blocks and a complete decoder with full control signals. In the refinement process we also include preparation for synthesis of each block. The validation interface will now change from the previous. Using back-annotation to verify the synthesized behavior, it is no longer possible to take advantage of global variables in VHDL. The need of a simple way of accessing internal registers will arise. One such unit has been developed but is not fully tested. In the first step we

The tests are made at a high level of abstraction, a test consists of an assembly program (program memory data) and memory data. Some tests are created for the purpose of validating a specific instruction, where as others are actual reference programs such as FFTs and digital filters. The tests are not written to ensure a certain degree of fault coverage, but instead to cover a large number of corner cases and special cases. Using MATLAB, we also generated random data to get high validity within a reasonable test time. Adding a new test is as easy as creating a directory and writing an assembly program.

asm,mem

REFERENCE

MODEL

log,mem

log,mem

COMPARE

ANALYZE

Figure 6. Validation flow

4.2 Validation Reference The reference for the validation is a Motorola DSP56002 chip on an evaluation board. The output from the model and the reference are memory dumps and a log file containing register contents after every executed instruction. The validation is performed by comparing the results from the reference and the model, see Fig. 6. The maketest tool handles this task. It can be configured to run a selected test or all of them. Running of tests in parallel on multiple computers and crc checking of result files are used to speed up the process. The result after each test is presented during the process, and a special file is also produced that allows for manual inspection of errors, see Fig. 7. Errors are marked with inverted text. This is useful since validation is an ongoing process during development and not something to be attempted after the model has been fully developed. Validation of earlier models - regression testing - is also possible. This is very useful when adding a new test, to ensure that earlier models are still valid.

Figure 7. Analyze of error in the validation process

5

CONCLUSIONS

The design flow with a sequence of models and taking small steps when migrating to new models has worked well. The first model, took long time to develop, mostly because of understanding the core’s functionality and modelling it. The two following models had a much shorter development time since much of the previous work was reused. We believe that this strategy will ensure a correct design within reasonable time. The design of model one with a replaceable user interface made it easy to integrate it with Box View. The needed interface module was designed and implemented in less than a week. The validation process has shown to work well, especially reuse of the validation framework between models. As expected the simulation time is much longer for the VHDL models compared to the one in C. The C Model is about seven times faster than the first VHDL model. This is a major advantage since development of the first models require validation more often. When we had the first valid model, less validation runs are required due to knowledge

of the expected behavior. We expect the execution time to be longer as the design is refined since more details are added. At this stage no validation has been carried out with other data widths then the original 24-bits. This due to that we prioritized to complete the developed chain to a working core before considering modifications 6

REFERENCES

[1] L. Wanhammar, “DSP Integrated Circuits,” Academic press, pp. 13-16, 1999. [2] DSP56000 24-Bit Digital Signal Processor Family Manual (1994). Motorola. Austin, TX. [3] DSP56002 Digital Signal Processor User’s manual (1993). Motorola. Austin, TX. [4] L. Benini, G. De Micheli, E. Macii, D. Sciuto and C. Silvano, “Address bus encoding techniques for system-level power optimization,” in Proc. Design, Automation and Test in Europe, pp. 861 - 866, 1998. [5] Ya-Lan Tsao, Ming Hsuan Tan, Jun-Xian Teng and ShyhJye Jou, “Parameterized and low power DSP core for embedded systems,” in Proc. International Symposium on Circuits and Systems, Vol. 5, pp. V-265 - 268, 2003. [6] Ching-Long Su, Chi-Ying Tsui, A. M. Despain, “Low power architecture design and compilation techniques for high-performance processors,” in Proc. IEEE Computer Conference, pp. 489 - 498, 1994. [7] BoxView Debugger for DSP56xxx (1999, June). Domain Technologies Inc. Plano, TX. [Online]. Available: http:// www.domaintec.com/images/pdf/boxview56k.pdf [8] M. J. Knieser, F. G. Wolff and C. A. Papachristou, “C/UNIX Functions for VHDL Test benches”, presented at Synopsys Users Group, San Jose, 2002.

Suggest Documents