Document not found! Please try again

Rapid Prototyping of Digital Signal Processing

0 downloads 0 Views 140KB Size Report
synchronous multi-rate applications on a single Unix workstation, on a Unix workstation ... Digital signal processing is used for speech synthesis and recognition ...
Rapid Prototyping of Digital Signal Processing Systems with

GRAPE-II. Rudy Lauwereins, Marc Engels, Marleen Adé, J.A. Peperstraete Katholieke Universiteit Leuven, ESAT Department Kard. Mercierlaan 94, B-3001 Heverlee, Belgium Tel. +32-16-22 09 31, Fax. +32-16-22 18 55 email: [email protected] Abstract — The paper describes GRAPE-II (Graphical RApid Prototyping Environment), an advanced system level development environment for the specification, compilation, debugging, simulation and emulation of Digital Signal Processing (DSP) applications. GRAPE-II supports the real-time emulation of synchronous multirate applications on heterogeneous target platforms, consisting of multiple TMS320C40 processors and Xilinx FPGAs. Ports to the Motorola DSP5600x, the Texas Instruments TMS320C80 and the Analog Devices 21060 are foreseen in the near future. GRAPE-II also allows for the simulation and debugging of asynchronous as well as synchronous multi-rate applications on a single Unix workstation, on a Unix workstation cluster, on a Meiko Computing Surface and on a Parsytec Xplorer parallel machine. GRAPE-II is commercialised under the name Virtuoso Synchro by Intelligent Systems International. Index Terms — Rapid prototyping, digital signal processing, parallel processing, data flow, programming environments.

I. INTRODUCTION Digital signal processing is used for speech synthesis and recognition, telecommunications, image and video processing, and robotics, as well as for consumer products such as compact disc players, digital audio tape recorders, and digital radio receivers [Alar87]. The ever increasing complexity and data rates of these DSP applications demand application-specific integrated circuits (ASICs). Development costs for such ASICs are

high, so algorithms should be thoroughly tested and optimized at all design stages before the final implementation. To verify and optimize algorithms, they have to be specified first and their functionality has to be tested completely (functional verification). Then, the optimum signal widths are determined, balancing cost with noise and disturbances (word-length verification). Finally, this optimized algorithm is transformed into an architecture for implementation. During the last step, the compatibility of the implementation with the original algorithm is verified (architectural verification). Nowadays, most verifications are performed by analysis and simulation tools on workstations or superminicomputers. Application prototypes are worked out only during the last design stage. However, if we prototype in the earlier stages of the design, we can • test more parameters in a shorter time; • verify the application with real-life input signals and true interfaces to the environment. Westall and Ip [West92] indicate that 80% of the ICs do not work in their final system due to interfacing problems; • optimize variables and evaluate an algorithm's subjective qualities under real-time conditions, e.g. by listening to the actual results of the audio algorithm or looking at the output of a video algorithm; • convince people (management, marketing, potential customers) of the usefulness of a DSP system Despite these benefits, prototyping is generally not used at the earlier stages because designing dedicated

prototyping hardware requires much time and money and because the prototyping hardware is often not flexible enough to follow the many modifications applied in the early design stages. We propose an alternative: a rapid-prototyping setup with general-purpose re-usable hardware to minimize development cost and a structured prototyping methodology to reduce programming effort. The generalpurpose hardware consists of commercial DSP processors, bond-out versions of core processors and FPGAs linked to form a powerful, heterogeneous multiprocessor, such as the Paradigm RP, developed within the Retides (REal-TIme Dsp Emulation System) Esprit project and marketed by InCA/Zycad [Note94]. Our Graphical RApid Prototyping Environment GRAPEII automates the prototyping methodology for these hardware systems by offering tools for resource estimation, re-timing, partitioning, assignment, routing, scheduling, codegeneration, parameter modification, animation and debugging. This prototyping approach has already been successfully used for an audio processor for the consumer market, a sender, receiver and channel simulator for digital audio broadcasting (DAB) [Enge91], and a realtime video encoder for mobile applications [Enge94]. Next section describes the various tools of the GRAPEII environment and the order in which they are called.

II. THE GRAPE-II TOOLS GRAPE-II consists of a set of tools that interact with two central databases. The first contains all the information the programmer specified for the application, the second describes the features of the target platform chosen to implement the application on. The tools make decisions based on this information and store their results as attributes back into the databases. Tools that come later in the script can hence make use of the results of their predecessors. Libraries are available to access and modify the objects of the database and to display the database in a graphical way. GRAPE-II is an open environment, since tools may be freely added to the environment and new attributes may be created at wish. It is not open with respect to the programming model it supports: the hierarchical levels of the application that are to be handled by GRAPE-II tools,

have to adhere to an extended data flow model [Lauw92, Anon94, Lauw94]. This limitation however guarantees that the tools are able to automatically detect synchronous parts within asynchronous algorithms as well as asynchronous parts that externally behave synchronously. They can hence take actions optimally suited for each part of the application separately, without a need for the programmer to specify whether an application sub-task resides in a synchronous or dynamic data flow domain. In addition, GRAPE-II allows to compile a single application specification to a whole range of target platforms, including FPGA-only, DSP-only and heterogeneous systems, taking into account and making use of the peculiarities of each platform and without the need to re-specify the application from scratch. The currently available GRAPE-II tools are explained via the script, i.e. the ordering of design tasks, of figure 1. First, the application is specified, completely independent of the target hardware onto which the application will run. This can be done in two ways: • GRAPE-II contains a hierarchical graphical editor which allows the DSP designer to draw the data dependency graph directly; • GRAPE-II comes with an interface to the DSP Station™ environment of EDC/Mentor Graphics [Note93]. This allows the designer to specify his/her application in the data flow language DFL, a derivative of Silage [Hilf85]. Alternatively, he/she could use the DSP Station schematic entry option or filter code generator. Below the level visible to the GRAPE-II tools, the application may be specified in any textual or graphical language a compiler is available for. Typically, this will be VHDL for hardware synthesis tools and C or assembly language for commercial DSP compilers. A shell in the chosen language is generated automatically by the graphical editor for each application sub-task, containing eg. all input/output declarations; in this shell, the DSP expert can easily specify the functionality of the sub-task. The target hardware is specified in a similar way, using the same hierarchical graphical editor as for the specification of the application. All hardware aspects that may be used by the GRAPE-II tools, like clock frequency, communication data rate, communication protocol, amount of memory or other resources, may be specified. Below the level visible to the GRAPE-II tools, a hardware component may be described in any textual or

graphical language third party tools can make use of. Examples are architecture files describing the mapping of signals to physical pins, or timing diagrams specifying the timing requirements at the pin level. The next tool automatically determines the amount of resources needed for the execution of each sub-task of the application. For microprocessor targets, this includes execution time, amount of program, data and buffer

memory, and dedicated resources like analog-to-digital convertors. For FPGA targets, resources can be the number of configurable logic blocks, pins, etc. When execution time of a sub-task depends on the type of memory that is used, eg. internal versus external data RAM, also this dependency is determined and taken into account by the other GRAPE tools.

Fig. 1. GRAPE-II tool script. Boxes with black background are optional; they are part of the DSP Station environment of EDC/Mentor Graphics. The following tool re-times [Leis83] the data dependency graph. After the DSP designer specifies the allowable delay between input and output signals, the tool automatically determines the best insertion point for pipeline registers, in order to maximize signal throughput.

Analoguously, in a cyclic data dependency graph, pipeline registers have to be inserted by the programmer in each cycle to avoid a deadlock. The place where they are inserted in the cycle influences the schedule and hence

also system throughput. Also for these pipeline registers, the tool determines the best position. It is clear that the decisions made by this tool influence the performance of the application after assignment, routing and scheduling. Therefore, this tool bases its decisions on a comparison of performance estimates for various alternatives. These estimates are obtained by calling the next tool in the script in estimate mode, in which a tool returns an estimate of the performance in an extremely short amount of time. This principle is used for all GRAPE-II tools. Since the search space for global performance optimization (including re-timing, assignment, routing, scheduling, etc.) is prohibitively large, it is reduced by splitting the optimization task in sub-tasks. Near-optimality is still guaranteed by using performance estimates. The partitioning and assignment tool [Bils93a] (fig. 2) assigns the sub-tasks of the application to the processing devices of the heterogeneous target, be it processors or FPGAs. The partitioning and assignment tool not only bases its decisions on the computational workload of the sub-tasks, but also on the needed communication bandwidth between the sub-tasks. The currently released version supports both manual as well as automatic partitioning and assignment for synchronous multi-rate and cyclo-static applications. The partitioning and assignment tool, as well as the further explained routing and scheduling tools, are based on heuristic branch-and-bound algorithms [Bils93b]. At a given position in the optimization search space, a list of possible actions is constructed, eg. for the assignment a list of sub-tasks is built which are candidate for being mapped to a processing device. This action list is ordered according to heuristic rules in such a way that the action which most likely leads to the shortest input-to-output delay or the highest sampling frequency appears first in

the list. All actions are deleted from the list which are guaranteed to lead to a solution which is worse than an already obtained solution or for which we can prove that they will violate the user performance requirements. The search space is explored in depth-first order, with backtracking. Only at the start of the assignment task, the search space is traversed in breadth-first order, since the knowledge on which the action list ordering is based, is still very scarse, leading to an imperfect ordering which would require too much backtracking. For a description of the heuristic rules used, we refer to the references indicated in the explanation of each tool. All tools take into account resource limitations like the amount of memory or programmable logic blocks, the availability of A/D or D/A channels of the required sampling frequency and accuracy, etc. Resources are divided in different classes, according to their usage over time: eg. some resources are permanently required by one application sub-task, whether the sub-task is waiting for inputs or being executed; examples are program memory and programmable logic blocks. This type of resource mainly influences assignment. Other resources are only required during the execution of a sub-task (eg. the computation engine of a processor, data memory for temporary variables) or during communication (eg. input/output channels, communication links); because they are required by multiple sub-tasks, their use has to be scheduled. After assignment, two paths may be followed: • The first leads to a static schedule. It hence decreases run-time overhead to an absolute minimum and may guarantee real-time performance for the specified target hardware. We call this the emulation path. • The second uses a run-time kernel. It offers enhanced flexibility for debugging at the cost of an increased run-time overhead. We call this the simulation path.

Fig. 2. Screendump of the assignment, routing and scheduling tools of GRAPE-II. The window marked with 'Target: LSI_C40_2' shows the graphical description of the target hardware: a PC plug-in card with two TMS320C40 processors, of which one has two analog input channels and two analog output channels; the two processors are connected by a shared bus. The window marked with 'Graphical_equalizer' shows the data dependency graph of the application; sub-tasks 1 and 2 are 5-band equalizers for the left and right audio signals respectively; sub-task 3 is a volume control task; all other sub-tasks are casting operations; the border color indicates the result of the automatic assignment; sub-tasks with a certain border color have been mapped on target processors with the corresponding fill color. The bottom window visualizes the schedule. Each row represents the schedule of a different processor. Boxes marked with 'In' or 'Out' are input respectively output communication primitives. Boxes marked with 'S' are send and the ones marked with 'R' receive interprocessor communication primitives. All communication primitives have been generated automatically by the router. The last window, marked with 'performance' indicates the maximum sampling frequency for this application and details the utilization of each processor. We will first describe the tools of the emulation path. When the target architecture possesses a reconfigurable network between the processing devices, the routing tool

[Enge93] first determines the best switch settings to let the hardware graph resemble the software graph as closely as possible. It then computes the path a message

has to follow from the sending to the receiving processor, when no direct connection between them exists. This not always implies the shortest path, since load balancing between the available communication channels is foreseen. In a third phase, the tool determines the best communication primitive for each communication. For example, Direct-Memory-Access-controlled communication may be faster than CPU-controlled communication for lengthy messages; on the contrary, for short messages, CPU-controlled communication may be more efficient. Finally, the tool adds these communication tasks to the application graph. Next, the buffer length minimization tool [Ade94] is optionally called. Currently, it determines the minimum buffer length between the application sub-tasks, executing at different frequencies, that is guaranteed to lead to a deadlock-free schedule. In the near future, it will be extended to determine the minimum buffer length that leads to a deadlock-free schedule of a given duration. Although buffer length minimization is useful to reduce the memory requirements for DSP targets, it is extremely important for sub-tasks assigned to FPGAs, since the implementation of registers is very area consuming. Thereafter, the scheduler [Bils93a] determines the execution order of all sub-tasks assigned to processing devices and of all messages assigned to communication channels in order to minimize the input-to-output delay of a sample or to maximize sampling rate. The currently released tool supports the manual and automatic scheduling of synchronous multi-rate and cyclo-static applications. It also takes into account realistic communication protocols. For CPU-controlled communication without message buffering, for example, the sending and receiving tasks have to be scheduled at the same time instant as the message transfer on the communication channel itself; for DMA-controlled communication, on the other hand, the initialization of the DMA registers of the sending and the receiving processor has to precede the actual channel communication. The scheduler chooses the needed communication protocol in correspondence to the communication primitive selected by the router. Currently, this scheduler is extended to support dynamic data flow: run-time tests check for each application sub-task if all required inputs are present. For a given application, the scheduler will determine at

compile-time the best order for these tests. This approach reduces the amount of run-time tests drastically, compared to a general purpose run-time kernel. It will also add a pre-emption mechanism to static schedules, to further reduce idle time at run-time. This pre-emption is only induced when the idle time between two sub-tasks is known to be larger than the task switching overhead, hence guaranteeing the same or better performance than the static schedule it starts from [Waut94]. Now that the sub-tasks are known that are assigned to a single processor and that their relative execution order is determined, the code generator builds a main program which calls the sub-tasks and communication primitives in the correct order for each processor. For the latter, code is generated based on parametrisable library functions.For each processing device, the program is then compiled into downloadable code. Instead of using our own code generator, which inlines or calls the user-written sub-tasks in schedule order, one could alternatively employ the optimized code generator of the DSP Station, which optimizes register usage across sub-task boundaries. For application parts which are assigned to FPGAs, the code generator builds hardware which starts the application sub-tasks according to the data dependencies. It generates VHDL code for each communication primitive selected by the router. Finally, the hardware description is synthesized into a netlist which is compiled into downloadable configuration bits. At this point, the application is ready for execution. Often, it will however contain parameters that have to be modified at run-time, either because we want to determine the optimum parameter settings in a real-life operation environment, either because they are controlled by the customer in a final product. For this, a virtual front panel creator has been built. It makes it possible to create dialog boxes containing an editable field or slider bar for each of the parameters we want to modify at run-time (figure 3). These virtual front panels running on a host computer are dynamically linked to the application running on the target machine. When the user changes the value of a parameter, it sends an update request to the appropriate processor, which changes the memory location holding the parameter accordingly. No recompilation is hence required.

Fig. 3. Virtual front panel for an audio application with separate 5-band equalizer for left and right channel and combined volume control. Instead of heading towards the statically scheduled implementation which was explained above (the emulation path), one could chose a run-time kernel based approach (the simulation path), which is especially useful for debugging purposes. After assignment, code is then automatically generated which hooks the user-specified application sub-tasks to a run-time kernel running on a Unix workstation cluster or a transputer based multiprocessor. Routing and scheduling are handled by the kernel at run-time. Next, the application may be animated and debugged [Caer93]. This tool is based on a Record/Replay mechanism. In the recording phase, the application is executed in real-time and critical information about the program behavior is logged in internal memory, eg. the relative order of communication events. Care is taken to minimize the amount of information that has to be logged,

in order to reduce the influence of the debugger on the execution of the application. In the replay phase, the application is executed a second time with the same inputs. The logged information is used to ensure the same behavior as during the recording phase, even for nondeterministic programs. It is even possible to do the replay on a smaller sized machine, to reduce debugging cost: the short recording phase is executed on an expensive large multiprocessor, while the lengthy replay phase (due to user interaction) is done on a single desktop workstation. In replay mode, program execution is animated on the graphical representation of the application. The user sets breakpoints by clicking with the mouse on connections. After the program halts, he/she can visualize the data on that connection, using a user defined visualization routine: when the connection for instance carries an image, the data is visualized as an

image instead of a matrix of values (figure 4). The consecutive values that appear on a connection during the repeated execution of the application may be stored in a time series, which can then be played back via audio or video equipment at the real-time sampling speed, to inspect sound respectively image quality at arbitrary

places in the application (true multimedia debugging). After narrowing down the place of an error to a single sub-task, the user calls a conventional source level debugger and can inspect the code of the sub-task in single step mode.

Fig. 4. Example of animation and debugging in GRAPE-II for an image processing (edge detection) application. A green circle on a connection indicates a breakpoint; a red circle is the breakpoint at which the program is currently suspended. Note that the data are visualized as images instead of matrices of values, to facilitate interpretation.

V. CONCLUSIONS The paper presented the GRAPE-II system level environment for the rapid prototyping of digital signal processing (DSP) applications on a multiprocessor. GRAPE-II covers the whole life cycle of the development of a prototype, starting from the specification of the application and the target hardware, via automatic assignment, routing, scheduling, code generation and compilation for a heterogeneous multiprocessor, to animation, debugging and virtual front panel creation.

GRAPE-II makes fast prototyping of complete DSP applications in real-time a reality.

AVAILABILITY OF GRAPE-II GRAPE-II can be obtained under the commercial name Virtuoso Synchro from Intelligent Systems International Inc., Lindestraat 9, B-3210 Linden, Belgium, Tel. +32-16-62 15 85, Fax. +32-16-62 15 84, Email: [email protected].

ACKNOWLEDGMENTS Rudy Lauwereins is a senior research associate and Marc Engels a senior research assistant of the Belgian National Fund for Scientific Research. The work presented in this paper is partly supported by the Belgian Interuniversity Pole of Attraction IUAP-50, by the Belgian Concerted Research Action on Applicable Neural Networks and by the European Community Esprit-III project 6800 Real-Time DSP Emulation System. K.U.Leuven-ESAT is a member of the DSP Valley Network.

REFERENCES [Ade94]

[Alar87]

[Anon94] [Bils93a]

[Bils93b]

[Caer93]

M. Adé, R. Lauwereins, J.A. Peperstraete, "Buffer memory requirements in DSP applications", 5th IEEE Int. Workshop on Rapid System Prototyping, Grenoble, France, June 21-23, 1994, pp. 108-123. M. Alard and R. Lassalle, "Principles of Modulation and Channel Coding for Digital Broadcasting for Mobile Receivers", EBU Rev., tech. no. 224, Aug. 1987, pp. 47-69. Anonymous, "GISLA manual", K.U.LeuvenESAT/ACCA, Jan. 94. G. Bilsen, M. Engels, R. Lauwereins, J.A. Peperstraete, “Development of a Load Balancing Tool for the GRAPE Rapid Prototyping Environment”, 4th IEEE Int. Workshop on Rapid System Prototyping, Research Triangle Park, North Carolina, June 28-30, 1993. Greet Bilsen, Marc Engels, Rudy Lauwereins, J.A. Peperstraete, "Survey of algorithms for static load balancing", Proc. of the IASTED International Conference on Modelling and Simulation, Pittsburgh, Pennsylvania, USA, May 10-12, 1993, pp. 418-421. C. Caerts, R. Lauwereins, J.A. Peperstraete, “PDG: A Process-Level Debugger for Concurrent Programs in the GRAPE Rapid Prototyping Environment”, 4th IEEE Int. Workshop on Rapid System Prototyping, Research Triangle Park, North Carolina, June 28-30, 1993.

[Enge91] Marc Engels, Rudy Lauwereins, J.A. Peperstraete, “Rapid Prototyping for DSP Systems with Multiprocessors”, Special issue on Rapid Prototyping, IEEE Design & Test of Computers, Vol. 8, No. 2, June 1991, pp. 52-62. [Enge93] M. Engels, R. Lauwereins, J.A. Peperstraete, A. van Roermund, “Design of a Processing Board for a Programmable Multi-VSP System”, Journal of VLSI Signal Processing, 5, 171-184 (1993), pp. 59-72. [Enge94] M. Engels, T. Meng, "Rapid Prototyping of a Real-Time Video Encoder", 5th IEEE Int. Workshop on Rapid System Prototyping, Grenoble, France, June 21-23, 1994, pp. 8-15. [Hilf85] P. Hilfinger, "A high-level language and silicon compiler for digital signal processing", Proc. IEEE CICC Conf., May 1985, pp. 213216. [Lauw92] R. Lauwereins, M. Engels, J.A. Peperstraete, "GRAPE-II: A Tool for the Rapid Prototyping of Multi-Rate Asynchronous DSP Applications on Heterogeneous Multiprocessors", 3rd IEEE Int. Workshop on Rapid System Prototyping, June 23-25, 1992, Research Triangle Park, North Carolina, USA, pp. 24-37. [Lauw94] R. Lauwereins, P. Wauters, M. Adé, J.A. Peperstraete, "Geometric Parallelism and Cyclo-Static Data Flow in GRAPE-II", 5th IEEE Int. Workshop on Rapid System Prototyping, Grenoble, France, June 21-23, 1994, pp. 90-107. [Leis83] C.E. Leiserson, F.M. Rose, J.B. Saxe, "Optimizing synchronous circuitry by retiming", Proc. of the 3rd Caltech Conf. on VLSI, Pasadena, CA, March 1983, pp. 87116. [Note93] S. Note, P. Vandebroeck, P. Odent, D. Genin, M. Van Canneyt, "Top Down design of two industrial IC's with DSP Station", DSP Application, March 93. [Note94] S. Note ea., "Paradigm RP, a System for the Rapid Prototyping of Real-Time DSP applications", DSP Applications, Jan. 1994, pp. 17-23.

[Waut94] Piet Wauters, Rudy Lauwereins, Jan Peperstraete, "Compile time analysis to minimise runtime overhead in pre-emptive scheduling on multi-processors", Proc. of the IASTED International Conference, Modelling and Simulation, Pittsburgh, Pennsylvania, USA, May 2-4,1994. [West92] Westall F.A., Ip S.F.A., "Digital signal processing in telecommunications", BT Technological Journal, Vol. 10, No. 1, January 1992, pp. 9-27.