These designs are especially attrac- 1 ware-software codesign. In a traditional ... Some intensive applications have ible design strategy, in which hardware.
Codesign Methoddology fbr DSP AppIicaiions ~~
~~
r
~~
ASAWAREE KAIAVADE
APPLICATION-SPECIFIC SYSTEMS are often the solution whenever general-purpose systems cannot meet portability, compactness, cost, or performance requirements. Important applications include communications, multimedia systems, consumer products,robotics, and control systems. Although the term applicationspecific often summons the companion term integrated circuit,ASIC design is no longer the research challenge it once was. Concerted effort in design automation has r e sulted in sophisticated and widely used tools for routinely designing large special-purpose chips. Although continued improvement of such tools is valuable, much research is refocusing on the systemlevel design problem. Two opposing philosophies for system-level design are emerging. One is the unified approach,which seeks a consistent semantics for specification of the complete system. The other is the heterogeneous approach, which seeks to systematically combine disjoint semantics. Although the intellectual appeal of the unified ap-
EDWARDA. LEE Universityof California,
Berkeley ~ _ _ _ _ _ _ _ _ _
~
16
~
~
The authors describe a systematic, heterogeneousdesign methodology using the Ptolemy framework for simulation, prototyping, and software synthesis of systems containing a mixture of hardwareand software components. They k u s on signalprocessing systems, where the hardware typically consists of custom data paths, FSMs, glue logic, and programmable processors, and the software is one or more embedded programs running on the programmable components.
~
~
i
Software or himware. An appli-
proach is compelling,we have adopted i the heterogeneous approach. We believe that the diversity of today’s design styles precludes a unified solution in the foreseeable future. Combining hard-
1
1 1
0740-747519310900-0016$0300 0 1993 IEEE
ware and software in a single system implementation is but one manifestation of this diversity. Even without good design tools, application-specific systems routinely mix diverse design styles.The component subsystems commonly found in these systems include the following:
cationspecific system with no software is rare. At the very least, a lowcost microprocessor or microcontroller rnanages the user interface. But it is also common to implement some of the core functions in software, often using somewhat specialized programmable processors such as programmable DSPs (digital signal processors).Occasionally, an applicationspecific system is implemented entirely in software. In that case, it is applicationspecific only if reprogramming by the user is not possible. AsICs. ASIC design has been the f o cus of synthesistools, even socalled high-levelsynthesistools,’ for over a IEEE DESIGN & TEST OF COMPUTERS
decade. The tools have developed to the point that they can synthesize certain systems fairly quickly. However, this design approach is not always suitable. Complex, lowspeed control functionsare often better implemented in software. Moreover, many applications inherently r e quire programmability-for example, to customize the user interface. ASlCs also cannot accommodate late design changes,and iterationsof the design are expensive. Thus, ASlCs may not be suitable for implementation of immature applications. Increasingly,designers use one or more ASICs for an application's better-understood and more performanceintensive portions, combined with programmable processors to implement the rest. 4
Domain-specificprogrammablep r e cessors. Design reuse can drive
down development time and system cost. For this reason, introducing enough programmability into a circuit to broaden its base of applications is often advisable. Suitable applications range from a half dozen different algorithms to an entire domain such as signal processing. One can design the processor itself by jointly optimizing the architecture, the instructionset, and the p r o grams for the applications.A major drawback of this approach is that it often requires a support infrastructure in the form of software and development systems to make reuse feasible. a Core-based ASK... This emerging design style combines programmable processor cores with custom data paths within a single die. Manufacturers of programmable processors are making the cores of their processors available as megacells that designers can use in such designs2Alternatively,one can use the core of an in-house processor.3 Corebased designsoffer numerous SEPTEMBER 1993
7
advantages: performance improvement (because critical components are implemented in custom data paths and internal communication between hardware and software is faster), field and mask programmability (due to the programmable core), and area and power reduction (due to integration of hardware and software within a single core). These designs are especially attractive for portable applications, such as applications in digital cellular tel e p h ~ n yDesigning .~ such systems requires partitioning the application into hardware and software and exploring trade-offsin different implementations. Design tools currently do not support this technique well.
Application-specificmdtiprocessoys.
code generation), and custom-hard-
1 ware synthesis.Tools that synthesizeei~
I i I
1 1
~
,
1
Some intensive applications have high enough complexity and speed ' requirements to justify develop- 1 ment of an applicationspecific ' multiprocessorsystem. In these sys- I terns, the interconnections can be I customized,along with the software and the selection of processors. Examples of design approaches for such systems range from homoge I neous interconnections of off-theshelf programmable components to heterogeneous interconnections of arbitrary custom or commodity processors.5 ~
Other possible components include analog circuits and field-programmable gate arrays (FPCAs). Often,components are mixed within a single system design-for example, multiple programmable processors along with custom ASICs. Furthermore, processors need not be of the same kind. The design issues for such systems include hardwaresoftware partitioning of the algorithm,selection of the type and number of processors, selection of the interconnection network, software synthesis (partitioning, scheduling, and
ther complete software or complete hardware solutions are common, but tools that support a mixture are rare.
What iscodesign? We refer to the simultaneousdesign of the hardware and software components of these multifarious systems as hardware-software codesign. In a traditional design strategy,designersmake the hardware and softwarepartitioning decisions at an early stage in the development cycle and develop the hardware and software designs independently from then on. There is little interaction between the two designs because of the lack of a unified representation,simulation,and synthesis framework. The new systems demand a more flexible design strategy, in which hardware and software designsproceed in parallel, with feedback and interaction between the two. The designer can then make the final hardwaresoftwaresplit afterevaluating alternative structureswith respect to performance, programmability, area, power, nonrecurring (development) costs, recurring (manufacturing) costs, reliability,maintenance, and design eve lution. This strategy demands tools that support unified hardwaresoftwarerepresentation, heterogeneous simulation at different levels of abstraction, and hardware-software synthesis.
DSP applications. We are develop ing a codesign methodology applicable to digital signal-processingand communications systems. DSP applications have the desirable feature of moderate ly simple algorithms, yet they demand high performance and throughput. Furthermore,exploring the cost and performance trade-offs between different implementationsis critical for consumer products and portable applications, where DSP is being widely used. We are focusing on the design of the hardware and software for such systems, 17
the design of a modem, the designer would experiment with different algorithms for timing recovery at thisstage. The designer then partitions the algorithm into hardware and software (2), guided by speed, complexity,and flexibility requirements. Components that need field programmability or that are inherently better accomplished in softHardware synthesis 1. Analog vs. digital ware are assigned to software imple2. Architecture selection: mentations. For instance, in the design type, # processors 4 3 ofa transceiver,a softwarerealization of 3. Register word length selection 3. Scheduler selection the codeddecoder would allow chang4. Custom hardware 4. Partitioning: # processors, ing the constellation easily,enabling the FPGA, data path ... I/O, memory. . support of multiple modem standards. /\ *\\ Operations with critical execution / / J Interface synthesis speed are allocated to hardware. Phase i Hardware 1. IPG (between DSPs) detectors. for example, can be imple'Oftware for configurat,on 2. Communication between mented with the CORDIC (coordinate custom hardware and P ~ ~ ~ ~ ~ ~ \ Drocessors rotation digital computer) alg~rithm,~ which is suitable for compact VU1 designs. Of course, to explore the design space, the designer would iterate the partitioning process. Partitioning is followed by hardware (3), software (4j,and interface (5) synthesis. The three are closely linked; changing one has immediate effects on the others. Hardware design decisions 1,2,3.. include selection of the programmable processor (directly affectingselection of Figure 1. A generic codesign methodology. the code generator) and determination of the number of processors and their connectivity (influencing code partiin which the hardware typically compris- framework in particular.The numbers in tioning and software-hardware interface es custom data paths, FSMs, glue logic, parentheses in the following discussion synthesis). In custom-hardware syntheand programmable signal processors, correspond to the stages shown in sis, the choices range from generating and the software is the program running Figure 1. custom data paths to generating masks on the programmable components. A The codesign task is to produce an for FPGAs. In designing custom data variety of commercial DSP microproces- optimal hardware-softwaredesign that paths, the designer must choose the regsols are suitable for most of the sophisti- meets the given specifications,within a ister word lengths.Some hardware struccated signal processing required in these set of design constraints (real-time re- tures (filter realizations, for instance) applications;one can synthesizecustom quirements, performance, speed, area, may meet performance requirements hardware for some of the computation- code size, memory requirements,power with smaller register widths than those intensive components. consumption, and programmability). estimated for other structures. Given a system specification, the deOn the software front, in the case of A generic codesign methodology signer develops an algorithm, using fixed-point processors, some algorithFigure 1 diagrams a methodology for high-level functional simulations (1 j, mic modifications might be necessary to designing heterogeneous hardware- without any assumptions about imple- minimize the effects of finite precision software systems.It is a general codesign mentation (such as available instruction (such as limit cycles and quantization scheme that does not apply to any set or register precision). For instance, in errors). Software synthesis involves parSystem specification I i
'1
U
18
IEEE DESIGN & TEST OF COMPUTERS
~
~
titioning and scheduling the code on multiple processors and synthesizing the code for interprocessorcommunication. These decisions depend on the architecture selected. The designer partitions among differentprocessors by optimizing cost functions such as communication cost, memory bandwidth, and local and global memory sizes. Interface synthesis involves adding latches, FIFOs, or address decoders in hardware and inserting code for I/O o p erations and semaphore synchronization in software. The typical way of solving this cyclic problem is to start with a design and work on it iteratively ’ to explore different options. Once the hardware and software components are synthesized, the next step is a heterogeneous simulation (6). In particular, the simulated hardware must run the generated software. This involves interaction of a number of different simulatorsif various specification languages are used. The designer then uses the simulation results to verify (7) that the design meets the specifications. Having performed the hardware and softwaresynthesis for a particular design choice, the designer can estimate area, power, critical path, component and bus utilization,and other factors. After using these estimates to evaluate the design (8), the designer may repartition the system to try out different options (9). Thus, the entire process is iterative. The Ptolemy framework The generic codesign methodology we have described requires a unified framework that allows the hardware and softwareComponentsto be integrated from the specification through the synthesis, simulation, and evaluation phases. The Ptolemy design environis such a framework. Ptolemy is a software environment for simulation and prototyping of heterogeneous systems. It uses object-oriented software technology to model each sub-
Block initialize( ) setup( ) go( 1 wrapup( 1 clone( )
Geodesic initialize( ) numlnit( ) setSourcePort( ) setDestPort( )
Porthole initialize( ) receiveData( ) sendData( )
readyType( ) print( ) operator
a functional or applicative language; each operation can be thought of as a p plying a function on a set of inputs and generatinga set of outputs. One can specify the numerical precision for the inputs and outputs of these functions,as well as the precision for internal computations. In addition, one can specify multirate computations such as downsampling and upsampling. These properties make Silage an attractive language for highlevel specification of DSP applications. Furthermore, a number of high-level synthesis systems that use Silage for specifiAs cation of their inputs are a~ailable.~*'~ a result, the Silage code generated by Ptolerny providesa link to these synthesis tools, thereby permitting custorn-hardware synthesis.Thus, the function of the Silage domain is twofold: custom-hardware synthesis and bit-true modeling of synthesized custom hardware. When a Silage galaxy is nested in an SDF universe, the blocks in the SDF domain send data to the Silage galaxy. On processing this data, the Silage galaxy generates outputs that can be further processed in the SDF domain. Thus, Silage galaxies in the SDF domain repre sent function application. Such an SDF-Silagesystem runs as follows:
=
iegin
bisuadl-out = biouadl(in): tjiquad2Iout = biquad2(blquadlLout): out = flrst0rderl(blquad2-out): func biquadl(1n : fix) out : flx =
begin out = f i x < 1 6 , l l ~ ( F o r k ~ o u t p u t ~ 2 l ~ o u t p u+ tG. ~l
Silage
1
Fork-0utput_2l-output-l = fix Fork-output_2l-output-2 = fix
( (
Sub-ni
Fork-output-22-output1 = flx Fork_output_22-output_2 = fix
( (
Forkkc Fork-,
Sub-nt
Figure 5. The top window shows the SDF universe with a Silage galaxy for a fifth-order filter. The filter is implemented in the Silage domain (second window from top] as a cascade of two biquad sections and a first-order filter. Each biquad (mid-/eh)is implemented w;th discrete components. The Silage code generated by Ptolemy appears on the right, the layout at the bottom /eft, and hardware estimates for the custom data path for the biquadsection (generatedby Hyper) at the boftom right.
Motorola 56000 and 96000 respectively). Alternatively, one can use the C code generation domain to synthesize C code, which can be compiled to the d e sired target processor. We have also im plemented multiprocessorcode generation in Ptolemy: a suite of schedulers that use properties of the SDF cornputation model to partition the code onio multiple processors,schedule code e x e cution, and insert code for interprocessor communication. For example,MultiSim56000 is a multiprocessor target that controls code generation for a multiprocessor system in a shared-memory configuration (Figure 7 on p. 25). The designerprovides the number of processorsand the shared-memory address to this target. The target then in22
vokes the appropriate parallel scheduler, which partitions and schedules the code onto the processors, inserts semaphore synchronization code, and generates ' sembly code foreach processor. An example we present later in the article illustrates this target in further detail. We are working on the design of heterogeneous multiprocessor targets, in which more than one type of programmable processor can be used. Sih has developed parallel schedulers that use different cost functions for partitioning and scheduling the code on heterogeneous programmable components.15
~
I ~
~
Code generation for hardware synthesis. We have developed a Silage16code ~
generation domain for Ptolemy. Silage is 1
1. The setup() phase generates Silage code. The designer then feeds this code to high-level synthesis tools such as Hyper" to synthesizea custom data path. The designerobtains estimates of the critical path, power consumption, and area. A single Silagestar (generated by compiling the Silage code and dynamically linking it into the running sirnulation) automatically replaces the Silage galaxy. The portholes of this Silage star are of type "fix." The designer can specify the precision of the data along these portholes, as well as the precision of intermediate results. This capability makes it possible to run bit-true simulations for experimenting with different IEEE DESIGN 6: TEST OF COMPUTERS
word lengths. 2. The go() phase simulates the complete system. The new Silage star (corresponding to the Silage gal-
System specification
Thor
The Silage domain thus permits: 1) high-level simulation (for functional verification),2) bit-true simulation (for analysis of finiteprecisioneffects, fine-tuning of the algorithm for finite word lengths, and determination of optimal word lengths),and 3) synthesisof custom data paths for parts of the algorithm committed to hardware implementation.Figure 5 illustrates these capabilities.
Silage
Software synthesis 1. Selection of target code
Hardware synthesis 1. Analog vs. digital 2. Architecture selection: type, # processors 3. Register width estimation 4. Custom hardware: FPGA, data path.. .
CG CG56 CGC
2. Bit-true simulation of 3. Scheduler selection 4. Partitioning:# processors, I/O, memory.. .
4
SDF FIX
C or Assembly code for programmable processors /
Ptolemy and hardware-sofiware codesign This section describes how we apply the componentsof Ptolemy to the generic hardwaresoftware codesign methodology outlined earlier. Figure 1 is redrawn in Figure 6 to show how the Ptolemy domains support the phases of the codesign process. We carry out high-level simulations and algorithm development (1 j in the SDF domain. We perform hardwaresoftware partitioning (2) manually, ' specifying whether a block is to be implemented in hardware or software This information generates wormholes for each of the two types of implementation: The algorithm components to be implemented in custom hardware are grouped in a Silage wormhole; the parts to be implemented as software running on programmable processors are clus- I tered into a CG wormhole corresponding to the target processor (CG56, CG96, and so on). We then make a preliminary hardware design decision (3) regarding the I number of processors to be used. (Fur-
6
U 1,2,3..
Figure 6. The codesign methodology under Ptolemy.
~
~
1
,
SEPTEMBER 1993
ther simulationsmay prove that this was not a good choice, and we can make changes iteratively.) In the case of multiprocessor systems, we then select the system configuration, thus determining the interface (5) between DSPs (such as a shared multiported memory, a shared bus, or serial communication).Next, we construct a Thor simulation model for this architecture, using Thor models for the DSP chips and glue logic. Functional models for analog components such as A/D converters and analog front ends are developed in SDF and added to the Thor model as wormholes. We carry out
bit-true simulations for the components allocated to custom hardware in the Silage domain. These simulations give an estimate of the optimal word lengths. Finally,we feed the generated Silage code to synthesis tools to estimate the critical path, power, and area. The Silage blocks are added to the system simulation model to represent the custom hardware. On the software (4) side, selection of the programmable processor determines the code generation domain. We then select the hardware-software interface (5). For instance,the DSP can communicate with external hardware either 23
quired to process one sample. If the requirements have not been met, we can
vides robust assurance of modem performance on most telephone lines in the public switched telephone network. The goal in our example is to design an implementation for this bidirectional channel simulator.Figure 7a showsthe algorithm for one direction of the fullduplex channel simulator (in SDF). Similar pro cessing would be performed on the signal coming in the opposite direction. To test the modems under different channel conditions,we must be able to change the degree of various impairments. Thus, we want to incorporate as much functionality in software as possible. As a first cut, we partition the algo rithm so that it is implemented entirely
peating the procedure (9), making a final selection on the basis of the user's requirements. At present, the infrastructure for these phases of the design process is available in Ptolemy. Work is under way toward automating some of these stages. We are developing a tool called the Design Assistant, which will automatically create the wormholes for the hardware and the software blocks and insert the interfaces.It will also assist in analysis and verification (currently done manually) and enable easy exploration of the design space. The Design Assistant will formalize and partially automate the design process, building upon the basic facili-
through one of itsserial ports or through memory-mapped I/O. Based on the hardwaresoftware interface and the interprocessor interface, we select an a p propriate code generation target and generate the assembly code. We insert interface and synchronization code in the program, at the same time adding hardware components for the interface. For example, serial I/O requires a serialto-parallel register and appropriate clocking circuitry. Similarly, for memory-mapped I/O, we insert address decoders and latches. We then analyze the code size to d e termine whether the program fits in the onchip program memory. If not, an ex-
,
lected a demonstration application that cessor, and the same code runs on the requires scarcely more than one proces- other processor. However, analysis of
logic are simulated in the Thor domain, customsynthesized hardware is mod-
lustrate our methodology. Consider the design of a full-duplex
and analog components are represented by their functional models in SDF.
phone channel by introducing impairThe second option seems more cost ments such as linear distortion, effective, so we explore it next. In this
24
~
~
sors,or partition the code onto the p r o cessors so that each implements a part
IEEE DESIGN & TEST OF COMPUTERS
sends the impaired signal to the receiving modem. The next design decision is the selection of the interface between the two processors.Again, two options are available: communication over the serial port or use of a shared memory. Selecting the latter,we build a hardware model for the system. We present the algorithm to the code generator again, with the new target (MultiSim56000): a two-processor system, with the processors communicating via a dual-ported shared memory. Figure 7 illustrates this design flow. We provide the algorithm description (Figure 7a) to the DSP56001 code generator (Figure 7b). MultiSim56000 is the code generation target. Figure 7b shows some of the parameters used by this target. The code generator partitions the algorithm and generates code for the two processors. In the resultant code, the first DSP performs linear distortion and phase shift operations,and the second adds Gaussian noise and second and third harmonic distortions. We develop the hardware description of the system in the SDF and Thor domains. The top-level design, consisting of two processors and a shared memory, is developed in the Thor domain (Figure 7c). To accommodate signals from both directions, a multiplexerdemultiplexer combination is added at the inDut and i output of the first and the second processor respectively. The analog compo nents (ND and D/A) are modeled in SDF (Figures 7d-7f) and added to the Thor universe as wormholes. When we run the Thor universe (Figure 7c),the two DSPs run the code generated by the code generator (codeO.asm and code1.asm). We observe the transmitted and received (impaired) signals at both ends of the channel and verify the design. We can repeat this process for different interprocessor communication mechanisms or different hardwaresoftware partitions. Author Kalavade's deSEPTEMBER 1993
/ AID
Converter
Figure 7. Telephone channel simulator: algorithm specification (a];code generator {bJ; digital (c] and analog (d-flhardware components.
tailed presentation of this case study describes and evaluates multiprocessor as well as system-level design options.8
Programmable-DSP core-based ASIC design. Figure 8 (next page) shows the transmitter and receiver for a modem. We shall concentrate on the design of the receiver rather than the transmitter-the receiver is usually the more challenging of the two. Channel equalization, carrier recovery, timing re-
covery, and symbol decoding are the critical components of the receiver. In the design of a modem to be embedded in a portable multimedia terminal, size, speed, and power are important considerations. Also, some programmability is necessary to allow changes in the signal constellation and fine tuning of the algorithm. In addition, the DSP used in the modem might also be used for other tasks in the terminal, such as frontend audio processing, fax, 25
C
O
D
E
S
I
G
N
F
O
D
R
S
P
~
Raised-cosine Nyquist pulses
Data symbols
SSlout
Mono SSlout
pqrt
Modulation
Tone
demodulation
Skew(n)
04 Figure 8. Modem transminer (a)and receiver (b).
or voice mail. Thus, the design calls for high performance as well as programmability. The programmableDSP c o r e A