[20] R. Williams, "fft.c11 - Fast Fourier Transform for the MC68HC11," http://www.mcu.motsps. com/freeweb/pub/mcu11/ffthc11.asm, March, 1994.
Subsetting Behavioral Intellectual Property for Low Power ASIP Design William E. Dougherty, David J. Pursley, Donald E. Thomas [wed,pursley,thomas]@ece.cmu.edu Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213 Abstract Power consumption is an increasingly important consideration in the design of mixed hardware/software systems. This work defines the notion of instruction subsetting and explores its use as a means of reducing power consumption from the system level of design. Instruction subsetting is defined as creating an application specific instruction set processor from a more general processor, such as a DSP. Although not as effective as an ASIC solution, instruction subsetting provides much of the power savings while maintaining some level of programmability. Beyond energy savings, instruction subsetting also offers the opportunity to reduce the design cycle through the reuse of existing processor intellectual property including behavioral and structural designs, hardware simulators, application code, and compilers. We synthesized 9 ASIPs through place and route and found that a poorly chosen instruction set may consume more than 4 times the energy of an ASIP with a proper instruction set choice. This finding will allow designers to consider another set of trade-offs in their hardware/software design space exploration.
implementations represent a trade-off of the level of customization and programmability of the implementation. Clearly, an ASIC implementation has fully customized datapaths and control logic, but it is not programmable. At the other end of the scale, a GP has little or no customized logic for the application at hand. For signal processing applications, a DSP is a more customized architecture than the general processor. An ASIP can be far more customized to an application or Behavioral Description
……………………….. ASIC
Most Customized
ASIP
DSP
GP
Least Customized
Figure 1: The Range of Implementation small set of applications but yet be programmable.
Introduction The performance of a fully customized ASIC design can be optimized in terms of a variety of parameters such as critical path, execution speed, area, or power dissipation. Programmable implementations may not be optimized for any of the performance parameters but can be reprogrammed for a variety of applications. This work defines the notion of instruction subsetting and uses it as a technique to trade-off performance and programmability. Consider the range of implementation styles shown in Figure1. Given a behavioral description, or a software program to be implemented, the system can be designed as software running on a general purpose processor (GP), as software on a DSP processor, as an application specific instruction set processor (ASIP), or as an application specific integrated circuit (ASIC). These alternate
This paper explores the design space available between the general-purpose processor and ASIC designs, mainly with the goal of power reduction. A behavioral description of a processor includes descriptions of all of its instructions. We could use an off-the-shelf version of the processor, or synthesize it using behavioral synthesis techniques. Alternately, we could determine a suitable subset of the instructions for the application at hand, and synthesize an ASIP that only implements that subset. This process, called instruction subsetting, typically reduces the area, critical path, and power dissipation of the implementation, while partially retaining its programmability. Instruction subsetting is an aspect of hardware software codesign in that it provides for performance trade-offs between the software application to be executed and the
1
underlying hardware architecture specifically designed to execute it. Key to this approach is the reuse of intellectual property (IP), specifically that of a general or DSP processor. Given such IP, new techniques for creating a range of ASIPs for custom low power system design can be developed. Not only will the power dissipation be reduced, but the ASIP will still be programmable (although with a more limited instruction set). Thus we can trade off the reduced power of an ASIC with the programmability of an ASIP. Unlike [6], we will begin with a defined instruction set and simply remove unneeded instructions rather than synthesizing a new instruction set based specifically on the needs of the target application. This paper describes the results of an experiment to characterize the design space and trade-offs available when using the instruction subsetting technique.
Approach Research in computer-aided design techniques for mixed hardware/software systems has begun to focus on reducing the power consumption of the systems being produced. Attention has been focused on reducing transition counts in the logic hardware using behavioral synthesis techniques [15][8] reducing memory accesses [2], code generation [18][16], and shutting down unused parts of a system [9]. Our goal is to examine power tradeoffs that can be made in the exploration of the design space through the use of application specific integrated processors (ASIPs). The traditional view of an ASIP is an ASIC that has been modified to implement a small degree of programmability [13]. As such, their design process begins with a thorough examination of the application set to determine the hardware instruction set that will best implement the algorithms. Based on this information, a new instruction set architecture and datapath are created to meet those needs. Since each design is unique to a set of applications, new tools such as compilers and hardware simulators must be created and verified [6]. The hardware design itself must go through a complete functional verification process. Although automation has been introduced to speed up the ASIP creation process, such tools often use simulated annealing, so care must be taken to ensure the design process is repeatable. Our technique for creating ASIPs takes the opposite approach, and seeks to customize an existing processor to the target application set. This top down approach begins with the selection of a processor to execute the software. The application is then compiled or assembled onto that machine, optimizing for any parameter such as execution time, memory usage, or code size. Once this has been
done for all of the target applications, the compiled code serves as a complete list of instructions and functionality required by the ASIP. This information, as well as the description of the complete processor, serves as the input to the instruction subsetting process. Our work specifically focuses on two subsetting approaches. The first uses behaviorally synthesizable descriptions of complete processors, to minimize the design time of the ASIPs and allow for the reuse of existing behavioral IP. While this behavioral approach allows for a larger number of designs to be explored in a short amount of time, questions as to subsettings applicability to designs hand-optimized at the RTL level would still remain. The second approach attempts to address this issue, by conducting a subsetting experiment on hand-optimized RTL descriptions of a DSP processor. Subsetting the RTL design can in some ways be considered the process of simply erasing gates from the design. Clearly, there will be energy savings proportional to the amount of gates removed, as well as the area saved from a layout that is made up of fewer components. When subsetting at the behavioral level however, it becomes much more than the process of simply erasing gates from Scheduled Behavioral Verilog
Hardware allocation
Dasys’ Rapidpath
Synopsys Design Compiler
Cascade Epoch
Cascade Epoch
Logic and Datapath Synthesis
Place and Route
Gate Level Verilog with Back Annotated Delay and Capacitance
Figure 2: High Level design Flow
the datapath. Once unneeded functionality has been removed from the original behavioral description, what remains is a behaviorally synthesizable ASIP description. Behavioral synthesis tools can then optimize this description in ways that might not have been possible in the original description.
2
As mentioned, reuse is an essential element in designing ASIPs in this manner. Beyond the reuse of the initial behavioral IP, the compilers and simulators initially designed for the complete processor can be reused for the ASIP simply by limiting the number of instructions seen as valid. In this way, future applications can be compiled to existing ASIPs by telling the compiler to ignore certain instructions. A lower functional verification effort should be required, as a chip created in this manner as the initial design was verified during the creation of the complete processor.
synthesizable description of the complete processor is examined to determine exactly which portions are required to support the application. Any functionality not required to implement the given subset is removed, and what remains is a new description ready to be synthesized into an ASIP. While this process is currently a manual one, future research is directed toward automated techniques. Compilers designed for the complete processor can be reused when writing new programs for the ASIP simply by limiting the choice of available instructions.
Experiment
All five designs were implemented with the design flow shown in Figure 2. Scheduled behavioral level Verilog is input to DASYS’ RapidPath [4], which allocates hardware for the design. The controller of the register-transfer level Verilog is passed through Synopsys’ Design Compiler [17] for logic synthesis, while the datapath portions are synthesized with Cascade Design Automation’s Epoch [1]. The designs were mapped to 0.5 micron CMOS technology. Epoch also places and routes the entire design. Power estimates were obtained from the gate level Verilog with timing and capacitance data back-annotated from the placed and routed design. An in-house simulation-based power estimation tool [10][14] was used on the logic portion of the designs, while memory power was determined using a regression model based method [3].
Consider the design of a 4-tap finite impulse response (FIR) filter. The most customized implementation of it is an ASIC that only executes this FIR filter. The least customized, as illustrated on the right of Figure 1, is with a general processor. For our experiments, the GP design is a partial implementation of the Motorola M68HC11 8-bit microcontroller [11] that will execute 77 instructions. In the middle of the design space, we have implemented four instruction subsets of the Motorola DSP56000 [10] 24-bit digital signal processor, consisting of 36, 19, 12, and 11 instructions. These designs can be thought of as a DSP and three increasingly customized ASIPs. A second version of the M68HC11 that contains only 24 instructions was also created as an ASIP version of a general-purpose processor. The smaller M68HC11 design implemented only the 24 instructions used in a hand designed FIR program, while the larger 77 instruction version implemented the instructions needed for an FFT program [20]. Care must be taken in comparing the raw results of these designs, since the datapaths are 8 bits wide while the DSP designs have 24 bit datapaths. The smallest DSP56000 design (11 instructions) corresponded to the instructions used in a hand-designed FIR program. The second design (12 instructions) was the minimal subset required to run a simplified FFT program.. The third design (19 instructions) implemented the instructions necessary for an FFT algorithm [12]. The final design (36 instructions) implemented the instructions needed to implement a portion of the SPHINX-III speech recognition front-end [19] as determined by compiling the original C code. The choice of the instruction set required by an application can be determined by a compiler, or as in our case, through hand-optimized assembly code designed to run efficiently on a given processor. Once the required instructions have been chosen, the behaviorally
Our FIR filter is a baseline design; the ASIC and the smallest DSP and GP ASIPs were designed specifically to execute this filter algorithm. Fuller implementations of the DSP and GP architectures, where we have implemented larger subsets of their instructions, allow us to measure the incremental changes in power dissipation when using larger ASIPs. Although several other applications were programmed on the processors, the FIR filter remains the only application comparable with the ASIC. For the hand-optimized designs, two versions of the Motorola DSP56000 were created in a standard cell design flow utilizing Synopsys’ DesignCompiler and Cascade’s Epoch. The first version of the hand designed DSP implemented 36 instructions, and was optimized for low power operation. Three man-years went into the design of this processor. The ASIP version of this design implemented 11 instructions and was built in 3 man weeks. While the 36 instruction version used guard latches in the arithmetic datapath, the 11 instruction version did not. Functionally, these designs are equivalent to the behaviorally synthesized 36 and 11 instruction DSP ASIPs.
3
Fir4 ASIC
HC24
HC77
Hand11
Hand36
DSP11
DSP12
DSP19
DSP36
Area (mm2)
15.17
1.6
4.25
28.38
34.76
31.36
60.75
75.69
85.00
# Transistors (1000s)
175K
17K
33K
179K
181K
179K
294K
356K
412K
KB of SRAM
6
0
0
9
9
9
9
9
9
Critical Path (ns)
38
26
28
38
40
52
80
93
102
Time/Datum (fir4)
152
6,732
7,219
266
280
364
560
651
714
# Instructions
N/A
24
77
12
36
11
12
19
36
% of Total Instruction Set
N/A
8%
25%
19%
58%
18%
19%
31%
58%
Programs Supported
fir4
fir4, fir64
fir4, fir64, fft
fir4, fir64, fft2, lms2
fir4, fir64, fft, fft2, lms, lms2, sphinx
fir4, fir64, lms2
fir4, fir64, fft2, lms2
fir4, fir64, fft, fft2, lms, lms2
fir4, fir64, fft, fft2, lms, lms2, sphinx
Table 1: Physical Design Characteristics
Physical Design Characteristics Table 1 presents the physical characteristics for each design used in our experiments. The design names appear in the first row of the table. The ASIC design is a custom design built to implement a 4-tap FIR filter. The "HC" designs are those based on the M68HC11. The "Hand" designs are the hand-optimized DSP56000 designs, and the "DSP" designs are the behaviorally synthesized DSP56000 based designs. The second row shows the size of the design, placed and routed in a 0.5 micron static CMOS process. The transistor counts presented in the third row do not include memory transistors. Like the 56K, all the DSP designs contain 9KB of memory, divided equally among an instruction memory and two data memories. The ASIC design contains only 6KB of memory, as it does not require an instruction memory. The critical path of the design, shown in the fifth row, represents the minimum clock period for the design as determined by Cascade’s TACTIC static timing analyzer. The sixth row, time per datum, shows the amount of time needed to complete one 4-tap FIR iteration. The ASIC computes an output in four cycles, while the DSP designs require 7 instructions (each 1 cycle) per iteration. The seventh and eighth rows show the number of instructions supported by each design and the percentage of the total instruction set this comprises, respectively. The final row shows the programs each design is capable of running.
Descriptions of each of the programs may be found in Table 2. While we would expect that the critical paths and transistor counts would increase at a steady rate as more instructions are added, we see a jump in the critical path for the DSP designs at the DSP12 and DSP19 ASIPs. This is accounted for by the significantly larger amount of control logic introduced as a result of the two additional addressing modes and additional move instructions required by the fft application. The majority of the changes made between the DSP19 and DSP36 designs are additional datapath items so the increases are not as pronounced. This is also reflected in the transistor count. As expected, implementing more instructions in a design increases physical characteristics like design and critical path.
Program fir4 fir64 fft fft2
Description 4-tap FIR filter 64-tap FIR filter 256 point FFT 256 point FFT written using only instructions from fir4
lms lms2
64-tap least mean squares filter 64-tap least mean squares filter written using only instructions from fir4 sphinx Speech processing front end (Kepstrum filter) Table 2: Descriptions of Programs
4
4.5 0E -09
4.50E-09 lm s
4.0 0E -09
4.00E-09
lms fir4
3.5 0E -09
3.50E-09 fir64 fir6 4
3.00E-09
ave rage lms2
2.5 0E -09
Ene rgy ( J)
En erg y (J)
3.0 0E -09
fir4 fft
2.0 0E -09 1.5 0E -09
averag e lm s2
2.50E-09 fft
2.00E-09 1.50E-09
F IR A SIC fft2
FIR ASIC
DSP 11
1.0 0E -09 5.0 0E -10
DSP 12
DS P19
5.00E-10
DSP36
0.0 0E +00
DSP11
0.00E+00 0%
10 %
fft2
1.00E-09
20%
30%
40%
50%
60%
70%
0%
10%
DSP 12 20%
% o f Instru ction Set
Aggregate Energy Consumption The cost of programmability can be measured by the average energy consumption across the design and application space for the behaviorally synthesized DSP ASIPs are presented in Figure 3. Results are presented on a per clock cycle basis so that all of the applications may be shown on a single graph. The percent of the complete 56k instruction set implemented in the ASIP appears on the X-axis while the Y-axis shows the average energy consumed per clock cycle in the program. Each of the points on the lines represents one of the ASIPs described earlier. The thick line shows the average over all applications. Moving from the 11 instruction ASIP to the largest ASIP shows a clear trend toward increasing energy consumption. A knee in the curve at the middle ASIP indicates that much of the energy consuming circuitry is present in the design by this point. Another point of interest is the average decrease in energy when moving from the DSP11 to the DSP12. Although there is more hardware present in the DSP12 design, the energy consumed is lower because of the organization of the controller and datapath. In the DSP11 the large datapaths are controlled by a few controllers, whereas the DSP12, several smaller controllers drive portions of the datapath. The resulting decrease in glitching activity accounts for the energy savings. On the fir4 line, the slope of the line between the ASIC and DSP11 designs show the cost of implementing the algorithm in software. Because the smaller ASIPs are able to execute faster than their larger counterparts, voltage scaling can be applied to reduce overall energy consumption. Figure 4 shows the average energy consumed per cycle when all of the clock Scaled Voltage (V) 1.4 Fir4 ASIC 2 DSP11 2.6 DSP12 2.9 DSP19 3 DSP36 Table 3: Scaled Operating Voltages
30%
DSP36 40%
50%
60%
70%
% of Instruction S et
Figure 4: Per Cycle Voltage Scaled Energy Consumption of Behaviorally Synthesized DSP ASIPs periods have been extended to that of the DSP36 (102ns) by lowering the operating voltage. The reduced Vdd values for each of the designs are shown in Table 3. The knee in the curve becomes slightly more pronounced here, and in the average case we see a savings of 65% between the largest and smallest ASIPs. Looking solely at energy numbers, there is no sense of time, and therefore no sense of the effect on performance. Here, we use the energy-delay product (EDP) [5] to examine the tradeoffs in the design space simultaneously considering performance and energy consumption. Figure 5 shows the energy delay product over our design and application space. Previous comparisons, based on energy alone, indicated that the modified applications (fft2 and lms2) consumed less energy per cycle, but the EDP shoes the effect of their increased execution times, which will be discussed in more detail later. Again there is a knee in the curve around the middle of the design space that indicates the presence of a "critical mass" in the energy consuming hardware present in a processor. The average EDP saving from the largest to the smallest ASIP is 93%. Figure 6 shows that the savings are not specific to the DSP based designs. Here, the results for the HC11 based 1.8 0E-12 fft2
1.6 0E-12
DSP36
1.4 0E-12 1.2 0E-12 Ene rgy-D elay
Figure 3: Per Cycle Energy Consumption of Behaviorally Synthesized DSP ASIPs
DSP 19
DS P19
1.0 0E-12 8.0 0E-13 DSP12 6.0 0E-13
fft D SP 11
4.0 0E-13
average
F IR ASIC lm s2
2.0 0E-13 0.00E +00 0%
10%
20%
30%
40%
50%
lm s fir64 fir4 60%
70%
% o f Instructio n S et
Figure 5: Per Cycle Energy-Delay Product of Behaviorally Synthesized DSP ASIPs
5
7.00E -10
4.00E-09 6.00E -10
Total
3.50E-09 5.00E -10
3.00E-09 HC77
4.00E -10
2.50E-09 Ene rgy (J)
Ene rgy ( J)
fir4
fir4 - voltage scaled
3.00E -10
2.00E-09
Datapath
1.50E-09
2.00E -10
Memory
H C 24
1.00E-09
1.00E -10
5.00E-10
Control
0.00E+00 0%
5%
10%
15%
20%
25%
30%
Breakdown of Energy Consumption Figure 7 shows a breakdown of the energy consumed by various hardware components. The plot shows the average energy per cycle for each of the major design elements averaged over the entire application space. The energy line for the memories remain fairly constant throughout the design space because all of the designs (except the ASIC) used identical memory configurations and had the same access patterns for each of the applications. The slight variation between the DSP11 design and the others results from fewer buffers leading into the memories. The datapath behaves as one might expect; as more functionality is added, the amount of energy consumed increases. The curve is less steep between the DSP19 and DSP36 designs for two reasons. First, the additional datapath elements required to move from the DSP19 to DSP36 are more logical than arithmetic, and second, the growth in the number and size of the muxes needed to move data increases by a proportionally lower amount. Over the design space, a savings of 76% is observed. While the results are not surprising, we now get a quantitative feel for the savings. The clock and control energies follow a less intuitive trend, increasing slightly from the lower to middle designs and becoming almost flat when moving to the largest ASIP. In the case of the clock, this is because all of the major datapath registers are present in the DSP11 design. Therefore, the only registers added in the larger ASIPs are small control registers. The saving for the control logic over the entire design space is 59%, but overall, control logic is not a significant contributor to the overall energy consumption.
0%
10%
20%
30%
40%
50%
60%
70%
% of Instruction Set
Figure 7: Average Per Cycle Energy Consumption of Major Design Components of Behaviorally Synthesized DSP ASIPs Figure 8 shows a more fine-grained breakdown of energy consumption in the designs. It clearly shows that the memories are the single largest contributors to the energy consumption. Muxes prove to be the most important contributors to the datapath energy consumption as well as the reason for the rapidly increasing trend when moving across the design space. This point to the interconnect as the major energy consumer in the larger designs. While not immediately apparent from the graph, the energy savings in the adders and registers are also over 80%. These trends may have been predicted from the components present in each of the designs, The number of 2 and 3 inputs muxes more than triples and the larger designs contain muxes with as many as 27 inputs. Datapath elements such as AND gates and equality checkers, whose signals feed back into controllers and mux select lines, see a six fold increase. The number of gates in the controllers grows by almost four times.
1.2 0E-09 memory 1.0 0E-09 muxes 8.0 0E-10
Ene rgy ( J)
Figure 6: Per Cycle Energy Consumption for Behaviorally Synthesized HC11 based designs designs running the 4-tap FIR filter program. The dashed line shows the voltage scaled energy consumption per clock cycle, while the solid line shows the unscaled results. Voltage scaling does not make a large difference in this case because the closeness in the critical paths prevented the operating voltage from being lowered very far, moving from 3V down to 2.9V. Even so, there is a 59% savings over the design space.
Clock
0.00E+00
% o f Instruction S et
6.0 0E-10
4.0 0E-10 a dders
registers c ontrol c lock
2.0 0E-10
m ultiplier othe r data path 0.00E +00 0%
10 %
2 0%
30%
40%
50 %
60%
70%
% o f Instructio n Set
Figure 8: Fine-Grained Breakdown of Average Per Cycle Energy Consumption of Behaviorally Synthesized DSP ASIPs
6
4.50E-14
7.00E -10
4.00E-14
fir 4
6.00E -10
fir64 lm s2
3.50E-14
Hand11
Hand36
5.00E -10 3.00E-14 Ene rgy D elay
Ene rgy ( J)
H an d11 Hand36 4.00E -10
lm s2
3.00E -10 Hand 11
2.50E-14 2.00E-14
H and11
1.50E-14
2.00E -10
fir64
1.00E-14
1.00E -10
5.00E-15 fir4 0.00E+00
0.00E +00
0%
0%
10 %
2 0%
30%
40%
50 %
6 0%
70%
10%
20%
30%
40%
50%
60%
70%
% of Instruction Set
% o f Instru ction S et
Figure 9: Voltage Scaled Energy per Cycle for Hand-Optimized DSP ASIPs
Hand-Optimized Designs There are definite savings to be had when implementing this ASIP design style using a behavioral synthesis design flow, but the question remains as to the effectiveness of the subsetting techniques on hand-optimized designs. Figures 9 and 10 show that although savings are not quite as dramatic as in the behaviorally synthesized designs, they are still close to 40% in terms of both voltage scaled energy consumption and energy delay product. This level of savings is present even though the 11 instruction ASIP does use special low power design techniques such as guard latching. In the behaviorally synthesized designs, most of the energy savings across the design space came from the datapath elements and the interconnections between them. This indicates that the synthesis process is able to better allocate the hardware in the new description. In the case of hand-optimized designs, the carefully tuned datapath of the larger design remains relatively in tact when creating ASIPs from it. Most of the energy savings in our handoptimized examples arise from the reduction in control, number of registers, and the size of the clock tree. Trying to make predictive assumptions about potential energy savings in hand-optimized designs is considerably more difficult than for synthesized designs. The quality of hand-optimized designs are subject to far more noise than synthesized ones, as they are completely dependent on the skill of the designer and the amount of time and effort put into them.
Evaluating Application Software Implementations When writing software for an ASIP, tradeoffs can be made in terms of the instructions used to implement some algorithm. It may be done in a few complex instructions, or several simpler ones. While each of the simpler instructions should consume less energy than a complex
Figure 10: Energy Delay Product per Cycle for Hand-Optimized Designs one, the increase in instruction count leads to longer execution times. This may be offset by simpler hardware however, which may have a higher clock rate due to its simplicity, The following section examines these tradeoffs. To quantify the effects of using complex versus simple instructions, we look at the effects of implementing the FFT and LMS filter in two different manners. The first seeks to minimize the code size of the application using any of the 56k’s instructions (fft and lms), while the second implements the same programs using only the 11 instructions found in the simpler FIR program. The fft program contained 14 instances of instructions not found in the limited instruction set. Emulating these 14 instructions, required 133 instructions from the restricted instruction set. Due to the placement of these instructions within the nested loops of the FFT, the run time increases from an average 991 cycles per group in fft to 8927 cycles per group in fft2. The lms program contains four instructions not found in the restricted set, which are replaced by 20 instructions from the restricted set. The overall impact on the time per datum is a rise from 203 cycles in lms to 339 cycles in lms2. Figures 11 and 12 show the effects of these transformations on average energy consumption and EDP, respectively. The average energy used per cycle is lower in the programs built from the restricted data set is lower because a proportionately larger amount of time is taken up in data movements. As noted previously, these programs take longer to compute their outputs, and this is reflected in the EDP. If we consider the DSP36 and DSP19 data points, the lower EDP comes about from using the programs using the unrestricted instruction sets, which contain many complex instructions. Using the DSP11 and DSP12 ASIPs, it is impossible to run the complex versions of the applications. In the case of the LMS filter, the increase in code size is offset by the faster hardware, and using the simplest program on the simplest hardware yields an
7
1.80E-12
4.50E-09
fft2 4.00E-09
1.60E-12
lms
1.40E-12
3.50E-09
1.20E-12
3.00E-09 Energy (J)
Energy (J)
lms2 2.50E-09 fft
2.00E-09
1.00E-12 8.00E-13 6.00E-13
1.50E-09 fft2
1.00E-09
fft
4.00E-13 lms2
2.00E-13
5.00E-10
lms 0.00E+00
0.00E+00 0%
10%
20%
30%
40%
50%
60%
70%
0%
10%
overall benefit in terms of the EDP. The FFT shows that this is not always the case. The DSP12 can only run the fft2 version of the application, but the hardware speed is not sufficient to compensate for the increased cycle count of the program.
Conclusions Our experiments have shown that using instruction subsetting to create ASIPs from existing behavioral IP is a viable approach to low power system design through rapid design space exploration. From the analysis of our 9 placed and routed designs, it is clear that the logic required to implement additional instructions is wasteful if these instructions are not likely to be highly utilized. In the average case, savings of 65% and 93% are achievable in terms of energy and EDP, respectively. The savings arise from the reduced interconnect and increased speed of the smaller designs. Even hand-optimized designs saw savings of almost 40% in terms of both energy and EDP. However, the choice of what instructions to implement in hardware or use in software needs to be made on a case by case basis. The final decision will be based on whether the speed of the hardware will be able to compensate for any increase in the instruction count of the programs. Beyond simple energy savings, this method of instruction subsetting offers a unique opportunity to reuse existing IP at both behavioral and RT levels. Not only does this decrease the physical design cycle, but also the amount of time required in the functional verification process for newly created ASIPs. By using this top down approach applications for ASIPs can be written in a familiar software development environment as determined by the selection of the processor to be subsetted. Existing compilers or assemblers are used to optimize and generate code that not only runs on the lower power ASIP version of the
30%
40%
50%
60%
70%
% of Instruction Set
% of Instruction Set
Figure 11: Per Cycle Energy Consumption Using Restricted and Unrestricted Software Application Implementations Running on Behaviorally Synthesized DSP ASIPs
20%
Figure 12: Per Cycle Energy-Delay Product Using Restricted and Unrestricted Software Application Implementations Running on Behaviorally Synthesized DSP ASIPs processor, but also serves as the required instruction set specification for the design. Future applications can be written for those ASIPs using the same compilers by simply indicating which instructions are not valid in the ASIP. Acknowledgements This work was supported by the Defense Advanced Research Projects Agency under Order No. A564, the National Science Foundation under Grant No. MIP9408457, and the Semiconductor Research Corporation under Task IDs 068-073 and 068-064. The United States Government has certain rights to this material. References [1] Cascade Design Automation Corporation, Epoch User’s Manual, February 1997. [2] F. Catthoor, L. Nachtergaele and S. Wuytack, "Optimizing Data Transfers and Memory for Low Power," ASIC & EDA Magazine, Spring 1997. [3] S. Coumeri and D. Thomas, “Memory Modeling for System Synthesis,” International Symposium on Low Power Electronics and Design, Aug. 1998, pp. 179-84. [4] Dasys, Inc., RapidPackage User's Manual, October 1997. [5] M. Horowitz et al., “Low Power Digital Design,” International Symposium on Low Power Electronics, Oct. 1994, pp. 8-11. [6] H. Ing-Jer and A.M. Despain, "Generating Instruction Sets and Microarchitectures from Applications," International Conference on ComputerAided Design, Nov. 1994, p391-6. [7] B. Klass et al., “Modeling Inter-Instruction Energy Effects in a Digital Signal Processor,” Power Driven Microarchitecture Workshop ISCA98, Jun. 1998, pp. 2530.
8
[8] R. Mehra and J. Rabaey, "Exploiting Regularity for Low-Power Design," International Conference on Computer-Aided Design, Nov. 1996, pp166-72. [9] J. Monteiro, S. Devadas, P. Ashar and A. Mauskar, "Scheduling Techniques to Enable Power Management," Design Automation Conference, June 1996, pp. 349-356. [10] Motorola, Inc. DSP56000/56001 Digital Signal Processor User’s Manual, 1990. [11] Motorola, Inc. M68HC11 Reference Manual, 1991. [12] Motorola, Inc. "Archive containing DSP56000related files," http://www.mot.com/pub/SPS/DSP/software/dr_bub/5600 0.zip, January 1992. [13] P. Paulin, et al., “DSP Design Tool Requirements for Embedded Systems: A Telecommunications Industrial Perspective,” Journal of VLSI Signal Processing, Jan. 1995, pp. 23-47. [14] D.J. Pursley, "A gate level simulator for power consumption analysis," M.S. Thesis, Carnegie Mellon University, May 1996. [15] A. Raghunathan and N.K. Jha, "Behavioral Synthesis for Low Power," International Conference on Computer-Aided Design, October 1994, pp. 318-322. [16] M. Srivastava and M. Potkonjak, "Power Optimization in Programmable Processors and ASIC Implementations of Linear Systems: Transformationbased Approach," Design Automation Conference, June 1996, pp. 434-348. [17] Synopsys, Inc., DesignWare User Guide, August 1997. [18] V. Tiwari, S. Malik and A. Wolfe, "Compilation Techniques for Low Energy: An Overview," International Symposium on Low-Power Electronics, October 1994, pp. 38-39. [19] N. Weinberg, private communication. [20] R. Williams, "fft.c11 - Fast Fourier Transform for the MC68HC11," http://www.mcu.motsps. com/freeweb/pub/mcu11/ffthc11.asm, March, 1994.
9