Power Space Exploration in Module

Area/Time /Power Space Exploration in Module Selection for DSP High Level Synthesis S. Gailhard +, O. Sentieys *, N. Julien +, E. Martin + + IUP-LESTER 2, rue le Coat Saint Haouen, 56100 Lorient, France E-mail: [email protected] http://lester.univ-ubs.fr:8080/

* ENSSAT-LASTI 6, rue de Kerampont, 22300 Lannion, France E-mail: [email protected] http://www.enssat.fr/RECHERCHE/ARCHI/

Abstract - Module selection is a basic architectural synthesis task that allows to optimise the cost of the dedicated circuits under real time constraint. Adding the power factor to the optimisation problem changes the working domain

from

two

dimensions

(Area/Time)

to

three

dimensions

(Area/Time/Power). However solving this problem by the best selection of the supply voltage and the operators set in a complex library remains unsolved. This paper presents an implementation of the module selection integrated in HLS GAUT tool and some results on a DWT algorithm. Keywords - Module Selection, High Level Synthetics (HLS) for LowPower, Area/Time/Power (ATP) Space Exploration, Real Time applications, Pipeline Architectures, Discrete Wavelet Transform (DWT).

I. INTRODUCTION Most VLSI researches have been focused on optimising circuits speed/area to perform complex signal processing. Indeed, the rapid advance in VLSI technology and the increase of computation required for recent real time DSP applications infer large chips density and high clock frequency. Then, the minimisation of power dissipation in modern circuit, such as mobile systems, is

a very important problem. To find the minimal power consumption of a system, VLSI researches work on HLS tools on low power architectural synthesis. Power optimisation can be processed at several levels [1] : behavioural, architectural, logical or physical. At behavioural and architectural levels, the expected saving power is more significant [2]. Architectural synthesis enables the re-use of operators already designed by a Logic Synthesis Tool. This concept provides a large operators library with different ATP characteristics. In this paper we study a new approach to module selection at the architectural level for low-power and time constraints for DSP systems. The second section presents power estimation at the behavioural level and the HLS tools for low-power. Section III describes the module selection task and section IV sets up the cost functions. Section V deals with our ATP Space Exploration method formulated with the method presented in the previous section. In the section VI, the DWT algorithm [3] results illustrating our method are presented before a conclusion in providing directions for further research in this area. II. PREVIOUS WORKS A. Power Estimation at behavioural level There are three major sources of power dissipation in digital CMOS circuits that are summarised in the following equation (1). The first term represents the switching component of the power, the second term is due to direct-path short circuit current and the last term is the power dissipation due to leakage current. However for CMOS circuits, it assumed that both short-circuit and leakage power dissipation are negligible. Therefore, the power dissipation is given by (1) [4] : P = Pswitching + Pshortcircuit + Pleakage = (C . p ) . Vdd2 . f = Ccomp . Vdd2 . f (1)

where C is the physical capacitance of the CMOS circuit, p is the probability of a power consuming transition (0→1), Ccomp is the average capacitance switched to perform a computation, f the sampling frequency and Vdd the supply voltage.

At a high level, it is assumed that signal values are uniformly distributed. Then, the capacitance Ccomp is described in the following equation (2) [4] :

Ccomp = ∑ N i . Ci + N reg . Creg + Cint erconnect + Ccontrol

(2)

i

where Ni is the number of i operations, Ci the i operator capacitance that contains the intrablock routing and gates capacitance, NregCreg the effective registers capacitance, Cinterconnect the physical interconnected capacitance estimation and Ccontrol the controller capacitance estimation. Low power dissipation is achieved by reducing Ci with technology considerations and Ccomp by the best selection of operators with different ATP characteristics. Moreover, voltage scaling has a great impact on the power dissipation because of the quadratic dependence of Vdd (1). Unfortunately, reducing Vdd increases the operators latency T significantly when the supply voltage approaching the device threshold voltage Vt (3) [5]. T∝

Vdd

(Vdd

− Vt )

2

(3)

The module selection presented in this paper uses both a complex operators library and the voltage scaling in order to optimise the design. B. HLS Tool for Low-Power In the literature, most synthesis tools for real time architectures use a specific hardware selection module to optimise the design area. These tools are classified either by the operator types and the targeted architectures implemented or by interaction between the selection tasks and other synthesis tasks (scheduling, assignment). The techniques used are very important for synthesis efficiency. Module selection in HLS Gaut tool [6] optimises the design area for application with multi-delay, pipeline and multi-functional operators for pipeline architectures for real time DSP applications. Few low-power HLS tools exist in the literature. For example, Hyper_LP [4] has integrated ATP space exploration that shows the supply voltage scaling

impact on the design. However, it is not used to explore the design optimisation supporting a supply voltage variation and a complex operators library with different ATP characteristics. III. THE MODULE SELECTION TASK The module selection is the first step in the HLS GAUT [7] design flow (Figure 1). Its goal is to determine the optimal supply voltage and the optimal operators set from a given library according to a DSP application (an algorithm and a time constraint) for design optimisation. Each designer has his own optimisation problem : circuits for mobile electronic systems require low power dissipation, whereas some designs need a minimum area circuit in order to be cheaper. Therefore, cost functions must depend of both area and power with a respective weight depending on the designer needs (Figure 2). A lgorithm (V H D L)

T im e C o nstrain t

Time Constraint

D FG

Performance

G aut T o o l M o d ule S electio n

F orm al L ib rary

Area S ched uling, B ind ing ...

Power

Optimizing ? • Power • Area

VHDL RTL

Figure 1 : GAUT design flow

• Vdd ? • Components set with different ATP ?

Figure 2 : Problem formulation

The module selection uses two entries : one from the application (the internal representation of the Data Flow Graph (DFG)), and one from the library. The DFG model contains NG operations nodes GK (1≤k≤NG) which perform No different types of operations. In the GAUT HLS tool, the library contains several parameters for each operator (Figure 3) : area, latency time, pipeline stages number, typical function and the effective capacitance Ceff.

COMPONENT add16b_vhdl GENERIC ( area : INTEGER := 53; Ceff : INTEGER := 7570 fF; function : MULTIF := ADD; latency_time : INTEGER := 18900 ps; pipe_stage : INTEGER := 1; ); PORT ( A :in Std_Logic_Vector(nb_bit-1 downto 0); B :in Std_Logic_Vector(nb_bit-1 downto 0); S :out Std_Logic_Vector(nb_bit-1 downto 0) ); END COMPONENT; Figure 3 : VHDL Specification of Hardware Resources

Ceff only contains the effective capacitance of gates simulated with a uniformly set of input values. It is the mean result of 5 simulations with 10.000 stimulis length (using Compass and Powercalc), with a 95% confidence interval smaller than 1% tolerated error [8]. The next table shows examples of a 0.8 micron CMOS technology.

16 bits operators

Latency (ns) at 5.5 Volts 14 8.6 18.9 9.9 32.3 22

Area in 1/1000 mm2 73 122 53 117 1337 1337

Power Dissipation (µW) at 1 MHz at 5.5 Volts 412 711 229 626 18827 18821

Adder (DataPath) Fast adder (DataPath) Adder (VHDL) Fast adder (VHDL) Multiplier (DataPath) Pipeline Multiplier (DataPath) Multiplier (VHDL)l Fast multiplier (VHDL)

53.3 33.2

1023 1270

11820 16231

Table I : Library examples

The library contains operators with different ATP characteristics. For example, an addition function in the DFG can be realised with 4 different operators. The purpose of the module selection is to select the best operators set for a cost minimisation.

IV. COST FUNCTIONS Cost functions used to optimise the design are based on both area and power dissipation (4), but an α factor (0 ≤ α ≤1) is introduced in the equation to select the power and area rate (4) : Cost

α(

S ) = (1 − α ).β.Area (S ) + α . Power (S )

(4)

where Area(S) is area estimation (operators, registers, interconnections and bus [7]), Power(S) is the estimation of the power dissipation (5) and β is a normalised coefficient. Power(S) is composed of the operators, registers, memory and clock tree power dissipation (5) [8]. 1  1 Power (S ) = Vdd2  . COp (S ) + CRe g (S ) + CDyn _ Mem (S ) + . CClk + StaticMem ( S ) TClk  Tr 

(

COp (S ) = ∑i =01 N i . Ci N

)

C Re g (S ) = N reg C reg

C Dyn _ Mem ( S ) = N mem . Cdyn

(5)

CClk (S ) = N reg . Creg

where COp(S) and CReg(S) represent the effective capacitance of the operators and registers. Each of them switches every Tr (Time constraint). The capacitance estimation of the clock tree Cclk is based on the estimated registers number N Re g (S ) [9] and on the clock tree capacitance per register Creg (CClk switches

every TClk : internal clock period). The memory power dissipation is based on a 2

static power dissipation Stat(S).Vdd (depending on the memory size) and on a dynamic effective capacitance CDyn_mem(S). Cost functions (4) are worked out at high level and are quickly computed. V. EXPLORING THE AREA/POWER SPACE The module selection presented here explores all the different solutions dealing with different operators (ATP, mono or multi-function, pipeline operators) and different supply voltages. Our DFG processing is carried out in only one order (Figure 4). To estimate the cost function (4), the operators allocation has to be obtained from the number of pipeline stages. This requires to

know the operators latency and therefore the selected operators and the supply voltage (3). To simplify the ATP space exploration, we have used a lemma [6] in order to reduce the number of cost estimation. The use of multi-functional units is necessary only if one of its functions is underused, e.g. if there is at least one mono-functional operator in the operators set S with an efficiency less than 100%. Therefore, a first step provides the best selection from mono-functional operators. If mono-functional operators are underused, an optimisation with multi-functional units is processed for reducing the circuit area. In conclusion, our module selection algorithm solves the best operators set for each supply voltage (mono and multi-functions) and solve the best supply voltage selection from a range from Vdd_min to Vdd_max (Figure 4) for Supply voltage from Vdd_min to Vdd_max for all mono-functional operators set Latency time calculation of each selected operator with (3) Number of pipeline stages calculation Allocation estimation in each pipeline stage Cost calculation (4) End mono-functional operators set Multifunctional operators resolution Best operators set selection for each supply voltage End for Supply voltage Best selection (Supply voltage and operators set)

Ö

Ö

Figure 4 : Module Selection Algorithm

This ATP exploration permits to attempt the minimum of the cost function chosen by the designer. VI. RESULTS To illustrate our theoretical method, a DWT algorithm [3] used for image compression is a typical example. The circuit has to implement the algorithm with three resolution levels on 10 images per second. So, it needs to specify a processor which performs two FIR filters (g, a 7th and h, a 9th linear phase FIR filters) (6) every 300 ns : y (n) = h0 . xn + ∑ i = 0 ( hi .( x2n − i + x2 n − i )) 3

y ' (n) = g0 . xn + ∑i = 0 ( gi .( x2 n − i + x2 n − i )) (6) 2

The table I in section III is the part of the operators library used in our HLS tool. The clock period is 10 ns (operators latency changes to a multiple of 10 ns). The timing constraint is 300 ns. Figure 5 shows the results (Power(S) (5), Area(S) and Cost(S) (4)) of the best operators set for each supply voltage, with α equal to 0.5. voltage scaling effects

Cost/Area/Power

’Cost’ ’Area’ ’Power’

Best Cost at Vdd = 3.8

2.5

3

3.5

4

4.5

5

Vdd

Figure 5 : Cost as a function of Vdd on a DWT algorithm

Figure 5 shows that the power dissipation decreases with the supply voltage because of the quadratic dependence of Vdd on the power dissipation (1). However, when the supply voltage decreases, operators latency (3) decreases, that involves more operators allocation and the design area increases. Table II shows the HLS GAUT tool results for each supply voltage : best operators set (Module selection), operators power dissipation estimation, total area and operators number in the circuit after an architectural synthesis.

Vdd (Volts) 4.9 3.9 3.8 3.2

Number of selected Operator (latency time (ns)) Power (mW) Area (mm2) 1 Adder Datapath (20) 2 Adder Vhdl (40) 2 Add Vhdl (40) 2 Add Datapath (40)

3 Mult Vhdl (70) 3 Mult Vhdl (90) 2 Fast Mult Vhdl (60) 3 Fast Mult Vhdl (80)

360 222 277 199

Table II : Module selection results on a DWT algorithm

5.16 5.76 4.82 6.14

Breaks in the Figure 5 are explained by the table II. For a 3.9 Volts supply voltage, three multipliers and two adders are allocated. However for a 3.8 Volts supply voltage, two adders and only two multipliers are allocated. This infers that the area cost is smaller at 4.5 Volts whereas the power dissipation estimation is greater. Cost functions (4) depend on the term α which represents the relative importance between the area and the power dissipation. Figure 6 shows the impact of this term on the cost function. The optimal supply voltage when 0.2