Energy-Aware Co-processor Selection for Embedded Processors on ...

1 downloads 0 Views 1MB Size Report
Donald Bren School of Information and Computer Science. University of .... k k. A. A. T. T. T. DE. DE. DE. Adding the co-processor is beneficial just if it reduces energy further ..... [9] G. Stitt, F. Vahid, “Hardware/Software Partitioning of. Software ...
Energy-Aware Co-processor Selection for Embedded Processors on FPGAs Amir Hossein Gholamipour Elaheh Bozorgzadeh Sudarshan Banerjee* Donald Bren School of Information and Computer Science University of California, Irvine E-mail: {amirgh,eli,banerjee}@ics.uci.edu in active region. In dual power mode increasing the area of active region by adding more co-processors may not lead to minimum energy consumption. HW/SW partitioning on FPGAs with embedded processors has been addressed in many research papers (e.g., [1, 2, 4, and 9]). The gain in energy is due to speed up in execution time. In most of the related work, the effect of static power is not considered. [2] and [9] address the trade-off between performance and area of FPGA device. In ASIC domain, some related work [3, 5, 6, and 7] aims at HW/SW partitioning to minimize energy. In [3], the authors used an ILP to select hardware candidates with the smallest area overhead under performance and power constraints. In [7] the objective is to find the co-processor set that maximizes performance under area constraint. In [11, 13] the authors showed power saving techniques in routing in a dual-voltage FPGA. In [10] the authors proposed a region-constrained placement technique for FPGAs with shut down mode. Dual-voltage FPGAs have not been considered in co-processor selection. In this paper, the problem of energy-aware coprocessor selection in dual power mode FPGAs and its complexity is analyzed under different timing and area constraints. We prove that the complexity of the problem is NP-Hard in various cases and we provide a unified integer linear programming (ILP) formulation to solve the problem under different constraints. In dual-power-mode co-processor selection, we observe interdependency between the energy gains corresponding to the co-processors, due to steadily static power consumption in active region during execution. By applying clock gating, dynamic power consumption does not introduce such dependency between co-processor candidates. The experiments show the impact of the theoretical analysis in this paper as well as the importance of considering both dynamic power and static power in exploiting co-processors. We compare the results to other approaches that only consider dynamic power or apply area minimization. Results show that the

Abstract In this paper, we present co-processor selection problem for minimum energy consumption in hw/sw co-design on FPGAs with dual power mode. We provide theoretical analysis for the problem under no constraint, resource constraint, and timing constraint. We prove that the complexity of the problem in each case is NP-Hard and we provide a generalized ILP formulation. We compared the result of our approach in minimizing energy to the result of other approaches that had not considered both static and dynamic power during optimization and we showed that we can reduce energy by 63% in some cases.

1. Introduction FPGAs are strong underlying platforms to implement System-on-Chips by providing hard and soft embedded processors on FPGAs, hence enabling on-chip integration of co-processors and processors. Computational intensive kernels of an application are extracted for hardware implementation (co-processor). Programmability of underlying hardware enables integration of various co-processors and customization for each application. However, adding each coprocessor may lead to increase in total resources. In nanoscale FPGA devices, static power is highly correlated with on-chip resources given that device cannot shut down as it loses its configuration. FPGA researchers propose dual power modes (Altera Stratix III) or shut down options for non-configured regions [10]. The static power consumption in the active region has the important role in total energy consumption of the device. This paper focuses on co-processor selection coupled with embedded processor on a FPGA device to execute the target application. The objective is to minimize energy consumption considering both static and dynamic power. The target FPGA architecture is assumed to have dual power modes: active mode (execution is in active region) and idle mode (either shut down or in low power mode which is not used during execution). In dual-power mode, the static power consumption highly depends on total resources *

Sudarshan is currently with Liga Systems, Sunnyvale, Ca, USA

1-4244-1258-7/07/$25.00 ©2007 IEEE

158

We define the problem as follows: A set of co-processor candidates C = {c1, c2, …, cn} represents the critical kernels of the software. Each coprocessor j is defined using an ordered triple (∆DEj, ∆Tj, Aj). (All the variables are defined in Table 1) The total energy consumed for executing the software is DEsw + Tsw*SPproc. (If the processor is a soft core, SPproc is equal to Aproc*β). If we add co-processors from set S ⊆ C , the total energy consumption is: (1)

selection of co-processors based on those techniques may not lead to minimum energy consumption.

2. Energy-Aware Co-Processor Selection

The target system architecture consists of a microprocessor and a set of co-processors (Figure 1) on a reconfigurable fabric (such as FPGA devices). The co-processors are connected to the processor (soft core or hard core) through communication buses. The inputs to each co-processor are sent through the communication buses from the processor and the outputs are sent back through the buses to the processor after execution.

E = ( DEsw −

Processor

Coproc n+2

) + (Tsw −

∑ ∆T ).(SP k

proc

C k ∈S

+

∑ A .β ) k

C k ∈S

< DEsw + Tsw .SPproc

Coproc n

Coproc n+1

k

The objective is to minimize Equation 1. If we have one co-processor candidate, adding the coprocessor must reduce energy consumption of the system. Hence: (DEsw − ∆DEco− proc ) + (Tsw − ∆Tco− proc ).(SPproc + Aco− proc .β ) (2)

Coproc 1 Local Memory

∑ ∆DE

C k ∈S

If we rewrite Inequality 2, we will reach: (Tsw − ∆Tco − proc ).β . Aco − proc < ∆E co− proc + SPproc .∆Tco − proc (3) The left hand side of Inequality 3, is the increase in static energy (loss in energy) because of the total area overhead of the added co-processor. The right hand side is the gain in dynamic energy as well as the gain in static energy due to saving in execution time. This means that we can add a co-processor if the gain in energy consumption dominates the loss. While the loss comes from the area overhead of the co-processor, the gain can be the result of the gain in dynamic energy or the speedup or both. If we assume that we select the optimal co-processor set by adding the co-processors one by one we can generalize Inequality 3 by considering the “current situation” of the system in which we have already added some co-processors from set S ⊆ C . Hence, dynamic energy consumption, execution time and area of the system are changed to: (4)

Coproc m

Figure 1. The architecture of the processor system By clock gating, we reduce dynamic power consumption when co-processors are idle. If the underlying FPGA platform has dual voltage islands, we can use part of the chip for implementing the design and we can put the idle region (non-configured region) of the device in low power mode [10, 13]. We assume that static power consumption in low power mode is negligible compared to the active region. Furthermore the active region can go to low power mode after execution or after the timing deadline. Altera Stratix device is an example of a commercial FPGA capable of providing dual-voltage for designs. On the other hand by using single power mode FPGAs like Xilinx Virtex devices we cannot exploit voltage islanding. In such devices, we have to exploit the whole device area for implementation while the idle region of the device will still consume comparable static power. It should be noted that on FPGAs shutting down the active section of the device cannot be an option as the device will loose its configuration.

DEcurr = DEsw −

∑ ∆DE

k

Ck ∈S

Tcurr = Tsw −

∑ ∆T

k

C k ∈S

Acurr =

∑A

k

C k ∈S

Adding the co-processor is beneficial just if it reduces energy further. By generalizing Inequality 3 we will reach Inequality 5 as follows: (Tcurr − ∆Tco − proc ).β . Aco − proc < (5) ∆DE co − proc + ( Acurr .β + SPproc ).∆Tco − proc

Lemma 1: Inequality 5 is a sufficient condition for adding a co-processor to the system to decrease energy consumption. However if the energy gain of the co-processor is less than the energy loss it does not necessarily prevent the co-processor from getting added. Instead a subset of the co-processors might be added to the system

2.1. Problem Formulation The total energy consumed in the system is due to both dynamic power and static power consumption:

Etotal = Pdynamic .Texec + Pstatic .Texec In this paper, the term referred to as Dynamic Energy (DE) is the energy consumption of the system due to dynamic power consumption.

159

while neither of the co-processors alone holds Inequality 5. Assume for a given application on an embedded soft processor we have (DEsw, Tsw, Aproc) = (40, 10, 2). There are two co-processor candidates C and D: (∆DEC, ∆TC, AC) = (10, 2, 2) and (∆DED, ∆TD, AD) = (10, 3, 3). Assuming that β=1, neither of the coprocessors C and D satisfies the condition in Inequality 5. However if we add both of them at the same time the effect that they cause is just like adding one coprocessor with (∆DE, ∆T, A) = (∆DEC+ ∆DED, ∆TC+ ∆TD, AC+ AD) = (20, 5, 5). Hence, selection of the two co-processors satisfies Inequality 5. This observation is in fact an intuition on why the problem is NP-Hard, because it suggests that if individual co-processors are not capable of reducing total energy consumption, then a subset of size two or more of them might be able to do that. So in the worst case every subset of the original set must be examined to observe reduction in total energy consumption. Due to clock gating, this behavior is not observed under dynamic energy minimization.

execution time. Then under the constraint of Inequality 7, the objective function is as follows: (8) ( DE − ∆DE ) + (T − ∆T ).( SP + A .β ) sw

Ck ∈S '



k

proc

cons

k ∈S

cons



k

k =1

Problem of co-processor selection under Hard Area Constraint is in fact the famous Knapsack problem. Each co-processor takes some area and has a gain equal to ∆DEi + (SPproc+Acons.β).∆Ti. We want to maximize the gain under constraint 7. Under relaxed area constraint the problem becomes polynomially solvable. Though co-processor selection under Flexible Area Constraint is similar to HAC, but is not the same problem. In this problem the gain for adding coprocessor ck is equal to: (10) ∆DE k + ∆Tk .( SPproc + ( Acurr + Ak ).β ) − Tcurr .β . Ak Based on Equation 10 the gain for adding a coprocessor depends on the current state of the system which makes it dependent on the co-processors that have been added to the processor up to this point. However in HAC the gains of co-processors are independent from each other. It can be shown that Knapsack problem can be reduced to FAC. Thus this problem is an NP-Hard problem. Under relaxed area constraint the problem reduces to the problem of coprocessor selection under no constraint which is still an NP-Hard problem.

Ck ∈S '

2.3. Time Constrained Co-processor Selection Timing constraint is also categorized into two types. Under Flexible Timing Constraint (FTC) the system goes to low power mode after execution. After that, the power consumption until the execution deadline is considered negligible. For this problem the objective is to minimize Equation 1 under timing constraint Tcons: (11) Tsw − ∑ ∆Tk ≤ Tcons

2.2. Area Constrained Co-processor Selection Area constraint (area assigned to implement the application) can be studied as two different problems considering the architecture of the FPGA. In Flexible Area Constraint (FAC), Flexible Area refers to area within which dual power mode is provided. Hence idle region of Acons, not being used by application (A’), is set to low power mode, consuming negligible static power compared to the active region (A). (Acons= A + A’) The objective is to minimize Equation 1 under area (7) constraint Acons: A ≤A k

sw

It should be noted that the definition of area constraint excludes the area of the processor if the processor is a soft core. Relaxed area constraint n (9) means: A ≥ A

It can be proven that co-processor selection decision problem is NP-Complete by reducing subset selection problem to it.



k

k ∈S

Co-processor selection decision problem: We have a set S of ordered triples (∆DEi, ∆Ti, Ai) and an integer number m, we want to know if there is any subset S’ which satisfies the following equation: (6) ( DEsw − ∑ ∆DEk ) + (Tsw − ∑ ∆Tk ).(SPproc + ∑ Ak .β ) = m Ck ∈S '



k∈S

Under Hard Timing Constraint (HTC), the system is in active mode until the deadline. After execution until the deadline the system only consumes static power. The objective of this problem is to minimize: (12) ( DE sw − ∑ ∆DE k ) + Tcons .( SPproc + ∑ Ak .β )

cons

Ck ∈S

In Hard Area Constraint (HAC), we assume that the resources in the area of Acons are in active power mode (e.g., Acons can refer to area of a voltage island). In this problem, area A is used for system implementation and the rest of the area (area A’) still consumes the same amount of static power during

k∈ S

k∈ S

under the constraint shown in Inequality 11. Throughout this paper, relaxed timing constraint means that for the timing constraint Tcons: (13) Tcons ≥ Tsw

160

Theoretical analysis of co-processor selection under FTC and HTC is the same as FAC and HAC. It should be noted that the Problem of HATC (Hard Area and Timing Constraint) is equivalent to dynamic energy minimization.

k

Asys = ∑ Ai .xi

Also for FTC, Tsys is equal to TET (Equation 15): Static Energy (SE) = System execution time * (Processor’s Static Power + Area of the co-processors * Static Power per unit area) (18) SE = Tsys .( SPproc + Asys .β )

2.4. ILP Formulation We present an integer linear programming formulation for energy-aware co-processor selection problem. The ILP formulation is capable of handling the on-chip resources (on-chip memory blocks and Multipliers) to minimize energy. It also works for the case when some co-processor candidates overlap in software implementation. For simplicity, we present the basic ILP for optimal co-processor selection. The ILP covers all problems covered in Section 2.1-2.3. For HAC and HTC the formulation for static power is different. To solve that problem we have introduced Asys and Tsys which is equivalent to Acons and Tcons to cover the formulation under Hard Constraints. For Flexible Constraints Asys and Tsys are determined using equations 17 and 15 (equal to TET). The problem with no constraint which is the main problem that we discussed in section 2.1 is the problem under FAC and FTC when the constraints are relaxed. In the ILP formulation we use the parameters in Table 1.

Dynamic Energy (DE) = Dynamic energy for the part of application running as software + Dynamic energy for the communication + Dynamic energy for the coprocessor

Table 1. Set of parameters used in ILP formulation Parameter Aj ∆DEj ∆Tj SPproc rj ecj commj tj DPproc DPcomm. DPcoproc_j TET Tsw Tcons Acons

(14)

≤m

j =1

• Execution Time Deadline Total Execution Time (TET) should be less than timing constraint of the system. TET = Tsw – (execution time of the software code of the co-processor - hardware execution time – communication time): k (15) TET = T − x .r .( t − ( comm +ec )) ≤ T SW



j

j

j

j

j

Description Area of co-processor j Dynamic Energy improvement of coprcoessor j Execution time improvement of co-processor j Static Power of the processor Number of times that co-processor j is called Execution time of the co-processor j Communication time of the co-processor j Run-time of co-processor j on software Average dynamic power of processor The average power of communication Dynamic power consumption for co-processor j Total Execution Time Execution time of the application in software Timing constraint imposed on the system Area constraint imposed on the system

3. Experiments We have targeted image processing benchmarks for our experiments. Three different image sharpening applications have been implemented both in hardware and software. These techniques are: (1) un-sharp masking, (2) sobel filter based image sharpening and (3) laplacian filter based image sharpening. Hardware units for each application are as follows:

cons

j =1



(19)

j =1

Processor Interface Constraint

j

j =1

It should be noted that although the equation for static energy is non-linear, it can be linearized by Fortet's linearization method [12].

Assuming that we have at most m communication k

j =1

+ ∑ DPco − proc j .x j .r j .ec j

2) Constraints:

∑x

k

k

• xj = 1 if co-processor j is selected for hardware implementation.

buses:

k

DE = DPproc (T − ∑ x j .r j .t j ) + DPcomm .( ∑ x j .r j Comm j )

1) ILP variables:



(17)

i =1

Area Constraint

Total area of the system should be less than the area constraint imposed on the system: (16)

3) Objective:

• Blur: blur_convolution, RGB, YCC • Sobel: HSobel_filter, VSobel_filter, RGB_conv, YCC_conv • Laplacian: laplacian_filter, RGB_conv, YCC_conv

The objective is to minimize total energy i.e dynamic energy plus static energy. For FAC Asys is equal to:

The embedded processor that we are using is Xilinx soft processor (Microblaze) with all features (including

n

∑x A i

i

≤ Acons

i =1

161

hardware multipliers and barrel shifter). Hardware units (co-processors) can be implemented using reconfigurable fabric of the FPGA. The communication between the processor and hardware units is possible through point to point, dedicated buses called Fast Simplex Link (FSL). Communication words are transferred from Microblaze’s on-chip memory (BRAMs connected to the processor through LMB bus) to the co-processors. We use Xilinx tool set to design, synthesize and place and route our circuits. The information related to the number of function calls and hierarchy of function calls is extracted through profiling the C code. Execution time of the software and hardware is measured through simulation using Modelsim. During simulation we generate the VCD file to use XPower for power estimation. After place and route, size and area of the hardware implementation is estimated using Xilinx ISE tool set. We feed all the information gathered to our ILP which we solve using CPLEX solver. We estimate static energy in two temperatures (25°C, 85°C). In addition, it is important to manage onchip resources (hardware multipliers, BRAMs) as they are limited on FPGAs. This problem is important for the BRAMs, as they have to be assigned to processor’s local memory. Since we are in multi-core era, BRAMs become very scarce resources that should be utilized for performance. The advantage of the FPGAs is that we can implement these resources using reconfigurable fabric of the FPGA. In our experiments we have considered hardware implementations with fully hard on-chip resources (Hard implementation) as well as implementations which use less on-chip resources and implement the rest in reconfigurable fabric of the FPGA (Hard/Soft implementation).

compared to the lowest dynamic energy consumes 45% less energy. It is interesting to observe that at 85°C none of the co-processors gets selected which shows the dominant role of static power in total energy at this temperature.

3.2. Resource Selection

E n e rg y (m J )

3.2.1. Communication Constraints. We consider 1, 2 or 3 FSL buses in our experiments. Figure 2 shows energy consumption when we have two FSL buses. As shown, we can achieve energy saving of up to 63% compared to the case that minimizes dynamic energy. 300 250 200 150 100 50 0 25

85

25

Blur

85

25

Sobel Lowest Energy

85 Laplacian

Lowest Dynamic Energy

Figure 2. Min Etotal vs. Min Edynamic – No constraints The experiments show that for LAPLACIAN application none of the co-processors with hard/soft implementation at 25°C gets selected when we have one or two FSL buses. As discussed in section 2 this means that none of the co-processors alone or no subset of size two of the co-processors satisfies Inequality 5. However as shown in Table 2 a set of three co-processors satisfies Inequality 5. Similar behavior is observed for BLUR application at 25°C.

Energy (mJ)

This problem is analyzed in details in Section 2. We compare the results of experiments with the case in which just dynamic energy is considered in objective function as studied in [4, 5 and 6]. The result of the experiments is shown in Table 2. Table 2. Selected co-processors under no constraint Min Etotal 25°C 85°C All None None None All None

Co-processor

We experiment the effect of communication bus constraints as well as the effect of Hard Area Constraint.

3.1. No Constraint Co-processor Selection

Hard/Soft Implementation BLUR SOBEL Laplacian

Constrained

Min Edynamic 25°C 85°C All All All All All All

250 200 150 100 50 0 25

85 Blur

While selecting all the co-processors results in minimum dynamic energy, it is noticeable that for SOBEL application in the lowest energy mode, none of the co-processors gets selected. Experiments show that for SOBEL application the lowest energy mode

25

85

Sobel Lowest Energy

25

85

Laplacian Fastest

Figure 3. Min Etotal vs. Max Performance under HAC

162

3.2.2. Hard Area Constraints (HAC). In Figure 3, for hard/soft implementation of the co-processors, energy consumption of the system with the least energy consumption is compared to energy consumption of the system with the least execution time. In this set of experiments, area sizes have been chosen according to Virtex device sizes. Therefore, the energy-aware coprocessor selection on single-power-mode FPGAs such as Xilinx Virtex can be formulated under HAC constraint where area constraint is the device size. At higher temperature, the co-processors for fastest execution can increase the total energy more significantly.

In this paper we studied the problem of energy minimization using co-processors. We analyzed the problem under no constraint, area constraints and timing constraints and we proved that the complexity of the problem in each case is NP-Hard and we provided unified ILP formulation for the abovementioned problems to find the optimum solution. We compared the result of our approach in minimizing energy to the result of other approaches that had not considered both static and dynamic energy and we showed that we can reduce energy by 63% in some cases.

3.3. Hard Timing Constraint (HTC)

Acknowledgement

4. Conclusion

In this set of experiments, software execution time is the largest timing deadline. Table 3 shows the set of co-processors selected under each timing constraint for LAPLACIAN. We compare our ILP selection to the selection based on the approach mentioned in [3] which minimizes area under timing constraint. In Table 3 the rows called smallest, show the smallest area under which the timing constraint is met. As a result, all the co-processors are hard implemented.

The authors would like to thank Juanjo Noguera for providing the benchmarks used in our experiments.

5. References

[1] R. Lysecky , F. Vahid, “A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning”, (DATE), 2005 [2] C. Galuzzi, E. Panainte, Y.D. Yankova, K.L.M. Bertels, S. Vassiliadis, “Automatic Selection of Application-Specific Instruction-Set Extensions”, CODES+ISSS, 2006 [3] N. Bình , M. Imai , A. Shiomi , N. Hikichi, “A hardware/software partitioning algorithm for designing pipelined ASIPs with least gate counts,”, DAC 1996 [4] P. Biswas, S. Banerjee, N. Dutt, P. Ienne, L. Pozzi. “Performance and Energy Benefits of Instruction Set Extensions in an FPGA Soft Core.”, VLSI Design, 2006. [5] J. Henkel, Y. Li. “Energy-conscious HW/SW-partitioning of embedded systems: A Case Study on an MPEG-2 Encoder”. CODES 1998 [6] J. Henkel, "A Low Power Hardware/Software Partitioning Approach for Core-Based Embedded Systems”, DAC 1999 [7] F. Sun, S. Ravi, A. Raghunathan, N. K. Jha, "Custom-Instruction Synthesis for Extensible-Processor Platforms", IEEE TCAD, VOL. 23, NO. 2, 2004 [8] K Atasu, G Dundar, C Ozturan, “An Integer Linear Programming Approach for Identifying Instruction-Set Extensions”, CODES+ISSS 2005 [9] G. Stitt, F. Vahid, “Hardware/Software Partitioning of Software Binaries”, ICCAD 2002. [10] Gayasen, A., Tsai, Y., Vijaykrishnan, N., Kandemir, M., Irwin, M. J., and Tuan, T, “Reducing leakage energy in FPGAs using region-constrained placement” FPGA 2004 [11] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, T. Tuan. “A Dual-Vdd Low Power FPGA Architecture”. FPGA 2004. [12] P. Hansen, B. Jaumard, V. Mathon, “ Constrained Nonlinear 0-1 programming”, ORSA Journal of Computing, Vol 5, No 2, 1993 [13] Hu, Y., Lin, Y., He, L., and Tuan, T. “Simultaneous time slack budgeting and retiming for dual-Vdd FPGA power reduction.” DAC 2006

Table 3. Selected Co-processors under HTC Hard Implementation T = 250ms T = 350ms T = 500ms

ILP Smallest ILP Smallest ILP Smallest

LAPLACIAN 25°C 85°C All All YCC, Lap. YCC, Lap. All Lap., YCC Lap. Lap. All Lap. Proc Proc

Energy (mJ)

The result is shown in Figure 4. In this set of experiments for hard/soft implementation of the coprocessors we almost always select the same set of coprocessors as the set with smallest area. The reason is that for these set of co-processors static energy is much higher than dynamic energy and minimizing area is in fact equivalent to minimizing energy. 70 60 50 40 30 20 10 0 25

85

T=250ms

25

85

T=350ms

Minimized Energy

25

85

T=500ms Smallest Area

Figure 4. Min Etotal vs. Min Area under HTC– LAPLACIAN

163

Suggest Documents