An Efficient Design Space Exploration Methodology for On ... - Polimi

An Efficient Design Space Exploration Methodology for On-Chip Multiprocessors Subject to Application-Specific Constraints Gianluca Palermo, Cristina Silvano, Vittorio Zaccaria Politecnico di Milano Dipartimento di Elettronica e Informazione Via Ponzio 34/5, 20133 Milano - Italy E-mail: {gpalermo, silvano, zaccaria}@elet.polimi.it Abstract—Multi-Processor System on-Chip (MPSoC) architectures represent an emerging paradigm for developing customized, application specific solutions meeting time-to-market, performance and power consumption constraints. Applicationspecific MPSoCs are usually designed by using a platform-based approach, where a wide range of customizable parameters must be tuned to find the best trade-offs in terms of the selected figures of merit (such as energy, delay and area). This optimization phase is called Design Space Exploration (DSE) and it generally consists of a Multi-Objective Optimization (MOO) problem with multiple constraints. The design space for an application-specific MPSoC architecture consists of several parameters, mainly related to microarchitecture, memory hierarchy, and interconnection network. The total amount of possible architecture configurations is too large to be comprehensively evaluated. So far, several heuristic techniques have been proposed to address the DSE problem for MPSoC, but they are not efficient in handling constraints and identifying the Pareto front. In this paper, an efficient DSE methodology for applicationspecific MPSoC is proposed. The methodology combines Design of Experiments (DoEs) and Response Surface Modeling (RSM) techniques to a new technique for handling systemlevel constraints. First, the DoE phase generates an initial plan of experiments used to create a coarse view of the target design space. Then, a set of RSM techniques are used to refine the exploration by exploiting application-specific constraints to identify the maximum number of feasible solutions. To tradeoff accuracy and efficiency of the proposed techniques, a set of experimental results with actual workloads are reported in the paper.1

I. I NTRODUCTION Multi-Processor Systems-on-Chip (MPSoC) and ChipMulti-Processors (CMPs) represent the de facto standard for embedded and general-purpose architectures. For these architectures, the platform-based design approach [1] is widely used to design application-specific architectures meeting time-tomarket constraints. In this scenario, configurable simulation models must be set up to accurately tune the on-chip architectures and to meet the target application requirements in terms of performance, battery lifetime and area. 1 This work was supported in part by the EC under grant MULTICUBE FP7-216693

The Design Space Exploration (DSE) phase is used to tune the configurable system parameters and it generally consists of a multi-objective optimization problem. The DSE problem consists of exploring a large design space consisting of several parameters at system and micro-architectural levels. Although several heuristic techniques have been proposed to address this problem so far, they are all characterized by low efficiency to identify the Pareto front of feasible solutions. Evolutionary or sensitivity based algorithms are among the most notable, state-of-the art techniques. In this paper, an efficient Design Space Exploration methodology is proposed to tune MPSoCs and CMPs subject to application-specific constraints (such as total system area, energy consumption or throughput). The methodology is based on traditional Design of Experiments (DoE) and Response Surface Modeling (RSM) techniques. First, the DoE phase sets up an initial plan of experiments used to create a coarse grain view of the target design space; second, a set of RSM techniques are used to refine the exploration and identify a set of feasible configurations. This process is iteratively repeated to derive a set of Pareto points. Then, an innovative technique is proposed in the paper to deal with application-specific constraints expressed at the system-level. Although the proposed methodology is highly flexible in terms of possible DoE and RSM techniques to be used, two guiding strategies (low-end and high-end) have been identified in the paper from the validation results carried out with actual workloads. From one side, the low-end strategy combines a DoE generated randomly to a linear regression model to obtain a less accurate but very efficient exploration. From the other side, the high-end strategy applies more sophisticated DoE and RSM techniques to obtain a good accuracy/efficiency tradeoff. The proposed techniques have then be applied to optimize an MPEG decoding application with energy and performance constraints while maximizing overall battery lifetime and minimizing area cost. To the best of our knowledge, while there have already been some applications of DoEs and RSM techniques to the field of performance analysis and optimization, the work proposed in this paper represents the first in-depth, comprehensive

75 c 2008 IEEE 1-4244-2334-7/08/$20.00

application of DoE and RSM techniques to the field of multiobjective design space exploration for MPSoCs characterized by application-specific constraints. The paper is organized as follows. Section II discusses the state of the art related to design space exploration while Section III introduces the design space exploration methodology proposed in this paper and validates it against a set of reference benchmarks. Section IV shows the the application of the proposed methodology the design of a customized MPEG decoder MPSoC. II. S TATE OF THE ART Several methods have been recently proposed in literature to reduce the design space exploration complexity by using traditional statistic techniques and advanced exploration algorithms. The proposed techniques can be partitioned mainly in two categories: heuristics for architectural exploration and methods for the system performance estimation and optimization. Among the most recent heuristics for power/performance architectural exploration we can find [2]–[4]. In [2], the authors compare the Pareto Simulated Annealing, the Pareto Reactive Taboo Search and Random Search exploration to identify energy-performance trade-offs for a parametric superscalar architecture running a set of multimedia kernels. In [3], a combined Genetic-Fuzzy system approach is proposed. The technique is applied to a highly parametrized SoC platform based on a VLIW processor in order to optimize both power dissipation and execution time performance. The technique is based on a Strength Pareto Evolutionary Algorithm coupled with fuzzy system rules in order to speedup the evaluation of the system configurations In [4], domain knowledge about the platform architecture has been used in the kernel of a design space exploration framework. The exploration problem is converted to a Markov Decision Process (MDP) problem whose solution corresponds to the sequence of optimal transformations to be applied to the platform. The requisite of domain knowledge is the main difference with respect to the proposals in [2], [3] State of the art, system performance optimization is presented in [5]–[8]. A common trend among those methods is the combined use of response surface modeling and design of experiments methodologies. In [5], a Radial Basis Function has been used to estimate the performance of a super-scalar architecture; the approach is critically coupled with an initial training sample set that is representative of the whole design space, in order to obtain a good estimation accuracy. The authors propose to use a variant of the Latin Hypercube method in order to derive an optimal, initial set. In [6], [7] linear regression has been used for the performance prediction and assessment. The authors analyze the main effects and the interaction effects among the processor architectural parameters. In both cases, random sampling has been used to derive an initial set of points to train the linear model. A different approach is proposed in [8], where the authors tackle performance prediction by using

76

an Artificial Neural Network paradigm to estimate the system performance of a Chip-Multiprocessor. In [9], [10] a Plackett-Burman design of experiments is applied to the system architecture to identify the key input parameters. The exploration is then reduced by exploiting this information. The approach shown in [9] is directed towards the optimization of an FPGA, soft-core-based design, while in [10] the target problem is more oriented to a single, superscalar processor micro-architecture. The present work represents a step forward from [11] towards the utilization of design of experiments, response surface methodologies and constraint handling to an efficient optimization of application-specific on-chip multiprocessor architectures. III. A N APPLICATION SPECIFIC DESIGN SPACE EXPLORATION METHODOLOGY

Platform-based design [1] has become a de-facto standard methodology merging IP reuse and platform-reconfigurability towards a unified view of automatic system synthesis. In this scenario, a virtual microprocessor-based architecture is customized at design time for a specific application, enabling a low-risk deployment while meeting time-to-market constraints. More specifically, pre-verified configurable IPs are instantiated and sized in order to meet application-specific constraints on the target problem domain. The design space exploration problem involves the optimization of multiple objectives (such as delay, power consumption, die occupation etc.) making the definition of optimality not uniquely defined. A high-performance system can be the worst in terms of energy consumption and vice-versa. In this paper, we propose an application-specifc design space exploration strategy leveraging Design of Experiments (DoE) and Response Surface Modeling (RSM) techniques combined with an innovative constraint handling technique. The term Design of Experiments (DoE) [12] is used to identify the planning of an information-gathering experimentation campaign where a set of variable parameters can be tuned. Very often the designer is interested in the effects of some parameter’s tuning on the system response. Design of experiments is a discipline that has very broad application across natural and social sciences and encompasses a set of techniques whose main goal is the screening and analysis of the system behavior with a small number of experiments. Each DoE plan differs in terms of the layout of the selected design points in the design space. On the other hand, response surface modeling techniques explore the functional dependence between several design parameters and one or more response variables. The working principle of RSM is to use a set of experiments generated by DoE in order to obtain a response model. A typical RSM flow involves a training phase, in which known data (or training set) is used to identify the RSM configuration, and a prediction phase in which the RSM is used to forecast unknown system response.

2008 Symposium on Application Specific Processors (SASP 2008)

In this paper, regression and interpolation methods are used to build our response surface models. Regression is a method that models the relationship between a dependent response function and some independent variables in order to describe, with the lowest error possible, both the known and unknown data. Interpolation is the process of assigning values to unknown points by using values from a small set of known points. Differently from regression, interpolation does not produce any error on the known data.

slack of the non-feasible solution with respect to the given constraints. Essentially, the ’ ’ operator corresponds to the dominance operator applied to an extended set of metrics including rank, penalty and system responses of the nonfeasible solutions. The final pareto set of Equation 1, is thus composed of the traditional Pareto front plus a set of promising non-feasible configurations. B. The proposed design flow

The proposed DSE methodology consists of the following steps: In this paper we propose an innovative technique for man1) Apply a DoE plan to pick up the set of initial configuraaging application specific constraints during the exploration tions S0 corresponding to the plan of experiments to be process. The technique is deeply related to the policy used by run. This step provides an initial coarse view at iteration the exploration process to construct the Pareto front. In order 0 of the target design space. to introduce the technique, let us define a design space S as 2) Perform the experiments to obtain the actual measurea set of tuples si . Each tuple si = ai , μi is composed of an ments associated with S0 . instance of the architecture parameters ai and the associated 3) Compute the Pareto front associated with the design system metrics μi . Metrics μi can be either derived from space S0 : P0 = P areto(S0 ). ˆ i. actual measurements mi or estimated measurements m 4) Apply a response surface technique to the Pareto front We indicate as actual measurement a measurement perP0 . The response surface model generates the design formed by experimenting with the actual system model, while space Sˆ1 composed of a set of estimated measurements. with the term estimated measurement we will indicate a Possibly, Sˆ1 could be as large as the entire design space, measurement based on a response surface model. however a sampling technique could be used if it is not The symbol S indicates a design space composed of tuples practically feasible to manage such a large design space. corresponding to actual measurements while Sˆ represents a 5) Compute the Pareto front Pˆ1 = P areto(Sˆ1 ). design space of tuples corresponding to estimated measure6) Run the experiments to derive the actual measurements ments. We use the symbol P to represent a reduced design on the architectural configurations contained in Pˆ1 . The space corresponding to the Pareto front (P = P areto(S)). We result is the design space P1 . define the Pareto front P of the design space S as follows: 7) If P1 covers P0 (C(P1 , P0 )) by a percentage greater than 0 and the stopping criterion is not met, restart from step P = {si ∈ S : ¬∃sj ∈ P : μj μi } (1) 4, where now P0 ← P1 . The stopping criterion is the where ’’ is a constrained dominance operator. The conmaximum number of actual measurements to be done. strained dominance operator is introduced to handle applicaThe described strategy is not coupled with a specific DoE tion specific constraints, as shown in the following paragraph. but any DoE plan can be used, as it will be shown in Section In our methodology, Pareto sets are built from feasible III-C. Concerning the response surface model, we target both and promising, non-feasible solutions. Non-feasible configu- interpolation and regression. rations are solutions which violate at least one constraint. A The proposed method assumes that the chosen RSM techpromising, non-feasible solution is a solution which violates nique is able to infer the trends of the Pareto configurations as less constraints as possible with minimum penalties; in in the entire design space. We try to avoid the use of the theory, promising non-feasible solutions can lead to feasible entire configuration space S0 to infer Sˆ1 in order to skip oversolutions in the successive steps of the exploration process. In fitting problems which would deteriorate the estimate Sˆ1 . The order to identify those solutions we introduce the constrained Shepard interpolation RSM, described later, fits well in this dominance operator. We say that μj constraint-dominates μi category, being possible to use it even in the case when few when: points are known in the original design space. ⎧ We will show in section III-D1 that a complex linear ⎨ μj ≺ μi when f easible(μj ) f alse when ¬f easible(μj ) ∧ f easible(μi ) regression surface is needed to obtain a suitable coefficient μj μi = ⎩ 2 μj μi when ¬f easible(μj ) ∧ ¬f easible(μi ) of determination (R ≥ 0.9) on the training set. This kind of (2) regression surface needs much more training configurations where ’≺’ is the dominance operator [13] while ’ ’ is an than those available in the typical Pareto fronts associated operator used to compare non-feasible solutions. The ’ ’ with our problem. Thus we introduce an extension to the operator combines the rank, penalty and the pure dominance kernel of the proposed method to include regression methods. of the system responses of two non-feasible solutions into a In step 4 of the above methodology we use a sub-sampling single extended-dominance value. The rank of a non-feasible of the parameter space S0 (instead of P0 ) to train the initial solution is the number of the constraints that the solution configuration space Sˆ1 . As before, at steps 5 and 6, each violates while the penalty is an aggregate evaluation of the configuration of the Pareto front Pˆ1 is then actually measured A. The proposed constraint-handling technique


77

and the resulting P1 is compared with the original P0 . To compute the next S0 , we merge the newly created Pareto front P1 to the original S0 configuration space. The remaining part of the algorithm is the same as shown above. In the following paragraphs, we show the DoE techniques as well as the RSM methods we used in the proposed design space exploration methodology. In the experimental results, we show how the performance characteristics of each strategy can vary with respect to the DoE and RSM selected. C. Design of experiments Although several design of experiments have been proposed in the literature so far, we present here the most traditional DoEs which we will leverage in the construction of our efficient design space exploration methodology: • Random DoE. The configurations are picked up randomly by following a Probability Density Function (PDF). In our methodology, we will use a uniformly distributed PDF. • Full factorial DoE. Experimentation plan whose design consists of two or more parameters, each with discrete possible values or ”levels”. The set of exeprimenta runs will consider all possible combinations of these levels across all the parameters. In this paper we consider a 2-level full factorial DoEs, where the only two levels considered are the minimum and maximum for each parameters. • Central composite design DoE. The design consists of three distinct sets of experimental runs: – A 2-level full or fractional factorial design ; – A set of center points, i.e. experimental runs whose values of each parameter are the medians of the values used in the factorial portion. – A set of axial points, i.e., experimental runs identical to the center points except for one parameter. In the general central composite design case this parameter will take on values both below and above the median of the two levels. • Box-Behnken DoE. The Box-Behnken design is suitable for second degree models where parameter combinations are at the center of the edges of the process space plus a design with all the parameters at the center. The primary advantage is that the parameter combinations avoid boundary values taken simultaneously (in contrast with the face-centered central composite design). This may be suitable to avoid singular points in the creation of the response surface. D. Response surface methods We present here the RSMs used for the construction of our efficient design space exploration methodology. 1) Linear regression: Linear regression is a regression method that models a linear relationship between a dependent response function f and some independent variables xi , i = 1 · · · p plus a random term ε. In this work we apply regression by taking into account also the interaction between the parameters as well as quadratic behavior with respect to

78

a single parameter. We thus consider the following general model: p p p p αk x2k + βi,k xk xi + γk xk + ε f (x) = α0 + k=1

i=1 k=1,k=i

k=1

(3) Least squares analysis can be used to determine a suitable approximation for the parameters. The least squares technique determines the values of unknown quantities in a statistical model by minimizing the sum of the squared residuals (the difference between the approximated and observed values). A measure of the quality of fit of the resulting model is the coefficient of determination R2 which is the fraction of the variance explained by the model over the total variance of the observations. As a rule of thumb, the higher R2 , the better the model fits the data. When R2 is equal to 1.0, the regression line perfectly fits the data. To use the linear regression as RSM, a correct configuration of the model has to be identified. First of all, a numerical value for each symbolic value of the parameters is chosen. We used ’plain’ as well ’exponential’ coding of the design parameters in our tests. Moreover, both simple and second-degree forms of the linear regression model should be analyzed including a test of the following configurations of the model: • Heavy. All the parameters of the design space are included in the model. • Medium . All the parameters except the most irrelevant are used. • Light. Only the most relevant parameters are used. To select which parameters are relevant and which are not, a main effect analysis [12] on a random subset of the design space configurations can be performed. We run the previous analysis on a subset of configurations of the entire design space and we found that including interaction effects is very important for our RSM design. The quadratic, plain encoding and heavy parameter model is the one which presented the best coefficient of determination (0.95). We will use this model configuration for the experimental results. We then measured the effect of output transformations on the accuracy of linear models. The preprocessing function transforms the response values before being fed to the linear model training in order to minimize the error. We considered a family of transformations as potential candidates {y 1 , y 0.5 , log(y) and y −1 }. For each transformation, we selected a set of random configurations as input to the linear regression model and computed the maximum normalized error on the training set. Figure 1 shows that the best behaving output transformation is the log(y) function. We will stick with this preprocessing function for the remaining part of the simulation results. 2) Shepard’s based interpolation: The Shepard’s technique is a well known method for multivariate interpolation. This technique is also called inverse distance weighting (IDW) method because the value of the response function in unknown points is the sum of the value of the response function in known points weighted with the inverse of the distance. In


6

"power=2-prep=1" "power=3-prep=1" "power=5-prep=1" "power=5-prep=log" "power=5-prep=-0.5"

5.5 5

4.5 4 max_norm_err

3.5 3 2.5

2

1.5

1 0.5

Fig. 1. Maximum Normalized Error of the linear regression on the overall design space by varying the number samples of the training set.

particular, the value of a response function f for an unknown design x is computed by using N known points xk as follows: N wk (x)f (xk ) f (x) = k=1 (4) N x) k=1 wk ( where: wk (x) =

1 , μ(x, xk )p

(5)

is the weighting function defined by Shepard, p is the power of the model and μ is the distance between the known point xk and the unknown point x. As in the regression technique, an exponential or log pre-processing transformation for f (x) can also be applied. Figure 2 shows the maximum normalized error computed by varying a number of random configurations selected as a training-set (known points), for the most significant power coefficients and pre-processing transformations. The model with a power value of 5 and pre-processing function y −0.5 provides the best performance in terms of mean error. We will use this model configuration in the remaining part of the paper. E. Validation of the methodology To validate the proposed methodology, we applied it to the customization of a symmetric shared-memory multiprocessor architecture for the execution of a set of standard benchmarks [14]. 1) Experimental setup for the validation: The experimental setup used for the validation includes the identification of the target architecture and the related design space, the target system workloads, and the target objective functions to be optimized. Target architecture and applications. The target architecture is a shared-memory multiprocessor with private L2 cache. We focused our analysis on the architectural parameters listed in Table I, where the minimum and maximum values have been

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

num_samples

Fig. 2. Maximum Normalized Error of the Shepard interpolation on the overall design space by varying the number of samples of the training set and by varying the power coefficient. TABLE I D ESIGN SPACE FOR THE SHARED - MEMORY MULTI - PROCESSOR PLATFORM Parameter # Processors Processor issue width. L1 instruction cache size L1 data cache size L2 private cache size L1 instruction cache assoc. L1 data cache assoc. L2 private cache assoc. I/D/L2 block size

Min. 2 1 2K 2K 32K 1w 1w 1w 16

Max. 16 8 16K 16K 256K 8w 8w 8w 32

reported. The resulting design space consists of 217 alternative configurations. The target applications we considered are derived from the SPLASH-2 [14] benchmark suite (U = {FFT, OCEAN, LU, RADIX}); for each application we considered 3 different input data-sets. To carry out the system metrics evaluation, we leveraged the Sesc [15] simulation tool, a fast simulator for chip-multiprocessor architectures with out-of-order processors that is able to provide energy and performance results for a given application. Within Sesc, the energy consumption computation for the memory hierarchy is supported by CACTI [16], while the energy consumption computation due to the core logic is based on the WATTCH models [17]. Objective Functions and Constraints. The actual system response consists in the average execution time and energy consumption from each application; we thus formulated our multi-objective problem as a ’robust’ problem consisting of finding a system configuration which minimizes the geometric average value of the system response in each single application scenario ξk ∈ U . In particular, we formalize our multiobjective problem as a minimization of the average execution time and mW/MIPS system response, over the set of twelve ξk ∈ U different scenarios: avg time μs(x) (6) min avg mW per MIPS(x) x∈X


79

TABLE II S TRATEGY IMPLEMENTATIONS EVALUATED

0,35

0,3

DoE Full-factorial Full-factorial Full-factorial Random Random Random Face centered CCD Face centered CCD Face centered CCD Box-Behnken Box-Behnken Box-Behnken

RSM Shepard’s method Linear regression Shepard’s method Linear regression Shepard’s method Linear regression Shepard’s method Linear regression

full-factorial-shepard full-factorial-linear random-shepard random-linear central-composite-shepard central-composite-linear box-behnken-shepard box-behnken-linear mosa_25_10_05 mosa_20_25_05 mosa_20_50_075 mosa_20_100_05 mosa_50_100_095

0,25

0,2

ADRS

Strategy implementation name full-factorial-plain full-factorial-shepard full-factorial-linear random-plain random-shepard random-linear central-composite-plain central-composite-shepard central-composite-linear box-behnken-plain box-behnken-shepard box-behnken-linear

0,15

0,1

0,05

0 100

1000

10000

Number of Actual Measurements

where avg time μs(x) =

(a)

time μs(x, ξk )

1 |U|

(7)

120%

ξk ∈U

avg mW per MIPS(x) =

1

mW per MIPS(x, ξk ) |U|

ξk ∈U

(8)

subject to the following constraint: total cache size(x) ≤ 1M B

(9)

Actual Pareto Points Miss-Rate [%]

100%

and

full-factorial-shepard full-factorial-linear random-shepard random-linear central-composite-shepard central-composite-linear box-behnken-shepard box-behnken-linear mosa_25_10_05 mosa_20_25_05 mosa_20_50_075 mosa_20_100_05 mosa_50_100_095

80%

60%

40%

20%

The total cache size of architecture x is defined as the sum of the L1 I/D cache size and the private L2 cache for each processor times the number of processors. 2) Evaluation of the proposed strategies: We validated both the interpolation and regression based strategies (presented in Section III-D) by combining them orthogonally with the design of experiments techniques presented in Section III-C. The complete set of implemented strategies is shown in Table II. In order to give a fair comparison with other methods proposed in the literature, we compare the proposed strategies with a Multi-Objective Simulated Annealing (MOSA) [18] algorithm. When considering MOSA runs, we use the label mosa E N A to indicate E number of epochs, N actual measurements performed in each epoch, and A as temperature decrease coefficient. Each proposed strategy has been run by considering a maximum number of actual measurements equal to 600; the resulting Pareto front has been validated against the reference, actual Pareto front of the target architecture2 . For evaluating the quality of the approximated Pareto fronts we use the Average Distance from Reference Set (ADRS) and the Coverage function [19]–[21] with respect to the reference Pareto. The lesser the ADRS and Coverage, the better the goodness of the approximated Pareto front. Figure 3 gives a wide perspective on the performance of each strategy. Figure 3(a) presents the ADRS for each strategy plotted against the number of actual measurements associated with it, while 3(b) plots the actual Pareto points miss-rate 2 The reference Pareto front has been computed with a full-search algorithm, thus it is the exact Pareto front.

80

0% 100

1000

10000

Number of Actual Measurements

(b) Fig. 3. Average Distance from Reference Set (a) and actual Pareto points miss-rate (b) for each implemented strategy and associated number of actual measurements (shown in log10 scale).

and the number of actual measurements associated with each strategy. The actual Pareto points miss-rate is the percentage of points of the actual Pareto front not identified by the specific strategy. Considering the ADRS, actual Pareto miss rate and number of actual measurements objectives, the dominant solutions we can identify are the following: ’box-behnken-linear’, ’random-linear’, ’full-factorial-shepard’, ’mosa 25 10 05’ and ’mosa 20 100 05’. The strategies proposed in this paper show a very low number of actual measurements, confirming their efficiency features. The ’mosa 25 10 05’ shows also a low number of actual measurements, however it is characterized by an high value of Pareto miss-rate. More in detail, Figure 3(a) suggests that strategies based on the ’linear regression’ are characterized by a number of actual measurement less than 350, while ’shepard’-based strategies are between 300 and 600. Finally, simulated annealing solutions show a slight decreasing trend of ADRS at the expense of a increasing number of actual measurements. Figure 3(b) shows that ’linear regression’-based approaches have a higher miss-rate but present the lowest number of actual measurements. A central role is played by the ’full-


factorial-shepard’ strategy, with a good trade-off between the two performance measures. Finally, the ’mosa 20 100 05’ shows a very good miss-rate combined with a high number of actual measurements. As a final statement, we can conclude that both the regression and interpolation strategies are the same face of a flexible optimization flow which allows the designer to tradeoff accuracy and efficiency, in terms of number of actual measurements. We can suggest, among all the proposed strategies, two guiding strategies which could be adopted depending on the specific designer’s goal: • Low-end strategy. The ’random-linear-regression’ can be adopted when only an coarse grain approximation of the Pareto front is needed, with very few actual measurements. • High-end strategy. The ’full-factorial-shepard’ strategy can be adopted by the designer when a very good tradeoff between ADRS and miss-rate is needed, with an higher number of actual measurements (2X with respect to ’random-linear-regression’). IV. T HE MPEG DECODER CASE STUDY In this section we apply the proposed methodology to the customization of an MPSoC architecture for the execution of an MPEG2 decoder application. As in the validation phase, we use the Sesc [15] simulator as the target MPSoC architecture model with the same design space described in Table I which is composed by 217 architectural configurations. We introduce also an area model (in 70nm technology) derived by [22] for the IBM Power5 architecture. For this particular application-specific customization, we formalize the multi-objective optimization problem as follows: ⎤ ⎡ total system area(x) (10) min ⎣ energy per frame(x) ⎦ x∈X 1 x) frame rate ( subject to the following constraints: total system area(x) ≤ 85mm2

(11)

frame rate(x) ≥ 25 f ps

(12)

The minimization problem has 3 objective functions which are the total system area, the energy consumption per frame and the inverse of the frame rate. The area constraint in Equation 11 has been defined in order to give an overall upper bound to the cost of manufacturing and packaging. The minimum frame rate in Equation 12 is introduced as a Quality of Service (QoS) constraint considering a standard 50 half-frame per second. In our case study, we used the ALPBench MPEG2decoder [23]. The objective functions have been derived as a geometric average of the system metrics over a set of 5 input data-sets composed of 10 frames at a resolution of 640x480. Since each application run on a single data-set takes more than 45min, we decided to adopt the low-end strategy for a fast, approximate determination of the target Pareto front. After the first DoE run, we found that only 40% of the visited configurations are feasible solutions. The RSM has been

paricularly useful in the following steps since it enabled us to skip a large amount of estimated unfeasible solutions. The final Pareto front found by applying the proposed methodologies is composed of 71 configurations. "mpeg_dec_cluster_0" "mpeg_dec_cluster_1" "mpeg_dec_cluster_2" energy_per_frame [J/f]

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 50 45 40

45

40 50

55

65 70 total_system_area [mm2]

Fig. 4.

35

60

75

frame_rate [fps]

30 80

8525

Final Pareto set clustered in three sets.

To help the system architect to select among the large number of feasible solutions of the Pareto front, we can envision an approach that starts by clustering the architectural configurations and then selects a champion solution for each cluster by using a decision-making-mechanism. In this case we used a k-means clustering algorithm applied to the framerate metric to create 3 groups of architectural configurations, representing a low, medium and high frame-rate solutions. Figure 4 shows the scatter plot of the clustered configurations with cluster centroids centered around the value of the framerate equal to 26.17, 32.89, 47.28 [fps]. For each cluster, we identified the best configuration by using as decision-makingmechanism the optimization of the product: total system area(x) × energy per frame(x)

(13)

The three selected system configurations found are shown in Table III. First of all, we note that the configurations found are very similar to the clustering centroids in terms of frame-rate, representing low, medium and high-performance architectures. We also note that the frame-rate is a monotonic function of the area while the energy per frame is a relatively constant value. Considering the architectural configurations, we note that exploration process identified solutions with a significant number of processors with a very low issue-width (1 or 2). Considering the figure of merit in Eq. 13, those solutions are more profitable than solutions with a small number of processor and higher issue-width. Moreover, we note that the level-1 instruction cache size and the level-2 cache size seem to be penalized by the increase in the number of processors given a constant area budget dedicated to memory hierarchy ( 300KB). The proposed approach is orthogonal to the usage of application-specific instruction set extensions dedicated to the


81

TABLE III F INAL CUSTOMIZED ARCHITECTURES FOR THE MPEG Parameter/metric # Processors Processor issue width. L1 instruction cache size L1 data cache size L2 private cache size L1 instruction cache assoc. L1 data cache assoc. L2 private cache assoc. I/D/L2 block size total system area [mm2 ] energy per frame [J/f ] frame rate [f ps]

Cluster 0 4 2 16K 8K 128K 1w 8w 4w 16B 43.43 0.603 25.71

Cluster 1 8 1 4K 8K 64K 1w 8w 4w 16B 61.18 0.572 31.85

DECODER .

Cluster 2 8 2 2K 8K 64K 1w 8w 4w 16B 76.65 0.622 45.66

MPEG2 decoding and leverages the benefits of pre-verified off-the-shelf components. As noted in [23], the use of dedicated instructions could provide a 2X speedup factor over the nominal values of frame rate reported in Table III. Overall, the proposed methodology has been able to flexibly find a set of candidate implementations with a very low number of actual measurements of the system (≤ 300). Each implementation trades-off area for increased QoS with a relatively constant price in terms of power consumption, while meeting decoding constraints and manufacturing costs. V. C ONCLUSIONS In this paper, we proposed a design space exploration methodology that leverages the traditional DoE paradigm and RSM techniques combined with a powerful way of considering customized application constraints. The design of experiments phase generates an initial plan of experiments which are used to create a coarse view of the target design space; then a set of response surface extraction techniques are used to identify non-feasible configurations and refine the optimal configurations. This process is repeated iteratively until a target criterion, e.g. number of simulations, is satisfied. The proposed methodology has been used to customize an MPSoC architecture for the execution of an MPEG application with QoS and manufacturing cost constraints. Overall, the proposed methodology has been able to flexibly find a set of candidate implementations with a very low number of actual measurements by trading-off area for increased QoS at a constant price in terms of power consumption. R EFERENCES [1] K. Keutzer, S. Malik, A. R. Newton, J. Rabaey, and A. SangiovanniVincentelli. System level design: Orthogonolization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(12):1523–1543, December 2000. [2] Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. Multiobjective design space exploration of embedded system. Journal of Embedded Computing, 1(3):305–316, 2006. [3] Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, and Davide Patti. Efficient design space exploration for application specific systems-on-a-chip. Journal of Systems Architecture, 53(10):733–750, 2007. [4] Giovanni Beltrame, Dario Bruschi, Donatella Sciuto, and Cristina Silvano. Decision-theoretic exploration of multiprocessor platforms. In Proceedings of CODES+ISSS: International Conference on HardwareSoftware Codesign and System Synthesis, pages 205–210, 2006.

82

[5] P. J. Joseph, Kapil Vaswani, and Matthew J. Thazhuthaveetil. A predictive performance model for superscalar processors. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 161–170, Washington, DC, USA, 2006. IEEE Computer Society. [6] P. J. Joseph, Kapil Vaswani, and Matthew J. Thazhuthaveetil. Construction and use of linear regression models for processor performance analysis. In HPCA ’06: Proceedings of the 12th International Symposium on High Performance Computer Architecture, pages 99–108, Austin, Texas, USA, 2006. IEEE Computer Society. [7] Benjamin C. Lee and David M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. SIGOPS Oper. Syst. Rev., 40(5):185–194, 2006. [8] Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. Efficiently exploring architectural design spaces via predictive modeling. SIGOPS Operating System Review, 40(5):195–206, 2006. [9] David Sheldon, Frank Vahid, and Stefano Lonardi. Soft-core processor customization using the design of experiments paradigm. In DATE ’07: Proceedings of the conference on Design, automation and test in Europe, pages 821–826, 2007. [10] Joshua J. Yi, David J. Lilja, and Douglas M. Hawkins. A statistically rigorous approach for improving simulation methodology. In HPCA ’03: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, page 281, Washington, DC, USA, 2003. IEEE Computer Society. [11] Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. An Efficient Design Space Exploration Methodology for Multiprocessor SoC Architectures based on Response Surface Methods. In IC-SAMOS - International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2008. [12] T. J. Santner, Williams B., and Notz W. The Design and Analysis of Computer Experiments. Springer-Verlag, 2003. [13] Hwang, C. L. and Masud, A. S. M. . Multiple Objective Decision Making – Methods and Applications: A State-of-the-Art Survey, volume 164. Lecture Notes in Economics and Mathematical Systems, SpringerVerlag, Berlin–Heidelberg, 1979. [14] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. Splash-2 programs: characterization and methodological considerations. Proceedings of the 22th International Symposium on Computer Architecture, page 2436, 1995. [15] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net. [16] S. Wilton and N. Jouppi. CACTI:An Enhanced Cache Access and Cycle Time Model. volume 31, pages 677–688, 1996. [17] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proceedings ISCA 2000: International Symposium on Computer Architecture, pages 83–94, 2000. [18] Jaszkiewicz A. Czyak P. Pareto simulated annealing - a metaheuristic technique for multiple-objective combinatorial optimisation. Journal of Multi-Criteria Decision Analysis, (7):34–47, April 1998. [19] Tatsuya Okabe, Yaochu Jin, and Bernhard Sendhoff. A critical survey of performance indices for multi-objective optimization. In Proceedings of the IEEE Congress on Evolutionary Computation, pages 878–885, 2003. [20] J. Knowles and D. Corne. On metrics for comparing non-dominated sets. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2002), pages 711–716, 2002. [21] David A. Van Veldhuizen and Gary B. Lamont. On measuring multiobjective evolutionary algorithm performance. In Proceedings of the 2000 Congress on Evolutionary Computation CEC00, pages 204–211, California, USA, 2000. IEEE Press. [22] Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. Cmp design space exploration subject to physical constraints. High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 15–26, Jan 2006. [23] Man-Lap Li, R. Sasanka, S. Adve, Yen-Kuang Chen, and E. Debes. The alpbench benchmark suite for complex multimedia applications. Workload Characterization Symposium, 2005. Proceedings of the IEEE International, pages 34 – 45, Sep 2005.