Real Time Power Estimation and Thread Scheduling via Performance ...

Real Time Power Estimation and Thread Scheduling via Performance Counters Karan Singh

Major Bhadauria

Sally A. McKee

Computer Systems Laboratory

Department of Computer Science and Engineering

Cornell University

Chalmers University of Technology

Ithaca, NY, USA

G¨oteborg, Sweden

{karan major }@csl.cornell.edu

[email protected]

Abstract

the CPU socket, per core power consumption is difficult to measure with present multi-core designs. This is because the processors share the same power planes. Embedding measurement devices on chip is also not financially feasible. The accuracy of the power meter can also be an issue when current draw rapidly fluctuates with program phases. We propose using Performance Monitoring Counters (PMCs) for estimating power consumption of the processor via analytical models. Performance counters on chip are accurate [15] and provide significant insight into the performance of the processor at the clock-cycle granularity. Performance counters are already incorporated within most modern architectures and are exposed to user space on commercial hardware. Estimating real-time power consumption enables the operating system to make better real-time scheduling decisions, administrators to accurately estimate maximum usable threads for data centers, and simulators to accurately estimate power without actually simulating it. Additionally, a power meter is not required for each system. Our analytical model can be queried on multiple systems regardless of the programs or inputs used. This is possible because our model is formed from micro-benchmark data that are independent of program behavior. These microbenchmarks are custom written to gather data from PMCs that contribute to the power function. We use this data to form the power model equations. We achieve power estimates for single and multi-threaded benchmark suites, and make the following contributions:

Estimating power consumption is critical for OS process scheduling and for software/hardware developers. However, obtaining processor and system power consumption is non-trivial and when using simulators time consuming and prone to error. We analytically derive functions for realtime estimation of processor and system power consumption using performance counter data on real hardware. We develop our model based on data gathered from microbenchmarks custom written to capture possible application behavior. The model is independent of our test benchmarks, and therefore, well-suited for future applications. We target chip multi-processors, analyzing the effects of shared resources and temperature on power estimation. We leverage this information for a power-aware thread scheduler.

1

Introduction

Power and thermal constraints limit processor frequency today. In response, computer architects have shifted to chip multi-processors (CMPs) to retain performance improvements without increasing the power envelope. This is achieved by trading higher frequency and voltage for more cores. When optimizing software for hardware, or evaluating new architectures for existing software, energy efficiency is a critical part of performance analysis. If the operating system is aware of the power consumption of various processes within the system; it can prioritize processes based on thermal constraints and available remaining power or schedule processes to remain within a given power envelope. Unfortunately, it is very hard to expose the power consumption of the processor to the software developer, computer architect or operating system at run-time. Power meters have been used to retrieve total system and CMP power consumption. Although existing hardware can be modified to monitor the current and power draw of

• We achieve accurate per-core estimates of multithreaded and multi-programmed workloads on a CMP with shared resources (an L3 cache, memory controller, memory channel and communication busses). • We observe and quantify the effect power and temperature have on each other. • We achieve real-time power estimation, without the need for off-line profiling of our benchmarks. • We illustrate an application of our real-time predictions to make thread scheduling decisions at run-time. 1

FP Units 0.23 Inst Retired

0.39 Stalls 0.2 Memory

0.33

DISPATCHED FPU:ALL RETIRED MMX AND FP INSTRUCTIONS:ALL RETIRED BRANCH INSTRUCTIONS:ALL RETIRED MISPREDICTED BRANCH INSTRUCTIONS:ALL RETIRED INSTRUCTIONS RETIRED UOPS DECODER EMPTY DISPATCH STALLS DRAM ACCESSES PAGE:ALL DATA CACHE MISSES L3 CACHE MISSES:ALL MEMORY CONTROLLER REQUESTS:ALL L2 CACHE MISS:ALL

Table 1. PMCs Categorized By Architecture and Ordered (increasing) by Correlation (based on SPEC-OMP data)

indicating potentially less power consumption. However, if the source of these stalls are reservation stations or reorder buffer stalls, then the processor is trying very hard to extract instruction level parallelism (ILP) from the code, indicating increased power consumption. For example, if all the instructions fetched can be subsequently issued and dispatched, the reservation stations and re-order buffers will remain a constant size (as instructions are retired in order). Should fetched instructions stall, out-of-order logic will attempt to execute other instructions, requiring expenditure of dynamic power as more reservation stations are examined to check for the instruction’s dependencies being satisfied.

Figure 1. AMD Phenom Annotated Die [2] We achieve median errors of 5.8%, 3.9%, and 7.2% for the NAS, SPEC-OMP and SPEC 2006 benchmark suites respectively. We leverage our real-time estimation to schedule workloads at run-time, achieving a reduced power envelope through suspension of appropriate threads.

2

Methodology

2.1

We categorize the AMD Phenom performance counters into four buckets: FP Units, Memory, Stalls, and Instructions Retired. We gather insight into these buckets by examining the processor die in Figure 1. The shared L3 and private L2 caches take up significant area, as well as the floating point units (FPUs) and front end logic. Tracking the instructions retired gives a good perspective on overall processor performance, and categorizing them into FP or regular (integer) instructions indicates what units the instruction mix is using. Monitoring the L2 cache misses allows us to track use of the L3 caches. L2 cache misses often result in L3 cache misses, which lead to off-chip memory accesses. Figure 2 graphs L3 cache misses normalized to total L3 cache accesses for SPEC 2006 benchmarks with only one reference input. We find the L3 cache has a large miss rate, since it is essentially a non-inclusive victim cache. Another architectural feature that is responsible for power consumption in high performance processors is the out-of-order logic. While there is no single PMC that tracks out-oforder logic use, the number of cycles stalled due to resource limits (DISPATCH STALLS) indicates total stalls due to branches, full load/store queues, reorder buffers, and reservation stations. This PMC represents how often the processor stalls, which provides insight into stalled issue logic,

Event Selection

We split the AMD Phenom PMCs into the four buckets chosen earlier, FP Units, Memory, Stalls, and Instructions Retired. We gather data for 13 counters from the entire SPEC-OMP benchmark suite using pfmon. We gather the power numbers using a Watts Up Pro power meter. The specific categories and the PMCs that fall within them are shown in Table 1. The PMCs in each category are in increasing order of correlation with power. We choose Spearman’s rank correlation for assessing the relationship between the counter value and power. Compared to vanilla correlation, Spearman’s rank correlation does not require any assumptions about the frequency distribution of the variables. We perform Spearman’s rank correlation on the data and choose the top counter from each bucket. We are restricted to four counters because the Phenom can only record that many simultaneously. Since our power predictor is designed to work in real-time, we are bound by this limit. However, this is an architecture-specific bottleneck that varies by architecture. The performance counters we use are: • e1 : L2 CACHE MISS:ALL • e2 : RETIRED UOPS • e3 : RETIRED MMX AND FP INSTRUCTIONS:ALL • e4 : DISPATCH STALLS

2

Figure 2. L3 Cache Miss Rates for SPEC 2006 We claim that data from these four counters capture enough information to predict core and system power. This claim is backed up by the prediction results presented in Section 4. Next, we discuss the micro-benchmarks written to exercise the four counters above and we then form the model based on the collected data.

2.2

We run four copies of the micro-benchmarks simultaneously. We isolate the data for a single random CPU since all cores exhibit the same PMC data. To ensure none of the separate threads are out of sync with each other, we run each program phase separately. We use the SPEC-OMP benchmark suite for selecting appropriate counters only. Our code is generic, and does not use any code from benchmarks we test on (SPEC 2006, SPEC-OMP and NAS). The microbenchmarks explore the entire space spanned by the four counters selected. This means that any future application behavior has to fall within the space of these four counters. This allows us to make the claim that our approach works independent of any benchmark suite. We verify this claim by testing our model on the NAS and SPEC 2006 benchmark suites, applications that we have not used at all during the model forming process. Since this is an empirical process, the verification lies in the quality of predictions.

Micro-Benchmarks

Since we wish our model to be independent of the applications used, we write micro-benchmarks that stress the four counters selected and explore the space of their cross product. This large space spans values from zero to several billion, with a large variation in values between benchmarks. For example, integer benchmarks have almost no floating point operations, and some computationally intensive benchmarks have almost no L2 or L3 cache misses. We try to cover the extreme boundaries of the space and common cases, but our micro-benchmarks do not comprehensively cover every region. A more thorough set of benchmarks could improve estimation but also increases training time. We only target three dimensions of the space, since one of our performance counter values (DISPATCH STALLS) cannot be explicitly targeted with code. We do however include a mixture of code, some of which can be trivially executed out of order, i.e., write after writes. Our core code consists of a large for loop and a case statement which iterates through different code nests depending on the index. Since our synthetic benchmark consists mainly of a set of assign statements (moves), and arithmetic and floating point operations; we compile with no optimizations to ensure the compiler does not remove redundant lines of code. We use these codes to generate data for creating our model.

2.3

Forming the Model

For the model, we use data from our micro-benchmarks only. We normalize the collected performance counter values to the elapsed cycle counts, yielding an event rate, ri , for each counter. The prediction model uses these rates as input. The performance counters selected represent high correlation with the target metric in their respective buckets. We model the per core power for each thread using a piece-wise model based on multiple linear regression. We collect hardware performance counter values every second during the runtime of the micro-benchmarks, and generate the model based on this data. Our modeling approach produces the following function (Equation 1), mapping observed event rates ri to core power Pcore : 3

Linear Function

We use the least squares method to determine the weights for each of the function parameters. Each part of the piece-wise function can be written as a linear combination of the transformed event rates (Equation 2), and is easily solved using a least squares estimator like the one used in [8]. We obtain the piece-wise linear model shown in Figure 5. During our analysis, we find the function behavior is significantly different for really low values of the L2 cache miss counter compared to the rest of the space. We break our function based on this counter. Since the L3 is non-inclusive, most L2 cache misses are off-chip accesses and contribute to total power. We also observe that the power increases with increasing retired uops since the processor is doing more work. All counters have positive correlation with power, except for the retired FP/MMX instructions counter. This is expected because FP and MMX instructions have higher latency than normal instructions, therefore this class of instructions reduces the throughput of the system, resulting in lower power use. Lastly, the dispatch stalls counter shows a positive correlation with power. This is possibly due to reservation stations or reorder buffer dispatch stalls. In this situation, the processor attempts to extract higher degree of instruction level parallelism (ILP) from the code. The dynamic power consumption increases due to this logic overhead.

Exponential Function

F(x)

Data

x

Figure 3. An illustrative example of best-fit continuous approximation functions

Piece-wise Function

F(x)

Data

x

Figure 4. An illustrative example of a better fitting piecewise function

2.4 Pˆcore =

F1 (g1 (r1 ), ..., gn (rn )), F2 (g1 (r1 ), ..., gn (rn )),

if condition else

We are interested in the effect of temperature on system power. Ideally, power consumption does not increase over time. However, since static power is a function of voltage, process technology and temperature, increasing temperature leads to increasing leakage power, and adds to total power. We concurrently monitor the temperature and power of the CMP, to see their relationship. Figures 6 graphs temperature (in Celsius) and power consumption (in Watts) over time. Results are normalized to their steady-state values. Benchmarks bt, lu and namd are run across all four cores of the CMP, with results capped at 120 seconds. For namd, four instances are run concurrently since it is singlethreaded. Performance counters and program source code are examined to insure the work performed is constant over time. We observe that the programs exhibit varying increase in power and temperature over time. Clearly, temperature and power affect each other. Not accounting for this temperature could lead to increased error in power estimates. However, not all systems support temperature sensors on die, or per-core. We do not incorporate temperature readings into our model since our hardware lacks support for per-core temperature sensors. We hope to do so for our future power models.

(1)

where ri = ei /(cycle count) Fn = p0 + p1 ∗ g1 (r1 ) + ... + p2 ∗ gn (rn )

Temperature Effects

(2)

The function consists of linear weights on transformations of the event rates (Equation 2). The transformations could be linear, inverse, logarithmic, exponential, or square root. They help prediction accuracy by transforming the data to be more amenable to linear regression. Hence, we use Spearman’s rank correlation which operates independent of the variable transformation (frequency distribution). The function is piece-wise linear. We choose such a form for the model because we find the behavior to be significantly different for low counter values. This allows us to capture more detail about the core power without sacrificing the simplicity and ease of linear regression. For example, if we were to form a model for the data in Figure 3, we find that neither a linear nor an exponential transformation fits the data well. However, if we were to break up the data into two parts, we find a piece-wise combination of the two fits much better (Figure 4). 4

Pˆcore =

1.15 + 65.88 ∗ r1 + 11.49 ∗ r2 + −3.356 ∗ r3 + 17.75 ∗ r4 , r1 < 10−5 23.34 + 0.93 ∗ log(r1 ) + 10.39 ∗ r2 + −6.425 ∗ r3 + 6.43 ∗ log(r4 ), r1 ≥ 10−5

(3)

where ri = ei /2200000000 Figure 5. Piece-wise linear function for core power derived from micro-benchmark data (1 second = 2.2 billion cycles)

(a) bt

(b) namd

(c) lu

Figure 6. Power vs. Temperature 4-Core CMP Frequency Process Technology Processor Number of Cores L1 (Instruction) Size L1 (Data) Size L2 Cache Size (Private) L3 Cache Size (Shared) Memory Controller Memory Width Memory Channels Main Memory

2.2 GHz (max), 1.1 GHz (min) 65nm SOI AMD Phenom 9500 CMP 4 64 KB 2-Way Set Associative 64KB 2-Way Set Associative 512 KB/core 8-Way Set Associative 2 MB 32-Way Set Associative Integrated On-Chip 64-bits /channel 2 4 GB PC2 6400(DDR2-800)

power meter for measuring power consumption, accurate to within 0.1W. It updates its power values every second. We find the baseline power for the system to be 85.1W running at 2.2 GHz, and 76.8W at 1.1 GHz. For the purpose of model formation only, we assume the base power without processor to be 68.5W. This simplifying assumption aids in faster model formation without the need for more complicated measuring techniques. We calculate per-core power by subtracting the base power and dividing by four. Some performance counters measure shared resources, which gives statistics across the CMP and not per core. Some performance counters could be further subdivided by type. For example, cache and DRAM accesses can be broken down into cache or page hits and misses, while dispatch stalls can be broken down by branch flushes, or full queues (reservation stations, reorder buffers, floating point units). The AMD processor used in this work can only sample four counters simultaneously, which is why we narrow down our design space to using only four performance counters. For power-aware scheduling of processes, we suspend processes to memory that are determined to be exceeding the available power envelope. These suspended processes are later restored and continue executing when there is sufficient power available.

Table 2. CMP Machine Configuration Parameters

3

Experimental Setup

We evaluate our work using the SPEC 2006 [14], SPECOMP [3] and NAS [5] benchmark suites. They are all compiled for the AMD x86 64 64-bit architecture using gcc 4.2 with -O3 optimization flags (and -OpenMP where applicable). We use the reference input sets for SPEC 2006 and SPEC-OMP, and the medium input set (B) for NAS. All benchmarks are run to completion. We use the pfmon utility from the perfmon2 library for accessing the hardware performance counters from user space. We run all benchmarks on Linux kernel version 2.6.25. Our system hardware configuration is detailed in Table 2. The system power is based on the processors being idle and powered down to their lowest frequency and voltage. All power measurements are based on current draw from the outlet by the power supply. We use a Watts Up Pro

4

Evaluation

We evaluate the accuracy of our power model using single and multi-threaded benchmarks. We use the entire CMP to test our results. We later leverage this real-time estima5

Figure 10. Scheduler Setup and Use Figure 9. Cumulative Distribution Function (CDF) plot To further test the robustness of our model, we examine system power estimates for a multi-programmed workload. Figure 11(d) shows system power consumption for ep (NAS), art (SPEC-OMP), mcf and hmmer (SPEC 2006) run concurrently. We sum per core power to estimate CMPwide system power. We find our model conservatively overestimates power consumption. After ep finishes executing, the power prediction decreases along with the actual system power. System power further decreases after art completes, and only two cores are running. Even with the interaction of shared resources on the CMP, our model accurately tracks system power consumption.

showing fraction of space predicted (y-axis) under a given error (x-axis)

tion to make power-aware scheduling decisions to suspend processes to ensure a given power envelope.

4.1

Evaluating Power Estimation

We test our derived power model on the SPEC 2006, SPEC-OMP and NAS benchmarks. We compare actual to predicted power in Figures 7(a), 7(b), 7(c) for NAS, SPEC-OMP and SPEC 2006 respectively. Each multithreaded benchmark is run across the entire CMP, and multiple copies are spawned for single-threaded programs. The power consumption reported is per core of the CMP. Our estimation model tracks the power consumption for each benchmark fairly well. We find a large difference between estimated and actual power for some benchmarks, such as bt and ft. This can partially be attributed to a lack of dynamic temperature data in our estimation model. Actual power values range from 19.6W for ep to 26.6W for mgrid for a variation of over 35%. Figures 8(a), 8(b) and 8(c), show the percentage error for each benchmark suite. Most benchmarks show less than 10% median error. We attribute the error in power estimates to temperature, the limited PMCs monitored, and parts of the counter space possibly unexplored by our microbenchmarks. Temperature was earlier found to increase power consumption by up to 10%. The NAS and SPECOMP benchmarks have an average median error of 5.8% and 3.9% respectively. SPEC 2006 has a marginally higher average median error of 7.2%. Figure 9 shows the Cumulative Distribution Function (CDF) for all three benchmark suites taken together. This gives us a good idea of the coverage of our model. For example, 85% of predictions across all benchmarks have less than 10% error. The CDF helps illustrate the model fit and shows that most predictions have very small error.

4.2

Power-Aware Thread Scheduling

We use the power predicted for processes to schedule them within a multi-programmed workload on a CMP. We suspend processes to remain below the system power envelope. We assume the system power envelope to be degraded by 10, 20 or 30%. We present an application that uses our power predictor to schedule processes such that they run under a fixed power envelope. We write a user-space scheduler in C that spawns a process on each of the four cores of the AMD Phenom, and monitors their behavior via pfmon. Figure 10 illustrates its setup and use. The program makes real-time predictions for per-core and system power based on collected performance counters, and suspends processes as the power envelope is breached. It suspends the processes such that system power is just below the power envelope. For example, assume that current system power is 190W and the power envelope is 180W. For simplicity, we have to choose between two processes consuming 20W and 25W respectively. The scheduler suspends the first process to bring system power down to 170W, rather than choosing the second process and being further away from the envelope (at 165W). We present examples of running a randomly chosen workload at 180W, 160W, and 140W, which represent approximately 90%, 80%, and 70% of maximum power us6

(a) NAS

(b) SPEC-OMP

(c) SPEC 2006

Figure 7. Actual vs. Predicted Power is resumed later and completes execution. As shown in Figures 11(a), 11(b), and 11(c), the Actual and Predicted power match up well. We are able to follow the power envelope strictly and do so entirely on the basis of a prediction-based scheduler. This even obviates the need for a power meter and would be an excellent tool for software-level control of total system power.

age when no processes are suspended. For fairness, we run multi-threaded programs with one thread only. Since if we were to suspend a multi-threaded process, it would span multiple cores and would be far below the power envelope. The benchmarks are chosen at random, consisting of ep from NAS, art from SPEC OMP, and mcf and hmmer from SPEC CPU 2006. The benchmarks are chosen at random. Figure 11(d) shows the workload execution without any power envelope. In figure 11(a), we see that during the initial execution of the workload, there is a sudden drop in system power. The total system power of 177.7W is far above the applied power envelope of 140W. At this point, ep, art, mcf, and hmmer are using 19.8W, 20.5W, 21.4W, and 22.4W respectively. The scheduler chooses ep and art to suspend because their sum of 40.3W brings the power envelope just below the envelope of 140W. The drop in power a bit before the 500s mark is where the current process ends, and the scheduler resumes another process to take its place. Similarly, in figure 11(c), at about 20 seconds into the execution of the workload, we see a drop in system power. The total system power of 181.1W predicted is just above the power envelope of 180W, and ep, art, mcf, and hmmer are using 19.7W, 24.7W, 20.7W, and 22.2W respectively. The scheduler suspends ep because it allows total power to remain closest to the power envelope. The process

5

Related Work

Significant work has been done correlating performance counters with power consumption and estimating power consumption from performance counters. Prior work falls under two different schemes. The first method estimates power consumption based on monitoring usage of the functional units [11, 16]. The second strategy derives mathematical correlations between performance counters and power consumed, independent of the underlying functional units [8]. In our work, we perform a combination of the two, using correlation and architectural knowledge to choose appropriate performance counters, and use analytical techniques to draw a relationship between performance counters and power consumption. We briefly outline related work in this area, and recent work on power-aware scheduling. Joseph et al. [11] estimate power using performance 7

(a) NAS

(b) SPEC-OMP

(c) SPEC 2006

Figure 8. Median errors for given benchmark suites to estimate power consumption. They randomly sample a large design space and estimate power consumption based on previous power values for the same design space that is profiled a priori. Unfortunately, this methodology requires sampling the same applications that they are trying to estimate power consumption for, and is dependent on having already trained on the applications of interest. This method is not feasible when executing programs outside of the sampled design space. At run-time, the scheduling is not known a priori, and the behavior of programs changes depending on their interaction with other processes sharing the same cache resources. Researchers have examined predicting power consumption at different number of threads and frequencies based on profiling applications at the highest number of threads or frequency. Curtis et al. [9] predict optimal concurrency values using machine learning to estimate power consumption at various thread counts based on power consumption at the highest thread count for the NAS benchmark suite. Unfortunately, they require the applications to already have been profiled at one thread count before predicting power consumption at other thread counts, making their models application dependent. Contreras et al. [8] use performance counters on the XScale platform to estimate power consumption at differ-

counters in a simulator (Wattch [7] integrated with SimpleScalar [4]), and on real hardware (Pentium Pro). Unfortunately, their power estimates must be derived offline after multiple benchmark runs, since they require more performance counters than are supported by the hardware. They use performance counters to estimate power consumption for different architectural resources but are unable to estimate power for 24% of the chip, and assume peak power of the structures. Wu et al. [16] use micro-benchmarks for estimating power consumption of functional units for the Pentium 4 architecture. We do not use this method since dynamic activity for all portions of the chip are not available, and should they be, it would exceed the limit of four counters for run-time power prediction. Economou et al. [1] use performance counters for predicting power of a blade server. They use custom microbenchmarks to profile the system and estimate power for the different hardware components (CPU, memory, and hard drive), and achieve estimates with 10% average error. Like our work, their kernel benchmarks for deriving their model is independent of benchmarks used for testing. Lee at al. [12] use statistical inference models to predict power consumption based on hardware design parameters. They build correlation based on hardware parameters and use the most significant parameters for training a model 8

(a) 140W

(b) 160W

(c) 180W

(d) None

Figure 11. Scheduler with given power envelope per-core and domain-wide DVFS for retaining performance while remaining within a power budget. Unlike their work, we do not require the presence of on-core current sensors, we use real hardware for evaluating our work, and implement a scheduler. While their work involved core adaptation policies, we examine OS scheduling decisions with fixed cores. We hope to expand our study when per-core DVFS is feasible on CMPs.

ent frequencies offline. They gather statistics from multiple performance counters by running benchmarks several times, and join the sampled data from different runs together to get better estimates. However, this methodology is not feasible at run-time since it requires three times as many performance counters as can be sampled in real-time. Therefore, it cannot be used by an operating system scheduler or developer executing multiple programs in parallel at run-time. Additionally, their parameterized linear model is not applicable here, as their work for an in-order single core does not scale to our platform. This is due to the increased complexity of our multi-core, out-of-order, power-efficient, high-performance platform.

We estimate power consumption for an aggressively power efficient high performance platform. Our platform differs with previous literature due to: resources being shared across cores, performance counters that do not individually exhibit high correlation with power, and significant power variation across benchmarks depending on workload. Additionally, since we desire real-time power estimation, we are confined to using only the number of performance counters available at run-time in a single iteration. This prevents us from comparing our work with previous work. Our framework can estimate power consumption on new programs that we have not seen before. Our work evaluates results on a multi-core system, providing a feasible method of estimating power consumption per-core. We leverage this to perform real-time power-aware per-core scheduling on a CMP, which was not previously possible.

Researchers have recently examined scheduling processes for thermal and power constraints. Merkel et al. [13] use performance counters to estimate power per processor in a 8-way symmetric multi-processing (SMP) system, and shuffle processes to reduce temperature overheating of individual processors. Bautista et al. [6] schedule processes on a multi-core system for real-time applications. They assume static power consumption for all applications. They leverage chip-wide DVFS to reduce energy when there is slack in the application. Isci et al. [10] analyze various power management policies for a given power budget. They leverage 9

6

Conclusions

[3] V. Aslot and R. Eigenmann. Performance characteristics of the SPEC OMP2001 benchmarks. In Proceedings of the European Workshop on OpenMP, Sept. 2001.

We analytically derive a piece-wise linear power model that maps performance counters to power consumption. We accurately estimate power consumption, independent of program behavior for the SPEC 2006, SPEC-OMP and NAS benchmark suites. We do so using custom written micro-benchmarks independent of our test programs to generate data for creating our estimation model. The microbenchmarks stress four counters selected based on their correlation value. The low error rate empirically verifies our claims of the micro-benchmark application independence and sufficiency of the four counters selected. We leverage our power model to perform run-time power-aware thread scheduling. We suspend and resume processes based on power consumption, ensuring the power envelope is not exceeded. Our work is especially useful for consolidated data centers where virtualization leads to multiple servers on a single CMP. Using our estimation methodology, we can accurately estimate power consumption at the core granularity allowing for accurate billing of power usage and cooling costs. Estimating per-core power consumption is challenging since some resources are shared across cores, such as caches, the DRAM memory controller and off-chip memory accesses. Additionally, current platforms only allow us to monitor a few performance counters percore depending on architecture (Intel, Sun, AMD). For future work, we intend to expand this study for scheduling multi-threaded workloads. This adds a layer of complexity since all the threads of a specific process need to be suspended at once. If only one thread is suspended, it might be holding locks or other threads might be forced to wait at barrier points for it. Another strategy is to force the scheduler to wait for specific thread spawn points before changing number of threads. We find temperature to play a significant role in power consumption. When supported by the hardware, we intend to incorporate it into our estimation model and thread scheduling. We also wish to investigate the benefits of using a power-aware scheduler for choosing processor cores to reduce frequency (DVFS), rather than suspending processes. As number of cores and power demands grow, process scheduling will be critical for efficient computing.

[4] T. Austin. Simplescalar http://www.simplescalar.com/.

4.0

release

note.

[5] D. Bailey, T. Harris, W. Saphir, R. Van der Wijngaart, A. Woo, and M. Yarrow. The NAS parallel benchmarks 2.0. Report NAS-95-020, NASA Ames Research Center, Dec. 1995. [6] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple power-aware scheduling for multicore systems when running realtime applications. In Proc. 22th IEEE/ACM International Parallel and Distributed Processing Symposium, pages 1–7, Apr. 2008. [7] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. 27th IEEE/ACM International Symposium on Computer Architecture, pages 83–94, June 2000. [8] G. Contreras and M. Martonosi. Power prediction for intel xscale processors using performance monitoring unit events. In Proc. IEEE/ACM International Symposium on Low Power Electronics and Design, pages 221–226, Aug. 2005. [9] M. Curtis-Maury, K. Singh, S. McKee, F. Blagojevic, D. Nikolopoulos, B. de Supinski, and M. Schulz. Identifying energy-efficient concurrency levels using machine learning. In Proc. 1st International Workshop on Green Computing, Sept. 2007. [10] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi. An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget. In Proc. IEEE/ACM 40th Annual International Symposium on Microarchitecture, pages 347–358, Dec. 2006. [11] R. Joseph and M. Martonosi. Run-time power estimation in highperformance microprocessors. In Proc. IEEE/ACM International Symposium on Low Power Electronics and Design, pages 135–140, Aug. 2001. [12] B. Lee and D. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proc. 12th ACM Symposium on Architectural Support for Programming Languages and Operating Systems, pages 185–194, Oct. 2006. [13] A. Merkel and F. Bellosa. Balancing power consumption in multiprocessor systems. In EuroSys ’06: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pages 403–414, Apr. 2006. [14] Standard Performance Evaluation Corporation. SPEC CPU benchmark suite. http://www.specbench.org/osg/cpu2006/, 2006. [15] V. Weaver and S. McKee. Can hardware performance counters be trusted? In Proc. IEEE International Symposium on Workload Characterization, Sept. 2008. [16] W. Wu, L. Jin, and J. Yang. A systematic method for functional unit power estimation in microprocessors. In Proc. 43rd ACM/IEEE Design Automation Conference, pages 554–557, July 2006.

References [1] Full-system power analysis and modeling for server environments. In Workshop on Modeling Benchmarking and Simulation (MOBS) at ISCA, June 2006. [2] AMD Phenom Quad-Core Processor Die. http://www.amd.com/usˆ ˜ 117770,00.html, en/Corporate/VirtualPressRoom/0,,51 104 572 57315044 Jan. 2008.

10

Real Time Power Estimation and Thread Scheduling via Performance ...

Real Time Power Estimation and Thread Scheduling via Performance ...

Suggest Documents

Real Time Power Estimation and Thread Scheduling via Performance ...

Scalable Thread Scheduling and Global Power ... - CiteSeerX

Power-Aware Real-Time Scheduling upon Dual

Thread assignment optimization with real-time performance and ...

Implementation & Performance Analysis of Real Time Scheduling ...

Simulation of Efficient Real-Time Scheduling and Power ... - Hal

The Design and Performance of a Real-Time CORBA Scheduling ...

Real-Time and Non-Real Time Packet Scheduling Schemes of

Scalable Thread Scheduling and Global Power ... - Research at Google

Scalable Thread Scheduling and Global Power Management for

Exploiting Unbalanced Thread Scheduling for Energy and Performance ...

Process Scheduling for Performance Estimation and ... - CiteSeerX

A Power-Aware, Best-Effort Real-Time Task Scheduling ... - CiteSeerX

Power Aware Scheduling for Real-Time Systems with (m ... - CiteSeerX

Power-Aware Scheduling for Embedded Real-Time ...

Power-Aware Scheduling for Periodic Real-Time ... - Semantic Scholar

Feedback Scheduling of Power-Aware Soft Real-Time ... - Google Sites

Optimal Real-Time Scheduling of Wind Integrated Power ... - MDPI

Discrete-Time Scheduling under Real-Time Constraints

Improving the performance of local real-time scheduling

Performance of Algorithms for Scheduling Real-Time Systems with ...

Performance-based Dynamic Scheduling of Hybrid Real-time

Performance Evaluation of Soft Real-Time Jobs Scheduling with ...

Real Time Linux Scheduling Performance ... - Linux Foundation Events