2009 International Conference on Computational Science and Engineering
Grouping-Based Dynamic Power Management for Multi-Threaded Programs in Chip-Multiprocessors Mu-Kai Huang
J. Morris Chang
Wei-Mei Chen
Department of Electronic Engineering National Taiwan University of Science and Technology Taipei, Taiwan
[email protected]
Department of Electrical and Computer Engineering Iowa State University Iowa, USA
[email protected]
Department of Electronic Engineering National Taiwan University of Science and Technology Taipei, Taiwan
[email protected]
the frequency of each processor to meet the needs of running programs has become a challenge. Multi-threaded programming allows a program to be partitioned into multiple threads which can be potentially run in parallel. The throughput of multi-threaded applications can be improved greatly by running multiple threads in parallel on a CMP. As the cost of CMP continues to drop, it becomes more and more attractive to have multi-threaded applications in the CMP embedded systems. In the context of multi-threaded programs, each thread has various execution behaviors during its run-time. The behaviors of thread are generally categorized into memory-intensive phase and processor-intensive phase [13]. In the processorintensive phase, typical instructions are arithmetic logic instructions and branch instructions. These instructions tend to operate on registers and do not depend on operands from memory. During the memory-intensive phase, memory-based instructions (e.g. load and store instructions) are commonly used. Owing to the speed gap between the processor and the memory, a number of stall cycles will be generated during the run-time of the processor in the memory-intensive phase. Slowing down the processor frequency during the memoryintensive phase can mask the memory latency without sacrificing performance. Moreover, it can preserve power consumption of processor which is critical to the embedded systems. Much research has demonstrated different approaches to trace the phase behavior of programs. Some researchers have explored approaches to predict future program behavior based on the history behaviors [10][11][22]. A number of studies have traced the phase behavior depending on program execution properties [2][18][20][21]. Several researchers have utilized the hardware performance monitors (PMs) to track the phase behavior of programs [4][13][14][24][25]. The DVFS technology has been used widely to regulate processor power consumption. Some research adjusts the processor frequency downward under the power constraint [9][14]. Some other approaches took power consumption and program responding deadline into consideration for weakly hard real-time systems [17][26][27]. Few studies predict the future behavior of threads and scheduled threads to a processor with frequency which matches the need of those threads
Abstract—In the embedded systems field, the research focus has shifted from performance to considering both performance and power consumption. Previous research has investigated methods to forecast the processing behavior of programs and adopt Dynamic Voltage and Frequency Scaling (DVFS) technique to adjust the frequency of processor to meet the needs of various phase behavior of threads of programs. However few researches have paid attention to the overhead of DVFS. Generally, DVFS brings processor core unavailable time from 10µs to 650µs. Adjusting frequency for every thread may encounter unanticipated overhead especially for multi-threaded programs. The objective of this study is to take performance, power consumption and overhead into consideration and give a low overhead power management that adjusts the frequency of processor for every group of threads instead of every thread. The proposed approach consists of three works: phase behavior prediction, DVFS controlling and workload migration. To demonstrate the effect of our approach, we implemented these works on a real Linux system and compared our approach with the system without DVFS and the system with DVFS for every thread. The results present that our approach improves 15-40% power consumption with 2-10% performance penalty. Moreover, it can reduce 94-97.5% processor core unavailable time, more than the system with DVFS for every thread.
I. I NTRODUCTION As development into system framework in the embedded systems, power consumption has become a primary design constraint. In the embedded systems, one of the major power demand devices is the processor due to its high-speed clock rate. For power saving, processors have been equipped with the on-chip regulator which has ability to adjust system voltage and frequency [12]. In recent years, researchers have turned to the processor power management issue based on the Dynamic Voltage and Frequency Scaling (DVFS) technique that drives the on-chip regulator dynamically. Due to the advances in VLSI technology, the ChipMultiprocessors (CMP) technologies nowadays are becoming attractive and cost effective in the hardware design of embedded systems [3]. Although the CMP improves system performance, it encounters great challenges in power consumption. As taking power and performance into consideration, adjusting Mu-Kai Huang is a PhD student at National Taiwan University of Science and Technology. This paper was done when he was a visiting student at Iowa State University.
978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48
56
Fig. 1.
The Linux 2.6 scheduling mechanism is a priority-based scheduling which gives priority to executable threads relying on their worth and need for processor time [16]. Each priority is mapped to a value of timeslice that represents allowable running time. When the executable thread exhausts its timeslice, the system will switch the executable thread to another thread. This action is referred as the context switch. Figure 1 gives the basic data structure of the Linux 2.6 scheduling mechanisms. The basic data structure of the scheduler is runqueue. The runqueue is a set of executable threads on a given processor, and each processor maintains its own runqueue. The runqueue is composed of two subqueues, the active queue and the expired queue. The active queue holds all the threads that have timeslice remaining, and the expired queue consists of all the unfinished threads that have exhausted their timeslice but not finished yet. When an unfinished thread has exhausted its timeslice, the scheduler updates the timeslice for the next execution term and moves this thread from the active queue to the expired queue. Once all threads in the active queue have used up their timeslices, i.e. the active queue becomes null, the scheduler will swap these two subqueues so that the active queue changes into the expired queue and vice versa. Then, the threads in the ”new” active queue are to be executed again.
The basic data structure of runqueue.
[4][23]. Several investigators forecasted the future behavior of every thread and adopted the per-thread DVFS which dynamically adjusts frequency of processor to the needs of each thread [1][10][15][25]. Although much published research is concerning power consumption of the per-thread DVFS, little attention has been paid to the overhead of the per-thread DVFS in the systems with considerable threads. Generally, controlling of the onchip regulator can bring processor core unavailable time from 10µs to 650µs [6][15]. During this time, processor core cannot be used for any operation. This is also referred to the DVFS overhead. In this paper, we use the term ”processor unavailable time” as it is appeared in [6]. When system executes large number of threads and adjusts the frequency of processor for each thread, system performance will be constrained by a large number of processor core unavailable time. This paper considers the DVFS overhead, performance and power consumption to the multi-threaded environment in the CMP systems, and proposes a per-group DVFS power management. The per-group DVFS groups threads based on Linux 2.6 Scheduling mechanisms [16], and reduces processor core unavailable time by adjusting frequency of processor to meet the needs of every group instead of every thread. In addition, we present a phase-based workload migration, which brings the threads with similar phase behavior into the same group to improve the accuracy of the per-group DVFS. In this paper, the experimental results show that our approach reduces 15-40% power consumption with 2-10% performance penalty. Moreover, our approach reduces 94-97.5% processor core unavailable time more than the per-thread DVFS. The rest of the paper is organized as follows. Section 2 describes overview of the Linux 2.6 scheduling mechanism. Section 3 gives our dynamic power management methodology includes the phase predictor, the per-group DVFS and the workload migration. Section 4 presents the implementation of our prototype system. Section 5 provides the experimental results and Section 6 summarizes the conclusions.
III. G ROUPING - BASED DYNAMIC P OWER M ANAGEMENT In this section, we present our dynamic power management that based on the Linux 2.6 scheduling mechanisms. Our approach consists of Phase Behavior Predictor (PBP), PerGroup DVFS Controller (PGDC) and Phase-Based Workload Migration (PBWM). Figure 2 shows a system overview of grouping-based dynamic power management. The PBP utilizes the hardware performance monitors to predict the future phase behavior of preempted thread on every context switch. When all the threads in a group have exhausted their timeslices or have been completed, system will turn to execute the threads of next group. Before system engages in the thread in next group, the PGDC determines a suitable frequency for all threads in next group. The PBWM considers the phase behavior of threads and collects the threads with same phase behavior in a group. Following subsections discuss the details of each component. A. The Phase Behavior Predictor Generally, behaviors of thread are categorized into the memory-intensive phase and the processor-intensive phase [13]. Because threads with memory-intensive phase associate with cache and memory accesses and cannot make use of all of the available frequency, using a lower processor frequency for executing threads with memory-intensive phase is an effective way to reduce power consumption. Due to the usual practice of programming and the characteristic of compiler, programs perform similar phase behavior within a span (Section 4 will demonstrate the phase behavior of general programs). The Phase Behavior Predictor (PBP) is a statistical predictor that is the predictor based on the last
II. T HE L INUX 2.6 S CHEDULING M ECHANISMS The objective of this research is to develop a groupingbased dynamic power management with less processor core unavailable time. It can reduce processor core unavailable time by adjusting frequency of processor for every group of threads instead of every thread. In our approach, the group is defined as the queue used in Linux scheduling mechanisms, Section 3 will describe the detail of our approach. However, first, we present the Linux 2.6 scheduling mechanisms.
57
Fig. 2.
The overview of Grouping-based dynamic power management.
value. In this predictor, the next phase behavior of a thread is assumed to be the same as its last phase behavior, i.e. phase[t + 1] = phase[t]. The hardware performance monitors are common means to measure the phase behavior of threads such as the number of retired instructions, the number of memory accesses, the number of TLB misses, etc [8]. Existing processor like the IBM power4, the AMD processor families and the Intel processor families are equipped with the hardware performance monitors. Algorithm 1 The Phase Behavior Predictor Require: An unfinished thread p Ensure: The phase behavior of p 1. SCP I ← IF U M em Stall/Instr Ret 2. if SCP I ≤ threshold then 3. p.phase[t] ← processor-intensive phase 4. else 5. p.phase[t] ← memory-intensive phase 6. end if 7. p.phase[t + 1] ← p.phase[t] To do the phase behavior prediction, we track two hardware activity events Instr Ret and IFU Mem Stall, where Instr Ret counts the number of instruction retired and IFU Mem Stall counts the stalled cycles while waiting for data from memory. Then we define a measure called Stall Cycle Per Instruction (SCPI) to differentiate the phase behavior of threads: SCP I =
IF U M em Stall Instr Ret
(1)
The PBP predicts future phase behavior of unfinished thread on every context switching. It predicts that the memoryintensive phase is forthcoming while SCPI is higher than threshold. On the contrary, the PBP forecasts that future phase behavior is the processor-intensive phase when SCPI is less than threshold. In this paper, we set the threshold as 0.2. The detail of threshold will be presented in Section 4. Algorithm 1 is the procedures of the PBP.
Fig. 3.
The executing behavior of bzip on different frequency.
B. The Per-Group DVFS Controller Adjusting appropriate frequency of processor for memoryintensive phase is effective to reduce power consumption without performance degradation. For instance, figure 3 presents the executing behavior of bzip (a compression application) on different frequency. As shown in the figure, the executing time of each memory-intensive phase portion on low frequency (see (a)) is same as the corresponding one on high frequency (see (b)). Therefore adjusting lower frequency of processor to execute memory-intensive phase can benefit power consumption. Previous works focused on the per-thread DVFS that dynamically adjusts frequency of processor to each thread with different phase behaviors [1][10][15][25]. Usually, adjusting frequency of processor has to pay the processor core unavailable time from 10µs to 650µs [6][15]. When system executes a large number of threads and adjusts frequency of processor for each thread, system performance will be constrained by appreciable processor core unavailable time. Although the perthread DVFS is an intuitional scheme for power consumption, it encounters a bottleneck on the system with the large number of threads. One way to avoid the bottleneck of per-thread DVFS on the system with a large number of threads is adjusting frequency for each group of threads instead of each thread. This paper presents the Per-Group DVFS Controller (PGDC) that performs DVFS for every group of threads. The PGDC takes a subqueue of running queue as a group, i.e. the threads in active queue are defined as a group and the threads in expired queue are also referred as a group as well. When the Linux scheduler exchanges these queues, the PGDC is enabled to determines an adequate frequency to execute the threads that belong to the ”new” active queue. To estimate the suitable frequency of processor, we add three counters memory phase, processor phase and unknown phase to the structure of the subqueue. Figure 4 gives the structure of subqueues. Following is the function of these counters:
58
Fig. 4.
frequency of processor to the corresponding frequency. For example, an Intel Pentium M 1.6 GHz processor system with six levels of frequency exchanges the active queue and the expired queue. If the memory phase counter of the ”new” active queue is 0, the PGDC adjusts the frequency of processor to highest frequency. On the other hand, if the ”new” active queue has 25 threads all told, and 12 threads with memory-intensive phase among these threads, the PGDC utilizes equation (3) to determine the adequate level of frequency that is level 3, then adjusts the frequency of processor to 1.2 GHz (refer to table 1).
The phase behavior counters of runqueue.
memory phase: The total number of threads with memory-intensive phase in this subqueue. • processor phase: The total number of threads with processor-intensive phase in this subqueue. • unknown phase: The total number of threads whose phase behavior aren’t predicted yet. i.e. the threads are created but not executed yet. When a thread with specific phase behavior is inserted into a subqueue, the corresponding counter is increased. However, once a thread is removed from a subqueue, the corresponding counter is decreased. Generally, the processor with DVFS technology has various levels of frequency and voltage to select from. For instance, table 1 shows the Intel Pentium M 1.6 GHz Processor supports six levels of frequency and voltage [6]. For the group of threads, the frequency of processor should be slowed down when most of the threads in the group are memory-intensive phase. On the other hand, selecting higher frequency of processor is efficient while most of the threads in the group are computation-intensive phase. We select the adequate frequency and voltage base on the ratio of the number of threads with memory-intensive phase to the number of total threads. Assuming the processor has m levels frequency (from level 1 to level m), and the lower level the faster frequency. The PGDC selects the frequency of processor by Equation (3): •
total thread =memory phase + processor phase + unknown phase level =
Algorithm 2 The Per-Group DVFS Controller Require: Active queue q and supported frequency level m Ensure: Adjust the frequency of processor to f 1. memory ← q.memory phase 2. total ← (q.memory phase + q.processor phase + q.unknown phase) 3. if memory = 0 then 4. level ← 1 5. else 6. ratio ← memory/total 7. level ← dm × ratioe 8. end if 9. if current frequency 6= corresponding frequency of level then 10. f ← corresponding frequency of level 11. end if
C. The Phase-Based Workload Migration As mentioned before, each processor maintains its own runqueue and only executes the threads in its runqueue. One way to promote the cooperation between two processors is migration. The migration is a system activity which moves a thread from a processor’s runqueue to another’s. The migration can not only promote the cooperation between processors but also improves the accuracy of the PGDC. For instance, the PGDC will adjust the lowest frequency for an active queue which has one thread with processor-intensive phase and ten threads with memory-intensive phase, and this thread with processor-intensive phase may be executed by unsuitable frequency. The migration can improve the accuracy of the PGDC by moving the thread with processor-intensive phase to another high frequency processor. Although the migration has significant advantages, it brings some disadvantages. When the migration is applied, it’ll block the runqueue of the source processor and the runqueue of the destination processor. If the migration is enabled haphazardly, the performance of processor would be constrained. Moreover, there are some threads that can’t be migrated:
(2)
1 if memory phase = 0 phase d memory × me otherwise total thread (3)
Table 1: Supported frequency and voltage for the Intel Pentium M 1.6GHz Processor [6]. level frequency voltage 1 1.6 GHz 1.484 V 2 1.4 GHz 1.420 V 3 1.2 GHz 1.276 V 4 1.0 GHz 1.164 V 5 800 MHz 1.036 V 6 600 MHz 0.956 V When the active queue and the expired queue are exchanged, the PGDC is driven to estimate and adjust an adequate frequency for the threads in the active queue. Algorithm 2 is the procedure of the PGDC. The PGDC determines the adequate frequency level by equation (3) and subsequently adjusts the
• •
59
running threads: The threads that are executing now. exclusive threads: The threads that only run on specific processor.
EQ[i].other thread = processor phase+unknown phase (5)
major =
mem
Fig. 5. •
Example of VEQ selection.
cache hot threads: The threads that are executed persistently and are most likely in the processor’s cache (cache hot).
s ratio[i] =
processor
We present a migration mechanisms called the Phase-Based Workload Migration (PBWM) and implement it by using Linux standard workload balancer. To reduce the overhead of migration, the PBWM is driven to move threads while the workload of a processor is imbalance. The PBWM examines the workload of expired queue of each processor whenever an expired queue is empty or every 200 ms while the system is busy. When the system is idle, the PBWM is driven every 1 ms. When the PBWM detects a Starveling Expired Queue (SEQ) whose workload is lower than other expired queue’s, the PBWM will find a Victim Expired Queue (VEQ) and move threads from VEQ to SEQ. Two types of queue are qualified to be a VEQ: the queue which has maximum workload and the queue whose most threads’s phase behavior is similar to the major phase behavior of SEQ. To select an appropriate VEQ, the PBWM first scores each expired queues of processor i except SEQ by Equation (4)-(9). Then the PBWM picks the expired queue with maximum score as VEQ. Equation (4) calculates the number of threads of each expired queue; Equation (5) counts each expired queue’s highfrequency needs threads; and Equation (6) gives the major phase behavior which is the phase behavior of most threads of SEQ. We define s ratio and t ratio for selecting a proper VEQ. The s ratio[i] (7) is the ratio of the number of threads with major phase behavior belongs to the expired queue of processor i to the total number of threads belong to every expired queue. The t ratio[i] (8) is the ratio of the number of threads belong to the expired queue of processor i to the the total number of threads belong to every expired queue. The score of each expired queue is determined by Equation (9), where α and β are the weight of the s ratio and the t ratio, respectively. In this paper, we defined the value of α and β by heuristic experiment, and α and β are set as 0.3 and 0.7, respectively.
if SEQ.memory phase > (SEQ.processor phase+ SEQ.unknown phase) otherwise
PEQ[i].memory phase EQ[j].memory phase
(6)
if major
each core j
PEQ[i].other thread EQ[j].other thread
= mem otherwise
each core j
(7)
t ratio[i] =
EQ[i].total thread P EQ[j].total thread
(8)
each core j
score[i] = α × s ratio[i] + β × t ratio[i]
(9)
Figure 5 gives an example of selecting VEQ. The SEQ is the expired queue of processor 3, and the major phase behavior of SEQ is mem. First, the PBWB scores the expired queue of processor 0, 1 and 2 by Equation (9). Then, the PBWB selects the expired queue with highest score (the expired queue of processor 0) as VEQ. After selecting VEQ, the PBWB moves appropriate threads from VEQ to SEQ. The migrated threads are not only have phase behavior similar to the major phase behavior of SEQ, but also movable (i.e. not running threads, not exclusive threads and not cache hot threads). The procedures used in the PBWB are presented in algorithm 3. The PBWB first finds a SEQ whose workload is lower than others. Next, it determines the major phase behavior of SEQ and scores other expired queue by Equation (9). Then, it selects the expired queue with maximum scores as VEQ. Finally, the PBWB blocks the runqueue of SEQ and VEQ, and moves adequate threads from VEQ to SEQ until the workload of SEQ is balance or all threads of SEQ are examined. IV. I MPLEMENTATION We implemented our scheme on an Intel Core 2 Quad Q6600 processor based desktop computer, running Linux kernel 2.6.22.9. The behavior of executing threads is measured via hardware performance monitors (PMs), and the future phase behavior of threads is predicted according to the value of SCPI. To apply DVFS, we control the frequency of processor via hardware Model-Specific Registers (MSRs). Moreover, we use a dynamic power measurement of CMOS circuit to take the gauge of power consumption. The details of these aspects for our framework are discussed in the following subsections.
EQ[i].total thread =memory phase + processor phase + unknown phase (4)
60
Algorithm 3 The Phase-Based Workload Migration Require: A starveling expired queue SEQ Ensure: Move and balance the workload of each processor 1. major ← the major phase behavior of SEQ 2. for each possible processor i do 3. score[i] ← the score of EQ[i] 4. end for 5. j ← i | score[i] is maximum 6. V EQ ← EQ[j] 7. block the runqueue of SEQ and V EQ 8. T ← V EQ 9. C ← {j | j ∈ V EQ ∧ j.phase = major} 10. repeat 11. if C = Ø then 12. p←t|t∈T 13. T ← T − {t} 14. else 15. p←t|t∈C 16. C ← C − {t} 17. T ← T − {t} 18. end if 19. if p is movable then 20. V EQ ← V EQ − {p} 21. SEQ ← SEQ + {p} 22. end if 23. until SEQ is balance or T = Ø 24. release the runqueue of SEQ and V EQ
A. Executing Behavior Monitoring In our experiments, the behavior of executing threads is monitored during the run-time of threads via PMs. We configured two available PMs in the Intel Core 2 Quad Q6600 processor to monitor the number of retired instructions and the stalled cycles while waiting for data from memory, with the Instr Ret and IFU Mem Stall event configurations. To monitor the previous behavior of unfinished threads, we implement the PMs access in the system call context switch(previous thread, next thread), i.e. the system collects the information from the PMs on every context switch. Once it is called, the system collects behavior of the previous thread via PMs and determines the phase behavior of unfinished thread. Afterward PMs is reset to zero.
Fig. 6.
The execution behavior of programs.
According to the behavior of minibenchmark, the SCPI is higher than 0.2 when program performs memory access. On the other hand, the SCPI is lower than 0.2 when executing arithmetic logic instructions. From the figure, the difference between two phases can be clearly distinguished by the value of 0.2 in SCPI. In our experiments, a SCPI threshold of 0.2 is used to differentiate the program phase behavior. C. The Implement of DVFS and Power Measurement The Intel Core 2 Quad Q6600 processor supports two levels of frequency and voltage for adjusting as shown in Table 2. We adjusted the frequency of processor by writing corresponding p-state of each frequency to IA32 PERF CTL register in MSRs [7]. When the p-state is written in IA32 PERF CTL, the processor will change its frequency after processor core unavailable time.
B. The Threshold of the Phase Behavior Predictor As mentioned before, the phase behavior of threads are distinguished by a threshold of SCPI. To determine the threshold, we observed the executing behavior of various programs and defined an suitable threshold. Figure 6 demonstrates the executing behavior of programs. Figure 6 (a) - (d) show the behavior of benchmarks of SPEC CPU2006, and Figure 6 (e) - (g) are the behavior of real programs. In addition, we present a minibenchmark which performs alternate randomly memory access and arithmetic logic instructions. The results show that programs have similar phase behavior within a span.
Table 2: Supported frequency and voltage for the Intel Core 2 Quad Q6600 Processor. level 1 2
61
frequency 2.4 GHz 1.6 GHz
voltage 1.4375 V 1.1125 V
A generic dynamic power measurement of CMOS circuits is given in [6]. The dynamic power consumption of CMOS circuits (P) can be expressed: 2 P = C × Vdd ×f
(10)
Where C is the effective switching capacitance, Vdd is the supply voltage and f is the executing frequency. In our experiments, we measured the executing time of each frequency and evaluated the power consumption by Equation (10). V. E XPERIMENT R ESULTS In this section, we compare the performance and the power consumption of the system in three configurations: the system without power management, the system with the proposed per-group DVFS, and the system with the per-thread DVFS. The system without power management executed the threads with highest frequency, and the system with per-thread DVFS used information from the PBP to adjust frequency for every thread (include user threads and kernel threads). We evaluate the performance and power consumption of these schemes by well-know benchmarks the SPEC CPU2006 [5] and the Phoronix Test Suite [19]. A. Evaluation with the SPEC CPU2006 In our experiments, the system without power management is denoted as FullFrequency, the system with per-thread DVFS is denoted as PerThreadDVFS and the system with our approach is denoted as PerGroupDVFS. For comparison, the results of PerThreadDVFS and PerGroupDVFS are normalized to the results of FullFrequency. The left part of Figure 7 depicts the comparison results for SPEC CPU2006. From top to bottom, the figure presents the comparison of performance, power consumption, energydelay product (EDP) and the number of DVFS times. As shown in Figure 7 (a), the PerThreadDVFS encountered 23% to 50% performance degradation due to the overhead of persistent DVFS operation, and the PerGroupDVFS had at most 8% performance degradation. Figure 7 (b) presents that the PerThreadDVFS reduced 16% to 63% power consumption and the PerGroupDVFS saved 10% to 42% power consumption. The results in the Figure 7(c) show EDP of the PerGroupDVFS is better than others. Comparing to the system without DVFS, the average EDP improvement of the PerGroupDVFS is 21% with an average of 2% performance degradation. The result of Figure 7 (d) demonstrates the PerGroupDVFS not only decreases an average of 97.5% the number of DVFS times but also reduces an average of 97.5% processor core unavailable time. Comparing to the system without DVFS, the application with similar proportion of memory-intensive phase to processor-intensive phase such as bzip, hmmer, sjeng, gobmk, xalanchmk, astar and h264ref, can save an average of 15% power consumption with few performance degradation. For the applications with large number of memory-intensive phase such as gcc, libquantum, omnetpp and mcf, our approach can reduce an average 32% power consumption nearly without any performance degradation.
Fig. 7. Comparison results of: (a) performance; (b) power consumption; (c) Energy-Delay Product (EDP); (d) times of DVFS.
B. Evaluation with Multi-threaded Applications We evaluated the comparison results for popular multithreaded applications which is provided by Phoronix Test Suite [19]. The multi-threaded applications of Phoronix Test Suite include 7-Zip Compression, Java 2D, Sunflow Rendering System, MySQL, Apache Builder, ImageMagick Builder and PHP Builder. The comparison results for multi-threaded applications are depicted in the right part of Figure 7. As shown in Figure 7 (a), the PerThreadDVFS encountered 31% to 55% performance degradation, and the PerGroupDVFS had at most 12% performance degradation. Figure 7 (b) presents that the PerThreadDVFS reduced 19% to 67% power consumption and the PerGroupDVFS saved 7% to 55% power consumption. The result of Figure 7 (c) shows the result of comparison of EDP. The average EDP improvement of the PerThreadDVFS is 13% with an average of 40% performance degradation, and the average EDP improvement of the PerGroupDVFS is 24% with an average of 5% performance degradation. The result of Figure 7 (d) demonstrates that the PerGroupDVFS decreases an average of 94% the number of DVFS times and reduces an average of 97% processor core unavailable time. These results show that although the per-thread DVFS reduced a lot of power consumption, it encountered appreciable performance degradation from frequent DVFS operation. The per-group DVFS not only reduced appreciable processor core
62
[7] Intel Corporation, Intelr 64 and IA-32 Architectures Software Developers Manual Volume 3A: System Programming Guide, Part 1, http://www.intel.com/products/processor/manuals/, November 2008. [8] Intel Corporation, Intelr 64 and IA-32 Architectures Software Developers Manual Volume 3B: System Programming Guide, Part 2, http://www.intel.com/products/processor/manuals/, November 2008. [9] C. Isci, a. Buyuktosunogly, C. Cher, P. Bose and M. Martonosi, An Analysis of Efficient Multi-Core Global Power Management Policies : Maximizing Performance for a Given Power Budget, 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. pp 347358. [10] C. Isci, G. Contreras and M. Martonosi, Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management, The 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. pp 359-370. [11] C. Isci, M. Martonosi, and A. Buyuktosunoglu, Long-term Workload Phases: Duration Predictions and Applications to DVFS, IEEE Micro: Special Issue on Energy Efficient Design 25(5), September/October 2005. pp 39-51. [12] W. Y. Kim, M. S. Gupta, G. Y. Wei and D. Brooks, System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators, High Performance Ciomputer Architecture, 2008. pp 123-134. [13] R. Kotla, A. Devgan, S. Ghiasi, Characterizing the Impact of Different Memory-Intensity Levels, International Workshop on Workload Characterization, 2004, pp 3-10. [14] R. Kotla, S. Ghiasi, T. Keller and F. Rawson, Scheduling Processor Voltage and Frequency in Server and Cluster Systems, Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, April 2005, pp 234-241. [15] H. Kweon, Y. Do, J. Lee and B. Ahn, An efficient Power-Aware Scheduling Algorithm in Real Time System, Pacific Rim Conference on Communications, Computers and Signal Processing, 2007. pp 350-353. [16] R. Love, Linux kernel development 2nd ed., Indianapolis, Ind.:Novell Press, 2005. [17] L. Niu and G. Quan, System Wide Dynamic Power Management for Weakly Hard Real-Time Systems, Journal of Low Power Electronics, Volume 2, Number 3, December 2006. pp 342-355. [18] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation, In Proceedings of the 37th International symposium on Microarchitecture, 2004. pp 81-92. [19] Phoronix Media, Phoronix Test Suite Benchmark, http://www.phoronixtest-suite.com/. [20] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Tenth International Conference on Architectural Support for Programming Languages and Operation Systems, October 2002. pp 45-57. [21] T. Sherwood, E. Perelman, and B. Calder, Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications, In International Conference on Parallel Architectures and Compilation Techniques, September 2001. pp 3-14. [22] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction, In Proceedings of the 28th International Symposium on Computer Architecture, June 2003. pp 336-349. [23] T. Sondag, V. Krishnamurthy and H. Rajan, Predictive Thread-to-Core Assignment on a Heterogeneous Multi-Core Processor, Proceedings of the 4th workshop on Programming languages and operating systems, October 2007. [24] R. Teodorescu and J. Torrellas, Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors, Proceedings of the 35th International Symposium on Computer Architecture, 2008, pp 363374. [25] F. Xie, M. Martonosi and S. Malik, Efficient Behavior-driven Runtime Dynamic Voltage Scaling Policies, Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, September 2005. pp 19-21. [26] D. Zhu, R. Melhem and B. Childers, Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation in Multi-Processor RealTime Systems, Proceedings of the 22nd IEEE Real-Time Systems Symposium, 2001. pp 686-700. [27] D. Zhu, N. AbouGhazaleh, D. Mosse and R. Melhem, Power Aware Scheduling for AND/OR Graphs in Multi-Processor Real-Time Systems, Proceedings of the 2002 International Conference on Parallel Processing, 2002. pp 849-864.
unavailable time than per-thread DVFS but also preserved energy only with little performance penalty. VI. C ONCLUSIONS In this paper we introduce a grouping-based DVFS power management strategy by adjusting frequency of processors to meet the needs of every group of threads instead of every thread. The proposed scheme leads to a much lower processor unavailable time in the CMP systems with multi-threaded environment. Our proposed approach is not only lessening the DVFS overhead but also reducing power consumption with low performance degradation. To achieve the power saving, slowing down the frequency of processor to execute the thread with considerable memory access is important. The proposed approach consists of three works: phase behavior prediction, DVFS controlling and workload migration. The phase behavior predictor categorizes the threads into the memory-intensive phase and the processor-intensive phase by a well defined SCPI threshold. The DVFS controller adjusts frequency of processors to meet the needs of every group of threads with similar behaviors. The workload migration brings the threads with similar phase behavior into the same group to improve the accuracy of the per-group DVFS. To demonstrate the performance of our approach, we implemented it on real Linux system and compared with the system without DVFS and the system with per-thread DVFS power management. According to the experiment results, our approach can save an average of 30% power consumption with an average of 3.5% performance loss. Moreover, it only needs negligible processor core unavailable time for power management. The results show that our scheme can reduce energy consumption efficiently with low performance degradation. ACKNOWLEDGMENT This work is supported by National Science Council under the Grant NSC97-2221-E-011-097-. We would like to thank Bashar M. Gharaibeh for his useful discussions, and thank Chun-Yuan Chang for his help during the development of our work. Moreover, we appreciate all reviewers for their review and helpful suggestions. R EFERENCES [1] G. Contreras and M. Martonosi, Power Prediction for Intel XScaler Processors Using Performance Monitoring Unit Events, Internal symposium of Low Power Electronics and Design, 2005. pp 221-226. [2] A. Dhodapkar and J. Smith, Managing multi-configurable hardware via dynamic working set analysis, In 29th Annual International Symposium on Computer Architecture, 2002. pp 233-244. [3] D. Geer, Chip Makers Turn to Multicore Processors, Computer, 38:5, 2005, pp 11-13. [4] S. Ghiasi, T. Keller and F. Rawson, Scheduling for Heterogeneous Processors in Server Systems, Proceedings of the 2nd conference on Computing frontiers, 2005, pp 199-210. [5] J. L. Henning, SPEC CPU2006 benchmark descriptions, ACM SIGARCH Computer Architecture News Volume 34 Issue 4, September 2006. pp 117. [6] Intel Corporation, Enhanced Intelr SpeedStepr Technology for the Intelr Pentiumr M Processor, http://www.intel.com/design/intarch/papers/301174.htm, March 2004.
63