Thermal-aware Task Scheduling at the System Software ... - CiteSeerX

11 downloads 0 Views 960KB Size Report
Aug 27, 2007 - Jeonghwan Choi*, Chen-Yong Cher, Hubertus Franke,. Hendrik Hamann, Alan Weger, Pradip Bose. IBM T. J. Watson Research Center.
Thermal-aware Task Scheduling at the System Software Level Jeonghwan Choi∗, Chen-Yong Cher, Hubertus Franke, Hendrik Hamann, Alan Weger, Pradip Bose IBM T. J. Watson Research Center Yorktown Heights, NY, USA

[email protected], {chenyong, frankeh, hendrikh, weger, pbose}@us.ibm.com ABSTRACT

cooling and packaging cost for heat dissipation increases with the total power, as well as the peak on-chip temperatures [5, 7]. In fact, the cost increase gradient is steeper at higher values of power and power-density. Fundamentally, the power and temperature related constraints have forced a drastic slowdown in core frequency growth, and have enabled a new generation of lower frequency, multi-core chip designs. However, the localized hot spot issues within each core region continues to be a problem, as vendors strive to increase chip-level throughput performance, while maintaining respectable single-thread performance. Dynamic thermal management (DTM) techniques [1] have been employed in the past, as a hardware solution to limit peak temperatures. These schemes throttle performance, to lower power consumption when a preset temperature threshold is reached. A variety of actuating responses to effect such throttling have been proposed: e.g. global clock gating, clock-throttling, voltage and/or frequency scaling, etc. However, these drastic hardware throttling measures can result in severe performance degradation for a class of applications that demand very high performance. As explained in [5], an emphasis on distributing the power consumption more evenly across a multi-core chip is the preferred approach to mitigate the thermal dissipation problem, without compromising performance. In terms of static design choices, this could lead to relatively simpler cores (and more cores per die over time), supported by thermal-aware floor planning. However, that alone is not sufficient; because, in order to deal with and exploit workload variability, improved methods for dynamic redistribution of power are needed to meet power and thermal envelopes, without giving up performance. In this paper, we investigate the trade offs between (temporal and spatial) hot spot mitigation schemes, thermal time constants, workload variability and chip-level power distributions in the context of live experiments conducted on a 1.2 GHz POWER5 system. Two observations motivate our work: (a) On-chip temperature hot spots and power profiles are closely linked to unit-level utilization, in view of advanced, fine-grain clock-gating technology prevalent in modern processors. (b) The rise- and fall-times of on-chip temperatures are typically in the hundreds of milliseconds range: at least an order of magnitude larger than OS scheduler ticks, which are in the range of milliseconds. We examine the effects of OS-level temperature-aware scheduling to mitigate on-chip thermal hot spot problems. The heuristics developed are implemented in a prototype thermal-aware Linux scheduler. With the help of actual temperature and execution time measurements on the Linuxoperated POWER5 set-up, we are able to demonstrate two key advantages to such an approach: (a) No additional hardware cost is incurred; and (b) known, simple high-level information about the micro-architecture system functionality (e.g. dynamic thread priority, time constraints for real-time tasks, etc.) can be exploited to achieve dynamic redistribution of on-chip power, without noticeable performance loss from a user perspective.

Power-related issues have become important considerations in current generation microprocessor design. One of these issues is that of elevated on-chip temperatures. This has an adverse effect on cooling cost and, if not addressed suitably, on chip reliability. In this paper we investigate the general trade-offs between temporal and spatial hot spot mitigation schemes and thermal time constants, workload variations and microprocessor power distributions. By leveraging spatial and temporal heat slacks, our schemes enable lowering of on-chip unit temperatures by changing the workload in a timely manner with Operating System(OS) and existing hardware support.

Categories and Subject Descriptors C.4 [PERFORMANCE OF SYSTEMS]: Design studies; Reliability, availability, and serviceability; D.4.1 [OPERATING SYSTEMS]: Process Management—Scheduling

General Terms Reliability

Keywords System Level Power Management, Thermal Management

1.

INTRODUCTION

Technology advances in microprocessor design have resulted in high device density and performance. However, non-ideal scaling in the late CMOS era, has led to severe power density and consequent temperature ("hot spot") issues in current generation chips. Higher peak and average temperatures lead to lower lifetimes at the chip and system level. It has been shown [24] that the lifetime of electrical circuits can be reduced in half, when the operating temperature is increased by 10-15 degrees Celsius. Also, the ∗ Jeonghwan Choi was an intern at IBM when this work was done. He received doctoral degree in 2007 in computer science from Korea Advanced Institute of Science and Technology (KAIST).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED’07, August 27–29, 2007, Portland, Oregon, USA. Copyright 2007 ACM 978-1-59593-709-4/07/0008 ...$5.00.

213

The rest of the paper is organized as follows. Section 2 presents an initial feasibility study for OS-level thermal mitigation. Section 3 describes our specific prototype Linux-based implementation of a thermal-aware scheduler, along with experimental results. Section 4 points out the related prior work in this topic area. Finally, in Section 5, we conclude with a brief discussion of future work.

Processors H/W threads Memory Size

Two 1.2GHz out-of-order core Two threads per core 2GB

L1 Dcache L1 Icache L2

3.5

3.3 2.3

0.1

-1.1

Workloads -0.5

0.4

1.0

1.1

1.6

vp r

x

2.0

vo rte

tw olf

bz i p2

sw im

as luc

a3 d

2.2

da xp %slow down

5.1

4.9 4.2

y

We perform our experiments on a live 1.2GHz POWER5 system running Bare Metal Linux (BML) [22]. To measure the changes in on-chip temperatures, we changed BML to sample the 24 onchip thermal sensors [3] at every scheduler tick with the granularity of 4 milliseconds (ms) per sample. We calibrated the thermal sensors using the Spatially-resolved Imaging of Microprocessor Power(SIMP) methodology developed at IBM Yorktown by Hamann et al. [6, 7]. The SIMP methodology calibrates the sensors by capturing infra-red images of the POWER5 chip when it reaches stable temperatures and then comparing the temperatures measured from the images to the reading of digital temperature sensor [3] values. To make the infra-red imaging possible, we replaced the metal heat-sink of the POWER5 system with a liquid heat sink that is transparent. Table 1 lists the configuration of the measured POWER5 system.

5.5

fm

POWER5 THERMAL MITIGATION STUDY

maximum delta temperature

ap si

Reduction In Temperatures (Celsius)

2.

Temperature reduction by core hopping(4ms) 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0

0.9

2.5

Figure 1: Core hopping reduces on-chip temperatures with small performance impact misses and the 4ms period is large enough that the performance cost of warming up the L1 cache is insignificant. As a result, the performance degradation was measured to be less than 3%.

2.3 Task Scheduling - potential for leveraging temporal heat slack

To demonstrate that an OS scheduler can mitigate hot spots by leveraging temporal heat slack through intelligent workload scheduling, we mix SPEC2K programs that exhibit different thermal characteristics on the POWER5 and use the Linux command taskset to pin all the tasks on the same hardware thread of the same core. Figure 3 shows the change in temperatures over time when running two benchmarks that share the same hardware thread (i.e., two tasks perform context switch every 100ms under OS control when their time slices are exhausted). The left graph of figure 3 shows the running of two hot tasks (daxpy) while the right graph of figure 3 shows the running of one hot task (daxpy) and one cold task (bzip2). From figure 3, we observe that mixing a cold task with a hot one reduces on-chip temperatures by as much as 5 ◦ Celsius. Unlike core hopping that requires OS-level intervention at four-millisecond intervals, carefully mixing hot tasks with cold tasks in non-SMT mode can reduce temperatures for long time without any OS intervention

32KB, 4way 64KB, 2way 1.44MB, no L3

Table 1: POWER5 configuration

2.1 Thermal time constant studies

To examine the thermal time constants on the POWER5 chip, we recorded the start and end times of a "hot" micro-benchmark (daxpy) pinned to core 1. In figure 2, the graph on the left shows the changes in relative temperatures from the Linux idle loop measured by each thermal sensor over time. The graphs on the middle and right show the zoomed-in views of the rising and falling periods for the same benchmark. From the figure on the left, we observe that the change in workload can cause changes in on-chip temperatures of (up to) 16 ◦ Celsius. From the two figures on the middle and left, we observe that the rise and fall times of temperatures are in the order of hundreds of milliseconds. Because these time constants are much larger than typical OS scheduler ticks of 10 ms or less, it suggests that we could implement on-chip thermal management in software (e.g., OS or hypervisor) and still react in time before the system reaches higher, more critical thermal conditions.

2.4 SMT Scheduling - potential for leveraging temporal heat slack

2.2 Core hopping - potential for leveraging spatial heat slack

To demonstrate that an OS scheduler can mitigate on-chip hot spots through leveraging spatial heat-slack, we altered single-threaded programs to hop between the two cores every four milliseconds on the dual-core POWER5. While one core is running the program the other is either running the Linux idle loop or some other daemon tasks. Figure 1 shows the reduction in hot spots and the effect of hopping on execution times. From the figure, we observe that core hopping can cause temperature changes of (up to) 5.5 ◦ Celsius while causing less than 1% average slowdown. We actually see minor speedups in two of the programs shown in figure 1; however, they probably represent minor experimental noise. Excluding such noise, we observe a 1.08 % average slowdown; so, the performance impact is still small. Since two cores share a common on-chip L2 cache, core hopping does not incur additional expensive L2 cache

214

To demonstrate that the OS scheduler can have a temperaturereducing effect by co-scheduling SMT-mode tasks intelligently, we again use the Linux command taskset to pin each task to a different hardware thread of the same core. Figure 5 shows the change in relative temperatures over time when running two benchmarks that share the same core but each using a separate hardware thread in SMT. The left graph of figure 5 shows the running of two hot tasks (lucas and swim) while the right graph shows the running of one hot task (lucas) and one cold task (vortex). From the figure 5, we observe that mixing a cold task with a hot task on SMT reduces on-chip temperatures by up to 3 ◦ Celsius.

3. PROTOTYPING THERMAL-AWARE SCHEDULER IN LINUX

In section 2, our experiments on the real POWER5 system show the feasibility of implementing chip-level hot spot mitigation techniques in the system hypervisor or OS because we perform the techniques at the granularity of scheduler ticks. To show that a modern OS such as Linux can implement these hot spot mitigation techniques without additional complexity, we first assessed the

Thermal characteristic over time single-thread daxpy on core 0 (sampled at 2.5ms)

14 12 10 8 6 4 2 0 -2 -5

5

15

25

35

45

55

Heating Characteristic (Zoomed-in)

Cooling characteristic (zoomed-in)

18

65

Time (seconds)

18

16

fpu isu0 idu ifu isu1 fxu lsu0 lsu1

14 12 10 8

Temperature relative to Linux idle loop (Celsius)

fpu isu0 idu ifu isu1 fxu lsu0 lsu1

16

Temperature relative to Linux idle loop (Celsius)

Temperature relative to Linux idle loop (Celsius)

18

6 4 2 0 -2 1

101

201

301

401

501

Time (milliseconds)

601

701

801

fpu isu0 idu ifu isu1 fxu lsu0 lsu1

16 14 12 10 8 6 4 2 0 -2 1

101

201

301

401

501

601

Time (milliseconds)

701

801

Celsius

Celsius

Figure 2: Thermal Time Constants are larger than 100ms OS time slices

Time (sec)

Time (sec)

(a) FP/FP (daxpy/daxpy)

(b) FP/Integer (daxpy/bzip2)

Figure 3: Thermal-aware scheduling reduces on-chip hot spots coding complexity and then developed a prototype thermal-aware scheduler extension to the Linux kernel (linux-2.6.17). Figure 4 shows three schemes we prototyped in Linux: Heat-balancing, deferred execution, reducing threading with cool-loop augmentation. Those three techniques have different triggering conditions and response times. For example, the Heat-balancing technique has a slower response time but it has less overhead, therefore we configure the system to trigger heat-balancing at lower temperatures; if Heat-balancing is not able to stop the elevation of on-chip temperatures, then the next technique (Deferred Execution) will be triggered. If all of the three schemes fail to stop the elevation of on-chip temperatures, we assume that the system will employ hardwaretriggered temperature management such as fetch-throttling or frequency/voltage scaling.

slower response, less overhead

Alternatives to existing POWER5 emergency throttling Heat-Balancing Deferred Execution of Hot Jobs Reducing threading on SMT Cool loop

faster response, more overhead

Figure 4: Thermal-aware scheduler prototyped in Bare Metal Linux

3.1 Heat-Balancing

characteristics of each task and core into account when making task-migration decisions. When the system is overloaded, the Heatbalancing extension attempts to assign hot and cold tasks to each core in order to create opportunities for leveraging temporal heat slack. When the system has less number of tasks than the number of cores, the original load balancing routine does not perform any task migration but the heat-balancing routine moves a hot task to a colder, idle core to create opportunities for leveraging spatial heat slacks. Because we piggy-back the Heat-balancing scheme into the original load-balancing, the additional performance cost for Heat-

Modern operating systems employ dynamic load balancing to improve response times and prevent starvation when the system is overloaded (i.e., when the system has the same number of or more tasks than the number of SMT hardware threads). In Linux, the load-balancing routine in the scheduler typically runs once in hundreds of milliseconds (200ms by default). The load-balancing employs task migration to minimize the differences in task queue length (i.e., number of tasks) for each core. To enable thermalaware scheduling in Linux, we implement a preventive scheme by extending the load-balancing code such that it takes the thermal

215

Celsius

Celsius

Time (sec)

Time (sec)

(a) FP/FP pair (lucas-swim)

(b) FP/Integer pair (lucas-votex)

Figure 5: Thermal-aware task-pairing reduces on-chip hot spots for SMT colder tasks to run before the hot task further heats up the core. Figure 7 shows an example of Deferred Execution. In the top graph queue

Task queue B A : Load 4 Heat 3 B : Load 4 Heat 1

Queue-Length Balanced (Original Scheme)

Task A

queue

Task B

Default Linux Scheduling of Non-SMT, Four-thread Workload: {daxpy, bzip2, bzip2, bzip2}

queue

A : Load 4 Heat 2 B : Load 4 Heat 2

12

Heat-Balanced : Hot task

Delta Temperature off Linux Idle Loop (Celsius)

Task A

Heat = Tall + w*Thot

where W is heat load

: Cold task

Figure 6: Heat-Balancing thermal-aware scheduler balancing is minor. Figure 6 illustrates the Heat-balancing scheduler where a rectangle represents a core, a solid circle represents a hot task and an empty circle represents a cold task. The dual-core system on the left uses the default Linux scheduler. Although the system is balanced in the sense that both cores have the same number of tasks, we notice that core A has three hot tasks while core B has only one, thus core A is more likely to heat up than core B. The system on the right uses the thermal-aware heat-balancing scheduler extension that balances the number of hot tasks and cold tasks on each core while honoring the goal of the original load-balancing scheme.

Task B

Detect overheating

Task B

Task C

Stop execution and move to end of queue

4 2

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Time (seconds)

Deferred Scheduling Non-SMT, Four-thread Workload: {daxpy, bzip2, bzip2, bzip2}

12

Time Task C

6

0.0

Delta Temperature off Linux Idle Loop (Celsius)

Task A

8

0

3.2 Deferred Execution of Hot tasks Time

FPU

10

Task A

FPU

10 8 6 4 2 0 0.0

Resume

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Time (seconds)

Figure 7: Deferred Execution of Hot task

Figure 8: Deferred Execution of Hot Task Results

In some cases, the Heat-balancing may not be triggered in time to prevent the rising of temperatures. To cover these cases, we implemented a reactive scheme called the Deferred Execution scheme in the Linux scheduler. When a core has multiple tasks and one of the tasks consistently heats up a core, the scheduler temporarily suspends the time slice of the current running hot task to allow other

in figure 8, the system runs the default Linux scheduler. Because daxpy is a fast-heating thread, it heats up the core whenever daxpy gets its full time slice. For clarity we only show the temperature of the FPU unit, which is the hottest spot for this particular workload. In the bottom graph in figure 8, the system runs our thermal-aware Linux scheduler that implements the "Deferred Execution of Hot

216

Task". The scheme swaps the hot daxpy to the end of OS runnable queue and run the other cooler threads (bzip2s) to cool down the core. From the bottom graph in figure 8, because of the fairness requirement in Linux, the scheduler is still forced to run the hot task when all other threads finish their time slice. This results in the core being occasionally heated up. However, we note that the peak temperature exceeds the 10 ◦ Celsius lines less frequently.

Time Task A

Task B

Cool loop

Detect overheating

Task B

Task C

Task A

Resume

Figure 10: Cooling loop for non-SMT

3.3 Cool loop for SMT and single-thread

When a system is fully-loaded or over-loaded with hot tasks, there is no sufficient heat slack such that either Heat-balancing or Deferred Execution could leverage to reduce on-chip hot spots. In such cases, the system needs to employ fast-responding, workloadreducing schemes to reduce temperature at the expense of performance. Today’s POWER5 systems employ hardware-enabled fetch throttling to reduce workload and thus temperature, when a certain thermal threshold is reached. In POWER5, with a standard package/cooling system, this is intended merely as a damage-control measure, and not something that routinely kicks in during normal operation. Most Intel microprocessors, on the other hand, use some form of dynamic thermal management, where a small percentage of hot workloads routinely invoke hardware throttling as a mechanism to mitigate thermal problems. This enables the use of less expensive cooling solutions. Ghiasi et al. show in [9] that such an approach could lead to starvation and drastic performance shortfalls. To lower temperatures while maintaining reasonable throughput, we implement a new kernel task called Cool Loop, to create opportunities for temporal heat slack. The Cool Loop can be thought of as the same as the OS idle loop, but with a higher OS priority. It could consist of no-op instructions or power-managing instructions that lower the core temperature through fetch throttling, frequency/voltage scaling or power-gating. In our case, we implement the Cool Loop as an empty loop that we found to be cold enough through our measurements. In our implementation, when the cool loop is running it does not perform useful computation, but it has an OS priority that is higher than user tasks but lower than interrupt service routines and scheduler ticks. Allowing interrupt servicing routines does not impose additional risks of further heating because interrupt routines including scheduler ticks are typically short. The scheduler checks the core temperature at every tick and resumes user tasks when the core temperature drops to a preset colder temperature. Figure 9 shows the Cool Loop running in an SMT. In a

using both hardware and software. The hardware techniques [1, 17, 8, 2, 15] demonstrate that distributing activities in local units helps reduce local hot spots, while [1] shows that the temperature of a single-core processor can be reduced through DVFS, fetchthrottling, speculation control and other techniques. POWER5 [3] employs a two-stage fetch-throttling when on-chip thermal sensors detect critical temperatures. (Again, for POWER5, this feature is a damage-control device, not intended for routine invocation under normal operation). Such hardware techniques are important to protect the chip; and, in some cases, they are useful in enabling a cheaper package/cooling solution, without affecting the performance of most applications. Such DTM support is featured in many commercial processors today, even though some of the hardware throttling techniques could cause significant performance impact in some workload scenarios as demonstrated in [9]. Modern microprocessors that utilize SMT on multiple cores on a chip present new perspectives for thermal management using task scheduling and migration in system-level software such as OS and the system hypervisor. Because the OS is already performing task scheduling with a time granularity much smaller than the thermal constants, one can piggyback thermal mitigation techniques that leverage spatial and temporal heat slacks on to the scheduling routines without noticeable impact on performance. CMP cores typically share caches for better hit ratios. These shared caches reduce the migration penalty of a task from one core to another, thus enabling core-hopping as a viable thermal mitigation technique. In [20, 11, 16] we see various techniques of corehopping while [4] proposes core-hopping on heterogeneous CMP systems. Meanwhile, because the combination of tasks scheduled to run on a core has significant impact on performance [18, 19, 21] and temperature [14], temperature-aware SMT co-scheduling is another viable thermal mitigation technique. The work reported in [10] shows that when thermal time constants are larger than OS time slices, task ordering can reduce on-chip temperature and improve performance by reducing overheating which triggers hardwarebased fetch throttling. Other work [23, 13, 12] demonstrates OSlevel thermal mitigation techniques using the combination of OSscheduling and hardware techniques such as DVFS. In this work, we operate with real temperature measurement using on-chip microunit-level digital sensors rather than temperature estimation scheme used in [23, 13, 12]. Although [12] proposes that CMP architecture aggravates thermal problems such that tasks have to be migrated outside of the chip for cooling, we show that the OS can leverage spatial and temporal heat slacks within a CMP chip to reduce temperatures significantly with little performance impact.

Time Task A Task D

Task B Task E

Cool loop Task D

Task B Cool loop

Detect overheating

Cool loop Task D

Task C Cool loop

Task A Task D Resume

Figure 9: Cool loop on SMT disables threading to reduce heating SMT, the temperature can be lowered by reducing the number of active hardware threads in SMT. When the SMT core heats up, the Cool Loop runs on each hardware-thread in an interleaved fashion that effectively disables SMT. Because SMT increases utilization of a core through overlapping cache misses and pipeline stalls, disabling SMT reduces core utilization and therefore the core’s active power and temperature. Figure 10 shows the Cool Loop running in a non-SMT system, where the user task is temporarily suspended to allow for cooling.

4.

5. CONCLUSIONS

In this paper, we investigate the trade-offs between temporal and spatial hot spot mitigation schemes and thermal time constants, workload variations and microprocessor power distributions on a live 1.2GHz POWER5 system. We have shown that due to power management techniques in POWER5 such as advanced clock-gating, changing the task assignments and sequences in a workload has a significant effect on the on-chip temperature. In addition, we also examine the effects of OS-level temperatureaware scheduling to mitigate these on-chip temperature issues. Two

RELATED WORK

Many researchers have addressed the notion of on-chip dynamic thermal management (DTM) at various levels in time and space

217

Temperature-Aware Computer Systems (TACS), June 2006. [11] E. Kursun, G. Reinman, S. Sair, A. Shayesteh, and T. Sherwood. Low-Overhead Core Swapping for Thermal Management. In Proceedings of the Power-Aware Computer Systems Workshop, December 2004. [12] A. Merkel and F. Bellosa. Balancing Power Consumption in Multiprocessor Systems. In Proceedings of the ACM SIGOPS EuroSys Conference, April 2006. [13] A. Merkel, F. Bellosa, and A. Weissel. Event-Driven Thermal Management in SMP Systems. In Proceedings of the Workshop on Temperature-Aware Computer Systems (TACS), June 2005. [14] M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat and run: Leveraging smt and cmp to manage power density through the operating system. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), October 2004. [15] M. D. Powell, E. Schuchman, and T. N. Vijaykumar. Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO), December 2005. [16] A. Shayesteh, E. Kursun, T. Sherwood, S. Sair, and G. Reinman. Reducing the Latency and Area Cost of Core Swapping Through Helper Engines. In Proceedings of the International Conference on Computer Design, 2005. [17] K. Skadron, M. Stan, W. Huang, and S. Velusamy. Temperature Aware Microarchitecture. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 2003. [18] A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), November 2000. [19] A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), June 2002. [20] J. Srinivasan and S. V. Adve. Predictive Dynamic Thermal Management for Multimedia Applications. In Proceedings of the International Conference on Supercomputing, June 2003. [21] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 1995. [22] T. Venton, M. Miller, R. Kalla, and A. Blanchard. A linux-based tool for hardware bring up, linux development, and manufacturing. IBM Syst. J., 44(2):319–329, 2005. [23] A. Weissel and F. Bellosa. Dynamic Thermal Management for Distributed Systems. In Proceedings of the Workshop on Temperature-Aware Computer Systems (TACS), June 2004. [24] L. Yeh and R. Chy. Thermal Management of Microelectronic Equipment. American Society of Mechanical Engineering, 2001.

observations motivate our work: 1) because of advanced clockgating technology, these on-chip temperatures are closely related to unit utilization; 2) the rise and fall times of on-chip temperatures are typically in hundreds of milliseconds, which are significantly larger than OS-level scheduling ticks in the order of milliseconds. Through leveraging spatial and temporal heat slacks, our schemes enable lowering of on-chip unit temperatures by changing the workload in a timely manner with OS and existing hardware support. Our schemes reduce temperatures up to 5.5 ◦ Celsius and on average 3.5 ◦ Celsius, with 1.08% mean performance impact. We may expect further temperature reduction in future high performance, multi/many-core systems: since workload variability and more aggressive power management techniques will increase the available spatial and temporal slack in such systems. Even though our experiments were done at the OS level to try and quantify real savings from temperature-aware scheduling, we believe that the system hypervisor can also implement these schemes. Through interacting and regulating the multiple OSs running on top, the system hypervisor can also achieve such savings in temperature. We will explore the system hypervisor-layer techniques in our future work.

6.

REFERENCES

[1] D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), January 2001. [2] P. Chaparro, G. Magklis, J. Gonzalez, and A. Gonzalez. Distributing the frontend for temperature reduction. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), February 2005. [3] J. Clabes, J. Friedrich, M. Sweet, J. DiLullo, S. Chu, D. Plass, J. Dawson, P. Muench, L. Powell, M. Floyd, B. Sinharoy, M. Lee, M. Goulet, J. Wagoner, N. Schwartz, S. Runyon, G. Gorman, P. Restle, R. Kalla, J. McGill, and S. Dodson. Design and implementation of the power5 microprocessor. In Proceedings of the Design Automation Conference (DAC), 2004. [4] S. Ghiasi and D. Grunwald. Thermal Management with Asymmetric Dual-Core Designs. Technical Report CU-CS-965-03, University of Colorado, 2004. [5] S. Gunther, F. Binns, D. Carmean, and J. Hall. Managing the Impact of Increasing Microprocessor Power Consumption. Intel Technology Journal, 5, February 2001. [6] H. F. Hamann, J. Lacey, A. Weger, and J. Wakil. Spatially-resolved imaging of microprocessor power (SIMP). In Proceedings of the Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems(ITherm), May 2006. [7] H. F. Hamann, A. Weger, J. Lacey, Z. Hu, P. Bose, E. Cohen, and J. Wakil. Hotspot-limited Microprocessors: Direct Temperature and Power distribution Measurements. IEEE Journal of Solid-State Circuits, 42:56–65, January 2007. [8] S. Heo, K. Barr, and K. Asanovic. Reducing power density through activity migration. In Proceedings of the International Symposium on Low Power Electronics and Design, August 2003. [9] R. Kotla, S. Ghiasi, T. Keller, and F. Rawson. Scheduling Processor Voltage and Frequency in Server and Cluster Systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2005. [10] E. Kursun, C.-Y. Cher, A. Buyuktosunoglu, and P.Bose. Investigating the Effects of Task Scheduling on Thermal Behavior. In Proceedings of the Workshop on

218

Suggest Documents