IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
905
Runtime Power Management of 3-D Multi-Core Architectures Under Peak Power and Temperature Constraints Kyungsu Kang, Member, IEEE, Jungsoo Kim, Member, IEEE, Sungjoo Yoo, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE
Abstract—3-D integration is a new technology that overcomes the limitations of 2-D integrated circuits, e.g., power and delay induced from long interconnect wires, by stacking multiple dies to increase logic integration density. However, chip-level power and peak temperature are the major performance limiters in 3-D multi-core architectures. In this paper, we propose a runtime power management method for both peak power and temperature-constrained 3-D multi-core systems in order to maximize the instruction throughput. The proposed method exploits dynamic temperature slack (defined as peak temperature constraint minus current temperature) and workload characteristics (e.g., instructions per cycle and memory-boundness) as well as thermal characteristics of 3-D stacking architectures. Compared with existing thermal-aware power management solutions for 3-D multi-core systems, our method yields up to 34.2% (average 18.5%) performance improvement in terms of instructions per second without significant additional energy consumption. Index Terms—3-D integration, chip-multiprocessor, dynamic voltage and frequency scaling (DVFS), power management, thermal management.
I. Introduction
M
ULTICORE systems are widely used across many application domains including general-purpose, embedded, network, digital signal processing, and graphics [1]–[7]. Number of cores on a chip is expected to grow as a result of device scaling (for 2-D dies) and 3-D integration technology. 3-D integration, by stacking multiple dies to achieve high core density, allows a drastic increase in power density (i.e., power dissipation per unit volume) of the chip. The high power density incurs temperature-related problems in reliability (e.g., negative bias temperature instability, electro-migration, timedependent dielectric breakdown, and thermal cycling), power Manuscript received March 30, 2010; revised August 12, 2010; accepted November 27, 2010. Date of current version May 18, 2011. This work was supported by the National Research Foundation of Korea, by the Korean Government (MEST), under Grant 2010-0000823, and by Hynix Semiconductor, Inc., Icheon, Korea. This paper was recommended by Associate Editor Y. Xie. K. Kang, J. Kim, and C.-M. Kyung are with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail:
[email protected];
[email protected];
[email protected]). S. Yoo is with the Department of Electronics and Electrical Engineering, Pohang University of Science and Technology, Pohang 790-784, Korea (email:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2010.2101371
consumption (i.e., temperature-induced leakage power), performance (i.e., increased circuit delay as temperature increases), and system cost (e.g., cooling and packaging cost). There are several dynamic thermal management (DTM) solutions, such as stopping with power/clock gating, running at low frequencies and/or low voltages, and issuing less instructions/functions, to control chip temperature [8]–[10]. When the operating temperature approaches the thermal limit, these DTM solutions can effectively reduce the temperature by reducing chip power consumption, but inevitably lead to performance degradation. These works, which focus on thermal problems in 2-D multi-core architectures, cannot be directly applied to 3-D multi-core architectures owing to the significant difference between thermal characteristics of 3-D stacking from those of 2-D multi-core architectures. Another key challenge in multi-core architectures is to meet the peak power consumption constraint coming from the limitations in the capabilities of chip cooling and power delivery [44]. However, total chip power consumption continues to increase as the number of cores increases and the supply voltage reduction stagnates. Various techniques to boost chip performance while meeting the peak power constraint have been proposed [11]–[13]. Previous studies on power and thermal management have two limitations. First, they can control only steady-state chip temperature, not instantaneous one, by controlling the chip power, such that the expected steady-state temperature does not exceed the thermal limit. Second, previous papers focus on either peak power [13], [25], [26] or peak temperature [22]– [24], even though both constraints are key factors to consider in the performance improvement. Our power management solution assumes that dynamic voltage and frequency scaling (DVFS) is applied in a per-core manner to fully exploit the heterogeneous states of each core such as operating temperature and memory access behavior of running software. Recently, W. Kim et al. [49] showed that per-core DVFS supported by on-chip voltage regulators, which is able to perform voltage changes in nanoseconds and consume little energy, provides precise voltage/frequency control to get either better performance or less energy consumption. As the number of cores within a chip multiprocessor (CMP) increases, per-core DVFS will be dominant to manage the heterogeneity among the cores.
c 2011 IEEE 0278-0070/$26.00
906
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
In this paper, we present a runtime power management solution to maximize the performance [in terms of instructions per second (IPS)] of 3-D multi-core systems while satisfying both peak constraints of power consumption and temperature. Contribution of this paper is twofold. First, this paper explicitly considers dynamic thermal effects when making power budgeting decision in order to maximize the system performance for 3-D multi-core architectures. Unlike previous DTM solutions for 3-D multi-core architectures, which have the limitation of steady-state temperature analysis, we deal with the instantaneous temperature to more aggressively exploit the temperature slack (defined as peak temperature constraint minus current temperature) for performance improvement while meeting the peak power constraint. Second, we exploit both the memory access behavior of software program (i.e., central processing unit, CPU, stall time induced from external memory access) and heterogeneous cooling efficiency of 3-D integrated circuits (ICs) [e.g., silicon dies (layers) closer to the heat sink have higher cooling efficiency] in order to improve the performance. This paper is organized as follows. Section II reviews related works. Section III explains preliminaries. Section IV gives the motivation for our work. Section V presents the problem definition. Sections VI and VII explain the proposed method, i.e., global power budgeting and thread migration methods, respectively. Section VIII presents the experimental setup. Section IX reports experimental results. Section X concludes this paper. II. Related Work In 2-D multi-core architectures, performance maximization under peak power constraints has been studied in [13], [25], and [26]. Annavaram et al. [13] showed that a program with a high degree of thread-level parallelism can yield a significant speed-up when running on a system consisting of four lowfrequency cores rather than a single high-frequency core under a fixed global power budget. Isci et al. [25] proposed a global power management method which keeps chip-level power under specified power budget without significant degradation of chip-level throughput by utilizing the CPU idle times owing to external memory access. Bergamaschi et al. [26] presented a simulation model for multi-core systems with an integrated power management module. Based on this simulation model, the authors proposed MaxBIPS algorithm which selects frequency/voltage levels of cores that maximize IPS while not exceeding the power budget. Ebi et al. [47] proposed a proactive power distribution method based on the per-core agent implemented in a mixed hardware and software to evenly distribute power and to reduce the peak temperature while meeting the given deadline constraint. The per-core agent performs DVFS and/or power gating according to measured temperature of the core. Many works on thermal modeling and DTM for the system where thread scheduling is already done are proposed [14]– [17], [23], [24], [46]. Skadron et al. [14] proposed a thermal modeling tool, which utilizes an equivalent circuit comprising thermal characteristics of package. Han et al. [15] proposed time-invariant linear thermal system that uses adaptive time
interval to linearize the temperature calculation and speed up the thermal analysis. Kumar et al. [16] proposed a regressionbased thermal model that uses hardware performance counters available in the processor. In [16], the authors also proposed HybDTM which controls chip temperature by clock gating or limiting task execution when the temperature reaches a thermal threshold. Atienza et al. [17] proposed a highly accurate FPGA-based thermal emulation framework which reduces simulation time for large multi-core systems. Murali et al. [23] proposed a design-time solution of DVFS scheduling based on convex optimization which optimizes performance while meeting the given power and temperature constraints. In [24], Zhang et al. presented a stochastic temperature-aware DVFS method to keep the expected latency within the designerspecified level while meeting the condition that the probability of peak temperature exceeding a given threshold is sufficiently small. In [46], the authors solved the temperature-constrained DVFS problem with a convex optimization solver to maximize processing speed. The solution can be interpreted as nonlinear static control laws, which adjust the processor speeds based on the measured temperatures in the system. Temperature-aware thread scheduling are widely used in 2-D multi-core systems to improve the chip performance or to reduce thermal hotspots by distributing heat generation more uniformly across the chip [18], [19]. Gomaa et al. [18] proposed Heat-and-Run scheme which is the first work to perform thermal-aware thread scheduling for multi-core multithreaded systems. Choi et al. [19] investigated the tradeoffs of temporal and spatial hotspot mitigation schemes, thermal time constants, workload variations, and microprocessor power distributions on a POWER5 system. DTM combining DVFS with thread scheduling has been studied in [20]–[22]. Coskun et al. [20] proposed a static instruction-level parallelism (ILP)-based design–time solution and an adaptive dynamic policy to minimize temperature hotspots and gradients of the multi-core system with per-core DVFS capability. Chantem et al. [21] proposed a thread scheduling technique that uses an mixed integer linear programming solver to minimize the peak temperature of multi-core systems based on the steady-state thermal analysis. They also proposed two heuristic approaches for thread scheduling based on steady-state and transient thermal analysis. Donald et al. [22] suggested a combination of sensor-based thread scheduling with distributed DVFS as the best solution in various thermal management techniques to improve system performance while meeting chip temperature constraints. There are prior works on circuits for 3-D multi-core systems [27]–[31]. Zhao et al. [28] proposed a design–time approach based on polynomial programming that addresses the problem of processor voltage/frequency assignment in 3-D multicore systems, such that the total system performance can be maximized while the temperature and power constraints are met. In [30] and [31], they proposed temperature-aware thread scheduling methods for 3-D multi-core systems to tune the thermal profile at runtime. Sun et al. [27] proposed a design– time thermal optimization algorithm that conducts thread scheduling and voltage/frequency scaling. In [27], thread-level slack distribution, which assigns time slack (time-to-deadline minus worst case remaining execution time) to each thread, is
KANG et al.: RUNTIME POWER MANAGEMENT OF 3-D MULTI-CORE ARCHITECTURES UNDER PEAK POWER AND TEMPERATURE CONSTRAINTS
used to perform voltage/frequency scaling, which permits the peak temperature minimization. Zhu et al. [29] proposed an analytic framework for temperature and performance tradeoffs in 3-D multi-core systems. The algorithm in [29] dynamically determines thread scheduling and voltage/frequency of each core according to switching activities of input workloads and 3-D floorplan while the steady-state temperature of each core does not exceed the specified temperature limit. DTM solutions in [16], [18], [19], and [29]–[31] can implicitly meet thermal constraints by using online adaptation informed by thermal sensing or models. DVFS and/or clock throttling, which are independent of the global thermal management techniques, can be used for the online adaptation that reacts to the transient temperature variation when the transient temperature approaches the thermal limits. These online adaptations trade performance for temperature reduction often incurring performance penalty. On the contrary, DTM solutions in [20], [21], [23], [24], [27], [28], and [46] explicitly meet thermal constraints by using internal predictive dynamic thermal models. However, all these solutions are design–time approaches. In our paper, the proposed method of runtime power management aims at maximizing performance, i.e., IPS in 3-D multi-core systems without violating the given power and temperature constraints. Both the power management method in [29] and our method address a runtime solution which combines thread scheduling and DVFS to maximize the system throughput. However, in [29], the peak power limit and the dynamic temperature slack are not taken into account. The dynamic temperature slack can be utilized for aggressively setting clock frequencies in a timely manner to maximize the system throughput while meeting the chip-level power budget and temperature constraints. Our paper is closely related to [30] on runtime thermal management using heterogeneous thermal coupling (e.g., temperatures among vertically adjacent silicon dies are more correlated than those among horizontally adjacent silicon dies) and cooling efficiency of 3-D multi-core systems. Both [30] and our work perform core-set (i.e., set of cores stacked on the same horizontal position) level thread scheduling to reduce the computational complexity. Instead of using online adaptation to meet dynamic thermal constraints, we propose a mathematical formulation to optimize power budgeting and voltage/frequency selection for 3-D multi-core systems. Our method also exploits, in thread scheduling, both the memory access behavior of software program, i.e., stall time of the CPU caused by external memory access and heterogeneous cooling efficiency in order to increase the chip-level performance. Existing power management techniques for 2-D ICs, which consider the effect on the memory access behavior, are not appropriate for 3-D multi-core systems with thermal heterogeneity [13], [25], [26]. III. Preliminaries A. Thermal Characteristics of 3-D Multi-Core Systems Fig. 1 illustrates a 3-D multi-core chip package. A multicore chip contains multiple vertically stacked silicon layers.
Fig. 1.
907
3-D multi-core chip package structure.
Fig. 2. Simplified thermal model of a general 3-D multi-core system in which each core has uniform temperature distribution (adapted from [29]).
Each silicon layer contains processor cores and memory modules. One side of the multi-core chip is connected to a substrate through the printed circuit board (PCB). The other side of the chip is attached to the heat sink where most heat is dissipated. Heat flow within a package can be modeled by the well-known analogy between heat transfer and electric circuit phenomena in resistor–capacitor (RC) network [32]. Fig. 2 illustrates a coarse-grained thermal model of a 3-D multi-core system in which each core is represented with a thermal model element (i.e., a thermal resistance, a thermal capacitance, and a current source). In Fig. 2, the heat sink is located at the bottom of the chip stack. For a thermal model to be accurate, each block represented as a thermal model element must be small enough for the temperature to be assumed uniform within each block. Thus, we use a fine-grain grid thermal model using HotSpot [33]. However, for the simplicity of explanation, we utilize the simplified thermal model of Fig. 2 in this section to explain the thermal characteristics of a 3-D multi-core system. In this section, we explain two thermal characteristics which are essential for understanding the 3-D multi-core power management: 1) heterogeneous thermal coupling, and 2) heterogeneous cooling efficiency. 1) Heterogeneous Thermal Coupling: In the thermal RC network shown in Fig. 2, the power consumption of a core can influence the temperature of other cores as well as its own temperature. Particularly in 3-D multi-core systems, vertically adjacent cores have much larger thermal influence on each other than horizontally adjacent cores because heat dissipates mostly in the vertical direction. In Fig. 1, the thickness of
908
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
a silicon layer is very small ( 2): (25) W = f1 · D + f2 · D = β P1D · D + β P2D · D. Now we assume that core 1 exhausts its temperature slack earlier than core 2 as shown in Fig. 14(b). In this case, both cores run together until t1 (= D −t) while the power budgets of cores 1 and 2 are set to P1D + α and P2D − α, respectively. At time t1 , the temperature slack of core 1 is exhausted and core 1 enters SLEEP mode. Then, the power consumption of core 2 is increased to Pmax . Thus, the executed workload W until time D is calculated as follows: W = β P1D + α · (D − t) + β P2D − α · (D − t) + β Pmax · t. (26) The partial derivatives of W with respect to t is presented as follows: dW = − β P1D + α − β Pmax − P1D − α + β Pmax . (27) dt √ Since β x is a concave function, dW /dt is always negative and W in (26) is maximized when t is zero. Thus, the performance is maximized when all cores exhaust their temperature slacks at the same time. To do that, we need to allocate the power budget of each core such that (8) is satisfied. (QED)
[1] Intel Products [Online]. Available: http://www.intel.com/products/processor/ index.htm [2] AMD Product [Online]. Available: http://www.amd.com/US/PRODUCTS/ Pages/products.aspx [3] M. Shah, J. Barren, J. Brooks, R. Golla, G. Grohoski, N. Gura, R. Hetherington, P. Jordan, M. Luttrell, C. Olson, B. Sana, D. Sheahan, L. Spracklen, and A. Wynn, “UltraSPARC T2: A highly-threaded, power-efficient, SPARC SoC,” in Proc. ASSCC, Nov. 2007, pp. 22–25. [4] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, “IBM POWER6 microarchitecture,” IBM J. Res. Develop., vol. 51, no. 6, pp. 639–662, Nov. 2007. [5] ARM Products [Online]. Available: http://www.arm.com/products/CPUs/ index.html [6] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, Mar. 2008. [7] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, “An 80-tile 1.28 TFLOPS networks-on-chip in 65 nm CMOS,” in Proc. ISSCC, Feb. 2007, pp. 98–100. [8] A. Kumar, S. Li, P. Li-Shiuan, and N. K. Jha, “HybDTM: A coordinated hardware-software approach for dynamic thermal management,” in Proc. DAC, Sep. 2006, pp. 548–553. [9] R. Rao, S. Vrudhula, C. Chakrabarti, and N. Chang, “An optimal analytical solution for processor speed control with thermal constraints,” in Proc. ISLPED, Oct. 2006, pp. 292–297. [10] J. Yang, X. Zhou, M. Chrobak, Y. Zhang, and L. Jin, “Dynamic thermal management through task scheduling,” in Proc. Int. Symp. ISPASS, Apr. 2008, pp. 191–201. [11] Intel Turbo Boost White Paper [Online]. Available: http://www.intel. com/technology/turboboost [12] J. Lee and N. S. Kim, “Optimizing throughput of power and thermal-constrained multicore processors using DVFS and per-core power gating,” in Proc. DAC, Jul. 2009, pp. 47–50. [13] M. Annavaram, E. Grochowski, and J. Shen, “Mitigating Amdahl’s low through EPI throttling,” in Proc. ISCA, Jun. 2005, pp. 298–309. [14] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, “Temperature-aware microarchitecture: Modeling and implementation,” ACM Trans. Architect. Code Optimiz., vol. 1, no. 1, pp. 94–125, Mar. 2004. [15] Y. Han, I. Koren, and C. M. Krishna, “Temptor: A lightweight runtime temperature monitoring tool using performance counters,” in Proc. 3rd Workshop TACS, Held Conjunct. ISCA-33, 2006. [16] A. Kumar, S. Li, P. Li-Shiuan, and N. K. Jha, “System-level dynamic thermal management for high-performance microprocessors,” IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 27, no. 1, pp. 96–108, Jan. 2008. [17] D. Atienza, P. G. Del Valle, G. Paci, F. Poletti, L. Benini, G. De Micheli, and J. M. Mendias, “A fast HW/SW FPGA-based thermal emulation framework for multiprocessor system-on-chip,” in Proc. DAC, Sep. 2006, pp. 618–623. [18] M. Gomaa, M. D. Powell, and T. N. Vijaykumar, “Heat-and-Run: Leveraging SMT and CMP to manage power density through the operating system,” in Proc. ASPLOS, Nov. 2004, pp. 260–270. [19] J. Choi, C. Chen-Yong, H. Franke, H. Hamann, A. Weger, and P. Bose, “Thermal-aware task scheduling at the system software level,” in Proc. ISLPED, Aug. 2007, pp. 213–218. [20] A. K. Coskun, T. T. Rosing, K. A. Whisnant, and K. C. Gross, “Static and dynamic temperature-aware scheduling for multiprocessor SoCs,” IEEE Trans. Very Large Scale Integr. Syst., vol. 16, no. 9, pp. 1127–1140, Sep. 2008. [21] T. Chantem, R. P. Dick, and X. S. Hu, “Temperature-aware scheduling and assignment for hard real-time applications on MPSoCs,” in Proc. DATE, Mar. 2008, pp. 288–293. [22] J. Donald and M. Martonosi, “Techniques for multicore thermal management: Classification and new exploration,” in Proc. ISCA, 2006, pp. 78–88. [23] S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, and G. De Micheli, “Temperature-aware processor frequency assignment for MPSoCs using convex optimization,” in Proc. CODES+ISSS, Sep. 2007, pp. 111–116. [24] S. Zhang and K. S. Chatha, “System-level thermal aware design of applications with uncertain execution times,” in Proc. ICCAD, Nov. 2008, pp. 242–249. [25] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi, “An analysis of efficient multicore global power management policies: Maximizing
918
[26]
[27]
[28]
[29]
[30]
[31]
[32] [33]
[34]
[35]
[36] [37] [38]
[39]
[40]
[41]
[42] [43] [44] [45] [46]
[47]
[48]
[49]
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
performance for a given power budget,” in Proc. Int. Symp. Microarchitect., Dec. 2006, pp. 347–358. R. Bergamaschi, H. Guoling, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada, H. Zhigang, P. Bose, and J. Darringer, “Exploring power management in multicore systems,” in Proc. ASPDAC, Mar. 2008, pp. 708–713. C. Sun, L. Shang, and R. P. Dick, “3-D multiprocessor system-on-chip thermal optimization,” in Proc. Int. Conf. Hardware/Software Codes. Syst. Synthesis, Oct. 2007, pp. 117–122. G. Zhao, H.-K. Kwan, C.-U. Lei, and N. Wong, “Processor frequency assignment in 3-D MPSoCs under thermal constraints by polynomial programming,” in Proc. APCCAS, Nov.–Dec. 2008, pp. 1668–1671. C. Zhu, Z. Gu, L. Shang, R. P. Dick, and R. Joseph, “3-D chip-multiprocessor runtime thermal management,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 27, no. 8, pp. 1479–1492, Aug. 2008. X. Zhou, J. Yang, Y. Xu, Y. Zhang, and J. Zhao, “Thermal-aware task scheduling for 3-D multicore processors,” IEEE Trans. Parallel Distributed Syst., vol. 21, no. 1, pp. 60–71, Jan. 2010. A. K. Coskun, J. L. Ayala, D. Atienza, T. S. Rosing, and Y. Leblebici, “Dynamic thermal management in 3-D multicore architectures,” in Proc. DATE, 2009, pp. 1410–1415. F. Kreith, Ed., The CRC Handbook of Thermal Engineering. Boca Raton, FL: CRC Press, 2000, pp. 2.1–2.92. W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan, “HotSpot: A compact thermal modeling methodology for early-stage VLSI design,” IEEE Trans. Very Large Scale Integr. Syst., vol. 14, no. 5, pp. 501–513, May 2006. B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb, “Die stacking (3-D) microarchitecture,” in Proc. Int. Symp. Microarchitect., Dec. 2006, pp. 469–479. Y. Yang, C. Zhu, Z. Gu, L. Shang, and R. P. Dick, “Adaptive multidomain thermal modeling and analysis for integrated circuit synthesis and design,” in Proc. ICCAD, Nov. 2006, pp. 575–582. J. S. Lee, K. Skadron, and S. W. Chung, “Predictive temperature-aware DVFS,” IEEE Trans. Comput., vol. 59, no. 1, pp. 127–133, Jan. 2010. S. Eyerman and L. Eeckhout, “A memory-level parallelism aware fetch policy for SMT processors,” in Proc. HPCA, 2007, pp. 240–249. H. Everett, III, “Generalized Lagrange multiplier method for solving problems of optimum allocation of resources,” Oper. Res., vol. 11, no. 3, pp. 399–417, 1963. N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs, “The implementation of the 65 nm dual-core 64 b Merom processor,” in Proc. ISSCC, Feb. 2007, pp. 106–590. Intel, “Intel core2 duo processors and Intel core2 extreme processors for platforms based on mobile Intel 965 express chipset family,” Datasheet, 2008, pp. 23–40. W. Liao, L. He, and K. M. Lepak, “Temperature and supply voltage aware performance and power modeling at microarchitecture level,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 7, pp. 1042–1053, Jul. 2005. Standard Performance Evaluation Corporation [Online]. Available: http://www.specbench.org Performance Application Programming Interface [Online]. Available: http://icl.cs.utk.edu/papi R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan, “Heterogeneous chip multiprocessors,” IEEE Comput., vol. 38, no. 11, pp. 32–38, Nov. 2005. D. Bovet and M. Cesati, Understanding the Linux Kernel, 3rd ed. Cambridge, MA: O’Reilly Publishers, Nov. 2005. A. Mutapcic, S. Boyd, S. Murali, D. Atienza, G. De Micheli, and R. Gupta, “Processor speed control with thermal constraints,” IEEE Trans. Circuits Syst. Part I: Reg. Papers, vol. 56, no. 9, pp. 1994–2008, Sep. 2009. T. Ebi, M. A. A. Farugue, and J. Henkel, “TAPE: Thermal-aware agent-based power economy for multi/many-core architectures,” in Proc. ICCAD, Nov. 2009, pp. 302–309. K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 1, pp. 18–28, Jan. 2005. W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” in Proc. HPCA, 2008, pp. 123–134.
[50] D. Tarjan, S. Thoziyoor, and N. P. Jouppi, “CACTI 4.0,” HP Laboratories, Palo Alto, CA, Tech. Rep. HPL-2006-86, Jun. 2006. [51] T. Li and L. K. John, “Run-time modeling and estimation of operating system power consumption,” in Proc. SIGMETRICS, 2003, pp. 160–171. [52] J. W. Chen, M. Dubois, and P. Stenstrom, “Integrating complete-system and user-level performance/power simulators: The SimWattch approach,” in Proc. ISPASS, Mar. 2003, pp. 1–10. [53] M.-L. Li, R. Sasanka, S. V. Adve, Y.-K. Chen, and E. Debes, “The ALPBench benchmark suite for complex multimedia applications,” in Proc. Int. Symp. Workload Characterization, Oct. 2005, pp. 34–35. [54] Free Computational Fluid Dynamics [Online]. Available: http://www.freecfd.com
Kyungsu Kang (S’06–M’10) received the B.S. degree from the Department of Electrical and Electronic Engineering, Kyungpook National University, Daegu, Korea, in 2003, and the unified course of the M.S. and Ph.D. degrees from the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2010. He has been a Post-Doctoral Fellow with KAIST since 2010. His current research interests include dynamic power and thermal management for 2-D/ 3-D chip multiprocessors. Jungsoo Kim (S’06–M’10) received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2005, and the unified course of the M.S. and Ph.D. degrees from the Department of Electrical Engineering and Computer Science, KAIST, in 2010. He has been a Post-Doctoral Fellow with KAIST since 2010. His current research interests include dynamic power and thermal management, multiprocessor system-on-chip design, and low-power wireless surveillance system design. Sungjoo Yoo (M’09) received the B.S., Masters, and Ph.D. degrees in electronics engineering from Seoul National University, Seoul, Korea, in 1992, 1995, and 2000, respectively. He was a Researcher with TIMA Laboratory, Grenoble, France, from 2000 to 2004. He was a Senior and Principal Engineer with Samsung Electronics, Daegu, Korea, from 2004 to 2008. He is with the Pohang University of Science and Technology, Pohang, Korea, since 2008. His current research interests include low power design and memory/storage architecture for embedded systems. Chong-Min Kyung (S’76–M’81–SM’99–F’08) received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, in 1975, and the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1977 and 1981, respectively. From April 1981 to January 1983, he was a PostDoctoral Fellow with Bell Telephone Laboratories, Murray Hill, NJ. He is currently the Hynix Chair Professor with KAIST. Since he joined KAIST in 1983, he has been working on system-on-a-chip design and verification methodology, and processor and graphics architectures for high-speed and/or low-power applications including mobile video codec. Dr. Kyung received the Most Excellent Design Award and Special Feature Award in the University Design Contest in the Asia and South Pacific Design Automation Conference (ASP-DAC) 1997 and 1998, respectively. He received the Best Paper Awards in the 36th DAC, New Orleans, LA, the 10th International Conference on Signal Processing Application and Technology, Orlando, FL, in September 1999, and the 1999 International Conference on Computer Design, Austin, TX. He was the General Chair of Asian SolidState Circuits Conference 2007, and ASP-DAC 2008. In 2000, he received the National Medal from the Korean Government for his contribution to research and education in integrated circuit design. He is a member of the National Academy of Engineering Korea, Seoul, Korea, and the Korean Academy of Science and Technology, Gyeonggi-Do, Korea.