Improving Energy Efficiency of Multi-Threaded Applications using ...

Improving Energy Efficiency of Multi-Threaded Applications using Heterogeneous CMOS-TFET Multicores Karthik Swaminathan∗, Emre Kultursay∗ , Vinay Saripalli∗ , Vijaykrishnan Narayanan∗, Mahmut Kandemir∗ and Suman Datta† ∗ Dept.

of Computer Science and Engineering The Pennsylvania State University, University Park, PA 16801 † Dept. of Electrical Engineering The Pennsylvania State University, University Park, PA 16801

Abstract— Energy-Delay-Product-aware DVFS is a widely-used technique that improves energy efficiency by dynamically adjusting the frequencies of cores. Further, for multithreaded applications, barrier-aware DVFS is a method that can dynamically tune the frequencies of cores to reduce barrier stall times and achieve higher energy efficiency. In both forms of DVFS, frequencies of cores are reduced from the maximum value to achieve better energy efficiency. TFET devices operate at energy efficiencies that cannot be achieved by CMOS devices. This advantage of TFET devices can be exploited in the context of multicore processors by replacing some of the CMOS cores with energy efficient TFET alternatives. However, the energy benefits of TFET devices are observed at relatively lower voltages, which results in a degradation in performance due to executing at lower frequencies. Although applications cannot be limited to run always at such lower frequencies, it can be significantly beneficial from an energy efficiency perspective to make use of energy efficient TFET cores during the times applications spend at these frequencies. In this paper, we show that due to EDP-aware DVFS and barrier-aware DVFS, multithreaded applications run for a significant portion of their execution time at frequencies at which TFET cores are more energy efficient. We further show that, at those frequencies, dynamically migrating threads to TFET cores can achieve average leakage and dynamic energy savings of 30% and 17%, respectively, with a performance degradation of less than 1%.

I. I NTRODUCTION It is known that applications do not show the same behavior throughout their execution, and they typically have a dynamic fluctuation in instructions per cycle (IPC). A low IPC typically means that the application is in a memory-bound phase and spends relatively more cycles waiting for a response from the memory hierarchy. In this scenario, dynamically reducing the voltage/frequency of the core executing the application can result in a significant reduction in its leakage and dynamic energy consumption with a relatively small degradation in performance. Therefore, operating cores always at the maximum voltage/frequency is not necessary and an Energy-Delay Product-aware (EDP-aware) Dynamic Voltage and Frequency Scaling (DVFS [10]) can achieve higher energy efficiency. Another optimization that is shown to provide energy savings is barrier-aware DVFS [9], which addresses a problem that is

978-1-61284-660-6/11/$26.00 © 2011 IEEE

specific to multithreaded applications. Multithreaded applications typically use barriers for synchronization, where a thread that arrives at a barrier must wait until all other threads also reach the barrier. Between any two barriers, the performance of the application is limited by the performance of the slowest thread in that region, resulting in all other threads waiting idle at the barrier. Barrier wait times of these threads can be significant if the workload is not evenly distributed across threads. Observing the idle wait times at barriers due to an unequal workload distribution, barrier-aware DVFS can reduce the voltage/frequency of each core individually such that it reaches the barrier at around the same time as the slowest thread. This reduction in supply voltage directly translates into lower leakage and dynamic energy without any significant performance degradation, thereby improving energy efficiency. Inter Band Tunnel Field Effect Transistors (TFETs) [12], [13] are novel transistors which can demonstrate superior subthreshold characteristics compared to CMOS devices. TFETs have sub-60mV/decade sub-threshold slopes, whereas CMOS devices are theoretically limited to an above 60mV/decade sub-threshold slope. As shown in Fig. 1(a), TFETs can deliver higher on-currents, achieving higher performance at low Vcc (0.3V) compared to CMOS. Further, sub-60mV/decade operation also means that TFETs have very low leakage current when the transistor is turned off. Inter band tunnelingbased devices, offering unprecedented promise for optimizing power consumption and scaling supply voltages to 0.1V, are projected to be in production before 2020 [5], [6]. On the other hand, CMOS can deliver higher on-current when a large gate voltage overdrive (Fig. 1(b)) is applied, enabling higher maximum operating frequencies than TFETs. It is expected that enhancements in semiconductor materials used at the tunneling junction of TFETs and optimizations on the TFET device structure will result in significant improvements in the performance of TFETs, enabling them to also reach high operating frequencies. This behavior of CMOS and TFET devices indicates that a crossover tradeoff occurs when deciding what type of core to use for an application or thread based on the processor utilization. Exploiting this tradeoff can result in large power savings without significant degradation in performance, making heterogeneous systems comprising of both CMOS and

247

TFET cores promising architectures that can achieve both high performance and high energy efficiency. Fig. 2 (from [13]) shows an example of this crossover between CMOS and TFET cores. From this figure, it can be observed that there is a crossover point, around 1.25 GHz, below which it is possible to operate a TFET core at the same frequency as a CMOS core, but at a lower voltage. While operating at the same frequency ensures that no performance degradation will be observed, reduced supply voltage translates into lower leakage and dynamic energy, which makes the TFET core more energy efficient under these conditions. However, above the crossover frequency, the drive current of CMOS exceeds that of TFET and becomes the more energy efficient solution. ϭ͘ϬϬнϬϯ

sĐĐсϬ͘ϯs

ϭ͘ϬϬнϬϭ ϭ͘ϬϬͲϬϭ

d&d DK^

ϭ͘ϬϬͲϬϯ ϭ͘ϬϬͲϬϱ

/^΀ƵͬƵŵ΁

/^΀ƵͬƵŵ΁

ϭ͘ϬϬнϬϯ

sĐĐсϬ͘ϳs

ϭ͘ϬϬнϬϭ ϭ͘ϬϬͲϬϭ

d&d DK^

ϭ͘ϬϬͲϬϯ ϭ͘ϬϬͲϬϱ

ͲϬ͘ϭ

Ϭ

Ϭ͘ϭ

Ϭ͘Ϯ

Ϭ͘ϯ

ͲϬ͘ϭ

Ϭ

Ϭ͘ϭ

Ϭ͘Ϯ

s'^΀s΁

(a)

Ϭ͘ϯ

Ϭ͘ϰ

Ϭ͘ϱ

Ϭ͘ϲ

Ϭ͘ϳ

s'^΀s΁

CMOS and TFET cores can achieve average significant energy savings. Specifically, we make the following contributions: • We show that, as a result of applying EDP-aware DVFS and barrier-aware DVFS, significant portion of execution time is spent at frequencies at which TFET cores are more energy efficient, • We propose a thread migration scheme that moves threads across TFET and CMOS cores based on the DVFS levels of the cores, • We evaluate our thread migration scheme using multithreaded applications from the SPLASH-2 benchmark suite [17] and show that it can achieve average leakage and dynamic energy savings of 30% and 17%, respectively, with a negligible performance degradation of less than 1%. The rest of this paper is organized as follows. In Section II, we describe the related work in the two main areas that this paper focuses on, namely, TFET devices and DVFS. Section III explains the DVFS mechanisms we assume throughout this paper and gives the details of our thread migration scheme. We describe our experimental setup in Section IV, present our simulation results in Section V, and conclude with Section VI.

(b)

II. R ELATED W ORK

Fig. 1. Change in Iof f and Ion currents for CMOS and TFET devices at different supply voltages. (a) At sub-threshold levels (0.3V), TFETs have better Iof f and as well as Ion currents. (b) At larger voltages (0.7V), although the Iof f current of TFET is still lower, CMOS has higher Ion that enables it to operate at higher frequencies.

Fig. 2. Variation of operating voltage with frequency in CMOS and TFET devices (adapted from [13]). At low frequencies, TFET devices can operate at lower voltages, whereas at high frequencies, CMOS cores require smaller supply voltage.

The two widely used DVFS mechanisms, namely, EDPaware DVFS and barrier-aware DVFS, both use dynamic reduction of voltage/frequency of cores to achieve energy savings. As a result, applying these optimizations results in cores to execute at frequencies lower than their maximum level. Considering multithreaded applications running on a multicore, we observe that due to EDP-aware DVFS and barrier-aware DVFS, significant portion of application execution time is spent at frequencies at which TFET cores are more energy efficient. Based on this, we show that a heterogeneous CMOS-TFET multicore processor with an energyaware mechanism that dynamically migrates threads between

In parallel applications using barriers for synchronization, a thread that reaches the barrier early waits until all other threads arrive at the barrier. As a result, overall execution time depends on the performance of the slowest, critical threads that reach the barrier last. In order to reduce the overheads associated with barrier synchronization, various strategies have been proposed. In [15] compiler optimizations to eliminate barriers or to convert them into less costly point-to-point synchronization primitives are presented. However, this technique can only eliminate or replace barriers that are redundant in the first place. Barriers that ensure the correctness of applications cannot be eliminated and these barriers can still incur large overheads. For those barriers, a promising method to reduce energy costs is to employ DVFS. In the context of sequential applications, DVFS [10] has been shown to be an effective technique to achieve higher energy efficiency. For parallel applications, barrier-aware DVFS methods have been proposed to avoid excessive barrier wait times. Liu et al. [9] show that the overheads of barriers are even larger for applications that have threads executing unequal loads. Using history information, future barrier stall times of threads are predicted and threads with large expected stall times are executed at lower DVFS levels. Threads are assigned DVFS levels such that they reach the barrier around the same time, improving energy efficiency. In contrast, in [8], when a thread reaches a barrier, its barrier stall time is predicted and the core running that thread is put into a sleep mode for that duration. As a result, instead of continuously checking whether all threads reached the barrier, the core enters a low power mode, reducing its leakage and dynamic energy. When barrier-aware DVFS is considered, it should be noted that the efficiency of any dynamic approach to balance barrier wait times of threads depends on the accuracy at which the criticality of threads can be predicted. Complex techniques

248

to identify critical threads in multithreaded applications have been proposed in [1] and [2]. Prior work on TFETs is mostly limited to the analysis and design of TFET devices and small scale circuits built from them. Early TFET devices [3], [16] made of Si suffered from low drive currents. In [12], the authors demonstrate 100nm channel length In0.53 Ga0.47 As TFET devices that can provide large drive currents and their use in low-power logic design. In [13], the authors show that TFET cores can be used to significantly improve the energy efficiency of sequential applications. Our work can be distinguished from [13] in the following ways. Firstly, considering parallel applications, we apply barrier-aware DVFS to avoid idle wait times at barriers and show that this optimization further increases the energy benefits of TFET cores. Second, unlike their approach, we do not assume a fixed thread-to-core mapping, but dynamically migrate threads across cores. As a result, our approach enables threads to have high performance during compute-intensive phases while achieving high energy efficiency during memorybound phases by migrating them to TFET cores. III. I MPLEMENTATION D ETAILS A. EDP-Aware DVFS DVFS is a technique that exploits the dynamic changes in the behavior of applications throughout their course of execution. When an application enters a memory-bound phase, its performance is mostly limited by the memory access latency, and therefore, the frequency at which the corresponding core is running has little impact on performance. In this case, voltage/frequency of the core can be reduced in order to reduce leakage power and dynamic energy. On the other hand, when the application enters a compute-bound phase, performance is mostly determined by the frequency of the core, so higher frequencies are desired. In order to exploit this dynamic variation in optimum execution frequency, DVFS has been proposed [10]. DVFS can be applied to optimize several metrics such as IPC or EDP. As our goal in this paper is to improve energy efficiency of multicore processors, we use an EDP-aware DVFS. Specifically, we use a modified version of the greedy EDP-aware DVFS algorithm presented in [14]. This DVFS algorithm dynamically monitors EDP and performs DVFS actions at the end of every epoch. At the ith epoch, a core can be in any one of NL DVFS levels, which means that during that epoch, the core executes at voltage (Vi ) and frequency (fi ) determined by that DVFS level. At the end of each epoch, one of the following DVFS decisions is carried out: (i) the core moves one DVFS level up (increased frequency and voltage) or (ii) the core moves one DVFS level down (decreased frequency and voltage). This DVFS decision is performed based on the EDP value observed during the last two epochs and the latest DVFS action performed. At the end of epoch i, the EDP of the core during that epoch (EDPi ) is calculated and compared against the EDP at the previous epoch (EDPi−1 ). If an improvement in EDP is observed, then the DVFS decision performed in the previous epoch is carried out again for this epoch. However, if the EDP is observed to degrade, then a DVFS decision is made in the direction opposite to the one taken in the previous

epoch. To prevent unnecessarily frequent DVFS level changes during steady phases of execution, we modified this algorithm to also support a third DVFS decision - to stay at the same DVFS level when the change in EDP is less than a prescribed threshold. It should be noted that, this threshold is set such that the algorithm still preserves its ability to accurately track the dynamic change in behavior of applications. B. Barrier-Aware DVFS Barriers are the most widely used synchronization primitives in multithreaded applications. A barrier is essentially a mechanism to prevent the progress of threads beyond the barrier until all threads reach the barrier. A thread that reaches the barrier early must wait for all other threads to arrive at the barrier. Barriers are typically implemented as shared counters that are incremented in a critical region whenever a thread arrives at the barrier. All threads waiting at a barrier continuously check the value of this counter and continue only when it becomes equal to the number of threads that must synchronize at the barrier. This requires threads to continuously read their local copies of the shared counter, doing no useful work and consuming leakage and dynamic energy. Energy overheads associated with barrier synchronization are proportional to the barrier wait times of threads. As observed in [1], an imbalance in workload distribution across threads can result in large barrier wait times. A solution to this problem is to use barrier-aware DVFS. This ensures that, at each barrier, the voltages and frequencies of cores are dynamically scaled such that faster threads do not arrive at the barrier early, but instead, they arrive at the barrier at aro und the same time as the slowest thread. Using this technique, the wait times of threads at the barrier are minimized, and as a result, redundant leakage and dynamic energy consumption spent at the barrier is avoided. In order to obtain energy savings using barrier-aware DVFS, the discrepancy in thread execution times between two barriers, and in turn the expected barrier stall times when all cores run at the maximum frequency, must be accurately predicted. Accurate barrier stall time predictions can be performed by applying any of the techniques in [1], [2], [8], [9]. As our goal in this paper is not to evaluate the accuracy of these techniques, but to show the potential benefits that can be obtained by using TFET cores under any barrieraware DVFS technique, we prefer to use an oracle predictor that can always find the best voltage/frequency for each core so that the barrier wait times are minimized. C. Thread Migration Each of the DVFS schemes described above results in a scaling coefficient, kedp and kbarrier respectively. The target frequency for each core is calculated by multiplying the maximum frequency with both these coefficients. As a result of applying EDP-aware DVFS and barrier-aware DVFS, during its course of execution, a thread runs at continuously changing voltage/frequency levels. Considering Fig. 2, denote the crossover frequency below which TFET cores are more energy efficient than CMOS cores as fc and the corresponding DVFS level as Lc . An energy-aware thread-to-core mapping mechanism should prefer to run threads on TFET cores if the corresponding DVFS level is less than fc and on CMOS cores

249

TABLE I S YSTEM PARAMETERS

if it is greater than fc . In order to achieve this, we propose a migration scheme that uses the DVFS levels of cores to migrate threads across CMOS and TFET cores. Our scheme migrates threads running on CMOS cores at DVFS levels lower than Lc to TFET cores, and similarly, threads running on TFET cores at DVFS levels higher than Lc to CMOS cores. Although, in theory, threads can be migrated across cores arbitrarily frequently, the cost associated with thread migration should be taken into account in practice. Therefore, in order to avoid excessive performance degradation due to frequent migrations, we perform CMOS to TFET migration at DVFS levels Lc − 1 and TFET to CMOS migration at DVFS level Lc + 1. IV. S IMULATION F RAMEWORK AND I NFRASTRUCTURE We carried out our experiments using the Simics [11] full system simulator. Our target system in this work contains an 8 core processor that consists of 4 CMOS cores and 4 TFET cores, running the Linux operating system. To fairly evaluate the benefits of using TFET cores over CMOS cores, we assumed that a maximum of 4 cores are in operation at any point of time and the rest of the cores are assumed to be in power-down mode. Most modern embedded processors are equipped with DVFS capability to allow the processor to operate in extremely low power modes as well as high performance modes. For example, the Intel XScale embedded processor implemented in 180nm technology provides 50mV DVFS stepping with an operating voltage range of 0.7V-1.8V, resulting in an operating frequency range from 200 MHz to 1000 MHz [4]. Targeting an embedded processor implemented in 22nm technology node, our supply voltages are much lower. In our model, we assume that CMOS cores can deliver 500MHz at 0.54V volts and there are 9 DVFS levels with frequencies ranging from 500 MHz to 1500 MHz in equal steps of size 125 MHz. As observed in Fig 2, the crossover DVFS level at which the most energy efficient device type switches between CMOS and TFET is taken to correspond to a frequency of 1250 MHz. In other words, threads operating below 1250 MHz are more efficient when run on TFET cores and vice versa. During execution, depending on the DVFS levels of cores, our thread migration algorithm moves threads across CMOS cores and TFET cores to achieve better energy efficiency. Our target multicore processor has 32KB private L1 instruction and data caches with 1 cycle access latency and a 1MB shared, unified L2 cache that can be accessed in 10 cycles at the maximum frequency of 1500MHz. Table I shows the details of the hardware configuration we used in our simulations. We evaluated our scheme using benchmarks from the SPLASH-2 [17] suite. Some of these workloads make extensive use of barriers for synchronization, which enables us to distinguish the benefits that arise from EDP-aware DVFS and barrier-aware DVFS. Table II lists our target benchmarks and the input sets we used in our experiments. In this work, we assume that per-core DVFS is achieved using on-chip voltage regulators. Moving from off-chip voltage regulators to per-core on-chip voltage regulators can provide DVFS transition times in the order of tens of nanoseconds [7]. This switching latency is much smaller than the epoch size that

Parameter Number of Cores L1 D/I-Cache L1 Access Latency L2 Cache L2 Access Latency Memory Access Latency Epoch Size

Value 4 CMOS + 4 TFET Private, 32KB each, 4-way set associative 1 cycle Shared, 1MB, 16-way set associative 10 cycles (at 1500 MHz) 120 cycles (at 1500 MHz) 10ms

TABLE II B ENCHMARKS Benchmark barnes fft fmm lu ocean radiosity radix volrend water-nsquared water-spatial

Input 262,144 particles 65,536 complex doubles 32,768 particles 2048x2048 matrix 1026x1026 grid default values for all inputs 4,194,304 keys car.env 1331 molecules, 10 steps 1331 molecules, 10 steps

we use in our experiments. Therefore, we assume that voltage/frequency level switching can take place instantaneously. The leakage and dynamic power models for the cache and the processor datapath were developed in [13] using validated device models for TFETs and predictive BSIM models for 22nm CMOS. In this work, we use the same processor energy and leakage model. In our system level simulations, total leakage and dynamic energy under the given benchmarks is calculated by taking the sum of leakage and dynamic energies over all epochs. Leakage energy of a core at each epoch depends on the leakage power at the corresponding frequency and the epoch duration. Core dynamic energy calculations are based on the dynamic energy per instruction values obtained from simulations and the number of instructions executed at each epoch. As will be shown in Section V, we estimate the overhead due to thread migration to be around 20K cycles at 1500MHz. In order to ensure that this overhead remains a negligible fraction (< 2%) of the period of our thread migration decisions, we fixed the epoch size as 10ms. V. E XPERIMENTAL R ESULTS In our experiments, we assumed that the baseline system has two DVFS mechanisms, namely, the EDP-aware DVFS (S1) and the barrier-aware DVFS (S2) as discussed in Sections III-A and III-B, that can be applied to improve energy efficiency. The first scheme (S1), monitors the EDP of the application continuously and performs DVFS decisions accordingly, making use of the trade-off between performance and energy efficiency. The second scheme moves faster threads into lower DVFS levels to eliminate idle wait times at barriers. We also experimented with a combined scheme that employs both of these mechanisms (S1+S2). The results presented in this section are normalized with respect to the energy consumption and performance of our baseline system which executes threads only on CMOS cores (i.e., it does not utilize TFET cores), but still employs the same DVFS techniques to improve energy efficiency. Before experimenting with TFET cores, in order to verify the correctness of our DVFS implementations, we compared our baseline system with DVFS against a system with no DVFS where all cores

250

run at the maximum frequency and obtained an average EDP improvement of 11%. All reported results in this section are improvements on top of this 11%. A. Energy Improvements Figs. 3 and 4 respectively show the leakage and dynamic energy savings under S1, S2, and S1+S2 strategies. Using S1+S2, an average leakage energy improvement of 30% is obtained due to the very low leakage energy of TFET cores at low voltages. Dynamic energy is also improved by 17% on average, resulting in average total energy saving of 19%. The improvement in dynamic energy is associated with the lower supply voltage requirement of the TFET devices. For leakage energy, in addition to this reduction in supply voltage, much lower off-currents of TFETs also result in significant additional leakage energy savings. >ĞĂŬĂŐĞŶĞƌŐǇ/ŵƉƌŽǀĞŵĞŶƚ ^ϭ

^Ϯ

^ϭн^Ϯ

ϴϬй ϳϬй ϲϬй ϱϬй ϰϬй ϯϬй ϮϬй ϭϬй Ϭй

savings. It should be noted that, the overheads associated with barrier synchronization increase with the number of processors that meet at the barrier [9]. As a result, increased number of cores enhance the importance of barrier-aware DVFS, which can result in more cores running at lower frequencies. Therefore, we expect the energy gains from migrating threads to TFET cores to become more significant in the manycore processors of the future when barrier-aware DVFS is employed. Another observation we make is that, the benefits of using TFET cores under S1+S2 exceeds the sum of the individual benefits of the two techniques. This is due to the constructive interference between the two DVFS techniques we employ when energy improvement due to TFET cores is considered. As a result of applying the two DVFS techniques simultaneously, the time spent by threads at lower frequencies increases, thereby improving the leakage and dynamic energy savings from our thread migration scheme. The fraction of the total time spent by the threads on TFET cores for each benchmark is shown in Fig. 5. It can be observed that, as a result of applying our scheme, on average, 50% of execution time is spent running on TFET cores under S1+S2. dŝŵĞ^ƉĞŶƚŽŶd&dŽƌĞƐ ^ϭ

^Ϯ

^ϭн^Ϯ

ϭϬϬй ϵϬй ϴϬй ϳϬй ϲϬй ϱϬй ϰϬй ϯϬй ϮϬй ϭϬй Ϭй

Fig. 3. Leakage energy savings as a result of migration to TFET cores under S1, S2, and S1+S2 strategies.

Fig. 5. Relative time threads spend executing on TFET cores under S1, S2, and S1+S2 strategies.

Fig. 4. Dynamic energy savings as a result of migration to TFET cores under S1, S2, and S1+S2 strategies.

One observation from our energy results is that, there is wide variation across benchmarks in the energy savings obtained from considering each DVFS strategy individually. Considering S1, large benefits are obtained with low IPC applications such as lu, whereas for applications with already high IPC, such as radix, the gains are much lower. Similarly, not all benchmarks show high energy savings due to S2. For instance, in water-spa , up to 40% energy savings are obtained when barrier-aware DVFS is applied. However, in radix, this technique brings less than 1% energy savings. Actually, this behavior is expected as water-spa has a large imbalance in its workload distribution, whereas the uniform distribution of work across threads of radix results in almost no energy

B. Performance Degradation The energy savings that can be obtained from migrating threads across CMOS and TFET cores is given in the previous section. However, there are also overheads associated with thread migration across cores that must be quantified. In order to understand the worst case overhead for a single migration operation, we analyze the underlying mechanism used for thread migration. In the Linux operating system, thread migration is performed using thread affinity. In this mechanism, the operating system scheduler stops the running thread, saves the register contents of the core running the thread, flushes the private caches of the core (L1 caches in our case) and puts the thread into the ready queue of the migration target core. During the private cache flush operation, only dirty lines are written back to L2 cache and clean lines are discarded. Note that, in inclusive caches, copies of these clean lines are already guaranteed to exist in L2 cache. If the target core of migration is idle, it immediately retrieves the thread from the ready queue, recovers register contents, and starts execution. In this case, our experiments indicate that the whole migration process takes in the order of 10K cycles. As a side effect of flushing private L1 caches, during its warm-up period,

251

the migrated thread suffers a number of L1 cache misses. However, most of these L1 misses become L2 cache hits, thereby reducing performance degradation after migration. Considering our L1 data cache with 1K lines (32KB capacity and 32B line size), at most 1K misses can be observed, which requires 1024 L2 cache accesses that takes 10K cycles (at 1500 MHz). As a result, we calculate the theoretical worst case migration cost to be approximately 20K cycles. This worst case cost is only 2% of the granularity of our thread migration decisions (i.e., the epoch size). Further. in our experiments, we observed the frequency of migrations to be less than 1%, with a worst case migration frequency of 3% in fft, which indicates that our proposed thread migration scheme does not suffer from frequent migrations across cores. Overall, we observed the performance degradations due to thread migration to be less than 1%.

CMOS cores in order to improve energy efficiency. Simulating a multicore system with 4 CMOS and 4 TFET cores, we show that our thread migration scheme can achieve average leakage and dynamic energy savings of 30% and 17%, respectively. We analyze the overheads of thread migration and observe it to be less than 1% on average, resulting in an overall 20% improvement in EDP.

C. Overall Impact on EDP

[1] A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture, 2009. [2] Q. Cai et al. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2008. [3] W. Y. Choi, B. Park, J. D. Lee, and T. K. Liu. Tunneling field-effect transistors (TFETs) with subthreshold swing (SS) less than 60 mV/dec. Electron Device Letters, 28(8), August 2007. [4] L. T. Clark, F. Ricci, and W. E. Brown. Dynamic voltage scaling with the xscale embedded microprocessor. In Adaptive Techniques for Dynamic Processor Optimization. 2008. [5] P. Clarke. http://www.eetimes.com/electronics-news/4213661/intel-sgargini-sees-tunnel-fet-as-transistor-option. In EE Times, 2011. [6] C. Hu et al. Prospect of tunneling green transistor for 0.1V CMOS. In International Electron Devices Meeting, 2010. [7] W. Kim, M. S. Gupta, G. Wei, and D. Brooks. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the International Symposium on High Performance Computer Architecture, 2008. [8] J. Li, J. F. Martinez, and M. C. Huang. The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture, 2004. [9] C. Liu, A. Sivasubramaniam, M. Kandemir, and M. J. Irwin. Exploiting barriers to optimize power consumption of CMPs. In Proceedings of the International Parallel and Distributed Processing Symposium, 2005. [10] P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey. A voltage reduction technique for digital systems. In Proceedings of the International Solid-State Circuits Conference, 1990. [11] P. S. Magnusson et al. Simics: A full system simulation platform. Computer, 35, February 2002. [12] S. Mookerjea et al. Experimental demonstration of 100nm channel length In0.53Ga0.47As-based vertical inter-band tunnel field effect transistors (TFETs) for ultra low-power logic and SRAM applications. In Proceedings of IEEE International Electron Devices Meeting, December 2009. [13] V. Saripalli, A. Misra, S. Datta, and V. Narayanan. An energy-efficient heterogeneous CMP based on hybrid TFET-CMOS cores. Proceedings of the Design Automation Conference, 2011. [14] N. Soundararajan, V. Narayanan, and A. Sivasubramaniam. Impact of dynamic voltage and frequency scaling on the architectural vulnerability of GALS architectures. Proceedings of the International Symposium on Low Power Electronics and Design, 2008. [15] Chau-Wen Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the Symposium on Principles and Practice of Parallel Programming, 1995. [16] P. F. Wang et al. Complementary tunneling transistor for low power application. Solid-State Electronics, 48, 2004. [17] S. C. Woo et al. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture, 1995.

We analyze the combined effect of the leakage/dynamic energy improvements and performance degradation results given in the previous two sections. The impact of our thread migration scheme on energy-delay product (EDP) for each benchmark is given Fig. 6 and shows that the EDP improvements follow the same improvement trend we have for energy. Maximum improvements are obtained with lu and water-spa benchmarks, which improve by 44% and 32%, respectively. However, the reasons for EDP improvement for these two benchmarks are different. In lu, the benefits are mostly due to EDP-aware DVFS, whereas in water-spa, most of the improvement is as a result of barrier-aware DVFS. On average, the EDP improvement achieved by our strategy is about 20%. W/ŵƉƌŽǀĞŵĞŶƚ ^ϭ

^Ϯ

^ϭн^Ϯ

ϱϬй ϰϬй ϯϬй ϮϬй ϭϬй Ϭй

Fig. 6. EDP improvements as a result of migration to TFET cores under S1, S2, and S1+S2 strategies.

VI. C ONCLUSION We show that due to EDP-aware DVFS and barrier-aware DVFS, multithreaded applications spend significant amount of time at low frequencies. At these frequencies, TFET cores are shown to be more energy efficient when compared to CMOS cores. However, CMOS cores are still more energy efficient at higher frequencies, and can reach higher maximum frequency levels than TFETs. Therefore, applications cannot be limited to always run on TFET cores. By employing TFET cores in multicores together with CMOS cores and using EDPaware and barrier-aware DVFS, we propose a thread migration scheme that dynamically moves threads across TFET and

ACKNOWLEDGMENT This work was supported in part by NSF awards 1028807, 0903432, 1017882, 0963839, 0720645, 0811687, 0702519, Microsoft Corporation, the Semiconductor Research Corporation’s Nanoelectronics Research Initiative and National Institute of Standards and Technology through the Midwest Institute for Nanoelectronics Discovery (MIND). R EFERENCES

252

Improving Energy Efficiency of Multi-Threaded Applications using ...

Improving Energy Efficiency of Multi-Threaded Applications using ...

Suggest Documents

Improving Efficiency of Data Intensive Applications on GPU Using ...

IMPROVING ENERGY EFFICIENCY OF CAR CLIMATE ... - WEIDPLAS

IMPROVING ENERGY EFFICIENCY OF CAR CLIMATE ... - WEIDPLAS

Improving Energy Efficiency of Coarse ...

Improving Cache Energy Efficiency Using a ... - Semantic Scholar

Improving A Grid-Based Energy Efficiency by Using Service Sharing ...

Certifying Energy Efficiency of Android Applications - Software ...

Certifying Energy Efficiency of Android Applications

Creating Multithreaded Applications - Google Groups

Improving Cloud Computing Energy Efficiency

Research Article Improving Energy Efficiency in QoS

Improving energy conversion efficiency for capacitive nanogenerator

Improving Energy Efficiency in Manufacturing via KPI

Improving Android Performance and Energy Efficiency

Improving Building Energy Efficiency and Daylight ...

Improving ship energy efficiency through a systems

Improving network energy efficiency through ... - Springer Link

lighting retrofitting: improving energy efficiency and ...

improving energy efficiency within manufacturing by ... - DergiPark

energy efficiency and using less - SHAPE ENERGY

energy efficiency and using less - SHAPE ENERGY

MODELING MULTITHREADED APPLICATIONS USING PETRI NETS KRISHNA M ...

Improving Performance of Web Applications Using ...

MODELING MULTITHREADED APPLICATIONS USING PETRI NETS KRISHNA M