Int. J. High Performance Computing and Networking, Vol. x, No. x, xxxx
Energy Efficiency of Heterogeneous Multicore System Based on the Enhanced Amdahl’s Law Songwen Pei*, Junge Zhang, Naixue Xiong Shanghai Key Lab of Modern Optical Systems, University of Shanghai for Science and Technology, Shanghai 200093, China E-mail:
[email protected] *Corresponding author
Myoung-Seo Kim, Jean-Luc Gaudiot Parallel Systems and Computer Architecture Lab, University of California Irvine, CA 92697, USA E-mail: {myoungseo.kim, gaudiot}@uci.edu
Abstract: Energy efficiency is one of the most challenges of designing future heterogeneous multicore system, beyond performance, hereby we propose an energy efficiency analytical model for heterogeneous multicore system based on the enhanced Amdahl’s law. The model extends the traditional computing-centric model by considering the overhead of data preparation which potentially includes the overhead of communication, data transfer, synchronization, etc. The analysis clearly shows that deceasing the overhead of data preparation is a promising approach to reach higher performance of computation and greater energy-efficiency. Therefore, more informed tradeoffs should be taken when we design a modern heterogeneous processor system within limited budget of energy. Keywords: Energy-Efficiency; Overhead of Data Preparation; Dataflow Computing Model; Performance Evaluation; Heterogeneous Multicore System. Reference to this paper should be made as follows: Pei, S., Zhang, J., Xiong, N., Kim M., and Gaudiot J. (xxxx) ‘Energy Efficiency of Heterogeneous Multicore System Based on the Enhanced Amdahl’s Law’, Int. J. High Performance Computing and Networking, Vol. x, No. x, pp.xxx–xxx. Biographical notes: Songwen Pei received the B.S. in the School of Computer from National University of Defence and Technology, Changsha, China in 2003, the M.S. in the School of Information Engineering from Guizhou University, Guiyang, China in 2006, and the Ph.D. in the School of Computer Science and Technology from Fudan University, Shanghai, China in 2009. He is currently an Associate Professor of the Computer Science and Engineering Department at the University of Shanghai for Science and Technology, and he currently works as a Guest Researcher at the Institute of Computing Technology, Chinese Academy of Sciences (2011-), and a Research Scientist at the Electrical Engineering and Computer Science, University of California(2013-2015). His research interests include heterogeneous multi-core processors, parallel and distributed computing, cloud computing, big data, and fault-tolerant computation, etc. He is a member Copyright © 20xx Inderscience Enterprises Ltd.
1
2
author of the IEEE, ACM and CCF in China, and he is also a board member of CCFTCCET, CCF-TCARCH and CCF-YOCSEF Shanghai respectively. He is a talent award-winner of Shanghai Pujiang Program 2016. Junge Zhang received the B.S. in the computer science from Luoyang Normal University in 2012. She is currently a graduate student in computer architecture at the University of Shanghai for Science and Technology. Her current research interests include computer architecture, Multi-core heterogeneous systems, data prefetching and GPU power model. Naixue Xiong received Ph.D. in software engineering from Wuhan University, Wuhan, China, and in dependable networks from the Japan Advanced Institute of Science and Technology, Nomi, Japan, respectively. He was with Wentworth Technology Institution, Georgia State University, Atlanta, GA, USA, for several years. He is currently a Professor with the Computer Science and Engineering Department at the University of Shanghai for Science and Technology. His current research interests include cloud computing, security and dependability, parallel and distributed computing, networks, and optimization theory. Myoung-Seo Kim received both the B.S. degree in Computer Science and the B.S. degree in Electrical and Electronics Engineering and the M.S. degree in Computer Science from Yonsei University, Seoul, Korea in 2003 and 2005, respectively. He has worked for 5 years performing design and verification of portable multimedia system-on-a-chip (SoC) at Samsung Electronics, Yongin, Korea (2005-2008) and at Apple Inc., Cupertino, California (2008-2009). He received the Ph.D. degree in Computer Science from University of California, Irvine in 2016. He is currently a principal researcher in the Parallel Systems and Computer Architecture Lab at University of California, Irvine. He was the recipient of the 2011 Yonsei International Foundation Scholarship and the 2012 SK Hynix Study Abroad Scholarship. In addition, he has published more than 15 papers in journals and international conferences. His research interests include multi/many-core system-on-a-chip, computer architecture and design, low-power architecture, embedded systems, VLSI circuit design and analysis, and design automation. He is a member of the IEEE and the ACM. Jean-Luc Gaudiot received the Diplôme d’Ingénieur from the École Supérieure d’Ingénieurs en Electrotechnique et Electronique, Paris, France in 1976 and the M.S. and Ph.D. degrees in Computer Science from the University of California, Los Angeles in 1977 and 1982, respectively. He is currently a Professor and Chair of the Electrical and Computer Engineering Department at the University of California, Irvine. Prior to joining UCI in January 2002, he was a Professor of Electrical Engineering at the University of Southern California since 1982, where he served as and Director of the Computer Engineering Division for three years. He has also done microprocessor systems design at Teledyne Controls, Santa Monica, California (1979-1980) and research in innovative architectures at the TRW Technology Research Center, El Segundo, California (1980-1982). He consults for a number of companies involved in the design of high-performance computer architectures. His research interests include multithreaded architectures, fault-tolerant multi-processors, and implementation of reconfigurable architectures. He has published over 200 journal and conference papers. His research has been sponsored by NSF, DoE, and DARPA, as well as a number of industrial organizations. In January 2006, he became the first Editor-in-Chief of IEEE Computer Architecture Letters, a new publication of the IEEE Computer Society, which he helped found to the end of facilitating short, fast turnaround of fundamental ideas in the Computer Architecture domain. He has served on the IEEE Computer Society Board of Governors for two terms (2010-1015), was VP of Educational Activities (2013), VP of Publications (2014-2015) and is the 2017 IEEE Computer
Energy Efficiency of Heterogeneous Multicore System
3
Society President. In 1999, he became a Fellow of the IEEE. He was elevated to the rank of AAAS Fellow in 2007. This paper is a revised and expanded version of a paper entitled ’PerformanceEnergy Efficiency Model of Heterogeneous Parallel Multicore System’ presented at the EEHTC4BD-2015 workshop held in conjunction with 6th IGSC-2015 Conference, Las Vegas, USA, 14-16 December, 2015.
1 Introduction As multicore processors have become mainstream, it has thus become crucial to identify performance bounds and performance scaling properties in exploiting the massive parallelism they may offer. It has been pointed out by Che et al. (2014). Computer architecture has been transiting from the homogeneous multicore era into the heterogeneous era by referring to Rogers (2013), which means that the memory wall in Sun and Chen (2010) and communication issues will increase the gap between the performance of an ideal processor and that of a "practical" processor because the overhead of data preparation becomes an unavoidable key parameter. Furthermore, energy efficiency is one of the most challenging issue as large-scale increase trend of integration CMOS devices which has led to fused architectures with superscalar central processing units (CPUs) and light-weighted streaming processor units (e.g. graphics processing units (GPUs), FPGA accelerators and ARM cores). Multicore systems with heterogeneous processing elements are becoming the mainstream in the field of future processor design. Recent examples include Intel’s MIC as shown in Duran and Klemm (2012), AMD’s Kabini Bouvier et al. (2014), and NVIDIA’s Project Denver Dally (2011), etc. Especially, as the rapid development of manufacture process and promising technologies, it becomes probable to make heterogeneous kilo-core system by integrating the general-purpose (non-graphics or data flow) computing units and GPGPU on a single chip in the near future. Here, we classify the processors(cores) in heterogeneous mutlicore system as big processor and little core respectively. As the density on multicore chip increase, future heterogeneous multicore parallel system will have to take seriously consideration on how to achieve higher performance and manage hardware/software resources while maintaining their power(energy) consumption within a limited budget. This challenge will stimulate multicore processor architect to develop new approaches that pursuit better performance per watt rather than simply yielding higher performance. In this paper we model computer system, from the view of data flow computing model, by extending the fraction of data preparation rather than the state-of-art Amdahl’s law on basis of computing-centric system which never takes into account potential cost of data preparation. The float computation of 052.alvinn in SPEC suit is only 27.6%, and the remaining operations are relevant to data preparation is about 72.3% in terms of Dixit (1998). We investigate the overhead of data preparation by comparing against the total execution time of applications (e.g. KM, NW, HS, BP, SRAD) from Rodinia benchmark on the heterogeneous system with NVIDIA GTX 750 which contains 512 streaming multi-processors (SMs), and 2 GB device memory. According to the experimental results, the overhead of data preparation(e.g. the overhead of CPU-GPU communication, Synchronization) for all applications under test is on average of 61.37% as shown in Pei,
4
author
Kim and Gaudiot (2016). Reducing the overhead of data preparation is a promising way to improve system performance and to decrease energy as well. Our contributions are mainly (1) Extending the performance model by considering the overhead of data preparation, and then making new equations to evaluate performance of heterogeneous parallel multicore system; (2) Building a performance-energy efficiency model and evaluating heterogeneous parallel multicore system; (3) Comparing the results to that of traditional Amdahl’s law.
2 Related Work There are a lot of research achievements in theory to harness power(energy) consumption. Woo and Lee (2008) extended Amdahl’s law for energy-efficient computing of many-core, who classified many-core design styles as three type: symmetric superscalar processor tagged with P ∗ , symmetric smaller power-efficient core tagged with c∗ , and asymmetric many-core processor with superscalar processor and many smaller cores tagged with P + c∗ . The research results show that heterogeneous architecture is better than symmetric system to save power. Similarly, Marowka (2012) extended Amdahl’s law for heterogenous computing, and investigated how energy efficiency and scalability were affected by the power constraints for three kind of heterogeneous computer system, i.e., symmetric, asymmetric and simultaneous asymmetric. The analysis shows clearly that how to gain greater parallelism is the most important factor affecting power consumption. Pei et al. (2016) also extended Amdahl’s law for single-thread heterogeneous multicore system by considering the overhead of data preparation. It demonstrate that heterogeneous CPU-GPU multicore system will be the mainstream alternative and potential innovations on system architecture will be acted in the future. Karanikolaou et al. (2014) evaluated experiments of energy consumption based on distributed and many-core platforms. They evaluated for the power of their processors demand at the idle and fully utilized state. In proportion to the parallelized percentage each time, the estimations of the theoretical model were compared to the experimental results achieved on the basis of the performance/power and performance/energy ratio metrics. Kim et al. (2015) focused on the energy efficiency of the sequential part acceleration, and how to determine the optimal frequency boosting ratio which can maximize energy efficiency. According to the results, energy efficiency of the acceleration increases as the number of cores increases, then an optimal frequency boosting ratio can be determined. Accelerating the sequential part of a program is a promising approach to improve overall performance in parallel processors as pointed by Kim et al. (2015). Londo¯ no et al presented a study about the potential dynamic energy improvement that can be achieved when hardware parallelization was used to increase the energy efficiency of the system rather than to increase performance. They modeled the potential dynamic energy improvement, optimal frequency and voltage allocation of a multicore system in terms of extending Amdahl’s law in Londovno et al. (2010). Ge and Cameron (2007) proposed a power-aware speedup model to predict the scaled execution time of power-aware clusters by isolating the performance effects of changing processor frequencies and the number of nodes. By decomposing the workload with DOP and ON-/OFF-chip characteristics, this model took into account the effects of both parallelism and power aware techniques on speedup.
Energy Efficiency of Heterogeneous Multicore System
5
In the field of practical products or implementations, low power techniques and algorithms for multicore system, such as dynamic voltage/frenquency scaling(DVFS) and heterogeneous microarchitectures, are recommend to reduce power(energy) by lowering the voltage and frequency or by migrating execution to a more efficient, but smaller size of chip. Sawalha and Barnes (2012) demonstrated that significant reduction in energy consumption could be achieved by dynamically adjusting mapping as application behavior changed with new program phases. Their experimental results showed significant energy reduction over random scheduling of programs within a heterogeneous multicore processor. Nowak (2014) employed the Convey HC-1, a heterogeneous system equipped with four user-programmable FPGAs, for our investigations toward energy-efficient computing. He found that heterogeneous systems based on reconfigurable hardware, efficient data exchange mechanisms, data-driven and component-based programming, and task-parallel execution can help achieve power-efficient exascale systems in future. Lukefahr et al. (2014) developed an offline analysis tool to study the potential energy efficiency of fine-grained DVFS and heterogeneous microarchitectures, as well as a hybrid approach. Nishikawa et al. (2015) reveal that energy aware RAID configuration for data intensive applications played an important role. Zhu et al. (2014) achieved high efficiency of accessing memory by splitting memory into many regions and allocated an application memory to a preferred region by MSPA. Zain et al. (2012) gained substantial parallel performance by orchestrating heterogeneous computational algebra component. Our prior work focused on designing a fused cache with compacted cache directories and a framework of accessing unified memory address space of heterogeneous parallel multicore system in Pei et al. (2014), Pei et al. (2015). Wang and Ren (2010) coordinated inter-processor work distribution and per-processor’s frequency scaling to minimize energy consumption under a given scheduling length constraint. Through several evaluations on a real CPU-GPU system, their results gained 14% energy reduction compared with static mapping strategy.
3 Performance Model of Integrated Heterogeneous Multicore System Four decades ago, Amdahl. (1967) mainly focused on performance of computer for a special case of using multiple processors in parallel when he argued for the single-processor approach’s validity for achieving large-scale computing capabilities. Here, we are more interested in the power-efficiency or energy-efficiency of future integrated heterogeneous multicore processor system. Hence, we develop analytical power models of integrated heterogeneous multicore and formulate metrics to evaluate energyefficiency on the basis of performance with considering the overhead of data preparation.
3.1 Reevaluating Amdahl’s Law According to Amdahl’s law, the formula for computing the theoretical maximum speedup(or performance) achievable through parallelization is as follows: SA (fc , c) =
1 1 − fc +
fc c
(1)
where c is the number of cores or processors, and fc is the fraction of computation that programmers can parallelize(0 ≤ fc ≤ 1).
6
author
1-fc
fc
1-fh
αfh
(1-α)fh
Figure 1 Normalized task (equivalence time). Split between Computation and Data Preparation.
The equation is correct if three potential key assumptions are verified: (1) the programs to be executed are of fixed-size, and the fraction of the programs that is parallelizable remains constant as well; (2) there exists an infinite memory capable of meeting the requirements as the number of cores increases; and (3) the overhead of preparing data to be used by computing units, such as accessing memory, communication on-chips or off-chips and synchronization among cores, can be completely neglected. Because of the “memory wall”, the overhead of data preparation including memory access, transmitting data on- and off- chip, transferring data between CPU memory spaces and GPU memory spaces for heterogeneous system, synchronizing processes, etc., becomes so significant that it cannot be ignored any longer. However, Amdahl’s law only considers the cost of computation to the exclusion of the cost of preparing data for computation, especially when modeling a heterogeneous multicore system. The pc denotes the computation portion, 1 − pc denotes the Data Preparation portion normalized to the computation portion: since the clock frequencies and the ISAs of CPUs, GPUs, off-chip bus and memory arbitrator would likely be different, we should normalize the performance of Data Preparation instructions to that of computing instructions. The fc is the parallelizable computation portion, 1 − fc is the sequential computation portion. Therefore, we explicitly sperate the computation and data preparation as shown in the Figure 1.We thus now assume that the whole cost of executing a program can be split into two independent parts, from the view of data flow computing model, one of preparing data for execution and the other one of running instructions when the required data are ready. Therefore, the Overhead of Data Preparation (ODP) includes the whole cost of preparing data for execution, to the exclusion of actual execution. As shown in the Figure 1, the Overhead of Data Preparation(ODP) can be considered to produce a new speedup equation by extending legacy Amdahl’s Law. We will call it the “Extended Amdahl’s law” as expressed in the equation (2). Thus, the legacy Amdahl’s law is a special case of the "Extended Amdahl’s law" where pc = 1. The fc is the parallelizable computation portion, 1 − fc is the sequential computation portion. SEA (fc , c, pc ) =
1 ((1 − fc ) +
fc c )
· pc + (1 − pc )
(2)
3.2 Analytic Model with Considering the Overhead of Data Preparation In general, memory access instructions can be executed simultaneously with independent computing instructions. However, it will not be possible to execute all of the data preparation instructions simultaneously with computing instructions. Further, not all operations of data preparation would be executed with computing instructions according to our observations. For example, if a load instruction is independent of the following computing instructions,
Energy Efficiency of Heterogeneous Multicore System 1-Pc
Pc
1 core
1-fc
c cores
1-fc
fc
fc/c
7
1-fh
α fh
(1-α) fh
1-fh dc (1-α) fh (1-dc)(1-α) fh
α fh
Figure 2 Illustration of extended Amdahl’s Law.
it can be issued simultaneously with them. However, it would not be issued if the queue of issuing load instructions was full. We thus further divide the portion of data preparation into three sub-parts: 1 − fh , αfh and (1 − α)fh . 1 − fh denotes the portion of data preparation which is closely dependent on computing instructions. fh denotes the data preparation portion of the program that can be overlapped with “computing” instructions before introducing advanced techniques, where 0 ≤ fh ≤ 1. However, equation (2) does not make allowance for techniques that would decrease or eliminate the overhead of data preparation when we enter into multicore system era. Therefore, we introduce the parameter α to denote the percentage of data preparation instructions which are actually executed simultaneously with computing instructions before using advanced architectural techniques (0 ≤ α ≤ 1). Thus, αfh denotes the portion of actual parallelized instructions for data preparation, and (1 − α)fh denotes the portion of data preparation instructions which cannot be overlapped with computing instructions without sophisticated architecture techniques. With the help of advanced techniques such as data prefetching, speculative execution, universal memory, no-copy data transfer, 3D NoC, etc., (1 − α)fh could be decreased significantly.
3.3 Quantitative Model of Heterogeneous Multicore System As proposed by Hill and Marty (2008), we first assume that a multicore chip of a given area and manufacturing technology is composed of at most n Base Core Equivalents (BCEs) (a single BCE implements a baseline core). Then, we assume that the resources of r BCEs can be used to create a powerful core with sequential performance perf (r), while the performance of a single-BCE core is 1. The function of perf (r) is allowed to be an arbitrary function, where 1 ≤ perf (r)≤ r. If perf (r) > r, there remains enough room to increase more cores or cache resources to speed up both sequential and parallel execution. However, if perf (r) < r, we need to trade off the sequential and parallel execution by simply increasing the number of cores. More cores increased, more overhead from the procedure of data preparation (e.g., accessing different memory spaces, communication and synchronization among cores) also increased. Since an improvement in sequential performance by microarchitecture techniques alone would follow Pollack’s rule perf (r) is roughly proportional to the square root of the increasing size of a chip or number of transistors. For simplicity, we also assume that the overhead of data preparation is also roughly proportional to perf (r). Furthermore, we introduce a variable dc to model how much percentage of data preparation that cannot be overlapped on a c cores system even adopting advanced technologies, where 0≤ dc ≤ 1. After normalization to the computation, the fraction of data
8
author
preparation instructions which cannot be overlapped on a c core system becomes fud = (1 − fh ) + dc · (1 − α)fh . On the opposite, the fraction of data preparation instructions which can be overlapped on a c core system becomes fpd = αfh + (1 − dc ) · (1 − α)fh , where fud + fpd = 1. While, the 1 − fh denotes the fraction of data preparation which is closely dependent on computation. Therefore, we can extend Amdahl’s law to be a new equation called “Enhanced Amdahl’s law”. The performance speedup is governed by: 0
SEA (fc , c, pc , fud ) =
1 ((1−fc )+ fcc )pc +fud (1−pc )
(3)
4 Energy Efficiency Model of Heterogeneous Multicore System 4.1 Performance Model of Heterogeneous Multicore System Similar to the assumption in Hill and Marty (2008), we assume that the integrated heterogeneous parallel multicore system is built with one big processor which consists of r BCEs, and c little cores which consists of only one BCEs for each core. The sequential computation and sequential data preparation are only executed by the big processor, and the parallel computation and parallel data preparation are also only executed by c little cores. Besides, the big processor would be idle while the little cores are executing the parallel instructions as the assumptions in Woo and Lee (2008). Executing the sequential fraction of a given program (includes the cost of executing sequential computing instructions and the cost of sequential fraction of data preparation) at only one big processor with r cores takes (1 − fc ) · pc + fud · (1 − pc ), whereas executing fc the parallel fraction of a given program simultaneously at c little cores which takes c·s · pc . c sc represents the performance of a little core normalized to that of a big processor, where 0 ≤ sc ≤ 1. Therefore, the performance speedup for integrated heterogeneous parallel multicore system while considering the overhead of data preparation is governed by: HS SEA (fc , c, pc , fud ) =
1 fc ((1−fc )+ c·s )p c +fud (1−pc ) c
(4)
4.2 Energy Efficiency Model We also adopt the variables sc , wc , k and kc from Woo and Lee (2008), where sc represents a little core’s performance normalized to that of a big processor (0 ≤ sc ≤ 1), wc represents an active little core’s power consumption relative to that of an active big processor (0 ≤ wc ≤ 1), k represents the fraction of power that the big processor consumes in idle state (0 ≤ k ≤ 1), and kc represents the fraction of an little core’s idle power normalized to the same core’s overall power consumption (0 ≤ kc ≤ 1). Assumed that the big processor in active state consumes a power of 1. During the sequential fraction of executing a given program, the amount of power that the big processor consumes is 1, and the amount of power that c little cores at idle state consume c · wc · kc . During the parallel fraction of executing a given program, the big processor consumes k, and the c little cores consume c · wc .
Energy Efficiency of Heterogeneous Multicore System
9
Because the cost (performance) of executing sequential and parallel portion are fc respectively pws = (1 − fc ) · pc + fud · (1 − pc ) and pwp = c·s · pc + fpd · (1 − pc ), c the average power is pws · {1 + c · wc · kc } + pwp · {k + c · wc } fc ((1−fc )+ c·s )pc +fud (1−pc ) c
W =
(5)
We can also model, similar to that in Woo and Lee (2008), the performance(speedup) per watt(S/W) equation which represents the performance achievable at an average power(W) in Equation 6. S W
=
fc )pc+fud (1−pc ) ((1−fc )+c·s 1 c × pw ·{1+c·w s c ·kc }+pwp ·{k+c·wc } )pc+fud (1−pc ) 1 pws ·{1+c·wc ·kc }+pwp ·{k+c·wc }
= ((1−f
fc c )+c·sc
(6)
The derivation of equation (6) on q variable c will lead to find the maximum value of c ·pc function S/W ,if and only if c = pws ·kc ·wc ·sck·f +fpd ·wc ·(1−pc )·sc . It shows that the optimal number of little cores is determined by tradeoff between the capability of executing parallel computation on a big core and the capability of executing parallel data preparation on little cores. In addition to introduce the metric performance(speedup) per watt(S/W), we can also derive the other corresponding metric performance(speedup) per joule (S/J) in Equation (7). S 1 1 1 S (7) J = pws + fc ·pc × pws ·{1+c·wc ·kc }+pwp ·{k+c·wc } = pws + fc ·pc × W c·s c·s c
c
5 Evaluation and Analysis Assumed that the integrated heterogeneous multicore system consists of one big processor(e.g. superscalar multicore CPU processor) and c little cores(e.g. GPU cores). In order to compare the results of S/W and S/J with that in Woo and Lee (2008), we also set sc , wc , k and kc as 0.5, 0.25, 0.3 and 0.2 respectively. We set the variables pc , fh , α, and dc as 0.6, 0.8, 0.699 and 0.333 respectively. Figure 3 and Figure 4 compare the analytical results of performance for asymmetric multicore system. Figure 3 shows that the maximum relative performance of P + c∗ is 56.39, where fc = 0.99 and c = 256. Note that, the variable fc is equivalent to the variable f in Woo and Lee (2008). As shown 5 curves in Figure 3, the relative performance dramatically increases as the number of little cores increase when fc = 0.99. It means that a little bit more percentage of parallel computation would bring huge performance gain. For example, the curve of fc = 0.99 and fc = 0.9, only 10% increase on parallel computation gets more than 5 times increase of performance where c = 256. However, the performance increase of each curve in Figure 4 is much more gently than that in Figure 3, and the maximum of relative performance is just only 3.35 where fc = 0.99 and c = 256. The reason is that much more cost would be consumed by data preparation, and the increase of relative performance is the result of balancing computation and data preparation. The Figure 4 also shows that a program with low percentage of parallel computation, such as fc = 0.3, can not get great relative performance increase even continuously increasing the number of little cores up to 256. Unfortunately, even though the percentage of parallel computation is higher than
10
author
60.00 55.00
fc=0.3
Relative performance
50.00
fc=0.5
45.00
fc=0.7
40.00 fc=0.9
35.00 fc=0.99
30.00
25.00 20.00 15.00
10.00 5.00
0.00 1
4
8
16
32
64
128
256
Number of little cores
Figure 3 Scalable Performance Distribution of Heterogeneous Asymmetric Multicore(HAM) in terms of Woo’s state-of-art asymmetric multicore system P + c∗ where sc = 0.5, wc = 0.25, k = 0.3, and kc = 0.2.
4.00 f=0.3
3.50
fc=0.5
Relative performance
fc=0.7
3.00
fc=0.9 fc=0.99
2.50 2.00 1.50
1.00 0.50 0.00 1
4
8
16
32
64
128
256
Number of little cores
Figure 4 Scalable Performance Distribution of Heterogeneous Asymmetric Multicore(HAM) with considering the overhead of data preparation where pc = 0.6, fh = 0.8, α = 0.699, and dc = 0.333.
Energy Efficiency of Heterogeneous Multicore System
11
2.00
Relative performance per watt
1.80 1.60 fc=0.3
1.40
fc=0.5
1.20
fc=0.7 fc=0.9
1.00
fc=0.99
0.80 0.60
0.40 0.20 0.00
1
4
8
16
32
64
128
256
Number of little cores
Figure 5 Performance per Watt Distribution of Heterogeneous Asymmetric Multicore(HAM) in terms of Woo’s state-of-art asymmetric multicore system P + c∗ where sc = 0.5, wc = 0.25, k = 0.3, and kc = 0.2.
90%, it can not get dramatic increase as that in the Figure 3. As shown in the Figure 5 and Figure 6, for each kind number of little cores, higher fraction of parallelizable computation in a given program gets higher relative performance per watt. For example, as shown in the Figure 3(b), the values of relative performance per watt are respectively 0.5397, 0.5877, 0.6452, 0.7517 and 0.7517 where fc = 0.3, fc = 0.5, fc = 0.7, fc = 0.9 and fc = 0.99 in the case of c = 16. It means that a program consisting of higher fraction of parallel execution consumes more energy. We can easily observe that the trend of curves in the Figure 5 are also similar to that in the Figure 6. The overall situation of state-of-art asymmetric multicore system is identical to that of integrated heterogeneous parallel multicore system. In Figure 5, the values of relative performance per watt decrease after c = 4 (fc = 0.9 and fc = 0.99) or c = 8 (fc = 0.3, fc = 0.5, and fc = 0.7). It means that as the number of little cores increase after the critical value of c, on average, the performance would become worse and worse within a fixed budget of energy. The reason is as the number of little cores increases for a fixed fraction of parallelizable computation in a given program, the performance of little cores will saturate. Therefore, the performance per watt decrease after the critical value of little cores. More little cores would consume more energy to maintain the synchronization, communication, cache coherence, etc rather than improve performance. However, if a program with higher fraction of computing parallelism, the slop of performance decrease becomes less slowly. For example, the curve corresponding to fc = 0.99 is over the curve corresponding to fc = 0.9. The reason is that higher fraction of parallel computation will have enough computing tasks to be executed on more little cores in parallel. There is a special case in the Figure 3(b), the value of performance per watt is 1 where fc = 0.3 and c = 1. Here, it’s the heterogeneous multicore system built with only
author
12
Relative performance per watt
1.2000
1.0000
fc=0.3 fc=0.5
fc=0.7
0.8000
fc=0.9 fc=0.99
0.6000
0.4000
0.2000
0.0000 1
4
8
16
32
64
128
256
Number of little cores
Figure 6 Scalable Performance per Watt Distribution of Heterogeneous Asymmetric Multicore(HAM) where sc = 0.5, wc = 0.25, k = 0.3, and kc = 0.2.
one big processor and only one little core. The value of performance per watt discloses that the sequential cost of computation and data preparation consumes the same amount of energy to that of parallel computation and data preparation. The interesting thing is that the point is fc = 0.3 which means the energy of executing 30% parallel computation in a given program is approximately identical to the energy of implementing 70% data preparation. Comparing the Figure 5 with the Figure 6, we observe that the five curves in the Figure 6 are more identical than that in the Figure 5, and they are approximate to the point around 0.1. Because some parts of sequential data preparation are executed by the big processor, and some parts of parallelizable data preparation are executed by little cores, whether the fraction of parallel computation is high or not, which can make a stable balance between sequential tasks and parallel tasks. To achieve the top point of relative performance per watt, in terms of the Figure 6, we suggest that the computer system is configured as one CPU core plus four GPU cores, and large-scale computer system should be scaled as 1 CPU : 4 GPUs. The maximum value of relative performance per watt for a given case (e.g. the number of little cores is 4) in Figure 5 and Figure 6 are 1.53 and 1.13 respectively. It means that the integrated heterogeneous multicore system would sacrifice 26% computing performance to implement data preparation within a fixed budget of energy, comparing to the performance in the state-of-art asymmetric multicore system based on traditional Amdahl’s law. After considering the overhead of data preparation, the relative performance per watt will decrease much more sharply that that in the Figure 5 as the number of little cores increases. The reason is that the energy consumed by data preparation tasks is not taken consideration in terms of legacy performance formula of Amdahl’s law.
Energy Efficiency of Heterogeneous Multicore System
13
100.00
Relative performance per joule
90.00
fc=0.3
80.00
fc=0.5
70.00
fc=0.7
fc=0.9
60.00
fc=0.99
50.00 40.00 30.00
20.00 10.00 0.00
1
4
8
16
32
64
128
256
Number of little cores
Figure 7 Performance per Joule Distribution of Heterogeneous Asymmetric Multicore(HAM) in terms of Woo’s state-of-art asymmetric multicore system P + c∗ where sc = 0.5, wc = 0.25, k = 0.3, and kc = 0.2.
In practical computer system, as number of little cores increase, there would be more costs of executing data preparation, such as synchronization, communication, data transformation, etc. However, the time consumed by executing data preparation can not gain speedup performance according to traditional Amdahl’s law rather than waste energy. As a result, the Figure 6 highly reflects the practical relative performance per watt than that in Figure 5. Figure 7 shows the performance per joule distribution of a P + c∗ . At a given energy budget, the relative performance per joule will decrease after passing by the maximum except the line f c = 0.99. For instance, the exterma of line f c = 0.9 achieve 8.88 where c = 64, the performance per joule will go down as the number of little cores increases. Figure 8 shows the S/J distribution of heterogeneous asymmetric multicore(HAM). Take the line f c = 0.9 as an example, the maximum is only 1.96 where c = 8. However, the relative performance per joule is 4.68 where c = 8 as shown in the Figure 7. Comparing the Figure 7 with the Figure 8, we observe that the relative performance per joule in HAM is less than the equivalent in P + c∗ . The S/J faithfully reflects the performance of practical multicore system at a give energy budget.
6 Conclusions We reevaluate the performance-energy efficiency for integrated heterogeneous parallel multicore system based on the mathematical model of data flow, which splits a program(task) into computation and data preparation rather than only computation in the state-of-art
author
14
Relative performance per joule
2.5000
fc=0.3
2.0000
fc=0.5
fc=0.7 fc=0.9
1.5000
fc=0.99
1.0000
0.5000
0.0000 1
4
8
16
32
64
128
256
Number of little cores
Figure 8 Scalable Performance per Joule Distribution of Heterogeneous Asymmetric Multicore(HAM) where sc = 0.5, wc = 0.25, k = 0.3, and kc = 0.2.
Amdahl’s law. By deriving the new equation of performance-energy based on the analytic model of integrated heterogeneous multicore system with considering the overhead of data preparation, we get a new result from it and compare it with that of Woo’s work. The results show that our model is much more closely reflecting practical heterogeneous multicore system than that in traditional Amdahl’s model. According to the results, more little cores would not bring significant performance gain after passing the critical point (e.g. c = 4 or c = 8) other than consuming more energy. Therefore, as for the heterogeneous multicore system built with one big core and many little cores, which has limited budget of energy to execute a program(task). Its gain of performance per watt is increased by improving the fraction of parallel task including parallel computation and parallel data preparation, rather than by increasing the number of little cores. Generally, the fraction of parallel computation in a given program is hard to be improved by instructions level parallelism (ILP), threads level parallelism (TLP) nowadays. However, I have another approach to accelerate performance by improving the fraction of parallel data preparation, such as data prefetching, no-copy data transfer, solid-state storage, computing on memory and other emerging technologies. We will keep on investigating the performance gain within limited energy(or power) budget for supercomputer or cluster built with multiple big processors and many little cores while considering the overhead of data preparation.
Energy Efficiency of Heterogeneous Multicore System
15
Acknowledgements We thank the anonymous reviewers for their invaluable comments. This work was partially funded by the Shanghai Municipal Natural Science Foundation (15ZR1428600), Shanghai Pujiang Program(16PJ1407600)and the National Science Foundation of United States under Grant No.XPS-1439097. Any opinions, findings and conclusions expressed in this paper are those of the authors and do not necessarily reflect the views of the sponsors.
References Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485. ACM, 1967. Diego Bouvier, Benjamin Cohen, Walter Fry, Sreekanth Godey, and Michael Mantor. Kabini: An amd accelerated processing unit system on a chip. Micro, IEEE, 34(2):22–33, 2014. Hao Che and Minh Nguyen. Amdahl¡¯s law for multithreaded multicore processors. Journal of Parallel and Distributed Computing, 74(10):3056–3069, 2014. Bill Dally. Project denver. Processor to usher in new era of computing.[Online] Available from http://blogs. NVIDIA. com/2011/01/project-denver-processor-tousher-in-new-eraof-computing/[Accessed 2nd August 2012], 2011. Alejandro Duran and Michael Klemm. The intel® many integrated core architecture. In High Performance Computing and Simulation (HPCS), 2012 International Conference on, pages 365–366. IEEE, 2012. Rong Ge and Kirk W Cameron. Power-aware speedup. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–10. IEEE, 2007. Mark D Hill and Michael R Marty. Amdahl’s law in the multicore era. Computer, (7):33–38, 2008. EM Karanikolaou, EI Milovanovi´c, IŽ Milovanovi´c, and MP Bekakos. Performance scalability and energy consumption on distributed and many-core platforms. The Journal of Supercomputing, 70(1):349–364, 2014. SH Kim, Dongkyu Kim, Chi-Kwan Lee, Won Seob Jeong, Won Woo Ro, and Jean-Luc Gaudiot. A performance-energy model to evaluate single thread execution acceleration. Computer Architecture Letters, 14(2):99–102, 2015. Sebastian Moreno Londoˇno, De Gyvez, and José Pineda. Extending amdahl’s law for energy-efficiency. In Energy Aware Computing (ICEAC), 2010 International Conference on, pages 1–4. IEEE, 2010. Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr, Thomas F Wenisch, and Scott Mahlke. Heterogeneous microarchitectures trump voltage scaling for low-power cores. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 237–250. ACM, 2014.
16
author
Ami Marowka. Extending amdahl’s law for heterogeneous computing. In Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on, pages 309–316. IEEE, 2012. Fabian Nowak. Evaluating the energy efficiency of reconfigurable computing toward heterogeneous multi-core computing. In Architecture of Computing Systems (ARCS), 2014 27th International Conference on, pages 1–6. VDE, 2014. Songwen Pei, Myoung-Seo Kim, Jean-Luc Gaudiot, and Naixue Xiong. Fusion coherence: Scalable cache coherence for heterogeneous kilo-core system. In Advanced Computer Architecture, pages 1–15. Springer, 2014. Songwen Pei, Xiongdong Wu, Zuoqi Tang. An approach to accessing unified memory address space of heterogeneous kilo-cores system. Journal of National University of Defense and Technology, 37(1):28–33, 2015. Songwen Pei, Myoung-Seo Kim, Jean-Luc Gaudiot. Extending Amdahl’s Law for Heterogeneous Multicore Processor with Consideration of the Overhead of Data Preparation. IEEE Embedded Systems Letters, 8(1):1–4, 2016. Phil Rogers. Heterogeneous system architecture overview. In Hot Chips, volume 25, 2013. Lina Sawalha and Ronald D Barnes. Energy-efficient phase-aware scheduling for heterogeneous multicore processors. In Green Technologies Conference, 2012 IEEE, pages 1–6. IEEE, 2012. Xian-He Sun and Yong Chen. Reevaluating amdahl¡¯s law in the multicore era. Journal of Parallel and Distributed Computing, 70(2):183–188, 2010. Guibin Wang and Xiaoguang Ren. Power-efficient work distribution method for cpu-gpu heterogeneous system. In Parallel and Distributed Processing with Applications (ISPA), 2010 International Symposium on, pages 122–129. IEEE, 2010. Dong Hyuk Woo and Hsien-Hsin S Lee. Extending amdahl’s law for energy-efficient computing in the many-core era. Computer, (12):24–31, 2008. Fred J. Pollack. New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Proceedings of the International Symposium on Microarchitecture(MICRO), page 2. IEEE, 1999. Kaivalya, M. Dixit. Overview of the SPEC Benchmarks. The Benchmark Handbook, Ch.9, Morgan Kaufmann Publishers, 1998. http://research.microsoft.com/enus/um/people/gray/benchmarkhandbook/chapter9.pdf Songwen Pei, Myoung-Seo Kim, Jean-Luc Gaudiot. Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing. KSII Transactions on Internet and Information Systems, 10(7):3231-3244, 2016. Norifumi Nishikawa, Miyuki Nakano, Masaru Kitsuregawa. Energy aware RAID configuration for data intensive applications in enterprise storages. Int. J. of Computational Science and Engineering(IJCSE), 11(3):227-238, 2015.
Energy Efficiency of Heterogeneous Multicore System
17
Zongwei Zhu, Xi Li, Chao Wang, Xuehai Zhou. Memory power optimisation on low-bit multi-access cross memory address mapping schema. Int. J. of Embedded Systems(IJES), 6(2/3):240-249, 2014. Sparsh Mittal. Orchestrating computational algebra components into a high-performance parallel system. Int. J. of High Performance Computing and Networking(IJHPCN), 7(2):76-86, 2012.