Evaluating DVFS and Concurrency Throttling on IBM's Power8 ... - SC15

Evaluating DVFS and Concurrency Throttling on IBM’s Power8 Architecture Wei Wang

Edgar A. Le´on

University of Delaware Email: [email protected]

Lawrence Livermore National Laboratory Email: [email protected]

Abstract—Two of the world’s next-generation of supercomputers at the U.S. national laboratories will be based on IBM’s Power architecture. Early insights of how this architecture impacts applications are beneficial to attain the best performance. In this work, we investigate the effects of DVFS and thread concurrency throttling on the performance of applications for an IBM OpenPower, Power8 system. We apply these techniques dynamically on a per region basis to three HPC codes: miniFE, LULESH, and Graph500. Our empirical results offer the following insights. First, concurrency throttling provides significant performance improvements. Second, 4-way simultaneous multi-threading (SMT) performs as well as 8-way SMT with potential gains in power-efficiency. And, third, applying informed frequency and concurrency throttling combinations on memory-bound regions results in greater performance and energy efficiency.

I. I NTRODUCTION The goal of exascale computing continues to drive innovations in all aspects of high performance computing (HPC). IBM’s new Power8 processor is designed for big data, analytics, and cloud environments [1]. Previous evaluations of this processor [2] show its potential for scientific applications. However, more explorations and insights are needed to best leverage this system. IBM’s Power8 architecture features 8-way simultaneous multi-threading (SMT-8) and per-core DVFS control with 69 distinct frequencies. In this work, we characterize application performance using different DVFS and concurrency settings. A key aspect of our analysis is the application of these features at the granularity of loops rather than the entire application. Our contributions are as follows. We evaluated how DVFS affects application performance on the Power8 architecture and how to improve energy consumption without performance degradation; we also evaluated concurrency throttling showing significant performance improvements; and, finally, SMT-4 performs as well as SMT-8 for several applications. II. M ETHODOLOGY We measure application performance under various thread concurrencies and DVFS frequencies. We also compared application performance using SMT-8 and SMT-4 configurations with the same number of threads. We tested three OpenMP codes: Graph500, LULESH 2.0, and miniFE. The granularity Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL-ABS676202.

of our measurements is per OpenMP parallel loop. Such finegrain analysis provides more insights than just measuring the entire application. By analyzing the results from running applications with different frequency and concurrency settings, we empirically identify memory-bound regions that could benefit from dynamic runtime frequency change. We invoke energy control API calls before and after such code regions to achieve the best performance and energy consumption. III. R ESULTS The experiments were performed on a TYAN GN70-BP010 Power8 system. It has 4 processor cores and supports up to 32 hardware threads on the SMT-8 configuration. A. OpenMP Thread Concurrency Throttling Figures 1a and 1b show the execution times when changing the number of OpenMP threads per core for Graph500 and LULESH. Applications do not always achieve the best performance with 8 threads per core. For example, LULESH (and miniFE; not shown due to space constraints) performs best with 4 threads per core. Graph500, on the other hand, is able to leverage all 8 threads per core. Comparing SMT8 and SMT-4, SMT-4 performans similarly to SMT-8. SMT-4 can potentially lead to better energy efficiency for applications that are memory-intense like LULESH and miniFE. B. DVFS Figures 2 and 3 show the execution times when changing the DVFS frequency for Graph500 and miniFE. A lower frequency leads to longer execution time for both benchmarks. However, applications slow down quite differently when reducing the frequency. Memory bound applications like miniFE provide energy saving opportunities for DVFS. Running at 2.061GHz, Graph500 slows down by 1.6× while miniFE only slows down by 1.2×. C. Combining DVFS and Concurrency Throttling miniFE contains a dominant loop that is memory-bound. The loop (Loop1 in Figures 4 and 5) is insensitive to reducing the frequency and the number of threads. The same two figures show that loops in miniFE behave differently when the frequency and the number of threads are reduced. Given this observation, we executed miniFE with 16 threads and mixed frequencies. Just before entering Loop1, the DVFS

60

SMT=8 SMT=4

2500

Execution Time

Execution Time

3000

2000 1500 1000 500 0

1

2

3

4

5

6

7

8

SMT=8 SMT=4

50 40 30 20 10 0

1

Number of OpenMP Threads Per Core

(a) Graph500 (best with 8 threads per core)

2

3

4

5

6

7

8


(b) LULESH (best with 4 threads per core)

90

Execution Time

90

32 OpenMP threads 28 OpenMP threads 24 OpenMP threads

85 80

85 80

75

75

70

70

65

65

60

60

55

55

50

50

45

45

Normalized Time (Lower is better)

Fig. 1. Comparing the application performance with varying number of threads per core and SMT settings

2.4

2

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8 1

61 20

27 24

93 26

59 29

25 32

91 34

57 37

23 40

22 43

32 OpenMP threads 24 OpenMP threads 16 OpenMP threads

Execution Time

130 125

135 130 125

120

120

115

115

110

110

2.4 2.2

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8 61

20

27

93

24

59

26

25

29

91

32

34

We evaluated application performance on IBM’s Power8 architecture leveraging DVFS and thread concurrency throttling. Although this architecture provides eight hardware threads per core (CPUs), the optimal number of software threads may be significantly lower particularly for memory intense regions. To leverage the best performance on this highlythreaded architecture, concurrency throttling is necessary. In addition, we found that SMT-4 performs as well as SMT-8 for applications that do not require all eight hardware threads. Our

8

2

57

IV. C ONCLUSIONS

7

1.8

23

frequency is set to the minimum possible. The frequency is reset to the maximum immediately after Loop1. We achieve slight performance improvement by combining concurrency throttling and DVFS.

6

2

37

Fig. 3. miniFE: performance slowdown of only 1.2X

5

Loop1 Loop2

2.2

22

61 20

27 24 93 26 59 29 25 32 91 34 57 37 23 40

22 43

Frequency (MHz)

4

2.4

40

105

3

Fig. 4. miniFE: Concurrency throttling affects the two loops differently

43

105

2


Normalized Time (Lower is better)

135

2.2

2

Frequency (MHz)

Fig. 2. Graph500: performance slowdown of more than 1.6X

2.4

Loop1 Loop2

2.2

Frequency (MHz)

Fig. 5. miniFE: DVFS affects the two loops differently

fine-granularity analysis shows that memory-bound regions are fairly insensitive to frequency changes resulting in energy savings. Finally, combining DVFS and concurrency throttling on the Power8 architecture can lead to improved performance and energy consumption. R EFERENCES [1] J. Friedrich, H. Le, W. Starke, J. Stuechli, B. Sinharoy, E. Fluhr, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, D. Hogenmiller, F. Malgioglio, R. Nett, R. Puri, P. Restle, D. Shan, Z. Deniz, D. Wendel, M. Ziegler, and D. Victor, “The POWER8TM processor: Designed for big data, analytics, and cloud environments,” in IEEE International Conference on IC Design Technology, 2014. [2] A. Adinetz, P. Baumeister, H. B¨ottiger, T. Hater, T. Maurer, D. Pleiter, W. Schenck, and S. Schifano, “Performance Evaluation of Scientific Applications on POWER8,” in International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, 2014.

Evaluating DVFS and Concurrency Throttling on IBM's Power8 ... - SC15

Evaluating DVFS and Concurrency Throttling on IBM's Power8 ... - SC15

Suggest Documents

Power8

Evaluating the Trade-off Between DVFS Energy-savings and ... - PUCRS

power8 - Microway

POWER8 Hardware Technical - IBM

Power Roadmap POWER8 - IBM

Evaluating Concurrency Options in Software ... - Semantic Scholar

IOTC-2012-SC15-INF05

POWER8 Hardware Technical - IBM

Concurrency Concurrency

Statewide Concurrency Literature Survey on Various Concurrency ...

A Survey and Measurement Study of GPU DVFS on Energy ...

IBMS-ECTS 2005 Abstracts

Inter-cluster Thread-to-core Mapping and DVFS on Heterogeneous ...

Concurrency

IBM POWER8 processor core microarchitecture

POWER8: The first OpenPOWER processor

Hydraulic Domestic Heating by Throttling

Facilitating Irregular Applications on Many-core Processors - SC15

On Negotiation as Concurrency Primitive

ALGEBRAIC TOPOLOGY AND CONCURRENCY

Concurrency Control and Recovery

CAPTCHA-Free Throttling - Semantic Scholar

Fine-Grained DVFS Using On-Chip Regulators - UGent-ELIS homepage

Coherent Accelerator Processor Proxy (CAPI) on POWER8 - Nallatech