Making Sense of Performance Counter Measurements on ...

2 downloads 1033 Views 2MB Size Report
Aug 12, 2010 - engineering application codes have been migrated to these systems with little ..... Just as debugging parallel programs can be hard due to non-.
Making Sense of Performance Counter Measurements on Supercomputing Applications Jeff Diamond1, John D. McCalpin2, Martin Burtscher3, Byoung-Do Kim2, 1 Stephen W. Keckler1,4 , James C. Browne 1 Department

of Computer Sciences, The University of Texas at Austin, Email: {jdiamond, skeckler, browne}@cs.utexas.edu} 2 Texas Advanced Computer Center, The University of Texas at Austin, Email: {mccalpin, bdkim}@tacc.utexas.edu 3 Institute for Computational Engineering and Sciences, The University of Texas at Austin, Email: [email protected] 4 NVIDIA Corporation August 12, 2010

Department of Computer Sciences Technical Report TR-10-15 The University of Texas at Austin Abstract The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip or achieving suboptimal performance from the cores they do utilize. Performance optimization on these systems require both different measurements and different optimization techniques than those for single core chips. This paper describes primary performance bottlenecks unique to multicore chips, sketching the roles that several commonly used measurement tools can most effectively play in performance optimization. The HOMME benchmark code from NCAR is used as a representative case study on several multicore based supercomputers to formulate and interpret measurements and derive characterizations relevant to modern multicore performance bottlenecks. Finally, we describe common pitfalls in performance measurements on multicore chips and how they may be avoided along with a novel high level multicore optimization technique that increased performance up to 35%.

1

Intranode  Strong  Scaling,  seconds  to  complete   20   15   10   5   0   0  

2  

4  

6  

8  

10  

12  

14  

16  

Ac#ve  cores  in  16  core  node   Figure 1: The motivation for this study: using all the cores on a chip may actually reduce performance. Homme execution time on Ranger using PGI compiler.

1

Introduction

Performance analysis and optimization has been a critical component of computer science almost as long as computers have existed. However, roughly each decade computer architecture and relative performance properties change enough to require changes to performance analysis and high level code optimization techniques. In the past decade, widely available performance analysis tools at all levels have evolved to find and eliminate performance bottlenecks. However, these programs still tend to focus on performance bottlenecks most common with 1990s era architectures, such as branch misprediction, local cache miss rates, network traffic, or load imbalance. Tools have become increasingly heavy weight and reliant on sampling techniques to reduce their overhead. Very little attention has been given to the interaction of cores on a single chip, and even less on high level code techniques to measure and ameliorate these effects. And this is understandable, as chips with four or more cores have only become common in the last few years, and intra-chip scalability issues have only recently been recognizes as a serious issue. This paper is an attempt to identify new optimization and measurement techniques necessary for the now ubiquitous multicore CMP architectures.

1.1

Motivation

The compute nodes of modern supercomputers are constructed with multicore chips and many will have heterogeneous architectures. Many scientific and engineering application codes have been migrated to multicore chip based architectures with little or no optimization for the multicore chip execution environment. The result is that many applications can effectively use only a fraction of the number of cores on a multicore chip and may not obtain maximum performance from the cores they do utilize. Figure 1 shows an example where turning on more than half the cores of a node actually can reduce overall performance. Performance bottlenecks for multicore chips are both quantitatively and qualitatively different than for single core chips. Even though many applications on single core

chips are memory bandwidth limited, optimization of performance bottlenecks for multicore chips require both different measurements and different optimizations than those for single core chips. Performance optimization for multicore chips is currently largely ad hoc. There are many tools, none of which by itself supplies the data and interpretation necessary for analysis and characterization of the performance bottlenecks on multicore chips, and effective performance analysis for performance bottlenecks on multicore chips may require the mastery and use of several tools. Each tool must be used with caution and knowledge of its strengths and weaknesses. There has been little systematic comparative analysis of current performance optimization tools for multicore chips, and effective utilization of the tools commonly requires detailed architectural knowledge. Many traditional source code optimizations such as loop unrolling focus on the optimization and measurement of computation speed. Less discussed are the needed source code level optimizations and measurement techniques which relate to the memory bottlenecks commonly occurring in multicore execution.

1.2

Goals

The goals of this paper are to advance the state of the art for analyzing those performance bottlenecks systemic to supercomputer applications running on multicore chips, illustrate issues that may arise with commonly available performance tools and approaches, and demonstrate simple optimization techniques that address the most important performance issues in modern processors.

1.3

Approach

The approach is case study based. Studies of the HOMME benchmark code from NCAR are used to evaluate measurement tools, derive characterizations of multicore performance bottlenecks, to identify and characterize common pitfalls in performance measurement for multicore chips and to derive performance optimizations on multicore chips. Most of the measurements reported in this paper were done on Ranger (AMD Barcelona chips) but two other homogeneous node architectures using other popular chips were also used.

1.4

Contributions

This paper describes the important factors which cause performance bottlenecks on multicore chips. It compares several commonly used measurement tools and sketches the roles each can most effectively play in performance optimization for multicore chips. It sketches a process which can be applied to determine the locations and the nature of multicore-related performance bottlenecks. It describes pitfalls in measurements on multicore chips and how the pitfalls may be avoided. It applies several existing tools to case studies of the HOMME benchmark code from NCAR on several multicore based supercomputers to formulate and interpret multicore-relevant measurements and derive characterizations of multicore performance bottlenecks. Applying the method to the codes used as the basis for the case studies was found to produce performance increases of up to 35%. (See Figure 23).

1.5

Paper Summary

In this paper, we illustrate some of the performance issues and performance measurement issues unique to supercomputers based on multicore chips. As the number of cores per chip increases past four, and primary DRAM access becomes a more significant percentage of total memory access time, we expect these issues to grow in importance. Section 2 of this paper describes at a high level the architectural bottlenecks and measurement issues unique to CMPs, along with the notion of intrachip scalability. Section 3 describes the primary case study, Homme, and the primary supercomputers used, Ranger and Longhorn. Section 4 describes the measurement tools used along with their strengths and weaknesses. Section 5 explores in more detail the measurement and analysis issues unique to 3

multicore, emphasizing the way code optimization techniques and the performance metrics have changed. The end of section 5 gives a simple approach to performance measurements on CMPs that avoids the most common pitfalls. Finally, section 6 illustrates some high level code optimizations to address the primary issues of CMP scalability, with a detailed analysis on an optimization technique we call microfission.

2

Emerging HPC Architectures

Large scale HPC systems have historically had complex performance issues due to complexities in their architecture - distributed memory systems, complex interconnect topologies, NUMA architectures to name a few. Modern HPC systems are even more complex, adding multicore chips, new levels to the memory hierarchy, looser integration, and heterogeneous architectures, which leverage computational accelerators like FPGAs [1], Cell Processors [2], and GPUs [3]. Even homogeneous platforms are complicated by the use of multicore chips, and this paper focuses primarily on homogeneous multicore issues. As HPC systems continue to increase in complexity, software performance issues become increasingly complex. Although tools have been developed to help get information out of these systems, the information that comes out is difficult to interpret for many reasons. Programmers must be aware of emerging architectural features which can lead to new types of performance bottlenecks. They must reason about what metrics and hardware events might identify these issues. They must confirm the degree to which the actual performance data approximates these metrics, and ensure that they are sampling in a way that produces meaningful results. The goal of this paper is to highlight these challenges through a case study using Homme [4] as a representative HPC application on multiple supercomputing platforms to illustrate the type of data that comes out, how it can be misleading, and guidance as to how to better gather and interpret application characteristics to enable better optimization. This paper compares and contrasts the fidelity of information that one can obtain from contemporary analysis tools, while pointing out the pitfalls associated with using performance data blindly without understanding the implications of how it is collected or what it is actually showing. We illustrate these principles with a concrete application on concrete hardware. The rest of this section describes the key architectural features of multicore systems which lead to poor HPC performance, followed by an introduction to intracore scalability issues, optimization behavior in the multicore regime, and the unique issues involved in measuring common multicore performance issues.

2.1

Modern Multicore Issues

The first pitfall addressed by this paper is the assumption that applications and conventional optimization techniques have similar behavior on multicore processors as uniprocessors. Many HPC applications are moved to multicore based supercomputers unmodified, often resulting in limited performance benefits beyond the use of two to three cores per quad core chip. Usability of 6 and 8 core chips may be even worse. Figure 1 shows the effect on execution time when attempting to use an increasing number of cores on a 16 core node. Multicore chips add a new level to the organizational hierarchy, with significant performance implications. This section highlights the new architectural features of multicore chips that are the primary sources of performance bottlenecks in HPC applications. Modern supercomputers tend to have relatively small shared memory nodes consisting of 2-4 multicore chips. Most current chips are quad core and integrate memory controllers for locally attached DRAM DIMMs. Each DRAM DIMM typically has two “ranks”, each of which caches a local DRAM “row” called a DRAM “page”. In a Ranger node, the DRAM system supports 32 different open pages of 32KB each. Because switching pages is a slow task, accessing open DRAM pages, or “page hits” is important to performance. Modern multicore chips greatly exacerbate the memory wall. Multiple cores now share the same L3 cache and off chip memory controller, resulting in less cache and off chip bandwidth per core. Memory traffic from on 4

Internode  Strong  Scaling,  EPC=1.5  to  384   100%   80%   60%  

Efficiency  

40%  

Communica5on  

20%  

Poly.(Efficiency)   Poly.(Communica5on)  

0%   0  

200  

400  

600  

800  

1000  

Total  Cores   Figure 2: Homme interchip strong scaling: As a classical parallel algorithm, Homme can scale up 1000x on a fixed problem size before efficiency per core halves. As total cores increases, Elements Per Core (computation load) decreases to below 2. chip cores and remote chips cause contention and thrashing in the memory controller and DRAM pages, reducing DRAM efficiency. Coherence snooping dramatically increases the latency to shared memory. Coherence traffic was observed at up to 80% of L2 traffic, despite running separate MPI processes that share no data. The result of all cores sharing a memory hierarchy is that performance efficiency can drop sharply once two cores per chip is exceeded as demonstrated by Homme in Figures 3 and 5, and that performance severely degrades past 8 cores per node, or 2 cores per chip. This is despite the fact that Homme has nearly perfect classical scaling characteristics, as shown in Figures 2 and 4, and that on chip communication in principle should be faster than communicating between chips or nodes.

2.2

Intracore scalability and the multicore regime

There are two primary ways in which a parallel algorithm can be extended, or ”scaled” across a large number of cores. In the classical notion, the same problem size is solved more quickly as it is spread across more cores. As the core count goes up, the computation per core drops. As the computation becomes spread quite thin, the overhead and communication costs dominate, and as a result, actual performance plateaus and then falls as the core count increases. This is the type of parallel scaling embodied by Amdahl’s Law [5] and which is referred to in the HPC community as strong scaling. However, strong scaling is no longer the relevant type of scaling for most super computer applications. In these applications, multiple cores are applied to solve a larger problem in similar time, not solve the same problem more quickly. This is the type of scaling embodied by Gustafson’s Law [6] and is referred to in the HPC community as weak scaling. In weak scaling, we think of a problem in terms of a constant amount work per core. To be so scalable, most modern HPC applications avoid all-to-all communications and even global barriers, which tend to grow in cost as the log of the number of cores. When we speak of the efficiency of parallel code, we are comparing performance per core to the single core case, that is, the percentage of linear speedup obtained when scaling. While strong scaling typically only obtains a fraction of linear speedup, 5

Intranode  Strong  Scaling,  EPC=9  to  150   100%   90%   80%   70%   60%   50%   40%   30%   20%   10%   0%   0  

2  

4  

6  

8  

10  

12  

14  

16  

Ac#ve  cores  in  16  core  node   Figure 3: Homme intrachip strong scaling: However, the same algorithm does not scale well across cores on a single chip, dropping below 50% efficiency at just 2 cores per chip and hitting just 15% efficiency at 4 cores per chip. (PGI Compiler / Ranger)

Internode  Weak  Scaling  at  54  elements/core   100%   90%   80%   70%   60%   50%   40%   30%   20%   10%   0%  

Efficiency   Communica9on   0  

50  

100  

150  

200  

250  

Total  Nodes   Figure 4: Homme classical (interchip) weak scaling is nearly perfect. And this is the type of scaling of particular relevance to HPC applications.

6

Core  Density  Scaling   100%   90%   80%  

Efficiency  /  Comm  Time  

70%   16  Cores,  9  EPC   60%  

240  Cores,  40  EPC   240  Cores,  40  EPC,  Comm  

50%  

240  Cores,  801  EPC  

40%  

48  Cores,  800  EPC  

240  Cores,  801  EPC,  Comm   48  Cores,  800  EPC,  Comm   30%  

16  cores,  54  EPC   16  Cores,  54  EPC,  Comm  

20%   10%   0%   0  

2  

4  

6  

8  

10  

12  

14  

16  

Density  in  Cores  Per  Node  

Figure 5: Homme intrachip weak scaling is a completely local effect. Regardless of number of cores used or workload per core, all Homme runs exhibit nearly identical intrachip weak scaling efficiency, dropping off to about 60% at 4 cores per chip. This means we can study multicore scaling effects at any problem size. (PGI Compiler / Ranger)

7

weak scaling is quite capable of linear speedup. Most HPC applications have internode scaling characteristics that are well understood and depend primarily on the nature of network communication between the nodes. By definition, these applications have been designed to scale well to tens of thousands of nodes. Our sample application, Homme, exhibits perfect weak scaling (where the amount of computation per core is constant) and excellent strong scaling (where the total computation of the system is constant), attaining a 900x increase in parallelism before efficiency falls in half. (See Figures 2 and 4.) However, this scalability across a communication network does not translate into on-chip scalability, which is more affected by data access patterns and limited access to memory. We define the term intrachip or intranode scalability to refer to the way the efficiency of an application scales as more cores are used per chip, regardless of the number of nodes used. This is defined analogously as the performance of an algorithm at many cores per chip relative to the performance at 1 core per chip. We refer to the number of active cores per chip as the core density of the application. Because on-chip communication is faster, it would normally be expected that applications using a fixed number of total cores would run faster with higher core densities, however, in fact performance drops drastically with core density. For example, if the efficiency of a program at 2 cores per chip is 100% and 4 cores per chip is 50%, then there will be zero performance gain from doubling the total number of cores and using 4 cores per chip, and if the number of cores is kept constant than performance will fall in half. As seen in Figures 2 through 5, intrachip scalablity (both strong and weak) can be three orders of magnitude less than traditional (internode) scalablity. Figure 5 shows that a wide range of runs with extremes in both the total number of cores used and the workload per core (Elements Per Core) all show similar drops in efficiency when increasing core density from 1 to 16 cores per 4 chip node on Ranger. This demonstrates that not only does a program’s intrachip scalability frequently bear no resemblance to its classical parallel scaling properties, but its intrachip scaling properties tend to be totally independent of its external topology and workload. The result is that intranode scaling properties can be studied at any convenient scale while remaining representative of a full scale production run. Prior optimizations tended to be CPU centric - reducing computation, reducing branch costs through unrolling, storing previously computed values. Over time, the increasing memory wall made memory issues dominate. Typical optimizations made better use of caching through blocking arrays and padding data structures. Other optimization techniques involved accessing data sequentially or with constant strides to aid prefetchers. Comparatively less work has looked at memory optimization techniques, and memory issues tend to be nondeterministic and non local. Due to shared resources in the memory hierarchy, multicore applications tend to become limited by off chip bandwidth. At first glance, this would make different computation optimization strategies moot, as all would run at the speed of memory. However, we found that when the memory system is in a saturated state, the order and pattern of data accesses assumes a first rate factor in performance. We call this condition of operation the multicore regime. As Figure 6 shows, the same optimizations can perform very differently in this regime. Many multicore optimizations that increase performance in the multicore regime would actually slow down code running on a uniprocessor (become a pessimization).

2.3

Difficulties with detecting and measuring memory effects

Finally, it is important to note that these new scaling and optimization behaviors resulting from multicore architectural bottlenecks lead to performance measurement issues that are not merely different, but which can be extremely difficult to overcome. This section introduces the primary measurement issues which are discussed in great detail in the rest of the paper. HPC applications have always tended to have low arithmetic intensity, and so memory performance has always been an issue. However, multicore scalability issues tend to arise by complex interactions of a hierarchical memory structure contested by many cores. This leads to the following fundamental difficulties when trying trace a memory

8

Longhorn  Mul+core  Performance   200%   180%   160%   140%   120%   100%   80%   60%   40%   20%   0%  

1  Core  Per  Chip   4  Cores  Per  Chip  

O2  

O3  

O2+OptC  

O3+OptC  

Figure 6: Simple example of how the same code optimizations behave very differently in the multicore regime. When cores compete for the memory system, instruction order has a significant effect on performance, resulting in classical compiler optimizations having unintended effects.

9

performance issue to its cause in software: • Memory effects are highly context sensitive. They depend on the state of all the caches and DRAM banks in the system. This means that the performance of a given function that always performs the same amount of computation will tend to have a much higher degree of variability, and depend much more on its calling context, since performance will depend on the data the calling functions have brought on chip and the state of the memory system upon entering the function. • Memory bottlenecks tend to be bursty. This results in extreme local variations in performance that might be hidden in larger averages. • Memory interactions are nondeterministic. Just as debugging parallel programs can be hard due to nondeterminism, memory bottlenecks caused by the interactions of threads on the memory system also tend to be nondeterministic. • Memory bottlenecks can be highly non-local. Because the latency of the memory system can be hundreds to thousands of cycles, and because memory access speed is affected by the activities of pervious memory operations, an apparent memory bottleneck often is caused by earlier functions. Additionally, the natural time skew between the activities of different cores and chips tend to dramatically spread the range of memory effects far beyond the time of complete cache turn over, the time it takes for all data in the on-chip memory hierarchy to be replaced with new data. • It is much easier to disturb memory behavior in the attempt to measure it, due to the difficulty in bracketing memory effects. The influence of a performance measurement will be felt much later, while the counter events recorded during the actual timing calls may have been caused by code in an earlier timing interval, or even be influenced by a different core running different code. Any sophisticated timing library will need to make multiple round trips to main memory to update counter totals, resulting in time dilations up to 10,000 cycles, long with prefetching interference, cache effects, etc, that can last for tens of billions of cycles. For this reason, we have found that developing or utilizing extremely light weight timing libraries are required to see an optimization’s effects on the running code and not simply the optimization’s effects on the timing code. As a result of these fundamental issues, the notion of a given function having “average performance characteristics” is less certain, and the performance impacts of the measurements themselves are much more difficult to isolate and remove. The rest of this paper discusses these issues in more detail and provides simple techniques to overcome them.

3

Methodology

In this paper, we explore the analysis of multicore issues using a case study. We use three different hardware platforms for our study - Ranger, a half petaflop Barcelona based supercomputer, Lonestar, a half Petaflop Nehalem-EP and nVidia Tesla 2 based heterogeneous supercomputer, and Discovery, a non-production test system at the Texas Advanced Supercomputing Center that houses a wide variety of different node types. The archetypal HPC application we chose for our case study was HOMME, the High Order Method Modeling Environment, developed by NCAR for their Climate Model 2. [4] Finally, we apply the following performance tools: gprof, mpiP, pin, PAPI, perfctr, TAU, HPCToolkit and PerfExpert, which will be described in section.

10

3.1

Ranger, Longhorn and Discovery

When this research began, almost all supercomputer had at most quad core chips, and the chip with the best reputation for memory performance was AMD’s Barcelona. This made the Ranger computer an ideal vehicle for multicore research. However, because the Barcelona architecturewas ineffectively scaled from 2 to 4 cores, it was not surprising that it could only effectively support 2 cores per chip. In 2010, a quad core Nehalem system came on line, representing a system with the best on chip memory system to date, which we used to contrast with Ranger. Longhorn confirmed that although the Nehalem memory system was indeed superior, supporting up to 3 cores per chip, all the multicore performance issues we found with Barcelona still existed on Nehalem, and all the optimizations carried over. Finally, we used the Discovery system to verify results on a wide range of CPU types. Because the Ranger system was the easiest to use for detailed analysis, featuring almost 500 hardware performance counters, we did the bulk of analysis on Ranger and used Longhorn for validation of our theories. We also compared the effects of three different Fortran90 and C compilers, Intel’s icc versions 10 and 11, and PGI’s compilers version 7.5. Ranger [7] consists of 3,936 16-way SMP computer nodes housing 14,744 AMD Opteron quad core processors (58,976 cores) at 2.3 GHz and peak performance of 579 teraflops. Each node has 32GB of DRAM, with each 8GB controlled by a different quad core chip. Nodes are connected by a 1 GB/sec infinity band network. Within each node there is a NUMA effect with 2 of the four nodes having slightly slower access to both memory and the network. Longhorn [8] is a hybrid supercomputer, consisting of 256 nodes, each containing 2 quad core Nehalem EP processors at 2.5 GHz, 48GB-144GB DRAM and two nVidia Quadro FX 5800 GPUs (Tesla 2 class like the GTX 280). Nodes are connected via a QDR infiniband interconnect. For this study, only the 2,048 Nehalem cores were used while the 122,880 Tesla2 cores were left idle. Discovery is a non production supercomputer at TACC designed specifically to experiment with different kinds of computer nodes.

3.2

Homme

In choosing software to analyze, we were interested in full scale, real world supercomputer applications that people really cared about. Supercomputing applications tend to have similar characteristics due to the requirement that they scale to at least tens of thousands of cores. Most tend to be SPMD (Single Program, Multiple Data) with minimal MPI communication, tending instead towards nearest neighbor communication. (All-to-all communication is not scalable, and even global barriers are used sparingly.) Most also tend to be dominated by memory accesses, having extremely low arithmetic intensity. We found that the overwhelming majority of HPC applications fall into three general compute classes: “Regular” applications at some level of granularity used a fixed, regular grid to store data. This allows easy management of data locality, simple, predictable data access patterns, and uniform predictable computation times, as well as arbitrarily high numeric accuracy. The most common applications are finite difference PDE solvers, dense linear algebra solvers, and stencil codes. The second most common paradigm in HPC were “irregular” applications ones which used a dynamically changing mesh to handle adaptive levels of detail or time varying geometry. The most common irregular HPC applications are finite element simulations, particle simulations, and sparse linear solvers. Data dependent control flow and complex data access patterns increase the level of non determinism, and load balancing becomes an issue, especially between nodes where computation is expensive. The third most common application type in HPC are integer search and compare applications, such as those found in biochemistry and intelligence. These applications must traverse huge irregular graphs and do integer compares at the nodes. This paper studies the class of regular HPC applications, and the software we chose for our case stude was HOMME. Homme (High Order Method Modeling Environment) [4] is an atmospheric general circulation model (AGCM) that provides 3D atmospheric simulation similar to the Community Atmospheric Model (CAM). The 11

code consists of two components; a dynamic core with on the hydrostatic equation solver and a physical process module coupled with sub-grid scale models. Homme is based on 2D spectral elements in curvilinear coordinates on a cubed-sphere combined with a second order finite difference scheme for the vertical discretization and advection. It is written in Fortran 95. Homme is an ideal candidate, as weather simulation is one of the most established uses of supercomputers, and Homme is by far the most sophisticated of any regular HPC application we examined. Homme is only locally regular - 8x8x96 regular grid blocks are connected to nearest neighbors via a graph distributed across cores using the METIS library [9], which is normally used to distribute irregular computation graphs across distributed nodes. The code applies many layers of physics and numerical techniques, tracking over 40 properties at each grid point, and updating quantities using multiple explicit and semi-implicit integration techniques along with spectral element techniques. Homme also applies best in class (for pre-multicore) optimizations for both computation and memory access, and is known for excellent classical scaling (interchip) but poor intrachip scaling. Finally, Homme was chosen as one of the 5 HPC Challenge benchmarks [10] and is required for supercomputer acceptance testing. The benchmark version of Homme is slightly simplified from the original scientific simulation model. It solves a modified form of the hydrostatic primitive equations with pre-calculated initial conditions in the form of a baroclinically unstable mid-lattitude jet for a period of twelve days. While the parallelism of the general version is hybrid mode (MPI and OpenMP), the benchmark version uses MPI only. Time integration scheme uses explicit finite difference computation on a static regular grid, which is simplified from semi-implicit scheme of the general version. The benchmark version is used in this study as it is one of the NSF acceptance test applications for Ranger supercomputer at TACC. As will be discussed in detail in Section 5, HOMME shares the key characteristic of HPC applications that lead to poor multicore scalability - very low arithmetic intensity. Computation is nearly irrelevant compared to load/store performance. As illustrated by Figure 7, functions which rearrange the structure of data are nearly all loads and stores, whereas classic stencil computations tend to do one to two operations for each value loaded from memory. In the most extreme cases in Homme, two 8-byte values needed to be loaded from main memory for every flop, shown by the L1 bars in Figure 12. We found that almost 90% of Homme’s performance was dictated entirely by the rate at which it could load values.

4

Performance Tools Comparison

This section provides a high level overview of the performance tools used in this study, as well as an overview of the ways these tools actually work. When choosing tools to evaluate, we focused on freely available, widely used open source tools for Linux systems, because we believe these have the largest potential reach. We summarize these results in table 1. To better understand the way these tools work, this section begins with a brief primer in primary mechanisms used by all these tools.

4.1

How tools profile your application

Regardless of what tool you’re using to measure application performance, it is highly likely that you will be looking at hardware performance counters. Remember that the relationship between hardware counters and software performance metrics can be complex due to nuances of the hardware architecture. Performance tools generally express the average value of these counters for a given function or region of interest. There are two ways performance tools gather performance counter information pertaining to critical sections of your program. Explicit methods read the actual counter values at precise locations in your code, while sampling based methods only check counters periodically during interrupts, relying on the stack to determine where in your code the interrupt occurred. As a result, sampling methods can only approximate the actual counter values at specific sections of code. The reason for using approximate techniques is to minimize the disturbance of the 12

PA CK _E

GE _

XC GE L   _U NP AC K_ CO EX LU CL M   N_ M OD EL _E AP XC PL L   Y_ FO RC PR IN EQ G_ _A EX DV CL   AN CE _E XP _E XC L   GR AD IE NT _E XC DI L   VE RG EN PR CE EQ _E _A XC DV L   VO AN RT CE I CI _E TY XP PR _E _M EQ XC ID _A L   D DV LE _L AN OO CE P_ _E EX XP CL _T   IN Y_ LO OP PR _E IM XC _D L   PR IF F IM US _D IO IF N_ FU EX SI CL ON   _L OO P_ 2_ EX PR CL EQ   _R OB ER T_ EX CL  

ED

ED

100%  

LPI  

90%  

80%  

70%  

60%  

50%  

40%  

30%  

20%  

10%  

0%  

Figure 7: Memory accesses per instruction (LPI) for major functions in Homme. (PGI Compiler) Most functions have almost no reuse of values.

13

_E ID

PR IM

XP _M LO O

E_ P_ E

EN T_

GR AD I

DL XC L  

14  

to t

_D PR EX IF EQ CL FU   _A S IO ED DV N G AN _E E_ XC CE UN L   _E PA XP C K_ _T EX IN Y_ CL   LO PR OP EQ _E _R XC OB L   ER PR DI T_ IM VE EX _D RG CL IF   FU EN SI C E_ ON EX _L CL OO   P_ 2_ ED EX GE CL _P   AC K_ VO EX CL RT   IC AP I TY PL CO Y _ _F EX LU OR CL M   N_ CI NG M OD _ E PR XC EL EQ _L L   OO _A DV P_ AN EX CL CE CO   _ EX LU P_ M N_ EX CL M   OD EL _E XC L  

PR EQ _A DV AN CE

120%  

Density  (weak)  Scaling,  Execu6on  Time,  Intel  Compiler  

100%  

80%  

60%  

40%   12  CPN  

20%   8  CPN  

4  CPN  

0%  

Figure 8: The Intel compiler exhibits better scalability than the PGI compiler, but efficiency still drops as density increases.

_L OO

EL _E XC L

 

 

K_ EX CL

XC L

  AP P PL _E Y_ XC FO L   PR RC EQ IN _A G _E DV XC AN L   CE _E XP _E XC GR L   AD IE NT _E DI XC VE PR L   RG EQ EN _A DV CE _E AN XC VO CE L   _E RT PR XP IC EQ IT _M _A Y_ ID DV EX DL AN CL E_   CE L OO _E XP P_ _T EX IN CL Y_   L O PR O I P M PR _E _D XC IM IF L   _D FU IF S IO FU N_ SI ON EX CL _L   OO P_ PR 2_ EX EQ CL _R   OB ER T_ EX CL  

EL

OD

N_ M

OD

N_ M

CO LU M

UN PA C

PA CK _E

GE _

GE _

CO LU M

ED

ED

Density  (weak)  Scaling,  4/1  CPC  execu:on  :me  (PGI)  

200%  

175%  

150%  

125%  

100%  

75%  

50%   ExecuCon  Time  

LPC  

25%  

0%  

Figure 9: LPI tends to be inversely proportional to multicore scalability. (Intel compiler)

15

GE _P AC K_ ED EX GE CL   _U NP A CK CO _E LU XC M L   N_ M OD EL AP _E PL XC Y_ L   FO RC PR IN EQ G_ _A EX DV CL AN   CE _E XP _E XC L   GR AD IE NT _E XC DI VE L   RG PR EN EQ CE _A _E DV XC AN L   VO CE RT _E IC IT XP PR Y_ _M EQ EX _A ID CL DL DV   E_ AN L OO CE _E P_ XP EX _T CL IN   Y_ LO O PR P_ IM EX _D CL PR   I F IM FU _D SI ON IF FU _E SI XC ON L   _L OO P_ 2_ PR EX EQ CL   _R OB ER T_ EX CL  

ED

LPI  vs  LPC  vs  FPC  at  1  Core  per  Chip  

100%  

90%  

80%  

70%  

60%  

50%  

40%  

30%   LPI  

20%   LPC  

10%   FPC  

0%  

Figure 10: LPI tends to be inversely proportional to both raw performance and memory access rates (LPC). (Intel compiler)

16

PA CK _E

GE _

XC GE L   _U NP A CO CK LU _E XC M N_ L   M OD EL AP _E PL XC Y_ L   FO PR RC EQ IN G_ _A EX DV CL AN   CE _E XP _E XC GR L   AD IE NT _E DI XC VE L   R PR GE EQ NC _A E_ DV EX AN CL VO   CE R TI _E CI PR XP TY EQ _M _E _A ID XC DV DL L   AN E_ LO CE OP _E XP _E _T XC IN L   Y_ LO OP PR IM _E XC _D PR L   IF IM FU _D SI IF O FU N_ SI EX ON CL _L   OO P_ 2_ PR EX EQ CL _R   OB ER T_ EX CL  

ED

ED

100%  

90%  

80%  

70%  

60%  

50%  

40%  

30%   L1  MR  

20%   L2  MR  

%ideal  LPC  

10%  

0%  

Figure 11: Actual memory performance (LPC) relies heavily on cache performance.

17

Bytes  Per  Flop  (taper)   48  

40  

32  

24  

16  

L1  B/F   L2  B/F   L3  B/F  

8  

  to t

PR IM

DL

E_

LO O

P_ E

_D XC IF L   FU PR SI EQ O _A N_ ED DV GE EX CL AN _U   NP CE _E AC XP K_ _T EX IN CL Y_   LO OP PR _E EQ XC _R L   OB ER T _E PR DI XC VE IM L   RG _D EN IF FU C E_ SI ON EX CL _L   OO P_ 2_ EX ED CL GE   _P AC K_ EX VO CL RT   IC IT AP Y_ PL EX Y_ CO CL FO LU   RC M IN N_ G M _E OD XC EL L   PR _L EQ OO _A P _E DV XC AN L   CE _ CO EX LU P_ M EX N_ CL M   OD EL _E XC L  

PR EQ _A DV AN CE

_E

XP _M

ID

GR AD I

EN T_

EX CL

 

0  

Figure 12: Bytes per flop (taper) for major functions in Homme. (Intel Compiler) While the average off chip taper is 1.5 bytes/flop, this is heavily biased by functions which do no computation. Many important functions can make almost no use of registers.

18

Bandwidth  at  4  Cores  Per  Chip   12,000    

10,000    

8,000    

6,000    

L1-­‐BW  (MB/s)   4,000    

L2-­‐BW  (MB/s)  

2,000    

EN T_

EX PR CL   EQ _R OB ER T_ EX CL   ED GE _P AC K_ EX AP CL PL   Y_ FO RC IN G_ CO EX LU CL M   N_ M OD EL _E XC L   DI VE RG EN CE _E XC L   VO RT IC IT Y_ EX CL  

  XC L GR AD I

K_ E UN PA C

ED

GE _

FU SI IF _D M PR I

PR EQ _A DV AN CE

_E

XP _E

ON _E

XC L

XC L

 

 

0    

Figure 13: L1 vs L2 bandwidth requirements by function. (Intel Compiler)

19

OD _L OO

_E

P_ E

EL

K_ E

XC L  

 

 

XC L

XC L

FO RC I

XC L   PR EQ NG _A _E DV XC AN L   CE _E XP _E XC GR L   AD IE NT _E DI XC VE L   RG PR EQ EN _A CE DV _E XC AN VO L   CE RT _E I PR CI XP TY EQ _M _E _A ID XC DV DL L   AN E_ LO CE _E OP XP _E _T XC IN L   Y_ LO PR OP IM _E _D PR XC L   IM IF F _D US IF I ON FU _E SI ON XC L   _L OO P_ 2_ PR EX EQ CL _R   OB ER T_ EX CL  

EL

OD

N_ M

AP PL Y_

N_ M

CO LU M

UN PA C

PA CK _E

GE _

GE _

CO LU M

ED

ED

Homme  Barcelona  Performance  

2.0   1.9   1.8   1.7   1.6   1.5   1.4   1.3   1.2   1.1   1.0   0.9   0.8   0.7   0.6   0.5   0.4   0.3   0.2   0.1   0.0   IPC  

FPC  

LPC  

Figure 14: Core performance metrics by major function for Homme. (PGI compiler)

20

performance monitoring code. There is a direct tradeoff between the complexity of the monitoring code and the accuracy of the results. During our research we directly observed this effect with all of these tools, so it was impossible to use a single tool for our entire program analysis. Instead, we used different tools for different jobs. We describe the tools and their strengths and weaknesses below.

4.2

Performance Counters - PAPI and PerfCtr

4.2.1

PerfCtr [11]

PerfCtr is a Linux specific low level programming API that allows users to get the values of hardware counters on a machine. All hardware counters can be accessed, and their meanings are listed in publications by the CPU manufacturers. For Barcelona, the hardware counter descriptions for AMD processors can be found in AMD’s Kernel Developers Guide (BKDG)[12] for that specific generation of processor. For Nehalem, look in Intel’s Software Developer’s Manual, volume 3B [13]. 4.2.2

PAPI [14]

is a higher level programming abstraction that attempts to unify hardware counters across multiple platforms. PAPI supports a subset of standard counters that are approximately mapped to each platform’s local native counters, while still allowing the user to query native counters by name. All the high level performance tools we discuss are built on PAPI and express performance in the language of PAPI counters. Directly inserting program code to get counter values provides the highest accuracy and clarity, but it takes a lot of development time compared to using a high level tool. We found explicit PAPI use indispensable for profiling very short functions and validating the meaning of hardware counters and the accuracy of the other tools. We found the PAPI call to read performance counters took 400 cycles, so for even more accurate timing of small leaf functions, we used PAPI to set up the performance counters and then used the direct rdpmc assembler call using local storage to get the counters, which took 9 cycles for 1 counter and 30 cycles for 4 counters on the Opteron. Additionally, by using a direct call to rdtsc, we could read the (user+system) time stamp counters in addition to the four hardware counters available in each timing set. Since it is often important to get the cycle count and instruction count with every timing set, this allowed us much greater flexibility in our timing groups. This turned out to be critical in determining the true performance effects of our optimizations instead of simply highlighting the effects the observations had on our timing libraries when dealing with small and medium sized functions.

4.3

gProf and mpiP

4.3.1

gProf [15, 16]

is an extremely simple utility that that determines the most critical functions in a program (the ones taking up the most total execution time) along with the dynamic program call graph, so you can determine which functions call them and how often. gProf keeps an explicit (accurate) call count for each function, while statistically sampling the program counter to estimate average time in each function, as well as sampling one level of the stack to estimate the callee count, which is used to estimate the dynamic call graph. gprof has the highest accuracy and lowest overhead of any tool we’ve tried. It is so good that we were able to use gprof to gauge the accuracy of other performance tools. It has no equal in ease of use, taking just a single line to run and immediately giving you the most important information you need to start your analysis. The downside of gprof is that (like most tools) it is context free, i.e. it averages function times over all calling contexts and execution times. If a function’s time actually varies, then that time will be misassigned to callee functions when estimating inclusive function times, and so the resulting estimates for inclusive function performance could be arbitrarily 21

wrong, leading to incorrect conclusions about performance bottlenecks. However, we found that in typical HPC applications, temporal context of a function mattered a lot more than the calling context of that function. However, gprof is still a sampling based system that relies on accurate assessment of the user stack to determine context, and we found that when doing certain types of very fine analysis, it did produce arbitrarily wrong answers. Despite this potential danger, we recommend gProf highly as the place to start any performance analysis, and as a tool to continually guide more detailed investigations. 4.3.2

mpiP [17]

is similar to gProf, except that it restricts its sampling to key MPI communication calls instead of user code. It is also easy to use and light weight. Unique among the tools is the ease at which mpiP aggregates results from all the active cores on a run, immediately highlighting per core variations and potential load imbalances. If your application has scaling issues due to network communication, this is a great tool to help.

4.4

TAU and HPCToolkit

TAU and HPCToolkit are feature rich, high level GUI performance analysis tools that collect performance data and display them for quick visual analysis. Like gProf and mpiP, both operate (mostly) on existing binaries. Both have substantial learning curves for users and do not explain the meaning of the counters they track or recommend programming solutions as some very high level performance tools. 4.4.1

TAU [18]

is unique among tools for using explicit performance monitoring (which is completely accurate), while maintaining extremely low overhead. By default, TAU gathers performance data on a per function granularity, but the user can augment this by instrumenting their own code to define sections in analyze. In fact, like PAPI, TAU is also unique in offering a complete programming API to its users. TAU allows the users a nearly endless amount of performance information on each function, but users tend to complain that the interface is hard to navigate, with the meaning of what is presented somewhat obtuse. TAU was designed for the power user and was the only tool to give the options of four different function timing contexts: call graph context, temporal context, spatial context (which core), and event context (behavior with respect to user defined program events). Tau and PAPI are unique in allowing the user to get performance information on any section of core. We used Tau as a quick stand in to get hardware counter values without the development time of using PAPI directly. 4.4.2

HPCToolkit [19, 20, 21]

is a relatively new sampling tool designed to be more friendly to the casual user. It still comes with a steep learning curve, although this is less of a curve then the high end performance tools it replaces. Two features that make HPCToolkit unique are that its default profiling granularity extends to loop nests as well as whole functions, and that it is the only tool featuring complete context analysis of function performance, which the user can explore directly from a source code window. The downside of HPCTookit is that full context sampling is always a very heavy operation, and we found that while HPCToolkit has less overhead that similarly sophisticated explicit performance tools, it nevertheless had the highest overhead of the group. As a result, accuracy was too low to monitor smaller functions. We found HPCToolkit was the best tool for automatically finding hotspots within functions, and it was responsible for discovering 3 of the 11 critical code sections in Homme analyzed in this paper. HPCToolkit is still in development, and the designers have been working with our team to provide easier external integration and access to HPCToolkit data.

22

tool PAPI gProf mpiP TAU HPCToolkit pin

technique explicit explicit sampling explicit sampling explicit

granularity arbitrary function function arbitrary loop nest arbitrary

overhead low to high low low low medium high

ease of use medium high high low low low

Table 1: Performance tool summary 4.4.3

PerfExpert [22]

is a tool in development attempting to merge the positive qualities of HPCToolkit and gprof: a performance tool that is extremely light weight and extremely easy to use, while adding in features normally found on far more sophisticated systems such as source level optimization suggestions and comparisons between runs, essential to multicore analysis. PerfExpert currently gathers raw performance data from HPCToolkit, but is targeting two other major analysis packages in the future. We evaluated PerfExpert in an experimental capacity. 4.4.4

pin [23]

Pin is a somewhat different tool that the others, in that its overhead is so high that actual performance measurement is impossible. However, pin allows you to do a complete execution trace of your code, observing all memory references as well as the dynamic instruction mix. We rused pin to answer the most detailed questions relative to performance bottlenecks, such as the layout of structures in memory, the locality of data access patterns and the degree of array multiplicity in loop accesses.

5

Performance measurement pitfalls and solutions

Although there are many available tools for performance measurement, many are difficult to use and require detailed knowledge to interpret. Seemingly simple application of their results are actually fraught with pitfalls, and this situation is even more acute in the multicore regime. This section attempts to provide guidance to HPC software developers in using performance information to detect multicore scaling issues in their applications, while illustrating deeper insights and surprises in the form of pitfalls.

5.1

Finding the functions that matter

5.1.1

Devising a representative test case

Pitfall - Assuming a small test case can’t represent a full scale application. Most HPC programs have good weak scaling, and typical stencil applications are written as Single Program Multiple Data (SPMD), so a few nodes can represent the performance an entire system. Running internode weak and strong scaling tests while keeping the thread density, i.e., active processes per chip constant will determine overall scalability. We found Homme to have perfect weak scaling (Figure 4) with a relatively constant 20% communication overhead. We also found that intrachip scaling properties (core density) were independent of the node topology of the run (Figure 5). As a result, for multicore issues, only 1-4 nodes were needed. Additionally, mpiP confirmed that load imbalance was less than 5% between the cores, so taking measurements from just one core would be representative for performance. This is important, because we found that there are performance issues when multiple cores attempt to access the same 23

shared hardware counters. Finally, we made extensive use of density scaling, a form of weak scaling where you keep the number of threads/processes and work constant, but vary the thread density (using 1 process/chip and max (4) processes/chip suffices), as the simplest way to illustrate the intrachip scaling issues of the application. By running the same problem on the same core count but varying the number of cores per chip, we are seeing how functions respond to having varying amounts of shared resources. Note that this does change the communication topology of the program, but acts as a conservative estimate, as on chip communication is typically faster than interchip communication. Additionally, most profiling tools only examine user functions, and so communication performance can be analyzed completely orthogonally from computation. As common for many fluid dynamic simulations, we found that Homme had almost no performance gains beyond two cores per chip on Barcelona and three cores per chip on Nehalem. Once a small, representative test is established, running gprof will highlight a representative group of important functions - the exact ranking varies by context. A look at the gprof results for Homme shows the tradeoff between the size of a function and the number of times it is called. Large functions in the millisecond range can be profiled well with higher level tools. Tiny functions in the microsecond range require special care - at the very least, they will need to be profiled separately. We used light weight PAPI calls to get detailed performance info on the three tiny Homme functions. The top two functions in Homme turned out to be too large for any feasible optimization attempt, so we used HPCToolkit to locate the most important loop nests within the functions. 5.1.2

Defining “good” performance

At the simplest level, we can categorize functions by three metrics - Flops Per Cycle (FPC), the goal of our code, Instructions Per Cycle (IPC), how hard the CPU is working, and Loads Per Cycle (LPC), how hard the memory system is working. If any of these three values is nearly as high as would be expected for that architecture then performance is “good” and further improvement must be thought algorithmic optimizations. However, if all three values are bad, then in most cases the performance bottleneck will be somewhere in the memory system. Pitfall - Assuming basic performance metrics are fundamental to the algorithm. We found that different compilers, and even different compiler options, could drastically change things as fundamental as the number of flops and instructions in a function, the number of loads and stores, and even the importance ranking of functions, that is, which areas of the code are most important. Compiler options had a profound effect on the memory subsystem, and hence the intrachip scalablity of functions. Finally, each compiler and optimization flag has different strengths on different parts of the program, so its important not to judge a compiler’s merit based on an entire program run, but to look at performance on a per function basis. (See Figures 15 and 16.) Pitfall - Trying to reach manufacture advertised performance: Performance ratings may not be reachable at all, or reachable only in the most controlled circumstances using optimized SIMD assembly language. Reasonable high level code may only reach 20-30% of advertised performance values. Determining “good” performance values can be determined with high level microbenchmarks or known high performing code. On Barcelona, we found “good” values to be near 2 IPC, 1 FPC and 0.5 LPC. On Nehalem these values were higher. However, typical performance was 10-20x worse, due to the presence of a multicore bottlenecks in that function. Common HPC applications seldom break 20% efficiency, e.g., acceptance benchmarks WRF, MILC, HPL and others were under 20% efficiency [24, 25], while more irregular [26] or sparse solvers rarely exceed 5% efficiency and can be far less [25, 27]. Using PAPI, TAU and HPCToolkit, we computed these performance metrics for the most important functions, Figure 17. By profiling the application at the minimum and maximum thread densities, we were also able to see which functions had the worst intrachip scalability. In these cases, performance bottlenecks must lie in the contested areas of the chip. Figure 18 shows the strong correlation between LPC and intrachip scalability, while comparing this with Figure 7 shows the inverse correlation between LPI (the amount of loads) and LPC (the rate of loads). Ultimately, only three of the top performing Homme functions had “good performance”, meaning that

24

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

80%  

PREQ_ADVANCE_EXP_MIDDLE_LOOP_EXCL   GRADIENT_EXCL   EDGE_UNPACK_EXCL   PRIM_DIFFUSION_EXCL   PREQ_ADVANCE_EXP_TINY_LOOP_EXCL   %RT-­‐Intel  

PREQ_ROBERT_EXCL  

L1-­‐MR-­‐intel   PRIM_DIFFUSION_LOOP_2_EXCL  

L1-­‐MR-­‐PGI   L2-­‐MR-­‐intel  

APPLY_FORCING_EXCL  

L2-­‐MR-­‐PGI  

DIVERGENCE_EXCL   EDGE_PACK_EXCL   COLUMN_MODEL_EXCL   VORTICITY_EXCL   PREQ_ADVANCE_EXP_EXCL   COLUMN_MODEL_LOOP_EXCL  

Figure 15: Effect of compiler on cache miss rates at least one (and then typically, all) of their metrics were near the best to be expected for the machine. These functions accessed only a single array, had higher arithmetic intensity, and made good use of loop unrolling and explicit registers to increase performance. Pitfall - Confusing dynamic instruction mix with performance: Tools commonly report per instruction metrics instead of per cycle metrics, i.e., Flops Per Instruction (FPI) and Loads Per Instruction (LPI), and programmers sometimes assume a program with a high number of floating point instructions is compute bound, or a program with a high number of memory accesses is memory bound. If a computation has low FPI, there may just be too many overhead instructions, but this is considered an algorithmic issue. The performance of the code refers to how fast it is performing these instructions. With flops, the two are often correlated, in that more flops leads to higher flops per cycle due to increased ILP and reduced overhead, but in the more important memory system, there is an inverse correlation - commonly, more loads tends to decrease memory performance (LPC). In general, performance metrics (per cycle values) are more important, but you can easily derive the instruction mix from them, e.g., LPI = LPC/IPC, FPI = FPC/IPC. One important thing to watch for are functions which focus on data manipulation rather than computation - these may appear to have extremely low FPC or tapers (Bytes per Flops) simply because they have extremely low FPI, i.e., they’re not computing. Four of the most important functions in Homme did no computation.

25

0  

20,000,000,000  

40,000,000,000  

60,000,000,000  

80,000,000,000  

PREQ_ADVANCE_EXP_MIDDLE_LOOP_EXCL   GRADIENT_EXCL   EDGE_UNPACK_EXCL   PRIM_DIFFUSION_EXCL   PREQ_ADVANCE_EXP_TINY_LOOP_EXCL   PREQ_ROBERT_EXCL   PRIM_DIFFUSION_LOOP_2_EXCL  

TOT_CYC-­‐intel   TOT_CYC-­‐pgi  

APPLY_FORCING_EXCL   DIVERGENCE_EXCL   EDGE_PACK_EXCL   COLUMN_MODEL_EXCL   VORTICITY_EXCL   PREQ_ADVANCE_EXP_EXCL   COLUMN_MODEL_LOOP_EXCL  

Figure 16: Although the Intel compiler created slightly faster executables overall, the PGI compiler performed better on several important functions.

26

OD XC L  

XC L   PR EQ NG _A _E DV XC AN L   CE _E XP _E XC GR L   AD IE NT _E DI XC VE L   RG PR EQ EN _A CE DV _E XC AN VO L   CE RT _E I PR CI XP TY EQ _M _E _A ID XC DV DL L   AN E_ LO CE _E OP XP _E _T XC IN L   Y_ LO PR OP IM _E _D PR XC L   IM IF F _D US IF I ON FU _E SI ON XC L   _L OO P_ 2_ PR EX EQ CL _R   OB ER T_ EX CL  

P_ E

_E

 

 

XC L

XC L

K_ E

EL

_L OO

FO RC I

EL

OD

N_ M

AP PL Y_

N_ M

CO LU M

UN PA C

PA CK _E

GE _

GE _

CO LU M

ED

ED

Homme  Barcelona  Performance  

2.0   1.9   1.8   1.7   1.6   1.5   1.4   1.3   1.2   1.1   1.0   0.9   0.8   0.7   0.6   0.5   0.4   0.3   0.2   0.1   0.0   IPC  

FPC  

LPC  

Figure 17: Key performance metrics by major function.

27

Density  (weak)  Scaling,  4/1  CPC  execu:on  :me  (PGI)   200%   175%   150%   125%   100%   75%   ExecuCon  Time  

50%  

LPC   25%  

CO LU M

_L OO

EL OD

N_ M

N_ M

OD

EL

_E

XC L

  AP P PL _E Y_ XC FO L   PR RC EQ IN _A G _E DV XC AN L   CE _E XP _E XC GR L   AD IE NT _E DI XC VE PR L   RG EQ EN _A DV CE _E AN XC VO CE L   _E RT PR XP IC EQ IT _M _A Y_ ID DV EX DL AN CL E_   CE L OO _E XP P_ _T EX IN CL Y_   L O PR O I P M PR _E _D XC IM IF L   _D FU IF S IO FU N_ SI ON EX CL _L   OO P_ PR 2_ EX EQ CL _R   OB ER T_ EX CL  

  K_ EX CL

UN PA C

CO LU M

GE _ ED

ED

GE _

PA CK _E

XC L

 

0%  

Figure 18: Achieved memory performance (LPC) is directly correlated with intrachip scalability. (Larger bars represent poor intrachip scalability because they show execution time increases with core density. This is even more dramatic when comparing memory performance with memory requirements in Figure 7.

28

5.2

Unicore Memory Issues

Classical memory issues can still be important, although in some cases, changes to hardware make them less relevant. However, even the analysis of typically unicore issues have changed now that memory systems have added more sophisticated features. 5.2.1

Translation Look Aside Buffer (TLB) misses

TLB misses are always bad, especially if the range of data accessed cannot be covered by the number of TLB entries in each core, causing TLB thrashing. However, the worse kind of TLB miss is on demand page initialization, which occurs in Linux each time a new OS page is accessed for the first time. These kind of TLB misses took 15,000-20,000 cycles each on the systems we tested, making them about 100x the impact of cache misses. This is a very common mistake, and one responsible for frequently overestimating the latency to main memory. Pitfall - Running on systems without HUGE PAGES enabled. In a supercomputer environment where a single application dominates each node and very few touch virtual memory there is very little reason not to run in HUGE PAGE mode, which virtually eliminates TLB misses from the system. Unfortunately, as a user, there’s typically not much you can do about this unless provisions for the option have already been made available. If you notice a function has TLB misses in excess of a few hundredths of a percent, be aware that this can double your apparent memory access latency. In Homme, we found that TLB page initializations resulted in program initialization consuming a noticeable fraction of performance. 5.2.2

Level 1 (L1) cache miss rates

high L1 miss rates are the first sign of poor memory performance. Although modern programs can run just fine out of the L2 caches, many modern CPUs have L2 caches with lower bandwidth than L1 and/or larger sequential chunks of memory being transferred, which can lead to wasted bandwidth and associativity issues. However, this is far less important than L2 cache misses. Pitfall - Low L1 cache misses are no longer indicative of good memory performance. Multiple accesses to the same 64-byte cache line (8 double precision numbers) will all count as a single miss, so for sequential memory access, only the first double of 8 will count as a miss, so the maximum L1 miss rate is 12.5%. More significantly, explicit loads can stall behind prefetch streams, with the result that each load will look like a hit even when all loads go to main memory. As a result, a modern application can have great L1 performance but horrible bottlenecks lower in the memory hierarchy. It is now fairly common due to prefetching to see an application with near 100% L1 cache hits actually getting every value from DRAM and running at DRAM throughput rates. Due to bandwidth limitation of current CPUs, such direct streaming algorithms often suffer a 10x slowdown from expected performance rates. Interestingly, figure 15 shows that the PGI compiler generates code with low L1 miss rates while the Intel compiler generates code with high L1 miss rates on every important function in Homme. A compiler influences cache miss rates in two ways - in the way it groups items in memory, and in the patterns it uses to access memory. A simple example of this is automatic blocking of loops. Since cache misses are now such an integral part of performance, this is now arguably the most important role of the compiler, although as we explore later, memory access patterns can have an even greater performance effect on the memory system than cache misses. 5.2.3

Level 2 (L2) cache miss rates

can be more indicative of a memory problem. It is worthwhile computing the L2 bandwidth (misses/second) to determine if a high miss rate is actually a performance issue, or if the amount of L2 accesses is itself trivially low. However, actual L2 traffic can be significantly higher due to prefetching, load replays, and coherence snoops,

29

which we found at accounted for as much as 80% of L2 bandwidth despite lack of any data sharing between processes. Pitfall - Relying on L2 performance to infer multicore scalability issues. We initially tried to infer memory bottlenecks by looking at the way L2 miss rates varied by thread density. We assumed that if the L2 miss rate was low or didn’t vary much with density, then there probably weren’t significant L3 and DRAM scalablity issues. This assumption turned out to be false for functions which showed poor intrachip scalability, but was corrected by looking at a few additional metrics such as L3 hit and miss rates and DRAM page hit rates.

5.3

Multicore Memory Issues

Even with a regular HPC application like Homme, we found that we simply couldn’t extrapolate the causes of poor intrachip scaling from only looking at L1 and L2 miss rates. Distinguishing between an L3 capacity or off chip bandwidth issue and a DRAM page miss issue is important, because the style of optimizations differ. Some of these performance counters were not in the “standard” set, so we had to use the native hardware counters as defined in processor user guides, e.g., [12, 13]. The good news is that all tools we tested were able to pass through native counter IDs to the hardware. 5.3.1

Level 3 (L3) cache miss rates

If L3 miss rates significantly increase between min and max thread densities, then it is likely that at least part of the performance issue is L3 capacity, since each core must now use only a fraction of the shared cache. Note that effective cache capacity can be reduced by associativity issues and false sharing between threads. Since L3 cache misses directly translate into off chip bandwidth, applications with capacity issues may also have bandwidth issues. Pitfall - Storing intermediate values to reduce computation. When cache capacity is an issue, a good optimization is to reduce the memory footprint of a function by reducing or reusing temporary variables and redundant arrays. DRAM streaming speeds were 10-20 times slower than normal execution, leaving ample time to redundantly recompute values instead of storing them. An even easier solution is described later in this paper. 5.3.2

Off-chip bandwidth

HPC applications often do not see as much benefit from a cache hierarchy, because they tend to have so little reuse of data values. In computing the off-chip bandwidth for Homme, we found that one in three loads went off-chip. It can be interesting to look at a function’s “taper”, which relates to its arithmetic intensity. Many functions in Homme need significantly more than 1 byte/flop, which can be used to determine the suitability of a given supercomputer to run a given HPC application. Be wary of averages. The performance weighted average for Homme is just 0.8 bytes/flop off chip, however this includes a significant amount of functions which have no flops or no off chips traffic. Peak values are critical to maintaining average performance, and so functions should be looked at individually. For Homme, a more realistic value is about 3.2 bytes/flop off chip. Pitfall - Judging a function’s bandwidth needs at full thread density. The bandwidth needs of a function can only be judged when it is running unconstrained, that is, at one process per chip. The magnitude of the bandwidth issue can be gaged by the amount the function exceeds its share of off chip bandwidth. If examined at full thread density, it may paradoxically appear that the functions bandwidth usage has decreased, yet performance is worse. 5.3.3

DRAM page misses

Not to be confused with OS pages relating to TLB misses, a DRAM page represents a “row” of data that has been read out and cached for fast access. DRAM pages can be large - 32KB in the case of Ranger, and there are 30

typically two per DIMM. This means that a Ranger node shares 32 different 32KB pages among 16 cores on four chips. Pitfall - DRAM contention doesn’t significantly impact performance. When a DRAM request is outside an open page, the page must first be written back to DRAM (closed), then the next page read out (opened), which adds 30ns (69 cycles at 2.3 GHz) to the access and reduces DRAM bandwidth. While this represents only 20% of access time on Barcelona, newer processors have been reducing absolute latency to DRAM, so the importance of page misses will grow with time. As the number of cores per node grows, contention will increase, and it will be more difficult to make effective use of DRAM pages. As an optimization, the DRAM controller will close a row (write back the results) after a certain amount of time so the next access only needs to open a row. This saves 15ns. Pitfall - Using high level code to reduce DRAM conflicts is hopeless. We cannot control where memory is physically allocated, or the way different cores on a node access different cache pages at different times. However, we can statistically reduce the number of conflicts and significantly increase performance. The final section of this paper shows the most significant multicore optimization we found - using loop fission to reduce DRAM page conflicts. 5.3.4

Importance of temporal context

Pitfall - Assuming that momentary fluctuations in performance will not sufficiently effect averages or overall performance conclusions. As discussed in earlier sections, an important performance effect in the multicore regime we’ve observed are what we call “core skew” effects. Because each core might be in a slightly different phase of execution, different functions may be happening on different cores, violating the SPMD assumption. This can be due to natural drift between synchronizations, such as from NUMA effects or non deterministic memory contention. This could be due to an intense memory event having a somewhat serializing effect, pushing the cores further out of phase. Finally, this could be due to a single core performing unique duties, like printing results or doing performance monitoring, that kicks that core further out of phase with the others. The effect is that an intense event gets “smeared out” over time, because it remains in operation from the time the first core starts the section until the last core finishes it. Figure 19 shows all these effects in Homme. In particular, the extreme oscillations caused by initialization tens of billion cycles later are enough to completely throw off relative averages. In addition, because the resonant oscillation (on the order of a few billion cycles, or 3 timesteps on Ranger and 8 timesteps on Longhorn) is larger than the difference in average performance, average performance values do not correctly show which optimization is better. Traditionally, people already know to fast forward over initialization and cache warm up time. But in the multicore regime, it’s necessary to fast forward 10s of billions of cycles into regular code execution for the skew disturbances to die down. It’s also critical to minimize clock skew by having all cores do as similar operations as possible.

5.4

A Step by Step Guide

This section spells out a very simple procedure for multicore program analysis which avoids many of the potential pitfalls described earlier: 5.4.1

Choose a test size for your program

In case your execution behavior varies with work per thread, try to choose the number of cores used and data processed per core to be close to a standard run of your code. One caveat - in order to do multicore analysis on N core processors, you need to size your test run to use no more than 1/N of the total cores on your supercomputer, so that you have sufficient nodes available to drop core density down to 1 core per chip. 31

Core  Skew  Example   1,400,000.00   1,390,000.00  

Cycles  to  Complete  Preq  Robert  

1,380,000.00   1,370,000.00   1,360,000.00   1,350,000.00  

Base   Opt1  

1,340,000.00  

Opt2  

1,330,000.00   1,320,000.00   1,310,000.00  

1   19   37   55   73   91   109   127   145   163   181   199   217   235   253   271   289   307   325   343   361   379   397   415   433   451   469   487   505   523   541   559   577   595   613   631   649   667   685   703  

1,300,000.00  

Major  Timestep  

Figure 19: Example showing three kinds of core skew effects: Perpetuation of disturbances for tens of billions of cycles (initialization), heterogeneous skew effects (a printf done by Core 0 every 30 timesteps) and natural core skew “resonance”, leading to background oscillation over a few billion cycles. 1% of extreme values were removed from measurements. No simple average can show which optimization is fastest, even though it is readily apparent.

32

5.4.2

Test intrachip scalability

Run your program with the same input as above twice with the chosen number of cores, but once using just one core per chip, and once using N cores per chip and 1/N nodes. You can compare the speeds as simply as inserting the date command before and after the run. Did the core dense version run as fast as the sparse run? If so, congratulations - you don’t have multicore issues, and given current processors, you likely don’t have performance issues either, since you’re not demanding much of your memory system. Otherwise, press on. 5.4.3

Due diligence

While not strictly necessary, you may wish to observe the weak scaling characteristics of your program by doing the prior experiment while varying the size of the input data and the number of cores. You can also use a tool like mpiP or any multicore analysis tool like Tau to see how much variation in performance there is between the cores. If intrachip scaling is the same at both large and small scales, then you can do the rest of the analysis on a single node. If there is little difference in performance between cores, then you can feel safe getting performance values from just a single core during a run. 5.4.4

Run gprof to determine the functions to analyze

This will immediately tell you the most important functions by execution time, and the basic call graph connecting them all so you can quickly learn your way through the code. Important functions will typically have an inverse relationship between cycles per call and calls per run. Pitfall - Tiny functions don’t matter. For Homme, the breakdown was: • 20% of all execution time is spent in functions which run for 2,000 cycles or less • 10% of all execution time is spent in functionsbetween 2,000 and 10,000 cycles • 15% of execution time in functions between 10K and 200K cycles • 15% of execution time between 200K and 1 million cycles • 35% of execution time in functions with 10 million or more cycles. Very significant functions, those taking up more than 10% of execution time, are represented at almost every scale, with the smallest function in Homme taking up 11% of total execution time. For important functions under a few million cycles, you may have to do follow up analysis with special low overhead measurements. Very large functions may need sub function profiling to make optimization tractable, in which case you can employ HPCTookit or source based profiling to find important subsections. 5.4.5

Get key performance metrics

For these important functions, at densities of 1 and N cores per chip, record the total cycles, total instructions, total floating point operations, and L1 cache hits and misses (some variations are needed depending on the available native counters on your system.) This will immediately show you IPC, FPC and LPC and let you classify the nature of each function as well as the ones that clearly need improvement. Figure 17 categorizes the major functions of Homme: 3 good (CPU bound) functions, 4 memory bound functions which did no computation, and 6 memory bound computations that were computational in nature. Comparing these metrics at thread densities of 1 and 4 cores per chip showed which functions had the worst multicore issues. Note that a low L1 miss rate combined with low LPC may indicate a streaming application that is successfully utilizing prefetchers. 33

5.4.6

Get multicore metrics

After finding the relevant native counters for your system obtain the L3 hit and miss rates, and if possible, the DRAM access rates and DRAM page hit rates. The computed off chip bandwidth at 1 core per chip will immediately determine the degree to which off chip bandwidth is an issue, while increases in the L3 miss rate will show the degree to which shared L3 capacity is an issue. DRAM page miss rates will show the degree to which bank conflicts are an issue. You can also compute the taper of your application, which shows the total effectiveness of the on chip cache hierarchy. Now that you’ve identified the particular memory bottlenecks, you can now optimize your functions in order of tractability and importance. 5.4.7

Fine tuning

Once you determine that critical functions may require more accuracy or context information in the profiling, you may go back and use more sensitive tools just for those functions. We needed to resort to ultra low overhead measurements and temporal context in order to analyze our smallest functions, under 2,000 cycles, and to assess the impact of code optimizations on mid-sized functions under a million cycles .

6

Optimization Results

The earlier sections of this paper have discussed the types of performance bottlenecks unique to the shared memory resources of multicore chips, and the new complications arising in attempting to measure these bottlenecks. But the ultimate goal of any performance analysis is to use this information to increase code performance. There are two main categories of optimization: algorithmic optimization, in which the actual algorithm or data structures used are altered to be more friendly to the memory system, and local optimizations, in which the algorithm is unchanged but the way it is locally implemented is optimized. The latter type of optimization is far more attractive, since it does not require intimate knowledge of the application code and can be automated by a compiler. There are four principle multicore bottlenecks identified in this paper. L3 capacity issues arise when performance is effected by having L3 space reduced to 1/N in an N way multicore. . Bus contention issues arise when multiple bursts of memory requests from different cores saturate data busses to memory. Off chip bandwidth issues arise when an application needs more than 1/N of the off chip bandwidth. And finally, of increasing importance in recent years, DRAM page misses and thrashing occur when requests from multiple cores and multiple chips are trying to simultaneously access different DRAM pages in a given DRAM bank. There are algorithmic solutions to these issues of varying complexity. For example, for L3 capacity issues, algorithmic optimizations used in unicore processors can help: reducing the size of the working set using smaller blocking, converting to structs of arrays, and reducing conflict/associativity issues by padding arrays and utilizing sequential data addresses. Increasing the utilization of caches as well as making an algorithm more computationally heavy can help reduce off chip bandwidth and bus contention. The most difficult is optimizing for DRAM page access, because this relies on physical instead of virtual memory addresses. Without huge pages, the application must rely on special OS calls to allocate arrays in sequential physical pages and aligned with 32KB DRAM page boundaries. Knowledge of how physical addresses are distributed across DRAM banks and pages must then be applied to have all the cores on a node accessing data only on currently open pages. This solution may be beyond the level of control of many programmers, and is certainly not portable across different hardware. This is why we consider the most important optimizations to be local and portable, and we next describe a local multicore optimization we call microfission.

34

do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) & - 2.0D0*T(i,j,k,n0) + T(i,j,k,np1)) v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) & - 2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1)) v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) & - 2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1)) div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) & - 2.0D0*div(i,j,k,n0) + div(i,j,k,np1)) end do end do end do Figure 20: Key loop in preq robert time differencing

6.1

Microfission Multicore Memory Optimization

Few developers currently look at DRAM page miss rates as a performance issue, and with the lack of control over how physical memory is mapped to pages, the non-deterministic thrashing of pages between multiple processes, and the lack of control over when values are actually written from caches, it could easily seem out of a high level programmer’s control. The most dramatic example of this can be seen in a simple Homme function that does a second order time differencing of flow properties which is responsible for 8% of total run time. This function streams through memory in sequential order, using each data value only once to compute a difference. As predicted earlier in this paper for functions with high LPI, analysis confirmed this was one of the functions with the worst intra-chip scaling. It had high L3 miss rates and even higher DRAM page miss rates. It required very high bandwidth but achieved very little actual bandwidth due to the DRAM page misses and bus contention. From a traditional point of view, It would appear that nothing could be done to increase its speed, since code without data reuse cannot make use of caches. However, we found a multicore high level optimization that has the benefit of being a completely local optimization, that is, no changes to local data storage were needed and a compiler could easily perform the optimization automatically. We call this technique microfission. The optimization procedure is to split up (loop fission) each large into many tiny loops such that no more than two independent arrays are accessed in each loop nest, and no more than one array is brought in from main memory per loop nest. We refer to this technique as “microfission” because it breaks loops into such fine chunks - typically every term in each equation will get its own loop. Then, wherever possible, combine loop bodies (loop fusion) that operate on the same two arrays. For example, we broke each line in Figure 20 into to loop nests with two arrays each as shown in Figure 21. From a uniprocessor context, this would appear to actually increase total bandwidth needs, while reducing MLP. But this approach offers two critical benefits to code operating in the multicore regime. First, on a very ephemeral scale (less than 10,000 cycles or so) it dramatically reduces the needed working set to just one array instead of several. As short as this is, it significantly reduces L3 cache misses. Second, it reduces the total number of independent locations being requested to the memory system across all the cores in a node. While this approach seldom eliminates all page misses, it greatly reduces the number of page misses, while empowering the on-chip memory controllers to more intelligently batch requests. Finally, compilers are much better at optimizing very small loop bodies, and we found the amount of overhead code generated was greatly reduced. Note that this programming style appears nearly opposite of that used in Homme, in which loop nests were made as “fat” as possible. This was a classic pre-optimization in which loops were fused to try to increase ILP and MLP inside a limited out of order window.

35

There was one complication. Compilers are now smart enough to fuse small loops and undo the optimization. To defeat the compiler, each micro loop was broken into a separate function call. This introduced a tremendous CPU overhead that we would not have had if the compiler had supported such an option. Yet, even with this handicap, it was a performance win. The actual performance effects of microfission as implemented in Figure 21 are illustrated in Figure 23 as measured on Ranger. The second set of bars show that the L2 miss rate is essentially unaffected, increasing from 7.4% to 7.9%, despite the increased private cache bandwidth incurred by the optimization. This confirms that private cache bandwidth is ample to support the lowest arithmetic intensity HPC codes, and that microfission is strictly a multicore optimization having minimal effect on individual cores. The next bars illustrate the reduction in contention for shared on-chip resources. L3 miss rates, typically above 50% for HPC applications, are cut in half from 66% to 33% due to the momentary drop in working set size. Off chip bandwidth needs are reduced accordingly. The DRAM page hit rate, previously below 20%, is 2.5 times as great, breaking the 50% mark, while the page miss rate is down from 53% to 25%. DRAM conflict miss rates, representing the worst type of DRAM contention, were down by a more modest 11% from an already low rate of 27%. As a result of reducing resource contention, actual performance (the first set of bars) increased 33% on Ranger and 35% on Longhorn. How did microfission effect intra-chip scalability? The optimized code’s scalability from 1 to 4 cores per chip increased by 11% over the unoptimized version, showing that while contention effects drop off somewhat as the number of requests drop, there tends to be still a somewhat linear relationship between the memory access rate and performance. However, if comparing the optimized code at 4 cores per chip to the base code at 1 core per chip, we see a 33% increase in scalability - on Ranger/Barcelona, the optimized code reached 78% efficiency at 4 cores per chip - it actually managed to benefit from 3 cores per chip and match the Nehalem system! The results for the Longhorn/Nehalem system were surprising, because while Nehalem tends to scale better in general, on this particular function, Longhorn/Nehalem had far worse scaling. Base efficiency going from 1 to 4 cores per chip was just 41% efficient, meaning it couldn’t even benefit fully from two cores per chip. The optimized version had 10% better intrachip scaling properties over the unoptimized version, a similar improvement to Barcelona. However, comparing the optimized version at 4 CPC over the unoptimized base at 1 CPC, the efficiency was increased to 58%, a 42% increase in intrachip scalability. One important assumption in increasing the local bandwidth is that two arrays will completely fit in one core’s share of on-chip caches. In Homme, this was the case, as each array for a single blocked element took up just 48KB. If the arrays were too large to fit on chip, blocking across arrays would be mandatory, as shown in Figure 22. A secondary optimization on top of this is to block the entire set of loops to only iterate over sections of the arrays that can fit in the L1 or L2 cache. This is primarily an advantage when given array is used in more than one micro loop or touched more than once. While this did offer some modest additional improvements in performance, it was clear that most of the performance benefit came from the base optimization, and blocking was limited by the algorithm. The message is clear - simple high level programming techniques can improve memory performance. When we applied this technique to other functions in Homme, we had more modest improvements of 5-20% because these functions were had additional performance issues such as TLB miss page initialization. While the tedium of manually applying this optimization prevented us from using it throughout Homme, we estimated this optimization could be applied to the majority of the important loops in Homme, and would be a good candidate for a compiler optimization flag.

6.2

Effect on Core Skew

In section 5.3.4 and Figure 19 we discussed the temporal performance variations caused by Core Skew. An interesting effect of micro fission is that, while it significantly improves average performance, it also drastically increases performance variation - increasing core skew effects on both platforms compared to the base background

36

Figure 21: Optimization of code in Figure 20. The basic idea of microfission is to break up loops until each loop is loading just one array from DRAM and holding no more than two arrays on-chip. Grouping together similar array accesses minimizes bandwidth. The entire operation can be blocked if needed to fit the arrays in cache. This greatly reduces L3 cache and DRAM page miss rates and general memory contention.

37

Figure 22: Blocking version of code in Figure 21. Best performance occurs when both arrays can fit into L1 and L2 caches. Blocking across all microloops makes this possible, and is mandatory if the arrays cannot fit on-chip. However, we found that for L2-sized arrays, blocking did not improve performance beyond the original optimization.

38

PreqRobert  Op,miza,on   140%  

120%  

100%  

80%   Base   uFission  

60%  

40%  

20%  

0%   Perf  

L2  MR  

L3  MR  

Page  HR  

Page  MR  

Page  CR  

Figure 23: Loop fission more than halved L3 and DRAM page miss rates, increasing performance by up to 35% oscillations. (See Figures 24-26.) This effect is likely of only secondary importance to performance, but it can have a significant impact on the accuracy and interpretation of performance monitoring. As a first step in understanding the source of increased variability, we’ve isolated the source of performance variation to be variation in L2 misses, as shown in Figure 27. This increases the likelihood that the effect is due to complex multicore interaction, as the L2 caches are the primary interface to global coherence. However, it is not yet clear why the simpler microfission traffic is so much more effected by core skew, or why dramatic oscillations occur in time scales longer than a major time step.

6.3

Reducing the data footprint

Another optimization we explored in Homme was reduction in data size by reducing the amount of local temporaries and temporary arrays, both by reusing temporary storage, especially across individual blocked “elements”, and by substituting redundant computation in place of stored values. In the multicore regime, we found code typically had the capacity to perform 5-10 floating point computations in the time of one load, giving a huge incentive to change the high level source algorithm to use computation instead of storage whenever possible. This is even more efficient when transposing the vertical structure of Homme to perform all computations on a single element in sequence instead of computing a single computation over all elements before moving on to the next computation.

39

1   16   31   46   61   76   91   106   121   136   151   166   181   196   211   226   241   256   271   286   301   316   331   346   361   376   391   406   421   436   451   466   481   496   511   526   541   556   571   586   601   616   631   646   661   676   691   706  

Cycles  to  Complete  PreqRobert  

Performance  Variability  -­‐  Baseline  

1,600,000  

1,550,000  

1,500,000  

1,450,000  

1,400,000  

1,350,000  

1,300,000  

1,250,000  

1,200,000  

Major  Timestep  

Figure 24: Base performance at 4 cores per chip.

40

1   16   31   46   61   76   91   106   121   136   151   166   181   196   211   226   241   256   271   286   301   316   331   346   361   376   391   406   421   436   451   466   481   496   511   526   541   556   571   586   601   616   631   646   661   676   691   706  

Cycles  to  Complete  PreqRobert  

Performance  Variability  -­‐  uFission  

1,200,000  

1,150,000  

1,100,000  

1,050,000  

1,000,000  

950,000  

900,000  

850,000  

800,000  

Major  Time  Step  

Figure 25: The Microfission Optimization increases average performance but also greatly increases performance variance, exposing periodic oscillations on ths node.

41

1   16   31   46   61   76   91   106   121   136   151   166   181   196   211   226   241   256   271   286   301   316   331   346   361   376   391   406   421   436   451   466   481   496   511   526   541   556   571   586   601   616   631   646   661   676   691   706   721  

Cycles  to  Complete  PreqRobert  

Temporal  Performance  @  1CPC  -­‐  uFission  

750000  

730000  

710000  

690000  

670000  

650000  

630000  

610000  

590000  

570000  

550000  

Major  Timestep  

Figure 26: The degree of variation is even more pronounced at one core per chip.

42

Metric  Temporal  Variability  -­‐  uFission   60%  

50%  

40%  

L1MR  

30%  

L2MR   L3MR   DHR  

20%  

10%  

1   17   33   49   65   81   97   113   129   145   161   177   193   209   225   241   257   273   289   305   321   337   353   369   385   401   417   433   449   465   481   497   513   529   545   561   577   593   609   625   641   657   673   689   705  

0%  

Major  Timesteps  

Figure 27: Unsurprisingly, the driver of this variation appears to be L2 cache miss rates, which vary by nearly 50%. The L2 is the gateway between private and shared memory and tends to get bombarded by snoop requests from the rest of the node.

43

6.4

Global multicore optimizations

The overall structure of Homme illustrates the importance of taking the concept of blocking to a higher level to support multicore processors. Homme already does two things very well - processing its data in 48KB chunks per grid value, and storing every field metric in a separate array (struct of arrays), allowing functions to only touch the values they specifically operate on. However, there are three areas in which Homme needs improvement for better multicore performance. The higher level organization of Homme iterates over all data elements for each layer of physics, artificially reducing arithmetic intensity and leading to a needless bandwidth explosion. Furthermore, Homme stores almost every computed value in a separate array, rather than attempting to minimize intermediate storage and temporary values. Finally, Homme frequently implements loop bodies which perform computation on many different arrays at once, which exacerbates DRAM conflicts and reduces effective bandwidth. All of these design decisions were made to aid in sotware readability and extensibility, but have severe negative performance impacts on multicore processors. The general program structure by definition permeates all of the Homme code, and as such is intractable to alter. However, when creating such code from scratch, a different style is needed for multicore performance. While Homme correctly used structure of arrays instead of arrays of structures to minimize its data foot print, two other design elements must change. First, for each set of simulation quantities, all layered physics computations applicable to that quantiy should be applied to each block in succession, not on all blocks in the list. Secondly, temporary arrays should be used very sparingly, and reused whenever possible to reduce the data footprint. Finally, blocking should be made even smaller to account for less cache per core on modern multicore processors.

7

Related Work

An early example in which the application of highly detailed architectural knowledge led to the development of super optimized memory performance was the STREAM benchmark suite [28], hosted at the University of Virginia [29]. There have been many articles on optimizing specific scientific kernels for multicore CPUs, e.g., stencil computations [30], sparse matrix-vector multiplication [31], and even a Lattice-Boltzmann computation [32]. In the paper, ”We have it easy, but do we have it right?” [33], the authors analyze SPEC benchmarks using PAPI to read hardware counters and note that measurement error due to measurement disturbance as well as OS context and compiler flags can be significant and unpredictable, such that correct performance conclusions could not be easily drawn. Unlike their work, we studied a full scale application in the context of a modern multicore supercomputer and found almost an order of magnitude greater disturbance. Their solution was not code based but measurement based - they recommended running tests over a wide range of OS parameters and validating any performance conclusions by conducting detailed tests to confirm the hypothesis. While we did note the effects of running tests over many optimization levels and optimization flags, we did not track the effects of Unix contexts. This paper also goes into far more depth about both the nature of measurement error and possible steps to correct it. The paper ”Investigating the impact of code generation on performance characteristics of integer programs” [34] investigates the different performance characteristics of three different compilers on SPEC benchmarks, focusing on PAPI hardware counters as primary metrics. They also observe that different compilers do better in different functions, and that judging a compiler only by total program performance is mistake. They also note compiler differences in cache miss rates and fundamental instruction counts, and even recommend some optimizations. We make similar observations in a different context, focusing more on memory optimizations and less on CPU optimizations. We note the impact of compiler memory access patterns in the multicore regime and recommend fundamental but simple changes to the way compilers optimize code for multicore processors. Jack Dongarra’s wrote a series on the impact of multicore processors on various programming areas, one of 44

which focused on scientific applications [35]. In this paper, he emphasized as we do that multiple cores on a chip cannot be treated as a traditional SMP due to shared on-chip resources, and as a result, new scientific code would become much more complex because it would have to take increasingly varying architectures into account. He referred to this shift in programming paradigm as the ”multi-core discontinuity.” He predicted many effects that we verify in this paper, e.g., that the memory wall will grow until memory access dominates performance issues. He also identified that compiler strategies must change, focusing on the issue of cache associativity, and pointing out that most compilers still use loop fusion, an approach inappropriate for CMPs. While his papers identify potential issues with upcoming architectures, they do not explore them in depth or offer actual solutions, and they do not focus on issues with performance measurement. Perhaps the paper which most directly addresses bottleneck analysis and optimization approaches specific to multicore chips is the Roofline model from Berkeley [36]. The model defines a number of key performance metrics and uses microbenchmarks to estimate realistic values for a given hardware platform. Comparing the actual performance of several HPC proxy kernels (the original seven Berkeley dwarfs) then provides insight into multicore bottlenecks and code optimization techniques on a half dozen single chip, multicore systems. While insightful and useful, this paper is analyzing at a much coarser level, examining a different scale of programs, and focusing on more traditional optimization techniques like blocking and CPU-centric optimization techniques such as balancing multiplies and adds or successful vectorization. They do not employ hardware counters and do not focus on difficulties in making accurate measurements or interpretations of those measurements. Finally, they ignore the on-chip memory hierarchy and the complications of DRAM pages and access patterns, along with their performance impact on high level compiler optimizations. In contrast, this paper analyzes a full scale supercomputer application for which CPU-centric optimizations are largely irrelevant. We derive a highly detailed architectural picture from hardware counters, as well as examine classic pitfalls in their interpretations. From this deeper, more detailed study, we compare and develop novel multicore analysis and optimization techniques, such as intra-chip scalability analysis and microfission memory optimization techniques.

8

Conclusion

This paper introduces the notion of intrachip scalability for HPC applications running on multicore processors and demonstrates that conventional internode parallel scaling properties are quite different from and independent of intrachip scaling properties. Using HOMME as an archetypal regular HPC application demonstrates almost no scalability beyond 2 cores per chip on Ranger, a barcelona supercomputer, and 3 cores per chip on Longhorn, a Nehalem-EP based hybrid supercomputer. Four key multicore performance bottlenecks identified are shared L3 cache capacity, shared off-chip bandwidth, memory bus and DRAM bank contention, and DRAM page conflicts. This paper outlines a simple procedure for measuring the performance metrics most critical to multicore performance. Sufficiently accurate performance measurements of the memory system are significantly more challenging than conventional performance measurements due to the delayed effects of memory references and the nondeterministic interaction among all the cores in a node. We show that temporal context is critical, and that obtaining sufficient measurement accuracy may require using multiple tools with different strengths and weaknesses. Finally, we discuss both algorithmic and local optimizations to improve multicore scalability. The most general local optimization, microfission, reduced L3 cache miss rates by almost 50%, more than doubled DRAM page hit rates, reduced compiler overhead instructions by a third, and increased intrachip scalability by up to 42% and absolute performance by up to 35%. In the near future, we hope to gain access to more hardware counters on the Longhorn supercomputer to make more detailed comparisons between platforms. We will also investigate the causes and characteristics of core skew effects and standing node oscillations. Finally, we hope to find two other tractable case studies to round out our representation of HPC applications as discussed in section 2: an archetypical irregular application, such as an adaptive mesh simulation, and a search application such as those common in biochemistry, exploring both the 45

generality of our observations and highlighting issues which may be unique to other kinds of HPC applications. We believe the multicore bottlenecks, measurement issues, and optimization techniques explored in Homme will apply to less regular applications, but that additional measurement challenges and optimization opportunities may surface.

Acknowledgment The authors would like to thank Lars Koesterke for his expert help with Fortran95 issues as well as his experience with HPC optimizations, and Victor Eijkhout for his experiences with sparse and unstructured HPC applications. Finally, we also would like to thank the reviewers for their helpful feedback. This work was supported in part by the National Science Foundation under award CCF-0916745 and and OCI award 0622780.

References [1] “Cray xd1 supercomputer delivers three times more power to reconfigurable computing applications.” [Online]. Available: http://www.xilinx.com/prs rls/design win/0591cray05.htm [2] K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, and J. C. Sancho, “Entering the petaflop era: the architecture and performance of roadrunner,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–11. [3] “A look at the 4640-gpu nebulae supercomputer.” [Online]. Available: http://blog.zorinaq.com/?e=14 [4] J. J.Hack, B. A. Boville, B. P. Briegleb, J. T. Kiehl, P. J. Rasch, and D. L. Williamson, “Description of the ncar community climate model (ccm2).” NCAR, Boulder, Colorado, Tech. Rep., 1993, nCAR Tech. Note TN-382. [5] G. Amdahl, “Validity of the single-processor approach to achieving large scale computing capabilities.” in AFIPS Conference Proceedings, vol. 30. Reston, Virginia, USA: AFIPS Press, 1967, pp. 483–485. [6] J. L. Gustafson, “Reevaluating amdahl’s law,” Commun. ACM, vol. 31, no. 5, pp. 532–533, 1988. [7] “Ranger user’s guide.” [Online]. Available: http://www.tacc.utexas.edu/services/userguides/ranger [8] “Longhorn user’s guide.” [Online]. Available: http://services.tacc.utexas.edu/index.php/longhorn-user-guide [9] “Metis - serial graph partitioning and fill-reducing matrix ordering.” [Online]. Available: //glaros.dtc.umn.edu/gkhome/metis/metis/overview

http:

[10] “Nsf 0605: the high-performance computing challenge benchmarks, version 2.0.” [Online]. Available: http://www.nsf.gov/publications/pub summ.jsp?ods key=nsf0605 [11] “Linux performance counter kernel api.” [Online]. Available: http://user.it.uu.se/∼mikpe/linux/perfctr/ [12] BIOS and Kernel Developers Guide (BKDG) For AMD Family 10h Processors, Rev 3.48 ed., April 2010, publication 31116. [Online]. Available: http://developer.amd.com/documentation/guides/pages/default.aspx [13] System Programming Guide, Part 2, intel document 253669. [Online]. Available: www.intel.com/assets/pdf/ manual/253669.pdf [14] “Papi: Performance application programming interface.” [Online]. Available: http://icl.cs.utk.edu/papi 46

[15] S. L. Graham, P. B. Kessler, and M. K. Mckusick, “Gprof: A call graph execution profiler,” in SIGPLAN ’82: Proceedings of the 1982 SIGPLAN symposium on Compiler construction. New York, NY, USA: ACM, 1982, pp. 120–126. [16] S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: a call graph execution profiler,” SIGPLAN Not., vol. 39, no. 4, pp. 49–57, 2004. [17] “Mpi profiler.” [Online]. Available: http://mpip.sourceforge.net/ [18] S. S. Shende and A. D. Malony, “The tau parallel performance system,” Int. J. High Perform. Comput. Appl., vol. 20, no. 2, pp. 287–311, 2006. [19] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “Hpctoolkit: tools for performance analysis of optimized parallel programs http://hpctoolkit.org,” Concurr. Comput. : Pract. Exper., vol. 22, no. 6, pp. 685–701, 2010. [20] N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel, “Diagnosing performance bottlenecks in emerging petascale applications,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 2009, pp. 1–11. [21] N. R. Tallent, J. M. Mellor-Crummey, and M. W. Fagan, “Binary analysis for measurement and attribution of program performance,” in PLDI ’09: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation. New York, NY, USA: ACM, 2009, pp. 441–452. [22] M. Burtscher, B.-D. Kim, J. Diamond, J. McCalpin, L. Koesterke, and J. Browne, “Diagnosing performance bottlenecks in emerging petascale applications,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 2009, pp. 1–11. [23] V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn, “Pin: a binary instrumentation tool for computer architecture research and education,” in WCAE ’04: Proceedings of the 2004 workshop on Computer architecture education. New York, NY, USA: ACM, 2004, p. 22. [24] “Personal communications from b.d. kim.” [25] “Personal communications from lars koesterke.” [26] D. K. Kaushik, D. E. Keyes, and B. F. Smith, “On the interaction of architecture and algorithm in the domainbased parallelization of an unstructured grid incompressible flow code,” in In Proceedings of the Tenth International Conference on Domain Decomposition Methods. AMS, 1998, pp. 311–319. [27] “Personal communications from victor eijkhout.” [28] J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,” IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995. [29] ——, “Stream: Sustainable memory bandwidth in high performance computers,” University of Virginia, Charlottesville, Virginia, Tech. Rep., 1991-2007, a continually updated technical report. http://www.cs.virginia.edu/stream/. [Online]. Available: http://www.cs.virginia.edu/stream/ [30] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, “Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–12. 47

[31] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” Parallel Comput., vol. 35, no. 3, pp. 178–194, 2009. [32] S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick, “Optimization of a lattice boltzmann computation on state-of-the-art multicore platforms,” J. Parallel Distrib. Comput., vol. 69, no. 9, pp. 762–777, 2009. [33] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney, “We have it easy, but do we have it right?” in IEEE International Symposium on Parallel and Distributed Processing, April 2008, pp. 1–7. [34] R. Jayaseelan, A. Bhowmik, and R. D. C. Ju, “Investigating the impact of code generation on performance characteristics of integer programs,” in INTERACT-14: Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture. New York, NY, USA: ACM, 2010, pp. 1–8. [35] J. Dongarra, D. Gannon, G. Fox, and K. Kennedy, “The impact of multicore on computational science software,” pp. 3–10, 2007. [36] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, 2009.

48