The Heuristic Static Load-Balancing Algorithm Applied ... - IEEE Xplore

1 downloads 0 Views 371KB Size Report
Abstract—We propose to use the heuristic static load- balancing (HSLB) algorithm for solving load-balancing prob- lems in the Community Earth System Model ...
2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops

The Heuristic Static Load-Balancing Algorithm Applied to the Community Earth System Model Yuri Alexeev∗ , Sheri Mickelson† , Sven Leyffer† , Robert Jacob† and Anthony Craig‡ ∗ Leadership Computing Facility Argonne National Laboratory, Argonne, IL 60439, USA Email: [email protected] † Mathematics and Computer Science Argonne National Laboratory, Argonne, IL 60439, USA Email: {mickelso,leyffer,jacob}@mcs.anl.gov ‡ Climate and Global Dynamics Division National Center for Atmospheric Research, Boulder, CO, USA Email: [email protected]

Abstract—We propose to use the heuristic static loadbalancing (HSLB) algorithm for solving load-balancing problems in the Community Earth System Model (CESM), a climate model, using fitted benchmark data as an alternative to the current manual approach. The problem of allocating the optimal number of CPU cores to CESM components is formulated as a mixed-integer nonlinear optimization problem which is solved by using an optimization branch-and-bound solver implemented in the MINLP package MINOTAUR. The key feature of the branch-and-bound method is that it guarantees to provide an optimal solution or show that none exists. Our algorithm was tested for the 1◦ and 1/8◦ resolution simulations on 32,768 nodes (131,072 cores) of IBM Blue Gene/P where we consistently achieved well load-balanced results. This work is a part of a broader effort to eliminate the need for manual tuning of the code for each platform and simulation type, improve the performance and scalability of CESM, and develop automated tools to achieve these goals. Keywords-constrained optimization, global optimization, integer programming, nonlinear programming, static load balancing, heuristic algorithm, climate modeling

I. I NTRODUCTION Achieving an even load balance is a key issue of parallel computing. Moreover, the impact of load balancing on the overall algorithm efficiency stands to increase dramatically as we enter the petascale supercomputing era. By Amdahl’s law, the scalable component of the total wall time shrinks as the numbers of processors increase while the load imbalance, together with the constant sequential component, leads to sublinear scaling. While parallelization of sequential code often requires rewriting the code which can be very challenging, adopting an efficient load-balancing scheme can be a simple and effective way to boost scalability and performance of the code. Load-balancing algorithms are broadly classified in two groups: dynamic load balancing (DLB) and static load balancing (SLB). SLB algorithms rely on previously obtained knowledge (e.g., benchmarking data), or consistent task 978-1-4799-4116-2 2014 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/IPDPSW.2014.177

sizes, while DLB assigns jobs to processors during code execution. There exist many variations of SLB and DLB algorithms adapted for specific applications [1], [2], [3], using different techniques such as random stealing, simulated annealing, recursive bisection methods, space-filling curve partitioning, and graph partitioning. SLB is generally easiest to implement and has negligible overhead making it suitable for fine-grained parallelism consisting of many small tasks. DLB is often preferred with larger tasks of diverse sizes as is often the case in coarse-grained parallelism. However, in the special cases of a few large tasks of diverse size, DLB algorithms are not appropriate because the number of tasks is much smaller than the number of processors. In that case, SLB is a better choice. Finding the optimal allocation is not trivial because, except for simple cases, it is an NP-hard problem. SLB algorithms typically consist of three steps: gathering benchmarking data, analysis/fitting of the data, and decision making of optimal allocation by using a mathematical or some other model. The decision making step is of utmost importance; it is the step that defines the SLB algorithm. For the quantum chemistry method, fragment molecular orbitals (FMO), the Heuristic Static Load-Balancing (HSLB) algorithm was developed [4]. For the decision making step, the load-balancing problem was formulated as a mixed-integer nonlinear optimization (MINLP) problem. The MINLP approach provides great flexibility in modeling the allocation problem realistically. Using nonlinear functions, we can capture complex relationships between the runtime and the number of processors. At the same time, we can impose integer restrictions on certain variables (e.g., number of processors or time requirements). To solve the MINLP for load balancing, we use MINOTAUR [5], a freely available MINLP toolkit. MINOTAUR offers several algorithms for solving general MINLPs, and can be easily called from different interfaces such as AMPL scripts or C++ code. 1581

In this paper, we applied HSLB to the climate modeling code called the Community Earth System Model (CESM) [6] on Intrepid, the Blue Gene/P supercomputer at Argonne National Laboratory (ANL) which has 40,960 quad-core processors [7]. CESM consists of a few distinct components such as an atmosphere model and an ocean model. Each component has different scaling requirements and performance characteristics. This combination of a few components (tasks) of various sizes makes CESM a perfect candidate for the HSLB application. HSLB for CESM was heavily modified compared to HSLB for FMO. The main difference is in constraining task layouts to reflect sequencing and partitioning of components across processors. These constraints are science requirements in CESM and defined by code organization. As a result, the HSLB mathematical models are more sophisticated for CESM because of the use of multiple constraints. In the next section, CESM setups are explained in greater detail. In section III, we will explain how the HSLB algorithm was applied to CESM. Section IV covers the application of HSLB to CESM simulations at 1◦ and 1/8◦ resolution for the purpose of the HSLB validation and demonstration of performance. In section V we conclude the paper by outlining major results and then discussing how CESM can benefit from the use of HSLB-based tools. II. LOAD-BALANCING IN CESM CESM is one of the most widely used global climate models in the world. Results from this model are a major part of the Intergovernmental Panel on Climate Change (IPCC) assessment reports. CESM1.1.1 consists of six different model components that communicate through a coupler. These include atmosphere, ocean, sea ice, land, river, and land ice models. The Community Atmosphere Model (CAM) is an atmospheric model whose development is based at the National Center for Atmospheric Research (NCAR), but has a large community of outside developers. The Parallel Ocean Program (POP), was developed at Los Alamos National Laboratory (LANL) and is used for global ocean modeling. The Community Ice Code (CICE) is a sea ice model that was also developed at LANL. The Community Land Model (CLM) is a land model that was developed at NCAR, but has a large community of outside developers. The River Transport Model (RTM) calculates the total runoff from the land surface model. The Community Ice Sheet Model (CISM) is used to predict ice sheet model retreat. The Coupler (CPL7) is used to control communication exchanges of twodimensional boundary data between the components. Each of the CESM model components has varying scalability patterns and performance characteristics. CESM can be run at many different model resolutions and with different combinations of model components. Some common atmosphere and land model resolutions include 2◦ , 1◦ , 1/2◦ , and 1/4◦ on a finite volume (FV) grid and 1◦ , 1/4◦ ,

Figure 1. Popular layouts of CESM components. Width of each component represents the number of nodes allocated to it while height represents the time to run it.

and 1/8◦ resolutions on the spectral element cube sphere (HOMME-SE) grid. The FV and spectral element dynamical core each use different methods for solving equations and each have different performance characteristics. The ice and ocean models are commonly run at a 1◦ resolution on a displaced pole grid and 1/10◦ resolution on a tri-pole grid. In this paper, we focused on running CESM1.1.1 on the 1◦ FV grid for the atmosphere and land models and 1◦ ocean and ice grids. We also ran a pre-release version of CESM1.2 on the 1/8◦ HOMME SE grid for the atmosphere, 1/4◦ FV grid for the land model, and 1/10◦ ocean and ice grid. The CESM architecture is flexible and allows the components to be run sequentially or concurrently across processors. Each component can be run with various MPI task and OpenMP thread counts, and components can be run sequentially on the same processor sets as other components. This allows for many different layout patterns and can make load balancing tricky. The old method for manually load balancing involves running the model at about five different core counts. Once these runs complete, the component timings are plotted, and users manually select optimal core

1582

counts based on the scaling curves. This process may involve trial and error, especially for inexperienced users. This can be an expensive process and can consume a significant amount of both person and computer time, especially at high resolutions. At the same time, it is very important to use the hardware resources efficiently to achieve the best model throughput at minimal cost, especially when the model can run for several hundreds or thousands of simulated years and use several million core hours for an experiment. In CESM1.1.1, fully coupled runs are usually set up to run with mixed sequential/concurrent layouts (Figure 1). The typical setup runs the atmosphere and ocean on separate processors. The atmospheric model then shares processors with the land and ice models. This is done because the atmosphere model is constrained to run sequentially with the ice and land models for science reasons. The river model is typically run on the same processors as the CLM model and the coupler is run on the same processors as the atmosphere. The coupler and the river models take less time to run compared to the other components, so these components were not included in our HSLB models, but they can be added later for fine tuning the work load balance. Other layouts include running the atmosphere, land, and ice models sequentially on a group of processors and running the ocean model on the remaining processors (see Figure 1 (2)). You can also run all of the models sequentially across all processors as shown in Figure 1 (3). In this paper, we focus on load balance optimization of models in the hybrid layout of figure 1, panel (1).

Table I M ATHEMATICAL MODELS WHICH CORRESPOND TO COMPONENT LAYOUTS (1)-(3) IN F IGURE 1

1 2 3

Given:

4 5 6 7

Variables:

8 9 10 11 12

III. T HE H EURISTIC S TATIC L OAD -BALANCING A LGORITHM Our HSLB method consists of four steps. First, we collect benchmarking data for each component. Second, we solve for the optimal parameters aj , bj , cj , dj (see Table II) by a least squares method based on our chosen scalability model. Third, we solve an integer optimization problem in order to obtain an optimal allocation of nodes. Fourth, we allocate the optimal number of nodes obtained from the optimization to run CESM in static load-balancing mode. We will outline each of these steps in greater detail in the following sections.

13

Minimize:

14 15 16 17 18 19 20 21

Subject to:

22 23 24 25 26

Subject to:

27 28

Subject to:

29

Subject to:

30 31

Z+ - set of positive integer numbers + - set of positive real numbers C = {ice, lnd, atm, ocn} = {i, l, a, o} - set of components N ∈ Z+ - total number of nodes available for allocation O = {2, 4, . . . , 480, 768} = {O1 , . . . , Om } possible allocations for ocn A = {1, 2, . . . , 1638, 1664} = {A1 , . . . , Am } possible allocations for atm T ∈ + - wall clock time obtained by solving allocation problem Ticelnd ∈ + - wall clock time to balance components lnd and ice Tsync ∈ + - synchronization tolerance to balance components lnd and ice nj ∈ Z+ - number of nodes allocated per component j ∈ C Tj (nj ) ∈ + - (fitted) performance function modeling time taken to run on nj zk ∈ {0, 1} - binary variables to model selection of number nodes, no T Constraints for layout (1) max(max(ice, lnd) + atm, ocn) Ticelnd ≥ Ti (ni ) Ticelnd ≥ Tl (nl ) T ≥ Ticelnd + Ta (na ) T ≥ To (no ) Tl (nl ) ≥ Ti (ni ) − Tsync Tl (nl ) ≤ Ti (ni ) + Tsync na + n o ≤ N ni + n l ≤ n a Constraints for layout (2) max(ice + lnd + atm, ocn) T ≥ Ti (ni ) + Tl (nl ) + Ta (na ) T ≥ To (no ) nl ≤ N − n o ni ≤ N − no na ≤ N − n o Constraints for layout (3) (ice + lnd + atm + ocn) T ≥ Ti (ni ) + Tl (nl ) + Ta (na ) + To (no ) nl ≤ N, ni ≤ N, na ≤ N, no ≤ N Constraints applied to all layouts m  zk = 1 k=1 m  k=1 m 

zk Ok = no zk A k = n a

k=1

A. Mathematical Models of Layouts Before we apply HSLB, we need to develop a mathematical model(s), which is arguably the most important step. A mathematical model is supposed to have an accurate representation of how components are executed including constraints of how each component relates to it and each other. The popular CESM layouts of components on processors are shown in Figure 1. In Table I, the mathematical models corresponding to layouts 1-3 from Figure 1 are shown. In Table I, line 3, we define the components used in modeling the system. Again, the real number of components in CESM is larger, but the

runoff, land ice, and coupler are currently excluded because the contribution to the total time is small. In this exercise, we are both trying to optimize the performance of each component as well as optimize the performance of the entire model by finding well load balanced layouts. Each component in CESM has various parameters such as decomposition choices and block sizes that impact performance, and some components in CESM are limited to run on particular processor counts or perform best at certain processor counts we’ll call “sweet” spots. The “sweet” spots are usually

1583

found by extensive profiling of different decomposition and blocking schemes, and we leveraged prior results. The version of CESM we used had ocean model processor count constraints hard coded into the implementation, and those were translated into the mathematical model (see Table I, line 5). The “sweet” spots for the atmosphere model are core counts that generally decompose the grid evenly. This constraint is modeled in AMPL as special sets, see Table I, lines 12 and 29-31. In the end, each component has slightly different performance scaling characteristics that must be captured. There are two types of constraints considered in these models: temporal and node constraints. Temporal constraints (see Table I lines 14-19, 22-23, and 27) dictate the constraints imposed by the system due to science limitations while node constraints map node allocations to the total number of nodes. Needless to say, there are many ways one can build a mathematical model. In general, the model should be as flexible as possible to allow for the optimization. This is manifested in the model by using less than or greater than equality constraints (see constraints in Table I). Finally, additional constraints, like Tsync (defined in Table I, line 9), may actually result in reduced performance of the algorithm because it imposes additional synchronization constraints on the solution.

Table II P ERFORMANCE FUNCTION AND VARIABLES DEFINITIONS ,

AS WELL AS OBJECTIVE FORMULATION WITH CONSTRAINTS

1

Function:

Tj (nj ) = Tjsca (nj ) + Tjnln (nj ) + Tjser = aj c + b j nj j + d j nj Tj (nj ) ∈ + - performance function modeling time taken to component j on nj cores Tjsca (nj ) ∈ + - scalable contribution of the function Tj (nj ) Tjser ∈ + - ”serial” contribution of the function Tj (nj ) Tjnln (nj ) ∈ + - contribution of the function Tj (nj ) other than Tjsca (nj ) and Tjnln (nj ) Dj ∈ Z+ - total number of data points available for fitting function for component j aj , bj , cj , dj ∈ + - fitting parameters associated with the performance function Tj (nj ) nji - number of nodes allocated to component j, i = 1, ..., Di yji ∈ + - observed time taken in solve component j using nji nodes, i = 1, ..., Di Di  aj c min (yji − − bj njij − dj )2 aj ,bj ,cj ,dj n ji j=1 =

2

Variables:

3 4 5

6 7 8 9

10

Minimize:

11

Subject to:

a j , b j , cj , d j ≥ 0

B. Performance Model Choosing an appropriate performance model is a crucial step in designing a successful SLB algorithm. A performance model is usually defined as a mathematical model capable of predicting the execution time of the program running in parallel as a function of problem size and the number of processors employed. Over the years, many performance models have been developed [4], [8], [9]. Many parallel performance models begin by identifying sequential and parallel contributions of the execution time in accordance with Amdahl’s law. The idea is to characterize these contributions in terms of the key parameters of the performance model. The performance models are often broadly defined and can be applied to any program running in parallel. We are going to use the performance model from the previously published paper [4] because this particular performance model describes the scalability of all CESM components except sea ice well. A machine learning based model was used for the sea ice model and this is described in another paper [10]. In this work, we use the nonlinear model outlined in Table II, line 1. Variables are defined in Table II, lines 2-9. We model Tj (nj ) which represents the wall-clock time to compute the j−th component as a function of nj the number of nodes. The three components of Tj (nj ) are described next. Tjsca (nj ) represents the contribution of the wall-clock time with perfect (or linear) scalability. It is a monotonically

decreasing function that asymptotically approaches zero. The contribution Tjser (nj ) on the other hand, represents the time spent in the non parallelized component of the application. It is independent of the number of nodes and includes any purely serial part of code. From the mathematical point of view, it is a constant that defines the minimum value of Tj (nj ). As nj increases, it is expected that Tjser (nj ) will dominate. The Tjnln (nj ) represents the contribution of the wall-clock time that is not described by either Tjsca (nj ) or Tjser . It represents the time spent in code that is only partially parallelized or depends on nj in a more complicated way. For example, it may include time spent in initialization, communication, or synchronization. The important implication is that, in general, Tjnln (nj ) could be an increasing or decreasing function. However on Intrepid, this term was increasing. Hence, in our model Tjnln (nj ) is an increasing function which has parameters cj , bj almost equal to zero. The functional form of Tj (nj ) seems to make sense both mathematically and from the viewpoint of Amdahl’s law. For small nj , Tj (nj ) is dominated by Tjsca (nj ). As the number of nodes increases, the roles are reversed and Tjser (nj ) becomes a major contributor to the total time (if Tjnln (nj ) contribution is negligible). The real examples of Tjsca (nj ), Tjnln (nj ), and Tjser (nj ) applied to CESM components are illustrated in Figure 2 in the top right corner.

1584

C. Fitting Data We estimate the parameters aj , bj , cj , dj used in the performance function (shown in Table II, line 1) by fitting the values of wall-clock time of each component from multiple 5-day model run at different node counts. An important point is that the wall-clock times used for fitting the data do not include intra-component communication times (these are associated with the coupler), but they do include communication timing inside the component. Also, any workload imbalance inside a component is included in the timers, which will be evident later during the analysis of the ice component scaling curves. Thus, the HSLB reported time for the whole run may differ slightly from the one found in the CESM output files, although usually the difference between the two results is small. Finally, nodes were used to represent the physical computing unit in our algorithm. On Intrepid, there are 4 cores per node and CESM is run with 1 MPI task and 4 threads per task on each node. Other choices could have been cores or CPUs or even software representations such as threads or MPI tasks. The best way to fit data is a matter of debate, but regardless of the choice all described variants are possible with a few changes to the AMPL input script. For each component, we obtain the best fit by solving the least squares problem shown in Table II, line 10. This objective function of the optimization problem is, in general, not convex, and there may be several locally optimal solutions to the problem. Since nonlinear optimization algorithms are iterative, selecting a different starting point may lead the solver to a different local solution. We experimented with different starting solutions and observed that even though the parameter values may differ, the solution value of the problem did not vary significantly. More important was the observation that differences in the parameter values among locally optimal solutions led to similar quality node allocations. We have constrained the variables in our fitting problem (Table II, line 10) to be positive, even though doing so is not necessarily the best choice mathematically. It makes sense for parameters aj , bj , dj to be positive because they represent values of time (Table II, line 11). It is less obvious what the constraints for cj should be. In general, the function can be increasing or decreasing, but CESM is a highly scalable code, and we did not observe increasing wall-clock times as nodes increased in any of our runs. Thus, we chose a positive value of cj . The examples of fitted aj , bj , cj , dj are shown in Figure 2 in the top right corner. There are two important questions to answer related to the generation of benchmarking data used for fitting: which values of nodes to choose and how many trial data points are needed. We propose that CESM should be run on the minimal number of nodes allowed by memory requirements and on the greatest number of nodes possible.

Figure 2. Scaling curves for each component in layout (1) for 1◦ resolution

In addition, a few simulations should be done in between to capture the curvature of the scaling. These choices should give a reasonable starting point for the fitting method. This approach will guarantee that performance function predictions will be interpolated rather than extrapolated which is important for accuracy. From our experience in order to capture scaling of a component, the number of benchmarking runs with various number of nodes should be at least greater than four for each component. There is an obvious trade-off between the time taken to obtain benchmarking data and the quality of the model. Each code is different, hence it is hard to make generalizations, but for CESM, four points were enough to build well-fitted scaling curves. The number of points should obviously increase with the level of noise in the application, the number of parameters to be estimated, or if the fit is poor. The quality of the fit was judged by computing R2 . In our tests, R2 was very close to 1 for each component. D. Formulating the Optimization Problem Once we identified an appropriate performance model and obtained values for all parameters, we formulated an optimization problem to find the optimal allocation of nodes to each component given a predetermined total number of nodes for the overall system. The decision variables that we seek to optimize are the number of nodes, nj to be allocated to each component j ∈ {ice, lnd, atm, ocn} with the goal to minimize the total wall-clock time (see definitions of variables in Table I). Other objective are also possible including 1. min − max function which minimizes maximum time spent in each component: min max Tj (nj ) n

1585

j

(1)

2. max − min function which maximizes minimum time spent in each component: max min Tj (nj ) n

j

solution of the MINLP. The LP/NLP algorithm is initialized by first creating a mixed-integer linear programming (MILP) relaxation of the MINLP. Given a nonlinear constraint of the form f (x) ≤ 0, where f is a continuously differentiable convex function. The constraint can be relaxed by linearizing the function around any point xk , namely

(2)

3. min function which minimizes sum of time spent in each component: 4  Tj (nj ) (3) min n

∇f (xk )T (x − xk ) + f (xk ) ≤ 0

i=1

(4)

The MILP relaxation is closer to the MINLP, if we linearize the functions about more points. However, additional linearization points increase the number of constraints in the MILP and can slow down the solver. To avoid this problem, linearization constraints derived from only a single point are added initially. This initial point is the solution of the continuous nonlinear programming (NLP) relaxation. We later add linearization constraints for only those nonlinear constraints that are violated significantly by the MILP solutions. The LP/NLP-based algorithm starts by solving an initial linear programming (LP) relaxation and initializes the value of the incumbent solution of the MINLP to infinity. The algorithm then creates a tree-search to solve increasingly tighter MILP relaxations. At every step of the algorithm, we remove an LP sub-problem from the list and solve it. If the solution value is greater than the incumbent, we discard this sub-problem because it does not contain any solution better than the incumbent. If the solution x ˆ of the LP problem has a fractional solution, then we branch on a non-integral integer variable to create two new sub-problems. These two subproblems are added to the list of unsolved sub-problems. If x ˆ satisfies all integer constraints, we check whether it also satisfies the nonlinear constraints. If it is feasible with respect to all nonlinear constraints, then we have a new incumbent solution. Otherwise, we linearize one or more violated constraints around x ˆ of the form shown in Table I, and continue. The algorithm terminates when the list is empty. MINOTAUR solves the LPs using the open-source solver CLP, and the NLP problems are solved with filterSQP. In the worst case, the algorithm may require the solution of an exponential (in the number of integer variables) number of LP and NLP problems. However, in practice the method takes much less time. For example, the MINLP for 40960 nodes took less than 60 seconds to solve on one core. In order to handle the large number of discrete choices for the atmospheric partition, we implemented these discrete choices as a special-ordered set, and forced the MINLP solver to branch on the special-ordered set, rather than on individual binary variables, which improved the runtime of the MINLP solver by two orders of magnitude, ensuring that our approach remains practical in this setting.

The third choice is obviously out of consideration because CESM requires more complicated relationships between components than just a sum of times spent in each component. Previous work (ref FMO/HSLB paper) also showed that the third function performs much worse compared to functions (1) and (2). The min − max function performed slightly better than the max − min function in (ref FMO/HSLB paper) and was the objective used in this work. E. Algorithms and Software for Solving the MINLP Model The problem outlined in Table I is for a special case of MINLP problem, which are NP-hard in general. Certain simple MINLPs, such as single constraint resource constrained MINLPs with non-increasing objectives, can be solved in polynomial time with customized solvers [11]. Here, we consider general MINLPs methods only, which are typically based on branch-and-bound [12]. Branch-andbound methods are guaranteed to provide an optimal solution or show that none exists. The runtime of these algorithms depends on the number of variables and constraints, as well as the type of functions used in the objective and constraints. For example, convex functions ensure that local solutions of the continuous relaxation are also global solutions, and some methods exploit this fact and other properties of convex functions [12], [13], [14]. On the other hand, if any problem function is not convex, then the continuous relaxation does not provide a bound, and we need to further relax the continuous problem by introducing convex under estimators that are modeled with additional variables and modify the constraints. Our MINLP optimization problem is written in AMPL, a modeling language that allows users to write optimization models using simple mathematical notation. AMPL also provides derivatives of nonlinear functions automatically, and it can be used with several different solvers. To solve the MINLP problem, we used our open-source solver toolkit MINOTAUR [5]. MINOTAUR implements different branchand-bound solvers and includes advanced routines to reformulate MINLPs. MINOTAUR provides libraries that can be called from other C++ or FORTRAN codes and hence can be directly called without requiring AMPL. For solving our problem, we use the LP/NLP-based branch-and-bound solver [13] implemented in MINOTAUR. The positivity of the coefficients aj , bj , dj implies that the nonlinear functions are convex, which ensures that MINOTAUR finds a global

F. Summary of HSLB Algorithm Before presenting the results of our experiments, we summarize the four steps of the HSLB algorithm and discuss

1586

A. 1◦ Resolution Scaling Results

the ways of further improving it.

For our first tests, we chose the 1◦ resolution setup. Each simulation takes a relatively small amount of time, and there was plenty of data to derive an optimal allocation manually. The results of “human” and HSLB optimization are shown in Table III. In column two are the node allocations for each component made by an expert. The corresponding times to run the components are shown in column three; the total time to execute the full CESM run is shown in the row labeled “Total time”. The HSLB node allocations to each component are shown in column four. The AMPL script prints predicted time to run components for comparison with actual time shown in the corresponding columns five and six. We have run 1◦ resolution simulations targeting 128, 256, 512, 1024, and 2048 nodes. The results in Table III are shown only for the smallest and largest target node counts because they are usually the hardest to balance with HSLB. First of all, “manual”, HSLB predicted time, and HSLB actual total times are very close to each other, even if node allocations to components are substantially different (see node allocations for 2048 nodes in Table III). So our initial conclusion is that HSLB works. In table III, the comparison of timings for the ice component is slightly worse compared to other components. The ice component supports seven decomposition strategies with varying block sizes to allow flexibility to achieve optimal performance. The optimal decomposition for a given number of nodes is not yet known a priori. In our tests, we used the default decompositions for CICE which resulted in the tests using varying decomposition types and block sizes. This increased the noise in the sea ice performance curve fit and impacted the timing estimates. As a result, a separate effort was begun to determine the optimal sea ice decompositions using machine learning [10].

1) Gather Data: Perform a CESM simulation for the intended layout D times using varied numbers of nodes. Collect the run times yji for each component j. 2) Fit: Next, solve 4 (number of components to fit) different least squares problems outlined in Table II to determine the coefficients aj , bj , cj , and dj for each component j. 3) Solve: For a given total number of nodes, determine the best node allocation to each component by solving the MINLP outlined in Table I, and obtain the optimal values of size nj for each component j. 4) Execute: Run a CESM simulation using the determined optimal node allocation for each component from step (3). This algorithm, being of a general nature, can be improved in several ways for a given application. The data gathering step (1) can be avoided altogether if reliable benchmarks are already available, for example, from previous experiments. And steps (2) and (3) could be solved by calling a MINLP solver directly from the application iteratively if CESM supported dynamic load balance. IV. RESULTS AND DISCUSSIONS The weakest part of the HSLB algorithm, in our opinion, is obtaining the actual performance data for fitting. In subsection III-C, we provided some common sense recommendations for choosing the number of nodes to sample. But inevitably there will be special circumstances when this approach will fail. However, up to now, this approach, taken in both the quantum chemistry code GAMESS and CESM [6] over the last two years, has consistently demonstrated successful optimizations. Nevertheless, validation is an important step to which we paid extra attention when we tested HSLB for 1◦ and 1/8◦ resolution simulations. The goal of HSLB is to eliminate the human factor and cost in finding the optimal allocation of nodes for each component. The manual process has a similar first step as the HSLB, namely generating some scaling curves for each component. Thereafter, the manual tuning and load balance testing is done by hand, sequentially, until a reasonable layout is obtained. This can take five to ten iterations which involves building the model, submitting to a queue, and waiting. We will compare the HSLB results to manual optimization to determine whether the HSLB is doing an adequate job. We tested HSLB in two configurations. The moderate 1◦ resolution was optimized previously and timing data existed for comparison. The high resolution 1/8◦ configuration is a relatively new large-scale setup without previous knowledge of component scaling on intrepid. The results of both configurations are discussed in the following sections.

B. 1/8◦ Resolution Scaling Results As we mentioned before, the 1/8◦ resolution is a more interesting test of HSLB. This is the highest resolution currently supported in CESM. A reasonable number of attempts to find the optimal node allocation by “manual” optimization was carried out. Then the timings from these results were used to find the optimal allocation by HSLB. The results of “manual” and HSLB optimization are shown in Table III. The format of the table is explained in subsection IV-A. At the higher resolution, the ocean model was initially limited to a few handful of node counts including 480, 512, 2356, 3136, 4564, 6124, and 19460 as a result of prior testing. This limited the ocean node optimization. However, even within that severe constraint, the HSLB predicted and actual times were reasonable and improved by as much as 10% compared to the manual approach as shown for both 8192 and 32768 nodes in Table III. That ocean node constraint was somewhat arbitrary, so to extend the optimization, HSLB was run without the ocean node

1587

Table III D ETAILED TIMINGS FOR EACH COMPONENT IN LAYOUT (1) FOR 1◦ AND 1/8◦ RESOLUTIONS . M ANUAL MEANS THE NODE ALLOCATION WAS PREDICTED BY “ HUMAN OPTIMIZATION ” WHILE HSLB MEANS HSLB OPTIMIZATION WAS USED FOR FINDING NODE ALLOCATIONS . HSLB “ PREDICTED TIME ” MEANS THE HSLB ESTIMATED TIME AND “ACTUAL TIME ” MEANS THE ACTUAL MODEL TIME . T HE FIRST TWO ENTIRES ARE ASSOCIATED WITH THE 1 ◦ OPTIMIZATION AT 128 AND 2048 NODES . T HE NEXT TWO ENTRIES ARE FOR THE 1/8 ◦ OPTIMIZATION AT 8192 AND 32768 NODES . T HE FINAL TWO ENTRIES ARE FOR AN OPTIMIZATION WITHOUT OCEAN NODE CONSTRAINTS AT 1/8 ◦ AT 8192 AND 32768 NODES .

1◦ resolution, 128 nodes components lnd ice atm ocn Total time, sec

# nodes 24 80 104 24

Manual Time, sec 63.766 109.054 306.952 362.669 416.006

Predicted # nodes 15 89 104 24

HSLB Predicted Time, sec 100.951 102.972 307.651 365.649 410.623

Actual Time, sec 100.202 116.472 308.699 365.853 425.171

HSLB Predicted Time, sec 22.693 22.822 61.662 78.532 84.484

Actual Time, sec 23.158 18.242 63.313 79.139 86.471

HSLB Predicted Time, sec 487.853 511.596 2878.798 2919.052 3390.394

Actual Time, sec 457.052 499.691 2989.115 2898.102 3488.806

HSLB Predicted Time, sec 232.158 290.088 1302.562 712.525 1592.649

Actual Time, sec 223.284 311.195 1301.136 700.373 1612.331

1◦ resolution, 2048 nodes components lnd ice atm ocn Total time, sec

# nodes 384 1280 1664 384

Manual Time, sec 5.777 17.912 61.987 61.987 79.899

Predicted # nodes 71 1454 1525 256

1/8◦ resolution , 8192 nodes components lnd ice atm ocn Total time, sec

# nodes 486 5350 5836 2356

Manual Time, sec 147.397 475.614 2533.76 3785.333 3785.333

Predicted # nodes 138 4918 5056 3136

1/8◦ resolution , 32768 nodes components lnd ice atm ocn Total time, sec

# nodes 2220 24424 26644 6124

Manual Time, sec 44.225 214.203 787.478 1645.009 1645.009

Predicted # nodes 302 13006 13308 19460

components lnd ice atm ocn Total time, sec

1/8◦ resolution , 8192 nodes, unconstrained ocean nodes HSLB Predicted # nodes Predicted Time, sec Actual # nodes Actual Time, sec 137 487.853 146 417.162 5238 489.904 5287 475.249 5375 2727.934 5433 2702.651 2817 3216.924 2759 3496.331 3217.837 3496.331

components lnd ice atm ocn Total time, sec

1/8◦ resolution , 32768 nodes, unconstrained ocean nodes HSLB Predicted # nodes Predicted Time, sec Actual # nodes Actual Time, sec 299 232.158 272 238.46 22657 232.735 20616 231.631 22956 896.67 20888 956.558 9812 1129.335 11880 1255.593 1129.405 1255.593

1588

will affect scaling; how making components not coupled will affect scaling; or the optimal number of nodes to run CESM. Later, as the mathematical model becomes more sophisticated, it might even be possible to do more exotic and less reliable predictions such as the prediction of CESM scaling on new hardware (e.g., exascale supercomputers) or prediction of what parts of the model need to be rewritten to improve performance. Some of these HSLB applications are discussed below.

Figure 3. 1/8◦ resolution scaling curves for layout (1). “Human” guess means optimal allocation was guessed by an expert, HSLB prediction represents timings predicted by HSLB algorithm, and HSLB actual represents actual timings for HSLB prediction

constraint at both 8192 and 32768 nodes. Those results are shown in the last two entries of Table III. At 8192 nodes, the optimization is relatively unchanged. However, at 32768 nodes, HSLB predicted an optimal ocean node count of 9812 and a run time of 1129 seconds, a reduction of about 40% compared to the constrained predicted time of 1593 seconds. In practice, we then tested the node allocation shown in the fourth column of the table which resulted in an actual time of 1256 seconds, an improvement of about 25% compared to the actual constrained load balance time of 1612 seconds. That tuned actual node allocation in the last entry of Table III was chosen based on the HSLB predicted nodes but adjusting node counts toward known component sweet spots. The 1/8◦ manual, predicted, and actual timings are summarized in Figure 3. An important result is that the component models processor counts should not be arbitrarily limited. By allowing users to pick relatively arbitrary processor counts when possible, better performance can be achieved. Another important point to note is that for the 1/8◦ resolution on intrepid, no prior timing or scaling data was available. Based on the predicted and actual timing numbers for the ocean model at 9812 and 11880 nodes, the ocean scaling curve was not captured well during our fit step. This suggests that ocean timing data at additional node counts would improve the fit and likely the optimization.

Figure 4. Scaling curves for layouts 1-3 (see Figure 1) for 1◦ resolution. All data is predicted except for layout (1) for which the experimental data is shown as layout (1exp). The R2 between predicted is experimental data for layout (1) is equal to 1.0

Since we built mathematical models for three component layouts (see Figure 1), but ran simulations only for the most common layout (Figure 1, panel 1), we decided to predict scaling of the other layouts at the 1◦ resolution based on the scaling curves shown in Figure 2. The results shown in Figure 4 are somewhat expected. The model with layouts 1 and 2 performed similar, while layout 3, as expected, performs the worst. This is consistent with prior results in CESM. Another important HSLB application may be the prediction of the optimal nodes to run a job. The definition of optimal depends on the goal; it could be a cost-efficient goal where nodes are increased until scaling is reduced to a predefined limit or it could be the shortest time to solution. V. CONCLUSIONS We have shown that the HSLB algorithm we developed is a viable alternative to the currently used “manual” optimization approach for load balancing CESM. By using MINLP optimization technique, HSLB predicts an optimal allocation of nodes to the components, given accurate benchmarking data and a correct mathematical model. It is our intention to develop a “black box” from HSLB which would allow anyone, especially scientists without experience at “manual”

C. Prediction of Optimal Layout and Number of Nodes to a Job It is possible to adapt the developed mathematical approach for other purposes. For example, HSLB can estimate the effect of constraints or “sweet” spots on scaling/efficiency of CESM, which component layout is more or less scalable; how replacing one component with another

1589

optimization, to run CESM efficiently on supercomputers or clusters. We implemented HSLB as a part of the automated pipeline in the latest version of CESM. The AMPL code in HSLB is executed remotely via Python script on NEOS server hosted by ANL. This work, although initially targeted at the CESM community, can benefit other climate modeling codes as well. By using HSLB, we improved the speed of CESM on 32,768 nodes for 1/8◦ resolution simulations by 25% compared to a baseline guess on Intrepid. HSLB predicted an optimal node allocation while decomposition data for the ocean model was used to test some new ocean node counts. This is a good example of how HSLB predicted that using new node counts beyond those hardcoded could significantly improve code performance. Other possible HSLB applications to CESM are outlined in subsection IV-C. For example, HSLB could be used for finding the optimal cost-effective number of nodes to run CESM for a particular task. The developed HSLB approach is just first step towards building more sophisticated models where we plan to load balance work inside components. As a first step, we developed a machine learning based algorithm for finding optimal decompositions of the sea ice component, and this is presented in a separate paper [10]. We are also considering adding some land block elimination processor counts to help automate the determination of “sweet” spots for the ocean component. The presented HSLB algorithm is not limited to FMO, CESM, or other climate modeling codes. In fact, any coarsegrained application with large tasks of diverse size can benefit from the present approach. As the number of cores increases in modern supercomputers, the issue of minimizing synchronization time while retaining high efficiency will put load-balancing schemes to a highly stressful test. We believe that for coarse-grained applications, our HSLB algorithm is a promising and general approach.

An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. This research used resources of the Argonne Leadership Computing Facility (ALCF) at Argonne. R EFERENCES [1] C. Xu and F. Lau, Load balancing in parallel computers: theory and practice. Norwell, MA: Kluwer Academic Publishers, 1997. [2] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, J. Teresco, J. Faik, J. Flaherty, and L. Gervasio, “New challenges in dynamic load balancing,” Applied Numerical Mathematics, vol. 52, no. 2, pp. 133–152, 2005. [3] M. Willebeek-LeMair and A. Reeves, “Strategies for dynamic load balancing on highly parallel computers,” Parallel and Distributed Systems, IEEE Transactions on, vol. 4, no. 9, pp. 979–993, 1993. [4] Y. Alexeev, A. Mahajan, S. Leyffer, G. Fletcher, and D. G. Fedorov, “Heuristic static load-balancing algorithm applied to the fragment molecular orbital method,” in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1–13. [5] A. Mahajan, S. Leyffer, J. Linderoth, J. Luedtke, and T. Munson, “MINOTAUR : A toolkit for MINLP,” http://wiki.mcs. anl.gov/minotaur/index.php/Main Page. [6] “The community earth system model,” http://www.cesm.ucar. edu/. [7] 2013, http://www.alcf.anl.gov/. [8] Y. Alexeev, R. Kendall, and M. Gordon, “The distributed data SCF,” Computer Physics Communications, vol. 143, no. 1, pp. 69–82, 2002. [9] Y. Alexeev, M. Schmidt, T. Windus, and M. Gordon, “A parallel distributed data CPHF algorithm for analytic Hessians,” Journal of Computational Chemistry, vol. 28, no. 10, pp. 1685–1694, 2007. [10] P. Balaprakash, Y. Alexeev, S. Mickelson, S. Leyffer, R. Jacob, and A. P. Craig, “Machine learning based load-balancing for the CESM climate modeling package,” 2013, submitted to VECPAR 2014.

VI. ACKNOWLEDGEMENTS We thank Dr. Raymond Loy and ALCF team members for discussions and help related to the paper. We especially thank Jim Edwards and Mariana Vertenstein from NCAR for encouraging this work and helpful discussions. The submitted manuscript has been created by the UChicago Argonne, LLC, Operator of Argonne National Laboratory (Argonne) under Contracts No. DE-AC0206CH11357 and DE-FG02-05ER25694 with the U.S. Department of Energy. The U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The NCAR is sponsored by the National Science Foundation.

[11] T. Ibaraki and N. Katoh, Resource allocation problems: algorithmic approaches. Cambridge, MA: The MIT Press, 1988. [12] R. Dakin, “A tree-search algorithm for mixed integer programming problems,” The Computer Journal, vol. 8, no. 3, pp. 250–255, 1965. [13] R. Fletcher and S. Leyffer, “Solving mixed integer nonlinear programs by outer approximation,” Mathematical Programming, vol. 66, no. 1, pp. 327–349, 1994. [14] A. Mahajan, S. Leyffer, and C. Kirches, “Solving mixedinteger nonlinear programs by QP-diving. technical report ANL/MCS-P2071-0312,” Argonne National Laboratory, Tech. Rep. ANL/MCS-P2071-0312, 2012.

1590