Benefit Estimating Service for Mapping Parallel ... - CiteSeerX

0 downloads 0 Views 323KB Size Report
grid computing environments like Legion [6] and. Globus [7] provide support for ...... Finally, TITAN [35] is a multi-tiered scheduling archi- tecture that employs the ...
A Cost/Benefit Estimating Service for Mapping Parallel Applications on Heterogeneous Clusters Dimitrios Katramatos Department of Computer Science University of Virginia Charlottesville, Virginia 22903 [email protected]

Steve J. Chapin Department of Electrical Engineering and Computer Science Syracuse University Syracuse, NY 13244 [email protected]

Abstract Matching the resource requirements of a parallel application to the available resources of a large, heterogeneous cluster is a key requirement in effectively scheduling the application tasks on the nodes of the cluster. This paper describes the Cost/Benefit Estimating Service (CBES), a runtime scheduling system targeted at finding highly effective schedules (or mappings) of tasks on nodes. CBES relies on its own infrastructure to gather and maintain static and dynamic information profiles for the computing system and the applications of interest. At the core of CBES is a mapping evaluation module which evaluates candidate application mappings on the basis of shortest execution times. By default, CBES uses a simulated-annealing based scheduler to select mappings. The paper presents the design, initial implementation, and test results of CBES on the Centurion cluster at the University of Virginia and the Orange Grove cluster at Syracuse University. These tests demonstrated that the exploitation of internode communication speed differences due to network heterogeneity can yield speedups of over 10% between same architecture nodes. The maximum observed speedup across architectures for the best vs. worst mapping scenarios of the same application was over 36%, while the corresponding average case speedup was approximately 30%.

1 Introduction Large, heterogeneous clusters of Common Off-TheShelf (COTS) components incorporate up to several thousands of nodes interconnected with fast network fabrics and offer a prime execution environment for demanding parallel applications, e.g. scientific simulations. Such systems, also known as federated clusters (figure 1), constitute a signifi-

Figure 1. Schematic of a federated cluster. cant special case of grid computing. The performance of an application on a cluster depends strongly on the efficiency of the mapping (or scheduling) of that application’s tasks on the nodes of the cluster. An efficient mapping must not only quantitatively satisfy the resource requirements, but also achieve the best feasible match between the computation and communication patterns of the application and the resource characteristics and availability of the cluster, to minimize application execution time. In the general case, the resources of a cluster are shared among multiple applications, thus presenting variations in availability. Also, an application’s resource requirements and utilization patterns may change over time, in a known or, in the case of irregular applications, an unknown manner. Consequently, finding an efficient mapping is a complex problem. The complexity further increases if the initial mapping has to change during the course of computa-

tion, to accommodate changes in the application and/or system behavior and prevent adverse effects on performance. Runtime systems used on federated clusters typically support automated mapping of application tasks. However, • parallel runtime systems like PVM [1] and MPI [2], although capable of placing tasks automatically, do so in a naive way, i.e. they select nodes round-robin from the same node list they use for system booting, regardless of resource availability; • workload management systems like Condor [3], PBS [4], and LSF [5] incorporate scheduling mechanisms to automatically map an application on resources allocated for use according to policies that maximize computing system throughput rather than application performance. In the case of PBS, users can select among several readily available schedulers, or even specify their own; • grid computing environments like Legion [6] and Globus [7] provide support for large scale internetwide computing and encompass services to locate and reserve resources; authenticate users; create, monitor, and control remote processes; etc. These environments have the flexibility to utilize a variety of built-in, usersupplied, or even application-specific schedulers to allocate resources to applications. In all three cases it is feasible to pursue maximization of application performance by augmenting the runtime system with a suitable scheduler. This paper presents the design, prototype implementation, and experimental testing of the Cost/Benefit Estimating Service (CBES). CBES [8] [9] is a run-time scheduling system based on a dynamic application mapping evaluation operation. For any given application mapping, this operation predicts the time the application would take to execute when using that mapping. The prediction takes into account static and dynamic information from both the computing system and the application. Two slightly different CBES prototypes were implemented. The first prototype was implemented on the largely homogeneous 256-node Centurion cluster at the University of Virginia [10]. The second prototype was implemented on a 28-node highly heterogeneous cluster, created by rewiring nodes of the original Orange Grove cluster at Syracuse University. The Virginia prototype employs a modified version of NWS [11] as system profiling and monitoring infrastructure, while the Syracuse prototype utilizes the simpler CBES infrastructure for the same purpose. Several experiments were conducted with the two prototypes to examine the validity and efficacy of the CBES design. Section 2 presents an overview of the system design, while section 3 details the mapping evaluation operation.

Sections 4, 5, and 6 present the CBES prototypes and the performed experimental work. Section 7 discusses related work and finally section 8 gives concluding remarks and future work directions.

2 Design Overview Our goal is to provide the means of scheduling a parallel application on demand, for the maximum benefit of that application. Defining as computation cost the execution time of an application, scheduling on demand for maximum benefit means that the application task placement (or mapping) should be generated so as to minimize the application execution time (cost), given the system conditions at the time of the placement. If system conditions, with regard to a running application, change, there should be the capability of generating a new mapping for that application, that may yield an even shorter execution time (lower cost) for the remainder of the execution, taking into account the task remapping costs. Furthermore, the same remapping capability should be available if the application itself changes behavior during the course of computation so as to render the initial mapping inefficient. This is an application-centric scheduling scheme, however, it only utilizes resources made available to an application, a user, a domain, etc., according to administrating policies in effect. We augment the runtime system of a clustered system with CBES, an auxiliary scheduling service. At the core of this service is a mapping evaluation operation used for comparing mappings and selecting those that are most beneficial for an application given the current system resource conditions. The CBES infrastructure consists of a set of databases, profiling tools, and monitoring daemons. These components are divided into two categories: system-dedicated and application-dedicated. Prior to any invocation of the service, the system-dedicated infrastructure needs to be initialized. This is an off–line phase, necessary for calibrating the service for use with a specific computing system. The computing system must remain free of computational and communication load for the duration of the calibration. Although this can be a lengthy and expensive phase, it takes place only once. The system monitoring daemons maintain a current picture of the availability of system resources. For a cluster of N nodes this can be a time consuming problem, potentially O(N 2 ). Although CPU availability is relatively easy to monitor, interconnecting network bandwidth varies not only with network topology, but also with message size and even with respect to the load of the communicating nodes. The CBES infrastructure uses a method that approximates a view of a cluster’s resource availability in O(N ) time even when the cluster nodes are under computational and/or

given mapping.

Mapping System Design

3 Mapping comparisons Computing System Profiler

Database

Application Profiler

Database Database

System Monitor

Application Monitor

System Core Clients A B

mapping evaluation SA scheduler

Figure 2. The basic design of CBES.

communication load. Key element of this method is the utilization of a network end-to-end latency model, generated during the calibration phase, for estimating on demand the internode latencies of the cluster by accounting for the effect of node CPU and NIC load on the no-load end-to-end latency values [12]. For using CBES, an application must also undergo profiling. An application profile is the result of analyzing an execution trace of the application and contains cumulative information regarding all major computation and communication events of each application process. In the case of a heterogeneous cluster, the profile also contains information regarding the computation speeds the application code can achieve on the different architectures of the cluster. In essence, an application profile is a summary of an application’s behavior. The basic design of CBES consists of a core module utilizing two independent, autonomous subsystems (figure 2). Several database tables provide system and application information. The first subsystem is responsible for the computing system part of the equation. During the initial system setup this subsystem provides the means for creating a profile of the computing system. During regular operation the subsystem continuously monitors the resource status of the cluster nodes since load information play a critical role in mapping evaluations. The second subsystem provides the means for generating computation and communication profiles for applications, and for supporting internal (triggered by the application) and external (triggered by system conditions) application remapping events. The core CBES module accepts mapping comparison requests from external clients (such as a scheduler). As a response, the module obtains current information from the two subsystems and generates execution time predictions corresponding to each

The core CBES module compares mappings on the basis of expected execution time. The system and application information subsystems of CBES provide on-demand a snapshot of resource availability, system profile data, and application profile data. The core module combines the information with given mapping definitions and generates a prediction for the application’s execution time. In this section we describe the formulation in support of this prediction operation and the experimental validation of the predictions.

3.1 Formulation A mapping M is an assignment of application tasks on cluster nodes. It is expressed as a set of nM pairs with first member the identity of a task (a process) and second member the identity of a cluster node. If P = {pi , i = 1..nM }

(1)

is the set of processes and N = {nj , j = 1..nM }

(2)

the set of nodes, then M = {(x, y)k , k = 1..nM , x ∈ M, y ∈ N }

(3)

For a given mapping M, the execution time SM is estimated as: nM SM = max |i=1 (Ri + Ci ) (4) with Ri and Ci the contributions to execution time of process i from computation and communication respectively. Since SM represents the execution time of the application, we have to consider the maximum R + C as the time corresponding to mapping M in the same way that the longest executing process defines the execution time of an application. We symbolize the value of i for which we obtain SM as iM . Term Ri represents the contribution of the computation performed by process i. The application profile contains for each process i the following quantities: • Xi the accumulated time process i was executing its own code, • Oi the accumulated time process i was executing message passing interface library code (overhead time), and • Bi the accumulated time process i was blocked waiting for messages it sent to other processes to reach their recipients and/or for messages sent by other processes to arrive.

Differences between the nominal (profile) node processing speed and the processing speed of node j assigned to process i through mapping M directly affect the execution time of process i. The processing speeds may differ due to hardware and software heterogeneity between the profile node and node j (e.g. architecture, clock frequency, memory size and speed, L2 cache size and speed, virtual memory layer operation) and/or due to already existing load on node j. Term Ri is calculated as follows: Ri = (Xi + Oi ) ·

Speedprof ilej 1 · Speedj ACP Uj

from the system profile, the latency model, and the given mapping M we can determine the current latency Lc for every message to be exchanged between the nodes assigned to each pair of processes as the corresponding no-load latency adjusted for the effect of CPU and network load. The theoretical communication time of process i for mapping M is calculated with the following summation: X X mcj · Lc (k, i, msj ) + ΘM = i k∈SSi j∈mgS

(5)

Speedj is the processing speed of node j and Speedprof ilej the processing speed of the node used during the profiling of the program. The ratio Speedprof ilej /Speedj expresses the increase or decrease in process execution time due to differences between the profiling node and the assigned node1 . ACP Uj is the current CPU availability for node j (0–100%) and 1/ACP Uj expresses the slowdown, the increase in execution time due to pre-existing load on node j, causing timesharing of the CPU. In equation 4, term Ci represents the contribution of the communication performed by process i to the execution time. The first step to obtain this contribution is to sum the partial contributions of each message that process i sent and/or received during the course of its execution, as recorded in a program’s profile. The sum represents the theoretical total time that the communication part takes, assuming blocking, standard-mode sends and receives. The end-to-end latency measurements use the same kind of primitives, however, these measurements take place under optimal conditions with benchmarks that do every effort to minimize overhead (e.g. by pre-posting receives). When executing under typical system conditions, there are no guarantees that programs can achieve the same minimal overhead. On the other hand, computation and communication in a process may overlap. Thus, it is possible to observe actual total times—also recorded in a program’s profile—longer or shorter than the theoretical total communication time. To account for this difference, the theoretical communication time of the mapping under evaluation has to be multiplied by a correction factor λi . Let SSi be the set of processes that send messages to process i and SRi the set of processes that receive messages from process i. For every process i, the program profile gives us an analysis of how many messages and of what size were received by that process from the other processes of the program. The messages of each sender process to process i define a set of same-size message groups mgSi ; similarly, the messages received by process i define the set mgRi . Each group of these sets has a message count mc and a message size ms . Also, 1 The application profile also includes experimentally measured speed ratios for all cluster node architectures.

X

X

k

k∈SRi j∈mgR

mcj · Lc (i, k, msj )

(6)

k

The factor λi represents the expansion or reduction of theoretical time based on the difference established for process i using profile information. It is calculated as the ratio of Bi , the communications time recorded in the profile for i, and ile Θprof , the theoretical time of the profile itself, calculated i from equation 6 for the mapping used for profiling: λi =

Bi prof ile Θi

(7)

The set of values Λ = (λi , i = 1..nM ) is constant and characteristic for each profile. The range of values for λi is 0.0 ≤ λi < 1.0 when communication overlaps with computation and 1.0 ≤ λi when additional overhead expands the communication time. For λi = 1.0 the theoretical time is exactly equal with the measured recorded time. Finally, term Ci is calculated as follows: Ci = ΘM i · λi

(8)

4 Experimental implementations of CBES Two basic CBES prototypes were implemented for proof-of-concept and further development and experimenting purposes. As such, they only support legacy MPI programs without modifications. Both prototypes were developed in C and operate in a LAM/MPI environment. The LAM/MPI package [13] was selected for two basic reasons: • it is MPI-2 compliant (supports dynamic process management), • it utilizes daemons for forming, controlling, and monitoring virtual machines for running applications; these daemons store detailed execution traces for an application; using the XMPI tool [14] it is possible to examine application behavior, either “post mortem” in the form of a profile, or even while the application is still running. The standard library supports execution tracing, thus, no linking with special libraries is necessary.

Orange grove cluster – experimental setup

Centurion cluster-experimental setup-128 MPI nodes

3Com switch #05

3Com switch #06

3Com switch #07

3Com switch #08 3Com gigabit switch #00

3Com switch #04

      

2 x 3Com stacked

3Com switch #09 3Com Switch 01 3Com Switch 00

3Com switch #10 3Com Switch 02

3Com switch #11              

Switch 11

Dlink Switch 10

 !"

Dlink Switch 12

#$%& !" ' !"   

Figure 3. The experimental configuration of the Centurion cluster

Figure 4. The experimental Orange Grove cluster.

The application profiling subsystem of CBES is based mainly on the XMPI tool, an execution trace visualization tool compatible with LAM/MPI. Like other publicly or commercially available visualization tools (e.g. Jumpshot [15], Vampir [16], etc.) XMPI analyzes an application’s execution trace and displays a visual representation of said execution. We added a profiling module to XMPI. This module takes advantage of the database built during the analysis of the execution trace and generates a profile as described above. Although a non-standard MPI feature, LAM/MPI provides statements that can be used to mark the beginning and end of an execution phase in the application code. These statements place markers in the application’s execution trace and separate the trace in a number of segments. XMPI analyzes each such segment separately; the modified version of the tool we use in the profiling subsystem generates a basic profile for each segment.

The CBES prototype on Centurion uses a modified version of NWS for off-line profiling of the cluster and creating the cluster network latency model and for periodically measuring the CPU and NIC availability of each cluster node. The simple modifications (additions) necessary for using NWS with CBES are the addition of an MPI end-toend latency benchmark to a sensor’s available benchmark options, the addition of a network connection availability sensor, and the addition of script-based clique control. The MPI sensor and the network connection availability sensor provide with the necessary system monitoring information, while the clique control scripts make possible the execution of multiple benchmarks in parallel. The latter drastically reduce the O(N 2 ) required initialization time while ensuring that the benchmarks don’t interfere and invalidate each other’s results.

4.2 Orange Grove Prototype 4.1 Centurion Prototype The first prototype system was implemented on the Centurion cluster at the University of Virginia. Centurion is a heterogeneous cluster including 128 nodes with single Alpha processors running Alpha Linux and 128 nodes with twin Intel Pentium IIs running x86 Linux. The interconnecting network is switched fast ethernet. The configuration of Centurion used throughout the present work consists of a subset of the Centurion nodes: 32 Alpha 533MHz (A) and 96 Intel dual PII 400MHz (I) nodes spread over eight identical 3Com 24-port 100MBps switches connected to a 3Com 1.2GBps switch constitute a 128-node cluster (see figure 3). Except these 128 primary nodes, 15 more nodes standby in reserve, to substitute primary nodes in case of crashes, etc.

The second CBES prototype was implemented on a rearranged Orange Grove cluster at Syracuse University. This version of Orange Grove (see figure 4) is a highly heterogeneous cluster consisting of 28 total nodes with 8 single-CPU 533MHz Alpha nodes (A) running Alpha Linux, 8 singleCPU 500MHz SPARC nodes (S) running Solaris and 12 dual CPU 400MHz Intel Pentium II nodes (I) running x86 Linux. The interconnecting network is also switched fast ethernet and consists of 5 identical 3Com 24-port 100MBps switches (2 of them stacked and functioning as one 48-port switch) and 2 DLink 8-port 100MBps switches. The network topology of this cluster emulates the topology of a federation of two elementary clusters with a limited capacity link.

Prediction Error, NPB suite/HPL 4.5 4.0 3.5 # of nodes

3.0

% error

Except for the differences in hardware support, the Orange Grove prototype also differs from the Centurion prototype in that it does not employ NWS for system profiling and monitoring. This prototype has basic profiling and monitoring support without the next-period forecasting capability of NWS and considers the latest measured load values as valid for the next time period. The Grove prototype also includes a simulated annealing-based scheduler.

16 16(2) 64 121 128

2.5 2.0 1.5

5 Experimental Validation of Predictions

1.0 0.5 0.0

To evaluate the validity of the execution time prediction formulation, we experimentally explored the magnitude of the prediction error while considering the following major factors (per program phase): computation and communication overlap, communication granularity (CPU-bound vs. communication-bound programs), and duration of execution. Computation and communication overlap affects the contribution of the Ci term. On the other hand, communication granularity affects the weight—the importance—of the Ri vs. the Ci term. Finally, duration of execution exposes small errors that must accumulate to become noticeable. We followed a three-phase experimental plan as follows: 1. Experiments with a synthetic benchmark program focusing on the behavior of individual formula terms under the effect of the three main factors, as listed above. 2. Experiments using the NASA parallel benchmarks (NPB 2.4) [17] and the HPL benchmark [18]. 3. Experiments using selected programs from the second phase but under varying background load conditions. All experiments were conducted on the Centurion and the Orange Grove clusters on the configurations described in sections 4.1 and 4.2. In each individual case, the actual execution time of the program under consideration was measured and compared against the time predicted by CBES. The difference was expressed as the error percentage with regard to the actual time. The first phase of experiments was essentially a parameter sweep (over 16,000 cases, 5 runs per case) covering a wide value range for the three main factors and also covering the mapping space of the two cluster configurations. The program used in this phase was configurable in terms of computation and communication overlap, communication granularity, and execution duration (indirectly), and was run on a set of mappings with varying number of nodes, hardware architecture mix, and network connectivity mix. Over 90% of the cases exhibited a prediction error of 4% or less. The overall average error was found to be approximately 2%±0.75% with 95% confidence intervals.

IS-A EP-B SP-A SP-B MG-A MG-B CG-A BT-S BT-A BT-B LU-A LU-B HPL

Benchmark

Figure 5. Prediction errors for the NPB 2.4 suite and HPL.

The second phase of experiments was aimed at obtaining several data points for the behavior of the prediction operation with regard to programs with much more complex computation and communication patterns than the synthetic tests of the first phase. Figure 5 presents the results for this phase. We used the IS, EP, SP, MG, CG, BT, and LU benchmarks from the NPB 2.4 benchmark suite for the A and B input classes and also the HPL benchmark with a problem size of 10,000, on mappings of up to 128 nodes. Each point in the figure is the mean error of 5 runs of the corresponding test case while the error bars indicate the 95% confidence intervals for the mean error. The observed mean error values for the NPB benchmarks and HPL are less than 3.5% (with the exception of a single case that exhibits an error of slightly less than 4%). Finally, the third phase of experiments had as primary goal to identify how tolerant a prediction is to background load changes. Here, we did not ignore the load effects in the prediction. However, we measured the actual execution time after changing the load conditions. Because all other inputs to the prediction operation are static, background load change is the only dominant factor in restricting the lifetime of a prediction. The performed measurements indicated that the predictions are highly sensitive to background load changes. The results of re-running the LU, SP, and BT cases of the second phase, but with the addition of an amount of load on one or more of the nodes included in each involved mapping, indicate an unacceptable increase of the prediction error. The load increase causes the error to exceed the previously observed levels of approximately 4% , when even a single node of the mapping in use loses just 10% of its CPU availability. Only light loads (less than 10%) or instantaneous or short term loads (short in comparison with the duration of execution of an already sched-

uled program) such as loads from routine operating system processes, were found to not invalidate the predictions.

LU on Orange Grove - execution time ranges

6 Scheduling with CBES The default CBES scheduler is based on a typical simulated annealing algorithm [19][20]. The CBES mapping evaluation formula (equation 4) plays the role of the energy function invoked by the algorithm. The energy level of a system configuration corresponds to an estimation of the execution time (cost) of a mapping, thus, the configuration with minimal energy—as found by the simulated annealing algorithm—corresponds to the estimated fastest mapping. We conducted a series of scheduling tests on the Orange Grove cluster, to study the performance of the CBES scheduler (CS). Given the size and connectivity of the cluster, we chose to use 8-node mappings. With mappings of this size it is possible to form—among others—mappings with nodes belonging exclusively to the same switch or nodes of only one hardware architecture. We compare the results of CS against those of the following two schedulers: • a simple random scheduler (RS); RS picks mappings at random from a pool of nodes considered equivalent. As such, RS requires a negligible amount of time to find a mapping solution. • a simulated annealing-based scheduler that takes into account program computation speeds and CPU loads but ignores communication latency effects (NCS). The cost function of NCS is the same one used with CS (equation 4) but without the communication term (equation 8). In this case, the cost function assigns an evaluation score to each mapping under consideration but cannot predict execution times. The comparison of CS against RS is a point of reference for demonstrating the maximum feasible overall speedup. Comparing CS against NCS reveals the significance of matching the communication pattern of a program to the communication topology of a cluster, as any difference benefit from scheduling will be due to the communication term alone. When the interconnecting network of a cluster exhibits non-negligible differences among internode latencies due to connectivity and heterogeneity, there is a potentially non-negligible gain (or loss) in speed that a running application can be subjected to. For the largely homogeneous Centurion cluster at UVa, these latency differences were found to be up to approximately 13%. For the strongly heterogeneous experimental Orange Grove at Syracuse the differences were as high as 54%. An application is not guaranteed to be sensitive to the effects of cluster internode latency differences. However, the magnitude of latency variations and

execution time (seconds)

340 320 300 280 260 240 220 200 high speed node group (A)

medium speed node group (A+I)

low speed node group (A+I+S)

architecture mix

Figure 6. LU on 8 Orange Grove nodes: measured execution time ranges.

the intensity of communication–orientation of the application play an important role to the magnitude of the potential gain (or loss) from a good (or bad) mapping. The programs selected for the scheduling tests cover a wide region of the space of parallel scientific applications commonly run on clusters: • the LU code from the NPB 2.4 benchmark suite, a simulated computational fluid dynamics application, • the High Performance Linpack (HPL) code, a dense linear system solver, • a selection of 5 programs from the ASCI purple benchmark suite [21]: sweep3d, a solver for the 3-D, time independent, particle transport equation on an orthogonal mesh [22], smg2000, a parallel semicoarsening multigrid solver for linear systems widely used in radiation diffusion and flow through porous media problems [23], SAMRAI, an object oriented C++ framework for the development of computational physics applications using structured adaptive mesh refinement technology [24], Towhee, a Monte Carlo molecular simulation code designed for computing fluid phase equilibria using atom-based force fields [25], and Aztec, a massively parallel iterative solver library for solving sparse linear systems. The package grew out of the specific application of modeling reacting flows (MPSalsa software at Sandia National Laboratories) [26]. The tests fall into the following two categories: • Worst case vs. best case scenario; these tests investigate the potential maximum difference in performance between CS, RS, and NCS; they are based on the observation that RS can select any mapping with equal

Speedup (%)

Approximate Scheduler Time (sec)

Comments

207.8

1.0

5.3

6

High-speed group

LU (2) 260.4

2.4

236.2

0.2

9.3

6

Medium-speed group

LU (3) 327.8

2.1

308.2

1.1

6.0

6

Low-speed group

100 80 frequency

Test Case

1.3

Worst Time (measured, sec) ± sec (95% confidence level) Best Time (measured, sec) ± sec (95% confidence level)

LU (1) 219.4

LU(3)

Table 1. LU: worst vs. best case scenario.

60

CS NCS

40

Measured Speedup (%)

Maximum Speedup (%)

218.2

2.5

4.8

5.3

3

258.7

7.2

8.7

9.3

LU (3) 302.3

0.5

90

308.2 318.9

0.6

1

326.2

5.2

5.5

6.0

Measured Time (sec)

2

0.8

CS

Hits (%)

0.7

236.2 254.0

Average Predicted Time

207.8 217.6

89

Measured Time (sec)

92

0.1

Hits (%)

1.4

LU (2) 235.6

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Average Predicted Time (sec)

LU (1) 212.1

Test Case

Expected Speedup (%)

± (95% conf. level)

± (95% conf. level)

20

NCS

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Table 2. LU: average case scenario. probability while NCS behaves like RS when selecting from a set of nodes of equivalent computation speeds with regard to a program. • Average case scenario; these tests investigate what level of performance can be expected when submitting a scheduling request in practice. CS is not always guaranteed to find the best mapping, nor does NCS (or even RS) always select the worst mapping. We also distinguish between the tests with LU and the tests with the rest of the selected programs. The LU tests were run on heterogeneous node sets, with regard to hardware architecture. All other cases were run on homogeneous sets, to focus the comparison on the effects of communication.

6.1 LU Tests The LU tests constitute a sampling of execution times across Orange Grove’s mapping space. To cover this mapping space we selected mappings with various analogies in node architecture and connectivity mix as representatives of mapping groups with approximately similar properties. The selection process yielded approximately 100 representative mapping cases. The measurements reveal the existence of 3 distinct execution time zones, as shown in figure 6, each corresponding to a subset of nodes (high, medium, and low speed subsets). The major differences between zones are due to differences in node computation speeds, however, the range covered by each zone is mainly due to the effect of communications.

0 300

305

310

315

320

325

predicted execution time (seconds)

Figure 7. Predicted time distributions for the LU(3) case.

6.1.1 Worst vs. Best Case Results Table 1 presents the results for three test sets with LU, one each for the high, medium, and low speed node groups. Each line of the table corresponds to two sets of scheduling tests. The first set uses NCS while the second uses CS. Within each zone, CS consistently selects mappings yielding the shortest execution times. On the other hand, NCS cannot distinguish between mappings within the same node group. The maximum potential speedup is between 5.3% and 9.3%, depending on node group. With regard to RS, the maximum potential speedup is 36.6%, as a random scheduler selects any mapping with equal probability. While these comparisons reveal the upper limit in potential gain, the average case scenario offers a more realistic comparison. 6.1.2 Average Case Results Table 2 presents the average case results for the same three LU cases as presented in table 1. For each case, the table lists the average of the results of 100 CS and NCS runs. The average predicted time is calculated by the CBES mapping evaluation operation. For NCS the table lists the normalized prediction; because NCS ignores the contribution of communications to the execution time, the result of a mapping evaluation is not a time prediction. To obtain an estimation of the corresponding execution time, we processed each mapping selected by NCS with the full evaluation operation as with CS. The hit percentage shows the frequency of successful selections by a scheduler (selections of mappings with minimum execution time). The measured time is the actual execution time of LU on the mapping selected

Expected Speedup (%)

Measured Speedup (%)

Maximum Speedup (%)

80.2

9.8

10.1

10.8

460.0

4.5

5.2

5.9

uncertain speedup

smg2000 (1) 16.6

0.1

85

16.4

17.5

0.1

2

17.3

5.1

5.2

5.6

210

12x12x12 problem size

smg2000 (2) 67.1

0.1

98

66.7

72.1

0.2

1

71.7

6.8

6.9

7.4

7.4

780

50x50x50 problem size

smg2000 (3) 114.0

0.2

96

115.1 124.4

0.7

3

127.1

8.4

9.4

9.6

1.0

9.6

950

60x60x60 problem size

0.0

92

80.9

0.1

2

90.2

9.1

10.3

10.8

0.1

1.3

19

uncertain speedup

10,000 problem size

sweep3d

9.4

1.2

9.3

1.2

1.1

2

smg2000 (1) 17.3

0.1

16.4

0.1

5.6

smg2000 (2) 72.0

0.1

66.7

0.1

1.3

115.1

0.2

7.6

smg2000 (3) 127.3 SAMRAI

7.7

Towhee

46.4

0.4

46.4

0.4

0.2

1

uncertain speedup

Aztec

90.7

0.5

80.9

0.4

10.8

6

Poisson solver

Table 3. Other tests: worst vs. best case scenario.

by each scheduler. The expected speedup is based on the predicted times and the measured speedup on the actual times. The maximum speedup is listed for comparison and is the same as in table 1. CS is approximately 90% successful, with the rest 10% of selected mappings being slightly slower. NCS however is less than 3% successful and even in those few cases the fastest mappings are slower than those found in the large majority of the CS runs. Figure 7 presents the distributions of the CS and NCS results for the LU(3) case. The distributions reveal why CS maintains its performance while NCS shows only a marginal improvement in the average case. For NCS, the number of mappings with minimum or nearly minimum time is essentially negligible compared to the overall number of selectable mappings. This results to a near zero probability of NCS selecting one of the faster mappings. While the CS results are strongly skewed towards the minimum-time mappings, the NCS results are strongly skewed towards the nearly worst-time mappings. As expected, RS performs worse than NCS. The overall average of the measured times of all 100 mapping cases examined is 296.5 seconds. The best achieved time of 207.8 seconds is approximately 30% shorter.

6.2 Other tests The tests with the remaining programs of the selection focus strictly on the effect of communications. Here we only compare the CS against the NCS results. To isolate the effect of communications, it is necessary to “level the field” by restricting the selections of CS and NCS to the same homogeneous subset of nodes per case (the subsets

Aztec

79.7

Measured Time (sec)

2 2

138

CS

Hits (%)

0.4 0.5

65

5.9

Average Predicted Time

78.3

435.9 466.7

10.8

1.9

Measured Time (sec)

72.1

94

0.3

435.9

Hits (%)

88

0.3

72.1

1.9

0000000000000000000000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

Average Predicted Time (sec)

0.5

445.5

0.4

463.3

0.9

Test Case

70.6

HPL (3)

Comments

HPL (2)

80.9

HPL (3)

0.3

± (95% conf. level)

5

HPL (2)

1.2

± (95% conf. level)

Speedup (%)

Approximate Scheduler Time (sec)

24.6

Worst Time (measured, sec) ± sec (95% confidence level) Best Time (measured, sec) ± sec (95% confidence level)

Test case

0.3

500 problem size uncertain speedup 5,000 problem size

HPL (1)

NCS

87.7

00000000000000000000000000000000000000000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

Table 4. Other tests: average case scenario.

may differ from case to case). In the exact same manner as with the LU case, table 3 shows the results for the worst vs. best case scenario tests, while table 4 presents the average case results. The encountered maximum speedups range from 5.6% to 10.8% while in the average case the speedups are only slightly lower, from 5.2% to 10.3%. That is, in the average case the maximum speedups are reduced by less than 10% approximately. Four of the worst vs. best case test cases (sweep3d, SAMRAI, Towhee, and HPL(1)) exhibited a questionable potential speedup. A closer examination of the program profiles revealed that sweep3d and SAMRAI have near allto-all communication patterns. With such patterns it is virtually impossible to find a mapping where the benefits are not cancelled by the penalties. Towhee is an embarrassingly parallel program with insignificant communication between processes. Finally, in the HPL(1) case the short execution duration exaggerates the differences. For these four cases we did not perform the average case tests since the corresponding programs were found to be unsuitable for CBESsupported scheduling. To obtain an approximate indication of the efficiency of CBES-supported scheduling, we examine how much of the available speedup it was possible to achieve. In the presented tests the speedup is achieved solely through the exploitation of communication patterns. Also, we know an application’s computation to communication ratio from that application’s profile. It is now possible to express the speedup as a decrease of the communication time. The maximum decrease we encountered is 46.4% for the LU(2) case which has an 80%/20% computation to communication ratio. Given that the latency differences for Orange Grove were found to be up to 54%, under the most favorable conditions the theoretically available speedup would cause a 54% decrease of the communication time. Thus, CBES achieved up to 85% of the theoretically available speedup.

As table 3 shows, in some cases (e.g. smg2000) the scheduler time (overhead) exceeds the execution time of the program itself. One of the major factors affecting scheduler time is the complexity of an application’s communication pattern, as reflected in that application’s profile. The higher the complexity, the longer it takes to evaluate a mapping. Because the simulated annealing algorithm goes through large numbers of mapping evaluations, the time required for each evaluation is critical for the overall duration of scheduler execution. Clearly, the overhead can be prohibitive for short-duration applications. However, an application run may consist of a core segment repeated any number of times. In such a case, one would need to pay the overhead for finding a mapping for this core segment only once, then save a percentage of time out of each repetition. The same holds for short-lived applications that will be run many times while system conditions allow the re-use of the same mapping over and over again.

7 Related Work Several research efforts have similarities and/or the same goals as CBES. Schopf [27] presents an approach to the problem of scheduling parallel applications on clusters based on structural modeling. Structural modeling defines distributed parallel application performance models parameterized with stochastic values in order to predict application performance in the dynamic environment of clusters. Furthermore, the approach uses a stochastic scheduling policy that makes use of the stochastic prediction to achieve efficient application execution. Closely related to Schopf’s work is AppLeS—Application Level Scheduling [28]. This approach builds scheduling agents tightly-coupled to the application. The agents create schedules specially tuned to the needs of the application and with the help of dynamic information and adaptive scheduling techniques maximize application performance. Prophet [29] is an automated scheduler for parallel computations in a heterogeneous environment. The Prophet framework is integrated into the Mentat-Legion parallel processing system and uses runtime granularity information for selecting the best number of processors to apply to the application. Prophet uses a callback mechanism to obtain application-specific information, and utilizes dynamic system information about CPU and network usage at runtime. The system information is supplied by Prophet’s own auxiliary system, the Network Resource Monitoring System (NRMS), with functionality similar to the Network Weather Service. Subhlok et al. [30] present a framework for automatic node selection for high performance applications on shared networks. This framework uses an application specification interface, through which applications can specify their

computation and communication requirements, and relies on ReMoS [31] for network information. The framework can invoke node selection algorithms that maximize computation capacity, maximize communication capacity, and obtain the maximum fraction of computation and communication capacity that can be achieved simultaneously. Yarkhan and Dongarra [32] perform scheduling experiments with a ScaLAPACK LU numerical solver in a grid environment using simulated annealing [33]. To evaluate the schedules generated by the simulated annealing algorithm they use a Performance Model, a function specifically created to predict the execution time of the program. Generating such a Performance Model requires detailed analysis of the program to be scheduled. Decker and Diekmann [34] describe CoPA, an environment for mapping coarse-grained applications onto workstation clusters. CoPA uses a special library on top of PVM to trace the execution of an application and analyze its behavior “post-game”. Using heuristics or simulated annealing CoPA searches for effective application mappings. However, CoPA only handles coarse-grained applications with infrequent communication and does not incorporate dynamic system information when searching for mappings. Finally, TITAN [35] is a multi-tiered scheduling architecture that employs the PACE [36] [37] performance prediction system to improve resource usage efficiency. PACE uses a Performance Specification Language (PSL) to describe workloads for both sequential and parallel parts of an application. Also, PACE uses a Hardware Model Configuration Language (HMCL) to describe hardware characteristics. Furthermore, PACE uses a parametric evaluation engine and workload and hardware descriptions to provide execution time estimates. TITAN uses these estimates for scheduling purposes, e.g. as fitness function values for a genetic algorithm-based scheduler.

8 Conclusions and Future Work We have described the design, theoretical background, and prototype implementations of CBES, as well as a series of experimental evaluation tests. CBES is an auxiliary runtime system service for facilitating the effective mapping of parallel application tasks on heterogeneous cluster nodes. CBES utilizes static and dynamic information from both the computing system and the application to compare different mappings by predicting the performance an application can achieve with each such mapping. Due to heterogeneity and network connectivity of a cluster, there are non-negligible differences in CPU speeds and internode communication latencies. CBES exploits these differences to the benefit of an application by matching them to the computation and communication patterns of that application. However, the level of exploitation greatly de-

pends on the application itself. In principle, not all scientific applications can benefit from the CBES scheduling mechanisms. The degree of cluster heterogeneity, the percentage of application execution time spent on communication, and the adaptability of an application’s communication pattern on the network topology of a cluster are strong indications for the size of benefits from CBES-supported scheduling. The CBES-supported scheduler achieved speedups higher than 10% for several important scientific applications, by exploitation of their communication patterns alone. The time cost of running the CBES-supported simulated annealing scheduler can be prohibitive for short-lived programs. However, in the case of long-duration applications there is a wide margin to offset the cost of running the scheduler by the total gain of an application run or the cumulative gain of several such runs. In the future, we’re planning to expand the CBES infrastructure with application monitoring and remapping capabilities, as well as runtime scheduling support for dynamically processes–spawning MPI-2 programs. We further intend to investigate the suitability of other scheduling algorithms, e.g. genetic algorithms, for CBES-supported scheduling, and the resulting performance. Finally, we will conduct further testing using a larger variety of parallel applications, including applications with irregular computation and/or communication patterns.

References [1] PVM: Parallel Virtual Machine. www.csm.ornl.gov/pvm/pvm home.html [2] Message Passing Interface Forum. www.mpi-forum.org [3] Condor: High Throughput Computing. www.cs.wisc.edu/condor [4] A. Bayucan, R. L. Henderson, C. Lesiak, N. Mann, T. Proett, and D. Tweten. Portable Batch System: External Reference Specification. MRJ Technology Solutions, Nov 1999. [5] Load Sharing Facility. accl.grc.nasa.gov/lsf [6] Legion: Worldwide Virtual Computer. legion.virginia.edu [7] The Globus Alliance. www.globus.org [8] D. Katramatos, D. Saxena, D. Mehta, and S. J. Chapin. A Cost/Benefit Model for Dynamic Resource Sharing. In Proceedings of the 9th Heterogeneous Computing Workshop, Cancun, Mexico, 2000.

[9] D. Katramatos, M. Humphrey, C. Hwang, and S. J. Chapin. Developing A Cost/Benefit Model for Dynamic Resource Sharing in Heterogeneous Clusters: Experience with SNL Clusters. In Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), pp.355–362, Brisbane, Australia, May 15–18, 2001. [10] The Centurion Cluster. legion.virginia.edu/centurion/Centurion.html [11] The Network Weather Service. nws.cs.ucsb.edu [12] D. Katramatos. Dynamic Resource Sharing Mechanisms for High-Performance Heterogeneous Clusters. Ph.D. dissertation, University of Virginia, Jan 2005. [13] LAM/MPI Parallel Computing. www.lam-mpi.org [14] XMPI—A Run/Debug GUI for MPI. www.lam-mpi.org/software/xmpi [15] Performance Visualization for Parallel Programs. www-unix.mcs.anl.gov/perfvis/software/viewers [16] Visualization and Analysis of MPI Programs. http://www.pallas.com/e/products/vampir/index.htm [17] The NASA Parallel Benchmarks. www.nas.nasa.gov/Software/NPB [18] HPL—A Portable Implementation of the HighPerformance Linpack Benchmark for DistributedMemory Computers. www.netlib.org/benchmark/hpl [19] N. Metropolis, A. W. Rosenbluth, and M. N. Rosenbluth, A. H. Teller, E. Teller. Equations of State Calculations by Fast Computing Machines. Journal of Chemical Physics, vol. 21, pp.1087–1092, 1953. [20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. The Art of Scientific Computing, Cambridge University Press, 2nd Edition, 1992, ISBN:0-521-43108-5. [21] The ASCI purple benchmarks. www.llnl.gov/asci/purple/benchmarks [22] The ASCI sweep3d benchmark. www.llnl.gov/asci benchmarks/asci/limited/sweep3d [23] smg2000. www.llnl.gov/asci/purple/benchmarks/limited/smg [24] SAMRAI. www.llnl.gov/CASC/SAMRAI/samrai home.html

[25] MCCCS Towhee. www.cs.sandia.gov/projects/towhee [26] Aztec. www.cs.sandia.gov/CRF/Aztec RD100.html [27] J. M. Schopf. Performance Prediction and Scheduling for Parallel Applications on Multi-User Clusters. Ph.D. dissertation, UCSD, 1998. [28] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-level scheduling on distributed heterogeneous networks. In Proceedings of SuperComputing ’96, 1996. [29] J. B. Weissman and X. Zhao. Scheduling parallel applications in distributed networks. Journal of Cluster Computing, 1998. [30] J. Subhlok, P. Lieu, and B. Lowekamp. Automatic Node Selection for High Performance Applications on Networks. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), Atlanta, Georgia, pp.163172, ACM Press, May 1999. [31] T. Dewitt, T. Gross, B. Lowekamp, N. Miller, P. Steenkiste, J. Subhlok, and D. Sutherland. ReMoS: A resource monitoring system for network aware applications. Technical Report CMU–CS–97–194, School of Computer Science, Carnegie Mellon University, Dec 1997. [32] A. Yarkhan and J. Dongarra. Experiments with Scheduling Using Simulated Annealing in a Grid Environment. In M. Parashar, editor, Lecture notes in computer science Grid Computing—GRID 2002, Third International Workshop, Vol. 2536, pp.232–242, Springer Verlag, Baltimore, MD, USA, Nov 2002. [33] A. Petitet, S. Blackford, J. Dongarra, B. Ellis, G. Fagg, K. Roche, and S. Vadhiyar. Numerical libraries and the Grid. The International Journal of High Performance Computing Applications, 15(4):359–374, Nov 2001. [34] T. Decker and R. Diekmann. Mapping of CoarseGrained Applications onto Workstation Clusters. In Proceedings of the 5th EUROMICRO Workshop on Parallel and Distributed Processing, PDP’97, 1997. [35] D. P. Spooner, S. A. Jarvis, J. Cao, S. Saini and G. R. Nudd. Local Grid Scheduling Techniques using Performance Prediction. In IEEE Proc. Comp. Digit. Tech., 150(2):87-96, 2003.

[36] S. A. Jarvis, D. P. Spooner, H. N. Lim Choi Keung, J. R. D. Dyson, L. Zhao, and G. R. Nudd. Performancebased Middleware Services for Grid Computing. Autonomic Computing Workshop, Fifth Annual International Workshop on Active Middleware Services (AMS’03), Seattle, Washington, Jun 25, 2003. [37] J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Modeling of ASCI High Performance Applications Using PACE. In Proceedings 15th Annual UK Performance Engineering Workshop, Bristol, UK, pp.413–424, 1999.