Dynamic Remapping of Parallel Computations with Varying ... - graal

19 downloads 0 Views 2MB Size Report
G. C. Fox and S. W. Otto, “Concurrent computation and theory of. [331. [341. I351. (361 .... Joel H. Saltz (S-85-M'85) received the B.A. degree in mathematics and ...
IEEE TRANSACTIONS

ON COMPUTERS,

VOL.

31. NO. 9. SEPTEMBER

1073

1988

Dynamic Remapping of Parallel Computations with Varying Resource Demands DAVID

M. NICOL

AND

JOE1 H. SALTZ,

Abstract-A large class of computational problems are cbaracterized by frequent synchronization, and computational requirements which change as a function of time. When such a prob lem is solved on a message passing multiprocessor machine, the combination of these characteristics leads to system performance which deteriorates in time. Performance can be improved with periodic redistribution of computational load; however, redistribution can exact a sometimes large delay cost. W e study the issue of deciding when to invoke a global load remapping mechanism. Such a decision policy must effectively weigh the costs of remapping against the performance benefits, and should be general enough to apply automatically to a wide range of computations. W e propose here a general remapping decision heuristic, then study its effectiveness and its anticipated behavior on two very different models of load evolution. Assuming only that the remapping cost is known, this policy dynamically attempts to minimize system degradation (including the cost of remapping) per computation step. This policy is quite simple, choosing to remap when the first local minimum in the degradation function is detected. Simulations show that our policy provides significantly better performance than that achieved by never remapping. W e also observe that the average interremapping frequency is quite close to the optimal fixed remapping frequency. This phenomenon is explained, in part, analytically. W e show that when performance deteriorates monotonically (in expectation), then the heuristic is well motivated. Finally, using a tractable model of load evolution, we compare our policy’s performance to provably optimal performance. Index Terms-Adaptive computation, dynamic load balancing, multiprocessors, optimal partitioning, parallel computation, remapping.

I.

INTR~DUCTI~N

ANY computational problems assume a discrete model of a physical system, and calculate a set of values for every domain point in the model. These values are often functions of time, so that it is intuitive to think of the computation as marching through time. W h e n such a problem is mapped onto a message passing multiprocessor machine or a shared memory machine with fast local memories, regions of the model domain are assigned to each processor. The running behavior of such a system is often characterized as a sequence

M

Manuscript received July 17, 1986; revised January 22, 1987. This work was supported by the National Aeronautics and Space Administration under NASA Contracts NASl-17070 and NASl-18107, while the authors were in residence at ICASE, Langley Research Center, Hampton, VA. D. M. Nicol is with the Department of Computer Science, College of William and Mary, Williamsburg, VA 23185. J. H. Saltz is with the Department of Computer Science, Yale University, New Haven, CT. IEEE Log Number 8718426.

MEMBER, IEEE

of steps, or iterations. During a step, a processor computes the appropriate values for its domain points. At the step’s end, it communicates any newly computed results required by other processors. Finally, it waits for other processors to complete their computation step and send it data required for the next step’s computation. The computational work associated with each portion of a problem’s subdomain may change over the course of solving the problem. This can occur when the behavior of the modeled physical system changes with time; this may also occur in problems without explicit time dependence. For example, during the course of solving a problem, more work may be required to resolve features of the emerging solution. Since time stepping is often used as a means for obtaining a steadystate solution, there is considerable overlap between the above mentioned categories. W e call these types of problems varying demand distribution problems. Because of the synchronization between steps, the system execution time during a step is effectively determined by the execution time of the slowest, or most heavily loaded processor. W e can then expect system performance to deteriorate in time, as the changing resource demand causes some processor to become proportionally overloaded. One way of dealing with this problem is to periodically redistribute, or remap load among processors. Changing distributions of computational work over a domain arise through the use of adaptive methods in the solution of hyperbolic partial differential equations. These solutions place extra grid points in some regions of the problem domain in order to resolve all features of the solution to the same accuracy [7], [8], [24], [37], [28], [51]. A number of studies have investigated methods for redistributing load in message passing multiprocessors for this type of problem [6], [25]-[27]. Changing distributions of computational work can also occur when vortex methods are applied to the numerical simulation of incompressible flow fields. In these methods, invicid fluid dynamics is modeled by parcels of vorticity which induce motion in one another [ 11, [31]. The number of vortices corresponding to a given region in the domain varies during the course of the solution of a problem. Methods for dynamically redistributing work in this problem have been investigated [2]. In multirate methods for the solution of systems of ordinary differential equations [52], different variables in the system of equations are stepped forward with different time steps. The size of the time steps in the system is generally equal to that of a globally defined largest time step divided by an integer. The size of the different time steps utilized may vary during the

0018-9340/88/0900-1073$01 .OO 0 1988 IEEE

1074 course of solving the problem, and hence the computational work associated with the integration of a given set of variables may change. Another class of problems which may have varying resource demands are adaptive methods for solving elliptic partial differential equations where iterations on a sequence of adaptively defined meshes are carried out [4], [32], [lo], [3], [53]. Generally, both the total amount of computational work required by each of the meshes and the distribution of work within the domain changes as one moves from one mesh to the next. There are also nonnumerical parallel computations which can exhibit varying computational requirements. In time-driven discrete event simulations [ 181, one simulates the interactions over time of a set of objects. Responsibility for a subset of objects is assigned to each processor. Over the course of the simulation, subsets of objects may differ in activity, and hence in their computation requirements. This problem may also arise in parallel simulations which proceed in a loosely synchronized manner, such as those described in [Ill, V31, [301, [361, 1381, [341. There are two fundamentally different approaches to remapping. The decentralized load balancing approach is usually studied in the context of a queueing network [ 171, [ 191, 1331, [46], [47], [49], [50]. Load balancing questions are then focused on “transfer policies,” and “location policies” [17]. A transfer policy governs whether a job arriving at a service center is kept or is routed elsewhere for processing. A location policy determines which service center receives a transferred job. Decentralized balancing is the natural approach when jobs are independent, and a global view of balancing would not yield substantially better load distributions. A large class of computations is not well characterized by a job arrival model, and it may be advantageous to take a global, or centralized perspective when balancing. We will call a global balancing mechanism “mapping” to distinguish it from the localized connotations of the term load balancing. A centralized mapping mechanism can exploit full knowledge of the computation and its behavior. Furthermore, dependencies between different parts of a computation can be complex, making it difficult to dynamically move small pieces of the computation from processor to processor in a decentralized way. Global mapping is natural in a computational environment where other decisions are already made globally, e.g., convergence checking in an iterative numerical method. Yet the execution of a global mapping algorithm may be costly, as may be the subsequent implementation of the new workload distribution. A number of authors have considered global mapping policies under varying model assumptions, for example, see [12], [16], [27], [48], [5], [9]. A comparison between global and decentralized mapping strategies is reported in [29]. For the types of problems we describe, remapping the load with a global mechanism is tantamount to repartitioning the set of model domain points in regions, and assigning the newly defined regions to processors. Mapping algorithms of this sort are studied in [6], [20], and [21], and the performance of the mapping algorithm proposed in [6] in the context of vortex methods is investigated in [2]. Decision policies determining when a load should be remapped become quite important.

IEEE TRANSACTIONS

ON COMPUTERS,

VOL.

37. NO. 9, SEPTEMBER

1988

The overhead associated with remapping can be high, so it is important to balance the overhead cost of remapping with the expected performance gain achieved by remapping. While this is a generic problem, the details of load evolution, of the remapping mechanism, and of various overhead costs are system and computation dependent. In order to study general properties of remapping decision policies, it is necessary to model the behavior of interest, and evaluate the performance of decision policies on those models. Remapping is treated this way in [35] under the assumption that the parallel computation has multiphase behavior. In the problems we are considering here, we assume that the computation to be performed can be partitioned into a sequence of sets, each set composed of a large number of operations. These sets of computations are separated by global synchronization points, and the computations in each set may be performed in an arbitrary order. Each computation in a set requires information contained in data structures assigned to a particular local memory. The mapping of these data structures to processors determines the load balance obtained between any two synchronization points. The problem of load imbalance arises because the changing distribution of the computational workload is bound to the static distribution of the data structures in local memories. Remapping then requires a reassignment of these data structures to processors to better balance the workload. Other general classes of computational patterns are also commonly encountered in scientific computation. In these cases as well, data dependencies are known to be of a given general form but the structure of the data dependency graph may vary markedly with the problem and may often only be determined during the execution of the program. A variety of methods for run-time mapping and scheduling of these moderately irregular problems have been proposed, a partial list of references includes [23], [41], [43]. An overview by the present authors introducing some of the ideas developed in this paper is presented in [42]. A related body of literature treats determination of the optimal checkpoint interval (see [22] and its references) in database systems. The general problem of deciding when to expend one cost to forestall another is similar to the mapping problem; however, the models of transaction processing systems are significantly different from our intuitive notions of parallel computations. In particular, our degradation costs (due to not remapping) are accumulated in a different fashion than the costs of database failures. A global remapping decision policy must effectively weigh the costs of remapping against the performance benefits. In general, an optimal decision policy for this problem is intractable to calculate; furthermore, its optimality is quite dependent on assumptions modeling the computation’s behavior. While we do show the optimal decision policy can be found for a simple model of problem behavior, we eschew proposing such an approach in a general setting. Like virtual memory management, scheduling, and other system functions, remapping in the face of unknown future behavior should be automatically performed by general purpose software [41]. We propose a simple remapping decision heuristic which assumes only that the processors periodically synchronize, that

NICOL

AND SALTZ:

DYNAMIC

REMAPPING

OF PARALLEL

the delay cost of remapping is known, and that run-time performance can be cheaply measured. This heuristic assumes no particular model of load evolution and consequently does not require estimation of any load evolution model parameters. The policy chooses to remap at the first detected local minimum of a dynamically measured system degradation function, and is consequently called the stop at rise, or SAR policy. W e study the effectiveness and anticipated behavior of SAR on two different stochastic models. The main contribution of this paper is to show that SAR is effective despite its simplicity. Simulations on two very different models of program behavior show that SAR consistently improves performance over that of not remapping at all. W e also observe that the average number of steps between two SAR remappings is extremely close to the optimal fixed remapping interval; we are able to explain this phenomenon, in part, analytically. Finally, we derive the optimum remapping policy under the more tractable of our two program behavior models, and compare its performance to SAR’s. W e find that SAR does capture much of the performance gain possible by remapping. This paper is organized as follows. Section II describes the two models of computational variation, and shows how they capture the drifting computational load phenomenon. Section III introduces the SAR heuristic, and discusses its performance. Section IV analytically links the SAR’s observed behavior with the optimal fixed interval policy behavior, and Section V examines remapping on the MUM model in greater analytic detail. Section VI summarizes our results, and the appendixes treat technical issues in detail. II.

MODELS

OF LOAD

1075

COMPUTATIONS

VARIATION

In order to demonstrate that our remapping heuristic is not tied intimately to a single model of load variation, we will study its performance under two different models. The multiple Markov chains (MUM) model assumes that a processor’s load variation behaves as a stochastic birth-death process, independent of any other processor’s. The MUM model also implicitly ties transitional behavior to the processor rather than the domain. These assumptions may not be realistic, but do provide a degree of tractability which we exploit in Section V. The second, more realistic model is called the load dependency model (LD). This model’s workload evolves by the direct simulation of migrating computational demands in a physical domain. By studying a single remapping policy on these two distinct models, we achieve a certain degree of confidence that the studied policy is applicable in a range of situations. Memory management policies for multiprogrammed uniprocessor systems have been successfully evaluated using a number of stochastic models to reflect the memory requirements of typical programs [ 151, [45]. In these models, the principal of memory reference locality plays a central role. Evaluating remapping policies for message passing machines is somewhat similar in spirit to the evaluation of uniprocessor paging algorithms. The principal of locality that our models attempt to capture is the locality of resource demand. Because of this locality, a given processor’s workload will often vary in a gradual fashion. W e will evaluate remapping policies on

two stochastic models which exhibit gradual load evolution. In both models, the computational work is performed during a sequence of steps where processors synchronize globally after every step. Both models employ Markovian assumptions; the first model we discuss is significantly simpler (and more analytically tractable) than the other.

A. Multiple Markov Chains Model A general approach to scientific computations that involve computations on a domain is to assign each processor a region of the problem domain. At each step, the processor needs to perform a certain amount of computation related to that region, e.g., in common solution methods for partial differential equations (PDE’s), a region will consist of a number of grid points at which numerical approximations are calculated. Completing a step, a processor may send messages reporting newly calculated results (e.g., values at grid points on the region’s boundary). Before advancing to the next step, the processor synchronizes with others. Depending on the algorithm and problem behavior, the computational requirements of a region may vary gradually from step to step. A good example of this phenomenon is adaptive regridding; a region may consist of a few thousand grid points, and the regridding of a small subregion may add or subtract a few hundred grid points. The MUM model characterizes load variation by modeling the work demands on a processor using a Markovian birth-death process. The state s of the chain is a positive integer describing the execution and communication time expended by the processor at a step. For the sake of model tractability, we place a limit L on the amount of time a processor may expend on a step. s may be any integer between 1 and L’. W e will always assume that L is odd, so that a “middle” state (L + 1)/2 is always defined. The transition probabilities out of a state reflect the principle of computational locality: all transitions out of s are either to s - 1, s, or s + 1. W h e n s is between 2 and L - 1, the transition probability to s + 1 is taken to be p/2, the transition probability to s - 1 is also taken to be p/2, and the probability of not changing state is 1 -p. The value of p models the load evolution rate: a low value corresponds to a slowly evolving computation. For s=lors=L, we suppose the state remains the same with probability 1 -p/2, and moves to the single neighboring state with probability p/2. The processors are modeled by a collection of independent, identically distributed Markov chains. Both of these statistical assumptions can be relaxed; we use them primarily for the analysis in Section V. W e let 7”(n) represent the state of processor j at step n, i.e., the time required by the jth processor to complete the nth step. Assuming N processors, the time required for the system as a whole to complete the nth step is determined by global synchronization, and is consequently given by

’ None of our results critically odd lower bound will suffice.

depend

the

limit

being

1,

any

1076 The average processor execution time during the nth step is

and the average time that a processor is idle due to synchronization during the nth step is consequently Tmax(n) - T(n). Finally, we assume that the states of every chain at step 0 are identical. An intuitive feel for the behavior of the MUM model is gained by examining graphs of the performance it generates. Fig. l(a) depicts the behavior of the MUM model for varying numbers of chains. The performance shown is the average processor utilization as a function of step, with the average taken over 500 simulated sample paths where p = 0.5 and each chain has 19 states. Performance declines more quickly and to a lower level as increasingly larger numbers of processors are simulated. For a given number of chains, the decline arises because average T,,,(n) tends to increase in n while the average T(n) remains relatively constant. Fig. l(b) depicts the performance of single sample paths of the MUM model using varying numbers of chains, where as before p = 0.5 and each chain has 19 states. Note that the decline is true only in the sense of comprising a long-term trend; each curve has many local maxima and minima. This point is particularly important, because any dynamic real-time remapping policy mechanism is concerned with the current sample path defined by the computation’s execution. When we remap a MUM workload, we assume that the load is redistributed by summing all state values, and assigning the average state value (actually, appropriate integer ceilings and floors of the average) to each processor. Such a redistribution is possible if a processor’s state value of s means there are s independent pieces of work, each requiring 1 unit of execution time. We also assume that p is not affected by balancing the workload. This assumption is needed for the analysis in Section V, but is relaxed in our second model.

B. Load Dependency Model While the MUM model is analytically tractable, some of its assumptions may not be realized in practice. The load dependency (LD) model is more realistic, directly simulating the evolving spatial distribution of computational load in a domain. We consider a two-dimensional plane in which activity occurs, for example, a factory floor. To simulate this activity we impose a dense regular grid upon the plane; each square of the grid defines an activity point. We suppose that activity in the plane is discretized in simulation time, and model the behavior of activity as follows. Each time step, a certain amount of activity may occur at an activity point. This activity is simulated (for example, arrival of parts to a manufacturing assembly station), causing a certain amount of computation. By the next time step, some of that activity may have moved to neighboring activity points. This movement of activity simulates the movement of physical objects in a physical domain, a?d is modeled by the movement of work units. A work unit is always positioned at some activity point, and has a %,eight describing its computational demand at that activity

IEEE TRANSACTIONS

ON COMPUTERS,

VOL.

37, NO. 9, SEPTEMBER

1988

point. From one time step to the next, a work unit at activity point x either remains at x, or moves to the north, east, south, or west neighboring activity point with probabilities PC, pN, pE, ps, pw, respectively. The LD model dS0 allOWS for the injection of new work units into portions of the domain (using “creation” probability distributions at every step) and for activity sinks which probabilistically remove work units. The LD model is balanced and remapped using binary dissection [6] to partition the activity points into N rectangular activity regions. An activity point’s weight is taken to be the sum of its work unit weights at the time of partitioning. A processor’s load at a time step is found by adding the weights of all work units resident on activity points assigned to that processor. In a wide variety of problems, data dependencies are quite local. Decomposition of a domain into contiguous regions with a relatively small perimeter to area ratio is thus generally desirable for reducing the quantity of information that must be exchanged between partitions. Furthermore, due to the local nature of the data dependencies, the communication required between partitions in a binary dissection will generally be greatest in partitions that are in physical proximity. The analysis in [6] shows that this type of partition is effective for static remapping, and is easily mapped onto various types of parallel architectures. Estimates are also obtained of the communication costs incurred when the resulting partitions are mapped onto a given architecture. The cost estimates obtained by such analysis are inevitably problem, mapping, and architecture dependent. Binary dissection is briefly described in Appendix A. The transitional behavior of work units across activity region boundaries causes processor loads to vary. Work unit movement simulates workloads generated by a variety of problems. Unlike the MUM model, a processor’s load evolution in the LD model is explicitly dependent on its own load, and on the loads of processors with neighboring activity regions. Because of global synchronization, the time required by the system to complete a time step is again maximum computation time among all processors. Like the MUM model, any initial balance will disappear as time progresses, and average processor utilization will drop. The imbalance is particularly severe if the work unit movement probabilities are anisotropic, eventually forcing most of the work units onto a small number of processors. III.

STOP

AT

RISE

DECISION

POLICY

One possible approach to dynamic remapping is to construct an accurate model of a problem’s run-time behavior, and attempt to determine the optimal remapping decision policy for that model. In fact, in Section V we show how to determine the optimal policy for the MUM model. Determining the optimal policy requires estimation of MUM model parameters; furthermore, the policy calculation is intractable. These problems can only be exacerbated in more detailed models; furthermore, an optimal policy for a particular model may not benefit a computation which seriously violates the model assumptions. Because of the required model detail and run-time estimation of model parameters, we do not feel that

NICOL

AND

SALTZ:

DYNAMIC

REMAPPING

OF PARALLEL

1077

COMPUTATIONS

0.85 0.8 0.75 0.7 0.65 0.6

-

0.55

-

0.5 0

I-

20

40

60

80

100

120

80

loo

.-__

STEPS (a)

1

0.8

0.6

-

0.5 20

60

40

STEPS (b) Fig. 1. M U M model. (a) Expected average processor utilization as a function of step estimated from 500 sample paths. Each chain has 19 states, p = 0.5. (b) Average processor utilization as a function of step-single sample path. Each chain has 19 states, p = 0.5.

120

1078

IEEE TRANSACTIONS

this approach can work in a general setting. Instead, dynamic remapping must ultimately be treated much as virtual memory management is now treated: at the operating system level, for general programs, and transparent to the user. While constructs such as MUM and LD are useful for evaluating remapping policies which are largely independent of the models, we should not make the mistake of designing a general purpose remapping policy which is closely coupled to model details. In this section, we introduce SAR, a “greedy” remapping policy, which attempts to minimize the average long-term processor idle time since the last remapping. SAR assumes only that the processors synchronize globally at every step, that the delay cost of remapping is known, and that each processor idle time since the last remapping. SAR assumes only processor idle time computed by SAR includes the time C spent in one remapping operation. Supposing that the load was last remapped n steps ago (say, just before step l), the average processor idle time per step achieved by remapping immediately is denoted

i (Tnlax(A - ml + c W (n)= j=’ , n recalling that T,,,,,(j) is the time required by the system to complete the jth step, and that T(j) is the average time required by a processor to complete the jth step.* W(n) explicitly embodies two of the costs a remapping policy must manage: the delay cost of remapping once and the idle time cost of not remapping. However, under SAR’s greedy philosophy, this latter cost is a past cost of not remapping. Optimal dynamic remapping policies must consider future costs of not remapping. It is instructive to examine the behavior of W(n) as a function of n. Fig. 2(a) plots W(n) for single sample paths of the MUM model when p = 0.5, C = 0.8, and the chains have 19 states. Paths from 2, 8, and 32 chain systems are shown. Fig. 2(b) plots the expected value E[ W(n)] under these same parameter values. Similar plots are obtained using the LD model. Both graphs show W(n)‘s marked propensity to drop, be minimized over some section of its domain, and then begin and continue to rise. This behavior can be explained in terms of W(n)‘s definition. There is a tendency for W(n) to decrease in n, as increasing n amortizes the cost C over a larger number of computation steps. But there is also a tendency for W(n) to increase in n, as we may expect Cy= i T(j )/n to remain relatively constant, and C;= i T,,( j )/n to increase in n. The tendency to decrease dominates initially, but is eventually superceded. Note again that this tendency is an expected tendency. The precise behavior of W(n) for a given sample path will vary with that sample path. Whereas it is reasonable to expect that E[ W(n)] has a single local minimum (a topic we discuss in Section IV), the values of W(n) for a given sample path may exhibit multiple local minima. Clearly, we do not have the option of remapping at a time step fi with any 2 These definitions were introduced as M U M tended here to general processing times.

state

but

ON COMPUTERS,

VOL.

37, NO. 9, SEPTEMBER

1988

assurance that W(h) will minimize the statistic. Instead, we will remap after the first fi such that W(h) > W(i - l), i.e., when the first local minima is detected. Note in Fig. 2(a) that the value of W(G) at this point is usually extremely close to the true minimum value of W(n). This policy of remapping when a local minimum of W(n) is first detected is called the SAR (stop at rise) policy. By simulating the MUM and LD models, we compared SAR’s performance to the policy which never remaps and the “remap every m steps” policy or fixed interval policy. A fixed interval policy is insensitive to statistical variations in a system’s performance, and requires (potentially impractical) prerun-time analysis to determine an effective periodicity. In Fig. 3, we use the MUM model to compare the performance obtained through the use of 1) SAR, 2) the fixed interval policy for a wide range of periods, and 3) never remapping. The performance of an eight-processor MUM model is depicted. Each problem consists of 400 steps, each data point is obtained through 200 simulations, and remapping costs of 2 and 8 are assumed. For the SAR policy, we plotted average utilization against the measured average number of steps between consecutive remappings. For the fixed interval policy, we plotted performance against the remapping periodicity. The single performance value of never remapping is plotted as a straight horizontal line. It is notable that SAR’s performance was comparable and in fact slightly higher than that of the optimal fixed interval policy. Furthermore, the average number of steps between SAR remappings corresponds closely to the optimal fixed periodicity. Similar results were obtained in many other cases using the MUM load model. Fig. 4 presents similar data for the LD model, where the remapping cost is taken to be 50 and 100 work unit times, a 64 x 64 mesh of activity points is used (initialized to one work unit per point). The transition probabilities are anisotropic (given in the figure legend), so that the work tends to drift to the upper right portion of the mesh over time. Not taken into account here is the cost of the interprocessor communication that occurs at the end of each step when partitions exchange newly computed results. As was observed in the MUM model, SAR’s behavior corresponds closely to that of the optimal fixed interval policy. In Fig. 4, SAR’s performance for a given cost is superior to that of the optimal fixed interval policy. In other simulations, SAR’s performance was comparable to, but slightly below, that of the optimal fixed interval policy. Note that the performance obtained by SAR in Fig. 4 is much greater than that obtained when no remapping is performed. The fact that so simple a policy can work well is quite encouraging. Since SAR adapts to statistical variations in the system’s behavior, we would hope that it can outperform a nonadaptive policy. In fact, our data show that SAR often outperforms the optimal fixed interval policy. The real significance of this is that SAR does so dynamically without explicit model assumptions or parameter estimation. If the underlying dynamics of load evolution were to change in time, we may reasonably expect SAR to adapt automatically to such changes. Note also that for both MUM and LD models SAR’s performance is markedly superior to the performance obtained when no remapping is performed.

(a) Fig. 2. MUM model. (a) Long-term average processor idle time per step w(n) from single sample path. Each chain has 19 states, p = 0.5, load balancing cost = 8. (b) Expected long-term average processor idle time per step E[w(n)], estimated from 500 sample paths. Each chain has I9 states, p = 0.5, load balancing cost = 8.

1080

IEEE TRANSACTIONS

0.5

1 0

I 10 PERIODIC:

Fig. 3.

I 20

I 30

I 40

STEPS BETWEEN WPINGS/SAR:

ON COMPUTERS,

I 50

VOL.

1 60

37. NO. 9, SEPTEMBER

J 70

AVERAGE STEPS BETWEEN RE?WPINGS

MUM model: performance of SAR compared to performance of periodic remapping. Eight chains, 400 steps, each chain has 19 states, p = 0.5, each data point calculated from 200 sample paths.

I 1

1.0 E L W 0 R K ;

’ ‘I 0.9

R P R 0 B L E 0.8 M ; E L T I M 0.7 E F 0 R P R0.6 0 B L E M I 0.5

4 10

, 20

I 30

I 40

1 50

I 60

I 70

1 80

-1 90

---I 100

Fig. 4. LD model: performance of SAR compared to performance of periodic remapping. 64 x 64 activity array initialized with one work unit per activity point. Work unit transition probabilities: up-O. 1, right-O. 1, down-0.05, left-0.05. Each data point calculated from 50 sample points.

1988

NICOL

AND SALTZ:

DYNAMIC

IV.

REMAPPING

ANALYSIS

OF PARALLEL

straightforward to show that

OF E[W(n)]

The SAR policy attempts to minimize the statistic W(n) between successive remappings. An exact analysis of remapping performance and behavior is difficult for a number of reasons. First, some detailed model of load evolution must be assumed. W e do this in Section V for the MUM model, but here we are interested in learning what we can about remapping in general settings. A second problem is that determining performance using SAR’s remapping rule is quite difficult (it is related to, but harder than, the “first-crossing-time” problem studied for general stochastic processes [38]). Since empirical data suggest that the value of w(n) at an SAR remapping point is extremely close to the minimal value of W(n) (at least for the MUM and LD models), we may reasonably study the minimization of E[ w(n)] and expect our conclusions to hold for SAR’s performance. A third difficulty is that the system state achieved by remapping is sensitive to the system state at the point where the remapping rule is invoked; recall how balanced states were constructed for the MUM and LD models. Faced by these problems, we have adopted two simplifying assumptions. 1) W e assume that whenever the system is remapped, it returns to the initial balanced state we will call So. 2) W e assume that the computation’s behavior is stationary, meaning that the mean average processor execution time at a step is a constant, E[T]. Note that if So is defined for the MUM model to be the vector where every processor’s state is (~5.+ 1)/2, then assumption 2) is satisfied by symmetry of the chains’ transition probabilities. Applying these assumptions, we will first show that if the expected system step execution time increases as a function of n (number of steps since last remapping), then E[ W(n)] has at most one local minimum as a function of n. W e also show that if E[ W(n)] has a minimum at fi, then the fixed interval decision policy of remapping every iz steps minimizes (in a limiting sense) the expected run time of any fixed interval policy, including the policy which prohibits remapping. These results are independent of the MUM and LD models and show that under very general conditions, SAR is a well-motivated heuristic. For the ith step since the last remapping, let E; denote the expected difference between the maximum processor step execution time and average step execution time: Ei = E[T,,,(i)] - E[T]. It is reasonable to expect that the 1s monotone increasing in i; under sequence of {E[T,,,(i)]} this hypothesis, we have the following theorem. Theorem 4.1: Suppose E[T,,,(i)] is monotone increasing in i. Then E[ W(n)] has at most one local minimum and if this minimum exists at it, then 2 is a monotone increasing function of the remapping cost C. Proof: First note that the monotonicity of E[Tmax(i)] implies that Ei is also monotone increasing. Then let 6(n) = E[ W(n + l)] - E[ W(n)], and let n K(n)

=

1081

COMPUTATIONS

nE,+I - c Ei. i= 1

From the definitions of 6(n), E[W(n)], E;, and

I,

it is

6(n) < 0

iff

K(n)


0

iff

K(n)

>

c.

and that Using the observation that Ei is monotone increasing, we show that I is monotone increasing. This follows, since

-

- En+,) 2 0.

= (n + l)(E,+*

From this expression, it is apparent that E[ W(n)] has at most one minimum (this fact is easily demonstrated using a conand 6(n) tradiction argument). The relationship between I shows that the position of this minimum is increasing in C. 0 This theorem’s hypothesis is quite general, and describes what we expect to be true with varying demand distribution problems: degrading performance as a function of time. Let ; be the step which minimizes E[ W(n)]. It turns out that under our two idealizing assumptions, s is the optimal (limiting) fixed interval remapping periodicity. To establish this result, we first observe that the expected execution time of the fixed interval policy with period n on a computation with Z steps is

LZ/nJ i (E[T,,,,,(i)] + C) + zm$ nE17,,xO~1 r=l

i=l

= LZ/nJ i GWmax(i)l+ 0 i= 1

= LZ/nJ i (E; + C) + Z * I@-1 i=l Zmod

-(Z

mod n) - E[Tl +

c

n

E[T,,d)l.

i=l

The expected average system execution time per step is found by dividing by Z:

i=l

n

+ E[?-] + 0(1/Z)

where O( l/Z) denotes terms which vanish as Z grows large. Letting Z + 03, the ii which minimizes this expression is therefore the optimal (limiting) fixed remapping periodicity. It is clear that h must minimize the first term of this expres-

1082

IEEE TRANSACTIONS

sion, which is simply E[ W(n)]. It is also easy to show that remapping with periodicity i outperforms the limiting performance achieved by never remapping. Let w(n) represent the average loss per step by not remapping and observe that

E[ W(n)] = E[w(n)] +;

.

The difference between E[ W(n)] and E[w(n)] becomes arbitrarily small for increasing n; since E[ W(n)] is increasing in n for n > i, it follows that

V.

OPTIMAL

MUM

MODEL

37, NO. 9. SEPTEMBER

19%

Tj(n + d,sj) = Tj(n + d) given that T,(n)=.sj and Tmax(n + d, SI,

,sN) =

max { Tj(n + d,sj)}. I +sN

Then we have the following theorem. Theorem 5.1: Assume that the Markov chains are homogeneous, and unbounded (no maximum nor minimum state), and that the transition probabilities are unaffected by remapping. Suppose C,“=, Tj(n, Sj) = K. For every i = 1,2, . . , N, define

a,= I

REMAPPING

As with may problems, it is fruitful to compare the performance of a remapping heuristic with the optimal remapping policy. The MUM model is simple enough to allow the formulation of the optimal MUM remapping policy. This section shows how the optimal MIJM policy can be computed, and then compares SAR’s performance to optimal performance.We find that SAR gives reasonable comparative performance. There are two basic issues in remapping, how to remap, and when to remap. We describe optimal policies for both of these issues. We treat the remapping mechanism by demonstrating that if the caps on MUM states are removed, then the obvious technique of assigning equal loads to each processor is optimal. However, we will note generalizations which can cause this technique to be suboptimal.We treat the remapping decision problem in the framework of a Markov decision process. We are then able to symbolically express an optimal remapping decision policy in terms of a solution to a set of equations. However, the complexity of solving these equations grows exponentially in the size of the problem, so that this technique is not useful for any but the smallest sized problems. While it is convenient to assume that remapping always returns the system to some single balanced state So, in reality the new balanced state depends very much on the state at the time of remapping. We balance a MUM system at step n by determining the average processor execution time during step n, and rearrange the load so that every processor now has the average load (subject to the obvious integer ceiling and floor operations). This mechanism is called average value remapping. While the optimality of this policy may seem trivially obvious, later reflection shows that this may not always be the case. For example, if one processor’s computational load tends to evolve faster than others’, we may want to assign

VOL.

it less than the average value load upon remapping. Called padding [44], such a policy accepts slightly decreased immediate performance in order to forestall the active processor from quickly degrading performance. If we assume that all processor loads evolve at the same rate, that the stochastic behavior of the evolution of the workload is not affected by remapping, and that the limitations on processor states are removed, we can show that the average value load policy is optimal. The proof of this claim is somewhat technical, and is provided in Appendix B. To formally state the claim, we define

E[ W(n)] I ,llfnmE[w(n)]. This discussion has proved the following theorem. Theorem 4.2: If E[ W(n)] is minimized at 2, then the limiting optimal fixed interval remapping policy (including the policy which never balances) is the policy which remaps ev0 ery A steps. While recognizing that the assumptions upon which Theorem 4.2 rests may not be met in practice, its statement supports the motivation of SAR: seeking to remap when W(n) is minimized. It also explains in part why SAR appears to remap with the same frequency as the optimal fixed interval policy.

ON COMPUTERS,

LWN_I+1 1 LK/N_l

ifi t} dt 2 Prob{Y > t}dt forallaz0. 5a .ru For our purposes, the following result, also from [38], is important. If Xi, , X, is a group of independent random variables, Y, , , Y, is a group of independent random variables. andX,z, Y;fori = 1,2,...,n, then g(XI,..~,Xn)~”

g(Y,;,..

Yr?)

(3)

for all increasing convex functions g. W e will first show that average value remapping is optimal in a system with two chains. W e observe that if i T, (n) T2(n)l I 1, then there is no benefit to be gained from remapping. W e consequently assume that if we remap at n, then 1T,(n) - Tz(n)i > 1. Our results stem principally from the following lemma. Lemma B-I: Let X and Y be independent, identically distributed, integer-valued random variables. If a > b - 1, then

max{a+X,b+Y}r, = gWx{a

max {a+x,

+ x, b + Y>) -g(max{a

(6)

Tmax(n+ d, a, 6) = max{a + X,(d), b + X*(d)} where X,(d) and X2(d) are independent. It follows immediately from Lemma B-l that T,,,&n + d, a, b) z. Tmax(n+ d, a - 1, b + 1). W e then use this result to show that average value remapping is optimal for an N chain system. Theorem 5.1: Assume that the Markov chains are homogeneous, and unbounded (no maximum nor minimum state), and that the transition probabilities are unaffected by remapping. Suppose Tl(n, s,) + Tz(rz,~2) + . + TN(~, SN) = K. For every i = 1,2, . , N, define

a; = r

LK/NJ + 1

if i b + y. In both cases, the increasing nature of g ensures that D,(x, y) 2 0. The only alternate case is where a + x I b + y , making expression (6) equal to - 1. But X and Y are independent and identically distributed; for every joint sample of X = x and Y = y which causes (6) to be - 1, the joint sample X = y and Y = x is equally likely, and causes expression (6) to be 1. Since y 2 x in this situation, we have max{a + y, b + x} 1 max{a + x, b + y}. It then follows from the increasing convexity of g that Dg( y, x) 1 lD,(x, y)l. Since D, is nonnegative in all other cases, it follows that E[D,(X, Y)] I 0, which proves the lemma. 0 To apply this lemma to our problem we note that if X is the Markov chain single-step random variable, and if X(d) denotes a d-fold convolution of X, then

max{a-l+X,b+l+Y}.

Proof: Let g be any increasing convex function, and let @(x,Y)

W e will argue that E[D,(X, Y)] L 0, which will prove the lemma. W e consider the value of

+

d,

SI ,

. , ~~11

2

E[T,,,,,(n

+

d, al,

, aN)].

Proof: W ithout loss of generality, assume that si 2 s2 2

1086

IEEE TRANSACTIONS

. . 2 sN. Note first that Trn,,(n + d,Sl,...,SN) = max{max{Tl(n

+ d,s~), Th0

+ ~,SAJ)},

* T2(n + d, sz), . . . , TN - l(n + d,sN - l)}. Now max{ Tl@ +d,s~),

Tdn+d,SN))

2, max{Tl(n+d,sl

- l), TN(n+d,.sN+

l)},

so that by (3), L&n

+ d, s1, . . . ,sN)l, T,,,ax(n+ d,s, - I,...,sN

- 1).

This argument applies to any set of SI , . . . , SN such that c:= 1 sj = K. We may therefore apply the argument repeatedly to find that Tn,ax(n + d,sl,...

,sN)&

Trnax(n +

0

REFERENCES “On vortex methods,” SIAM J. 1985. PI S. Baden, “Dynamic load balancing of a vortex calculation running on multiprocessors,” Univ. California, Berkeley, Comput. Sci. Tech. Rep., 1986. [31 R. E. Bank, “A multi-level iterative method for nonlinear elliptic equations,” in Elliptic Problem Solvers, M. Schultz, Ed. New York: Academic 198 1. r41 D. Bai and A. Brandt, “Local mesh refinement multilevel techniques,” Dep. Appl. Math., Weizmann Instit. Sci. Rep., 1984. PI J. A. Bannister and K. S. Trivedi, “Task allocation in fault-tolerant distributed systems,” Acta Informatica, vol. 20, pp. 261-281, 1983. WI M. I. Berger and S. Bokhari, “The partitioning of non-uniform problems,” ICASE Rep. 85-55, Nov. 1985. 171 M. J. Berger and A. Jameson, “Automatic adaptive grid refinement for the Euler equations,” AIAA J., vol. 23, pp. 561-586, 1985. PI M. J. Berger and J. Oliger, “Adaptive mesh refinement for hyperbolic partial differential equations,” J. camp. Phys., vol. 53, pp. 484512, 1984. t91 S. Bokhari, “Partitioning problems in parallel, pipelined, and distributed computing,” ICASE Rep. 85-54, Nov. 1985. UOI A. Brandt, “Multilevel adaptive solutions to boundary value problems,” Math. Computat., vol. 31, pp. 333-390, 1977. r111 K. M. Chandy and J. Misra, “Distributed simulation: A case study in design and verification of distributed programs,” IEEE Trans. Software Eng., vol. SE-5, pp. 440-452, Sept. 1979. r121 W. W. Chu, L. J. Holloway, M. Lan, and K. Efe, “Task allocation in distributed data processing,” Computer, vol. 13, pp. 57-69, Nov. 1980. [I31 A. I. Conception, “Distributed simulation on multi-processors: Specification, design, and architecture,” Ph.D. dissertation, Wayne State Univ., Jan. 1985. [I41 H. A. David, Order Statistics. New York: Wiley, 1981. 1151 P. J. Denning, “Working sets past and present,” IEEE Trans. Software Eng., vol. SE-6, pp. 6484, Jan. 1980. 1161 A. Dutta, G. Koehler, and A. Whinston, “On optimal allocation in a distributed processing environment,” Management Sci., vol. 28, pp. 839-853, Aug. 1982. [I71 D. L. Eager, E. D. Lazowska, and J. Zahorjan, “Adaptive load sharing in homogeneous distributed systems,” IEEE Trans. Software Eng., vol. SE-12, pp. 662-675, May 1986. [I81 G. S. Fishman, Principles of Discrete Event Simulation. New York: Wiley, 1978. r191 G. J. Foschini, “On heavy traffic diffusion analysis and dynamic routing in packet switched networks, ” in Computer Performance, K. M. Chandy and M. Reiser Eds. New York: North-Holland, 1977. [201 G. C. Fox and W. Furmanski, “Load balancing by a neural network,” Cal. Tech Rep. C3P 363, Sept. 1986. t211 G. C. Fox and S. W. Otto, “Concurrent computation and theory of C. Anderson and C. Greengard,

Numer. Anal., vol. 22, pp. 413-440,

VOL.

37. NO. 9. SEPTEMBER

1988

complex systems,” Cal Tech Rep. C3P 255, CALT-68-1343, Mar. 1986. 1221 E. Gelenbe, “On the optimum checkpoint interval,” J. ACM, vol. 26, pp. 259-270, Apr. 1979. 1231 A. George, M. T. Heath, J. Liu, and E. Ng, “Sparse Cholesky factorization on a local-memory multiprocessort,” Oak Ridge Tech. Rep. ORNLITM-9962. r241 W. D. Gropp, “Local uniform mesh refinement with moving grids,” Yale Tech. Rep. YALEUIDCSIRR-313. Aor. 1984. “Local uniform mesh refinement on loosely-coupled parallel r251 -, processors,” Yale Tech. Rep. YALEUIDCSIRR-352, Dec. 1984. “Dynamic grid manipulation for PDE’s on hypercube parallel 1261 -, processors,” Yale. Tech. Rep. YALEU1DCSRR-458, Mar. 1986. 1271 D. Gustield, “Parametric combinatorial computing and a problem of program module distribution,” J. ACM, vol. 30, pp. 551-563, July 1983. 1281 A. Harten and J. M. Hyman, “Self-adjusting grid methods for onedimensional hyperbolic conservation laws,” Los Alamos Rep. LA9105, 1981. 12% M. A. Iqbal, J. H. Saltz, and S. H. Bokhari, “Performance tradeoffs in static and dynamic load balancing strategies,” in Proc. 1986 Int.

Conf. Parrallel Processing. d>al,...,aN).

The theorem’s conclusion follows immediately.

111

ON COMPUTERS,

1301 D. R. Jefferson, “Virtual time,” A C M Trans. Programming Languages Syst., vol. 7, pp. 404415, July 1985. [311 A. Leonard, “Vortex methods for flow simulation.” J. Comp. Phys., vol. 37, pp. 289-335, 1980. 1321 S. McCormick and J. Thomas, “The fast adaptive composite grid method for elliptic equations,” Math. Computat., vol. 46, pp. 439456, 1986. [331 L. M. Ni, C. Xu, and T. B. Gendreau, “A distributed drafting algorithm for load balancing,” IEEE Trans. Software Eng., vol. SE-l 1, pp. 1153-1161, Oct. 1985. [341 D. M. Nicol and P. F. Reynolds, Jr., “The automated partitioning of simulations for parallel execution,” Univ. Virginia Dep. Comput. Sci. Tech. Rep. TR-85-15, Aug. 1985. “Dynamic remapping decisions in multi-phase parallel compuI351 tations,” ICASE Rep. 86-58, Sept. 1986. (361 J. K. Peacock, E. Manning, and J. W. Wong, “Synchronization of distributed simulation using broadcast algorithms,” Comput. Networks, vol. 4, pp. 3-10, 1980. 1371 M. M. Rai and T. L. Anderson, “The use of adaptive grid generation method for transonic airfoil flow calculations,” AIAA Paper 81. 1012, June 1981. 1381 P. F. Reynolds, Jr., “A shared resource algorithm for distributed simulation,” in Proc. Ninth Annu. Int. Comput. Architecture Conf., Austin, TX, Apr. 1982, pp. 259-266. 1391 S. Ross, Applied Probability Models with Optimization Applications. San Francisco, CA: Holden and Day, 1971. Stochastic Processes. New York: Wiley, 1983. 1401 -, 1411 J. H. Saltz and M. C. Chen, “Automated problem mapping: The crystal runtime system,” in Proc. Hypercube Microprocessor Conf., Knoxville, TN, Sept. 1986. 1421 J. H. Saltz and D. M. Nicol, “Statistical methodologies for the control of dynamic remapping,” ICASE Rep. 8646, June 1986; Proc.

Army Res. Workshop Parallel Processing Medium Scale Multiprocessors, Palo Alto, CA, Jan. 1986. 1431 I. H. Saltz, “Automated nization delay effects,” publication. R. Smith and J. Saltz, WI ing mesh control,” in

problem scheduling and reduction of synchroICASE Rep. 87-22, May 1987, submitted for “Performance

analysis of strategies for mov-

Proc. C M G X V Int. Conf. Management Perform. Eval. Comput. Syst., 1984, pp. 301-308. Program Behavior: Models and Measurements. New J. R. Spim, 1451 York: Elsevier North-Holland, 1977. J. A. Stankovic, “An application of Bayesian decision theory to decentralized control of job scheduling,” IEEE Trans. Cornput., vol. C- 34, pp. 117-130, Feb. 1985. 1471 J. A. Stankovic, K. Ramamritham, and S. Cheng. “Evaluation of a flexible task scheduling algorithm for distributed hard real-time systems,” IEEE Trans. Cornput., vol. C-34, pp. 1130-1143, Dec. 1985. 1481 H. S. Stone, “Critical load factors in distributed computer systems,” IEEE Trans. Software Eng., vol. SE-4, pp. 254-258, May 1978. 1491 D. Towsley, “Queueing network models with state-dependent routing,” J. ACM, vol. 271 pp. 323-337, Apr. 1980. [501 A. N. Tantawi and D. Towsley, “Optimal static load balancing,” J. ACM, vol. 32, pp. 445-465. Apr. 1985. [461

NICOI

AND SALTZ:

DYNAMIC

REMAPPING

OF PARALLEL

COMPUTATIONS

t511

W. Usab and E. M. Murman, “Embedded mesh solutions of the Euler equation using a multiple-grid method,” in Proc. AIAA Comput. Fluid Dynam. Conf, Danvers, MA, July 1983, paper 83-1946. [=I D. R. Wells, “Multirate linear multistep methods for the solution of systems of ordinary differential equations.” Univ. Illinois Rep. UIUCDCS-R- 82-1093, July 1982. I531 0. C. Zienkiewicz and A. W. Craig, “Adaptive mesh refinement and a posteriori error estimation for the p-version of the finite element method, in Adaptive Computational Methods for Partial Differential Equations, Ivo Babuska, Ed. Philadelphia, PA: SIAM, 1983.

David M. Nicol received the B.A. degree (magna cum laude, Phi Beta Kappa) in mathematics from Carleton College, Northfield, MN, in 1979, and the M.S. and Ph.D. degrees in computer science from the University of Virginia, Charlottesville, in 1983 and 1985. From 1979 to 1982, he was a Programmer/ Analyst with the Information Sciences Division of Control Data Corporation, Minneapolis, MN. There he designed and implemented real-time parallel image processing systems. From 1985 to 1987 he was a Staff Scientist at the Institute for Computer Applications in Science and Engineering (ICASE), where he presently consults. Currently he is an Assistant Professor in the Department of Computer Science at the College of William and Mary, Williamsburg, Va. His research interests are primarily in the design and analysis of static and dynamic methods for problem mapping onto parallel computers. and in the probabilistic analysis of searching algorithms.

1087 Dr. Nicol is a member of the IEEE Computer Society, the Association for Computing Machinery. and ORSA.

Joel H. Saltz (S-85-M’85) received the B.A. degree in mathematics and physics from the University of Michigan, Ann Arbor, in 1977 (magna cum laude, Phi Beta Kappa), the M.A. degree in mathematics from Michigan in 1978. and the joint M.D.-Ph.D. degrees in computer science from Duke University, Durham, NC, in 1985. He was a Researcher at Scientific Applications. Inc. (1978-1979) and worked on the simulation of acoustic propagation in rough seas. From 19851986 he was a Staff Scientist at the Institute for Computer Applications in Science and Engineering (ICASE) at NASA Langley Research Center. Since 1986 he has been Assistant Professor in the Department of Computer Science, Yale University, New Haven. CT. and is a frequent visitor to ICASE. His research interest include the development and performance evaluation of automated methods for problem mapping on parallel machines, and in parallel algorithms for scientific and medical computation. Dr. Saltz is a member of the Association for Computing Machinery, the American Medical Society for Medical Informatics, and Sigma Xi.

Suggest Documents