Department of Computer Science and Engineering ... completion times by dynamically allocating resources to jobs that are behind schedule and taking re-.
Dynamic Resource Allocation of Computer Clusters With Probabilistic Workloads Robert Sheahan, Marwan Sleiman, Lester Lipsky Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269-2155 Pierre Fiorini Department of Computer Science University of Southern Maine Portland, Maine
Abstract In many parallel processing systems, particularly real-time systems, it is desirable for jobs to finish as close to a target time as possible. This work examines a method of controlling the variance of job completion times by dynamically allocating resources to jobs that are behind schedule and taking resources from jobs that are ahead of schedule. Emphasis is placed on where variance enters the system and how well it can be controlled. Controlling the variance also helps meet deadlines because the number of standard deviations between the target and the deadline determines the probability of missing the deadline.
1 Introduction In a parallel computing environment, multiple processors (also called processing elements or PEs) work on the tasks that make up a job. In many situations, particularly in many real time applications, it is desirable to have a job complete at or near a specific time, called a Target Time. Target times are related to deadlines in that the average completion time should be at the target time but the average completion time should be a number of standard deviations before the deadline, the number of standard deviations determines the probability of missing the deadline. If the task times, or at least upper bounds on task times [Manimaran98], are known exactly ahead of time, this becomes a scheduling problem, where in principle, the optimum solution can be found. However when the needs of the tasks are only known probabilistically, then there is no hope for an optimal solution, but one looks, instead, for schemes that will do well on average.
Previous approaches have tended to look at the average time for completed tasks in this run and use that as an estimate for the time future tasks will take. Such procedures may be effective when dealing with tasks that are executed sequentially, but are much less effective for tasks executed in parallel. For sequential tasks the average of k tasks tends to be unbiased. That is, their average is as likely to overestimate the mean as to underestimate it. However, if the tasks are executed in parallel, the average of the first k tasks is very likely to be less than the long-term mean, their difference decreasing as 1/k [Lipsky96]. The reason for this is that the first few tasks to finish are likely to be much less than the mean, and new tasks may actually finish before some of those that started before them. That is, one can only consider those tasks that have already finished. The situation is even worse if the distribution of task times is heavy tailed, a characteristic that is common in computer and telecommunications applications [Crovell97], because for a heavy tailed distribution the average of the first k tasks is very likely to be less than the long-term mean even for sequential tasks. For this reason, we propose using more than just the information from the current run. Observing repeated runs of a job can give a much better view of task time distribution. With these conditions in mind we come to the system we are studying. We consider a job made up of N independent tasks whose demands for various resources are only known probabilistically. That is, the distribution of task demands (when averaged over many tasks) is known, but the demand by each task is not known until that task is finished. The goal is to select a configuration of hardware resources [Rosti98] (e.g., CPU’s, local discs, communication channels, centralized discs for shared data) so as to meet a target as closely as possible (not too early, and not too late). The tasks can run in parallel (except, when sharing a resource), and when a task finishes, the system configuration can be modified if the job is ahead or behind schedule. Our methodology in studying this includes our analytic semi-Markov model of a cluster of processors and associated peripherals [Mohamed04] which can be used to calculate the mean and variance of the time it takes to process N iid (Independent Identically Distributed) tasks, taken P at time. When one task finishes, it leaves the system and is replaced by another task in the queue, until all N tasks finish. From the analytic model we determine just how many resources would be needed (e.g., the number of processors that are needed if each processor is used exclusively by one task) for the whole job to finish just on time. In conjunction with this we are preparing a discrete event simulation of the identical system. At the time a task finishes, called an epoch, we will compare the actual time from the simulation with that calculated by the analytic model, and if the job is not on schedule we change P (or some other resource resource parameter) so that the tasks remaining can still finish on time. Our goal is to see just how much control can be maintained in such an uncertain environment. In particular, what is the variation of finishing time from job to job? In a real environment the distributions may not be well known, and even the mean times per task may be uncertain. In such cases just how much control can one have? Even the best of AI algorithms cannot hope to do better than one could if the distributions were known. Therefore, our procedure should give a lower bound on what a real system could hope for. It could also provide some understanding as to where and how a heuristic scheduling/resource allocation could be improved. As a means for getting some insight as to just what kind of and how much data we should look at to achieve our goals, we have done an extensive study of the simplest system within our framework. Here we have P processors and no peripherals, there is an infinite pool of processors that can be allocated and deallocated instantaneously without cost, the number of allocated processors does not alter the time it takes any task to run [Jacob99], and all the service times are exponentially distributed. Clearly this 2
system cannot exist [Bharadwaj00], we use it only to determine the best case bounds. In the “Future Work” section we discuss more complicated systems, but the rest of this paper uses the simple best case just described. For this system, the mean time for N tasks to complete can be written down directly, without any detailed calculations. In fact in a system where the number of processors is kept constant, (see [Lipsky96]) the time to complete all N tasks is T (P, N) =
N −P + H(P ), P
(1)
where H(P ) is the harmonic sum given by H(P ) =
P X
1 · n=1 n
and the mean time per task is 1.0. Furthermore, the variance is given by σ 2 (P, N) =
N −P + H2 (P ), P2
where H2 (P ) =
(2)
P X
1 · 2 n=1 n
We note here that although H(P ) grows as log P , H2 (P ) is always less than π 2 /6. Thus, for fixed P , the mean and variance both grow linearly √ with N. This means that the probability of missing the target time by a fixed amount grows as N , given that the target time grows linearly with N. But variable P does much better. Even in this system we have found some very interesting results, including the observation that the variation of partial job time (The time it has taken to complete k < N tasks), after growing at a rate that is less than that for a fixed P , actually decreases after k passes the half-way mark, until the so-called draining period, when the number of tasks remaining is equal to the number of processors, and no further control is possible. We have also found that the variance of job completion is independent of N for a target time given by Equation (1). The results are presented in the following sections.
2 Description of Simulation To simulate this system we wrote a program with several components. One component, timeLeft, takes as input the number of processors in use and the number of tasks remaining and returns an estimate of the time to finish. This component is used by the procNeeded component which takes the time remaining before target time, and current number of processors and returns the new number of processors. The logic of procNeeded is roughly as follows (logic for limiting number of processors to an upper bound of number of tasks remaining or a specified upper bound, and passing the distribution and system configuration information is not shown here for readability). Note that for this series of tests we limited the change in the number of processors to the same, one more, or one less than the previous epoch, see “Future Work” for alternatives.
3
int procNeeded(oldProcCount, tasksLeft, timeLeft) newProc=oldProcCount-1 DO WHILE (targetTime < timeLeft(newProc,tasksLeft) && (newProc < oldProcCount+1)) newProc++ LOOP RETURN newProc The body of timeLeft, the heart of our analytic model, is very complex and is discussed in [Mohamed04]. For the simple case we focus on here, it reduces to Equation (1). Another component is the random number generator, taskList. It produces individual task times based on the task time distribution, or can be overridden to produce a specific sequence (and/or log all values produced) for program validation. The logic of this function is also not central to understanding the simulation, but it is important to note that it uses a 48 bit random number generator to prevent problems with the granularity of lower resolution random number generators. The simulator can be thought of as a heap structure that stores events such that the upcoming event most proximal in time is the top of the heap. A variable timeNow keeps track of where we are in the run. As with procNeeded, oneRun is simplified here to enhance readability. The logic for collecting measurements and passing additional information used for systems more complex than the one studied here has been removed. void oneRun(numProcs,numTasks) //Pick target time for Task & avg Proc targetTime=timeLeft(numProcs,numTasks) newProc=oldProc=numProcs timeNow=0 //Load the first processors with tasks FOR (Task=0;Task < numProcs;Task++) targetTime=timeLeft(numProcs,numTasks) push_heap(taskList(Task)) NEXT Task //process until all tasks are finished DO WHILE (heap_size() > 0) timeNow+=heap_pop newProc=procNeeded(oldProc, numTasks - Task, targetTime-timeNow) FOR (i=oldProc; i > 1), the variance of the completion time about the target is much smaller than for a system where the number of processors is kept constant. Furthermore, we found the surprising result that this variance does not depend on the length of the job (N), even though the variance grows linearly for systems with a constant number of processors. It is yet to be shown that theses results hold for nonexponential distributions, for multi-processor systems with server contention, or for systems where the expected remaining time is itself an approximation. But the approach appears to be promising. Research is continuing in these directions.
References [Bharadwaj00] ‘On the influence of start-up costs in scheduling divisible loads on bus networks” V Bharadwaj, X Li, CC Ko, IEEE Transactions on Parallel and Distributed Systems, 2000
8
[Crovell97] “Consequence of Ignoring Self-Similar Data Traffic In Communications Modeling”, Mark Crovella, Lester Lipsky, Pierre Fiorini, Tenth International Conference on Parallel and Distributed Computing (PDCS-97), New Orleans, LA, (October 1997) [Jacob99] “Task spreading and shrinking on multiprocessor systems and networks of workstations” JC Jacob, SY Lee, IEEE Transactions on Parallel and Distributed Systems, 1999 [Lipsky96] “On the Performance of Parallel Computers: Order Statistics and Amdahls Law” L. Lipsky, T. Zhang, and S. Kang, 22nd International Conference for the Resource Management and Performance Evaluation of Computing Systems (CMG96), December 1996 [Manimaran98] “An efficient dynamic scheduling algorithm for multiprocessor real-time systems” G Manimaran, CSR Murthy, IEEE Transactions on Parallel and Distributed Systems, 1998 [McCann94] “Processor allocation policies for message-passing parallel computers” C McCann, J Zahorjan, Proceedings of the ACM SIGMETRICS Conference, 1994 [Mohamed04] “Modelling Parallel and Distributed Systems With Finite Workloads” Ahmed M. Mohamed, Lester Lipsky, Reda Ammar, Journal of Performance Evaluation, October 2004. [Peris94] “Analysis of the impact of memory in distributed parallel processing systems” VGJ Peris, MS Squillante, VK Naik, ACM SIGMETRICS Performance Evaluation Review Volume 22 , Issue 1 (May 1994) [Rosti98] “The impact of IO on program behavior and parallel scheduling” E Rosti, G Serazzi, E Smirni, MS Squillante, Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer, 56 - 65, 1998 [Sharma98] “Job scheduling in mesh multicomputers” DD Sharma, DK Pradhan, IEEE Transactions on Parallel and Distributed Systems Volume 9 , Issue 1, 57 - 70 (January 1998) [Weeransinghe02] “A Generalized Analytic Performance Model Of Distributed Systems That Perform N Tasks Using P Fault-Prone Processors” Gehan Weeransinghe, Lester Lipsky, Imad Antonios, FTPDS-02, Fort Lauderdale, Fla, April 2002.
9