Reinforcement learning for utility-based grid scheduling

Reinforcement learning for utility-based grid scheduling Julien Perez Balazs K´egl C´ecile Germain-Renaud LRI/LAL, University of Paris-Sud, CNRS 91898 Orsay, France {perez,kegl,cecile.germain}@lri.fr November 11, 2007

1

Introduction

requirements. The long-term expected utilities defined by the local RL-based schedulers can then be efficiently exploited by the matchmaking process which dispatches jobs to sites, possibly by an upper-level RL algorithm, in the spirit of the hierarchical method proposed in [6]. The flexibility of an RL-based system allows us to develop tools to model the state of the grid, the jobs to be scheduled, and the high-level objectives of the various actors on the grid. It can also adapt its decisions to changes in the distributions of inter-arrival time, QoS requirements, and resource availability. Moreover, it requires minimal prior knowledge about the target environment, including user requests and infrastructure.

In this work we propose to implement a reinforcement-learning-based (RL) scheduling approach for large grid computing systems. The goal of grid scheduling is to efficiently dispatch continuously arriving jobs onto the machines on the grid. There are several properties of grid systems that make them significantly different from local clusters. First, their size: our reference infrastructure is the EGEE grid [1] which features 41,000 CPU’s distributed on 240 sites in 45 countries, and maintains 100,000 concurrent jobs for a large variety of e-Science applications. Grid computing infrastructures are heterogeneous, dynamic, non steadystate systems, with only partial perception of their environment. Decision-making (human or automatic) is distributed: each participating site configures, runs, and maintains a batch system containing its computational resources. The scheduling policy for each site is defined by the local site administrator, and the overall scheduling policy evolves implicitly as a “sum” of the local policies. A critical issue for widespread adoption of grids is to provide differentiated quality of service (QoS), covering the whole range from interactive usage with turnaround time as the primary performance metric, to batch-oriented access to complex scientific applications with high job throughput [2]. Virtual Organizations (VO’s) are a key concept in the grid exploitation model: they represent groups of users with similar access rights. Resource allocation should be based on the VO’s: in the midto long- term, each VO is entitled to a pre-defined share of the resources, defined by agreements between the participating institutions.

2

Formalization

We consider grid scheduling as a continues stateaction space Markov decision process problem (MDP). We first formalize the main components of the system in the reinforcement learning framework. State space: the grid model. A complete model of the grid would include a detailed description of each queue and of all the resources. This would be both inadequate to the MDP framework and unrealistic: the dimension of the state space would become very large. Instead, the state is represented by four real-valued variables inspired by [7]: 1) The expected time remaining until any of the currently running jobs is completed, 2) the number of currently idle machines, 3) the average utility (see below) expected to be received by the currently running jobs, and 4) the workload (the total execution time of jobs waiting in the queues). Action space: the job model. Each waiting job is a potential action to be chosen by the scheduler. Although this implies a discrete choice, the action space itself is continuous: each job is represented by a set of descriptors (extracted for in-

To satisfy these requirements, we propose to implement a RL-based scheduling approach. Our goal is to develop a scheduler for the local level, which is experimentally (at least in the EGEE case) the most difficult to adjust to the high-level 1

stance from the EGEE logging and bookkeeping system). The exact set of variables is under research, for the time being we are using 1) the type of the job (batch/interactive), 2) the VO of the user who submitted the job, and 3) the expected execution time. The first two descriptors are actually available; the third one can be estimated from other descriptors. Reward: utility functions. The overall utility of the scheduler is a combination of the timeutility, and the fairness. The time-utility function [3, 6, 7] is attached to each job, and it describes how “satisfied” the user will be if his/her job finishes after a certain time delay. It is typically a decreasing function of time, and it can vary with the job type. The fairness represents the difference between the actual resource allocation and the externally defined shares given to VO’s. The method: SARSA – on-policy control learning. The policy learning framework is based on SARSA [5], a classical reinforcement learning algorithm. This on-policy learning approach allows the scheduler to maintain an unique policy that provides efficient scheduling decisions and, in the same time, adapts to potential changes of the environment.

3

0.28

0.24 0.22

!

0.2 0.18 0.16 0.14 0.12 0.1 0.08

0

50

100

150

200

250

300

350

time

Figure 1: Average reward for a scheduling task of 100 jobs on 20 machines. and about the learned policies, and the goal is to find the most communication-efficient way to share this information. Another interesting aspect of this project is that the models developed for the states and the actions may actually be more interesting for human administrators than the scheduler itself. Whereas actually implementing the RL-based scheduler can be technically difficult, human administrators can easily exploit the state and action models to design simple and robust heuristics.

Experiments

References

In this section, we present our first experimental results. To test our scheduler and to compare it to classical algorithms, we developed a grid scheduling multi-queue/multi-machine simulator. Since our state-action space is continuous, the Qfunction is learned using function approximators (neural networks and Gaussian processes [4]). For the exploration behavior, we use classical mechanisms (-greedy and soft-max). We compared the reinforcement learning approach to FIFO, a simple classical scheduling policy. The first results (Figure 1) indicate that the reinforcement learning system improves the performance of the scheduler significantly.

4

RL with NN RL with GP FIFO

0.26

[1] F. Gagliardi et. al. Building an Infrastructure for scientific Grid computing: status and goals of the EGEE project. Philosophical Transactions of the Royal Society A, 1833, 2005. [2] C. Germain-Renaud, C. Loomis, J.T. Mo’scicki, and R. Texier. Scheduling for responsive grids. Journal of Grid Computing, 2007. on line doi= 10.1007/s10723-007-9086-4. [3] E. Douglas Jensen, C. Douglas Locke, and Hideyuki Tokuda. A time-driven scheduling model for realtime operating systems. In IEEE Real-Time Systems Symposium, pages 112–122, 1985. [4] Carl Edward Rasmusen and Chris Williams. Gaussian processes for machine learning. 2006. [5] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. 1998.

Perspectives

After the development and thorough testing of lo- [6] Gerald J. Tesauro and Jeffrey O. Kephart. Utility functions in autonomic systems. In Proceedings cal schedulers, we plan to in investigate the probof the 1st International Conference on Autonomic lem of distributed reinforcement learning. With Computing(ICAC’04), pages 70–77, 2004. the development of large networks with low communication bandwidth, distributed machine learn- [7] David Vengerov. A Reinforcement Learning Framework for Utility-Based Scheduling in Resourceing in general is rapidly becoming an important Constrained Systems. Technical Report TR-2005research subject. Although communication is rela141, Sun Labs, 2005. to appear in the Grid Comtively slow, local schedulers can share their knowlputing Journal. edge both about the dynamical state of the grid, 2