Learning heuristic policies – a reinforcement learning problem Thomas Philip Runarsson School of Engineering and Natural Sciences University of Iceland
[email protected]
Abstract. How learning heuristic policies may be formulated as a reinforcement learning problem is discussed. Reinforcement learning algorithms are commonly centred around estimating value functions. Here a value function represents the average performance of the learned heuristic algorithm over a problem domain. Heuristics correspond to actions and states to solution instances. The problem of bin packing is used to illustrate the key concepts. Experimental studies show that the reinforcement learning approach is compatible with the current techniques used for learning heuristics. The framework opens up further possibilities for learning heuristics by exploring the numerous techniques available in the reinforcement learning literature.
1
Introduction
The current state of the art in search techniques concentrate on problem specific systems. There are many examples of effective and innovative search methodologies which have been adapted for specific applications. Over the last few decades, there has been considerable scientific progress in building search methodologies and customizing these methodologies. This has usually been achieved through hybridization with problem specific techniques for a broad scope of applications. This approach has resulted in effective methods for intricate real world problem solving environments and is commonly referred to as heuristic search. At the other extreme an exhaustive search could be applied without a great deal of proficiency. However, the search space for many real world problems is too large for an exhaustive search, making it too costly. Even when an effective search method exists, for example mixed integer programming, real world problems frequently do not scale well, see eg. [6] for a compendium of so-called NP optimization problems. In such cases heuristics offer an alternative approach to complete search. In optimization the goal is to search for instances x, from a set of instances X , which maximize a payoff function f (x) while satisfying a number of constraints. A typical search method starts from an initial set of instances. Then, iteratively, search operators are applied locating new instances until instances with the highest payoff are reached. The key ingredient to any search methodology is thus the structure or representation of the instances x and the search operators that
manipulate them. The aim of developing automated systems for designing and selecting search operators or heuristics is a challenging research objective. Even when a number of search heuristics have been designed, for a particular problem domain, the task still remains of selecting those heuristics which are most likely to succeed in generating instances with higher payoff. Furthermore, the success of a heuristic will depend on a particular case in point and the current instance when local search heuristics are applied. For this reason additional heuristics may be needed to guide and modify the search heuristics in order to produce instances that might otherwise not be created. These additional are so-called meta-heuristics. Hyper-heuristics are an even more general approach where the space of the heuristics themselves is searched [4]. A recent overview on methods of automating the heuristic design process is given in [2,5]. In general we can split the heuristic design process into two parts; the first being the actual heuristic h or operator used to modify or create instance1 x ∈ X , the second part being the heuristic policy π(φ(x), h), the probability of selecting h, where φ(x) are features of instance x, in the simplest form φ(x) = x. Learning a heuristic h can be quite tricky for many applications. For example, for a designed heuristic space h ∈ H there may exist heuristics that create instances x ∈ / X or where the constraints are not satisfied. For this reason most of the literature in automating the heuristic design process is focused on learning heuristic policies [15,12,3], although sometimes not explicitly stated. The main contribution of this paper is on how learning heuristics can be put in a reinforcement learning framework. The approach is illustrated for the bin packing problem. The focus is on learning a heuristic policy and the actual heuristics will be kept as simple and intuitive as possible. In reinforcement learning policies are found directly or indirectly, via a value functions, using a scheme of reward and punishment. To date only a handful of examples [15,11,1,10] exist on applying reinforcement learning for learning heuristics. However, ant system have also many similarities to reinforcement learning and can be thought of as learning a heuristic policy, see [7,8]. Commonly researchers apply reinforcement learning only to a particular problem instance, not to the entire problem domain as will be attempted here. The literature of reinforcement learning is rich in applications which can be posed as Markov decision processes, even partially observable ones. Reinforcement learning methods are also commonly referred to as approximate dynamic programming [13], since commonly approximation techniques are used to model policies. Posing the task of learning heuristic within this framework opens up a wealth of techniques for this research domain. It also may help formalize better open research questions, such as how much human expertise is required for the design of a satisfactory heuristic search method, for a given problem domain f ∈ F? The following section illustrates how learning heuristics may be formulated as a reinforcement learning problem. This is followed by a description of the bin-packing problem and a discussion of commonly used heuristic for this task. 1
So called construction heuristics versus local search heuristics.
Section 4 illustrated how temporal difference learning can be applied to learning heuristic policies for bin packing and the results compared with classical heuristics as well as those learned using genetic programming in [12]. Both off-line and on-line bin packing are considered. The paper concludes with a summary of main results.
2
Learning heuristics – a reinforcement learning problem
In heuristic search the goal is to search for instances x, which maximize some payoff or objective f (x) while satisfying a number of constraints set by the problem. A typical search method starts from an initial set of instances. Then, iteratively, heuristic operators h are applied locating new instances until instances with the highest payoff are reached. The key ingredients to any heuristic search methodology is thus; the structure or representation of the instances x, the heuristic h ∈ H, the heuristic policy, π, and payoff f (x). Analogously, it is possible to conceptualise heuristic search in the reinforcement learning framework [14] as pictured below. Here the characteristic features of our instance φ(x) is synonymous to a state in the reinforcement learning literature and likewise the heuristic h to an action. Each iteration of the search heuristic is denoted by t. The reward must be written as follows: f (x) =
T X
c(xt )
(1)
t=0
where T denotes the final iteration, found by some termination criteria for the heuristic search. For many problems one would set c(xt ) = 0 for all t < T and then c(xT ) = f (xT ). For construction heuristics T would denote the iteration for when the instance has been constructed completely. For some problems, the objective f (x) can be broken down into a sum as shown in (1). One such example is the bin packing problem. Each time a bin new bin needs to be opened a reward of c(x) = −1 is given else 0. It is the search agent’s responsibility to update its heuristic policy based on the feedback from the particular problem instance f ∈ F being searched. Once the search has been terminated the environment is updated with a new problem instance sampled from F . This way a new learning episode is initiated. This makes the heuristic learning problem noisy. The resulting policy learned, however, is one that maximizes the average performance over the problem domain, that is 1 X (f ) max f (xT ) (2) π |F | f ∈F
(f )
where xT is the solution found by the learned heuristic policy for problem f . The average performance over the problem domain corresponds to the so called value function in reinforcement learning. Reinforcement learning algorithms are commonly centred around estimating value functions.
3
Bin Packing
¯ and a list of items of sizes w1 , w2 , . . . , wn , each item must Given a bin of size W be in exactly one bin, n X
zi,j = 1,
j = 1, . . . , m
(3)
i=1
where zi,j is 1 if item i is in bin j. The bins should not overflow, that is m X
¯ xj , zi,j ot (wi ). The same heuristic strategy for selecting a bin is used here. The temporal difference (TD) formulation is as follows: Q(st , wt ) = Q(st , wt ) + α Q(st+1 , wt+1 ) + ct+1 − Q(st , wt ) (15)
were the state st = g t (wt ) − ot (wt ) and ct+1 is 1 if a new bin was opened else 0. The value of a terminal state is zero as usual, i.e. Q(sn+1 , ·) = 0. The weight selected follows then the policy wt = argmin Q(st , w)
(16)
w,ot (w)>0
The value function now tells us the expected number of bins that will be opened given the current state st and taking decision wt , at iterationP t while n following the heuristic policy π. The number of bins used is, therefore, i=1 ci . However, for more general problems the cost of a solution is not known until the complete solution has been built. So an alternative formulation is to have no cost during search (ct = 0, t = 1, . . . , n) and only at the final iteration give a terminal cost which is the number of bins used, i.e. cn+1 = m. Figure 4 below shows the moving average number of bins used as a function of learning episodes3 . The noise is the result from generating new problem instance at each episode. When the performance of this value function is compared with the one in [12] on 100 test problems, no statistical difference in the mean number of bins used is observed, µGP = 34.27 and µT D = 34.53. These results improve on the off-line approach of first-fit above. 3
α = 0.01, slightly larger than before.
0.95#Binse−1+0.05*#Binse
Moving average of number of bins used versus learning episodes 42 40 38 36 34 0
2000
4000 6000 episode (e)
8000
10000
Fig. 4. Moving average number of bins used as a function of learning episodes. Each new episode generates a new problem instance, hence the noise.
5
Summary and conclusions
The challenging task of heuristic learning was put forth as a reinforcement learning problem. The heuristics are assumed to be designed by the human designer and correspond to the actions in the usual reinforcement learning setting. The states represent the solution instances and the heuristic policies learned decide when these heuristics should be used. The simpler the heuristic the more challenging it becomes to learn a policy. At the other extreme a single powerful heuristic may be used and so no heuristic policy need be learned (only one action possible). The heuristic policy is found indirectly by approximating a value function, whose value is the expected performance of the algorithm over the problem domain as a function of specific features of a solution instance and the applied heuristics. It is clear that problem domain knowledge will be reflected in the careful design of features, such as we have seen in the histogram-matching approach to bin packing, and in the design of heuristics. The machine learning side is devoted to learning heuristic policies. The exploratory noise needed to drive the reinforcement learning is introduced indirectly by generating completely new problem instance at each learning episode. This very different from the reinforcement learning approaches commonly seen in the literature for learning heuristic search, where usually only a single problem (benchmark) instance is considered. One immediate concern, which needs to be addressed, is the level of noise encountered during learning for when a new problem instance is generated at each new episode. Although the bin packing problems tackled in this paper could be solved, figure 4 shows that convergence may also be an issue. One possible solution to this may be to correlate the instances generated.
References 1. R. Bai, E.K. Burke, M. Gendreau, G. Kendall, and B. McCollum. Memory length in hyper-heuristics: An empirical study. In Computational Intelligence in Scheduling, 2007. SCIS’07. IEEE Symposium on, pages 173–178. IEEE, 2007. 2. E.K. Burke, M.R. Hyde, G. Kendall, G. Ochoa, E. Ozcan, and J.R. Woodward. Exploring hyper-heuristic methodologies with genetic programming. Computational Intelligence, pages 177–201, 2009. 3. E.K. Burke, MR Hyde, G. Kendall, and J. Woodward. A genetic programming hyperheuristic approach for evolving two dimensional strip packing heuristics. IEEE Transactions on Evolutionary Computation (to appear), 2010. 4. E.K. Burke and G. Kendall. Search methodologies: introductory tutorials in optimization and decision support techniques. Springer Verlag, 2005. 5. E.K. Burker, M. Hyde, G. Kendall, G. Ochoa, E. ”Ozcan, and J.R. Woodward. A classification of hyper-heuristic approaches. Handbook of Metaheuristics, pages 449–468, 2010. 6. P. Crescenzi and V. Kann. A compendium of NP optimization problems. http://www.nada.kth.se/ viggo/problemlist/compendium.html. Technical report accessed september 2010. 7. M. Dorigo and L. Gambardella. A study of some properties of Ant-Q. Parallel Problem Solving from NaturePPSN IV, pages 656–665, 1996. 8. M. Dorigo and L.M. Gambardella. Ant colony system: A cooperative learning approach to the traveling salesman problem. Evolutionary Computation, IEEE Transactions on, 1(1):53–66, 2002. 9. S. Floyd and R.M. Karp. FFD bin packing for item sizes with uniform distributions on [0, 1/2]. Algorithmica, 6(1):222–240, 1991. 10. D. Meignan, A. Koukam, and J.C. Cr´eput. Coalition-based metaheuristic: a selfadaptive metaheuristic using reinforcement learning and mimetism. Journal of Heuristics, pages 1–21. 11. Alexander Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523–544, 2003. 12. R. Poli, J. Woodward, and E.K. Burke. A histogram-matching approach to the evolution of bin-packing strategies. Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, pages 3500–3507, Sept. 2007. 13. W.B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Wiley-Interscience, 2007. 14. R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. The MIT press, 1998. 15. Wei Zhang and Thomas G. Dietterich. A Reinforcement Learning Approach to Job-shop Scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114–1120, San Francisco, California, 1995. Morgan-Kaufmann.