Reinforcement Learning using LCS in Continuous ... - Semantic Scholar

0 downloads 0 Views 29KB Size Report
A basic LCS technique is developed and applied to a commonly studied RL ..... R. Sutton and A. Barto, An Introduction to Reinforcement Learning, MIT Press, ...
Reinforcement Learning using LCS in Continuous State Space IWLCS-2004 Extended Abstract David LeRoux, [email protected] Michael Littman, [email protected] Department of Computer Science, Rutgers University

Abstract Reinforcement Learning, RL, deals with a class of problems in which a policy is to be learned based solely on numerical reward signals that are often significantly delayed from when the actions are selected. LCS has been shown to be an effective method for RL, but most of the applications have been in discrete state space. The more commonly used approaches within RL, e.g., policy search, value iteration, and Q-learning, have also generally been applied to discrete state space, although a number of techniques have been proposed recently for extending traditional techniques to continuous state space problems. This paper looks at how LCS might be used for such problems. A basic LCS technique is developed and applied to a commonly studied RL problem to demonstrate that an accurate value function and policy may be learned in time comparable to techniques used in traditional RL approaches. The LCS method proposed is kept minimal and an evaluation test-bed is suggested so that enhancements to the LCS methodology may be evaluated to determine which enhancements provide the greatest benefit. A number of suggestions are made for future study of such enhancements.

1. Introduction There is a large literature on Reinforcement Learning, but most of it deals with discrete state space (Kaelbling, Littman and Moore, 1996). The most commonly used approaches when faced with continuous state space are: 1) discretization, where the continuous state space is approximated by a grid of points within the space (Munos and Moore, 1999), or 2) value function approximation, where the value function is approximated using some function approximation technique (Baird, 1995). The main problem with discretization is the "curse of dimensionality" and the problem of not knowing in advance the areas of state space that require greater refinement. Value function approximation with RL has had a number of highly publicized successes, such as the use of a neural network to approximate the value function in a world-class backgammon-playing system (Tesauro, 1995). However, many problems have been encountered using function approximation, particularly with non-convergence (Boyan and Moore, 1995). It has been shown that certain classes of function approximators are guaranteed to converge to a reasonable approximation of the value function (Gordon, 1995). These function approximators are known as averagers because they all use some form of weighted averaging of experience data that, in turn, creates a contracting or non-expanding mapping that guarantees convergence of standard RL techniques, such as Q-learning. Smart and Kaelbling (2000) used an averager-approach in suggesting how to use a form of Q-learning to solve practical dynamic control tasks in continuous state spaces. That work is used here as the "traditional" RL example with which to compare LCS approaches as an alternative method of dealing with such problems. Similarly, the primary focus of LCS in the past has been on discrete state spaces. LCS has been extended to continuous domains using fuzzy classifier systems (Bonarini 1996) and by generalizing XCS (Wilson 1999). This paper seeks to build on these past approaches and to compare LCS to traditional RL techniques for solving delayed reward problems in continuous state space. Rather than using a specific advanced LCS system, such as XCSR, the approach taken here is to start with a "bare bones" LCS, apply it to a commonly used RL problem, compare results to those obtained using traditional RL methods, and use this minimal LCS framework as a test-bed with which to evaluated enhancements to the LCS methodology.

2. Test Problem: Mountain-Car Task A commonly used RL problem involving continuous state space is the mountain-car task, originally used by Moore (1991) and described in detail by Singh and Sutton (1996). The object is to drive a car to the top of a hill (forward

direction) by applying actions of either forward or backward power at each time step.1 The state space of the problem is the continuous two-dimensional description of the car's current position and velocity. The car is not powerful enough to reach to top of the hill by simply applying forward power from many position and velocity states, and the driver must resort to going partially up the hill and reversing direction several times to gain enough momentum to overcome the effects of gravity. Reward is -1 at all points and the trial stops in the absorbing states at the top of the hill, regardless of velocity. With an optimal policy, the value of any state will be the negative of the minimum number of steps required to reach the top of the hill. For ease in presentation of the economic model described below, a terminal reward of +300 is given when the top of the hill is reached. Of course, this constant terminal reward does not change the optimal policy in this undiscounted version of the problem. This problem poses several interesting challenges to reinforcement learning methodologies beyond the continuous state space. From many states, the optimal policy requires moving away from the goal before further progress can be made. There is often a long delay between actions and goal state, with many positions requiring over 100 steps using an optimal policy to reach the top of the hill. There are areas in the state space where a small change in position or velocity can cause a large change in the value of the state, such as states where the car is just able to make it to the top, but could not if the position were slightly lower on the hill or if the velocity were slightly less. Finally, the problem poses interesting exploration challenges since from some starting positions it is quite difficult to reach the top of the hill using a random exploration strategy. For this reason, most applications using this problem start training trajectories from random starting points in state space and end the trajectory if the goal has not been reach in a certain number of steps. The mountain-car task was used as a test problem by Smart and Kaelbling (2000) to apply a value function approximation technique using locally weighted regression of the nearest neighbors in state space. A variation of that approach is used in this paper as the "traditional" RL method with which to compare LCS approaches. The goal is to learn a near optimal policy from any starting point.

3. LCS Approach This paper is not seeking the optimal LCS approach to the problem, but rather a "minimal" LCS structure that succeeds in learning a near-optimal policy. The XCS (Wilson 1995, 1999) approach has been successful in discrete problems with delayed rewards, such as grid-like "woods" and maze environments. It has also been extended, in the form of XCSR, to immediate reward continuous state space problems, e.g., the "real 6-multiplexer" (Wilson 1999). Other recent LCS research has argued for strict adherence to certain basic economic principals in applying LCS (Baum and Durdanovic, 1999). The Hayek machine in that research has shown good results in solving complex delayed reward discrete RL problems such as Blocks World and Rubik's Cube. This paper seeks to combine the most fundamental elements from these two approaches. In general terms, we seek to use the "accuracy-based" structure of XCS, with its Q-learning related rule update methodology, but adhere to the basic economic rules advocated in the Hayek model. Many of the complex features of each of these models, which undoubtedly improve their performance, have been omitted in order to create a minimal starting point for evaluating further enhancements. For concreteness, this description of the LCS system used will be in terms of the specific mountain-car task. It should be clear how the approach could be generalized to other continuous state space problems. As in XCSR, the classifier rules use "interval predicates" as their condition tests. Each rule's predicate consists of minimum and maximum values for both the position and velocity of the car. A classifier "covers" a particular state if both the position and velocity are within the interval ranges of the rule predicate. Each rule also had an action, which corresponds to either forward or backward in the mountain-car problem, and a "bid", which this the rule's estimate of the "value of the world" when the state is within the rule's predicate intervals.

1

Some versions of the Mountain-Car task also provide a "neutral" action and have a terminal reward that varies based on the speed of the car when the top of the hill is reached, with maximum rewarded at zero velocity. Neither of these complications is included in the version of the problem used here.

An initial set of rules is generated randomly as follows. Each of the four interval points has a 0.5 chance of being the true end point of the state space, e.g., the left point of the position interval has a 0.5 chance of being the absolute left point of the entire state space. This corresponds to a "wild card" in discrete LCS, since any point in state space will pass this condition. Any condition interval point that is not a wild card is selected uniformly from the state space interval, with the restriction that the minimum value must be less than the maximum value. Action values for each rule are generated randomly with equal chance of being a forward or backward action. Bids are initialized uniformly between 0 and 300, which is the terminal reward. Each rule also maintains a "wealth" value, which is a measure of how successful the rule has been in the past. The wealth values are all initially set to 300. The system progresses in a series of auctions. At each auction, each rule bids only if the state is within the rule's condition interval ranges. If so, the rule bids its bid value, but no more than its remaining wealth. The single rule having the highest bid is selected and its action is used to move the car to its next state. The wealth of the winning rule is reduced by the amount of its bid and by 1 for the step cost. The wealth of the rule is increased by any positive reward received, which only occurs when the top of the hill is reached. It is also increased by the bid of the winning bidder in the next auction. Thus a rule will increase its wealth if it bids less than 299 and its action takes the car to the top of the hill where it receives a reward of 300, or if it bids x, and its action takes the car to a position where another rule bids more than x+1. After each auction, the rule that won the prior auction is evaluated. If the rule's wealth has fallen to less than half its starting value, the rule is eliminated and any remaining wealth is placed in a "world capital" pool to be used later to provide initial wealth to newly created rules. The rule's bid is adjusted using the update rule: bidt+1 = bidt + (reward -step cost + sale price - bidt), where is the learning rate. If the rule has been successful, determined by whether its wealth is at least as large as its initial value, and if the world had sufficient capital, the rule is allowed to propagate. This is accomplished by creating a new rule, identical to the original, except that one of its condition rules is modified to be more restrictive than the original rule and its bid is adjusted by a small random amount selected uniformly in the interval (-0.5, +0.5). The thinking here is that a rule may be successful, but be too general. On average it sells the world for more than it paid, but sometimes it loses and sometimes it gains. By restricting the condition criteria, we may be able to find rules that bid more accurately on a subset of the original rule's conditions. The initial wealth for this new rule is paid for out of world capital. The limited amount of capital within the system is the control that limits the number of agents. The series of auctions followed by evaluations continues until the terminal goal state is attained and the rule whose action led to the goal receives the reward and is evaluated. At this point a new training trajectory begins by randomly selecting a point in the state space. No rule owns the world at this point, so the bid paid by the winner of the first auction in each trajectory goes into the world capital. Exploration is accomplished using an -greedy approach. Under this method, with probability , the action taken at each step is determined randomly rather than by the action of the winning bidder. When this random exploration is used the prior rule is still updated using the highest bid, but then another auction is held restricted to those rules whose actions agree with the predetermined random action. 



If a state is reached and there are no rules in the population covering that state, a rule is created using the same procedure used to create the initial set of bids in the auction until one is found covering this state. This covering logic is rarely used due to the large number of initial rules and their general coverage at inception.

4. Results The primary test of this extremely simple LCS approach is whether it is able to learn the complex value function of the mountain-car problem and, if so, how much training is required. To create a standard with which to measure this LCS approach, we first used a nearest neighbor Q-learning approach similar to Smart and Kaelbling starting from a training set of 1000 trajectories of length 25 steps each with random starting points. The data was run repeatedly though the update process until the Q function converged at the 25,000 data points. The state space was then discretized using 100 intervals along each dimension. The value of the state at each of the 10,000 grid points was estimated using weighted nearest neighbors of the 25,000 training points. The resulting value function is shown graphically in Figure 1. The graph shows the highest value positions in the upper right quadrant, corresponding to high positive speed and a position near the top of the hill. The value of state position decreases as you move

counterclockwise and reaches lowest values near the center, which corresponds to positions near the bottom of the hill with velocity near zero. Random points from this value function were tested against the steps required to reach the goal and the function was found to be in close agreement with the true minimum, as determined by trial and error with a human controlling the car. The LCS model used for comparison is as described above with an initial set of 2000 rules, =0.1, and =0.1. The system was run for 1,000,000 auctions, and rules that had not won any of the last 100,000 auctions were eliminated after each goal was attained. The resulting set of rules was used to produce the same 100x100 discretization as with the nearest neighbor approach. This is done by conducting an auction at each of the grid points and using the highest bid for the value at that point. The action of that bid is the action choice of the learned policy. The resulting value function is shown as Figure 2. 

The value function learned by this simple LCS approach is quite close to the value function obtained from the more traditional RL approach. The LCS graph also shows that LCS concentrates its rule refinement activity to the areas where the value function changes rapidly, while keeping more general rules in the areas where the value function changes gradually. The final paper will include analysis of the two approaches from the following perspectives: • Closeness of the value functions, • Accuracy of the resulting policies, • Number of training steps require to attain comparable policies, and • Number of rules required by LCS to get an accurate estimate of the value function.

5. Discussion and Future Work The final paper will discuss the following aspects of this work and potential future developments: • Reasons why the LCS approach works and how it concentrates its learning effort to the most important areas. • Why following the "greedy policy" in LCS can be an effective exploration strategy. • Recycling of training points to speed LCS convergence and to minimize the number of actual observations required. • Enhancements to the basic LCS approach that seem to offer the most potential benefit.

References C. G. Atkeson, A. W. Moore, and S. Schaal, Locally weighted learning. Artificial Intelligence Review, 1996. L. C. Baird, Residual algorithms : Reinforcement learning with function approximation. Machine Learning: Proceedings of the Twelfth International Conference, 1995. E. B. Baum and I. Durdanovic. "Toward a Model of Intelligence as an Economy of Agents," Machine Learning 35, pp155-185 (1999). A. Bonarini. Delayed Reinforcement, Fuzzy Q-Learning and Fuzzy Logic Controllers. In F. Herrera and J. L. Verdegay, editors, Genetic Algorithms and Soft Computing, (Studies in Fuzziness, 8), pages 447-466, Berlin, D, 1996. J. Boyan and A. Moore, "Generalization in Reinforcement Learning: Safely Approximating the Value Function," Advances in Neural Information Processing Systems 7, MIT Press, 1995. L. Chrisman, Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183--188. AAAI Press, San Jose, California, 1992.

J. Forbes and D. Andre, "Practical Reinforcement Learning in Continuous Domains," Computer Science Division, University of California, Berkeley, Tech. Rep. UCB/CSD-00-1109, 2000. G. Gordon, "Stable Function Approximation in Dynamic Programming," Proceedings of IMCL '95, 1995. L.P. Kaelbling, L.M. Littman and A.W. Moore, "Reinforcement learning: a survey," Journal of Artificial Intelligence Research, vol. 4, pp. 237--285, 1996. R. Munos and A. Moore, Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, 1999 Variable Resolution discretizations for high-accuracy solutions of optimal control problems. S. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123158, 1996. W. D. Smart and L. P. Kaebling. Practical reinforcement learning in continuous spaces. In P. Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning, pages 903-910. Morgan Kaufmann, San Francisco, CA, 2000. R. Sutton and A. Barto, An Introduction to Reinforcement Learning, MIT Press, 1998. G. Tesauro, "Temporal difference learning and TDGammon. " Comm. of the ACM, 38:3, 58-67, 1995. S. W. Wilson. Classifier fitness based on accuracy. Evolutionary Computation, 3(2):149--175. S. W. Wilson. State of XCS classifier system research. Second International Workshop on Learning Classifier Systems (IWLCS-99), Orlando, FL, USA, July 13, 1999. S. W. Wilson. "Get Real! XCS with Continuous-Valued Inputs" From Festschrift in Honor of John H. Holland, May 15-18, 1999 (pp. 111-121), L. Booker, S. Forrest, M. Mitchell, and R. Riolo (eds.).