Reinforcement learning is a subtopic of machine learning that is concerned ...... from reinforcement learning with generalization techniques from inductive logic.
n
KATHOLIEKE UNIVERSITEIT LEUVEN FACULTEIT TOEGEPASTE WETENSCHAPPEN DEPARTEMENT COMPUTERWETENSCHAPPEN Celestijnenlaan 200A – 3001 Leuven (Heverlee)
RELATIONAL REINFORCEMENT LEARNING
Promotoren : Prof. Dr. L. DE RAEDT Prof. Dr. ir. M. BRUYNOOGHE
Proefschrift voorgedragen tot het behalen van het doctoraat in de toegepaste wetenschappen door Kurt DRIESSENS
Mei 2004
n
KATHOLIEKE UNIVERSITEIT LEUVEN FACULTEIT TOEGEPASTE WETENSCHAPPEN DEPARTEMENT COMPUTERWETENSCHAPPEN Celestijnenlaan 200A – 3001 Leuven (Heverlee)
RELATIONAL REINFORCEMENT LEARNING
Jury : Prof. Dr. ir. J. Berlamont, voorzitter Prof. Dr. L. De Raedt, promotor Prof. Dr. ir. M. Bruynooghe, promotor Prof. Dr. D. De Schreye Prof. Dr. ir. E. Steegmans Prof. Dr. S. Dˇzeroski Institut Joˇzef Stefan, Ljubljana, Sloveni¨e Prof. Dr. P. Tadepalli Oregon State University, Corvallis, Oregon, USA
U.D.C. 681.3*I2
Mei 2004
Proefschrift voorgedragen tot het behalen van het doctoraat in de toegepaste wetenschappen door Kurt DRIESSENS
c
Katholieke Universiteit Leuven - Faculteit Toegepaste Wetenschappen Arenbergkasteel, B-3001 Heverlee (Belgium) Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden door middel van druk, fotocopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande schriftelijke toestemmimg van de uitgever. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the publisher. D/2004/7515/39 ISBN 90-5682-500-3
Relational Reinforcement Learning Reinforcement learning is a subtopic of machine learning that is concerned with software systems that learn to behave through interaction with their environment and receive only feedback on the quality of their current behavior instead of a set of correct (and possibly incorrect) learning examples. Although reinforcement learning algorithms have been studied extensively in a propositional setting, their usefulness in complex problems is limited by their inability to incorporate relational information about the environment. In this work, a first relational reinforcement learning (or RRL) system is presented. This RRL system combines Q-learning with the representational power of relational learning by using relational representations for states and actions and by employing a relational regression algorithm to approximate the Q-values generated through a standard Q-learning algorithm. The use of relational representations permits the use of structural information or the existence of objects and relations between objects in the description of the resulting policy (through the learned Q-function approximation). Three incremental relational regression techniques are developed that can be used in the RRL system. These techniques consist of an incremental relational regression tree algorithm, a relational version of instance based regression with several example selection mechanisms and an algorithm based on Gaussian processes that uses graph kernels as a covariance function. The capabilities of the RRL approach and the performance of the three regression algorithms are evaluated empirically using the blocks world with a number of different goals and the computer games Digger and Tetris. To further increase the applicability of relational reinforcement learning, two techniques are introduced that allow integration of background knowledge into the RRL system. It is shown how guided exploration can improve performance in environments with sparse rewards and a new hierarchical reinforcement learning method is presented that can be used for concurrent goals.
ii
Preface I’m showing off the Digger demo to my dad. me: “So it walks around and remembers what it encounters.” The demo pauzes while tg goes through the learning examples. me: “And then it thinks about what it has done.” dad: ”But computers can’t think.” me: “Euhm ... it’s mathematics ... it’s basically the same ...” When I started my academic career at the Catholic University of Leuven, I situated myself in between two research groups, “Machine Learning” and “Distributed Systems”. I started my work on a subject that shared interests with both groups, i.e., the RoboCup challenge. After two years of working (or playing) with this subject, and being invited to give all kinds of presentations on this together with my partner in crime Nico Jacobs1 , it became clear that the RoboCup environment at that time demanded too much technical work to enable a team of 1 or 2 researchers to do actual machine learning research on it. At that time, I decided to abandon the RoboCup community and focus my research on relational reinforcement learning, at the same time abandoning my connection to the Distributed Systems research group. The appeal of this topic for me lay partially in its connection with psychology and humanand animal-learning. Reinforcement learning research on computers allows for the investigation of pure learning mechanisms, and excludes almost all possible side-effects such as prejudice. Unfortunately, it also excludes actual comprehension of the learning task. It’s cheaper then research on humans though. What did survive from my RoboCup days was the urge to see the system I was building actually learn something. This drove the development of the RRL systems with its different regression algorithms and the many experiments that were performed. Instead of building a theoretic framework for relational reinforcement learning, I was driven to demonstrate its capabilities and tried to apply the RRL system to appealing applications. Imagine my surprise when it all turned out to be mathematics after all ... 1 We
even made national TV!
iii
iv “Come on Kurt, what is required to make RRL applicable in the industry?” (Luc De Raedt) Most PhD students thank their promotors in their preface, but there are usually very good reasons for this. I need to thank Luc De Raedt for coming up with this great topic. I think that not every PhD student can benefit from being handed a topic that made such an impact in the research community. I must admit that I was a little worried when Luc told me that he would be leaving the departement to go work in Freiburg, but in retrospect I have to thank him for making that decision as well. It has led to great working visits to a beautiful city and initiated contact with new people with whom it was a pleasure to work together. I will never forget the last three meetings Luc and I had on the text of this dissertation, which took place in a plane flying from Washington to Brussels, in a pub in the Black Forest in Germany (after midnight) and in his own livingroom, in a house on a street that exists on no available map. Maurice Bruynooghe, I need to thank especially for his support during the last few months. With Luc living and working at a safe distance, he had to put up with all my (at that time) urgent questions and requests, and always found time to squeeze the proofreading of yet another new version of a chapter into his busy schedule. He often succeeded in giving me the needed confidence boost to get me back to work again. What I have learned and will remember most from Maurice is that a lot can be said with very little words, and that there is always a bigger context to your research challenges. “Isn’t it frustrating that nobody understands what you are working on?” (Wendy de Pree) I would like to thank the members of my jury: Danny De Schreye, Eric Steegmans, Saˇso Dˇzeroski and Prasad Tadepalli for their valuable comments on this text during the last months. It is amazing how much beter a text can become after it is already finished. I also thank Jean Berlamont. Of this jury I need to spend a little extra time on Saˇso. Together with Luc, he lay the foundations for this research and with lots of enthousiasme followed up and contributed to this research ever since. However, Saˇso went above and beyond and became as good as a third promotor of this work, and a much appreciated personal friend. Saˇso was however not the only person to contribute to the research in this dissertation. Jan Ramon’s expertise was a great help on a number of topics and Thomas G¨ artner was kind enough to include me as an author of one of his award winning publications. Kristian Kersting and Martijn van Otterlo were enthusiastic discussion partners. Honorable mentions go to Nico Jacobs and
v Hendrik Blockeel who’s direct contribution to this work was limited, but who have provided me with lovely work-related memories. I must not forget Jan, Raymond, Stefan, Sofie, Anneleen, Celine, Joost, Tom and Daan who make up the rest of the rapidly growing machine learning research group in Leuven, and also Luc and Wim, previous members who have left the group and taken up real jobs. “Every piece of paper in this house has “Relational Reinforcement Learning” on it!” (Ilse Goossens) I need to thank my parents letting me grow up in an environment which has led me here, and my sister for setting me a tough target to aim for. I want to thank my friends for letting me do my thing, for showing interest when I needed it (or at least faking it), for trying to comprehend what I was working on (or at least pretending to), but most of all for pulling me out of the workplace and making me forget about algorithms and statistics once in a while. And at last, I would like to thank Ilse, together with whom I’ve battled the rest of the world from long before anyone had ever heard of relational reinforcement learning. She keeps me on my toes, she keeps me happy, she keeps me going, she keeps me sane ... she’s a keeper. I’ll end this preface with one last quote. It’s not from a movie, it’s not even from a human. It originates from an internet chat-bot, i.e., a computer program that talks back when you chat with it. I think it’s a pretty smart program ... “I think that all learning is essentially reinforcement learning. Can you think of learning which has no motive behind it? This may sound disappointing, but that’s what it’s all about: pleasure and pain in different degrees and flavors.” Alan, a chat-bot2 .
2 http://www.a-i.com/
vi
Contents I
Introduction
1
1 Introduction 1.1 Intelligent Computer Programs . . 1.2 Adaptive Computer Programs . . . 1.3 Learning from Reinforcements . . . 1.4 Relational Reinforcement Learning 1.5 Contributions . . . . . . . . . . . . 1.6 Organization of the Text . . . . . . 1.7 Bibliographical Note . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
3 3 4 5 6 7 8 8
2 Reinforcement Learning 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Reinforcement Learning Framework . . . . . . . 2.2.1 The Learning Task . . . . . . . . . . . . . . . 2.2.2 Value Functions . . . . . . . . . . . . . . . . 2.2.3 Nondeterministic Environments and Policies . 2.3 Solution Methods . . . . . . . . . . . . . . . . . . . . 2.3.1 Direct Policy Search . . . . . . . . . . . . . . 2.3.2 Value Function Based Approaches . . . . . . 2.3.3 Q-learning . . . . . . . . . . . . . . . . . . . . 2.3.3.1 Q-value Function Generalization . . 2.3.3.2 Exploration vs. Exploitation . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
11 11 12 13 14 15 17 17 17 19 21 22 23
3 State and Action Representation 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 A Very Simple Format . . . . . . . . . . . 3.3 Propositional Representations . . . . . . . 3.4 Deictic Representations . . . . . . . . . . 3.5 Structural (or Relational) Representations 3.5.1 Relational Interpretations . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
25 25 25 27 28 29 31
vii
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . .
viii
CONTENTS
3.6
3.5.2 Labelled Directed Graphs . . . . . . . . . . . . . . . . . 3.5.3 The Blocks World . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Relational Reinforcement Learning 4.1 Introduction . . . . . . . . . . . . . . . . . . . 4.2 Relational Q-Learning . . . . . . . . . . . . . 4.3 The RRL System . . . . . . . . . . . . . . . . 4.3.1 The Suggested Approach . . . . . . . 4.3.2 A General Algorithm . . . . . . . . . . 4.4 Incremental Relational Regression . . . . . . 4.5 A Proof of Concept . . . . . . . . . . . . . . . 4.6 Some Closely Related Approaches . . . . . . 4.6.1 Translation to a Propositional Task . 4.6.2 Direct Policy Search . . . . . . . . . . 4.6.3 Relational Markov Decision Processes 4.6.4 Other Related Techniques . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . .
II
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
On First Order Regression
32 33 35 37 37 38 39 39 40 42 43 45 45 46 46 48 49
51
5 RRL-tg 5.1 Introduction . . . . . . . . . . . . . . . . . . 5.2 Related Work . . . . . . . . . . . . . . . . . 5.3 The tg Algorithm . . . . . . . . . . . . . . 5.3.1 Relational Trees . . . . . . . . . . . 5.3.2 Candidate Test Creation . . . . . . . 5.3.3 Candidate Test Selection . . . . . . 5.3.4 RRL-tg . . . . . . . . . . . . . . . 5.4 Experiments . . . . . . . . . . . . . . . . . . 5.4.1 The Experimental Setup . . . . . . . 5.4.1.1 Tasks in the Blocks World 5.4.1.2 The Learning Graphs . . . 5.4.2 The Results . . . . . . . . . . . . . . 5.5 Possible Extensions . . . . . . . . . . . . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
53 53 54 55 55 56 59 60 60 60 61 63 63 68 70
6 RRL-rib 6.1 Introduction . . . . . . . . . 6.2 Nearest Neighbor Methods . 6.3 Relational Distances . . . . 6.4 The rib Algorithm . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
71 71 72 73 76
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
CONTENTS
ix
6.4.1 6.4.2
6.5
6.6 6.7
Limiting the Inflow . . . . . . . . . . . . . . . . . . . Forgetting Stored Examples . . . . . . . . . . . . . . 6.4.2.1 Error Contribution . . . . . . . . . . . . . . 6.4.2.2 Error Proximity . . . . . . . . . . . . . . . 6.4.3 A Q-learning Specific Strategy: Maximum Variance 6.4.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 A Simple Task . . . . . . . . . . . . . . . . . . . . . 6.5.1.1 Inflow Behavior . . . . . . . . . . . . . . . 6.5.1.2 Adding an Upper Limit . . . . . . . . . . . 6.5.1.3 The Effects of Maximum Variance . . . . . 6.5.2 The Blocks World . . . . . . . . . . . . . . . . . . . 6.5.2.1 The Influence of Different Data Base Sizes 6.5.2.2 Comparing rib and tg . . . . . . . . . . . Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 RRL-kbr 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . 7.3 Gaussian Processes for Regression . . . . . . . . . . . . 7.4 Graph Kernels . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Labeled Directed Graphs . . . . . . . . . . . . . 7.4.2 Graph Degree and Adjacency Matrix . . . . . . . 7.4.3 Product Graph Kernels . . . . . . . . . . . . . . 7.4.4 Computing Graph Kernels . . . . . . . . . . . . . 7.4.5 Radial Basis Functions . . . . . . . . . . . . . . . 7.5 Blocks World Kernels . . . . . . . . . . . . . . . . . . . 7.5.1 State and Action Representation . . . . . . . . . 7.5.2 A Blocks World Kernel . . . . . . . . . . . . . . 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 The Influence of the Series Parameter β . . . . . 7.6.2 The Influence of the Generalization Parameter ρ 7.6.3 Comparing kbr , rib and tg . . . . . . . . . . 7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
III
On Larger Environments
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
76 77 78 78 79 80 81 81 82 83 85 86 86 88 91 92
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
93 93 94 96 98 99 101 102 104 105 106 107 108 108 108 111 113 115 116
117
8 Guided RRL 119 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.2 Guidance and Reinforcement Learning . . . . . . . . . . . . . . 120
x
CONTENTS
8.3
8.4 8.5 8.6
8.2.1 The Need for Guidance . . . . . . . . . . . 8.2.2 Using “Reasonable” Policies for Guidance . 8.2.3 Different Strategies for Supplying Guidance Experiments . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Experimental Setup . . . . . . . . . . . . . 8.3.2 Guidance at the Start of Learning . . . . . 8.3.3 A Closer Look at RRL-tg . . . . . . . . . 8.3.4 Spreading the Guidance . . . . . . . . . . . 8.3.5 Active Guidance . . . . . . . . . . . . . . . 8.3.6 An ”Idealized” Learning Environment . . . Related Work . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . Further Work . . . . . . . . . . . . . . . . . . . . .
9 Two Computer Games 9.1 Introduction . . . . . . . . . . . . . . 9.2 The Digger Game . . . . . . . . . . . 9.2.1 Learning Difficulties . . . . . 9.2.2 State Representation . . . . . 9.2.3 Two Concurrent Subgoals . . 9.3 Hierarchical Reinforcement Learning 9.4 Concurrent Goals and RRL-tg . . 9.5 Experiments in the Digger Game . . 9.5.1 Bootstrapping with Guidance 9.5.2 Separating the Subtasks . . . 9.6 The Tetris Game . . . . . . . . . . . 9.6.1 Q-values in the Tetris Game . 9.6.2 Afterstates . . . . . . . . . . 9.6.3 Experiments . . . . . . . . . 9.6.4 Discussion . . . . . . . . . . . 9.7 Conclusions . . . . . . . . . . . . . .
IV
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
120 121 122 123 124 124 127 129 132 134 137 138 139
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
141 141 142 144 144 145 146 148 149 150 150 151 153 154 154 157 158
Conclusions
10 Conclusions 10.1 The RRL System . . . . . . . . . . . . 10.2 Comparing the Regression Algorithms 10.3 RRL on the Digger and Tetris Games 10.4 The Leuven Methodology . . . . . . . 10.5 In General . . . . . . . . . . . . . . . .
159 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
161 161 162 164 165 165
CONTENTS
xi
11 Future Work 11.1 Further Work on Regression Algorithms . . . . . . 11.2 Integration of Domain Knowledge . . . . . . . . . . 11.3 Integration of Planning, Model Building and RRL 11.4 Policy Learning . . . . . . . . . . . . . . . . . . . . 11.5 Applications . . . . . . . . . . . . . . . . . . . . . . 11.6 Theoretical Framework for RRL . . . . . . . . . .
V
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Appendices
A On Blocks World Representations A.1 The Blocks World as a Relational A.1.1 Clausal Logic . . . . . . . A.1.2 The Blocks World . . . . A.2 The Blocks World as a Graph . .
167 167 169 169 170 170 171
173 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
175 175 175 176 178
xii
CONTENTS
List of Figures 2.1
Reinforcement Learning Agent and its Environment . . . . . .
12
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
Example States in Tic-tac-toe . . . . . . . . . . . . . . . . . . Example State in Final Fantasy X . . . . . . . . . . . . . . . Delivery Robot in its Environment . . . . . . . . . . . . . . . Example Delivery Robot State as a Relational Interpretation Graph Representation of a Road Map . . . . . . . . . . . . . Example State in The Blocks World . . . . . . . . . . . . . . Blocks World State as a Relational Interpretation . . . . . . . Blocks World State as a Graph . . . . . . . . . . . . . . . . .
. . . . . . . .
28 30 31 32 33 34 34 35
4.1
Example Epoch in the Blocks World . . . . . . . . . . . . . . .
41
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Relational Regression Tree . . . Non Reachable Goal States . . tg on Stacking . . . . . . . . . tg on Unstacking . . . . . . . . tg on On(A,B) . . . . . . . . . Learned Tree for On(A,B) Task First Order Tree Restructuring
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
56 62 64 65 66 67 69
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10
Renaming in the Blocks World . . . . . . . . . . Different Regions for a Regression Algorithm . . Maximum Variance Example Selection . . . . . . The Corridor Application . . . . . . . . . . . . . Prediction Errors for Varying Inflow Limitations Database Sizes for Varying Inflow Limitations . . Effects of Selection by Error Contribution . . . . Effects of Selection by Error Proximity . . . . . . The Shape of a Q-function . . . . . . . . . . . . . Effects of Selection by Maximum Variance . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
75 77 80 81 82 83 84 84 85 85
xiii
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
xiv
LIST OF FIGURES 6.11 6.12 6.13 6.14 6.15 6.16
rib rib rib rib rib rib
on Stacking . . . . . on Unstacking . . . on On(A,B) . . . . . vs tg on Stacking . vs tg on Unstacking vs tg on On(A,B) .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
86 87 88 89 89 90
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17
The Covariance Matrix for Gaussian Processes . . . . . . . Examples of a Graph, Labeled Graph and Directed Graph . Examples of a Walk, a Path and a Cycle . . . . . . . . . . . Direct Graph Product . . . . . . . . . . . . . . . . . . . . . Weights of the Geometric Matrix Series . . . . . . . . . . . Weights of the Exponential Matrix Series . . . . . . . . . . Graph Representation of a Blocks World State . . . . . . . Graph Representation of a Blocks World (state, action) Pair Influence of the Exponential Parameter on Stacking . . . . Influence of the Exponential Parameter on Unstacking . . . Influence of the Exponential Parameter on On(A,B) . . . . Influence of the Generalization Parameter on Stacking . . . Influence of the Generalization Parameter on Unstacking . . Influence of the Generalization Parameter on On(A,B) . . . kbr vs rib and tg on Stacking . . . . . . . . . . . . . . . . kbr vs rib and tg on Unstacking . . . . . . . . . . . . . . kbr vs rib and tg on On(A,B) . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
97 100 100 103 105 106 107 107 109 110 110 111 112 112 113 114 114
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12
Blocks World Random Policy Success Rate . . Blocks World Random Policy Noise Ratio . . . Guidance at Start for Stacking . . . . . . . . . Guidance at Start for Unstacking and On(A,B) Half Optimal Guidance for tg . . . . . . . . . Spread Guidance for Stacking and Unstacking . Spread Guidance for On(A,B) . . . . . . . . . . Active Guidance for Stacking and Unstacking . Active Guidance for On(A,B) . . . . . . . . . . Stacking in an “Idealized” Environment . . . . Unstacking in an “Idealized” Environment . . . On(A,B) in an “Idealized”Environment . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
121 122 125 126 128 130 131 133 134 135 136 136
9.1 9.2 9.3 9.4 9.5
The Digger Game . . . . . . . . . . . . . . . The Freedom of Movement in Digger . . . . Concurrent Goals in Digger . . . . . . . . . Concurrent Goals with Competing Actions Performance Results for the Digger Game .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
142 143 146 148 150
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
LIST OF FIGURES 9.6 9.7 9.8 9.9
Hierarchical Learning for the Digger Game A Tetris Snapshot . . . . . . . . . . . . . . Greedy Action Problem in Tetris . . . . . . Afterstates in Tetris . . . . . . . . . . . . .
xv . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
151 152 153 155
A.1 Example State in The Blocks World . . . . . . . . . . . . . . . A.2 Graph Representation of a Blocks World State . . . . . . . . .
176 178
xvi
LIST OF FIGURES
List of Algorithms 2.1 2.2
Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . Episodic Q-learning Algorithm . . . . . . . . . . . . . . . . . .
20 21
4.1 4.2
The Relational Reinforcement Learning Algorithm . . . . . . . A First RRL Algorithm . . . . . . . . . . . . . . . . . . . . . .
40 44
5.1
The tg Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.1
The rib Data Selection Algorithm. . . . . . . . . . . . . . . . .
81
List of Tables 5.1 5.2
Blocks World Sizes . . . . . . . . . . . . . . . . . . . . . . . . . Q-tree Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62 64
6.1 6.2
Database Sizes for rib-mv . . . . . . . . . . . . . . . . . . . . Execution Times for RRL-tg and RRL-rib . . . . . . . . . .
90 91
7.1
Execution Times for RRL-tg , RRL-rib and RRL-kbr
. . .
115
xviii
LIST OF TABLES
List of Symbols The following table lists some symbols used throughout the text, together with a short description of their meaning and the page where the symbol is introduced.
Reinforcement Learning Framework Symbol
Description
a, at A A(s) δ δ(s, a) E(δ,r,π) (.) γ goal Pδ Pδ (s, a, s0 ) Pπ Pπ (s, a) pre π π∗ π ˆ π(s) Q ˆ Q ˆe Q, Q∗ Qπ (s, a) r rt
action (at time-step t) set of actions set of state dependant actions transition function resulting state taking action a in state s expected value given δ, r and π discount factor goal function transition probability function probability that taking action a in state s results in state s0 probabilistic policy function probability that action a is chosen in state s precondition function policy optimal policy policy belonging to an approximate Q-function action chosen by policy π in state s Q-value (or Quality value) approximated Q-function (after learning epoch e) optimal Q-value Q-value of action a in state s following policy π reward or reward function reward received at time-step t xix
xx
LIST OF SYMBOLS
r(s, a) s, st S Vπ V∗ V π (s) Vt (s) visits(s, a)
reward for taking action a in state s state (at time-step t) set of states value function according to policy π value function according to an optimal policy utility of state s following policy π utility approximation at time-step t number of times action a was performed from state s
Logic , ; \= ←, :bi hi p/n
conjunction disjunction not equal implication operator negative literals positive literals predicate p with arity n
Graphs δ + (v) δ − (v) |δ + (v)| |δ − (v)| ∆+ (G) ∆− (G) e, ei E E G `, `i l, li L label Ψ ν, νi v, vi V
set of edges starting from vertex v set of edges ending in vertex v outdegree of vertex v indegree of vertex v maximal outdegree of graph G maximal indegree of graph G edge set of edges adjacency matrix graph label label variable set of labels label function edge to vertices function vertex vertex variable set of vertices
LIST OF SYMBOLS
Learning/Regression Framework distij e, ei E, Examples Fˆ n qi σ t, ti
distance between examples i and j example set of learning examples function approximation number of examples Q-value of example i variance target value (of example i)
Instance Based Regression errori errori−i EC-scorei EP-scorei Fg , Fl M
cumulative prediction error for example i cumulative prediction error for example i without example i error contribution score of example i error proximity score of example i example filter parameters maximum variance
Kernel Based Regression hx, x0 i β CN Cij , C(xi , xj ) φ γ H k kconv k× k∗ k(x, x0 ) µ R R−1 ρ t, ti t, ti x, xi X
inner product of x and x0 exponential weights parameter covariance matrix of first N examples covariance of examples xi and xj feature transformation geometric weight parameter feature space, Hilbert space kernel function convolution kernel product graph kernel blocks world kernel kernel value of examples x and x0 mean target vector composition relation decomposition relation radial basis parameter target value (of example i) array of target values (up to example i) example example space
xxi
xxii
LIST OF SYMBOLS
Other argmaxa (.) maxa (.) exp(.) P (A|B) R T x N
the a that maximizes the given expression the maximum value of the expression for varying values of a exponential function probability of A given B set of real numbers temperature (in Boltzmann exploration) average x-value set of positive integers
Part I
Introduction
1
Chapter 1
Introduction “I assume I need no introduction.” Interview with a Vampire This work is intended to be a very small step in the quest to build intelligent computer programs. As is the case with most doctoral dissertations in computer science (and probably most other scientific fields as well), the research in this work on the topic of Relational Reinforcement Learning represents developments in a small niche of its research field. Situating the topic of this text therefore generates a long list of more general topics starting from Reinforcement Learning (Sutton and Barto, 1998; Kaelbling et al., 1996) — or even more specific, Q-learning (Watkins, 1989) — through the field of Machine Learning (Mitchell, 1997; Langley, 1994) and ending at the relatively broad topic of Artificial Intelligence (Russell and Norvig, 1995) . To facilitate the introduction of this ranking of topics, they will be discussed more elaborately in reverse order.
1.1
Intelligent Computer Programs
One definition of artificial intelligence, as given by Russell Beale once, is the following: “Artificial Intelligence can be defined as the attempt to get real machines to behave like the ones in the movies.” Whereas a few decades ago the field of artificial intelligence was regarded as a stalling research field, today artificial intelligence is omnipresent in science fiction and widely regarded as the next big step in technological evolution. From the enchanting personality of Andrew Martin, the robot from Isaac Asimov’s 3
4
CHAPTER 1. INTRODUCTION
story “Bicentennial Man” (Asimov, 1976), later portrayed by Robin Williams in the movie of the same name, to the more menacing HAL 9000 from the novel “2001, a Space Odyssey” of Arthur C. Clarke (Clarke, 1968), the artificial intelligence entities portrayed in science fiction stories display a large amount of human level intelligence (or better) often supplemented by an endearing lack of common sense (such as a sense of humor). The large increase of computer technology in the every day life of human beings and the imaginative minds of science fiction authors has largely increased the public interest in the research field. Driven by the great performance increases in computational hardware and the availability of an ever growing number of research results, the field of artificial intelligence has reinforced this interest by a few impressive accomplishments such as the chess computer Deep Blue. A more scientific view of the research field is the definition of artificial intelligence by John McCarthy who coined the term in 1955. He defines it as: “The science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.” While originally, the research realized in the field of artificial intelligence focussed primarily on expert systems and problem solving algorithms, the field now includes a larger variety of topics ranging over knowledge representations, control methods, natural language processing, etc. One important subfield of artificial intelligence is that of “Machine Learning” as some people have posed that real intelligence is unattainable without the ability to learn.
1.2
Adaptive Computer Programs
The most obvious shortcoming of artificial intelligence tends to be the predictiveness exhibited by supposedly intelligent computer programs, such as, for example, artificial adversaries in computer games. The human mind is very well suited to recognize situations in which a system displays repeated and easily predicted behavior and once this behavior has been located it is often very easily exploited. Machine learning research is concerned with computer programs that learn from experience and possibly adapt their behavior when necessary. The subfield of machine learning is defined by Mitchell (1997) as: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its
1.3. LEARNING FROM REINFORCEMENTS
5
performance at tasks in T , as measured by P , improves with experience E.” For example, a computer learning to play chess (task T ) would be judged on the increase in the number of won games (performance measure P ) per number of played games (experience E). A less theoretically correct definition, that is however more accessible to the general public that has never heard of the field before, is given by Rob Shapire in his lectures on the “Foundations of Machine Learning”: “Machine Learning studies computer programs for learning to do stuff.” Although the machine learning subfield of artificial intelligence is not as well known to the general public, the introduction of machine learning technology into every day life has already started. Applications range from data mining engines used by most supermarket chains to discover trends in consumer behavior, to the spam-mail filters that are built into most up-to-date e-mail clients. Although many subtopics can be identified within the field of machine learning, one possible way of dividing the research within machine learning is by looking at the kind of task one is trying to solve. The first is the extraction of new knowledge out of available data. To this field belongs the work on usermodelling (Kobsa, 2001), data-mining (Witten and Frank, 1999; Dˇzeroski and Lavrac, 2001), etc. The second subfield is concerned with programs that learn to act appropriately. A nice overview of different learning settings in this field is given by Boutilier et al. (1999). It is in this field that reinforcement learning is situated.
1.3
Learning from Reinforcements
Currently, only a very small number of every day computer programs are proactive. Most computer applications (luckily) only perform tasks when the user pushes a button or activates some task. Pro-active software systems are usually referred to as software agents (Jennings and Woodridge, 1995). An example of a pro-active software system is a web-spider, as used by most internet search engines to generate a data-base of available web-pages. Such a web-spider searches the internet, trying to discover new web-pages by following links on other pages. When deciding what links to follow the web-spider tries to limit the number of followed links to new information and thus the amount of data it needs to download. Although it is possible (and current practice) to design and implement the link selection strategy of the web-spider by hand, it might be possible to generate better performing strategies through the use of appropriate machine learning techniques (Rennie and McCallum, 1999).
6
CHAPTER 1. INTRODUCTION
Reinforcement Learning problems are characterized by the type of information that is presented to the learning system. Whereas supervised learning techniques, such as behavioral cloning techniques (Bain and Sammut, 1995; Urbancic et al., 1996), are presented with an array of already classified learning examples, the only information that is supplied to a reinforcement learning system is a quantitative assessment of its current behavior. Instead of a number of examples that show the system how to behave, the reinforcement learner has to derive the appropriate behavior by linking the appropriate parts of its behavior with the received rewards or punishments. In the example of the web-spider, the reinforcement learning system could receive a small punishment for each followed link, but an appropriate reward for each newly discovered web-page. By attempting to limit the amount of punishment and maximize the received rewards, the reinforcement learning system might be able to generate a very good internet exploration strategy. Q-learning is one of many reinforcement learning algorithms. It uses the rewards or punishments it receives to calculate a Quality value for each possible action in each possible state. When the learning system then needs to decide what action to take, it selects the action with the highest “Q-value” in the current state.
1.4
Relational Reinforcement Learning
Relational Reinforcement Learning is concerned with reinforcement learning in domains that exhibit structural properties and in which different kinds of related objects exist. These kind of domains are usually characterized by a very large and possibly unbounded number of different possible states and actions. In this kind of environment, most traditional reinforcement learning techniques break down. Even Q-learning techniques that use a propositional Q-value generalization can be hard to apply to tasks that hold an unknown number of objects or are largely defined by the relations between available objects. Relational reinforcement learning as presented in this work, will employ a relational regression technique in cooperation with a Q-learning algorithm to build a relational, generalized Q-function. As such, it combines techniques from reinforcement learning with generalization techniques from inductive logic programming. Due to the use of a more expressive representation language to represent states, actions and Q-functions, the proposed relational reinforcement learning system can be potentially applied to a wider range of learning tasks than conventional reinforcement learning. It also enables the abstraction from specific goals or even from specific learning environments and allows for the exploitation of results from previous learning phases when addressing new (more complex), but related situations.
1.5. CONTRIBUTIONS
1.5
7
Contributions
The main contribution of this work is the development of a first applicable relational reinforcement learning system. The foundations for this system were laid by Saso Dˇzeroski, Luc De Raedt and Hendrik Blockeel in (Dˇzeroski et al., 1998) and further investigated in (Dˇzeroski et al., 2001), but it wasn’t until a fully incremental regression algorithm was developed that the system could be applied to non toy examples. A second contribution is the development of three incremental relational regression algorithms. A relational regression algorithm generalizes over learning examples with a continuous target value and makes predictions about the value of unseen examples, using a relational representation for both the learning examples as the resulting function. The tg algorithm that builds first order regression trees was developed together with Jan Ramon and Hendrik Blockeel and first published in (Driessens et al., 2001). It uses a number of incrementally updated statistics to build a relational regression tree. The instance based rib algorithm, first discussed in (Driessens and Ramon, 2003) was built using Jan Ramon’s expertise on relational distances. Using nearest-neighbor prediction, the rib algorithm selects which examples to store in its data-base by using example selection criteria based on local prediction errors or maximum Q-function variation. The third regression algorithm, based on Gaussian processes for regression and using graph kernels as a covariance function between examples, is called kbr and was developed with the help of Thomas G¨artner and Jan Ramon and first published in (G¨artner et al., 2003a). Although these regression algorithms were developed for use with the RRL system, they are more widely applicable to other relational learning problems with a continuous prediction class. A third contribution is the development of two additions to the RRL system that increase its applicability to larger, more difficult tasks. The first is a methodology to supply guidance to the RRL system through the use of external, reasonable policies that enables the RRL system to perform better on tasks with sparse and hard to reach rewards. This methodology was first published in (Driessens and Dˇzeroski, 2002a). The second is a novel hierarchical reinforcement learning method, that used the expressive power of relational representations to supply information about learned Q-functions on partial problems to the learning algorithm. This new hierarchical method can be used to handle concurrent goals and was first presented in (Driessens and Blockeel, 2001). A last contribution is the application of the RRL system to the non trivial tasks of Digger and Tetris. Although both Digger and Tetris are still toy applications and have little to do with real world problems, they are relatively complex compared to the usual tasks handled with reinforcement learning.
8
CHAPTER 1. INTRODUCTION
1.6
Organization of the Text
This work is composed of a rather large number of relatively small chapters. To accentuate the structure of the text, an additional division into 4 Parts has been made. Part I discusses a number of introductory issues. While this chapter gives a very general introduction to the research area that this work is situated in, Chapter 2 introduces the paradigm of reinforcement learning. Chapter 3 discusses different representational formats that are available to describe the environment and possible actions of the reinforcement learning agent. A formal definition of the problem addressed in this work is given in Chapter 4, together with a general description of the relational reinforcement learning or RRL system that is the main accomplishment of this thesis. Three different approaches to incremental relational regression are presented in Part II. Chapter 5 discusses a regression tree algorithm tg and Chapter 6 introduces relational instance based regression. A regression algorithm based on gaussian processes and graph kernels is presented in Chapter 7. Each of these regression algorithms is thoroughly tested on a variety of problems in the blocks world. Part III discusses some extra methodologies that can be added to the basic RRL system to increase the performance on large problems and illustrates these using larger environments in the blocks world and also in two computer games: Digger and Tetris. Chapter 8 illustrates that the performance of the RRL system can be augmented by supplying a reasonable guidance policy. Chapter 9 introduces hierarchical reinforcement learning for concurrent goals using the Digger game as a testing environment and illustrates the behavior of the RRL system on the popular Tetris game. In Part IV Chapters 10 and 11 discuss some conclusions that can be drawn from this work and highlight a number of directions for possible further work.
1.7
Bibliographical Note
Most of this dissertation has already been published elsewhere. The following list contains the key articles: On the RRL system 1. (Dˇzeroski et al., 1998) and (Dˇzeroski et al., 2001) introduced relational reinforcement learning through the use of Q-learning with a relational generalization technique. 2. (Driessens, 2001) presents an agent oriented discussion of relational reinforcement learning.
1.7. BIBLIOGRAPHICAL NOTE
9
On incremental relational regression 1. (Driessens et al., 2001) introduced the first incremental RRL system through the development of the tg incremental regression tree algorithm. 2. (Driessens and Ramon, 2003) presented relational instance based regression for the RRL system. 3. (G¨ artner et al., 2003a) developed a regression technique based on Gaussian processes and graph kernels for the RRL system. On the use of domain knowledge 1. (Driessens and Dˇzeroski, 2002a) and(Driessens and Dˇzeroski, 2002b) discussed the use of guidance with reasonable policies to enhance the performance of the RRL system in environments with sparse rewards. 2. (Driessens and Blockeel, 2001) presented hierarchical reinforcement learning for concurrent goals applied to the Digger computer game.
10
CHAPTER 1. INTRODUCTION
Chapter 2
Reinforcement Learning “No reward is worth this!” A New Hope
2.1
Introduction
The most basic style of learning, i.e., the type of learning that is performed by all intelligent creatures, is governed by positive and negative rewards. A dog will learn to behave as desired by presenting it with reinforcements at appropriate times. A new-born child will learn the shortest path to the center of the attention of its parents quickly (often too quickly). Other types of learning, mainly supervised learning, require cognitive interaction between the learner and a third entity, a kind of teacher who provides a set of correct and incorrect examples. This cognitive interaction requires a higher level of intelligence than what is needed for reinforcement learning. Although the rewards used for reinforcement learning can also assume a cognitive interaction, they don’t need to and can address more instinctive needs. To illustrate the difference between the two, here are some examples of typical reinforcement learning and supervised learning: • A dog is hugged and pet when it returns a stick and learns to repeat this behavior. (Reinforcement Learning) • A math student watches the teacher perform exercises on the blackboard and imitates this behavior for his home-work. (Supervised Learning) • A baby starts to scream because it is hungry and gets a bottle of milk presented. The baby now screams a lot more. (Reinforcement Learning) 11
12
CHAPTER 2. REINFORCEMENT LEARNING • A reader overlooks four examples of reinforcement and supervised learning and learns to distinguish the two. (Supervised Learning)
The following section provides a formal definition of the reinforcement learning problem domain. Section 2.3 discusses an array of possible solution approaches, ranging from dynamic programming methods to temporal difference approaches. In particular, the focus is on Q-learning as a model free reinforcement learning algorithm which is discussed at length in Section 2.3.3.
2.2
The Reinforcement Learning Framework
In computer science, reinforcement learning is performed by a software agent that interacts with its environment and receives only feedback on the quality of his performance instead of a set of correct (and possibly incorrect) examples. The system that is trying to solve the reinforcement learning problem will be referred to as the agent. Such an agent will interact with its environment, sometimes also referred to as its world. This interaction consists of actions and perceptions as depicted in Figure 2.1. The agent is supplied with an indication of the current state of the environment and chooses an action to execute, to which the environment reacts by presenting an updated state indication.
Environment reward
action
state
Agent Figure 2.1: The interaction between an agent and its environment in a reinforcement learning problem. Also available in the perceptions presented to the agent is a reward, a numerical value that is given for each taken action (or for each reached world-state).
2.2. THE REINFORCEMENT LEARNING FRAMEWORK
13
To solve the reinforcement learning problem, the agent will try to find a policy that maximizes the received rewards over time.
2.2.1
The Learning Task
A more formal definition of a reinforcement learning problem is presented in this section. This formulation of reinforcement learning is comparable to the ones given by Mitchell (1997) and Kaelbling et al. (1996). Definition 2.1 A reinforcement learning task is defined as follows: Given • a set of states S, • a set of actions A, • a (possibly unknown) transition function δ : S × A → S, • an unknown real-valued reward function r : S × A → R. Find a policy π ∗ : S → A that maximizes a value function V π (st ) for all states st ∈ S. The utility value V π (s) is based on the rewards received starting from state s and following policy π. At each point in time t, the reinforcement learning agent is in state st , one of the states of S, and selects an action at = π(st ) ∈ A to execute according to its policy π. Executing an action at in a state st will put the agent in a new state st+1 = δ(st , at ). The agent also receives a reward rt = r(st , at ). The value V π (s) indicates the value or utility of state s, often related to the cumulative reward an agent can expect starting in state s and following policy π. A few possible definitions of utility functions will be given in Section 2.2.2. It is possible that not all actions can be executed in all world states. In environments where the set of available actions depends on the state of the environment the possible actions in state s will be indicated as A(s). The task of learning is to find an optimal policy, i.e., a policy that will maximize the chosen value function. The optimal policy is denoted by π ∗ and the corresponding value-function by V ∗ . For an agent learning to play chess, the set of states S are all the legal chess states that can be reached during play, the set of actions A(s) are all the legal moves in state s. The transition function δ includes both the result of the action chosen by the agent and the result of the counter move of its opponent. Unless the agent was playing a very predictable opponent, δ is unknown to the agent in this case. The reward function could be defined easily to present the agent with a reward of 1 when it wins the game, −1 when it looses and
14
CHAPTER 2. REINFORCEMENT LEARNING
0 for all other actions. The task of the learning agent is then to find a policy which maximizes the received reward and therefore maximizes the number of won games.
2.2.2
Value Functions
The most commonly used definition of state utility, and the one that will be used throughout this thesis, is the discounted cumulative future reward: Definition 2.2 (Discounted Cumulative Reward) V π (st ) ≡
∞ X
γ i rt+i
(2.1)
i=0
with 0 ≤ γ < 1. This expression computes the discounted sum of all future rewards that the agent will receive starting from state st and executing actions according to policy π. The discount factor γ keeps the cumulative reward finite, but it is also used as a measure that indicates the relative importance of future rewards. Setting γ = 0 will make the agent try to optimize only immediate rewards, while setting γ close to 1 means that the agent will regard future rewards almost as important as immediate ones. Requiring that γ < 1 ensures that the state utilities remain finite. Other Utility Functions The definition of the utility function as given in Equations 2.1 is not the only possibility. A number of other possible definitions are: Definition 2.3 (Finite Horizon) V π (st ) ≡
h X
ri
i=t
where rewards are only considered up to a fixed time-step (h). Definition 2.4 (Receding Horizon) V π (st ) ≡
h X
rt+i
i=0
where rewards are only considered up to a fixed number of steps (h) starting from the current time-step.
2.2. THE REINFORCEMENT LEARNING FRAMEWORK
15
Definition 2.5 (Average Reward) h
1X rt+i h→∞ h i=0
V π (st ) ≡ lim
which considers the long-run average reward. More about these alternative utility function definitions can be found in the overview paper by Kaelbling et al. (1996).
2.2.3
Nondeterministic Environments and Policies
The environment the agent interacts with is not required to be deterministic. When the execution of an action in a given state does not always result in the same state transition, the transition function δ can be replaced by a transitionprobability function. Definition 2.6 A transition-probability function Pδ : S × A × S → [0 : 1] where Pδ (st , at , st+1 ) indicates the probability that taking action at in state st results in state st+1 . This implies that X
∀s ∈ S, a ∈ A(s) :
Pδ (s, a, s0 ) = 1
s0 ∈S
The assumption that Pδ is only dependent on the current state st and action at is called the Markov property . It allows the agent to make decisions based only on the current state. More about the Markov property of environments can be found in the book of Puterman (1994). Not only can the state transitions be stochastic, also the reward function can be nondeterministic. It must be noted however, that while determinism is not required, usually the environment is assumed to be static, i.e., the probabilities of state transitions or rewards do not change over time. Stochastic Policies When using nondeterministic policies, a similar probability function can be defined instead of the deterministic policy π. Definition 2.7 A probabilistic policy function Pπ : S × A → [0 : 1]. Pπ (s, a) indicates the probability of choosing action a in state s. It is required that ∀s ∈ S :
X a∈A(s)
Pπ (s, a) = 1
16
CHAPTER 2. REINFORCEMENT LEARNING
Utility Functions for Nondeterministic Environments and Policies When dealing with stochastic environments or policies, the definition of the utility function of Equation 2.1 is changed to represent the expected value of the discounted sum of future rewards. ! ∞ X π i V (st ) ≡ E(δ,r,π) γ rt+i (2.2) i=0
Intermezzo: Planning as a Reinforcement Learning Task There exist important similarities between reinforcement learning as described above and planning without complete knowledge. In planning, one is given a goal function: goal : S → {true, f alse} that defines which states are target states. The aim of the planning task is: Given a starting state s1 , find a sequence of actions a1 , a2 , . . . , an with ai ∈ A, such that: goal(δ(. . . δ(s1 , a1 ) . . . , an )) = true Usually, also a precondition function pre : S × A → {true, f alse} is given. This function specifies which actions can be executed in which states. This puts the following extra constraints on the action sequence: ∀ai : pre(δ(. . . δ(s1 , a1 ) . . . , ai−1 ), ai ) = true In normal planning problems, the effect of all actions (i.e., the function δ) is known to the planning system. However, when this function is unknown, i.e., when planning without complete knowledge of the effect of actions, the setting is essentially that of reinforcement learning. In both settings, a policy π has to be learned and the transition function δ is unknown to the agent. To capture the idea that goal states are absorbing states, δ satisfies: ∀a : δ(st , a) = st if goal(st ) = true This ensures that once the learning agent reaches a goal state, it stays in that goal state. The reinforcement learning reward can be defined as follows: ( 1 if goal(st ) = false and goal(δ(st , at )) = true rt = r(st , at ) = 0 otherwise A reward is thus only given when a goal-state is reached for the first time. This reward function is unknown to the learning agent, as it depends on the unknown transition function δ.
2.3. SOLUTION METHODS
17
As such, it is possible to cast a problem of learning to plan under incomplete knowledge as a reinforcement learning task. The optimal policy π ∗ can be used to compute the shortest action-sequence to reach a goal state, so this optimal policy, or even an approximation thereof, can be used to improve planning performance.
2.3
Solution Methods
There are two major directions that can be explored to solve reinforcement learning problems: direct policy search algorithms and statistical value-function based approaches.
2.3.1
Direct Policy Search
Since a solution to a reinforcement learning problem is a policy, one can try to define the policy space and search in this space for an optimal policy. In this approach, the policy of the reinforcement learning agent is parameterized and this parameter vector is tuned to optimize the average return of the related policy. This is an approach used by for example Genetic Programming algorithms. (The utility function defined in section 2.2.1 can be used as a fitness function to evaluate candidate solutions.) A full discussion of the use of genetic algorithms or genetic programming falls out of the scope of this thesis. Interested readers can consult the books of Mitchell (1996) or Goldberg (1989).
2.3.2
Value Function Based Approaches
Most research in reinforcement learning focuses on the computation of the optimal utility of states (i.e., the function V ∗ or related values) to find the optimal policy. Once this function V ∗ is known, it is easy to translate it into an optimal policy. The optimal action in a state is the action that leads to the state with the highest V ∗ -value. π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]
(2.3)
However, as can be seen in the equation, this translation requires a model of the environment through the use of the state transition function δ (and the reward function r). In many applications, building a model of the environment is at least as hard as finding an optimal policy, so learning V ∗ is not sufficient to solve the reinforcement learning problem. Therefore, instead of learning the utility of states, an agent can learn a different value, which quantifies the utility of an action in a given state when following a given policy π.
18
CHAPTER 2. REINFORCEMENT LEARNING
A lot of approaches exist that try to compute the state or (state, action) values. Sutton and Barto (1998) discuss methods ranging from dynamic programming over Monte-Carlo methods to temporal difference approaches. Dynamic programming methods (Barto et al., 1995) try to compute V ∗ using a full model of the environment. Policy evaluation, which computes the value V π of any policy π, starts from a randomly initialized utility value for each state and updates these values using the Bellman equation: Vi+1 (s) = r(s, π(s)) + γVi (δ(s, π(s)))
(2.4)
until a fixed point for the update rule has been found. For nondeterministic environments or policies, the right side of the equation needs to be replaced with the expected values. To find the optimal policy, policy iteration interleaves policy evaluation steps with policy improvement steps. After each policy evaluation, the policy π is adjusted so that it maximizes the use of the current utility estimates. This step is called policy improvement. Afterwards, the new (improved) policy is evaluated again, until a fixed point is found. Value iteration is based on the same principles as policy iteration, but does not require a full policy evaluation before it adapts the policy. Instead of the update rule from Equation 2.4 it uses the following: Vi+1 (s) = maxa [r(s, a) + γVi (δ(s, a))]
(2.5)
By using a model of the environment and previous estimates of the utility values of “neighboring” states, value iteration computes better approximations of the utilities of all states by finding the neighboring state with the maximum value. This process is then iterated until no further change is needed. However, since the number of states can become quite large, even one iteration of computing better utility approximations can become problematic. Monte-Carlo methods (Barto and Duff, 1994) base their value predictions on average rewards collected while exploring the environment. For every possible (state, action) pair, an array of rewards is collected through exploration of the environment and the prediction of the (state, action)-value is made by averaging the collected rewards. This means that no model of the environment is needed for Monte-Carlo methods. Temporal difference (TD) algorithms combine the ideas of dynamic programming and Monte-Carlo methods. They use an update rule that expresses the relation between the utility values of adjacent states to incrementally update these values based on information collected during exploration of the environment. The two most important algorithms that use this approach to learn (state, action) values are Q-learning (Watkins, 1989) (see also Section 2.3.3) and SARSA (Sutton and Barto, 1998). While SARSA learns the (state, action) values of the exploration policy, Q-learning, which is exploration insensitive,
2.3. SOLUTION METHODS
19
computes the optimal (state, action) values. Of these two, Q-learning will be discussed in more detail in the following section. Temporal difference algorithms that use eligibility traces, such as T D(λ), Sarsa(λ) and Q(λ), use a more elaborate version of the update rule than Qlearning or SARSA, that doesn’t only take a single action and the resulting state in account, but looks λ steps ahead instead. More information on these techniques can be found in the book by Sutton and Barto (1998). More recently, the reinforcement learning problem has also been translated to the support vector machines setting (Dietterich and Wang, 2002). Readers interested in more information on the discussed (and more) methods can find a great deal of knowledge in the work of Kaelbling et al. (1996) and Sutton and Barto (1998).
2.3.3
Q-learning
Q-learning (Watkins, 1989) is a temporal difference method that computes the Q-value for the optimal policy associated with each (state, action) pair. The Q-value of a (state, action) pair with respect to a policy π is closely related to the utility function defined in the previous section. Definition 2.8 (Q-value) Qπ (s, a) ≡ r(s, a) + γV π (δ(s, a)) The Q-value for the optimal policy is denoted by Q∗ (s, a) = r(s, a) + γV ∗ (δ(s, a)). This is exactly the value that is maximized in Equation 2.3. Therefore, this value can be used to generate an optimal policy without the need for a model of the environment (i.e., transition function δ). An action in a given state will be optimal if the action has the highest Q-value in that state. Of course, not knowing the δ or r functions, means that this definition can not be used to compute the Q-function. Luckily, the Q-learning algorithm presents a way to learn the Q-values without explicit knowledge of these functions. Used to compute the optimal Q-values, Q-learning is a simple algorithm that updates these values incrementally while the reinforcement learning agent explores its environment. Algorithm 2.1 shows a high level description of the algorithm. This simple Q-learning algorithm converges towards the optimal Q-function in deterministic environments provided that each (state, action) pair is visited often enough (Watkins, 1989; Jaakkola et al., 1993). The formula used in the described Q-learning algorithm: Q(s, a) ← r + γmaxa0 Q(s0 , a0 )
(2.6)
is known as a Bellman Equation. It is used in deterministic environments.
20
CHAPTER 2. REINFORCEMENT LEARNING
Algorithm 2.1 The basic Q-learning algorithm. for each s ∈ S, a ∈ A do initialize table entry Q(s, a) end for generate a starting state s repeat select an action a and execute it receive an immediate reward r = r(s, a) observe the new state s0 update the table entry for Q(s, a) as follows: Q(s, a) ← r + γmaxa0 Q(s0 , a0 ) 0 s←s until no more learning
When reinforcement learning is performed in a non-deterministic world, another Bellman Equation, that makes use of the ideas of temporal difference learning, is used: Q(s, a) ← (1 − α)Q(s, a) + α[r + γmaxa0 Q(s0 , a0 )]
(2.7)
with α=
1 1 + visits(s, a)
The Q-learning algorithm using this update rule converges in stochastic environments given the fact that every (state, action) pair is visited often enough and that the probabilities of state transitions and rewards are static (Watkins, 1989; Jaakkola et al., 1993). Other Q-value update equations exist, but the reader is referred to the books by Sutton and Barto (1998) and Mitchell (1997) for a more complete overview of learning strategies. One improvement that is often used to lower the time till convergence is to store the (state, action) pairs together with their appropriate rewards until the agent reaches a goal-state1 and then compute the Q-values of the encountered states and actions backwards. This allows the Q-values to spread more rapidly throughout the state space. One series of (state, action) pairs is called an episode (Langley, 1994; Mitchell, 1997). This transforms the Q-learning algorithm into Algorithm 2.2. 1 The reached state does not have to be a terminal state. The backpropagation of the encountered rewards can be started at any time. However, the resulting speedup will be larger when backpropagation is only performed when a substantial reward has been received.
2.3. SOLUTION METHODS
21
Algorithm 2.2 The Q-learning algorithm with Bucket-Brigade updating. for each s ∈ S, a ∈ A do initialize table entry Q(s, a) end for repeat {for each episode} generate a starting state s0 i←0 repeat {for each step of episode} select an action ai and execute it receive an immediate reward ri = r(si , ai ) observe the new state si+1 i←i+1 until si is terminal for j = i − 1 to 0 do Q(sj , aj ) ← rj + γmaxa0 Q(sj + 1, a0 ) end for until no more episodes
2.3.3.1
Q-value Function Generalization
Up till now, the description of the Q-learning algorithm has assumed that a separate Q-value is stored for each possible (state, action) pair. This is known as table-based Q-learning and is limited in practice by the amount of different Q-values that need to be memorized. The number of Q-values is equal to the number of different (state, action) pairs. Not only does table based Q-learning require a memory location for each (state, action) pair, which quickly results in impractical amounts of required memory space, but also the convergence of the Q-values to those of the optimal Q∗ -function only occurs when every (state, action) pair is visited often enough, thus increasing execution times of the Q-learning algorithm when the state-action space grows large. This makes Q-learning suffer from the ”Curse of Dimensionality” (Bellman, 1961). Since the number of states and the number of actions grows exponentially with the number of used attributes, both the required amount of memory as well as the execution time of the Q-learning algorithm grow also exponentially. To make Q-learning feasible in larger environments one often employs a kind of generalization function to represent the Q-function. This generalization is built by a regression algorithm which uses Q-value examples (consisting of a state, an action and a Q-value) encountered during exploration of the enviˆ that ronment. The regression algorithm builds an approximate Q-function Q, predicts approximate Q-values for all (state, action) pairs, even the ones that were never encountered during exploration. Because this regression task will
22
CHAPTER 2. REINFORCEMENT LEARNING
be crucial in the rest of this work, it will be defined more explicitly. Definition 2.9 (Regression) A Regression Task can be defined as follows: Given a set of learning examples ei ∈ E with a continuous target value ti , Build a function Fˆ : E → R that generalizes over the seen examples and predicts the target values for unseen examples thereby minimizing or maximizing a given criterion. The criterion used can vary from, for example, an accuracy measure such as the mean squared error to the performance of the associated policy when generalizing over Q-values. The use of regression for Q-learning not only reduces the amount of memory and computation time needed but also enables the learner to make predictions of the quality of unseen (state, action) pairs. Q-value generalization is performed with a number of different regression methods such as neural nets, statistical curve fitting, regression trees, instance based methods and so on. Special care has to be taken when choosing a regression method for Q-learning as not all regression methods are suited to deal with the specific requirements of the Q-learning setting. Learning the Q-function happens on line while interacting with the environment. This requires regression algorithms that can deal with an almost continuous stream of incrementally arriving data, and are able to handle a non-stationary target function, also known as moving target regression. Regression algorithms used in this context include neural networks (Tesauro, 1992; Mahadevan et al., 1997) and decision trees (Smart and Kaelbling, 2000). 2.3.3.2
Exploration vs. Exploitation
Q-learning is exploration insensitive. This means that the Q-values converge to the correct values no matter what policy is used to select actions, provided that each (state, action) pair is visited often enough. This allows the reinforcement learning agent to adapt its exploration strategy and try to collect as much new information as soon as possible. In online learning problems, where the agent is not only supposed to be learning, but also reach some minimal performance on the task, it may have to trade in some exploration possibilities and exploit what it has learned so far. Several mechanisms for selecting an action during the execution of the Qlearning algorithm exist. For a detailed discussion of exploration techniques the reader can consult the work by Wiering (1999) and Kaelbling et al. (1996). A few frequently used strategies are discussed here. Greedy strategies choose the
2.4. CONCLUSIONS
23
action with the highest Q-value to optimize the expected pay-back. However, to ensure a sufficient amount of exploration to satisfy the convergence constraints of the Q-learning algorithm one often has to degrade to an -greedy approach, where, with a small probability , instead of the optimal action a random action is chosen. Randomized selection techniques include for example Boltzmann exploration where an action is chosen according to the following probability distribution: eQ(s,a)/T Q(s,a0 )/T a0 e
P (a|s) = P
(2.8)
The temperature T can be lowered over time. Whereas a high temperature causes a lot of exploration, a low temperature places a high emphasis on the computed Q-values. Interval based exploration computes upper bounds for a confidence interval for the Q-values for each action in a state and then chooses the action with the highest upper bound. This ensures initial exploration of states but decreases the exploration when confidence in the computed Q-values grows and turns out to work well in practice. It does however require some form of confidence interval estimation.
2.4
Conclusions
This chapter introduced the reinforcement learning framework as a machine learning problem in which the learning system is trying to discover a beneficial policy with respect to the quantitative feedback it receives. The relation between reinforcement learning and planning was briefly discussed. A short overview of possible solution methods for reinforcement learning was presented, including a few model based approaches such as value iteration. Of the model free approaches such as SARSA, Q-learning was discussed in more detail as it will be used as the basis for the relational reinforcement learning system that will be developed in Chapter 4. Two subtasks of the Q-learning algorithm were discussed, i.e., Q-function generalization to be able to deal with large state-action spaces and the tradeoff between exploration and exploitation which will either let the agent profit from what it has already learned or allow it to discover new and possibly more beneficial areas of the state-action space.
24
CHAPTER 2. REINFORCEMENT LEARNING
Chapter 3
State and Action Representation “I represent what everyone is afraid of.” Bowling for Columbine
3.1
Introduction
The previous chapter discussed the problem setting of reinforcement learning and introduced Q-learning as a possible solution algorithm but paid no attention to the format used to represent the states and actions that the agent deals with. In this chapter, an overview is given of different representation possibilities. They are ordered according to their expressiveness. The representation of a learning problem can have a great impact on the performance, both of the learning algorithm and of its results. Although most work in reinforcement learning is situated in the propositional or attributevalue setting, recently the interest in using more expressive representations has grown. An overview of higher order representational issues in reinforcement learning is given by van Otterlo (2002).
3.2
A Very Simple Format
The most simple format is the use of numbered or enumerated states and actions. This would lead to an interaction between the agent and its environment which is very simple and will force the agent to represent its Q-function as a table, as almost no possibility of state- or action-interpretation is given. This 25
26
CHAPTER 3. STATE AND ACTION REPRESENTATION
representation can be used when little or no information about the learning task is available. An example interaction would be the following: Environment: You are now in state 32. You have 3 possible actions. Agent: I take action 2. Environment: You receive a reward of -2. You are now in state 14. You have 5 possible actions. Agent: I take action 5. Environment: You receive a reward of +6. You are now in state 46. You have 3 possible actions. Agent: ... Although this format is limited in its use, there are applications where its use is sufficient to represent the problem domain. One example is the game of Blackjack, a casino game where the object of the game is to obtain cards such that the sum of the numerical values is as high as possible, without exceeding 21. An Ace can count for either 11 or 1 and all face cards count as 10. The state description in this game is given by the sum of the card values in the agents hand, possibly augmented by the fact that he is holding a usable Ace (i.e., an Ace that is counted for 11 and can still be downgraded to count as a 1) and the value of the dealer card. In each state of the game, there are 2 possible actions: hit (action 1) or stick (action 2). When the sum of the cards is below 12, the player should always hit, since there is no possibility of breaking 21 with the next card. For this reason and because the game is over as soon as the player breaks 21, only states where the value of the cards add up to a value starting with 12 up to 21 need to be represented. These 10 different values combined with the 10 different values of the dealer’s card and the possible existence of a usable ace, lead to a total of 200 states that need to be represented, each with 2 possible actions. This representation is sufficient to calculate an optimal strategy for — this slightly simplified version of — the Blackjack game (Sutton and Barto, 1998).
Intermezzo: Afterstates Even with this very simple representation, a small interpretation of the encountered (state, action) pairs is possible through the use of afterstates (Sutton and Barto, 1998). An afterstate is the state that is the partial result from taking a
3.3. PROPOSITIONAL REPRESENTATIONS
27
certain action in a certain state. For example in a two player game, it is usually easy to predict the initial response of the environment to a selected move (referred to as the partial result of the action), but it might be impossible to predict the end-results of the chosen action as this includes the counter move of the opponent. If the initial or partial dynamics of the environment are simple enough (or if the environmental observations include this information) the afterstate of a (state, action) pair can be used instead of the (state, action) pair itself to predict the value of the given (state, action) pair. This leads to a generalization over (state, action) pairs that have the same afterstate and can reduce the number of values that need to be remembered by a factor as large as the average number of actions possible in a state. Of course, the use of afterstates is not possible in environments where the outcome of an action is completely stochastic such as the Blackjack game discussed above.
3.3
Propositional Representations
A more expressive format than state enumeration is to represent states and actions in a propositional format. This corresponds to describing each state (and possibly the action as well) as a feature vector with an attribute for each possible property of the agent’s environment. The domain of each of the attributes can vary from being Boolean over enumerable ranges, to being continuous. The game of tic-tac-toe (Figure 3.1) is a reinforcement learning problem that translates naturally into a propositional representation. There are 9 squares that can be empty or have a cross or a circle in them, so a state can be represented as a feature vector of length 9, with each attribute ∈ {empty, cross, circle}. This widely studied propositional representation format allows for a large array of possible generalization techniques to be used to approximate the Qfunction. Among these, the most commonly used are: neural networks for numerical attribute vectors (Tesauro, 1992; Mahadevan et al., 1997) and decision trees for Boolean or enumerable feature vectors (Smart and Kaelbling, 2000). The reader is referred to the work of Sutton and Barto (1998) and Kaelbling et al. (1996) for a larger overview of value-function generalization techniques in a propositional setting. In the tic-tac-toe example above, the agent could learn that, not placing a cross in the bottom-middle square when the feature vector has circle for the second and fifth attribute and empty for the eighth attribute, will result in losing the game and receiving the corresponding negative reward, whatever the values of the attributes at other positions are. Such a set of states could be represented as: [?, circle, ?, ?, circle, ?, ?, empty, ?] where the value “?” denotes
28
CHAPTER 3. STATE AND ACTION REPRESENTATION
[empty, empty, empty, empty, circle, empty, empty, empty, empty]
[empty, circle, cross, empty, circle, empty, empty, empty, empty]
[cross, circle, cross, empty, circle, empty, empty, circle, empty]
Figure 3.1: A few example states and the corresponding feature vectors of the Tic-tac-toe game. that it doesn’t matter what the value of the attribute is. This will generalize the Q-value of 36 (state, action) pairs with one entry. Propositional representations have problems dealing with states that are defined by the objects that are present, their properties and their relations to other objects. It is not possible to capture regularities between the different attributes to generalize over different states. In the representation employed above, it is for example not possible to state that not placing a cross next to any two adjacent circles will result in losing the game. More examples of environments which are hard to represent using a propositional representation will be given in Section 3.5.
3.4
Deictic Representations
Propositional representations are not suited to deal with environments that include sets of objects, i.e., environments where the number of object varies or is unknown at the start of learning. Deictic representations deal with the varying numbers of objects that can be present in the environment by defining a focal point for the agent. The rest of the environment is then defined in relation to that focal point. One example where deictic representation seems the natural choice, is when giving directions to someone who is lost. When a person is in unknown territory, people automatically transfer into a deictic representation. Instead of referring to specific streets by their names or to specific crossroads, one usually uses constructs such as: • The street on your left. • The second crossroad.
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
29
These constructs only make sense in relation to a focal point. While giving directions this is usually the starting location or the location one should have reached by following the previous set of directions. The following are some deictic constructs that refer to objects: • The last person you talked to. • The sock on your left foot. • The text you are reading. In Q-learning one has to explore the entire state-action space. When using deictic representations, the relativity of the state representation to a focal point means that the different possible focal points in one environment state also have to be explored. When the location of the focal point is regarded as a part of the current state description, this can cause a substantial increase in the complexity of a learning problem. In early experiments with deictic representations for reinforcement learning and Q-learning in particular, the extra complexities caused by manipulation of the focal point of the agent made deictic representation fail to improve on Q-learning using propositional representations (Finney et al., 2002).
3.5
Structural (or Relational) Representations
The real world is filled with objects. Objects that display certain properties and objects that relate to each other. To deal with an environment filled with this kind of structure, a structural representation of both world states and actions that can represent objects and their properties and relations is necessary (Kaelbling et al., 2001; Dˇzeroski et al., 2001). One example where reinforcement learning could be used, but a relational representation would be needed to describe states and actions, are role playing games such as Neverwinter Nights or Final Fantasy. In these computer games, a player controls a varying amount of different characters. The goal of the game is not only to survive the presented scenario, but also to develop the available characters by improving their characteristics, making them learn new abilities and gathering helpful objects. This can be accomplished by fighting varying numbers of foes or completing certain quests. Figure 3.2 shows a screen shot of a battle scene from Final Fantasy X. Even when just looking at the battle part of the role playing game, representing such a state using a propositional array of state-features is problematic if not impossible: 1. The number of characters in an adventuring party varies during the game. Also the number of opponents in a battle may be unknown.
30
CHAPTER 3. STATE AND ACTION REPRESENTATION
Figure 3.2: A screen shot of a battle in the Final Fantasy X role playing game. 2. Different characters usually exhibit different types and even a different number of abilities. A fighter character will usually physically attack enemies (although the character might develop a number of different attacks) while a wizard character usually has a range of different spells that can be used, varying from healing spells over protective spells to offensive, i.e., damage dealing, spells. 3. Available objects usually exhibit different properties when wielded by different characters. For example, a fighter character will usually make better use of a large weapon than a wizard character but the differences are often much more subtle than this. 4. A feature often present in this kind of role playing games is the relative strength of characters and their foes. A certain character can be stronger against certain types of foes. A given foe can be more vulnerable against magic than against physical attacks or even different types of magic. The strength of a weapon can depend on the type of armor currently worn by the intended target. Also the actions available in a role playing game can be hard to represent in a propositional way as the player usually needs to specify not only the preferred action, but also a number of options and the intended target. A magic spell or a special attack for example might have multiple targets. All of these features are hard to represent using a single feature vector. However, given a multi relational database, it would be easy to construct a
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
31
lossless representation of such a game state. Different representations formats are available that can represents worlds with objects and relations. Most popular is the representation of relational databases. Two other methods will be presented in the next section.
3.5.1
Relational Interpretations
A first representation that will be used is that of relational interpretations, as used in the “learning from interpretations” setting (De Raedt and Dˇzeroski, 1994; Blockeel et al., 1999). In this notation, each (state, action) pair will be represented as a set of relational facts. This is called a relational interpretation.
1 3 2
4
5
Figure 3.3: The delivery robot in its environment. Consider the package delivery robot of Figure 3.3. The robot’s main task is to deliver the available packages to their intended destination as fast as possible. It can carry multiple packages at the same time, with limitations based on the cumulative size of the packages. The robot is equipped with some navigational abilities so that the actions available to the robot consist of moving to any adjacent room or picking up or dropping a package. Packages appear at random intervals and locations with random destinations. As the number of objects in this environment, i.e., the packages, is variable, it is hard to represent all possible (state, action) pairs using a fixed length feature vector. Figure 3.4 shows the state of Figure 3.3 represented as a relational interpretation, or a set of facts. Additional objects appearing during interaction with the world will lead to additional facts becoming part of the set. The action can be represented by an additional fact such as “move(south)”
32
CHAPTER 3. STATE AND ACTION REPRESENTATION
location(r2). carrying(p2). maximumload(5). package(p1). package(p2). package(p4).
destination(p1,r3). destination(p2,r4). destination(p4,r3).
size(p1,3). size(p2,1). size(p4,3).
location(p1,r4). location(p2,r2). location(p4,r2). Figure 3.4: The state of the delivery robot represented as a relational interpretation. or “pickup(p4)”. The relational state description can be extended by providing background knowledge in the form of logical clauses. These clauses can derive more complex knowledge about the (state, action) pair such as what packages can be carried together.
3.5.2
Labelled Directed Graphs
Another representation format that can be used to represent relational data is a graph (Diestel, 2000; Korte and Vygen, 2002). Graphs are particularly well suited for representing worlds with a lot of structure, where objects are defined by their relation to other objects, more than by their own properties. An example of this can be found in navigational tasks, where the agent needs a representation of all possible paths available to it. Figure 3.5 shows part of a road map and the corresponding graph that can be used to represent the environment. The agent is located at the intersection labelled “current position” and wants to get to Paradise. In case of one-way streets, the graph could be changed to a directional graph, with an edge for each possible travelling direction. An action, e.g. travelling between two adjacent intersections, can be represented by an additional, labelled, edge. By only representing the structure of the environment and not, for example, the individual names of the highways and streets, the reinforcement learning agent will be forced to learn a general policy that can be applied to similar environments. By using labelled graphs, it remains possible however to supply extra information on travelling paths such as speed limits or congestion probabilities. In the case of the road map, the graph structure of the state representation
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
33
Current Position
{curpos}
{goal} Goal
Figure 3.5: A part of a road map and its representation as a labelled graph. doesn’t change, except for the “current position” label. This is not required however, as shown in the following example where this is not the case.
3.5.3
The Blocks World
As a running example used throughout the following chapters, the blocks world — which is well known in artificial intelligence research (Nilsson, 1980; Langley, 1994; Slaney and Thi´ebaux, 2001) — can easily be represented either as a relational interpretation or as a graph. The blocks world, as used in this work, consists of a constant number of blocks and a floor large enough to hold all the blocks. Blocks can be on the floor or can be stacked on one another. Only states with neatly stacked blocks are considered, i.e. it is not possible for a block to be on top of two different blocks. The actions in the blocks world consist of moving a clear block, i.e. a block that has no other block on top of it, onto another clear block or the floor. Figure 3.6 shows a possible blocks world state and action pair. The shown state is in a blocks world with 5 blocks. The blocks are given a number to establish their identity. The action of moving block 3 onto block 2 is represented by the dotted arrow. Although the blocks world is a good testing environment, a goal has to be set for the agent to make it a reinforcement learning task. In this work, both specific goals (e.g. “Put block 3 on top of block 1”) and more general goals (e.g. “Put all the blocks in one stack”) will be considered. The blocks world can be easily represented in the two proposed formats. As a relational interpretation, the blocks world state is represented by the “clear”
34
CHAPTER 3. STATE AND ACTION REPRESENTATION
3 1 5
4 2
Figure 3.6: An example state of a blocks world with 5 blocks. The action is indicated by the dotted arrow. on(1,5). on(2,floor). on(3,1). on(4,floor). on(5,floor).
clear(2). clear(3). clear(4). move(3,2).
Figure 3.7: The blocks world state represented as a relational interpretation.
and “on” predicates. The action is represented by the “move” predicate and the goal, if necessary, can be represented by a “goal” predicate. Figure 3.7 shows the relational representation of the blocks world (state, action) pair of Figure 3.6. Possible extensions of the (state, action) representation that can be added as background knowledge include for example : “above(3,5)” and “numberOfStacks(3)”. Figure 3.8 shows the graph representation of the same (state, action) pair. Each block, together with the sky and table are represented by vertices, while the “clear” and “on” relations are represented by directed edges. The action is represented by an additional labelled edge, and the extra labels a1 and a2 . If needed, additional edges can be used to represent the intended goal. More detailed information about the used blocks world representations can be found in Appendix A. Some problem domains will fit easily into a representation as a relational interpretation, while others may be easier to represent as a graph. In the next two chapters no distinction will be made between these representations. It will be assumed that a representation for states and actions is available in which it is possible to represent objects as well as their relations.
3.6. CONCLUSIONS
35 {clear}
v6 {on} {block,a1} {on} {block}
{on}
v3
{on}
v4
{action}
v1
{block}
{block,a2 }
v2 {on}
{on} {on}
v5 {block}
{on}
v0
{floor}
Figure 3.8: The blocks world state represented as a labelled directed graph.
3.6
Conclusions
This chapter gave an overview of possible representation formats with increasing expressiveness ranging from simple state and action enumeration over attribute-value representations to deictic and relational representations. Two relational or structural representation possibilities were introduced: relational interpretations and labelled directed graphs.
36
CHAPTER 3. STATE AND ACTION REPRESENTATION
Chapter 4
Reinforcement Learning In a Relational Environment “It is our belief that the message contains instructions for building something, some kind of machine.” Contact
4.1
Introduction
As argued in the previous chapter, relational problem domains require their own representational format. This eliminates the use of fixed size arrays as state and action representations and thereby greatly limits the use of standard reinforcement learning or Q-learning techniques as much of them depend on the fact that the state-action space can be regarded as a vector space. In a relational application, no limits are placed on the size or even on the dimensions of the state space. Problems in relational or structural domains tend to be very large as the size of the state space increases quickly with the number of objects that exist in the environment. To deal with this problem, the goal of using relational reinforcement learning should be to abstract away from the specific identity of states, actions or even objects in the environment and instead identify states and actions only by referring to objects through their properties and the relations between the objects. For example, a house-hold robot would be required to wash and iron any shirt the owner buys, instead of just the shirts that originally came with the purchase of the robot. By defining a shirt by its properties instead of its specific identity, any object made of cotton with two sleeves, a collar, buttons in front and the shape to fit one human would be handled correctly by the robot. 37
38
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
The same remarks can be made for the environment. Different environments should be generalized over by using their structure instead of learning to behave in their specific layout. Instead of having to buy a robot together with a house in which it can move around, the robot should be able to find its way around any house and be able to identify the kitchen by the objects located inside. This chapter introduces relational reinforcement learning and presents the RRL system, one possible approach to relational reinforcement learning that uses a kind of relational Q-learning. Section 4.2 defines the relational reinforcement learning task. In Section 4.3, a general algorithm for relational Q-learning is presented. This will introduce the need for incremental first order regression and provide a framework for the next chapters that will discuss various regression techniques that can be used in the RRL system. This chapter concludes by discussing the first, prototypical implementation of an RRL system and by presenting an overview of some closely related work.
4.2
Relational Q-Learning
The relational reinforcement learning task is very similar to the regular reinforcement learning task except that a relational representation is used for the states and actions. Definition 4.1 (Relational Reinforcement learning) The relational Reinforcement Learning task can be defined as follows: Given: 1. A set of possible states S, represented in a relational format, 2. A set of possible actions A, also represented in a relational format, 3. An unknown transition function δ: S × A → S, (this function can be nondeterministic) 4. A real-valued reward function r: S × A → R, 5. Background knowledge generally valid about the environment. Find a policy for selecting actions π ∗ : S → A that maximizes a value function: V π (st ) for all st ∈ S. The states of S need not be explicitly enumerated. They can be defined through the alphabet of the chosen representational format. For the actions, the learning system will only have to deal with the actions applicable to a given
4.3. THE RRL SYSTEM
39
state, denoted by A(s). The reward function r can be substituted by a “goal” function: goal : S → {true,false}. The reward function will then be defined as discussed in the intermezzo on page 16. The value function V π can be defined as discussed in Section 2.2.2. In this work, only the definition of Equation 2.1 (i.e., the discounted cumulative reward) is considered. The background knowledge at this point covers a large number of different kinds of information. Possibilities include information on the shape of the value function, partial knowledge about the effect of actions, similarity measures between different states and so on. Specific forms of background knowledge and how they are used to help learning will be discussed later. The background knowledge can for example include predicates that derive new relational facts about a given state when using relational interpretations. For the blocks world example from Section 3.5.3, the set of all states S would be the set of all possible blocks world configurations. If the number of blocks is not limited, there exists an infinite number of states. The possible actions in the blocks world depend on the state. They consist of moving all blocks that are clear onto another block that is clear or to the floor. The transition function δ in the blocks world as used throughout this work is deterministic, i.e., all actions will have the expected consequences. The reward function r will depend on the goal defined in the learning experiment and will be computed as defined in the intermezzo of page 16. Background knowledge in the blocks world could for instance include which blocks belong to the same stack or even the number of stacks in a given state.
4.3
A Relational Reinforcement Learning System (or RRL system)
The RRL system was designed to solve the relational Q-learning problem as defined in the previous section. Some intuition behind the approach is given, followed by a general algorithm describing the RRL system’s approach.
4.3.1
The Suggested Approach
The RRL system follows the same reinforcement learning approach as most Qlearning algorithms that use Q-function generalization. Instead of representing the Q-values in a lookup table, the RRL system uses the information it collects about the Q-values of different (state, action) pairs to allow a regression algorithm to build a Q-function generalization. The difference between the RRL system and other Q-learning algorithms using function generalization lies in the fact that the RRL system employs a
40
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Algorithm 4.1 The Relational Reinforcement Learning Algorithm ˆ0 initialize the Q-function hypothesis Q e←0 repeat {for each episode} Examples ← ∅ generate a starting state s0 i←0 repeat {for each step of episode} ˆe choose ai for si using a policy derived from the current hypothesis Q take action ai , observe ri and si+1 i←i+1 until si is terminal for j = i − 1 to 0 do ˆ e (sj+1 , a) generate example x = (sj , aj , qˆj ) where qˆj ← rj + γmaxa Q Examples ← Examples ∪ {x} end for ˆ e using Examples and a relational regression Update Q ˆ e+1 algorithm to produce Q e←e+1 until no more episodes
relational representation of the environment and the available actions and that a relational regression algorithm is used to build a Q-function. This relational regression algorithm prevents from using specific identities of states, action and objects in its Q-function and instead relies on the structure and relations present in the environment to define similarities between (state, action) pairs and to predict the appropriate Q-value.
4.3.2
A General Algorithm
Algorithm 4.1 presents an algorithm to solve the previously defined task. It uses a relational regression algorithm to approximate the Q-function. The exact nature of this algorithm is not yet specified. The development of such a regression algorithm will be the subject of the following chapters. The algorithm starts by initializing the Q-function. Although this usually means that the regression engine returns the same default value for each (state, action) pair, it is also possible to include some initial information about the learning environment in the Q-function initialization. The algorithm then starts running learning episodes like any standard Qlearning algorithm (Sutton and Barto, 1998; Mitchell, 1997; Kaelbling et al., 1996). For the exploration strategy, the system translates the current Q-
4.3. THE RRL SYSTEM
41
function approximation into a policy using Boltzmann statistics (Kaelbling et al., 1996). During the learning episode, all the encountered states and the selected actions are stored, together with the rewards connected to each encountered (state, action) pair. At the end of each episode, when the system encounters a terminal state, it uses reward back-propagation and the current Q-function approximation to compute the appropriate Q-value approximation for each encountered (state, action) pair. The algorithm then presents the set of (state, action, qvalue) triplets to a relational regression engine, which will use this set of examples to update the current Q-function estimate, whereafter the algorithm continues executing the next learning episode. The algorithm described is the one that will be used in the rest of the text and will be referred to as the RRL algorithm. A few choices were made when designing the algorithm that do not have a direct influence on the usability of the relational reinforcement learning technique in general. For example, the RRL system uses Boltzmann based exploration but any other exploration technique could also be used (See also Chapter 8). Also, in the described system, learning examples are generated and the regression algorithms is invoked at the end of an episode, i.e., when the policy takes the agent into an end-state. While the backpropagation of the reward will aid the system in generating more correct learning examples more quickly, this setup is not imperative for the proposed technique. Another approach would be to let the algorithm explore for a fixed number of steps before starting the regression algorithm, or even to send a newly generated learning example to the regression engine after each step. Although this could lead to a slower convergence of the regression algorithm to a usable Q-function, the general ideas behind the system would not be lost.
3 1 2
4 Q-value = 0.81
reward = 1
reward = 0
reward = 0
1 2
4 3 Q-value = 0.9
2
1 4 3 Q-value = 1.0
Figure 4.1: An example of a learning episode in the blocks world with 4 blocks using “on(2,3)” as a goal. Figure 4.1 shows a possible episode for a blocks world with 4 blocks with “on(2, 3)” as a goal. The episode consists of 3 (state, action) pairs that have
42
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
been executed from left to right. When the terminal state is reached, after the execution of the last action, the Q-values are computed from right to left to the shown values. At that point in time all three (state, action) pairs together with their Q-values are presented to the regression algorithm. The actual format of the examples will depend on the chosen regression algorithm.
4.4
Incremental Relational Regression
The RRL system described above, requires a regression algorithm, i.e., an algorithm that can make real value predictions for unseen (state, action) pairs based on previously encountered examples. To work in the relational reinforcement learning system, the algorithm needs to be both incremental and able to deal with a relational representation of the presented examples. Definition 4.2 (The RRL Regression Task) The Regression Task for the relational reinforcement learning system is defined as follows: Given a continuous stream of (state, action, qvalue) triplets calculated by the RRL system as described above, Build and update a function ˆ :S ×A→R Q that generalizes over the seen (state, action, qvalue) examples and predicts Qvalues for unseen (state, action) pairs such that the policy defined as: ˆ a) ∀s ∈ S : π ˆ (s) = argmaxa∈A(s) Q(s, is an optimal policy with respect to the chosen value function V , i.e. the Utility Function of Equation 2.1. Several requirements for the regression algorithm are implicit in this definition. Incremental (1) : The continuous stream of new (state, action, qvalue) triplets implies that the regression algorithm must be able to deal with inˆ function during the cremental data. The RRL system will query the Q learning process, not only after all examples have been processed. Incremental (2) : The large number of examples presented to the learning algorithm that is inherent to the design of the RRL system inhibits the algorithm from storing all encountered learning examples. This requirement is strengthened by the fact that all states and actions are represented in a relational format, which usually results in larger storage requirements than other, less expressive, representational formats.
4.5. A PROOF OF CONCEPT
43
Moving target : Because the learning examples are incrementally computed by a Q-learning algorithm, the function that needs to be learned is not stable during learning. As shown in Equation 2.6, the computations that generate examples to learn the Q-function, make use of current estimates of the same Q-function. This means that the examples of Q-values will only gradually converge to the correct values and that early examples will almost certainly be noisy. The regression algorithm should be able to deal with this kind of learning data. No vector space : The learning examples (as well as the examples on which predictions have to be made) will be represented in a relational format. This means that the algorithm cannot treat the set of all examples as a vector space. Regression algorithms often rely on the fact that the dimension of the example space is finite and (more importantly) known beforehand. With relational representations, this is not the case. Although in all practical applications the dimension of the state space will be finite, it will not be known at the start of the learning experiment and may even vary during the experiment (e.g. when the number of objects in the agent’s environment varies). This refrains the algorithm from using techniques such as local linear models, convex hull building and instance averaging, which rely on the use of a vector space. To the best of the author’s knowledge, no work existed in the field of incremental relational regression algorithms. Although a number of first order algorithms exist that can be used for regression prior to this thesis (Karaliˇc, 1995; Kramer and Widmer, 2000; Blockeel et al., 1998), none of these systems is incremental. The next three chapters will discuss three newly developed regression algorithms that take the requirements described above into account.
4.5
A Proof of Concept
A first prototype of a relational reinforcement learning system that uses a relational form of Q-value generalization was built by Dˇzeroski et al. (1998). It used a relational interpretations representation of states and actions and an of the shelf regression tree algorithm Tilde (Blockeel and De Raedt, 1998) to construct the Q-function. The pseudo code of the system is given in Algorithm 4.2. A more elaborate discussion can be found in (Dˇzeroski et al., 2001). The relational regression tree that represents the Q-function in the original RRL system is built starting from a knowledge base which holds examples of state, action and Q-function value triplets. To generate the example-set for tree induction, the system stores all encountered (state, action, qvalue)-triplets into a knowledge base that is then used for the tree induction by Tilde .
44
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Algorithm 4.2 The algorithm used by the proof of concept implementation. ˆ 0 to assign 0 to all (s, a) pairs initialize Q Examples ← ∅ e←0 repeat generate an episode that consists of states s0 to si and actions a0 to ai−1 (where aj is the action taken in state sj ) through the use of a standard ˆe Q-learning algorithm, using the current hypothesis for Q for j = i − 1 to 0 do generate example x = (sj , aj , qˆj ), ˆ e (sj+1 , a0 ) where qˆj = rj + γmaxa0 Q if (sj , aj , qˆold ) ∈ Examples then Examples ← Examples ∪ {x}\{(sj , aj , qˆold )} else Examples ← Examples ∪ {x} end if end for ˆ e+1 use Tilde and Examples to produce Q e←e+1 until no more learning
Originally a classification algorithm, Tilde was later adapted for regression (Blockeel et al., 1998). Tilde was not an incremental algorithm, so all the encountered (state, action) pairs were remembered together with their Q-value. Therefore, after each learning episode, the newly encountered (state, action) pairs and their newly computed Q-values are added to the example set and a new regression tree is build from scratch. Note that old examples are kept in the knowledge base at all times and never deleted. The system avoids the presence of contradictory examples in the knowledge base by replacing the old example with a new one that holds the updated Q-value, if a (state, action) pair is encountered more than once.
Problems with the original RRL While the results of this prototype implementation demonstrated that it is indeed possible to learn a relational Q-function and that this function can be used to generalize not only over states and actions but also over related environments, the non-incremental nature of the system limits its usability. Four problems with the original RRL implementation can be identified that diminish its performance. 1. The original RRL system needed to keep track of an ever increasing
4.6. SOME CLOSELY RELATED APPROACHES
45
amount of examples: for each different (state, action) pair ever encountered a Q-value is kept. This caused a large amount of memory to be used. 2. When a (state, action) pair is encountered for the second time, the new Qvalue needs to replace the old value. This means that each encountered (state, action) pair needs to be matched against the entire knowledge base, to check whether an old example needs to be replaced. 3. Trees are built from scratch after each episode. This step, as well as the example replacement procedure, takes increasingly more time as the set of examples grows. 4. A final point is related to the fact that in Q-learning, early estimations of Q-values are used to compute better estimates. In the original implementation, this leads to the existence of old and probably incorrect examples in the knowledge base. An existing (state, action, qvalue) example gets an updated Q-value at the moment when exactly the same (state, action) pair is encountered, but in structural domains, were there is usually a large number of states and action, this doesn’t occur very often. In the original implementation, no effort is made to forget old or incorrect learning examples. Most of these problems stem from the fact that Tilde expects the full set of examples to be available when it starts. To solve these problems, a fully incremental relational regression algorithm is needed as discussed in Section 4.4. Such an algorithm avoids the need to regenerate the Q-function when new learning examples become available.
4.6
Some Closely Related Approaches
Very recently, the interest in relational reinforcement learning problems has grown significantly and a number of different approaches have been suggested. This section highlights a few different routes that can be taken to handle relational reinforcement learning problems.
4.6.1
Translation to a Propositional Task
As shown in Chapter 3 a first step toward relational representations and object abstraction is the use of a deictic representation. The use of a focal point in the representation allows a fixed size array to be used to represent a world with varying amounts of objects and allows for a limited representation of the structure of the environment. The deictic representation deals with varying
46
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
amounts of objects by limiting the represented objects to the surroundings of the focal point. The representation relies on the focal point to find the appropriate part of the world state. Finney et al. (2002) used a deictic representation in a blocks world environment. The use of a focus pointer and state features, that describe the block that is focussed on as well as the blocks around it, allow representing a blocks world with an arbitrary amount of blocks. The focus pointer is controlled by the learning agent as well and can be moved in four directions. As it turns out, the extra complexity caused by the movement of the focus pointer (i.e., extra actions and longer trajectories to reach the intended goal) and the fact that the problem is made partially observable by only representing parts of the entire state causes Q-learning with deictic representations to perform worse than expected.
4.6.2
Direct Policy Search
Fern et al. (2003) use an approximate variant of policy iteration to handle large state spaces. A policy language bias is used to enable the learning system to build a policy from a sampled set of Q-values. Just like in standard policy iteration (Sutton and Barto, 1998), approximate policy iteration interleaves policy evaluation and policy improvement steps. However, the policy evaluation step generates a set of (state, qvaluelist) tuples (where the qvaluelist includes the Q-values of all possible actions) as learning examples by sampling the state space instead of computing the utility values for the entire state space. In the policy improvement step, a new policy is learned from the learning examples according to the policy language bias. The Q-values of the learning examples are estimated using policy roll-out, i.e., generating a set of trajectories following the policy and computing the costs of these trajectories. This step requires a model of the environment. Initial results of this approach are promising, but the need for a world model limits its applicability.
4.6.3
Relational Markov Decision Processes
A lot of attention has gone to relational representations of Markov Decision Processes (MDPs). These approaches use an abstraction of the state and action space to reduce the size of the learning problem. A distinction can be made between approaches that use a predefined model of the environment and algorithms that induce the state space abstraction. Kersting and De Raedt (2003) introduce Logical Markov Decision Processes as a compact representation of relational MDPs. They define abstract states as a conjunction of first order literals, i.e., a logical query. Each abstract state represents the set of states that are covered by the logical query. Between these
4.6. SOME CLOSELY RELATED APPROACHES
47
abstract states, they defined abstract transitions and abstract actions that represent sets of actions of the original problem. Abstract actions are defined in a STRIPS-like manner (Fikes and Nilsson, 1971), defining pre-conditions and post-conditions of the action. By this translation of the “ground” problem into a higher level representation of the problem, the number of possible (state, action) pairs and thus the number of Q-values that need to be learned is greatly reduced and Kersting and De Raedt define a “Logical Q-learning” algorithm to accomplish this task. Independent of this, Morales (2003) introduced rQ-learning, i.e., Q-learning in R-Space. R-Space consists of r-states and r-actions which are very comparable to the abstract states and abstract actions in the work of Kersting and De Raedt (2003). The rQ-learning algorithms tries to compute the Q-values of the (r-state, r-action) pairs. Van Otterlo (2004) defines the CARCASS representation that consists of pairs of abstract states and the set of abstract actions that can be performed in the given abstract state. While this is again comparable to the two previously discussed approaches, van Otterlo not only defines a Q-learning algorithm for his representation but also suggests learning a model of the relational MDP defined by CARCASS to allow prioritized sweeping to be used as a solution method. A downside to all three techniques is that they require a hand coded definition of the higher level states and actions. This can be cumbersome and greatly influences the performance of the approach. They also suffer from the fact that this translation caused the problem to become partially observable, so no convergence claims were made. Boutilier et al. (2001) suggests a value-iteration (and thus value based) approach that uses the Situation Calculus as a representation format for the relational MDP. This creates a theoretical framework that can be used to derive a partitioning of the state-space that distinguishes states and actions according to the utility function of the problem. However, the complexities imposed by the use of Situation Calculus and the need for sophisticated simplification techniques have so far prevented a full implementation of the technique. Very recently, Kersting et al. (2004) introduce a relational version of the Bellman update rule called ReBel. Using this update rule they have devised an algorithm that uses value iteration to automatically construct a state-action space partitioning. In contrast to the Situation Calculus used by Boutilier et al. they used a constraint logic programming language to represent the relational MDP. The Relation to the RRL System The idea of abstract states and actions to represent sets of world states and actions seems both elegant and quite practical. However, instead of requiring
48
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
a user defined abstraction of the reinforcement learning problem, it would be preferable if the task of finding the correct level of abstraction could be left to the learning algorithm. To distinguish between similar subsets of states and actions, the learning algorithm can use the utility values of states as an indication of related states or Q-values to relate several (state, action) pairs. The RRL system uses Q-values as a similarity indication of states and actions. However, it does not learn an explicit model of abstract states or actions, but models the abstract state-action space implicitly by building a Q-function based on structural and relational information available in the (state, action) pair. This Q-function will also represent the policy learned by the system as required by the problem definition. It must be noted that the use of a Q-function to implicitly define the regions of related (state, action) pairs can result in a partition of the state-action space that can’t be directly translated into a set of abstract states and actions. This can occur when the Q-function generalizes over (state, action) pairs by using relational features that combine elements from both states and actions.
4.6.4
Other Related Techniques
Yoon et al. (2002) devise an approach to extend policies from small environments to larger environments using first order policy induction. Using a planning component or a policy that can solve problems in small worlds, a learning set is generated from which a rule-set is learned, that generalizes well to worlds with more objects and thus a larger state space. This is related to the P-learning approach suggested in (Dˇzeroski et al., 2001) that uses a Q-function learned on small worlds to generate learning examples for learning a policy. Another value based approach is presented by Guestrin et al. (2003). By assuming that the utility of a state can be represented as the sum of various subparts of the state, Guestrin et al. are able to build a class-based value-function. Such a value function computes the values for a set of classes in which each object is assumed to have the same contribution to the total state utility. Using these class values, a value can be computed for each possible state configuration. However, the assumption that each object belonging to the same class has the same value contribution can be very restrictive on the type of problems that can be handled by the technique. The authors themselves suspect problems in dealing with environments where the reward function depends on the state of a large number of objects, or where there is a strong interaction between many of the worlds objects. A typical example of such an environment is the blocks world. A limitation of the suggested learning technique is that so far, the relations between objects are assumed to be static, which is certainly not the case in for example the blocks world.
4.7. CONCLUSIONS
4.7
49
Conclusions
This chapter introduced the relational reinforcement learning task. The relational reinforcement learning system or RRL system uses a standard Qlearning approach that incorporates incremental relational regression to build a Q-function generalization. System specific requirements for the incremental relational regression engine were discussed. The chapters of part II each discuss a new regression algorithm that can be used in the RRL system. A prototype of a relational Q-learning system was briefly discussed and an overview of some closely related work was given of which quite a few suggestions use a relational abstraction of states and actions to reduce the number of utility values or Q-values that need to be learned. Compared to these approaches, the RRL system uses a Q-value driven search for related (state, action) pairs. The Q-function that is the result of the RRL algorithm implicitly defines a stateaction partition of the learning task.
50
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Part II
On First Order Regression
51
Chapter 5
Incremental First Order Regression Tree Induction “You’re out of your tree.” “It’s not my tree.” Benny & Joon
5.1
Introduction
Decision trees are a successful and widely studied machine learning technique. The learning algorithms used to build classification and regression trees use a greedy learning technique called Top down induction of decision trees or TDIDT. TDIDT applies a divide and conquer strategy which makes it a very efficient learning technique. After a discussion of some related work, this chapter introduces the first of three relational regression algorithms that have been developed in this thesis for use in the RRL system. The tg algorithm is an incremental first order regression tree algorithm. It accepts the same language bias as the Tilde system (Blockeel and De Raedt, 1998) and adopts some ideas of incremental attribute value regression tree algorithms. Section 5.4 first introduces the experimental setting for the next three chapters. This experimental setting uses the blocks world as introduced in Section 3.5.3 with 3 different goals. The behavior of the tg algorithm is evaluated on these 3 tasks. This chapter concludes by discussing a number of possible improvements to the tg system. The tg algorithm was designed and implemented with the help of Jan Ramon and Hendrik Blockeel and was first introduced in (Driessens et al., 2001). 53
54
5.2
CHAPTER 5. RRL-TG
Related Work
For attribute-value representations, there exists an incremental regression tree induction algorithm that was designed for Q-learning, i.e., the G-algorithm by Chapman and Kaelbling (1991). This is a tree learning algorithm that updates its theory incrementally as examples are added. An important feature is that examples can be discarded after they are processed. This avoids using a huge amount of memory to store examples. On a high level, the G-algorithm stores the current decision tree, and for each leaf node statistics for all tests that could be used to split that leaf further. Each time an example is inserted, it is sorted down the decision tree according to the tests in the internal nodes, and in the leaf the statistics of the tests are updated. When needed, i.e., when indicated by the statistics in a leaf, the algorithm splits the leaf under investigation according to the best available test, again indicated by the stored statistics, and creates two new empty leafs. Once a test is chosen for a certain node, this choice cannot be undone. This is possibly dangerous for Q-learning, because Q-learning requires the regression algorithm to do moving target regression. This means that tests at the top of the tree are chosen, based on possibly unreliable learning examples. The ITI algorithm of Utgoff et al. (1997) incorporates mechanisms for tree restructuring, i.e., changing the chosen test in a node when the statistics kept in that node indicate that a change is necessary. The algorithm uses two tree revision operators: tree transposition and slewing cut-points for numerical attributes. These revision techniques rely on the fact that each decision node stores statistics about all possible tests that could be used at that node. When dealing with an attribute value representation of the data, this is feasible because the possible tests are limited to the possible values of the nominal attributes and the possible cut-points for numerical attributes. As will be discussed later, this is not possible when using a first order representation of the data. Another related approach is the U-tree algorithm of MacCallum (1999) which is specifically designed for Q-learning. U-trees rely on an attribute value representation of states and actions, but allow the use of a (limited term) history of a state as part of the decision criteria. For this, the U-tree algorithm stores the sequence of all (state, action) pairs. This history allows U-trees to be used with partially observable MDP’s, while the feature selection used in the decision tree allows generalization over similar states. In the U-tree algorithm, the trees are not used in the same way as the use of regression is proposed for the RRL system, but value iteration is performed using the leaves in the tree as states with each step that is taken by the learning agent. Q-values are thus stored per (leaf, action) pair. The reward value needed for value iteration is computed as the average of the rewards of all stored (state, action) pairs that belong to that leaf and share the same action.
5.3. THE TG ALGORITHM
5.3
55
The tg Algorithm
The tg algorithm is a first order extension of the G algorithm of Chapman and Kaelbling (1991). Algorithm 5.1 shows a high level description of the regression algorithm. Algorithm 5.1 The (T)G-algorithm. initialize by creating a tree with a single leaf with empty statistics for each learning example that becomes available do sort the example down the tree using the tests of the internal nodes until it reaches a leaf update the statistics in the leaf according to the new example if the statistics in the leaf indicate that a new split is needed then generate an internal node using the indicated test grow 2 new leafs with empty statistics end if end for The tg algorithm uses a relational representation language for describing the examples and for the tests that can be used in the regression tree.
5.3.1
Relational Trees
A relational (or logical) regression tree can be defined as follows (Blockeel and De Raedt, 1998): Definition 5.1 (Relational Regression Tree) A relational regression tree is a binary tree in which • Every internal node contains a test which is a conjunction of first order literals. • Every leaf (terminal node) of the tree contains a real valued prediction. An extra constraint placed on the first order literals that are used as tests in internal nodes is that a variable that is introduced in a node (i.e., it does not occur in higher nodes) does not occur in the right subtree of the node. Figure 5.1 gives an example of a first order regression tree. The test in a node should be read as the existentially quantified conjunction of all literals in the nodes in the path from the root of the tree to that node. In the left subtree of a node, the test of the node is added to the conjunction, for the right subtree, the negation of the test should be added.
56
CHAPTER 5. RRL-TG
on(BlockA,BlockB) yes
no
Qvalue = 0.1
clear(BlockA) yes
Qvalue = 0.4
no
on(BlockB,floor) yes
Qvalue = 0.9
no
Qvalue = 0.3
Figure 5.1: A relational regression tree The constraint on the use of variables stems from the fact that variables in the tests of internal nodes are existentially quantified. Suppose a node introduces a new variable X. Where the left subtree of a node corresponds to the fact that a substitution for X has been found to make the conjunction true, the right side corresponds to the situation where no substitution for X exists, i.e., there is no such X. Therefore, it makes no sense to refer to X in the right subtree. A relational (logical) regression tree can be easily translated into a Prolog decision list. For example, the tree of Figure 5.1 can be represented as the following Prolog program: q_value(0.4) :- on(BlockA,BlockB), clear(BlockA), !. q_value(0.9) :- on(BlockA,BlockB), on(BlockB,floor), !. q_value(0.3) :- on(BlockA,BlockB), !. q_value(0.1). The use of the cut-operator “!” can be avoided by the introduction of extra definite clauses and the negation operator, but this leads to a larger and less efficient program (Blockeel and De Raedt, 1998).
5.3.2
Candidate Test Creation
The construction of new tests is done through a refinement operator. tg uses a user-defined refinement operator that originated in the Tilde system (Blockeel and De Raedt, 1998). This refinement operator uses a language bias to specify the predicates that can be used together with their possible variable bindings. This language bias has to be defined by the user of the system. By specifying possible testextensions in the form of rmode-declarations, the user indicates what literals can be used as tests for internal nodes.
5.3. THE TG ALGORITHM
57
An rmode-declaration looks as follows: rmode(N: conjunction_of_literals). This declaration means that the conjunction of literals can be used as the test in an internal node, but at most N times in a path from the top of the tree to the leaf. When the N is omitted, its value defaults to infinity. To allow for the unification of variables between the tests used in different nodes within one path of the tree, the conjunction of literals given in the rmodedeclaration includes mode information of the used variables. Possible modes are ‘+’, ‘−’ and ‘+−’. A ‘+’ indicates that the variable should be used as an input variable, i.e., that the variable should occur in one of the tests on the path from the top of the tree to the leaf that will be extended. A ‘−’ stands for output, i.e., that the associated variable should not yet occur. ‘+−’ means that both options are allowed, i.e., extensions can be generated both with an already occurring or a completely new variable. To illustrate this, look for example at the leftmost node of the regression tree in Figure 5.1. With the following rmode-declarations: rmode(5: clear(+-X)). rmode(5: on(+X,-Y)). this leaf could be replaced by an internal node using any of the following tests: clear(BlockA). clear(BlockB). clear(BlockC). as a result from the first rmode, or on(BlockA,BlockC). on(BlockB,BlockC). resulting from the second. A more detailed description of this language bias part of the system, which includes the more advanced use of typed variables and lookahead search, can be found in (Blockeel, 1998). Because the possible test candidates depend on the previously chosen tests, it is not possible to restructure relational trees in the same way as done with the ITI algorithm in the work of Utgoff et al. (1997). In the propositional case the set of candidate queries consists of the set of all features minus the features that are already tested higher in the tree. This makes it relatively easy to keep statistics about all possible tests at each internal node of the tree. With relational trees, if a test in an internal node of the tree is changed, the test candidates of all the nodes in the subtrees of that node change as well, so no statistics can be gathered for all possible tests in each internal node. Although not yet used in the tg algorithm, Section 5.5 will discuss some tree restructuring mechanisms that can be used for relational trees.
58
CHAPTER 5. RRL-TG
Intermezzo: Candidate Test Storage In contrast to the propositional case, keeping track of the candidate-tests (the refinements of a query) is a non-trivial task. In the first order case, the set of candidate queries consists of all possible ways to extend a query. The longer a query is and the more variables it contains, the larger the number of possible ways to bind the variables becomes and the larger the set of candidate tests is. These dynamics in the set of possible test-extensions causes problems when trying to extend the U-tree algorithm of Utgoff et al. (1997) to first order representations. Since the set of possible tests for a node is not fixed when the ancestor nodes are not fixed, the statistics for each possible test in a node will be useless when an ancestor node is changed by one of the tree restructuring mechanisms. Since a large number of such candidate tests exist, they must be stored as efficiently as possible. To this aim the query packs mechanism introduced by (Blockeel et al., 2000) is used. A query pack is a set of similar queries structured into a tree; common parts of the queries are represented only once in such a structure. For instance, the set of conjunctions {(p(X), q(X)), (p(X), r(X))} can be represented as a single term p(X), (q(X); r(X)) This can yield a significant gain in practice. Assuming a constant branching factor b, the memory use for storing a pack of n queries of length l is proportional to the number of nodes in a tree with l layers, i.e., b + b2 + . . . + bl = (bl − 1)b/(b − 1) (The root node of the tree is not considered as not all queries are required to start with the same predicate). Since n = bl (the number of leafs in the tree) this is equal to (n − 1)b/(b − 1). The amount of memory used to store n queries of length l without using a pack representation is proportional to nl. Also, executing a set of queries structured in a pack requires considerably less time than executing them all separately. Even storing queries in packs requires much memory. However, the packs in the leaf nodes are very similar. Therefore, a further optimization is to reuse them. When a node is split, the pack for the new right leaf node is the same as the original pack of the node. For the new left sub-node, the pack is currently only reused if a test is added which does not introduce new variables. In that case the query pack in the left leaf node will be equal to the pack in the original node except for the chosen test which of course can’t be taken again. In further work, it is also
5.3. THE TG ALGORITHM
59
possible to reuse query packs in the more difficult case when a test is added which introduces new variables.
5.3.3
Candidate Test Selection
The statistics for each leaf consist of the number of examples on which each possible test succeeds or fails as well as the sum of the Q-values and the sum of squared Q-values for each of the two subsets created by the test. These statistics can be calculated incrementally and are sufficient to compute whether some test is significant, i.e., whether the variance of the Q-values of the examples would be reduced sufficiently by splitting the node using that particular test. A standard F-test with a significance level of 0.001 is used to make this decision. (This may seem as a very high significance level, but since the Q-learning setting supplies the regression algorithm with lots of examples, this value needs to be very high.) The F-test compares the variance of the Q-values of the examples collected in the leaf before and after splitting, i.e., np 2 nn 2 σ + σ n p n n
vs.
2 σtotal
where np and nn are the number of examples for which the test succeeds or fails and σp and σn are the variances of the Q-values of the examples for which the test succeeds and fails respectively. The variances of the subsets are added together, weighted according to the size of the subsets. Since Pn Pn (xi − x)2 x2 − nx2 σ 2 ≡ i=0 = i=0 i n n with x the average of all xi , the comparison above can be rewritten as: Pnp Pnn 2 Pn 2 2 qi − nn q 2n np i=0 qi2 − np q 2p nn i=0 i=0 qi − nq + vs. n np n nn n or after multiplying by n: np X i=0
qi2
−
np q 2p
+
nn X i=0
qi2
−
nn q 2n
vs.
n X
qi2 − nq 2
i=0
This comparison can be easily expressed using the 6 statistical values stored in a leaf for each test: !2 n !2 !2 np np nn n n n X X X 1 X 1 X 1 X 2 2 2 qi + qi − qi qi − qi vs. qi − np i=0 nn i=0 n i=0 i=0 i=0 i=0
60
CHAPTER 5. RRL-TG
with n = np + nn . Each leaf also stores the Q-value that should be predicted for (state, action) pairs that are assigned to that leaf. This value is obtained from the statistics of the test used to split its parent node when the leaf was created. Later, this value is updated as new examples are sorted in the leaf. A node is split after some minimal number of examples are collected and some test becomes significant with a high confidence. This minimal number of examples or minimal sample size is a parameter of the system that can be tuned. Low values will cause the tg to learn faster, but may cause problems by choosing the wrong test based on too little information.
5.3.4
RRL-tg
The introduction of the incremental tg algorithm into the RRL system solves the problems of the original RRL implementation while keeping most of the properties of the original system. • The trees are no longer generated from scratch after each episode but are built incrementally. This enables the new system to process much more learning episodes with the same computational power. • Because tg only stores statistics about the examples in the tree and only references these examples once (when they are inserted into the tree) the need for remembering and therefore searching and replacing examples has disappeared. • Since tg begins each new leaf with completely empty statistics, examples have a limited life span and old (possibly noisy) Q-value examples will be deleted even if the exact same (state, action) pair is not encountered twice. • Since the bias used by this incremental algorithm is the same as with Tilde , the same theories can be learned by tg . Both algorithms search the same hypothesis space and although tg can be misled in the beginning due to its incremental nature, in the limit the quality of approximations of the Q-values should be the same.
5.4 5.4.1
Experiments The Experimental Setup
To test the tg algorithm, a number of experiments using the blocks world (See Chapter 3) have been performed. To test the ability of RRL and tg to generalize over varying environments, the number of blocks (and thus the
5.4. EXPERIMENTS
61
number of objects in the world) is varied between 3 and 5. Although this limits the number of different world states to 587 and the number of different state-action combinations to approximately 2000, this number is large enough to illustrate the generalization characteristics of the regression algorithms. Per episode, RRL will collect about 2 to 2.5 examples (depending on the exact learning task), so after 1000 episodes RRL will have collected around 2000 to 2500 learning examples. However, one should also remember that not all of these learning examples will carry correct, if any, information and that Qlearning expects each (state, action) pair to be visited multiple times. 5.4.1.1
Tasks in the Blocks World
Three different goals where used, each with their own characteristics. Stacking all blocks : In this task, the agent needs to build one large stack of blocks. The order of the blocks in the stack does not matter. The optimal policy is quite simple, as the agent just needs to stack blocks onto the highest stack. However, since the RRL algorithm is a Q-learning technique, tg will still need to predict a number of different Q-values. Stacking is the simplest task that will be considered. Unstacking all blocks : The Unstacking task consists of putting all blocks onto the floor. Again (of course) there is no order in which the blocks should be put on the floor. The Unstacking task is closely related to the Stacking task, but there are some important differences. Although the optimal policy for Unstacking is even simpler than the one for Stacking, i.e., in each step, put a block on the floor that is not already on it, the task is quite a bit harder to learn using Q-learning. Not only is there only one goal state, but the number of possible actions (related to the number of blocks with no other blocks on top of it) increases with each step closer to the goal. This makes it very hard for a Q-learning agent to reach the goal state using random exploration and will cause the regression algorithm to be confronted with a lot of non-informative Q-values as not reaching the goal state during an episode results in Q-values = 0.0 for all the (state, action) pairs encountered during the episode. Stacking two specific blocks: On(A,B) : The third task under consideration is stacking two specific blocks. This task is interesting for a number of reasons. First, the task is actually a combination of several sub-tasks. To stack two specific blocks, one first needs to clear the two blocks and then put the correct block on top of the other. This means that the optimal policy is harder to represent than for the two other tasks and the same can be expected for the Q-function. Secondly, RRL will be trained to learn to stack any two specific blocks. In each training episode, the
62
CHAPTER 5. RRL-TG goal will be changed to refer to a different set of blocks. This way, RRL will have to describe the learned Q-function with regard to the properties of blocks and relations between blocks instead of referring to specific block identities. This will allow RRL to use its learned Q-function to stack “Block 1” onto “Block 2” and to stack “Block 5” onto “Block 3” without retraining.
Table 5.1: Number of states and number of reachable goal states for the three goals and different numbers of blocks. No. of blocks No. of states RGS stack RGS on(a,b) RGS unstack 3 13 6 2 1 4 73 24 7 1 5 501 120 34 1 6 4 051 720 209 1 7 37 633 5 040 1 546 1 8 394 353 40 320 13 327 1 9 4 596 553 362 880 130 922 1 10 58 941 091 3 628 800 1 441 729 1 Table 5.1 shows the number of states in the Blocks World in relation to the number of blocks in the world. Although only environments with 3 to 5 blocks are used in the tests in this chapter, the world size for larger numbers of blocks are shown also, as these worlds will be used in later chapters. The number of “reachable goal states” (RGS) is also shown for each of the three tasks. Since goal states are modelled to be absorbing states and given that RRL always starts an episode in a non goal state, there are a (large) number of goal states for the On(A,B) task that can not be reached during an episode. These are states that have block A on top of block B, but also have other blocks on top of A. See Figure 5.2 for an example of a reachable and non reachable goal state.
1 2
4 3
4 1 2
3
Figure 5.2: Left a reachable goal and right a non reachable goal state for the on(1, 2) task in the Blocks World.
5.4. EXPERIMENTS 5.4.1.2
63
The Learning Graphs
Unless indicated otherwise the learning graphs throughout this work combine the results of 10-fold experiments. The tested criterium is the average reward received by the RRL system starting from 200 different, randomly generated starting states. The rewards in the test are given according to the planning setting discussed in the intermezzo on page 16, with the added constraint that a reward is only presented to RRL if it reaches the goal in the minimal number of steps needed. If RRL does not reach the goal in the minimal number of steps, the episode is terminated and no reward is given. This means that the y-axis on the learning curves represents the percentage of cases in which RRL succeeds in reaching the goal state in the minimal number on steps, i.e., solving the problem optimally. Giving only a reward of 1 for optimal traces (i.e., reaching the goal in the minimal number of steps) will not result in only learning from correct examples. Because RRL calculates the Q-values of new learning examples using its current Q-function approximation, it will also generate non-zero Q-values for examples from non-optimal episodes. These may or may not be correct. The number of blocks is rotated between 3, 4 and 5 during the learning episodes, i.e., each episode of 3 blocks is followed by one with 4, which is in turn followed by one with 5, etc. Every 100 episodes, the Q-function built by the RRL system is translated into a deterministic policy and tested on 200 randomly generated problems with 3 to 5 blocks. Note that during these testing episodes, RRL uses the greedy policy indicated by the Q-function it has learned at that time. This removes the influence of the exploration strategy from the test results. This choice was made because RRL so far only uses exploration strategies to control the learning examples that are presented to the used regression algorithm. Because the experiments are designed to test the performance of the regression algorithm as a Q-function generalization, the influence of the exploration strategy during testing would only make the results harder to interpret. If interested, the reader can check Appendix A for the representation of the blocks world that was used. The appendix also includes the language bias needed for the tg algorithm, which is the same as for the original implementation which used Tilde for regression.
5.4.2
The Results
Aside from the language bias, the most important parameter of the tg algorithm is the minimal number of examples that have to be classified to a leaf, before tg is allowed to split that leaf and choose the most appropriate test. This minimal example set size influences the number of splits that tg makes only in the short term as tg counts on the statistics in the leafs to choose the appropriate tests whenever it becomes necessary to split a leaf. However, since
64
CHAPTER 5. RRL-TG
the examples that are presented to the tg algorithm are noisy at the start of learning, and since splitting decisions can not be undone, it is important to collect a significant amount of learning experience before committing to a decision. The minimal sample size can be tuned to make tg wait long enough to make a split in the Q-tree. Stacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4
’mss = 30’ ’mss = 50’ ’mss = 100’ ’mss = 200’ ’mss = 500’
0.3 0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 5.3: Comparing minimal sample size influences for the Stacking task Different minimal sample sizes were tested on the different blocks world tasks. Figure 5.3 shows the results for the Stacking task. For the Stacking task in worlds with 3 to 5 blocks, the average length of an episode is only 2 steps, so 2 learning examples are generated per episode on average. The performance graph shows that higher minimal sample sizes result in a higher accuracy (e.g. ±98% accuracy for mss = 200 compared to ±94% for mss = 30), at the cost of slower convergence. The performance of tg with the sample size set to 30 jumps up quite fast, but is also the least stable. The performance for the size set to 500 rises slowly and with large jumps. Each jump represents the fact that tg has extended the Q-tree with one test. These performance jumps are typical for RRL using tg as the performance will increase (or even decrease) only with the split of a leaf. These jumps are also present in the other performance curves, but because the changes in the Q-tree happen faster, they are less apparent. The size of the resulting Q-tree is also influenced by the minimal sample size. More informed by a larger example set, tg succeeds in choosing better tests. This yields equivalent or better performance with smaller trees as badly chosen tests at the top of the tree force tg to make corrections — and thus build larger subtrees — lower in the Q-tree. Table 5.2 shows the size of the Q-tree after learning for 1000 episodes. Although tg only builds a tree with 3 leafs for a minimal sample size set to 500, the performance of this small Q-tree
5.4. EXPERIMENTS
65
Table 5.2: The number of leafs in the Q-tree for different minimal sample sizes (mss) mss 30 50 100 200 500 1000
Stacking 10.8 9.6 7.6 5.9 3.0 2.0
Unstacking 16.3 16.6 13.1 8.2 3.6 2.0
On(A,B) 33.1 24.4 16.3 9.7 4.0 3.0
when translated into a policy is already very good as shown in figure 5.3. The performance curve for the sample size of 500 clearly shows the performance increase related to each of the two splits. The table also shows a tree size for setting the minimal sample size to 1000, but tg is only able to build a tree with one split in that case and needs more learning episodes to build a useful tree with that setting. Therefore the performance of tg for that value is not shown in figure 5.3. Unstacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4
’mss = 30’ ’mss = 50’ ’mss = 100’ ’mss = 200’ ’mss = 500’
0.3 0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 5.4: Comparing minimal sample size influences for the Unstacking task Figure 5.4 shows the results for the Unstacking task. An episode for the Unstacking task in a world with 3 to 5 blocks generates on average 2.3 learning examples. Since the goal is harder to reach (see the discussion in the previous section), tg will be presented with more uninformative examples than in the Stacking task. This results in a slower performance increase. It also causes tg to build a larger tree (see table 5.2). For this kind of task, it is better to raise
66
CHAPTER 5. RRL-TG
the minimal example set size and force tg to make decisions based on more experience. On(A,B)
Average Total Reward
1 0.8 0.6 0.4 ’mss = 30’ ’mss = 50’ ’mss = 100’ ’mss = 200’ ’mss = 500’
0.2 0 0
200
400 600 Number of Episodes
800
1000
Figure 5.5: Comparing minimal sample size influences for the On(A,B) task The On(A,B) task does not suffer from the low number of goal states as much as the Unstacking task does, so it doesn’t need the minimal example set size to be raised. Compared to the previous two tasks, the On(A,B) task is harder because it consists of two subtasks as explained. This makes the Qfunction more complex, and will force tg to build a more complex (i.e., larger) Q-tree. An episode in worlds with 3 to 5 blocks generates about 2.5 learning examples on average. Figure 5.5 shows the performance graphs for the On(A,B) task. Because of the need of a larger Q-tree, lower values for the minimal sample size are better with a fixed number of learning episodes. Figure 5.6 shows a tree that was learned after 1000 episodes with the minimal sample size set to 200. It can be easily seen why it doesn’t solve the complete On(A,B) task yet. The topmost line in the tree is a standard initialization of the tree and is defined by the user. It allows tg to refer to the blocks participation in either the goal or the action. It can also compare these blocks to each other, as it does for the leftmost leaf, which represents the (state, action) pair that moves the goal blocks on top of each other. Therefore the Q-value of 1.0 is correct for this leaf. However the right handed sibling of this node still overgeneralizes. It treats all (state, action) pairs that move a block, that is not the A block, onto the B block, as identical. On the right side of the tree tg has begun to build a tree which calculates the distance to the goal. It makes a distinction between the states which have the B block on top of A, and the ones which don’t. In the right most branch, it starts to introduce new variables. This allows it to check whether there are still other blocks which have to be moved. To solve
67 5.4. EXPERIMENTS
Q = 0.1068
no
equal(X,A) yes
Q = 1.0
yes
no
yes
yes
no
no
yes
above(X,B)
Q = 0.02386
above(K,A)
Q = 0.2891
no
above(L,B)
on(B,A)
move(X,Y), goal_on(A,B), equal(Y,B)
yes
Q = 0.3222
no
equal(X,B) yes
Q = 0.6400
Q = 0.1935
no
Q = 0.0
no
above(M,B) yes
Q = 0.1194
Figure 5.6: A tree learned by tg after 1000 learning episodes for the On(A,B) task. The minimal sample size was set to 200.
68
CHAPTER 5. RRL-TG
the complete problem, the tree would have to be expanded with additional tests until it represents at least all possible distances to the goal and with it all different possible Q-values. The execution times for the RRL-tg system are almost equal for the 3 tasks presented. It takes RRL-tg about 15 seconds to process the 1000 learning episodes and an additional 2 seconds to perform the 2000 tests in an entire learning episode. The Q-tree learned by the tg algorithm is quite simple, and thus fast, to consult.
5.5
Possible Extensions
The tg algorithm as discussed above can be improved in a number of ways. The major drawback of the tg algorithm as it stands, is the inability to undo mistakes it makes at the top of the tree. The minimal sample size as it is used now can be an effective way to reduce the probability of selecting bad tests at the top of the tree, but also introduces a number of unwanted side-effects. Large minimal sample sizes reduce the learning speed of the tg algorithm and thereby the learning speed of the RRL system, especially in later stages of learning, since a lot of data has to be collected in each leaf, before tg is allowed to split. This can be circumvented by the use of a dynamic minimal sample size. One could, for instance, reduce the minimal sample size for each level of the tree. Another method of dealing with early mistakes would be to store statistics about the tests in each node — not just the leaf nodes of the tree — and allow tg to alter a chosen test when it becomes clear that a mistake has been made. Table 5.2 suggests the existence of very good tests, that should surface, given enough training examples. As explained in section 5.3.2 using first order trees makes it impossible to store statistics on each possible test for a node if the tests in the nodes above it are not fixed. This makes it impossible to simply propagate the change through the underlying subtrees. However, two possibilities present themselves. The most simple solution would be to delete the two subtrees and start building them from scratch, this time with the assurance of a better parent node (See left side of figure 5.7). This is probably the easiest solution, however, a lot of information is lost with the deletion of the two subtrees. Another possible approach is to reuse the original subtree in each of the two leafs (cf. the right side of figure 5.7). If the statistics in the subtrees are reset tg can start to collect information on new best tests without the immediate loss of all the previously learned information. This approach would also need a pruning strategy to delete subtrees when they become unnecessary. To reduce the jumps in performance and possibly speed up the learning rate,
5.5. POSSIBLE EXTENSIONS
69 Old (bad) Test
...
Subtree
... New Test
New Test
...
...
Old Test
Old Test
Subtree Subtree
...
...
Figure 5.7: Two possible tree rebuilding strategies when a better test is discovered in a node. tg could use the statistics gathered in a leaf to make more informed Q-value estimates. It is possible for tg to use the information gathered on the different tests to select the Q-value prediction of the best test without committing to select that test to expand the tree. This could yield better estimations earlier on and as a consequence also improve the estimations of the new learning examples. However, caution should be used here as prediction errors might cause instability as well. Another extension that is worth investigating is the addition of aggregate functions into the language bias of the tg system. Since the Q-function often implicitly encodes a distance to the goal or to the next reward, this distance might be more easily represented by allowing tg to use aggregates such as summation. The inclusion of a history for each state such as used in the Utree algorithm of MacCallum (1999) would improve the behavior of RRL-tg is partially observable environments. The relational nature of the tests used by tg easily facilitate the use of a short term memory. The only needed change would be to store a the (state, action) pairs which make up the memory of the learning agent. The additional memory requirements would be limited, as only a fixed sized window of (state, action) pairs would have to be remembered. A larger step from the current tg algorithm would be to grow a set of trees instead of a single tree. In this setup, the second tree could be trained to represent the prediction error made by the first tree, the third tree to represent the error still made after the combination of the first and second tree, and so on. This would allow tg to build a first tree based on largest differences in the Q-function, but might help further tuning of this coarse function by only
70
CHAPTER 5. RRL-TG
having to build a error correcting tree once instead of having to expand each leaf of the first tree into a full subtree. The use of more than one tree might allow each tree to focus on a different aspect of the (state, action) pair and its influence on the Q-value.
5.6
Conclusions
This chapter introduced the first of three relational regression algorithms for relational reinforcement learning. The tg algorithm incrementally builds and updates a first order regression tree. The algorithm stores statistics in the leafs that contain information on the possible tests that can be used for splitting the leafs. This alleviates the need to store all encountered examples and greatly increases the applicability of relational reinforcement learning compared to the original RRL system. The tg system was tested on three tasks in the blocks world and the influence of its most important parameter, the minimal number of examples that needs to be classified to a leaf before that leaf is allowed to split, was investigated.
Chapter 6
Relational Instance Based Regression “You know, when you’re the middle child in a family of five million, you don’t get any attention.” Antz
6.1
Introduction
Instance based learning (classification as well as regression) is known for both its simplicity and performance. Instance based learning — also known as lazy learning or as nearest neighbor methods — simply stores the learning examples or a selection thereof and computes a classification or regression value based on a comparison of the new example with the stored examples. For this comparison, some kind of similarity measure, often a distance, has to be defined. The two major concerns for the use of instance based learning as a regression technique for RRL are the noise inherent to the Q-learning setting and the need for example selection as the instance based learning system will be presented with a continuous stream of new learning examples. This chapter describes a relational instance based regression technique that will be called rib algorithm. Several data base management techniques designed to limit the number of examples that are stored by the rib algorithm are presented. rib is then tested on two applications. A simple corridor application is used to study the influence of the different parameters of the system and the blocks world environment is used to compare the performance of the rib system to the tg regression algorithm. The chapter concludes by discussing some further work for the rib system. 71
72
CHAPTER 6. RRL-RIB
The rib system was designed and implemented with the help of Jan Ramon and was first presented in (Driessens and Ramon, 2003).
6.2
Nearest Neighbor Methods
This section discusses some previous work in instance based regression and relates it to the regression task in the RRL system. Aha et al. introduced the concept of instance based learning for classification (Aha et al., 1991) through the use of stored examples and nearest neighbor techniques. They suggested two techniques to filter out unwanted examples to both limit the number of examples that are stored in memory and improve the behavior of instance based learning when confronted with noisy data. To limit the inflow of new examples into the database, the IB2 system only stores examples that are classified wrong by the examples in memory so far. To be able to deal with noise, the IB3 system removes examples from the database who’s classification record (i.e., the ratio of correct and incorrect classification attempts) is significantly worse than that of other examples in the database. Although these filtering techniques are simple and effective for classification, they do not translate easily to regression. The idea of instance based prediction of continuous target attributes was introduced by Kibler et al. (1989). This work describes an approach that uses a form of local linear regression. Although Kibler et al. refer to instance based classification methods for reducing the amount of storage space needed by the instance based techniques, they have not used these techniques for continuous value prediction tasks. This idea of local linear regression is in greater detail explored by Atkeson et al. (1997), but again, no effort is made to limit the growth of the stored database. In follow-up work however (Schaal et al., 2000), they do describe a locally weighted learning algorithm that does not need to remember any data explicitly. Instead, the algorithm builds “locally linear models” which are updated with each new learning example. Each of these models is accompanied by a “receptive field” which represents the area in which this linear model can be used to make predictions. The algorithm also determines when to create a new receptive field and the associated linear model. Although this is a good idea, building local linear models in the relational setting (where data can not be represented as a finite length vector) does not seem feasible. An example where instance based regression is used in Q-learning is in the work of Smart and Kaelbling (2000) where they use locally weighted regression as a Q-function generalization technique for learning to control a real robot moving through a corridor. In this work, the authors do not look toward limiting the size of the example-set that is stored in memory. They focus on making safe predictions and accomplish this by constructing a convex hull
6.3. RELATIONAL DISTANCES
73
around their data. Before making a prediction, they check whether the new example is inside this convex hull. The computation of the convex hull again relies on the fact that the data can be represented as a vector, which is not the case in the relational setting. Another instance of Q-learning in which instance based learning is used is given by Forbes and Andre (2002) where Q-learning is used in the context of automated driving. In this work the authors do address the problem of large example-sets. They use two parameters that limit the inflow of examples into the database. First, a limit is placed on the density of stored examples. They overcome the necessity of forgetting old data in the Q-learning setting by updating the Q-value of stored examples according to the Q-value of similar new examples. Secondly, a limit is given on how accurately Q-values have to be predicted. If the Q-value of a new example is predicted within a given boundary, the new example is not stored. When the number of examples in the database reaches a specified number, the example contributing the least to the correct prediction of values is removed. Some of these ideas will be adopted and expanded on in the design of the new relational regression algorithm.
6.3
Relational Distances
Nearest neighbor methods need some kind of similarity measure or distance between the used examples. In the RRL context, this means a distance is needed between different (state, action) pairs. Because RRL uses a relational representation of both states and actions, a relational distance is needed. Since states and actions are represented using a set of ground facts, it might seem sufficient to use a similarity measure on sets, combined with a distance on ground facts. However, this approach would treat each fact as a separate entity while different facts can contain references to the same object in a state or action. Therefore, a distance between relational interpretations is needed. Several approaches to compute such a distance exist (Sebag, 1997; Ramon and Bruynooghe, 2001). Related to this, Emde and Wettschereck (1996) define a similarity measure on first order representations that doesn’t satisfy the triangle inequality. Although these general distances can be used in RRL , they incorporate no prior knowledge about the structure and dynamics of the environment RRL is learning in. Where in the RRL-tg system, the user could guide the search by specifying a language bias for the tg system, an application specific distance will allow the user to do the same for the RRL-rib system. The use of such a specific distance will not only allow the user to emphasize the important structures in the environment, but might also reduce the computational complexity of the distance. In the blocks world for example, an application specific distance can empha-
74
CHAPTER 6. RRL-RIB
size the role of stacks of blocks. For example, the distance between two blocks world states can be defined as the distance between two sets of stacks. For a distance between sets, the matching distance (Ramon and Bruynooghe, 2001) can be used. This distance first tries to find the optimal matching between the elements of the two sets and then computes a distance based on the distances between the matched elements. Also, by treating the blocks not mentioned in either the goal or as part of the performed action as identical, the use of object specific information can be avoided (comparable to the use of variables in the tg language bias.) Such a distance for the blocks world environment can be computed as follows: 1. Try to rename the blocks so that block-names that appear in the action (and possibly in the goal) match between the two (state, action) pairs. If this is not possible, add a penalty to your distance for each mismatch. Rename each block that does not appear in the goal or the action to the same name. 2. To compute the distance between the two states, regard each state (with renamed blocks) as a set of stacks and calculate the distance between these two sets using the matching-distance between sets based on the distance between the stacks of blocks (Ramon and Bruynooghe, 2001). 3. To compute the distance between two stacks of blocks, transform each stack into a string by reading the names of the blocks from the top of the stack to the bottom, and compute the edit distance (Wagner and Fischer, 1974) between the resulting strings. While this procedure defines a generic distance, it will adapt itself to deal with different goals as well as different numbers of blocks in the world. The renaming step (Step 1) ensures that the blocks that are manipulated and the blocks mentioned in the goal of the task are matched between the two (state, action) pairs. The other blocks are given a standard name to remove their specific identity. This allows instance-based RRL to generalize over actions that manipulate different specific blocks as well as to train on similar goals which refer to different specific blocks. This is comparable to RRL-tg which uses variables to represent blocks which appear in the action and goal description. Blocks which do not appear in the action or goal description are all regarded as generic blocks, i.e., without paying attention to the specific identity of these blocks. The renaming step is illustrated in Figure 6.1. In the first state, block 1 is moved onto another block while block 3 is moved in the second state. Therefore, both blocks are given the same name during renaming. Block 2 in the left state and block 1 in the right hand state are also given the same name because they are both mentioned as the first block in the goal. For
6.3. RELATIONAL DISTANCES
1 2
4 3
goal: on(2,4)
a c
b x
75
3 1 5
4 2
goal: on(1,2)
a c x
x b
Figure 6.1: The renaming step in the computation of the blocks world distance.
block 4 in the left state and block 2 in the right state, both the action and the goal references to the blocks match, so they can also be given the same name. Were this not the case, they would have received different names and a penalty would be added to the distance per mismatched block. In the experiments this penalty value was set to 2.0 All the other blocks are given a generic name, because their identity is neither referred to by the action nor by the goal. After this renaming step, the two bottom states of Figure 6.1 are translated into the two following sets of strings: {ac, bx} and {acx, b, x}. These sets are compared using the matching distance between the sets and the edit distance between the different strings. Computing the edit distances of the available strings results in the following table: d(ac,b) = 3.0 d(ac,acx) = 1.0 d(ac,x) = 3.0
d(bx,b) = 1.0 d(bx,acx) = 3.0 d(bx,x) = 1.0
An optimal matching in this case matches ‘ac’ to ‘acx’ and ‘bx’ to ‘b’ (or ‘x’). The unmatched element of the second set causes a penalty to be added to the distance. In the experiments, this penalty was set to 2.0, so that the distance for the two (state, action) pairs of Figure 6.1 is computed to be 4.0. The use of the matching-distance and the edit-distance enables RRL-rib to train on
76
CHAPTER 6. RRL-RIB
blocks worlds of different sizes.
6.4
The rib Algorithm
This section describes a number of different techniques that can be used with relational instance based regression to limit the number of examples stored in memory. As stated before, none of these techniques will require the use of vector representations. Some of these techniques are designed specifically to work well with Q-learning. The rib algorithm will use c-nearest-neighbor prediction as a regression technique, i.e., the predicted Q-value qˆi will be computed as follows: P qj j distij 1 j distij
qˆi = P
(6.1)
where distij is the distance between example i and example j and the sum is computed over all examples stored in memory. To prevent division by 0, a small amount δ can be added to this distance.
6.4.1
Limiting the Inflow
In IB2 (Aha et al., 1991) the inflow of new examples into the database is limited by only storing examples that are classified incorrectly by the examples already stored in the database. However, when predicting a continuous value, one can not expect to predict a value correctly very often. A certain margin for error in the predicted value will have to be tolerated. Comparable techniques used in regression context (Forbes and Andre, 2002) allow an absolute error when making predictions as well as limit the density of the examples stored in the database. To translate the idea of IB2 towards regression in a more adaptive manner, instead of adopting an absolute error-margin it is better to use an error-margin which is proportional to the standard deviation of the values of the examples closest to the new example. This will make the regression engine more robust against large variations in the values that need to be predicted. Translated to a formula, this means that examples will be stored if |q − qˆ| > σlocal · Fl
(6.2)
with q the real Q-value of the new example, qˆ the predicted Q-value, σlocal the standard deviation of the Q-value of a representative set of the closest examples (the rib algorithm uses the 30 closest examples) and Fl a suitable parameter. The idea of limiting the number of examples which occupy the same region of the example space is also adopted, but without the rigidity that a global
6.4. THE RIB ALGORITHM
77
maximum density imposes. Equation 6.2 will limit the number of examples stored in a certain area. However, when trying to approximate a function such as the one shown in Figure 6.2, it seems natural to store more examples of region A than region B in the database. Unfortunately, region A will yield a large σlocal in Equation 6.2 and will not cause the algorithm to store as many examples as it should.
A
B
Figure 6.2: To predict the shown function correctly, an instance based learner should store more examples from area A than area B.
rib therefore adopts an extra strategy that stores examples in the database until the local standard-deviation (i.e., of the 30 closest examples) is only a fraction of the standard deviation of the entire database, i.e., an example will be stored if σglobal σlocal > (6.3) Fg with σlocal the standard deviation of the Q-value of the 30 closest examples, σglobal the standard deviation of the Q-value of all stored examples and Fg a suitable parameter. This will result in more stored examples in areas with large variance of the function value and less in areas with small variance. An example will be stored by the RRL system if it meets one of the two criteria. Both Equation 6.2 and Equation 6.3 can be tuned by varying the parameters Fl and Fg .
6.4.2
Forgetting Stored Examples
The techniques described in the previous section might not be sufficient to limit the growth of the database appropriately. When memory limitations are reached, or when computation times grow too large, one might have to place a
78
CHAPTER 6. RRL-RIB
hard limit on the number of examples that can be stored. The algorithm then has to decide which examples it can remove from the database. IB3 (Aha et al., 1991) uses a classification record for each stored example and removes the examples that perform worse than others. In IB3, this removal of examples is added to allow the instance based learner to deal with noise in the training data. Because Q-learning has to deal with moving target regression and therefore inevitably with noisy data, rib will probably benefit from a similar strategy. However, because the rib algorithm has to deal with continuous values, keeping a classification record which lists the number of correct and incorrect classifications is not feasible. Two separate scores are proposed that can be computed for each example and that will indicate which example should be removed from the database. 6.4.2.1
Error Contribution
Since rib is in fact trying to minimize the prediction error, it is possible to compute for each example what the cumulative prediction error is with and without the example. The cumulative prediction error with example i included is computed as follows: X errori = (qj − qˆj )2 j6=i
with qˆj the prediction of the Q-value of example j by the database with example i included. The cumulative prediction error without example i is: X errori−i = (qj − qˆj−i )2 j
with qˆj−i the prediction of the Q-value of example j by the database without example i. Note that this time a term is included to represent the loss of information about the Q-value of example i. The resulting score for example i obtained by taking the difference between the two cumulative prediction errors looks as follows: X EC-scorei = (qi − qˆi−i )2 + [(qj − qˆj−i )2 − (qj − qˆj )2 ] (6.4) j6=i
The lowest scoring example is the example that has the lowest influence on the cumulative prediction error and thus is the example that should be removed. 6.4.2.2
Error Proximity
A score simple to compute is based on the proximity of examples in the database that are predicted with large errors. Since the influence of stored examples is
6.4. THE RIB ALGORITHM
79
inversely proportional to the distance, it makes sense to presume that examples which are close to the examples with large prediction errors are also causing these errors. The score for example i can be computed as: EP-scorei =
X |qj − qˆj | distij j
(6.5)
where qˆj is the prediction of the Q-value of example j by the database and distij the distance between example i and example j. In this case, the example with the highest score is the one that should be removed. Another scoring function is used by Forbes and Andre (2002). In this work, the authors also suggest not just forgetting examples, but using instanceaveraging instead. Applying the instance averaging technique would imply the use of first order prototypes, which is complex and therefore it is not used in the RIB algorithm.
6.4.3
A Q-learning Specific Strategy: Maximum Variance
The major problem encountered while using instance based learning for regression is that it is impossible to distinguish high function variation from actual noise. It seems impossible to do this without prior knowledge about the behavior of the function that the algorithm is trying to approximate. If one could impose a limit on the variation of the function to be learned, this limit might allow us to distinguish at least part of the noise from function variation. For example in Q-learning, an application expert could know that |qi − qj | σlocal · Fl ) or (σlocal > global Fg ) then store the new example in the data base if (number of stored examples > maximum allowed examples) then compute the EC- or EP-score for each stored example remove example with the worst score end if end if end for
When a new learning example becomes available, the rib system will try to predict its Q-value. If this predicted value sufficiently differs from the real value, or if the new example is from a region of the state-action space where rib has not yet collected enough information, the new example is stored in the data-base. If this brings the number of stored examples over the allowed maximum, the EC- or EP-score is used to select the best example for removal. When the maximum variance strategy is used, all examples are stored when they arrive (independent of the predictive accuracy on these examples). However, after each addition to the data base, Equation 6.6 is used to select which examples stay in the data-base and which are removed. Prediction in the rib system is performed by evaluating Equation 6.1.
6.5
Experiments
6.5.1
A Simple Task
Start
Goal
To test the different data management approaches, experiments are performed using a very simple (non-relational) Q-learning task. A reinforcement learning agent walks through a corridor of length 10 shown in Figure 6.4. The agents starts on one end of the corridor and receives a reward of 1.0 when he reaches
Figure 6.4: The corridor application.
82
CHAPTER 6. RRL-RIB Effect of Filter Parameters on Prediction Error ’Fl=3 Fg=5 error’ ’Fl=5 Fg=3 error’ ’Fl=5 Fg=5 error’ ’Fl=10 Fg=5 error’ ’Fl=5 Fg=8 error’
Average Prediction Error
0.024 0.022 0.02 0.018 0.016 0
50
100 150 Number of Episodes
200
Figure 6.5: Prediction errors for varying inflow limitations.
the other end. The distance between two state-action-pairs, is related to the number of steps it takes to get from one state to the other, slightly increased if the chosen actions differ. The Q-function related to this problem is a very simple, monotonically increasing function, so that it only takes two (well chosen) examples for the Qlearner to learn the optimal policy. This being the case, the average predictionerror on all (state, action) pairs is compared for the different suggested approaches.
6.5.1.1
Inflow Behavior
To test the two inflow-filters of section 6.4.1 several experiments varying the Fl and Fg values separately were performed. Figure 6.5 shows the average prediction errors over 50 test trials. Figure 6.6 shows the corresponding database sizes. The influence of Fg is exactly what one would expect. A larger value for Fg forces the algorithm to store more examples but lowers the average prediction error. It is worth noticing that in this application the influence on the size of the database and therefore on the computation time is quite large with respect to the relatively small effect this has on the prediction errors. The influence of Fl is not so predictable. First of all, the influence of this parameter on the size of the database seems limited. Second, one would expect that an increase of the value of Fl would cause an increase in the prediction error as well. Although the differences measured were not significant enough to make any claims, this does not seem to be the case.
6.5. EXPERIMENTS
83
Average Number of Stored Examples
Effect of Filter Parameters on Data-set Size 600 500 400 300 200
’Fl=3 Fg=5 size’ ’Fl=5 Fg=3 size’ ’Fl=5 Fg=5 size’ ’Fl=10 Fg=5 size’ ’Fl=5 Fg=8 size’
100 0 0
50
100 Number of Episodes
150
200
Figure 6.6: Database sizes for varying inflow limitations.
6.5.1.2
Adding an Upper Limit
The two scoring functions from section 6.4.2 were tested in the same setting by adding an upper limit to the database size that the rib algorithm is allowed to use. The two parameters Fl and Fg were set to 5.0 — values that gave both average prediction errors and average database size — and the number of examples that rib could store to make predictions was varied. Figure 6.7 shows the average prediction-error as a function of the number of learning episodes when using the error-contribution-score (EC-score) of Equation 6.4 for different maximum database sizes. The ’no limit’ curve in the graph shows the prediction error when no examples are removed. Figure 6.8 shows the average prediction-error when managing the database size with the error-proximity-score (EP-score) of Equation 6.5. Although differences with the EC-score are small, EP-score management performs at least as well and is easier to compute. A rather disturbing feature of both graphs is that the prediction error at first seems to reach a minimum, whereafter the prediction error goes up again. The reasons for this behavior are not entirely clear although the order in which Q-values are usually discovered in Q-learning might offer a possible explanation. The typical exponentially decreasing shape of the Q-function in relation to the distance to the goal is shown in Figure 6.9. The first learning examples with a non zero Q-value are usually discovered close to the goal, as it is easier to stumble onto a reward when already close to the goal. These examples lie in the region of the state-action space where differences between Q-values are large and making good predictions in this area will greatly reduce the overall prediction error. Later in the learning experiment, examples from all over the state-action space will be stored to keep the local standard variation low.
84
CHAPTER 6. RRL-RIB
Select by Error Contribution ’no limit’ ’50 examples’ ’100 examples’ ’200 examples’
Average Prediction Error
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
50
100 Number of Episodes
150
200
Figure 6.7: The effect of example selection by Error Contribution.
Select by Error Proximity ’no limit’ ’50 examples’ ’100 examples’ ’200 examples’
Average Prediction Error
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
50
100 Number of Episodes
150
200
Figure 6.8: The effect of example selection by Error Proximity.
85
Q-value
6.5. EXPERIMENTS
Range of early examples
Distance to goal
Figure 6.9: The shape of a Q-function in relation to the distance to the goal.
However, the influence of good predictions in the right part of the state space as presented in Figure 6.9 on the prediction error is a lot smaller. Balancing the examples stored in the data-base over the entire state-space therefore might result in a small increase of the prediction errors. 6.5.1.3
The Effects of Maximum Variance Prediction Error for Maximum Variance 0.7
’no limit’ ’100 ec-score’ ’100 ep-score’ ’Max Variance’
Average Prediction Error
0.6 0.5 0.4 0.3 0.2 0.1 0 0
50
100 Number of Episodes
150
200
Figure 6.10: The effect of example selection by Maximum Variance.
Figure 6.10 shows the prediction-error when the maximum variance (or mv) strategy is used to manage the database. The M -value of Equation 6.6 was set to 0.1, the maximum difference in Q-values in this environment. The prediction errors are a lot larger than with the other strategies, but RRL is still able to find the optimal strategy. The advantage of the mv-strategy lies in the number
86
CHAPTER 6. RRL-RIB
of examples stored in the database. With this particular application, only 20 examples are stored, one for each possible Q-value.
6.5.2
The Blocks World
The rib algorithm was also tested in the same blocks world environment as the tg system (described in the previous chapter). To apply the rib algorithm to work in the blocks world, a distance has to be defined between different (state, action) pairs. The following experiments use the blocks world distance as defined in Section 6.3. 6.5.2.1
The Influence of Different Data Base Sizes
Since the behavior of the error-proximity score and the error-contribution score is very similar as shown in the previous section, the rib system is only tested on the blocks world with the error-proximity score and the maximum variance selection strategy. First, the influence of the maximum size of the data base is tested with the use of the error-proximity selection strategy. The data base size is limited to a number of different values ranging from 25 to 200 examples. 25 examples will cause the rib system to keep all new examples that arrive and only select examples by forgetting others. The low number of stored examples causes both the global and local variance to be computed using all the stored examples and will cause Equation 6.3 to accept all new examples. The high limit of 200 examples to be stored will cause rib to take longer to select the right examples to generalize nicely over the entire state-action space. Stacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 ’RIB-EP(25)’ ’RIB-EP(50)’ ’RIB-EP(100)’ ’RIB-EP(200)’
0.3 0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 6.11: The performance of rib with different data base sizes on the Stacking task.
6.5. EXPERIMENTS
87
Figure 6.11 shows the results for the stack goal. Although 25 examples is clearly too few for rib to be able to build a well performing Q-function, with 50 examples it already builds a Q-function that translates in a well performing policy. With approximately 3 possible actions per state, there are over 1500 different (state, action) pairs. Using 200 as an upper limit for the database slows the performance improvement of rib , as a higher number of examples makes it harder to select the ones that help to generalize well. Unstacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 ’RIB-EP(25)’ ’RIB-EP(50)’ ’RIB-EP(100)’ ’RIB-EP(200)’
0.3 0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 6.12: The performance of rib with different data base sizes on the Unstacking task. Figure 6.12 shows the results for the rib regression algorithm on the Unstacking task, again for different sizes of the stored example set. Here, rib is not able to build a well performing policy using only 50 examples. When rib is allowed to store 200 examples, there is a bump in the learning curve where at first, rib learns a well performing policy, then seems to forget what it has learned for a while and then starts to recover and learn a better policy once again. Remember that the Unstacking task generates a lot of uninformative examples, making it hard for the regression engine to filter out the informative examples, as they can appear to be nothing but noise. This is exactly what is happening in this case. At first, rib is allowed to store all the examples that it receives. It will recognize the fact that some examples are non-informative and not accept these and will store the examples that yield large Q-values as interesting. Since the translation of the Q-function into a policy only looks at the maximum Q-values of a state, this will lead to a largely incorrect but relatively well performing Q-function. When the data base gets filled up with examples, rib will try to remove the ones that seem noisy and mistakes the few high yielding (state, action) pairs for inaccurate. When it removes these examples, the overall performance of the constructed
88
CHAPTER 6. RRL-RIB
policy degrades. Only when the discovered rewards start to spread to the computed Q-values of other (state, action) pairs does rib recover and does the performance start to increase again. On(A,B)
Average Total Reward
1 0.8 0.6 0.4 ’RIB-EP(25)’ ’RIB-EP(50)’ ’RIB-EP(100)’ ’RIB-EP(200)’
0.2 0 0
200
400 600 Number of Episodes
800
1000
Figure 6.13: The performance of rib with different data base sizes on the On(A,B) task. The performance of the rib regression engine on the On(A,B) task is shown in figure 6.13. In this task, the differences between the different data-base sizes are small. None of the sizes allows rib to find an optimal policy. The largest data base size again causes the slowest learning rate. 6.5.2.2
Comparing rib and tg
Choosing a data base size equal to 100 for the examples selection using errorproximity and the selection of examples using maximum variance, the performance of rib was compared to the performance of tg as a regression engine for RRL in the blocks world. The M value of Equation 6.6 was set to 0.1, the maximum difference between two different Q-values in these tests. Figure 6.14 shows the results for the three regression algorithms for the Stacking goal. rib seems to have the upper hand when it comes to the learning rate per episode. This is partly due to the fact that tg needs to collect a larger number of (state, action, qvalue) examples before it can make use of them by generating different nodes and leafs in the Q-tree. rib will use the predictive power of examples as soon as they become available. This said, it needs to be considered that rib needs a lot more processing power (and thus processing time) to make Q-value predictions. This causes RRL-rib to be a lot slower than RRL-tg when it comes to computation times. However, since interaction with the environment usually yields the largest learning cost, the learning rate per episode is the fairest comparison between the different systems.
6.5. EXPERIMENTS
89
Stacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 ’TG’ ’RIB-EP(100)’ ’RIB-MV’
0.3 0.2 0
200
400 600 Number of Episodes
800
1000
Figure 6.14: Performance comparison between the tg and rib algorithms for the Stacking task. Between the two rib systems, rib-mv seems to learn a little quicker than rib-ep but rib-ep has the advantage of yielding a better policy. This difference is more explicit for the Unstacking tasks. The results for the Unstacking task are shown in Figure 6.15. Unstacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 0.3 ’TG’ ’RIB-EP(100)’ ’RIB-MV’
0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 6.15: Performance comparison between the tg and rib algorithms for the Unstacking task. For the Unstacking task, rib-mv learns its policy very quickly, but then gets stuck at the reached performance level. The performance graphs shows this to be at ±90%, but actual levels in the different experiments ranged from the optimal policy to ±73%. Each time rib-mv reached its eventual level of
90
CHAPTER 6. RRL-RIB
performance quickly, but was unable to escape from that local maximum. ribep learns slower compared to rib-mv but was able to learn the optimal policy in each of the 10 test-runs. Table 6.1: The average number of examples stored in the rib database using Maximum Variance Example Selection at the end of the learning experiment. Task Stacking Unstacking On(A,B)
Average Final Database Size 35 10 103
Table 6.1 shows the average number of examples that rib-mv had in its data-base at the end of the learning task. For the Stacking and Unstacking tasks, these numbers are a lot lower than the 100 examples used by rib-ep in its experiments. This lower number also makes rib-mv a lot more efficient computationally. Considering each test-run separately, there was no correlation between the number of stored examples and performance level reached by the rib-mv system. On(A,B)
Average Total Reward
1 0.8 0.6 0.4 0.2
’TG’ ’RIB-EP(100)’ ’RIB-MV’
0 0
200
400 600 Number of Episodes
800
1000
Figure 6.16: Performance comparison between the tg and rib algorithms for the On(A,B) task. Figure 6.16 shows the learning curves for the On(A,B) task. This is the first task in which rib-mv outperforms the rib-ep policy. The number of examples remembered by rib-mv is also a little higher then the 100 allowed examples for the rib-ep approach. tg is again a little slower to reach high performance levels than both rib versions.
6.6. POSSIBLE EXTENSIONS
91
Table 6.2: The execution times for RRL-tg , rib-mv and rib-ep on a Pentium III 1.1 Ghz machine in seconds. The first number in each cell indicates the learning time, the second indicates the testing time. Task Stacking Unstacking On(A,B)
RRL-tg 14 – 1 15 – 2 16 – 1
rib-mv 27 – 27 7 – 10 44 – 100
rib-ep 920 – 90 1500 – 200 1800 – 200
When applicable (see Section 6.4.3), the large reduction of the number of stored examples together with the automatic adaptation to the difficulty of the learning task, makes rib-mv the preferred technique for instance based regression. rib-mv largely outperforms rib-ep with respect to computation times. Table 6.2 shows the execution times of the two rib systems next to the RRL-tg system. The first number in each cell is the time spend by the RRL system processing the 1000 learning episodes, the second number is the time spend on the 2000 testing trials. This second number gives an indication of the prediction efficiency of the system. To be fair, it must be stated that the rib-mv implementation was optimized with respect to computation speed, while rib-ep was not. However, this optimization can not be held responsible for the large differences observed. Compared to the RRL-tg system, the instance based regression technique of RRL-rib results in a smoother learning progression. RRL-tg relies on finding the correct split for each node in the regression tree. When this node is found, this results in a large improvement in the learned policy. Instance-based RRL does not rely on such key-decisions and therefore can be expected to be more robust than the tree based RRL-tg system. As can be seen in Table 6.2 the policy learned by tg can be evaluated much more efficiently.
6.6
Possible Extensions
The major drawback of the rib regression system is the computational complexity of the system and the slow performance that results from this. Often, a strict constraint on the number of examples stored in the data-base is needed to reduce computation times. There are multiple ways of dealing with this problem. The most simple solution, i.e., with the least intrusive change to the rib system itself, would be to use incrementally computable distances. An incrementally computable distance would first compute a rough estimate of the distance between different (state, action) pairs, which can later be refined if necessary. When the use of
92
CHAPTER 6. RRL-RIB
approximate distances is limited to distant (state, action) pairs, the influence of the approximations would be limited or non-existent, but computation times might be reduced significantly. A more complex, but also more promising change to the rib system, would be the combination of a model building regression technique (such as the tg system) and instance based learning. Such an approach would use the model building regression algorithm to make a coarse division of the state-action space and use instance based learning locally to improve the predictive accuracy. From the rib viewpoint, this would remove the need to compute the distance from the new (state, action) pair to every stored (state, action) pair. The instance based part of the regression system would only need to consider the stored examples that are classified to the same group as the new example. From the model building algorithm’s viewpoint, this would allow for better local function fitting.
6.7
Conclusions
This chapter introduced relational instance based regression, a new regression technique that can be used when instances can not be represented as vectors. Several database management approaches were developed that limit the memory requirements and computation times by limiting the number of examples that need to be stored in the database. The behavior of these different approaches was shown and discussed using a simple example application and was compared with the regression tree based RRL-tg algorithm. Empirical results show that instance-based RRL outperforms RRL-tg considering the learning speed per training episode. RRL-tg on the other hand has the advantage of producing a Q-function in a comprehensible format.
Chapter 7
Gaussian Processes and Graph Kernels “Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.” The Matrix
7.1
Introduction
Kernel methods are among the most successful recent developments within the field of machine learning. Used together with the recently developed kernels for structured data they yield powerful classification and regression methods that can be used for relational applications. This chapter introduces an incremental regression algorithm based on Gaussian processes and graph kernels. This algorithm is integrated into the RRL system to create RRL-kbr and several system parameters are discussed and their influence is evaluated empirically. Afterwards, it is compared with the previous regression techniques. After a brief introduction of kernel methods and kernel functions, this chapter introduces Gaussian processes as an incrementally learnable regression technique that uses kernels as the covariance function between learning examples. Section 7.4 introduces graph kernels that can be used as kernels between (state, action) pairs. Next, Section 7.5 illustrates how to apply a graph kernel in the blocks world environment. In Section 7.6 the behavior of Gaussian processes as a regression system for RRL is empirically evaluated and compared to the RRL-tg and RRL-rib systems. Section 7.7 discusses a few improvements that can still be made to the RRL-kbr system. Related work on reinforcement learning with kernel methods is very limited so far. In the work by Ormoneit and Sen (2002) the term ‘kernel’ is not used 93
94
CHAPTER 7. RRL-KBR
to refer to a positive definite function but to a probability density function. Dietterich and Wang (2002) and Rasmussen and Kuss (2004) don’t use Qlearning as in the RRL system but model the reinforcement learning task in a closed form. Dietterich and Wang use support vector machines, while Rasmussen and Kuss use Gaussian processes. The kernel based regression system was designed and implemented in collaboration with Thomas G¨artner and Jan Ramon and was first introduced in (G¨ artner et al., 2003a).
7.2
Kernel Methods
Kernel methods work by embedding the data into a vector space and then looking for (often linear) relations between the data in that space. If the mapping to the vector space is well chosen, complex relations can be simplified and more easily discovered. These relations can then be used for classification, regression, etc. Based on the fact that all generalization requires some form of similarity measure, all kernel methods are in principle composed of 2 parts: 1. A general purpose machine learning algorithm. 2. A problem specific kernel function. The kernel function is employed to avoid the need for an explicit mapping to the (often high dimensional) vector space. Technically, a kernel k computes an inner product in some feature space which is, in general, different from the representation space of the instances. The computational attractiveness of kernel methods comes from the fact that quite often a closed form of these ‘feature space inner products’ exists. Instead of performing the expensive transformation step φ explicitly, a kernel k(x, x0 ) = hφ(x), φ(x0 )i computes the inner product directly and performs the feature transformation only implicitly. Whether, for a given function k : X × X → R, a feature transformation φ : X → H into some Hilbert space H exists, such that k(x, x0 ) = hφ(x), φ(x0 )i for all x, x0 ∈ X can be checked by verifying that the function is positive definite (Aronszajn, 1950). This means that any set, whether a linear space or not, that admits a positive definite kernel can be embedded into a linear space. Thus, throughout this text, ‘valid’ means ‘positive definite’. Here then is the definition of a positive definite kernel. (N is the set of positive integers.) Definition 7.1 Let X be a set. A symmetric function k : X × X → R is a positive definite kernel on X if, ∀n ∈ N, x1 , . . . , xn ∈ X , and c1 , . . . , cn ∈ R X ci cj k(xi , xj ) ≥ 0 i,j∈{1,...,n}
7.2. KERNEL METHODS
95
While it is not always easy to prove positive definiteness for a given kernel, positive definite kernels do have some nice closure properties. In particular, they are closed under sum, direct sum, multiplication by a scalar, product, tensor product, zero extension, point-wise limits, and exponentiation (Cristianini and Shawe-Taylor, 2000; Haussler, 1999).
Kernels for Structured Data The best known kernel for representation spaces that are not mere attributevalue tuples is the convolution kernel proposed by Haussler (1999). The basic idea of convolution kernels is that the semantics of composite objects can often be captured by a relation R between the object and its parts. The kernel on the composed object is then a combination of kernels defined on it’s different parts. Let x, x0 ∈ X be the objects and (x1 , x2 , · · · , xD ), (x01 , x02 , · · · , x0D ) ∈ X1 × · · · × XD be tuples of parts of these objects. Given the relation R : (X1 × · · · × XD ) × X a decomposition R−1 can be defined as R−1 (x) = {(x1 , x2 , · · · , xD ) : R((x1 , x2 , · · · , xD ), x)}. With positive definite kernels kd : Xd × Xd → R for all d ∈ {1, . . . , D} the convolution kernel is defined as 0
kconv (x, x ) =
X (x1 , x2 , · · · , xD ) ∈ R−1 (x) (x01 , x02 , · · · , x0D ) ∈ R−1 (x0 )
D Y
kd (xd , x0d )
d=1
The term convolution kernel refers to a class of kernels that can be formulated in the above way. The advantage of convolution kernels is that they are very general and can be applied in many different problems. However, because of that generality, they require a significant amount of work to adapt them to a specific problem, which makes choosing the composition relation R in ‘real-world’ applications a non-trivial task. There are other kernel definitions for structured data in the literature. However, they usually focus on a very restricted syntax and are more or less domain specific. Examples are string and tree kernels. Traditionally, string kernels (Lodhi et al., 2002) have focused on applications in text mining and measure similarity of two strings by the number of common (not necessarily contiguous) substrings. These string kernels have not been applied in other domains. However, other string kernels have been defined for other domains, e.g., recognition of translation initiation sites in DNA and mRNA sequences (Zien et al., 2000). Again, these kernels have not been applied in other domains. Tree kernels (Collins and Duffy, 2002) can be applied to “ranked” trees , i.e., trees where the number of children of a node is determined by the label of the node. They compute the similarity of trees based on their common subtrees. Tree
96
CHAPTER 7. RRL-KBR
kernels have been applied in natural language processing tasks. A kernel for instances represented by terms in a higher-order logic is presented by G¨artner et al. (2003c). For an extensive overview of these and other kernels on structured data, the reader is referred to the overview paper by G¨artner (2003). None of these kernels, however, can be easily applied to the kind of state-action representations encountered in relational reinforcement learning problems. Kernels that can be applied there have independently been introduced by G¨artner (2002) and Kashima and Inokuchi (2002) and will be presented in Section 7.4.
7.3
Gaussian Processes for Regression
The exact mathematics in the description of the Gaussian processes here, are kept to a minimum. Readers interested in a more rigorous explanation of Gaussian processes and their properties may want to consult MacKay (1997). Parametric regression techniques use a parameterized function as a hypothesis. The learning algorithms use the observed data to tune the parameter vector w. In some cases, a single function is chosen and used for predictions of unseen examples. In other cases, a combination of functions is used. Examples of parametric regression techniques are neural networks and radial basis functions . Bayesian regression techniques assume a prior distribution over the parameter vector and calculate a posterior distribution over parameter vectors using Bayes rule and the available learning data. Predictions for new, unseen data can be made by marginalizing over the parameters. Gaussian processes implement a non-parametric Bayesian technique. Instead of assuming a prior over the parameter vectors, a prior is assumed over the target function itself. Assume that a set of data points {[xi |ti ]}N i=1 is observed, with xi the description of the example and ti the target value. The regression task in a Bayesian approach is to find the predictive distribution of the value tN +1 given the example description xN +1 , i.e.: P (tN +1 |[x1 · · · xN ], [t1 · · · tN ], xN +1 ) To model this task as a Gaussian process it is assumed (Gibbs, 1997) that the target values tN = [t1 · · · tN ] have a joint distribution: 1 1 −1 (tN − µ) (7.1) P (tN |[x1 · · · xN ], CN ) = exp − (tN − µ)T CN Z 2 where µ is the mean vector of the target values, C is a covariance matrix (Cij = C(xi , xj ), 1 ≤ i, j ≤ N ) and Z is an appropriate normalization constant. The
7.3. GAUSSIAN PROCESSES FOR REGRESSION
CN
k N+1
T
κ
97
C N+1 = k N+1
Figure 7.1: The relationship between the covariance matrices CN and CN +1 .
choice of covariance functions is restricted to positive definite kernel functions (See Section 7.2). For simplicity reasons1 , assume the mean vector µ = 0. Because P (A∧B|C) = P (A|C)·P (B|C), the predictive distribution of tN +1 can be written as the conditional distribution: P (tN +1 |[x1 · · · xN ], tN , xN +1 ) =
P (tN +1 |[x1 · · · xN ], xN +1 ) P (tN |[x1 · · · xN ])
and, using Equation 7.1, as the following Gaussian distribution:
P (tN +1 |[x1 · · · xN ], tN , xN +1 , CN +1 ) 1 ZN −1 −1 T = exp − tTN +1 CN t − t C t (7.2) N +1 N N N +1 ZN +1 2 with ZN and ZN +1 appropriate normalizing constants and CN and CN +1 as in Figure 7.1. The vector kN +1 and scalar κ are defined as: kN +1 κ
= [C(x1 , xN +1 ) · · · C(xn , xN +1 )] = C(xN +1 , xN +1 )
By grouping the terms that depend on tN +1 (Gibbs, 1997) Equation 7.2 can be rewritten as: tN +1 − tˆN +1 1 P (tN +1 |[x1 · · · xN ], tN , xN +1 , CN +1 ) = exp − Z 2σtˆ2
2 ! (7.3)
N +1
1 Although this may seem as a leap of faith, assuming 0 as an apriori Q-value is standard practice in Q-learning. This assumption was also used in the case of the tg and rib algorithms, albeit less explicitly.
98
CHAPTER 7. RRL-KBR
with tˆN +1 σtˆ2N +1
−1 = kTN +1 CN tN
= κ−
(7.4)
−1 kTN +1 CN kN +1
(7.5)
and kN +1 and κ as previously defined. This expression maximizes at tˆN +1 , and therefore the value tˆN +1 is the one that will be predicted by the regression algorithm. σtˆN +1 gives the standard deviation on the predicted value. Note −1 that, to make predictions, CN is used, so there is no need to invert the new matrix CN +1 for each prediction. Although using Gaussian processes for Q-function regression might seem like an overkill, this technique has some properties that makes it very well suited for reinforcement learning. First, the probability distribution over the target values can be used to guide the exploration of the state-space during the learning process comparable to interval based exploration (Kaelbling et al., 1996). Secondly, the inverse of the covariance matrix can be computed incrementally, using the partitioned inverse equations of Barnett (1979): M m −1 CN +1 = mT µ with M
−1 = CN − µkN +1 kTN +1
m
−1 = −µCN kN +1
µ =
−1 κ − kTN +1 CN kN +1
−1
While matrix inversion is of cubic complexity, computing the inverse of a matrix incrementally after expansion is only of quadratic time complexity. As stated before, no additional inversions need to be performed to make multiple predictions.
7.4
Graph Kernels
Graph kernels are an important means to extend the applicability of kernel methods to structured data. To be able to use Gaussian processes as a regression technique in the RRL system, a covariance function needs to be defined between different (state, action) pairs. This covariance function can be provided by the use of a graph kernel. This section gives a brief overview of graphs and graph kernels. For a more in-depth discussion of graphs the reader is referred to the work of Diestel (2000) and Korte and Vygen (2002). For a discussion of different graph kernels see (G¨ artner et al., 2003b).
7.4. GRAPH KERNELS
7.4.1
99
Labeled Directed Graphs
Before the graph kernel can be introduced, there are some concepts that need to be defined. Definition 7.2 (Graph) A graph G is described by a finite set of vertices V, a finite set of edges E and a function Ψ that denotes which vertices belong to which edge. Definition 7.3 (Labeled Graph) For labeled graphs there is additionally a set of labels L along with a function label assigning a label to each edge and vertex. Definition 7.4 (Directed Graph) For directed graphs the function Ψ : E → {V × V} maps each edge to the tuple consisting of its initial and terminal node. Edges e in a directed graph for which Ψ(e) = (v, v) are called loops. Two edges e, e0 are parallel if Ψ(e) = Ψ(e0 ). Frequently, only graphs without parallel edges are considered. For application within the RRL setting however, it is important to also consider graphs with parallel edges. Sometimes some enumeration of the vertices and labels in a graph is assumed, i.e., V = {νi }ni=1 where n = |V| and L = {`r }r∈N .2 To refer to the vertex and edge set of a specific graph, the notation V(G) and E(G) can be used. Wherever two graphs are distinguished by their subscript (Gi ) the same notation will be used to distinguish their vertex and edge sets. Figure 7.2 shows examples of a graph, a labeled graph and a directed graph. For all these graphs, V = {ν1 , ν2 , ν3 , ν4 , ν5 } and E = {(ν1 , ν1 ), (ν1 , ν3 ), (ν2 , ν1 ), (ν2 , ν5 ), (ν3 , ν2 ), (ν3 , ν4 ), (ν4 , ν2 ), (ν4 , ν3 ), (ν4 , ν5 )} Some special graphs, relevant for the description of graph kernels are walks, paths, and cycles. Definition 7.5 (Walk) A walk w (sometimes called an ‘edge progression’) is a sequence of vertices vi ∈ V and edges ei ∈ E with w = v1 , e1 , v2 , e2 , . . . en , vn+1 and Ψ(ei ) = (vi , vi+1 ). The length of the walk is equal to the number of edges in this sequence, i.e., n in the above case. 2 While ` will be used to always denote the same label, l is a variable that can take 1 1 different values, e.g., `1 , `2 , . . .. The same holds for vertex ν1 and variable v1 .
100
CHAPTER 7. RRL-KBR
s 1
1
1
r
c
r 2 3
r h
3
2
2
c
3
r
h
r
4
c
4
4 r
5
5
5
e
Graph
Directed Graph
Labelled Graph
Figure 7.2: From left to right, a graph, a labeled graph and a directed graph, all with the same node and edge set.
Definition 7.6 (Path) A path is a walk in which vi 6= vj ⇔ i 6= j and ei 6= ej ⇔ i 6= j.
Definition 7.7 (Cycle) A cycle is a path followed by an edge en+1 where Ψ(en+1 ) = (vn+1 , v1 ).
Figure 7.3 gives an illustration of a walk, a path and a cycle in the directed graph of Figure 7.2. Note that when a cycle exists in a graph, there are infinitely many walks in that graph and walks of infinite length.
1
1
1 2
3
3
3 4
4
4
5 Walk
2
2
5
5 Path
Cycle
Figure 7.3: From left to right, a walk, a path and a cycle of a graph.
7.4. GRAPH KERNELS
7.4.2
101
Graph Degree and Adjacency Matrix
Some functions describing the neighborhood of a vertex v in a graph G also need to be defined. Definition 7.8 (Outdegree) δ + (v) = {e ∈ E | Ψ(e) = (v, u)} is the set of edges that start from the vertex v. The outdegree of a vertex v is defined as |δ + (v)|. The maximal outdegree of a graph G is denoted by ∆+ (G) = max{|δ + (v)|, v ∈ V}.
Definition 7.9 (Indegree) δ − (v) = {e ∈ E | Ψ(e) = (u, v)} is the set of edges that arrive at the vertex v. The indegree of a vertex v is defined as |δ − (v)|. The maximal indegree of a graph G is denoted by ∆− (G) = max{|δ − (v)|, v ∈ V}. For example, the outdegree of vertex ν3 in the directed graph of Figure 7.3 |δ + (ν3 )| = 2 while the maximal outdegree of the graph ∆+ (G) = 3. The indegree of the same node |δ − (ν3 )| = 2 which is also the graphs maximal indegree. For a compact representation of the graph kernel the adjacency matrix E of a graph will be used. Definition 7.10 (Adjacency Matrix) The adjacency matrix E of a graph G is a square matrix where component [E]ij of the matrix corresponds to the number of edges between vertex νi and νj . The adjacency matrix of the directed graph in Figure 7.3 is the following: 1 0 1 0 0 1 0 0 0 1 E= 0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 Parallel edges in the graph would give rise to components with values greater than 1. Replacing the adjacency matrix E by its n-th power (n ∈ N, n ≥ 0), the interpretation is quite similar. Each component [E n ]ij of this matrix gives the number of walks of length n from vertex νi to νj . It is clear that the maximal indegree equals the maximal column sum of the adjacency matrix and that the maximal outdegree equals the maximal row sum of the adjacency matrix. For a ≥ ∆+ (G)∆− (G), an is an upper bound on each component of the matrix E n . This is useful to determine the convergence properties of some graph kernels.
102
7.4.3
CHAPTER 7. RRL-KBR
Product Graph Kernels
This section briefly reviews one of the graph kernels defined in (G¨artner et al., 2003b). Technically, this kernel is based on the idea of counting the number of walks in a product graph. Note that the definitions given here are more complicated than those given in (G¨artner et al., 2003b) as parallel edges have to be considered here. Product graphs (Imrich and Klavˇzar, 2000) are a very interesting tool in discrete mathematics. The four most important graph products are the Cartesian, the strong, the direct, and the lexicographic product. While the most fundamental one is the Cartesian product, in this context the direct graph product is the most important one. However, the definition needs to be extended to labelled directed graphs. For that consider a function match(l1 , l2 ) that is ‘true’ if the labels l1 and l2 ‘match’. In the simplest case match(l1 , l2 ) ⇔ l1 = l2 . Using this function, the direct product of two graphs is defined as follows: Definition 7.11 (Direct Product Graph) The direct product of the two graphs G1 = (V1 , E1 , Ψ1 ) and G2 = (V2 , E2 , Ψ2 ) is denoted by G1 × G2 . The vertex set of the direct product is defined as: V(G1 × G2 ) = {(v1 , v2 ) ∈ V1 × V2 : match(label (v1 ), label (v2 ))} The edge set is then defined as: E(G1 × G2 ) = {(e1 , e2 ) ∈ E1 × E2 : ∃ (u1 , u2 ), (v1 , v2 ) ∈ V(G1 × G2 ) ∧ Ψ1 (e1 ) = (u1 , v1 ) ∧ Ψ2 (e2 ) = (u2 , v2 ) ∧ match(label (e1 ), label (e2 ))} Given an edge (e1 , e2 ) ∈ E(G1 × G2 ) with Ψ1 (e1 ) = (u1 , v1 ) and Ψ2 (e2 ) = (u2 , v2 ) the value of ΨG1 ×G2 is: ΨG1 ×G2 ((e1 , e2 )) = ((u1 , u2 ), (v1 , v2 )) The graphs G1 , G2 are called the factors of graph G1 × G2 . The labels of the vertices and edges in graph G1 × G2 correspond to the labels in the factors. Figure 7.4 shows two directed labelled graphs and their direct product. The edges are presumed to be unlabelled. Intuitively, higher levels of similarity between two graphs leads to a higher number of nodes and edges in their product graph. Having introduced product graphs, finally, the product graph kernel can be defined.
7.4. GRAPH KERNELS
103 s 1
c 2
c 3 4
e
s I
1,I c
2,III c
s
II 3,III c
e
2,II c
III
c
4,IV
3,II IV
e
c
Figure 7.4: Two labelled directed graphs and their direct product graph at the bottom right. (For simplicity reasons, no labels are used on the edges in this example.)
Definition 7.12 (Product Graph Kernel) Let G1 , G2 be two graphs, let E× denote the adjacency matrix of their direct product E× = E(G1 × G2 ), and let V× denote the vertex set of the direct product V× = V(G1 × G2 ). With a sequence of weights λ = λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) the product graph kernel is defined as |V× |
k× (G1 , G2 ) =
X
∞ X
i,j=1
n=0
"
# n λ n E×
(7.6) ij
if the limit exists. For the proof that this kernel is positive definite, see (G¨artner et al., 2003b)3 . There it is shown that this product graph kernel corresponds to the inner product in a feature space made up by all possible contiguous label sequences in the graph. Each feature value √ corresponds to the number of walks with such a label sequence, weighted by λn where n is the length of the sequence. 3 The
extension to parallel edges is straight forward.
104
CHAPTER 7. RRL-KBR
7.4.4
Computing Graph Kernels
To compute this graph kernel, it is necessary to compute the limit of the above matrix power series. Two possibilities immediately present themselves: the i exponential weight setting (λi = βi! ) for which the limit of the above matrix power series always exists, and the geometric weight setting (λi = γ i ) for which the limit exists if γ < 1/a, where a = ∆+ (G)∆− (G) as above. Exponential Series Similar to the exponential of a scalar value (eb = 1 + b/1! + b2 /2! + b3 /3! + . . .) the exponential of the square matrix E is defined as eβE = lim
n→∞
n X (βE)i i=0
i!
(7.7)
0
where β0! = 1 and E 0 = I. Feasible exponentiation of matrices in general requires diagonalizing the matrix. If the matrix E can be diagonalized such that E = T −1 DT arbitrary powers of the matrix can be easily calculated as E n = (T −1 DT )n = T −1 Dn T and for a diagonal matrix the power can be calculated component-wise [Dn ]ii = [Dii ]n . Thus eβE = T −1 eβD T where eβD can be calculated component-wise. Once the matrix is diagonalized, computing the exponential matrix can be done in linear time. Matrix diagonalization is a matrix eigenvalue problem and such methods have roughly cubic time complexity. P Geometric Series The geometric series i γ i is known toP converge if and n 1 only if |γ| < 1. In that case the limit is given by limn→∞ i=0 γ i = 1−γ . Similarly, the geometric series of a matrix is defined as lim
n→∞
n X
γiEi
(7.8)
i=0
if γ < 1/a, where a = ∆+ (G)∆− (G). Feasible computation of the limit of a geometric series is then possible by inverting the matrix I − γE. To see this, suppose (I−γE)x = 0 and thus γEx = x and (γE)i x = x. Now, note that, given the limitations on γ, (γE)i → 0 as i → ∞. Therefore x = 0 and I − γE is regular and can be inverted. Then (I − γE)(I + γE + γ 2 E 2 + · · · ) = I and (I − γE)−1 = (I + γE + γ 2 E 2 + · · · ) is obvious. Matrix inversion is roughly of cubic time complexity. Weight Influences When comparing Equation 7.6 with the two equations 7.7 and 7.8, it can be seen that the role of the weight λi is taken by β i /i! in the exponential series and by γ i in the geometric series. Each of these two forms
7.4. GRAPH KERNELS
105
Geometric Weights 1
’Gamma = 1/2’ ’Gamma = 1/5’ ’Gamma = 1/10’
Relative Weight Value
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
2
4 6 Length of Walk
8
10
Figure 7.5: The weight distribution of the geometric series for different values of the parameter γ after normalization
have a parameter (β and γ respectively) that can be used to tune the weights of walks of a given length. However, the shape of these weights is different for each form. For the geometric series, Figure 7.5 shows the resulting weights for a number of different values of the parameter γ. Although the weights do change for varying parameters, the relative importance does not change. The highest weights are always for the shortest walks. For the exponential series however, different values of the parameter β shift the highest importance to walks of different lengths, as shown in Figure 7.6. This allows for better fine-tuning of the kernel towards different applications. Therefore, the exponential series will be used to compute kernels for the RRL system. In relatively sparse graphs, it is often more practical to actually count the number of walks rather than using the closed forms presented.
7.4.5
Radial Basis Functions
In finite state-action spaces Q-learning is guaranteed to converge if the mapping between (state, action) pairs and Q-values is represented explicitly. One advantage of Gaussian processes is that for particular choices of the covariance function, the representation is explicit. To see this, consider the matching kernel kδ : X × X → R defined as: ( 1 if x = x0 kδ (x, x0 ) = 0 otherwise as the covariance function between examples. Let the predicted Q-value be
106
CHAPTER 7. RRL-KBR Exponential Weights
Relative Weight Value
1
’Beta = 1’ ’Beta = 5’ ’Beta = 10’
0.8 0.6 0.4 0.2 0 0
5
10 Length of Walk
15
20
Figure 7.6: The weight distribution of the exponential series for different values of the parameter β after normalization
−1 the mean of the distribution over target values, i.e., tˆn+1 = kTN +1 CN tN where the variables are used as defined in section 7.3. Assume the training examples are distinct and the test example is equal to the j-th training example. It −1 then turns out that CN = I = CN where I denotes the identity matrix. As furthermore kN +1 is then the vector with all components equal to 0 except the j-th which is equal to 1, it is obvious that tˆn+1 = tj and the representation is thus explicit.
A frequently used kernel function for instances that can be represented by vectors is the Gaussian radial basis function kernel (RBF). Given the bandwidth parameter ρ the RBF kernel is defined as: krbf (x, x0 ) = exp(−ρ||x − x0 ||2 ). For large enough ρ the RBF kernel behaves like the matching kernel. In other words, the parameter ρ can be used to regulate the amount of generalization performed in the Gaussian process algorithm: For very large ρ all instances are very different and the Q-function is represented explicitly; for small enough ρ all examples are considered very similar and the resulting function is very smooth.
7.5
Blocks World Kernels
This section first shows how the states and actions in the blocks world can be represented as a graph. Then it discusses the kernel that is used as the covariance function between blocks worlds (state, action) pairs.
7.5. BLOCKS WORLD KERNELS
107 {clear}
v6 {on}
3 1 5
{block}
{on}
v3
{on}
{block}
v4
{on}
4 2
{block}
v2
v1
{block}
{on}
{on} {on}
v5
{on}
v0
{block}
{floor}
Figure 7.7: The graph representation of the blocks world state.
7.5.1
State and Action Representation
To be able to apply the graph kernel to the blocks world, the (state, action) pairs of the blocks world need to be represented as a graph. Figure 7.7 shows a blocks world state and the graph that will be used to represent this state. To also represent the action of the (state, action) pair to which a Q-value belongs, an edge with the label ’action’ is added between the two blocks that are manipulated, as well as the extra labels ‘a1 ’ and ‘a2 ’ which identify the moving block and the target block. For the On(A,B) goal, the graph representation needs to be extended with an indication of the two blocks that need to be stacked. This is represented both by adding extra labels ‘g1 ’ and ‘g2 ’ to the blocks as an extra edge labeled ‘goal’ between the two blocks that need to be stacked. This addition of edges and the labels connected to these edges allow for the representation of an arbitrary goal. Figure 7.8 shows an example of a full (state, action) pair with {clear}
v5 {on} {block,g2 }
{on}
{on}
v2
{block,a2 }
v1
{on}
2 4
1 3
v3 {block}
{block,a1 ,g1 }
v4
{on} {on} {on}
v0
{floor}
Figure 7.8: The graph representation of the blocks world (state, action) pair with On(3,2) as the goal.
108
CHAPTER 7. RRL-KBR
the representation of the on(3, 2) goal included. A more complete description of the blocks world to graph translation is included in Appendix A.
7.5.2
A Blocks World Kernel
In order to have a means to regulate the amount of generalization in the blocks world setting, the graph kernel is not used directly, but ‘wrapped’ in a Gaussian RBF function. Of the two settings used to compute the graph kernel, the exponential setting allows tuning the graph kernel to consider different lengths of walks as most important. Thus, let k be the graph kernel with exponential weights, then the kernel used in the blocks world is given by k ∗ (x, x0 ) = exp[−ρ(k(x.x) − 2k(x, x0 ) + k(x0 , x0 ))] This choice introduces two parameters into the regression system: β, that allows the user to shift the focus to different lengths of walks in the direct product graph, and ρ which allows to tune the level of generalization.
7.6
Experiments
To test the kernel based regression system (kbr), RRL once again tackled the same blocks world tasks as in the two previous chapters. As the system has two different parameters of which the influence needs to be investigated, an array of experiments was run to find good values of these parameters. In this text, only a small number of these experiments are shown, as the influence of one parameter is best illustrated when using the best value of the other parameter. In the experimental setup used, the best performance was reached with β = 10 and ρ = 0.01. Therefore, the influence of the β parameter is shown in tests where ρ = 0.01 and likewise for the tests on the influence of the parameter ρ.
7.6.1
The Influence of the Series Parameter β
As already stated, the exponential series parameter β influences the weights of the number of walks with a certain length as shown in Figure 7.6. Although worlds with only 3 to 5 blocks are considered, in the Stacking and Unstacking experiments, the maximum length of a walk in the two subgraphs is 6, because of the ’clear’ and ’floor’ nodes. Because of the extra ’goal’ edge in the On(A,B) experiments, there is a possibility of a cycle in the graph and therefore also of walks of infinite length.
7.6. EXPERIMENTS
109
Stacking - Walk Length Influence 1
Average Total Reward
0.98 0.96 0.94 0.92 0.9 0.88 0.86
’Beta = 1’ ’Beta = 5’ ’Beta = 10’ ’Beta = 50’ ’Beta = 100’
0.84 0.82 0.8 0
200
400 600 Number of Episodes
800
1000
Figure 7.9: The performance of kbr with focussing on different length walks for the Stacking task.
The different values of the β parameter tested were 1 (which would be basically the same as using the geometric setting) 5, 10, 50 and 100. For the Stacking and Unstacking task, there should be little difference between the values larger than 10, for the On(A,B) task, these large values could be significant. As expected, Figure 7.9 shows that there is little difference between the higher values of the β parameter for the Stacking task. The performance graph does show a difference between the low values. The value of 1, corresponding to the geometric kernel computation, performs less well than the rest. Even the value of 5 seems a little worse than the rest, indicating that the longest walks should be given the highest weights. Intuitively, this is consistent with what one would expect, as the Stacking task, when considered in the graph notation, basically consists of building the longest walk possible. Figure 7.10 shows that the behavior of the β parameter for the Unstacking task is greatly similar. The only difference is that the value of 5 works as well as higher values. This makes sense as the goal of Unstacking will cause most of the encountered (state, action) graphs to have many short walks. For the On(A,B) task, where it is possible for the (state, action) graph to contain cycles and walks of infinite length, there is a larger difference between the different parameter values. The best performance is reached for the values 5 and 10, i.e., focussing on walks of length 5 to 6 or 10 to 11. The performance drops for lower values, which is comparable to the weight distribution of the geometric setting and for higher values, that make the kernel focus on the cycles in the graph.
110
CHAPTER 7. RRL-KBR
Unstacking - Walk Length Influence 1
Average Total Reward
0.98 0.96 0.94 0.92 0.9 0.88 0.86
’Beta = 1’ ’Beta = 5’ ’Beta = 10’ ’Beta = 50’ ’Beta = 100’
0.84 0.82 0.8 0
200
400 600 Number of Episodes
800
1000
Figure 7.10: The performance of kbr with focussing on different length walks for the Unstacking task.
On(A,B) - Walk Length Influence 1
Average Total Reward
0.9 0.8 0.7 0.6
’Beta = 1’ ’Beta = 5’ ’Beta = 10’ ’Beta = 50’ ’Beta = 100’
0.5 0.4 0
200
400 600 Number of Episodes
800
1000
Figure 7.11: The performance of kbr with focussing on different length walks for the On(A,B) task.
7.6. EXPERIMENTS
111
Stacking - Rbf Parameter Influence 1
Average Total Reward
0.98 0.96 0.94 0.92 0.9 0.88 ’Rho = 0.00001’ ’Rho = 0.001’ ’Rho = 0.01’ ’Rho = 0.1’ ’Rho = 10’ ’Rho = 1000’
0.86 0.84 0.82 0.8 0
200
400 600 Number of Episodes
800
1000
Figure 7.12: The performance of kbr for different levels of generalization on the Stacking task.
7.6.2
The Influence of the Generalization Parameter ρ
The parameter ρ allows tuning of the amount of generalization by controlling the width of the radial basis functions that are used to wrap the graph kernel. Low values, connected to a large amount of generalization will cause kbr to learn quicker, but possibly lead to lower accuracy of the resulting policy. High values, resulting in little generalization, will cause RRL to learn slowly and might prevent it from learning a policy for the entire state space. Figure 7.12 shows the influence of different generalization levels for the Stacking goal. High generalization leads to the lowest resulting performance (although the ρ value of 0.00001 works well in the beginning of the experiment), but overall the differences are quite small. Even for very high ρ values (1000), kbr succeeds in learning a good policy. The Unstacking goal causes a larger delay when kbr uses very little generalization. The Unstacking goal is harder to reach than the Stacking goal, so a smaller percentage of the set of learning examples contains useful information. Higher levels of generalization cause these values to spread more and help RRL generate a better performing policy. At very low values of ρ, overgeneralization causes the performance to drop again. For the On(A,B) goal, there is again little difference between the different values of the parameter, although very little generalization causes kbr to learn slower.
112
CHAPTER 7. RRL-KBR
Unstacking - Rbf Parameter Influence 1
Average Total Reward
0.98 0.96 0.94 0.92 0.9 0.88 ’Rho = 0.00001’ ’Rho = 0.001’ ’Rho = 0.01’ ’Rho = 0.1’ ’Rho = 10’ ’Rho = 1000’
0.86 0.84 0.82 0.8 0
200
400 600 Number of Episodes
800
1000
Figure 7.13: The performance of kbr for different levels of generalization on the Unstacking task.
On(A,B) - Rbf Parameter Influence 1
Average Total Reward
0.9 0.8 0.7 ’Rho = 0.00001’ ’Rho = 0.001’ ’Rho = 0.01’ ’Rho = 0.1’ ’Rho = 10’ ’Rho = 1000’
0.6 0.5 0.4 0
200
400 600 Number of Episodes
800
1000
Figure 7.14: The performance of kbr for different levels of generalization on the On(A,B) task.
7.6. EXPERIMENTS
113
Stacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 ’TG’ ’RIB’ ’KBR’
0.3 0.2 0
200
400 600 Number of Episodes
800
1000
Figure 7.15: Performance comparison between the tg , rib and kbr algorithms for the Stacking task.
7.6.3
Comparing kbr , rib and tg
For the comparison of kbr to the two other regression engines tg and rib , the parameter values β = 5 and ρ = 0.01 were chosen. The architecture of the kbr regression engine is most comparable to the rib algorithm, as each new learning example is stored and influences the prediction of new examples according to their similarity. tg on the other hand, uses the learning examples to build an explicit model of the Q-function. Figure 7.15 shows the performances of the three algorithms on the Stacking goal. tg needs more learning episodes to reach the same level of performance as the two other algorithms, of which kbr is a little faster. It needs to be pointed out that the tg algorithm is a lot faster computationally though and that it can handle a lot more episodes then rib and kbr with the same computational capacity. However, this is only advantageous for environments that can react quickly to the agents decisions and have a low exploration cost, such as completely simulated environments. For the Unstacking goal, the difference in the learning rate between tg and the two others is even more apparent, as tg does not learn the optimal policy during the available learning episodes. The performance curves of Figure 7.17 show that none of the three regression engines allow RRL to learn the optimal policy for the On(A,B) goal. Remarkably, the three systems perform very comparably, although tg is again a little slower at the start, but it quickly catches up. As shown in Table 7.1, the current implementation of the RRL-kbr system is quite slow, and comparable in learning speed to the rib-ep system (see Table 6.2). This is largely due to the fact that no example selection strategy
114
CHAPTER 7. RRL-KBR
Unstacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 0.3 ’TG’ ’RIB’ ’KBR’
0.2 0.1 0
200
400 600 Number of Episodes
800
1000
Figure 7.16: Performance comparison between the tg , rib and kbr algorithms for the Unstacking task.
On(A,B)
Average Total Reward
1 0.8 0.6 0.4 0.2
’TG’ ’RIB’ ’KBR’
0 0
200
400 600 Number of Episodes
800
1000
Figure 7.17: Performance comparison between the tg , rib and kbr algorithms for the On(A,B) task.
7.7. FUTURE WORK
115
Table 7.1: The execution times for RRL-tg , RRL-rib and RRL-kbr on a Pentium III 1.1 Ghz machine in seconds. The first number in each cell indicates the learning time, the second indicates the testing time. Task Stacking Unstacking On(A,B)
RRL-tg 14 – 1 15 – 2 16 – 1
RRL-rib 27 – 27 7 – 10 44 – 100
RRL-kbr 650 – 900 1200 – 2300 1500 – 3000
has been implemented for RRL-kbr so far. As such, the size of the covariance matrix used by the Gaussian processes algorithm grows with each visited (state, action) pair. This also greatly augments the time needed to make predictions on new (state, action) pairs as indicated by the right side figures of Table 7.1.
7.7
Future Work
So far, no work has been done to select the examples that are accepted by the kbr system. All presented examples are used to build the covariance matrix C, that (as a consequence) tends to grow very large and greatly influences the computational efficiency of the kbr regression algorithm. To reduce the size of this matrix (and the related vectors) similar measures as for the rib algorithm could be used. However, the use of kernels allows for other selection mechanisms as well, such as the removal of examples that give rise to parallel covariance vectors (i.e., (state, action) pairs that are identical with respect to the feature space). Another item that needs to be addressed in the future to take full advantage of the Gaussian processes for regression, is the use of the probability estimates that can be predicted. An obvious use for them would be to guide exploration. The probability estimates can be used to build confidence intervals for the predicted Q-value. Such confidence intervals are useful for the interval based exploration that was discussed in Section 2.3.3.2. Possibly, the probability estimates can also be used for learning example selection, much like the use of the standard deviation in the rib system. So far, the parameters used in the kernel function (such as the β and ρ parameters in the blocks world experiments) have been fixed throughout a single learning experiment. However, Gaussian processes allow the use of parameters in the covariance function — referred to as hyper-parameters — and the tuning of these parameters to fit the data more precisely. However, the application of these parameter learning techniques to the RRL setting seems non trivial.
116
7.8
CHAPTER 7. RRL-KBR
Conclusions
This chapter introduced Gaussian processes and graph kernels as a regression technique for the RRL system. The graph kernels act as the covariance function needed for the Gaussian processes and are based on the number of walks in the direct product graph of the two (state, action) graphs. With the weights of the exponential series setting, the kernels can be tuned to walks of different lengths. Wrapping this kernel in a radial basis function allows one to control the amount of generalization. Experimental results show that the behavior of kernel based regression is comparable and maybe even better than instance based regression. Both of these outperform tg with respect to the learning rate per episode.
Part III
On Larger Environments
117
Chapter 8
On adding Guidance to Relational Reinforcement Learning “How do you explain school to a higher intelligence?” E.T. - The Extra Terrestrial
8.1
Introduction
In structural domains, the state space is typically very large, and although the relational regression algorithms introduced before can provide the right level of abstraction to learn in such domains, the problem remains that rewards may be distributed very sparsely in the state space. Using random exploration through the search space, rewards may simply never be encountered. In some of the application domains mentioned above this prohibits RRL from finding a good solution. While plenty of exploration strategies exist (Wiering, 1999), few deal with the problems of exploration at the start of the learning process. It is exactly this problem that occurs often in the RRL setting. There is, however, an approach which has been followed with success, and which consists of guiding the Q-learner with examples of “reasonable” strategies, provided by a teacher (Smart and Kaelbling, 2000). Thus a mix between the classical unsupervised Q-learning and (supervised) behavioral cloning is obtained. It is the suitability of this mixture in the context of RRL that is explored in this chapter. This chapter introduces guidance as a way to help RRL (and other reinforcement learning techniques) tackle large environments with sparse rewards. 119
120
CHAPTER 8. GUIDED RRL
After an argumentation for the need of guidance in relational reinforcement learning, several modes of guidance are suggested and empirically evaluated on larger versions of the blocks world problems. This substantial array of testcases also provides a view of the individual characteristics of the three different regression engines. The chapter concludes by discussing related work and presenting a large array of possible directions for further work. The idea of using guidance was developed together with Saˇso Dˇzeroski and published in (Driessens and Dˇzeroski, 2002a; Driessens and Dˇzeroski, 2002b; Driessens and Dˇzeroski, 2004).
8.2 8.2.1
Guidance and Reinforcement Learning The Need for Guidance
In the early stages of learning, the exploration strategy used in Q-learning is pretty much random and causes the learning system to perform poorly. Only when enough information about the environment is discovered, i.e., when sufficient knowledge about the reward function is gathered, can better exploration strategies be used. Gathering knowledge about the reward function can be hard when rewards are sparse and especially if these rewards are hard to reach using a random strategy. A lot of time is usually spent exploring regions of the state-action space without learning anything because no rewards (or only similar rewards) are encountered. Relational applications often suffer from this problem because they deal with very large state-spaces when compared to attribute-value problems. First, the size of the state-space grows exponentially with regard to the number of objects in the world, the number of properties of each object and the number of possible relations between objects. Second, when actions are related to objects — such as moving one object to another — the number of actions grows equally fast. For example, when the number of blocks in the blocks world increases, the goal states — and as a consequence the (state, action) pairs that yield a reward — become very sparse. To illustrate this, Figure 8.1 shows the success-rate of random policies in the blocks world. The agent with the random policy starts from a randomly generated state (which is not a goal state) and is allowed to take at most 10 actions. For each of the three goals (i.e., Stacking, Unstacking and On(A,B)) the graph shows the percentage of trials that end in a goal state and therefore with a reward, with respect to the number of blocks in the world. As shown in the graph, the Unstacking goal in the blocks world with 10 blocks would almost never be reached by random exploration. Not only is there a single goal state out of 59 million states, but the number of possible actions increases as one gets closer to the goal state: in a state from which a single
8.2. GUIDANCE AND REINFORCEMENT LEARNING
121
Percentage of Succeeding Episodes
Random Exploration Success Rate 100
’stack’ ’unstack’ ’on(A,B)’
80 60 40 20 0 3
4
5
6 7 Number of Blocks
8
9
10
Figure 8.1: Success rate for the three goals in the blocks world with a random policy. action leads to the goal state, there are 73 actions possible. The graph of Figure 8.2 shows the percentage of learning examples with a non zero Q-value that is presented to the regression algorithm. Since all examples with a zero Q-value can be regarded as noise for the regression algorithm, it is clear that learning the correct Q-function from these examples is very hard.
8.2.2
Using “Reasonable” Policies for Guidance
Although random policies can have a hard time reaching sparsely spread rewards in a large world, it is often relatively easy to reach these rewards by using “reasonable” policies. While optimal policies are certainly “reasonable”, non-optimal policies are often easy (or easier) to implement or generate than optimal ones. One obvious candidate for an often non-optimal, but reasonable, controller would be a human expert. To integrate the guidance that these reasonable policies can supply with our relational reinforcement learning system, the given policy is used to choose the actions instead of a policy derived from the current Q-function hypothesis (which will not be informative in the early stages of learning). The episodes created in this way can be used in exactly the same way as normal episodes in the RRL algorithm to create a set of examples which is presented to the relational regression algorithm. In case of a human controller, one could log the normal operation of a system together with the corresponding rewards and generate the learning examples from this log. Since tabled Q-learning is exploration insensitive — i.e., the Q-values will converge to the optimal values, independent of the exploration strategy used
122
CHAPTER 8. GUIDED RRL
Percentage of Informative Examples
Informative Examples using Random Exploration 100
’stack’ ’unstack’ ’on(A,B)’
80 60 40 20 0 3
4
5
6 7 Number of Blocks
8
9
10
Figure 8.2: The percentage of informative examples presented to the regression algorithm for the three goals in the blocks world with a random policy. (Kaelbling et al., 1996) — the non-optimality of the used policy will have no negative effect on the convergence of the Q-table. While Q-learning with generalization is not exploration insensitive, the experiments will demonstrate that the “guiding policy” helps the learning system to reach non-obvious rewards and that this results in a two-fold improvement in learning performance. In terms of learning speed, the guidance is expected to help the Q-learner to discover higher yielding policies earlier in the learning experiment. Through the early discovery of important states and actions and the early availability of these (state, action) pairs to the generalization engine, it should also be possible for the Q-learner to reach a higher level of performance — i.e., a higher average reward — in the available time. While the idea of supplying guidance or another initialization procedure to increase the performance of a tabula rasa algorithm such as reinforcement learning is not new (See Section 8.4), it is under-utilized. With the emergence of new reinforcement learning approaches, such as the RRL system, that are able to tackle larger problems, this idea is gaining importance and could provide the leverage necessary to solve really hard problems.
8.2.3
Different Strategies for Supplying Guidance
When supplying guidance by creating episodes and presenting the resulting learning examples to the used regression engine, different strategies can be used to decide when to supply this guidance. One option that will be investigated, is supplying the guidance at the beginning of learning when the reinforcement learning agent is forced to use a
8.3. EXPERIMENTS
123
random policy to explore the state-space. This strategy also makes sense when using guidance from a human expert. After logging the normal operations of a human controlling the system, one can translate these logs into a set of learning examples and present this set to the regression algorithm. This will allow the regression engine to build a partial Q-function which can later be used to guide the further exploration of the state-space. This Q-function approximation will neither represent the correct Q-function, nor will it cover the entire state-action space, but it might be suitable for guiding RRL towards more rewards with the use of Q-function based exploration. The RRL algorithm explores the state space using Boltzmann exploration (Kaelbling et al., 1996) based on the values predicted by the partial Q-function. This makes a compromise between exploration and exploitation of the partial Q-function. Another strategy is to interleave the guidance with normal exploration episodes. In analogy with human learning, this mixture of perfect and more or less random examples can make it easier for the regression engine to distinguish beneficial actions from other ones. The influence of guidance when it is supplied with different frequencies will be compared. One benefit of interleaving guided traces with exploration episodes is that the reinforcement learning system can remember the episodes or starting states that did not lead to a reward. It can then ask for guidance starting from the states in which it failed. This will allow the guidance to zoom in on areas of the state-space which are not yet covered correctly by the regression algorithm. This type of guidance will be called active guidance.
8.3
Experiments
The guidance added to exploration should have a two-fold effect on learning performance. In terms of learning speed, the guidance should help the Q-learner to discover better policies earlier. Through the early discovery of important states and actions and the early availability of these (state, action) pairs to the generalization engine, it should also be possible for the Q-learner to reach a higher level of performance — i.e., a higher average reward — for a given amount of learning experience. The experiments will test the effects of guidance at the start of the learning experiment as well as guidance that is spread throughout the experiment and active guidance, and will illustrate the differences between them. The influence of guided traces will be tested on each of the three different regression algorithms discussed previously. All have different strategies for dealing with incoming learning examples and as such will react differently to the presented guidance. Although the chosen setup will not allow for a fair comparison between the three algorithms, the different learning behaviors will provide a view on the consequences of the different inner workings of the three
124
CHAPTER 8. GUIDED RRL
regression engines and possibly an indication of the kind of problems that are best handled by each algorithm.
8.3.1
Experimental Setup
To test the effects of guidance, RRL (without guidance) will be compared with G-RRL (with guidance) in the following setup: first RRL is run in its natural form, giving it the possibility to train for a certain number of episodes; at regular time intervals the learned policy is extracted from RRL and tested on a number of randomly generated test problems. To compare with G-RRL some of the exploration is substituted with guided traces. These traces are generated by either a hand-coded policy, a previously learned policy or a human controller. In between these traces, G-RRL is allowed to explore the state-space further on its own. Note that in the performance graphs, the traces presented to G-RRL will count as episodes. Using a blocks world with 10 blocks provides a learning environment with is large enough for the rewards to become sparse (see Table 5.1 and Figures 8.1 and 8.2). For the blocks world it is easy to write optimal policies for the three goals. Thus it is easy to supply RRL with a large amount of optimal example traces. The tg algorithm needs a higher number of learning examples compared to the rib and kbr algorithms. On the other hand, the tg implementation is a lot more efficient than the other implementations, so tg is able to handle more training episodes for a given amount of computation time. Since the goal of the experiments is to investigate the influence of guidance for the two systems and not to compare their performance, the tg algorithm will be supplied with a lot of training episodes (as it has little difficulty handling them) and the other algorithms with less training episodes (given the fact that they usually don’t need them).
8.3.2
Guidance at the Start of Learning
Reaching a state that satisfies the Stacking goal is not really all that hard, even with 10 blocks and random exploration: approximately one of every 17 states is a goal state. Even so, some improvement can be obtained by using a small amount of guidance as shown in Figure 8.3. The tg based algorithm is quite good at learning a close to optimal policy by itself. However, the added help from the guided traces helps it to decrease the number of episodes needed to obtain a certain performance level. (The strange behavior of tg when it is supplied with 100 guided traces will be further investigated in Section 8.3.3.) RRL-rib has a harder time with this goal. It doesn’t come close to reaching the optimal policy, but the help it receives from the guided traces do allow it to both reach better performance earlier during learning as well as reach a
8.3. EXPERIMENTS
125
Stacking - TG
Stacking - RIB
1
1
0.8
Average Total Reward
Average Total Reward
0.9 0.7 0.6 0.5 0.4 0.3 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.2 0.1 0 0
200
400
600 800 1000 Number Of Episodes
1200
0.8 0.6 0.4 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.2 0
1400
0
200
400 600 Number of Episodes
800
1000
Stacking - KBR 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 0.3 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.2 0.1 0 0
200
400 600 Number of Episodes
800
1000
Figure 8.3: Guidance at the start for the Stacking goal in the blocks world. higher level of performance overall. The strangest behavior is exhibited by the the kbr algorithm. Using the guided traces it quickly improves its strategy — with 100 guided traces it even reaches optimal behavior — but then starts to decrease its performance when it is allowed to explore the state space on its own. The graphs show the average performance of RRL over 10 test runs. As already stated, in a world with 10 blocks, it is almost impossible to reach a state satisfying the Unstacking-goal at random. This is illustrated by the graphs on the left side of Figure 8.4. It also shows the average performance over 10 test runs. RRL never learns anything useful on its own, because it doesn’t succeed in reaching the reward often enough (if ever). Because tg does not make any decisions until enough evidence has been collected, small amounts of guidance do not help tg . Even when supplied with 100 optimal traces, tg does not learn a useful policy. Supplied with 500 optimal traces, tg has collected enough examples to learn something useful, but the difficulties with exploring such a huge state-space with so little reward still shows in the fact that tg is able to do very little extra with this information. This is caused by the fact that tg throws out the statistics it collected when it chooses a suitable splitting criterion and generates two new and empty leafs. Thus, when
126
CHAPTER 8. GUIDED RRL
On(A,B) - TG 1
0.8
0.8
Average Total Reward
Average Total Reward
Unstacking - TG 1
0.6 0.4 0.2
’no guidance’ ’100 at start’ ’500 at start’
0 0
500
1000 1500 Number of Episodes
2000
0.6 0.4 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.2 0
2500
0
2000
1
0.8
0.8
0.6 0.4 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.2 0 0
200
400 600 Number of Episodes
800
’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.4 0.2 0
1000
0
200
0.8
0.8
0.6 0.4 ’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’ 0
200
400 600 Number of Episodes
400 600 Number of Episodes
800
1000
On(A,B) - KBR 1
Average Total Reward
Average Total Reward
Unstacking - KBR
0
10000
0.6
1
0.2
8000
On(A,B) - RIB
1
Average Total Reward
Average Total Reward
Unstacking - RIB
4000 6000 Number of Episodes
800
’no guidance’ ’5 at start’ ’20 at start’ ’100 at start’
0.6 0.4 0.2 0
1000
0
200
400 600 Number of Episodes
800
1000
Figure 8.4: Guidance at the start for the Unstacking (left) and On(A,B) (right) goals in the blocks world.
8.3. EXPERIMENTS
127
a split is made, tg basically forgets all the guidance it received. The rib algorithm was designed to remember high scoring (state, action) pairs. Once an optimal Q-value is encountered, it will never be deleted from the stored data base. This makes rib perform very well on the Unstacking-goal, reaching a close to optimal strategy with little guidance. Figure 8.4 does show that even 5 guided traces are sufficient, although a little extra guidance helps to reach high performance sooner. The kbr algorithm shows the same behavior as it did for the Stacking goal. Using the guided traces it quickly develops a high performance policy, but when it is left to explore on its own, the performance of the learned Q-function starts to decrease. The kbr algorithm bases its Q-value estimation on an estimated probability density. Since it has no example selection possibilities, the large number of uninformative (state, action) pairs (i.e., with 0 Q-value) generated during exploration overwhelm the informative ones and cause the probability estimate to degenerate. The On(A,B) goal has always been a hard problem for RRL (Dˇzeroski et al., 2001;Driessens et al., 2001). The right side of Figure 8.4 shows the learning curves for each of the algorithms. Every data point is the average reward over 10 000 randomly generated test cases in the case of the tg curve, 1 000 for the rib and kbr graphs, all collected over 10 separate test runs. Although the optimal policy is never reached, the graph clearly shows the improvement that is generated by supplying RRL with varying amounts of guidance. Only in the kbr case, when supplied with limited amounts of guidance, the performance is comparable to that of RRL without guidance.
8.3.3
A Closer Look at RRL-tg
An interesting feature of the performance graphs of tg is the performance of the experiment with the Stacking goal where it was supplied with 100 (or more) optimal traces in the beginning of the learning experiment. Not only does this experiment take longer to converge to a high performance policy, but during the first 100 episodes, there is no improvement at all. rib and kbr do not suffer from this at all. This behavior becomes worse when tg is supplied with even more optimal traces. Figure 8.5 shows the learning curves when tg is supplied with 500 optimal traces. The reason for tg ’s failing to learn anything during the first part of the experiment (i.e., when being supplied with optimal traces) can be found in the specific inner workings of the generalization engine. Trying to connect the correct Q-values with the corresponding (state, action) pairs, the generalization engine tries to discover significant differences between (state, action) pairs with differing Q-values. In the ideal case, the tg -engine is able to distinguish between states that are at different distances from a reward producing state, and between optimal and non-optimal actions in these states.
128
CHAPTER 8. GUIDED RRL
Stacking
Unstacking
1
1
0.8
Average Total Reward
Average Total Reward
0.9 0.7 0.6 0.5 0.4 0.3 0.2
0.8 0.6 0.4 0.2
’optimal traces’ ’half optimal traces’
0.1 0 0
200
400
600 800 1000 Number Of Episodes
1200
’optimal traces’ ’half optimal traces’
0 1400
0
500
1000 1500 Number of Episodes
2000
2500
On(A,B)
Average Total Reward
1 0.8 0.6 0.4 0.2 ’optimal traces’ ’half optimal traces’
0 0
2000
4000 6000 Number of Episodes
8000
10000
Figure 8.5: Half optimal guidance in the blocks world for tg However, when tg is supplied with only optimal (state, action, qvalue) examples, overgeneralization occurs. The generalization engine never encounters a non-optimal action and therefore, never learns to distinguish optimal from non-optimal actions. It will create a Q-tree that separates states which are at different distances from the goal-state. Later, during exploration, it will expand this tree to account for optimal and non-optimal actions in these states. These trees are usually larger than they should be, because in the normal case, when supplied with both optimal and non-optimal examples, tg is often able to generalize in one leaf of its tree over both non-optimal actions in states that are close to the goal and optimal actions in states that are a little further from the goal. To illustrate this behavior, tg was supplied with 500 half-optimal guidance traces in which the used policy alternates between a random and an optimal action. Figure 8.5 shows that, in this case, G-RRL does learn during the guided traces. Most noticeable is the behavior of tg with half optimal guidance when it has to deal with the Unstacking goal. Even though it is not trivial to reach the goal state when using a half optimal policy, it is reached often enough for G-RRL to learn a correct policy. Figure 8.5 shows that G-RRL is able to
8.3. EXPERIMENTS
129
learn quite a lot during the 500 supplied traces and then is able to reach the optimal policy after some extra exploration. This experiment (although artificial) shows that G-RRL can be useful even in domains where it is easy to hand-code a reasonable policy. G-RRL will use the experience created by that policy to construct a better (possibly optimal) one. The sudden leaps in performance are characteristic for tg : whenever a new (well chosen) test is added to the Q-tree, the performance jumps to a higher level. rib and kbr do not suffer from this overgeneralization. Since the Q-value estimation is the result of a weighted average of neighboring examples in the rib case, the rib algorithm is able to make more subtle differences between (state, action) pairs. Since the used weights in the average calculation are based on the distance between two (state, action) pairs, and since this distance has to include information about the resemblance of the two actions, there is almost no chance of overgeneralization of the Q-values over different actions in the same state. The same holds for the covariance function in the kbr case. The covariance function or kernel is based on the (state, action) graph which includes the information on the chosen action. Where tg is left on its own to decide which information about the (state, action) pair to use in its Q-function model, rib and kbr are forced to use the complete (state, action) pair description.
8.3.4
Spreading the Guidance
Instead of supplying all the guidance at the beginning of learning, it is also possible to spread the guidance throughout the entire experiment. In the following experiments, guidance is supplied either 1 guided trace every 10 learning episodes or in batches of 10 guided traces every 100 learning episodes. Spreading the guidance through the entire learning experiment compared to presenting an equal total amount of guidance at the beginning of learning avoids the overgeneralization problem that occurred when using the tg algorithm. The top left graph of Figure 8.6 clearly shows that tg does not suffer the same learning delay. The influence of spread guidance for the Unstacking goal (top right of Figure 8.6) is remarkable. With initial (and optimal) guidance, tg was not able to learn any useful strategy. In this case however, the mix of guided traces and explorative traces allows tg to build a well performing Q-tree. It still is less likely to find the optimal policy then with the (artificial) half-optimal guidance but performs reasonably well. The rib algorithm did not suffer from overgeneralization and as a consequence, there is little difference between the obtained results with initial and spread guidance. rib is designed to select and store examples with high Qvalues. It does not matter when during the learning experiment these examples are encountered. The only noticeable difference is that the performance
130
CHAPTER 8. GUIDED RRL
Unstacking - TG 1
0.9
0.9
0.8
0.8
Average Total Reward
Average Total Reward
Stacking - TG 1
0.7 0.6 0.5 0.4 0.3 ’no guidance’ ’100 at start’ ’1 every 10’ ’10 every 100’
0.2 0.1 0 0
200
400
600 800 1000 Number of Episodes
1200
0.7 0.6 0.5 0.4 0.3 ’no guidance’ ’500 at start’ ’1 every 10’ ’10 every 100’
0.2 0.1 0
1400
0
500
1
0.8
0.98
0.6 0.4 ’no guidance’ ’100 at start’ ’1 every 10’ ’10 every 100’
0.2 0 0
200
400 600 Number of Episodes
2000
2500
Unstacking - RIB
1
Average Total Reward
Average Total Reward
Stacking - RIB
1000 1500 Number of Episodes
800
0.96 0.94 0.92
’100 at start’ ’1 every 10’ ’10 every 100’
0.9 1000
0
200
Stacking - KBR
400 600 Number of Episodes
800
1000
Unstacking - KBR
1
1
0.8
Average Total Reward
Average Total Reward
0.9 0.7 0.6 0.5 0.4 0.3 ’no guidance’ ’100 at start’ ’1 every 10’ ’10 every 100’
0.2 0.1 0 0
200
400 600 Number of Episodes
800
0.8 0.6 0.4 0.2
’100 at start’ ’1 every 10’ ’10 every 100’
0 1000
0
200
400 600 Number of Episodes
800
1000
Figure 8.6: Guidance at start and spread for the Stacking (left) and Unstacking (right) goals in the blocks world
8.3. EXPERIMENTS
131
increase becomes somewhat slower but smoother. With spread guidance, the kbr algorithm does not get the same jump-start as with all the guidance at the beginning of learning. As expected, the same total amount of guidance, leads to approximately the same level of behavior, regardless of when this guidance is supplied.
On(A,B) - TG
On(A,B) - RIB
1
1
’no guidance’ ’100 at start’ ’1 every 10’ ’10 every 100’
0.8
Average Total Reward
Average Total Reward
0.9 0.7 0.6 0.5 0.4 0.3
’1000 at start’ ’1000 delayed’ ’1 every 10’ ’10 every 100’ ’100 every 1000’
0.2 0.1 0 0
2000
4000 6000 Number of Episodes
8000
0.8 0.6 0.4 0.2 0
10000
0
200
400 600 Number of Episodes
800
1000
On(A,B) - KBR
Average Total Reward
1
’no guidance’ ’100 at start’ ’1 every 10’ ’10 every 100’
0.8 0.6 0.4 0.2 0 0
200
400 600 Number of Episodes
800
1000
Figure 8.7: Guidance at start and spread for the On(A,B) goal in the blocks world For the On(A,B) goal, the influence of the spread guidance on the performance of tg is large, both in terms of learning speed as in the overall level of performance reached as shown in the top left graph of Figure 8.7. Again, for rib and KBR, there is little difference in the performance of the resulting policies. All graphs, but especially the tg case of Figure 8.7, show the influence of different frequencies used to provide guidance. Note that in all cases, an equal amount of guidance was used. Although the results show little difference, there seems to be a small advantage for thinly spread guidance. Intuitively, it seems best to spread the available guidance as thin as possible and the performed experiments do not show any negative results for doing so. However, spreading out the guidance when there is only a small amount available (e.g. 1 guided
132
CHAPTER 8. GUIDED RRL
trace every 10 000 episodes) might prevent the guidance from having any effect. Another possibility for dealing with scarce guidance is to provide all the guidance after RRL has had some time to explore the environment. Although Figure 8.7 shows inferior results for this approach when compared to the spread out guidance, this is probably due to the large size of the presented batch. Note also that learning here is faster than when all guidance is provided at the beginning of the learning experiment.
8.3.5
Active Guidance
As stated at the end of Section 8.2.3, in the blocks world where each episode is started from a randomly generated starting position, RRL can be given the opportunity to ask for guided traces starting from some of these starting states where it failed. In planning problems like the ones of the blocks world, RRL can discover whether or not it succeeded by checking whether it received a reward of 1 or not. This will allow RRL to receive information about parts of the state space where it does not yet have enough knowledge and to supply the generalization algorithm with examples which are not yet correctly predicted. Figures 8.8 and 8.9 show the results of this active guidance. In these learning experiments, guidance was spread like in the previous section, but replaced with active guidance. Two kinds of behavior can be distinguished. In the first, GRRL succeeds in finding an almost optimal strategy, and the active guidance succeeds in pushing G-RRL to even better performance at the end of the learning experiment. This is the case for all goals using tg for regression, for the Unstacking goal with rib and for the Stacking and Unstacking goals with the kbr algorithm, often leading to an optimal policy that was not reached before, or greatly reducing the number of cases in which the goal was not reached. For example, the percentage of episodes where RRL does not reach the goal state is reduced from 11% to 3.9% using the tg algorithm in the On(A,B) experiment. This behavior is completely consistent with what one would expect. In the beginning, both modes of guidance provide enough new examples to increase the accuracy of the learned Q-functions. However, when a large part of the state-space is already covered by the Q-functions, the specific examples provided by active guidance allow for the Q-function to be extended to improve its coverage of the outer reaches of the state-space. In the second kind of behavior, G-RRL does not succeed in reaching a sufficiently high level of performance. This happens for rib on the tasks of Stacking and On(A,B) and for kbr with the On(A,B) goal. There is little difference here between the help provided by normal and active guidance. Active guidance is not able to focus onto critical regions of the state-space and improve upon the examples provided by regular guidance.
8.3. EXPERIMENTS
133
Stacking - TG
Unstacking - TG
1
1
Average Total Reward
Average Total Reward
0.99 0.98 0.97 0.96 0.95 ’1 every 10’ ’1 every 10 active’
0.94 0
500
1000 1500 Number of Episodes
2000
0.8 0.6 0.4 0.2 ’1 every 10’ ’1 every 10 active’
0 2500
0
500
1
0.8
0.98
0.6 0.4 0.2 ’1 every 10’ ’1 every 10 active’
0 0
200
400 600 Number of Episodes
800
0.94 0.92 ’1 every 10’ ’1 every 10 active’
0.9 1000
0
200
0.9
0.9
0.8
0.8
0.7 0.6 0.5 0.4 0.3 0.2 ’1 every 10’ ’1 every 10 active’ 0
200
400 600 Number of Episodes
400 600 Number of Episodes
800
1000
Unstacking - KBR 1
Average Total Reward
Average Total Reward
Stacking - KBR
0
2500
0.96
1
0.1
2000
Unstacking - RIB
1
Average Total Reward
Average Total Reward
Stacking - RIB
1000 1500 Number of Episodes
800
0.7 0.6 0.5 0.4 0.3 0.2 ’1 every 10’ ’1 every 10 active’
0.1 0 1000
0
200
400 600 Number of Episodes
800
1000
Figure 8.8: Active guidance in the blocks world for the Stacking (left) and the Unstacking (right) goals.
134
CHAPTER 8. GUIDED RRL
On(A,B) - TG
On(A,B) - RIB
1
1
0.8
Average Total Reward
Average Total Reward
0.9 0.7 0.6 0.5 0.4 0.3 0.2 ’1 every 10’ ’1 every 10 active’
0.1 0 0
2000
4000 6000 Number of Episodes
8000
0.8 0.6 0.4 0.2 ’1 every 10’ ’1 every 10 active’
0 10000
0
200
400 600 Number of Episodes
800
1000
On(A,B) - KBR
Average Total Reward
1 0.8 0.6 0.4 0.2 ’1 every 10’ ’1 every 10 active’
0 0
200
400 600 Number of Episodes
800
1000
Figure 8.9: Active guidance in the blocks world the On(A,B) goal.
8.3.6
An ”Idealized” Learning Environment
Because RRL is able to generalize over different environments (as already shown in Chapters 5 to 7) it is possible to create an “idealized” learning environment. In a small environment, RRL can create a set of learning examples with a variation of optimal and non-optimal (state, action) pairs by simply exploring on its own. In a large environment, G-RRL can be used to avoid large numbers of uninformative learning examples. To combine both of these ideas, the following experiments allow RRL to explore on its own in environments with 3 to 5 blocks, and guidance is provided in 10% of the cases in a world with 10 blocks. This kind of learning environment is comparable to human learning environments where often, a teacher will both show students how they solve difficult problems and make the students practice solving easier problems on their own. To test whether RRL is able to generalize over different environments, the learned Q-function and its resulting policy are tested in environments where the number of blocks ranges from 3 to 10. This means that RRL will have to handle worlds with for example 8 blocks without ever being allowed to train
8.3. EXPERIMENTS
135
itself in such a world. Stacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’
0.1 0 0
50
100 150 200 250 300 350 400 450 500 Number of Episodes
Figure 8.10: Performance comparison between the tg , rib and kbr algorithms for the Stacking task in an idealized learning environment. Figure 8.10 shows the learning curves for all three regression engines for the Stacking goal in the described setup. The kbr algorithm is the only one which builds an optimal Q-function in this case. Although the tg algorithm seems to be the slowest to improve, i.e., it needs the most learning episodes to yield a comparable policy, it must be said that the tg algorithm is much more efficient and is able to handle a much larger set of learning examples given a fixed amount of computational power. However, this is only beneficial when the learning environment is fast to interact with and the exploration costs of the environment are low. In real world applications this will often not be the case, causing the additional exploration cost needed by tg to dominate the computational cost of the rib and kbr algorithms. The “idealized” learning environment works extremely well for the Unstacking task. All three regression algorithms succeed in building an optimal policy quickly as shown in Figure 8.11. The mixed set of optimal and non-optimal examples resulting from the combination of small world exploration and large world guidance turns the difficult task of Unstacking in large worlds into an easy generalization problem. For the On(A,B) task, tg is again slower than the two other algorithms in reaching a comparable level of performance. However, when tg is presented with 5 times more episodes then the two other algorithms, the performance becomes remarkably similar. (The “RRL-tg (episodes ∗5)” performance curve shows the average reward reached by tg after 5 times as many learning episodes as indicated on the x-axis.)
136
CHAPTER 8. GUIDED RRL
Unstacking 1
Average Total Reward
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’
0.1 0 0
200
400 600 Number of Episodes
800
1000
Figure 8.11: Performance comparison between the tg , rib and kbr algorithms for the Unstacking task in an idealized learning environment.
On(A,B)
Average Total Reward
1 0.8 0.6 0.4 ’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’ ’RRL-TG (episodes * 5)’
0.2 0 0
200
400 600 Number of Episodes
800
1000
Figure 8.12: Performance comparison between the tg , rib and kbr algorithms for the On(A,B) task in an idealized learning environment.
8.4. RELATED WORK
8.4
137
Related Work
The idea of incorporating guidance in automated learning of control is not new. Chambers and Michie (1969) discuss three kinds of cooperative learning. In the first, the learning system just accepts the offered advice. In the second, the expert has the option of not offering any advice. In the third, some criterion decides whether the learner has enough experience to override the human decision. Roughly speaking, the first corresponds to behavioral cloning, the second to reinforcement learning and the third to guided reinforcement learning. The link between this work and behavioral cloning (Bain and Sammut, 1995; Urbancic et al., 1996) is not very hard to make. If the used regression algorithm would be supplied with low Q-values for not encountered (state, action) pairs, RRL would learn to imitate the behavior of the supplied traces. Because of this similarity of the techniques, it is not surprising that similar problems are encountered as in behavioral cloning. Scheffer et al. (1997) discuss some of these problems. The differences between learning by experimentation and learning with “perfect guidance” (behavioral cloning) and the problems and benefits of both approaches are highlighted. At first sight, behavioral cloning seems to have the advantage, as it sees precisely the optimal actions to take. However, this is all that it is given. Learning by experimentation, on the other hand, receives imperfect information about a wider range of (state, action) pairs. While some of the problems Scheffer mentions are solved by the combination of the two approaches as suggested in this chapter, other problems resurface in the presented experiments. Scheffer states that learning from guidance will experience difficulties when confronted with memory constraints so that it can not simply memorize the ideal sequence of actions but has to store associations instead. This is very closely related to the problems of the tg generalization engine when it is supplied with only perfect (state, action) pairs. Wang (1995) combines observation and practice in the OBSERVER learning system which learns STRIPS-like planning operators (Fikes and Nilsson, 1971). The system starts with learning from “perfect guidance” and improves on the planning operators (pre- and post-conditions) through practice. There is no reinforcement learning involved. Lin’s work on reinforcement learning, planning and teaching (Lin, 1992) and the work of Smart and Kaelbling on reinforcement learning in continuous state-spaces (Smart and Kaelbling, 2000) is closely related to the one presented in this chapter in terms of combining guidance and experimentation. Lin uses a neural network approach for generalization and uses a human strategy to teach the agent. The reinforcement learning agent is then allowed to replay each teaching episode to increase the amount of information gained from a single lesson. However, the number of times that one lesson can be replayed has to be restricted to prevent over-learning. This behavior is strongly related to the
138
CHAPTER 8. GUIDED RRL
over-generalization behavior of tg when only perfect guidance is presented. Smart’s work, dealing with continuous state-spaces, uses a nearest neighbor approach for generalization and uses example training runs to bootstrap the Q-function approximation. The use of nearest neighbor and convex hulls to select the examples for which predictions are made, successfully prevents overgeneralization. It is not clear how to translate the convex hull approach to the relational setting. Another technique that is based on the same principles as our approach is used by Dixon et al. (2000). They incorporate prior knowledge into the reinforcement learning agent by building an off-policy exploration module in which they include the prior knowledge. They use artificial neural networks as a generalization engine. Other approaches to speed up reinforcement learning by supplying it with non-optimal strategies include the work of Shapiro et al. (2001). There the authors embed hierarchical reinforcement learning within an agent architecture. The agent is supplied with a “reasonable policy” and learns the best options for this policy through experience. This approach is complementary to and can be combined with the G-RRL ideas.
8.5
Conclusions
This chapter addressed the problem of integrating guidance and experimentation in reinforcement learning, and in particular relational reinforcement learning as the problem of finding rewards that are sparsely distributed is more severe in large relational problem domains. It was shown that providing guidance to the reinforcement learning agent does help improve the performance in such cases. Guidance in this case takes the form of traces of execution of a “reasonable policy” that provides sufficiently dense rewards. The utility of guidance was demonstrated through experiments in the blocks world domain with 10 blocks. The 10 blocks world is characterized by a huge state space and the three chosen tasks are characterized by their sparse rewards. The effect of using guidance was studied in a number of settings, characterized along two dimensions: the mode of providing guidance and the generalization engine used within relational reinforcement learning. Two modes of using guidance were investigated: providing all guidance at the start and spreading guidance, i.e., providing some guided episodes followed by several exploration episodes, and repeating this. A variation on the latter mode is active learning, where the agent asks for guided traces starting from initial states that it selects itself rather than receiving guided traces from randomly chosen initial states. Overall, the use of guidance in addition to experimentation improves performance over using experimentation only, for all considered combinations of
8.6. FURTHER WORK
139
the dimensions mentioned above. Improvements in terms of the overall performance level achieved, the convergence speed, or both were observed. The improvements result from using the best of both worlds: guidance provides perfect or reasonably good information about the optimal action to take in a narrow range of situations, while experimentation can obtain imperfect information about the optimal action to take in a wide range of situations. The actual magnitude of performance improvement does depend on the considered combination of the mode of providing guidance and generalization engine. While both guidance at the start and spread guidance improve the performance of the RRL system, spread guidance often yields higher, but more importantly, never lower performance. This is especially the case when the regression engine is vulnerable to over-generalization such as the tg algorithm. Providing all the guidance up front doesn’t quite work well in this case for several reasons. Namely, making a split on “perfect” guidance generated examples only distinguishes between the regions of equal state-values, but not the actions that allow movement between them. This can be corrected by splits further down the tree, but this requires lots of extra examples, and therefore more learning episodes. This problem is aggravated by the fact that after making a split, the guidance received so far is lost. These problems do not appear when instance-based regression is used as a generalization engine as the rib algorithm is designed to remember high yielding examples and the bias towards action difference of the used distance prevents the over-generalization that occurs with tg . The kbr algorithm also does not suffer from over-generalization and succeeds in translating only optimal examples into the optimal policy. However, since it has no example selection strategy like the rib algorithm, large amounts of uninformative examples can cause the optimal policy to degenerate during later exploration. Active learning with spread guidance helps improve performance in the later stages of learning, by enabling fine tuning of the learned Q-function by focusing on problematic regions of the state space. This often results in a significant reduction of the cases where RRL exhibits non-optimal behavior. Experiments show that a sufficiently high level of performance has to be reached by G-RRL for the active guidance to have any effect. If performance is too low to allow fine-tuning, active guidance does not improve on normal guidance.
8.6
Further Work
Since each of the three regression algorithms responds differently to the supplied guidance, one possible direction for further work is the tighter integration of guidance and the used generalization engine. For example, when dealing with a model building regression algorithm like the tg system, one could supply more and possibly specific guidance when the algorithm is bound to make an
140
CHAPTER 8. GUIDED RRL
important decision. When tg is used, this would be when tg is ready to choose a new test to split a leaf. Even when this guidance is not case specific, it could be used to check whether a reasonable policy contradicts the proposed split. Alternatively, one might decide to store (some of) the guided traces and re-use them: at present, all statistics, and thus all the information received from the guided traces, are forgotten once tg chooses a split. When looking for a more general solution, one could try to provide a larger batch of guidance after RRL has had some time to try and explore the statespace on its own. This is related to a human teaching strategy, where providing the student with the perfect strategy at the start of learning is less effective than providing the student with that strategy after he or she has had some time to explore the systems behavior. Another route of investigation that could yield interesting results would be to have a closer look at the relations of our approach to the human learning process. In analogy to human learner–teacher interaction, one could have a teacher look at the behavior of RRL or — given the declarative nature of the policies and Q-functions that are generated by tg — at the policy that RRL has constructed itself and adjust the advice it wants to give. In the long run, because RRL-tg uses an inductive logic programming approach to generalize its Q-function and policies, this advice doesn’t have to be limited to traces, but could include feedback on which part of the constructed Q-function is useless and has to be rebuilt, or even constraints that the learned policy has to satisfy. Although the idea of active guidance seems very attractive both intuitively and in practice, it is not easy to extend this approach to applications with stochastic actions or a fixed starting state such as most games. If one looks at for example the popular computer game of Tetris (See also Chapter 9), the starting state always has an empty playing field and the next block to be dropped is chosen randomly. For stochastic applications one could try to remember all the stochastic elements and try to recreate the episode. For the Tetris game this would include the entire sequence of blocks and asking the guidance strategy for a game with the given sequence. However, given the large difference in the Tetris state as a consequence of only a few different actions, the effect of this approach is anticipated to be small. Another step towards active guidance in stochastic environments would be to keep track of actions (and states) with a large negative effect. For example in the Tetris game again, one could notice a large increase of the height of the wall of the playing field. These remembered states could then be used to ask for guidance. However, this approach requires not only a large amount of administration inside the learning system but also needs some a priori indication of bad and good results of actions.
Chapter 9
Two Computer Games “Shall we play a game?” War Games
9.1
Introduction
To test the applicability of the RRL system to larger problems than the ones already tested in the blocks world, computer games offer a cheap solution. Computer games offer a well defined world for the learning system to interact with and can often be tuned in complexity to make the learning task easier or more difficult when needed. A computer game environment is often filled with a varying amount of objects that exhibit several relational properties and are often only defined by their properties and relations to other objects. This kind of environment is exactly what the RRL system was designed for. Computer games also offer cheap learning experience as the environment is completely simulated. Real world applications are often hampered by long interaction times or high exploration costs. Real world applications often have slow response times, either by having to wait for human interactions or simply because of physical limitations. Given access to the computer game code, the interaction of the learning system with the game can often be accelerated to speed up the learning experiments. Real world systems are often not suited for random exploration as wrong choices might have destructive consequences. In a completely simulated environment, the agent can be punished for making bad mistakes by the appropriate reward value without imposing any real cost. It is therefore not surprising that some of the best showcases of reinforcement learning are based on games (Tesauro, 1992). This chapter will show the learning possibilities of the relational reinforcement learning in two computer games: Digger and Tetris. The Digger game will 141
142
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.1: A snapshot of the DIGGER game.
be tackled using the tg regression algorithm as well as with a new hierarchical reinforcement learning technique for concurrent goals. After a brief introduction of the Digger game and a short overview of hierarchical reinforcement learning, a new approach to hierarchical reinforcement learning for concurrent goals is presented. Section 9.5 reports on a number of experiments with the Digger game. The behavior of the RRL system in the Tetris Game is discussed in Section 9.6. All three regression systems are used for the Tetris task. Instead of learning a Q-function, the Tetris game is handled using “afterstates” and a utility function is learned instead. The work on learning the Digger game was started together with Jeroen Nees in the context of his master’s thesis. Parts of this work were published in (Driessens and Blockeel, 2001) where the new hierarchical approach was introduced and (Driessens and Dˇzeroski, 2002a) that reported on some of the experiments.
9.2
The Digger Game
Digger1 is a computer game created in 1983, by Windmill Software. It is one of a few old computer games which still enjoy a fair amount of popularity. In this game, the player controls a digging machine or “Digger” in an environment that contains emeralds, bags of gold, two kinds of monsters (nobbins and hobbins) and tunnels. The goal of the game is to collect as many emeralds and as much gold as possible while avoiding or shooting monsters. In the tests the hobbins and the bags of gold were removed from the game. Hobbins are more dangerous than nobbins for human players, because they can 1 http://www.digger.org
9.2. THE DIGGER GAME
143
Figure 9.2: The possible paths that can be travelled on in the Digger Game.
dig their own tunnels and reach Digger faster, as well as increase the mobility of the nobbins. However, they are less interesting for learning purposes, because they reduce the implicit penalty for digging new tunnels (and thereby increasing the mobility of the monsters) when trying to reach certain rewards. The bags of gold were removed from the game to reduce the complexity. Although they are still shown during the game and consequently in the screen shots, Digger does not interact with them. (Bags of gold have to be pushed to and dropped down cliffs so that they burst open before the gold can be collected. The game already is sufficiently complex for reinforcement learning purposes without them.) The digging machine is not completely free to move anywhere through the game-field. The screen is divided into 10 by 15 squares, and although it takes several steps for Digger to get from one square to the next and the human player can decide to return to where it is coming from before the next square is reached, Digger can only change directions in the middle of these squares. In other words, Digger can only travel on the lines shown in Figure 9.2, but it can turn back at practically any given time. The most elementary step in the Digger Game lasts about 100 milliseconds during game-play. It takes Digger from 4 to 5 basic steps to go from one square to the next. To reduce the number of steps between events (and rewards) and therefore the complexity of learning the game through reinforcement learning, the game was discretized so that a single step would make Digger move to the next square. Although this does remove one possible strategy from the game — a human player will often try to reach an emerald without connecting two separate tunnels to limit the mobility of the monsters — it will reduce the number of steps between rewards and the total number of steps taken during a learning episode.
144
CHAPTER 9. TWO COMPUTER GAMES
9.2.1
Learning Difficulties
The Digger computer game offers quite a challenge to a reinforcement learning system. Although the game is quite simple according to human standards, it is too large for normal reinforcement learning techniques. Even with the used discretization, the number of possible states is very large. Indeed, the number of available emeralds and present monsters varies during the game as does the tunnel structure. Although it is possible to represent the Digger game with a feature vector, it would be very large and very impractical. The different levels available in the game add to the difficulties of learning this game with non-relational reinforcement learning. Because the identity of an emerald or a certain monster is unimportant and only their relative properties are important, a small relational Q-function can yield a well performing policy.
9.2.2
State Representation
The used representation of the Digger game state consists of the following components: • the coordinates of digger, e.g., digPos(6,9) • information on digger itself, supplied in the format: digInf(digger dead,time to reload,level done,pts scored, steps taken), e.g., digInf(false,63,false,0,17), • information on tunnels as seen by digger (range of view in each direction (up, down, left and right), e.g., tunnel(4,0,2,0); information on the tunnel is relative to the digger; there is only one digger, so there is no need for a digger index argument) • list of emeralds (e.g., [em(14,9), em(14,8), em(14,5), . . .]), this information holds the absolute coordinates of all the emeralds, • list of monsters (e.g., [mon(10,1,down), mon(10,9,down) . . .]), also using absolute coordinates, and • information on the fireball fired by digger (x-coordinate, y-coordinate, travelling direction, e.g., fb(7,9,right)). The use of lists removes the limitations of fixed size feature vectors and the lossless representation of the game-state allows for the computation of all possible (relational) state properties.
9.2. THE DIGGER GAME
145
The actions available to the learning system are moveOne(X) and shoot(X) with X ∈ {up, down, lef t, right}. These actions are implemented in such a way that Digger moves an entire row or column. Shooting in a direction will also move Digger in that direction, as it is impossible to make Digger stand still during the game. To let the tg algorithm use relational information to build the Q-function instead of the the state and action representation described above, the tg algorithm is presented with the following predicates to use as splitting criteria: • actionDirection/2: gets the direction of the chosen action. • moveAction/1: succeeds when the chosen action is a moving action and returns the direction. • shootAction/1: succeeds when the chosen action is a shooting action and returns the direction. • emerald/2: returns the relative direction of a given emerald. • nearestEmerald/2: computes the nearest emerald and its direction. • monster/2: returns the relative direction of a given monster. • visibleMonster/2: computes whether there is a monster that is connected to Digger with a straight tunnel and returns the monster and its direction. • monsterDir/2: returns the travelling direction of a given monster. • distanceTo/2: computes the distance from Digger to a given emerald or monster. • canFire/0: succeeds if Digger’s weapon is charged and ready to fire. • lineOfFire/1: succeeds if a fireball is already travelling in the direction of the given monster. None of these predicates use object specific or level specific information. This allows RRL to learn on the different levels of the Digger game at the same time. Learning experience on easy levels can be mixed with hard levels to help RRL to learn a representative Q-function.
9.2.3
Two Concurrent Subgoals
A player controlling the Digger bot, will have to collect as many emeralds as possible and avoid encounters with the nobbins. These two tasks can be regarded as two separate goals that have to be handled concurrently. An interesting feature of the Digger game is that it is often non-optimal to regard the
146
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.3: The dilemma of concurrent goals in the Digger game. The two grey actions are optimal for subgoals, while the white action is optimal considering both at the same time.
two tasks as competing, as the optimal action when both tasks are considered is often a different action than the optimal action for any of the subtasks as shown in the example of Figure 9.3. The left grey arrow indicates the action that is optimal for avoiding monsters, while the right grey arrow shows the action that would be optimal for collecting emeralds. The white action is the one chosen by most human players in the shown game state, as it combines the avoidance of the nearby monster with the path to the closest “safe” emerald. The fact that the Digger game consists of trying to solve these two rather distinct subgoals make it a suited application for hierarchical reinforcement learning. However, since the two subgoals have to be handled concurrently, not all hierarchical reinforcement learning approaches are very well suited.
9.3
Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning is often used for reinforcement learning problems that are too large to be handled directly. By dividing the learning problem into smaller subproblems and learning these problems separately, the so-called “curse of dimensionality”2 can be partially circumvented. The translation of a learning problem into multiple smaller or simpler learning tasks can take different forms: Sequential Division : When the original problem can be translated into a sequence of (higher level) steps that have to be performed, each of these 2 The unfortunate fact that the number of parameters that need to be learned grows exponentially with the size of the problem.
9.3. HIERARCHICAL REINFORCEMENT LEARNING
147
subtasks can be treated as an individual learning task. The complete problem is solved by executing each policy until the local goal for which it was learned is reached. An obvious example problem that can be handled by this technique is that of construction work, e.g. a house. To build a house, first the foundation has to be laid, then the walls have to be erected and finally a roof has to be put on top. Temporal Abstraction : When the solution to the learning problem consists of (higher level) steps which have to be repeated or interleaved a certain number of times, the correct execution of each of these steps can be regarded as a learning task. The solution to the original problem will then be a high level policy that makes use of the low level abilities that were learned to reduce the number of steps to be taken. The term “temporal abstraction” stems from the fact that each of the subtasks may take a different amount of time to be executed and decisions are no longer required at each step. An example of this would be robot navigation where navigating from one location to another consists of following several corridors, making a number of turns and avoiding the necessary objects. Concurrent Goal Selection : When the complete learning task is made up of different goals that have to be addressed concurrently, the learning system could first be allowed to train on each of the subgoals separately. An example of this kind of task would be animal survival. A rabbit has to handle the goal of collecting food concurrently with the goal of avoiding predators. The existing work on hierarchical reinforcement learning is extensive and a full overview of the topic will not be presented. Interested readers are referred to overviews given by Kaelbling et al. (1996) or Barto and Mahadevan (2003). Most of the work on hierarchical reinforcement learning was done on sequential division and temporal abstraction (Sutton et al., 1999; Parr and Russell, 1997; Dietterich, 2000). Both of these approaches require the subgoals to have a procedural nature, i.e., that one can be achieved after another has been completed (not necessarily in a strict order) and that each subgoal has a clear termination condition. Having the Q-learning agent build a policy for each subgoal leads to a number of “macro-actions” which are an abstract extension of single step actions and which can be used to simplify the policy of the agent for the complete problem. However, this does impose a restriction on the types of problems where these techniques can be applied. The concurrent goal setting includes learning problems where subgoals do not have termination conditions, i.e., where all subgoals have to be pursued during the entire execution. W-learning (Humphrys, 1995) is one of the few hierarchical reinforcement learning techniques that does deal with multiple parallel goals. The multiple
148
CHAPTER 9. TWO COMPUTER GAMES
goals are handled by generating a separate Q-learning agent for each goal. The policies of the different agents are combined by attributing a Weight-factor for each (state, agent) combination that determines the importance of following a certain agents advice in the given state. The weight values are computed based on which agent is most likely to suffer the most if its advice is disregarded.
Figure 9.4: The optimal action for the combination of both goals (shown in white) is different from either of the two optimal actions for each goal considered separately (shown in grey). The downside of W-learning is that only actions that are optimal for one of the subgoals will ever be chosen. Consider the robot in Figure 9.4. To survive, the robot has to accomplish two tasks: it wants to reach the oil can on the left of the picture, but it needs to avoid contact with the magnet that can trap it when it gets too close. Two reinforcement learning agents inside the robot will suggest the two grey arrows as optimal actions. To reach the oil can, the robot will select the left grey action, to avoid the magnet, the robot will prefer the action represented by the right grey arrow. No matter which subagent is chosen to be most important, the actual optimal action represented by the white arrow will never be chosen.
9.4
Concurrent Goals and RRL-tg
The results of relational reinforcement learning (comparable to other Q-learning techniques) is some form of Q-function that is usually translated into a policy. However, letting RRL learn on a subgoal of the original problem will result in a Q-function that gives an indication of the quality of a (state, action) pair for the used subgoal. This Q-function not only holds an indication about which actions are optimal, but also includes information about non-optimal but reasonable actions. To make use of the information contained in the subgoal Q-functions, the predictions made by these Q-functions could be used in the description of the Q-function for the entire problem.
9.5. EXPERIMENTS IN THE DIGGER GAME
149
Through the use of background knowledge and an adjusted language that defines the available splitting criteria tg can make use of almost any information that is available about the states or actions. This can include Q-value information learned for the different subgoals. Simple use of the predictions of Q-values for each of the subgoals can be done by comparing these values to constant values or to each other. On top of this, tg can be allowed to add, negate or perform other computations with these values. Simple addition of Q-values predicted for different subgoals could for example emphasize the actions that are reasonable for most subgoals, catastrophic for none and as a consequence, close to optimal for the complete task.
9.5
Experiments in the Digger Game
The actual coupling between the RRL system and the Digger game was done by Jeroen Nees in the context of his master’s thesis. For reasons beyond his (and my) control, this coupling remained unstable and unfortunately quite slow, which made it impossible for an extensive amount of experiments to be run. The rib and kbr algorithms will not be used in the Digger experiments. rib requires a first order distance and kbr a kernel to be defined on the (state, action) pairs of the presented learning problem. The definition of such a distance or kernel is non-trivial and as a consequence, they were not designed. The rewards in the Digger-game are distributed as follows: 25 points for eating an emerald (and an extra 250 points for eating 8 emeralds in a row), 250 points for shooting a monster and -200 points for dying. RRL is trained and tested on all 8 standard Digger-levels. Figure 9.5 shows the average learning results for RRL over 10 test runs. The y-axis displays the average reward obtained by the learned strategies over 640 Digger test-games divided over the first 8 different Digger levels. In tabula rasa form, RRL reaches an average performance of almost 600 points per level. It is hard to give the maximum number of points that can be earned, as this amount differs per level. It can be said that RRL performs worse than human players although it does succeed in finishing the easiest level. Human players will score much more points though as RRL finishes the level by eating all available emeralds. Human players who want to maximize their earned points will eat all but one of the emeralds and then finish the level by shooting all monsters. This behavior is never exhibited by the RRL system, although it does occasionally succeed in killing a monster (and even makes a detour to do this).
150
CHAPTER 9. TWO COMPUTER GAMES
The Digger Game 1000
Average Total Reward
800 600 400 200 0
’tabula rasa learning’ ’with 5 guided traces’ ’with 20 guided traces’
-200 0
200
400 600 Number Of Episodes
800
1000
Figure 9.5: The performance of RRL on the Digger game, both with and without guidance.
9.5.1
Bootstrapping with Guidance
In a second step, the learned strategy was used to provide RRL with a few guided episodes. The point of this experiment was to see whether RRL could improve on its own strategy. Figure 9.5 also shows the learning performance of RRL with 5 and 20 guided traces and illustrates the fact that RRL is indeed capable of improving its own learned strategy, although the resulting increase of performance is limited. A second iteration did not result in any improvements anymore. To reach higher levels of performance, a better set of splitting criteria is needed together with, most importantly, a substantially larger set of learning episodes. Care should be taken when comparing the learning graphs of these approaches. While the system that is given guidance could be regarded as having more episodes to learn from because it profits from earlier experience, this would not be fair as it also has to start building the Q-function from scratch. Using tg this means that a sufficient number of (state, action) pairs needs to be collected to build a Q-tree. However, these extra learning episodes could induce an extra cost when experimentation is not free.
9.5.2
Separating the Subtasks
For the experiment shown in Figure 9.6, RRL was first given the opportunity to learn to collect emeralds and to avoid or shoot monsters separately using the ideas for hierarchical reinforcement learning for concurrent goals. RRL was allowed to train for 100 episodes on each of the subtasks. Each of the learned Q-functions was then added to the background knowledge used by the
9.6. THE TETRIS GAME
151
The Digger Game 700
Average Total Reward
600 500 400 300 200 100 0
’Regular RRL’ ’Hierarchical RRL’
-100 0
50
100
150 200 250 Number of Episodes
300
350
400
Figure 9.6: The performance of RRL both with and without hierarchical learning. tg algorithm and tg was permitted to compare the predicted Q-values to constant values and to each other. It was also allowed to add the two predicted values together or negate one of the values before comparing it. As shown in the learning curve, the extra information allows RRL to increase its performance faster than without it. However, the resulting level of performance is only comparable to the level of performance without the added features. It does not improve on it. The same remark as for the guided experiment holds here. While the hierarchical learner profits from previous experience and could be regarded as having an extra 200 learning episodes, this is not entirely fair because tg needs to build a Q-function from scratch when the learning experiment is started. However, the extra learning episodes should be considered when the learning curves are compared.
9.6
The Tetris Game
Tetris3 is probably one of the most famous computer games around. Designed by Alexey Pazhitnov in 1985 and has been ported to almost any platform available, including most consoles. The Tetris is a puzzle-video game played on a two-dimensional grid as shown in Figure 9.7. Different shaped blocks fall from the top of the game field and fill up the grid. The object of the game is to score points while keeping the blocks from piling up to the top of the game field. To do this, one can move the dropping blocks right and left or 3 Tetris
is owned by The Tetris Company and Blue Planet Software.
152
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.7: A snapshot of the Tetris game.
rotate them as they fall. When one horizontal row is completely filled, that line disappears and the player scores a point. When the blocks pile up to the top of the game field, the game ends. The fallen blocks on the playing field will be referred to as the wall. A playing strategy for the Tetris game consist of two parts. Given the shape of the dropping block, one has to decide on the optimal orientation and location of the block in the game field. This can be seen as the strategic part of the game and deals with the uncertainty about the shape of the blocks that will follow the present one. The other part consists of using the low level actions — turn, moveLeft, moveRight, drop — to reach this optimal placement. This part is completely deterministic and can be viewed as a rather simple planning problem. The RRL system will only be tested on the first, and most challenging subtask. Finding the optimal placement of a given series of falling blocks is an NP-complete problem (Demaine et al., 2002). Although the optimal placement is not required to play a good game of Tetris, the added difficulty of dealing with an unknown series of blocks makes it quite challenging for reinforcement learning, and Q-learning in particular and a suited application to test the limitations of the RRL system. There exist very good artificial Tetris players, most of which are hand built. The best of these algorithms score about 500.000 lines on average when they only include information about the falling block and more than 5 million lines when the next block is also considered. The results in this chapter will be nowhere near this high and will be even low for human standards. However, the experiments shown will illustrate the capabilities of and the difficulties still faced by the RRL system, and possibly of Q-learning algorithms in general. Compared to the Digger game, Tetris is a lot harder for the RRL system. This is greatly due to the shape of the Q-function in Tetris. Where the Digger game was very object and relation oriented (having to deal with emeralds and
9.6. THE TETRIS GAME
153
Figure 9.8: Greedily taking the scoring action and dropping the block in the canyon on column 1 might lead to problems later.
monsters and the distances between them) Tetris players need to focus on the shape of the wall in the game field, which evolves more chaotically. Although certain relational features exist (such as canyons, i.e., deep and narrow depressions in the wall on the playing field), the exact value of a Tetris state and/or action seems hard to predict.
9.6.1
Q-values in the Tetris Game
The stochastic nature of the game (i.e., the unknown shape of the next falling block) in combination with the chaotic nature of the Tetris dynamics make it very hard to connect a Q-value to a given (state, action) pair. It is very hard to predict the future rewards starting from a given Tetris state. The state shown in Figure 9.8 for example can quickly lead to a reward of 2, but the creation of a hole in column 1 could eventually lead to problems, depending on the blocks to come, for example when no block can be found to fill the small canyon on column 6. The development of the Tetris wall is also quite chaotic. The resulting shape of the wall when dropping the block of Figure 9.8 in the canyon on column 1 is barely related to the shape after the block is dropped almost anywhere else on the game field. The height of the wall decreases by deleting two lines, but also the resulting shape of the top of the wall is quite different, having lost the canyon on column 1, which was one of its most defining features. This chaotic behavior of the shape of the Tetris wall, will also make the Q-values in the Tetris game very hard to predict.
154
9.6.2
CHAPTER 9. TWO COMPUTER GAMES
Afterstates
The Tetris game can easily benefit from the use of afterstates as discussed in the intermezzo on page 26. It is very natural for human players to decide where to place a Tetris block by quickly predicting what the resulting game-field state would look like after the chosen action is performed. This allows the player to evaluate a number of features of the resulting state such as: • Are any lines erased? • What is the shape of the top of the wall after the block has landed? Will it fit a large variety of blocks that might follow? • Are any new holes created or are some old ones deleted? • How many canyons will (still) exist and what is their depth or width? The state resulting from dropping a block can only be partially computed, because of the random selection of the next falling block. Regarding this selection of the next block as a counter move of the environment, casts the problem into the afterstates setting. Calculating the resulting state (without the next block) and predicting the reward that accompanies the chosen action are fairly easy and allows RRL to calculate the Q-value of a (state, action) pair as Qaft (s, a) = rpred (s, a) + Vˆ (δpred (s, a)) The use of the afterstate prediction therefore reduces the need for regression using (state, action) pairs to just predicting the value of a state. For the two (state, action) pairs in Figure 9.9, this means that they will both use the same state for the utility prediction by the chosen regression algorithm. The use of afterstates does not remove the unpredictability of the long term reward in the Tetris environment however, and the connected difficulties to calculate the utility of states remain.
9.6.3
Experiments
All three regression systems were tested on the Tetris application. tg was given a language that included tests such as: • What is the maximum, average and minimum height of the wall? • What is the number of holes? • What are the differences in height between adjacent columns? • Are there canyons (of width 1 or width 2)? How many are there?
9.6. THE TETRIS GAME
155
Figure 9.9: Both (state, action) pairs at the top of the Figure result in the same afterstate shown on the bottom. No next block is shown, as the shape of this block is stochastic and considered as the counter move of the Tetris environment.
156
CHAPTER 9. TWO COMPUTER GAMES
• Does the next block fit? What is the lowest fitting row for the next block? • How many points can the next block score? • What is the lowest row the next block can be dropped on? To turn the numerical values into binary tests as needed for the tg system, the values were compared to a number of user defined constants or other related values, e.g. the average height to the lowest row where the next block can be dropped. With this language and 10% guided learning episodes (1 guided every 10), RRL-tg learned to delete around 10 rows per game after about 5000 learning games averaged over 10 learning experiments. The resulting Q-trees were quite large, showing an average of approximately 800 leafs. These results do not improve on previously reported results of RRL-tg in the Tetris game (Driessens and Dˇzeroski, 2002a; Driessens and Dˇzeroski, 2002b; Driessens and Dˇzeroski, 2004), where afterstates were not used. rib was given a Euclidian distance using a number of state features, thus basically working with a propositional representation of the Tetris states. These features were: • The maximum, average and minimum height of the wall and the differences between the extremes and the average. • The height differences between adjacent columns. • The number of holes and canyons of width 1. Also presented with 10% guided learning episodes RRL-rib learned to remove an average of 12 lines per game after around 50 learning episodes, i.e., with only 5 guided traces. It could not however improve on that policy during another 450 learning games. This behavior was shown in each of the 10 learning episodes. The covariance function needed for RRL-kbr was supplied by an inner product of feature vectors such as used in the rib distance, thereby again working with a propositional representation of the problem. The RRL-kbr system very quickly learned to delete an average of 30 to 40 lines per game after only 10 to 20 learning games, but un-learns this strategy after another 20 to 30 learning games. To predict the value of a new example, the covariance matrix C needs to be inverted. This inversion can in theory be impossible as there is no guarantee that the covariance matrix is non-singular. This is in particular a problem when many learning examples reside in a low-dimensional subspace of the feature space related to the used kernel, which is likely to happen in the Tetris application, as the used feature space has a rather low dimension. The current solution for this problem is to add a multiple of the identity matrix to the covariance matrix to ensure that it is of full rank. This
9.6. THE TETRIS GAME
157
means C + I in the computations instead of C. Unfortunately, this is not always sufficient: if is large, the matrix C + I differs too much from the real covariance matrix C while if is small, computing (C + I)−1 becomes numerically very unstable. It is this numeric instability that causes RRL-kbr to un-learn its policy.
9.6.4
Discussion
None of the three regression algorithms fully exploited its relational capabilities. All representations used were in essence propositional. In contrast to the Digger game, Tetris does not seem like a real relational problem. Most existing work on artificial Tetris players uses numerical attributes to describe the shape of the game field. The results on the Tetris game with RRL in its current form are a bit disappointing. For example, a strategy which always selects the lowest possible dropping position for the falling block already succeeds in deleting 19 lines on average per game. Only RRL-kbr succeeds in doing better (before it runs into problems). These results are presumed to be caused by the difficulty of predicting the future reward in the Tetris game. Even human Tetris experts, who often agree on the action that should be chosen in a given Tetris state, will have a hard time predicting how many lines will be deleted using the next 10 blocks. These difficulties are illustrated by the RRL-tg experiments and the large sizes of the learned Q-trees. Since at least a good estimate of the future cumulative reward is necessary for any Q-learning algorithm, Q-learning (or any derived learning algorithm such as the RRL system) is probably not the appropriate learning technique for the Tetris game. This illustrates one of the problems still faced by the RRL system and indicates some directions for further work. The RRL system still strongly relies on learning a good Q-function approximation. The use of a regression algorithm allows RRL to generalize over different (state, action) pairs, but not over different Q-values. Other reinforcement learning techniques applied to the Tetris game (Bertsekas and Tsitsiklis, 1996; Lagoudakis et al., 2002) use a form of approximate policy iteration. These techniques perform much better than the RRL system, generating policies that delete 1000 lines per game or more. The advantage off these algorithms seems to lie in the iterative improvement nature of the policy iteration algorithm. Instead of having to learn a correct utility or Q-value, the policy improvement step only relies on an indication of which action is better than other actions to build the next policy. Therefore, these approaches seems to perform some kind of advantage learning. Applying an appropriate version of the relational approximate policy iteration technique of Fern et al. (2003) will probably yield comparable results. Also, the use of policy learning as described by Dˇzeroski et al. (2001) generates a Q-value
158
CHAPTER 9. TWO COMPUTER GAMES
independent description of a policy and might yield better results on tasks such as the Tetris game.
9.7
Conclusions
Digger was the first “larger” application that the RRL system was tested on. Although Digger is still a rather simple game according to human standards, with the large number of objects present on the game field, the total number of different possible states and the availability of different levels, the Digger game would be a very difficult task for non-relational reinforcement learning. Although the RRL learning system never reached a human playing standard, it was able to learn “decent” behavior and even to finish the first (and easiest) level. This illustrates that RRL is able to handle large problem domains compared to regular reinforcement learning. This chapter also introduced a new hierarchical reinforcement learning technique that can be used to learn in an environment with concurrent goals. The techniques relies on the ability of the tg algorithm to include background information on top of the (state, action) pair description to build a Q-function. The suggested approach first lets RRL build a Q-function for each of the present subgoals and then allows tg to use the Q-value predictions during the construction of the Q-function for the entire learning problem. RRL was able to bootstrap itself by using a previously learned policy as guidance and was able to improve upon its own behavior. Using the hierarchical learning technique for concurrent goals leads to faster learning, but not to a higher level of performance. On the Tetris game, the performance of RRL was a bit disappointing. This seems to be caused by the difficult to predict Q-values connected to the Tetris game. While other reinforcement learning techniques which apply some kind of advantage learning are probably more suited to deal with this kind of applications, the behavior of the RRL system on the Tetris task provided some insights on the usability of RRL and of Q-learning in general. In further work, a more extended language could be written for tg or even a relational distance or kernel could be designed for rib and kbr which could possibly lead to better performance. However, it does not seem feasible for RRL to reach human expert level performance in the current setup. More external help will be necessary to build a complex strategy such as the one used by humans to maximize the number of points earned per level.
Part IV
Conclusions
159
Chapter 10
Conclusions “I’m sorry, if you were right, I would agree with you.” Awakenings
10.1
The RRL System
This work presented the first relational reinforcement learning system. Through the adaptation of a standard Q-learning algorithm with Q-function generalization and the use of different incremental, relational regression algorithms an applicable relational reinforcement learning technique was constructed. Three new incremental and relational regression techniques were developed. A first order regression tree algorithm tg was designed as the combination of two existing tree algorithms, i.e., the Tilde algorithm and the G-tree algorithm. The tree building algorithm was made incremental through the use of performance statistics for all possible extensions in each of the leafs of intermediate trees. Based on the ideas of instance based learning, a relational instance based regression algorithm was developed. Different data management techniques were designed, based on different error-related example selection criteria. One example selection technique — based on maximum Q-value variation — was built on the specific dynamics of Q-learning algorithms. A third regression algorithm was based on Gaussian processes and graph kernels. Through the use of graph representations for state and actions and a kernel based on the number of walks in the product graph, the well defined statistical properties of Gaussian processes can be used as a regression algorithm in the RRL system. Although the three discussed regression algorithms were developed with their application in the RRL system in mind, most of the developed systems can 161
162
CHAPTER 10. CONCLUSIONS
be used in any supervised relational classification problems as well, especially in tasks with a continuous prediction target. Two additions were made to the RRL system to increase the applicability of the system to large applications. Adding guidance to the exploration phase of the RRL system, based on an available reasonable policy can greatly increase the performance of the RRL system in applications with sparse and hard to reach rewards. Several modes of guidance were tested, ranging from supplying all the guidance at the beginning of learning to spreading out the guidance and interleaving it with normal exploration. Although the different regression algorithms react differently to different kinds of guidance, overall, guidance improves the performance of the RRL system significantly. A second addition to the RRL system was a new hierarchical reinforcement learning approach that can be used for concurrent goals with competing actions. In this hierarchical approach, the RRL system is allowed to first train on subtasks of a complete reinforcement learning task. When RRL is confronted with the original task, it is supplied with the learned Q-functions as background knowledge and given the ability to compare the predictions made by the functions to constants and each other. Through the use of this information RRL system can increase its performance on the complete problem more quickly.
10.2
Comparing the Regression Algorithms
The three regression algorithms were compared on different applications, varying from different tasks in the blocks world to the computer games Digger and Tetris. Although there are differences in the performance of the three systems, it is remarkable how comparable their performance is. The differences in performance between the three algorithms are a result of the general characteristics of the family of algorithms the three engines belong to, i.e., model building algorithms and more specifically tree building algorithms for the tg algorithm, instance based algorithms for RRL-rib and statistical methods for the kernel based system. In general, decision trees perform well when dealing with large amounts of data. Through the use of their divide and conquer strategy, they can process large amounts of learning examples fast. They generalize well when enough training data is available to build an elaborate tree. However, when only a small number of learning examples is available or in the case were the learning examples are not evenly distributed over the space of values that need to be predicted, decision trees can perform poorly and are likely to overgeneralize. These characteristics are of course amplified in an incremental implementation of a tree learning algorithm. The need for large amounts of learning examples can be recognized in the fact that the tg algorithm usually needs more learning
10.2. COMPARING THE REGRESSION ALGORITHMS
163
episodes to reach the same level of performance as the other two algorithms. However, when exploration costs are not an issue, this is made up greatly by the much higher processing speed of the tg algorithm. The vulnerability of decision trees to skewed learning data distributions surfaces in the difficulties that the tg algorithm has with the guidance mode where all guidance is provided at the start of learning. The tg algorithm’s greatest advantages lie in its construction of a declarative Q-function and its ability to easily incorporate extra domain knowledge in its Q-function description. The hierarchical approach for concurrent goals for example, relies on the fact that tg is used as the regression engine. Although it is not entirely infeasible to use RRL-rib or RRL-kbr in this case, it will not be as convenient and elegant as with the tg algorithm. The declarative Q-function that is the result of the tg algorithm will also aid in the introduction of reinforcement learning into software development and real world (or industrial) problems, as software designers can study and comprehend the Q-function that is learned and will be used as a policy in the software system. This will facilitate verification of the behavior of the resulting system in extraordinary cases. When small amounts of learning examples are given or when cautious generalization is needed (for example in cases where the distribution of the learning data is skewed in respect to the space of values that need to be predicted), instance based and statistical methods perform better than decision trees. Through the use of the appropriate example selection method, instance based regression can be tuned to take full advantage of learning data with a very low percentage of informative content. This can be a great advantage in tasks that supply noisy data such as Q-learning at the start of a learning experiment. With examples of which the Q-values are based on very inaccurate predictions, instance based regression can be made to remember mostly examples with a high information content. This behavior results in very good performance of the RRL-rib system in the early stages of experiments with applications with sparse and hard to reach rewards. RRL-rib performs extremely well when it is supplied with guidance at the start of learning, as it will try to remember high yielding examples and can exploit that knowledge when it generates a policy. The major drawbacks of the instance based regression are computational. The evaluation of a relational distance is usually computationally very intensive and when the shape of the Q-function forces the rib system to store a high number of examples, this can drastically increase the time needed for learning (through the computation of the example selection criteria) and prediction. Also, the RRL-rib system relies on the availability of an appropriate relational distance between different state-action combinations. The design and implementation of such a distance can be non-trivial. The statistical properties of the kernel based system make it the most informative regression algorithm of the three. Its basis in Bayesian statistics allows
164
CHAPTER 10. CONCLUSIONS
it to make accurate predictions with limited amounts of learning data as well as to supply informative meta data such as the confidence of the made predictions. This additional data can be very useful for the reinforcement learning algorithm in the RRL system. The high prediction accuracy of the kbr regression algorithm can be witnessed in the overall high performance of the RRL-kbr system. However, it must be noticed that it suffers more from uninformative data than rib as shown in the experiments where guidance is provided at the start of learning. In these experiments kbr is the quickest of all regression algorithms to build a well performing policy, but is prone to forget the learned strategy during further uninformative exploration. Comparable to the rib system, RRL-kbr relies on the availability of a kernel between two state-action combinations. As with relational distances, the definition of such a kernel can be quite complex. Also, since currently the kbr system has not been supplied with any appropriate example selection criterion, processing large numbers of examples becomes a problem and can drastically increase the time complexity of learning and prediction. The fact that both rib and kbr require less training examples to generate good predictions, makes them well suited for applications with high exploration costs. The resulting Q-functions are however not directly suited for later interpretation.
10.3
RRL on the Digger and Tetris Games
The RRL system was tested on the computer games Digger and Tetris. The Digger game displays an environment that would be very difficult to handle with standard Q-learning or Q-learning with propositional function generalization. The Digger game field is filled with a large number of objects (emeralds and monsters) and has a structure (tunnels) that changes during game-play. It also offers 8 different levels on which RRL was able to learn and perform using a single Q-function generalization. Although the results of the RRL system on the Digger game do not reach human level performance, they do show the increased applicability that relational reinforcement learning has brought to the Q-learning approach. The results on the Tetris game were a bit disappointing and illustrated the limitations of the current RRL system. None of the regression algorithms were able to handle the difficulties in learning the Q-function connected to the Tetris game very well. The Q-learning approach used in the RRL system does not seem fit to learn to play a chaotic game like Tetris. Other reinforcement learning techniques which don’t rely on an accurate modeling of the Q-function will probably be more appropriate.
10.4. THE LEUVEN METHODOLOGY
10.4
165
The Leuven Methodology
The development of a relational reinforcement learning system was only a small next step in the general philosophy of the machine learning research group in Leuven which is to upgrade propositional learners to a relational setting. This upgrading methodology is based on the learning from interpretations setting introduced by De Raedt and Dˇzeroski (1994). The same strategy had already been followed to develop a series of first order versions of machine learning algorithms including a decision tree induction algorithm called Tilde (Blockeel and De Raedt, 1998; Blockeel, 1998), a first order frequent subset and association rule discovery system Warmer (King et al., 2001; Dehaspe and Toivonen, 1999; Dehaspe, 1998), a first order rule learner ICL (Van Laer, 2002) and first order clustering and instance based techniques (Ramon and Bruynooghe, 2001; Ramon, 2002). The followed methodology has also been formalized into a process presented in (Van Laer and De Raedt, 1998; Van Laer, 2002). A comparable approach was followed to develop Bayesian Logic Programs (Kersting and De Raedt, 2000) and Logical Markov Decision Programs (Kersting and De Raedt, 2003) Ongoing research includes first order sequence discovery (Jacobs and Blockeel, 2001) and relational neural networks (Blockeel and Bruynooghe, 2003).
10.5
In General
In their invited talks at IJCAI’97 (in Nagoya, Japan), both Richard Sutton and Leslie Pack Kaelbling challenged machine learning researchers to study the combination of relational learning and reinforcement learning. This thesis presents a first answer to these challenges. The work presented has very little theoretical content, and instead focussed on the development and application of an applicable relational reinforcement learning system. Although the applications of the RRL system have been limited to toy examples and computer games so far, the demonstrated possibilities of the relational reinforcement learning approach has sparked a large interest into the research field. This has given rise to research into the theoretical foundations of relational reinforcement learning and relational extensions of approaches of reinforcement learning other than Q-learning (see Section 4.6). In a larger context, the emergence of the relation reinforcement learning research field has helped to renew the interest in relational learning and related topics such as stochastic relational learning.
166
CHAPTER 10. CONCLUSIONS
Chapter 11
Future Work “They made us too smart, too quick, and too many.” A.I. This chapter discusses some directions for future work. The field of relational reinforcement learning is still very young and a great variety of topics are waiting to be investigated. However, this chapter will focuss on some of the extensions and improvements that can be made to the RRL system.
11.1
Further Work on Regression Algorithms
In each of the regression algorithm chapters, a number of ideas were presented for further development of the different systems and will be briefly repeated here. The tg algorithm currently lacks a way of restructuring the tree once it discovers that the initial choices it made, i.e., the tests that were chosen at the top of the tree, were non-optimal. The first order setting of the tg algorithm prevents the simple approach used in propositional algorithms of storing statistics on all possible tests in each internal tree node and restructuring the tree according to these when necessary. However, it does seem possible to exploit at least parts of the previously built tree when the regression algorithm discovers it made a mistake earlier. Other extensions of the tg algorithm include building forests instead of single trees, where later trees try to model the prediction errors made by the previously built trees. Also interesting is the addition of aggregation functions into the language bias used by the tg system. The improvements suggested for the rib regression algorithm are mainly focussed on the reduction of the computational requirements. These range from incrementally computable distances or using a partitioning of the state-action 167
168
CHAPTER 11. FUTURE WORK
space to reduce the number of stored examples that a new, to be predicted example needs to be compared with. The kbr system can also benefit from the same example selection or the state-action space partitioning as the rib system. However, the probability distributions predicted by the Gaussian processes also allows for the design of other example selection methods. The probabilities predicted by the Gaussian processes can also be used to guide exploration. One more elaborate change to the suggested regression algorithms would be the combination of the model building approach of the tg system with the example driven approaches of the rib and kbr systems. This would allow the tg system (or any other model building approach) to make a coarse partitioning of the state-action space, and the example driven approaches to build a well fitting local Q-function approximation. The partitioning made by the model building algorithm will reduce the number of examples that need to be handled simultaneously by the example driven approach. One still open problem in regression algorithms is a system that can handle uncertainty in the state and action description. None of the three proposed algorithms is currently able to handle probabilistic state or action information. One possible direction that can be investigated in this context are first order extensions of neural networks. Neural networks are great at handling numerical values, which can be used to represent the probabilities connected to state and action features. A few preliminary ideas on first order neural networks can be found in (Blockeel and Bruynooghe, 2003). For the tg algorithm, it is also possible to interpret the resulting Q-function, i.e., the resulting regression tree. Due to the declarative nature of the tg algorithm, it might be possible to allow the user of the RRL system to intervene in the tree building process when he or she discovers a tree structure that contradicts his or her intuition. This intervention can in the first place be made by changing the language bias that is used by the tg algorithm but can also be extended to forcing tg to rebuild certain parts of the regression tree. It is possible to use the maximum variance parameter of the rib system to limit the number of examples stored in the data-base, regardless of the exact value of the maximum variance of the Q-function with respect to the defined (state, action) pairs distance. It could be possible to automatically tune this parameter for a given level of Q-function approximation or a given performance level that should be reached. Although the regression algorithms were designed with their use in the RRL system in mind, their applicability is not limited to regression for Q-learning alone. Almost all systems could be used for regular relational regression tasks without any changes to the systems. Future work will certainly include the evaluation of the three systems on regular relational regression problems. The added feature that the algorithms can deal with incremental data might be
11.2. INTEGRATION OF DOMAIN KNOWLEDGE
169
unnecessary in these applications.
11.2
Integration of Domain Knowledge
The integration of domain knowledge into the RRL system in its most obvious form has been limited to the use of guidance to help exploration and the division of the learning problem into subgoals. Of course, also the definition of the language bias used by tg or the distance used by rib are influenced by the domain knowledge of the user of the RRL system. The use of guidance to help exploration allows for many different ways of integrating the knowledge of a domain expert. Still to be investigated idea is the closer integration of guidance strategies and the used regression algorithm. For example, when using the tg algorithm, the guidance could be adjusted to the next split that the tg system intends to make. This might alleviate some of the problems with early decisions made by the tg system. In line with observations made in human learning, the use of guidance could be delayed until RRL has had some time to explore and tune the guidance to the parts of the state-space that the learning agent has most problems with. Active guidance is the first step in this direction, but more work is needed to make it applicable to stochastic or highly chaotic environments. However, the use of domain information into the RRL system can (and probably should) be extended beyond this. For the RRL-tg system for example, this could include inspection of the constructed regression tree as already stated. Recent work on relational sequence discovery tries to find frequent sequences of occurrences, possibly with gaps of different length between defining occurrences. Using this kind of discovery technique on a number of successful training episodes, it might be possible to discover important sequential subgoals, as they will appear in each (or at least a large number) of the episodes.
11.3
Integration of Planning, Model Building and Relational Reinforcement Learning
Further research will certainly include the integration of (partial) model building and planning under uncertainty into the RRL system. In this approach, the learning agent tries to build a partial model of its environment in the form of a set of rules that describe parts of the mechanics of the world as well as make predictions about the reward that will be received. This model can then be used to make predictions about the consequences of actions and thus add a planning component to the learning agent. The combination of this technique with the RRL system seems a straightforward next step. While the planning
170
CHAPTER 11. FUTURE WORK
component allows the agent to look a few steps ahead instead of just using the information of the current state to choose which action to perform, the information supplied by a learned Q-function will reduce the needed planning depth.
11.4
Policy Learning
Since the Q-values for a given task implicitly encode both the distance to and the amount of received awards, the policy that is generated from such a Qfunction is often much easier to represent. It might even be easier to learn. A policy learning approach was already suggested in the introductory work on the RRL algorithm (Dˇzeroski et al., 2001), but it was not considered further in the work done in context of this dissertation. Preliminary results with learning a policy from examples generated from a learned Q-function turn out to be very promising. The simpler representation of a policy compared to Q-function also causes the policy to be easier to learn and makes it generalize better. This could solve the problems that the current RRL system has with applications such as Tetris, which have a Q-function that is very difficult to predict. Using the tg algorithm to build a policy allow the user to guide the search and inspect the resulting policy afterwards. The ability to steer the learned policy in certain directions will ease the introduction of machine learning techniques in software agent applications.
11.5
Applications
Although the application of RRL to the Digger game was new with respect to other Q-learning applications, other, more useful applications need to be investigated. A first possibility is that of web-spidering. A web-spider is a software agent that is used to gather web-pages on specific topics, or even just completely new web-pages for storage in the data-base of internet search engines. In this task, the goal of the agent is to limit the amount of data it needs to download, and as a consequence, limit the number of links it needs to follow to get to the wanted information. The relational structure of the internet might make it a well suited domain for relational reinforcement learning. Since relational reinforcement learning was designed to handle worlds with objects, another type of applications that might be looked at are those situated in the real world, for example through robotics. The RRL system is better suited to handle high level task such as planning package movement than low level tasks such as object avoidance. Therefore, a suitable starting platform
11.6. THEORETICAL FRAMEWORK FOR RRL
171
with which to interface will have to be found. For this, the work on Golog and IndiGolog seem promising (De Giacomo et al., 2002).
11.6
Theoretical Framework for Relational Reinforcement Learning
Last but not least is the development of a theoretical framework for relational reinforcement learning such as it exists for regular reinforcement learning. Although the RRL system described above has illustrated the usefulness and applicability of relational reinforcement learning, very little is known about the theoretical foundations of relational reinforcement learning. Very recently a lot of attention has gone to relational representations of Markov Decision Processes (MDPs) which might yield a theoretical framework in which relational reinforcement learning can be studied. Such a study can allow better comprehension of why, and more importantly when relational reinforcement learning works. Eventually, such an understanding can ease the introduction of relational reinforcement learning into applications or can be applied to build better learning systems.
172
CHAPTER 11. FUTURE WORK
Part V
Appendices
173
Appendix A
On Blocks World Representations A.1
The Blocks World as a Relational Interpretation
To be able to describe the representation used in this work, a few concepts have to be introduced. The same conventions are used in this text as in the work of Flach (1994). These are standard in the Inductive Logic Programming community.
A.1.1
Clausal Logic
Names of individual entities, often objects, are called constants, and can be recognized by the fact that they are written starting with a lowercase character. Variables, which are used to denote arbitrary individuals are written starting with an uppercase character. A term is either a constant, a variable or a functor symbol, followed by a number of terms. The number of terms behind the functor symbol is called the arity. An atom is a predicate symbol followed by a number of terms. That number is again referred to as the predicate’s arity and a predicate p with arity n is denoted as p/n. A ground atom is an atom without any variables. A literal is either an atom or a negated atom. A clause is a disjunction of literals. By grouping positive and negative literals, a clause can be written in the following format: h1 ; . . . ; hn ← b 1 . . . b m where h1 , . . . , hn are the positive literals of the clause, called the head of the clause and b1 . . . bm are the negative literals, also called the body of the clause. 175
176
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
The “;” symbol should be read as an or, the “,” as an and and the ← as an if. A clause with a single positive literal is called a definite clause. A definite clause with no negative literals is called a fact.
A.1.2
The Blocks World
To represent the blocks world as a relational interpretation, the predicates on/2 and clear/1 are used. The ground fact on(a,b). represents the fact that block a is on top of block b, while clear(c). signifies that block c has no blocks on top of it. The action is represented by the move/2 predicate, i.e., move(c,a). represents the action of moving block c onto block a.
3 1 5
4 2
Figure A.1: An example state of a blocks world with 5 blocks. The action is indicated by the dotted arrow. The example (state, action) pair of figure A.1 is represented as follows: on(1,5). on(2,floor). on(3,1). on(4,floor). on(5,floor).
clear(2). clear(3). clear(4). move(3,2).
Through the use of definite clauses, this representation can be augmented by some derivable predicates. The predicates that were used throughout the experiments in the blocks world are the following:
A.1. THE BLOCKS WORLD AS A RELATIONAL INTERPRETATION177 eq/2 defines the equality of two blocks. above/2 specifies whether the first argument is a block that is part of the stack on top of the second argument. height/2 computes the hight above the table of a given block. number of blocks/1 gives the number of blocks in the world as a result. number of stacks/1 computes the number of stacks in the blocks world state. The definitions of these predicates are trivial. Although the clear/1 predicate could also be derived from the on/2 facts, it is included in facts of the state representation. When the goal the agent has to reach refers to specific blocks, this goal can be represented by adding a goal predicate and a definition of when this goal is reached. For example, the goal of putting block 3 on top of block 1 can be represented and defined as follows: goal :- goal_on(3,1). goal_on(X,Y) :- on(X,Y). Other more general goals can be represented in similar ways, but can also be left out of the representation all together, because no specific blocks need to be referenced. For example, the task of Stacking all blocks could be represented and defined as follows: goal :- stack. stack :- not (on(X,floor),on(Y,floor),X\=Y). and Unstacking as: goal :- unstack. unstack :- not (on(X,Y),Y\=floor). For implementation purposes, the blocks world state is sometimes represented as a single term with three arguments. The first argument is then the list of on/2 predicates, the second argument the list of clear/2 and the third argument represents the goal configuration. This representation is completely equivalent to the one discussed above.
178
A.2
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
The Blocks World as a Graph
Figure A.2 shows the graph representation of the blocks world (state, action) pair of Figure A.1. The vertices of the graph correspond either to a block, the floor, or ‘clear’; where ‘clear’ basically denotes ‘no block’. This is reflected in the labels of the vertices. An edge labelled ‘on’ (solid arrows) between two vertices labelled ’block’ denotes that the block corresponding to its initial vertex is on top of the block corresponding to its terminal vertex. The edges labelled ‘on’ that start from the vertex labelled ’clear’ represents the fact that the block corresponding to its terminal vertex has no blocks on top of it. The edges labelled ‘on’ that terminate in the vertex labelled ‘floor’ signify that the block corresponding to its initial vertex is on the floor. The edge labelled ‘action’ (dashed arrow) denotes the action of putting the block corresponding to its initial vertex on top of the block corresponding to its terminal vertex; in the example “move block 3 onto block 2”. The labels ‘a1 ’ and ‘a2 ’ denote the initial and terminal vertex of the action, respectively, representing the fact that the block corresponding to the ‘a1 ’ label is moved onto the block represented by the vertex with the ‘a2 ’ label. {clear}
v6 {on} {block,a1} {on} {block}
{on}
v3
{on}
v1
{block}
v4
{action} {block,a2 }
v2 {on}
{on} {on}
v5 {block}
{on}
v0
{floor}
Figure A.2: The graph representation of the blocks world (state, action) pair of figure A.1.
To represent an arbitrary blocks world as a labelled directed graph one proceeds as follows. Given the set of blocks numbered 1, . . . , n and the set of stacks 1, . . . , m: 1. The vertex set V of the graph is {ν0 , . . . , νn+1 } 2. The edge set E of the graph is {e1 , . . . , en+m+1 }.
A.2. THE BLOCKS WORLD AS A GRAPH
179
The node ν0 is used to represent the floor, νn+1 indicates which blocks are clear. Since each block is on top of something and each stack has one clear block, n + m edges are needed to represent the blocks world state. Finally, one extra edge is needed to represent the action. For the representation of a state it remains to define the function Ψ: 3. For 1 ≤ i ≤ n, define Ψ(ei ) = (νi , ν0 ) if block i is on the floor, and Ψ(ei ) = (νi , νj ) if block i is on top of block j. 4. For n < i ≤ n + m, define Ψ(ei ) = (νn+1 , νj ) if block j is the top block of stack i − n. and the function label : 5. Define: L = 2{{floor},{clear},{block},{on},{a1 },{a2 }} , • label (ν0 ) = {floor}, • label (νn+1 ) = {clear}, • label (νi ) = {block} (1 ≤ i ≤ n) • label (ei ) = {on} (1 ≤ i ≤ n + m). All that is left now is to represent the action in the graph 6. Define: • Ψ(en+m+1 ) = (νi , νj ) if block i is moved to block j. • label (νi ) = label (νi ) ∪ {a1 } • label (νj ) = label (νj ) ∪ {a2 } • label (en+m+1 ) = {action} It is clear that this mapping from blocks worlds to graphs is injective. In some cases the ‘goal’ of a blocks world problem is to stack blocks in a given configuration (e.g. “put block 3 on top of block 4”). This then needs to be represented in the graph. This is handled in the same way as the action representation, i.e., by an extra edge along with an extra ‘g1 ’, ‘g2 ’, and ‘goal’ labels for initial and terminal blocks, and the new edge, respectively. Note that by using more than one ‘goal’ edge, arbitrary goal configurations can be modelled, e.g., “put block 3 on top of block 4 and block 2 on top of block 1”. In the case of more general goals such as: “Build a single stack” it will not be represented in the graph, as no specific blocks need to be referenced.
180
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
Bibliography [Aha et al., 1991] D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6(1):37–66, January 1991. [Aronszajn, 1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. [Asimov, 1976] I. Asimov. Bicentential Man. 1976. [Atkeson et al., 1997] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997. [Bain and Sammut, 1995] M. Bain and C. Sammut. A framework for behavioral cloning. In S. Muggleton, K. Furukawa, and D. Michie, editors, Machine Intelligence 15. Oxford University Press, 1995. [Barnett, 1979] S. Barnett. MacGraw-Hill, 1979.
Matrix Methods for Engineers and Scientists.
[Barto and Duff, 1994] A. Barto and M. Duff. Monte Carlo matrix inversion and reinforcement learning. In J.D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 687–694. Morgan Kaufmann Publishers, Inc., 1994. [Barto and Mahadevan, 2003] A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Systems, 13:41–77, 2003. [Barto et al., 1995] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [Bellman, 1961] R. Bellman. Adaptive Control Processes: a Guided Tour. Princeton University, 1961. 181
182
BIBLIOGRAPHY
[Bertsekas and Tsitsiklis, 1996] Bertsekas and Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [Blockeel and Bruynooghe, 2003] H. Blockeel and M. Bruynooghe. Aggregation versus selection bias, and relational neural networks. In IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, SRL-2003, Acapulco, Mexico, August 11, 2003, 2003. [Blockeel and De Raedt, 1998] H. Blockeel and L. De Raedt. Top-down induction of first order logical decision trees. Artificial Intelligence, 101(1-2):285– 297, June 1998. [Blockeel et al., 1998] H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees. In Proceedings of the 15th International Conference on Machine Learning, pages 55–63, 1998. [Blockeel et al., 1999] H. Blockeel, L. De Raedt, N. Jacobs, and B. Demoen. Scaling up inductive logic programming by learning from interpretations. Data Mining and Knowledge Discovery, 3(1):59–93, 1999. [Blockeel et al., 2000] H. Blockeel, B. Demoen, L. Dehaspe, G. Janssens, J. Ramon, and H. Vandecasteele. Executing query packs in ILP. In J. Cussens and A. Frisch, editors, Proceedings of the 10th International Conference in Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 60–77, London, UK, July 2000. Springer. [Blockeel, 1998] H. Blockeel. Top-down induction of first order logical decision trees. Phd, Department of Computer Science, K.U.Leuven, Leuven, Belgium, 1998. [Boutilier et al., 1999] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of AI Research, 11:1–94, 1999. [Boutilier et al., 2001] C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic programming for first order MDP’s. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, 2001. [Chambers and Michie, 1969] R. A. Chambers and D. Michie. Man-machine co-operation on a learning task. Computer Graphics: Techniques and Applications, pages 179–186, 1969. [Chapman and Kaelbling, 1991] D. Chapman and L.P. Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisions. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, pages 726–731, 1991.
BIBLIOGRAPHY
183
[Clarke, 1968] A.C. Clarke. 2001, A Space Oddyssey. 1968. [Collins and Duffy, 2002] M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. The MIT Press. [Cristianini and Shawe-Taylor, 2000] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, 2000. [De Giacomo et al., 2002] G. De Giacomo, Y. Lesp´erance, H. Levesque, and S. Sardi˜ na. On the semantics of deliberation in IndiGolog — from theory to implementation. In Proceedings of the 8th Conference on Principles of Knowledge Representation and Reasoning, 2002. [De Raedt and Dˇzeroski, 1994] L. De Raedt and S. Dˇzeroski. First order jkclausal theories are PAC-learnable. Artificial Intelligence, 70:375–392, 1994. [Dehaspe and Toivonen, 1999] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3(1):7–36, 1999. [Dehaspe, 1998] L. Dehaspe. Frequent Pattern Discovery in First-Order Logic. Phd, Department of Computer Science, K.U.Leuven, Leuven, Belgium, 1998. [Demaine et al., 2002] E.D. Demaine, S. Hohenberger, and D. Liben-Nowell. Tetris is hard, even to approximate. Technical Report MIT-LCS-TR-865, Massachussets Institue of Technology, Boston, 2002. [Diestel, 2000] R. Diestel. Graph Theory. Springer-Verlag, 2000. [Dietterich and Wang, 2002] T.G. Dietterich and X. Wang. Batch value function approximation via support vectors. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. The MIT Press. [Dietterich, 2000] T.G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. [Dixon et al., 2000] K.R. Dixon, R.J. Malak, and P.K. Khosla. Incorporating prior knowledge and previously learned information into reinforcement learning agents. Technical report, Institute for Complex Engineered Systems, Carnegie Mellon University, 2000.
184
BIBLIOGRAPHY
[Driessens and Blockeel, 2001] K. Driessens and H. Blockeel. Learning digger using hierarchical reinforcement learning for concurrent goals. European Workshop on Reinforcement Learning, EWRL, Utrecht, the Netherlands, October 5-6, 2001, oct 2001. [Driessens and Dˇzeroski, 2002a] K. Driessens and S. Dˇzeroski. Integrating experimentation and guidance in relational reinforcement learning. In C. Sammut and A. Hoffmann, editors, Proceedings of the Nineteenth International Conference on Machine Learning, pages 115–122. Morgan Kaufmann Publishers, Inc, 2002. [Driessens and Dˇzeroski, 2002b] K. Driessens and S. Dˇzeroski. On using guidance in relational reinforcement learning. In Proceedings of Twelfth BelgianDutch Conference on Machine Learning, pages 31–38, 2002. Technical report UU-CS-2002-046. [Driessens and Dˇzeroski, 2004] K. Driessens and S. Dˇzeroski. Integrating guidance into relational reinforcement learning. Machine Learning, 2004. Accepted. [Driessens and Ramon, 2003] K. Driessens and J. Ramon. Relational instance based regression for relational reinforcement learning. In Proceedings of the Twentieth International Conference on Machine Learning, pages 123–130. AAAI Press, 2003. [Driessens et al., 2001] K. Driessens, J. Ramon, and H. Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. In L. De Raedt and P. Flach, editors, Proceedings of the 13th European Conference on Machine Learning, volume 2167 of Lecture Notes in Artificial Intelligence, pages 97–108. Springer-Verlag, 2001. [Driessens, 2001] K. Driessens. Relational reinforcement learning. In MultiAgent Systems and Applications, volume 2086 of Lecture Notes in Artificial Intelligence, pages 271–280. Springer-Verlag, 2001. [Dˇzeroski et al., 2001] S. Dˇzeroski, L. De Raedt, and K. Driessens. Relational reinforcement learning. Machine Learning, 43:7–52, 2001. [Dˇzeroski and Lavrac, 2001] S. Dˇzeroski and N. Lavrac, editors. Relational Data Mining. Springer, Berlin, 2001. [Dˇzeroski et al., 1998] S. Dˇzeroski, L. De Raedt, and H. Blockeel. Relational reinforcement learning. In Proceedings of the 15th International Conference on Machine Learning, pages 136–143. Morgan Kaufmann, 1998.
BIBLIOGRAPHY
185
[Emde and Wettschereck, 1996] W. Emde and D. Wettschereck. Relational instance-based learning. In L. Saitta, editor, Proceedings of the 13th International Conference on Machine Learning, pages 122–130. Morgan Kaufmann, 1996. [Fern et al., 2003] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Thrun S., L. Saul, and B. Bernhard Schlkopf, editors, Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems. The MIT Press, 2003. [Fikes and Nilsson, 1971] R.E. Fikes and N.J. Nilsson. Strips: A new approach to the application for theorem proving to problem solving. In Advance Papers of the Second International Joint Conference on Artificial Intelligence, pages 608–620, Edinburgh, Scotland, 1971. [Finney et al., 2002] S. Finney, N. H. Gardiol, L. P. Kaelbling, and T. Oates. The thing that we tried didn’t work very well: Deictic representation in reinforcement learning,. In Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence, Edmonton, 2002. [Flach, 1994] P. Flach. Simply Logical. John Wiley, Chicester, 1994. [Forbes and Andre, 2002] J. Forbes and D. Andre. Representations for learning control policies. In E. de Jong and T. Oates, editors, Proceedings of the ICML-2002 Workshop on Development of Representations, pages 7–14. The University of New South Wales, Sydney, 2002. [G¨artner et al., 2003a] T. G¨ artner, K. Driessens, and J. Ramon. Graph kernels and Gaussian processes for relational reinforcement learning. In Inductive Logic Programming, 13th International Conference, ILP 2003, Proceedings, volume 2835 of Lecture Notes in Computer Science, pages 146–163. Springer, 2003. [G¨artner et al., 2003b] T. G¨ artner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In B. Sch¨olkopf and M.K. Warmuth, editors, Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, volume 2777 of Lecture Notes in Computer Science. Springer, 2003. [G¨artner et al., 2003c] T. G¨ artner, J.W. Lloyd, and P.A. Flach. Kernels for structured data. In Inductive Logic Programming, 12th International Conference, ILP 2002, Proceedings, volume 2583 of Lecture Notes in Computer Science. Springer, 2003. [G¨artner, 2002] T. G¨ artner. Exponential and geometric kernels for graphs. In NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data, 2002.
186
BIBLIOGRAPHY
[G¨ artner, 2003] T. G¨ artner. A survey of kernels for structured data. SIGKDD Explorations, 5(1):49–58, 2003. [Gibbs, 1997] M.N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, University of Cambridge, 1997. [Goldberg, 1989] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989. [Guestrin et al., 2003] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments in relational MDP’s. In Proceedings of the 18th International Joint Conference on Artificial Intellingece, 2003. [Haussler, 1999] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, 1999. [Humphrys, 1995] M. Humphrys. W-learning: Competition among selfish Qlearners. Technical Report 362, University of Cambridge, Computer Laboratory, 1995. [Imrich and Klavˇzar, 2000] W. Imrich and S. Klavˇzar. Product Graphs: Structure and Recognition. John Wiley, 2000. [Jaakkola et al., 1993] T. Jaakkola, M. Jordan, and S.P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In J.D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 703–710. Morgan Kaufmann Publishers, Inc., 1993. [Jacobs and Blockeel, 2001] N. Jacobs and H. Blockeel. From shell logs to shell scripts. In C. Rouveirol and M. Sebag, editors, Proceedings of the 11th International Conference on Inductive Logic Programming, volume 2157 of Lecture Notes in Artificial Intelligence, pages 80–90. Springer-Verlag, 2001. [Jennings and Woodridge, 1995] N.R. Jennings and M. Woodridge. Intelligent agents and multi-agent systems. Applied Artificial Intelligence, 9:357–369, 1995. [Kaelbling et al., 1996] L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237– 285, 1996. [Kaelbling et al., 2001] L.P. Kaelbling, T. Oates, N. Hernandez, and S. Finney. Learning in worlds with objects. In Working Notes of the AAAI Stanford Spring Symposium on Learning Grounded Representations, 2001.
BIBLIOGRAPHY
187
[Karaliˇc, 1995] A. Karaliˇc. First-order regression: Applications in real-world domains. In Proceedings of the 2nd International Workshop on Artificial Intelligence Techniques, 1995. [Kashima and Inokuchi, 2002] H. Kashima and A. Inokuchi. Kernels for graph classification. In ICDM Workshop on Active Mining, 2002. [Kersting and De Raedt, 2000] K. Kersting and L. De Raedt. Baeyesian logic programs. In Proceedings of the Tenth International Conference on Inductive Logic Programming, work in progress track, 2000. [Kersting and De Raedt, 2003] K. Kersting and L. De Raedt. Logical Markov Decision Programs. In Proceedings of the IJCAI’03 Workshop on Learning Statistical Models of Relational Data, pages 63–70, 2003. [Kersting et al., 2004] K. Kersting, M. van Otterlo, and L De Raedt. Bellman goes relational. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004. Accepted. [Kibler et al., 1989] D. Kibler, D. W. Aha, and M.K. Albert. Instance-based prediction of real-valued attributes. Computational Intelligence, 5:51–57, 1989. [King et al., 2001] R. D. King, A. Srinivasan, and L. Dehaspe. Warmr: a data mining tool for chemical data. Journal of Computer-Aided Molecular Design, 15(2):173–181, feb 2001. [Kobsa, 2001] A. Kobsa, editor. User Modeling and User-Adapted Interaction, Ten Year Anniversary Issue, volume 1-2. Kluwer Academic Publishers, 2001. [Korte and Vygen, 2002] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer-Verlag, 2002. [Kramer and Widmer, 2000] Stefan Kramer and Gerhard Widmer. Inducing classification and regression trees in first order logic, pages 140–156. Springer-Verlag New York, Inc., 2000. [Lagoudakis et al., 2002] M.G. Lagoudakis, R. Parr, and M.L. Littman. Leastsquares methods in reinforcement learning for control. In Proceedings of the 2nd Hellenic Conference on Artificial Intelligence (SETN-02), pages 249– 260. Springer, 2002. [Langley, 1994] P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1994. [Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293–321, 1992.
188
BIBLIOGRAPHY
[Lodhi et al., 2002] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [MacCallum, 1999] A. MacCallum. Reinforcement learning with selective perception and Hidden State. PhD thesis, University of Rochestor, 1999. [MacKay, 1997] D.J.C. MacKay. Introduction to Gaussian processes. available at http://wol.ra.phy.cam.ac.uk/mackay, 1997. [Mahadevan et al., 1997] S. Mahadevan, N. Marchalleck, T.K. Das, and A. Gosavi. Self-improving factory simulation using continuous-time averagereward reinforcement learning. In Proc. 14th International Conference on Machine Learning, pages 202–210. Morgan Kaufmann, 1997. [Mitchell, 1996] M. Mitchell. An Introduction to Genetic Algorithms. The MIT Press, 1996. [Mitchell, 1997] T. Mitchell. Machine Learning. McGraw-Hill, 1997. [Morales, 2003] E. Morales. Scaling up reinforcement learning with a relational representation. In Proc. of the Workshop on Adaptability in Multi-agent Systems, pages 15–26, 2003. [Nilsson, 1980] N.T. Nilsson. Principles of Artificial Intelligence. Tioga Publishing Company, 1980. [Ormoneit and Sen, 2002] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49:161–178, 2002. [Parr and Russell, 1997] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1997. [Puterman, 1994] M. L. Puterman. Markov Decision Processes. J. Wiley & Sons, 1994. [Ramon and Bruynooghe, 2001] J. Ramon and M. Bruynooghe. A polynomial time computable metric between point sets. Acta Informatica, 37:765–780, 2001. [Ramon, 2002] J. Ramon. Clustering and instance based learning in first order logic. PhD thesis, Department of Computer Science, K.U.Leuven, 2002. [Rasmussen and Kuss, 2004] C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004.
BIBLIOGRAPHY
189
[Rennie and McCallum, 1999] J. Rennie and A.K. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of the 16th International Conf. on Machine Learning, pages 335–343. Morgan Kaufmann, San Francisco, CA, 1999. [Russell and Norvig, 1995] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 1995. [Schaal et al., 2000] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Real-time robot learning with locally weighted statistical learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 288–293. IEEE Press, Piscataway, N.J., 2000. [Scheffer et al., 1997] T. Scheffer, R. Greiner, and C. Darken. Why experimentation can be better than “perfect guidance”. In Proceedings of the 14th International Conference on Machine Learning, pages 331–339. Morgan Kaufmann, 1997. [Sebag, 1997] M. Sebag. Distance induction in first order logic. In N. Lavraˇc and S. Dˇzeroski, editors, Proceedings of the Seventh International Workshop on Inductive Logic Programming, volume 1297 of Lecture Notes in Artificial Intelligence, pages 264–272. Springer, 1997. [Shapiro et al., 2001] D. Shapiro, P. Langley, and R. Shachter. Using background knowledge to speed reinforcement learning in physical agents. In Proceedings of the 5th International Conference on Autonomous Agents. Association for Computing Machinery, 2001. [Slaney and Thi´ebaux, 2001] J. Slaney and S. Thi´ebaux. Blocks world revisited. Artificial Intelligence, 125:119–153, 2001. [Smart and Kaelbling, 2000] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the 17th International Conference on Machine Learning, pages 903–910. Morgan Kaufmann, 2000. [Sutton and Barto, 1998] R. Sutton and A. Barto. Reinforcement Learning: an introduction. The MIT Press, Cambridge, MA, 1998. [Sutton et al., 1999] R. Sutton, D. Precup, and S.P. Singh. Between MDP’s and semi-MDP’s: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999. [Tesauro, 1992] G. Tesauro. Practical issues in temporal difference learning. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 259–266. Morgan Kaufmann Publishers, Inc., 1992.
190
BIBLIOGRAPHY
[Urbancic et al., 1996] T. Urbancic, I. Bratko, and C. Sammut. Learning models of control skills: Phenomena, results and problems. In Proceedings of the 13th Triennial World Congress of the International Federation of Automatic Control, pages 391–396. IFAC, 1996. [Utgoff et al., 1997] P. Utgoff, N Berkman, and J. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 29:5–44, 1997. [Van Laer and De Raedt, 1998] W. Van Laer and L. De Raedt. A methodology for first order learning: a case study. In F. Verdenius and W. van den Broek, editors, Proceedings of the Eighth Belgian-Dutch Conference on Machine Learning, volume 352 of ATO-DLO Rapport, 1998. [Van Laer, 2002] W. Van Laer. From Propositional to First Order Logic in Machine Learning and Data Mining. Induction of First Order Rules with ICL. PhD thesis, Department of Computer Science, K.U.Leuven, 2002. [van Otterlo, 2002] M. van Otterlo. Relational representations in reinforcement learning: Review and open problems. In E. de Jong and T. Oates, editors, Proc. of the ICML-2002 Workshop on Development of Representations, pages 39–46. University of New South Wales, 2002. [van Otterlo, 2004] M. van Otterlo. Reinforcement learning for relational MDP’s. In Proceedings of the Machine Learning Conference of Belgium and the Netherlands 2004, 2004. [Wagner and Fischer, 1974] R.A. Wagner and M.J. Fischer. The string to string correction problem. Journal of the ACM, 21(1):168–173, January 1974. [Wang, 1995] X. Wang. Learning by observation and practice: An incremental approach for planning operator acquisition. In Proceedings of the 12th International Conference on Machine Learning, pages 549–557, 1995. [Watkins, 1989] Christopher Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge., 1989. [Wiering, 1999] M. Wiering. Explorations in Efficient Reinforcement Learning. PhD thesis, University of Amsterdam, 1999. [Witten and Frank, 1999] I.A. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
BIBLIOGRAPHY
191
[Yoon et al., 2002] S. Yoon, A. Fern, and R. Givan. Inductive policy selection for first-order MDP’s. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002. [Zien et al., 2000] A. Zien, G. Ratsch, S. Mika, B. Sch¨olkopf, T. Lengauer, and K.-R. Muller. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799–807, 2000.
Index action, 12 adjacency matrix, see graphs afterstates, 26 agent, 5, 12 arity, 175 artificial intelligence, 3 atom, 175 automated driving, 73 background knowledge, 39 Bayesian regression, 96 behavioral cloning, 6, 137 Bellman equation, 18–20 blackjack, 26 blocks world, 33 distance, 74 episode, 41 graph, 178 kernel, 106 relational interpretation, 176 state space size, 62 tasks, 61 Boltzmann exploration, 23 c-nearest neighbor, 76 CARCASS, 47 clause, 175 concurrent goals, 145, 148 convex hull, 73 convolution kernel, 95 corridor application, 81 covariance function, 98 covariance matrix, 96 curse of dimensionality, 21
data-mining, 5 deictic, 46 deictic representations, 28–29, 46 Digger, 142–146 direct policy search, 17 discount factor, see γ dynamic programming, 18 edit distance, 74 environment, 12 episode, 20 error contribution, 78, 83 error management, 83 error margin, 76 error proximity, 78 exploration, 22 exponential series, 104 fact, 176 feature vector, 27 focal point, 28, 46 function generalization, see regression functor, 175 G-algorithm, 54 γ, 14 Gaussian processes, 96 genetic programming, 17 geometric series, 104 graphs, 99–101 adjacency matrix, 101 cycle, 100 definition, 99
INDEX direct product, 102 directed, 99 indegree, 101 labelled, 99 outdegree, 101 walk, 100 ground, 175 guidance, 120–123 active, 123, 132 hierarchical reinforcement learning, 146–148 incremental distance, 91 incremental regression, 42 inflow limitations, 76, 82 instance averaging, 79 instance based learning, 72 instance based regression, 72 ITI algorithm, 54, 57 kernel, 95 for graphs, 103 for structured data, 95 kernel methods, 94–98 language bias, 56 learning graphs, 63 literal, 175 locally linear regression, 72 Logical Markov Decision Processes, 46 machine learning, 4 Markov Decision Processes, 15 relational, 46 Markov property, 15 matching distance, 74 matrix inversion incremental, 98 maximum variance, 79, 86 MDP, see Markov Decision Processes minimal sample size, 60, 68
193 Monte-Carlo methods, 18 moving target regression, 43 nearest neighbor, see instance based learning neural networks, 22, 28, 96 On(A,B), see blocks world tasks planning, 16 reward, 16 policy, 13, 17 policy evaluation, 18 policy improvement, 18 policy iteration, 18 approximate, 46 positive definite kernel, see kernel predicate, 175 probabilistic policies, 15 Prolog, 56 propositional representations, 27–28 Q-learning, 6, 19–23 algorithm, 20 Q-value, 19 query packs, 58 radial basis function, 96, 105 ranked tree, 95 reachable goal state, 62 receptive field, 72 refinement operator, 56 regression, 21, 28, 42 task definition, 22 reinforcement learning, 5, 6, 11–16 task definition, 13 RRL , 39 algorithm, 40 problem definition, 38 prototype, 43 regression task, 42 relational distances, 73 relational regression, 42
194 relational regression tree, 55 sizes, 64 relational reinforcement learning, 6 relational representations, 29 graphs, 32 interpretations, 31 reward, 12, 16 reward function, 13 reward backpropagation, 20 rib algorithm, 76 rmode, 56 rQ-learning, 47 SARSA, 18 software agents, see agent stacking, see blocks world tasks state, 12 state utility, see value function stochastic environments, 15–16, 80 state utility, 16 stochastic policies, 15 supervised learning, 6, 11 support vector machines, 19 temporal difference learning, 18 term, 175 tg algorithm, 55 Tilde , 43 transition function, 13 transition probability, 15 U-trees, 54 unstacking, see blocks world tasks user-modelling, 5 utility, see value function value function, 13, 14 class-based, 48 discounted cumulative reward, 14 value iteration, 18 web-spider, 5 world, see environment
INDEX