Relational Learning by Observation - CiteSeerX

3 downloads 0 Views 1MB Size Report
First, I thank my advisor John Laird for his wise advice and the role model he ... Most importantly, our deep discussions challenged me to think about science in a ...... approaches such as reinforcement learning, learning by observation does ..... In this section, we list constraints (Figure 5) we pose on the space of learning by.
Relational Learning by Observation

by

Tolga O. Könik

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2007

Doctoral Committee: Professor John E. Laird, Chair Professor Martha E. Pollack Professor Richmond H. Thomason Associate Professor Satinder Singh Baveja Associate Professor Thad A. Polk

© Tolga O. Könik All rights reserved 2007

DEDICATION

For Aslı

ii

ACKNOWLEDGEMENTS First, I thank my advisor John Laird for his wise advice and the role model he provides. His advice always balanced open-mindedness, allowing me to choose hard questions, and a sense of practicality, ensuring that my quests were divided into manageable steps with finite duration. I always try to learn (by observation) from his rare ability to balance life and a successful scientific career, and from his friendly, sincere and enthusiastic interactions with students and colleagues. I also thank the rest of the Soar group at University of Michigan, including John Hawkins, Devvan Stokes, Scott Wallace, Bob Marinier, Andrew Nuxoll, Shelley Nelson, Jonathan Voigt, Alex Kerfoot, Mazin Assanie, and Karen Coulter. I am grateful to Mike van Lent, who was very influential at the early stages of this research, and Doug Pearson, who made major contributions to the diagrammatic behavior specifications aspect of this research. Weekly teleconferences with Doug were intellectually satisfying and helped to drive the research forward. John (Laird), Brian, Devvan, and Alex implemented the Soar agent that I used in my experiments. I extend my thanks to the larger Soar community for listening to my presentations at various stages of my research and giving valuable feedback and support. I thank all members of Icarus group at ISLE and Stanford University, especially Pat Langley, for the lengthy afternoon discussions and his feedback on my research. He helped me to ground my research in the literature. I also thank the members of my committee, Martha Pollack, Rich Thomason, Satinder Baveja, and Thad Polk, who brought new perspectives to this dissertation with their questions and comments. Finally, I thank my family. I thank my wife Aslı, who enjoyed and endured all ups and downs of this research with me. She edited several parts of this dissertation and its earlier paper versions. She increased my awareness of the scientific jargon I use and helped me to enjoy communicating my research to people outside a small community. Most importantly, our deep discussions challenged me to think about science in a larger context. I thank Mum and Dad for being there for me literally from the beginning. Their unconditional love and support enabled me to go after a science career.

iii

TABLE OF CONTENT

DEDICATION.................................................................................................................... ii ACKNOWLEDGEMENTS............................................................................................... iii LIST OF FIGURES .......................................................................................................... vii LIST OF DEFINITIONS .....................................................................................................x ABSTRACT....................................................................................................................... xi CHAPTER 1. INTRODUCTION ...................................................................................................1 1.1 Relational Learning by Observation.................................................................4 2. DESIGN DECISIONS .............................................................................................8 2.1 Properties of Learning by Observation Problem ..............................................8 2.1.1 Properties of the Environment where the Task Is Executed ..................9 2.1.2 Properties of the Experts to be Modeled..............................................10 2.1.3 Properties of the Target Agent Program ..............................................11 2.2 Properties of the Learning by Observation Framework We Explore .............12 2.2.1 Assumed Properties of the Available Information Sources.................13 2.2.2 Representation of Captured Knowledge ..............................................14 2.2.3 Learning Algorithm .............................................................................16 2.2.4 Learning Strategies and Bias ...............................................................16 3. RELATED WORK ................................................................................................18 3.1 Inductive Learning by Observation ................................................................18 3.1.1 Behavior Cloning .................................................................................18 3.1.2 Goal-driven Behavior Cloning.............................................................19 3.1.3 KnoMic ................................................................................................20 3.1.4 L2Act ..................................................................................................21 3.2 Theory-driven learning by observation ..........................................................21 3.3 Systems that Learn Planning Knowledge .......................................................21 3.3.1 OBSERVER.........................................................................................22 3.3.2 TRAIL ..................................................................................................22 3.3.3 Event & Situation Calculus..................................................................23 3.4 Reinforcement Learning .................................................................................23 iv

3.5 ILP Applications in Autonomous Agents Context .........................................24 4. RELATIONAL LEARNING BY OBSERVATION .............................................26 4.1 Overview.........................................................................................................26 4.2 Using Relational Learning by Observation ....................................................29 4.2.1 Learning from Behavior Performances................................................29 4.2.2 Learning from Diagrammatic Behavior Specifications .......................32 4.3 Knowledge Representation.............................................................................36 4.3.1 Task Performance Knowledge.............................................................36 4.3.2 Symbolic Situations and Behavior Trace.............................................38 4.3.3 Annotated Behavior Trace ...................................................................41 4.4 Decision Concepts ..........................................................................................44 4.4.1 Learning from only Positive Examples................................................52 4.4.2 Biasing Decision Concept Learning with Learned Decision Concepts...............................................................................................54 4.5 Learning Decision Concepts using ILP ..........................................................55 4.5.1 ILP and Inverse Entailment .................................................................56 4.5.2 Using ILP in Learning by Observation................................................57 4.6 A Learning by Observation Example .............................................................60 4.7 Background Knowledge .................................................................................66 4.7.1 Domain Independent Learning by Observation Knowledge ...............67 4.7.1.1 Active Goal Access Knowledge ..............................................67 4.7.1.2 Past Experience Knowledge ....................................................70 4.7.1.3 Assumed Knowledge about Unobservable Features of the World .......................................................................................74 4.7.1.4 Explicit Bias.............................................................................76 4.7.2 Task & Domain Knowledge ................................................................79 4.8 Agent Generation for a Particular Agent Architecture ...................................82 4.9 Episodic Database...........................................................................................84 5. EVALUATION......................................................................................................92 5.1 A Methodology for Evaluating Relational Learning by Observation ............93 5.1.1 Evaluating Learning by Observation using Decision Similarity .........95 5.2 Learning from Artificially Created Examples ..............................................103 5.2.1 Learning from Correct Positive and Negative Examples ..................103 5.2.2 Learning from Positive Examples Only.............................................106 5.3 Learning from Agent Program Generated Behavior ....................................107 5.3.1 Learning from Only Positive Examples.............................................109 5.3.2 Learning Multiple Concepts ..............................................................114 5.4 Summary of Experiments .............................................................................124 6. DISCUSSION ......................................................................................................126 6.1 Research Contributions.................................................................................126 6.1.1 Defining Complex Learning by Observation Problems and Solution Systems................................................................................126 6.1.2 Formalization of Relational Learning by Observation Problem........127 v

6.1.3 A New Relational Learning by Observation Framework ..................127 6.1.4 Defining Two Settings that use Relational Learning by Observation ........................................................................................128 6.1.5 A Methodology to Evaluate Relational Learning by Observation.....129 6.2 Revisiting the Properties of our Learning by Observation Framework........129 6.3 Future Work..................................................................................................132 6.3.1 Expanding Learning by Observation .................................................132 6.3.1.1 Using More Expert Input .......................................................132 6.3.1.2 Using Less Expert Input.........................................................134 6.3.1.3 Using additional feedback......................................................134 6.3.1.4 Agents with Learning by Observation Capability .................135 6.3.2 Improving our Learning by Observation Framework........................136 6.3.2.1 Decision Concepts .................................................................136 6.3.2.2 ILP..........................................................................................137 6.3.3 More Experimental Evaluation ..........................................................137 APPENDIX A..................................................................................................................139 REFERENCES ................................................................................................................144

vi

LIST OF FIGURES Figure 1. Trade-off between human effort and behavior complexity of learned agent programs ............................................................................................................................. 3 Figure 2. Properties of complex environments ................................................................... 9 Figure 3. Properties of the experts .................................................................................... 11 Figure 4. Target agent program ........................................................................................ 12 Figure 5. Our assumptions about the properties of the learning by observation framework we investigate. A1 - A5 describe the input, while A6 - A23 describe the properties of our learning by observation framework. ................................................................................. 15 Figure 6. Relational learning by observation framework ................................................. 27 Figure 7. Learning from behavior performances. In mode 1, the expert generates annotated behavior. In mode 2, an agent program executes behavior and generates annotations, while the expert accepts or rejects them....................................................... 30 Figure 8. Diagrammatic behavior specification with Redux ............................................ 32 Figure 9. Learning from diagrammatic behavior specifications setting. In mode 1, the expert generates annotated behavior. In mode 2, the behavior is interactively specified by the expert and the agent program...................................................................................... 33 Figure 10. An operator hierarchy in a building navigation domain.................................. 37 Figure 11. A snapshot a SER situation in Haunt 2 domain. The sensed relations are dynamically updated and are associated with relations using background knowledge. ... 39 Figure 12. A tree of partially ordered situations that contains four behavior paths.......... 41 Figure 13. Annotation hierarchy. The horizontal direction depicts temporal extent over behavior trace. The rejected annotations are marked and the rest are accepted annotations. ........................................................................................................................................... 42 Figure 14. A recursive annotation hierarchy..................................................................... 43 Figure 15. The positive and negative example regions of different concepts. The horizontal dimension corresponds to the change in situations over time. A, B, and P are the accepted annotation where P is the context annotation of A and B. E A+ and E A− mark the positive and negative example regions of the annotation A........................................ 47 Figure 16. Negative examples for selection-condition, extracted from a rejected annotation ¬A, where the expert rejects the agent program’s annotation with the operator opA..................................................................................................................................... 49 vii

Figure 17. Parameter heuristic negative examples are generated by changing the operator parameters of a positive example to parameters of that operator observed in other situations. .......................................................................................................................... 53 Figure 18. Single positive example and ground literals used to construct the bottom clause................................................................................................................................. 58 Figure 19. The most specific (bottom) clause of the learned hypothesis.......................... 59 Figure 20. An overgeneral hypothesis for the selection condition of go-to-door operator60 Figure 21. A desired hypothesis for the selection condition of go-to-door operator ........ 60 Figure 22. An operator hierarchy used in Haunt domain experiments ............................. 61 Figure 23. A behavior example for the operator hierarchy in Figure 22. ......................... 62 Figure 24. Some of the execution-condition hypothesis learned for the operator hierarchy in Figure 22. ...................................................................................................................... 64 Figure 25. Correspondence of objects in different situations ........................................... 69 Figure 26. Mode definitions for the execution condition of move-to-via-node in Figure 24 ........................................................................................................................................... 78 Figure 27. Three situations where the move-near goal of the agent is just terminated because the agent is close enough to the target object. Background knowledge infers compare_range predicates and the relations that are relevant for the move-near goal of the agent are marked with bold typeface. ......................................................................... 81 Figure 28. Search for the situation predicate query contains(s23, x1, Y) in the episodic database, which succeeds with Y=y1, and Y=y2 ................................................................ 88 Figure 29. Propagation of a set of situations over the index of contains predicate with the first argument x1................................................................................................................ 88 Figure 30. Theory accuracy is a weighted average of the accuracy of sibling operators, if the operators are selected independently. ......................................................................... 99 Figure 31. Theory accuracy can be lower than the accuracy of all sibling operators if the accurate situations of two operators overlap................................................................... 100 Figure 32. Theory accuracy can be higher than the accuracy of all sibling operators if the inaccurate situations of two operators overlap................................................................ 100 Figure 33. Three different kinds of accuracies calculated on situations sampled from qualitatively different parts of the behavior trace ........................................................... 101 Figure 34. Accuracy distribution of the hypotheses learned with correct positive and negative examples........................................................................................................... 104 Figure 35. Accuracy distribution of the hypothesis learned from with positive-only learning using correct positive and heuristic negative examples.................................... 107 Figure 36. A level map in Haunt domain........................................................................ 108 Figure 37. The correct hypothesis learned with real behavior trace data of a Soar agent ......................................................................................................................................... 109 viii

Figure 38. The distribution of the learned selection-condition(move-to-via-node) concepts over positive examples and heuristic negative examples generated from a real behavior trace data of a Soar agent. ............................................................................................... 110 Figure 39. Distribution of selection accuracy of the execution-condition(move-to-vianode) concept in comparison to the hand-coded concept over positive examples and heuristic negative examples. ........................................................................................... 112 Figure 40. Distribution of termination accuracy of the execution-condition(move-to-vianode) concept in comparison to the hand-coded concept over positive examples and heuristic negative examples. ........................................................................................... 113 Figure 41. Distribution of activity accuracy of the execution-condition(move-to-via-node) concept in comparison to the hand-coded concept over positive examples and heuristic negative examples........................................................................................................... 114 Figure 42. Average single operator selection accuracy on training set .......................... 116 Figure 43. Average single operator termination accuracy on training set ...................... 116 Figure 44. Average single operator activity accuracy on training set............................. 117 Figure 45. Average theory selection accuracy distribution over different contexts on training set....................................................................................................................... 118 Figure 46. Average single operator selection accuracy on test set ................................. 118 Figure 47. Average single operator termination accuracy on test set ............................. 119 Figure 48. Average single operator activity accuracy on test set ................................... 119 Figure 49. Average theory selection accuracy distribution over different contexts on test set .................................................................................................................................... 120 Figure 50. Single operator selection accuracy of the best learn concepts on training set120 Figure 51. Single operator termination accuracy of the best learn concepts on training set ......................................................................................................................................... 121 Figure 52. Single operator activity accuracy of the best learn concepts on training set (hnew-testing) ................................................................................................................. 121 Figure 53. Theory selection accuracy distribution of the best learned theory over different contexts on test set .......................................................................................................... 122 Figure 54. Differences between learned agent program and expert program. a) does not cause significant behavior difference. b) might cause behavior difference, c) causes behavior difference. ........................................................................................................ 124 Figure 55. Our assumptions about the properties of the learning by observation framework we investigate. (Figure 5 repeated). ............................................................. 130

ix

LIST OF DEFINITIONS Definition 1. Decision concept ......................................................................................... 45 Definition 2. Execution condition..................................................................................... 50 Definition 3. Assumptions to interpret execution condition............................................. 51 Definition 4. Episodic database storage............................................................................ 87 Definition 5. Coverage...................................................................................................... 95 Definition 6. Accuracy, overgenerality and overspecifity ................................................ 96 Definition 7. Relevant context ........................................................................................ 101 Definition 8. Selection, activity and termination accuracy............................................. 102

x

ABSTRACT

Relational Learning by Observation by Tolga O. Könik

Chair: John E. Laird In this dissertation, we investigate learning by observation, a machine learning approach to create cognitive agents automatically by observing the task-performance behavior of human experts. We argue that the most important challenge of learning by observation is that the internal reasoning of the expert is not available to the learner. As a solution, we propose a framework that uses multiple complex knowledge sources to model the expert more accurately. We describe a relational learning by observation framework that uses expert behavior traces and expert goal annotations as the primary input, interprets them in the context of background knowledge, inductively finds patterns in similar expert decisions, and creates an agent program. The background knowledge used to interpret the expert behavior does not only include task and domain knowledge, but also domain independent learning by observation knowledge that models the fixed mental mechanisms of the expert. We explore two learning approaches. In learning from behavior performances approach, the main source of information used in learning is behavior traces of expert recorded during actual task performance. In the learning from diagrammatic behavior

specifications approach, the expert specifies behavior using a graphical representation,

xi

abstractly depicting the critical situations for the desired behavior. This provides the expert with additional modes of interaction with the learning system; simplifying the learning task at the expense of more expert effort. Both of these approaches are uniformly represented in relational learning by observation framework. Our framework maps “learning an agent program” problem on to multiple learning problems that can be represented in a “supervised concept learning” setting. The acquired procedural knowledge is partitioned into a hierarchy of goals and it is represented with first order rules. Using an inductive logic programming (ILP) learning component allows our system to combine complex knowledge from multiple sources. These sources include the behavior traces, which are temporally changing relational situations, the expert goal annotations, which are hierarchically organized and provide structured information, and background knowledge, which is represented as relational facts and first order rules. Our learning by observation framework needs to store large amounts of behavior data and access it efficiently during learning. We propose an episodic database as a solution, which is an extension of Prolog that improves Prolog by providing efficient and power mechanisms to store and query relational temporal information. We evaluated our framework using both artificially created examples and behavior observation traces generated by AI agents. We developed a general methodology to test relational learning by observation. Our methodology is based on first using a hand-coded agent program as the expert, and then comparing the decision making knowledge of the expert and learned agent programs on observed situations.

xii

CHAPTER 1 INTRODUCTION Developing cognitive agents that behave “intelligently” in complex environments (i.e. large, dynamic, nondeterministic, and with unobservable states) usually presumes costly agent-programmer effort for acquiring knowledge from experts and encoding it into an executable representation. In this dissertation, we explore the use of machine learning techniques to automate this process. We investigate learning by observation systems that create cognitive agents automatically, by observing the task performance behavior of human experts. This line of research has two practical goals. The first goal is to develop systems that AI programmers can use to create agent programs faster and with less development cost. The second goal is to develop systems that people knowledgeable in a task can use to create AI programs without programming, but instead by showing how the target program should perform. We want learning by observation systems to capture the general expertise and the style of the human experts for two reasons. First, even though human-like agents may generate suboptimal performance, the strategies acquired from humans can be robust and adaptive to changing situations, which is important in applications that require humanlevel intelligence. Second, human-like behavior is desired in applications such as educational/training environments, simulations or computer games that contain synthetic characters that model humans. Even if the goal is to create an agent program with the best possible performance, trying to capture the strategies of human experts still has merit. First of all, unlike approaches such as reinforcement learning, learning by observation does not require an externally provided criterion to measure the performance of the agent, which may be difficult to encode in complex domains. In a complex domain, even if there is a welldefined performance criterion, justifying all behavior in terms of future gains may be a 1

difficult task. For example, a reinforcement learning technique that is based on an agent program to explore the environment can require an infeasible amount of search to associate the decisions with the rewards obtained from the environment, while an explanation-based learning technique that deductively justifies all expert behavior can require an impractical amount of correctness and completeness in the background theory. On the other hand, a learning by observation system that inductively detects repeated expert decision patterns can avoid these problems because it does not need to completely justify the imitated expert decisions in terms of future gains. Finally, even if the goal is to create an optimal agent for a given performance criteria, learning by observation can be an initial step to extract the relevant knowledge structures that humans use in their decisions. Additional optimizing learning methods, such as reinforcement learning, can then further improve performance by adapting the agent to maximize the performance criteria. In this approach, the structures captured with learning by observation would be used only to the extent they contribute to finding optimal solutions. For example in a flight simulator domain, Morales and Sammut [25] used structures learned from expert generated behavior traces to improve results of reinforcement learning. In creating cognitive agents, we face a trade-off between the human effort required to create agent programs vs. the complexity of the behavior the learned agents can generate (Figure 1). At one end of this spectrum, the agents are generated using traditional techniques such as knowledge acquisition by interviewing the experts and manual programming. This approach can be used to create fairly complex agent programs but requires significant human effort. For example, in the TacAir-Soar project [13] encoding a medium-fidelity agent required more than ten person years [48]. On the other end of the spectrum, we have methods such as reinforcement learning or unsupervised learning, which require little or no human effort, but have so far not generated agents that are comparable in behavior complexity to the hand-coded agent programs. Although the behavior complexity of the learned agents may increase with further machine learning research, and the ease of developing complex agents using traditional techniques may improve with new tools, learning by observation is an alternative approach that takes advantage of both machine learning and human experts. 2

Behavior complexity of learned agent programs

Learning by observation Unsupervised Learning

Reinforcement Learning

Learning from Learning from behavior diagrammatic behavior performance specifications

Techniques that use more human knowledge and require more human effort Knowledge acquisition Expert system programming

Figure 1. Trade-off between human effort and behavior complexity of learned agent programs The learning by observation approach we investigate in this dissertation is at an intermediate region in this trade-off, where while the main source of information used in learning is relatively easy for the expert to communicate to the learning system, the complexity of the behavior that can be learned approaches that of manually coded systems. We investigate two learning by observation approaches in this intermediate region. In the learning from behavior performances approach we describe in section 4.2.1, the main source of information used in learning is traces of expert task performance. On the other hand, in the learning from diagrammatic behavior

specifications approach we describe in section 4.2.2, the expert specifies behavior using a graphical representation, abstractly depicting the critical situations of the desired behavior. This provides the expert with additional modes of interaction with the learning system; simplifying the learning task at the expense of more expert effort. Although these settings are different from the perspective of how behavior data is collected, we investigate them in a unified relational learning by observation framework, where both 3

settings use the same learning algorithm albeit with minor differences in background knowledge and bias. One of the key challenges of learning by observing is that the expert’s mental reasoning is not directly available to the learner. Traditional learning by observation approaches such as behavior cloning [39] focused on problems where all relevant features of the task are directly observable. However, in many domains, the experts react not only to the directly observable data in the environment, but also to internal structures that they create and maintain while performing the task, such as the results of inferences, goals and memories of data observed earlier. To tackle this difficulty, we investigate the use of additional knowledge sources besides the primary observed behavior input. Using these additional knowledge sources, our goal is to significantly increase the complexity of the performance knowledge that can be captured, without significantly increasing the expert effort required. Each type of knowledge source should be reasonably easy for the user to communicate and we should have mechanisms to utilize them in learning. Our learning by observation framework can use expert annotations of the behavior, which contain additional information such as the goals being pursued. Moreover, our framework can utilize background knowledge about the task and domain that models the experts’ own knowledge of the domain. Our framework first learns how the experts select goals based on observed situations, already selected goals, and background knowledge about the task and the domain, then it learns how they select actions that exhibit behavior consistent with the selected goals.

1.1 Relational Learning by Observation Most mainstream machine learning research focuses on approaches that use an attributevalue based representation, where the learned concepts are functions of a fixed size feature vector. Many domains are inherently structured with relations among objects and are difficult to model using this representation. For example, in a circuit design domain, a circuit is a structure composed of interconnections of a varying number of components in complex ways. Describing this domain with flat feature vectors requires manually defining a constant number of features that summarizes the substructures that are 4

potentially important for learning. For example, we may have a numerical feature that counts the number of resistors, or a Boolean feature that tests the existence of a low-pass filter within the circuit. Unfortunately, if a domain is inherently structured, representing it with a feature vector is bound to loose information and generality. The features that are going to be useful in learning may depend on both the domain and the kind of problems the learning system is expected to solve. Consequently, determining the features may be a difficult task requiring significant human effort and expertise both in machine learning and in task domain. Relational machine learning approaches address this problem by using relational representations that are more flexible than feature vectors. This allows relevant structures to be dynamically constructed during learning, making feature selection part of the problem that the learning system is solving. Although structured representations have been investigated in machine learning since the early days of machine learning [51], recently there is a renewed interest in them and they are treated more formally under umbrella headings such as relational learning or inductive logic programming (ILP) [28]. ILP has been successfully applied to structured domains such as discovery of biological functions (i.e. predicting structural activity of drugs [43]), natural language processing (i.e. learning semantics parsers [53]), and mathematics (i.e. discovering novel theorems [5]). In our learning by observation problem, the situations that the cognitive agents encounter are inherently structured, often containing complex relations between the objects. The agents have to combine these observed situations with complex internal structures such as goals and background knowledge to select actions and generate behavior. We use a relational learning approach, which provides a natural way of combining these multiple knowledge sources by representing all available information using a first order language. In this dissertation, we explore a machine learning framework we call relational

learning by observation that uses temporally changing relational situations and actions of an “observed agent” as the input, interprets them in the context of structured goal annotations and complex background task/domain knowledge, and finally induces an agent program that behaves similar to the observed agent. Both learning from behavior performances and learning from diagrammatic behavior specifications can be uniformly represented within the relational learning by observation framework, albeit with minor 5

differences in background knowledge, bias and heuristics used in learning, and how the input data is collected. Our framework reduces the vague “behave like an expert” learning problem, to a set of well-defined supervised learning problems that can be framed in an Inductive Logic Programming (ILP) setting, where first order rules are learned from structured data. To be able to use ILP algorithms in the large domains we consider, we devised an efficient mechanism to store and query structured behavior data. Since learning with temporally changing structures is rarely studied in relational learning framework, one additional goal of this dissertation is to improve relational learning algorithms in this context. We use the general agent architecture Soar [18] as the execution system of the target agent program that our system learns. Soar uses a symbolic rule based representation that simplifies the interaction with the ILP learning component. Moreover, the style of knowledge representation our framework uses is inspired from Soar’s knowledge representation and programming styles. For example, the goals are structured hierarchically, which helps the learning system by imposing constraints on the structure of the target agent program. This bias is particularly important in a relational learning setting where rich knowledge structures produce large hypothesis spaces. Using a general agent architecture like Soar also makes it possible to create a domain independent learning system, so that it can work in different domains by changing only the background knowledge. Finally, a general architecture provides a common formalism for integrating our system with other learning strategies and architectural mechanism. For example Soar already uses chunking [19], an explanation based generalization type learning mechanism, which can learn to speedup the execution of the rules induced by our learning by observation component. Similarly, there is recent effort to include reinforcement learning within Soar [29], which could be used to better utilize the general structures captured with relational learning by observation in a particular setting. A longterm motivation is that Soar is one of the few candidates of unified cognitive architectures [32] and has been successful as the basis for developing knowledge-rich agents for complex environments [13, 20, 52]. One practical reason for this choice is that there exist interfaces between Soar and these environments that can be reused in our system. Moreover, there are hand-coded agents in these domains, which required 6

significant human effort and they can form a basis of comparison for the agents we create automatically. Although Soar influences how knowledge is represented in our framework, we introduce our framework independent of Soar to make our learning assumptions more explicit and to have results that are transferable to other agent architectures. The next Chapter describes the properties of the learning by observation problem and properties of the learning by observation framework that we investigate as a solution. Chapter 3 overviews related work. Chapter 4 describes our relational learning by observation framework. Chapter 5 demonstrates experimental results. Finally, Chapter 6 concludes with a summary, discussion of major contributions and future work.

7

CHAPTER 2 DESIGN DECISIONS In this chapter, we first describe the properties of the learning by observation problem we want to address in this dissertation. Next, we describe constraints we pose on the learning by observation framework we investigate as a solution to this problem. These constraints guide our exploration and help to make our assumptions and design choices about the systems we investigate more explicit.

2.1 Properties of Learning by Observation Problem In this section, we list the properties of the learning by observation problem we investigate. This may not be a complete list and we do not claim that our learning by observation framework can fully deal with all issues associated with these properties. Nevertheless, we believe that listing them is useful to show our target and to motivate the design choices we describe in the next section. A learning by observation system observes an expert and creates an agent

program that performs tasks in an environment. A learning by observation problem can be characterized with these three components; the environment where the task is executed and the expert that is going to be modeled provide the inputs of the learning by observation system, whereas the agent program that is going to be created is the output. Next, we describe the assumed properties of these three components of learning by observation problem. The properties of the expert that are relevant for the learning by observation depend on the properties of the environment. Moreover, the properties of the learned agent program depend on both the environment and the expert.

8

2.1.1 Properties of the Environment where the Task Is Executed We want to obtain agent programs that can deal with a complex environment that may change continuously (En1). We assume that the agent interacts with the environment using an interface, which allows it to exhibit behavior on the environment using a set of actions provided by the interface. We assume that these actions are durative (En2) in that is they are continuously applied as long as they are kept active and the agent has to continuously maintain their activity. From the perspective of the agent, the environment may be nondeterministic (En3) that is, the agent does not have exact knowledge how the environment changes or this information may be too complicated to be used in efficient predictions. The agent may not be able to observe all features of the environment at a time (En4). It may not be able to control all features of the environment, possibly because there are other agents or complicated dynamics (En5). The environment may have a large state space (En6), and be rich in structure (En7), that is it may contain objects that are related in complex ways and the structure of the environment may be relevant to the problem solving strategies the agent may need to use (Figure 2). Environment En1. changes continuously En2. requires durative actions En3. is nondeterministic En4. is partially observable En5. contains uncontrollable dynamics En6. has a large state space En7. is rich in structure Figure 2. Properties of complex environments For our experiments in section 5.3, we use problems from “Haunt 2 game” [20], which is a 3-D first person perspective adventure game built using the Unreal game engine. In this domain, the expert controls a game character and the goal is to learn an agent program that can control the same game character. This environment has a large, structured state space, unobservable features, real time decisions, continuous space, external agents and events.

9

2.1.2 Properties of the Experts to be Modeled The experts may use various sources of knowledge in decision-making. If the environment is nondeterministic (En3), the experts cannot predict the exact state of the world based on previous states and may need to use current sensory information (Ex1). The decisions the experts make may also depend on the factual knowledge about a particular environment (i.e. knowledge about the building map in a navigation domain) and general common sense knowledge (i.e. rooms are connected to each other through doors) that the expert has about the task and the domain (Ex2). The decisions the experts make in selecting new goals and executing behavior may depend on the currently active goals (Ex3) such as deciding to open the door of a car because of the goal of driving it to work. Moreover, if the environment is only partially observable (En4), some decisions may need to be based on beliefs (Ex4) that the expert has obtained during the execution of the task that persist even after the reasons for acquiring them has disappeared from the sensors. The expert may build beliefs about the unobservable parts of the world as well as about the current progress of his/her goals. For example, an agent in an office environment with the goal of fixing the photocopy machine should remember that it has achieved its goal after the machine is fixed. It should not return to the room to fix the machine, each time it exits the room and the machine goes out of the scope of the sensors. In complex domains, the expert we want to model may have multiple goals (Ex5), each continuing for a duration of time (Ex6) (i.e. “I will go to the printer in the next room and make a printout”). If the environment is nondeterministic (En3) and therefore the actions cannot be accurately planned beforehand, it is crucial that the expert is able to dynamically initiate and terminate a goal (Ex7), based on the sensory information and internal processes. When the expert cannot predict all contingencies of their actions and cannot control the dynamics of the world (En5), using homeostatic goals (Ex8) that aim to maintain a condition based on frequent feedback from the environment is a common strategy used in reactive tasks. (e.g. keeping the direction of a plane towards the way point in a flight simulator domain) If the environment is not completely observable (En4), some persistence of the goals may be required, even if the original reasons to start 10

pursuing a goal have been removed from the sensors (Ex9). For example if the expert is following another agent, the expert may continue the pursuit towards the last seen location of the agent he/she follows, although it may be temporarily blocked by an obstacle in the current sensors of the expert. Expert Ex1. can use sensory information Ex2. can use task and domain knowledge Ex3. can use goals Ex4. can use dynamically created beliefs Ex5. can have multiple goals Ex6. can have durative goals Ex7. can dynamically select and terminate goals Ex8. can have homeostatic goals Ex9. can have persisting goals Ex10. can generate limited number of observations Ex11. can make errors Ex12. can use multiple strategies to achieve tasks Figure 3. Properties of the experts One of the major bottlenecks of learning by observation is the time cost of the expert, who can generate only a limited number of examples (Ex10). The problem gets even more difficult because the expert can make errors (Ex11) both in problem solving and during other communication with the learning system. Moreover, the expert may use multiple strategies or make arbitrary choices that are equally good while progressing in a goal (Ex12) and therefore although the choices that the expert has not made may be indications of undesired behavior, they cannot be assumed to be definitely wrong (Figure 3).

2.1.3 Properties of the Target Agent Program The agent program we want to create at the end of the learning process will probably be not perfect. Therefore it is preferred that the extracted task performance knowledge is expressed in a format that can be understood, evaluated and changed by knowledge engineers and experts (Ag1). Learning by observation is not the only source of information to create an agent program automatically. It is also desired that the extracted 11

information can be integrated (Ag2) with information obtained from other sources (e.g. other learning strategies or hand-coded program). The extracted knowledge should be translatable to a program of an efficient agent architecture so that the agent can execute actions in real time (Ag3). The agent program should be general enough to generate behavior in unforeseen situations. Moreover, even at identical situations the experts may behave differently due to random choices they make. The captured agent knowledge should be general enough to exhibit such variability (Ag4). On the other hand, merely a general agent program that solves problems in a domain possibly in styles different from the experts’ does not satisfy our learning by observation goal. Our goal is to obtain agents that behave similar to the experts (Ag5) that are observed for creating the agent program. Agent Program Ag1. uses a human understandable representation Ag2. uses composable representation Ag3. is reactively executed Ag4. has general knowledge Ag5. behaves similar to observed experts Figure 4. Target agent program

2.2 Properties of the Learning by Observation Framework We Explore In this section, we list constraints (Figure 5) we pose on the space of learning by observation systems we explore. These constraints are motivated by the properties of the assumed learning by observation problem described in the previous section. This list has two purposes. First, it guides our research by limiting our exploration to a subspace of learning by observation systems that we believe are more likely to solve the learning by observation problem characterized in previous section. Second, it makes our major design decisions and assumptions more explicit, clarifying the high-level properties of the learning by observations systems we investigate in this dissertation. We grouped this list into assumed properties of the available information sources, which describes how the problem defined in previous section is converted to the input of the learning by observation system, the properties of the representation of the captured knowledge, which describes how the captured knowledge is represented by the learning by 12

observation system, the properties of the learning algorithm, which describes the properties of the learning algorithm the learning by observation system uses internally, and the learning bias and strategies, which describes how this learning algorithm is used to better model the learning by observation problem.

2.2.1 Assumed Properties of the Available Information Sources We assume that our learning by observation framework uses as input the traces of expert generated behavior consisting of the situations the experts encounter and the actions they select (A1). Our framework should also be able to use traces of behavior generated by a previously learned agent program (A2). In the case of expert generated behavior, the framework uses annotations by the expert indicating which goals are being pursued. In the case of agent program generated behavior, the agent program should annotate the behavior with goals and the expert validates, rejects, or corrects these annotations (A3). The expert goal annotations contain only the names of the goals and their parameters (i.e. goto-door(d1)) and do not describe their meanings. Instead, their meanings are learned

though observation. The observed behavior traces are interpreted not only in the context of goal annotations, but also in the context of hand-coded background knowledge such as factual knowledge about the environment (i.e. map of a building) and task relevant common sense knowledge (i.e. rooms are connected through doors), which are used to model the internal knowledge of the expert that cannot be observed on the environment directly (A4), and expert annotations of the observed situations with data structures representing the knowledge the expert uses in these situations (i.e. indicating that a particular door is important for a decision or that the target location of a move action is related to the location of a particular object) (A5). We do not require that all of these inputs be available, but if they are available for a problem, we must have mechanisms that utilize them to simplify the learning problem. The task and domain knowledge is not necessarily obtained from the expert being modeled, and does not need to correspond to the expert’s understanding and reasoning about the world. The goal of learning by observation is not to exactly replicate the 13

internal reasoning process of the expert, but to create a model that exhibits similar behavior. Of course, background knowledge that resembles the expert’s internal knowledge may be more useful during this process. Therefore, while the goal annotations must be obtained from the expert that exhibits them, a separate knowledge engineer can encode the background knowledge. Although encoding commonsense background knowledge may be difficult, it may be shared in learning multiple tasks. Moreover, existing general commonsense theories can be used if they are relevant in a domain. (e.g. a qualitative spatial theory in a building navigation domain). Factual background knowledge should be specific to an environment (e.g. the objects in a room), but that is typically easier to encode.

2.2.2 Representation of Captured Knowledge We assume that the learned agent program can continuously maintain durative actions in a real-time environment (A6) and the agent program explicitly represents durative goals that it pursues using the actions (A7). In the planning literature, the term “goal” is often used as a predefined condition that holds in desired end-states. The goals in our framework have a more general meaning. They represent desired internal states of an agent that exist over time, and in general, they indicate that particular kinds of behaviors are appropriate. They can represent intentions to achieve a particular condition, as in the regular planning sense, but they may be also about maintaining a process, such as watching a television show. We assume that the goals are structured in a hierarchy (A8), in order to represent complex behavior while keeping the learning tasks more manageable. The goals and actions may contain both constant valued and object valued parameters (A9) (e.g. fly-at-altitude(4000) in a flight simulator domain or go-to-door(d1) in a building navigation domain). The parameters of a goal expand the generality of the captured knowledge. The object-valued goal annotations also help learning by providing structured information about the internal reasoning of the expert. For example, when the expert annotates a behavior with the goal go-to-room(r1), by choosing the room object r1, the expert points to information related to that room, such as where on the map that room is or which items it contains. At the end of learning, the captured knowledge is 14

transformed to a program that can run in a rule based (A10) reactive agent architecture (A11).

Available Information Sources A1. Expert-environment interaction observation traces A2. Agent program-environment interaction observation traces A3. Expert goal annotations of the observation traces A4. Factual and common sense task and domain knowledge A5. Expert data structures annotations Representation of Captured Knowledge A6. Durative actions A7. Durative goals A8. Hierarchical goals A9. Object-valued and constant-valued goal and action parameters A10. Rule based symbolic architecture A11. Reactively executable representation Properties of the Learning Algorithm A12. Using complex background knowledge in learning A13. First order rules in hypothesis space A14. Explicit learning component A15. Good generalizing learning algorithm A16. Robust learning algorithm Learning Bias and Strategies A17. Learning multiple concepts to represent goals A18. Testing sensory conditions in hypothesis A19. Testing domain knowledge in hypothesis A20. Testing active high-level goal conditions in hypothesis A21. Testing goal completed beliefs A22. Testing beliefs about unobservable features of the world A23. Using expert data structures annotations for speedup and quality Figure 5. Our assumptions about the properties of the learning by observation framework we investigate. A1 - A5 describe the input, while A6 - A23 describe the properties of our learning by observation framework.

15

2.2.3 Learning Algorithm The internal learning algorithm of our learning by observation framework should be able to use a rich language for examples and background knowledge (A12) to represent the rich input (A1-A5) of the framework. Furthermore, it should be able to generate hypotheses in a rich language such as first order logic to be able to test structured conditions in a general fashion (A13). Dealing with complex input and output structures is one of the major advantages of ILP algorithms over feature-attribute based machine learning techniques. Therefore, we frame our problem in a general ILP setting, without necessarily committing to a particular ILP algorithm. Using an explicit learning component in a frequently studied formalism such as ILP (as opposed to using ad hoc learning techniques) clarifies how the imprecise “learning an agent program using learning by observation problem” is mapped to a well-defined supervised learning problem (A14). ILP algorithms are suitable to abstract the details of the specific examples due to variables in relations (A15). For example, to learn the skills to move towards a door, our framework would not need to learn separate concepts specific to each door, requiring separate set of examples. Instead, the general knowledge of how to move towards any door will be learned using all of these examples. One implication is that we might require fewer expert behavior traces, which is probably the most costly resource in our problem. We need learning algorithms that can deal with noise because the background theory used in learning will probably be only an approximate model the expert’s reasoning (A16). Moreover, it is not realistic to assume that the expert’s behavior is always consistent with his/her goal annotations.

2.2.4 Learning Strategies and Bias We assume that the learner represents the problem of maintaining the activity of a goal using multiple learned concepts (A17). For example, a goal may be represented by two concepts, one indicating when the goal is selected and the other indicating when the goal is terminated. Alternatively, a concept may represent the conditions that hold as long as a goal is active. Although these two schemes could be converted to each other if the 16

knowledge representation is rich enough, one of them might represent a particular goal in a more compact way, which could be learned easier and with higher confidence. The conditions in the learned rules will need to refer to current sensor values so that the agent’s behavior can be conditional on changes in the environment (A18). To represent internal knowledge and reasoning of the expert, the conditions in the hypothesis may refer to relations that are defined in the background knowledge (A19), as well as active higher-level goals. For example, when the open(Door) goal is learned in the context of a previously selected goal drive(Car), which door should be opened depends on which car is being driven; that is, it must be a door of that particular car (A20). During performance, the expert may acquire beliefs based on his observations that may persist even after the reasons for acquiring them are no longer observable. In general, it is difficult to learn decision processes that are based on such beliefs since it is not clear which beliefs are going to be persistent and which beliefs from past experience are relevant for decisions in the current situation. Like KnoMic [47], we assume that our learning by observation system will maintain a limited set of persistent beliefs about completed goals. Such beliefs can help the agent to maintain information about its progress towards its higher-level goals. We assume that the agents can create internal belief structures about the goals they have completed and use these facts in subsequent decision making (A21). Moreover, our framework allows for the learned agents to form beliefs about unobservable parts of the environment. For example when the agent program observes an object, it can remember its location and can use this information in future decisions even in cases when the object is not directly observable (A22). The expert provided data structure annotations might highlight the relevant parts of the situations or the internal knowledge of the expert relevant to the current situation. If such information is available, it can be used to bias the hypothesis search during learning to speedup the search and improve the quality of the learned concepts (A23).

17

CHAPTER 3 RELATED WORK

3.1 Inductive Learning by Observation In this section, we discuss previous work closest to our learning by observation framework. We describe systems that learn task-performance knowledge by inductively generalizing the behavior traces of an expert by finding patterns in similar decisions.

3.1.1 Behavior Cloning Learning to replicate expert behavior from using behavior traces without requiring the expert to provide additional information is often called behavioral cloning. Most behavioral cloning research to date has focused on learning sub-cognitive skills in controlling a dynamic system such as pole balancing [22], controlling a simulated aircraft [23, 39], or operating a crane [46]. In contrast, our focus is capturing deliberate high-level reasoning, which we hope to achieve interpreting the expert behavior in the context of additional information the expert provides. Behavioral cloning was originally formulated as a direct mapping from states to control actions, which produces a reactive agent. Sammut’s system [39] is one of the earliest applications of behavior cloning in a complex tasks. This system learns decision trees [37] that map the current situation to actions in an airplane piloting task. The behavior traces is manually divided into a fixed set of segments (i.e. take-off, landing, etc in flight simulator domain they use), each of which is used to train a separate decision tree. These segments roughly correspond to our goals, but unlike our case, these segments are not hierarchically structured, they are not parameterized, and the conditions to transition between them are hand-coded and not learned. Since the learned decision trees 18

are not parameterized and do not depend on background knowledge about the task, even slight changes in the task (i.e. changing the flight plan altitude), require retraining of the decision trees.

3.1.2 Goal-driven Behavior Cloning Later, using goals was proposed to improved robustness of the learned agents. Camacho’s system [4] induced controllers that had goal parameters so that the execution system can use the same controllers under varying goal settings. It did not however learn how to set the goal parameters or how to change them dynamically during task performance. Since these goals were not learned, they are more similar to the background knowledge than goals in our learning by observation framework. Bain and Sammut [1] discuss a two step approach where a goal model, which is a mapping from states to goal parameters, and an effect model, which is a mapping from control actions to changes in the state, are learned separately. The execution system selects control actions that will achieve the goal values by interpreting the effect rules. Isaac and Sammut [12] also present a two step approach where an anticipatory level sets goal values and PID controllers at the lower level to produce control actions to reduce the error between the goal and state values. Suc and Bratko [45] describe induction of constraints that model qualitative trajectories that the expert is trying to follow to achieve goals. These constraints are used to guide the choice of the control actions. As in the goal-directed behavioral cloning research described above, representing goals explicitly is an important component of our approach for obtaining complex and flexible agents, but the goals in our framework are used in a quite different way. In the systems described above, the goals are desired values for some predefined parameters of a dynamic system. For example, the learning-to-fly domain has goal parameters such as target turn-rate. In contrast, the goals in our framework correspond to durative high-level internal states of the expert indicating that a particular kind of behavior is desired. These goals may be related to a final state the expert wants to achieve, such as a go-to-room(r1) goal in a building navigation domain; they may be about maintaining a condition such a maintain-altitude goal in an airplane control domain; or they may simply represent the

19

desire to exhibit a complex behavior as in fly-in-a-circle goal. Unlike the above approaches, we do not assume pre-existing definitions for the goals. In contrast, the meaning of each goal in our framework is discovered by learning under which circumstances the expert selects it as well as learning the behaviors that become relevant once it is selected. Unlike these systems, the goals in our framework are hierarchically organized so that the goals at the higher levels of the hierarchy correspond to more complex behavior.

3.1.3 KnoMic van Lent’s [48] learning by observation system KnoMic also learns hierarchies of durative goals using annotated behavior traces, although his goals don’t have parameters, preventing it from modeling complex relations between the goals. Nevertheless, KnoMic was able to model a complex task in a tactical air flight domain. The agent it trained can execute an area patrol mission (after some manual corrections), where it flies in a circular path by choosing goals repetitively. Moreover, KnoMic was able to dynamically change goals, for example to suspend the default patrolling behavior to engage a newly sighted enemy plane. Although KnoMic’s learning by observation framework was able to model a complex task using a very simple learning algorithm, it used an attribute-value based representation that significantly limited its ability to model the internal reasoning of the expert. It would run into difficulties when structured properties of the environment are relevant, for example if it has to make decisions involving multiple objects (e.g. two enemy planes in a tactical air combat domain), if complex knowledge about the task (e.g. a building map in a navigation domain) is important in choosing the right strategy, or if the decisions of the expert must involve inference beyond the directly observed features of the external world (e.g. choosing a door towards a room that is not directly observed). In addition, KnoMic uses a simple single-pass specific-to-general learning approach that does not scale to the structured behavior data required in the complex domains we consider. Our framework addresses these problems by allowing parametric goals, structured background knowledge, and structured sensors. 20

3.1.4 L2Act Khardon [14] studied learnability of action selection policies from observed behavior of a planning system and demonstrated results on small planning problems. His framework requires that goals be given to the learner in an explicit representation, while we try to inductively learn the goals.

3.2 Theory-driven learning by observation One alternative to our inductive approach to learning by observation is to use explanation-based learning (EBL), where learning is a result of explaining few examples using a deductive background knowledge theory [24]. For example, Segre’s ARMS system [40] used EBL on expert behavior traces to learn assembly plans for a simulated robot arm. Recently, learning by observation that utilizes an EBL-like technique was used with the general agent architecture ICARUS [31]. This system represents the learned procedural knowledge in a skill hierarchy (similar to our goal hierarchy) built through learning. Although it does not require expert goal annotations, it requires definitions of the goal conditions that hold when the task is terminated, a causal action model, and a hierarchical theory of concepts relevant for the task. Most EBL systems require background knowledge that can deductively explain expert behavior, whereas in our inductive approach background knowledge is optional and used to the extent it helps to find similarities between observed examples in the behavior. ILP systems usually handle noise better than EBL systems, which is an important for learning by observation in complex domains.

3.3 Systems that Learn Planning Knowledge Several systems learn how agent actions change the environment and the agent’s state. One advantage of these systems is that the knowledge they try to capture is independent of the agent generating the behavior. Therefore, these systems can easily combine behavior generated by an expert with behavior generated by an agent program. On the other hand, since the goal is not to model the expert, this approach alone cannot be used 21

in applications that require human-like behavior. These systems usually do not generate executable agent programs; instead, they are usually combined with some sort of planning system that uses the captured knowledge to generate behavior. These systems would have difficulty if changes caused by the actions are difficult to observe: for example because the agent can take multiple actions at the same time (e.g. turn and move), the actions cause delayed effects, or there are external agents making it difficult to attribute the changes to particular actions. Moreover, the changes the actions cause may simply not be observable to the agent. In such domains, our approach of trying to replicate expert decisions, without necessarily understanding what changes they will cause, may be easier to learn.

3.3.1 OBSERVER OBSERVER [50] uses expert observations to learn planning operators in a STRIPS [9] like representation. OBSERVER learns the effects of operators and their preconditions occur, both in a first order language. OBSERVER uses a version space like algorithm, which does not assume noise. It learns what changes the operators cause under which conditions. It then uses a planning algorithm to use the learned operators. Their approach is feasible in the discrete planning domains they consider, because they do not have delayed or unobservable effects, and all changes in the state can be attributed to the immediate operators selected by the expert.

3.3.2 TRAIL Like OBSERVER, TRAIL [2] learns what effects the actions cause on the environment. It uses expert behavior observations and experimentation on the environment to collect data for learning. Unlike OBSERVER, TRAIL uses a teleoperator representation model that supports durative actions. Each teleoperator T consists of a precondition p, action a, effect e, and a set of side effects S. For a correct T, if p is satisfied, continuous application of a is guaranteed to achieve e, without changing p, potentially causing the side effects in

S. TRAIL uses an ILP algorithm to learn the teleoperators and once they are learned, they can be used by a planning algorithm to achieve goal states. Although TRAIL uses user 22

behavior data, it does not capture behavior strategy. For behavior, it depends on explicit goals and planning. It does not have a goal hierarchy and does not have persistent beliefs limiting the flexibility of behavior it can generate.

3.3.3 Event & Situation Calculus Moyle [26] describes an ILP system that learns theories in event calculus, while Otero describes an ILP system that learn effects in situation calculus [34]. These systems could have difficulty if changes caused by the actions are difficult to observe, possibly because the actions cause delayed effects that are difficult to attribute to particular actions. In these cases, our approach of trying to replicate expert decisions, without necessarily understanding what changes they will cause, may be easier to learn.

3.4 Reinforcement Learning Another alternative to learning by observation in training agent programs is to use reinforcement learning. Reinforcement algorithms search for a strategy of selecting actions that maximize a predefined reward, usually by experimenting in an environment or by using an internal model of it. The knowledge captured with most reinforcement algorithms is specific to an instance of a domain and a task and does not satisfy our goal of capturing general structures used in decisions. Dzeroski, Raedt, and Driessens’s [8] relational reinforcement learning system addresses this issue by proposing a way to use structured background knowledge and structured state descriptions within a reinforcement framework. The learning algorithm they present first learns a mapping from structured state descriptions to the utility of taking particular actions (Q-values) by experimenting in the domain and using the reinforcement signal obtained from the environment. They represent the Q-values implicitly using relational regression trees [17], a first order generalization of decision trees. In a second phase, these Q-values are used to learn a compact action selection policy. Most notable, their system combines reinforcement-learning methods devised to deal with delayed environmental feedback with relational learning techniques developed in ILP framework.

23

Recently, Driessens and Dzeroski combined behavior traces generated by experts with traces obtained from experimentation on the environment [7]. Reinforcement learning with random exploratory strategies has difficulty in large state spaces with sparse feedback because the agents rarely get to these feedback locations. Expert guidance helps their system to more quickly reach states that return reward. Although they use expert behavior traces, unlike our setting, they do not aim their agents to learn the reasoning of the expert. Their system does not try to learn the goals of the experts and the actions are not learned in the context of goals. Moreover, the expert selections are not directly treated as correct decisions. Instead, the values of the actions are propagated backward from states that return positive reward. In complex domains, our strategy of trying to replicate the expert decisions may be easier than trying to justify actions in terms of future gains, especially when the environment is large and the reward is sparse. Moreover, replicating the problem solving style of an expert, even if he/she makes suboptimal decisions, is an important requirement for some applications such as creating “believably human-like” artificial characters.

3.5 ILP Applications in Autonomous Agents Context There are few systems where ILP is used in autonomous agent context. Although the following systems do no use expert observations to learn how to behave, the relational ILP representation they use could make it possible to integrate ideas from them with our system. Klingspor et al.’s system [15] learns to detect a hierarchy of concepts that correspond to abstract states of the agents in the world using expert observations and annotations in an ILP setting. For example it learns to detect the abstract “going through a door” state based on lower level concepts such as “being in front of the door” and “facing to the door”, and actions such as “move forward”. The lowest level concepts are learned based on the immediate sensors (i.e. x-coordinate) and actions. Although our system also learns a hierarchy of concepts, they have a quite different meaning. Our system learns to select and implement goals such as open(Door), based on higher-level goals previously selected such as drive(Car). While their system learns abstract states of the agent such as 24

open(Door) using lower-level state concepts such as infront(Door), pull(Door). We

investigate how to select and execute goals, while their system could be used to detect which goals an agent may be pursuing at a time. The learned concepts of Klingspor’s system can be treated in our research as high-level qualitative sensors, which could be used along with other sensors to model the decision making process of the expert. Since their system also uses a first order logic based representation, we could integrate their learned “detection” knowledge to our system very naturally by just adding their rules to the background knowledge database our system uses in learning. Matsui et al. [21] use ILP to train a system to detect situations where execution of the actions of an agent are going to be successful. They demonstrate how simulated robot soccer agents that already know how to kick a ball can learn when kicking will achieve its desired effect of scoring a goal. They do not use expert input; instead, they rely on agent experimentation on the domain that is evaluated by hand-coded critic definitions. This approach is only useful if the desired effects of actions are easy to describe. The agents described by Inuzuka et al. [11] use a similar idea in a low level robot navigation domain. Here, again, a simple action success criterion is used. Their robot should move in a hill-climbing direction towards the target points and it should not collide with the walls. Most notably, their agents start with a random action selection policy and incrementally improve it in a “relearn-action-policy” and “generate-new-examples” loop. They rely on a separate planning system to decide which actions should be selected, if multiple actions may achieve their desired goals, while we want to learn the strategies to select actions. Their rules only refer to sensory information or their immediate implications. They do not use background knowledge that represents common sense or factual world information, although their framework could allow such an inclusion.

25

CHAPTER 4 RELATIONAL LEARNING BY OBSERVATION In this chapter, we describe our relational learning by observation framework in detail. We start with an overview of the framework.

4.1 Overview The goal of our relational learning by observation framework is to create agent programs that mimic task-performance of experts, by generalizing expert generated behavior data in the context of background knowledge about the task and domain, and additional expert provided information such as goal annotations. The primary input of our framework is a set of annotated behavior traces consisting of a list of observed situations, actions, and goals annotations that are marked as “correct” or “incorrect”. An annotated behavior trace is either generated by the expert (mode 1), or through an interaction of a previously learned agent program with an expert (mode 2). Our relational learning by observation framework interprets the annotated behavior traces in the context of task/domain

knowledge and returns an agent program that generates behavior that is consistent with the annotated behavior traces. Figure 6 is an overview of our relational learning by observation framework. In the first step, a set of annotated behavior traces are inserted into an episodic database that efficiently stores and retrieves the observed behavior and the expert annotations. In order to create an agent that replicates the behavior of the expert, the system decomposes the problem of “obtaining an agent program” to multiple problems of learning decision

concepts for each step in behavior. The decision concepts include learning when the goals and actions should be selected, when they should be terminated, or when they should be interrupted. Learning the individual decision concepts is then formulated as a “supervised concept-learning” problem with the training set generator creating positive and negative 26

examples of each decision concept using the annotated behavior traces stored in the episodic database. The concept learner component uses an ILP algorithm that learns rules for each decision concept by generalizing the concept examples in the context of background knowledge consisting of the hand-coded task/domain theory, the annotated behavior traces and learning by observation knowledge. The learning by observation knowledge is based on the domain independent assumptions of the learning by observation framework and it is considered as part of the framework. The agent generator component converts the learned decision concept rules into an executable agent program. Agent Architecture Mode 2

New agent program

Learned Agent

Annotated Behavior Trace Generator (ABTG)

Mode 1

Expert

Task & domain knowledge Annotated behavior traces

Agent Generator

Learning by observation knowledge ILP background knowledge & bias

Decision concept rules

ILP Concept Learner

Observed situations

Episodic Database

Goal & action annotations

Decision Concept examples

Training Set Generator

Relational Learning by Observation

Figure 6. Relational learning by observation framework

27

Section 4.2 describes two methods to generate the input for relational learning by observation. In learning from behavior performances setting the expert generates behavior data directly on the environment, whereas in learning from diagrammatic

behavior specifications settings the expert specifies abstract scenarios using a graphical interface. Section 4.3 presents knowledge representation of our framework. It describes not only the annotated behavior traces, the input of our framework, but also how knowledge is represented by the agent program that is going to be learned as the output of our framework. The learned agent programs use a restricted relational language to describe the environment and a hierarchical organization to represent their taskperformance knowledge. Although we need a rich enough representation that can represent complex environments and tasks, using proper restrictions is also important in dealing with the huge hypothesis spaces of the “learning an agent program” problem and to help generalization during learning. Moreover, the hierarchical task-performance knowledge organization helps the learner by biasing the learner with an assumed structure of the agent program that is going to be learned. Section 4.4 describes how the imprecise “learning an agent program” problem is reduced to multiple well-defined supervised concept learning problems, how these concepts are defined and how their examples are created. Section 4.5 describes how the relational learning setting of Inductive Logic Programming is used to learn these decision concepts. In section 4.6 we go over an agent program example our system has learned from expert traces. Section 4.7 describes task and domain knowledge and learning by observation knowledge used by the ILP component as background knowledge in interpreting the behavior data. Task and domain knowledge is important for modeling knowledge of the expert that is not observable by the learning system. On the other hand, the learning by observation knowledge encodes domain independent knowledge about our relational learning by observation framework, the agent architecture and assumed internal mechanisms of the expert. Learning by observation knowledge is an important component of our framework that helps biasing the general-purpose ILP learner with knowledge about the learning by observation problem. Section 4.8 describes how learned concepts are converted to an agent program for a particular agent architecture. Finally, section 4.9 describes episodic 28

database component of our framework that enables our framework to efficiently deal with large behavior data generated in complex environments.

4.2 Using Relational Learning by Observation In this section, we describe two settings where relational learning by observation is used. In learning from behavior performances setting, the learning system uses as input the behavior data recorded while an expert is executing a task in the environment. In contrast, in learning from diagrammatic behavior specifications setting, rather than actually executing the task, an expert abstractly specifies desired behavior using a graphical interface that diagrammatically describes the task and the domain. Both of these settings use the same relational learning by observation component albeit minor differences in background knowledge and explicit bias.

4.2.1 Learning from Behavior Performances The learning from behavior performances settings depicted in Figure 7 has two execution modes. In the first mode, the expert executes behavior on the environment by selecting actions and annotating the behavior with intended goals. The recorded annotated behavior traces are used to learn an initial, approximately correct agent program. In the second execution mode, a previously learned agent program generates behavior in a similar fashion and the expert gives feedback on the correctness of selected actions and goal annotations, leading to the creation of an improved agent program. In complex domains, an agent (expert/agent program) may receive vast amounts of raw sensory data and the low-level motor interaction the agent has to control may be extremely complicated. Since we focus more on high-level reasoning of cognitive agents than low-level control, we assume that the agent programs interact with the environment using an interface that converts the raw data to a symbolic environmental representation (SER). While the expert makes his/her decisions using a visualization of the raw data, the agent program will make decisions with the corresponding symbolic data. Moreover, both the expert and the agent program execute only symbolic actions provided by SER, which is responsible for implementing these actions in the environment at the control 29

level. We assume that the environmental interface converts the continuously changing environment to a symbolic representation at a sufficiently high frequency so that no significant change occurs between two consecutive situations. The complexity of the environmental interface depends on the domain. While in robotics applications this interface could be fairly complex, in the simulated 3-D game environment we used in our experiments, the environmental interface conducts a simple transformation that converts the game representation into SER.

Agent Architecture

Replace Agent Program

Agent Program

User Interface

Mode 2

Environmental Interface Mode 1

Expert

Annotations

Task & Domain Knowledge

Environment

Behavior trace Behavior Annotation Tool

Annotated Behavior trace

Relational Learning by Observation

Figure 7. Learning from behavior performances. In mode 1, the expert generates annotated behavior. In mode 2, an agent program executes behavior and generates annotations, while the expert accepts or rejects them. While the expert or the agent program performs a task, the environmental interface records a trace of the behavior containing the executed actions and the symbolic representation of the encountered situations from the “task-executer” agent’s point of view. In the first execution mode, the expert annotates the behavior trace with the intended goals. The goal annotations are durative and hierarchical. For example the expert may annotate a continuous temporal region on the behavior trace as “In this time 30

interval, I am pursuing the goal go-to-room(r1) and to achieve that, I am pursuing the subgoal open-door(d1)”. In the second mode, the agent proposes similar annotations and the expert accepts or rejects them. The durative action annotations are similar to the goal annotations, except that they can be recorded by the environmental interface without additional expert effort. The annotated behavior traces created with this process are returned to the relational learning by observation component that interprets them in the context of task and domain knowledge to induce an agent program. The newly created agent program can generate further traces while the expert accepts or rejects agent program selected actions and goal annotations (mode 2). Alternatively, the expert can generate more traces (mode 1) if the agent program is not good enough to generate reasonable behavior. This process continues until the expert is satisfied with the performance of the agent program and terminates training. In our current implementation, a new agent program is learned from scratch at each cycle but since more behavior traces have been accumulated, this should lead to a more accurate agent program being learned. We have partially implemented this framework to conduct the experiments reported in section 5.3. Although we have implemented the relational learning by observation component to use both annotated correct behavior and incorrect behavior with rejected annotations, the behavior generation works only in the first mode of the execution cycle where only correct behavior instances are used. In experiments in section 5.3 we demonstrate that this may be sufficient to capture correct behavior performance knowledge. In our experiments, instead of behavior generated by a human expert, we used behavior of hand-coded Soar agents. Cloning artificial agents is a cost-effective way to evaluate our framework - it greatly simplifies data collection and it does not require us to build domain specific components to track expert behavior and annotations. Instead, we built a general interface that can extract annotations and behavior from Soar agents on any environment Soar has been connected to. This approach enables us to evaluate learning by observation in a systematical fashion as described in section 5.1.

31

4.2.2 Learning from Diagrammatic Behavior Specifications In learning from diagrammatic behavior specifications setting, the expert and the agent program (if one exists) interactively specify abstract behavior scenarios using a graphical interface (Figure 8) that diagrammatically represents the environment and the behavior. Compared to learning from the behavior performances setting described in previous section, this approach simplifies the learning problem at the expense of more expert effort. The expert uses richer tools to generate behavior data and receives more knowledge about the behavior of the partially learned agent program. As a result, the learner can better predict the internal reasoning of the expert and it can better focus on the parts of the task where the learned agent program lacks knowledge most.

Figure 8. Diagrammatic behavior specification with Redux Domains that require control of a dynamic environment (i.e. flying an airplane) prevent the expert and the learned agent program from performing the task at the same time. In transferring knowledge from the expert to the learned agent program, an important issue is how the “task-executer” and “observer” roles are coordinated between the expert and the learned agent program while the task is performed. A common approach to solve this problem is to assign fixed roles to the expert and the agent program. For example, in behavioral cloning systems [3, 39], a learner agent passively observes the task execution of the expert. On the other hand, in learning by instruction systems [10] an agent program executes the task while an observing expert gives instructions and feedback. The learning from behavior performances setting described in

32

previous section supports learning from behavior generated by both the expert and a previously learned agent but the behavior is not generated interactively. In contrast, the learning from behavior specifications setting described in this section is an interactive approach where the expert and the learned agent program can switch roles dynamically depending on the state of the environment, their knowledge about the task and their knowledge about each other. We use the diagrammatic behavior specification tool Redux [36] that facilitates this dynamic interaction by acting as a communication medium. Redux works like a white-board (Figure 8) where the expert and the learned agent program share not only assumptions about the world state and desired behavior, but also internal knowledge such as goals motivating the specified behavior, and meta knowledge such as feedback about the correctness of each other’s behavior specification. This shared information helps the expert to focus on parts of the task where the agent program is lacking knowledge most. Redux also acts as a translation tool among the expert, the learned agent program, and the learning system such that while the expert uses the diagrams, the learning system and the agent program uses a corresponding symbolic representation. Agent Architecture Agent Rules Replace Agent Program

Export Agent Program

Expert

Agent Program

Mode 2

Mode 1,2

Diagrammatic Behavior

Environment

Generator

Task & Domain Knowledge

Annotated Behavior traces

Relational Learning by Observation

Figure 9. Learning from diagrammatic behavior specifications setting. In mode 1, the expert generates annotated behavior. In mode 2, the behavior is interactively specified by the expert and the agent program.

33

In the learning from diagrammatic behavior specifications setting depicted in Figure 9, an expert and an agent program (if there is any) specify behavior scenarios using a diagrammatic storyboard like representation. The generated scenarios are converted to annotated behavior traces that are entered to a relational learning by observation component, which generates an improved agent program. When the expert decides that the agent program is competent enough, it is exported to an external agent architecture such as Soar, which interacts with the real environment. The behavior specification starts with the expert actively specifying every step of the behavior (mode 1). The expert first draws an initial situation consisting of an abstract representation of the environment. The expert then creates new situations by selecting actions. The interface uses domain knowledge to help the expert with this process by predicting possible outcomes of the selected actions. For example, in a building navigation domain currently supported by Redux (Figure 8), the expert would first draw an abstract map representing the building (rooms, doors, etc.), and select actions for the “body” object that represent the entity to be controlled by agent program at the end of the training. A scenario consists of discrete situations each representing important time points in a continuous behavior. The expert also annotates the scenario with goals and additional knowledge structures such as markers highlighting the important objects to communicate his/her reasoning to both the agent program and the learning by observation component. As the agent program improves through learning, it becomes a more active participant in behavior specification. If the agent program has knowledge about how the collectively selected goals can be achieved in the specified situation, it proposes subgoals or actions to achieve these goals. This eliminates the need for the expert to generate behavior and allows the expert to validate, reject or override these suggestions, which provides valuable information in learning the next version of the agent program. After each learning phase, the improved agent program takes greater control of the interaction until finally the expert is always passively verifying its behavior. The interactivity improves learning by focusing the interactions on the more relevant behavior traces. For example, relevant negative examples can be quickly collected, since the expert can immediately correct unacceptable decisions of the agent program. 34

Unlike the case of the learning from behavior performances, where the encountered situations are constrained by real-time execution in the environment, during behavior specification the expert can focus to the critical situations by depicting only the important and qualitatively distinctive ones where new decisions are made. The expert can move forward and backward in time, inspecting the collectively created scenario. Moreover, the diagrammatic representation gives the expert more freedom to express knowledge to help learning. In addition to desired behavior, the expert can specify undesired behavior, multiple possible action or goal selections which are equally desirable at a given situations, or multiple possible outcomes of a selected action. As a result, instead of a linear behavior trace, the expert generates a tree of partially ordered situations. This allows the learning system to get more information both about the variability and the limits of correct behavior. As in learning from behavior performances, in learning from behavior specifications, the expert annotates the behavior data with goals but since there are no real time constraints to generate behavior, annotating is a natural part of behavior specification. Like the behavior performances setting, the goal annotations are important in communicating the internal reasoning of the expert to the learning by observation system, but in addition to that, the goals in behavior specification setting are also used by the expert and agent program to communicate their understanding of the specified behavior. For example, each of them can propose subgoals that achieve the goal selected by the other. Moreover, the expert can propose alternative goals that replace the goals selected by the agent program. We have implemented both modes of this learning settings using the diagrammatic behavior generator tool Redux [36]. The additional knowledge structures this setting provides are handled by the relational learning by observation component using additional domain independent background knowledge in form of Prolog programs, learning heuristics and bias.

35

4.3 Knowledge Representation In this section, we describe the knowledge representation used by our learning by observation framework. Although this representation is rich enough to model complex problems, it is a restricted form of structured representations, which is crucial in biasing the general purpose ILP framework with learning by observation specific properties to deal with the huge hypothesis spaces of the “learning an agent program” problem. Section 4.3.1 describes task-performance knowledge representation of our framework, which consists of hierarchically organized goals and actions of the learned agent program, is represented in our framework. Section 4.3.2 describes the environment as represented by the agent program and the learning by observation system. Finally, section 4.3.3 describes the representation of the behavior data generated by the expert.

4.3.1 Task Performance Knowledge We assume that the task performance knowledge of the target agent program is decomposed into a hierarchy of operators that represent both the goals that the agent pursues and the actions it takes to achieve these goals (Figure 10). The primitive operators at the leaves of the hierarchy represent primitive actions that the agent can execute in the environment and the remaining operators represent the goals of the agent. The hierarchy is not a strict AND/OR hierarchy. Instead, the hierarchy indicates that the suboperators are “strategies” that can be applied in complex orderings (potentially multiple times) to achieve the goal of the parent operator. With the operator hierarchy assumption, we decompose the “learning an agent program” problem to multiple “learning to select and maintain the activity of an operator” problems. The suboperators correspond to strategies that the agent can use as part of pursuing the goal of the parent operator. The learned agent has to continuously maintain the activity of these operators based on current sensors and internal knowledge. Operators can have parameters, so that when the agent selects an operator, it must also instantiate its parameters. For example if the agent selects the operator get-item(Item) in Figure 10, it must instantiate its parameter Item, with a symbol that represents a collectable item in the environment. It then executes

the operator by selecting and executing its suboperators. For example, the operator get36

item is executed by selecting and terminating the operators get-item-in-room and getitem-different-room operators such that only one of them is active at a time, until the goal

of get-item is completed. The real execution in the environment occurs when actions, the lowest level operators, are selected. The names of the selected actions and their parameters are sent to an environmental interface, which applies them in the environment. The actions are continuously applied on the environment as long as the agent keeps them active. We assume that there may be at most one operator active at each level of the hierarchy, except at the lowest level, where operators representing the actions can be executed in parallel. This assumption does not pose a major constraint on the possible behaviors because multiple goals at the same level can be modeled by having a single operator that is decomposed into multiple parallel suboperators, which are randomly selected at each execution cycle. The operators at any level can be retracted in response to the perceived events, in which case all of its suboperators are also retracted and another operator at the same level is selected. The assumption that forces a single active operator at a level greatly simplifies the learning task because the learner associates the observed behavior only with the active operators and each operator is learned in the context of a single parent operator.

get-item(Item)

get-item-in-room(Item) go-forward(LinearDir) turn(RotDir)

get-item-different-room(Item)

go-to-door(Door)

go-forward(LinearDir) turn(RotDir)

go-through-door(Door)

turn(RotDir) go-forward(LinearDir)

Figure 10. An operator hierarchy in a building navigation domain

37

In this representation, information about how the operators are selected implies information about how the operators are executed because execution of an operator at one level is realized by selection of the suboperators at the lower level. The suboperators may be executed in complex orderings to achieve the goal of the parent operator, dependent on the observed situation and internal knowledge. The real execution occurs with lowest level operators representing actions such as go(forward) or turn(left). For example in Figure 10, assume that the agent decides to get an item i1 by selecting get-item(Item) and instantiating Item= i1. If the agent is not in the same room with i1, it selects the suboperator get-item-different-room(i1). Then the agent executes this operator using its suboperators go-to-door and go-through-door. The operator go-to-door is used to select and approach a door leading towards the item i1. When the agent is close to that door, go-to-door is replaced with go-through-door, which moves the agent to the next room.

Once in the next room, the agent selects go-to-door again but this time with a new Door instantiation. This process continues until agent is in the same room with i1 and get-item-different-room is retracted with all of its suboperators and replaced with get-item-in-room.

Initially, the system only knows the names and the arity of the operators. The final agent obtained as the result of learning should have the capability of maintaining the activity of the operators (i.e. selecting them with correct parameters, terminating them when they achieve their goal, abandoning them in preference of other operators, etc.) and executing them (managing the suboperators).

4.3.2 Symbolic Situations and Behavior Trace Our relational learning by observation framework assumes that there is an external

annotated behavior trace generator (ABTG) component creating the annotated behavior traces (Figure 6). In learning from behavior performances, an environmental interface has the role of ABTG, whereas in learning from behavior specifications, the diagrammatic behavior generator has that role. In both cases, the ABTG maintains the situations (Figure 11), the observed state of the environment from the perspective of the “taskexecuter” in a symbolic environmental representation (SER). While in the behavior 38

performance setting the perceived environment is converted to SER, in behavior specification setting, the diagrammatic representation is converted to SER. “book”

“table”

d1

type o2 visible

connection connection

d2

type on

in

o1

1

path

distance destination

r3 contains in

Sensors

i5 x-coordinate energy-level

p2

path

current-room contains

AGENT

r2 2

pathdoor

has-door

contains visible r

in

Directly sensed data

i3

Hand-coded factual or inferred knowledge

3

“high”

Figure 11. A snapshot a SER situation in Haunt 2 domain. The sensed relations are dynamically updated and are associated with relations using background knowledge. Our relational learning by observation framework and the agent programs it learns represent the observed situations using a directed graph of binary predicates, a representation adapted from Soar architecture. Each situations consists of a set of binary predicates of the form p(a, b) where p is a relation between the objects in the environment denoted by a and b by the ABTG. In the Haunt domain, an example “snapshot” of this time varying representation is depicted in Figure 11. The sensors can be represented with binary predicates where the first argument is a special symbol describing the agent generating the behavior and the second argument is the sensed value. The sensors can be constant-valued such as the x-coordinate(agent, 35) or energy-level(agent, high) as well as object-valued such as current-room(agent, r1). The

object-valued sensors can be used to represent structured relations among perceived objects. For example, when a book on top of a desk enters the visual display of the expert, ABTG builds the corresponding objects and binds an “on” relation between them. ABTG can enrich the situations with hand-coded factual knowledge or inferred 39

knowledge in addition to the directly sensed features. For example in Figure 11, we not only know that the agent is in the room r1, but we also know that it can enter the room r2 by going through door d1. We say that the observed situation predicate p(si, a, b) holds if and only if p(a, b) was in SER at the situation si. If the environment contains static facts (i.e. rooms, doors, etc...) that do not change over different situations, that information can be added to the beginning of the behavior trace manually, even if the expert does not perceive them directly. This corresponds to the assumption that the expert may already know about these facts and the learning system can use this information as background knowledge as it creates the model of the expert. If p(x, y) is such a static fact, we say that the assumed

situation predicate p(si, x, y) is true for any si. Moreover, ABTG can use generic rules that infer new facts from the observed and assumed facts at a situation. Such facts will be called inferred situation predicates. During learning, all situation predicates in SER are used uniformly. The behavior trace consists of a partially ordered set of situations, each symbolically representing an instant of the environment from the perspective of the agent program that has generated the behavior. In the learning from behavior performances setting (section 4.2.1), the situations are totally ordered and two consecutive situations represent two moments in the behavior with a short period of time in between. In the case of learning from diagrammatic behavior specifications (section 4.2.2), the situations represent important moments in the behavior that are depicted with the diagrammatic tool, whereas a corresponding symbolic situation is sent to the relational learning by observation component. In this setting, the situations are only partially ordered because the expert can specify multiple “next” situations for a given situation. A tree of partially ordered situations can be converted to a set of completely ordered behavior traces by creating a separate behavior trace for each path between the initial situation at the top node and a leaf situation. Using that transformation, we will assume that the input of our relational learning by observation framework is a set of completely ordered situations that are called behavior paths. For example, the partially ordered situation tree in Figure 12 contains four behavior paths. This assumption will simplify our formalization without loss of generality. 40

S01

S11

Initial situation

S21

S31

S41

S51

Behavior paths {S01 S11 S21 S31 S41 S51}

S22

S52

S32

{S01 S11 S21 S31 S41 S52} {S01 S11 S22 S32}

S33

{S01 S11 S22 S33}

Figure 12. A tree of partially ordered situations that contains four behavior paths. In practice, it is inefficient to store the list of all situation predicates that hold at each situation explicitly, especially in domains where sampling frequencies are high and there is a significant amount of sensory input. In section 4.9 we describe the episodic database, which efficiently stores and retrieves situation predicates and operator annotations contained in the annotated behavior trace.

4.3.3 Annotated Behavior Trace It is difficult to extract the goals and subgoals necessary for hierarchical execution from behavior traces alone. In our framework, this information is assumed to be given by the expert provided operator annotations that describe the intended goals and selected actions at each situation on the behavior trace. The annotations are created through some collaboration of the expert and previously created agent program. In the learning from behavior performances setting (Figure 7), the expert and the agent work in turns, where the expert generates the annotations in the first mode and edits the annotations generated by the agent program in the second mode. On the other hand, the annotations are created more interactively by the expert and agent program in the learning from behavior specifications setting. Two different types of operator annotations mark a list of consecutive situations with an operator instance. An accepted annotation indicates that the expert believes that the operator instance correctly depicts the desired goal or action for the selected situations, whereas a rejected annotation indicates that the annotation does not correctly depict them. More formally, we say annotation(a, R, op(x0)), if a is annotation that associates the annotation region R, a set of consecutive situations describing the temporal 41

extent of the annotation with an annotation operator op(x0), where op(x0) is a ground literal consisting of operator name op and the instantiated arguments x0. We say

accepted-annotation(a) if a is an annotation selected or verified by the expert, and rejected-annotation(a), if the expert indicated that a describes undesired behavior. A operator valid selection at a situation that satisfies the semantics of the operator hierarchy must form a connected path of operators starting from the root of the hierarchy (i.e. get-item, get-item-different-room, go-to-door, … in Figure 10). Just like the operators, the annotations are hierarchically organized to a structure we call annotation

hierarchy (Figure 13). Annotation hierarchy describes which ordered list of suboperators are executed at which time points for achieving the goal of which parent operator instance. Although the hierarchical relations of annotations are inherited from the operator hierarchy, the annotation hierarchy is guaranteed to be a finite tree for a finite behavior trace, even if the corresponding operator hierarchy is recursive (Figure 14). Moreover, unlike the operator hierarchy, the annotation hierarchy is temporally extended over the space of situations such that an annotation region encloses the annotation regions of descending annotations. get-item(i1)

get-item-different-room(i1)

get-item-in-room(i1) rejected: get-item-in-room(i2)

go-to-door(d1) go-through-door(d1) go-to-door(d3) go-through-door(d3) rejected: go-through-door(d3) rejected: go-to-door(d1)

Figure 13. Annotation hierarchy. The horizontal direction depicts temporal extent over behavior trace. The rejected annotations are marked and the rest are accepted annotations.

42

For two annotations a, b, if b is the parent annotation of a in the annotation hierarchy, we say context(a, b) or “a is in the context of b”. We say share-context(a, b) or “a and b share the same context”, if a and b are in the same context. For example, Figure 13 depicts annotations with the operators go-to-door(d1) and go-through-door(d1) that share the same context that has the operator get-item(i1). On a linear behavior trace with totally ordered situations, two accepted annotations that share the same context are not allowed to have intersecting annotation regions because in a given context, the expert can associate the selected behavior with only a single correct operator. However, we do not have a similar restriction for rejected annotations since the expert can indicate multiple operators that should not be selected at a context and situation. Note that although on partially ordered situations (Figure 12) multiple accepted annotations can exists at the same situation and context, each of these correct annotations have to be on a different behavior paths and the unique accepted annotation condition is preserved on each behavior path. get-item(i1)

go-to-room(r3)

go-to-room(r2)

get-item-in-room(i1)

go-to-door(d3) go-through-door(d3)

go-to-door(d1) go-through-door(d1)

Figure 14. A recursive annotation hierarchy We assume that all annotations except the top one have exactly one context annotation, which must be an accepted annotation. As a consequence, rejected annotations are not allowed to contain other annotations in their context. Our framework allows multiple accepted annotations with instantiations of the same operator to be active 43

in a given situation, but since at each context, only one accepted annotation can be active at a time, these annotations must be at different levels of the annotation hierarchy. For example in Figure 14, the operator go-to-room is used recursively such that go-toroom(r2) is called as part of the behavior that achieves go-to-room(r3).

Like the situation predicates in the behavior trace (section 4.3.2), the annotations are stored in episodic database (section 4.9). The episodic database is used to efficiently access the annotations not only in creating the examples of the concepts learned but also in testing the candidate concept hypotheses generated during learning. Next, we describe how the annotated behavior traces are used to extract the examples of these concepts that are used in constructing the learned agent program.

4.4 Decision Concepts In section 4.3.1, we discussed how the problem of “learning an agent program” is decomposed into multiple “learning to maintain the activity of an operator” problems. In this section, we further decompose it into multiple “decision concept learning” problems that can be framed in a supervised ILP learning setting. This decomposition allows our system to learn each decision concept separately (such as when to select an operator and when to terminate an operator). This approach is taken because the decision concepts are often highly disjunctive – that is there are many reasons for selecting an operator and many independent reasons for terminating an operator. After learning, the system dynamically combines decision concepts learned from different examples. If the decision concepts were not learned separately, the system would have to explicitly learn every combination of decision concepts, requiring many more examples and leading to less robust performance. Moreover, by learning different kind of decision concepts for each operator (i.e. selection and termination), the operator execution logic is decomposed into simpler and easier to learn conditions. In this section, we first describe how decision concepts are defined, and how their examples can be extracted from annotated behavior traces. Subsection 4.4.1 describes how decision concept examples are created using the first mode of our framework when only correct behavior traces are present. The ability to learn from only correct behavior 44

traces is important when no prior agent program is present. Here we describe a positiveexample only learning approach, which builds on positive-only learning literature in ILP but is improved in the special case of a learning by observation problem. Subsection 4.4.2 describes how learning one kind of a decision concept for an operator (i.e. how to select that operator) depends upon learning other kind of decision concepts of that operator (i.e. how to terminate that operator). We show how this dependency can be exploited during learning so that learning one kind of decision concept for an operator can boost learning of another kind. A decision concept of an operator op is a mapping from the internal state of the agent and the perceived external state of the environment to a “decision suggestion” about the activity of op. Figure 15 depicts four decision concepts, selection-condition (when the operator should be selected if it is not currently selected), overriding-selectioncondition (when the operator should be selected even if another operator is selected), maintenance-condition (what must be true for the operator to be continued during its

application), and termination-condition (when the operator has completed and should be terminated). For each decision concept, we must define how their examples should be constructed from the behavior traces and how they are used during execution. Next, we describe the formal structure of the decision concepts, which is going to be used in describing how their examples are constructed, how they are learned, and how they are used during execution.

Definition 1. Decision concept For a concept of kind con and an operator op(x), we get a decision concept

con(s, c, op(x)) where s is a situation, c is a symbol representing the context annotation where op(x) is selected, and x is a parameter vector of op. A learned decision concept hypothesis has the form:

con(s, c, op(x)) ← P(s, c, x) where P is a condition that can be tested using the situation predicates in the episodic database, the annotation hierarchy or background knowledge. To 45

simplify the presentation, we will also refer to a decision concept con(s, c, op(x)) as con(op(x)), or just con, if the missing parts are not relevant for a particular discussion. For example if selection-condition(S, C, go-to-door(Door)) holds for a situation S= s0 and context C= c0 and door object Door= d0, it represents advice indicating that the

agent should select the operator go-to-door(d0) at situation s0 and the context c0, if no operator is selected in c0. Both our learning framework and the learned agent program that is generated as the output of the learning framework use decision concepts of the form con(s, c, op(x)), although they use different methods to instantiate the values s and c. During reactive

execution, the agent program uses the decision concepts to make a decision about activity of op(x), by assigning s to the current situation and by assigning c to a context at current situation. Here s is used to access the perceived or inferred beliefs about the world in the current situation, whereas c is used to access the active goals of the agent. On the other hand, the learning by observation system analyzes the hypothesized decision concepts by testing different parts of the recorded behavior trace using different values for s and c. Although our focus in this thesis is on learning by observation and reactive execution, it is worthwhile mentioning two other potential uses of decision concepts. An agent that maintains a history of its own behavior could also use the decision concepts to reason about its past behavior (reflection) by setting s and c to previous situations and contexts, which would allow the agent to use information from its past experience during execution. Whereas an agent that has a model to make predictions about the future could use the decision concepts to reason about future events and behavior (planning) by hypothesizing expected situations and contexts. The positive and negative examples of decision concepts are ground terms of the form con(s0, c0, op(x0)). The training set generator component of the learner constructs these examples by analyzing the behavior trace, using the accepted and rejected annotations provided by the expert and the ILP component uses them in learning hypotheses of the form in Definition 1.

46

(a) termination-condition(opA) P

(b) selection-condition(opA)

A

A

E A+

E A−

B

E A−

E A+

(c) overriding-selection-condition(opA)

(d) maintenance-condition(opA) P

P A

E A+

P

A

B

E A−

E A+

B

E A−

Figure 15. The positive and negative example regions of different concepts. The horizontal dimension corresponds to the change in situations over time. A, B, and P are the accepted annotation where P is the context annotation of A and B. E A+ and E A− mark the positive and negative example regions of the annotation A. To create positive (negative) examples of the concept con of an operator opA, first an annotation A with the operator instance opA(x0) and context c0 is randomly selected from the annotation hierarchy. Next, a situation s0 in c0 is randomly selected from a set of situations E A+ ( E A− ) called the positive (negative) example regions to obtain a positive (negative) example con(s0, c0, opA(x0)). Each decision concept has a corresponding function that maps an annotation A to negative and positive example regions for that decision concept. Figure 15 depicts this mapping for different decision concept kinds, such that the example regions E A+ and E A− are calculated given an annotation A with the operator opA. The horizontal dimension represents temporally consecutive situations in the behavior trace and the boxes represent the accepted annotations. P is the context annotation of A, and B is a randomly selected accepted annotation in the same context with A. The operator opB of the annotation B may have the same operator name with opA, but it should have a different parameter instantiation. For example, for selectioncondition(opA), E A+ is selected from a set of situations right around the initial situation of

A, whereas E A− is selected around the initial situation of another annotation B in the same context (Figure 15.b). If we have opA = go-to-door(d1), range(A)=s20-s30, and

range(B)=s50-s60, context(A) = c5 we could get the example regions:

47

E A+ = s18-s22 and E A− = s48-s52 the positive example selection-condition(s20, c5, go-to-door(d1)),

and the negative example selection-condition(s50, c5, go-to-door(d1)).

In general, the examples of decision concepts of an operator are selected only from situations where there is an appropriate context in which to consider a decision about it. Since the expert can only generate an accepted annotation A when its context annotation P is selected, for a given concept, E A+ and E A− must be subsets of range(P). Similarly, during execution, the decision concepts of opA are considered only at situations where parent(opA) is active. In learning from behavior performances with continuously changing domains, we use a range of values to generate the examples to better deal with annotation errors related to timing. I.e., the expert can select an annotation in anticipation of a condition that is expected to be satisfied soon. On the other hand, in behavior performances setting where the timing of events is less important, the example regions are reduced to single situations. Different concepts will have different suggestions about the activity of the operators. For example, a situation where termination-condition(opA) holds suggests that the agent has to terminate opA if opA is active, and that opA should not be selected if it is not active. selection-condition(opA) would be useful to decide whether opA should be selected if a previous operator is already terminated (i.e. because of termination condition). It would not be very useful while another operator is still active because such situations are not considered as examples for selection-condition(opA). On the other hand, overriding-selection-condition(A) suggests terminating another operator opB and selecting

opA, even during the situations where opB is active since the negative examples of this concept are collected throughout a region where opB is active. Neither selectioncondition(opA) nor overriding-selection-condition(opA) makes a suggestion while opA is

active, because their examples are not collected in such regions. Finally, like overriding48

selection-condition(opA), maintenance-condition(opA) suggests that opA should start even

if another operator is still active. On the other hand, while like the other selection conditions, absence of maintenance-condition(opA) suggests that opA should not be started at situations where it is not active, unlike the other selection conditions the absence of maintenance-condition(opA) also suggest that opA should be terminated, if opA is the active operator. This difference stems from the fact that the positive examples of maintenance condition concept are collected through all situations where opA is active. If data collected in the second execution cycle of our framework is available, where the expert evaluates the agent’s generated behavior and annotations, we get an opportunity to extract more reliable negative examples for the selection condition (Figure 16). The regions where the expert rejects the selection of an operator op(x0) at a situation

s0 and context c0 provide reliable negative examples of selection-condition(s0, c0, op(x0)). Nevertheless, to get to the second cycle, we need our framework to be capable of learning in the first cycle an approximate agent that can generate reasonable behavior. Therefore, in the experiments in section 5.3, we focus on learning from only expert generated behavior. selection-condition(opA) ¬A

P

E A−

Figure 16. Negative examples for selection-condition, extracted from a rejected annotation ¬A, where the expert rejects the agent program’s annotation with the operator opA. This list of decision concepts is not a complete list of potentially useful decision concepts. It just exemplifies how different kind of decision concepts can be defined to different execution strategies. On the other hand, a subset of these concepts may be sufficient to obtain an executable agent. For a given execution architecture, we can choose to learn a subset of decision concepts that are easier to encode and more efficient to execute. For example, Soar version 7 architecturally supports rules similar to the

49

termination/selection decision concepts while Soar version 8 supports only rules similar to the maintenance concept. One problem with using multiple decision concepts for an operator is that they may make conflicting suggestions. There are several possibilities for dealing with this problem. One can commit to a particular priority between the decision concepts. For example KnoMic [47] learns only concepts similar to our selection and termination conditions. In execution, KnoMic implicitly assumes that termination conditions have higher priority. Another alternative is to have a dynamic conflict resolution strategy. For example, a second learning step could be used to learn weights for each concept such that the learned weight vector best explains the behavior traces. In this dissertation, we do not explore conflict resolution strategies and like KnoMic, we commit to a fixed set of operator concepts and an execution mechanism with fixed priorities to interpret them. Next, we define an execution-condition concept, which defines an execution strategy for the agent program using only selection and termination concepts.

Definition 2. Execution condition Our current framework assumes that the agent architecture uses the following execution-condition concept to determine which operator instantiation op is

considered for activation at a given situation s and context c. execution-condition(s, c, op) ↔ selection-condition(s, c, op)



¬ termination-condition(s, c, op)

During the agent program execution, for a given parent operator instance, if no child operator is selected because the parent operator has just started or because a previous child operator has just terminated, a child operator instance op that satisfies execution-condition(op) is randomly selected and activated. At each level of the operator

hierarchy, an active operator op is continuously executed as long as executioncondition(op) holds. op is terminated if execution-condition(op) does not hold. For

example in Figure 13 right at the situation where get-item-different-room(i1) is selected, go-to-door(d1) is selected as a suboperators, because this is one of the operator

instantiations that satisfy the execution condition at that situation and context. In that 50

context, a new suboperator is not considered until that execution condition does not hold and go-to-door is retracted. After the termination of go-to-door(d1), since the parent operator is still active, a new suboperator go-through-door(d1) is randomly selected among operator instances that satisfy execution-condition concept at that situation and context. The execution-condition predicate must be interpreted using additional assumptions about the structure of the operator hierarchy.

Definition 3. Assumptions to interpret execution condition 1) Each active operator except the top operator must have an active parent operator 2) Two active operators cannot share the same parent operator. To achieve the first assumption, whenever execution-condition(op) does not hold and op is deactivated, all operators below op are also deactivated. To achieve the second assumption, if op is an active operator, no other operator op1 that shares the same parent operator is activated even if execution-condition(op1) holds. If no operator is active in a context and execution-condition holds for multiple operators in that context, only one of them is selected randomly and activated. While the agent architecture has to interpret execution-condition in the context of these assumptions (this behavior is supported by Soar architecturally), the learning system should be biased with those assumptions, using how the decision concepts are defined and how their examples are generated. For example in Figure 13, in the context of get-item-different-room(i1), only one operator is activated at a time due to the second assumption in Definition 3. Here, when get-item-different-room is first started, go-todoor(d1) is selected among all plausible alternatives (e.g. another door d2 could also be

leading to the same target), and once it is activated, no other alternative operator is considered until its termination. At the situation where the execution condition of getitem(i1) does not hold (i.e. the termination condition is just satisfied when the agent has

grabbed i1), all suboperators below this get-item instantiation are automatically retracted due to the first assumption in Definition 3. On the other hand, the learning system is 51

biased with the first assumption by generating examples of a decision concept only at regions where the parent operator is active (Figure 15). The second assumption is also used in generating the examples, where an operator opA is treated as undesired at a situation where another operator opB is selected (Figure 15.b).

4.4.1 Learning from only Positive Examples The negative example regions in Figure 15.b, c, d implicitly assume that the selection of an operator opA is undesirable when another operator opB is selected. This assumption does not completely hold in domains where the desired agent behavior is nondeterministic, because there may be situations where the agent can select among equally good alternatives. In such domains, the negative examples of selection concepts should be treated weaker, as preferences rather than completely accurate negative examples. The parametric nature of our operators causes another difficulty in learning with only correct expert traces. The negative examples in Figure 15.b, c, d indicate when not to select an operator opA, but our early experiments have shown that they may not give enough information about which operator parameter instantiations would be incorrect at a situation where opA has to be selected. To deal with this problem, we describe an additional negative example generation scheme. For each positive example con(s0, c0,

op(x1)), we generate heuristic negative examples of the form con(s0, c0, op(x2)) using the same situation s0 and context c0, but an operator parameter vector x2, which was used in a different selection of the same operator in the behavior trace. For example in Figure 17, for a randomly selected positive example selection-condition(s0, c0, go-to-door(d1)), heuristic negative examples are created by using the same situation s0, context c0, and operator name go-to-door, whereas the argument of go-to-door is selected from a set of arguments that were used in instantiations of that operator at different annotations on the behavior trace. To differentiate between these two heuristic mechanisms that generate negative examples from a correct trace, we will call the heuristic negative examples as generated in Figure 15.b, c, d situation heuristic negative examples, whereas we call the negative examples as generated in Figure 17 parameter heuristic negative examples.

52

Like the situation heuristic negative examples, the parameter heuristic negative examples should be treated as preferences in domains with nondeterministic desired behaviors, because a selection of op(x1) does not guarantee that another selection op(x2) would be incorrect. During learning with these examples, we search for compact hypotheses that cover all positive examples while minimizing coverage of negative examples. This approach is similar to the positive-only learning strategy described by Muggleton [13] except that in their algorithm, the negative examples would be created by randomly choosing all arguments of the learned concept randomly and independently including the situation and context arguments. Compared to Muggleton‘s approach, we use special properties of our situation based representation, to obtain more relevant negative examples. First of all, we use the knowledge that the situations and contexts are dependent such that our procedure ensures that the randomly selected situation is a situation in the randomly selected context. Moreover, we restrict the situation and context pairs to cases where an operator is known to be selected, creating near-miss examples to detect parameters that should not be selected at that situation. Muggleton’s more generic approach would consider all situation and context pairs even the ones where the considered operator is not selected. Our approach of using prior knowledge about dependencies of arguments in positive-only learning could be more generally applied in other ILP problems, but it is not the focus of this dissertation.

r2

Positive Example:

r3 d2

selection-condition(s0, c0, go-to-door(d1))

d3 d4

d1

Heuristic Negatives:

i1

r1

selection-condition(s0, c0, go-to-door(d4)) selection-condition(s0, c0, go-to-door(d6))

d5 d6

selection-condition(s0, c0, go-to-door(d7)) d7 d8

Figure 17. Parameter heuristic negative examples are generated by changing the operator parameters of a positive example to parameters of that operator observed in other situations.

53

4.4.2 Biasing Decision Concept Learning with Learned Decision Concepts If we have reason to expect that one kind of operator concept can be learned more accurately than another kind, we may be able to use the hypotheses learned for the former kind to simplify learning the latter kind. For example in learning with only correct expert traces, while we extract reliable negative and positive examples for termination-condition concept, we use less reliable heuristic negative examples for the selection-condition concept. Next, we describe how hypotheses learned for the more reliable terminationcondition concept can be used to bias learning selection-condition concept.

The relation of these two kinds of concepts is easy to demonstrate. If the expert terminates an operator op at a situation s (e.g. because it has achieved its goal), the execution strategy we have in Definition 2 assumes that the expert would not select the same operator instance op in that situation. Therefore, whenever termination-condition(s,

op) holds, to determine the correct execution at s, knowledge about selection-condition(s, op) is not required; consequently, selection-condition does not need to be learned accurately for those kind of situations. Utilizing this intuition, we redesigned the learning scheme for the selection-condition concept such that while it is learned less accurately, it can be learned easier and the implied behavior execution is not affected. According to our new learning algorithm, termination-condition is learned before learning selection-condition. Moreover, any negative example selection-condition(s0, op0) that satisfies termination-condition(s0, op0) is removed from the example set used in learning selection-condition. Before this change, the learning algorithm could need to learn additional conditions that ensure that the learned concept is unsatisfied for these negative examples. In other words, any reason for terminating op0 would need to be explicitly encoded into the selection-condition, although it may already be encoded in the termination-condition. In the updated learning scheme, by eliminating the negative

examples of selection-condition that satisfy termination-condition, these additional conditions won’t need to be part of selection-condition concept and the concepts that need to be learned will be shorter and easier to learn. In this updated learning algorithm, the selection-condition does not correctly model decisions at situations where terminationcondition holds, but this inaccuracy would not affect the accuracy of the execution

54

condition in Definition 2. Instead, at a given situation, the selection-condition concept only encodes knowledge to choose among operators that would not be terminated by termination-condition.

This issue is also interesting from the perspective of positive example only learning described in previous subsection. In generating a heuristic negative example selection-condition(s0, op0), our system assumes that s0 is a situation where op0 could

have been selected but the expert has chosen another operator instead. Given terminationcondition(s0, op0), therefore we can eliminate a potential negative example selectioncondition(s0, op0), because we know that s0 is not a situation where the expert could have

selected op0. In general, knowledge about termination-condition restricts the space of operators that can be randomly selected in constructing heuristic negative examples, improving their quality.

4.5 Learning Decision Concepts using ILP In this section we overview Inductive Logic Programming and we describe how it is used to learn the decision concepts in our relational learning by observation framework. The learning uses as input the decision concept examples generated by the training set generator as described in previous section and background knowledge that is going to be discussed in more detail in section 4.7. It returns as output decision concept rules (Definition 1) that are used in agent program generation. Having an explicit ILP component in our relational learning by observation framework is advantageous as opposed to having an ad hoc relational learning by observation algorithm since it forces us to be more explicit about the bias and background knowledge used in learning. This approach clarifies the connection between the imprecisely specified learning by observation problem (learning an agent programs from observations) and an actively studied and formally specified learning approach (ILP). Moreover, our framework can be connected to different ILP algorithms, which is not only useful from the learning by observation research perspective but also from the ILP perspective since learning by observation provides a good testbed in a challenging domain that is rarely studied in an ILP setting. 55

First, we overview the ILP formulation we use in our framework and then describe how it is used in learning by observation context.

4.5.1 ILP and Inverse Entailment The general supervised ILP problem can be specified as follows. Given background knowledge B and a set of examples (positive and negative) E, find the simplest hypothesis such that

B ∧ H |= E

(1)

ILP algorithms are often cast as a search problem where the goal is to find an H that satisfies this constraint and they can be usually characterized by the structure of the search space for the H theories, the strategy used in the search (i.e. depth-first, breadth first), the order the hypotheses are considered (i.e. specific to general, or general to specific), and heuristics for measuring quality of the hypothesis to direct and stop the search. In our current implementation of our relational learning by observation framework, we use the ILP testbed Aleph [41] in a setting that uses the inverse entailment [27] formulation of ILP. In inverse entailment, the formula above is interpreted as

B ∧ ¬E |= ¬ H

(2)

Next the so called bottom clause ⊥ is defined such that ¬ ⊥ is a conjunction of all (possibly infinite) skolemized ground literals which are true in all models of B ∧ ¬E, such that

B ∧ ¬E |= ¬ ⊥ |= ¬ H and therefore we get:

H |= ⊥

56

(3)

Candidate H clauses can be found by first constructing the ⊥ clause from a single positive example and then by searching through the space of H clauses that subsume ⊥ while evaluating them for coverage of all examples. For example given background knowledge 1:

B = {dog(a), pet(X) ← dog(X), animal(X) ← pet(X) } and a single example: E = {nice(a)} We get: ¬⊥ =

¬ nice(a) ∧ dog(a) ∧ pet(a) ∧ animal(a)

and therefore the bottom clause becomes: ⊥ = (∀x) nice(x) ← dog(x) ∧ pet(x) ∧ animal(x)

In the next step, rules consisting of substructures of this bottom clause are generated in a heuristic search, where each one of them is tested for coverage on positive and negative examples. For example, hypotheses such as “∀x nice(x) ← true” or “∀x nice(x) ← dog(x) ∧ pet(x)” may be considered as candidates during this heuristic search and the best one is selected using all examples.

4.5.2 Using ILP in Learning by Observation In our learning by observation case, for a randomly selected positive example con(s0,

c0, op(x0)), the bottom clause contains a conjunction of all predicates that hold in the situation s0 and context c0. These predicates include the observed and assumed predicates that are stored in the episodic database, as well as predicates inferred from them using background theory. For example, Figure 18 depicts one positive example and the set of ground predicates that are used in constructing the ¬ ⊥ clause.

1

As usual, a set of clauses is considered to be a conjunction of clauses. 57

Although in principle literals inferred about any situation of the behavior trace can be used to construct ¬ ⊥ , we use explicit bias (section 4.7.1.4) to enforce the learner to use only hypothesis conditions that refer to the situation and context that is entered in the head of concept example. Therefore, Figure 18 lists only predicates that hold at the situation s12 and context c25. This assumption stems from the fact that in our learning by observation framework we make the assumption that the agent program determines all decision concepts with only information in the current situation. If the agent needs to use information from past experience in the current decisions, it has to build explicit memory structures that summarize past experience in the current situation. Using knowledge about past experience is further discussed in section 4.7.1.2.

One Positive Example selection-condition(s12, c25, go-to-door(d1))

Assumed Predicates door(s12, r1, d1) door(s12, r1, d2) connected-door(s12, d2, d3) in-room(s12, d3, r2) contains(s12, r2, d3) …

Observed Predicates current-room(s12, agent, r1) …

Inferred Predicates parent-goal(c25, get-item(i1) ) path(s12, r1, p1) pathdoor(s12, p1, d1) destination(s12, p1, r2) contains(s12, r2, i1) …

Figure 18. Single positive example and ground literals used to construct the bottom clause 2 Next, ¬ ⊥ , which consists of the conjunction of the background predicates and the negation of the positive example, is negated to obtain the bottom clause ⊥ , during which 2

The actual syntax of parent-goal predicate is slightly more complicated than presented here and it is going to be discussed in section 4.7.1.1. 58

the skolem constants are replaced with variables and we obtain the bottom clause in Figure 19. Note that we use Prolog syntax where the capitals are variables. We replaced the random variable names that the learning code generates with semantically meaningful names for presentation purposes. We prevented variablization of special constants such as the operator names and the constant “agent” using explicit bias discussed in section 4.7.1.4. selection-condition(S, Context, go-to-door(Door)) ← parent-goal(Context, get-item(Item) )

and

current-room(S, agent, Room1 )

and

door(S, Room1, Door)

and

door(S, Room1, Door2)

and

... connected-door(S, Door2, Door3)

and

in-room(S, Door3, Room2)

and

contains(S, Room2, Door3)

and

... path(S, Room1, Path)

and

pathdoor(S, Path, Door)

and

destination(S, Path, Room2)

and

... contains(S, Room2, Item) ...

Figure 19. The most specific (bottom) clause of the learned hypothesis Next, a search algorithm generates hypotheses that subsume the bottom clause and evaluates them based on coverage of positive and negative examples. In our experiments we used the ILP testbed Aleph [41] in a setting where it does general to specific A* search (where hypotheses with fewer conditions are considered earlier). We used a simple cost function frequently used in ILP systems that maximizes the covered positive examples, while minimizing the covered negative examples and the minimum number of literals that must be added to the current hypothesis to generate the variables in the head of hypothesis. This is an admissible heuristic because we want hypotheses that can generate the variables of the operators in the head of the decision concepts. For example, the learned hypotheses in Figure 21 generate the Door argument given a 59

situation and context, whereas the overgeneral hypothesis in Figure 20 does not generate the output variable Door and looking at the bottom clause, it needs at least one additional condition (i.e. door(S, Room1, Door) ) to become a structurally acceptable hypothesis that generates the output variables. selection-condition(S, Context, go-to-door(Door)) ← parent-goal(Context, get-item(Item) )

and

current-room(S, agent, Room1 )

and

Figure 20. An overgeneral hypothesis for the selection condition of go-to-door operator Figure 21 depicts a correct hypothesis that is learned with this process in the experiment reported in section 5.2. It reads as: “At any situation S and context Context and with an active context operator get-item(Item), the operator go-to-door(Door) should be selected if Door can be instantiated with the door on the shortest path from the current room to the room where Item is in.” Here, the learning system models the selection decision of go-to-door by checking the high-level goals and retrieving relevant information (parent-goal retrieves information about the desired item), by using structured sensors (i.e. current-room), and task/domain knowledge (i.e. contains, path).

selection-condition(S, Context, go-to-door(Door) ) parent-goal(Context, get-item(Item) )

and

current-room(S, agent, Room1 )

and

path(S, Room1, Path)

and

pathdoor(S, Path, Door)

and

destination(S, Path, Room2)

and



contains(S, Room2, Item).

Figure 21. A desired hypothesis for the selection condition of go-to-door operator

4.6 A Learning by Observation Example In this section, we describe a learning by observation problem that we used in our experiments (section 5.3). In the next section, we will use this example to describe the role of background knowledge in our learning by observation framework. 60

move-to-new-area (Area)

move-to-via-node (Node)

drop (Item)

move-to-connected-node (Node)

move-to-via-node (Node)

move-through-door (Node)

acquire (Item)

move-near (Item)

pickup (Item)

move-to-area (Area)

move-near-in-area (Item)

move-to-connected-node (Node)

move-through-door (Node)

` Figure 22. An operator hierarchy used in Haunt domain experiments Figure 22 depicts a Haunt domain operator hierarchy learned in our experiments. Figure 23 depicts a sample correct behavior. At the top level, the agent randomly decides to acquire an item with the operator acquire(Item), to go to a random room with move-tonew-area(Area), and to drop an item it carries at the current location using drop(Item).

The agent can interrupt and change the top-level goals at random times. If the agent has decided to acquire an item with acquire, and the item is sufficiently close, the agent collects the item using pickup. Otherwise, the agent approaches the item using movenear. If the agent is in the same room with the item, it approaches it with move-near-inarea, otherwise it uses move-to-area to first go to the same room with the item.

61

GOALS r2

g5 g3

g4

d2 d3

d1

d4

r3

g6

g2 i1

r1

g1

d5

d6

d7

d8

r4

g1: acquire-item(i1) move-near(i1) move-to-area(r3) move-to-via-node(d1) g2: acquire-item(i1) move-near(i1) move-to-area(r3) move-to-connected-node(d2) g3: acquire-item(i1) move-near(i1) move-to-area(r3) move-to-via-node(d3) g4: acquire-item(i1) move-near(i1) move-to-area(r3) move-to-connected-node(d4) g5: acquire-item(i1) move-near(i1) move-near-in-area(i1) g6: acquire-item(i1) pickup(i1)

Figure 23. A behavior example for the operator hierarchy in Figure 22. As discussed in 4.3.2, while the expert interacts with a visual representation of the environment, the learning by observation system and the learned agent program use a corresponding symbolic representation. In the Haunt domain, the doors and connections between two rooms are represented with nodes that mark both sides of doors connecting the rooms. While the expert uses the 3-D display of doors on the screen for behavior, the agent program uses the corresponding nodes for navigation. To achieve move-to-area, the agent first chooses a door to exit towards its target room by selecting move-to-vianode(Node), where Node is the marker in front of the door the agent wants to exit. If the

agent is close enough to the target door so that it can see the node marking the room at the other side of the door in the neighboring room, the agent moves to that node using move-to-connected-node and enters the new room. If the new room is not the target

room, the agent will choose another door and proceed towards it using a new instantiation of move-to-via-node. The same process continues until the agent gets to the target room and achieves move-to-area. Some of the rooms are connected with closed doors, in which case the agent cannot see the node on the other side of the room until the agent tries to go through the door and the door opens. While move-to-via-node takes the agent close to a 62

door the agent wants to exit, it does not always cause the door to open, for example if the agent is moving towards the node in front of the door with an angle parallel to the door. In those cases, the agent cannot use move-to-connected-node to exit the door, since the closed doors blocks the agent from seeing the connected node on the other side of the door. In this case, the agent applies move-through-door operator, which moves the agent towards the door with a direct angle so that the door opens and the agent either exists in the room while applying move-through-door or it opens the door, perceives the connected node and uses move-to-connected to exit the room. Figure 24 show some of the execution conditions learned for this hierarchy in experiments reported in section 5.3. A complete list is presented in Appendix A. In this experiment expert traces similar to the one depicted in Figure 23 are converted to the examples of termination and selection condition concepts. In learning the selection condition concepts, we used only correct behavior traces with the positive-only learning strategy described in section 4.4.1. Moreover, our system first learned the termination condition concepts, which are used in learning selection condition concepts as described in 4.4.2. The learned selection and termination condition concepts are converted to execution condition concepts as in Definition 2. According to the execution strategy in Figure 24, move-near(Object) is executed if there is a high-level goal to acquire Object and Object is not less than 77.32 unit distance away. If move-near(Object) is selected but Object is not in the current room, move-to-area(Area) is executed where Area instantiated with the room where the Object

is located (checked with area-id predicate). This execution stops when the agent is in Area, or when the location information of Object changes. The location of Object might

not be directly visible to the agent, in which case the agent uses the last seen location of the Object as assumed location. If Object was not seen before, the whole goal stack is retracted and a different top-level goal is selected. (i.e. traversing to a different area using move-to-new-area). If the move-to-area(Area) operator is active, the agent can select

among

three

different

subgoals

move-to-via-node(Node),

move-to-connected-

node(Node), or move-through-door(Node), each of which is used to select a node to

navigate towards.

63

a) move-near execution-condition(Sit,Context, move-near(Object) ) ← parent-goal(Context, name(acquire), 1, Object), not( parent-goal(Context,name(acquire),1,Object), compare_range(Sit,Object, s1, until the expert collects the object or observes it in a different room, at which time the

74

area-id predicate is updated with the new information. Of course this does not mean that

the expert must also be maintaining the location of all objects. This mechanism, like the other background predicates, is only used to model the expert to the extent it explains expert behavior. If for example the expert indeed maintains the locations of the objects, the learner can benefit from this mechanism directly while modeling the expert. On the other hand, if the expert maintains the location of the target objects only in some cases (i.e. when he/she is not dealing with an intervening external agent), then the learner could learn to use this mechanism conditional on the existence of an external agent. Finally, if the expert does not know the location of the target item and uses other information in his/her choices (i.e. the expert may be using a random strategy of following red doors), the learner would ignore the location of the object and use another regularity (red doors) to model expert behavior. Our learning by observation framework supports belief maintenance using two mechanisms. The annotated behavior generator interface (Figure 6) that the expert uses to generate behavior data can update the current belief state using the perceived situation and hand-coded background knowledge. Alternatively, the ILP learning component can use hand-coded background knowledge such that area-id(s, o1, r1) is dynamically calculated by inspecting the last directly observed location of the item o1 on the behavior trace. Using the episodic database (section 4.9), it is trivial to write a background knowledge predicate area-id(s, o1, r1) that accesses the last observed value of a SER predicate such as area-id(o1, r1) and such queries are handled very efficiently. This second implementation is also a mechanism that accesses information about past situations as described in previous section. We have implemented both of these mechanisms although in the experiments reported in section 5.3 we used only the first method. During agent execution, we also have two options for implementing belief maintenance. If the agent architecture supports episodic memory, it could dynamically determine area-id beliefs by examining its own history. Making episodic memory an architectural component of Soar is an ongoing research project [33]. In the alternative method we use in our current implementation, the agent program maintains beliefs about special predicates such as area-id using agent program rules. Ideally, maintaining beliefs 75

should be a domain independent mechanism that uses domain dependent knowledge about which belief predicates are maintained and how. However, in our current implementation all belief maintenance is implemented in a domain dependent fashion. In learning, the interface uses hand-coded knowledge to determine what beliefs should be updated and during execution the learned agent program uses hand-coded rules to update these beliefs.

4.7.1.4 Explicit Bias In many ILP algorithms, background knowledge is represented using the same language that is used to represent the hypotheses. Explicit bias is an additional ILP mechanism that explicitly encodes meta-knowledge such as size limitations of allowed hypotheses. This can dramatically improve the efficiency of the ILP algorithms by limiting the number of the generated hypotheses. In addition, a restricted language often means better generality and higher quality of learned knowledge with fewer examples. In our learning by observation framework, we are committed to a rich but restricted first order representation, which can be easily formalized using explicit bias so that the ILP learner can benefit from this assumption. Our relational learning by observation framework significantly restricts the hypothesis search space of the ILP component by using explicit bias. First of all, we are only interested in execution condition concepts of the form execution-condition(s, c,

op(x)), such that if it is called with the situation s and the context c instantiated, it should return a instance op(x) such that the operator name op and the argument vector x are both instantiated. For example in Figure 24, this is achieved by requiring that all generated hypotheses use a constant operator name and they contain a chain of relations that are guaranteed to instantiate the variable x if s and c are instantiated. A second desired structural restriction on the hypotheses is due to their efficient evaluation. At a given time, our framework represents the environment using a directed graph of binary SER predicates (Figure 11). In regular Soar programs, these binary predicates are tested only after their first argument (input argument) is instantiated since

76

Soar can execute such rules more efficiently 3. This corresponds in our framework to the fact that the observed situation predicates (section 4.3.2) are only tested after their first and second arguments are instantiated, which correspond to the situation and input variable respectively. For example, suppose that in the execution condition of the moveto-via-node in Figure 24, for a given situation Sit, the situation predicates path, path_via_node and path_destination are called with the first two arguments instantiated by

previous predicates. If we search only for hypotheses that satisfy this syntactic constraint, not only is the hypothesis search space reduced, but also the learned agent programs will be executed faster. Moreover since the episodic database of our framework (section 4.9) also makes use of this assumption, during learning, it returns queries faster, thereby improving the efficiency of evaluation of each candidate hypothesis that is considered during learning. Fortunately, this structural constraint can be naturally represented in many ILP algorithms using the so-called mode definitions. For example, a mode definition can be specified as follows:

pred(+x, −y, #z) This structure indicates that in the learned hypothesis pred is tested with the input

variable x instantiated, output variable y instantiated or not instantiated, and due to “#z” in occurrences of pred in the hypothesis, the third argument must be a constant. For example, Figure 26 lists the mode definitions that are used in learning the execution condition of move-to-via-node in Figure 24.c. Like any hypotheses considered in learning, when this hypothesis is called as a Prolog query with the first two arguments of the head instantiated, it is guaranteed that all input variables in the body will be instantiated at the time their predicates are queried. For example Path is an input variable of the path_via_node predicate and at the time path_via_node is queried, Path is guaranteed to be instantiated by the path predicate.

3

A similar idea is also used in Prolog programs that under normal circumstances index predicates using the first argument. 77

execution-condition(+sit, +context, #op(-arg) ) parent-goal(+context, #op, #nth, +arg) sensors(+sit, #sensor, -value) path(+sit, +area, -path) path_via_node(+sit, +path,-node) path_destination(+sit, +path, -targetArea)

Figure 26. Mode definitions for the execution condition of move-to-via-node in Figure 24 In our learning by observation framework, for each concept predicate con, we use the mode definition con(+sit, +context, #op(−arg1, −arg2, ..)), which encodes the assumption that the concepts are called with situation and context arguments instantiated, that each concept is learned specific for a particular operator (since #op denotes a constant name), and that the learned hypothesis must guarantee to instantiate the values of the operator arguments. We have already discussed that the mode definitions of each observed situation predicate is defined as p(+s, +x, −y). Moreover, the inferred predicates require manually defined mode definitions. Mode definitions are used to limit the structure of the bottom clause. For example in Figure 19, the predicate door(S, Room1, Door) can be added to the bottom clause only after a predicate that is guaranteed to instantiate the variable Room1, which is currentroom(S, agent, Room1) in this case. Moreover, only hypotheses that follow mode

constraints are generated from the bottom clause. For example even though the hypothesis selection-condition(S, Context, go-to-door(Door)) ←

door(S, Room1, Door) subsumes the bottom clause in Figure 19, it is not generated during learning because it does not satisfy the constraint that Room1 must be an input variable of the door predicate. As described in section 4.7.1.1, the generated hypotheses consider only predicates at the current situation and any information about the past situations are only referred to indirectly through special background predicates such as completed-goal. This is achieved by defining situation variables as output variables in the head of concepts and as input variables in all background predicates. Consequently, the bottom clause generates

78

only conditions that hold in the current situation, which greatly reduces the hypothesis search space.

4.7.2 Task & Domain Knowledge Our learning by observation framework can use task and domain knowledge that is not directly observed in the environment but can be used to model prior knowledge of the expert. Task and domain knowledge includes specific factual knowledge (such as which rooms are interconnected on a playground map) and general inference knowledge (such as how to find the shortest path between two rooms). While factual knowledge provides context to interpret the observed situations, inference knowledge provides mechanisms for this interpretation. Specific factual domain/task knowledge is represented using ground literals that are added to the behavior trace. For example in Haunt domain, connected-node(s1, n1, n2) represents a fact that there is a connection between the nodes n1 and n2 (corresponding to two sides of a door connecting two rooms) at the situation s1. For example, the execution condition of move-through-door (Figure 24) tests connected-node facts in deciding whether to exit a room through a door. Specific factual knowledge is typically static during the execution of a task. For example, a predicate connected-node(s, n1, n2) may hold for all situations s on a behavior trace. Nevertheless, our framework allows such facts to be changed during the behavior as they are treated just like the dynamically observed facts. This flexibility is useful because the distinction between static and dynamic features of the world is not always clear. For example, the connection between two rooms is usually static during the performance of a task. Nevertheless, if for example the agent digs a hole on a wall, the connectivity structure between rooms may change during behavior execution. In addition to specific background knowledge, our relational learning by observation framework can use knowledge about inferences the expert may make at a given world state. This kind of knowledge usually requires a more powerful language than ground literals. In our framework, general inference knowledge is encoded using first order horn-clauses. These structures are used to guess the inferences the expert may 79

be conducting at a given state of the world. For example the execution condition of move-to-connected-node (Figure 24) uses the inference predicate visible_node(Sit, Node1) to test that the node Node1 on a neighboring room is visible (i.e. there is no

closed door between the rooms). Here visible_node compares Node1 against the nodes that are detected by the sensors. Background knowledge can also be used to model expert’s focus on relevant objects in a decision. For example compare_range(Sit, Object,