Learning by Doing: A Dynamic Architecture for Generating Adaptive Behavioral Sequences Axel Steinhage∗and Thomas Bergener Institut f¨ ur Neuroinformatik, Ruhr-Universit¨at Bochum, 44780 Bochum, Germany Tel: +49 (0)234 322 7969, Fax: +49 (0)234 321 4209 Email: Axel.Steinhage/
[email protected] NC 2000, invited paper for the session Nonlinear Dynamics and Neural Fields Keywords: Dynamical Systems, Behavioral Organization, Learning, Autonomous Robots
Abstract We present a software architecture for the behavioral organization of a mobile robot which is entirely based on nonlinear dynamical systems. The activity of elements from a set of predefined elementary behaviors is controlled over time by coupled differential equations such that complex behavioral sequences are generated. By means of this architecture the behavioral system can be designed such that the robot autonomously plans its actions selecting between multiple behavioral goals specified by an operator. On the basis of predefined local sensor contexts and logical interrelations between the elementary behaviors the system learns to organize these behaviors into a sequence directed towards a behavioral goal. Learning can be run in two modes: (1) the system explores its behavioral space autonomously and (2) learning can be accelerated through guidance by the operator. We prove the feasibility of the approach for sequence generation by an implementation on an anthropomorphic robot and show its planning and learning characteristics by a computer simulation
1
Introduction
Autonomous systems organize their behavior depending on the information currently acquired by their sensors and on the experiences gathered in previous situations. The key aspect which distinguishes autonomous systems from simple control systems is the flexibility to react on continuous variations of the sensor input with qualitative changes of the system’s behavior. For biological organisms, which are perfect examples of autonomous systems, this characteristic becomes obvious if one divides a complex behavior into a number of so-called elementary behaviors: the behavior ”hunting”, for instance, often ∗ corresponding
author
consists of a temporal sequence of behavioral phases like ”exploring the terrain”, ”detection and identification of a potential prey”, ”approaching”, ”hiding, attacking”, ”fighting”, ”killing”, ”eating” etc.. The switching between the different phases is triggered by continuous changes of the system’s sensor information e.g. the perceived distance of the prey. The strong influence of the current sensor information in its entity, which we will call ”sensor context” from now on, lets the system quickly react accordingly to sudden changes of the situation which often results in branching, termination or re-execution of a behavioral sequence. However, there are several characteristics of autonomous systems which show that the current sensor context is not the only information used to organize the behavior. The most prominent features of this kind are state dependency, planning and adaptation. State dependency: the decision which elementary behavior to activate within a specific situation sometimes depends on the internal behavioral state of the autonomous system. Although the sensory information acquired from the environment may be the same in two situations, the behavior may be different depending on the history that led to the current behavioral state of the system. For example a satiated animal which just comes back from a successful hunt will often not go after a prey right away although the conditions may be advantageous. Planning: some behavioral tasks can only be reached through a complex sequence of activation and de-activation of the elementary behaviors. These sequences must fulfill an internal logical and temporal relationship in the sense that the result of a specific behavior builds the basis for the next behavior within the goal directed sequence. Therefore, the activation of an elementary behavior depends on the expected outcome it leads to. The behavioral goal as the endpoint of the sequence (e.g. ”eating”) is reached in an optimal way if at any decision point
within the sequence that behavior is activated which brings the system closer to the situation in which the goal behavior can be acted out. Generating an optimal sequence in that way can be considered as following a behavioral plan. Adaptation: autonomous systems are able to learn the consequences of their behavior. This enables them to avoid behaviors which turned out to be unsuccessful and to favor successful behaviors. Within similar situations the system may therefore react differently depending on the knowledge acquired from previous trials. Through a complicated interplay of evolution and lifelong learning the nervous systems of biological organisms became perfectly suited for the task of adaptive and flexible behavioral organization. Inspired by the aspects of biological behavioral organization and the knowledge about information processing in nervous systems we developed a mathematical framework for the design of an artificial autonomous system.
2
The dynamic approach to behavioral organization
Our work is based on the Dynamic Approach to Behavioral Organization the main principles of which we will briefly describe now. Please refer to [4][5][6] for a more detailed description of the approach. We represent the behavioral state of the system by the vector ~n(t) the elements ni (t) of which parameterize the activity of the elementary behavior i at time t. The ni (t) are the state variables of the coupled dynamical systems X γi,j n2j ni + ξ(t). (1) τn n˙ i = αi ni − |αi |n3i − j
Herein τn is the time scale of the dynamics γi,j ∈ {0, 1} is the competition matrix and αi ∈ [−1, 1] is the competitive advantage. ξ(t) ≃ 0 is a small stochastic noise term which is necessary to avoid that the dynamics gets stuck in unstable fixed points. This noise term can be neglected within the following mathematical analysis. For αi < 0 the dynamics (1) has one fixed point ni = 0 with n˙ i |ni =0 ≡ 0 which we assign to the state where behavior i is not active. This fixed point is stable as ∂∂nn˙ ii |ni =0 = αi < 0. For αi > 0 the activity P of behaviorPi depends on the competitive term j γi,j n2j : for j γi,j n2j = 0 the dynamics (1) has three fixed points. An instable fixed point at ni = 0 and two stable ones at
ni = 1, −1. We assign these latter two fixed points withP n2i = 1 to the state where behavior i is active. For j γi,j n2j > αi the dynamics has again only one stable fixed point at ni ≡ 0 so that behavior i is not active in that case. The latter situation occurs if γi,j = 1 for n2j = 1 i.e. if at least one other elementary behavior j is active that has an entry γi,j = 1 in the competition matrix. The competition matrix γ therefore encodes inhibitions between the elementary behaviors: an active elementary behavior j prevents the activation of an elementary behavior i if γi,j = 1. Summarizing it can be said that an elementary behavior i is activated when its competitive advantage αi is positive and no competing behavior j is active. The competitive advantage αi is the state variable of a second dynamics α˙ i =
1 + αi 1 − αi (Ii − 1) + Ii , τα,1 τα,2
τα,1 ≪ τα,2 . (2)
Here Ii ∈ [0, 1] is the activating input for behavior i: αi relaxes into the stable fixed point αi = −1 on the fast time scale τα,1 if Ii = 0 thereby switching off the elementary behavior i. For Ii = 1 the dynamics (2) relaxes into the stable fixed point αi = 1 on the slow time scale τα,2 allowing for an activation of the behavior. The dynamics (2) works as a low-pass filter for the input I which suppresses oscillations: the slow time scale τα,2 prevents that noisy variations in the input Ii are transfered directly to the behavioral level. We have implemented a second dynamics of this type to equip the system with a short term memory: m ˙i=
1 − mi 2 mi 2 (ni − 1) + n , τi τm,2 i
τi ≫ τm,2
(3)
Here the variable mi follows the behavioral state n2i : for an activation n2i = 1 the dynamics (3) relaxes to mi = 1 on the fast time scale τm,2 . The de-activation ni = 0, however, is followed by mi = 0 on the slow time scale τi . This means that m ~ is a memory for the behavioral state ~n which has a characteristic forgetting time τi for every behavior i. This memory is used to control the activation of an elementary behavior i depending on the behavioral state. Following the line of thought contained in the introduction, the input Ii consists of two parts, the required context of and the logical interaction between the behaviors: Y (1 + Ai,j (mj − 1)) (4) Ii = Ci j
The behavioral context Ci ∈ [0, 1] parameterizes to which extent the conditions for activating behavior i
are fulfilled and the product term evaluates whether or not the rest of the elementary behaviors have the required activation states. The concrete functional form of Ci depends on the knowledge about the external situation necessary for the activation of the behavior and on the internal knowledge by the execution of which behavioral sequence this situation can be achieved. We will go into that in more detail within section 4. To understand the functionality of (4) we assume for now that Ci subsumes all the external conditions required for the activation of behavior i. If these conditions are not given, indicated by Ci = 0, the input remains Ii = 0 and the corresponding behavior can not be activated. If, however, the sensor context is fulfilled (Ci = 1), the value of Ii depends on the product term in (4) which implements a logical AND-condition1 for the memory state mj and the so-called activation matrix Ai,j ∈ {0, 1}. This matrix has the same dimension as the competition matrix γi,j in (1). However, its function is reverse: a behavior i can only be activated if all behaviors j with Ai,j = 1 are active, indicated by mj = 1. The activation matrix therefore specifies the behavioral state that is necessary to activate a certain elementary behavior. If the necessary logical presuppositions are not fulfilled for one or more mj , the product in (4) vanishes an the input Ii remains zero even though the behavioral context Ci = 1 may be given. Just by specifying the competition matrix, the activation matrix, the memory time scales τi and the contexts Ci , we have implemented this scheme for the organization of various rather complex robot behaviors and behavioral sequences [7][3]. On the anthropomorphic robot Arnold (see Fig. 1), for instance, we have implemented a doorpassing-behavior which consists of the elementary behaviors visually search for doors, detect and identify a specific door, track the door, approach the door and pass through the door [6]. An example for a context Ci is here a characteristic distance between the robot and the door: if this distance is reached, indicated by Ci = 1, the behavior changes from approaching to passing through. This is an example for the feature of autonomous systems to react on continuous variations in the sensor input with a qualitative change of the behavior. In the dynamic approach this qualitative change is brought about by a phase 1 As the logical variables within our approach can take any value from the interval [0, 1], we have developed a continuous fuzzy form of the well-known boolean binary logics: Not: a ¯= 1 − a, And: a ∧ b = ab, Or: a ∨ b = 1 − (1 − a)(1 − b)
Figure 1: The anthropomorphic robot Arnold change (or bifurcation) of the underlying dynamics. The competition matrix for this example has been set such that for incompatible behaviors like searching and pass through γi,j = 1 while e.g. the behaviors track the door and approach have γi,j = 0 because they should be active simultaneously. The activation matrix has been set such that a sensible sequence is generated: for pass through to become active, the behaviors detect door and approach door must necessarily have been active before. It turned out that the design of these parameters is straightforward once the elementary behaviors and their pairwise logical relationships have been identified. Due to the use of continuous dynamics the generated behavior and the phase switches are smooth. The influence of the context Ci lets the system flexibly react on sudden changes in the environment (e.g. the closure of the approached door) with a re-execution of the sequence (the search-behavior is activated again).
3
Problems robotics
of
learning
in
Why should a robot learn? Even if we were able to program an autonomous robot such that it perfectly solves given tasks in an unknown and cluttered environment, which is practically impossible, the robot’s performance would decrease for the following reasons: After long periods of time the system will undergo two kinds of changes. On the one hand the systems physical structure will alter due to wear-out and minor defects in the sensors and actuators. On the other hand the environment will change unpredictably. Both effects lead to the necessity to adapt the robots control scheme depend-
ing on the environment’s feedback, i.e. to learn from experience. In this paper we will restrict to learning in action-selection mechanisms. How does a robot learn? Maes groups the different approaches to learn in autonomous agents into three classes [2]: Reinforcement Learning means to learn an action policy that maps every pair of a situation the robot finds itself in and an actions to perform onto a new situation, assigning a specific reward to this pair. The learning strategy maximizes a policy’s expected cumulative reward over time. Classifier Systems optimize a set of rules that propose actions in specific situations. Learning is done by a “bucket-brigade-algorithm”: An elected rule increases the strength of the rule, that was used in the last time step and by that strengthens successful action selection rules. The final class of learning approaches for autonomous agents are Model Builders: These systems try to learn a causal model of its actions and the observable states in the environment by measuring the probability of the effects when a specific action is chosen in some situation. Given a goal state the system is able to backtrack an optimal sequence of actions towards the current situation. A learning scheme for the robot Arnold: We claim some qualities of a learning scheme for our physical robot. First of all the learning scheme shall enhance the existing framework for behavioral organization that performed well in many experiments with the anthropomorphic robot Arnold. We think that many learning approaches are based on actionselection schemes that would not work in realistic environments. Therefore we establish our learning scheme on an architecture and its internal state information for which we can show that it is available in experiments with a vision guided anthropomorphic robot. Though we demonstrate learning only in computer simulations we claim that this approach is also applicable with the physical robot. Learning must be a continuous process in a robot’s lifetime. Though the robot might distinguish phases of exploration and goal directed acting, learning from experiences should be done throughout. The switching between exploration and exploitation must be performed autonomously and depending on the current goals. The robot has to find out if its knowledge is sufficient so reach at least one of the given goals. Whenever no useful action can be found depending on the current goals the robot should explore the most promising behaviors. A human user’s insight into the robot’s working domain can enormously accelerate the robot’s learning process, even if there is no real communication or understanding between man and machine. Sim-
ple hints on behaviors to try next in an exploration phase can guide the robot efficiently, without making any changes in the exploration scheme necessary. To be applicable to real robots the learning scheme must accomplish learning from few tries, or even realize one-shot-learning. Since anthropomorphic robots like Arnold spend up to a few minutes showing a single behavior like grasping or door passing it is unrealistic to calculate with many thousand trials for a statistical evaluation.
4 4.1
Mathematical framework of our approach Planning
The context Ci in equation (4) denotes wether behavior i is applicable in the sense that the current sensor situation is adequate for behavior i to be activated and that (a) it supports at least one currently given goal or (b) the behavior is elected to be explored or (c) the behavior can be activated independently of the current goals (and is thus called a default behavior). These different aspects are expressed by assigning a set of context variables to each behavior. E.g. the goal context Cigoal is set to 1, if behavior i is a currently active goal, otherwise we set Cigoal = 0. In the same way the default-behaviors are marked by a variable Cidef ault = 1. The sensor context Cis represents a classification of the present sensor situation that might be suited for showing the behavior i (Cis = 1) or not (Cis = 0). The goal supporting behaviors are found using a voting scheme: Every behavior which is a subgoal for a currently active goal (denoted by an external context Cie = 1) and the sensor context of which is not present (Cis = 0) votes for the activation of all behaviors that are capable to produce the sensor condition of behavior i. These relations are coded in a voting matrix V, meaning that an entry Vi,j = 1 says that behavior j causes the sensor context of behaviors i and so, i votes for j to produce its environmental presuppositions. A voting context Civoting then collects the votes from the remaining behaviors: Civoting = ((A1,i ∨ V1,i ) ∧ C1e ∧ C1s )
(5)
C2e
∨((A2,i ∨ V2,i ) ∧ ∧ C2s ) . . . ∨ ((An,i ∨ Vn,i ) ∧ Cne ∧
Cns ) Y = 1 − ((1 − Vk,i − Ak,i + Ak,i Vk,i )Cke (1 − Cks )) k
The external context Cie = 1 if either the behavior i is currently marked as a goal (Cigoal = 1) or at least one behavior votes for i (resulting in Civoting = 1): Cie
= Cigoal ∨ Civoting = 1 − (1 −
Cigoal )(1
A behavior’s overall context nal context is Cie = 1 or if it (given by Cidef ault = 1) or if ploration (Ciexplore = 1) and Cis = 1: =
Ci
=
(6) −
Civoting )
(1 − (1 −
Cie )(1
−
by assigning
is Ci = 1 if its exteris a default-behavior it is selected for exits sensor context is
(Cie ∨ Cidef ault ∨ Ciexplore ) ∧ Cis Cidef ault )(1
−
Learning
The voting matrix V codes the effects of every single behavior with respect to the sensor contexts. Observing the different variables Cis in an exploration phase enables to adapt this causal model and thus to learn from experience. Therefore we analyze the correlation of a changing sensor context with the system’s “working memory” m ~ (see eq. (3)). The dynamics C˙ ims
=
C ms (1 − Cis ) (1 − Cims )Cis − i τcms,1 τcms,2 τcms,1 ≫ τcms,2
Ti,j = (1 −
−
− 1)
4.3
Exploration
The system explores when there is no behavior i for which the external context Cie and the sensor context Cis is set, given by the global exploration context Y (1 − Cje Cjs ). (14) E= j
For E = 0 the system is in a “working mode” and there is at least one behavior which is a goal or subgoal and the sensor context of which is fulfilled. For E = 1 the robot is currently not able to realize its goals and thus starts to explores by applying behaviors at random. A behavior’s motivation to explore is controlled by a dynamics that depends on a random term ζ(pi ) with a probability pi : 1 M˙ i = − Mi + ζ(pi )(1 − Mi ) τM
(15)
ζ(pi ) draws Mi to one in a mean frequency that is proportional to pi . The motivation Mi then decays with a slow time scale τM . A behavior’s exploration context then is
(9)
and is linked with the behavior organization according to equation (7).
(10)
that signals a correlation between mj and a rising Css with Ti,j > 0 and anti-correlation with Ti,j < 0. A learning matrix W integrates this learning signal whenever it differs significantly from 0 such that Wi,j follows Ti,j on the slow time scale τW : ˙ i,j = T 2 (Ti,j − Wi,j ) τW W i,j
This closes the feedback loop and enables the system to use its experiences to realize the current goals by intelligently activating subgoals if some goal’s/subgoal’s sensor context is not fulfilled.
Ciexplore = E Mi
filters the sensor context such that the term is greater 0 for a short time (depending on the time scale τcms,2 ) when Cis changes from 0 to 1. We define a learning signal Ti,j with Cims )(2mj
(13)
(8)
Cis −Cims
Cidef ault )(Cis
Vi,j = θ(Wi,j , 0.15)
(7) Ciexplore ))Cis .
The voting scheme implements a causal model as a mapping from behaviors to sensor contexts and thus to behaviors, since behaviors and sensor contexts are directly associated (compare [1]). Note that a subset of the repertoire of behaviors is decoupled from the planning mechanism by defining them as default behaviors. These behaviors can always become active provided that their sensor context is fulfilled and the logical presuppositions and mutual competitions allow an activation.
4.2
The learning matrix is projected onto the voting scheme by applying a threshold function 0 : x 0 is used to spread the probability over all pi such that every behavior is explored with a probability greater than 0. The probability is then shared over all behaviors, the sensor context of which is fulfilled and which are not default-behaviors. pi is assigned depending on the weights Ji : Ji C s (1 − Cidef ault ) pi = P i def ault s ) k Jk Ck (1 − Ck
5
(18)
Experiments
The framework for planning and learning was implemented as a computer simulation for easier experimentation. The good results and the fact, that the information used (the sensor contexts) is available on the robot Arnold in a very similar way makes us confident that the adaptive planning scheme can be successfully demonstrated on Arnold in the near future. The simulated robot shows 10 different behaviors in a simple “box world” (Fig. 2) containing a number of grey obstacles, red graspable boxes and two target objects (a green one at the top border of the robot’s world in Fig. 2 and a blue one at the bottom border) where the robot shall put the red boxes. The robot controls its velocity and rotation speed, it can grasp objects it touches with the front bumper and can put them down anywhere. Its sensor system measures the distance and color of visible objects in all directions. The implemented behaviors are defined as follows: wander: The robot moves around and avoids obstacles. The sensor context is constantly set to 1. s stop: The robot stops, Cstop = 1. This is a default behavior. target red box: The robot moves around searching for a red box and moves towards it. s Ctarget red box = 1 if the robot does not carry a red box. target green box: The robot move towards s a green box, Ctarget green box = 1 if the robot carries a red box. target blue box: analogous target green box. grasp red box: The robot grasps a red box if it touches one with its front side and does not already carry one. Drop red box on green: If the robot carries a red box and touches a green one with its front side it places its load close to its front. Drop red box on blue: Analogous Drop red box on green. collision: The robot stops if it touches any kind of obstacle. free: The robot frees itself from a collision by driving backwards with Cfsree = 1. This behavior requires that a collision occurred short time before: Af ree,collision = 1.
A button row serves as an interface to switch the goals contexts on or off. In the experiment that we want to present here we marked drop red box on blue as the only goal (Fig. 2). Since the robot starts without any experiences concerning its own abilities and the sensor context for drop red box on green is not present it starts to explore. After moving around the robot grasps a red box, moves to the green object, put the red box down, hits another red box without grasping it and starts wandering (Fig. 2/1). In this phase the robot collects experiences about the interrelation like e.g. that target red box is useful to produce the sensor context for grasp red box. Hitting an object always activates collision and free which keeps the robot maneuverable (see the behavioral activity in Fig. 2/5). When the robot puts down its load on the blue box for the first time (Fig. 2/2) the voting scheme contains enough knowledge to solve the given task. The exploration context E switches from 1 to 0 and the robot switches from exploration to exploitation (Fig. 2/5). From that point of time it repeats the sequence of targeting a red box, grasping it, searching the blue box and putting down the load there (Fig. 2/3 and 2/4). The learning rate τW is chosen “optimistic” here so the robot integrates experiences into the planning mechanism very fast and in some situations even shows one-shot-learning. Decreasing the learning rate lets the robot collect more evidence before it accepts experiences as causal relations whereas the accelerated exploration scheme favors behaviors that have been successful before. Analyzing this experiment one should note a few characteristics of the approach. Firstly we remind of that our planning and learning approach strongly differs from e.g. Markov decision systems since we assume a fixed association of situations and behaviors. Each planning step thus is a mapping of behaviors to behaviors while Markov systems map a set of states S and actions A to states (S × A → S). Although the state information is deciding for the next behavioral pattern to develop the search space for planning and learning is significantly reduced. Secondly we do not assume models of fixed state machines for the robot and the environment. The whole framework is formulated assuming continuous time and thus circumvents problems that occur in the realization of many classical approaches in artificial intelligence. These facts make our system extremely robust and allow fast learning but on the other hand limit some other capabilities, e.g. the ability to deal with symbolic operands for goals and behaviors. That is why we had to define two different behaviors put red box on blue and put red box on green, although
5
1
2
3
4
Exploration
Exploitation
Figure 2: Snapshots of the computer simulation (1-4) and activity of the behaviors over time (5). See text for explanations.
these behaviors are very similar.
[2] P. Maes. Modeling adaptive autonomous agents. Artificial Life, 1(1/2):135–162, 1994.
6
[3] R. Menzner and A. Steinhage. Nonlinear attractor dynamics for guiding an anthropomorphic robot by means of speech control. In H. Araujo and J. M. M. Dias, editors, Proceedings of the International Symposium on Intelligent Robotic Systems, SIRS, number ISBN 972-96889-4-X, pages 177–180. University of Coimbra, Portugal, 1999.
Summary and Conclusion
We have presented an approach for autonomously generating and learning complex and flexible behavioral sequences of given elementary behaviors in time. The patterns of active and non-active behaviors correspond to stable states of a high-dimensional non-linear dynamical system. This system consists of a number of differential equations that are coupled by means of two constant matrices which define pairwise activation- and competition rules for the elementary behaviors. Changes in the behavior are brought about by bifurcations within the dynamics which let the system switch from one stable pattern to the next one within the sequence. Therefore, behavioral diversity is a consequence of the multi-stability of the high-dimensional dynamics. The transitions between the different activation patterns are very natural and smooth as the underlying dynamics always relaxes continuously into the stable states on a defined time scale. The generated sequences are flexible as the activation of a specific behavioral pattern depends on the sensor context of the elementary behaviors. Based on a recursive evaluation of the logical interrelations coded within the matrices the system can plan a promising behavioral sequence directed towards one of a set of given goal situations. The activation of those elementary behaviors are favored the execution of which result in situations which fulfill the sensor context of behaviors closer to a goal situation. The consequences of behaviors can be learned in two modes: guided by an external operator or completely autonomous by exploration. The architecture is mathematically simple yet very powerful: the high dimensional behavioral space which, usually would require a full search over all combinations of activation patterns, is reduced to a search over all goal directed sequences. The possibility to gather experience through one-shot-learning allows the system to display successful behavior based on a small set of training data. The unified description of the architecture as continuous dynamical systems makes the system easy to design and to scale up towards a larger behavioral repertoire.
References [1] P. Maes. How to do the right thing. Connection Science Journal, 1(3), 1989.
[4] G. Sch¨ oner, M. Dose, and C. Engels. Dynamics of behavior: Theory and applications for autonomous robot architectures. Robotics and Autonomous Systems, 16:213–245, 1995. [5] A. Steinhage. Dynamical Systems for the Generation of Navigation Behavior (Ph.D. thesis). Number ISBN 3-8265-3508-1 in Berichte aus der Physik. SHAKER-Verlag, Aachen, Germany, 1998. [6] A. Steinhage and T. Bergener. Dynamical systems for the behavioral organization of an anthropomorphic mobile robot. In R. Pfeifer, B. Blumberg, J. Meyer, and S. Wilson, editors, From Animals to Animats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, pages 147–152. The MIT Press/Bradford Books, 1998. [7] A. Steinhage and G. Sch¨ oner. Dynamical systems for the behavioral organization of autonomous robot navigation. In M. G. T. Schenker P S, editor, Sensor Fusion and Decentralized Control in Robotic Systems: Proceedings of SPIE, volume 3523, pages 160–180. SPIE-publishing, 1998.