Schema-Based Modular Reinforcement Learning

0 downloads 0 Views 269KB Size Report
Piaget's schema theory and implement this classical idea by proposing a computational model called Dual-Schema model. A schema is a characteristic of some ...
Adaptive Organization of Generalized Behavioral Concepts for Autonomous Robots: Schema-Based Modular Reinforcement Learning Tadahiro Taniguchi

Tetsuo Sawaragi

Dept. of Precision Engineering Graduate School of Engineering, Kyoto University Yoshida honmachi Sakyoku Kyoto, Japan [email protected]

Dept. of Precision Engineering Graduate School of Engineering, Kyoto University Yoshida honmachi Sakyoku Kyoto, Japan [email protected]

Abstract— In this paper, we introduce a reinforcement learning method for autonomous robots to obtain generalized behavioral concepts. Reinforcement learning is a well formulated method for autonomous robots to obtain a new behavioral concept by themselves. However, these behavioral concepts cannot be applied to other environments that are different from the place where the robots have learned the concepts. On the contrary, we, human beings, can apply our behavioral concepts to some different environments, objects, and/or situations. We think this ability owes to some memory structure like Schema System that was originally proposed by J.Piaget. We previously proposed a modularlearning method called Dual-Schemata model. In this paper we add a reinforcement learning mechanism to this model. By being provided with this structure, autonomous robots become able to obtain new generalized behavioral concepts by themselves. We also show this kind of structure will enable autonomous robots to behave appropriately even in a novel socially interactive environment. Index Terms— schema, hierarchical reinforcement learning, modular reinforcement learning and generalizes behavioral concept.

I. I NTRODUCTION In future days many socially intelligent robots will join our social life. However, now, we have not developed such an intelligent social robot which can behave in our human social community. The most obvious difference between laboratory environments and human social environments is about diversity and complexity. In laboratories robots are required to accomplish typical tasks in the same situation. However, once the robots go out of the room, they will find that this world is filled with many kinds of objects, dynamics, and animates. In order to perform well in such an environment, the robots have to calculate how to do in the environment from what to do. In such a case the robots must have generalized behavioral concepts corresponding to what to do. They are independent of environments or objects. We define this word as below. Thus, the social robots should be able to develop some generalized behavioral concepts so that they can adapt to the changes in their task environments. Here, we define a behavioral concept as a memory unit of motor programs governing a performer’s (i.e., an agent’s) activities of

perceiving stimuli and executing response. We introduce Piaget’s schema theory and implement this classical idea by proposing a computational model called Dual-Schema model. A schema is a characteristic of some population of objects and/or movements, and consists of a set of rules serving as instructions for producing a population prototype for a large variety of different objects and/or motions. The classical theory for motor programs postulate that for each movement there must be either a motor program or a reference against which to compare feedback, and that there is a one-to-one mapping between stored states and movements to be made. However, this presents problems for the central nervous system in terms of the amount of material that must be stored (i.e., a storage problem), and about how the performer produces a ”novel” movement (i.e., a novelty problem), both of which could not be solved by the conventional ideas. For these problems, such a classical motor program notion should be modified somewhat to mean that there are generalized motor programs for a given class of movements. For example, there might be a single behavioral concept for the many ways of throwing a ball. In this paper, we attempt to construct such generalized behavioral concepts based on the schema theory. The generalized behavioral concepts are assumed to be able to present the prestructured commands for a number of movements if specific response specifications are provided. These specifications can be thought of as variations of performances concerning with some common ”verb” concepts. For instance, with respect to a verb ”throw”, many ways of throwing would be possible and actual motor commands given to muscles from central nervous systems might be different, i.e., with a different shoulder angle, at a different speed, a different force, and so on. However, we can define some common behavioral concepts by noticing the following components [12] ¯

Information received from the various sensors prior to the response ¯ Response specifications specified by the performer before the movement can be run off ¯ Sensory consequences of the response produced: actual feedback stimuli received from external world

and/or internal organs, and Response outcome of that movement All of these components are modeled in our Dual-Schema model, and we show that this model can solve the abovementioned unsolved problems. That is, by assuming an abstract class of behaviors as a generalized behavioral concept, we can explain well about the observations of behaviors concerning with the invariant features that are commonly maintained in a seemingly different collection of behaviors such as the order of contraction, the temporal relationships among the set of contractions, relative force and the preserving common shapes of trajectories in spite of the muscles to contract. As we said, motor commands are not directly related to what generalized behavioral concepts represent. In this paper we try to formulate generalized behavioral concepts as intentinal schemata. The formulation will enable an autonomous robot to obtain a generalized behavioral concept in a simulation world through reinforcement learning. To realize behavioral concept formation process, we propose a novel reinforcement learning architecture, which is designed as intentional schema’s learning mechanism in Dual-Schemata model [9]. Dual-Schemata model is a incremental modular learning system for developmental robots. In the next section we introduce Dual-Schemata model. After that, we propose a reinforcement learning architecture which realizes a robot to obtain a generalized behavioral concept as an intentional schema. After the section, we show simulation experimental results, and finally conclude this paper. ¯

Fig. 1.

Overview of Dual-Schemata model

B. Perceptional Schema We define perceptional schema. After agent interacts with its environment or an object, this perceptional schema becomes a representation of dynamics of the environment or the object the autonomous robot faces to. Perceptional schema continues to learn the relationships among three vectors Ý ,  , and  . This relationship implies an autonomous agent’s concept about the object which the agent acts on. Wherein, Ý is a direct sum of and  . Here,  represents available information the robot can get other than ; e.g. items stored in a short term memory, a message from other agents, and so on. The relationships can be formulated in three ways as followings.



II. D UAL -S CHEMATA MODEL We have been proposing Dual-Schemata model. This model enables autonomous robots to obtain its own concepts about objects and/or environments incrementally without any supervisors. Using this method we showed that a facial robot, which is equipped two CMOS cameras and pan-tilt two degrees of freedom, become able not only to chase a moving blue ball but also to distinguish 4 kinds of ball movements without a supervisor. This process is driven by subjective error we difine afterward. In this section we shortly explain Dual-Schemata model (Fig.1). This version of Dual-Schemata model is named LDS (Light Dual-Schemata model) [1], [8], which is slightly different from original Dual- Schemata model [9], but is sharing the same concept with that.



Ý

  



 Ý   





Ý



(1) (2) (3)

 is often called forward model, and  is called inverse

model. A concept about environmental dynamics can be represented in those functions as far as they correctly predict the corresponding values as they are. Thus, n-th perceptual schema   , has two functions  , and  , both of which are refined continuously. Though original Dual-Schemata Model [9] allows these functions to be any types of functions, Light DualSchemata Model restricts only to linear functions so that an inverse model could exist mathematically.



A. Sensor and Motor Vectors 

Dual-Schemata Model has two kinds of modulesmodules: perceptional schemata and intentional schemata (Fig. 1). We assume that autonomous robots have some sensors and motors. The sensory inputs are usually encoded in a finite dimensional real number vector at every sampling time step, and also they output a motor vector  to the motors.













Ý

Ý

 



 Ý  

 

Ý





 





(4)



(5)

In these expressions  ,  ,  , and  are coefficient matrices, and  and  are constant vectors. C. Intentional Schema The other schema is intentional schema. While a perceptional schema plays a role of representation of environment, Intentional schema acts as a representation of generalized

behavior. The m-th intentional schema  has an inner function  .  (6)     

   is queue storages. In Light Dual-Schemata Model  defined as below without using storages.             

at the next time step. In addition to that, we define a ¢ by-pass intentional schema  ¼ . It acts as a behavioral function without referring to inverse models. This kind of intentional schemata express behaviors which are irrelevant to the outer dynamics, such as shaking head, nodding, or swinging hands. It also has an inner function ¢ ¼ . ¢ ¼



 Ý

(7)

Different from the functions of perceptional schemata, these functions need not be linear. These functions can be obtained through reinforcement learning. We’ll introduce this method in next section. D. Schema State As we show in Fig. 1, Dual-Schemata Model has an intentional schema switch and a perceptional schema switch in its structure. Each switch selects one schema at a time. It will be explained later how each switch selects a schema at every moment. If the perceptional schema switch selects   and the intentional schema switch selects  at time step , a schema state    at time is defined as below.

   



     

 

  





(8)

By using these symbols, a behavioral function  can be expressed as below. 

 



Ý







Ý

 



Ý



(9)

If   is a by-pass intentional schema, the behavioral function is expressed as below. 

 



Ý



¢ 



Ý



(10)

E. Equilibration When the robot gets Ý     while perceptional schema state is   ,   tries to assimilate this sample into itself. Assimilation and accommodation are important keywords in Piaget’s schema model [10]. Whether this   can assimilate this sample or not depends on subjective error   .    is defined as below.

          ½







(11)

diag(*) means diagonal matrix whose elements are *. Here,



 

 is predictive error, and  expected predictive

error. The predictive error is calculated as below using forward model. Ý               (12)

Here,     means to get absolute value for each element. For example,      .    was calIn previous Dual-Schemata model [9],  culated from averaged errors which are accumulated in its





 means a desired value of the next sensor input  . This schema outputs what an agent wants to see



  

 

(13) (14)



This error predictor changes so as to approximate the relationship between  and   . If      where  is an assimilation threshold parameter,   assimilates a triplet  Ý       . Following the above assimilation, the inner functions defined within   change themselves so as to fit the new samples better. This is an accommodation process. These iterative operations are called an equilibration process as a whole. In accommodation process  in   changes itself using inertial steepest descent method. By using an inertial steepest decent method,  becomes able to predict environmental forward dynamics. If the system can get this forward model, inverse model  can be derived from  through algebraic operations. F. Differentiation Perceptional schemata are switched and differentiated according to perceptional schema activity   . The perceptional schema activity    represents the degrees with which the schema is fitting to an agent’s encountering environmental dynamics.    is updated by referring to   at every time step as below.

  



             

 



(15)

 is a parameter which define how much the schema persist  in current recognition of environment. Usually  means that this environment is irrelevant to   , and   means that this environment is well corresponding to   . Perceptional schema switch selects and divides perceptional schema by referring to the activity according to the rules as below.

1) The perceptional schema switch selects a perceptional schema whose index is the smallest (i.e. the oldest) among the schemata whose activity is larger than a threshold   , which will be defined later. 2) If the selected perceptional schema’s activity changes into , the perceptional schema switch divides a selected perceptional schema into two parts. If the number of perceptional schema is  at a current time, a new perceptional schema is named    . These rules make perceptional schemata switch able to switch and differentiate perceptional schemata depending on the context of interaction [1] and the robot’s embodiment [8].   is usually defined as       . This parameter play a similar role to vigilance parameter in ART [11]. However, in ART robots’ encountering dynamics are neglected.

III. R EINFORCEMENT L EARNING FOR I NTENTIONAL S CHEMA In this section, we introduce a novel reinforcement learning architecture which enable an autonomous robot to obtain a generalized behavioral concept as an intentional schema.

Fig. 2.

Intentional Schema as a Reinforcement Learning Module.

A. Reinforcement Learning and Models Reinforcement learning is originally formulated in discrete MDP (Markov Decision Process), but it has already been extended to continuous time and space [6], [5]. Reinforcement learning is a key idea to make an autonomous robot learn new behaviors to cope with tasks given not by designers, but by users. In reinforcement learning, rewards drive an agent to seek for the best policy to maximize the amount of future rewards. Most of reinforcement learning theories need not have an environmental model. (Some have models; e.g.[4].) Therefore, reinforcement learning is often called a modelless learning method. However, it sometime results in an unpleasant consequence. Especially, policies which are obtained through modelless reinforcement learning are easily damaged by environmental changes. Therefore, a behavioral concept (i.e. policy) which is obtained through reinforcement learning in a certain environment cannot be applied to other environments without resuming expensive calculations of reinforcement learning. In other words, behavioral concepts acquired through simple modelless reinforcement learning are not generalized behavioral concepts. This owes to the definition of policies in reinforcement learning. The policies are usually defined as below. 



 

(16)

It means a policy is a function representing regularities existing in stimuli and responses. In Dual-Schemata model intentional schema is designed to be robust against any changes of environmental dynamics. Dynamics are exclusively treated by perceptional schema, so intentional schema can realize its stored behavioral concept in various environments.

Fig. 3.

Overview of a collaborative ball alancing task

where  is a function approximator with a parameter vector  ,  is noise for exploration, and  is an output function which restrict too big or too small output vectors to moderate size. The noise size  is controlled by value function’s value. The value function is written like below under the condition that the policy is  . 



 

       









!

(18)

Æ

!





   

 



(19)

TD error Æ is calculated as above. By using the TD error, agents can improve both the policy and the value function. Here, the details of the algorithm for updating these functions are omitted here (see [7], [5]). What’s important in this formulation is that policies in intentional schemata never treat actual motor outputs. Intentional schemata commit actual dynamics to perceptional schemata. According to this formulation, intentional schemata’s behavioral concepts (i.e. policies) can be applied to different environments by joining hands with each perceptional schema.

IV. E XPERIMENT

We adopt the actor-critic method as intentional schema’s learning dynamics. The actor-critic method is a well-known reinforcement learning method [6]. It has already been fully formulated in continuous time and space by Doya [5], so we can use this method regardless of a time step size and explicit cell partition of a state space. As we mentioned in section 2, intentional schema has inner function  . The inner function outputs next desired sensor input vector   to a perceptional schema. Here, we consider this output as an action output of  . Therefore, the policy of  is written like below.





Here, ! represents a reward which an learning agent obtains at time step .

B. Actor-Critic method for Intentional Schema



  ½

(17)

In this section, we verify the advantageous effect of our reinforcement learning method of Dual-Schemata model. A. Conditions We take up an example task whose purpose is to keep balance of a rolling ball in the center of a flat table. In this experiment, an agent cannot control both sides of the table, but only one side (Fig.3). We put two agents in this experimental situation. The one is named player that is equipped with Dual-Schemata model, and the other is named partner that plays this task in company with the player. In this experiment the partners’ policies are fixed without any learning capability embedded. They don’t learn anymore. We prepare three types of agents shown in table 1.

TABLE I I NTENTIONAL S CHEMATA Index ¼

Name move random

Policy

½ ¾

(for reinforcement learning) shake a ball

   ¼ ( is a noise term) ·½ ½  ·½ ¾  

   

TABLE II C ONTROLLERS OF PARTNERS Index

Name cooperator lazy bones disturber

¼ ½ ¾

Controller

 

Fig. 4.

     

B. Experimental Result

System dynamics, each agent’s sensor input vectors, their motor output vectors, and reward function for the player are simply defined as below. In this case, both the player and the partners share the same sensor inputs. The agents’ sampling rate is set to 4.0[Hz], and constants are also set to   " and # ".

#   $  $    

(20)



$

(22)

"  "  " 

(23) (24)







 



!

$

Three different reinforcement learning architectures.

(21)

We also prepare three intentional schemata. Two of those are designed a priori; ”move random” and ”shake a ball”. The third one is learned through reinforcement learning. As for the partners, three kinds of characters are assumed: ”Cooperator” tries to keep the ball at the center of the table. ”Lazy bane” does nothing but keep the hand a little lower than the horizontal axis. ”Disturber” tries to get the ball out of the table. What we should keep in mind is that the player has to behave differently depending on his partner’s policy even though the goal of the task is the same. Under these conditions, we start a simulation experiment. In this experiment, we sometime replace the partner. During first 5000[s] ”cooperator” performs as a partner (1st period) . During the next 5000[s] ”lazy bones” takes over the position (2nd period). After that, ”cooperator” and ”lazy bone” take turns every 200[s] alternately from 10000[s] to 15000[s](3rd period). Finally, ”disturber” comes in after 15000[s](4th period). In addition to these replacements, we stop the reinforcement learning dynamics after 10000[s] in order to test the acquired concept’s generality. If the player has acquired a generalized behavioral concept, he will be able to perform well even in a new situation without iterating wasteful efforts of reinforcement learning.

In this task, we compare three types of reinforcement learning schemes which are shown in Fig.4. 1) shows the simple continuous actor-critic method [5]. 2) shows our Light Dual-Schemata model without perceptional schema’s differentiation dynamics. This model can be called hierarchical reinforcement learning system [3]. And 3) is our complete Dual-Schemata model. This model can be called hierarchical modular reinforcement learning system. We design three types of player agents that are provided with different reinforcement learning schemes. Then we experiment with these three player agents each for 5 times. The transitions of averaged rewards are shown in Fig.5. In addition to these three trial, we also test on the simple continuous actor-critic method without stopping reinforcement learning after 10000[s].

C and D in Fig.5 show that policies which are obtained by the simple actor-critic method is easily damaged when the environment (i.e the partner) changes. If reinforcement learning is going on (see D in Fig.5), averaged rewards curve recovers from the bottom. However the speed is too slow for the robot to manage changeable situations. This means the behavioral concepts which the simple actorcritic method obtains are too specialized and brittle to the environmental changes. On the contrary, A and B in Fig.5 show good performance even after reinforcement learning is stopped at 10000[s]. This adaptivity is due to perceptional schemata’s equilibration process. The changes of the player’s partner can be considered as changes in forward dynamics from the viewpoint of the player agent. Therefore, desired next sensor value which the intentional schema output need not be modified. Moreover, the equilibration process is supervised learning process. This process is much faster than reinforcement learning process. So, these scheme can manage a little changeable situations without constant efforts of reinforcement learning. However if the partner frequently changes, this learning speed is not fast enough. In such a case perceptional schema’s modularity has a certain advantage over a single perceptional schema scheme. Once new perceptional schema is created for each partner by perceptional schema

Fig. 5. Comparison of the time course of learning with different learning scheme: (A) Dual-Schemata model, (B) Dual-Schemata model without differentiation of perceptional schema, (C) simple continuous actor-critic, (D) simple continuous actor-critic which continue to learn after 10000[s].

Fig. 6. Acquired concepts by Dual-Schemata model thorough an interaction with the task environment.

switch, to switch these perceptional schemata is enough to deal with slightly changeable situation. Fig.5(B) shows that the rewards are fluctuating much in the 3rd period (from 10000[s] to 15000[s]). This means a single perceptional schema must adapt to new partner whenever its partner changes. Dual-Schemata model overcomes both of these two problems. Fig.6 shows all concepts the agent equipped with DualSchemata model obtain through this collaborative ball balancing task.  become a behavioral concept which represents ”to balance a ball”. This concept does not depend on environmental changes (i.e. partners changes). In addition to that, a perceptional schema   has been devided into three perceptional schemata through the task. These differentiated perceptional schemata correspond to ”with cooperator”, ”with lazy bone”, and ”with diturber” for each. The acquired intentional schema acts in the company of each perceptional schema as if they were subordinate concepts of more general abstract one. Therefore, the intentional schema can be called a generalized behavioral concept.

V. C ONCLUSION In this paper, we proposed a new reinforcement learning architecture. In Dual-Schemata model the reinforcement learning dynamics does not work alone, but with perceptional schema’s equilibration dynamics. To collaborate with other learning dynamics an agent can obtain a more generalized concept through reinforcement learning.

However our formulation is limited to linear environment although the real world is full of nonlinearity. We must extend this formulation to such nonlinear environments. Moreover, in this paper we showed the robot can acquire only one new generalized behavioral concept. In future work we have to make an agent obtain more behavioral concepts incrementally. ACKNOWLEDGMENT This paper is supported in part by Center of Excellence for Research and Education on Complex Functional Mechanical Systems (The 21st Century COE program of the Ministry of Education, Culture, Sports, Science and Technology, Japan.) R EFERENCES [1] T. Taniguchi, T. Sawaragi, “Design and Performance of Symbols Self-Organized within an Autonomous Agent Interacting with Varied Environments ”, IEEE International Workshop on RO-MAN , 2004. [2] D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control”, Neural Networks, Vol. 11, pp.1317-1329, 1998. [3] J. Morimoto, K. Doya,“Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning”, Robotics and Autonoumous Systems, Vol.36,pp.37-51,2001. [4] K. Doya et al.“Multiple Model-based Reinforcement Learning”, Neural Computation, Vol14, pp.1347-1369 ,(2000) [5] K. Doya, “Reinforcement Learning In Continous Time and Space ”, Neural Computation, Vol.12(1), pp.219-245,(2000) [6] R.S.Sutton, A.G.Barto, “Reinforcement Learning”, MIT Press,(1998) [7] H. Kimura, S. Kobayashi, “An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Function” 15th International Conference on Machine Learning, pp.278-286, (1998) [8] T. Taniguchi, T. Sawaragi: “Self-Organization of Inner Symbols for Chase: Symbol Organization and Embodiment”, IEEE International Conference on SMC 2004 CD-ROM, (2004) [9] T. Taniguchi, T. Sawaragi: “Assimilation and Accommodation for Self-organizational Learning of Autonomous Robots: Proposal of Dual-Schemata Model”, IEEE International symposium on CIRA 2003, pp. 277-282, (2003) [10] John H.Flavell: “The Developmental Psychology of Jean Piaget”, Van Nostrand Reinhold, (1963) [11] Gail A. Carpenter and Stephen, “A massively parallel architecture for a self-organizing neural pattern recognition machine”, Computer Vision, Graphics, and Image Processing, 37, pp.54-115,(1987) [12] Schmidt, R.A. , ”Motor and Action Perspectives on Motor Behaviour”, in Meijer, O.G. and Roth, K. (Eds.), Complex Movement Behaviour, Elsevier Sci., Pub., 3-44, (1988).