Behavior Cycles in Robots

0 downloads 0 Views 437KB Size Report
step learning cycle that moves though phases of concrete experience, ..... related to the growing neural gas self-organizing map (GNG SOM) proposed by Fritzke ...
1

Title: Modeling Behavior Cycles as a Value System for Developmental Robots Running Title: Behavior Cycles in Robots

Kathryn E. Merrick

School of Engineering and Information Technology University of New South Wales, Australian Defence Force Academy Northcott Drive Canberra, 2600, ACT, Australia Phone: 61 2 6268 8023; Fax: 61 2 6268 8443; E-mail: [email protected]

2

Abstract

The behavior of natural systems is governed by rhythmic behavior cycles at the biological, cognitive and social levels. These cycles permit natural organisms to adapt their behavior to their environment for survival, behavioral efficiency or evolutionary advantage. This paper proposes a model of behavior cycles as the basis for motivated reinforcement learning in developmental robots. Motivated reinforcement learning is a machine learning technique that incorporates a value system with a trialand-error learning component. Motivated reinforcement learning is a promising model for developmental robotics because it provides a way for artificial agents to build and adapt their skill-sets autonomously over time. However, new models and metrics are needed to scale existing motivated reinforcement learning algorithms to the complex, real-world environments inhabited by robots. This paper presents two such models and an experimental evaluation on four Lego Mindstorms NXT robots. Results show that the robots can evolve measurable, structured behavior cycles adapted to their individual physical forms.

Keywords Behavior cycles, developmental robotics, motivated reinforcement learning, neural networks, SART networks.

3

1.

Developmental Learning

Developmental learning is an autonomous, incremental process by which an individual progressively adapts their behavior to increase the variety or complexity of their activities (Oudeyer, Kaplan, & Hafner, 2007). Developmental learning is a long term goal for robotics researchers because it promises a way for robots to increase their skill-sets autonomously by selecting their own learning goals (Weng et al., 2001). This provides a mechanism by which robots can adapt to new tasks; adapt to unexpected changes in their physical structure or environment; or identify and achieve interesting or creative goals not envisaged by system engineers. Developmental robotics is a challenging application for traditional artificial intelligence (AI) approaches for a number of reasons. First, the complexity of the real-world environments inhabited by robots renders many AI approaches intractable. Secondly, a majority of existing AI algorithms assume that there is a known goal or goals to be achieved. As such they generally do not incorporate modules to self-select goals. Finally, developmental robotics is challenging because of the difficulties defining and evaluating what it means for a developmental robot to be successful. Evaluation of traditional robots is generally made with respect to a specific goal or goals (Arkin, 1998). This becomes inappropriate for developmental robots that can self-select the goals they will achieve and selforganize their behavior to achieve them. This paper uses the idea of behavior cycles as the basis for extending motivated reinforcement learning (MRL) to the developmental robotics domain. MRL is a machine learning technique that incorporates a value system with a trial-and-error learning component. Existing MRL models and metrics have found success in simulations and virtual world applications (Merrick & Maher, 2009a, 2009b; Singh, Barto, & Chentanez, 2005). However, existing results show they do not yet extend to complex environments such as those inhabited by robots (Merrick, 2008a; Merrick & Huntington, 2008; Merrick & Maher, 2009b). Despite this, MRL is a promising model for developmental robotics because the value system provides a mechanism for goal selection and the learning module provides a way to achieve solutions to those goals. The approach in this paper is inspired by the behavior cycles in natural systems, which permit organisms to adapt behaviors suitable for their physical form and for survival, efficiency of action or evolutionary advantage in their environment (Ahlgren & Halberg, 1990). The new models provide a structured and extensible approach to designing and evaluating developmental robots. The remainder of this paper is organized as follows. Section 2 reviews relevant literature about behavior cycles in natural systems. It discusses how the concept can be used in the design and evaluation of artificial systems such as robots. Section 3 reviews value systems in developmental robots and related fields, and describes how the new MRL models presented in this work differ from existing value system approaches. Section 4 presents our model of behavior cycles and how it can be incorporated, first into MRL models with function approximation for complex environments and, secondly, into new metrics for evaluating developmental robots. The new models are demonstrated in Section 5 on four Lego Mindstorms NXT robots. Results show that robots using the new model can autonomously adapt structured behavior cycles through experimentation with their physical structure and interaction with their environment. The paper concludes with a discussion of directions for future work in Section 6.

2. 2.1.

Behavior Cycles

Natural Systems

Three broad categories of behavior cycles are evident in natural systems: biological cycles, cognitive cycles and social cycles (Merrick, 2008b). At a biological level, these cycles support basic survival and adaptation. At a cognitive level they permit exploration, learning and creativity. Finally, at a social level behavior cycles drive evolution and cultural advancement. All of these traits are desirable for future developmental robots. The following sections consider each category in detail.

4

2.1.1.

Biological Cycles

Biological cycles, biorhythms or biocycles are closely associated with both the physical form of a natural organism and its environment (Ahlgren & Halberg, 1990). Environmental factors influencing the evolution of biocycles include seasonal cycles in temperature and light (Dunlap, Loros, & DeCoursey, 2003), variations in the earth’s magnetic field (Wever, 1973) and tidal cycles (Brown, 1954). Biocycles may also be influenced internally by hormonal or chemical cycles. Ahlegren and Halberg (1990) propose four advantages of biocycles in natural systems: anticipation of environmental change, efficiency of action, competition reduction through exploitation of environmental niches, and navigation. Designing robots capable of evolving emergent behavior cycles has the potential to draw on these advantages. First, it provides a basis for building robots that can synchronize their behavior with their environment to achieve, efficient, long term activity though anticipation and competitive behavior. In robots, biological form is replaced by physical hardware, but the possibility for pseudo-biological cycles remains. These may be synchronized with salient internal variables such as battery level, heat, oil or other fluids, or with salient cycles in the robot’s environment. In particular, there is a possibility for robots to synchronize their behavior with that of humans sharing their environment. This would permit them to anticipate, predict and support those cycles. Potential applications for robots with this capability include home assistant bots (Merrick and Shafi, 2009) and industrial robots. The idea of automated home assistant in particular is desirable in regions with aging population demographics. The ability to evolve biocycles also provides a basis for building robots that can synchronize their behavior with their structure. This is particularly important for achieving the goal of developmental robots that can progressively adapt their behavior to increase the variety or complexity of their activities. In the future, this might also permit robots to adapt to reconfiguration or tolerate damage or failure of a component. The models presented in this paper focus on the evolution of behavior cycles so robots can learn behavior synchronized with their physical structure. In future however, the study of cognitive and social cycles also promises advantages for developmental robots. 2.1.2.

Cognitive Cycles

Where biocycles are associated with the physical form or environment of an individual, cognitive cycles are associated with abstract reasoning processes. Cycles in risk taking behavior (Chavanne, 2003; Zimecki, 2006), habituation and recovery (Geen, Beatty, & Arkin, 1984) and learning (Kolb, Rubin, & McIntyre, 1984) are examples of cognitive cycles. Habituation describes the process of stimuli losing novelty for an individual. Loss of novelty motivates exploration and experimentation to achieve a more optimal level of stimulation. In this way, individuals engage in an ongoing cycle of habituation and recovery. Closely related to habituation and recovery is the learning cycle. Kolb et al. (1984) describe a fourstep learning cycle that moves though phases of concrete experience, reflection and observation, abstract conceptualization, and active experimentation. Depending on the results of experimentation, the cycle starts over with a new learning experience or with a revision of the current one. Where biocycles tend to be closely associated with action efficiency and survival, cognitive behavior cycles are frequently associated with risk taking behavior, experimentation or creativity. The relationship between cognitive cycles and creativity, for example, is evident in phenomena such as fashion cycles where similar-yet-different fashions reemerge from time to time. Habituation and recovery of ideas in the fashion industry drive trends through cycles. The capacity for experimentation and creativity are potential advantages of designing robots capable of emergent cognitive behavior cycles. For example, they provide a basis for building robots themselves capable of creative design; robots that can identify and use objects creatively as tools; and robots that can discover novel problems to solve, not necessarily envisaged by the robot’s designers. Application areas for such robots include architecture, engineering and the construction industry where systems that can mimic the creative role traditionally played only by humans, are a long term

5

research goal (Gero, 1992). Scientific research may also potentially benefit from such robots, if they are capable of conducting creative research and experimentation. 2.1.3.

Social Cycles

Cyclic behavior can be observed in groups of individuals, at both biological and cognitive levels. At the biological level, co-evolution (Ehrlich & Raven, 1964) of parasites and their hosts, predators and prey or birds and flowers is an example of a social cycle. At the cognitive level, Social Cycle Theory describes the evolution of society and human history as progressing though a series of sociodemographic cycles. Various mathematical models of these cycles have been developed (Nefedov, 2004; Usher, 1989). Social cycles can be thought of as describing the rise and fall of species, trends, cultures and societies. These cycles force social progress through a cyclic process of social evolution. Habituation and recovery, for example, occur at a social as well as an individual level, influencing the focus of attention of large groups. Csikszentmihalyi (1996) notes how a society provides the constructive criticism necessary for an individual’s self-improvement and an audience for their creative contributions. Society-wide habituation and recovery of ideas drives trends through cycles. Just as the implementation of pseudo-biological and cognitive cycles in robots promises advantages for efficiency, survival and creativity, in future, robots that can self-organize social cycles may have the potential for social evolution and self-advancement. While the disadvantages of robots with selfevolutionary capabilities are often dramatized, the drive towards nano-technologies and robot swarms for tasks such as search-and-rescue in disaster zones or interplanetary exploration, creates a new setting for machines capable of adapting both their own social structure and their physical form. 2.1.4.

Evaluating Cyclic Behavior

The importance of identifying repetitive patterns – particularly cyclic patterns – in natural systems can be seen in a number of studies (Forbes & Fiume, 2005; Kovar & Gleicher, 2004; Li & Holstein, 2002; Tang, Leung, Komura, & Shum., 2008). Tang et al. (2008), for example, visualize and analyze posture similarity in human motion-capture data for dancers. Posture is represented as a multidimensional vector and plotted on a point-cloud matrix. Patterns on the point-cloud matrix reveal patterns and cycles in a dance. The point-cloud matrices can also be analyzed numerically to produce a quantitative analysis of the individual’s behavior. For example, the number, length and period of cycles can be determined. Tang et al. (2008) propose their work as a starting point for the design of artificial systems that can generate cyclic motion, such as automated dance tutors. However, both their visual and numerical analysis techniques can also be adapted to characterize the behavior of an artificial system such as a robot. Merrick (2009), for example, applies the visual technique to analyze robot behavior. This paper extends the qualitative visual analysis by adapting the associated numerical techniques to evaluate robots. This is discussed in Section 4.4.2. 2.2.

Artificial Systems

Artificial agents that exhibit behavior cycles have been considered in a number of domains. For example, the simplest approaches to artificially intelligent characters in computer games use scripted animations that repeat a small set of behaviors as a short cycle (Laird & van Lent, 2000). More complex story-telling approaches create longer cycles using role based models (Mac Namee, Dobbyn, Cunningham, & O'Sulivan, 2003). However, both these approaches assume a fixed set of behaviors through which the agent cycles. The agent cannot adapt its behavior cycles to changes in its environment, beyond those for which it has been programmed. Robots require adaptive behavior cycles to function in complex environments. That is, they need to be able to learn and modify their behavior cycles as required. This suggests two requirements in the robot agent model: a learning module to learn and adapt cycles and a value system to identify which cycles to learn. This is the topic of the next section.

6

3.

Robotic Value Systems

The question of whether robots can have human-like value systems – such as emotions or complex motivations – sparks a similar ethical debate to whether they can have human-like consciousness (Arkin, 1998). However, the idea that robots may have an artificial value system as a mechanism for allowing them to deal effectively with the contingencies of life in complex real-world environments, does have support (Oudeyer et al., 2007; Sporns, Almassy, & Edelman, 2000; Huang & Weng, 2002; Friston, Tononi, Reeke, Sporns, & Edelman, 1994; Lungarella, Metta, Pfeifer, & Sandini, 2003; Marsland, Nehmzow, & Shapiro, 2000). Robotic value systems mediate the saliency of environmental stimuli to allow robots to self-supervise and self-organize their learning (Lungarella et al., 2003). That is, the value system signals the occurrence of important stimuli and triggers the formation of learning goals. Value systems play an important role in the design of robots with adaptive, lifelong learning behavior because they provide a way for robots to learn autonomously through spontaneous, self-generated activity. This is in contrast to robots without value systems, which often rely on instructions provided by their human designer or a human teacher to determine the goals they will pursue. Different value systems have been considered, including systems for innate and acquired values (Friston et al., 1994); neuromodulatory systems (Sporns et al., 2000) and cognitive value systems (Huang & Weng, 2002; Merrick & Huntington, 2008; Oudeyer et al., 2007): 3.1.

Neuromodulatory Value Systems

The primary role of neuromodulatory systems in natural organisms is to tune the dynamics of the neural network comprising the brain at different stages of the organism’s development (Lungarella et al., 2003). In doing this, neuromodulatory systems can function as value systems to mediate or value environmental stimuli and trigger a behavioral response. Neuromodulatory systems in robots are based on computational models of neurons in the brain. They define properties such as how neurons influence each other, how long neurons activate for and the regions of the brain that are affected. Existing work with neuromodulatory value systems in robots has focused on areas such as the adaptation of appetitive and aversive behavior (Sporns et al., 2000) and adaptation of the visual system (Friston et al., 1994). 3.2.

Cognitive Value Systems

In contrast to neuromodulatory value systems, cognitive value systems are based on psychological theories. These include theories describing psychological phenomena such as emotion and motivation. In addition, many of the cognitive value systems developed so far have been studied in the context of their interaction with machine learning approaches, such as reward-based learning (Huang & Weng, 2002; Merrick & Huntington, 2008; Nagai, Asada, & Hosoda, 2002; Oudeyer et al., 2007). The idea is that the value system will identify a series of goals over the course of the robot’s life and generate a reward signal to direct the learning module to find a solution for each goal. This approach is represented diagrammatically in Fig. 1. At each time t, the robot senses a state S(t). The value system interprets this state as an observation O(t) and generates a reward R(t) representing the immediate value of that observation to the agent. The learning module then updates utility values stored in a behavioral policy based on O(t) and R(t). Utility values may simply represent immediate reward, or they may represent more complex concepts such as expected future reward. Finally the robot uses the learned policy to select an action A(t). Execution of this action causes the environment to transition to a new state. Oudeyer et al. (2007) classify artificial value systems in three categories: error maximization (EM), progress maximization (PM) and similarity-based progress maximization (SBPM). These categories reflect the idea that robots using value systems try to choose actions that will maximize the value of a reward signal, and that this reward signal may be calculated in different ways. Robots using EM techniques focus on actions that permit them to learn about stimuli for which they currently have a high prediction error. Examples of EM approaches (Huang & Weng, 2002; Marshall, Blank, & Meeden, 2004; Marsland, Nehmzow, & Shapiro, 2000; Thrun, 1995) can often be thought of

7

as modeling the ‘novelty’ of a stimulus and seeking out stimuli of high novelty. The main criticism of EM approaches is that random occurrences often result in a high prediction error, but there is also little to be learned from such occurrences. Alternative approaches that can filter out random occurrences are required in robotics domains where random occurrences may be a result of sensor noise. Robot S(t)

O(t)

Value System

R(t)

Learning Module

A(t)

Environment

Fig. 1. Reward-based learning using a value system. The value system interprets sensed states S(t) as observations O(t) and generates a reward value R(t). The learning module updates a behavioral policy based on O(t) and R(t) and uses the learned policy to select an action A(t).

PM techniques focus attention on stimuli for which the robot ‘predicts that it will have a high prediction error’. This more indirect method of computing reward overcomes some of the difficulties associated with EM techniques in environments that may contain random occurrences. Examples include work by Kaplan and Oudeyer (2003) and Herrmann et al. (2000). SBPM techniques are like PM techniques, but take into account the similarity of observations when making predictions. These approaches – of which MRL is an example – often incorporate an unsupervised learning algorithm or other mechanism to cluster similar experiences, before computing a ‘novelty’, ‘interest’ or ‘curiosity’ value for the learned cluster (Marsland et al., 2000; Merrick, 2008a; Merrick & Huntington, 2008; Oudeyer et al., 2007). SBPM techniques appear the most promising in robotics domains because they provide a way to generalize observations from large volumes of sensor data. However, existing state-of-the-art SBPM systems do not yet employ state-of-the-art machine learning approaches for behavior acquisition. As a result, existing SBPM systems are either limited to solving certain types of problems for which there is immediate reward feedback to the value system (Marsland et al., 2000; Oudeyer et al., 2007) or suffer from the challenges of environmental complexity and evaluation (Merrick, 2008a; Merrick & Huntington, 2008). The work in this paper thus innovates in two ways. First, it incorporates function approximation approaches for RL (Brignone & Howarth, 2003; Coulom, 2002) into the MRL framework, extending MRL to the developmental robotics domain (and potentially other complex domains). Secondly, by explicitly modeling behavior cycles as a MRL value system, a specific structure is introduced that is to be achieved in the robot’s behavior, without explicitly specifying which goals the robot should address. This provides a generic basis for measuring the emergent behavior of developmental robots and comparing the behavior generated by different approaches.

4.

Modeling Behavior Cycles

The purpose of the cycle-based value system presented in this section is to motivate learning of biocycles adapted to the robot’s physical structure or that of its environment. By identifying ‘achievable’ behavior cycles it can be thought of as the lowest layer of a potentially multilayer value system. This multilayer value system might generate not only biocycles, but also cognitive and social cycles. The cognitive and social layers might further distinguish ‘useful’, ‘interesting’ or ‘correct’ behaviors from ‘achievable’ behaviors. While the focus of this paper is on development of a singlelayer value system for generating biocycles, further discussion of multilayer value systems is included in Section 6. Using the notation from Fig. 1, a robot’s experiences can be described as a trajectory of states and actions over time: S(1), A(1), S(2), A(2), S(3), A(3), …

8

A behavior cycle is a sequence of states in which one or more states is repeated. The simplest behavior cycle is thus a sequence of the same state repeated at adjacent time steps: S1(t), S1(t+1) , S1(t+2) , S1(t+3)… This represents the robot maintaining itself in some configuration, or performing a maintenance task. A more complex behavior cycle might repeat several states as follows: S1(t), S2(t+1) , S3(t+2) , S1(t+3) , S2(t+4) , S3(t+5) … This represents a robot achieving change, or performing an achievement task. This paper focuses on the repetition of states rather than the repetition of actions, as repeated actions may not necessarily generate behavior cycles in noisy environments. A robot with a sticky wheel, for example, may repeat a move forward action, but generate haphazard motion. To create robots that can achieve cyclic behavior, the role of the value system is to reward certain observations of the sensed world to be repeated. This approach identifies two properties of observations to reward: • The observation must be repeatable; • Repetition must cause the agent to learn. A robot using MRL can identify repeatable observations by querying its behavioral policy to check if the current observation already has an entry in the policy. Likewise, a robot using MRL can identify observations that cause it to learn by analyzing the learning error (change in utility values) when it updates the behavioral policy. Fig. 2. shows a possible reward-signal using these ideas. Utility is optimistically initialized at one and learning error at zero. The first few times an observation occurs, reward is zero until the agent recognizes the observation as repeatable. On subsequent occurrences of the observation, reward has the higher value of one. When learning error becomes very low (indicating there is little being learned) reward drops back to zero. Reward increases to one again when learning error is low and negative, indicating that the robot has almost completely forgotten the observation (so there is once more something to be learned).

Fig. 2. Relationship between learned utility, learning error and reward. For simplicity, this chart depicts learned utility for immediate reward received for a single observation.

This reward signal motivates cycles of exploration and exploitation. When reward is low the robot will explore until it encounters a situation that generates a higher reward value. The robot will then exploit that situation by learning about it, until it is no longer highly rewarding. These cycles of exploration and exploitation have a number of implications. First, the cycle-based reward signal motivates cycles of learning and forgetting about different observations. While the robot is forgetting about one observation it may be learning about another, so its focus of attention changes over time. The result of this shifting attention focus is that the policy learned by the robot is also non-stationary. This is one of

9

the fundamental differences between MRL and RL using a stationary reward signal. Secondly, because the learned policy is non-stationary, there is no concept of a single optimal policy in MRL. Rather, MRL using this cycle-based reward signal makes use of the learning ability of traditional RL algorithms to permit the robot to converge on a series of different polices that are efficient with respect to different tasks at different times. The following sub-sections describe three MRL models that implement this reward signal as a single layer, cycle-based value system with variations of the traditional Q-learning algorithm (Watkins & Dayan, 1992). Using a ‘flat’ Q-learning approach means that when the robot’s focus of attention shifts, previous learning is forgotten. This implies that the robot’s current behavior is always relevant to its current situation and out-of-date policies are not retained. A discussion of alternative approaches using hierarchical models to permit recall and reuse of learned policies is made in Section 6.2. Each of the models defined below specifies (1) how the robot interprets observations O(t) from sensed data, (2) the reward signal R(t) produced by the value system, (3) the learning step that updates the behavioral policy Q and (4) how actions A(t) are selected. 4.1.

Q-MRL – Behavior Cycles and Q-Learning

This model combines a cycle-based value system with table-based Q-learning (Watkins & Dayan, 1992). This model has been used successfully in virtual and simulated environments (Merrick & Maher, 2009a), but tends to suffer from the ‘curse of dimensionality’ in robotic domains. Specifically, continuous-valued or noisy sensor readings mean that there are a large number of unique states. Learning becomes quickly unviable in terms of time and memory requirements. However, this model serves as a baseline against which to compare the performance of the new MRL models using function approximation. In this model, the observation at time t contains the same data as the sensed state. That is: O(t) = S(t)

(1)

The reward signal representing the cycle-based value system is:

if ∃τ > 1 such that O( t −τ ) = O( t −1) and ⎧1 ⎪ ( ΔQ( t −1) (O( t −1) , A( t −1) ) > ε or 0 > ΔQ( t −1) (O( t −1) , A( t −1) ) > − ε ) ⎪ R(t) = ⎨ ⎪− 10 if O(t −1) = O(t ) ⎪− 1 otherwise ⎩

(2)

This reward signal has three rules. The first rule assigns the highest reward of 1 to repeated observations that cause learning. This rule contains a component to continue rewarding observations that continue to cause learning (i.e. continue to have ΔQ(t–1) > ε) and a component to switch to rewarding observations that have been almost completely forgotten (i.e. those with 0 > ΔQ(t–1) > –ε ). The assumption is that observations that have been almost forgotten have the potential to be relearned, causing the robot to cycle through its own behavior cycles. The second rule assigns observations that are repeated at subsequent time steps a punishment of –10. This rule means that this model focuses attention on achievement tasks, rather than maintenance tasks. The third rule assigns all other observations, including those that are repeated without causing learning, a punishment of –1. The Q-learning update is: Q(t)(O(t–1), A(t–1)) = Q(t–1)(O(t–1), A(t–1)) + ΔQ(t)(O(t–1), A(t–1))

(3)

where ΔQ(t)(O(t–1), A(t–1)) = α[R(t) + γ max Q(t–1)(O(t), A(t)) – Q(t–1)(O(t–1), A(t–1))] A( t ) ∈A

In this MRL model, the Q-table stores not only the utility value Q for each observation-action mapping, but also the learning error ΔQ. This stored error-value is used to compute the cycle-based

10

reward signal the next time a particular observation-action sequence occurs. The parameter α governs the rate of change in Q. α can be further divided into two parameters, αL and αF, governing the rate of learning and forgetting respectively. The parameter γ is the discount factor for future expected reward. Finally, greedy action-selection is used: A(t) = argmax Q(t) (O( t ) , A( t ) )

(4)

A( t ) ∈ A

An exploration component, such as ε-greedy action-selection traditionally needed in Q-learning, is not required in this MRL model as the value system incorporates exploration, as described earlier. 4.2.

NN-MRL – Behavior Cycles and Neural Network Reinforcement Learning

This model combines a cycle-based value system with neural-network Q-learning. In this model, we assume an attribute-based representation for sensed states: S(t) = (s1(t), s2(t), s3(t), … sL (t), …)

(5)

In NN-MRL the observation at time t also contains exactly the same data as the sensed state, as in Equation 1. The reward signal has a similar structure to that used for Q-MRL, but uses a distance function dist(O(t), O(t')) to determine the similarity of observations, rather than enforcing strict equality. Use of the distance function allows for sensor noise in determining if observations are equal. This complements the function approximation in the neural-network itself. This paper uses Euclidean distance as the distance function. The reward signal is: if ∃τ > 1 such that dist(O( t − ) , O( t −1) ) < ρ and τ

⎧1 ⎪ ⎪ ⎪− 5 ⎪ R(t) = ⎨ ⎪ ⎪− 10 ⎪ ⎪⎩− 1

( ΔQ( t −1) (O( t −1) , A( t −1) ) > ε or 0 > ΔQ( t −1) (O( t −1) , A( t −1) ) > − ε ) if ∃τ > 1 such that dist(O( t −τ ) , O( t −1) ) < ρ and (0 < ΔQ( t −1) (O( t −1) , A( t −1) ) ≤ ε or ΔQ( t −1) (O( t −1) , A( t −1) ) ≤ − ε ) if dist(O( t −1) , O( t ) ) < ρ otherwise

This reward signal has four rules. The first rule assigns the highest reward of 1 to repeated observations that cause learning. This rule contains a component to continue rewarding observations that continue to cause learning (i.e. continue to have ΔQ(t–1) > ε) and a component to switch to rewarding observations that have been almost completely forgotten (i.e. those with 0 > ΔQ(t–1) > –ε ). The assumption is that observations that have been almost forgotten have the potential to be relearned, causing the robot to cycle through its own behavior cycles. The second rule assigns observations that are repeated without causing learning a punishment of –5. This rule contains a component to continue to punish observations that do not cause learning (i.e. continue to have ΔQ(t–1) ≤ –ε) and a component to switch to punishing observations that have been almost completely learned (i.e. those with 0 < ΔQ(t– 1) ≤ ε). The assumption is that observations that are almost completely learned become boring, so the robot should shift its attention elsewhere. The third rule assigns observations that are repeated at subsequent time steps a punishment of –10. This rule means that this model again focuses attention on achievement tasks, rather than maintenance tasks. Finally, the fourth rule assigns all other observations receive a punishment of –1. In neural-network function approximation for RL, the behavioral policy is represented by a fixed-size neural-network mapping observations to actions. This paper uses a back-propagation network (Russell and Norvig, 1995; Chapter 19) with one hidden layer, as shown in Fig. 3. Each element of the observation is represented by an input neuron (leftmost layer in Fig. 3). Each action is represented by an output neuron (rightmost layer in Fig. 3). Weights connect input, hidden and output neurons. In this paper, the number of hidden neurons is half the number of input neurons. The neural network has just one neuron for each element of the observation, rather than entries for every combination of values for observation elements needed in the table-based representation used in

11

Q-MRL. This significantly reduces the memory requirements of learning, while providing a way for the robot to adapt its learned behavior policy.

Robot

O(t)

R(t)

Learning Module

Neural Network o1

N1

A1

o2

N2

A2

. . .

. . .

. . .

oL

wLn

Nn

wna

A(t)

Aa

Fig. 3. A neural-network is used to represent the learned mapping from observations to actions and utility values. Input nodes (left) represent elements of observations. Output nodes (right) represent actions.

The neural network represents the robot’s current hypothesis as to which action should be taken in response to a given observation. This hypothesis is defined by the weights w of the network, which can be thought of as the robot’s long-term memory. Thus, the predicted utility of taking a given action Aa(t-1) in response to an observation O(t–1) is computed as a function of the network weights as follows: Q(t–1)(O(t–1), Aa(t–1)) = g (

∑ g (∑ o n

L ( t −1) wLn ( t −1) )wna ( t −1) )

(6)

L

g(i) is the activation function for the network. The activation function is a non-linear component that transforms the input i to each neuron into the activation-value output by the neuron. A sigmoid activation function is used in this paper, shifted to give outputs in the range (–1, 1). This means that the network can represent both negative and positive feedback from the reward function. The activation function is: g(i) =

2 1 + e− i

−1

The input i to the activation function is determined by the network topology. It is a weighted sum of the inputs provided by connected neurons. Inputs are indicated by arrows in Fig. 3. According to the RL update rule, the actual utility of observing O(t) after taking action Aa(t–1) in response to observation O(t–1) is: U(t) = R(t) + γ max Q(t–1)(O(t), A(t)) A( t ) ∈A

The prediction error Ea(t) of the network in response to action Aa(t–1) is thus: Ea(t) = U(t) – Q(t–1)(O(t–1), Aa(t–1)). The weights of the network are updated to improve the robot’s hypothesis by reducing the prediction error. This is done by assessing the neurons to blame for the error and dividing the error among the contributing weights. This division is done by propagating to each hidden neuron Nn a weighted portion En(t) of the network error as follows: En(t) = wna(t–1) Ea(t). The weights connecting the output neuron (representing the current action) to neurons in the hidden layer, are updated to bring the predicted utility closer to the actual utility. The update equation is:

12

wna(t) = wna(t–1) + α g(

∑o

L ( t −1) wLn ( t −1)

) Ea(t) g'(

L

∑N w n

na ( t −1)

)

(7)

n

This update equation increases or decreases each weight wna proportionally to the network error and the neuron’s input. The update equation can be interpreted as performing a gradient descent search in the weight space. α is now the learning rate of both the neural-network and the Q-learning. The activation gradient g'(i) is calculated stepwise as: g'(i(t)) =

g (i( t ) ) − g (i( t −1) ) i( t ) − i( t −1)

where g(i(0)) = 1

g'(i(t-–1)) is cached in each neuron to speed up computation at time t. Finally, weights connecting hidden neurons to input neurons are updated according to: wLn(t) = wLn(t–1) + α oL(t–1) En(t) g'(

∑o

L ( t −1) wLn ( t −1)

)

(8)

L

The network updates in Equations 7 and 8 replace the standard Q-learning update used in Q-MRL (Equation 3). Once the network has been updated, the learning error ΔQ(t) can be computed and cached in the appropriate action neuron for use calculating reward at the next time-step. The learning error is: ΔQ(t)(O(t–1), Aa(t–1)) = Q(t)(O(t–1), Aa(t–1)) – Q(t–1)(O(t–1), Aa(t–1)) The action-selection function is the same as that used for Q-MRL (Equation 4). 4.3.

SART-MRL – Behavior Cycles and Simplified Adaptive Resonance Theory Function Approximation in Reinforcement Learning

This model combines a cycle-based value system with a simplified adaptive resonance theory (SART) network (Baraldi & Alpaydin, 1998) and table-based Q-learning. Robots using this model can generalize over the observation space using a variable-sized network of neurons. This approach is related to the growing neural gas self-organizing map (GNG SOM) proposed by Fritzke (1995), but uses a set of unconnected neurons, rather than learning network topology. Our approach extends the work of Brignone Howarth (2003) – who combined standard ART networks with gradient-descent Qlearning – to the MRL case. We use table-based learning rather than gradient-descent learning, and allow the table to grow as required. This means that the robot’s representation of the observation space can grow as required. In this model, the observation at time t is determined by clustering states using a SART network as shown in Fig. 4. The sensed state S(t) is first presented to the attentional subsystem of the SART network. The attentional subsystem comprises a set O(t) of cluster prototypes or receptive field centers representing general observations. Initially the set O is empty. When presented with a sensed state, the attentional subsystem compares it to each existing observation and determines the best-matching observation Owinning. The best-matching observation is defined as the observation with the minimum Euclidean distance to the sensed state: Owinning = argmin dist( S ( t ) , O ) O ∈ O(t )

The best-matching observation is passed to the orienting subsystem of the SART network. The orienting subsystem then computes which observation O(t) should be passed on to the value system and learning processes. If the best-matching observation is within some distance ρ of the sensed state (called the vigilance constraint), it is updated to shift it closer to the sensed state. That is, each of its elements oL is modified using the update equation: o (t)L = oL(t–1) + β(sL(t–1) – oL(t–1)) where β is the learning rate of the SART network.

13

Robot Value System SART Network

S(t)

attentional subsystem

orienting

Owinning subsystem

O(t)

Fig. 4. As the first stage of the value system, sensed states are clustered using a Simplified-ART network. Observations thus represent state clusters.

The updated observation Owinning-updated is used to compute the reward value and passed to the learning module. Otherwise, if the best-matching observation does not satisfy the vigilance constraint, a new observation Onew is created and passed on. The new observation uses the attributes s(t)1, s(t)2, s(t)3, … s(t)L, … of the current sensed state as its elements. This means that the SART network is never randomized. In summary:

⎧Owinning −updated if dist( S (t ) , Owinning ) ≤ ρ O(t) = ⎨ ⎩Onew otherwise The vigilance constraint ensures that the SART network remains stable enough to guard against expansion caused by noisy sensor data, or simply the size and complexity of the state space, but flexible enough to generate new observations when required. The reward signal, learning update and action-selection functions are the same as those used for QMRL (Equations 2, 3 and 4, respectively). 4.4.

Evaluating Cyclic Behavior in Robots

Natural systems may exhibit many different behavior cycles varying in length from minutes to days or months. Likewise, the model presented in this paper aims to achieve robots capable of multiple behavior cycles of different lengths and durations at different times. To identify and characterize these behaviors, this section uses visualizations and numerical analysis based on point-cloud matrices (Tang et al., 2008; Merrick, 2009). 4.4.1.

Point-Cloud Visualizations of Cyclic Behavior

A robot’s posture at any time t can be characterized by its sensed state S(t). Using an attribute-based, vector representation of sensed states as in Equation 5, a point-cloud visualization of a robot’s behavior can be constructed by computing the Euclidean distance dist(S(t), S(t’)) between pairs of postures at all times t and t′. The intensity of a pixel (t, t′) on the point-cloud diagram is determined by dist(S(t), S(t’)). A darker color indicates more similar postures as shown in Fig. 5. Dark diagonals on the point-cloud matrix thus indicate that the robot is cycling through a sequence of similar postures. Point-cloud matrices provide a useful qualitative technique for visualizing the emergent behavior of robots and identifying time periods of interest. This technique is, however, limited by the amount of data that can be displayed on a screen or page. In general it is only effective for visualizing short behavioral sequences. In contrast, the numerical techniques presented in the next section can be used to characterize the behavior of a robot over long periods.

14

time (y)

Similar

Different time (x)

Fig. 5. Point-cloud matrix for a fragment of robot data. Diagonals indicate cycle behavior. This example shows a threeposture cycle: S(804) ≈ S(801), S(805) ≈ S(802), S(806) ≈ S(803), S(807) ≈ S(804), and so on.

4.4.2.

Numerical Analysis of Cyclic Behavior

Previous work with point-cloud visualizations of robot behavior (Merrick, 2009) have provided a useful means of qualitative analysis. However difficulties arise when robots have long lifetimes that cannot be adequately visualized on a screen or page. Numerical analysis of the point-cloud matrices provides a complementary approach for quantifying the cyclic behavior of robots. Of particular interest are the duration, number and length of cycles. These give an indication of the stability, variety and complexity of the robot’s behavior. This provides a way to understand the tradeoff between exploratory behavior and exploitation of learned cyclic behavior. The numerical analysis described by Tang et al (2008) is adapted for robots in the following sections. Identifying Cyclic Behavior

In the example in Fig. 5, cycles can be identified by analyzing unbroken sequences of ‘dark’ pixels that form diagonals. A ‘dark’ pixel is defined as one where dist(S(t), S(t’)) < ρ′. For example, by inspection of Fig. 5, S(804) ≈ S(801), S(805) ≈ S(802), S(806) ≈ S(803), S(807) ≈ S(804), …. S(816) ≈ S(813). This cycle starts at t1 = 804 and ends at t2 = 816. It thus has a duration of thirteen time-steps and a length of three postures. Formally, using a point-cloud matrix, a behavior cycle B is a sequence of posture-pairs (t, t′) such that for all t1 ≤ t ≤ t2 , dist(S(t), S(t’)) < ρ′ and t’= t – NB. NB is the length of the cycle. The duration of the cycle is DB = t2 – t1 + 1. The sequence of posture-pairs must be repeated in its entirety at least once (ie: DB > NB) and cannot be a multiple of another shorter cycle (ie: there is no N < NB for which dist(S(t), S(t’)) < ρ′ and t′ = t – N for all t1 ≤ t ≤ t2 .) Behavioral Stability

The stability ζ of a robot’s behavior over a time period of length T is the total duration of all behavior cycles in the period, divided by the length of the period: ζT =

∑D B∈T

B

T

This gives a theoretical stability value normalized between zero (fewer cycles/shorter durations) and one (many cycles/longer durations). In practice, behavior cycles may overlap, with the last few actions of one cycle also included in the start of a new cycle. This can lead to stability values slightly higher than one in some applications. A higher stability value is generally desirable as it indicates that the robot is exploiting learned behavioral cycles more productively than a robot with a lower stability value. However, a very high stability value may also mean that the robot is not exploring adequately to learn a range of different behaviors. Such a value might occur, for example, for a robot maintaining a particular posture for a long period. Stability values should thus not be considered in isolation, but rather in conjunction with

15

statistics describing the number and length of cycles. This gives an additional indication of the variety and complexity of the robot’s behavior. This combined approach is used in the following section.

5.

Biocycles in Lego Critter-bots

This section presents experiments with each of the MRL algorithms described in Section 4 on four Lego Mindstorms NXT ‘critter-bot’ robots. ‘Critter’ forms such insects and snails were selected for the robots to reflect the focus of this paper on learning low-level biocycles for movement, rather than cognitive cycles for higher level activities. Several different robots were used to show that the MRL algorithms can enable the robots to adapt behaviors relevant to their different physical forms. 5.1.

Experimental Setup

The critter-bots are shown in Fig. 6. The first, shown in Fig. 6(a), is a snail with a single motor controlling the height of its antennas. The snail can sense only whether the motor is moving or not. This critter-bot is thus relatively noise free, aside from the delay in data sent from the robot to the PC. It thus represents a control application for our experiments. The movement reading is an enumerated set where 0 means the motor is stopped, 200 means the motor is moving forwards and 100 means the motor is moving backwards. An example sensed state for the snail is: S(12) = (motor:200.0)

(a)

(b)

(c)

(d)

Fig 6. Four critter-bots: (a) a snail with a single motor connected to the intelligent brick; (b) a bee with a motor and color sensor; (c) a cricket with a motor and ultrasonic sensor; (d) an ant with a motor and accelerometer.

The second critter-bot, shown in Fig. 6(b), is a bee with a motor and color sensor. The motor allows the bee to turn its color sensor ‘head’ through 90o (45o to both the left and right). The bee can whether the motor is moving or not and red, blue and green color intensities in the direction the color-sensor is pointing. The movement reading is the same enumerated set used in the snail. Color intensity readings range between 0 and 255. An example sensed state for the bee is: S(226) = (motor:100.0, red:0.0, green:80.0, blue:0.0)

The bee was placed between two color panels, one red and the other green, as shown in Fig. 6(b). The color-sensor has a limited range so color intensity readings are higher at the far left and right when the bee’s head is closest to the panels.

16

Fig. 6(c) shows the third critter-bot, a cricket, with a motor and ultrasonic distance sensor. As with the bee, the motor allows the cricket to turn its ultrasonic sensor ‘head’ through 45o to both the left and right. The cricket can sense whether the motor is moving or not and eight ping values describing the distance of any object in the direction the ultrasonic sensor is pointing. The cricket was placed in a corner such that it was further from one wall than from the other. An example sensed state for the cricket is: S(20) = (motor:100.0, dist0:41.0, dist1:48.0, dist2:161.0, dist3:166.0, dist4:0.0, dist5:0.0, dist6:0.0, dist7:0.0)

Finally, the fourth critter-bot in Fig. 6(d) is an ant with a motor and accelerometer. The motor moves the ant’s legs, which can grip the surface it is on and propel the robot forwards or backwards. The ant can sense whether the motor is moving or not and six values from the accelerometer. Three of these describe its acceleration in three dimensions and three describe its tilt from the horizontal in the same dimensions. These values range from 0 to 255. An example sensed state for the ant is: S(20) = (motor:100.0, xacc:1.0, xtilt:0.0, yacc:3.0, ytilt:0.0, zacc:19.0, ztilt:4.0)

All the critter-bots have three actions available in any state: A1 move the motor forward at a fixed speed, A2 move the motor backwards at a fixed speed, and A3 stop the motor. All four bots are also designed such that there are no limitations on the rotation of the motor. For example continuous forward (or backward) motion of the motor on the bee will oscillate the head between its left and right extremes. This sort of physical design is important for intrinsically motivated robots that can experiment with their physical structure. Without this design, the robots could damage themselves unless additional logic is pre-programmed to prevent over-extension of their joints. (For example, in natural systems this ‘additional logic’ is a pain response.) The three MRL algorithms were implemented in Java. Due to the virtual machine and memory limitations of the Mindstorms NXT intelligent brick, the algorithms are run on a PC. The intelligent brick connected to each critter-bot (shown in Fig. 6(a)) receives commands from the PC via Bluetooth and triggers the motor. Sensor data is returned to the PC via Bluetooth. Each algorithm was run five times for 2,000 time-steps (approximately ten minutes) on each robot. Experimental parameters and their values are shown in Table 1. Values were generally chosen based on the findings of previous MRL work (Merrick, 2008b; Merrick & Huntington, 2008; Merrick & Maher, 2009a) although some tuning was done to select appropriate validation and SART thresholds. Data was collected describing the number and length of cycles learned by each robot. The aim of this experiment is to characterize the structure of the behavior achieved by each robot and measure the overall stability of their behavior. This provides insight into the capacity of each algorithm to focus attention and learn achievable behavior cycles. The results in the following section also include case by case descriptions to provide insight into the potential usefulness and appropriateness of learned behaviors.

Table 1. Summary of experimental parameters and their values

Symbol ε αL αF γ β ρ ρ' 1

Description Cycle forgetting threshold Q-Learning rate Q-Forgetting rate Discount factor SART learning rate Validation threshold Validation threshold for numerical analysis1

Q-MRL 0.01 0.9 0.1 0.9 n/a n/a 0.1

NN-MRL 0.01 0.9 0.1 0.01 n/a 0.2 0.1

SART-MRL 0.01 0.9 0.1 0.9 0.1 0.2 0.1

ρ′ was chosen to be less than ρ to ensure none of the algorithms is advantaged by similarity to the measurement technique.

17

5.2.

Results and Discussion

Fig. 7. charts the behavioral stability for the three algorithms on each robot. In the snail, behavioral stability is close to 1 for all algorithms. This means that the snail is engaged in stable behavior cycles for close to 100% of its lifetime, regardless of the algorithm used. This is an important result as it indicates that both of the function approximation algorithms (NN-MRL and SART-MRL) show comparable performance to standard Q-MRL in a noise-free setting. 1.4 Q-MRL

1.2

NN-MRL

Stability

1

SART-MRL

0.8 0.6 0.4 0.2 0 Snail

Bee

Cricket

Ant

Fig. 7. Average behavioral stability for each algorithm on each robot2.

Inspection of the log-files and segments of the point-cloud matrices for the snail reveals some of the actual behavior cycles learned. The snail using Q-MRL, for example, learned a number of different behaviors for raising and lowering its antennae. Fig. 8(a) shows one such behavior corresponding to a six-posture cycle resulting from the actions: A1, A2, A3, A2, A1, A3… Another shorter cycle resulted from the actions A1, A2, A3… while another longer cycle combined the actions A2, A1, A2, A3, A1, A3… In the context of a very simple artificial critter each of these behavior cycles is appropriate to the snail. Each of cycle represents a different way that the snail can raise and lower its antennae to different heights and angles. The snail adapts new cycles over time and is continually learning new ways to manipulate its antennae. On the remaining three, more complex, critter-bots, Fig. 7. shows that SART-MRL has significantly higher stability when compared to Q-MRL. This is expected, as SART-MRL has an advantage in the more complex robots because it can generalize over the state space of the robot. Stability results for NN-MRL compared to Q-MRL are statistically ambiguous, but do tend to be higher than Q-MRL. Fig. 9. charts the median and maximum cycle lengths for each algorithm on each robot. The snail shows statistically higher median and maximum cycle lengths learned by the Q-MRL algorithm than by the two function approximation approaches. This is expected as in the relatively noise-free snail, QMRL is advantaged by its ability to distinguish between more unique states. This means that it can develop longer cycles that change between a larger number of postures. In the bee, the cricket and ant, however, SART-MRL shows significantly higher median and maximum cycle length than Q-MRL. In these robots the much larger state space means that Q-MRL is disadvantaged by being able to distinguish between all states as learning is more time and memory intensive. In contrast, SART-MRL is advantaged by its ability to generalize over the state space.

2

Results show the 95% confidence interval.

18

The bee using NN-MRL exhibits an unusually high maximum cycle length (albeit with a high standard deviation). NN-MRL has the greatest capacity of the three algorithms to generalize over the state space. NN-MRL generalizes more than SART-MRL because in NN-MRL the entire state space must be represented by a fixed size neural network. In SART-MRL the network can grow to encompass new stimuli. In the bee, the ability of NN-MRL to generalize and ignore sensor noise permits it to (occasionally) achieve long periods of high behavioral stability (see Fig. 7.) in which the bee oscillates its head between the far left and right. The same behavioral trait is not exhibited in the cricket because the ultrasonic sensor only detects a distance change near the centre of the robot’s visual range and thus requires an algorithm with a lower level of generalization, to achieve stable behavior.

(a)

(b)

(c)

(d)

Fig. 8. Behavior cycles learned by (a) a snail using Q-MRL; (b) a bee using Q-MRL; (c) a bee using SART-MRL; (d) an ant using SART-MRL.

To illustrate the different characteristic behavior generated by the two algorithms, Fig. 8(b) and (c) show behavior cycles learned by the bee using Q-MRL and SART-MRL. The bee using Q-MRL tended to learn rapid head movements in the low color intensity (and thus low noise) region between the two color panels. For example, Fig. 8(b) shows a two-posture cycle resulting from repetition of the actions A1, A2… in which the bee’s head moved from left-to-right very quickly. In contrast, the bees using NN-MRL and SART-MRL learned longer behavior cycles for turning the head through its full range. For example, Fig. 8(c) shows a fourteen posture sequence learned by a bee using SART-MRL. This bee moved its head more slowly through the full 90o range from left to right by repeating the actions A1, A3…. The structured, oscillating color sensor readings for this period are shown in Fig. 10.

19

(a)

7 Q-MRL

Median cycle length

6

NN-MRL

5

SART-MRL

4 3 2 1 0 Snail

Bee

Cricket

Ant

250 Maximum cycle length

(b)

Q-MRL

200

NN-MRL SART-MRL

150 100 50 0 Snail

Bee

Cricket

Ant

Fig. 9. (a) Median cycle length and (b) maximum cycle length for each algorithm on each robot. This gives an indication of the behavioral complexity possible using each algorithm.

Fig. 10. Color sensor readings for the same time period as Fig. 8(c). The bee using SART-MRL has learnt to control its head to oscillate between the red and green panels.

The cricket also learned a similar set of head oscillation behaviors. Two examples are shown in Fig. 11. From t=600-620 there is a small oscillation in the distance readings as the cricket moves its head left to right through a small angle, slightly changing the distance to the near wall. From t=650 onwards, there is a larger oscillation in the distance readings as the cricket moves its heads through a larger angle between the near and far walls.

20

Fig. 11. Ultrasonic sensor readings by one of the crickets using SART-MRL. Two of the eight ping values return readings. The cricket has learnt to control its head to oscillate between the near and far walls.

The ant was perhaps the most interesting of the critter-bots as it was able to learn to ‘walk’ motivated only by the cycle-based value system. To achieve a structured behavior cycle, the walk was somewhat jerky, with the ant learning to combine a sequence of ‘move-forward’ and ‘stop-motor’ actions. This behavior cycle is shown in Fig. 8(d). All these qualitative results are particularly interesting as they show that robots can learn basic joint manipulation behaviors such as ‘head movement’ and ‘walking’ without being specifically programmed to learn those behaviors. In addition, even though each robot uses the same cycle-based value system, each robot learns different cycles in response to its physical structure. This can be seen numerically from the results in Fig. 12, which charts the number of unique cycles learned by each algorithm of any duration DB > NB. When the algorithms are used controlling the bee, for example, more cycles are learned than when they are controlling the snail. This is in response to the more complex state space of the bee. While none of the algorithms clearly results in more or less variety of behavior than the others on any given robot, when considered in conjunction with the other results, we can conclude that the cycles that are learned tend to be longer and have longer durations in the robots using SART-MRL and NN-MRL.

Number of unique cycles

60 Q-MRL

50

NN-MRL SART-MRL

40 30 20 10 0 Snail

Bee

Cricket

Ant

Fig. 12. Number of unique cycles of any duration. This gives an indication of the behavioral variety possible using each algorithm and of the difference in behavioral variety between robots with different physical forms.

21

6.

Conclusion and Future Work

The algorithms in this paper are a step towards developmental robots that can select their own learning goals autonomously using a value system, and learn to achieve those goals. Likewise the numerical metrics are a step towards consistent evaluation metrics for developmental robots. The experimental results presented are promising because they show that structured behavior tailored to the physical form of a robot can emerge using a generic, cycle-based value system. In particular, the SART-MRL model shows promise as an approach that can learn structured behavior cycles in the noisy environments encountered by robots. A number of directions remain for future work developing this model. 6.1.

Multilayer Cycle-Based Value Systems

The models presented in this paper focus on the development of ‘pseudo-biological’ behavior cycles so robots can learn behavior adapted to their physical structure. In future however, further study of different biological cycles, as well as cognitive and social cycles, also promises advantages for developmental robots. For example, potential applications for models of cognitive behavior cycles include robots capable of creative behavior such as tool-use; adaptive behavior such as fault-tolerance. Potential applications for models of social cycles may include evolutionary behavior for selfadvancement of robotic societies. In the long term, multilayer, cycle-based value systems may motivate biological, cognitive and social cycles to permit the design of robots with a range of these capabilities. 6.2.

Combing Value Systems with other Learning Algorithms

This paper combines value systems with ‘flat’ RL. That is, the learning algorithm maintains a single policy describing the current behavior cycle being learned and exploited by the robot. Previously learned cycles are forgotten. This means that the behavior of the robot is always adapted to the current situation of the robot and there is no possibility of out-of-date policies being stored. However, in certain situations, the ability to recall and reuse learned policies can be an advantage. This is the case if the robot is likely to encounter similar situations in future. Thus, other RL approaches – such as hierarchical RL models – that can recall and reuse learned policies may be combined with function approximation and value systems to create adaptive robots with memory of past experiences. Furthermore, value systems may be combined with machine learning approaches beyond RL. For example, for situations where trial-and-error learning is inappropriate, combining a value system with a supervised learning approach would create robots that can select their own goals and learn by mimicking other robots or even humans. 6.3.

Evaluating Robots with Value Systems

The results in this paper show that structured behavior cycles can emerge using a generic value system and RL. However, the proposed numerical metrics do not give an indication of the ‘intelligence’, ‘usefulness’ or ‘correctness’ of this behavior. Thus there is still a role to be played by qualitative, domain specific analysis of developmental robots, such as the case-by-case descriptions used in this paper. As more complex value systems are developed, more complex metrics will be required to characterize the performance of robots using them. Formalizing such approaches also remains an area for future work. 6.4.

Beyond Critter-Bots

Finally, this paper has considered four simple robotic platforms to analyze the proposed algorithms. Further exploration of the sensitivity of the algorithms to different parameter values, and their performance in more complex robots, is an area for future work. Directions of particular interest include the design of more complete critters with more complex sensory systems and actuator sets. This will not only improve our understanding of developmental robots, but also of the role of value systems in natural and artificial agents.

22

Acknowledgement This research was supported by a UNSW@ADFA Rector’s Staff Start Up Grant

23

References Ahlgren, A., & Halberg, F. (1990). Cycles of nature: an introduction to biological rhythms. Washington DC: National Teachers Association. Arkin, R. (1998). Behavior-based robotics. Cambridge, Massachusetts: The MIT Press. Baraldi, A., & Alpaydin, E. (1998). Simplified ART: a new class of ART algorithms (Technical Report, TR 98004). Berkley, CA: International Computer Science Institute. Brignone, L., & Howarth, M. (2003). ART-R: A novel reinforcement learning algorithm using an ART module for state representation. Paper presented at the IEEE Workshop on Neural Networks for Signal Processing, pp 829–837. Brown, F. A. (1954). Persistent activity rhythms in the oyster. American Journal of Physiology, 178(3):510–514. Chavanne, T. (2003). Variation in risk taking behavior among female college students as a function of the menstrual cycle. Evolution and Human Behavior 19(1):27-32. Coulom, R. (2002). Reinforcement learning using neural networks, with applications to motor control. PhD Thesis, Institut National Polytechnique de Grenoble. Csikszentmihalyi, M. (1996) Creativity: flow and the psychology of discovery and invention. Harper Collins Publisher, New York. Dunlap, J., Loros, J., & DeCoursey, P. (2003). Chronobiology: biological timekeeping: Sinauer Associates. Ehrlich, & Raven. (1964). Butterflies and plants: a study in coevolution. Evolution 18:584–608. Forbes, K., & Fiume, E. (2005). An efficient search algorithm for motion data using weighted PCA. Paper presented at the The 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Los Angeles, CA, pp 67–76. Fritzke, B. (1995). Incremental learning of local linear mappings. In Proceedings of the International Conference on Artificial Neural Networks, pp 217–222. Friston, K., Tononi, G., Reeke, G., Sporns, O., & Edelman, G. (1994). Value-dependent selection in the brain: simulation in a synthetic neural model. Neuroscience, 59:229–243. Geen, R. G., Beatty, W. W., & Arkin, R. M. (1984). Human motivation: physiological, behavioral and social approaches. Massachusetts: Allyn and Bacon, Inc. Gero, J, S. (1992) Creativity, emergence and evolution in design. Second International Roundtable Conference on Computational Models of Creative Design, Sydney, pp 1–28. Herrmann, J., Pawelzik, K., & Geisel, T. (2000). Learning predictive representations. Neurocomputing, 32– 33:785–791. Huang, X., & Weng, J. (2002). Novelty and reinforcement learning in the value system of developmental robots. Paper presented at the Second International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, Edinburgh, Scotland, pp 47-55. Kaplan, F., & Oudeyer, P.-Y. (2003). Motivational principles for visual know-how development. Paper presented at the Proceedings of the 3rd international workshop on Epigenetic Robotics: Modeling cognitive development in robotic systems, Lund University Cognitive Studies, pp 73–80. Kolb, D. A., Rubin, I. M., & McIntyre, J. M. (Eds.). (1984). Organizational psychology: readings on human behavior in organizations. Englewood Cliffs, NJ: Prentice-Hall. Kovar, L., & Gleicher, M. (2004). Automated extraction and parametrization of motions in large data sets. Paper presented at the ACM SIGRAPH 2004 Los Angeles, CA. Laird, J., & van Lent, M. (2000). Interactive computer games: human-level AI's killer application. Paper presented at the National Conference on Artificial Intelligence (AAAI), pp 1171–1178. Li, B., & Holstein, H. (2002). Recognition of human periodic motion – a frequency domain approach. Paper presented at the Sixteenth International Conference on Pattern Recognition, Washington DC, pp 311– 314. Lungarella, M., Metta, G., Pfeifer, R., & Sandini, G. (2003). Developmental robotics: a survey. Connection Science, 15(4):151–190.

24 Mac Namee, B., Dobbyn, S., Cunningham, P., & O’Sulivan, C. (2003). Simulating virtual humans across diverse situations. Paper presented at the Intelligent Agents, 4th International Workshop, IVA 2003, Kloster Irsee, Germany, pp 159–163. Marshall, J., Blank, D., & Meeden, L. (2004). An emergent framework for self-motivation in developmental robotics. Paper presented at the Third International Conference on Developmental Learning, San Diego, CA, pp 104–111. Marsland, S., Nehmzow, U., & Shapiro, J. (2000). A real-time novelty detector for a mobile robot. Paper presented at the EUREL European Advanced Robotics Systems Masterclass and Conference. Merrick, K. (2008a). Designing toys that come alive: curious robots for creative play. Paper presented at the Seventh International Conference on Entertainment Computing (ICEC 2008), Carnegie Mellon University, pp 149–154. Merrick, K. (2008b). Modelling behaviour cycles for life-long learning in motivated agents Seventh International Conference on Simulated Evolution and Learning (SEAL 2008). Melbourne, Australia: LNCS, Springer, pp 1–10. Merrick, K. (2009). Evaluating intrinsically motivated robots using affordances and point-cloud matrices, The Ninth International Conference on Epigenetic Robotics. Venice, Italy, pp 105-112. Merrick, K., & Huntington, E. (2008). Attention focus in curious, reconfigurable robots. Paper presented at the Australian Conference on Robotics and Automation, ANU, Canberra, Australia, (CD, no page numbers). Merrick, K., & Maher, M. L. (2009a). Motivated learning from interesting events: adaptive, multitask learning agents for complex environments. Adaptive Behavior, 17(1):7–27. Merrick, K., & Maher, M. L. (2009b). Motivated reinforcement learning: curious characters for multiuser games. Berlin, Springer-Verlag. Merrick, K., Shafi, K. (2009) Agent Models for Self-Motivated Home Assistant Bots, International Symposium on Computational Models for Life Sciences, Sofia, Bulgaria, (to appear). Nagai, Y., Asada, M., & Hosoda, K. (2002). Developmental learning model for joint attention. Paper presented at the 15th International Conference on Intelligent Robots and Systems (IROS 2002), pp 932–937. Nefedov, S. A. (2004). A model of demographic cycles in traditional societies: the case of ancient china. Evolution and History, 3(1):69–80. Oudeyer, P.-Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286. Russell, S., Norvig, P. (1995) Artificial intelligence: A modern approach. Prentice-Hall, New Jersey. Singh, S., Barto, A. G., & Chentanez, N. (2005). Intrinsically motivated reinforcement learning. Paper presented at the Advances in Neural Information Processing Systems 17 (NIPS), pp 1281–1288. Sporns, O., Almassy, N., & Edelman, G. (2000). Plasticity in value systems and its role in adaptive behavior. Adaptive Behavior, 8:129–148. Tang, K.-T., Leung, H., Komura, T., & Shum., H. (2008). Finding repetitive patterns in 3D human motion captured data. Paper presented at the Second International Conference on Ubiquitous Information Management and Communication, Suwon, Korea, pp 396–403. Thrun, S. (1995). Exploration in active learning. In Handbook of Brain Science an Neural Networks. Cambridge, MA: MIT Press. Usher, D. (1989). The dynastic cycle and the stationary state. The American Economic Review, 79:1031–1044. Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292. Weng, J., McClelland, J., Pentland, A., Sporns, O., Stockman, I., Sur, M., et al. (2001). Artificial intelligence: autonomous mental development by robots and animals. Science, 291:599–600. Wever, R. (1973). Human circadian rhythms under the influence of weak electric fields and the different aspects of these studies. International Journal of Biometeorology, 17(220). Zimecki, M. (2006). The lunar cycle: effects on human and animal behavior and physiology Postepy Higieny i Medycyny Doswiadczalnej (Online), 60:1–7.