A Model-Based Goal-Directed Bayesian Framework for ... - CiteSeerX

19 downloads 0 Views 635KB Size Report
Oct 4, 2004 - and Hayes proposed coupled forward and inverse models for robotic imita- tion of human ... ear embedding [Roweis and Saul, 2000]. Saliency ...
A Model-Based Goal-Directed Bayesian Framework for Imitation Learning in Humans and Machines

Aaron P. Shon† David B. Grimes† Chris L. Baker† Rajesh P. N. Rao† Andrew N. Meltzoff‡ †Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle WA 98195 USA ‡Department of Psychology, University of Washington, Seattle WA 98195 USA

Abstract Imitation offers a powerful mechanism for knowledge acquisition, particularly for intelligent agents (like infants) that lack the ability to transfer knowledge using language. Several algorithms and models have recently been proposed for imitation learning in humans and robots. However, few proposals offer a framework for imitation learning in noisy stochastic environments where the imitator must learn and act under real-time performance constraints. In this paper, we present a novel probabilistic framework for imitation learning in stochastic environments with unreliable sensors. We develop Bayesian algorithms, based on Meltzoff and Moore’s AIM hypothesis for infant imitation, that implement the core of an imitation learning framework. Our algorithms are computationally efficient, allowing real-time learning and imitation in an active stereo vision robotic head. We present results of both software simulations and our algorithms running on the head, demonstrating the validity of our approach. We conclude by advocating a research agenda that

Preprint submitted to Elsevier Science

4 October 2004

promotes interaction between cognitive and robotic studies of imitation. Key words: imitation learning, Bayesian inference, forward models, inverse models, probabilistic models, action selection, intention, robotics

1

Introduction: imitation learning in animals and machines

The capacity of human infants to learn and adapt is astonishing. In just a few short years, a child is able to speak, read, write, interact with others, and perform myriad other complex tasks. In contrast, digital computers possess limited capabilities to learn from their environments. The learning they exhibit arises from explicitly programmed algorithms, often tuned for very specific applications. Human children accomplish seemingly effortlessly what the artificial intelligence community has labored more than 50 years to accomplish, with varying degrees of success. One key to solving this puzzle is that children are highly adept at observing and imitating the actions of others. Imitation is a versatile mechanism for transferring knowledge from a skilled agent (the instructor ) to an unskilled agent (or observer ) using direct demonstration rather than manipulating symbols. Various forms of imitation have been studied in apes [Visalberghy and Fragaszy, 1990], [Byrne and Russon, 2003], [Whiten, 2004], in children (including infants only 42 minutes old) [Meltzoff and Moore, 1977], [Meltzoff and Moore, 1997], and in an increasingly diverse selection of machines [Fong et al., 2002], [Lungarella and Metta, 2003]. The reason behind the growing interest in imitation in the machine learning and robotics communities is obvious: a machine with the ability to imitate has a drastically lower cost of reprogramming than one which requires programming by an expert. Robotic imitation may allow non-experts to directly 2

instruct robots through demonstration. Imitative robots also offer testbeds for cognitive researchers to test computational theories, and provide modifiable agents for contingent interaction with humans in psychological experiments. From an information-theoretic viewpoint, imitation learning involves transmitting information from the mind of the instructor to the mind of the observer via a noisy communications channel, the environment. One novel computational aspect of imitation learning, in contrast to typical supervised or unsupervised learning paradigms in the machine learning literature, is that attributes of the environment—the characteristics of the channel itself—are manipulated to convey information. The instructor manipulates the state of certain objects in the environment (including possibly the instructor’s own body). The observer must determine what components of environmental state are relevant to performing imitation, then determine what actions the instructor performed (or intended to perform) to achieve a goal state. Finally, the observer must map the actions performed by the instructor into its own coordinate frame so that it can execute actions to achieve the goal state. We consider imitation from three different perspectives: • Cognitive theories: Cognitive studies of imitation in infants offer an example of a non-linguistic imitation learning system that operates in real time, using massive amounts of noisy data. Cognitive theories suggest constraints for developing imitation learning algorithms. We discuss a theoretical framework suggested by Meltzoff and colleagues based on their results from infant studies. • Artificial intelligence/reinforcement learning: Recent developments in reinforcement learning have suggested how plan recognition and obser3

vation of instructor agents can assist an observer agent in maximizing its expected reward in stochastic environments. We examine two strains of research from the reinforcement learning community. • Robotics: The robotics community has recently begun adopting ideas from cognitive and developmental studies of imitation. We examine a small sample of these research efforts.

We begin with an overview of some recent cognitive work in infant imitation. Next, we present our probabilistic imitation learning framework. We offer simulation results and examples of a robotic implementation that demonstrate the viability of our framework. We next examine prior work in machine learning and robotics for imitation learning. Finally, we conclude with a proposal for future research, including a program of interactive study between cognitive psychology and machine learning.

Throughout this paper, we concentrate on developing probabilistic algorithms, that is, algorithms that maintain probability distributions rather than single values to represent variables of interest. Our emphasis on a probabilistic approach stems from the growing number of successful demonstrations of probabilistic algorithms in a variety of real-world domains [Rubin, 1988], [Isard and Blake, 1998],[Gordon et al., 1993], [Thrun et al., 1999],[Schiele and Pentland, 1999], [Olshausen and Field, 1996],[Schulz et al., 2003], [Patterson et al., 2003], [Spirtes et al., 2001]. 4

(a)

(b)

Fig. 1. Neonates and infants are capable of imitating body movements: (a) Tongue protrusion, opening the mouth, and lip protrusion are imitated by neonates (from [Meltzoff and Moore, 1977]). (b) A 14-month-old child replays an observed sequence required to cause a pad to light up (from [Meltzoff, 1988b]). The precise action needed to obtain the goal state of lighting up the pad is unlikely to be discovered by the child through random exploration of actions.

2

Cognitive theories of imitation

Although non-human primates demonstrate some imitative capability, their abilities are extremely limited in comparison to those of human children (and adults). For this reason, we focus our view of cognitive theories of imitation on data from human children.

A number of imitative behaviors are observed in human infants, including: manual imitation, where infants imitate hand gestures; vocal imitation, where infants replicate utterances of others; and facial imitation, where infants replicate the facial gestures of others. Facial imitation is particularly interesting because (in the absence of mirrors or similar devices) infants are unable to directly match the visual appearance of a teacher’s face with their own appearance. Such imitation demands a certain degree of cross-modal processing, which we highlight below. 5

2.1 Stages of imitative behavior

A number of psychological theories of imitation have been proposed. Experimental evidence obtained by Meltzoff and colleagues indicates a gradual progression of infants’ imitative capabilities, beginning with exploratory movements of muscle groups and progressing to imitating bodily movements, imitating actions performed on objects, and finally inferring intent during imitation. Meltzoff and Moore [Meltzoff and Moore, 1997] identified ten characteristics of early infant imitation: (1) Infants imitate a wide range of acts (2) Imitation is specific to a particular motor system (e.g., tongue protrusion evokes tongue protrusion, not lip protrusion) (3) Newborn infants imitate (4) Activation of appropriate body parts for imitation is fast (5) Infants correct for mistakes made in imitation (6) Novel acts can be imitated (7) Targets that are not present in the infant’s field of view can be imitated (8) Static gestures can be imitated (9) Infants know when they are being imitated (10) As infants develop, their imitative capabilities change Any system that attempts to capture the imitative capabilities of infants must address these characteristics. The ability of newborns as young as 42 minutes to imitate ([Meltzoff and Moore, 1983] and see Fig. 1(a)) suggests that imitation is inborn; it is a biologically-endowed capability universal among typically developing children. With that said, it is 6

also clear that newborns do not begin life with the ability to perform imitative acts of arbitrary complexity. Neonates are capable of imitating facial gestures and gross bodily movements; infants gradually acquire more advanced imitative capabilities [Meltzoff and Moore, 1997]. Based on the developmental discoveries enumerated below, we suggest that the fundamental algorithms used by infants to perform imitation do not change significantly over time. Rather, more complex imitation occurs because infants gradually learn more complex models of their environment (and other agents in that environment). Gradual acquisition of more complex models for interacting with the environment, in turn, argues that these environmental models are distinct from one another (and possibly hierarchical).

2.1.1

The nature of early imitation

Meltzoff and Moore [Meltzoff and Moore, 1997] enumerate three major conclusions supported by evidence from infant imitation. First, imitation is representationally mediated. That is, imitation is not directly transduced by virtue of specialized connections between, say, the visual system and the motor system in an infant’s brain. Rather, some internal representation of the states of the instructor are stored in the infant’s mind. This is demonstrated by studies [Meltzoff and Moore, 1994],[Meltzoff and Moore, 1992] where infants were capable of deferred imitation, reproducing behaviors shown 24 hours or more before testing. Second, imitation is goal directed. Infants do not slavishly reproduce the motor sequences shown by adult instructors—they only partially match the motor patterns presented by an instructor [Meltzoff and Moore, 1977], 7

[Meltzoff and Moore, 1994]. Early in the imitative process, infants commit “creative” errors, i.e., their errors seem designed to reproduce the end state shown by the instructor, but the infants use novel body part configurations to reach this end state. For example, one experiment [Meltzoff and Moore, 1994] showed that, after observing an adult sticking his tongue out to the side, 70% of infants responded by sticking their own tongues out straight and turning their heads so as to produce a new type of “tongue-to-the-side” facial gesture[Meltzoff and Moore, 1997]. Meltzoff and Moore argue that these results show imitation is a goal directed process. That is, when infants imitate, they have in mind a terminal motor state that they seek to match to the target state of the instructor. How infants decide what aspects of a terminal motor state are relevant for imitation is an open question.

Finally, imitation is generative and specific. Infants can imitate a wide range of gestures within the first 2 months of life (including tongue and lip protrusion, hand gestures, and cheek and brow movements, among others). Combined with the finding that infants employ novel gestures to reach desired end states, Meltzoff and Moore [Meltzoff and Moore, 1997] suggest that infants use “generative models” of motor movements to imitate. This parallels the use of generative models [Bishop, 1995],[Hinton and Ghahramani, 1997] in machine learning: rather than being equipped with a predefined set of motor primitives used to perform imitation, infants appear to use internal models to find suitable motor trajectories in real time. Their trajectories are designed to achieve a goal state or sequence of goal states, not to mimic the precise trajectories followed by an adult. Related evidence showing the goal-directedness of imitation comes from slightly older children [Gleissner et al., 2000]. 8

2.1.2

Organ identification

One problem that any imitation learning system must overcome is the correspondence problem [Nehaniv and Dautenhahn, 1998],[Alissandrakis et al., 2002a],[Meltzoff, 2002], where the observer must determine a mapping between its own state and action spaces and those of the instructor. For facial imitation, infants seem to come pre-equipped with a mapping between adult facial states and their own proprioceptive feedback. Experimental evidence shows that infants respond to seeing facial gestures by quickly activating the corresponding body part—if an adult protrudes his or her tongue, for example, the infant’s tongue will also begin to move, with a quieting of other body parts. Meltzoff and Moore [Meltzoff and Moore, 1997] call this process, in which infants determine which body part to move before identifying the correct method to move it, organ identification. Meltzoff and Moore suggest two possible mechanisms for how organ identification might occur: (i) by perceiving characteristic collections of features (“organs as forms”) and (ii) by perceiving characteristic spatiotemporal movement patterns (“organs from motion”). In terms of a computational framework, we can view mechanism (i) as resulting from some collection of genetically hardwired feature detectors. Other studies [Buccino et al., 2001,Gross, 1992] have shown that visual displays of parts of the face and hands cause activity at specific brain sites in both monkeys and humans. Such neural hardware may provide the underpinnings for organ identification (also see the growing literature on “mirror neurons” [di Pellegrino et al., 1992], [Rizzolatti et al., 2000], [Rizzolatti et al., 2002]). Mechanism (ii) relies on being able to match the outputs of forward models of different muscle group movements with their visual correlates (see Section 3 below). 9

The following three sections detail increasingly complex imitative behaviors observed in developing infants. We first discuss the hypothesis that infants learn forward model kinematics of their own muscles, a process Meltzoff and Moore call “body babbling” [Meltzoff and Moore, 1997]. We then describe successively more complex developmental stages of imitation, from imitating body movements of others to imitating actions on objects and finally, inferring the intent of other agents.

2.1.3

Body babbling

Despite the supposition of specialized neural circuits for matching observed face states to proprioceptive states in neonatal infants, it is unrealistic to assume that infants begin life with accurate a priori models of their muscle systems’ kinematics. Meltzoff and Moore [Meltzoff and Moore, 1997] propose that a period of experimentation occurs early in life, during which infants learn a mapping between motor commands and the resulting states of their bodies. This experimentation process is called “body babbling,” in direct analogy to the process of vocal babbling through which children acquire auditoryarticulatory maps. Several open questions remain regarding body babbling. For example, to what extent do “remembered” mappings learned during this process interact with dynamic proprioceptive feedback encountered during performance of a motor task? How fine-grained a representation of resulting motor states do infants maintain—are motor states on the level of gross organ states (e.g., “all fingers extended straight out from the palm at an angle of 0 degrees”), relationships between nearby organs or motor groups (“the forefinger is displaced 5 degrees 10

relative to the middle finger”), or detailed positions of body segments in a global egocentric coordinate frame (“the tip of the forefinger is at coordinate (45, 12, 34)”)? How are models learned during body babbling extended to accommodate holding and using objects? To what extent does body babbling extract specialized information about how body segments react to motor commands, and to what extent does it extract general knowledge about the passive physics underlying the infant’s limbs (inertia, momentum, acceleration, force, etc.)? Based on computational considerations, it seems probable that the models developed during body babbling are hierarchical and specialized to movements of an infant’s muscle system. A hierarchy seems probable because the huge number of degrees of freedom available to the human body make a complete description of all possible muscle states unfeasibly large. Even given the long developmental period of human infants, it seems unlikely that all possible muscle states of interest could be explored. This problem may be mitigated if infants are biased toward attempting to emulate body configurations shown by adult instructors; that is, infants may concentrate their sampling over actions to emulate gestures and facial expressions common to the culture around them (cf. Section 2.3 below) even at an early stage of body babbling.

2.1.4

Imitative learning of body movements

The next developmental stage in infant imitation is the replay of others’ body movements. Neonates as young as 42 minutes can imitate facial expressions, and 12- to 21-day-old infants can imitate up to four different facial and manual gestures [Meltzoff and Moore, 1977],[Meltzoff and Moore, 1983]. It seems 11

likely that this stage of imitation relies on combining hardwired organ identification abilities with a gradual refinement in the model acquired during body babbling (a process which could begin prenatally). Manual gestures could be matched during imitation by visually matching the adult’s hand state with the infant’s hand state. This would require visual pattern matching because of the different size, shape, color, and proportion of the adult and infant hands, but nonetheless the self and others’ states can both be visually perceived (unlike the case of facial imitation). The ability of infants to model hand gestures within a matter of days after birth may argue for the pre-existence of specialized hand detectors in the neonate brain, as well as the aforementioned face detectors. Which specific feature detectors are inborn and which are learned remains a topic for future investigation.

2.1.5

Imitative learning of actions on objects

Infants in the second half-year of life are capable of performing imitative actions on objects. In one experiment [Meltzoff, 1988b], toddlers saw an adult lean forward and touch a yellow panel with his forehead, causing the panel to light up. The infants were not allowed to perform immediate imitation, or even to inspect the panel beyond the period of demonstration. After a 1-week delay, infants were brought back to the lab. 67% of infants in the test group imitated the previously observed gesture (Fig. 1(b)) while 0% of controls who had not seen the target act did so. This study is interesting not just because it shows toddlers’ increased imitative capabilities; it is also interesting because of what the toddlers did not do. Rather than simply remembering the object properties (“depressing that panel makes it light up”), which could have been elicited by simply touching the panel with their hands, for example, the chil12

dren followed the specific forehead-touching action they had earlier seen in the adult instructor’s demonstration. Thus, some representation of the adult’s motor act must have been stored in the children’s minds for later use. Another study [Hanna and Meltzoff, 1993] showed that infants who observe their peers play in unique ways with toys in a day-care setting can later imitate these unique play styles when presented with the same toys at home. Yet another study [Meltzoff, 1988a] showed that 14-month-olds can extrapolate from actions on toys shown on a TV screen and imitate those actions when given the same toys in the real world. This indicates that infants can use imitative processes even when the instructor’s actions are represented in 2-D space. Finally, recent findings have linked the mechanisms underlying imitation to learning to follow the gaze of others when they turn to look at objects. This behavior, known as “gaze tracking” or “gaze following,” is an important first step towards developing shared attention between the infant and other agents [Brooks and Meltzoff, 2002].

2.1.6

Imitative learning via inferring intent

The final developmental step of imitation, and one where humans far exceed the capabilities of machines, is inferring intent—knowing what the instructor “means to do” even before the instructor has achieved a goal state (or when the instructor fails to reach a goal state). The ability to infer the intent of others represents a key step in forming a “theory of mind” for other agents, that is, being able to simulate the internal mental states of others. In one study [Meltzoff, 1995], 18-month-old infants were shown an adult performing an unsuccessful motor act. For example, the adult “accidentally” over- or 13

undershot his target object with his hands, or his hands “slipped” several times in manipulating a toy, preventing him from transforming the toy in some way. The results showed that infants can infer which actions were intended, even when they had not seen the goal successfully achieved. The same study [Meltzoff, 1995] also showed that infants do not infer intent when the instructor is a clearly inanimate object. In this study, Meltzoff constructed a device from wood and plastic that had poles for “arms” and mechanical pincers for “hands.” The goal was to pull apart two halves of a toy. The device was made to demonstrate the same unsuccessful act as the human instructor. Although the device did not appear human, its manipulators traced out the same spatiotemporal path as the adult instructor. The infants inferred intent of the adult instructor after the unsuccessful act, but not in the case of the inanimate device. Their rate of performing the correct action after viewing the inanimate device were no different from controls. Interestingly, if the device was successful in performing the task, infants’ odds of replaying the behavior did increase compared to controls. This result shows that infants are not merely tracing out spatiotemporal patterns—they are able to simulate or predict what state the instructor means to achieve, but only when the instructor matches certain criteria that identify him or her as an intelligent agent “worthy” of imitation.

2.2 The AIM Model

Fig. 2 provides a conceptual schematic of Meltzoff and Moore’s active intermodal mapping (AIM) hypothesis for imitation in infants [Meltzoff and Moore, 1977], [Meltzoff and Moore, 1994],[Meltzoff and Moore, 1997]. The key claim is that 14

imitation is a matching-to-target process. The active nature of the matching process is captured by the proprioceptive feedback loop. The loop allows infants’ motor performance to be evaluated against the seen target and serves as a basis for correction. Imitation begins by mapping perceptions of the teacher and the infant’s own somatosensory or proprioceptive feedback into a supramodal representation, allowing matching to the target to occur. Meltzoff and Moore’s [Meltzoff and Moore, 1997] hypothesis for early infant imitation is that infants maintain a directory of muscle commands indexed by goal organ states. To determine how this directory is constructed, Meltzoff and Moore performed a detailed analysis of infant movements in a task involving a novel act, tongue protrusion out of the side of the mouth instead of straight ahead [Meltzoff and Moore, 1994]. Infants appeared to match their organ states to the instructor’s states by isolating and matching a single dimension at a time, e.g., first the lateral dimension is matched, then forwardness of the tongue, and finally the amount of protrusion beyond the lips. Meltzoff and Moore suggest that saliency determines the dimension that infants choose to imitate next; i.e., mismatches that the visual system considers “interesting” at the neural level guide infant attention toward which dimension to match next. This ordering of dimensions for matching seems to be consistent across individual infants, ruling out trial and error. Several issues remain unclear as to how AIM might be implemented in the infant brain. For example, precisely which criteria are used to determine saliency of instructor facial features? What is the minimum set of facial features necessary to determine that an agent is intelligent, and thus worthy of imitation? Do infants use other bases to make this judgment, such as prior communicative efforts or turn taking? What algorithms and computational resources are 15

needed to map sensory data into a supramodal space?

2.3 The “like me” hypothesis: developmental changes in modeling others’ intentions

One hallmark of infant development occurs at about 1 year of age: rather than simply repeating the instructor’s end motor state, 1-year-olds will “test” for contingent interaction with the instructor by first matching the instructor’s state, then abruptly changing motor state while attending to the instructor, as if to see what the instructor will do next [Meltzoff, 1990]. Neonates do not display this type of behavior. In a related finding, Brooks and Meltzoff showed that infants under 10 months of age tend to cue on the instructor’s head turning to determine when to follow gaze; older children (12 to 18 months old) have a more refined model of others’ attention, only looking when the instructor’s eyes are open [Brooks and Meltzoff, 2002,Brooks and Meltzoff, 2004]. Another study [Meltzoff, 1999] found that 9-month-old infants could not infer the intent of adults after unsuccessful demonstrations, but that 15- and 18-month-old children could infer the goal with the adult only performing failed demonstrations. These results suggest that some shift in representation of imitative acts must occur around 1 to 1.5 years of age. Meltzoff [Meltzoff, 2002] proposes that infants begin life with a rudimentary hard-wired mapping between self and other, but that this is refined over time through interaction. Meltzoff’s “like me” hypothesis postulates that tendencies toward social behavior (turn-taking, empathy, etc.) possess inborn components, which are fine-tuned as the infants interact more and more with adult and peer exemplars. This interaction allows infants to find correlations be16

tween their own actions and internal states, and the observed states of others.

2.4 Summary: cognitive studies

Cognitive studies of imitation in infants reveal a number of basic characteristics that may help guide the development of robotic imitation learning systems: • Infant imitation appears to be goal-directed • Infant imitation appears to involve a matching-to-target process, where actions are corrected one (muscle) degree of freedom at a time • Sensations seem to be mapped into a supra-modal space, allowing visual perception of the instructor to be compared to proprioceptive signals of the observer • Infant imitation seems to be a staged developmental process, with increasing complexity in imitative capability • The “like me” hypothesis states that imitation is an act of social cognition, which in turn relies on inborn mechanisms that are gradually refined to match the inferred social norms and intentions of others in the infant’s environment

3

A Bayesian framework for goal-directed imitation learning

Imitation learning systems that only learn deterministic mappings from state to actions are susceptible to noise and uncertainty in stochastic real-world environments. Development of a probabilistic framework for robotic imitation learning, capable of scaling to complex hierarchies of goals and subgoals, re17

mains a largely untouched area of research. The following section sketches a proposal for such a framework. Systems that use deterministic models rather than probabilistic ones ignore the stochastic nature of realistic environments. We propose a goal-directed Bayesian formalism that overcomes both of these problems.

We use the notation st to denote the state of an agent at time t, and at to denote the action taken by an agent at time t. sG denotes a special “goal state” that is the desired end result of the imitative behavior. Imitation learning can be viewed as a model-based goal-directed Bayesian task by identifying:

• Forward model: Predicts a probability distribution over future states given current state(s), action(s), and goal(s): P (st+1 |at , st , sG ). Models how different actions affect the state of the agent and environmental state. • Inverse model: Infers a distribution over actions given current state(s), future state(s), and goal(s): P (at |st , st+1 , sG ). Models which action(s) are likely to cause a transition from a given state to a desired next state. • Prior model: Infers a distribution over actions given current state(s) and goal(s): P (at |st , sG ). Models the policy (or preferences) followed by a particular instructor in transitioning through the environment to achieve a particular goal.

Thus the prior model involves learning a Markov Decision Process (MDP) or a partially observable MDP (POMDP), while the forward model involves learning a “simulator” of how the environment (including other agents and the agent’s own body) reacts to actions performed within it. Learning inverse models is a notoriously difficult task [Jordan and Rumelhart, 1992], not least because multiple actions may cause transitions from st to st+1 . However, using 18

Bayes’ rule, we can infer an entire distribution over possible actions using the forward and prior models: P (at |st , st+1 , sG ) ∝ P (st+1 |at , st , sG )P (at |st , sG )

(1)

Equation 1 can be used to either select the maximum a posteriori (MAP) action, or to sample over the distribution of actions. The latter method occasionally picks an action different from the MAP action and allows better exploration of the action space (cf. the exploration- exploitation tradeoff in reinforcement learning). Sampling from the distribution over actions is also called probability matching. Evidence exists that the brain employs probability matching in at least some cases [Herrnstein, 1961],[Krebs and Kacelnik, 1991]. According to the AIM model [Meltzoff and Moore, 1997], imitation begins with an infant (or other agent) forming a representation of features in the outside world. An equivalence detector matches the current modality-independent representation of the instructor’s state with a modality-independent representation of the infant’s state. Proprioceptive feedback guides the infant’s motor output toward matching the instructor’s state. Our framework for Bayesian action selection using learned models captures this idea of imitation as a “matching-to-target” process. Fig. 3 depicts a block diagram of our architecture. Like AIM, our system begins by running several feature detectors (skin detectors, face trackers, etc.) on sensor inputs from the environment. Detected features are monitored over time to produce state sequences. In turn, these sequences define actions. The next step is to transform state and action observations into instructor-centric values, then map from instructor-centric to observer-centric coordinates. Observer19

Fig. 2. AIM hypothesis model for infant imitation: The AIM hypothesis [Meltzoff and Moore, 1997] argues that infants match observations of adults with their own proprioceptions using a modality-independent (“supramodal”) representation of state. Our computational framework suggests a probabilistic implementation of this hypothesis.

centric values are employed to update probabilistic forward and prior models in our Bayesian inference framework. Finally, combining distributions from the forward and prior models as in Eqn. 1 yields a distribution over actions. The resulting distribution over actions is converted into a single motor action the observer should take next, with an efference copy conveyed to the feature detectors to cancel out the effects of self-motion. Our approach shares some similarities with other recent model-based archi20

tectures for robotic imitation and control. Billard and Matari´c developed a system that learned to imitate movements using biologically-inspired algorithms [Billard and Matari´c, 2001]. The system learned a recurrent neural network model for motor control based on visual inputs from an instructor. A software avatar performed imitation of human arm movements. Demiris and Hayes proposed coupled forward and inverse models for robotic imitation of human motor acts [Demiris and Hayes, 2002]. Their architecture also draws inspiration from the AIM model of infant imitation. In contrast to their deterministic, graph-based approach, we advocate using probabilistic models to handle real-world, noisy data. Wolpert and Kawato likewise proposed learning probabilistic forward and inverse models to control a robotic arm [Wolpert and Kawato, 1998],[Haruno et al., 2000]. Both of these proposals still require the (potentially difficult) explicit learning of inverse models. Further, it is unclear how well these architectures scale to real-world tasks, or how they mirror developmental stages observed in human infants. Our framework differs from previous efforts by combining Bayesian inference with forward models to predict environmental state and prior models to learn social norms, yielding a computationally feasible probabilistic approach to imitation. In the sections below, we discuss in more detail some of the problems that the observer must surmount in attempting to replicate the behavior of the instructor (see also [Schaal et al., 2003,Rao and Meltzoff, 2003]).

3.1 State and action identification

Deriving state and action identity from sensor data involves task- and sensorspecific functions. Although it is impossible to summarize the extensive body 21

Cameras

Motor planning

Supramodal representation

Perception

Feature detectors • Skin detection • Face detection • Pose classification

Saliency model:

Projection along gaze vector

pixels 7→ P(objects)

Correspondence model: Instructor-centric coordinates 7→ ego-centric coordinates

Forward model

Mismatch detection

Proprioceptive feedback

(world dynamics)

Prior model

(instructor’s policy)

Inverse model

(action selection)

Fig. 3. Overview of model-based Bayesian imitation learning architecture: As in Meltzoff and Moore’s AIM proposal, the initial stages of our model correspond to the formation of a modality-independent representation of world state. Mappings from instructor-centric to observer-centric coordinates and from the instructor’s motor degrees of freedom (DOF) to the observer’s motor DOF play the role of equivalence detector in our framework, matching the instructor’s motor output to the motor commands of the observer. Proprioceptive feedback from the execution of actions closes the motor control loop.

22

of work in action and state identification here, we note recent progress in extracting actions from laser rangefinder and radio [Fox et al., 2003] and visual [Efros et al., 2003] data. In most cases, computational expediency necessitates employment of dimensionality reduction techniques such as principal components analysis, Isomap [Tenenbaum et al., 2000], or locally linear embedding [Roweis and Saul, 2000]. Saliency detection algorithms (e.g., [Itti et al., 1998]) may also help reduce high-dimensional visual state data to tractable size.

3.2 Learning correspondences between instructor and observer states

A prerequisite for any robotic imitation task is to determine a mapping from the instructor’s state to the observer’s [Nehaniv and Dautenhahn, 2002], [Johnson and Demiris, 2004]. We view this state mapping problem as an instance of subgraph isomorphism, where the goal is to match subgraphs from the instructor (corresponding to effectors, e.g., limbs) to their corresponding subgraphs in the observer. Subgraph isomorphism denotes the problem of finding a 1-to-1 mapping between the nodes and edges of one subgraph and the nodes and edges of another. Nodes represent rigid skeletal segments, and edges represent movable joints between segments. Such an approach attempts to find common “skeletal segment groups” (subgraphs) shared between instructor and observer. For instance, a humanoid robot might not possess precisely the same degrees of freedom as a human, but similarities in skeletal structure could be determined by finding closely matched subgraphs such as arms, legs, or hands. Future efforts in solving the correspondence problem might also employ notions of similarity and analogical mapping from the cognitive literature (for 23

example, see [Gentner et al., 2001]). In the simulation and robotic head results shown below, the mappings are trivial. In our simulation results, the instructor and observer have the same state and action spaces; no correspondence problem need be solved. In our robotics results, the matched subgraphs are programmed into the system, and comprise a two-node graph (the “head” and base of the robot, corresponding to the head and neck of its human instructor) with one edge (the movable neck). Developing detailed graph-theoretic approaches that map between instructor states and observer states remains an ongoing topic of investigation. Although subgraph isomorphism in the general case is computationally intractable, most realistic examples of correspondence matching (for instance, finding matches between a human’s arm and a robot’s arm) involve small graphs well within the limits of real-time computation.

3.3 Learning forward models and policies

Numerous supervised and unsupervised approaches (see e.g. [Jordan and Rumelhart, 1992], [Todorov and Ghahramani, 2003]) have been proposed to learn forward models which map motor commands to their sensory consequences. Evidence demonstrates that infants learn forward models of how their limbs, facial muscles, and other body parts react to motor commands, a process referred to as “body babbling” in [Meltzoff and Moore, 1997]. Such learning could occur both prenatally and during interaction with objects and people in the world. Since both the motor command issued and its sensory consequence are available to the agent, forward models of environmental dynamics can be acquired using supervised learning algorithms. 24

A popular approach to learning the policy P (at |st , sG ) is reinforcement learning where the goal is to maximum the total reward obtained by the agent (see [Kaelbling et al., 1996] for a survey). In reinforcement learning, the reward signal alone is used to learn models of the environment, rather than relying on actions inferred from observing a teacher as in imitation learning.

3.4 Sequence learning and segmentation

Realistic imitation learning systems must be able to learn sequences of states that define actions, and to segment these sequences into meaningful chunks for later recall or replay. Part of our ongoing work is to define how semantically meaningful chunks can be defined and recalled in real time. Recent developments in concept learning (e.g., [Tenenbaum, 1999]) suggest how similar environmental states might be grouped together, enabling development of hierarchical state and action representations in machine systems.

3.5 A Bayesian algorithm for inferring intent

Being able to determine the intention of others is a crucial requirement for any social agent, particularly an agent that learns by watching the actions of others. One appealing aspect of our framework is that it suggests a probabilistic algorithm for determining the intent of the instructor. That is, an observer can determine a distribution over goal states based on watching what actions the instructor executes over some period of time. This could have applications in machine learning systems that predict what goal state the user is attempting to achieve, then offer suggestions or assist in performing actions that help 25

the user reach that state. The theory could lead to quantitative predictions for future cognitive studies to determine how humans infer intent in other intelligent agents. Our algorithm for inferring intent uses applications of Bayes’ rule to compute the probability over goal states given a current state, action, and next state obtained by the instructor, P (sG |st+1 , at , st ). This probability distribution over goal states represents the instructor’s intent. One point of note is that P (st+1 |at , st , sG ) ≡ P (st+1 |at , st ); i.e., the forward model does not depend on the goal state sG , since the environment is indifferent to the desired goal. Our derivation proceeds as follows: P (sG |st+1 , at , st ) = kP (st+1 |sG , at , st ) P (sG |at , st ) ∝ P (st+1 |at , st ) P (sG |at , st ) ∝ P (st+1 |at , st ) P (at |sG , st ) P (st |sG ) P (sG )

(2) (3) (4)

The first term in the equation above is the forward model. The second term represents the prior model (the “policy”; see above). The third term represents a distribution over states at time t, given a goal state sG . This could be learned by, e.g., observing the instructor manipulate an object, with a known intent, and recording how often the object is in each state. Alternatively, the observer could itself “play with” or “experiment with” the object, bearing in mind a particular goal state, and record how often each object state is observed. The fourth term is a prior distribution over goal states characterizing how often a particular goal is chosen. If the observer can either assume that the instructor has a similar reward model to itself (the “like-me” hypothesis [Meltzoff, 2002]), or model the instructor’s desired states in some other way, it can infer P (sG ). Interestingly, the four terms above roughly match the four developmental 26

stages laid out by Meltzoff [Meltzoff, 2002]. The first term is the forward model, whose learning is assumed to begin very early in development during the “body babbling” stage. The second term corresponds to a distribution over actions as learned during imitation and goal-directed actions. This distribution can be used if all the observer wants to do is imitate body movements (the first type of imitation that infants learn to perform according to Meltzoff’s theory of development). The third term refers to distributions over states of objects given a goal state. Because the space of actions an agent’s body can execute is presumably much less than the number of state configurations objects in the environment can assume, this distribution requires collecting much more data than the first. Once this distribution is learned, however, it becomes easier to manipulate objects to a particular end—an observer that has learned P (st |sG ) has learned which states of an object or situation “look right” given a particular goal. The complexity of this third term in the intent inference equation could explain why it takes babies much longer to learn to imitate goal-directed actions on objects than it does to perform simple imitation of body movements (as claimed in Meltzoff’s theory). Finally, the last term, P (sG ), is the most complex term to learn. This is both because the number of possible goal states sG is huge, and the fact that the observer must model the instructor’s distribution over goals indirectly (the observer obviously cannot directly access the instructor’s reward model). The observer must rely on features of its own reward model, as well as telltale signs of desired states to infer this prior distribution. For example, states that the instructor tends to act to remain in, or that cause the instructor to change the context of its actions, could be potential goal states. The difficulty of learning this distribution could explain why it takes so long for infants to acquire the final piece of the imitation puzzle, determining the intent of others. It should be noted 27

that the four terms in the equation above were not explicitly engineered into our intent inference algorithm to match childhood developmental stages but rather, were the result of using the inverse model formulation in Eqn. 1 and straightforward applications of Bayes’ rule.

4

Results: simulated imitation and intent inference

We tested the proposed Bayesian imitation framework using a simulated maze environment. The environment (shown in Fig. 4(a)) consists of a 20 × 20 discrete array of states (thin lines). Thick lines in the figure denote walls, through which agents cannot pass. Three goal states exist in the environment; these are indicated by shaded ovals. Lightness of ovals is proportional to the a priori probability of the instructor selecting each goal state (reflecting, e.g., relative reward value experienced at each state). In this example, prior probabilities for each state (from highest to lowest) were set at P (sG ) = {0.67, 0.27, 0.06}. All instructor and observer trajectories begin at the lower left corner, maze location (1,1) (black asterisk). Figs. 4 and 5 demonstrate imitation results in the simulated environment. The task was to reproduce observed trajectories through a maze containing three different goal states (maze locations marked with ovals in Fig. 4(a)). This simulated environment simplifies a number of the issues mentioned above: the location and value of each goal state is known by the observer a priori; the movements of the instructor are observed free from noise; the forward model is restricted so that only moves to adjacent maze locations are possible; and the observer has no explicit knowledge of walls (hence any wall-avoiding behavior results from watching the instructor). 28

The set of possible actions for an agent (Fig. 4(b)) includes moving one square to the north (N), south (S), east (E), or west (W), or simply staying put at the current location (X). The stochastic nature of the maze environment means an agent’s selected actions will not always have the intended consequences. Fig. 4(c) compares the true probabilistic kernel underlying state transitions through the maze (left matrix) with the observer’s forward model (right matrix). Lighter squares denote greater probability mass. Here the E and W moves are significantly less reliable than the N, S, or X moves. The observer acquires the estimated state transition kernel Pˆ (st+1 |at , st ) by applying randomly chosen moves to the environment for 500 simulation steps before imitation begins. In a more biologically relevant context, this developmental period would correspond to “body babbling,” where infants learn to map from motor commands to proprioceptive states [Meltzoff and Moore, 1997]. After the observer learns a forward model of the environmental dynamics, the instructor demonstrates 10 different trajectories to the observer (3 to the white goal, 4 to the light gray goal, 3 to the dark gray goal), allowing the observer to learn a prior model. Fig. 5(a) shows a sample training trajectory (dashed line) where the instructor moves from location (1,1) to goal 1, the state at (19,19) indicated by the white oval. The solid line demonstrates the observer moving to the same goal after learning has occurred. The observer’s trajectory varies somewhat from the instructor’s due to the stochastic nature of the environment but the final state is the same as the instructor’s. We tested the intent inference algorithm using the trajectory shown in Fig. 5(b) where the instructor moves toward goal 1. The observer’s task for this trajectory is to infer, at each time step of the trajectory, the intent of the instructor 29

20

10

1 10

N

1

X

E

W

20

Actual P (st+1|at , st ) Learned Pˆ (st+1|at , st ) N S

0.5

E

0.25

W

b)

S

c)

1

X

0.75

0.75

N

at

X

1

at

a)

0.5

S E

0.25

W X

N

S

st+1

E

W

0

X

N

S

st+1

E

W

0

Fig. 4. Simulated environment for imitation learning: (a) Maze environment used to train observer. Thick black lines denote walls; ovals represent goal states. (b) Possible actions for the observer are relative to the current grid location: on each simulation step, the observer can choose to move North (N), South (S), East (E), West (W), or make no move (X). Because the environment is stochastic, it is not guaranteed that the chosen action will cause a desired state transition. (c) Actual (left) and estimated (right) state transition matrices. Each row of each matrix encodes a probability distribution P (st+1 |at , st ) of reaching the desired next state given a current state and action. (adapted from [Rao et al., 2004]).

by estimating a distribution over possible goal states the instructor is headed toward. The graph in Fig. 5(c) shows this distribution over goals, where data points represent inferred intent averaged over epochs of 5 simulation steps each (i.e., the first data point on the graph represents inferred intent averaged over simulation steps 1-5, the second data point spans simulation steps 6-10, 30

20

20

a)

b)

10

10

1

1 1

10

20

1

10 N

1

S

20

P(sG|at,st,st+1)

c)

d)

0.8

10

Goal 1 Goal 2 Goal 3

0.6

1

1

10

20

1

20

1

0.5

10

0.5

0

1

1

E

0.4

0.2

0

20

1

23 Simulation step

45

10

20

0

W

20

1

20

1

10

0.5

10

0.5

0

1

1

1

10

20

Fig. 5. Imitation and goal inference in simulation:

1

10

20

0

(a) The observer (solid line)

successfully imitates the instructor’s trajectory (dashed line), starting from map location (1,1). Grayscale color of the lines indicates time: light gray denotes steps early on in the trajectory, with the line gradually becoming darker as the timecourse of the trajectory continues. Note that the observer requires more steps to complete the trajectory than does the instructor. This is because the environment is stochastic and because of the limited training examples presented to the observer. (b) Sample trajectory of the instructor moving from starting location (1,1) to the upper right goal (goal 1). The task is for the observer to use its learned models to estimate a distribution over possible goal states while the instructor is executing this trajectory. Total length of the trajectory is 45 simulation steps. (c) Graph showing a distribution over instructor’s goal states inferred by the observer at different time points as the instructor is executing the trajectory in (b) . Note how the actual goal state, goal 1, maintains a high probability relative to the other goal states throughout the simulation. Goal 2 initially takes on a relatively high probability due to ambiguity in the training trajectories and limited training data. After moving past the ambiguous part of the trajectory, goal 1 (the correct answer) clearly becomes dominant. Each data point in the graph shows an average of the intent distribution taken over 5 time steps. (d) Example prior distributions P (at |st , sG ) learned by the

31 over actions for each state given that the goal state observer. In this case, we show the learned distribution sG is goal 2, the gray oval.

etc.). Because of the prior probabilities over goal states (given above), the initial distribution at time t = 0 would be {0.67, 0.27, 0.06} (and hence in favor of the correct goal). Note that the estimate of the goal, i.e., the goal with the highest probability, is correct over all epochs. As the graph shows, the first epoch of 5 time steps introduces ambiguity: the probabilities for the first and second goals become almost equal. This is because the limited number of training examples were biased toward trajectories that lead to the second goal. However, the algorithm is particularly confident once the ambiguous section of the trajectory, where the instructor could be moving toward the dark gray or the light gray goal, is passed. Performance of the algorithm would be enhanced by more training; only 10 sample trajectories were presented to the algorithm, meaning that its estimates of the distributions on the right hand side of Eqn. 4 were extremely biased.

Fig. 5(d) shows examples of learned prior distributions Pˆ (at |st , sG ) for the case where the goal state sG is goal 2, the light gray oval at map location (8,1). The plots show, for each possible map location st that the observer could be in at time t, the prior probability of each of the possible actions N, E, S, and W, given that sG ) is goal 2 (the X action is not shown since it has negligible probability mass for the prior distributions shown here). In each of the four plots, the brightness of a location is proportional to the probability mass of choosing a particular action at that location. For example, given goal 2 and the current state st = (1, 1), the largest probability mass is associated with action E (moving east). These prior distributions encode the preferences of the instructor as learned by the observer. 32

5

Results: real-time gaze following in a robotic head

We have also tested our probabilistic approach in a Biclops robotic head (Fig. 8(a)) for the task of gaze following [Brooks and Meltzoff, 2002],[Scassellati, 1999]. The goal is to follow the gaze of a human instructor and track the orientation of the instructor’s head to determine where to look next. Gaze following (Fig. 6(a,b,c)) represents a key step in the development of shared attention, in turn bootstrapping more complicated imitation tasks. The first step in performing gaze following is to learn a predictive model of motor commands: given that the robotic head is is some state and that a motor command is issued, what is the distribution over next states? Thus, we are learning a function f : (Motor encoder position, Motor command) 7→ P (Next motor encoder position). We measured reliability of motors in the Biclops head to learn a model of the head’s forward dynamics (Fig. 7(a-d)). The forward model assumes linear error functions for head pan and tilt (Epan , Etilt ), plus additive, zero-mean Gaussian noise (magnitudes σpan , σtilt ):

Epan = α∆pan + N (0, σpan ) + c Etilt = β∆tilt + N (0, σpan ) + d

(5) (6)

with α, β constant line slopes and c, d additive fit constants. The residual error in head position after learning the model (shown in Fig. 7(c,d)) demonstrates the validity of our assumption of Gaussian error. The prior model for this case was simply defined as a uniform distribution over actions that would move the head closer to the estimated gaze vector of the instructor (see Fig. 7(f)). For the example state trajectory plotted in Fig. 7(e), the corresponding inverse model as computed by Eqn. 1 is shown in Fig. 7(g). Lighter values correspond 33

(a)

(b)

(c)

Fig. 6. Infants track gaze to establish shared attention: Infants as young as 9 months can detect gaze based on head direction; older infants (≥ 10 months) use opened eyes as a cue to detect whether they should perform gaze tracking. In this example (from [Brooks and Meltzoff, 2002]), an infant and instructor begin interacting (a). When the instructor looks toward an object (b), the infant focuses attention on the same object using the instructor’s gaze as a cue (c).

to greater probability of selecting that action; the figure demonstrates that the inverse model is highly peaked about the action most likely to reproduce the observed trajectory. Once trained, our system begins by identifying an image region likely to contain a face (based on detecting skin tones and bounding box aspect ratio). We employ a Bayesian pose detection algorithm [Wu et al., 2000] that matches an elliptical model of the head to the human instructor’s face. Our algorithm then transforms the estimated gaze into the Biclops’ egocentric coordinate frame, causing the Biclops to look toward the same point in space as the human instructor. We trained the pose detector on a total of 13 faces, with each training subject looking at 36 different targets; each target was associated with a different pan and tilt angle relative to pan 0, tilt 0 (with the subject looking straight ahead). Fig. 8(e) depicts a likelihood surface over pan and tilt angles of the instructor’s head in the pose shown in Fig. 8(d). Our system generates pan and tilt motor commands by selecting the maximum a posteriori estimate of the instruc34

tor’s pan and tilt, and performing a simple linear transform from instructorcentric to egocentric coordinates. Out of 27 out-of-sample testing images using leave-one-out cross-validation, our system is able to track the angle of the instructor’s head to a mean error of ±4.6 degrees.

1

Our previous ef-

forts [Shon et al., 2003] demonstrated the ability of our system to track the gaze of an instructor; ongoing robotics work involves learning policy models specific to each instructor, and inferring instructor intent based on object saliency. Real-time control is critical to performing certain types of imitation, and especially to performing motor control in rapidly changing stochastic environments. Our present system 2 uses a cluster of 3 workstations to perform all steps of our algorithm, from observation to motor execution, in real time. Although fast motor control is possible in a limited domain such as gaze tracking, achieving the same performance in more complex problem domains will probably require development of a hierarchy of controllers, from a central top-down planner to lower-level motor control. Again, we anticipate guidance from biological results in constraining our representations: psychophysical and physiological studies have begun to yield a rich trove of data that describe how human neural and motor systems act in response to real-time motor control tasks [Miall and Wolpert, 1996],[Wolpert et al., 1995], [Ghahramani and Wolpert, 1995],[Blakemore et al., 1998], [Buccino et al., 2001], [Olshausen and Field, 1996], [Attneave, 1954]. 1

We define error as:

E=

r

θpan − θˆpan

2



+ θtilt − θˆtilt

2

where θ is the true angle, and θˆ is our system’s estimate of the angle. 2 Please see http://neural.cs.washington.edu/files/final-demo.avi for demonstration video recorded in real-time from the Biclops’ point of view.

35

6

Related work: computational frameworks for imitation

6.1 Imitation and reinforcement learning

Although reinforcement learning has been known for decades in the psychology community, it has more recently been developed as a computational framework in the machine learning community [Barto et al., 1983,Sutton, 1988]. Most reinforcement learning results are based on the Markov Decision Process (MDP) framework [Bertsekas, 1987], where a reinforcement learning agent exists in a discrete state space S. A discrete set of actions A can be used to transition between states. In a stochastic environment, a transition kernel T defines the probability Pr (st+1 |st , at ) of transitioning from the state at time t to the state at time t+1, given action a at time t. A feasibility function F (A, s) constrains the set of available actions when an agent is in some state s. For example, in the stereotypical maze environment, F might limit the choice of actions so that an agent cannot attempt to move through a wall. The reward function R (s) defines how much positive reinforcement an agent receives for being in state s; alternatively, this can be defined as a cost function C defining how much negative reinforcement an agent receives. The main problem in reinforcement learning is to find a policy at ← π (st ), mapping from states to actions, that maximizes future reward. Reinforcement learning has attracted a great deal of attention because it enables intelligent agents to achieve goals in large, stochastic environments without requiring their designers to encode any prior knowledge except, possibly, defining the reward provided in each state. Unfortunately, discovering optimal policies in even the most trivial environments has been shown to be 36

intractable [Littman et al., 1995]. The problem has computational cost polynomial in the number of states |S|. However, because the number of states is exponential in the number of variables involved in solving a problem, finding an optimal policy is intractable (i.e., finding optimal policies in the general case is NP-hard). In the following sections, we consider two reinforcement learning frameworks. The first framework uses imitation to discover the structure of the transition kernel T , allowing the imitative agent to discover reasonable policies without needing to explore the full extent of its environment. The second framework attempts to directly infer the policy followed by an instructor agent based on observing the instructor’s state transitions.

6.1.1

Price’s model: Bayesian imitation learning

Work by Price [Price, 2003] has resulted in a reinforcement learning framework known as implicit imitation. Implicit imitation relies on the idea of transferring information about T from the mind of the instructor to that of the observer. This model is also known as Bayesian imitation learning because, unlike traditional reinforcement learning methods, the model maintains higher-order statistical moments (covariances, etc.) over model parameters. For example, while a traditional reinforcement learner might maintain a single estimated transition kernel Tˆ while exploring the environment, a Bayesian imitation  

learning agent would maintain a distribution Pr Tˆ over possible transition kernels while watching the instructor. Price identifies several advantages to a Bayesian imitation approach: (1) Prior information about observer and instructor actions can be incorporated. 37

(2) Allows observers to avoid suboptimal policies if the instructor can show a better policy. (3) Allows Bayesian techniques to be applied to the “exploration-exploitation” tradeoff, in which an agent must balance exploration of unseen parts of the state space (which might contain higher rewards relative to areas of the state space it has already seen) against exploitation (the need to make use of what has already been discovered about the environment to achieve rewards). A related advantage is the removal of hard-to-determine exploration-rate parameters often found in reinforcement learning systems. Reinforcement learning is difficult primarily because the expected reward value of executing a particular action at a particular state is difficult to determine. The value function Vπ (s) gives the expected sum of future rewards obtained if an agent follows policy π starting from state s. The simplest formulation for finding an optimal reinforcement learning policy starts from an arbitrary value function estimate V 0 (s), and uses the well-known Bellman backup equation [Bellman, 1957]: 

max  V n (s) = R(s)+ a∈A γ

X

s0 ∈S



Pr (s, a, s0 ) V n−1 (s0 )

(7)

where 0 ≤ γ < 1 is a discount factor used to ensure that future reward stays bounded. This formulation is known as value iteration. Although it produces an optimal answer, value iteration requires many examples (again, exponential in the number of state variables) to converge to the optimum. Because value iteration takes so long to converge, function approximators with some interpolation properties (e.g., neural networks) are typically used to approximate the unknown, true value function V ∗ . 38

Bayesian reinforcement learners [Price, 2003,Dearden et al., 1998,Dearden et al., 1999] assume a prior density Pr(T ), and compute a likelihood over T by applying Bayes’ rule, given a state history H = hs0 , s1 , . . . , st i and an action history A = ha0 , a1 , . . . , at−1 i experienced by the agent up to time t: Pr (T |H, A) = R

Pr (H|T , A) Pr (T |A) 0 0 T 0 ∈T Pr (H|T , A) Pr (T |A)

(8)

Here, T represents the set of all possible transition functions. The denominator in Eqn. 8 is just a normalization term to ensure that the resulting likelihoods sum to a distribution. Bayesian imitation learners [Price, 2003] are reinforcement learning agents that have the ability to observe state histories generated by an instructor agent. Rather than iteratively updating the estimated value function V for each state using value iteration or a similar technique, Bayesian imitation learners use the instructor’s state history H to derive a set of likely instructor actions A and to update the posterior Pr (T |H, A). Under this framework, imitation serves to transfer knowledge about the environment model T from instructor to observer in cases where explicit communication is unfeasible. Price also proposes an algorithm, called k-step repair, for finding alternate ways of implementing a policy π when executing π requires actions only available to the instructor, not the observer (for example, if the bodies of the instructor and the observer differ). Suppose that the observer learns, through watching the instructor, that a certain state transition s → x yields a high reward. Unfortunately, however, the observe1r cannot directly transition from s to x due to differences in the action spaces A of the observer and the instructor. The intuition behind k-step repair is that the observer may be able 39

to reach the same high-reward state x through a “bridge” transition, e.g., s → u → v → x. When an action a executed by the instructor proves infeasible for the observer to replicate, the observer attempts to find a bridge whose length is bounded by some integer k. The value of parameter k is dependent on the domain; Price notes that the amount of connectedness in the environment may guide intuition for setting this parameter. If very few bridges can be found, Price mentions the possibility of performing higher-level inference to determine that imitation should be abandoned altogether (and, possibly, that another instructor should be selected, if available).

6.1.2

Limitations of the framework

Price’s framework ignores the correspondence problem [Nehaniv and Dautenhahn, 1998],[Alissandrakis et al., 2002a] of mapping between the states and actions of the instructor and those of the observer. This is not a terribly worrisome defect; presumably, solving correspondences does not depend on the environment itself, so that T can be learned after correspondences have been established. This independence assumption reinforces the idea from the cognitive literature that a robotic imitation learning system could employ a staged process (like infants appear to), beginning with a refinement of organ identification, then learning a model of how its motor commands affect its limb kinematics, and progressing to more complex imitative behaviors. However, Price’s framework does not make use of all the information available to the observer. Not only can the observer learn about the dynamics of the world by watching the instructor; it can also learn something about the inter40

nal states of the instructor. That is, the framework misses out on the idea of imitation as a mechanism for transmitting social information. In the context of Section 3 above, Price’s formalism learns forward and environmental models, but neglects the existence of prior models and goal distributions. However, Price does briefly mention the idea of symmetric Markov games as a fruitful future research topic. These are game-theoretic situations in which agents can interact by exchanging knowledge between one another. A final, less theoretical criticism of Price’s framework is that his results deal with simulated environments. Manifold engineering challenges exist in implementing many of these ideas in real-time robotic systems. As an example, Price’s framework does not address the problem of feature selection—i.e., the problem of determining how to process raw sensory information to yield accurate distributions over observed states in real time. Another example would be a lack of hierarchy in his model—robotic systems acting in real time are likely to require efficient data structures for representing and learning hierarchies of skills and subgoals, a topic which goes unaddressed in his work.

6.1.3

Bui’s model: Abstract Markov policies

Bui and colleagues have developed a novel stochastic model called the Abstract Hidden Markov Model (AHMM) [Bui et al., 2002]. The model comes from the AI plan recognition literature, where the observer must infer the instructor’s policy by watching the observer’s actions. Their model is based on the same MDP framework as Price’s implicit imitation formulation; the specific policy model they consider is the Abstract Markov Policy (AMP) model [Parr and Russell, 1997,Sutton et al., 1999]. AMPs are essentially hi41

erarchical MDPs; a policy at one level of the hierarchy can activate one of several MDPs on the next level down. At each level of the hierarchy (except the first), one or more terminal states exist that return control to the previous level higher up in the hierarchy. Executing an AMP produces a stochastic process that Bui et al. call an Abstract Markov Model. With a nontrivial observation model (i.e., the case where environmental states are hidden and the agent perceives inputs that are stochastic functions of the true state), executing an AMP results in an Abstract Hidden Markov Model. The attraction of the AMP is its hierarchical structure, which makes inferring the intent of other agents more computationally tractable than a model which assumes a single, flat set of states and transitions. The two main contributions of [Bui et al., 2002] are: i) the representation of AHMMs as Dynamic Bayesian Networks (DBNs) [Dean and Kanazawa, 1989], [Nicholson and Brady, 1992],[Pearl, 1988], a general type of graphical model for probabilistic inference; and ii) a particle filtering [Gordon et al., 1993,Isard and Blake, 1998] algorithm employing Rao-Blackwellisation [Casella and Robert, 1996].

6.1.4

Limitations of the model

One omission in Bui et al.’s work is that learning model parameters and model structure of AHMMs is not addressed. The cognitive literature suggests that it is possible for infants to determine the policies being followed by others; as mentioned above, computational considerations make it likely that some sort of hierarchy is employed in representing these observed policies. This gives hope that tractable algorithms do exist for determining the parameters and structure of hierarchical Markov decision processes. 42

6.1.5

Summary

Reinforcement learning provides a viable theoretical framework for reasoning about the computational value of imitation. With the exception of the work examined here, relatively little research has focused on examining the connection between reinforcement learning and imitation learning. The two papers examined here make significant contributions toward a computational theory of imitation from the point of view of reinforcement learning: • Price’s model emphasizes imitation as the implicit transfer of environmental knowledge from the instructor to the observer • Keeping a distribution over environmental models using Bayesian methods can enhance the performance of imitation learners • Price’s framework includes a simple method for the observer to achieve the same goals as the instructor even when the capabilities of the two differ • Bui et al.’s framework proposes a hierarchical model for inferring intent, and provides an efficient algorithm for performing inference in this model.

6.2 Robotic implementations of imitation learning

Despite the potential for imitation to help robots acquire new skills and to interact more richly with humans, imitation learning has, until recently, received relatively little attention in the robotics community. The following sections consider the work of several robotics groups in creating biologically-inspired robotic systems that acquire information via imitation learning. 43

6.2.1

Component assembly: Kuniyoshi’s “Learning by Watching” system

Kuniyoshi et al. proposed a system for imitating human instructors performing simple assembly tasks [Kuniyoshi et al., 1994]. This work is particularly impressive since it offered an example of a robotic imitation learning system, capable of deferred replay of learned behaviors, using technology available 10 years ago. Kuniyoshi et al. called their approach “learning by watching” to distinguish it from four preexisting methods for human instructors to guide robot observers in learning a motor task: (1) Explicit textual programming: Programming a robot directly using a programming language. This method may be necessary for extremely complex tasks where well-known, off-the-shelf control algorithms are available. It fails in situations where the problem is ill-defined or where expert programmers are unavailable to develop programs. (2) Off-line simulation-based programming: Combines explicit textual programming with a “world simulator” (e.g., a physics-based kinematic simulator) to train robots offline. This approach requires potentially expensive, relatively rare hardware (because high-fidelity physical simulations demand excessive processing and memory capabilities). Further, the (potentially high) cost of developing a realistic world simulator software package must also be taken into account. (3) Inductive learning: This is simply a form of reinforcement learning where the robot learns how to move using trial and error. As Kuniyoshi et al. explain, this method is infeasible for learning complex tasks, and is really only useful for fine-tuning an already-established model. (4) Teaching by guiding: A human instructor directly moves a robotic manipulator to show the robot correct trajectories to follow in performing a 44

skill. This method has seen widespread adoption in some industrial fields, but suffers from several drawbacks. It gives rise to inflexible skills (because guided trajectories are hard to debug); the work is tiresome for the human instructor; it has the potential to damage either the robotic manipulator or the human instructor if an accident occurs; and it does not give the robot a policy to follow when deviations from the programmed trajectory occur (for example, owing to slop in the motors of the manipulator, or to unforeseen changes in the environment around the manipulator). The learning by watching approach attempts to overcome the shortcomings of all these methods by allowing a robot to observe a human using his or her own hands to perform a task, then having the robot match the trajectory of the human instructor using its own manipulators. Imitation in this system follows five stages: (1) Presentation: The instructor shows the system a target task using his or her hands. (2) Recognition: The system uses a stereo camera system to extract 3D information about the objects being manipulated by the instructor. This knowledge is stored using a symbolic representation. (3) Analysis: The system uses the symbolic representation derived in the previous step to build a hierarchical representation of goals, and to identify dependencies among operations. (4) Generation: At the start of a deferred replay of the learned task, the system recognizes the initial configuration of objects in its workspace. It then uses its knowledge of the instructor’s actions during training to generate a program its manipulator can execute. 45

(5) Execution: The robot executes the program using its manipulator. Kuniyoshi et al.’s experiments use a domain in which groups of blocks are moved on a workspace (a flat, black table), representing an assembly task. Steps in an assembly task are identified with changes in the configuration of blocks. For example, if one block is stacked on top of another, this is considered a relevant change; another example of relevant change would be if the faces of two blocks are made coplanar with one another. Due to the limited hardware available at the time, it should not be surprising that this system was only capable of assembly tasks in a very limited domain. Nonetheless, the system is noteworthy in implementing a robotic system that learns by imitation using visual data, and that forms a hierarchical model of the plan being implemented by the instructor. Additionally, the fact that such a simple collection of feature detectors was sufficient to perform relatively complex tasks bodes well for current robotic systems that are designed to imitate in real time using visual inputs.

6.2.2

Socially interactive robots

Fong et al. [Fong et al., 2003] recently surveyed the state of the art in socially interactive robotics. Their work addresses several topics, including: how the physical appearance of a robotic platform affects its ability to socially interact with humans (robots that appear similar, but not identical, to humans are perceived as “uncanny,” and are likely to have difficulty interacting successfully); how robot personalities should be implemented according to the functional role the robot is designed to perform; and the types of sensors and feature detectors likely to be necessary for social interaction with humans. Exam46

ples of the latter include: i) people tracking capability; ii) speech recognition; iii) gesture recognition; iv) face detection and recognition; v) facial expression recognition (and possibly the ability to create the robot’s own simulated facial expressions); and vi) gaze tracking for shared attention. Fong et al. also mention the importance of user modeling for social robotics, and of the ability to infer intent. They identify a number of research groups (including the well-known Cog project at MIT [Scassellati, 2000]) that believe the first step toward implementing intentionality in robots is establishing mechanisms for shared attention, particularly using gaze tracking and pointing. The survey mentions that most robotic imitation efforts thus far have focused on simple tasks, such as block stacking or pendulum balancing, and have been situated in relatively impoverished environments, often with explicit markers telling the system which stimuli are important. Roy’s CELL architecture [Roy, 1999],[Roy and Pentland, 2002] performs limited linguistic imitation and concept learning on a robotic platform. The system segments an instructor’s utterances into words, and connects words to objects in the robot’s visual field in real-time. While impressive, the system’s motor capabilities are limited; hence the types of imitation available to the system are limited.

6.2.3

Robots as social agents: Breazeal and Scassellati

Breazeal and Scassellati have been involved in the Kismet and Cog projects at MIT; Kismet [Breazeal and Velasquez, 1998] (and a more recent effort, Leonardo [Breazeal et al., 2004]) is an attempt to build an articulated robotic face capable of interacting with passersby, while Cog is a humanoid torso de47

signed to explore efficient algorithms for imitative behavior. In [Breazeal and Scasellatti, 2001], Breazeal and Scassellati propose a list of challenges in building robots capable of imitating humans. These challenges include:

(1) How does the robot know when to imitate? Numerous constraints will operate on any robotics platform, requiring the robot to choose what behaviors to engage in at any point in time. This brings up a problem similar to the reinforcement learning “exploration-exploitation” tradeoff: when should the robot decide its skill set to complete a task is “good enough,” and stop imitating others? How should the robot handle, for example, the case where it must decide between learning a new skill by imitation— a skill it may not get another chance to learn—and recharging its batteries so it can continue operating? (2) How does the robot know what to imitate? Given a choice of several possible instructors, which should the robot imitate? In a more general sense, how should the robot divide up its computational resources and sensor bandwidth so as to maximize its future reward? We are currently pursuing an information-theoretic account of this problem, which we call active imitation learning. (3) How does the robot map observed actions into behavioral responses? This question again addresses the correspondence problem of how to map between the observer’s and the instructor’s bodies. (4) How does the robot evaluate its actions, correct errors, and recognize when it has achieved its goal? This is essentially a credit assignment problem: how does the robot know where it went wrong when it fails to replicate a demonstrated task? This challenge also involves knowledge of side effects and contextual constraints. One example given 48

by Breazeal and Scassellati involves a robot opening a jar: does the robot still succeed if it opens the jar by smashing it?

Breazeal and Scassellati’s approach employs three key ideas. First, knowing what to imitate is determined by saliency. Saliency, in turn, is determined by inherent object properties (color, motion, etc.), by context (food is more salient when an agent is hungry, for example), and the instructor’s attention (via gaze, pointing, touching an object, etc.). Second, Breazeal and Scassellati constrain the state and action spaces they must consider by assuming a similar morphology between the instructor and the observer. Third, they propose using prior knowledge about how social interactions typically proceed to recognize when failures occur, and to assist in determining saliency. They conclude by proposing, at a high level, a software architecture for performing real-time robotic imitation learning.

Breazeal and Scassellati’s work is important in raising several issues that robotic imitation learning platforms must address. However, their approach to solving these problems lacks several desirable traits for performing complex imitation tasks in stochastic environments. First, their methods do not appear to be based on a unifying computational framework (such as the Bayesian framework) to perform imitation. Their methods seem to be deterministic rather than probabilistic—Kismet’s behaviors and expressions are relatively fixed, and deterministically-valued computations drive its attention. Lack of an overarching computational principle makes their system’s results difficult to compare with results from the cognitive literature. 49

6.2.4

Computational frameworks for robotic imitation learning

Schaal et al. [Schaal et al., 2003] provide a further review of the robotics literature, this time with a focus on the motor aspects of imitation learning. In their view, as in the reinforcement learning literature, imitation is a problem of finding the correct control policy π that determines a motor output vector u, given an internal state vector z, a time t, and a set of (possibly learned) parameters α: u(t) = π (z(t), t, α)

(9)

This type of policy, which explicitly depends on the time t, is called a nonautonomous policy. A policy of the form: u(t) = π (z(t), α)

(10)

is called an autonomous policy. Autonomous policies are, in some sense, more useful than non-autonomous policies, because they are more robust: an autonomous policy can be goal-directed, whereas a non-autonomous policy may encounter problems if it is forced to deviate from its predefined time course. This highlights a problem with systems such as that of Kuniyoshi et al. [Kuniyoshi et al., 1994]: although non-autonomous policies may be admissible in tightly-controlled environments like a factory floor, they are inappropriate for social interaction, where the environment and the behaviors of other agents are difficult to predict in terms of an explicit timeline. An additional requirement for any imitation task is the specification of a cost function J (similar to the reward function R seen in reinforcement learning): J = g (z(t), u(t), t)

(11) 50

Schaal et al. mention the complexity of defining J so that the observer’s behavior remains both goal-directed and focused enough on the instructor’s trajectory so that the observer does not waste its resources needlessly exploring alternative solutions. They mention, for example, specifying J so as to penalize deviations from the observed trajectory. Based on both the cognitive literature and computational considerations, however, this concept of J may require revision. Infants and young children clearly focus on the goal of the imitation rather than a specific trajectory (e.g., [Gleissner et al., 2000]); computationally, it seems to create unnecessary complexity to include both the reward (or cost) for achieving or missing the goal and the cost in terms of adherence to the demonstrated trajectory in one function J. Rather, this argues once again for a hierarchical representation. Each level of the hierarchy would have its own cost function. At the lowest level of the hierarchy, the cost function might include terms to minimize the amount of energy used to replay a motor trajectory, or to minimize the amount of strain on the observer’s motors, etc. At this lowest level, control policies could be non-autonomous, more akin to built-in reflexes in being driven only by feedforward control. Successively higher levels of the hierarchy would encode successively more abstract subgoals; each subgoal node in the hierarchy could potentially have its own cost function. At the topmost level of the hierarchy, the system-wide reward function R would drive the observer’s behaviors. Schaal et al. take the view that solutions to motor problems can be viewed as solutions to systems of differential equations in one of two possible coordinate spaces, either the “internal” (joint or muscle) space: θ˙ = π (z(t), t, α)

(12) 51

or in “external” (task coordinate) space: x˙ = π (z(t), t, α)

(13)

Using differential equations with known attractors allows for the formation of autonomous policies, as opposed to earlier approaches [Miyamoto and Kawato, 1998], [Miyamoto et al., 1996] using parameterized splines to guide motor trajectories, which can only encode non-autonomous policies. The representation (internal or external) used for motor planning has a profound effect on the correspondence problem mentioned earlier. If the coordinates are in external space, the observer and instructor are in total agreement about what the coordinates of any given object should be, and only a linear transformation is necessary to convert from one to the other. Using external coordinates can be problematic, however, in cases where the dynamics of a trajectory (not just static placement of limbs) is important, or where the effectors of observer and instructor differ widely. In this case, an internal representation seems to be more economical. Schaal et al. cite recent work by Alissandrakis et al. [Alissandrakis et al., 2002b] where an algorithm is developed for finding and reaching subgoals via alternate trajectories when the observer has different action capabilities from the instructor. This work bears similarities to Price’s k-step repair algorithm described above.

6.2.5

Summary

Few research groups in the robotics community have recognized the importance of imitation learning for skill acquisition. The groups that are working in this area have primarily concentrated on learning simple motor sequences 52

(tennis serves, etc.), or have focused on tasks in tightly controlled environments (e.g., assembly tasks using blocks on a clean table). Significant contributions could be made in the area of hierarchical imitative learning models implemented on a robotics platform. Additionally, to date, most of the robotic systems that use imitation employ deterministic control policies. Significant gains could also be made by implementing robots that perform robust imitation learning using probabilistic models. For example, many real-world dynamics are inherently unpredictable: effectors slip, motors heat up and become unreliable, manipulated objects can be slippery or unwieldy, and the actions executed by other agents may not adhere to any deterministic policy. These concerns motivate our reliance on model-based Bayesian methods for imitative computations.

7

Conclusion

We have described a Bayesian framework for imitation learning, based on the AIM model of imitation learning by Meltzoff and Moore. The framework emphasizes imitation as a “match-to-target” task, and promotes separation between the dynamics of the environment and the policy a particular teacher chooses to employ in reaching a goal. We have sketched the basic components for an imitation learning system operating in realistic large-scale environments with stochastic dynamics and noisy sensor observations. Our model naturally leads to a Bayesian algorithm for inferring the intent of other agents. We presented results of applying our framework to a simulated maze task and to gaze following in an active stereo vision robotic head. We are currently investigating the ability of the framework to scale up to more complex robotic 53

imitation tasks in real-world environments. The components of our framework described thus far suggest several avenues for interaction between cognitive psychology and robotics research. For example: (1) Development of a “robotic Turing test”: Because robots’ external appearances and degree of contingent interaction are so easily manipulated, we intend to explore the question of how infants distinguish between animate and inanimate objects (or objects worthy/unworthy of interaction). For example, using a robotic platform like the humanoid robot HOAP-2 (Fig. 9), we can ask the question, “What is the minimal amount of human-like bodily or facial features children need to consider the robot as another being?” Similarly, we can ask whether it helps for the robot to interact contingently with the child to be considered a fellow intelligent agent. These questions could be assessed, for example, by seeing whether the child engages in play or other social behavior with the robot, or feels empathy for the robot. If the robot “breaks down,” does the child display an emotional response? (2) Cognitive studies in intent inference: A study involving older children might allow us to manipulate each of the four distributions that underlie our intent inference algorithm (Eqn. 4) to confirm or refute our Bayesian account of how infants infer the intent of others. As an example, we envision showing children videos of an adult manipulating a toy with several possible end states. We would first allow a child to observe video of an adult actor manipulating the toy through a sequence of states. During testing, as we manipulate each of the terms in the equation (forward dynamics of the toy, instructor policy or preferences for actions in ma54

nipulating the toy, and saliency or reward model for each possible state of the toy, respectively), we can ask the child about his or her estimate of what the adult is trying to accomplish. (3) Determining the character of prior models: A key component in our Bayesian algorithm for action selection (Eqn. 1) is the shape of the prior model P (at |st , sG ). We anticipate that cognitive studies of children interacting with simple toys or other common objects with well-defined, discrete sets of possible actions will help us determine the structure of these distributions. Also of interest is the nature of how these distributions change over time as a child interacts more and more with an object: do expert users of an object have significantly different priors over the actions they select as compared to novice users?

The connection between cognitive and robotics studies will undoubtedly become stronger in the near future, as cognitive findings provide empirical constraints on machine learning and control algorithms for imitation. At the same time, we anticipate that robotic systems will facilitate more informative cognitive studies by providing flexible, easily manipulable platforms for contingent interaction with infants, children, and other intelligent agents.

8

Acknowledgements

This research was supported by an ONR Young Investigator award and NSF Career grant to RPNR, and NIH grant HD-22514 and a NSF LIFE Science of Learning Center grant to ANM.

55

References

[Alissandrakis et al., 2002a] Alissandrakis, A., Nehaniv, C. L., and Dautenhahn, K. (2002a). Do as I do: correspondences across different robotic embodiments. In Polani, D., Kim, J., and Martinetz, T., editors, Proc. Fifth German Workshop on Artificial Life, pages 143–152. [Alissandrakis et al., 2002b] Alissandrakis, A., Nehaniv, C. L., and Dautenhahn, K. (2002b). Imitating with ALICE: Learning to imitate corresponding actions across dissimilar embodiments. IEEE Trans. on Systems, Man, and Cybernetics, Part A: Systems and Humans, 32:482–496. [Attneave, 1954] Attneave, F. (1954).

Some informational aspects of visual

perception. Psychological Review, 61(3):183–193. [Barto et al., 1983] Barto, A., Sutton, R., and Anderson, C. (1983). Neuronlike elements that can solve difficult control learning problems.

IEEE Trans. on

Systems, Man, and Cybernetics, 13:835–846. [Bellman, 1957] Bellman, R. E. (1957).

Dynamic Programming.

Princeton:

Princeton University Press. [Bertsekas, 1987] Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Englewood Cliffs: Prentice-Hall. [Billard and Matari´c, 2001] Billard, A. and Matari´c, M. J. (2001). Learning human arm movements by imitation: Evaluation of a biologically inspired connectionist architecture. Robotics and Autonomous Systems, 37(2–3):145–160. [Bishop, 1995] Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford. [Blakemore et al., 1998] Blakemore, S. J., Goodbody, S. J., and Wolpert, D. M.

56

(1998). Predicting the consequences of our own actions: the role of sensorimotor context estimation. J. Neurosci., 18(18):7511–7518. [Breazeal et al., 2004] Breazeal, C., Buchsbaum, D., Gray, J., Gatenby, D., and Blumberg, B. (2004). Learning from and about others: towards using imitation to bootstrap the social understanding of others by robots (to appear). Artificial Life (special issue). [Breazeal and Scasellatti, 2001] Breazeal, C. and Scasellatti, B. (2001). Challenges in building robots that imitate people. In Dautenhahn, K. and Nehaniv, C., editors, Imitation in Animals and Artifacts. MIT Press. [Breazeal and Velasquez, 1998] Breazeal, C. and Velasquez, J. (1998). teaching a robot “infant” using emotive communication acts.

Toward

In Proc. 1998

Simulation of Adaptive Behavior, Workshop on Socially Situated Intelligence, pages 25–40. [Brooks and Meltzoff, 2002] Brooks, R. and Meltzoff, A. N. (2002). The importance of eyes: How infants interpet adult looking behavior. Developmental Psychology, 38:958–966. [Brooks and Meltzoff, 2004] Brooks, R. and Meltzoff, A. N. (2004).

The

development of gaze following and its relation to language. Developmental Science (in press). [Buccino et al., 2001] Buccino, G., Binofski, F., Fink, G. R., Fadiga, L., Fogassi, L., Gallese, V., Seitz, R. J., Zilles, K., Rizzolatti, G., and Freund, H.-J. (2001). Action observation activates premotor and parietal areas in a somatotopic manner: an fMRI study. European Journal of Neuroscience, 13:400–404. [Bui et al., 2002] Bui, H. H., Venkatesh, S., and West, G. (2002). Policy recognition in the Abstract Hidden Markov Model. Journal of Artificial Intelligence Research, 17:451–499.

57

[Byrne and Russon, 2003] Byrne, R. W. and Russon, A. E. (2003). Learning by imitation: a hierarchical approach. Behavioral and Brain Sciences. [Casella and Robert, 1996] Casella, G. and Robert, C. P. (1996).

Rao-

blackwellisation of sampling schemes. Biometrika, pages 81–94. [Dean and Kanazawa, 1989] Dean, T. and Kanazawa, K. (1989).

A model for

reasoning about persistence and causation. Computational Intelligence, 5(3):142– 150. [Dearden et al., 1999] Dearden, R., Friedman, N., and Andre, D. (1999). Modelbased Bayesian exploration. In Proc. UAI-99, pages 150–159. [Dearden et al., 1998] Dearden, R., Friedman, N., and Russell, S. (1998). Bayesian Q-learning. In AAAI-98, pages 761–768. [Demiris and Hayes, 2002] Demiris, Y. and Hayes, G. (2002).

Imitation as a

dual-route process featuring predictive and learning components: a biologicallyplausible computational model. In Dautenhahn, K. and Nehaniv, C., editors, Imitation in Animals and Artifacts, chapter 13, pages 327–361. MIT Press. [di Pellegrino et al., 1992] di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., and Rizzolati, G. (1992). Understanding motor events: a neurophysiological study. Experimental Brain Research, 91:176–180. [Efros et al., 2003] Efros, A. A., Berg, A. C., Mori, G., and Malik, J. (2003). Recognizing action at a distance. In ICCV ’03, pages 726–733. [Fong et al., 2002] Fong, T., Nourbakhsh, I., and Dautenhahn, K. (2002). A survey of socially interactive robots. Robotics and Autonomous Systems, 42(3–4):142– 166. [Fong et al., 2003] Fong, T., Nourbakhsh, I., and Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42:143–166.

58

[Fox et al., 2003] Fox, D., Hightower, J., Liao, L., Schulz, D., and Borriello, G. (2003). Bayesian filtering for location estimation. IEEE Pervasive Computing. [Gentner et al., 2001] Gentner, D., Holyoak, K. J., and Kokinov, B. K., editors (2001). The Analogical Mind: Perspectives from Cognitive Science. MIT Press. [Ghahramani and Wolpert, 1995] Ghahramani, Z. and Wolpert, D. (1995). Modular decomposition in visuomotor learning. Nature, 386:392–395. [Gleissner et al., 2000] Gleissner, B., Meltzoff, A. N., and Bekkering, H. (2000). Children’s coding of human action: Cognitive factors influencing imitation in 3year-olds. Developmental Science, 3:405–414. [Gordon et al., 1993] Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation.

IEE

Proceedings on Radar and Signal Processing, 140:107–113. [Gross, 1992] Gross, C. G. (1992).

Representation of visual stimuli in inferior

temporal cortex. In Bruce, V., Cowey, A., and Ellis, A. W., editors, Processing the Facial Image, pages 3–10. New York: Oxford University Press. [Hanna and Meltzoff, 1993] Hanna, E. and Meltzoff, A. N. (1993). Peer imitation by toddlers in laboratory, home, and day-care contexts: Implications for social learning and memory. Developmental Psychology, 29:701–710. [Haruno et al., 2000] Haruno, M., Wolpert, D., and Kawato, M. (2000). MOSAIC model for sensorimotor learning and control. Neural Computation, 13:2201–2222. [Herrnstein, 1961] Herrnstein, R. J. (1961).

Relative and absolute strength of

responses as a function of frequency of reinforcement. J. Exp. Anal. Behaviour, 4:267–272. [Hinton and Ghahramani, 1997] Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Royal Soc. B, 352:1177–1190.

59

[Isard and Blake, 1998] Isard, M. and Blake, A. (1998). Condensation – conditional density propagation for visual tracking.

International Journal of Computer

Vision, 29(1):5–28. [Itti et al., 1998] Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE PAMI, 20(11):1254–1259. [Johnson and Demiris, 2004] Johnson, M. and Demiris, Y. (2004). Abstraction in recognition to solve the correspondence problem for robotic imitation (to appear). In Proceedings of TAROS. [Jordan and Rumelhart, 1992] Jordan, M. I. and Rumelhart, D. E. (1992). Forward models: supervised learning with a distal teacher. Cognitive Science, 16:307–354. [Kaelbling et al., 1996] Kaelbling, L. P., Littman, L. M., and Moore, A. W. (1996). Reinforcement learning: A survey. J. Artificial Intelligence Res., 4:237–285. [Krebs and Kacelnik, 1991] Krebs, J. R. and Kacelnik, A. (1991). Decision making. In Krebs, J. R. and Davies, N. B., editors, Behavioural Ecology (3rd edition), pages 105–137. Blackwell Scientific Publishers. [Kuniyoshi et al., 1994] Kuniyoshi, Y., Inaba, M., and Inoue, H. (1994). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Trans. Robotics and Automation, 10:799–822. [Littman et al., 1995] Littman, M. L., Dean, T. L., and Kaelbling, L. P. (1995). On the complexity of solving Markov decision problems. In Proc. UAI-95, pages 394–402. [Lungarella and Metta, 2003] Lungarella, M. and Metta, G. (2003). Beyond gazing, pointing, and reaching: a survey of developmental robotics. In EPIROB ’03, pages 81–89. [Meltzoff, 1988a] Meltzoff, A. N. (1988a). Imitation of televised models by infants. Child Development, 59:1221–1229.

60

[Meltzoff, 1988b] Meltzoff, A. N. (1988b). Infant imitation after a 1-week delay: Long-term memory for novel acts and multiple stimuli. Developmental Psychology, 24:470–476. [Meltzoff, 1990] Meltzoff, A. N. (1990). Foundations for developing a concept of self: the role of imitation in relating self to other and the value of social mirroring, social modeling, and self practice in infancy. In Cicchetti, D. and Beeghly, M., editors, The Self in Transition: Infancy to Childhood, pages 139–164. Chicago: University of Chicago Press. [Meltzoff, 1995] Meltzoff, A. N. (1995). Understanding the intentions of others: Reenactment of intended acts by 18-month-old children. Developmental Psychology, 31:838–850. [Meltzoff, 1999] Meltzoff, A. N. (1999). Origins of theory of mind, cognition, and communication. Journal of Communication Disorders, 32:251–269. [Meltzoff, 2002] Meltzoff, A. N. (2002). Elements of a developmental theory of imitation.

In Meltzoff, A. N. and Prinz, W., editors, The imitative mind:

Development, evolution, and brain bases, pages 19–41. Cambridge: Cambridge University Press. [Meltzoff and Moore, 1977] Meltzoff, A. N. and Moore, M. K. (1977). Imitation of facial and manual gestures by human neonates. Science, 198:75–78. [Meltzoff and Moore, 1983] Meltzoff, A. N. and Moore, M. K. (1983). Newborn infants imitate adult facial gestures. Child Development, 54:702–709. [Meltzoff and Moore, 1992] Meltzoff, A. N. and Moore, M. K. (1992).

Early

imitation within a functional framework: the importance of person identity, movement, and development. Infant Behavior and Development, 15:479–505. [Meltzoff and Moore, 1994] Meltzoff, A. N. and Moore, M. K. (1994). Imitation, memory, and the representation of persons. Infant Behavior and Development,

61

17:83–99. [Meltzoff and Moore, 1997] Meltzoff, A. N. and Moore, M. K. (1997). Explaining facial imitation: a theoretical model. Early Development and Parenting, 6:179– 192. [Miall and Wolpert, 1996] Miall, R. and Wolpert, D. (1996). Forward models for physiological motor control. Neural Networks, 8(9):1265–1279. [Miyamoto and Kawato, 1998] Miyamoto, H. and Kawato, M. (1998). A tennis serve and upswing learning robot based on bi-directional theory. Neural Networks, 11:1331–1344. [Miyamoto et al., 1996] Miyamoto, H., Schaal, S., Gandolfo, F., Koike, Y., Osu, R., Nakano, E., Wada, Y., and Kawato, M. (1996). A Kendama learning robot based on bi-directional theory. Neural Networks, 9:1281–1302. [Nehaniv and Dautenhahn, 1998] Nehaniv,

C.

and

Dautenhahn,

K.

(1998).

Mapping between dissimilar bodies: affordances and the algebraic foundations of imitation. In Demiris, J. and Birk, A., editors, Proc. European Workshop on Learning Robots. [Nehaniv and Dautenhahn, 2002] Nehaniv, C. and Dautenhahn, K. (2002). The correspondence problem. In Imitation in Animals and Artifacts, pages 41–61. MIT Press. [Nicholson and Brady, 1992] Nicholson, A. E. and Brady, J. M. (1992).

The

data association problem when monitoring robot vehicles using dynamic belief networks. In Proc. Tenth European Conf. on Artificial Intelligence, pages 689– 693. [Olshausen and Field, 1996] Olshausen, B. A. and Field, D. J. (1996). Natural image statistics and efficient coding. Network, 7:333–339.

62

[Parr and Russell, 1997] Parr, R. and Russell, S. (1997). Reinforcement learning with hierarchies of machines. In Advances in NIPS ’97. [Patterson et al., 2003] Patterson, D., Liao, L., Fox, D., and Kautz, H. (2003). Inferring high level behavior from low level sensors. In UBICOMP ’03. [Pearl, 1988] Pearl, J. (1988).

Probabilistic Reasoning in Intelligent Systems:

Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann. [Price, 2003] Price, B. (2003). Accelerating Reinforcement Learning with Imitation. PhD thesis, University of British Columbia. [Rao and Meltzoff, 2003] Rao, R. P. N. and Meltzoff, A. N. (2003).

Imitation

learning in infants and robots: Towards probabilistic computational models. In Proc. AISB. [Rao et al., 2004] Rao, R. P. N., Shon, A. P., and Meltzoff, A. N. (2004). A Bayesian model of imitation in infants and robots (to appear). In Imitation and Social Learning in Robots, Humans, and Animals. Cambridge University Press. [Rizzolatti et al., 2002] Rizzolatti, G., Fadiga, L., Fogassi, L., and Gallese, V. (2002). From mirror neurons to imitation, facts, and speculation. In Meltzoff, A. N. and Prinz, W., editors, The imitative mind: Development, evolution, and brain bases, pages 247–266. Cambridge University Press. [Rizzolatti et al., 2000] Rizzolatti, G., Fogassi, L., and Gallese, V. (2000). Mirror neurons: intentionality detectors? Int. J. Psychol., 35:205. [Roweis and Saul, 2000] Roweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326. [Roy and Pentland, 2002] Roy, D. and Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26(1):113–146.

63

[Roy, 1999] Roy, D. K. (1999). Learning from sights and sounds: a computational model. PhD thesis, Massachusetts Institute of Technology. [Rubin, 1988] Rubin, D. B. (1988). Using the SIR algorithm to simulate posterior distributions. In Bernardo, M., DeGroot, K., Lindley, D., and Smith, A., editors, Bayesian Statistics 3, pages 395–402. Oxford, UK: Oxford University Press. [Scassellati, 1999] Scassellati, B. (1999).

Imitation and mechanisms of joint

attention: A developmental structure for building social skills on a humanoid robot. Lecture Notes in Computer Science, 1562:176–195. [Scassellati, 2000] Scassellati, B. (2000). Investigating models of social development using a humanoid robot.

In Webb, B. and Consi, T., editors, Biorobotics.

Cambridge, MA: MIT Press. [Schaal et al., 2003] Schaal, S., Ijspeert, A., and Billard, A. (2003). Computational approaches to motor learning by imitation. Phil. Trans. Royal Soc. London: Series B, 358:537–547. [Schiele and Pentland, 1999] Schiele, B. and Pentland, A. (1999).

Probabilistic

object recognition and localization. In ICCV ’99, volume I, pages 177–182. [Schulz et al., 2003] Schulz, D., Burgard, W., Fox, D., and Cremers, A. B. (2003). People tracking with a mobile robot using sample-based joint probabilistic data association filters. Int. J. Robotics Res. [Shon et al., 2003] Shon, A. P., Grimes, D. B., Baker, C. L., and Rao, R. P. N. (2003). Bayesian imitation learning in a robotic head. In NIPS (demonstration track). [Spirtes et al., 2001] Spirtes, P., Glymour, C., and Scheines, R. (2001). Causation, Prediction, and Search. MIT Press, 2nd edition. [Sutton, 1988] Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44.

64

[Sutton et al., 1999] Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDP and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. [Tenenbaum, 1999] Tenenbaum, J. (1999). Bayesian modeling of human concept learning. In Kearns, M. S., Solla, S. A., and Cohn, D. A., editors, Advances in Neural Information Processing Systems, volume 11, pages 59–68. MIT Press, Cambridge, MA. [Tenenbaum et al., 2000] Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323. [Thrun et al., 1999] Thrun, S., Langford, J. C., and Fox, D. (1999). Monte Carlo hidden Markov models: Learning non-parametric models of partially observable stochastic processes. In Proc. 16th International Conf. on Machine Learning, pages 415–424. Morgan Kaufmann, San Francisco, CA. [Todorov and Ghahramani, 2003] Todorov,

E.

and

Ghahramani,

Z.

(2003).

Unsupervised learning of sensory-motor primitives. In Proc. 25th IEEE EMB, pages 1750–1753. [Visalberghy and Fragaszy, 1990] Visalberghy, E. and Fragaszy, D. (1990).

Do

monkeys ape? In “Language” and intelligence in monkeys and apes: comparative developmental perspectives, pages 247–273. Cambridge University Press. [Whiten, 2004] Whiten, A. (2004). The imitator’s representation of the imitated: ape and child. In Meltzoff, A. N. and Prinz, W., editors, The Imitative Mind: Development, evolution, and brain bases, pages 98–121. Cambridge University Press. [Wolpert et al., 1995] Wolpert, D., Ghahramani, Z., and Jordan, M. (1995). An internal model for sensorimotor integration. Science, 269:1880–1882.

65

[Wolpert and Kawato, 1998] Wolpert, D. and Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11:1317–1329. [Wu et al., 2000] Wu, Y., Toyama, K., and Huang, T. (2000). Wide-range, personand illumination-insensitive head orientation estimation. In AFGR00, pages 183– 188.

66

b) Residuals of

distribution

0 1

−1

0

pan (deg)

e)

1

0

s

t+1

st

1

−1

0

f)

Policy model

pan (deg)

−50

15

0 50

−100

0

pan (deg)

100

−1

g) 10

100

−50

1

−100

tilt (deg)

tilt (deg)

sG

0 1

0

pan (deg)

States

50

0 1

−1

Test set:

residuals ∼ N

−1

tilt (deg)

1

tilt (deg)

tilt (deg)

0

d)

residuals ∼ N

−1

−1

tilt (deg)

−1

c) Training set:

forward model

tilt (deg)

a) Empirical noise

0

pan (deg)

1

Inverse model

20 25 30

−200

0

pan (deg)

200

20

30

pan (deg)

40

Fig. 7. Probabilistic models for the robotic head: (a) Discrete empirical distribution of deviation (in degrees) between intended motor states and observed motor states in the Biclops stereo head. The distribution was estimated using a training dataset of 597 example movements. If motors in the head were completely accurate, the entire distribution would display a spike at (0,0). (b) Distribution of residual error after model learning. A linear model was learned in a maximum likelihood fashion from the training data. The residual error distribution is marginally Gaussian, an important assumption of the model. (c) Gaussian approximation of the discretized distribution shown in (b). (d) Gaussian approximation of errors on a cross-validation set of 896 testing movements resembles that of the training set. (e-g) Example of an inverse model computation. (e) States used in this example were: st = (−40, −30), st+1 = (−10, −10), and sG = (10, 10). (f) Policy model P (at |st , sG ). The grid depicts the policy that only actions yielding next states closer to the goal than the current state are allowed, i.e., given non-zero probability by the policy model. (g) Inverse model P (at |st , st+1 , sG ). The distribution shows the likelihood of each action to move the Biclops head toward st+1 . This distribution is sharply peaked around the “optimal” action at = (30.36, 19.63).

67

b) pan angle (deg.)

a)

c)

45

22.5

0

−22.5

−45 −60

−30

0

30

60

tilt angle (deg.)

Fig. 8. Gaze tracking in a robotic head: (a) (Left) A Biclops active stereo vision head from Metrica, Inc. (Right) The view from the Biclops’ cameras as it executes our gaze tracking algorithm. The bounding box outlines the estimated location of the instructor’s face, while the arrow shows the instructor’s estimated gaze vector. (b) Likelihood surface for the face shown in (d), depicting the likelihood over pan and tilt angles of the subject’s head. The region of highest likelihood (the brightest region) matches the actual pan and tilt angles (black X) of the subject’s face shown in (d).

68

Fig. 9. Humanoid robot platform for future research: Our future cognitive and machine learning studies will employ a HOAP-2 humanoid robot from Fujitsu Automation, Ltd. The robot possesses stereo cameras, a variety of proprioceptive sensors, and 25 motor degrees of freedom.

69

Suggest Documents