Hierarchical, Attentive, Multiple Models for Execution ...

4 downloads 0 Views 335KB Size Report
Hierarchical, Attentive, Multiple Models for Execution and Recognition (HAMMER). Yiannis Demiris & Bassam Khadhouri. Biologically Inspired Autonomous ...
Hierarchical, Attentive, Multiple Models for Execution and Recognition (HAMMER) Yiannis Demiris & Bassam Khadhouri Biologically Inspired Autonomous Robots Team (BioART) Department of Electrical and Electronic Engineering Imperial College London [email protected], [email protected] http://www.iis.ee.ic.ac.uk/yiannis Introduction Following the increased interest in mechanisms that will endow robots with the capability to imitate human action, several computational architectures have been proposed to match visual information from observing a demonstrator, to motor plans that would achieve the corresponding action for the observer. Internal models of the motor systems of the observer and the demonstrator, and their capabilities have been frequently suggested as useful tools aiding in this matching (for a review, see [Schaal, Ijspeert and Billard, 2003]). Building on our previous work on designing architectures that employ multiple inverse and forward models [Demiris and Hayes 2002], we review here our recent work extending these architectures using hierarchical models that incorporate a principled method for the top-down control of attention during action perception. The HAMMER architecture HAMMER is organised around, and contributes towards, three concepts: (1) The basic building block involves a pair of inverse and forward models in the dual role of either executing or perceiving an action [Demiris and Hayes 2002] (2) These building blocks are arranged in a distributed, hierarchical manner [Demiris and Johnson 2003]. (3) The limited computational and sensor resources are taken explicitly into consideration by not assuming that all state information is instantly available to the inverse model that requires it, but are formulated as requests to an attention mechanism. This provides a principled approach to top-down control of attention during imitation. Building blocks HAMMER makes extensive use of the concepts of inverse and forward models. An inverse model (akin to the concepts of a controller, behavior, or action) is a function that takes as inputs the current state of the system and the target goal(s) and outputs the control commands that are needed to achieve or maintain those goal(s). Related to this concept is that of a forward model of a controlled system: a forward model (akin to the concept of internal predictor), is a function that takes as inputs the current state of the system and a control command to be applied on it and outputs the predicted next state of the controlled system. The building block of HAMMER is an inverse model paired with a forward model (figure 1). When HAMMER is asked to rehearse or execute a certain action, the inverse model module receives information about the current state (and, optionally, about the target goal(s)), and it outputs the motor commands that it believes are necessary to achieve or maintain these implicit or explicit target goal(s). The forward model provides an estimate of the upcoming states should these motor commands get executed. This estimate is returned back to the inverse model, allowing it to adjust any parameters of the action (an example of this would be achieving different movement speeds [Demiris and Hayes 2002]). If HAMMER is to determine whether a visually perceived demonstrated action matches a particular inverse-forward model coupling, the demonstrator's current state as perceived by the imitator is fed to the inverse model. The inverse model generates the motor commands that it would output if it was in that state and wanted to execute this particular action. The motor commands are inhibited from being sent to the motor system. The forward model outputs an estimated next state, which is a prediction of what the demonstrator's next state will be. This predicted state

Figure 1: the architecture’s basic building block, an inverse model paired with a forward model (from Demiris & Hayes 2002, Demiris and Johnson 2003) is compared with the demonstrator's actual state at the next time step. This comparison results in an error signal that can be used to increase or decrease the behaviour's confidence value, which is an indicator of how closely the demonstrated action matches a particular imitator's action. Distribution and Hierarchy HAMMER consists of multiple pairs of inverse and forward models that operate in parallel [Demiris and Hayes 2002]. When the demonstrator agent executes a particular action the perceived states are fed into all of the imitator's available inverse models. As described earlier, this generates multiple motor commands (representing multiple hypotheses as to what action is being demonstrated) that are sent to the forward models. The forward models generate predictions about the demonstrator's next state: these are compared with the actual demonstrator's state at the next time step, and the error signal resulting from this comparison affects the confidence values of the inverse models. At the end of the demonstration (or earlier if required) the inverse model with the highest confidence value, i.e. the one that is the closest match to the demonstrator’s action is selected. This architecture has been implemented in real-dynamics robot simulations [Demiris and Hayes 2002], and real robots [Demiris and Johnson 2003, Johnson and Demiris 2004], and has offered plausible explanations and testable predictions regarding the behaviour of biological imitation mechanisms in humans and monkeys (review in [Demiris and Johnson 2005]). More recently we have designed and implemented a hierarchical extension [Demiris and Johnson 2003, shown below] to this arrangement: primitive inverse models are combined to form higher more complex sequences, with the eventual goal of achieving increasingly more abstract inverse models [Johnson and Demiris 2004], thus helping in dealing with the correspondence problem. This aspect of the architecture has been described in detail in [Demiris and Johnson 2003]. We have also completed experiments to learn forward and inverse models through motor babbling [Dearden and Demiris, 2005].

Figure 2: HAMMER can build hierarchies of composite inverse & forward models by arranging them sequentially or in parallel [Demiris and Johnson 2003].

Top-down control of attention However, the architecture as stated so far assumes that the complete state information will be available for and fed to all the available inverse models. Since each of the inverse models require a subset of the global state information (for example, one might only need the arm position rather than full body state information), we can optimise this process by allowing each inverse model to request a subset of the information from an attention mechanism, thus exerting a top-down control on the attention mechanism. Since HAMMER is inspired by the “simulation theory of mind” point of view for action perception, it asserts that, for a given behaviour, the information that it will try to extract during the demonstration is the state of the variables it would control if it was executing this behaviour. Apart from improving on the resource requirements of the architecture above, this novel approach provides a principled way for supplying top-down signals to attention. The saliency of each request can then be a function of the confidence that each inverse model possesses, removing the need for ad-hoc ways for computing the saliency of top-down requests. Top-down control can then be integrated with saliency information from the stimuli itself, allowing a control decision to be made as to where to focus the observer’s attention. An overall diagram of this is shown below (figure 3):

Figure 3: Inverse models submit requests to the attention mechanism, exerting top-down control Strategies for selecting among the different requests can include “equal time sharing”, or “highest priority first”, or other suitable resource scheduling algorithms. In the experiments reported below, we summarize various strategies for incorporating the top-down saliency, and their effects in action recognition. Experiments We implemented this architecture on a PeopleBot mobile robot, equipped with a camera and a gripper. A number of inverse models were implemented, including opening & closing grippers, moving towards object, picking objects, and transporting objects [Demiris and Johnson 2003, Johnson and Demiris 2004]. Simple forward models were also implemented, based on the kinematics of the various movements. In the graphs below, we see the progression of the confidences over time when the demonstrator shows a “pick apple” behaviour (B1), with the observer robot having in its repertoire 8 behaviours (B1: pick apple, B2: Move apple, B3: move hand away from apple, B4-B6: similar to B1-B3 but with a pen, B7: drop apple and B8: drop pen). Figure 4a shows the confidence progression without any attention influences, while figure 4b shows the influence of the attention mechanism. In this particular graph (4b), we demonstrate a combination of “equal time sharing”, followed by “highest confidence first” for the allocation of state data to the different requesting inverse models, which is the combination that has worked best in this set of experiments. This intuitively corresponds to “playing it safe” initially (allocating resources equally) until the separability between the different models allows for a switch to a more aggressive strategy that ignores requests from inverse models that are not confident. Note that, since behaviours do not always get the information they need, they typically end up with a lower confidence level than if they were getting all the

information they would need at each time step, but with significant savings from a computational point of view. We have discussed how this top-down (goal-driven) mechanism interacts with bottom-up (stimulus driven) attention in [Khadhouri and Demiris, 2005].

Figure 4a: no attention mechanism

Figure 4b: “Equal time sharing” followed by “Highest confidence has higher priority”

Epilogue We reviewed our approach to the development of architectures that incorporate distributed, hierarchical networks of inverse and forward models, and described how HAMMER can be used to perceive a demonstrated action. The novelty of our approach lies in the idea that the features that the observer will choose to attend are the ones that (s)he would have to control if (s)he was in the same situation and would have to perform the same action. This is compatible with recent biological evidence (e.g. Flanagan and Johansson, 2003) on the use of action plans in action recognition. This stems naturally from the simulation approach to action perception (Demiris and Johnson 2005), which has been the main inspiration underlying our work, and provides a timely opportunity to study the interplay between the two important topics of attention and action perception. Acknowledgements The first author acknowledges the support of EPSRC (Grant GR/S11305/01) and the Royal Society. The second author is supported by a doctoral scholarship from the Morphy Trust. Thanks to all the BioART members for their valuable feedback. References (Schaal, Ijspeert, Billard, 2003): Computational approaches to motor learning by imitation, S. Schaal, A. Ijspeert, and A. Billard, Philosophical Transaction of the Royal Society of London: Series B, Biological Sciences, 358, 1431, pp.537-547. (Dearden and Demiris, 2005): Learning forward models for robotics, A. Dearden and Y. Demiris, Proceedings of IJCAI, Edinburgh, 2005 (to appear). (Demiris and Hayes, 2002): Imitation as a dual-route process featuring predictive and learning components: a biologically-plausible computational model, Y. Demiris and G. Hayes, in Imitation in Animals and Artifacts, K. Dautenhahn and C. Nehaniv (eds), MIT Press, 2002. (Demiris and Johnson, 2003): Distributed, predictive perception of actions: A biologically inspired robotics architecture for imitation and learning, Y. Demiris and M. Johnson, Connection Science, vol. 15:4, pp. 231-243, 2003. (Demiris and Johnson, 2005): Simulation Theory for Understanding Others: A Robotics Perspective, Y. Demiris and M. Johnson, in Imitation and Social Learning in Robots, Humans and Animals: Behavioural Social and Communicative Dimensions, K. Dautenhahn and C. Nehaniv (eds), Cambridge Un.Press., 2004. (Flanagan and Johahnson, 2003): Action plans used in action observation, J. R Flanagan, and R. S. Johansson, Nature, 424: 769-771, 2003. (Johnson and Demiris, 2004): Abstraction in Recognition to Solve the Correspondence Problem for Robot Imitation, M. Johnson and Y. Demiris, in Proceedings of TAROS 2004, pp 63 – 70, Essex, 2004. (Khadhouri and Demiris, 2005): Compound effect of bottom-up and top-down influences in attention during action recognition, B. Khadhouri and Y. Demiris, Proc. of IJCAI, Edinburgh, 2005 (to appear).

Suggest Documents