Towards Combining Autonomy and Interactivity for Social Robots

6 downloads 0 Views 345KB Size Report
The success of social robots in achieving natural coexistence with humans ... history of AI tells us that although system decomposition is useful for studying ...
Towards Combining Autonomy and Interactivity for Social Robots Yasser MOHAMMAD

Toyoaki NISHIDA

Graduate School of Informatics, Kyoto University

Graduate School of Informatics, Kyoto University

Yoshida-Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan

Yoshida-Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan

[email protected]

[email protected]

ABSTRACT The success of social robots in achieving natural coexistence with humans depends on both their level of autonomy and their interactive abilities. Although a lot of robotic architectures have been suggested and many researchers have focused on human-robot interaction, a robotic architecture that can effectively combine interactivity and autonomy is still unavailable. This paper contributes to the research efforts towards this architecture in the following ways. First a theoretical analysis is provided that leads to the notion of co-evolution between the agent and its environment and with other agents as the condition needed to combine both autonomy and interactivity. The analysis also shows that the basic competencies needed to achieve the required level of autonomy and the envisioned level of interactivity are similar but not the same. Secondly nine specific requirements are then formalized that should be achieved by the architecture. Thirdly a robotic architecture that tries to achieve those requirements by utilizing two main theoretical hypothesis and several insights from social science, developmental psychology and neuroscience is detailed. Lastly two experiments with a humanoid robot and a simulated agent are reported to show the potential of the proposed architecture.

Keywords: Embodiment, EICA, Social Robotics, HRI

1

INTRODUCTION

A robot that can correctly execute its task but fails to interact with humans in a natural way is as unacceptable in the area of social robotics as a robot that knows how to interact with humans but fails to achieve the task for which it was designed. The ability to combine natural social interactivity with autonomy is then a vital requirement for a successful social robot. Usually researchers focus on one of those two areas ignoring the other completely, but the history of AI tells us that although system decomposition is useful for studying various parts, the task of combining those parts together to produce a productive final system usually turns out to be extremely difficult if the different subcomponents were not designed with the complete system in mind. After defining the basic terms in section 2, the theoretical hypotheses behind the design of the proposed architecture are stated in section 3. Based on this analysis the basic requirements for the design of the proposed architecture are given in section 4 followed by the details of two levels of specification of the proposed architecture in section 5. Two example implementations of the proposed two levels of specification are given in section 6 followed by a discussion of the limitations and future work in section 7 while the relation to related systems is discussed in section 8. The paper is then concluded.

1

2

DEFINITIONS AND RELATED LITERATURE

2.1 AUTONOMY Autonomy is an important term in modern AI as well as in theoretical biology and social sciences. Many researchers in these areas have tried to define autonomy. The available definitions are vast and inter-related. In robotic control literature autonomy is usually taken to mean the ability of the robot to work without direct control of a human. This definition of autonomy is limited to a single kind ignoring other kinds of autonomy like environmental autonomy, and social autonomy. A better definition comes from the Agent and MAS (Multi-Agent Systems) literature. Autonomy is a relational concept that cannot be defined without reference to the agent goals and environment. An agent X is said to be autonomous from an entity Y toward a goal G if and only if, X has the power to achieve G without needing help from Y. In (Castelfranchi and Falcone 2004), the author arguments convincingly that autonomy and sociality (the ability to live and interact in a society) are inter-related and as much as the society limits the autonomy of its members, it enhances it at the same time through different processes. In this work, we focus on general autonomy according to the aforementioned definition.

2.2 INTERACTIVITY In this work interactivity is defined as the ability to use normal modes of interaction, and stable social norms during close encounters with humans. This definition is focusing on short term fast interaction compared with the general definition of interactivity in social sciences and MAS literature which is usually focusing on long term sociality. Interactivity with this definition is a pre-condition for implementing social robots but is not what makes the robot social. Sociality needs explicit reasoning about social relations and expectations of others. Sociality in this sense can be built on top of the EICA architecture we propose in this paper using a set of deliberative processes that manage trust, reputation and various social relations. Interactivity in the short term sense given here is a real time quality which resists traditional deliberative reasoning implementations and suggests a reactive implementation.

2.3 EMBODIMENT The GOFAI’s (Good Old Fashioned AI) view of intelligence as some transcendental phenomena is challenged by many authors on both philosophical and practical bases (Vogt 2002; Ziemke 1999). The need of an alternative encouraged many authors to challenge the basic assumptions of GOFAI, leading to many interrelated hypotheses including the dynamical hypothesis in cognitive science, and the behavioral approach in robotics (Brooks 1991). There is something common in all of those alternatives; all of them are enactive approaches that rely on some form of embodiment to overcome the grounding problem (Ziemke 1999). Five different notions of embodiment can be identified as suggested in (Ziemke 1999). The first two of them are: Structural Coupling which means that the agent is dynamically related to its environment forming a single combined dynamical system. Historical Embodiment through a history of structural coupling with the environment that affects the agent’s internal dynamical system. The notion of historical embodiment stresses two concepts: • A form of coupling between the agent and its environment at multiple time scales ranging from reactive to long-term deliberative interactions. This is a common property in historical embodiment and structural coupling. • This coupling is dynamic and is adapted through the interaction itself. This is a unique feature of historical embodiment.

2.4 MUTUAL INTENTION To be able to interact naturally with humans, the real world agent (e.g. robot, ECA, etc) needs to have several faculties including the ability to sense human generated signals, to understand them all the way to the detection of the underlying intentions and to show its own intention in a natural way for humans. Usually those are treated as three separate problems, but in natural

2

interactions between humans this separation does not normally exist. In natural interaction situations the intentions of the two agents co-evolve rather than being communicated. Of course communication situations in which information is transferred in only one direction (as suggested by the points above) do exist, but this communication framework cannot be assumed to cover all possible interactions in the real world especially those involving nonverbal behavior. Let’s look at a very simple situation during which one person is giving another person directions to a specific location. This situation appears to be a one way transfer of information that should conform to the separate three steps formulation outlined above. 1. The listener perceives the signals given by the speaker (Passive Perception). 2. The listener analyzes the perceived signals (Intention Understanding). 3. The listener gives a final feedback (Intention Communication). In reality the interaction will not go this way. The listener will not be passive during the instructions but will actively align his body and give feedback to the speaker, and those actions by the listener will change the cognitive state of the speaker and indirectly will change the signals perceived by the listener. So perception will be in fact interactive not passive. The second step is also not realistic because during the interaction the listener will continuously shape her understanding (internal representation) of the speaker’s intention, so no final analysis step separated from the perception will occur. The third step is again just a simplification because the feedback given is continuous in the form of mutual alignment and not just a confirmation as suggested by the scenario decomposition above. This analysis suggests recasting the problem in real world interaction from three separate problems to a single problem we call Mutual Intention formation and maintenance. Mutual Intention is defined in this work as a dynamically coherent first and second order view toward the interaction focus. First order view of the interaction focus is the agent’s own cognitive state toward the interaction focus. Second order view of the interaction focus is the agent’s view of the other agent’s first order view. Two cognitive states are said to be in dynamical coherence if the two states co-evolve according to a fixed dynamical low. In case of verbal communication, a discrete relation between the interacting agents does not impose a serious problem in most cases as the delay between utterances can be long enough for the analysis, but for nonverbal communications this discrete relation can impose unacceptable limitation on the ability of the two agents to interact naturally. The gestural dance theory proposed in (Condon and Ogston 1966) and confirmed in (Kendon 1970) and (Birdwhistell 1970) hypothesizes that nonverbal bodily communication shows synchrony effects between the interacting humans in time scales of 40 milliseconds or less. Although this hypothesize is not confirmed yet (e.g. the experiments reported in (McDowall 1978) did not find evidence of such fine scale synchronization in time scales less than 500 milliseconds), fine scale synchrony in body movements during interactions (specially during smooth turn taking) challenges the possibility of using disembodied techniques for interactive agents that utilize not only verbal but nonverbal communication channels. Another challenge for building agents that can form and maintain mutual intention in real world environments is that the synchrony found in body movements during human-human interactions is continuous in the sense that the number of possible body poses that every human can take is a continuous function and is difficult to be modeled by discrete interaction models. From this discussion, it can be concluded that two of the characteristic features of interactive agents that can achieve mutual intention with humans are: • The ability to synchronize their behavior with its partners (e.g. humans) at different time scales. • The ability to discover and adapt to the rhythms of synchronization in multiple modalities (e.g. proximities, facial expression, gaze direction etc) and multiple speeds.

3

THEORETICAL FOUNDATIONS OF EICA

The EICA (Embodied Interactive Control Architecture) architecture proposed in this paper is based on two theoretical hypotheses. The following subsections introduce these hypotheses and provide a theoretical discussion about the reasons we believe they can be useful in combining autonomy and interactivity (as defined in the previous section).

3

3.1 HYPOTHESES The EICA architecture presented in this paper is based on two theoretical hypotheses made by the authors: (H1) Historical Embodiment Hypothesis: The precondition level of embodiment for achieving intelligent autonomous real world agents is the historical embodiment level as defined in (Ziemke 1999). What this level of embodiment emphasizes is the role of the extended interaction between the agent and the environment. This extended interaction is usually overlooked as a detail that is not important for implementing intelligent agents, but this interaction is the only way around the grounding problem as the history of the interaction can associate an internal meaning to the perceived signals, allowing the agent to act in the environment in a rooted and situated way, that is not possible using only externally coded algorithms (Mohammad and Nishida 2006). (H2) Intention through Interaction Hypothesis: Intention can be best modeled not as a fixed hidden variable of unknown value, but as a dynamically evolving function. Interaction between two agents couples their intention functions creating a single system that co-evolves as the interaction goes. This co-evolution can converge to a mutual intention state of either cooperative or conflicting nature if the dynamical coupling law was well designed (Mohammad and Nishida 2007).

3.2 RELATION BETWEEN H1 AND H2 The relation between mutual intention and embodiment is very important to understand both of them. The level of embodiment required to achieve any nontrivial level of intelligent behavior in the real world is the interactive historical embodiment (as hypothesized by the authors), and the precondition of natural interaction between intelligent agents is the ability to interactively form and maintain mutual intention. The common concept here is the concept of interaction as a kind of co-evolution between the agent and his environment (historical embodiment) and between the agent and the other agents (mutual intention formation and maintenance). Nevertheless this analogy does not lead to assume that interacting with the environment is the same as interacting with other agents because of two differences: • In case of historical embodiment the coupling between the agent dynamics and the environment dynamics is asymmetric in the sense that the agent cannot usually affect the internal dynamics of the environment to the same level the environment can affect the agent's internal dynamics but the coupling in the case of mutual attention should be symmetric if a partnership relation is to emerge. For example, failing to do turn taking correctly can greatly affect the psychological state of the partner, while failure to adjust to a change of the temperature will not affect the dynamics of the world. • The amount of initial coupling needed for the interaction to converge into a mutual intention state is much more than the amount of initial coupling needed to achieve historical embodiment as long as disastrous states are avoided. For example if the agent does not how to share attention with the human during a conversation, the human will not interact normally and this in turn can prevent the agent from ever learning how to share attention with humans, but if the agent cannot navigate correctly in an environment in the beginning the failure itself can be useful for learning the competence as long as no disastrous collisions happened. This discussion reveals that similar but not the same competencies are needed to support autonomy (H1) and interactivity (H2). Those competencies are consolidated into nine requirements in the following section.

4

FROM THEORETICAL FOUNDATIONS TO SYSTEM REQUIREMENTS

To build real world agents that can potentially achieve historical embodiment (H1) several restrictions have to be put on the design of the agent architecture: R1. As the rate of change in the environment state cannot be modeled at any single time scale but needs multiple time scales (e.g. fast changes in lighting conditions versus slow change in the furniture location in an indoors environment) the architecture should support multiple levels of computational complexity and response times that range from reactive to deliberative processes.

4

R2. No computationally intense deliberative processes should be mandatory for the workings of the agent. R3. The architecture must support incremental adaptation to allow the agent to use its interaction with the environment for its own benefit. Given that parallel systems are much harder to be learned than simple serial systems, the internal structure of the agent should be arranged so that single processes can be learned incrementally and the exact timing should be controlled by higher order processes that can be adapted to the situation. R4. It should support incremental evolution of the internal agent design either through programming or automatic learning. To build agents that can form and maintain mutual intention in real world interactions (H2) more requirements should be added to the architecture: R5. Intentionality should be explicitly modeled in the agent design and as intentionality can be described in multiple levels of abstraction (e.g. high level conscious intention to move, and low level motor plan for organ control during the movement) the system should support both low level motor-plan (reactive plans) as well as high level intentional modeling. R6. The agent should be able to model the intention function of the interacting partners in order to predict their responses to its own actions. R7. The agent should have a mechanism for learning the interaction protocol and adapting to different agents. R8. The agent should have a mechanism for combining the results of its interaction oriented processes and its task completion and survival autonomous processes. R9. As the interactive subsystem of the agent needs to explore the intention function of the interacting partners in a timely fashion, the input sensory information from intelligent agents (e.g. humans) should be processed separately from other environmental input because they carry information about the internal intentional state of the other agents while other sensors carry no intentional information. The EICA architecture tries to meet these requirements to provide a general architecture for implementing real world agents capable of combining autonomy with social interactivity. The following subsection describes the general design of EICA

5

THE ARCHITECTURE

The EICA architecture consists of a set of specification levels. Every specification level in EICA describes a set of primitives that can be used to build the computational processes of the robot. Three specification levels are currently available namely L0EICA, LaEICA and LiEICA. The former is designed to meet the first two requirements (R1:R2) stemming from the historical embodiment hypothesis (H1) while providing the basis for meeting R5 and R6, the second is designed to meet R3 and R4 for agents that interact only with experienced humans, while the later is designed to meet these two requirements along with the last six requirements (R5:R9) stemming from the intention through interaction hypothesis (H2) for agents that can engage in a partnership relation with humans. This paper will only consider L0EICA and LiEICA.

5.1 L0EICA Fig. 1 shows a simplified version of the lowest level of specification of the EICA architecture called L0EICA. Every processing component in EICA must ultimately implement the Active interface. In that sense this type is equivalent to the Behavior abstraction in BBS (behavior based systems) or the Agent abstraction in MAS (multi-agent systems). Every Active object has the following attributes (see Fig. 1): Attentionality: a real number that specifies the relative attention that should be given to this process. This number is used to calculate the speed at which this process is allowed to run. As shown in Fig. 1, this attribute is connected to the output of the attention effect channel of the object. Actionability: a real number specifying the activation level of the object. A zero or negative activation level prevents the object from execution. A positive actionability means that the process will be allowed to run but depending on the exact value of this attribute the effect of the object on other active objects is calculated. As shown in Fig. 1, this attribute is connected to the output of the action effect channel of the object.

5

Attributes: a set of general values that can be connected to effect-channels (see below) and provide a means of transferring effects between running processes. Effects: A set of output ports that connect this object through effect channels to other active components.

Fig. 1 LoEICA components and their relations By separating the actionability from the attentionality and allowing actionability to have a continuous range, EICA enables a form of attention focusing that is usually unavailable to behavioral systems. This separation allows the robot to select the active processes depending on the general context (by setting the actionability value) while still being able to assign the computation power according to the exact environmental and internal condition (by setting the attentionality value). The fact that the actionability is variable allows the system to use it to change the possible influence of various processes (through the operators of the effect channels) based on the current situation. Active components can be connected together through effect channels. Every effect channel has a set of n inputs that use continuous signaling and a single output that is continuously calculated from those inputs. This output is calculated according to the operation attribute of the effect channel. At this level of specification the types that can be used to directly implement the processing components of the robot are: MotorPlan Represents a simple reactive plan that generates a short path control mechanism from sensing to actuation. The action integration mechanism provides the means to integrate the actuation commands generated by all running motor plans into final commands sent to the executers to be applied to the robot actuators based on the intentionality assigned to every motor plan. The motor plans in EICA are more like reactive motor schema than traditional behaviors. Process Provides a higher level of control over the behavior of the robot by controlling the temporal evolution of the intentionality of various motor plans. As will be shown in the application presented in this paper, the interaction between multiple simple processes can generate arbitrary complex behavior. Processes in EICA are not allowed to generate action directly. Reflex A special type of processes that can bypass the action integrator and send direct commands to the executer(s). Reflexes provide safety services like collision avoidance during navigation, or safety measures to prevent any possible accidents to the interacting person due to any failures in other modules. Sensor An active entity intended to communicate with the hardware sensors of the robot directly. This kind of objects was introduced to provide a more efficient sensing capability to the robot by utilizing the latency property of the data channel component. The internal organization of the agent is determined by its behavioral graph. A behavioral graph is a labeled weighted directed graph BG < A , EC ,C > , where A is the set of all processes

6

and Motor plans, EC are the set of effect channels, and C are a set of weighted directed edges each of which connects a member of A to a member of EC or vice versa. Active components of the robot can also be connected together using data channels to exchange data. The operation and details of data channels is outside the scope of this paper. Other than the aforementioned types of active entities, EICA has a central Action Integrator that receives actions from motor plans and uses the source’s intentionality level as well as an assigned priority and mutuality for every DoF of the robot in the action to decide how to integrate it with actions from other motor plans using simple weighted averaging subject to mutuality constraints. This algorithm although very simple can generate a continuous range of integration possibilities ranging from pure action selection to potential field like action integration based on the parameters assigned by the various motor plans of the robot. One of the main purposes of having agent architectures is to make it easier for the programmer to divide the required task into smaller computational components that can be implemented directly. The proposed action architecture helps in achieving a natural division of the problem by the following a simple procedure. First the task is analyzed to find the basic competencies that the robot must possess in order to achieve this task. Those competencies are not complex behaviors like attend-to-human but finer behaviors like look-right, follow-face, etc. Those competencies are then mapped to the motor plans of the system. Each one of these motor plans should be carefully engineered (or learned) and tested before adding any more components to the system. The next step in the design process is to design the behavioral processes of the system. To do that, the task is analyzed to find the underlying processes that control the required behavior. Those processes are then implemented. The most difficult part of the whole process is to find the correct parameters of those processes to achieve the required external behavior. Currently this parameter choice is done using trial-and-error but it will be more effective to use machine learning techniques to learn those parameters from the interactions either offline or online. The current architecture supports run-time adaptation of those parameters, and this feature will be exploited in the future to implement learning of the behavioral processes. Those behavioral processes are added incrementally and the relative timing between them is adjusted according to the required final behavior. The final step of the design process is to implement the needed sensors for achieving the goals of the agent. This simple design procedure is made possible because of the separation between the basic behavioral components (motor plans) and the behavioral processes. The current design of L0EICA targets the autonomous part of the robot and it is instructive to see how it tries to meet the six requirements of H1. R1. The architecture at this level of specification does not impose any restriction on the design of the processes (Motor plans on the other hand are restricted to use reactive processing). In real implementations the behavioral graph will be in the form of a tree. The leaves of this tree are the Motor plans, and the processes in higher levels are usually using slower computational algorithms. This allows the robot to interact well with changes in the environment at different time scales. R2. The Motor plans implement this fast reactive response. The processes can either use reactive processing or deliberation but the real time performance of the robot is mainly controlled by the final Motor plans and those are reactive by definition. As the only restriction in the implementation technology at this level is to use reactive processing in the Motor plans, the architecture does not use any mandatory deliberative components. The central action integrator used with EICA is very simple and fast and does not impose speed restriction on the system (Sabbagh 2004). R3. The No-Free-Launch theory proofs that there can be no learning algorithm that is optimal for learning all kinds of behavior. For this reason LoEICA does not provide a specific learning algorithm but provides the means for implementing such a learning system as a process in the behavioral graph. Those learning processes can use the mechanism of effect channels to adapt the parameters of the learned behaviors that are encoded as attributes in the running processes. The example implementation given in this paper shows how the LiEICA architecture explores this possibility and implements learning through interaction in the agent. R4. The design procedure outlined earlier in this section shows how can the agent design be divided into three steps during each of which computational components are added (and potentially learned) incrementally until the final design of the robot is completed. As processes can be added and removed at runtime along with effect

7

channels, LoEICA provides the basic support for online adaptation. Higher levels of specification can utilize these features to implement online adaptation as needed. L0EICA does not support the last six requirements presented in the previous section. For example, as Fig. 1 shows, the agent at this level uses the same processing for signals originating from inanimate environmental components and other intelligent agents and cannot understand the behavior of other agents as a goal directed behavior. Those limitations are met by the LiEICA level of specification presented in the following subsection.

5.2 LIEICA To fulfill the last six requirements (R5:R9) for real world agents that stem from the intention through interaction hypothesis (H2), a new level of specification called LiEICA (for Interactive) was added to EICA. To achieve those goals we looked for inspiration in the known information about interactive behavior of humans in the fields of psychology, developmental psychology, and social science. The following paragraphs will explain how the research in those fields affected the design of this level of specification. Fig. 2 shows the general structure of this level of specification and the main components during protocol learning (see below for details). LiEICA is different from L0EICA because it does not only provide basic building blocks that the programmer can use to build the robot but it provides a specific way to implement the computational structure of the agent. The main goal of LiEICA is to discover an interaction coupling function that converges into a mutual intention state. Social researchers discovered various levels of synchrony in natural interactions ranging from role switching during free conversations and slow turn taking during verbal interaction to the hypothesized gestural dance (Kendon 1970). To achieve natural interaction with humans, the agent needs to synchronize its behavior with the behavior of the human at different time scale using different kinds of process ranging from deliberative role switching to reactive body alignment. The LiEICA system tries to achieve that by allowing the agent to discover how to synchronize its behavior with its partner(s) on those timescales. The architecture is a layered control architecture consisting of multiple interaction control layers. Within each layer a set of interactive processes provide the competences needed to synchronize the behavior of the agent with the behavior of its partner(s) based on a global role variable that specify the role of the agent in the interaction. In this paper we focus in the case in which each interactive process is known except for a parameter vector. The goal of the system is then translated to learn the optimal parameter vectors of the interactive processes that achieve the required synchrony as specified by the behavior of the target partners (e.g. humans). To achieve natural interaction humans develop a theory of mind that tries to understand the actions of interacting partners in a goal directed manner. Failure to develop this theory of mind is hypothesized to be a major factor in developing autism and other interaction disorders (Sabbagh 2004). Two major theories are competing to explain how humans learn and encode the theory of mind namely the theory of theory and the theory of simulations (Davies and Stone 1995). The theory of theory hypothesizes that a separate recognition mechanism is available that can decode the partner's behavior while the theory of simulation suggests that the same neuronal circuitry is used for both generation of actions and recognition of those actions when performed by others (Sabbagh 2004). The discovery of mirror neurons in the F5 premotor cortex area of monkeys (Murata 1997) and recent evidence of there existence in humans (Oberman 2007) supports the theory of simulation although the possibility of the existence of other separate recognition mechanism is far from being ruled out. The proposed system utilizes those results in a novel way by providing a simulation theoretic mechanism for recognizing interaction acts of the partner in a goal directed manner. This mechanism is augmented by a separate recognition mechanism to enable learning the interaction protocol using the interaction itself as will be explained later in this section.

8

Fig. 2 LiEICA – A simplified version Fig. 2 gives a simplified version of the architecture. The main parts of the architecture are: Interaction Perception Processes (IPPs): Used to sense the actions of the other agents. Perspective Taking Processes (PTPs): For every interacting partner a set of Perspective Taking Processes are spawned to provide a view of the interaction from the partner's point of view. Those processes generate the same kinds of signals that are generated by the agent's interaction perception processes but assuming that the agent is in the position of the partner. Forward Basic Interaction Acts (FBIAs): The basic interactive acts that the agent is capable of. In the current version those acts must be specified by the designer using arbitrary logic. Those processes must use the simplest possible logic and should be deterministic to simplify the design of the reverse basic interaction acts explained next. Reverse Basic Interaction Acts (RBIAs): Every FIMP has a reverse version that detects the probability of its execution in the signals perceived by the IPPs or the PTPs. Those are the first steps in both the simulation and theory paths of the system and allows the agent to represent the acts it perceives in the same vocabulary used to generate its own actions. The FBIAs and RBIAs constitute the first interaction control layer in the system which is not learnable. The rest of the interaction control layers can be learned by the agent. Interactive Control Processes (ICPs): Those constitute the higher interactive control layers. Every interactive control process consists of two twin processes. The forward process is responsible of adjusting the actionability of various processes in the lower layer based on the interaction protocol and in the same time are used to simulate the partner. The reverse processes represent the theory the agent have about the partner and the protocol and is related to the forward processes in the same way as RIMPs are related to FIMPs. Shared Variables: Two global shared variables represent the agent's role during interaction (e.g. listener, instructor, etc) and the age of the agent which is the total time of interactions the agent have recognized or engaged in. A third variable (Robust) is initialized for every partner and stores the aggregated difference between the theory and the simulation of this partner. This variable is used in conjunction with the age to determine its learning rate. Mirror Trainer: A special process that is responsible of adapting reverse processes once their twin forward processes are changed. The algorithm used is simply to spawn an offline version of the changed process, try random values of its parameters while keeping its activation at one and zero to collect positive and negative data then train the reverse process using this training set. Interaction Structure Learner: This process is responsible of learning the processes of the interactive control layers by watching other agents interacting. This process is not discussed in this paper. Interactive Adaptation Manager: This process learns the parameters of various processes online during interactions with other partners. This process is the heart of LiEICA and will be the focus of the following presentation.

9

During interactions the processes of every layer are divided into two sets based on the role of the agent in the current interaction. The first set is the running interactive processes that represent the processes generating the actual behavior of the agent and runs in the forward direction. The second set is the simulated interactive processes that represent the other roles in the interaction (one set is instantiated for every other agent) and run in both the forward and reverse directions. The forward direction represents the simulation of the interacting partner at this layer while the reverse direction represents the current theory based interpretation of the partner's actual behavior at the lower layers. This is shown in Fig. 2. For simplification a two-agent interaction scenario (e.g. a listener-speaker scenario) will be considered in this section. Generalization to interactions that involve more than two agents is straightforward. In the beginning the role and age variables has to be set based on the task and current situation. Once those variables are determined the running interactive processes start driving the agent during the interaction. The perspective taking perceptual processes continuously translate the input stream into the partner's frame of reference, while the reverse basic interaction acts are measuring the most probable value of the actionability of various basic interaction acts of him/her/it. This is then fed to the reverse processes in the higher layers to generate the expected actionability of all the ICPs. This constitutes the theory about the intention of the other agent at different levels of detail based on the learned interaction structure. This is moving from bottom up in the interactive control layer hierarchy. The forward direction of processes representing the partner is also executed at the whole hierarchy to generate the expected actionability of each of them according to the simulation of the partner. This is moving from the top down in the hierarchy. The difference between the theory and the simulation is used at every layer to drive the adaptation system only if the difference is higher than a threshold that depends on the age of the agent (Currently we use a threshold that increases linearly with the age). After adaptation mirror training is used to bring the reverse and forward processes of the simulated partner together. In all cases a weighted sum of the theory and the simulation results is used as the final partner actionability level for all processes and is utilized by the forward running processes to drive the agent. It is instructive to show how this design meets the requirements of section 4: R3. Learning how to interact is built in this level of specification through mirror training, interaction structure learning, and interactive adaptation as explained in the previous subsections. The agent starts by learning its own interactive capabilities (mirror training), then it learns the structure of the interactions it is expected to engage in (interaction structure learning) and finally it adapts this structure to the partner in real time. R4. This level of specification keeps the incremental design advantage of L0EICA and provides a simple way for online adaptation based on the age of the agent through the interactive adaptation process that is based on elements from the simulation theory and the theory of theory as explained in section 5.2. R5. The hierarchical design of this level of specification makes it possible to represent intention at different layers encoded in the actionability levels of the processes representing the self (running processes) and the partner(s) (simulation processes). R6. By using simulation the system provides the means to interpret the interacting partner as an intentional agent. The intention of the interacting partner is not represented by passive BDI like intentions but by dynamic processes at various layers of the control architecture and this allows for the determination of the whole intention function of him/her. Another advantage of the proposed architecture is that the agent represents the intention of the partner using its own representation of intention (interaction processes). R7. As seen above every interactive control layer learns various processes during the interaction structure learning stage of the agent's development. R8. The action integrator provides the mechanism for integrating interactive and nonbasic interaction acts. Also the highest interactive control architecture is not driven by the interaction itself and other goals of the agent can be used to drive this layer which will affect all of the lower layers. R9. The perceptual subsystem separates the signals from interacting partners and the signals from the environment as shown in Fig. 2. This allows the designer to use human-specific perceptual processes like the interactive perception mechanism

10

(Mohammad and Nishida 2006) to get more accurate information about the behavior of the interacting human.

6

EXAMPLE IMPLEMENTATIONS

6.1 REACTIVE GAZE CONTROL DURING LISTENING

Fig. 3 Reactive Gaze Control To show how L0EICA can be used to implement reactive behavior during natural human-robot interactions, a simple system for controlling head orientation during natural listening to an explanation given by a human subject was implemented on a humanoid Robovie II robot (Mohammad et al. 2007). Fig. 3 shows the various parts of the control software. The goal of this system was to achieve human-like gaze control in terms of the average time of mutual attention, gaze toward instructor, and mutual gaze. The details of this system are reported in (Mohammad et al. 2007). Only a brief account of the system is given here. Following the procedure outlined in section 5.1 the design process involved the following step: 1. Analysis of gaze control during face to face encounters suggested the need of three behavioral processes: a. A process that pulls the robot head toward the direction of the instructor. This process was called Look-At-Human b. A process that counters the effect of Look-At-Human called Be-Polite. These two processes form an approach-escape mechanism that resembles the processes controlling spatial behavior during human-human interactions as reported in (Mohammad et al. 2007). c. A third process that tries to make the robot look at the salient object in the environment was needed to form visual mutual attention with the instructor and was called Mutual-Attention. The role of these three behavioral processes was to control the intentionality of the basic motor plans. 2. Analysis of the motion requirement by these processes suggested the need of four basic motor plans (follow-face, follow-object, follow-gaze, and look-around). Those plans were implemented by four simple state machines and their intentionality was controlled by the three behavioral processes described earlier. 3. To implement those processes and motor plans five perceptual processes (sensors) where needed: the Human-head and Robot-head sensors provides the current location and orientation of the human and robot heads respectively, the Speaking sensor detects human speech, the Confirming sensor detects a confirmation act by the human (the output of this sensor was

11

manually added), and the Gaze-Map sensor generates a representation of the saliency of the space around the robot based on the gaze and pointing behaviors of the human utilizing summation of Gaussians.

a. Simulated versus human behavior

b. Effect of noise on the performance of various interactive behaviors. Fig. 4 Performance of the Reactive Gaze Controller To test the applicability of this approach and its noise sensitivity a simulation study was conducted in which six sessions of explanations were collected and the data of the speaker was fed to the system which controlled the head of a Robovie II humanoid robot. Twenty different levels were added to every one of the scenarios and the effect of the noise level on the robot behavior was also analyzed. The performance of the robot in comparison with human listeners is depicted in Fig. 4-a. The robot achieved human-like behavior in terms of mutual gaze and gaze toward instructor. No human-human data was available to evaluate the mutual attention behavior of the robot. The effect of noise on the system is shown in Fig. 4-b. As the figure shows the degrade in system performance is proportional to the square of the noise level which leads to good noise rejection properties for low levels of signal correlated noise. This system had two main disadvantages: 1. The system is purely reactive and does not have any knowledge about the internal state of the human. 2. The parameters of all the processes in the system were manually adjusted because L0EICA does not provide a specific learning capability. The following subsection provides the design of a new system based on LiEICA to overcome the aforementioned limitations of the reactive controller.

6.2 LEARNING HOW TO LISTEN BY INSTRUCTING In order to solve the problems with the reactive agent presented in the previous section and as a proof of concept for the LiEICA level of specification a simulation study was conducted to measure the capacity of the agent to learn how to control the gaze direction during listening while

12

it is acting as an instructor for an agent that knows how to listen. The main focus of this study was to analyze the effectiveness of the mirror training and Interactive Adaptation to learn how to interact similarly to the agents encountered. Neither Interaction Structure Learning nor the naturalness of the resulting behavior was studied in this simulation. A simulation study rather than a real world human-agent interaction was selected because it allows us to control all the parameters of the fully designed agent, the noise levels, etc and because it can be speeded up to allow us to study more interactions (the simulations in this experiment were run 600 times faster than the real-time speed). Ten different fully designed agents were implemented that differ in the details of how they conduct instruction and how they respond to it while three agents were designed as instruct-only agents and the goal of the experiment was to study how can those agents learn listening by instructing the fully designed agents. Because the verbal content was not needed in this experiment a single 10 minutes speech was recorded and parts of it are played while instructing. The virtual environment in which the agents interacted consists of a table with six different objects and the agents were standing facing each other in the opposite directions of the table. The locations of the objects were selected randomly within the surface of the table. During the interaction, when the instructor is speaking about an object or working on it, there is a probability (7% and 10% respectively) that it will move the object. The maximum distance between the agent and the objects can be longer than its hand so the instructor has to move sometimes along its side of the table. Every agent has two arms that can be used to manipulate objects or point to them.

a. The design of the fully designed agent

b. The design of the instruct-only agent.

Fig. 5 The Basic Interaction Acts and Interaction Control Processes of the instructor and listener synthetic agents used in the experiment The inputs to the agents are the 3D locations of objects along with eight position sensors attached virtually to the front and back of the heads of the agents and their right palm and index fingers (used to discover pointing). The final input channel is the speech signal of the other agent. Fig. 5 shows the FBIAs and ICPs of the two kinds of agents. The common FBIAs were implemented as augmented state machines. A zero mean Gaussian random process error signal was added to the final join angles used to simulate the pose of the agent. The first layer processes are implemented probability distributions over the actionability of the ten FBIAs. The nominal values of these distributions are given in Table 1. These values were based on the analysis of the six scenarios collected for the reactive agent (section 6.1). For every agent the exact values of the probabilities were selected randomly around the nominal values exemplified in Fig. 6.

13

The Explain process was implemented as a probabilistic state machine where the period spent on every state is governed by a timing distribution while next state selection is governed by the transition distribution. The timing distribution is a uniform distribution between the minimum and maximum number of seconds shown in Table 2. The transition probability distribution is shown in Table 2. A more realistic design of this process was to incorporate the expected current state of the listener in the calculation of the transition probabilities but as the listener agent in this experiment has access to this process this complication would have not affected the results and we decided to simplify the design by removing this dependency. The Listen process was implemented as a state machine which is governed by the state of the instructor as shown in Table 3.

RMS Error In Actionability

10 8 6 4 2 0 1

4

7

10 13 16 19 22 25 28 31 34 37 40 43 46 Interaction Num ber Random

Uniform

Imitation

Fig. 6 Progress in learning with different initialization settings

Table 1. Nominal Probability distributions associated with various Interaction Control Processes in Layer 1. Each one of the fully-designed agents has a different probability distribution that is selected around those nominal values. The instruct-only agents have only Instructor role ICPs initialized as in the table Basic Interaction Act Probabilities

Layer 1 Look @ Partner

Look @ Salient

Look@LeastSa.

Look@Other

Look@Rand.

Nod

Say Statement

Say "mmmm"

Point @ Salient

Wait

Random FBIA

Role

Speak To

0.8

0.2

0.0

0.0

0.0

0.0

0.9

0.0

0.4

0.0

0.0

Speak About

0.2

0.8

0.0

0.0

0.0

0.0

0.9

0.0

0.7

0.0

0.0

Busy Working

0.05

0.8

0.01

0.1

0.04

0.0

0.1

0.1

0.1

0.2

0.0

Bring to Attention

0.5

0.5

0.0

0.0

0.0

0.0

0.5

0.0

0.9

0.0

0.0

Ask Confirmation

0.99 0.01

0.0

0.0

0.0

0.6

0.5

0.0

0.0

0.0

0.0

Attend to Partner

0.9

0.1

0.0

0.0

0.0

0.2

0.0

0.1

0.0

0.0

0.0

Attend to Salient

0.1

0.9

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.0

0.0

Scan Environment

0.2

0.2

0.2

0.2

0.2

0.0

0.0

0.0

0.0

0.7

0.0

Show Interest

0.3

0.7

0.0

0.0

0.0

0.0

0.0

0.2

0.2

0.0

0.0

Give Confirmation

0.99 0.01

0.0

0.0

0.0

0.9

0.0

0.5

0.0

0.0

0.0

Interaction Control Process

Instructor Listener

14

Table 2. The Timing and Transition Distribution for the Explain Process.

Start Speak To Speak To Speak About Busy Working Bring to Attention Ask Confirmation Halt

Transition Probabilities

Max

Current

Limits (sec) Min

Interaction Time (min) any

Suggest Documents