act with a robot which has a perceivable and interpretable behavior than with ...... output feature maps from CNN are given to the attention network which takes.
DEEP REINFORCEMENT LEARNING FOR HUMAN-ROBOT INTERACTION IN THE REAL-WORLD
A thesis submitted in conformity with the requirements for the degree of Master of Engineering Graduate School of Engineering Science, Osaka University, Japan
by Ahmed Hussain Qureshi September 2017
DEEP REINFORCEMENT LEARNING FOR HUMAN-ROBOT INTERACTION IN THE REAL-WORLD Ahmed Hussain Qureshi Osaka University 2017
For robots to coexist with humans in a social world like ours, it is crucial that they possess human-like social interaction skills. Programming a robot to possess such skills is a challenging task. To addresses this challenge, we present two deep reinforcement learning based self-learning paradigms: 1) Multimodal Deep Q-Network (MDQN) using which the robot learned to interpret and respond to complex human behaviors. 2) Multimodal Deep Attention Recurrent Q-Network (MDARQN) using which the robot not only learned to interpret and respond to complex human behaviors but also learned to indicate its attention. This attention indication adds the perceivablity to robot actions. The results of this method indicate that the people show more willingness to interact with a robot which has a perceivable and interpretable behavior than with a robot which does not have such behavior. The proposed models (MDQN and MDARQN) are trained using the human-robot interaction experiences gathered in a real uncontrolled world. These methods are also evaluated on a test dataset (not seen by the system during training) and through a real-world human-robot interaction experimentation.
ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my research supervisor, Prof. Hiroshi Ishiguro, who introduced me to the various real-world issues in our society which can be effectively solved by robotics. He continuously and convincingly conveyed a spirit of adventure with regard to research and provided indispensable advice, information and support on all aspects of this project. I am very grateful for his excellent guidance throughout my masters studies. I would also like to thank my co-advisors Dr. Yutaka Nakamura and Dr. Yuichiro Yoshikawa for their assistance, patience and dedicated involvement in every step of this project. They were always available to answer my questions and gave generously of their time and vast knowledge. Most importantly, none of this could have been possible without the support of my family. This research thesis stands as a testament to their unconditional love and encouragement.
3
TABLE OF CONTENTS Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 4
1
Introduction
1
2
Reinforcement Learning 2.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 6 6
3
Deep Reinforcement learning for HRI 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 States s ∈ S . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Actions A . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Reward function . . . . . . . . . . . . . . . . . . . . 3.2 Interaction learning procedure . . . . . . . . . . . . . . . . 3.2.1 Data generation phase . . . . . . . . . . . . . . . . . 3.2.2 Training phase . . . . . . . . . . . . . . . . . . . . . 3.3 Deep Neural Networks for Q-function approximation . . . 3.3.1 Multimodal Deep Q-Network . . . . . . . . . . . . . 3.3.2 Multimodal Deep Attention Recurrent Q-Network 3.4 Gaze Control Mechanism . . . . . . . . . . . . . . . . . . .
4
5
Results & Discussion 4.1 Similarity to human behavior . . . . . . . . . . . . . . 4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Intentions depicting factors . . . . . . . . . . . 4.2.2 Impact of attention steering on HRI . . . . . . 4.2.3 Impact of reward function on robot’s behavior
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . .
10 11 11 12 14 15 15 16 17 18 19 23
. . . . .
26 28 30 31 32 33
Conclusion & Future work
34
A Implementation Details A.1 Training dataset, data augmentation and pre-processing . . . . . A.2 Experiment details and hyper-parameters . . . . . . . . . . . . . .
35 35 36
Bibliography
37
4
CHAPTER 1 INTRODUCTION
The field of Human-Robot Interaction (HRI) has emerged with an objective of socializing robots [6]. One of the biggest challenges faced by sociable robots is the challenge of interpreting complex human interactive behaviors [4]. These interactive behaviors are not only complex because of their enormous diversity but also because of being driven by the people’s underlying invisible intentions and social norms [4] [8]. It is therefore nearly impossible to devise a handcrafted policy for the robots to interact with people. Perhaps, the only practical way to realize social HRI is to build robots that can learn autonomously [5] through interaction with people in the real-world. Recently, the notion of building robots that learn through interaction with people has gained the interest of many researchers and thus, has led to the proposition of various approaches. The notable methods include [17] [3] [2]. In [17], the robot learned to choose appropriate responsive behavior based on the human intentions inferred from their body movements. However, we believe that the human intentions are not only driven by the body movements but also by
Figure 1.1: Robot learning social interaction skills from people.
1
other intention depicting factors such as gaze directions, face expression, body language, walking trajectories, on-going activities, surrounding environment, etc. In [3] [2], the interaction between two persons is recorded using a motion capture system, and the robot learned a responsive behavior from this recorded data by imitating the behavior of a human interacting partner. However, in all methods mentioned above, only a single interaction partner is considered for the interaction with a robot whereas, in the real-world, the robot can be approached by any number of people. Hence to cope with the problem of learning human-like interaction skills, it is important for a robot to learn through interaction with people in their natural and uncontrolled ambient. Reinforcement learning (RL) attempts to solve the challenge of building a robot that learns through interaction (i.e., taking actions) with the physical world [16]. However, the research in the field of end-to-end reinforcement learning was being hampered by the difficulties in the perception tasks until the recent development of Deep Q-Network (DQN) [12]. The DQN combines deep learning [9] with reinforcement learning [16], and it learned to map high-dimensional images to actions for playing various arcade Atari games. In this thesis, two methods, variants of DQN, are presented to extend the application of DQN from Atari emulator to the problem of a robot learning social interaction skills through interaction with people in open public places (e.g., cafeteria, department reception, common rooms, etc.). The first method is Multimodal Deep Q-Network (MDQN) [13]. MDQN uses dual stream convolutional neural networks for approximation of the actionvalue of the Q-learning method.
The dual stream structure processes the
grayscale and depth images independently, and the q-values from both streams
2
are fused together for choosing the best possible action in the given scenario. By using MDQN, the robot learned to greet people after 14 days of hit and trial method based interaction with people at different public places such as a cafeteria, common rooms, department entrance, etc (as shown in figure 1.1). The robot could perform only one of the four actions for an interaction and the action were Wait (W), LookTowardsHuman (LTH), WaveHand (H) and HandShake (HS). The results showed that the robot augmented with MDQN learned to choose appropriate actions in the diverse real world scenarios. However, in MDQN [13], the robot actions lacked perceivability as the robot could not indicate its attention during the interaction with people in the real-world. The attention indication by the robot is crucial for an effective HRI. The research in [7] highlights that humans show more willingness to interact with a robot that can indicate its attention than with a robot that cannot. Therefore, to add the perceivability to the robot actions in [13], this thesis presents a second method called Multimodal Deep Attention Recurrent Q-Network (MDARQN)[14]. The MDARQN adds perceivability to robot actions through a recurrent attention model (RAM) [15]. RAM enables the Q-network to focus on certain parts of the input image instead of processing it entirely at a fine scale. This region selection reduces the number of training parameters as well as the computational operations. Beside computational benefits, RAM provides information about where MDARQN is looking at, while taking any decision. In the proposed work, we utilize this visual attention information by RAM for realizing perceivable HRI. So far, recurrent attention models (RAM) have been applied successfully to various tasks such as object tracking [11], image classification [11], machine translation [1], and image captioning [19]. The research in [15], integrate RAM into DQN and surpasses the previous performance of DQN on
3
some of the Atari games. However, to the best of our knowledge, the applicability of RAM for the perceivable HRI has not been explored yet. Furthermore, for MDARQN, we consider the same problem setting as for MDQN [13] i.e., the robot learns to greet people in public places with a set of four actions i.e., wait, look towards human, wave hand and handshake. As the problem setting of [13] and the MDARQN is same, we exploit the humanrobot interaction experiences collected in [13] for the off-policy training of our MDAQRN method. For testing, we evaluate the performance of both MDQN and MDARQN on a test dataset (not seen by the system during training) as well as through experimentation in the real-world. To facilitate the implementation of proposed methods, we release the source code and the depth dataset1 collected during the experimentation. However, during the experimentation, we collected both grayscale and depth frames, but due to privacy concerns we only release the depth dataset. The rest of the thesis is organized as follows. Chapter 2 presents the background on reinforcement learning and deep reinforcement learning. Chapters 3 explains the proposed frameworks of reinforcement learning for real-world Human-Robot Interaction (HRI) problems. Chapter 4 presents the results as well as the discussion on the results. Finally, Chapter 5 concludes the paper together with some pointers to the future areas of research.
1
To get the source code and the dataset please visit https://sites.google.com/a/irl.sys.es.osaka-u.ac.jp/member/home/ahmed-qureshi/deephri
4
CHAPTER 2 REINFORCEMENT LEARNING
This section formulates the Q-learning framework [16] [18] which is a wellknown Reinforcement Learning approach.
It also describes the Deep Q-
Network (DQN) [12] which forms the baseline for the proposed methods. We consider a discrete time Markov decision process, as shown in the figure 2.1, in which the agent interacts with an environment. At time t, the agent observes a state st and takes a discrete action at from a finite set of legal actions A = {1, 2, . . . , |A|} under a policy π. This action execution leads to a state transition from st to st+1 and a scalar reward r(st , at ) ∈ R. The ultimate goal of the agent is to learn a policy π that maximizes the expected total return, T X t0 −t 0 0 γ r(st , at ) , R(st , at ) = E
(2.1)
t0 =t
where T and γ ∈ (0, 1] correspond to the terminal time and discount factor, respectively. For rest of the paper, we denote reward r(st , at ) as rt for brevity.
(a) Block View
(b) Markov Decision Process
Figure 2.1: Reinforcement learning framework.
5
2.1
Q-learning
In the Q-learning [16] [18], the policy π is determined from an action-value function Q(s, a) which indicates the expected return when the agent is in state s and takes an action a. Furthermore, the optimal action-value function represents the expected return for taking an action a in a state s and thereafter, following an optimal policy π i.e., Q∗ (s, a) = maxπ E R(st , at )|st = s, at = a, π
(2.2)
This optimal action-value function also obeys a recursive Bellman relation which is formalized as follow: Q∗ (s, a) = E rt + γmaxa0 Q∗ (st+1 , a0 )|st = s, at = a
2.2
(2.3)
Deep Q-learning
One of the practices in RL, especially Q-learning, is to estimate the actionvalue function by using a function estimator such as neural networks i.e., Q(s, a) ≈ Q(s, a, θ). The parameters θ of the neural Q-network are adjusted iteratively towards the Bellman target. Recently, a new approach to approximate action-value function, called deep Q-networks (DQN), has been introduced which is much more stable than previous techniques. In DQN, the actionvalue function is approximated by a deep convolutional neural network. The DQN technique for function approximation differs from previous methods in two ways: 1) It uses experience replay [10] i.e., it stores the agent’s interaction 6
Figure 2.2: Deep Q-Network experience, wt = (st , at , rt , st+1 ), with the environment into the replay memory, M = w1 , · · · , wt , at each time-step; 2) It maintains two Q-networks: the Bellman target is computed by the target network with old parameters i.e., Q(s, a; θ− ), while the learning network Q(s, a; θ) keeps the current parameters which may get updated several times at each time-step. The old parameters θ− are updated to current parameters θ after every C iterations. In DQN, the parameters of the Q-network are adjusted iteratively towards the Bellman target by minimizing the following loss function: " 2 # 0 0 − Li (θi ) = E(s,a,r,s0 )∼M r + γmaxa0 Q(s , a ; θ ) − Q(s, a; θ)
(2.4)
For each update, i, a mini-batch is sampled from the replay memory. The current parameters θ are updated by stochastic gradient descent in the direction of the gradient of the loss function with respect to the parameters i.e., h i 5Li (θi ) = E(s,a,r,s0 )∼M (r + γmaxa0 Q(s0 , a0 ; θ− ) − Q(s, a; θ)) 5θi Q(s, a; θ)
(2.5)
Finally, the agent’s behavior at each time-step is selected by an -greedy policy where the greedy strategy is adopted with probability 1 − while the random strategy with probability . In this thesis, the baseline algorithm for the proposed methods is DQN. As 7
mentioned earlier, the DQN has demonstrated its ability to learn from highdimensional visual input to play arcade video games (see figure 2.2) at human and superhuman level. However, the applicability of DQN to real-world human-robot interaction problems was not explored until the proposition of our methods (MDQN and MDARQN).
8
Algorithm 1: Interaction learning procedure Initialize replay memory M to size N Initialize training Q-network Q(s, a; θ) with parameters θ ˆ a; θ− ) with weights θ− = θ Initialize target Q-network Q(s, for episode = 1, M do Data generation phase: Initialize the start state to s1 for t = 1, T do With probability select a random action at otherwise select at = maxa Q(st , a; θ); st+1 , rt ← ExecuteAction(at ) Store the transition (st , at , rt , st+1 ) in M end for Training phase: Randomize a memory M for experience replay for i = 1, n do Sample random minibuffer B from M while B do Sample minibatch m of transitions sk , ak , rk , sk+1 from B without replacement; if sk+1 ==Terminal then yk = rk else ˆ k+1 , a; θ− ) yk = rk + γmaxa Q(s end if Perform gradient descent on loss (yk − Q(sk , ak ; θ))2 w.r.t the network parameters θ; end while end for After every C-episodes sync θ− with θ.
end for
9
CHAPTER 3 DEEP REINFORCEMENT LEARNING FOR HRI
This chapter explains two proposed methods for the real-world HRI problem. The first method is Multimodal Deep Q-Network (MDQN) [13]. By using MDQN, the robot acquired social intelligence after 14 days of interaction with people in free public places (see figure 1.1). The results indicated the success of the method in interpreting complex human behaviors and choosing the appropriate actions for the interaction. The second method is Multimodal Deep Attention Recurrent Q-Network (MDARQN) [14] which extends MDQN by adding perceivablity to robot behavior. In MDQN, the robot actions lacked perceivability as the robot could not indicate its attention. Therefore, MDARQN is proposed which adds perceivability to robot actions through a recurrent attention model (RAM) [11]. This thesis considers a scenario in which the robot learns to greet people using a set of four legal actions, i.e., wait, look towards human, wave hand and handshake. The objective of the robot is to learn which action to perform in each situation. To realize this kind of real-world HRI, we consider a system indicated in the figure 3.1. It should be noted that our system, as shown in figure 3.1, takes real-world snapshots of robot interaction with its environment, whereas, in DQN, as shown in figure 2.2, simulated non-real-world environments are considered.
10
Figure 3.1: Deep Q-learning for HRI
3.1
System Overview
This section describes the states s ∈ S , actions A and reward formulation r. To realize the real-world HRI, a Pepper robot1 (see figure 3.1) was used. In addition, we also equip Pepper’s right hand with Force-Sensitive Resistor (FSR) touch sensor which detects if the handshake has happened or not, and this handshake detection forms the basis for our reward function (as discussed later in section 3.1.3). For aesthetic reasons we also hide the touch sensor under the woolen gloves as can be seen in figures 1.1 and 3.1.
3.1.1
States s ∈ S
The input state for the Q-function approximator comprises of eight grayscale and depth images. Out of many built-in sensors of the Pepper, only the 2-D camera (located on robot’s forehead) and the ASUS Xtion 3-D sensor (located behind robot eyes) were used to obtain the grayscale and depth images, respectively. The 2-D camera and the 3-D sensor were operated at 10 fps with 320×240 resolution. These grayscale and depth images were then downsized to 198×198. 1
//www.aldebaran.com/en/cool-robots/pepper/find-out-more-about-pepper
11
To prepare an input state for our proposed deep neural networks, the eight most recent grayscale and depth frames, of each time step, were stacked together.
3.1.2
Actions A
This section describes the robot four actions, i.e., wait, look towards humans, wave hand and handshake. Before performing any of these actions, the robot steers its attention indicated by the gaze control system. There are two methods for gaze control (see section 3.4), 1) gaze steering with neural attention model and 2) gaze steering without neural attention model. The implementation of the four actions varies slightly with the underlying gaze control system. The description of the actions is as follow
Wait:
In case of MDARQN based neural attentioning, the robot does nothing other than attending to the attention location given by MDARQN. However, in case of MDQN/random-policy where their is no neural attention system, the robot randomly moves its head within allowable range of head pitch and head yaw (see figure 3.2).
Look towards human:
During this action, if there is human in the robot field of view, the robot tracks the person with its head (see figure 3.3). If this action is being performed under
12
(a)
(b)
(c)
(d)
Figure 3.2: Action: Wait
the neural attentioning then the robot tracks the human within a narrow field in order to avoid any desynchronization with the neural attention mechanism.
(a)
(b)
(c)
(d)
Figure 3.3: Action: Look towards human
Wave hand:
During this, the robot waves its hand (see figure 3.4) and says Hello.
(a)
(b)
(c)
Figure 3.4: Action: Wave hand
13
(d)
Handshake:
For performing a handshake, the robot lifts its right hand up to a certain height and then waits for a few seconds. If FSR touch sensor detects the touch then the robot grabs the person’s hand otherwise robot brings its hand down to the default position (see figure 3.5).
(a)
(b)
(c)
(d)
Figure 3.5: Action: Handshake
3.1.3
Reward function
Handshake detection through touch sensor forms the baseline of our reward function. The robot gets the reward of 1 and -0.1 on the successful and unsuccessful handshake, respectively. Furthermore, the reward of value 0 is given on actions other than handshake (see Equation 3.1). The handshake is successful if the human and robot actually shake each others hand while it is unsuccessful when robot attempts to do a handshake but the handshake does not happen (see figure 3.6).
14
1 if handshake is successful Ita = −0.1 if handshake is unsuccessfuk 0 for actions otherthan handshake
(a) Successful handshake
(3.1)
(b) Unsuccessful handshake
Figure 3.6: Instances of successful and unsuccessful handshakes.
3.2
Interaction learning procedure
Algorithm 1 outlines the interaction learning procedure for a robot to learn social behavior through interaction with people in the real-world. Unlike DQN [12], we separate the data generation and training phase. Each day of experiment corresponds to an episode during which the algorithm executes both the data generation phase and the training phase. Following is a brief description of both phases.
3.2.1
Data generation phase
During the data generation phase, the system interacts with the environment using Q-network Q(s, a; θ). The system observes the current scene, which comprises of grayscale and depth frames, and takes an action using the -greedy 15
strategy. The environment in return provides the scalar reward (please refer to section 3.1.3 for the definition of reward function). The interaction experience wi = (si , ai , ri , si+1 ) is stored in the replay memory M. The replay memory M keeps the N most recent experiences which are later used by the training phase for updating the learning network parameters θ.
3.2.2
Training phase
During the training phase, the system utilizes the collected data, stored in replay memory M, for training the networks. The hyperparameter n denotes the number of experience replay. For each experience replay, a mini buffer B of size 2000 interaction experiences is randomly sampled from the finite sized replay memory M. The model is trained on the mini batches m sampled from buffer B and the network parameters θ are updated iteratively in the direction of the Bellman targets (see Equation 2.4 & 2.5). The random sampling from the replay memory breaks the correlation among the samples since the standard reinforcement learning methods assume the samples are independently and identically distributed. The reason for dividing the algorithm into two phases is to avoid the delay that would be caused if the networks were trained during the interaction period. The DQN agent [12] works in a cycle in which it first interacts with the environment and stores the transition into the replay memory, then it samples the mini batch from the replay memory and trains the network on this mini batch. This cycle is repeated until termination occurs. The sequential process of interaction and training can be acceptable only in fields other than HRI. In HRI, the agent has
16
to interact with people based on social norms, so, any pause or delay while the robot is on the field is unacceptable. Therefore, we divide the algorithm into two stages: in the first stage, the robot gathers data through interaction for some finite period of time, in the second stage, it goes to its rest position. During the resting period, the training phase gets activated to train the neural model.
3.3
Deep Neural Networks for Q-function approximation
The proposed method uses deep neural networks based action-value function approximator. The input to this function is a state s. In this research, a tuple of observed images is treated as the state and thus, the input becomes a high dimensional vector consisting of pixel values of all images. The output of the network is a K dimensional vector and the k-th element corresponds to the value of the k-th action. The function approximator is trained so that each output value is adjusted towards the expected return (Eq. 2.1) i.e., qk (s) ≈ R(s, ak ),
(3.2)
where the qk (s) is the value of the k-th element of the output vector corresponding to the k-th action. The training objective for the our function approximators is same as for DQN (see Equation 2.4). The following section describes our two function approximators. First, Multimodal Deep Q-Network and the second, Multimodal Deep Attention Recurrent Q-Network.
17
Figure 3.7: Multimodal Deep Q-Network
3.3.1
Multimodal Deep Q-Network
The input state s to our MDQN function approximator consists of grayscale and depth images. The figure 3.7 shows the structure of MDQN. It comprises of two streams of convolutional neural networks i.e., grayscale-channel which processes grayscale images and the depth-channel which processes depth images. It should be noted that these two streams are trained independently. The output q-values are fused together only for taking the greedy actions. For taking the greedy action the outputs from each stream of the dual stream Q-network are fused together, and the action with the highest q-value is selected. For the fusion of outputs from each stream of the Q-network, the algorithm first normalizes the q-values from each stream, and then takes an average of these normalized q-values.
MDQN architecture
The proposed model comprises of two streams, one for the gray-scale information, and another for the depth information. The structure of the two streams is
18
identical and each stream comprises of eight layers (excluding the input layer). The overall model architecture is schematically shown in figure 3.7. The inputs to the y-channel and the depth channel of the multimodal Q-network are grayscale (198 × 198 × 8) and depth images (198 × 198 × 8), respectively. Since each stream takes eight frames as an input, therefore, the last eight frames from the corresponding camera are pre-processed and stacked together to form the input for each stream of the network. Since the two streams are identical so we only discuss the structure of one of the streams. The 198 × 198 × 8 input images are given to first convolutional layer (C1) which convolves 16 filters of 9 × 9 with stride 3, followed by rectifier linear unit function (ReLU) and results into 16 feature maps each of size 64 × 64 (we denote this by 16@64 × 64). The output from C1 is fed into sub-sampling layer S1 which applies 2 × 2 max-pooling with the stride of 2 × 2. The second (C2) and third (C3) convolutional layer convolve 32 and 64 filters, respectively, of size 5 × 5 with stride 1. The output from C2 and C3 passes through the non-linear ReLU function and is fed into sub-sampling layers S2 and S3, respectively . The final hidden layer is fully connected with 256 rectifier units. The output layer is fully-connected linear layer with 4 units, one unit for each legal action.
3.3.2
Multimodal Deep Attention Recurrent Q-Network
In this section, we describe our second proposed neural model i.e., MDARQN, using which the robot learns to do perceivable HRI. The MDARQN architecture comprises of two streams of identically structured neural Q-networks, one for processing the grayscale frames while other for processing the depth frames (see figure 3.8). Each of these neural Q-network streams is trained independently of
19
Figure 3.8: Multimodal Deep Attention Recurrent Q-Network
each other. Since two Q-network streams are identical, and trained independently, therefore, for simplicity; we only discuss the structure of a single stream of the dual stream Q-network. Each stream consists of three neural models: 1) Convnets; 2) Long Short-term Memory (LSTM) network; and 3) Attention network (G). The rest of the section explains these three neural models and the flow of information between them (as also shown in figure 3.8).
Convnets
The convnets take pre-processed visual frame as an input at each time-step and transform it into L feature vectors, each of which provides D-dimensional representation of a part of an input image i.e, at = {a1t , · · · , atL }, alt ∈ RD . This feature vector is taken as an input by the attention network for generating the annotation vector z ∈ RD .
20
Figure 3.9: MDARQN: LSTM
LSTM
We employ the implementation of LSTM network shown in figure 3.9. In figure 3.9, it , ft , ot , ct , and ht correspond to the input, forget, output, memory and hidden state of the LSTM, respectively. Let d be the dimensionality of all LSTM states and matrix M : Ra → Rb is an affine transformation of trainable parameters with dimension a = d + D and b = 4d. As shown in figure 3.9, the LSTM network takes the annotation vector zt ∈ RD , previous hidden state ht−1 , and the previous memory state ct−1 as an input in order to produce the next hidden state ht . This hidden state ht is given to the attention network G and to the linear output layer for generating the annotation vector zt+1 at next time step and for providing the output q-value for each of the legal actions, respectively.
Attention network
The attention network generates the dynamic representation, called annotation vector zt , of the corresponding parts of an input image at time t. The attention mechanism φ, a multilayer perceptron, takes a D-dimensional L feature vectors at and a previous hidden state ht−1 of the LSTM network as an input for comput-
21
ing the positive weights βlt for each location l. The weights βlt are computed as follow:
exp(αlt ) βlt = PL k k=1 exp(αt ) where αlt = φ(alt , ht−1 ). The annotation vector zt is computed as zt =
(3.3) PL l=1
βlt alt .
This annotation vector is used by LSTM for computing next hidden state. There are two type of attention network [19] in the literature: the soft and hard attention network. The attention network used in MDARQN is the soft attention network and unlike hard attention network, it fully differentiable and deterministic. Since each of the streams of the MDARQN model is fully differentiable, therefore, each network stream is trained by minimizing the general loss function (Equation 1) through a standard back-propagation method. Finally, output from the two streams are fused together for taking a greedy action as shown in the figure 3.8. For the fusion, the output q-values from each Q-network stream are first normalized and then these normalized q-values from each stream are averaged together to generate output q-values of MDARQN. The greedy action is then taken by picking the action which has a highest q-value from these fused q-values.
MDARQN architecture
This section provides the architecture details of MDARQN model.
The
MDARQN consist of two streams: the Grayscale-channel and the Depthchannel stream for processing the grayscale and depth images, respectively. 22
Figure 3.10: MDARQN: Convnets
Since, the structure of both streams are identical, therefore, we only discuss one of the streams. The convolutional neural network (see figure 3.10) consists of four convolution layers each of which is followed by a non-linear rectifier function. The input dimension to the CNN is 1 × 198 × 198. The convolution layer 1, 2, 3 and 4 convolves 16 filters of 9 × 9, 32 filters of 8 × 8, 64 filters of 7 × 7 and 256 filter of 6 × 6, respectively. The stride of convolution 1, 2, 3 and 4 are 3, 2, 2 and 1 respectively. The CNN outputs 256 feature maps of dimension 7 × 7. The output feature maps from CNN are given to the attention network which takes 49 vectors each of size 256. To be consistent with attention network, the LSTM network also has 256 units. To generate Q-values for the four set of actions, the output of the LSTM is transformed to four units through a linear layer preceded by non-linear rectifier unit.
3.4
Gaze Control Mechanism
There are two types of gaze control methods: 1) Neural Attention based gaze steering, this mechanism utilizes MDARQN model to enable a robot to indicate
23
its attention; 2) Gaze steering without neural attention models, this method is used by the robot with MDQN and random-policy to interact with the environment.
Neural Attention Gaze Steering
In order to ensure perceivable HRI, we utilize the annotation vector z given by attention network G for attention steering of the robot. This attention steering is done as follow. The images used by the MDARQN has dimensions 198 × 198. We divide horizontal and vertical axis of the input image into five sub-regions based on the author’s defined thresholds. The horizontal axis is divided into the left most, left, center, right, right most regions while the vertical axis is divided into top most, top, center, bottom and bottom most regions. The indicators I x ∈ {−2, −1, 0, 1, 2} and I y ∈ {−2, −1, 0, 1, 2} indicate these sub-regions of horizontal and vertical axis, respectively, starting from -2 which corresponds to the left/top most region. The robot attention mechanism uses annotation vector zt to extract the pixel location on the input image, we call it attention mark, where the MDARQN pays the maximum attention. The attention mark and author’s defined thresholds are then used to determine the indicator values I = {I x , I y }. The indicator I and robot’s actual position indicator I a = {Ixa , Iya } is then used to compute the next attention location. The value of robot’s actual position indicator at the given time-step is computed relative to the actual location at previous time-step and it is calculated as follow:
24
a a if I + It−1 > 2 or I + It−1 < −2 0 a It = a I + It−1 otherwise
(3.4)
The robot actual position is initialized to its central location i.e., I0a = 0. The motion of the robot in a real world is determined by ω = {ωx , ωy } where ωx controls the rotation of the robot body while ωy controls the robot’s head projection. The ω is computed as follow: a {ωx , ωy } = {sθ1 , sθ2 }; where s = Ita − It−1
(3.5)
The value of θ1 and θ2 are π/6 and π/9, respectively. It should be noted that it is important to bind the robot motion to predefined regions. As in the proposed work we utilized embedded visual sensors in the robot which have limited field of view and are mobile due to the robot motion. Hence, restricted motion allows a safe localization of robot in public environments.
Gaze Steering without Neural Attention Model
Since the robot augmented with MDQN/random-policy2 does not indicate the responsible region on the images for its decision. Therefore, it is necessary to equip the robot with another attention system that facilitates it’s interaction with humans. This attention steering mechanism instills the awareness into the robot and makes it sensitive to the stimulus coming from the real-world. The stimuli used are the sound and the movement detection. In case robot senses any stimulus, it looks for human at the stimulus origin. If there is not any human, the robot returns to its previous orientation but otherwise it tracks the human with its head in order to engage them for an interaction. 2
for taking non-greedy actions
25
CHAPTER 4 RESULTS & DISCUSSION
To evaluate the performances of MDQN and MDARQN, we collected a test dataset through real-world human-robot interaction experimentation. For the test dataset, the robot was augmented with a random-policy i.e., the robot selected the actions for interaction randomly. This test dataset comprises of 560 interaction experiences (4480 grayscale and depth frames). In addition, we also evaluate the behavior of trained MDQN and MDARQN through a two days real-world human-robot interaction experimentation. In this experimentation, each day the robot executed 600 interaction steps for around 2 hours during the lunch break. On the first day, the robot was given a trained MDQN model for interaction whereas on the second day, it was equipped with trained MDARQN model. Through this experimentation, we noted the handshake ratio (described in next section) for both MDQN and MDARQN. The remaining section summarizes the results of these evaluations. Trained Model Hand-shake ratio Accuracy (%) True positive rate (%) False positive rate (%)
MDARQN(Aug) 0.74 95.2 90.5 3.15
MDARQN 91.3 82.7 5.77
MDQN 0.48 95.3 90.7 3.09
Table 4.1: Performance measures of trained Q-networks.
26
(a) Wait
(b) Wait
(c) Wait
(d) LookTowardsHuman
(e) LookTowardsHuman
(f) LookTowardsHuman
(g) Wave hand
(h) Wave hand
(i) Wave hand
(j) Handshake
(k) Handshake
(l) Handshake
Figure 4.1: Successful cases of policy networks’ decision.
27
4.1
Similarity to human behavior
To evaluate the similarity between policy networks’ (MDQN and MDARQN) behavior and the human social behavior, we employed the following evaluation procedure. The three volunteers evaluated the decisions of the policy networks on a test dataset. Each volunteer observed the sequence of eight grayscale frames depicting the scenario followed by the policy networks’ decision (the action with maximum q-value). The volunteer then decided if the decision is in agreement with what human would do in that situation. If the majority of volunteers considered the decision to be inappropriate, then they were asked to pick up the most appropriate action for the depicted scenario. By following this evaluation procedure, we evaluated several policy networks as shown in the table 4.1. This table compares the performance of three models i.e., MDARQN(Aug), MDARQN and MDQN. The MDARQN(Aug) was trained on an augmented training dataset while MDARQN and MDQN were trained on an unaugmented training dataset. The details of data augmentation technique is given in Appendix A.1. The description of nomenclature used in table 1 is as follow. The handshake ratio, measures how often the robot augmented with a certain Q-network can attract people for handshaking in an uncontrolled public environment. This handshake ratio is computed as the ratio of number successful handshakes performed by the robot to the total number of handshake attempts. Accuracy is a measure of how often the Q-network’s predictions were correct. True positive rate corresponds to the percentage of predicting positive targets as positive. False positive rate measures how often the negative examples were classified as positive.
28
The first row of table 4.1 indicates that the handshake ratios for a robot augmented with MDAQRN(Aug) and MDQN are 0.74 and 0.48, respectively. Furthermore, it can be seen in the last three rows of table 4.1 that MDARQN(Aug) and MDQN demonstrate similar performance while MDARQN has relatively inferior performance on the test dataset. In addition, we have also noticed that the performance of individual streams of MDQN and MDARQN(Aug) were also similar with a true positive rate of around 70%. This indicates that in order to provide similar performance to the neural network without attention, the attention driven neural networks require more training data. From now onward all results presented correspond to our proposed MDARQN(Aug) and MDQN. Figures 4.1 and 4.2 show the successful and unsuccessful cases of MDARQN(Aug)/MDQN decisions, respectively, in the depicted scenarios. In figure 4.1, in each sub-figure, the top two frames (starting from left) show the first and the last frame out of eight most recent frames for any situation while the bottom two images indicates the region of attention (given by MDARQN) on these frames. An action with maximum q-value is considered as a best action. In general, the robot learned to take the waiting (W) action when there were no people around the robot, people were busy in some activity or people were going away from the robot (see figures 4.1(a)-4.1(c)). The look towards human (LTH) action, as shown in figures 4.1(d)-4.1(f), was taken when the people were mildly engaged with the robot. The term mildly engaged refers to the engagement in which either people were gazing at the robot from distance or people were in front of the robot but they were not looking at it. The wave hand (H) action (see figures 4.1(g)-4.1(i)) was taken when people were distant and they were not looking at the robot. Lastly, the handshake (HS) action, as shown in figures 4.1(j)-4.1(l), was taken when people were fully engaged
29
(a) NP: HS HE: H
(b) NP: HS HE: W
(c) Attention error
Figure 4.2: Unsuccessful cases of policy networks’ decision.
with the robot. The full engagement means the people were not only standing in front of the robot but also looking at it. The robot’s successful decisions and their implications are further discussed in detail in the next section. Finally, the figure 4.2 shows the scenarios from the test dataset in which the neural policies (MDARQN/MDQN) made the incorrect decision i.e., there was a contradiction in the decision of human evaluators (HE) and the neural policy (MDQN and MDARQN(Aug)), denoted as NP.
4.2
Discussion
This section provides a brief discussion on the intention prediction ability of MDQN and MDARQN, impact of neural attention model based gaze steering and reward function definition on HRI.
30
4.2.1
Intentions depicting factors
The instances shown in the figure 4.1 and the neural policies (NP: MDARQN/MDQN) decision in these scenarios insinuate that the proposed NP have an understanding of the intention depicting factors, like people walking trajectory, people level of engagement with the robot, head orientation, activity in progress, etc, and their implications. In the human-human interaction, these factors help other people to choose the appropriate action for the interaction. Likewise, NPs’ understanding of these factors further validates the high similarity between NPs’ behavior and the human social behavior. In figures 4.1(c), 4.1(d) and 4.1(h) the first and last frames out of eight most recent frames are shown to indicate the people walking trajectory. In the figure 4.1(c), a person is walking away from the robot and thus, the NP decide to wait. However, in figures 4.1(d) and 4.1(h), the people are approaching towards the robot so the NP decided to look towards them and wave hand, respectively to get their attention for the interaction. The understanding about the level of human engagement with the robot can be seen from figures 4.1(d)-4.1(l). In the figures 4.1(d)-4.1(f), either people are standing close to the robot or people are at a small distance away but they are looking at the robot. Therefore, the NP choose the look towards human action which is a softer way of gaining people’s attention. In the figures 4.1(g)-4.1(i), people are at a relatively longer distance thus the level of engagement is lower compared to the scenarios shown in the figures 4.1(d)-4.1(f). Therefore, in figure 4.1(g)-4.1(i) scenes, the NP picked a wave hand action which is a stronger way of getting the people’s attention. Finally, in the figures 4.1(j)-4.1(l), people are fully engaged with the robot thus the agent picks the handshake action. The agent has also learned that if people are not around the robot then no action other than waiting should be chosen (see figure 4.1(a)). 31
Finally, the NPs understanding about the ongoing activities is evident from its action in situation indicated in the figure 4.1(b). In figure 4.1(b), the person is taking robot’s picture and therefore, the NP decides to wait.
4.2.2
Impact of attention steering on HRI
The higher handshake ratio of MDARQN as compared to MDQN indicates that people show more willingness to interact with a robot that exhibits its attention compared to a robot that does not exhibit attention. This result is in accordance with the findings of [7] and hence, attentioning is important for a successful HRI. Despite higher handshake ratio of MDARQN with respect to other neural policies, the handshake ratio for both models is actually low. One of the reasons for this low ratio is robot’s repeated attempts to perform a handshake with a person who is fully engaged (as can also be seen in the accompanying video) but we, humans, avoid multiple handshakes. Therefore, in our future plan, we hope to add memory to the model so that the robot can determine with whom it has already interacted. In addition to willingness, the attention is also important to determine with whom the robot is intending to interact out of many other people in the scene. As in figures 4.1(j) and 4.1(k), the attention network highlights the person on right side and the person at the center, respectively in order to perform handshake with them. It should be noted that this precise attentioning cannot be made possible without neural attention models.
32
(a) MDQN
(b) MDARQN
Figure 4.3: Effect of reward function on the robot’s behavior.
4.2.3
Impact of reward function on robot’s behavior
Reward function definition determines the robot behavior; the results presented so far were based on the reward function discussed earlier. In this section, we evaluate the effect of different reward functions on the robot’s behavior. High penalty on unsuccessful handshake inculcates rude behavior into the robot as robot become reluctant to handshake while low penalty e.g., 0 inculcates amiable behavior as robot repeatedly attempts to do a handshake. In order to test which robot behavior is acceptable, we trained five models and the penalties on unsuccessful handshake for these five models were 0, 0.1, 0.2, 0.5 and 1 while rest of the reward function definitions were kept same as discussed earlier. The performance of these models was evaluated on test dataset following the agent decision evaluation procedure (discussed earlier). The graph in figures 4.3(a) and 4.3(b) show that models (MDQN and MDARQN, respectively) with penalty of 0.1 on unsuccessful handshake generates more socially acceptable decisions as compared to other models.
33
CHAPTER 5 CONCLUSION & FUTURE WORK
For the robots to enter our social world, it is important for them to learn the human-like social behavior through interaction with people in the real rewardsparse environments. Furthermore, for successful HRI, it is also essential for a robot to interpret complex human behavior and respond to these behaviors in a socially acceptable and perceivable way. To realize successful HRI in the realworld, this thesis introduces two deep neural network based methods. First, Multimodal Deep Q-Network (MDQN) with which the robot learns the social interaction skills through trial and error method. Second, Multimodal Deep Attention Q-Network (MDARQN) which enabled the robot to respond to complex human behavior by first interpreting them and then executing a responsive action with attention indication. The results show i) that the robot, with the given neural policies (MDQN and MDARQN), has learned to infer intention from intention depicting factors such as human body language, walking trajectory or any ongoing activity; ii) that the attention indication using MDARQN adds perceivability to robot actions and thus people show more willingness for interaction with a robot; iii) the diverse interaction scenarios which were definitely hard to envision and yet our neural polices learned to choose appropriate decisions in these diverse scenarios. In our future plan, we plan to i) explore the impact of different fusion strategies on multimodal learning; ii) augment the proposed network with differentiable working memories in order to realize long-term HRI; iii) increase the robot’s action space instead of restricting it to four actions only; iv) evaluate the influence of three actions, other than handshake, on the human behavior.
34
APPENDIX A IMPLEMENTATION DETAILS
This section outlines the implementation details of the proposed project. The MDARQN/MDQN code was built on the baseline [12] and is implemented in torch/lua1 . The robot side programming is done in python. The system used for training MDARQN/MDQN had 3.40GHz×8 Intel Core i7 processor with 32 GB RAM and GeForce GTX 980 GPU. The rest of the section explains various modules of the project.
A.1
Training dataset, data augmentation and pre-processing
We double the training dataset through two data augmentation techniques: 1) Random cropping of the input image of size 320 × 240 to the size suitable for the model i.e., 198 × 198; 2) Mirroring the input image and then cropping it randomly to the size 198 × 198. The total training data collected during 14 days of experiment comprise of 111,504 grayscale and depth frames. After data augmentation, the number of grayscale and depth frames grows to 223,008. in order to ensure similar performances of MDQN and MDARQN, the MDARQN was trained with augmented data whereas MDQN was trained with unaugmented data. 1
http://torch.ch/
35
A.2
Experiment details and hyper-parameters
We conducted the experiment for 14 days using MDQN. Every day the datageneration phase was executed for around 4 hours followed by the learning phase. The number of interaction steps the robot could perform during 4 hours data-generation phase depended on the internet speed as we used the wireless media for a communication between Pepper and the computer system running MDQN. For each interaction step, the robot provides eight most recent depth and grayscale frames. The replay memory M stored up to 3750 most recent interaction experiences. During learning phase, a mini buffer B of size 2000 samples was randomly sampled from the replay buffer M. This mini buffer was then used for mini-batch training of the Q-network using RMSProp algorithm. This mini-batch training was repeated 10 times (i.e., n=10) during the learning phase and the size of a mini-batch m was 25 samples. The target network parameters θ− were updated to learning network parameters θ every day after training and the learning rate was kept constant at 0.00025. The exploration parameter was annealed linearly from 1 to 0.1 over the 28000 interaction steps, however, during 14 days of experiment the robot could perform only 13938 interaction steps due to variations in internet speed at different locations. The interaction experiences collected from the above-mentioned experimentation were utilized for the off-policy training of MDARQN. This model, MDARQN, was trained in the same fashion as the MDQN i.e., MDARQN was trained over the 14 episodes, and each episode comprised of data-loading and learning phase. The hyper-parameters were exactly the same as mentioned above. Furthermore, as suggested in [15], the initial LSTM hidden and memory state were zeroed for each new mini-batch.
36
BIBLIOGRAPHY [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [2] Heni Ben Amor, Gerhard Neumann, Sanket Kamthe, Oliver Kroemer, and Jochen Peters. Interaction primitives for human-robot cooperation tasks. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 2831–2837. IEEE, 2014. [3] Heni Ben Amor, Dominik Vogt, Marco Ewerton, Erik Berger, Byung-Ik Jung, and Jochen Peters. Learning responsive robot behavior by imitation. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 3257–3264. IEEE, 2013. [4] Cynthia Breazeal. Toward sociable robots. Robotics and autonomous systems, 42(3):167–175, 2003. [5] Cynthia Breazeal. Social interactions in hri: the robot view. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(2):181–186, 2004. [6] Cynthia L Breazeal. Designing sociable robots. MIT press, 2004. [7] Allison Bruce, Illah Nourbakhsh, and Reid Simmons. The role of expressiveness and attention in human-robot interaction. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, volume 4, pages 4138–4142. IEEE, 2002. [8] Vincent G Duffy. Handbook of digital human modeling: research for applied ergonomics and human factors engineering. CRC press, 2008. [9] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [10] Long-Ji Lin. Reinforcement learning for robots using neural networks. PhD thesis, Fujitsu Laboratories Ltd, 1993. [11] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204–2212, 2014. 37
[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. [13] Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, and Hiroshi Ishiguro. Robot gains social intelligence through multimodal deep reinforcement learning. In Humanoid Robots (Humanoids), 2016 IEEE-RAS 16th International Conference on, pages 745–751. IEEE, 2016. [14] Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, and Hiroshi Ishiguro. Show, attend and interact: Perceivable human-robot social interaction through neural attention q-network. arXiv preprint arXiv:1702.08626, 2017. [15] Ivan Sorokin, Alexey Seleznev, Mikhail Pavlov, Aleksandr Fedorov, and Anastasiia Ignateva. Deep attention recurrent q-network. arXiv preprint arXiv:1512.01693, 2015. [16] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. ¨ [17] Zhikun Wang, Katharina Mulling, Marc Peter Deisenroth, Heni Ben Amor, ¨ David Vogt, Bernhard Scholkopf, and Jan Peters. Probabilistic movement modeling for intention inference in human–robot interaction. The International Journal of Robotics Research, 32(7):841–858, 2013. [18] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. [19] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
38
LIST OF PUBLICATIONS
[1] Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa and Hiroshi Ishiguro, ”Robot gains social intelligence through Multimodal Deep Reinforcement Learning”, Proceedings of IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 745-751 Cancun, Mexico 2016. [2] Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, Hiroshi Ishiguro, ”Show, Attend and Interact: Perceivable Human-Robot Social Interaction through Neural Attention Q-Network”, Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp.1639–1645, Singapore 2017. [3] Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, Hiroshi Ishiguro, ”Robot Learns Responsive Behaviour through Interaction with People using Deep Reinforcement Learning”, 3rd International Symposium on Cognitive Neuroscience Robotics: Toward Constructive Developmental Science, Osaka Japan, 2016.
39