Human intention understanding based on object ... - IEEE Xplore

5 downloads 0 Views 2MB Size Report
Abstract-Intention understanding is a basic requirement for human-machine interaction. Action classification and object affordance recognition are two possible ...
Human Intention Understanding Based On Object Affordance and Action Classification Zhibin Yu, Sangwook Kim

Rammohan Mallipeddi, Minho Lee

School of Electronics Engineering

School of Electronics Engineering

Kyungpook National University Taegu, Korea [email protected], [email protected]

Kyungpook National University Taegu, Korea [email protected], [email protected]

Abstract-Intention understanding is a basic requirement for human-machine

interaction.

Action

classification

and

object

affordance recognition are two possible ways to understand human intention. In this study, Multiple Timescale Recurrent Neural Network (MTRNN) is adapted to analyze human action. Supervised

MTRNN,

which

is

an

extension

of

Continuous

Timescale Recurrent Neural Network (CTRNN), is used for action and intention classification. On the other hand, deep learning algorithms

proved to be efficient in understanding

complex concepts in complex real world environment. Stacked

*

developed to handle dynamic action classification problem. For instance, famous Hidden Markov model analyzes and classifies human action [7] and intention. [8]. Similarly, Recurrent Neural Network (RNN) [9] based models such as Multiple Timescale Recurrent Neural Network (MTRNN) [10] also does the similar task. The efficiency of supervised MTRNN in action classification has been shown in our own previous work [11]. In this work, we intend to extend this model on intention recognition, which is based on the action classification results.

denoising auto-encoder (SDA) is used to extract human implicit intention related information feature based Robust

object

Features (SURF)

information.

Object

from

the observed

detection method is

also

affordance

used

namely to

describes

find the

objects.

Speeded the

A Up

object

interactions

between agent and the environment. In this paper, we propose an intention recognition system using 'action classification' and 'object affordance information'. Experimental result shows that supervised MTRNN is able to use

different

information

in

different time period and improve the intention recognition rate by cooperating with the SDA.

Keywords-Supervised learning, Object ajJordance, Action classification, Intention understanding

I.

INTRODUCTION

Intention understanding is important in developing artificial cognitive agents such as robots that can provide various services to human beings [1]. While, humans can naturally understand each other's intentions by assimilating several cues, artificial agents fail to do so. Therefore they need to be taught to understand the meaning of explicit and implicit intentional cues. Affordance represents the relationship between an object and an action [2]. It is a possible cue from which a robot can deduce possible human intention by considering the relationship between objects and actions after understanding the objet labels. There are several ways to model the relationship between objects and intentions [3, 4]. Our previous work [5], object affordance-based intention recognition model using the stacked denoising auto-encoder (SDA) [6], can be considered one of the candidate models. It's believed that the SDA can represent the relationship between objects and corresponding intentions. Since human action, in general, is not a static gesture but a dynamic process, various dynamic models have been

978-1-4799-1959-8/15/$31.00 @2015 IEEE

, Robot

Find - Cup

Pick - Coffee

Pour - Water

Drink - Coffee

Drinking the coffee

Fig. I. Example of human implicit intention understanding

The process for human implicit intention understanding is shown in Fig. 1. When human want to take a cup of coffee, he (she) will do several actions to achieve this goal. It may be difficult to guess human intention at the first glance. However, it would be possible to fmd out human intention after robot observes enough sequence of actions and correlates objects information. Considering the role of object affordance and action cues in intention recognition, in this paper, we propose a system, which is a hybrid of object recognition model (SDA) and a dynamic model (Supervised MTRNN). The SDA, as a static model, is used to fmd human implicit intention according to object affordance, of which the objects are selected by human gaze. The supervised MTRNN is used to recognize human's motion and motion-based intention classification. These two models are connected in the last layer and share information with each other. Moreover, gradient descendent

algorithm is used to evolve the weights between the SDA and Supervised MTRNN models. The cooperation between these two models would be a possible solution to recognize human intention. The experiment results show that the two different models can cooperate together and produce more robust results to understand human intentions. II.

Action classification

Time series data have been used for action-based intention analysis. The hidden Markov model (HMM) is considered to be an efficient dynamic tool to model dynamic sequences, which can also be used for action recognition [12]. HMM use the transitional probability of each state to calculate the probability of each possible combination. However, it is limited to represent the contextual meaning of different motions and intentions. Recurrent neural network (RNN) is another useful tool to model dynamic sequences. The sensitivity of RNN may not be as good as HMM but it is able to predict the output and classify the signals at the same time [9]. Continuous timescale recurrent neural network (CTRNN) which is developed in [13] further improvised this model. Supervised MTRNN, which is used in this paper, is an extension of CTRNN [11]. B.

exp( _"

Object recogntion

In the case of human-object interaction, different objects have different relationships with various actions. Thus, correct object information based on object affordance can help intention understanding.

Vi' - Vsample,

I. exp( II

RELATED WORK

In this section we survey previous related works. For human intention understanding, action classification and object recognition is likely to be a key technology. A.

extraction [lO, 11]. The activation rule in SOM layer is defined using the following formula:

-

�EO

where

V"l

(j

W)

Vj,t - Vsample,t (j

is the reference vector of ith node,

(1)

W)

V"I

- vsample, l

means the prediction error of the i-th node in time scale t; a means all the nodes of input-output layer; (J is a constant, which is set to 0.01 and Pi l is an output vector which is used as the input of upper layer of supervised MTRNN. When the upper layer generates outputs, SOM layer is used again to calculate the predication outputs using the following equation:

Vi,I+1 LYI,IV,,1 iEV =

where

Vi,HI

is the prediction output for the next step;

(2)

Yi,1 is

the activation output of the i-th node from the input-output layer.

Context layers, that are the key components of MTRNN, are modeled by CTRNN. CTRNN considers the time scale effect as compared to RNN. In CTRNN, the output of each neuron is calculated using both the current input samples and the past history of the neural states. Hence, it makes the CTRNN suitable for predicting the continuous sensory-motor sequences [18].

There are several object recognition technologies which can be classified into one of the two approaches: appearance-based and feature-based object recognition models. Appearance­ based methods attempt to describe the objects using statistical analysis while feature-based methods try to find feasible matches between image and the target object. Scale-invariant feature transform (SIFT) [14], a feature-based method, is a useful technology to catch the scale-invariant features of the target object. Speeded Up Robust Features (SURF) [15], also a feature-based method, performs several times faster than SIFT. In this paper, we just use the SURF provided by OPENCV [16] to recognize objects. III. A.

PROPOSED MODEL

Supervised MTRNN

Asus Xtion sensor [17] is used to extract the depth images and skeleton movements from human agents. X and Y position of each skeleton node are recorded as the input for supervised MTRNN. We used Self-organizing map (SOM), which is generally used as a pre-processing method for MTRNN feature

VJ(unplt./ Visual information in time step t

Fig. 2. Supervised MTRNN model

Similarly, the original aim of MTRNN is to predict dynamic signals. MTRNN includes multiple CTRNN layers with different time constants. The information in the CTRNN layer with small time constant (fast context) changes fast. In

contrast, slow context layer is believed to arrange the sequence of elemental action information which is saved in the fast context layer. Supervised MTRNN, which is shown in Fig. 2, is an extension of MTRNN model, which includes supervised training rule in slow context layer. The error function is defined using Kullback-Leibler divergence: E

=

Y i,1 "" L.L.Y "Ilog *

*

(3)

--

Y"I

I iEO

where a is the nodes in input-output layer, �r is the desired output value of the ith neuron at the time step t and YL ,t is the prediction value of the ith neuron with the existing weights and the initial states. The partial differential is obtained by back propagration through time (BPTT) algorithm:



dE

E =

W

L � d Yj,t-l I "'i U"I

(4 )

i

E

0

i

E

C

otherwise

(5)

whereJ(x) is the derivative of the sigmoid function, Yi,1 - Y"I is the difference between neuron output and ideal value, Uj,t is the ith neuron state in the t time step, "', is the constant that *

controls the neuron state updating speed , and (j,k is Kronecker delta ( (jik 1 if i k, otherwise, 0). C means all of the classification nodes which are used for dyanmic signal classification in slow context layer [11]. =

=

ol)jec�. I"'....................�.���;::: ����:···:: ....····......� ··..·l Decoder i ! r.con.t=ctod

hi.dden

nliliurOnii:

l" .....,..............��.��:.,��,�,���.�....................

Bncoder

Related

vi.iblo n.ur=.

i

r.... ·. . . ..... . ..�·�::· ::��:��.'·. . . .. ·. . . . . ·l i

hidden

neur:on."

" ,

Zst1mated

>intention

i

l............ . .. . . ��.��:.: ��.�.���.:........ . ..... . ....I

Objec�. of att.ntlon �

.

visible neurons

Fig. 3. The structure of deep auto-encoder B. Stacked denoising aut-encoder based intention estimation

Stacked denoising auto-encoder, shown in Fig. 3, is a kind of deep neural network which is composed of multiple layers [5]. In [19], authors tried to recognize human intention by This work was funded by Ministry of Knowledge Economy (MKE),and was also funded by Ministry of Science, leT and future Planning.

analyzing the distance between the observed human's hand and the objects in the scene over several frames. However, this system couldn't recommend objects which may be related with the current intention. Unsupervised learning is used to initialize the weights between layers except the last layer. Stacked auto-encoder consists of restricted Boltzmann machine (RBM) layers. RBM is a simple neural network model which only includes 2 layers. RBM can extract the features from the lower layer by reducing the energy. The energy function is always defined as: E ( v, h )

where

a,

=

-

L a,v, - L b)h) - L v,hjwij iEV

jEh

is the bias for the i-th visible node and and

bias of the j-th hidden node;

wij

(6)

',j

b)

is the

is the weight value from i-th

node to j-th node. The encoder and decoder parts are two complementary parts in SDA. The encoder extracts high-order features of data and transforms data to a relatively low-dimensional space. The decoder, the counterpart of the encoder, tries to recover the data with original dimension from the encoded vector. At the code layer, these probabilities are evaluated and compared to decide most adequate intention for a given object of attention. Input vectors for RBM are binary data. For example, if there are 5 objects, objects 1 and 2 are touched by human, the input vector are set to [1 1 0 0 0]. And target code is encoded in a same way. The information stored in the code layer can be used for intention estimation [5]. C.

Object afJordance based Supervised MTRNN

The primary goal of MTRNN is to predict dynamic signals. Therefore, MTRNN includes multiple CTRNN layers with different time constant in each layer. The information in the CTRNN layer with small time constant (fast context) changes repidly but slow context layer arranges the sequence of elementary action information, which is saved in the fast context layer. Fig. 4 shows the overview of our proposed model. We combined Supervised MTRNN with SDA. The object labels are provided by SURF using OPENCV and then feed SDA. The number of good matches are recorded as the raw data. SDA is trained before Supervised MTRNN. Although decode layer is needed to train SDA. Only the information stored in the code layer is used as another input of Supervised MTRNN. The object information provided to auto-encoder is binary data. According to the skeletal position, the space between human head and hip center is defined as the region of interest area. Ideally, the objects which is not grasped will not be detected in the experiments. Objects detected by SURF are recorded as 1, otherwise O. The weights between code layer and slow context layer can also be updated by using Eq. (4) using backpropagation rule. The output value of the code layer in SDA model is considered as an additional feature to the slow context layer of Supervised MTRNN. The weights between the

code layer and the slow context layer are fully connected. The information stored in the code layer is believed to be the correlated intention [5]. They are trained together with other weights of Supervised MTRNN. Information from the SDA to the SMTRNN, which is represented as the arrow from the code layer to slow context layer in the Fig. 4, enriches features required to recognize the human intention accurately. It means that by integrating these information together, the intention classification module is able to consider skeletal dynamics and object affordance in a unified framework simultaneously. If several intentions share the similar dynamic characteristics, lower layers of MTRNN will give similar outputs, i.e., their features will be hardly-distinguishable, but encoded feature information from the SDA will make the each data easily separable by insertion of feature dimensions related to the object affordance. We expect that the information saved in the code layer would help Supervised MTRNN to improve its intention classification results.

IV.

EXPERIMENT RESULTS

In our experiment we used 5 intentions, which are combinations of 5 objects categories. Asus Xtion is used to capture the skeleton information and depth images. Fig. 3 shows an example scene of 5 different intentions in our experiments.

(a)

1---------------------1

St.acked auto-encoder"

:

(b) , r------------------------�t , , ,

Supervised MTRNN "

Intention classification" l,: r. ----------------------,

(c)

(d)

(e) Fig. 5. 5 kinds of intentions: (a) reading book, (b) pouring chocolate powder into cup, (c) drinking, (d) eating noodles and (e) calling with a cellphone

TABLE II.

Environment".

Fig. 4. Proposed model

TABLE I.

Intention label 1 2 3 4 5

DETAILS OF 5 DIFFERENT MOTION BASED INTENTIONS RELATED TO 5 MOTIONS

Details

Related objects

Reading book Pouring powder Drinking Eating noodles Calling

Book Chocolate container, cup Cup Fast noodles Cellphone

Intention label

I 2 3 4 5 Average

INTENTION RECOGNITION RESULTS

Classification result (true positive) % Object-based Supervised supervised MTRNN MTRNN Training Test Training Test set set set set 88.33 67.84 77.29 62.78 67.11 72.67

80.67 54.29 81 61.54 66.67 68.83

93.67 79.67 68.45 69.47 80.34 78.32

79.6 67.6 72 73 75.33 73.56

The relationship between intentions, objects and action is shown in Table I. Each intention lasts 150 frames. It should be

08 07 06

-Readn,&book



-Reod,ngbook

Pcumgpowder

,

Pculllgpowder

Dnnkmg

Dnnklng

-Eotmgnoodle,

-Eatlllgnoodle!

--

Calln,&

- -'-Cllh

," �

� 06

804

-Noodle, ·�QlateCOOl.a1n« -C�l1phcm

:

(a)

- - -Book

(a)

08 I1506 •

(b) ,

�0

J

I�

. ,

, N""'" "I 2 o

-Reodmgbook

:-:-

�n

g

"

"

'"

"

'00

OJ

0.8

:506. -� 4 j O. \ �

r»dl.,

.4 "

(d)

Call

(b)

'" -Read"'Bbook

(c)

Poo.r1fI8powder

DrI1lkl!l£

-Eatmgnoodle,

,

Cbocot"'-ecorULI'Ief

- C ellphone

,

Drmk1rlg

. ".

---Book

-R••d,ngbook

P

Suggest Documents