A robot that learns in stages utilizing scaffolds: toward an active and ...

4 downloads 0 Views 729KB Size Report
method of utilizing the sca old given by ordinary people in everyday situations for robots that contact with people. Categories and Subject Descriptors. I.2.6 [Arti ...
A robot that learns in stages utilizing scaffolds: toward an active and long-term human-robot interaction TANAKA Kazuaki

OKA Natsuki

Kyoto Institute of Technology

Kyoto Institute of Technology

d8821007 @ edu.kit.ac.jp

nat @ kit.ac.jp

ABSTRACT

If humans learn through human-human interaction, it is

In recent years, robots began to appear in our daily lives.

known that scaolding is eective [7]. Scaolding is a method

However, people get bored with them after a short time. We

of promoting learning by gradually giving dicult learning

therefore consider that robots that contact with people must

tasks according to the ability of learners beginning with easy

have an ability to learn new actions, so that people enjoy

ones. If robots learn through human-robot interaction, it is

the interaction for a long time. However, it is dicult for

possible that scaolding also supports their learning. There

them to learn complex actions like games through human-

have been several attempts in which a human gives a robot

robot interaction. If humans learn through human-human

dicult learning tasks by stages to promote learning [1, 5].

interaction, it is known that scaolding is eective. Scaold-

They demonstrated that robots can learn eciently by grad-

ing is a method of promoting learning by gradually giving

ually giving dicult learning tasks. However, in these works,

dicult learning tasks according to the ability of learners.

not ordinary people but researchers set up gradual learn-

If robots learn through human-robot interaction, it is pos-

ing tasks. Accordingly, if robots learn through interactions

sible that scaolding also supports their learning. However,

with ordinary people in everyday situations, following three

it has not claried that scaolding occurs actually through

points have not become clear:

interactions with ordinary people in everyday situations. In this experiment, we clarify this problem, and propose the method of utilizing the scaold given by ordinary people in

• • •

everyday situations for robots that contact with people.

whether scaolding occurs actually, conditions in which scaolding occurs, and whether robots can utilize scaolds given by ordinary people in everyday situations.

In this work, we set up a situation in which a human teaches

Categories and Subject Descriptors

a traditional game to a robot, and aim to clarify these points

I.2.6 [Articial Intelligence]: Learning; J.4 [Computer Ap-

experimentally.

plications]: Social and Behavioral SciencesPsychology

2.

PRELIMINARY EXPERIMENT

General Terms

We observed human-robot interaction in a game learning

Algorithms, Experimentation, Human Factors

task described below.

Keywords

2.1

human-robot interaction, scaolding, game, Q-learning

In this work, we employed a game acquisition task for posi-

Learning task

tive interaction. We therefore consider that Japanese traditional game ATCHI MUITE HOI is appropriate because it

1. INTRODUCTION In recent years, robots began to appear in our daily lives as pets. However, they have low sociality, because of almost all of them expose pre-registered actions or are controlled by humans. Consequently, there is a problem that people get bored with them after a short time [2]. We therefore consider that robots that contact with people must have an ability to learn new actions through human-robot interaction, so that people enjoy the interaction with them for a long time.

has moderate diculty and practicable even by a pet robot AIBO. ATCHI MUITE HOI is a game that is played between twoplayer. Player A points at one of four directions: up, down, left, or right, saying ATCHI MUITE HOI, and simultaneously player B turns his face at one of the four directions. If player B looks at the same direction that player A points at, player B loses, otherwise player B wins. In this experiment, we do not use voices, but only use hand movements as cues. Because, a voice-recognition system that we can use easy has a time lag. Additionally, we do not use a pink ball in place of nger. Because, pink color of recognition accu-

                                                   

                 

                                      !    "   #   $#% &' ()(*++ (,&,(  ) -+

195

racy is higher than human's hand in an image recognition of AIBO. We consider that following two processes are necessary for new action acquisition:

Sequence acquisition process: In this process, a robot learns sequences of learning task.

If AIBO learns ATCHI

MUITE HOI, it learns sequences in which it is beginning of the game when a human player takes the ball in front of its face, and it looks at one of the four directions, up, down, left, and right, when a human player moves the ball. Goal state acquisition process: In this process, a robot learns goal states of learning task.

If AIBO learns ATCHI

MUITE HOI, it learns win state in which it looked at the other direction that a human player moved a ball (the state in which there is not a ball in front of its face), and lose state in which it looked at the same direction

Figure 1: Learning system

(the state in which there is a ball in front of its face).

2.2 Experimental method We suppose that AIBO understands the sequence acquisition process that we describe in Section 2.1, and we implement the experiment in which it learns win and lose state of ATCHI MUITE HOI through human-robot interaction. Usable instructions are two evaluations, a pat (positive evaluation) and a hit (negative evaluation).

2.3 Experimental result We observed preliminary experiment of seven participants, and found the following: 1. Seven participants became to do not give instructions when AIBO learned win and lose state and came to express correct emotion. 2. Two participants gave the positive evaluation with motive of It is correct expression.

when AIBO lost a

game and expressed `sad' emotion for the rst time. In addition, one participants put back his hand though he tried to give a positive evaluation involuntary, and three participants stopped it though they tried to give it in the same situation. Traditional learning system is learning an optimized solution in the single learning task so that it is preferable that

Figure 2: Each state and transition conditions. In performance assessment experiment (see Section 4), we showed participants this diagram in which doted-lines were omitted.

same instructions are consistently given to it for the same state. Nevertheless, in the interaction with humans, they did

Q-learning [6], one of the reinforcement learning algorithm,

not give consistent instructions for the same state, but in-

for this part. In Q-learning, the action value

structions changed as learning progressed for instance above-

is the value of an action

described 1.

rewards

and 2.

so that the learning doesn't increase

well.

r,

a

in a state

s

Q(s, a)

which

is updated based on

and the best action in each state is found by trial

and error. In this experiment, we set nine states (Refer to Figure 2 about each state:

In this experiment, we consider these alterations of instructions as scaolds, and aim to build a learning system that can follow or react to the alterations.

s0

to

s8).

Additionally, AIBO's

actions are following ve:

a0 :Stop and keep a1 to a4 :Look at

now state. one of the four directions, up, down,

left, and right

3. LEARNING SYSTEM THAT UTILIZES SCAFFOLDS

AIBO selects and acts the action

In this section, we suggest the learning system that can fol-

touch sensor that exists in back of AIBO, the state returns

low or react to the alterations as scaolds and learn new

into initial state

actions. Our learning system consist of three parts: 1) Se-

ATCHI MUITE HOI.

an

in now state

s′

when

the state has changed. In addition, if a participant pats a

s0 .

Figure 2 shows a ow of learning task

quence acquisition part (see Section 3.1), 2) Goal state acquisition part (see Section 3.2), and 3) Emotional expression

If the robot has given a reward

part (see Section 3.3).

an acted action

Figure 1 shows the composition of

our learning system.

an

r

by human, the Q value of

is updated according to the following:

Q(s, a) ← Q(s, an ) + α{r − Q(s, an )}

(1)

3.1 Sequence acquisition part The Sequence acquisition part learns best action in each

In the interaction that the scaolding occurs, changes of

state and acquires sequences of the learning task. We employ

the goal states in which a human sets are to be expected so

196

that the learning system gets complex if we consider delayed rewards.

In this experiment, we therefore do not consider

them for simplicity. We set the learning rate positive reward at

+0.1,

α

at

0.15, the −0.1.

and the negative reward at

The Boltzmann selection was used for selecting actions, and the Boltzmann temperature was set to

probability. We therefore render AIBO involuntarily looks at the direction where the ball moved. so that the selection probability of an action corresponding to direction where the

0.05

the goal state acquisition part, that is, after rewards are not given, the state value keeps now value, and the action value ′ is updated according to the state value V (s ).

0.03.

When the initial state, all actions are selected at an equal

ball moved is calculated after bias

The reward given by NNC R is set to the same value as ′ the state value V (s ) at the sequence acquisition part and

is added to Q value

of the action. For example, when AIBO recognizes that the

4.

PERFORMANCE ASSESSMENT EXPERIMENT

We evaluated whether the system that was described in Section 3 can follow or react to the alteration of instructions through the game acquisition task. In this section, we describe the experimental method and its results.

ball moved up, the selection probabilities are calculated after bias

0.05

is added to Q value of the action that AIBO looks

up. However, the Q value is not updated actually then.

4.1

Experimental method

Five participants were asked to teach AIBO the rule of ATCHI MUITE HOI that was described in Section 2.1.

3.2 Goal state acquisition part As discussed previously, in the interaction that the scaolding occurs, changes of the goal states in which a human sets are to be expected so that the robot must learn the goal states at any given time. In this work, we assume that the ′ state s in which a human gave rewards is the goal state, and ′ employed updating rule of state value V (s ) that utilizes re′ wards without delay. The state value V (s ) is updated by following:

We showed participants the expressions of AIBO and the sequences of the game that are shown in Figure 2. Each of them was written on A4 papers. In addition, usable instructions were two evaluations, a pat (positive evaluation) and a hit (negative evaluation), and the experiment lasts thirty minutes per participant in total.

4.2

Experimental result

We could observe alterations of instructions in the experi-

V (s′ ) ← V (s′ ) + α{r − V (s′ )}

(2)

ment.

They can be classied into three types: 1) IR (In-

creasing Rewards): Then, we set the learning rate

α

at

an alteration in which instructors be-

come giving rewards, 2) DR (Decreasing Rewards): an al-

0.20.

teration in which instructors stop to give rewards, and 3) FR (Flipping Rewards): an alteration in which instructors

3.3 Emotional expression part

become giving opposite rewards. Table 1 shows the types of

The emotional expression part gives feedback the learning state of AIBO to participants.

according to action values, if the state s transits s′ that has a higher state value V (s′ ) than the ′ action value Q(s, a), it expresses `happy', but if the state s action

the alterations of instructions and their examples.

When AIBO selected an

a

to the state

has a lower state value, it expresses `sad'.

Table 1:

The types of the alterations of instructions and

their examples Types

Alterations

・Examples

no rewards

・at situations in which AIBO



As described in Section 2.3, participants became to do not

pat

was

Thereby, AIBO do not express emotions according to decreasing rewards to represent its habituation into participants.

and the goal state acquisition part at

that the delta increases, afterwards it decreases. selection probability of an action exceeds

0.20

an alteration in which in-

no rewards

structors



become

giving

hit

it discontinues expressing. Additionally, if there is the ball

situation

in

which

・at situations in which AIBO

it the ball

・at situations in which AIBO lost the game pat



no

rewards

・at situations in which AIBO began to be looking at the ball almost certainly

・at situations in which AIBO

DR (Decreasing Rewards):

expressed

in front of AIBO's nose, it gives feedback the expression that it is looking at the ball to participants.

a

when a participant showed

When a

50%, AIBO begins 80%,

ball

was not looking at the ball

so

to express emotions, but when the probability exceeds

the

AIBO won the game

The dierent learning rate is set the sequence acquisition

0.15,

at

it the ball

・at

IR (Increasing Rewards):

rewards

part at

looking

when a participant showed

give rewards when AIBO began to learn correct actions.

emotion

`happy'

when it won the game



an alteration in which in-

hit

structors stop to give re-

rewards

no

wards

・at situations in which AIBO does

not

look

at

the

ball

as an exploratory behavior, despite AIBO began to be

3.4 Utilizing NNC

looking at the ball almost

As described in Section 2.3, the rewards given by human be-

certainly

・at

come to decrease when AIBO began to learn correct actions.

situations

AIBO

Consequently, the action value and the state value approach

in

expressed

which emotion

`sad' when it lost the game

asymptotically to zero. We therefore employ NNC [3] that

FR

is the implicit criterion to assume no instructions as positive

an alteration in which in-

AIBO

structors

`sad' when it lost the game

rewards to keep the action value and the state value.

(Flipping

Rewards):

become

opposite rewards.

197

giving

hit



pat

・at

situations

in

expressed

which emotion

Our learning system successfully followed, and reacted to IR

Therefore, we built a learning system that consists of the

and DR. On the other hand, it could not follow to FR.

Sequence acquisition part, the Goal state acquisition part, and the Emotional expression part, and implemented an ex-

5. DISCUSSION 5.1 Following or reacting to the alterations of instructions

periment in which the robot learned the game rule from

As described in Section 4.2, the alterations of instructions

gave scaolds for the robot. The alterations of instructions

can be classied into three types. IR is the alteration that

can be classied into three types, IR (Increasing Rewards),

instructors become giving rewards when a new learning task:

DR (Decreasing Rewards), and FR (Flipping Rewards).

instructions given by humans in order to clarify them. As a result, we observed that participants changed the instructions according to robot's behavior, that is, participants

a scaold is given. It is possible to follow to IR according to the weighed mean of the Q value that comes to attach

However, it is not always true that the scaolds are given

importance to recent given instructions.

In contrast, DR

in another condition as well as this experiment, we must

is the alteration that instructors stop to give rewards when

therefore implement under various conditions to make clear

the learner achieves the given learning task. This alteration

the second point.

should not be followed, but it was nevertheless possible to react to this alteration by utilizing NNC (see Section 3.4).

Additionally, our learning system could follow or react to IR and DR, but could not follow to FR. It is necessary to

Furthermore, FR in which instructors become giving oppo-

recognize the target of evaluations to follow to FR. We will

site rewards was also observed.

seriously try to realize the joint attention to solve this prob-

It was the alteration that

participants gave the positive evaluation with motive of It is

lem without supercial artices.

correct expression. when AIBO lost a game and expressed `sad' emotion similarly to the result 2.

of the preliminary

experiment described in Section 2.3. In the preliminary experiment, the goal state of the learning task is to express adequate emotions according to win or lose. On the other hand, in the experiment, the goal state of the learning task is to learn the rule of the game, but the alteration was nevertheless observed similarly. Accordingly, FR occurs if the

Our future plan includes:

• • •

tion. Our learning system could not recognize the target of evaluations, in consequence it could not follow to FR.

We will reconsider the method of expressing emotions, and emotional expressions. We will implement the experiments under various con-

7.

ACKNOWLEDGMENTS

This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for

In this experiment, it became clear that scaolding occurred as IR, DR, and FR, and the learning progressed by alternately occurring IR and DR or FR.

Scientic Research (C) The mechanism of cognitive development through the triadic interactions.

8.

REFERENCES

[1] Asada, M., Noda, S., Tawaratsumida, S., and Hosoda,

5.2 Adequacy of the expression The expression method described in Section 3.3 could feed back the learning states of AIBO to participants as will be appreciated from the result that the alterations DR or FR occurred. However, participants gave instructions again (IR) after they stopped giving instructions (DR) according to emotional expression of AIBO stopped. One of this causes is that AIBO acted the exploratory behaviors after it stopped emotional expressions. We therefore consider that more appropriate scaolds are given if the learner can feed back his motive of behavior.

of evaluations to follow FR.

ditions.

target of evaluations changes from the learner's actions into his emotional expressions when the learner displays an emo-

We will consider the method of recognizing the target

For example, we include the

method that dierentiate the learner's emotional expressions when the learner selects an optimum action and an

K., Purposive Behavior Acquisition for a Real Robot by Vision-Based Reinforcement Learning. Machine Learning, Vol. 23, pp. 279303, 1996. [2] Sato, T., Nakata, T., Methodology for Synthesizing Aectionate Pet Robots, Journal of Japanese Society for Articial Intelligence, Vol. 16, No. 3, pp. 406411, 2001, (in Japanese). [3] Tanaka, K., Zuo, X., Sagano, Y., and Oka, N., Learning the meaning of action commands based on No News Is Good News Criterion, Workshop on Multimodal Interfaces in Semantic Interaction, ISBN:978-1-59593-869-5, pp. 916, 2007. [4] Tanaka, K., and Oka, N., The eect of the response

exploratory behavior.

timing of a pet robot on human-robot interaction, Human-Agent Interaction Simposium, 1B-1, 2006, (in

6. CONCLUDING REMARKS We believe that robots must acquire new actions through interaction with humans and environment, so that people enjoy the interaction with them for a long time, and hence we direct our attention to the scaolding. However, when robots learn through the interaction with ordinary people in everyday situations, following three points were not clear: 1) whether scaolding occurs actually, 2) conditions in which

Japanese). [5] Thomaz, A. L., and Breazeal, C., Tutelage and Socially Guided Robot Learning, IEEE/RSJ International Conference, Vol. 4, pp. 34753480, 2006. [6] Watkins, C. J. C. H., Dayan, P., Q-learning, Machine Learning, Vol. 8, No. 3-4, pp. 279292, 1992. [7] Wood, D., Bruner, J. S., and Ross, G., The role of

scaolding occurs, and 3) whether robots can utilize scaf-

tutoring in problem-solving, Journal of Child

folds given by ordinary people in everyday situations.

Psychology and Psychiatry, Vol. 17, pp. 89100, 1976.

198

Suggest Documents