method of utilizing the sca old given by ordinary people in everyday situations for robots that contact with people. Categories and Subject Descriptors. I.2.6 [Arti ...
A robot that learns in stages utilizing scaffolds: toward an active and long-term human-robot interaction TANAKA Kazuaki
OKA Natsuki
Kyoto Institute of Technology
Kyoto Institute of Technology
d8821007 @ edu.kit.ac.jp
nat @ kit.ac.jp
ABSTRACT
If humans learn through human-human interaction, it is
In recent years, robots began to appear in our daily lives.
known that scaolding is eective [7]. Scaolding is a method
However, people get bored with them after a short time. We
of promoting learning by gradually giving dicult learning
therefore consider that robots that contact with people must
tasks according to the ability of learners beginning with easy
have an ability to learn new actions, so that people enjoy
ones. If robots learn through human-robot interaction, it is
the interaction for a long time. However, it is dicult for
possible that scaolding also supports their learning. There
them to learn complex actions like games through human-
have been several attempts in which a human gives a robot
robot interaction. If humans learn through human-human
dicult learning tasks by stages to promote learning [1, 5].
interaction, it is known that scaolding is eective. Scaold-
They demonstrated that robots can learn eciently by grad-
ing is a method of promoting learning by gradually giving
ually giving dicult learning tasks. However, in these works,
dicult learning tasks according to the ability of learners.
not ordinary people but researchers set up gradual learn-
If robots learn through human-robot interaction, it is pos-
ing tasks. Accordingly, if robots learn through interactions
sible that scaolding also supports their learning. However,
with ordinary people in everyday situations, following three
it has not claried that scaolding occurs actually through
points have not become clear:
interactions with ordinary people in everyday situations. In this experiment, we clarify this problem, and propose the method of utilizing the scaold given by ordinary people in
• • •
everyday situations for robots that contact with people.
whether scaolding occurs actually, conditions in which scaolding occurs, and whether robots can utilize scaolds given by ordinary people in everyday situations.
In this work, we set up a situation in which a human teaches
Categories and Subject Descriptors
a traditional game to a robot, and aim to clarify these points
I.2.6 [Articial Intelligence]: Learning; J.4 [Computer Ap-
experimentally.
plications]: Social and Behavioral SciencesPsychology
2.
PRELIMINARY EXPERIMENT
General Terms
We observed human-robot interaction in a game learning
Algorithms, Experimentation, Human Factors
task described below.
Keywords
2.1
human-robot interaction, scaolding, game, Q-learning
In this work, we employed a game acquisition task for posi-
Learning task
tive interaction. We therefore consider that Japanese traditional game ATCHI MUITE HOI is appropriate because it
1. INTRODUCTION In recent years, robots began to appear in our daily lives as pets. However, they have low sociality, because of almost all of them expose pre-registered actions or are controlled by humans. Consequently, there is a problem that people get bored with them after a short time [2]. We therefore consider that robots that contact with people must have an ability to learn new actions through human-robot interaction, so that people enjoy the interaction with them for a long time.
has moderate diculty and practicable even by a pet robot AIBO. ATCHI MUITE HOI is a game that is played between twoplayer. Player A points at one of four directions: up, down, left, or right, saying ATCHI MUITE HOI, and simultaneously player B turns his face at one of the four directions. If player B looks at the same direction that player A points at, player B loses, otherwise player B wins. In this experiment, we do not use voices, but only use hand movements as cues. Because, a voice-recognition system that we can use easy has a time lag. Additionally, we do not use a pink ball in place of nger. Because, pink color of recognition accu-
! " # $#% &' ()(*++ (,&,( ) -+
195
racy is higher than human's hand in an image recognition of AIBO. We consider that following two processes are necessary for new action acquisition:
Sequence acquisition process: In this process, a robot learns sequences of learning task.
If AIBO learns ATCHI
MUITE HOI, it learns sequences in which it is beginning of the game when a human player takes the ball in front of its face, and it looks at one of the four directions, up, down, left, and right, when a human player moves the ball. Goal state acquisition process: In this process, a robot learns goal states of learning task.
If AIBO learns ATCHI
MUITE HOI, it learns win state in which it looked at the other direction that a human player moved a ball (the state in which there is not a ball in front of its face), and lose state in which it looked at the same direction
Figure 1: Learning system
(the state in which there is a ball in front of its face).
2.2 Experimental method We suppose that AIBO understands the sequence acquisition process that we describe in Section 2.1, and we implement the experiment in which it learns win and lose state of ATCHI MUITE HOI through human-robot interaction. Usable instructions are two evaluations, a pat (positive evaluation) and a hit (negative evaluation).
2.3 Experimental result We observed preliminary experiment of seven participants, and found the following: 1. Seven participants became to do not give instructions when AIBO learned win and lose state and came to express correct emotion. 2. Two participants gave the positive evaluation with motive of It is correct expression.
when AIBO lost a
game and expressed `sad' emotion for the rst time. In addition, one participants put back his hand though he tried to give a positive evaluation involuntary, and three participants stopped it though they tried to give it in the same situation. Traditional learning system is learning an optimized solution in the single learning task so that it is preferable that
Figure 2: Each state and transition conditions. In performance assessment experiment (see Section 4), we showed participants this diagram in which doted-lines were omitted.
same instructions are consistently given to it for the same state. Nevertheless, in the interaction with humans, they did
Q-learning [6], one of the reinforcement learning algorithm,
not give consistent instructions for the same state, but in-
for this part. In Q-learning, the action value
structions changed as learning progressed for instance above-
is the value of an action
described 1.
rewards
and 2.
so that the learning doesn't increase
well.
r,
a
in a state
s
Q(s, a)
which
is updated based on
and the best action in each state is found by trial
and error. In this experiment, we set nine states (Refer to Figure 2 about each state:
In this experiment, we consider these alterations of instructions as scaolds, and aim to build a learning system that can follow or react to the alterations.
s0
to
s8).
Additionally, AIBO's
actions are following ve:
a0 :Stop and keep a1 to a4 :Look at
now state. one of the four directions, up, down,
left, and right
3. LEARNING SYSTEM THAT UTILIZES SCAFFOLDS
AIBO selects and acts the action
In this section, we suggest the learning system that can fol-
touch sensor that exists in back of AIBO, the state returns
low or react to the alterations as scaolds and learn new
into initial state
actions. Our learning system consist of three parts: 1) Se-
ATCHI MUITE HOI.
an
in now state
s′
when
the state has changed. In addition, if a participant pats a
s0 .
Figure 2 shows a ow of learning task
quence acquisition part (see Section 3.1), 2) Goal state acquisition part (see Section 3.2), and 3) Emotional expression
If the robot has given a reward
part (see Section 3.3).
an acted action
Figure 1 shows the composition of
our learning system.
an
r
by human, the Q value of
is updated according to the following:
Q(s, a) ← Q(s, an ) + α{r − Q(s, an )}
(1)
3.1 Sequence acquisition part The Sequence acquisition part learns best action in each
In the interaction that the scaolding occurs, changes of
state and acquires sequences of the learning task. We employ
the goal states in which a human sets are to be expected so
196
that the learning system gets complex if we consider delayed rewards.
In this experiment, we therefore do not consider
them for simplicity. We set the learning rate positive reward at
+0.1,
α
at
0.15, the −0.1.
and the negative reward at
The Boltzmann selection was used for selecting actions, and the Boltzmann temperature was set to
probability. We therefore render AIBO involuntarily looks at the direction where the ball moved. so that the selection probability of an action corresponding to direction where the
0.05
the goal state acquisition part, that is, after rewards are not given, the state value keeps now value, and the action value ′ is updated according to the state value V (s ).
0.03.
When the initial state, all actions are selected at an equal
ball moved is calculated after bias
The reward given by NNC R is set to the same value as ′ the state value V (s ) at the sequence acquisition part and
is added to Q value
of the action. For example, when AIBO recognizes that the
4.
PERFORMANCE ASSESSMENT EXPERIMENT
We evaluated whether the system that was described in Section 3 can follow or react to the alteration of instructions through the game acquisition task. In this section, we describe the experimental method and its results.
ball moved up, the selection probabilities are calculated after bias
0.05
is added to Q value of the action that AIBO looks
up. However, the Q value is not updated actually then.
4.1
Experimental method
Five participants were asked to teach AIBO the rule of ATCHI MUITE HOI that was described in Section 2.1.
3.2 Goal state acquisition part As discussed previously, in the interaction that the scaolding occurs, changes of the goal states in which a human sets are to be expected so that the robot must learn the goal states at any given time. In this work, we assume that the ′ state s in which a human gave rewards is the goal state, and ′ employed updating rule of state value V (s ) that utilizes re′ wards without delay. The state value V (s ) is updated by following:
We showed participants the expressions of AIBO and the sequences of the game that are shown in Figure 2. Each of them was written on A4 papers. In addition, usable instructions were two evaluations, a pat (positive evaluation) and a hit (negative evaluation), and the experiment lasts thirty minutes per participant in total.
4.2
Experimental result
We could observe alterations of instructions in the experi-
V (s′ ) ← V (s′ ) + α{r − V (s′ )}
(2)
ment.
They can be classied into three types: 1) IR (In-
creasing Rewards): Then, we set the learning rate
α
at
an alteration in which instructors be-
come giving rewards, 2) DR (Decreasing Rewards): an al-
0.20.
teration in which instructors stop to give rewards, and 3) FR (Flipping Rewards): an alteration in which instructors
3.3 Emotional expression part
become giving opposite rewards. Table 1 shows the types of
The emotional expression part gives feedback the learning state of AIBO to participants.
according to action values, if the state s transits s′ that has a higher state value V (s′ ) than the ′ action value Q(s, a), it expresses `happy', but if the state s action
the alterations of instructions and their examples.
When AIBO selected an
a
to the state
has a lower state value, it expresses `sad'.
Table 1:
The types of the alterations of instructions and
their examples Types
Alterations
・Examples
no rewards
・at situations in which AIBO
→
As described in Section 2.3, participants became to do not
pat
was
Thereby, AIBO do not express emotions according to decreasing rewards to represent its habituation into participants.
and the goal state acquisition part at
that the delta increases, afterwards it decreases. selection probability of an action exceeds
0.20
an alteration in which in-
no rewards
structors
→
become
giving
hit
it discontinues expressing. Additionally, if there is the ball
situation
in
which
・at situations in which AIBO
it the ball
・at situations in which AIBO lost the game pat
→
no
rewards
・at situations in which AIBO began to be looking at the ball almost certainly
・at situations in which AIBO
DR (Decreasing Rewards):
expressed
in front of AIBO's nose, it gives feedback the expression that it is looking at the ball to participants.
a
when a participant showed
When a
50%, AIBO begins 80%,
ball
was not looking at the ball
so
to express emotions, but when the probability exceeds
the
AIBO won the game
The dierent learning rate is set the sequence acquisition
0.15,
at
it the ball
・at
IR (Increasing Rewards):
rewards
part at
looking
when a participant showed
give rewards when AIBO began to learn correct actions.
emotion
`happy'
when it won the game
→
an alteration in which in-
hit
structors stop to give re-
rewards
no
wards
・at situations in which AIBO does
not
look
at
the
ball
as an exploratory behavior, despite AIBO began to be
3.4 Utilizing NNC
looking at the ball almost
As described in Section 2.3, the rewards given by human be-
certainly
・at
come to decrease when AIBO began to learn correct actions.
situations
AIBO
Consequently, the action value and the state value approach
in
expressed
which emotion
`sad' when it lost the game
asymptotically to zero. We therefore employ NNC [3] that
FR
is the implicit criterion to assume no instructions as positive
an alteration in which in-
AIBO
structors
`sad' when it lost the game
rewards to keep the action value and the state value.
(Flipping
Rewards):
become
opposite rewards.
197
giving
hit
→
pat
・at
situations
in
expressed
which emotion
Our learning system successfully followed, and reacted to IR
Therefore, we built a learning system that consists of the
and DR. On the other hand, it could not follow to FR.
Sequence acquisition part, the Goal state acquisition part, and the Emotional expression part, and implemented an ex-
5. DISCUSSION 5.1 Following or reacting to the alterations of instructions
periment in which the robot learned the game rule from
As described in Section 4.2, the alterations of instructions
gave scaolds for the robot. The alterations of instructions
can be classied into three types. IR is the alteration that
can be classied into three types, IR (Increasing Rewards),
instructors become giving rewards when a new learning task:
DR (Decreasing Rewards), and FR (Flipping Rewards).
instructions given by humans in order to clarify them. As a result, we observed that participants changed the instructions according to robot's behavior, that is, participants
a scaold is given. It is possible to follow to IR according to the weighed mean of the Q value that comes to attach
However, it is not always true that the scaolds are given
importance to recent given instructions.
In contrast, DR
in another condition as well as this experiment, we must
is the alteration that instructors stop to give rewards when
therefore implement under various conditions to make clear
the learner achieves the given learning task. This alteration
the second point.
should not be followed, but it was nevertheless possible to react to this alteration by utilizing NNC (see Section 3.4).
Additionally, our learning system could follow or react to IR and DR, but could not follow to FR. It is necessary to
Furthermore, FR in which instructors become giving oppo-
recognize the target of evaluations to follow to FR. We will
site rewards was also observed.
seriously try to realize the joint attention to solve this prob-
It was the alteration that
participants gave the positive evaluation with motive of It is
lem without supercial artices.
correct expression. when AIBO lost a game and expressed `sad' emotion similarly to the result 2.
of the preliminary
experiment described in Section 2.3. In the preliminary experiment, the goal state of the learning task is to express adequate emotions according to win or lose. On the other hand, in the experiment, the goal state of the learning task is to learn the rule of the game, but the alteration was nevertheless observed similarly. Accordingly, FR occurs if the
Our future plan includes:
• • •
tion. Our learning system could not recognize the target of evaluations, in consequence it could not follow to FR.
We will reconsider the method of expressing emotions, and emotional expressions. We will implement the experiments under various con-
7.
ACKNOWLEDGMENTS
This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for
In this experiment, it became clear that scaolding occurred as IR, DR, and FR, and the learning progressed by alternately occurring IR and DR or FR.
Scientic Research (C) The mechanism of cognitive development through the triadic interactions.
8.
REFERENCES
[1] Asada, M., Noda, S., Tawaratsumida, S., and Hosoda,
5.2 Adequacy of the expression The expression method described in Section 3.3 could feed back the learning states of AIBO to participants as will be appreciated from the result that the alterations DR or FR occurred. However, participants gave instructions again (IR) after they stopped giving instructions (DR) according to emotional expression of AIBO stopped. One of this causes is that AIBO acted the exploratory behaviors after it stopped emotional expressions. We therefore consider that more appropriate scaolds are given if the learner can feed back his motive of behavior.
of evaluations to follow FR.
ditions.
target of evaluations changes from the learner's actions into his emotional expressions when the learner displays an emo-
We will consider the method of recognizing the target
For example, we include the
method that dierentiate the learner's emotional expressions when the learner selects an optimum action and an
K., Purposive Behavior Acquisition for a Real Robot by Vision-Based Reinforcement Learning. Machine Learning, Vol. 23, pp. 279303, 1996. [2] Sato, T., Nakata, T., Methodology for Synthesizing Aectionate Pet Robots, Journal of Japanese Society for Articial Intelligence, Vol. 16, No. 3, pp. 406411, 2001, (in Japanese). [3] Tanaka, K., Zuo, X., Sagano, Y., and Oka, N., Learning the meaning of action commands based on No News Is Good News Criterion, Workshop on Multimodal Interfaces in Semantic Interaction, ISBN:978-1-59593-869-5, pp. 916, 2007. [4] Tanaka, K., and Oka, N., The eect of the response
exploratory behavior.
timing of a pet robot on human-robot interaction, Human-Agent Interaction Simposium, 1B-1, 2006, (in
6. CONCLUDING REMARKS We believe that robots must acquire new actions through interaction with humans and environment, so that people enjoy the interaction with them for a long time, and hence we direct our attention to the scaolding. However, when robots learn through the interaction with ordinary people in everyday situations, following three points were not clear: 1) whether scaolding occurs actually, 2) conditions in which
Japanese). [5] Thomaz, A. L., and Breazeal, C., Tutelage and Socially Guided Robot Learning, IEEE/RSJ International Conference, Vol. 4, pp. 34753480, 2006. [6] Watkins, C. J. C. H., Dayan, P., Q-learning, Machine Learning, Vol. 8, No. 3-4, pp. 279292, 1992. [7] Wood, D., Bruner, J. S., and Ross, G., The role of
scaolding occurs, and 3) whether robots can utilize scaf-
tutoring in problem-solving, Journal of Child
folds given by ordinary people in everyday situations.
Psychology and Psychiatry, Vol. 17, pp. 89100, 1976.
198