skills for a Puma 260 manipulator are given. 1 Introduction. Since humans can carry out motions with no ap- parent di culty, one would expect the generation of.
In: IEEE International Conference on Robotics and Automation, Minneapolis, Minnesota, April 1996
Building Elementary Robot Skills from Human Demonstration M. Kaiser
R. Dillmann
University of Karlsruhe Institute for Real-Time Computer Systems & Robotics D-76128 Karlsruhe, Germany
Abstract This paper presents a general approach to the acquisition of sensor-based robot skills from human demonstrations. Since human-generated examples cannot be assumed to be optimal with respect to the robot, adaptation of the initially acquired skill is explicitly considered. Results for acquiring and re ning manipulation skills for a Puma 260 manipulator are given.
1 Introduction Since humans can carry out motions with no apparent diculty, one would expect the generation of elementary skills to be a relatively simple problem. However, it turns out that it is extremely dicult to duplicate this elementary operative intelligence, which is used by humans unconsciously, in a computercontrolled robot [10]. This observation motivates research in the eld of Robot Skill Acquisition via Human Demonstration [1, 11, 7], which is an extension of Robot Programming by Human Demonstration [9] that deals with the aquisition of sensor-based robot skills from human demonstrations (Fig. 1).
Figure 1: Demonstration of a peg-insertion skill for a Puma 260 manipulator. Most approaches to skill acquisition rely on application-speci c solutions [1, 11, 15, 16]. They
do not consider human de ciencies, which occur especially in service scenarios with skills demonstrated by untrained users. Consequently, skill adaptation is only rarely performed [11, 7]. In this paper, an approach to skill acquisition from human demonstration is presented that takes the characteristics of humangenerated data, the existence of irrelevant perception and action components, and robustness requirements into account. The appropriateness of the approach and the developed methods is demonstrated for the acquisition of two manipulation skills for a Puma 260.
2 Skill Acquisition from Human Demonstration "Skill" denotes "the learned power of doing a thing competently." From a system's theoretic viewpoint, this means that for a given state x(t); the skilled system (the robot) should perform an action u(t) in order to achieve a goal that is associated to the particular skill. The action performed should be the result of a competent decision, i.e., it should be optimal with respect to an evaluation criterion (a reward) r(x(t)) that is related to the goal to be achieved. Essentially, a skill s is therefore given through a control function Cs : u(t) = Cs (x(t)) that implicitely encodes the goal associated to the skill and produces in each state x(t) a "competent" action u(t); and a function rs(x(t)) that evaluates the state x(t) w.r.t. the goal. The aim of skill acquisition from human demonstration is to approximate these functions from data obtained during human performance of the skill s: In the case of elementary skills, the state x is represented as a sequence of sensorial inputs y, i.e., x(t) = (y (t ? d); : : :; y(t ? d ? p)); d; p 0; and the result of a human demonstration is a sequence ((y (0); u(0)); : : :; (y(T ); u(T ))) of sensor measurements y(t) and actions u(t). If elementary manipulation skills are being demonstrated (Fig. 1),
the (y(t); u(t)) pairs could be measured forces/torques and translational/rotational velocities.
2.1 Characteristics of human demonstrations Ideally, the data obtained from the human demonstration represent the skill to be acquired perfectly. Then, the equation u(t) = Cs(x(t)) = Cs(y(t ? d); : : :; y(t ? d ? p)) holds for all recorded samples. In reality, this will seldom be the case. Several sources of suboptimality exist that result in disturbances of the above equation, the most prominent being 1. the existence of incorrect actions that must be corrected at a later instance and 2. the human tendency to perform "bang-bang" instead of smooth control. Insertion Operation
0.4
Offset [mm]
0.2 0.1 0 -0.1 -0.2 -0.3 -0.4
0
50
100
Ticks
150
200
250
Figure 2: Actions (position osets) (Dx ; Dy ; Dz ) recorded from a human demonstration of the peg-intohole skill. The eect of these suboptimalities cannot be neglected (Fig. 2). The robot should therefore not simply copy the human when performing the demonstrated skill. Instead, it should avoid obvious mistakes from the very beginning and overcome other de ciencies through adaptation.
2.2 Generating training data
2.2.1 Identifying relevant action components
To rank the importance of a particular action component, the contribution of this component is determined. If jjui(t)jj is the normed contribution (e.g., the change in position w.r.t. a particular degree of freedom) of action component i at time t, the individual contribution of this component PT jju is(t)jj Ki = Pdim ut PT i : jju (t)jj j=1
(
=0 )
t=0
j
Then, the set of relevant action components is the minimum subset of components of u whose combined contribution is above a threshold min ; 0 < min 1: Depending on the quality of the human demonstration, min is chosen to be min 2 [0:9; 1:0].
2.2.2 Identifying relevant perceptions
Dx Dy Dz
0.3
be found. These identi cation steps result in reduced perception vectors y 0; a reduced representation x0 = (y0(t ? d); : : :; y0(t0 ? d ? p)) of the state, and a reduced action vector u :
Data obtained from a human demonstration must be preprocessed before they can be used as training data for learning the desired skill. This preprocessing must comprise the identi cation of those perception components yi that must be known in order to produce a "competent" action u as well as the parameters d and p: In addition, relevant action components, i.e., those components uj of the action vector u that are required to execute the skill, must
If suciently many samples (y; u) are available from a single demonstration, relevant perception components can be identi ed statistically [7]. Otherwise, the reduced perception vector y 0 and the parameter p are determined such that with x0 (t) = (y 0(t?d); : : : ; x0(t? d ? p)) jjx0(t1 ) ? x0 (t2 )jj jju0(t1 ) ? u0 (t2 )jj; 0 < holds for all t1; t2 2 fd + p; : : :; T g and the total amount of components of x0 is minimal. Here, d = do is equivalent to the estimated reaction time of the operator. As a result of both preprocessing steps, the demonstration data are reduced to training data ((x0 (d + p); u0(d + p)); : : :; (x0(T ); u0 (T ))):
2.2.3 Removing ineective and corrective actions
Ineective and incorrect motions should be removed from the sampled data to prevent the robot from learning them. However, it cannot generally be determined if an action is incorrect. Only two preprocessing steps can safely be performed: Removal of all actions that do not contribute anything to solving the task, i.e., removal of all samples (x0 ; u0 ) with jju0jj s ; s 0:1 If the space of relevant action components is a vector space and the subsequent application of 1 s
is a system speci c threshold, its "minimum stimulus."
two actions u01 and u02 with u01 = u02 ; 2 < is equivalent to applying (1 + )u02 ; u = u0 (t + 1) is a corrective action if u = u0(t + 1) u0(t); < 0: Then, u can be "smoothed" by applying u(t) = u(t + 1) = 12 (u(t) + u(t + 1)):
y(t)
ri(x(t))
ri(x(t-1))
ri(x(t-d))
...
2.2.4 Initializing the evaluation criterion
The identi cation of relevant action and perception components followed by the limited preprocessing that's possible in the general case result in data that can be used to learn the function Cs; i.e., to build the skill-speci c control function. To learn the evaluation criterion rs ; these data must be extended by assigning a speci c value r to any of the states x0 (t): Since a suitable evaluation function is not assumed to be known, this extension can only be based on heuristics. If demonstration data ((x0 (d + p); u0(d + p)); : : :; (x0 (T ); u0(T ))); are given and the goal state is represented by x0 (T ); r = rs(x0 (t)) can be initialized as jjx0(T ) ? x0 (t)jj < jjx0(T ) ? x0 (t ? 1)jj r = rr+ ifotherwise ? Thus, any state that is closer to the goal state than its predecessor is given a positive reward r+ , while any state that's further away receives a negative reward r? with rmin r? < r+ rmax : For the experiments described in section 5, rmin = ?1; r? = 0:2; r+ = 0:7; rmax = 1 were chosen.
3 Initial skill learning During initial learning, the functions Cs and rs are built from the generated training data. The formalism used to represent these functions must ful ll two important criteria. First, the representation must be constructed from the training data in order to not require a potentially untrained user to chose a representation. Second, the formalism must allow for incremental learning, since online adaptation and extension of the initially learned skill is mandatory. Both requirements exclude, for instance, multilayer perceptrons (MLPs) [15, 11] from being used in a general approach. A function approximation technique whose structure can be generated from examples and which supports incremental learning of both numerical and structural parameters are neural networks based on local receptive elds, such as Radial-Basis Function Networks (RBFs) [14]. These networks calculate the output u for a given input vector x as the weighted
x(t)
Figure 3: Structure of a Radial-Basis Function Network with time-delays in the hidden (cluster) layer. sum of the activity of the local receptive elds. Like multilayer perceptrons, they are universal approximators, and similar to MLPs, they can use time-delays for handling spatio-temporal data (Fig. 3), making them applicable also for identi cation and control [6].
3.1 Network construction and training
To construct an RBF network, the number of clusters, their centers and widths have to be determined. For solving this task, several algorithms have been proposed [12, 13], among those a supervised clustering algorithm speci cally designed for constructing networks to approximate continuous functions [2]. This algorithm is based on the idea that the amount of overlap between individual clusters of an RBF network should depend on the similarity of the actions associated to these clusters. Additionally, it allows to determine the amount of generalization done by the network by specifying the maximum network activity in regions of the input space that are not covered by examples (for further details see [2]). Following the construction, the network's output weights wi are trained using gradient descent.
4 Skill re nement The re nement of skills requires to change actions that are associated to known states (skill adaptation) as well as to assign appropriate actions to states that were not encountered yet (skill extension). In the general setting of skill acquisition from human demonstration, both tasks have to be solved based on a delayed scalar evaluation (a delayed reward) of the skill execution. This is a typical setting for the application of reinforcement learning methods [3] and especially direct adaptation methods (Fig. 4, [4]). These methods do not rely on a model of the plant but try to directly estimate necessary changes in the
r ext r int
y
controller
critic
u
EE
u'
plant
y
Figure 4: Direct adaptation scheme. calculated action u0 on-line. To this aim, the gener-
ated action is systematically altered by an exploratory element (EE). The correlation between the change of the obtained evaluation (the reinforcement signal) and the change in the original action is exploited for adaptation (Fig. 4). The technique used for skill re nement is based on Gullapalli's Stochastic Real-Valued units (SRV units, [4]) that allow for reinforcement learning of real-valued functions. The basic idea behind the SRV algorithm is to use a Gaussian distribution to produce a stochastic control signal u given an input x0 , i.e. u = (Cs(x0); ). At each time instant, is0 calculated on the basis of the estimated reward rs(x (t)); with becoming smaller if rs increases. To facilitate faster and more secure adaptation, an additional correction term u0 which estimates @r @ x0 @r @r @ u0 = @ x0 @ u0 for @ u0 > 0 is calculated via a third network as u0 = Xs (x0 ) that is build on-line from received rewards only. u0 serves as a bias during adaptation: u0 = (Cs(x0) + u0; ):
4.1 Skill re nement at work The only kind of information that can be expected from the user during the skill application and re nement process is an evaluation after the termination of the skill execution2 . Then, the temporal credit assignment problem, i.e., to decide how much a particular action u0 (t) contributed to this evaluation, must be solved. The approach is to use an exponentially discounted adaptation rate. For real-world adaptation, this has several advantages over using exponentially discounted rewards. During the actual skill execution (Fig. 5), it is necessary to determine how much an action should really be altered for exploration purposes, i.e., to de ne : 2 Termination means that either a robot-speci c error occurs or the goal state known from the demonstration is reached.
1. Check if the current state x(t) is the goal state or an error state. If this is the case, terminate. 2. Check if the current state x(t) is known, i.e., check if the activation of any of the clusters in the network representing Cs is suciently high. 3. If this is the case: (a) Calculate the action u0 (t) as u0 (t) = Cs (x0 (t)): (b) (c) (d) (e) 4. Else (a) (b)
(c) (d) (e)
Calculate the prediction r(t) = rs (x0 (t)): Alter u0 (t) dependent on r(t) to obtain u (t). Store the quadrupel (x0 (t); u0 (t); u (t); r(t)): Apply action u (t): Insert a new cluster into the network representing Cs that covers x0 (t): Find the cluster that's closest to the new cluster, for which rs () > 0 holds. Assign the action u0 associated to that cluster to the new cluster (i.e., initialize the corresponding weights). Insert a new cluster into the network representing rs that covers x0 (t): Assign 0 ("unknown") as the predicted reward to the new cluster (i.e., initialize the corresponding weight). Proceed with 3a.
Figure 5: Skill execution cycle. For each action component u0i , i (t) = i jrmax ? rs (x0 (t))j where rmax is the maximum reward that can be obtained. i is the maximum absolute dierence that occured between two subsequent actions u(t) and u(t +1) in component ui. This amount of exploration is assumed to be save. The complete skill execution cycle is shown in Fig. 5. After termination, skill re nement (Fig. 6) is performed.
5 Acquiring Manipulation Skills For the evaluation of the presented methods, two distinct manipulation skills have been investigated. Both were taught to a Puma 260 manipulator equipped with a 6D force/torque sensor using the environment shown in Fig. 1. For demonstrations performed with the 6D mouse (Fig. 1), the operator's
Figure 6: Skill re nement.
Forces during insertion operation (5th trial) 5 Fx [lb] Fy [lb] Fz [lb]
0
-5
Force [lb]
1. Let r be the nal reward. 2. For each stored tripel (x0 (t); u0 (t); u (t); r(t)) : (a) If r(t) < r; train the network representing Cs with x0 (t) as input, u (t) ? u0 (t) as output error, and a discounted learning rate. (b) Else train the network representing Cs with x0 (t) as input, u0 (t) ? u (t) as output error, and a discounted learning rate. (c) Train the network representing rs with x0 (t) as input, r ? r(t) as output error, and a discounted learning rate.
-10
-15
-20
-25 0
50
100
150 200 Time [0.032 s]
250
300
350
Figure 7: Forces recorded during peg-insertion under neural network control (after 5 practice runs).
reaction time was identi ed as 0:25[s] such that do = 7 [7]. For direct manual demonstration, do = 1: Prior to any kind of analysis, the sampled data were shifted to achieve do = 1 in all cases.
5.1 Peg insertion
The rst skill to be demonstrated was peg-into-hole insertion, to be performed with a round peg of 10 mm diameter and a clearance of 0:5 mm. The demonstration was performed using the 6D mouse and resulted in 357 samples taken over a period of approximately 12 seconds of time. With min = 0:9; the relevant actions were identi ed as u0 = fDx ; Dy ; Dz g (the translational osets in x, y, and z direction). The relevant perceptions were determined as x0 (t) = (Fx (t ? 1); Fy (t ? 1); Fz (t ? 1)): Because no rotational degrees of freedom were considered, incorrect actions could be removed according to section 2.2.3. From the preprocessed data (Fig. 2), an RBF network was built that featured 88 clusters. The network representing rs was constructed from training data generated according to section 2.2.4 with r+ = 0:7 and r? = 0:2; resulting in 127 clusters. Using the network representing Cs without adaptation, the robot was able to insert the peg, however with an unsatisfactory performance. After ve practice runs that were only evaluated after termination of each run, the performance became signi cantly better (Fig. 7). During these ve runs, 28 neurons were inserted in the network representing Cs :
5.2 Opening a door
A second skill to be demonstrated and learned was the opening of a door (Fig. 8). This skill is important especially in service scenarios. It is dicult, since the robot operates in a closed kinematic chain. It could
Figure 8: Puma 260 opening a door. also not be demonstrated using the 6D mouse. Instead, the robot had to be manually guided. This demonstration resulted in 261 samples. Using again min = 0:9; relevant action components were identi ed as u0 = fDz ; Rx ; Ry g; i.e., translation into z direction and rotations around the x and y axis. The relevant perceptions were x0 (t) = (Fz (t ? 1); My (t ? 1)); i.e., the force in z direction and the torque around the y axis. Since both translational and rotational degrees of freedom had to be considered, preprocessing was reduced to removing irrelevant actions. From the preprocessed data, a network was built that featured 103 clusters. Using this network without adaptation, the robot was not able to open the door. This was due to the fact that the manual guidance of the robot during demonstration eected the measured forces and torques, such that situations encountered during the
demonstration were dierent from those the robot experienced when applying the skill. However, since the demonstrated actions were valid, after 15 trials the robot managed the task (Fig. 9) even without explicit sample correction (as in [5]). Until then, 189 neurons were added. Forces/Torques measured during robot Door Opening (15 practice runs) 2 0 -2 -4 -6
Fx Fy Fz Mx My Mz
-8 -10
0
50
100
150
200
250
300
350
Figure 9: Forces and torques recorded during door opening under neural network control (after 15 runs).
6 Summary and Conclusion This paper presented an approach to the acquisition of robot skills from human demonstrations. The approach considers the characteristics of humangenerated data, the existence of irrelevant perception and action components, and the need for robust methods that do not require skill-speci c parameterizations. The acquisition of two dierent manipulation skills for a Puma 260 manipulator have shown the validity of the approach and the robustness of the presented methods. Currently, our work focusses on methods to correct eects of user presence on the sampled data and on interfaces allowing to intuitively de ne more sophisticated evaluation functions that result in faster adaptation. We don't expect that skills acquired via an interactive learning approach are comparable to those originating from an in-depth task analysis and explicit robot programming, although virtual reality techniques [8] as well as advanced user interfaces and teaching devices may help the user to produce better examples. However, if robots are to become consumer products, they will be exposed to users who are not familiar with computers or robots. For such users, explicitly programming their robot according to their personal requirements is not an option, whereas teaching by showing de nitely is.
Acknowledgements
This work has partially been supported by the ESPRIT Project 7274 "B-Learn II". It has been performed at the Institute for Real-Time Computer Systems and Robotics, Prof. Dr.-Ing. U. Rembold and Prof. Dr.-Ing. R. Dillmann.
References [1] H. Asada and B.-H. Yang. Skill acquisition from human experts through pattern processing of teaching data. In Proceedings of the 1989 IEEE International Conference on Robotics and Automation, 1989. [2] C. Baroglio, A. Giordana, M. Kaiser, M. Nuttin, and R. Piola. Learning controllers for industrial robots. Machine Learning, 1996. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike elements that can solve dicult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, pages 835{846, 1983. [4] V. Gullapalli, J. A. Franklin, and H. Benbrahim. Acquiring robot skills via reinforcement learning. IEEE Control Systems Magazine, 14(1):13 { 24, 1994. [5] G. Hirzinger and J. Heindl. Sensor programming - a new way for teaching robot paths and forces/torques simultaneously. In Third International Conference on Robot Vision and Sensory Controls, Cambridge, Mass., 1983. [6] M. Kaiser, A. Retey, and R. Dillmann. Designing neural networks for adaptive control. In IEEE International Conference on Decision and Control, 1995. [7] M. Kaiser, A. Retey, and R. Dillmann. Robot skill acquisition via human demonstration. In Proceedings of the International Conference on Advanced Robotics, 1995. [8] R. Koeppe and G. Hirzinger. Learning compliant motions by task-demonstration in virtual environments. In Fourth Int. Symp. on Experimental Robotics, 1995. [9] Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, 10(6):799{822, 1994. [10] J.-C. Latombe. Robot Motion Planning. Kluwer, Boston, 1991. [11] S. Liu and H. Asada. Teaching and learning of deburring robots using neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation, Atlanta, Georgia, 1993. [12] J. Moody and C. Darken. Learning with localized receptive elds. In Proceedings of the Connectionist Models Summer School. Carnegie Mellon University, 1988. [13] M.T. Musavi, W. Ahmed, K.H. Chan, K.B. Faris, and D.M. Hummels. On the training of radial basis function classi ers. Neural Networks, 5:595 { 603, 1992. [14] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481{1497, September 1990. [15] D. A. Pomerleau. Ecient training of arti cial neural networks for autonomous navigation. Neural Computation, 3:88 { 97, 1991. [16] P. Reignier, V. Hansen, and J.L. Crowley. Incremental supervised learning for mobile robot reactive control. In Intelligent Autonomous Systems 4, 1995.