Open-source benchmarking for learned reaching ...

Paladyn, J. Behav. Robot. 2015; 6:30–41

Research Article

Open Access

A. Lemme*, Y. Meirovitch, M. Khansari-Zadeh, T. Flash, A. Billard, and J. J. Steil

Open-source benchmarking for learned reaching motion generation in robotics Abstract: This paper introduces a benchmark framework to evaluate the performance of reaching motion generation approaches that learn from demonstrated examples. The system implements ten different performance measures for typical generalization tasks in robotics using open source MATLAB software. Systematic comparisons are based on a default training data set of human motions, which specify the respective ground truth. In technical terms, an evaluated motion generation method needs to compute velocities, given a state provided by the simulation system. This however is agnostic to how this is done by the method or how the methods learns from the provided demonstrations. The framework focuses on robustness, which is tested statistically by sampling from a set of perturbation scenarios. These perturbations interfere with motion generation and challenge its generalization ability. The benchmark thus helps to identify the strengths and weaknesses of competing approaches, while allowing the user the opportunity to configure the weightings between different measures. Keywords: benchmarking, standardized comparisons, human-like motions, reaching motions, movement primitive, dynamical systems, learning from demonstrations, programming by demonstrations DOI 10.1515/pjbr-2015-0002 Received August 25, 2014; accepted December 11, 2014

*Corresponding Author: A. Lemme: Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University - Germany, E-mail: [email protected] Y. Meirovitch: Weizmann Institute of Science (WIS), Israel, E-mail: [email protected] M. Khansari-Zadeh: Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland, E-mail: [email protected] T. Flash: Weizmann Institute of Science (WIS), Israel, E-mail: [email protected] A. Billard: Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland, E-mail: [email protected] J. J. Steil: Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University - Germany, E-mail: [email protected]

1 Introduction The new generation of redundant robots, with a high number of degrees-of-freedom, needs to perform a wide variety of tasks and to autonomously adapt to perturbations, uncertainties or changes in the environment [1, 2]. Additionally, human-like natural motions are essential to developing for social acceptance of humanoid robots [3, 4]. In this context, a large variety of attractor-based dynamical system approaches have recently been developed in order to implement various reaching motions, mostly derived from human demonstration data [5–7]. However, systematic generalization tests using well established measures, including those for human-likeness as derived in human-motion science, are still lacking. To improve this situation, this paper proposes a benchmarking framework to compare algorithms for motion generation. Benchmarks are a common approach for providing systematic comparisons in many fields of research, including supercomputers [8], hardware design [9], or optimization software [10]. This motivates us to propose a benchmark software framework, where we consider motion generation algorithms that model point-to-point reaching or drawing motions with the help of dynamical systems e.g. [11–19]. An extensive performance and generalization evaluation is missing in most of the previous works. Success, in these studies, is illustrated mostly by displaying only a few example motions. There is, however, a large number of possible training data sets, generalization tasks and measures that could potentially be useful for evaluation. This provides a substantial challenge for designing an effective benchmark. Such a benchmark would need to provide a rich set of training data, and to evaluate a specific, standardized set of performance measures on significant tasks. These tasks should be reasonably diverse and challenging. In particular, it is not sufficient to evaluate only the accuracy of reproducing a given (set of) demonstrations; to be meaningful for actual applications on real robots, it is necessary to evaluate the robustness to perturbations and uncertainties. In the proposed framework, e.g. ’perturbations applied to the end-effector’ or ’changes in

© 2015 A. Lemme et al., licensee De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.

Unauthenticated Download Date | 6/14/15 5:08 AM

Benchmarking of reaching motion generation

the goal state’ are applied during the execution of the motion. The decision as to which performance measures should be used depends on the properties desired for motion generation. We argue that, on the one hand, robots must be precise in the execution of the task and be robust to perturbations that may occur during execution. On the other hand, to promote the acceptance of robots in human environments, human-like motions are crucial [3, 4]. Our benchmark framework, therefore, currently comprises ten different evaluation measures, including some that have been derived from investigation of human motions in order to measure human-likeness using typical features such as power laws [20, 21] or the minimum jerk model [22–24]. Although the control principles underlying human motion generation are disputed in motor control research [25], we do not subscribe to a particular theory. We believe, rather, that the proposed measures evaluate particular features, strengths or respective weaknesses. We thus provide the means for a multi-faceted benchmarking system through statistical evaluation and diverse exploration of the robustness in various scenarios, but refrain from weighting the criteria, instead leaving the task of proposing an overall score or ranking to the user. This benchmark framework considers only the motion generation of already trained modules, therefore the proposal is agnostic to properties of the learning process such as training time, offline vs. online training or data efficiency. We are aware that such features are practically relevant, but respective measures would be very difficult to implement in a generic framework, because they would need to address features of the learning that are intermingled with the algorithm itself. We hereby invite the robot motion generation community to employ this benchmark to evaluate respective methods. Participation will be a valuable addition to the research in this field and findings of such investigations could contribute to a greater understanding of the differences between various motion generation approaches. The MATLAB framework and the first sets of “trained modules" are ready for download at the persistent doi:10.4119/ unibi/2678439.

2 Benchmark system We describe the benchmark system presented in Figure 1 to clarify the roles of the software framework and the participants. Assumptions and ground truth of this benchmark framework. It is assumed that a motion task is implic-

| 31

Benchmark scenarios Sec. 4

Data-set Sec. 3

Participants

Perturbations Target points Start points

Simulation

Motion generation

Trained modules

Reproduced trajectory Demonstrated trajectory

Evaluation Sec. 5

Ranking Sec. 6

Figure 1: Schematic illustration of the benchmark architecture.

itly given through a demonstration data set, which is recorded from a goal directed motion i.e. point-topoint motions with a common goal but varying starting points. The framework makes no assumption as to whether these data were actually generated by a dynamical system or by some other method. It does implicitly assume that a suitable representation of the motion can be learned from these data, whereas most existing methods in the literature use dynamical systems of various forms. Using the training data, the user is required to provide such learned representation, in a form suitable for evaluation in the benchmark. The data comprise a ground truth against which the learned dynamical system can be evaluated, but no further assumptions are made about any ground truth dynamics underlying the data. Generalization refers to reproduction of motions similar to the recorded data during perturbations, where similarity can have different meanings, as discussed in Section 5. Trained modules need to be prepared by each participant, according to a software interface and a training data set available in this framework. Each trained module in M = {m1 . . . m N } is trained for a specific shape in the data set D = {d1 . . . d N }, where N is the number of shapes in the data set. The simulation will test only one shape, d i , at a time, using the corresponding trained module m i . The software interface for the trained module represents a class structure, which can be used to store both the learned representation and the control algorithm. This provides functionality to process the trained modules. It is necessary to represent each module as a first-order dynamical system that maps positions x to velocity vectors v. The velocities are integrated over time in the simulation module described below.


32 | Katherine M. Tsui et al. Benchmark scenarios specify uncertainties and perturbation types that can occur during motion generation (e.g. a displacement of the predicted position). An overview of the available scenarios is given in Section 4. A set of parameters can be drawn randomly from specified probability distributions. This set of parameters forms a reproducible basis for the simulation. The parameters and corresponding probability distributions are specified in Section 4. Data sets play an important and two-fold role in this benchmark. Besides serving as training data, they also comprise the ground-truth for comparisons and provide initial conditions for motion generation. The initial conditions are given by the start points and target points of each demonstrated trajectory. An initial data set, used in this benchmark, is described in Section 3. The simulation module receives the initial starting and target points from the given data set, together with a specification identifying which scenario should be tested with the corresponding perturbation parameters. Given this configuration, the simulation module invokes the trained modules to generate a motion in the specified scenario by integrating velocities v over time t, which are provided by the “trained modules”: xt+1 = xt + ∆t · v(xt ),

(1)

where ∆t is the time constant for discretization of the continuous dynamics and is set according to the ground-truth data set. The simulation module provides feedback to the “trained modules”, including the current position of the target and the state (endeffector), the current velocity and the current time step after each such integration step. The simulation stops either if the trained module indicates that the motion generation is finished or if the generation process exceeds a maximum execution time allowed for the motion. The evaluation is performed by computing measures on the reproduced trajectory as integrated in the simulation. These measures compare the generated trajectory to the demonstrated trajectory given by the ground-truth data set. The provided measures are introduced in Section 5. The ranking using multiple measures is a difficult task, if the focus is not fully known. The result of this benchmark is a variable set of statistical evaluations calculated on the set of measures. This allows each participant to choose among the different evaluations in order to generate a solid basis for comparison. Further elaboration can be found in Section 6.

3 Benchmark data set In this benchmark, the task is to generate point-to-point reaching motions. These motions not only approach a specific target, but also follow a specified trajectory pattern to reach it. This benchmark, therefore, does not only evaluate the precision of the executed reaching motions, but also the human-likeness of its trajectory (see Section 5). These criteria impose two constraints on the kind of data set that can be used in this benchmark, as it constitutes the ground-truth for the evaluation. First, the data set should naturally offer a variety of movement shapes to test the scalability of the considered learning methods. Second, with respect to evaluating the human-likeness of the reproduced motions the movement shapes in the data set should be demonstrated by human subjects. For this purpose, the proposed benchmark uses the LASA human handwriting library [26] as the benchmark data set. This library was first introduced in [27, 28] and extended in [29, 30] to compare the reproduction performance of different regression techniques. It has also been adopted in several works as the baseline for performance comparison [14, 15, 31]. The LASA library comprises data from handwriting motions collected from pen input using a Tablet PC. For each desired motion shape, the human subject was asked to draw seven demonstrations starting from different initial positions and moving to the same final point. The initial points are close to each other, which results in demonstrations that may intersect each other, but represent the desired movement shape with approximately the same size and rotation. The recorded library contains 26 human handwriting motion sets and four additional sets. The additional sets accommodate more than one movement shape in one set (called Multi Models). Without loss of generality, the target (final) point is by definition set to (0, 0) for all motions (shapes) in this library. All demonstrations of the different shapes are displayed in Figure 2. The benchmark tests the robustness of the trained modules in different scenarios, as described in Section 4, which raises the question: What is the correct response to perturbations? According to the criterion of humanlikeness, we expect that the motion generation methods will react to perturbations similarly to how humans do. The performance of human subjects under perturbations has been examined before [32–35] and could in principle also be included in this benchmark. In these studies, the considered tasks are to generate point-to-point straight motions, which are not very challenging for motion learning. Nevertheless the results could provide interesting in-



| 33

systematic variations use a, v combinations as specified below: Amplitude: Consider l, corresponding to the length span of motion along both x and y axes. Then, we draw samples for amplitude a from a normal probability distribution: a = N(µ, σ), (2) where µ = 0.1 l and σ = 0.05 l. Direction: The direction of the perturbation v p . The vector v ∈ Rn is drawn from a uniform distribution in interval [−0.5, 0.5] ∈ R to determine a random direction for the vector to point at. This vector is normalized and multiplied by the amplitude to obtain the actual perturbation vector v p = a ||vv|| . Figure 2: The library of LASA handwriting motions [26]. This library is composed of N = 30 two-dimensional point-to-point motions.

4.2 Discrete Push of the end-effector sights into each motion-generation approach, especially with regard to perturbation handling compared to human performances. At this point, however, we chose to simplify the evaluation process and focus on to the robotic side, by deciding that perturbations should be compensated for with respect to different measures, described in Section 5.

4 Benchmark scenarios The motion generation in this benchmark applies to different kinds of perturbations. We identified four scenarios that include the majority of typical perturbation types occuring in robot motion. The goal is to evaluate the ability of the different motion-generation algorithms to cope with these perturbations during motion execution, i.e. their generalization abilities with respect to (i) Initialization from different starting points; (ii) the push of the end effector (realized as a sudden displacement of the current position); (iii) continuous push of the end effector; (iv) changes of goal position during motion execution. The four types of perturbations are visualized in Figure 3.

4.1 Generalization to different initial conditions The most common generalisation test involves perturbation of the motion generation starting point (see Figure 3(a)). Displacement of the end-effector is applied at time t = 0 with amplitude a and direction v, where the

In this scenario, a sudden displacement of the current position is applied during motion generation (see Figure 3(b)). This simulates a hit against the end-effector at a particular point in time. This perturbation appears with varying timing t p , direction v and amplitude a. The start and target points remain fixed. Direction and amplitude are chosen as described in Section 4.1. Timing: The timing parameter t p specifies when to execute the perturbation. The benchmark data set consists of multiple motion shapes, where each shape is demonstrated a number of times. For each set of demonstrations the mean motion duration τ is calculated. The perturbation start time t p is then given by t p = t s τ, where t s is drawn uniformly in [0, 1] ∈ R.

4.3 Continuous Push of the end-effector In this scenario (see Figure 3(c)), motion generation is continuously perturbed during a specific time interval. This simulates, for instance, a teaching scenario in which a human tutor is correcting the movement for a certain period of time. As previously, perturbations appear at varying times, and with a range of directions and amplitudes. The target point remains fixed. Timing and direction are chosen as described in Section 4.2. Amplitude: The samples for amplitude are drawn from a normal probability distribution (see Equation (2)) with µ = 0.5¯v and σ = 0.25¯v. The parameter v¯ corresponds


34 | Katherine M. Tsui et al.

(a) change start position

(b) discrete push

(c) continuous push

(d) change target position

Figure 3: Four different benchmark scenarios used in the benchmark software framework to test generalization capabilities. The blue trajectory (solid line) is the demonstrated motion, whereas the red trajectory (dashed line) is the perturbed motion generated in the simulation. The black arrow indicate the direction and amplitude of the applied perturbation.

to the mean speed of all demonstrations across all motions in the library. Duration: The duration of the perturbation τ d (in seconds) is given by τ d = t d τ, where t d is drawn from a uniform distribution in [0.1, 0.3] ∈ R. τ is again the mean motion duration over each set of demonstrations for each motion shape.

4.4 Target displacement The ability of the trajectory generators to track and reach moving targets (see Figure 3(d)) is quantified in this benchmark scenario. The target motion starts at different times and lasts for a specified duration. Also the amplitude and direction of the target motion can be changed. The parameters are chosen as in the Continuous Push benchmark, but are now applied to the target point. The same probability distributions are used.

4.5 Possible extensions In robotics, motion generation needs to be robust against various perturbations to allow a safe robot interaction. Four different types of perturbations are implemented in the current scenarios, but more sophisticated scenarios, such as obstacle avoidance, are conceivable. the ability of the motion generation approach to avoid obstacles as smoothly as possible without colliding with the obstacle could be quantified. However, such scenarios would require an additional mechanism to respond to the obstacle, which is beyond the scope of the current benchmark. A possible extension to spatial perturbations is to perturb the inner clock of the benchmark, since some motion generation methods use an explicit representation of time.

It is, of course, difficult to guarantee that the system clock is being used, and not an internal clock inside the model. A possible implementation of such a perturbation is to add noise to the ∆t variable to simulate signal delays from the sensor array, which again effectively results in a spatial displacement of the predicted position. Further investigations are required to provide systematic means for performance evaluation in the above scenarios, and this possibility is thus left for future extensions.

5 Evaluation and performance measures In the following, we describe the evaluation of the perturbed trajectories. All trajectories are scaled in time by a parameter τ (dτ ∝ dt) in order to have standardized duration. This allows inspection of motion kinematics independent of the total movement duration. Geometric and kinematic accuracies of the reproductions are inspected separately and in combination. To retain only the shaperelevant geometrical information, called path, all trajectories are parameterized according to their Euclidean arc length, i.e. the result is a resampled motion trajectory with constant speed that provides only the path information of the motion.

5.1 Measures on the Geometric level In this section the measures evaluate the geometric features of the reproductions according the path information, i.e. the precision in the reproductions.



5.1.1 Path Accuracy The agreement between reproduced path and demonstration path is measured by the point-wise root mean squared error (RMSE). √︃ ⃒⃒2 1 ∑︁ ⃒⃒⃒⃒ x demo (i) − x repro (i)⃒⃒ , (3) RMSE pos = M i

where || · || specifies the L2 -norm. The paths x demo and x repro are the discretized paths of the demonstration and reproduction, respectively, with M being the number of samples following normalization.

5.1.2 Target Position Error The precision in reaching the target point is given by the distance from the last recorded position to the target. Due to robotic tasks such as pick and place of objects, it is interesting to evaluate how precisely the end point is implemented in movement- primitive modules.

| 35

The R2 measure compares predictions for the trainedmovement- generation model against those for the simple model of constant speed motion, where a value of R2 = 1 indicates a perfect agreement and a R2 ≤ 0 means that the constant speed model is better than the reproductions at explaining the variance of the demonstrations. In the evaluation process the R2 measure is not only used for the speed profiles, but also for scoring the reproduced path of the motions.

5.2.3 Target Velocity Error This is a task-relevant measure, because in reaching tasks, stopping at the target is necessary for safe human-robot interaction. We, therefore, check if the last velocity generated by the model is close to zero taking the L2-norm of the velocity vector.

5.2.4 Movement Duration

5.2 Measures on the Kinematic level The following measures are specialized to evaluate performance over velocity or speed profiles.

5.2.1 Velocity Accuracy

The movement duration of the reproduction and the corresponding demonstration are compared. Let us denote the movement durations of the demonstration its reproduction by t fd and t fr , respectively. The movement duration error is then computed according to: ε movement−duration (t fd , t fr ) = |1.0 −

The reproduced and demonstrated velocity vector profiles are compared using the RMSE: √︃ ⃒⃒2 1 ∑︁ ⃒⃒⃒⃒ RMSE vel = v demo (i) − v repro (i)⃒⃒ , (4) M i

where v demo and v repro are first derivatives of the trajectories x demo and x repro and M is the number of samples following normalization.

t fr | t fd

(6)

This measure is difficult to interpret if perturbations occur during the motion, but it delivers interesting information about the motion generation together with other measures. For example, if the module needs a long time to reach the target due to strict accuracy constraints then the motion generation will take significantly longer than the demonstration, which will show as a large duration error.

5.3 Measures using Geometric and Kinematic levels

5.2.2 Speed Accuracy Speed profiles of the demonstrated and reproduced trajectories with standardized duration are compared using the R2 measure, )︀2 ∑︀ (︀ 2 i || v demo (i)|| − || v repro (i)|| R speed−accuracy = 1 − ∑︀ (︀ )︀2 , (5) ¯ demo || i || v demo (i)|| − || v where ||v¯ demo || is the mean value of the demonstrated speed profile (e.g. [36–38]).

In the motor-control community, laws of motion are observed and extensively discussed, but it is not clear what drives the human motor system per se. We do not claim, or intend to claim, to know what specific laws of motion are used. However, it is shown that several of these regularities tend to agree with optimal performances for humandrawing-like motions.


36 | Katherine M. Tsui et al. Ample research has been devoted to the tendency of humans to produce stereotypical motor behavior [22, 39– 46]. One key observation is that point-to-point motions tend to be straight and their speed profiles bell-shaped, regardless of the direction and end-point locations of the generated trajectories. A theoretical account for this invariance is suggested by the minimum jerk model [22, 23]. The predictions of this model were originally tested for tasks involving via points [23] and also curved movements (see the constrained minimum jerk model [24]). Another common observation is that hand movements tend to slow down if the shape of the trajectory becomes curved. This tendency is quantified by the two-thirds power law [20], which predicts the hand’s speed to be proportional to the curvature of the path of the movement raised to the power of minus one third (see Equation (9)). Numerous studies have analyzed the persistence of this rule, mainly in drawing motions [21] but also for other modalities such as pupil tracking [47]. In this evaluation we have included measures to evaluate the geometric and kinematic information of the reproduction versus the demonstration.

5.3.1 Trajectory Accuracy The trajectories, x repro , x demo , are compared using the following R2 measure, R2trajectory−accuracy

⃒⃒2 ∑︀ ⃒⃒ ⃒⃒ ⃒⃒ i x demo (i) − x repro (i) = 1 − ∑︀ ⃒⃒ ⃒⃒2 . (7) ⃒⃒ ¯ demo ⃒⃒ i x demo (i) − x

Perturbed trajectories cannot deliver a perfect match, however, it is interesting to evaluate how closely the results of different methods can follow the demonstrated trajectory. Note that ’path accuracy’ is measured by the RMSE, which directly gives the distance between the two paths. In case of ’trajectory accuracy’ both shape information and velocity profile are evaluated at the same time. Therefore, we decided to use the R2 measure to express the similarity between the test and demonstration trajectories.

where r(t) is any end effector’s trajectory. We therefore use the root of the mean squared derivative (RMSD) of a trajectory r(t), ⎯ ⎸ ⎸ ∫︁T ⃒⃒ 3 ⃒⃒2 ⎸1 ⃒ d r⃒ dt, RMSD(r) = ⎷ ⃒ dt3 ⃒ T 0

Use of this measure for the perturbed trajectory is not ideal, because the perturbation applies additional jerk to the movement. When evaluating the test trajectory the perturbed part is therefore ignored.

5.3.3 The 2/3 power law The 2/3 power law predicts the speed profile of a trajectory, ds dt , based on its curvature κ(t), ds dt

I(r)

=

∫︁T ⃒ 3 ⃒2 ⃒d ⃒ ⃒ ⃒ ⃒ dt3 r⃒ dt,

log

0

(8)

ακ(t)β ,

(9)

where α and β are segment-wise constants. The parameter β is usually close to a value of − 31 for drawing motions and α is a velocity gain factor (see [48] for the use of this law for complex 3D tasks). The curvature-speed power law defined by (9) is a prominent feature of human drawing movements, although there is little consensuses about how to infer movement segments from these laws [36]. The question of human segmented control is not addressed in this benchmark, therefore the procedure described in this section focuses solely on average compliance with the law rather than on the temporal features of this compliance. Since compliance with such power laws is determined over motion segments, it is necessary to evaluate compliance of the reproduced trajectories with this law for different trajectory segments with varying segment durations. To avoid explicit segmentation, sliding windows are used for each demonstration by regressing the R2 measure for the following linear relation between the logarithms of speed and curvature in each individual segment [21],

5.3.2 Minimum jerk trajectories The minimum-jerk model predicts that human trajectories minimize the following functional,

=

ds dt

=

log α + β log κ(t),

(10)

The parameters α and β and the fitness R2 (i.e. the square of the Pearson correlation coefficient) are extracted from this linear regression for each segment. The fitness is used for further analysis. To carry out the linear regression (10), all trajectories are low-pass filtered and the beginning and the end of the trajectories are then trimmed based on the first and last



| 37

Table 1: Overview of different categories of measures used in the benchmark for performance evaluation of each participant. The optimality threshold is given for each measure in the context of the described data set (see Section 3).

Category Geometric

Kinematic

Kinematic & Geometric Software model

Property

Measure

Threshold

Scope

Label used in Figure 4

path target position error velocity profile speed profile target velocity error movement duration trajectory power law minimum jerk processing time

RMSE L2 -norm RMSE R2 L2 -norm see Equation (6) R2 ∆R2 RMSD ms

1 mm − 80% − 0.1 95% >0 − −

global local global global local global global global global global

trajectory-position-error target-position-error trajectory-velocity-error R2-speed target-velocity-error normalizedFinalTime R2 PL-R2 mean-jerk MeanComputationTime

samples reaching 1/2 of the median speed. In addition, the segments of the trajectory where the curvature is below its 30th percentile are discarded. These procedures are necessary since curvature-speed power laws are not satisfied at the beginning or end of a movement where the speed is low, nor over low curvature portions where, according to (10), the speed becomes singular. The procedure examines segment durations W = {0.2, 0.3, ..., 2}, measured in seconds. For each demonstration demo i , the R2 scores, S n = {s1 , s2 , ..., s k }, are averaged across all k segments in S n ; these segments extracted by one sliding window with duration W n ∈ W. Hence, the score of the demonstration is a function of the segment duration S demo i (W n ) = E[S n ],

(11)

where the expected value E is taken over all possible segments s m ∈ S n corresponding to the sliding window duration W n . The segment durations W n are marked and discarded from further analysis, if S demo i (W n ) does not comply with the power law (S demo i (W n ) < 0.35). The remain¯ n ) are then averaged across demonstraing scores S demo i (W tions. The resulting score is a function of these segment ¯ n, durations W ¯ n ) = Ei [S demo i (W ¯ n )]. S(W

(12)

Therefore, only segment durations for which the respective segments comply, on average, with the power law are used to evaluate the reproductions. The set of scores ¯ n )} and the respective set of segment durations {W ¯ n} {S(W are referred to as the “ground-truth” compliance of the task’s demonstrations with the power law. The same procedure is carried out to obtain the ¯ n ), where only the score for each reproduction, S repro i (W

“ground-truth” segment durations are used for compari¯ n the score is son against the demonstrations. For each W given by: ¯ n ) = E[S n ], S repro i (W (13) where the expected value is taken over all possible seg¯ n. ments of duration W Power law compliance in the reproductions is compared against demonstration power law compliance for each segment duration, as described below. The deviation of each reproduction repro i from the ground-truth score is calculated for each segment duration, and then averaged over all durations. For each reproduction, we use a relative score which determines the change in the power law fitness from the demonstration’s fitness, [︂ ]︂ ¯ n ) − S(W ¯ n) S repro i (W . (14) F(reproi ) = En ¯ n) S(W We use a threshold ∆R2 = F(reproi ) > 0 which indicates that the power law competence in the reproduction is as good as the averaged power law competence of the demonstrations.

5.4 Implemented model properties As well as the performance of the motion generation module, we are also interested in resource management, which can give some insights on the complexity in motion generation methods. The processing time provides an estimation of the computational complexity of the motion generator algorithm. It corresponds to the amount of time (in milliseconds) for the algorithm to provide the next desired state, based on the current state of motion.


38 | Katherine M. Tsui et al.

0.6

0.5

0.4

0.3

M M G Tp P oM LL F−

tr

tr

tr

tr

M

(a) Discrete push

(b) Generalization

(c) Changing target

(d) Continuous push

Pr

iV

N

−D

P1

M

aj tra M ec je ea n ta ta to cto rg rg nC or ry ry et e om ma −p − −p t− pu lize os vel os vel d t iti oc iti oc m R on ity o it 2− ea atio Fin −e −e sp n− nT alT PL n−e y−e rr rr R ee je im im −R rr rr or or 2 d rk e e 2 or or

0.7

LF

aj tra M ec je ea n ta ta to cto rg rg nC or ry ry et e om ma −p − −p t− pu lize os vel os vel iti oc iti oc m ta dF R on ity o it 2− ea tio in −e −e sp n− nT alT PL n−e y−e rr rr R ee je im im −R rr rr or or 2 d rk e e 2 or or

0.8

D

aj tra M ec je ea n ta ta to cto rg rg nC or ry ry et e om ma −p − −p t− l v i pu ze os el os vel d o iti c iti oc R me tat Fi on ity o it 2− a io n −e −e sp n− nT alT PL n−e y−e rr rr R ee je im im −R rr rr or or 2 d rk e e 2 or or

0.9

C


Pr

iV N

P1 M D M −D LF C


Pr

iV N

P1 M D M −D LF C


Pr

iV N

P1 M D M −D LF C

aj tra M ec je ea n ta ta to cto rg rg nC or ry ry et e om ma −p − −p t− pu lize os vel os vel iti oc iti oc m ta dF R on ity o it 2− ea tio in −e −e sp n− nT alT PL n−e y−e rr rr R ee je im im −R rr rr or or 2 d rk e e 2 or or

1

0.2

0.1

0

Figure 4: General overview of the results from four different benchmark scenarios. The task consist of reproduced learned trajectories arising from perturbations specified in the four benchmark scenarios.Names of the different approaches are given on the x axes. Yaxes, identify different measure labels (see Table 1). Each square can have three different colors comparing all motion generation approaches to each other: high performance is shown as white, while low performance is represented by a black square.

6 Competition For each benchmark scenario, 100 parameter vectors are drawn from the given probability distributions, as described in the previous sections. Note that all the scores whose optimal values may potentially be attained by a good reproduction adhere to an optimality threshold. For example, all end positions that are in the vicinity of 1 mm of the target positions are classified as optimal. However, doing this for, e.g., the mean squared jerk is artificial and meaningless because an optimal attainable jerk threshold is unknown and only second-order time polynomials attain the optimal value of the zero mean squared jerk [49]. Table 1 provides an overview of the defined measures and optimality thresholds for the described data set in Section 3.

6.1 Reference results In [50], an early version of this benchmark framework was introduced and discussed. The first users of this benchmark submitted their trained models to this benchmark framework, which gives subsequent users of this benchmark the opportunity to use the same trained models for further comparisons. The following modules are currently available: First a Task-parameterized Gaussian Mixture Model (TpGMM)[12], which implements a virtual spring damper system. Second, a probabilistic approach for movement primitives called Probabilistic Movement Primitives (ProMP)[11]. In this module, motion is represented as a distribution over trajectories. The motion distribution is used as a stochastic feedback controller that can repro-

duce similar trajectories when given the corresponding distribution. Both approaches use an internal secondorder dynamical system representation, whereas the next approaches are first-order dynamical systems. The third approach is called Control Lyapunov Function-based Dynamic Movements (CLF-DM)[30]. This approach builds an estimate of an energy function generalized from user demonstrations, which is then used during runtime to ensure global asymptotic stability of nonlinear dynamical systems at the target. A neural network implementation called Neural imprinted Vector Field (NiVF)[14, 15] is also evaluated. The neural network learning uses stability constraints from Lyapunov theory to implement a vector field, which is then used to encode a stable dynamical system. All models were implemented and provided by their corresponding authors. Finally, the dynamic movement primitives approach (DMP) is evaluated. The DMP model is implemented by the authors of this paper and follows the original design of the DMP algorithm described in [19, 51]. All of these approaches are applied to the default data set introduced in Section 3. The trained modules are available for download and can be used in comparisons.

6.2 Standardized Scores and Ranking In this benchmark, normalized scores are chosen that can be used for comparisons between methods in the benchmark, or across all motion patterns for each method. To reduce the sensitivity of the scoring system to noise and digitization, standardized scores are used. One way to achieve this standardization is to equip all measures with an optimality threshold, whose optimal values may potentially be attained by a good reproduction. In addi-


0.59 ne iz

us

l ra

−p

et

nt

ge

co

rg

ta

at n

s

io

he

(trajectory−position−error+2.7) −0.0266 −1 −0.0266

(a) The performance of the DMP1 model for each benchmark scenario is shown. It considers the performance over all shapes in the data-set.

3

2.5

2

1.5

0.685

0.68

0.675

0.67

M

M

G

Tp

P

oM

Pr

LL

F−

iV

N

M

−D

P1

M

D

(trajectory−position−error+12) −1.46 −1 −1.46

(b) The inter-shape performance is illustrated in this plot. The plot displays the performance for each shape in the generalization benchmark scenario.

LF

[5]

0.592

C

[4]

0.594

es

[3]

0.596

sh

[2]

B. Adams, C.L. Breazeal, R. Brooks, and B. Scassellati. Humanoid robots: a new kind of tool. IEEE Intell. Syst. Control lett., 15(4):25–31, 2000. Alin Albu-Schäffer, Sami Haddadin, Ch Ott, Andreas Stemmer, Thomas Wimböck, and Gerd Hirzinger. The dlr lightweight robot: design and control concepts for robots in human environments. Industrial Robot: An International Journal, 34(5):376–385, 2007. E. Oztop, D. Franklin, T. Chaminade, and G. Cheng. Human– humanoid interaction: Is a humanoid robot perceived as a human? Humanoid Robotics, 2(04):537–559, 2005. T. Chaminade, D. Franklin, E. Oztop, and G. Cheng. Motor interference between humans and humanoid robots: Effect of biological and artificial motion. In Proc. of Int. Conf. on Development and Learning, pages 96–101. IEEE, 2005. Aleš Ude, Christopher G. Atkeson, and Marcia Riley. Programming full-body movements for humanoid robots by observation. Robotics and Autonomous Systems, 47(2-3):93–108, 2004.

0.598

pu

[1]

0.6

ee he ape h Zs rm o e W hap d S oi W ez ap Tr pe ha Ss on o Sp ke a Sn ne Si rpc a Sh he eg e Sa ap Sh e R ap 4 h e ls− PS ap de −3 Sh o ls N ti−M de −2 ul o ls M ti−M de −1 ul o ls M ti−M de ul o M ti−M ul M ne Li −2 af Le f−1 a Le ape h sh LS me 2 a − Kh ape h ne JS ape Li h e ed JS ap end Sh B G ble ou e D ap ine Sh L C ded n Be le g

References

| 39

An

tion, a non-parametric Kruskal-Wallis test, for determining if samples originate from the same distribution, is used as a statistical tool to evaluate the performance. The parametric equivalent of the Kruskal-Wallis test, the one-way analysis of variance (ANOVA), is also used. With the help of the integrated analysis tool in the benchmark software, each participant is able to generate similar plots to those shown in Figure 4 and Figure 5. Figure 4 provides a statistical overview of performance, given the benchmark scenarios, participation methods and the different measures with respect to optimality thresholds. The color of each square is determined by the mean performance over all tested parameters, where each result corresponding with one test parameter vector is ranked as high performance (white), common performance (gray) or low performance (black). Note that white does not necessarily mean an optimal result, but it shows the statistical tendency arising from comparison of the models. Starting from this overview, each user can generate more detailed plots. In Figure 5 we give exemplar results from application of the DMP model to the benchmark scenario. - Figure 5(a); inter-motion pattern performance over all benchmark scenarios per model - Figure 5(b); and inter model performance for one measure over all benchmark scenarios - Figure 5(c). The benchmark system allows participants to systematically evaluate their own modules, with respect to different benchmark scenarios. However, the ranking of these results is left for each participant, because the unique design of each module may target specific features in motion generation, and may therefore change the focus of the ranking.

(trajectory−position−error+11) −1.67 −1 −1.67


(c) Results of all trained modules over all shapes in the generalization benchmark scenario. Figure 5: Exemplar plots illustrating different comparisons. In particular, these plots show the “trajectory-position-error", however, those plots can be generated for each measure described in Section 5.


40 | Katherine M. Tsui et al. [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22] [23]

Aude Billard, Sylvain Calinon, Ruediger Dillmann, and Stefan Schaal. Robot Programming by Demonstration, chapter 59, pages 1371–1394. Springer, 2008. Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009. D. Bailey and J. Barton. The NAS kernel benchmark program. National Aeronautics and Space Administration, Ames Research Center, 1985. K. Gaj, E. Homsirikamol, and M. Rogawski. Fair and comprehensive methodology for comparing hardware performance of fourteen round two SHA-3 candidates using FPGAs. In Cryptographic Hardware and Embedded Systems (CHES), pages 264– 278. Springer, 2010. E. Dolan and J. More. Benchmarking optimization software with performance profiles. Mathematical programming, 91(2):201– 213, 2002. A. Paraschos, G. Neumann, and J. Peters. A probabilistic approach to robot trajectory generation. In Proc. of Int. Conf. on Humanoid Robots (Humanoids), pages 477–483, 2013. S. Calinon, T. Alizadeh, and D. G. Caldwell. On improving the extrapolation capability of task-parameterized movement models. In Proc. of Int. Conf on Intelligent Robots and Systems (IROS), pages 610–616. IEEE, 2013. S.M. Khansari-Zadeh and A. Billard. Learning stable nonlinear dynamical systems with gaussian mixture models. Transactions on Robotics, 27(5):943–957, 2011. A. Lemme, Neumann K., F. R. Reinhart, and J. J. Steil. Neural learning of vector fields for encoding stable dynamical systems. Neurocomputing, 141(0):3–14, 2014. K. Neumann, Lemme A., and J. J. Steil. Neural learning of stable dynamical systems based on data-driven Lyapunov candidates. In Proc. of Int. Conf. Intelligent Robots and Systems (IROS), pages 1216–1222. IEEE, 2013. A. Ude, A. Gams, T. Asfour, and J. Morimoto. Task-specific generalization of discrete and periodic dynamic movement primitives. Transactions on Robotics, 26(5):800–815, 2010. H. Hoffmann, P. Pastor, Dae-Hyung Park, and S. Schaal. Biologically-inspired dynamical systems for movement generation: Automatic real-time goal adaptation and obstacle avoidance. In Proc. of Int. Conf. on Robotics and Automation (ICRA), pages 2587–2592, 2009. P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In Proc. of Int. Conf. on Robotics and Automation (ICRA), pages 763–768. IEEE, 2009. S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primitives. In Robotics Research, pages 561–572. Springer, 2005. F. Lacquaniti, C. Terzuolo, and P. Viviani. The law relating kinematic and figural aspects of drawing movements. Acta Psychologica, 54:115–130, 1983. P. Viviani and M. Cenzato. Segmentation and coupling in complex movements. Journal Experimental Psychology Humam Perception and Performence, 11(6):828–845, 1985. N. Hogan. An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4(11):2745, 1984. T. Flash and N. Hogan. The coordination of arm movements - an experimentally confirmed mathematical-model. Journal of Neuroscience, 5(7):1688–1703, 1985.

[24] E. Todorov and M. I. Jordan. Smoothness maximization along a predefined path accurately predicts the speed profiles of complex arm movements. J. Neurophysiol, 80:696–714, 1998. [25] T. Flash, Y. Meirovitch, and A. Barliya. Models of human movement: trajectory planning and inverse kinematics studies. Robotics and Autonomous Systems, 61(4):330–339, 2013. [26] S.M. Khansari-Zadeh. Benchmark data. http://www.amarsiproject.eu/open-source, 2012. [Online; accessed 17-October2014]. [27] S.-M. Khansari-Zadeh and Aude Billard. BM: An iterative algorithm to learn stable non-linear dynamical systems with gaussian mixture models. In Proc. of Int. Conf. on Robotics and Automation (ICRA), pages 2381–2388, 2010. [28] S.M. Khansari-Zadeh and A. Billard. Imitation learning of globally stable non-linear point-to-point robot motions using nonlinear programming. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), pages 2676–2683, 2010. [29] S. M. Khansari-Zadeh. A Dynamical System-based Approach to Modeling Stable Robot Control Policies via Imitation Learning. PhD thesis, EPFL, 2012. [30] S.M. Khansari-Zadeh and A. Billard. Learning control lyapunov function to ensure stability of dynamical system-based robot reaching motions robotics and autonomous systems. Robotics and Autonomous Systems, 62(6):752–765, 2014. [31] J. Gómez, D. Alvarez, S. Garrido, and L. Moreno. Kinesthetic teaching via fast marching square. In Proc. of Int. Conf. on Intelligent Robots and Systems (IROS), pages 1305–1310. IEEE, 2012. [32] R. Shadmehr and F. Mussa-Ivaldi. Adaptive representation of dynamics during learning of a motor task. Neuroscience, 14(5):3208–3224, 1994. [33] F. Gandolfo, F. Mussa-Ivaldi, and E. Bizzi. Motor learning by field approximation. National Academy of Sciences, 93(9):3843– 3846, 1996. [34] M. Conditt, F. Gandolfo, and F. Mussa-Ivaldi. The motor system does not learn the dynamics of the arm by rote memorization of past experience. Neurophysiology, 78(1):554–560, 1997. [35] A Karniel and F. Mussa-Ivaldi. Does the motor control system use multiple models and context switching to cope with a variable environment? Experimental Brain Research, 143(4):520– 524, 2002. [36] D. Sternad and S. Schaal. Segmentation of endpoint trajectories does not imply segmented control. Experimental Brain Research, 124(1):118–136, 1999. [37] F. Pollick, U. Maoz, A. Handzel, P. Giblin, G. Sapiro, and T. Flash. Three-dimensional arm movements at constant equiaflne speed. Cortex, 45(3):325–339, 2009. [38] D. Bennequin, R. Fuchs, A. Berthoz, and T. Flash. Movement timing and invariance arise from several geometries. PLoS computational biology, 5(7):e1000426, 2009. [39] K. Lashley. The problem of serial order in psychology. Cerebral mechanisms in behavior. New York: Wiley, 1951. [40] N. Bernstein. The Co-ordination and Regulation of Movements. Pergamon Press, Oxford, 1967. [41] W. Abend, E. Bizzi, and P. Morasso. Human arm trajectory formation. Exp Brain Res, 105:331–348, 1982. [42] T. Flash. Organizing principles underlying the formation of hand trajectories. Doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA, 1983. [43] C. M. Harris and D. M. Wolpert. Signal-dependent noise determines motor planning. Nature, 394:780–784, 1998.



[44] Bizzi E. Mussa-Ivaladi F.A. Motor learning through the combination of primitives. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences, 355:1755–1759, 2000. [45] E. Bizzi, M.C. Tresch, P. Saltiel, and A. d Avella. New perspectives on spinal motor systems. Nature Reviews Neuroscience, 1(2):101–108, 2000. [46] T. Flash and B. Hochner. Motor primitives in vertebrates and invertebrates. Current Opinion in Neurobiology, 15(6):660 – 666, 2005. Motor sytems / Neurobiology of behaviour. [47] P. Viviani and C. deSperati. The relationsheep between curvature and velocity in two dimensional smooth pursuit eye movement. The Journal of Neuroscience, 17:3932–3945, 1997.

|

41

[48] D. Endres, Y. Meirovitch, T. Flash, and M. Giese. Segmenting sign language into motor primitives with Bayesian binning. Frontiers in computational neuroscience, 7, 2013. [49] F. Polyakov, E. Stark, R. Drori, M. Abeles, and T. Flash. Parabolic movement primitives and cortical states: merging optimality with geometric invariance. Biological Cybernetics, 100(2):159– 184, 2009. [50] M.S. Khansari, A. Lemme, Y. Meirovitch, B. Schrauwen, M. A. Giese, A.J. Ijspeert, A. Billard, and J.J. Steil. Workshop on benchmarking of state-of-the-art algorithms in generating human-like robot reaching motions. In Humanoids. IEEE, 2013. [51] A. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.


Open-source benchmarking for learned reaching ...

Open-source benchmarking for learned reaching ...

Suggest Documents

SoftSolvers OpenSource Solutions Presentation

Books for Parents - Reaching IN...Reaching OUT

OpenMEEG: opensource software for quasistatic bioelectromagnetics ...

Watershed Delineation for Varahanadhi Basin Using Opensource ...

Models, algorithms and validation for opensource ...

MIR Benchmarking: Lessons Learned from the ... - EECS/QMUL

Competitive Benchmarking: Lessons Learned from the Trading Agent ...

Lessons Learned on Benchmarking from the International ... - OSTI.gov

Reaching for Excellence

MySQL: The Commercial OpenSource Database

Resilienceâ¦ - Reaching IN...Reaching OUT

Resilienceâ¦ - Reaching IN...Reaching OUT

Benchmarking Document Benchmarking report for ... - IEA 4E

Relevant Search Guide - OpenSource Connections

DO Definitions - OpenSource Leadership Strategies

10-aplikasi-shopping-opensource - IlmuKomputer.Com

Relevant Search Guide - OpenSource Connections

Reaching IN…Reaching OUT (RIRO)

Resilienceâ¦ - Reaching IN...Reaching OUT

and Strength-based Measures for - Reaching IN...Reaching OUT

Benchmarking Benchmarking Wasserversorgung

Reaching for the stars - Reactions

A BENCHMARKING PROTOCOL FOR

Benchmarking Guidance Performance for

Open-source benchmarking for learned reaching ...