Learning Action Primitives - CiteSeerX

5 downloads 0 Views 192KB Size Report
The representation of full body motion such as walking, dancing or running requires a ... reduction at the input stage, i.e., attempt to either remove spurious input variables, or find input variable ..... A sensor fusion approach for rec- ognizing ...
Learning Action Primitives Dana Kuli´c, Danica Kragic and Volker Kr¨uger

Abstract The use of action primitives plays an important role in modeling human and robot actions. Action primitives are motivated not only by neuro-biological findings, they also allow an efficient and effective action modeling from an informationtheoretic viewpoint. Different approaches for modeling action primitives have been proposed. This chapter overviews the recent approaches for learning and modeling action primitives for human and robot action and describes common approaches such as stochastic methods and dynamical systems approaches. Active research questions in the field are introduced, including temporal segmentation, dimensionality reduction, and the integration of action primitives into complex behaviors.

1 Introduction Research on human motion and other biological movement postulates that movement behavior is composed of action primitives: simple, atomic movements that can be combined and sequenced to form complex behavior [86]. Action primitives allow for complex actions to be decomposed into an alphabet of primitives, and for grammar-like descriptions to be constructed to govern their order of use. A similar strategy of using primitives can be found in linguistics where utterances of words are broken down into phonemes. Action primitives enable a ‘symbolic” description of complex action instances as words over the alphabet of action primitives, which subsequently enables the use of techniques from AI and automata theory for parsing and planning. Dana Kuli´c University of Waterloo, e-mail: [email protected] Danica Kragic Centre for Autonomous Systems, Royal Institute of Technology - KTH, e-mail: [email protected] Volker Kr¨uger CIT, Aalborg University, e-mail: [email protected]

1

2

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

The recognition and learning of these primitives has received a lot of attention in the research literature, in the AI, vision and robotics communities. From the vision perspective, formulating the recognition problem in terms of action primitives rather than continuous movement significantly reduces the search space, facilitating movement recognition. From the robotics and AI perspective, using action primitives to generate movement improves computational efficiency, both in motion planning and control. Motion planning becomes simplified because it can be carried out over the space of action primitives, rather than over the much larger muscle/joint space. Similarly, the control problem becomes simplified because controllers can be designed locally (within a single action primitive), rather than having to operate over the entire movement space. Most approaches used for learning and modeling of action primitives are of a generative type, even though discriminative approaches exist. One reason is that generative models allow in principle to generate and recognize actions at the same time. A further reason is that generative models are able to identify previously unseen primitives if the available action primitive models are not able to explain the new observation sufficiently well. Discriminative models, on the other hand, have to choose among the known models, which means that they implicitly assume that the set of known primitives is exhaustive. In this chapter we overview the recent work on automated action primitive learning from observation of human movement.

1.1 Connections to Biological Models of Movement Primitives Research on action primitives has been motivated by evidence from biology and neuroscience for the presence of action primitives at several different hierarchy levels in the animal and human brain. Studies of frogs and birds reveal evidence of muscle synergies, coordinated, synchronized movements of multiple muscle units such as swim kicks and wing movements which are generated by the firing of a single neuron in the spinal cord [69]. Recent research on primates and humans has uncovered the mirror neuron system, thought to encode a sensorimotor representation of action primitives [81, 80]. The mirror neuron system is believed to be a neural mechanism which links action observation and execution. A key question in biology and cognitive science is the model of learning, i.e., how do the mirror neurons acquire their mirror-like properties. Heyes and Ray [33, 32] propose the Associative Sequence Learning (ASL) theory of imitation. Learning proceeds via two sets of associative processes, forming sequence associations between sensory representations and associations between the sensory and motor representations. Meltzoff [65, 66] proposes the Active Intermodal Mapping theory, which postulates that infants possess an innate ability to map from perception to motor acts, and that initially captured motions get progressively refined through repeated practice. These theories imply that the development of the imitation mechanism is highly experience-dependent, consisting of correlation links between sen-

Learning Action Primitives

3

sory and motor data which are formed over time. However, some researchers [9] postulate that many of the animal behaviors classified as imitation can more accurately be explained by priming mechanisms, such as stimulus enhancement, emulation and response facilitation. They argue that the main imitation mechanism used by primates and humans is not at the action level, but rather at the program level, i.e., primates learn to imitate the efficient organization of the actions, while the individual actions comprising the complex behavior are learned by other means, such as trial and error. Similarly, Wolhschlager et al. [102] propose the goal directed imitation (GOADI) theory, which suggests that imitation is guided by cognitively specified goals. According to the theory, the imitator does not imitate the exact movements produced by the demonstrator, but instead extracts from the demonstrator a hierarchy of goals and subgoals present in the task. These goals and subgoals are then accomplished by motor programs already in the imitator’s repertoire which are most strongly associated with the achievement of the given goal. These neurobiological findings have been used as models for implementing artificial action primitive recognition and learning systems, most commonly by modeling the mirror neuron system [38, 12] and hypothesizing that implementing algorithms which learn from observation and by imitation of action primitives will facilitate robot learning and provide an intuitive and human-like programming interface [8, 86].

2 Representations and Learning of Movement Primitives A key question in learning action primitives is the mathematical representation of the primitive. Two broad representation classes are most commonly used : stochastic models and dynamical systems models. Other techniques for modeling movement primitives, such as B-splines [98], B-spline wavelets [97] or polynomial functions [72], have also been proposed, but are not covered here due to space constraints.

2.1 Stochastic Approaches Stochastic models represent each action primitive as a stochastic dynamical system, where the evolution of the system state is governed by a stochastic process, and observation of the system is impeded by noise. Two types of stochastic models can be used: generative or discriminative. Generative models are the ones used most commonly as they can be used to both recognize the action primitive and also generate a prototype, for example on a robot or in an animation, analogous to the idea of the mirror neurons (see Sect. 1.1). On the other hand, discriminative models can only be used to classify actions; a further short-coming is their inability to detected unknown actions. The choice of model type is strongly influenced by the target application, most activity recognition systems focus on discriminative models, where

4

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

only recognition capability is required. On the other hand, in robotics applications, where reproduction of the demonstrated actions is required, generative models are typically used.

2.1.1 Generative Models The most commonly used generative stochastic models are Hidden Markov Models (HMMs) [78] and Gaussian Mixture Models (GMMs) [79]. Hidden Markov models have been a popular technique for human motion modeling, and have been used in a variety of applications, including skill transfer [103, 19], robot assisted surgery [51, 23], sign language and gesture modeling [90, 4, 35] and motion prediction [34, 3]. Developing from the earlier work on Teaching by Showing [58], a common paradigm in robotics research is Programming by Demonstration (PbD) [45, 19, 18, 4, 5]. An HMM models the trajectory dynamics of the motion by the evolution of a hidden variable through a finite set of states. The hidden state variable evolves according to a stochastic state transition model A[N, N], where N is the number of states in the model. The probability of a state being the initial state is given by the initial state distribution π [N]. The hidden state cannot be observed directly, only a vector of output variables is observable, from which the hidden state needs to be inferred. The output variables and each hidden state are related through the output distribution model B[N, K], where K is the dimension of the output vector. For continuous output variables, the output distribution model typically used is a Gaussian or mixture of Gaussians. When modeling action primitives, the output variables typically describe joint angles, Cartesian coordinates of key body points, or object positions. An HMM is used to represent each action primitive. Training the model consists of learning the model parameters A, B, π to find the parameters that maximize the likelihood that the trained model generated the exemplar data. The Baum-Welch algorithm, a type of expectation-minimization, is typically used to perform the training [78]. Once the models are trained, recognition of novel observations is implemented by computing the likelihood that each of the known models produced the novel observation, and selecting the highest likelihood model as the recognized model. Since the HMM represents a stochastic model of the action, there is not a unique trajectory which can be used for action reproduction. A representative trajectory can be generated by first sampling from the state transition model to generate a state sequence, and then sampling from each state output distribution to generate an output sequence. However, this can result in a noisy trajectory which does not respect continuity constraints. To generate an appropriate trajectory, several techniques have been proposed: One approach is to repeat the sampling many times and average over the generated trajectory [38]. As the number of samples approaches infinity, the generated trajectory will approach the most likely trajectory. However, this procedure is time consuming as many trajectories must be generated. A second approach is to use a deterministic generation approach, by selecting first the most

Learning Action Primitives

5

likely state sequence, and then the mean from each output distribution in the case of a Gaussian output function, or the weighted sum of means in the case of the mixture of Gaussians model [56]. This approach is fast, but discards the information about allowable trajectory variations from the mean. Both of these generation approaches typically require a post-processing smoothing step to ensure a smooth trajectory. A final approach is to include both position and velocity information in the output variables, and use the velocity information to aid with smoothing [91, 92]. A second type of stochastic model which can be used to encode action primitives is the Gaussian Mixture model (GMM) [79]. In this approach, the time series data of an action primitive is modeled as a mixture of multivariate Gaussian distributions. The model parameters are learned using the Expectation-Minimization algorithm. Unlike the HMM model, where the temporal evolution of the trajectory is encoded through the evolution of the hidden state, the GMM model does not contain any explicit timing information. The GMM is encoding the spatial variance, but not the timing of the advancement along the trajectory. To encode timing information, two approaches are possible: adding time as an additional variable in the observation vector [14], or encoding both the position and velocity in the observation vector, rather than modeling time explicitly [28]. With the first approach, simply adding time to the observation vector is typically not sufficient, as it does not account for temporal variability across demonstrations. To address this issue, a preprocessing step is added to temporally align the exemplar data sets using dynamic time warping. With the second approach, no pre-processing is required, but the size of the output vector is doubled to include both the position and velocity of each variable of interest. Once the model structure is selected and the model is trained for each primitive, Gaussian Mixture regression can be used to generate a trajectory. The strengths of stochastic models in general include their ability to capture both the spatial and temporal variability in the movement, and in particular to capture the change in variance along the movement. For example, in a goal reaching movement, the human demonstrator will typically exhibit high variance at the start of the movement, as there is significant variability in the starting location, but very low variance at the goal location. HMMs and GMMs can explicitly capture this feature of the action primitive by encoding both the mean location and the variance around that location for key points along the trajectory. Another advantage of stochastic modeling is the ability to handle data noise and missing data. Stochastic generative models are also convenient for movement recognition and comparison. Due to the stochastic representation, both a likelihood and a distance measure comparing movements are easily computed, allowing movement differences and similarities to be analyzed [56, 53]. On the other hand, a weakness of basic HMMs and GMMs is the lack of parametrization. For example, if a reaching movement is learned for a specific reach target, it is not intuitive how to adapt the learned model to a new reach target. Parametric hidden Markov models (PHMMs) have been proposed to address this issue [101, 53]. The PHMM extends basic HMMs by adding a latent parameter modeling the systematic variation within a class.

6

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

A related issue of stochastic generative models is the lack of generalization away from the trained trajectories, and the lack of appropriate trajectory when the system is disturbed. Recent work has addressed this issue by adding a control command pushing the system towards the learned trajectory [13]. Another issue with stochastic models is the requirement for multiple examples of a motion, in order to determine valid statistics. This has traditionally meant that training data must be collected off-line, and limited potential for on-line and continuous learning. Recent research [11, 56] [57] has proposed algorithms for incremental, on-line learning of stochastic models to overcome this limitation. A final issue with stochastic models concerns the parameter selection problem, i.e., how to select the appropriate number of states (in an HMM), or the number of mixtures (in a GMM). With simple learning systems with few action primitives, it may be possible for the designer to specify the parameters manually, however, this technique is not appropriate when the potential number of motion primitives is large and not known ahead of time. Several techniques have been proposed, including iterative addition and merging of states [20], using Support Vector Machines as a pre-processing step to determine the appropriate number of states [23], the Bayesian information criterion [6], the Akaike information criterion [55], or the use of a variable structure HMM which can be incrementally adapted [56, 53].

2.1.2 Discriminative Models Discriminative models formulate the action primitive learning problem as a classification problem. With this formulation, the key issues are how to find the most discriminant features, and how to find the cluster boundaries, via supervised learning. The questions of feature selection and classifier training are common topics in machine learning research, where many general algorithms have been proposed. A recent example is the work of Han et al. [31], who use a Hierarchical Gaussian Process Latent Variable Model to find a reduced dimensionality subspace over the motion features. A conditional random field model is then applied to model each motion pattern, and a support vector machine classifier is used to distinguish between the motions. In another exemplar implementation [87], features based on boundary and content characteristics are manually selected by the designers, and then a discriminative semi-Markov model is used to simultaneously distinguish between motions and identify segment points between motions. Loesch et al. [62] evaluate several general machine learning methods for feature selection and classification of human activity, focusing on motions representing typical at home activities. For the feature selection algorithms, the Evaluation of Information Gain from Attribute (EIGA) and the Correlation-based Feature Subset Selection (CbFSS) methods are compared. The EIGA method measures the utility of a feature based on the information gain with respect to the activity class. The CbFSS method evaluates the utility of a subset of features by considering the predictive ability of each feature relative to the degree of redundancy between the subset of features. For the classification algorithms, Naive Bayes Network (NBN), Bayes Network (BN), Multilayer Per-

Learning Action Primitives

7

ceptron (MLP), Radial Basis Function Network (RBF) and Support Vector Machine (SVM) are compared. The results show that CbFSS is usually slightly better than EIGA at selecting features, regardless of the classifier used. The classifier results show that NBN and RBF have poor results, while BN, MLP and SVM have good performance, which for the case of MLP and SVM increases with the number of features used. Since multiple classification algorithms give similar performance, it can be concluded that the selected data features have intrinsic high information value, so that the classifier selection is not very important. Thus, feature selection plays a significant role in classification performance. A strong assumption commonly made in case of generative models such as HMM or its variants is that the observations are conditionally independent in order to insure computational tractability. To accommodate for long-range dependencies among observations or multiple overlapping features of observations at multiple time steps stochastic grammars and parsing would have to be used subsequently [22]. Conditional Random Fields (CRFs) [59] elegantly avoid the independence assumption of HMMs and are able to incorporate both overlapping features and longterm dependencies into the model. Although CRFs have been extensively used in the area of computer vision and machine learning, there are almost no examples of their use in the robotics community. An example of their use in an integrated object/action recognition work is presented in [46], where a method for classifying manipulation actions in the context of the objects manipulated and classifying objects in the context of the actions used to manipulate them has been presented.

2.2 Dynamical Systems Approaches Dynamical systems approaches model an action primitive as an evolution of state variables in a dynamic system. Typically, a differential equation is used to describe the state evolution [36, 70] [84], but neural network based models have also been proposed [39, 71]. With differential equation based approaches, a distinction must be made between goal oriented motion (for example, reaching, tennis swinging, etc.), and cyclic motion, such as walking, swimming, etc. For goal oriented motions, the evolution of the trajectory is modeled as a non-linear attractor, while for cyclic motions, a non-linear oscillator is used. The trajectory dynamics are typically described in state-space form, with an additional phase dynamics to allow for non uniform rate of advancement along the trajectory [36]:

ν˙ = αν (βν (g − x) − ν ) , x˙ = αx ν , z˙ = αz (βz (g − y) − z) , y˙ = αy ( f (x, ν ) + z) .

(1) (2)

Here, z is the state variable, g is a known goal state, αz , βz , αν , βν are time constants, and f (x, ν ) = ∑ bi ϕ (x, ν ) is a non-linear function model. The state variable z is a

8

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

vector of the motion variables of interest, typically joint angles or body or object Cartesian positions. (x, ν ) is a linear 2nd order dynamical system representing the evolution of the temporal variable, allowing time evolution along the trajectory to be modified. By choosing αν , βν such that the system is critically damped, the system has guaranteed monotonic global convergence to the goal. An action primitive is learned by learning the non-linear function f (x, ν ) which best approximates the demonstrated action. A common approach is to use a locally linear approximation for the non-linear function

f (x, ν ) =

x − x0 ∑ki=1 wi bi ν , , wi = exp(− 12 di (x¯ − ci )2 ) , x¯ = k g − x0 ∑i=1 wi

(3)

where bi are the weights to be learned, and di is the region of validity for each locally linear model. The weights can now be learned from demonstration data by a regression algorithm [85], such as locally weighted projection regression [99]. Dynamical model approaches have been used successfully to learn and reproduce fast and accurate motions [76], such as tennis swinging [84], ball hitting [75] and catching [47] , and pick and place actions [74]. For cyclic motions, instead of using a point goal attractor, a limit cycle attractor is used [70]: r˙ = αr (A − r) , ϕ˙ = ω .

(4)

Here, r is the oscillator amplitude, A is the desired amplitude, and ϕ is the phase. The cyclic motion state equation is then: z˙ = αz (βz (g − y) − z) , y˙ = αy ( fi (r, ϕ ) + z) ,

(5)

where g becomes the setpoint about which the oscillation takes place, and the nonlinear function to be learned is f (r, ϕ ). To coordinate multiple degrees of freedom (DoF), a single oscillator, described by Equation (4), is coupled to a unique nonlinear function associated with each DoF. This approach has been demonstrated for gait with humanoid robots [70], where the parameters of the dynamical system are learned from human motion data. A strength of dynamical systems approaches is their ability to represent complex trajectory shapes, including complex phase relationships and movement reversals. Dynamical systems also offer good mechanisms for parametrization and generalization. For example, for a reaching motion, the location of the target can be easily modified by changing the value of the goal in the state equation. Recently, techniques for automatically learning the required goal location when a new environment is encountered have also been proposed [96]. The system behavior is defined for the entire state space, not only near the observed demonstration trajectory. This allows the system to respond appropriately when disturbances during execution push

Learning Action Primitives

9

the system away from the trained trajectory. This can also be beneficial for robot systems needing to generate a trajectory for a novel instantiation of a task. However, the generation of trajectories far away from the demonstrated one can be questionable from the point of view of action recognition, as there is no guarantee that the motion generated by the equation away from the demonstrated trajectory will be similar to human motion. Another weakness of dynamical systems approaches is that data noise and missing data cannot be easily handled by this type of model. Recent work has also proposed combining dynamical systems approaches with stochastic methods, by combining combining an HMM trajectory with a learned attractor field [30], or by approximating the dynamic equation with a GMM [28].

2.3 Measurement Systems For the purpose of generating training data, different sensory modalities have been used. A popular approach to generate the training data is through kinesthetic training where the teacher manually guides the robot through the desired motion [14, 39, 23, 6]. The work of [37] incorporates the use of a laser range finder to learn assembly tasks. Data gloves have been used for both learning of assembly tasks [16] and human grasping strategies [19]. Magnetic sensors in different configurations have been used both for generating arm [10] and hand [24, 25, 4] trajectories. Finally, there are examples of different vision based approaches. A common approach is to use motion capture systems [38] [56] [53] that require special equipment and markers on the humans to be tracked. The most user-friendly approach is to develop techniques for generating the training data from humans based solely on observation from video camera without any fiducial markers. These methods are mentioned in a previous chapter in this book and are thus not further discussed here. There have also been systems that perform one part of the training using various simulated environments [18, 1]. Recent work presented in [89] uses a simulator to generate training data for symbolic task learning. Finally, approaches that integrate several sensory modalities such as visual and verbal cues have been presented in [73].

3 Dimensionality Reduction An important problem to solve when learning action primitives is the curse of dimensionality. The representation of full body motion such as walking, dancing or running requires a kinematic model with high DoF (typically 15 - 20), with higher order models required for better accuracy [54]. When the actions to be learned also include interaction with the environment, additional data may need to be included in the action primitive representation, including cartesian data about the body loca-

10

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

tion, and the location of pertinent objects in the environment, such as the object to be lifted, the location of obstacles, etc. However, not all the data may be important for each action primitive (for example, leg data may not be important for reaching), and it may also be possible that, while the observed data may be high dimensional, there may be a latent variable actuating the motion which is of lower dimensionality. For example, in a cyclic motion such as walking, the movement of the joints is synchronized and not independent, and could be described with a lower dimensionality latent variable. Dimensionality reduction techniques can be classified either by the type of algorithm used to find the reduced dimensionality subspace, or based on the processing stage at which the reduction takes place. One approach is to apply dimensionality reduction at the input stage, i.e., attempt to either remove spurious input variables, or find input variable features which reduce the dimensionality of the input vector [6, 41]. A second approach is to introduce dimensionality reduction during model learning for each action primitive. This technique is commonly used with locally weighted projection regression (LWPR) learning [99]. A final approach is to search for a low dimensional relationship describing the space of models [93]. The basic idea of the dimensionality reduction techniques is to model the relationship between some low dimensional representation X = {x1 , . . . , xN } with xi ∈ ℜD and observations Y through a mapping f . yi = f (xi )

(6)

We can generalize roughly the methods for dimensionality reduction to spectral and generative methods.

3.1 Spectral Dimensionality Reduction The most known spectral methods for linear dimensionality reduction are Principal components analysis (PCA) [44] and metric multidimensional scaling (MDS) [15]. Metric MDS computes the low dimensional representation of a high dimensional data set based on preserving the inner products between the input data. PCA on the other hand preserves its covariance structure up to a rotation. More specifically, the goal of MDS is to find a geometrical configuration of datapoints which respects a specific proximity relationship, [15]. The proximity relationship is represented using the square matrix D with elements di j representing the pairwise distances between entity i and j. The objective in MDS is to find a set of points X which under the Euclidean distance metric approximates the proximity measure defined by D leading to the following objective, N N

argminX = ∑ ∑ di j − ||xi − x j ||L2 , i

j

(7)

Learning Action Primitives

11

which can be solved in closed form as an eigenvalue-problem. Possibly the most well known algorithm for dimensionality reduction is Principal Component Analysis (PCA) [44]. The objective in PCA is to find a rotation of the current axes such that each consecutive basis-vector maximizes the variance of the data. This is done by diagonalizing the co-variance matrix C = YT Y and projecting the data on the dimensions representing the dominant portion of the variance in the data. PCA is an instance of MDS and leads to the same solution (up to scale) when using Euclidean distance as the proximity relation when constructing D. Even though PCA has been successfully applied in many scenarios it is built on the assumption that the intrinsic representation is simply a linear subspace of the original data. Since this is commonly not the case, it has lead to a large body of work in extending MDS to be able to handle non-linear embeddings. In Isomap [94] the proximity matrix is computed by finding the shortest path through the proximity graph whereas in Maximum Variance Unfolding (MVU) [100] the minimum proximity allowed without violating the proximity graph are used. Further algorithms such as Locally Linear Embeddings (LLE) [82] and Laplacian Eigenmaps [2] are also based on proximity graphs aiming to extend a truncated distance measure to a global one. Methods based on MDS are attractive as they are associated with convex objectives. However, PCA assumes a linear subspace and the non-linear extensions suffer significantly when there is noise in the data, as the local distance measure is very sensitive. Further, even though assumed to exist, none of the algorithms does model the inverse to the generative mapping. This means that having found the intrinsic representation of the data, mapping previously unseen points onto this representation is non-trivial.

3.2 Generative Dimensionality Reduction Here, it is assumed that the observed data Y have been generated from X, often referred to as the latent representation of the data, through a mapping parameterized by W. The models are often referred to as latent variable models for dimensionality reduction. By assuming the observed data Y to be independent and identically distributed and corrupted by Gaussian noise leads to the likelihood of the data, N ¡ ¢ p(Y|X, W, β −1 ) = ∏ N yi | f (xi , W), β −1 ,

(8)

i=1

where β −1 is the noise variance. In the Bayesian formalism both the parameters of the mapping W and the latent location X are nuisance variables. Seeking the manifold representation of the observed data we want to formulate the posterior distribution over the parameters given the data, p(X, W|Y). This means inverting the generative mapping through Bayes’ Theorem which implies marginalization over both the latent locations X and

12

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

the mapping W. Reaching the posterior means we need to formulate prior distributions over X and W. However, this is severely ill-posed as there is an infinite number of combinations of latent locations and mappings that could have generated the data. To proceed assumptions about the relationship need to be made. In Probabilistic PCA (PPCA) [95] the generative mapping is assumed to be linear which together with placing a spherical Gaussian prior over X means that the parameters of the mapping W can be found through maximum likelihood. In order to cope with problems related to applying mixture models in highdimensional spaces, Gaussian Process Latent Variable Model (GPLVM) [60] was suggested. Rather than marginalizing over the latent locations and finding the maximum likelihood solution of the mapping the GPLVM takes the opposite approach. A prior over the mapping f is specified through the use of Gaussian Processes (GP). One advantage of the GPLVM is that it is straight forward to specify additional prior distributions over the latent locations. Examples of employing GPLVM in robotics applications have been presented in [7].

3.3 Approaches specific to action primitives When using dynamic movement primitives, dimensionality reduction can be incorporated into the training process by using Partial Least Squares during the regression [99]. In this approach, orthogonal projections of the input data are recursively computed, and single variable regression is performed along these projections on the residuals of the previous iteration step. If the number of degrees of freedom of the data is significantly lower than the input space, fewer projections will be needed to get accurate fitting. The method also automatically excludes irrelevant input dimensions. This approach takes advantage of the fact that the models used are local, and on the assumption that movement data is locally low dimensional, so that local models can be significantly lower dimensioned than the full state space. Jenkins and Matari´c [42] describe an algorithm for dimensionality reduction for spatio-temporal data, based on the Isomap algorithm [94]. In the extended algorithm, neighborhoods are defined both spatially and temporally, and then distances between temporally adjacent points are reduced prior to generating the global distance matrix. Two versions of the algorithm are presented, one when the data is continuous, and one when the data has been pre-segmented. The algorithm is applied to robot and human motion data, and is able to extract lower-dimensional manifolds capturing looping behavior (for example, multiple executions of the same motion) much better than PCA or the original Isomap algorithm.

Learning Action Primitives

13

4 Segmentation Many of the algorithms proposed for learning action primitives through observation consider the case where the number of actions to be learned are specified by the designer, the demonstrated actions are observed and segmented a priori, and the learning is a one shot, off-line process. In this case, there is no need to autonomously segment or cluster the motion data, as this task is performed off-line by the designer. However, during natural human movement, action primitives appear as combinations and sequences, and are observed by the learning system as a single continuous stream of time series data. In order to perform on-line learning from continuous observation, the action primitives must be identified in the continuous stream of observation, this is the problem of autonomous segmentation. Existing data segmentation algorithms can be divided into two broad categories: algorithms which take advantage of known motion primitives to perform the segmentation, and unsupervised algorithms which require no a-priori knowledge of the motions.

4.1 Movement Primitives known a-priori The first class of segmentation algorithms perform segmentation, given a known set of motion primitives. In other words, given an alphabet of motion primitives P, for any observation O = a1 a2 a3 . . . aT that consists of the movement primitives ai , our goal is to recover these primitives and their precise order. The recovery of the motion primitives is a non-trivial problem. Hidden Markov models (HMMs) have been used commonly to segment and label sequences. Applications can be found in biology where HMMs have been used on biological sequences [22] or in computational linguistics where HMMs are applied to a variety of problems including text and speech processing, detection of phonemes, etc. [78, 64]. Let Xt be the random variable over the observation sequences and Yt be the random variable over the sequences of possible movement primitives (or labels), then an HMM is able to provide us with a joint distribution p(Xt , Yt ). However, when using HMMs for modeling p, one important assumption is that the Yt are all pairwise independent, as we had pointed out already earlier in Sect. 2.1.2. This is because an HMM is not able to model long-range statistical dependencies between the primitives. In order to model statistical dependencies, stochastic context-free grammars are often used [22]. As an alternative to generative models, discriminative models such as conditional random fields (CRF)[59] can be used. Like HMMs, CFRs are finite state models, however contrary to HMMs, CRFs generalize to analogues of stochastic contextfree grammars [59] and are able to assign a well-defined probability distribution over possible labelings of the entire sequence of labels, given the entire observation sequence. To be precise, CRFs are undirected graphical models over the observed variables Xt and the state variables Yt and the graph G modeling the distribution of Yt given the observations Xt . The graph G is unconstrained as long as it rep-

14

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

resents the conditional independencies in the label sequences being modeled. The likelihood of a sequence yt of motion primitives, given an observation xt is given as P(yt |xt , θ ) =

1 ∏ Θ (yt , xt ; θ ) , Z c∈C

(9)

where C is the set of cliques in G and Θ a potential function over the set of cliques give as Θ (yt , xt ; θ ) = exp(∑ θc,k fk (yc , xc )) . (10) k

Here, { fk } are called feature functions. The parameter Z is a normalizing factor. During training, the parameters θ are found, and belief propagation is used to compute Eq. (9). A CRF computes the most probable label for each time step which provides us with the required segmentation.

4.2 Assumption about segment point indicators A second class of algorithms attempts to isolate action primitives and identify segmentation points without any a-priori knowledge about the action primitives being observed. In this case, some assumption must be made about the underlying structure of the data at a segmentation point. For example, several algorithms have been developed for segmenting motions based on the velocity properties of the observation vector [77, 27, 61]. In Pomplun and Matari´c [77], a segment is recognized when the root mean square (RMS) value of the joint velocities falls below a certain threshold. In this case, the assumption is that there will be a pause in the motion between motion primitives. While this assumption allows for easy identification of segment points, it is fairly restrictive and not representative of natural, fluid human motion. In Fod et al. [27], it is assumed that there is a change in the direction of movement accompanying a change between motion primitives. Therefore, a segmentation point is recognized when a Zero Velocity Crossing (ZVC) is detected in the joint angle data, in a sufficient number of dimensions. Lieberman and Breazeal [61] improve upon this approach by automating the threshold selection and adding heuristic rules for creating segments when there is contact with the environment. This approach works well when the number of DoF of the observation vector is fairly small, but it becomes more difficult to tune the algorithm as the number of joints increases. For example, it becomes more difficult to select a single threshold for the RMS value of the joint velocities which will accurately differentiate between segments at rest and segments in motion when the dimension space is large and different types of motions (arm vs. full body motions) are considered. Koenig and Matari´c [48] develop a segmentation algorithm based on the variance of the feature data. The algorithm searches for a set of segment points which minimize a cost function of the data variance. In a related approach, Kohlmorgen and Lemm [49] describe a system for automatic on-line segmentation of time series data,

Learning Action Primitives

15

based on the assumption that data from the same motion primitive will belong to the same underlying distribution. The incoming data is described as a series of probability density functions, which are formulated as the states of a Hidden Markov Model (HMM), and a minimum cost path is found among the states using an accelerated version of the Viterbi algorithm. This algorithm has been successfully applied to human motion capture data with high DoF [40, 54]. In [52] Bayes propagation over time in combination with a particle filtering approach is used for on-line segmentation of the time series data. Here, the primitives are also modeled as HMMs, but the approach estimates the maximum a-posteriori probability of each primitive, given the data, directly. It is also possible to improve the segmentation result by including information about any known motion primitives [57] of by including knowledge about an object on which the action is applied [52, 53].

5 Connections to Learning at Higher Levels of Abstraction The use of action primitives is closely connected to the use of higher-level models such as grammars that govern how these primitives are interlinked with each other. Furthermore, primitives are often meant to cause a very specific effect on the environment, such as remove an object, push an object or insert object A into object B. To formalize the possible effect on the environment, grammatical production rules for action primitives, objects, object states and object affordances are often used [26, 17]. Affordances were first introduced by J.J.Gibson [43] and refer to action possibilities that an object or an environment offers to the acting agent. E.g., a door can be {open or closed}, and the affordance is {close door, open door}. The observation that objects and actions are intertwined is not new to robotics researchers [26, 29, 68, 50, 21, 63, 88]. Objects and production rules can be specified a-priori by an expert and the scene state is often considered to be independent from the presence of the agent itself. For surveillance applications, simple predefined grammars are used to describe actions such as “leaving a bag” or “taking a bag”. However, for robotics, it must also be taken into account that a robot might physically not be able to execute a particular action because it might be, e.g., in a wrong location or it might be too weak. The research on motion planning takes this into account, while in most cases it is assumed that the environment does not change while the agent performs the planning movements. However, unless the programmer has a precise model of the physical robot body as well as for the scene objects, the affordances need to be learned by the robot itself through exploration. In order to learn how valid and appropriate an action is, the robot eventually needs to try to execute them [26, 29, 21, 88]. This could be interpreted as “playing” or “discovering”. Similarly to humans, the learning process can in some cases be biased through imitation learning. In [26] a robot learns affordances of objects on a table by poking, pulling and pushing them. In [21, 83] the authors formalize the relationship between action, affordances and effect,

16

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

(effect (state, action)) ,

(11)

that describes a certain effect if an action is applied on a scene while being in a certain state. In robotics, a developmental approach [67] can be used, by first letting the robot apply a predefined set of actions randomly on its environment and record what effects each action has on the environment. In [21] this is called a relation instance. Here, state is the scene state as perceived by the robot before the action, and effect marks the changes of the environment due to the action, again, as perceived by the robot. After quantizing the principally continuous set of effects into a small set of discrete effect-ids, a support vector machine or similar can be used to model and predict the effects vs. scene state and action relationships in Eq. (11). This way, one becomes able to generate and recognize goal-directed actions and behaviors, where a goal is defined to be a desired or final effect in a given scene state. To perform scene understanding in a surveillance scenario the systematic application of the relationships in Eq. (11) can be used to predict the possible outcomes of the actions observed thus far, and for goal-directed robot control one has to generate the desired effect by a systematic application of its available set of actions given the relationships in Eq. (11).

6 Conclusions and Open Questions Learning action primitives from observation of human motion is an active research area with many potential applications, including activity and behavior recognition, robot learning, and sports training and rehabilitation. Two dominant approaches have emerged for representing action primitives: stochastic models and dynamical systems models. Other active research areas include dimensionality reduction, extracting action primitives from continuous time series data via segmentation, and incorporating action primitives into higher order models of behavior. While significant advances have been made, many open questions remain before a fully autonomous action primitive learning system can be realized. Many of the current systems are semi-autonomous, where data is collected off-line, manually sorted and pre-processed and algorithm parameters selected and tuned by the designers. In these systems, typically the number and type of action primitive to be learned must be specified a-priori, limiting the generality and re-useability of such systems. A second issue is the choice of representation for the input data, such as the choice of joint angle or Cartesian representation of the motion, or the choice of representation frame when describing object motion. In current systems, this choice is typically made by the designer, which is simple when the action is straightforward and known a-priori, but becomes more difficult when the action incorporates both selfmotion and interaction with the environment. A third open research goal is a fully autonomous segmentation system capable of segmenting full body motion composed of arbitrary action primitives, where there may be a significant overlap among the primitives. Tied to this goal is the fundamental question of primitive ambiguity

Learning Action Primitives

17

and how a primitive is defined: should this be done by the autonomous system, or by the human demonstrator. Another open issue when applying learned action primitives to movement generation by a robot is how the primitive should be adapted to the robot’s own morphology and sensorimotor loop (i.e., combining learning from observation and practice).

References 1. J. Aleotti, S. Caselli, and M. Reggiani. Leveraging on a virtual environment for robot programming by demonstration. Robotics and Autonomous Systems, 47(2-3):153–161, 2004. 2. M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems, 14:585–591, 2002. 3. M. Bennewitz, W. Burgard, G. Cielniak, and S. Thrun. Learning motion patterns of people for compliant robot motion. Int. Journal of Robotics Research, 24(1):31–48, 2005. 4. K. Bernardin, K. Ogawara, K. Ikeuchi, and R. Dillmann. A sensor fusion approach for recognizing continuous human grasping sequences using hidden markov models. IEEE Trans. on Robotics, 21(1):47–57, 2005. 5. A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Handbook of Robotics, chapter Robot Programming by Demonstration, pages 1371–1394. Springer, 2008. 6. A. Billard, S. Calinon, and F. Guenter. Discriminative and adaptive imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems, 54:370–384, 2006. 7. S. Bitzer and S. Vijayakumar. Latent spaces for dynamic movement primitives. In IEEE Int. Conf. on Humanoid Robots, pages 574–581, 2009. 8. C. Breazeal and B. Scassellati. Robots that imitate humans. Trends in Cognitive Sciences, 6(11):481–487, 2002. 9. R. W. Byrne and A. E. Russon. Learning by imitation: a hierarchical approach. Behavioral and Brain Sciences, 21:667–721, 1998. 10. S. Calinon and A. Billard. Active teaching in robot programming by demonstration. In IEEE Int. Conf. on Robot and Human Interactive Communication, pages 702–707, 2007. 11. S. Calinon and A. Billard. Incremental learning of gestures by imitation in a humanoid robot. In ACM/IEEE Int. Conf. on Human-Robot Interaction, pages 255–262, 2007. 12. S. Calinon and A. Billard. Learning of gestures by imitation in a humanoid robot. In C. L. Nehaniv and K. Dautenhahn, editors, Imitation and Social Learning in Robots, Humans and Animals, pages 153–177. Cambridge University Press, 2007. 13. S. Calinon, F. D’halluin, E.L. Sauser, D.G. Caldwell, and A.G. Billard. Learning and reproduction of gestures by imitation. Robotics Automation Magazine, IEEE, 17(2):44 –54, 2010. 14. S. Calinon, F. Guenter, and A. Billard. On learning, representing and generalizing a task in a humanoid robot. IEEE Trans. on Systems, Man and Cybernetics B, 37(2):286–298, 2007. 15. T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall/CRC, 2001. 16. C.P.Tung and A.C.Kak. Automatic learning of assembly task using dataglove system. In IEEE Int. Conf. on Intelligent Robots and Systems, volume 1, pages 1–8, 1995. 17. Y. Demiris and M. Johnson. Distributed, predictive perception of actions: a biologically inspired robotics architecture for imitation and learning. Connection Science, 15(4):231– 243, 2003. 18. R. Dillmann. Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems, 47:109–116, 2004. 19. R. Dillmann, O. Rogalla, M. Ehrenmann, R. Zollner, and M. Bordegoni. Learning robot behaviour and skills based on human demonstration and advice: The machine learning paradigm. In Int. Symp. on Robotics Research, pages 229–238, 1999.

18

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

20. K. R. Dixon, J. M. Dolan, and P. K. Khosla. Predictive robot programming: Theoretical and experimental analysis. Int. Journal of Robotics Research, 23(9):955–973, 2004. 21. M.R. Dogar, M. Cakmak, E. Ugur, and E. Sahin. From primitive behaviors to goal-directed behavior using affordances. In IEEE Int. Conf. on Intelligent Robots and Systems, pages 729–734, 2007. 22. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press., 1998. 23. S. Ekvall, D. Aarno, and D. Kragic. Online task recognition and real-time adaptive assistance for computer-aided machine control. IEEE Trans. on Robotics, 22(5):1029–1033, 2006. 24. S. Ekvall and D. Kragic. Interactive grasp learning based on human demonstration. In IEEE Int. Conf. on Robotics and Automation, volume 4, pages 3519 – 3524, 2004. 25. S. Ekvall and D. Kragic. Grasp recognition for programming by demonstration tasks. In IEEE Int. Conf. on Robotics and Automation, pages 748–753, 2005. 26. P. Fitzpatrick, G. Metta, L. Natale, S. Rao, and G. Sandini. Learning about objects through action - initial steps towards artificial cognition. In IEEE Int. Conf. on Robotics and Automation, volume 3, pages 3140–3145, 2003. 27. A. Fod, M. J. Matari´c, and O. C. Jenkins. Automated derivation of primitives for movement classification. Autonomous Robots, 12(1):39–54, 2002. 28. E Gribovskaya, S M Khansari-Zadeh, and Aude Billard. Learning Non-linear Multivariate Dynamics of Motion in Robotic Manipulators. Int. Journal of Robotics Research. In Press. 29. S. Griffith, J. Sinapov, M. Miller, and A. Stoytchev. Toward interactive learning of object categories by a robot: A case study with container and non-container objects. In IEEE Int. Conf. on Development and Learning, pages 1–6, 2009. 30. F. Guenter and A. G. Billard. Using reinforcement learning to adapt an imitation task. In IEEE Int. Conf. on Intelligent Robots and Systems, pages 1022–1027, 2007. 31. Lei Han, Xinxiao Wu, Wei Liang, Guangming Hou, and Yunde Jia. Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing, 28:836–849, 2010. 32. C. Heyes. Causes and consequences of imitation. Trends in Cognitive Sciences, 5(6):253– 261, 2001. 33. C. Heyes and E. Ray. What is the significance of imitation in animals? Advances in the Study of Behavior, 29:215–245, 2000. 34. M. A. T. Ho, Y. Yamada, and Y. Umetani. An adaptive visual attentive tracker for human communicational behaviors using hmm-based td learning with new state distinction cpapbility. IEEE Trans. on Robotics, 21(3):497–504, 2005. 35. S. Iba, C. J. J. Paredis, and P. K. Khosla. Interactive multi-modal robot programming. Int. Journal of Robotics Research, 24(1):83–104, 2005. 36. A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In IEEE Int. Conf. on Robotics and Automation, pages 1398– 1403, 2002. 37. K. Ikeuchi and T. Suchiro. Towards an assebmly plan from observation, part i: Assembly task recognition using face-contact relations (polyhedral objects). In IEEE Int. Conf. on Robotics and Automation, volume 3, pages 2171–2177, 1992. 38. T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura. Embodied symbol emergence based on mimesis theory. Int. Journal of Robotics Research, 23(4–5):363–377, 2004. 39. M. Ito and J. Tani. On-line imitative interaction with a humanoid robot using a dynamic neural network model of a mirror system. Adaptive Behavior, 12(2):93–115, 2004. 40. B. Janus and Y. Nakamura. Unsupervised probabilistic segmentation of motion data for mimesis modeling. In IEEE Int. Conf. on Advanced Robotics, pages 411–417, 2005. 41. O. C. Jenkins and M. Matari´c. Performance-derived behavior vocabularies: Data-driven acquisition of skills from motion. Int. Journal of Humanoid Robotics, 1(2):237–288, 2004. 42. O. C. Jenkins and M. Matari´c. A spatio-temporal extension to isomap nonlinear dimension reduction. In Int. Conf. on Machine Learning, pages 441–448, 2004. 43. J.J.Gibsen. Perceiving, acting and knowing: toward an ecological psychology, chapter The theory of affordances, pages 67–82. Lawrence Erlbaum Associates Publishers, 1977.

Learning Action Primitives

19

44. I.T. Jolliffe. Principal component analysis. Springer verlag, 2002. 45. S. B. Kang and K. Ikeuchi. Toward automatic robot instruction from perception – temporal segmentation of tasks from human hand motion. IEEE Trans. on Robotics and Automation, 11:432–443, 1993. 46. H. Kjellstrom, J. Romero, and D. Kragic. 47. J. Kober, B. Mohler, and J. Peters. Learning perceptual coupling for motor primitives. In IEEE Int. Conf. on Intelligent Robots and Systems, 2008. 48. N. Koenig and M. J. Matari´c. Behavior-based segmentation of demonstrated tasks. In Int. Conf. on Development and Learning, 2006. 49. J. Kohlmorgen and S. Lemm. A dynamic hmm for on-line segmentation of sequential data. In Neural Information Processing Systems, pages 793–800, 2001. 50. Hideki Kozima, Cocoro Nakagawa, and Hiroyuki Yano. Emergence of imitation mediated by objects. In Int. Workshop on Epigenetic Robotics, pages 59–61, 2002. 51. D. Kragic, P. Marayong, M. Li, A. M. Okamura, and G. D. Hager. Human-machine collaborative systems for microsurgical applications. Int. Journal of Robotics Research, 24(9):731– 742, 2005. 52. V. Krueger and D. Grest. Using hidden markov models for recognizing action primitives in complex actions. In Scandinativan Conf. on Image Analysis, 2007. 53. V. Kr¨uger, D. Herzog, S. Baby, A. Ude, and D. Kragic. Learning actions from observations. Robotics Automation Magazine, IEEE, 17(2):30 –43, 2010. 54. D. Kuli´c and Y. Nakamura. On-line segmentation of whole body human motion data for large kinematic models. In IEEE Int. Conf. on Intelligent Robots and Systems, pages 4300–4305, 2009. 55. D. Kuli´c, W. Takano, and Y. Nakamura. Incremental on-line hierarchical clustering of whole body motion patterns. In IEEE Int. Symp. on Robot and Human Interactive Communication, pages 1016–1021, 2007. 56. D. Kuli´c, W. Takano, and Y. Nakamura. Incremental learning, clustering and hierarchy formation of whole body motion patterns using adaptive hidden markov chains. Int. Journal of Robotics Research, 27(7):761–784, 2008. 57. D. Kuli´c, W. Takano, and Y. Nakamura. On-line segmentation and clustering from continuous observation of whole body motions. IEEE Trans. on Robotics, 25(5):1158–1166, 2009. 58. Y. Kuniyoshi, M. Inaba, and H. Inoue. Teaching by showing: Generating robot programs by visual observation of human performance. In Int. Symp. on Industrial Robots, pages 119– 126, 1989. 59. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Int. Conf. on Machine Learning, pages 282–289, 2001. 60. N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. The Journal of Machine Learning Research, 6:1783–1816, November 2005. 61. J. Lieberman and C. Breazeal. Improvements on action parsing and action interpolatin for learning through demonstration. In IEEE Int. Conf. on Humanoid Robots, pages 342–365, 2004. 62. M. Loesch, S. Schmidt-Rohr, S. Knoop, S. Vacek, and R. Dillmann. Feature set selection and optimal classifier for human activity recognition. In IEEE Int. Conf. on Robot and Human Interactive Communication, pages 1022–1027, 2007. 63. M.C. Lopes and J. Santos Victor. Visual learning by imitation with motor representations. SMC, 35(3):438–449, 2005. 64. C. Manning and H. Schuetze. Foundations of statistical natural language processing. MIT Press, 1999. 65. A. N. Meltzoff. Imitation as a mechanism of social cognition: origins of empathy, theory of mind, and the representation of action. In U. Goswami, editor, Blackwell Handbook of Childhood Cognitive Development, pages 6–25. Blackwell Publishers, 2002.

20

Dana Kuli´c, Danica Kragic and Volker Kr¨uger

66. A. N. Meltzoff. Imitation and other minds: the ’like me’ hypothesis. In S. Hurley and N. Chater, editors, Perspectives on imitation: from neuroscience to social science, volume 2, pages 55–77. MIT Press, 2005. 67. G. Metta, G. Sandini, L. Natale, R. Manzotti, and F. Panerai. Development in artificial systems. In EDEC Symp. at the Int. Conf. on Cognitive Science, (Beijing, China), 2001. 68. L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor. Learning object affordances: From sensory motor coordination to imitation. IEEE Trans. on Robotics, 24(1):15–26, 2008. 69. F. A. Mussa-Ivaldi and E. Bizzi. Motor learning through the combination of primitives. Phil. Trans. of the Royal Society of London B, 355:1755–1769, 2000. 70. J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and M. Kawato. Learning from demonstration and adaptation of biped locomotion. Robotics and Autonomous Systems, 47:79–91, 2004. 71. T. Ogata, S. Sugano, and J. Tani. Open-end human-robot interaction from the dynamical systems perspective: mutual adaptation and incremental learning. Advanced Robotics, 19:651– 670, 2005. 72. M. Okada, K. Tatani, and Y. Nakamura. Polynomial design of the nonlinear dynamics for the brain-like information processing of whole body motion. In IEEE Int. Conf. on Robotics and Automation, pages 1410–1415, 2002. 73. M. Paradowitz, R. Zoellner, S. Knoop, and R. Dillmann. Incremental learning of tasks from user demonstrations, pas experiences and vocal comments. IEEE Trans. on System, Man and Cybernetics B, 37(2):322–332, 2007. 74. Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and generalization of motor skills by learning from demonstration. In Robotics and Automation, 2009. ICRA ’09. IEEE International Conference on, pages 763–768, 2009. 75. J. Peters and S. Schaal. applying the episodic natural actor-critic architecture to motor primitive learning. In European Symposium on artificial neural networks, 2007. 76. J. Peters and S. Schaal. Reinforcement learning for operational space control. In IEEE Int. Conf. on Robotics and Automation, pages 2111–2116, 2007. 77. M. Pomplun and M. J. Matari´c. Evaluation metrics and results of human arm movement imitation. In IEEE Int. Conf. on Humanoid Robotics, 2000. 78. L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989. 79. Richard A. Redner and Homer F. Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2):195–239, 1984. 80. G. Rizzolatti and L. Craighero. The mirror-neuron system. Annual Reviews of Neuroscience, 27:169–192, 2004. 81. G. Rizzolatti, L. Fogassi, and V. Gallese. Neurophysical mechanisms underlying the understanding and imitation of action. Nature Reviews: Neuroscience, 2:661–670, 2001. 82. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. 83. E. Sahin, M. Cakmak, M. R. Dogar, E. Ugur, and G. Uecoluk. To afford or not to afford: A new formalization of affordances toward affordance-based robot control. Adaptive Behavior, 15(4):447–472, 2007. 84. S. Schaal. Dynamic movement primitives - a framework for motor control in humans and humanoid robotics. In Hiroshi Kimura, Kazuo Tsuchiya, Akio Ishiguro, and Hartmut Witte, editors, Adaptive Motion of Animals and Machines, pages 261–280. Springer Tokyo, 2006. 85. S. Schaal, C. G. Atkeson, and S. Vijayakumar. Scalable techniques from nonparametric statistics for real time robot learning. Applied Intelligence, 17:49–60, 2002. 86. S. Schaal, A. Ijspeert, and A. Billard. Computational approaches to motor learning by imitation. Phil. Trans. of the Royal Society of London B, 358:537 – 547, 2003. 87. Q. Shi, L. Wang, L. Cheng, and A. Smola. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, ??:??, 2008. 88. J. Sinapov and A. Stoytchev. Detecting the functional similarities between tools using a hierarchical representation of outcomes. In IEEE Int. Conf. on Development and Learning, pages 91–96, 2008.

Learning Action Primitives

21

89. D. Song, K. Huebner, V. Kyrki, and D. Kragic. Learning task constraints for robot grasping using graphical models. In IEEE Int. Conf. on Intelligent Robots and Systems, pages 1579– 1585, 2010. 90. T. Startner and A. Pentland. Visual recognition of american sign language using hidden markov models. In Int. Conf. on Automatic Face and Gesture Recognition, pages 189–194, 1995. 91. K. Sugiura and N. Iwahashi. Learning object-manipulation verbs for human-robot communication. In Workshop on Multi-Modal Interfaces in Semantic Interaction, pages 32–38, 2007. 92. K. Sugiura and N. Iwahashi. Motion recognition and generation by combining referencepoint-dependent probabilistic models. In IEEE Int. Conf. on Intelligent Robots and Systems, pages 852–857, 2008. 93. W. Takano, K. Yamane, T. Sugihara, K. Yamamoto, and Y. Nakamura. Primitive communication based on motion recognition and generation with hierarchical mimesis model. In IEEE Int. Conf. on Robotics and Automation, pages 3602–3608, 2006. 94. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. 95. M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999. 96. A. Ude, A. Gams, T. Asfour, and J. Morimoto. Task-specific generalization of discrete and periodic dynamic movement primitives. IEEE Transactions on Robotics, 26(5):800–815, 2010. 97. Ales Ude, Christopher G. Atkeson, and Marcia Riley. Programming full-body movements for humanoid robots by observation. Robotics and Autonomous Systems, 47(2-3):93 – 108, 2004. 98. Ales Ude, Marcia Riley, Bojan Nemec, Andrej Kos, Tamim Asfour, and Gordon Cheng. Synthesizing goal-directed actions from a library of example movements. In IEEE Int. Conf. on Humanoid Robots, pages 115–121, 2007. 99. S. Vijayakumar, A. D’Souza, and S. Schaal. Incremental online learning in high dimensions. Neural Computation, 17:2602–2634, 2005. 100. K.Q. Weinberger, F. Sha, and L.K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Int. Conf. Machine Learning, 2004. 101. A.D. Wilson and A.F. Bobick. Parametric hidden markov models for gesture recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9):884 –900, 1999. 102. A. Wohlschlaeger, M. Gattis, and H. Bekkering. Action generation and action perception in imitation: An instance of the ideomotor principle. Phil. Trans. of the Royal Society of London B, 358:501–515, 2003. 103. J. Yang, Y. Xu, and C. S. Chen. Human action learning via hidden markov model. IEEE Trans. on Systems, Man and Cybernetics A, 27(1):34–44, 1997.