Natural Movement Generation Using Hidden ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
tion generation—hidden Markov models (HMMs) and principal components—into an .... HMMs have also appeared in video-based gesture recognition. [14]–[16] and ..... execution, and totally N sample observation sequences are available for ...
1184

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

Natural Movement Generation Using Hidden Markov Models and Principal Components Junghyun Kwon and Frank C. Park

Abstract—Recent studies have shown that the perception of natural movements—in the sense of being “humanlike”—depends on both joint and task space characteristics of the movement. This paper proposes a movement generation framework that merges two established techniques from gesture recognition and motion generation—hidden Markov models (HMMs) and principal components—into an efficient and reliable means of generating natural movements, which uniformly considers joint and task space characteristics. Given human motion data that are classified into several movement categories, for each category, the principal components extracted from the joint trajectories are used as basis elements. An HMM is, in turn, designed and trained for each movement class using the human task space motion data. Natural movements are generated as the optimal linear combination of principal components, which yields the highest probability for the trained HMM. Experimental case studies with a prototype humanoid robot demonstrate the various advantages of our proposed framework. Index Terms—Hidden Markov model (HMM), movement primitive, natural movement, principal component.

I. I NTRODUCTION

W

HILE the various motivating arguments put forth for designing robots that physically resemble humans—the need for robots to function in environments designed expressly for humans and the better facilitation of human–robot interaction leading to wider social acceptance of robots—are by now well known, recent works that focus on getting humanoid robots to move in a humanlike way, e.g., [1] and [2], point to a growing realization that how the robot moves and behaves is just as important in the way that humans ultimately perceive robots. Several recent works have attempted to quantify how humanlike or natural a particular movement is. Considering only the angle and velocity profiles of the joints, Ren et al. [3] employ various statistical methods, e.g., a mixture of Gaussians and hidden Markov models (HMMs), to measure how similar a movement is to a previously acquired database of human movements. In contrast, the task space properties of a movement are emphasized in [4] and [5]; drawing upon evidence Manuscript received September 17, 2007; revised December 31, 2007 and April 16, 2008. This research was supported in part by grants from the Frontier 21C Program in Intelligent Robotics, by the Seoul National University (SNU) Center for Biomimetic Systems, and by IAMD—SNU. This paper was recommended by Associate Editor C.-T. Lin. This paper has supplementary downloadable material available at http:// ieeexplore.ieee.org, provided by the authors. This includes four movie clips, which show the humanoid motion generation results for the numerical and experimental case studies in Section IV. This material is 9.9 MB in size. The authors are with the School of Mechanical and Aerospace Engineering, Seoul National University, Seoul 151-742, Korea (e-mail: jhkwon@robotics. snu.ac.kr; [email protected]). Digital Object Identifier 10.1109/TSMCB.2008.926324

from human motor control research, Simmons and Demiris [4] generate human point-to-point arm-reaching movements in which the hand typically follows a linear path and the velocity profile resembles a smooth bell-shaped curve. In [5], human subjects are asked to subjectively evaluate the naturalness of arm movements generated from various algorithms, e.g., the minimum velocity or acceleration of the hand, minimum angular velocity or acceleration of the joints, minimum joint torque or torque change based on dynamic optimization, etc., and the authors conclude, among other things, that the smoothness of hand velocity profiles is critical and that minimum torque and other dynamically optimal motions tend not to be perceived as being natural (the latter finding is also confirmed in [6]). The important consequence inferred from such previous works is that both the joint and task space characteristics of a movement are important ingredients affecting the movement’s naturalness. The easiest and most popular way to generate humanoid movements resembling human movements in both the joint and task spaces is to exploit some existing database of human motion capture data. Recent approaches attempt to generate movements that are similar, in some statistical sense, to this database of movements; this is also true for many works in imitation learning and programming by demonstration. Nearly all of the metrics used to measure the similarity between movements rely on joint and task space representations of the movement. However, directly applying a particular set of human motion capture data—in the form of joint angle trajectories—to a robot will more often than not yield unsatisfactory movements because topological and structural differences inevitably exist between the human subjects providing the motion data and the actual robot (e.g., differences in kinematic degrees of freedom, joint types, link dimensions, joint limits, mass and inertial properties, etc.). Therefore, what is needed is a procedure for generating new movements from a collection of similar movements (or, in the context of imitation learning, from multiple demonstrations of a specific movement), which adjusts the joint trajectories to compensate for these structural differences while still preserving the “natural” qualities of the movement in both the joint and task space senses. The main contribution of this paper is an algorithm for generating natural humanlike movements, which simultaneously accounts for task and joint space characteristics in a unified and consistent fashion while preserving the natural qualities of the movement within an optimization setting. The approach that we set forth is based on the following: 1) the encoding of movement primitives in joint space as a set of basis elements that are extracted via the principal component analysis (PCA)

1083-4419/$25.00 © 2008 IEEE

KWON AND PARK: NATURAL MOVEMENT GENERATION

of existing human movement data and 2) the use of HMMs to determine the optimal linear combination of basis elements, which best represents the movement in the task space. The idea that complex movements are created from a vocabulary of movement primitives has inspired a wealth of related studies in the human motor control literature (see, e.g., [7]–[9] and the references cited therein). The method of representing and generating human arm movements as a linear superposition of principal components was first investigated in [8]. Basis elements obtained from PCA have also been used in interpolation-based schemes for deriving action and behavior primitives for humanoid robots [10]–[12] and for obtaining dynamically suboptimal motions in real time, e.g., suboptimal minimum torque motions [12]. The use of HMMs are ubiquitous in time series signal processing, particularly in speech recognition [13]. Recently, HMMs have also appeared in video-based gesture recognition [14]–[16] and robot task learning [17], [18]. The ability of HMMs to generalize human demonstrations has naturally led to several proposed methods for HMM-based movement generation [19]–[23], whose details we discuss in the following. In our proposed method, each movement primitive is represented by an HMM trained with human motion capture data. The HMM training process uses only the task space properties of the movement, i.e., only the hand trajectories and velocities in Cartesian space. At the same time, basis elements in the form of joint angle trajectories for each movement primitive are extracted via the PCA of the same HMM training data. Movements are then generated by selecting the basis element weights that result in the maximum probability for the given HMM and that meet boundary conditions and other userspecified constraints. A. Related Work Several closely related works that employ HMMs for movement generation have been recently proposed in the literature. Inamura et al. [19] train a continuous HMM using joint angle trajectories of human motion capture data as the observation data. New joint angle trajectories are obtained via the direct simulation of state transition and observation emission process with an averaging strategy, in which the following are accomplished: 1) the average state sequence is obtained from repeated trials of the state transition process; 2) the joint angle trajectories are obtained from a single trial of the observation emission process based on the averaged state sequence; and 3) the averaged joint angle trajectories are obtained from repeated trials of 1) and 2). Calinon and Billard [20]–[22] primarily use HMM in an imitation learning context, which is for generalizing multiple human demonstrations of a given task. In contrast with Inamura et al. [19], Calinon and Billard [20] use only key-point values, such as inflection points extracted from the joint angle or 3-D hand path trajectories, to train two corresponding separate HMMs. New movements are generated by interpolating the key-point sequence retrieved from the trained observation mean vectors. The works of Calinon and Billard in [21] and [22] extend their work in [20] by the following: 1) using

1185

PCA or independent component analysis to reduce the data dimension and the effects of noise before HMM training and 2) introducing a cost function to determine whether to control the humanoid using either the retrieved joint angle or 3-D hand path trajectories. Asfour et al. [23] also rely on key-point retrieval from the trained HMM and interpolation of those key-points, which is similar to [20]. A distinctive feature of [23] is that three separate HMMs are employed for a specific movement, namely, jointangle-based, hand-path-based, and hand-orientation-based. The newly generated movement is the blend of the joint angle trajectories produced by those HMMs. The approach presented in this paper enhances, in several important respects, the previously mentioned HMM-based methods for movement generation. The primary original contributions can be summarized as follows. 1) Our framework provides a systematic and efficient means of simultaneously considering the goodness of fit in both the joint and task spaces; the basis elements extracted via PCA reflect the joint space characteristics of the movement, whereas the trained HMM focuses on the features of the movement in the task space. Considering that the requisite optimization involves selecting the best linear combination of the movement basis elements that yields the highest probability for the trained HMM, the resulting movement preserves both the joint and task space characteristics of the original human motion. Our approach can be contrasted with that of Inamura et al. [19], which uses only human joint angle trajectories and does not consider the task space features of the movement. At the same time, the other methods described in [20]–[23] generate separate movements according to the task and joint space characteristics, of which one is either selected or the two are suitably merged through some ad hoc user-defined criterion. 2) Our optimization procedure effectively resolves any movement discrepancies arising from the inherent kinematic differences between the target humanoid and the human subject providing the movement database. Considering that previous approaches [19]–[23] directly generate the movements without considering such structural differences, it becomes necessary to, e.g., manually adjust the human joint angle trajectories prior to HMM training to ensure, for example, that the joint limits of the target humanoid are not exceeded and that the qualitative features that make the movement natural are preserved even for different link length proportions. Such adjustments must, moreover, be made repeatedly whenever the target humanoid is changed. In contrast, considering that any joint limit constraints can be easily included into our optimization procedure, our approach avoids any laborious manual modification of the human joint angle trajectories. As we will also discuss later, considering that the HMM observation vector is defined to be the Cartesian hand position and velocity (normalized by the sum of the link lengths), humanlike movements can be generated simply by explicitly specifying the initial and

1186

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

final joint configurations of the movement as constraints in the optimization procedure. Unlike past approaches, the modification of the human joint angle trajectories and the HMM training is unnecessary even when the target humanoid is changed. We emphasize that although our proposed approach bears some superficial similarities to the methodology presented in [21] and [22], which is in the respect that both apply PCA and HMMs to human movement data, the two approaches are fundamentally different in both purpose and means; the works of Calinon and Billard in [21] and [22] apply PCA to human movement data across different joint or Cartesian position variables to reduce the data dimension and the effects of noise prior to HMM training. In contrast, our approach applies PCA to human movement data across multiple human demonstrations of every single joint comprising a movement in order to extract the movement basis elements for each corresponding joint. B. Paper Organization This paper is organized as follows. In Section II, we describe the principal component-based approach to constructing joint space basis elements and how these elements can be used to generate movements. Section III then presents our overall HMM-based optimization framework for movement generation. Section IV provides a detailed case study involving diverse sets of arm movements for a prototype humanoid robot, which is followed by a summary and topics for further study in Section V. II. J OINT S PACE B ASIS E LEMENTS VIA P RINCIPAL C OMPONENTS In our procedure for movement basis extraction via PCA, a human demonstrator first executes repeated trials of a specific movement (e.g., hand raising), which are then recorded with a motion capture device. Assuming that the kinematic parameters of the human subject are available, the movement data are stored in the form of joint angle trajectories obtained from the inverse kinematics. For each joint, PCA is then applied to the captured joint angle trajectories, and a set of dominant principal components (or, more precisely in our case, their continuoustime polynomial approximations) are selected as basis elements representing that movement class. This procedure is repeated for different types of basic movements, with the associated principal components constituting a convenient “primitive” representation for each movement class. If the demonstrator executes a specific movement N times, with each movement sampled at M uniform time intervals, the joint angle trajectories of a particular joint can be represented as a sample movement matrix X ∈ M ×N ; here, each column of X, which is denoted as xi ∈ M , i = 1, 2, . . . , N , represents a single (sampled) joint angle trajectory for each execution. The associated sample covariance matrix C ∈ M ×M can be obtained as C=

N 1  (xi − x ¯)(xi − x ¯) N i=1

(1)

Fig. 1. (a) Joint angle trajectories obtained from human motion capture session. The thick line represents the mean joint angle trajectory. (b) First four dominant principal components extracted from human joint angle trajectories.

where x ¯ is the sample mean. Denoting the eigenvalues of C, in decreasing order, by λi (i.e., λ1 ≥ λ2 ≥ · · · ≥ λM ), the corresponding eigenvectors ei are the principal components of X. The ratio between each eigenvalue and the sum of all the eigenvalues is then the percentage of data explained by each corresponding principal component. Usually, the first p dominant principal components are selected as movement basis elements according to the following criterion: p λi ≥T (2) i=1 M i=1 λi where T is a specific user-supplied threshold (e.g., 0.9 and 0.95). Fig. 1 shows an example of the captured joint angle trajectories and the four dominant principal components. If we use only the first three principal components as the movement basis elements, the newly generated joint angle trajectory can be expressed as a linear combination of bi (t), which is the continuous-time polynomial approximation of each principal component ei , as follows: q(t) = q¯(t) + c1 b1 (t) + c2 b2 (t) + c3 b3 (t) + c4

(3)

where q¯(t) is the continuous-time polynomial approximation of the mean trajectory and ci represents the scalar weighting coefficients. What remains now is how to determine the coefficients ci . One of the simplest ways of determining ci is to perform a linear movement interpolation via the solution of the linear equation ⎤ ⎡ b1 (t0 ) b2 (t0 ) b3 (t0 ) 1 ⎡ c1 ⎤ ⎡ q(t0 ) − q¯(t0 ) ⎤ ⎢ b1 (tf ) b2 (tf ) b3 (tf ) 1 ⎥⎢ c2 ⎥ ⎢ q(tf ) − q¯(tf ) ⎥ ⎥ ⎢ ⎦ (4) ⎣ b˙ 1 (t0 ) b˙ 2 (t0 ) b˙ 3 (t0 ) 0 ⎦⎣ c3 ⎦=⎣ q(t ˙ 0 ) − q¯˙ (t0 ) q(t ˙ f ) − q¯˙ (tf ) b˙ 1 (tf ) b˙ 2 (tf ) b˙ 3 (tf ) 0 c4 ˙ 0 ), q(t ˙ f )} are boundary conditions at where {q(t0 ), q(tf ), q(t the initial and final configurations specified by the user. The computational time for solving (4) is negligible. Modifying the number of principal components used or applying other boundary conditions, e.g., joint angle accelerations at the end points, can be accommodated in a similar fashion. Note that (4) must be separately applied to each joint. When the number of weights to be determined exceeds the number

KWON AND PARK: NATURAL MOVEMENT GENERATION

1187

of boundary conditions, the weight values can be determined via optimization with respect to a suitable criterion. The subsequent suboptimal motions obtained for physical criteria, e.g., minimum torque motion, can generally be obtained far more efficiently than by using more general basis elements such as B-splines. Finally, we note that although our framework employs principal component representations of the joint trajectories, in principle, one can use any number of methods for constructing the basis elements for each movement primitive, e.g., independent component analysis, nonnegative matrix factorization, etc. III. M OVEMENT G ENERATION F RAMEWORK By using the principal components as the movement basis elements, we can generate new joint angle trajectories that appear similar to the captured human joint angle trajectories. However, this apparent similarity in the joint space does not ensure a natural appearance in the task space. In previous case studies involving arm movements [24], we have shown that both the linear interpolation method using (4) and torque minimization, although efficient, produce movements that somehow fail to capture the small but essential details that result in the perception of a movement as being natural—the spatiotemporal characteristics of the movement in the task space (such as the motion of the hand while waving) clearly need to be considered. To address this concern, we combine the PCA-based movement generation framework [12] and the HMM representation of movements used for gesture recognition into a single movement generation framework. Specifically, the HMM is trained in the task space for a specific movement primitive, using the same human motion capture data employed in the PCA-based movement basis extraction procedure. We then determine the linear combination of weights for the movement basis elements so as to maximize the probability for the trained HMM, which is subject to any user-imposed boundary joint values and constraints. The end result is a movement that is natural from the perspective of both the task and joint spaces, and it satisfies the desired initial and final configurations of the end effector. We describe the proposed movement generation framework in the following two stages: 1) HMM setup and training and 2) motion optimization using the principal components with the trained HMM. A. HMM Setup and Training HMMs can be characterized as discrete state stochastic processes consisting of hidden states S and observation vectors o from S. The characteristics of a specific HMM are completely determined by the probability distribution set λHMM = {A, B, π}, where A, B, and π are, respectively, the state transition probability distribution, the observation probability distribution, and the initial state distribution. In Fig. 2, aij comprising A represents the transition probabilities from state Si to state Sj , i.e., aij = P (qt+1 = Sj |qt = Si ), where we denote the actual state at time t as qt , and bj (o) comprising B represents the observation probability distributions at Sj .

Fig. 2. Examples of HMMs. (a) Ergodic HMM with four states. (b) Left–right HMM with three states.

If the mixture of M Gaussian densities is used, bj (o) can be expressed as bj (o) =

M 

wjm Njm (o; μjm , Σjm )

(5)

m=1

where wjm represents the mixture coefficients for the mth Gaussian mixture in the state Sj and μjm and Σjm are the mean vectors and the covariance matrices for the mth Gaussian mixture in the state Sj . Given the observation sequences, HMM training involves determining the probability distribution set λHMM = {A, B, π} so as to maximize the probability of the observation sequences for the given model. Many different training algorithms are available, e.g., the Baum–Welch algorithm is widely used in speech recognition applications [13]. As a first step toward using HMMs for movement generation, the observation vector must be defined; for this purpose, we use the 3-D Cartesian position and velocity of the end effector, which can be straightforwardly obtained from the forward kinematics once the joint and kinematic parameters of the human links are available. Considering that in most cases, the link length proportions of the humanoid robot will differ from that of the human demonstrator, we use the Cartesian position normalized by the sum of the link lengths, i.e., if the Cartesian position with respect to the fixed frame is p = (X, Y, Z) and ˆ Yˆ , Z), ˆ then the observation the normalized position is pˆ = (X, ˆ Yˆ , Z, ˆ X, ˆ˙ Yˆ˙ , Z) ˆ˙ ∈ 6 . For the vector is chosen to be o = (X, observation density, we use a single Gaussian density rather than a mixture of multiple Gaussians, considering that the training observation sequences are necessarily limited in practice. Next, the type of HMM must be decided. The most general type of HMM is the ergodic HMM shown in Fig. 2(a), where all of the states are mutually connected. In the ideal case, the choice of the type of HMM should not be a critical factor. For example, in speech recognition applications, where the left–right HMM shown in Fig. 2(b) is widely used according to the temporal characteristics of the spoken word, all the aij ’s that

1188

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

Fig. 3. Examples of choices of the number of the hidden states for hand movements, which are, namely, hand raising and figure “3.”

do not correspond to the left–right HMM would vanish during the training process, even for the ergodic HMM. However, considering that the training observation sequences are noisy to some degree and their available quantity is limited in practice, choosing an appropriate type of HMM is crucial for effective training. For our application in which the observation vector is defined to be the position and velocity of the end effector, the left–right HMM shown in Fig. 2(b) is found to be an appropriate choice. The number of hidden states must also be decided. In the extreme, the number of states can be set equal to the length of the observation vector sequence [13]; however, this is not recommended for the obvious reasons of limited training data and excessive computational requirements at the training stage. In the case of speech recognition, the number of hidden states typically corresponds to the number of phonemes in a specific word. Similarly, we can select the number of hidden states according to the spatial properties of a specific movement. We heuristically split the specific movement into the initial and final states and a minimal number of intermediate states that are representative of the movement characteristics. Fig. 3 shows examples of such choices of the number of hidden states for hand movements that are considered in our case studies, i.e., hand raising and a gesture representing the figure “3.” We note that our method is not bound to any particular type of movement segmentation and that any number of methods based on, e.g., curvature considerations, information-theoretic criteria, such as the minimum description length, etc., can be employed. For HMM training, the human motion capture data used to extract the movement basis elements is also used to obtain the observation sequences. Therefore, the observation sequence O = {o1 , o2 , . . . , oM } is available for each motion capture execution, and totally N sample observation sequences are available for HMM training. As a result of HMM training, multiple human demonstrations are generalized, and the essential characteristics of a specific movement are encoded in the probability distribution set λHMM = {A, B, π}. B. Movement Optimization Once the trained HMM and the principal components extracted from the human motion capture data are available, optimal movements using the principal components as movement basis elements can be determined that yield the highest probability for the trained HMM. This step can be expressed in the form of an optimization involving the trained HMM, i.e., ˆ HMM ) max P (O|λ c

(6)

Fig. 4.

Four joints of the humanoid arm.

where c denotes the weights in the linear combination of the ˆ is the observation vector PCA-based basis elements and O sequence [computed from the forward kinematic equations with the joint angle trajectories obtained from the linear combination of the principal components for each joint as given by (3)], which is subject to appropriate boundary joint values at the initial and final configurations. When the joint limits of the humanoid robot differ from those of the human demonstrator, the joint limit constraints can be easily inserted into the optimization process. The boundary joint values and joint limit constraints are, respectively, considered as the linear equality and nonlinear inequality constraints in the optimization process. ˆ HMM ). The forward procedure [13] is used in calculating P (O|λ To reduce the optimization search space, we use the projected values of the human motion data onto the selected principal components. For each principal component of a specific joint, there exist N projected values of the original human sample movement matrix X ∈ M ×N . In order for the resulting movement to be similar to the sample human movements, each weighting coefficient of (3) should be close to the corresponding projected values of the human motion data. We therefore restrict the search space of each weighting coefficient according to ±ci ≤ pi,max · (1 + δ), where pi,max is the maximum absolute value of the N projected values of the human movements onto the corresponding principal component and δ is a small positive number to allow the new movement to exceed the range of the original human movements. For our following case studies, δ is set to 0.3. The weighting coefficients, such as c4 in (3) not related with the principal components, are restricted to lie between −(1/2)π and (1/2)π. IV. N UMERICAL AND E XPERIMENTAL C ASE S TUDIES In this section, we examine the feasibility of our proposed movement generation framework via numerical and experimental case studies involving several arm movements. We consider a 4-DOF humanoid arm, consisting of a 3-DOF shoulder and 1-DOF elbow, with the wrist assumed fixed (see Fig. 4). To test the adaptability of our framework to variations in the link length proportions, three different humanoid models are used in the case studies, as shown in Table I. The forearm length represents the length between the elbow and the reference point of the hand. The first humanoid model has the same link length proportions as the human demonstrator. Simple arm movements, including hand raising and a figure “3” gesture, are

KWON AND PARK: NATURAL MOVEMENT GENERATION

1189

TABLE I THREE HUMANOID MODELS USED IN THE SIMULATION. ALL UNITS ARE IN MILLIMETERS

considered in the case studies. The Matlab HMM toolbox [25] is used throughout for all calculations related to our HMMs. A. Hand Raising We first consider a simple hand-raising motion. The human demonstrator executes this motion 50 times, and the human joint angle trajectories are calculated from the recorded 3-D position data of the markers. The principal components and observation sequences are then extracted from the obtained joint angle trajectories. We use a left–right HMM with five hidden states defined as follows. State 1) The hand is at the initial lower position, with the elbow outstretched. State 3) The hand is in the middle position, with the elbow bent. State 5) The hand is in the final vertical position, with the elbow outstretched. States 2) and 4) are defined in the obvious way as the intermediate spatial positions of the adjacent states. Fig. 5 shows both the training data and the training results of the associated HMM for the hand-raising motion. In Fig. 5(b), the estimated mean vectors of the observation Gaussian density for each state are plotted as asterisk markers, together with the mean of observation sequences. Only the first three components of the mean vector corresponding to the normalized Cartesian position of the hand can be displayed. As shown in Fig. 3, we can see that the mean vectors are well placed at the positions presumed during the design of the associated HMM. Fig. 5(c) shows the log-likelihoods of each training observation sequence for the trained HMM, i.e., P (O|λHMM ). In Fig. 5(d), for comparison, the maximum and minimum probability training sequences for the trained hand-raising HMM are displayed with the mean observation sequence. Again, only the first three components, i.e., the normalized Cartesian hand trajectories, are plotted. The maximum probability observation sequence, whose log-likelihood for the trained HMM is −59.5, appears very much similar to the mean observation sequence whose log-likelihood is −36.1, whereas the minimum probability observation sequence, whose log-likelihood is −584.3, does not. These results suggest that the HMM has been adequately trained for the hand-raising motion. Before showing the results of our movement optimization, we illustrate, with examples, the problems that arise when directly applying human motion capture data to humanoids. First, the resulting motions may exceed joint limits, such as the case for the humanoid elbow joint limit being set to (2/3)π in Fig. 6(a). Fig. 6(b) shows the resulting Cartesian hand trajectories obtained from the human joint trajectories for our three humanoid models. The log-likelihood of the observation sequence for humanoid model 1 for the trained

Fig. 5. (a) Cartesian hand trajectories of human demonstrations of the handraising motion. (b) Results of HMM training for the hand-raising motion. The thick line represents the mean of the observation sequences. (c) Log-likelihoods of training observation sequences for the trained HMM. (d) Normalized Cartesian hand trajectories of the (A) mean, (B) maximum probability, and (C) minimum probability hand-raising motions.

HMM is −197.5, whereas those for humanoid models 2 and 3 are, respectively, −308.5 and −328.1. These numbers reflect the diminished naturalness of the movement arising from the difference in length proportions.

1190

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

Fig. 6. Examples of problems encountered when directly applying human motion capture data to humanoids. (a) Human motion capture data may exceed the different humanoid joint limits. The dotted line represents the humanoid’s elbow joint limit. (b) Movement may appear unnatural due to the different link length proportions. The lines A, B, and C represent the normalized Cartesian hand trajectories of the hand-raising motion applied, respectively, to humanoid models 1, 2, and 3. TABLE II PERCENTAGE OF DATA EXPLAINED BY EACH PRINCIPAL COMPONENT

We now compare the movements obtained from the optimization stage with the previous results. Table II shows the percentage of data explained by each principal component for our basic movements (hand raising and figure “3”). For both cases, we use the four dominant principal components in the optimization, which explain at least 80% of the original motion capture data. The joint angle values at the initial and final configurations are imposed as equality constraints in the optimization. We first examine the effects of different joint limits on optimized hand-raising movements for humanoid model 1. The solid and dashed lines of Fig. 7(a) represent, respectively, the optimized joint angle trajectories for the case without and with elbow joint limits (set to (2/3)π in the latter case). The corresponding normalized Cartesian trajectories of the hand position

Fig. 7. (a) Joint angle trajectories of the optimized hand-raising motions (solid line) without and (dashed line) with the elbow joint limit. The dotted line represents the elbow joint limit. (b) Normalized Cartesian hand trajectories of the optimized hand-raising motion (A) without and (B) with elbow joint limits.

are shown in Fig. 7(b). Please see video attachment 1 for the corresponding moving picture of the optimized movements. This will be available at http://ieeexplore.ieee.org. The loglikelihood of the movement in the case without the elbow joint limit is −56.3 and −95.9 in the joint limit case. We next examine the results of the optimized motions for different link length proportions. We first optimize the handraising motion for humanoid model 1 for the prescribed boundary values of the joints. We then determine the optimal motions for humanoid models 2 and 3, using boundary values calculated such that the initial and final normalized Cartesian hand positions are the closest (in the sense of least squares distance) to those for humanoid model 1. From the results shown in Fig. 8 and video attachment 2, the similarities between the resulting normalized Cartesian hand trajectories are immediately apparent in spite of the differences evident in the corresponding joint angle trajectories. This video attachment will be available at http://ieeexplore.ieee.org. The log-likelihoods of the optimized motions for humanoid models 1, 2, and 3 are, respectively, −56.3, −62.3, and −67.4. B. Figure “3” Gesture We now apply our proposed framework to an arm gesture corresponding to the figure “3.” The entire procedure for movement generation is identical to the previous cases. A left–right HMM with seven states is used (see Fig. 3). Again, using 50 executions by the human demonstrator, the results of the HMM training are shown in Fig. 9. Fig. 9(b) shows that our choice of the number of states for the figure “3”

KWON AND PARK: NATURAL MOVEMENT GENERATION

1191

Fig. 8. (a) Joint angle trajectories for the optimized hand-raising motions. The solid, dashed, and dotted lines represent, respectively, the results for humanoid models 1, 2, and 3. (b) Normalized Cartesian hand trajectories of the optimized hand-raising motions. The lines A, B, and C represent, respectively, the results for humanoid models 1, 2, and 3.

gesture is appropriate. The log-likelihoods of the maximum and minimum probability figure “3” gestures are, respectively, −162.7 and −420.8, whereas that of the mean observation sequence is −141.8. The corresponding normalized Cartesian hand trajectories are shown in Fig. 9(d). The figure “3” gesture generation results are shown in Fig. 10 and video attachment 3. This will be available at http://ieeexplore.ieee.org. The boundary joint values for each humanoid model are set following the hand-raising case. Despite the differences in joint angle trajectories for the three humanoid models, the normalized Cartesian hand trajectories are almost identical. The log-likelihoods of the resulting gestures for humanoid models 1, 2, and 3 are, respectively, −138.9, −146.2, and −143.3, suggesting that the training appears to have been adequate. C. Application to the KIBO Humanoid Prototype We now demonstrate the results of our movement generation procedure using the KIBO humanoid robot prototype. Similar to the robot used for our earlier simulation studies, KIBO also has a 4-DOF arm (3-DOF shoulder and 1-DOF elbow). KIBO’s upper arm and forearm lengths are both 120 mm, and its elbow joint limit is set to (110/180)π. Two simple movements used in the previous case study are generated again for KIBO. The initial and final boundary joint values for each movement are specified in the same way as in the previous case studies. The optimized motions are shown in Fig. 11 and video attachment 4. This will be available at http://ieeexplore.ieee.org.

Fig. 9. (a) Cartesian hand trajectories of human demonstrations for the figure “3” gesture. (b) Results of HMM training. The thick line represents the mean of the observation sequences. (c) Log-likelihoods of training observation sequences for the trained HMM. (d) Cartesian hand trajectories for the (A) mean, (B) maximum probability, and (C) minimum probability motions.

The white lines represent the 2-D trajectories of the reference point on the hand. For the resulting hand-raising motion shown in the top row of Fig. 11, it can be verified that the resulting motion is natural even with the elbow joint limit constraint. The

1192

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

TABLE III LOG-LIKELIHOODS FOR THE TRAINED HMMs

TABLE IV AVERAGE COMPUTATIONAL TIMES

Fig. 10. (a) Joint angle trajectories of the optimized figure “3” gestures. The solid, dashed, and dotted lines represent, respectively, the results for the humanoid models 1, 2, and 3. (b) Normalized Cartesian hand trajectories of the optimized figure “3” gestures. The lines A, B, and C represent, respectively, the results for humanoid models 1, 2, and 3.

function value and the weighting coefficients are all set to 0.01. The computational times are averaged over ten trials for each movement with varying boundary joint values. From Table IV, we can see that there is a slight increase of the computational time in proportion to the number of states. An analysis reveals that in the evaluation of the objective function, most of the computational time is occupied in the calculation of the forward kinematics of the observation vector, particularly by the curve fitting and differentiation procedure, to obtain the requisite velocities for the hand trajectory. In our simulations, this stage is considerably speeded up without any noticeable loss in accuracy by using finite difference approximated derivatives. It should be noted that the main emphasis of this paper was not on minimizing computation times but on evaluating the validity of the algorithm and a qualitative analysis of the generated movements. We expect that further speedup can be attained via, for example, more efficient approximation techniques, customized optimization algorithms, and even by implementing our algorithm directly in C or C++. D. Comparison With Other Algorithms

Fig. 11. Movement generation results for the humanoid robot prototype KIBO. (Top) Hand raising. (Bottom) Figure “3” gesture.

resulting figure “3” gesture, which is shown in the bottom row of Fig. 11, also appears quite satisfactory. The log-likelihoods of the optimized hand-raising and figure “3” gestures are, respectively, −147.1 and −138.4 (also summarized in Table III). The log-likelihoods expressed in italics denote cases where the elbow joint limit constraint was in effect. It should be noted that the comparison of log-likelihoods across different movement classes is not meaningful because each movement class possesses its own trained HMM. The average computational times for generating each movement are shown in Table IV. The optimization is performed on the Pentium Dual-Core 2.8-GHz processor with 2-GB memory using the “fmincon” function of the Matlab Optimization Toolbox. The optimization termination tolerances on the objective

We now compare our movement generation procedure with other natural movement generation algorithms reported in the literature. In [24], we have already shown that our proposed approach can yield far more natural movements than the linear interpolation or joint torque minimization. Here, among the various algorithms introduced in [5], the following four algorithms are compared: the minimum end-effector velocity and acceleration movements (MV and MA) in the 3-D Cartesian space and minimum angular velocity and acceleration movements (MAV and MAA) in the joint space. Lim et al. [12] reports it is more efficient to use, as basis elements in the movement optimization, the principal components of human joint trajectories than, for example, piecewise B-spline polynomials; we therefore use the principal components as the movement basis elements for these four algorithms. The 3-D Cartesian trajectories of the resulting movements generated via each method under the same boundary conditions for each movement (hand-raising and figure “3” gestures), are shown in Fig. 12. The log-likelihoods of the resulting hand-raising movements generated via our approach, MV, MA, MAA, and MAV for the trained HMM are, respectively, −41.3, −237.5, −177,7, −339.1, and −258.5. We can also see that the trajectories of the hand-raising movements generated via

KWON AND PARK: NATURAL MOVEMENT GENERATION

Fig. 12. Normalized Cartesian hand trajectories of (a) hand-raising movements and (b) figure “3” gestures generated via (A) our framework; (B) hand velocity and (C) acceleration minimization; and (D) joint velocity and (E) acceleration minimization.

our approach shown in Fig. 12(a) are much more similar to the mean trajectories shown in Fig. 5(b) than the trajectories of the other algorithms. The advantage of our proposed movement generation framework over MV, MA, MAV, and MAA can be more clearly seen via the results of the figure “3” gesture shown in Fig. 12(b). Although the principal components representing the joint space characteristics of the figure “3” gesture are used as the movement basis elements, with the exception of our method, all the other methods do not consider the task space characteristics of the figure “3” gesture in the optimization process; the consequence is an unnatural-looking movement as shown in Fig. 12(b). The log-likelihoods of the resulting figure “3” gestures generated via our approach, MV, MA, MAA, and MAV are, respectively, −163.7, −1019.1, −1241.9, −956.1, and −1135.4, further supporting our claims of naturalness. E. Discussion Beyond the simple arm movements demonstrated in our experiments, in principle, our framework can also be extended to more complex movements, which are both of a longer duration and involving a greater number of states to describe the movement. To assess the associated computational complexity, we recall that the complexity of the forward procedure for ˆ HMM ) is O(N 2 T ), where N and T denote, computing P (O|λ respectively, the number of states and the length of the observation sequence [13]. Based on this result, it would seem that, in general, it is better to break up movements involving a large number of states into a concatenation of smaller (and simpler)

1193

movement segments. Such a movement segmentation can be manually or automatically performed, as done in [26] and [27]. While it is difficult to provide a precise quantitative projection because it would depend on factors such as the kinematic structure and geometric and topological characteristics of the movement trajectory, for our arm movement studies, we have found that limiting the number of states for each movement to below ten seems to be reasonably effective. In our HMM training procedure, we use the 3-D Cartesian position and velocity of the end effector, which are normalized by the total length of links, as the observation vector. Via such a normalization, it is possible to apply the trained HMM to other humanoids with different link length proportions without retraining the HMM. The observation vector normalization also allows us to use the motion capture data of multiple human subjects for HMM training purposes without degrading the quality of the resulting humanoid movement. For the movement basis element extraction procedure via PCA, using multiple human subjects should not significantly affect the quality of the generated movements, considering that the joint angle trajectories should not significantly differ, provided that they belong to the same movement class. Finally, we remark that our proposed framework is best suited for situations where natural movements are a high priority, e.g., gestural human–robot interaction and entertainment robots performing dance or athletic movements. In its current form, our framework is not well suited for situations where movement accuracy is critical, e.g., two-arm cooperative movements interacting physically or manipulating a common object. Indeed, for movements requiring any combination of precise force-position-impedance control, demanding that the movement further appear natural would seem to call for an altogether different approach. For two-arm movements that do not involve physical interaction, one straightforward way to apply our framework is to first generate one arm movement using our methodology and the second arm movement can then be obtained from the same methodology, with collision avoidance between the two arms imposed as constraints in the associated optimization for the latter arm. V. C ONCLUSION This paper has presented an algorithm for generating natural humanlike movements, which simultaneously accounts for task and joint space characteristics in a unified and consistent fashion. Each movement class is represented as an HMM, which is trained with only the task space trajectories of human motion capture data. At the same time, basis elements in the form of joint angle trajectories for each movement primitive are extracted via the PCA of the motion data. Movements are then generated by selecting the basis element weights that result in the maximum probability for the given HMM and that meet boundary conditions and other user-specified constraints. The results obtained from our numerical and experimental case studies demonstrate the advantages of our proposed framework. The resulting movements appear substantially more natural than, for example, movements obtained from a torque minimization procedure or movements obtained by directly

1194

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 5, OCTOBER 2008

applying human joint trajectory data to the humanoid platform. Another important advantage of our method is that any kinematic differences between the humanoid and human subjects, i.e., different link length proportions or joint limit constraints, are naturally accounted for in the embedded optimization procedure. Our current efforts are focused on the following: 1) considering, in lieu of PCA, other methods for constructing the joint space basis elements, e.g., independent component analysis and nonnegative matrix factorization; 2) replacing the manual movement segmentation step in the HMM representation of a complex movement, which is by an automatic segmentation procedure such as one based on minimizing the minimum description length of the movement encoding; 3) seeking ways to improve the computational efficiency of the optimization procedure; and 4) considering cyclic movements, such as walking and running, and more complex movements that arise in sports and dance. R EFERENCES [1] D. Matsui, T. Minato, K. MacDorman, and H. Ishiguro, “Generating natural motion in an android by mapping human motion,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2005, pp. 3301–3308. [2] M. Bennewitz, F. Faber, D. Joho, M. Schreiber, and S. Behnke, “Towards a humanoid museum guide robot that interacts with multiple persons,” in Proc. 5th IEEE-RAS Int. Conf. Humanoid Robots, 2005, pp. 418–423. [3] L. Ren, A. Patrick, A. A. Efros, J. K. Hodgins, and J. M. Rehg, “A data-driven approach to quantifying natural human motion,” ACM Trans. Graph., vol. 24, no. 3, pp. 1090–1097, Jul. 2005. [4] G. Simmons and Y. Demiris, “Imitation of human demonstration using a biologically inspired modular optimal control scheme,” in Proc. 4th IEEERAS Int. Conf. Humanoid Robots, 2004, pp. 215–234. [5] F. E. Pollick, J. G. Hale, and M. Tzoneva-Hadjigeorgieva, “Perception of humanoid movement,” Int. J. Humanoid Robot., vol. 2, no. 3, pp. 277– 300, 2005. [6] A. Safonova, J. K. Hodgins, and N. S. Pollard, “Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces,” ACM Trans. Graph., vol. 23, no. 3, pp. 514–521, Aug. 2004. [7] N. A. Bernstein, The Coordination and Regulation of Movements. London, U.K.: Pergamon, 1967. [8] T. D. Sanger, “Human arm movements described by a low-dimensional superposition of principal components,” J. Neurosci., vol. 20, no. 3, pp. 1066–1072, Feb. 2000. [9] F. A. Mussa-Ivaldi and E. Bizzi, “Motor learning through the combination of primitives,” Philos. Trans. R. Soc. Lond. B, Biol. Sci., vol. 355, no. 1404, pp. 1755–1769, Dec. 2000. [10] O. C. Jenkins and M. Mataric, “Deriving action and behavior primitives from human motion data,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2002, pp. 2551–2556. [11] O. C. Jenkins and M. Mataric, “Automated derivation of primitives for movement classification,” Auton. Robots, vol. 12, no. 1, pp. 39–54, Jan. 2002. [12] B. Lim, S. Ra, and F. C. Park, “Movement primitives, principal component analysis, and the efficient generation of natural motions,” in Proc. IEEE Int. Conf. Robot. Autom., 2005, pp. 4630–4635. [13] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [14] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in timesequential images using hidden Markov model,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., 1992, pp. 379–385. [15] T. Starner and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, Dec. 1998. [16] A. D. Wilson and A. F. Bobick, “Parametric hidden Markov models for gesture recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 9, pp. 884–900, Sep. 1999. [17] J. Yang, Y. Xu, and C. S. Chen, “Hidden Markov model approach to skill learning and its application to telerobotics,” IEEE Trans. Robot. Autom., vol. 10, no. 5, pp. 621–631, Oct. 1994.

[18] J. Yang, Y. Xu, and C. S. Chen, “Human action learning via hidden Markov model,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 27, no. 1, pp. 34–44, Jan. 1997. [19] T. Inamura, H. Tanie, and Y. Nakamura, “Keyframe compression and decompression for time series data based on the continuous hidden Markov model,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2003, pp. 1487–1492. [20] S. Calinon and A. Billard, “Stochastic gesture production and recognition model for a humanoid robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2004, pp. 2769–2774. [21] S. Calinon and A. Billard, “Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM,” in Proc. 22nd Int. Conf. Mach. Learn., 2005, pp. 105–112. [22] S. Calinon and A. Billard, “Learning of gestures by imitation in a humanoid robot,” in Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions, K. Dautenhahn and C. L. Nehaniv, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2007. [23] T. Asfour, F. Gyarfas, P. Azad, and R. Dillmann, “Imitation learning of dual-arm manipulation tasks in humanoid robots,” in Proc. 6th IEEE-RAS Int. Conf. Humanoid Robots, 2006, pp. 40–47. [24] J. Kwon and F. C. Park, “Using hidden Markov models to generate natural humanoid movements,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2006, pp. 1990–1995. [25] K. Murphy, Hidden Markov model (HMM) toolbox for Matlab. [Online]. Available: http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html [26] J. Barbiè, A. Safonova, J. Y. Pan, C. Faloutsos, J. K. Hodgins, and N. S. Pollard, “Segmenting motion capture data into distinct behaviors,” in Proc. Graph. Interface, 2004, pp. 185–194. [27] D. Bouchard and N. Badler, “Semantic segmentation of motion capture using Laban movement analysis,” in Proc. Intell. Virtual Agents, 2007, vol. 4722, pp. 37–44.

Junghyun Kwon received the B.S. and Ph.D. degrees in mechanical engineering from Seoul National University, Seoul, Korea, in 2002 and 2008, respectively. He is currently with the School of Mechanical and Aerospace Engineering, Seoul National University. His research interests include visual tracking and estimation in robotics-related applications, movement recognition and generation for human–robot interaction, and 2-D/3-D image visualization for medical imaging and nondestructive testing.

Frank C. Park received the B.S. degree in electrical engineering and computer science from the Massachusetts Institute of Technology, Cambridge, in 1985, and the Ph.D. degree in applied mathematics from Harvard University, Cambridge, MA, in 1991. From 1991 to 1994, he was an Assistant Professor of mechanical and aerospace engineering with the University of California, Irvine. Since 1995, he has been with the School of Mechanical and Aerospace Engineering, Seoul National University, Seoul, Korea, where he is currently a Professor. He is an Associate Editor of the American Society of Mechanical Engineers Journal of Mechanisms and Robotics and the Parts Editor of the Springer-Verlag Handbook of Robotics. His research interests are robotics, mathematical systems theory, and related areas of applied mathematics. Dr. Park is a 2007–2008 IEEE Robotics and Automation Society Distinguished Lecturer and serves as Senior Editor of the IEEE TRANSACTIONS ON ROBOTICS and as the Secretary of the IEEE Robotics and Automation Society.

Suggest Documents