Intelligent Service Robotics manuscript No. JIST-D-17-00046
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect Reda Elbasiony · Walid Gomaa
Received: 08 May 2017 / Revised: 24 Oct 2017 / Accepted: 06 Feb 2018
Abstract In this paper, a novel framework which enables humanoid robots to learn new skills from demonstration is proposed. The proposed framework makes use of real-time human motion imitation module as a demonstration interface for providing the desired motion to the learning module in an efficient and userfriendly way. This interface overcomes many problems of the currently used interfaces like direct motion recording, Kinesthetic teaching, and immersive teleoperation. This method gives the human demonstrator the ability to control almost all body parts of the humanoid robot in real time (including hand shape and orientation which are essential to perform object grasping). The humanoid robot is controlled remotely and without using any sophisticated haptic devices, where it depends only on an inexpensive Kinect sensor and two additional force sensors. To the best of our knowledge, this is the first time for Kinect sensor to be used in estimating hand shape and orientation for object grasping within the field of real-time human motion imitation. Then, the observed motions are projected onto a latent space using Gaussian Reda Elbasiony Cyber-Physical Systems Lab, Egypt-Japan University of Science and Technology (E-JUST), P.O. Box 179, New Borg ElArab City, Postal Code 21934 Alexandria, Egypt. Faculty of Engineering, Tanta University, Tanta 31511, Egypt. Tel.: +20-100-293-7033 E-mail:
[email protected] Walid Gomaa Cyber-Physical Systems Lab, Egypt-Japan University of Science and Technology (E-JUST), P.O. Box 179, New Borg ElArab City, Postal Code 21934 Alexandria, Egypt Faculty of Engineering, Alexandria University, Alexandria 21544, Egypt E-mail:
[email protected]
Process Latent Variable Model (GPLVM) to extract the relevant features. These relevant features are then used to train regression models through the Variational Heteroscedastic Gaussian Process Regression (VHGPR) algorithm which is proved to be a very accurate and very fast regression algorithm. Our proposed framework is validated using di↵erent activities concerned with both human upper and lower body parts and object grasping also. Keywords Imitation learning · humanoid robot · Gaussian Process Latent Variable Model (GPLVM) · Variational Heteroscedastic Gaussian Process Regression (VHGPR) · Kinect sensor · NAO robot · Grasping
1 Introduction The design of humanoid robots has been inspired from the structure of the human body to be able to interact with the real world in the same manner that people do. Humanoids are required to be integrated into our everyday life to help us in performing our tasks. However, it is not an easy matter to control them to perform the every day’s activities that people can do due to many reasons. First, humanoid robots are very complex systems that contain many degrees of freedom which make their control in traditional manners very tough and time-consuming. Second, most of the targeted users of humanoid robots are not robotic experts, and they cannot program or control this complex system [34]. Finally, it is impossible for the humanoid robots manufacturers to cover all its required tasks and embed their instructions into the operating system of the robot by default.
2
At the beginning of the 1980s, ’Programming by Demonstration’ (PbD) and ’Imitation Learning’ started to appear in the field of robotics to replace the traditional methods of robot programming [3]. Imitation learning is an approach where robot’s learning is achieved by providing it with a set of demonstrations of some skills to be generalized and reproduced through suitable machine learning techniques [30]. Therefore, controlling the robot can be considered just like controlling any other typical daily tool like a car or even a video game. Thus, the process of robot programming does not require any previous experience in robotics or control systems and can be achieved by lay people within an acceptable time. Skill representation is an essential matter in the field of imitation learning. There are two main trends for skill representation: a high-level representation method, and a low-level representation method. The high-level representation, recently referred to as symbolic encoding, depends on decomposing any skill into a sequence of predefined action-perception units. Although this encoding is suitable for reproducing complex high-level skills, it requires pre-defining a set of basic controllers for reproduction. The low-level representation, recently referred to as trajectory encoding, takes the form of a non-linear mapping between sensory and motor information, which gives a general representation of motion and allows encoding of various types of signals. However, it cannot be used in the reproduction of complicated high-level skills [3]. Many types of sensors can be used in the motion capturing process. Early approaches used traditional video cameras, where the required motion is observed as video streams. This method is more suitable for symbolic encoding only because it depends on dividing the overall skill into some predefined primitive actions and recognize them using a classification technique. Moreover, this method cannot give accurate information about the motion like joint positions in 3D or joint angles[19]. Some new approaches use other very accurate motion sensors like Xsens MVN motion capture system which consists of a set of inertial sensors attached to the desired body segments of the human demonstrator to give very accurate positions of these parts while acting skills [17]. However, it is uncomfortable for humans to fix many sensors containing wires to their bodies; moreover, these sensors are very expensive and cannot be within reach of many people. Recently, some other approaches depend on the use of Microsoft Kinect sensor in the motion capturing process. Although this type of sensors does not give high accuracy like Xsens MVN motion sensors, they are more comfortable than motion sensors because they do not require any types of
Short form of author list
parts to be fixed on the human body, and have very low prices compared to them. Providing the demonstrations to the robot is not an easy task. It requires a suitable interface having the ability to capture the human motion and transfer it in a suitable way to the robot to imitate the movements using its joints and e↵ectors. There are three main types of interfaces which are used in many studies. One of the oldest methods depends on recording the human motion directly through vision techniques or some wearable sensors, then some required analysis and optimization are made o✏ine to extract the motion from the recorded data and to map it to the robot body [42] and [18]. However, this method is suitable only for generating human-likeness motions and can’t be used in cases that require objects manipulations or precise control because the action demonstration and the robot imitation always happen in di↵erent phases. Another common interface depends mainly on a direct connection between human and robot. The human demonstrator controls the robot body physically by moving its limbs directly to perform the required task. This method is called Kinesthetic teaching [4]. Although this approach does not require any mapping between human and robot bodies, it is very hard to be used when it is needed to control many degrees of freedom simultaneously or to move robot legs while keeping it balanced for example. It is clear that the human demonstrator cannot move many robot limbs simultaneously using only his two hands. The third known interfaces depend on immersive teleoperation, where the human demonstrator uses haptic devices [34] or remote controls [14] to control the robot movements. Teleoperation can be employed in tasks that require precise control and objects manipulation and does not require mapping models. However, very complicated and expensive haptic devices are needed when synchronization between many degrees of freedom is a must. Also, the human demonstrator must be trained first, maybe for a long time, on using these remote controls or haptic devices. So, the first contribution of our work is proposing a new interface for providing whole-body human demonstrations to the robot in a reliable way. The interface can be used not only to produce anthropomorphic motion or signs but also, on the contrary of all similar previous work, to manipulate and grasp objects remotely using the robot hands. The proposed interface depends on Microsoft Kinect sensor in motion capturing which allows the human demonstrator to freely move while acting the demonstration in the same way that ’directly recording human action method’ works. However, all the processing happens in real time, with
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
acceptable delay, and does not need a separate phase for analysis or optimization like the old vision-based methods. The interface also allows the human demonstrator to perform the required task using the robot’s own body like Kinesthetic teaching and immersive teleoperation because the demonstrator looks like controlling the robot body remotely in the real time to perform the task. However, the demonstrator can control all the needed e↵ectors simultaneously and without using any remote controls or haptic devices. And also without requiring any previous training or experience before using the system. Controlling the robot in this way gives the human demonstrator the ability to perform the required task precisely looking like being in a closed loop control cycle with the robot, where the ability of the human demonstrator to observe the robot motion visually in the real time can be considered as the feedback of this control system. The second contribution in this work is the integration between the learning framework and the demonstration interface giving the robot the ability to learn the demonstrated motions and reproduce it autonomously not only imitating them. In the proposed framework, Variational Heteroscedastic Gaussian Process Regression (VHGPR) algorithm is employed for the first time in the topic of learning by imitation. The accuracy of this algorithm is verified by comparing with the golden standard Markov chain Monte Carlo (MCMC); it gives an accuracy near MCMC with a very low computational cost comparing with it [22]. The rest of this paper is organized as follows. Section 2 presents the related work of both human motion imitation and learning. Section 3 gives a background on the used machine learning techniques: GPLVM, DTW, and VHGPR. Section 4 presents our proposed framework in details. Section 5 presents the experiments which are used to validate the framework. Finally, Section 6 concludes the paper and proposes some directions for future work. 2 Related Work 2.1 Human motion imitation One of the first approaches to real-time human motion imitation has been proposed by [37]. Riley et al. use simple colored markers attached to the upper body of the human demonstrator and stereo vision systems to determine the human joint angles and then find an inverse kinematic solution to be mapped to the humanoid. Symbolic representations of some primitive motions are extracted o✏ine. These primitive motions consist of essential postures in arm motions and step prim-
3
itives in leg motions which are used to generate sequences of joint angles to be modified to satisfy some mechanical constraints of the robot. However, the leg motions are generated based on the desired Zero Moment Point (ZMP). In [29] and [28], Nakaoka et al. propose an imitation technique for dance movements for humanoid robots HRP-1S and HRP-2. The leg motions are improved to be very close to the demonstrated motions by proposing stable leg task models to be used in the imitation process. These leg motion primitives are recognized from the captured motion data and regenerated in the imitation phase to ensure the robot’s balance. In [10], Chalodhorn et al. propose a framework for learning human behavior by imitation through sensorymotor mapping in reduced dimensional spaces, where the o✏ine optimization of the motions is performed in the reduced space. The authors consider the task of teaching a humanoid robot to walk through imitation. Fujitsu HOAP-2 humanoid robot is used in their experiments. In [31], a spring-model-based Cartesian control approach is proposed, where a set of control points on the humanoid are selected and the robot is virtually connected to the measured marker points via translational springs. The authors do their experiments on a humanoid called IRT humanoid robot that imitates the motion of the upper body of a human demonstrator and use the humanoid legs mainly for balancing. Suleiman et al. in [40] formulate the imitation problem as a constrained optimization problem. The main objective is to make the humanoid robot able to reproduce imitated motions which are very close to the corresponding demonstrated motions through an optimization technique which is constrained by the physical limits of the humanoid robot. The experiments are performed on some motions of the upper body only of the HRP-2 humanoid robot. In [12], Dariush et al. formulate ’the human to humanoid retargeting problem’ as a task space control problem. They control the upper body of the humanoid by computing generalized coordinates which minimize the Cartesian tracking error between the normalized human motion descriptors and the corresponding motion descriptors on the humanoid robot’s upper body. Experiments are performed on Honda humanoid robot ASIMO. In [43], Yamane and Hodgins present a control framework that contains two main components. The first component is a balance controller which uses a simplified humanoid model to obtain the desired input to keep the balance based on the current state of the robot. The second component is a tracking controller which computes the joint torques which minimize the deviation from desired inputs and the error
4
from desired joint accelerations to track the motion capture data. They use the humanoid robot developed by Sarcos and owned by Carnegie Mellon University in their experiments. In [36], Ramos et al. propose a methodology for reshaping human motions and adapting the dynamics of these motions to the dynamics of humanoid robots based on an inverse dynamics control scheme with a quadratic programming optimization solver. HRP-2 humanoid is used to test this work. In [39], Stanton et al. propose a human motion imitation system based on mapping between sensor data from the Xsens MVN motion capture system and the angular position of the robot’s actuator. This mapping is done based on training a feed-forward neural network for each Degree Of Freedom (DOF) on the robot. In the initial training phase, the human demonstrator is asked to imitate some predefined motions performed by the humanoid robot. However, the motions which can be imitated by this method are very limited. NAO H23 humanoid robot is used in their experiments. In [9], Cela et al. propose an imitation system using a motion capture technique that depends on a human suit consisting of 8 sensors, six resistive linear potentiometers on the legs and, two digital accelerometers for the arms. A stability control module is proposed to keep the robot balanced during imitation based on a feedback control system using accelerometer placed on the robot’s back. The single support mode is possible in this system. However, it cannot be used to imitate complex motions. The MechRc educational humanoid robot is used in the experiments. In [17], Koenemann et al. present a system that generates a statically stable pose for every point in time depending on the positions of the end e↵ectors and the center of mass of a compact human model. The authors use Xsens MVN motion capturing system to capture the motion of the human demonstrator. This system was experimented with the humanoid NAO H25 to imitate complex whole-body motions including the single support mode. All the abovementioned methods su↵er from some limitations. Some of them consider imitation of the upper body part motions only [37], [31], [40] and [12]; some other methods require o✏ine training or optimization phases or even some predefined tasks [29], [28], [10], [40], [43], [36] and [39]; and some others depend on very expensive sensors for human motion capturing which cannot be available to many users and also not comfortable for the human demonstrators [39] and [17]. Kinect sensor also has been recently used in some other human motion imitation systems. In [24], Luo et al. propose an approach to control an anthropomorphic dual arm robot in real-time. They use a Kinect sensor to capture the positions of human skeleton joints and
Short form of author list
provide them to the robot to control its arms based on Cartesian impedance control. In [32] and [23] the authors propose whole-body human motion imitation systems which depend on Kinect sensor also in human motion capturing. Although the methods presented in these researches are very similar to ours, the applied experiments show only the ability of the robot to produce robot poses which are analogous to the corresponding human poses. However, the real value of learning by imitation comes from ensuring that the robot can deal with the real world using the ability to act like a human but not from just mimicking a single pose at a time. This can be achieved by ensuring that the robot can perform a complete skill while manipulating or grasping an object which is a principal point in our work. 2.1.1 Object grasping using Kinect On the contrary of wearable sensors, Kinect sensor has rarely been utilized for learning robots to grasp objects by imitation. This is due to its inability to give accurate information about user’s hand state and orientation, which are very crucial for object grasping. In [11], the authors propose a human-aided robotic grasping approach based on teleoperation using Kinect. They use a method based on point cloud for hand state detection and assume that the palm always faces the camera. They also used two predefined thresholds; one for detecting the shape of the hand and the other for estimating the state of the hand based on the length of the hand shape. However, it is very hard to maintain the assumption of keeping the palm faces the camera during the whole grasping process. Using threshold also for determining the state of the hand based on its length is not accurate and depend on the hand shape of the user himself. The authors also depend on a pre-trained grasping software to generate the appropriate grasping position according to the type of the object, which requires an additional process for object recognition and also requires that the object must be recognized by the grasping software and exists in its database. However, this method cannot guarantee that the robot will grasp the object in the same way and from the same position wanted by the user. The method also cannot be applied on grasping new objects which are not included in the database of the grasping software. Our work proposes determining the shape of the hand based on the depth data captured by the Kinect sensor, which guarantees completely separating the hand from all the surrounding objects and gives almost the real shape of the hand. The state of the hand is es-
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
5
the reproduction phase. ActiveMedia PowerBot robot is employed in this method. In a similar way, Pardowitz et al. in [33] present a system to record and reason over demonstrations of household tasks. A hierarchical and incremental approach is used to encode these tasks to be used in creating rules to manage the way of handling objects to reach the goal. In [7], Calinon et al. propose a learning by imitation framework for extracting essential features and generalizing human-demonstrated tasks. The proposed 2.2 Motion learning framework depends on PCA, Gaussian mixture models (GMM), and Gaussian mixture regression (GMR). In [42], Ude et al. propose a method to generate joint First, the captured motion data is projected onto a latrajectories for humanoids based on the similarity between human motion and humanoid robot motion through tent space using PCA to extract relevant features, then, the resulting data are encoded using GMM. Finally, the an automatic approach to combine the kinematic paselected task is generalized using GMR. Fujitsu HOAPrameters of both the humanoid and the human demon2 is used to test the proposed method. In [6], Calinon et strator using B-spline wavelets. Experiments are done al. propose a probabilistic approach to learn human mousing DB humanoid robot. In [38], Shon et al. argue tion through imitation using multiple demonstrations. that learning by imitation can be reduced to a regresThey employ HMM and GMR in addition to dynamical sion problem. They employ Scaled GPLVM for regressystems to build time-independent models to be utilized sion in an approach called Gaussian process canoniin the reproduction of the demonstrated motions. The cal correlation analysis (GPCCA) to map between the iCub robot is used in the experiment. high-dimensional observation space of human and humanoid joint angles and a low-dimensional latent variIn [16], Khansari-Zadeh and Billard propose a method able space. They also use several low-dimensional lato learn robot motion through multiple demonstrations. tent variable spaces, where each space covers a subset They model the motion as a non-linear dynamical sysof some degrees of freedom. Training data is collected tem, where the parameters of the dynamical system is from having a human demonstrator imitating some genlearned through a proposed learning method called Staerated motions of a model of humanoid robot skeleton. ble Estimator of Dynamical Systems (SEDS) by solvFujitsu HOAP-2 is used as the test platform. ing an optimization problem under strict stability constraint. They also propose two objective functions which In [5], Calinon and Billard propose a framework to are SEDS-MSE and SEDS-Likelihood for this optimizaenable humanoid robots to recognize and reproduce gestion problem. The 7-DOF right arm of the humanoid tures performed by human demonstrators. In the learnrobot iCub and the six DOF industrial robot Katanaing phase, the proposed framework starts with a decomT arm are both used in the experiments. In [1], Akgun position of the motion data into either principle comet al. propose an approach based on keyframe demonponents (PCA) or independent components (ICA) to strations, where the human demonstrator provides a reduce the dimensionality of the data. Then, the lowset of consecutive keyframes that can be connected to dimensional data is encoded using Continuous Hidden perform the skill. The proposed method uses GMMs Markov Model (HMM). In the reproduction phase, the as a skill learning algorithm. Trajectories of joint anHMM is used to recognize the new gesture. Then, the gles, which are recorded while the teacher moves the robot imitates the gesture by producing a generalized robot’s arm to perform the task, are provided to the Kform of it by extracting the best sequence of states and means algorithm to generate initial mean vectors and generating a time-series of joint angles and trajectories covariance matrices for the expectation-maximization to be reprojected into the robot’s workspace and fed to (EM) algorithm. Then, the EM algorithm is used to the robot’s controller. Fujitsu HOAP-2 is also used to extract a GMM from the data. To reproduce the skill, test this method. they provided a time dimension vector as input to the In [13], Ekvall and Kragic propose a method for Gaussian Mixture Regression (GMR) for reproducing learning by demonstration where each task is decomthe required joint positions. The authors use Simon huposed into subtasks then modelled as states. Then the manoid robot to validate their work. similar subtasks are represented by the same state. The task is generalized through the relationships between In [27], Mulling et al. propose a framework that enthe states to make a plan for this task to be used in ables a robot to learn table tennis by demonstration. timated using a pre-trained SVM classification model based on the depth histogram of the hand, which gives accurate results for all shapes and positions of the hands and also does not require the palm to still on the face of the camera all the time. The human demonstrator also can choose the grasping object and the grasping position exactly as required not as proposed by any other grasping software.
6
The robot first learns a set of primitive table tennis motions to be compiled into a set of Dynamic Movement Primitives (DMPs). Then, an approach called mixture of motor primitives (MoMP) is proposed to be used to generalize these motions. Mulling et al. use Barrett WAM arm with seven DoFs to evaluate the proposed method. In [34], Peternel and Babic propose a system to teach a humanoid robot to maintain the postural stability in the presence of perturbations in real-time. They made a haptic interface to provide feedback of the robot stability to the human demonstrator in real-time. The proposed system uses Locally Weighted Projection Regression (LWPR) as a learning technique besides a novel approach to gradually transfer the control responsibility from the human demonstrator to the built robot controller. Fujitsu HOAP-3 humanoid robot is used to perform the experiments. There are many limitations in the abovementioned works as described before in the introduction section of this paper, the proposed methods in [42] and [38] can’t be used in cases that require objects manipulations or precise control because the human demonstration and the robot imitation always happen in di↵erent phases. The proposed methods in [5], [13], [7], [6] and [16] consider learning the upper body part motions only. In the proposed methods in [7], [1] and [27] the robot is taught through kinesthetics, i.e., by the human demonstrator moving the robot’s e↵ectors to perform the required task, so these methods are valid only for learning upper body tasks and cannot be used with lower body tasks because kinesthetics methods cannot preserve the robot’s balance while training. However, our proposed approach exploits the real-time human motion imitation in controlling the full body of the humanoid robot in an efficient way to teach him performing whole-body tasks including tasks that require support mode changing. Moreover, the proposed method gives the human demonstrator an accurate feedback from the robot while performing the task in the training phase. 3 Background 3.1 Gaussian Process Latent Variable Models (GPLVM) Dimensionality reduction can be achieved using both linear and nonlinear methods. In linear methods, like probabilistic principle component analysis (PPCA), the data is represented in a linear subspace of the observation space. Although linear methods are easy to implement and have closed form solutions, they are not efficient for large datasets and not suitable for most of
Short form of author list
the datasets which come from nonlinear models. Nonlinear methods for dimensionality reduction are appropriate for modeling data generated from a nonlinear manifold and can reduce them to a lower number of dimensions. In these methods, the latent coordinates and the nonlinear mapping from the latent subspace to the observation space are considered as parameters in a generative model, which are learned through optimization or Monte Carlo methods. GPLVM can be considered as a generalization of PPCA, where the linear kernel is replaced with a nonlinear kernel to allow nonlinear mapping from the latent space to the observed-data space [20, 21]. Assuming a linear relationship, with Gaussian noise, between a set of n observations Y and some latent space X on the form: yi = W x i + ✏ i , ✏i ⇠ N 0, 2 I (1) where W is the transformation matrix or the mapping parameters. Then, n Y P (Y |X) = N (yi |0, K). (2) i=1
where Eq. (2) represents a product of Gaussian processes with linear kernel. Thus, by replacing the linear kernel with a nonlinear one, e.g. Gaussian kernel, the resulted model represents the GPLVM model.
3.2 Variational Heteroscedastic Gaussian Process Regression (VHGPR) Standard Gaussian processes, or homoscedastic Gaussian processes, models observation noise as a Gaussian distribution with constant covariance. However, this assumption causes somewhat inaccurate prediction in many situations. Heteroscedastic Gaussian processes have been introduced to model observation noise as a Gaussian distribution with an input-dependent covariance function to allow more accurate predictions than the homoscedastic method. Unfortunately, this assumption makes the predictive density and marginal likelihood of the heteroscedastic model no longer analytically tractable. So, a solution called Variational HGP (VHGP) has been proposed, which depends on a variational approximation method that allows more accurate inference, close to the accuracy of golden standard MCMC, with the same computational complexity of the standard GP, i.e. O(n3 ). The interesting reader is referred to [22] for more details about this approach. One of the main disadvantages of the parametric probability density functions (like GMM) is their need to estimate the required parameters beforehand. The wrong selection of the values of these parameters has a
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
strong bad impact on the generated model. For example, in GMM, the Gaussian parameters and the number of Gaussian components in GMM must be estimated. Many research like [1, 7] use iteratively performed Maximum Likelihood Estimation using the standard Expectation Maximization (EM) algorithm to perform parameter estimation task. However, EM algorithm has the disadvantages of being very sensitive to initials and can stuck to local minima. So, it must be trained multiple times with multiple initials, as illustrated in [7]. EM also su↵ers from slow convergence. However, the non-parametric methods, such as Gaussian Processes with all their variations, don’t require any assumptions about the underlying functions, and can fit many functional forms with higher accuracy. 4 The Proposed Framework Our proposed framework, shown in Figure 1, contains two main parts: the imitation part and the learning and reproduction part. The imitation part is responsible for the process of controlling the humanoid robot to perform some task by imitating the human demonstrator which acts the same task in real time. The learning and reproduction part is responsible for the process of learning and generalizing the performed task through modeling the joint-space data extracted from multiple demonstrations of the same task given by the imitation part. This part is also responsible for reproducing the learned task upon request. To teach the humanoid a new activity, the human teacher demonstrates the new activity to a Kinect sensor, which tracks the skeleton of the teacher and provides a set of frames that contain 3D positions of each joint of the teacher body. Then, theses positions are used to calculate the human joint angles in proper forms to be provided to the humanoid robot to follow the teacher motion in real-time. Thus, while teaching the task, the human demonstrator controls the humanoid robot remotely to perform the task. The teacher is required to demonstrate the same task multiple times to provide a sufficient number of demonstrations needed for generality. Therefore, the collected data, which contain elapsed times, joint angles, balance constraints, and objects’ positions, are provided to the learning phase. In the learning phase, the observed joint angles are first provided to a dimensionality reduction process which employs GPLVM algorithm to be projected onto a latent space to extract the relevant features only from the observed date to decrease the processing time in the next steps. Then, the low-dimensional data in the latent space are provided to the next component which is responsible for the temporal alignment of the features
7
Algorithm 1 Motion imitation and dataset construction for a specific task Require: Kinect sensor S, Number of demonstrations N , Task name tname. 1: procedure MotionImitation 2: for N demonstrations do 3: Initialize variables and robot; 4: while a skeleton frame exists do 5: Read 3D positions of the joints; 6: Read feet FSRs data; 7: Calculate the joint angles in Euler form; 8: Detect the states and orientations of hands; 9: if each joint angle within allowed range then 10: Apply low pass filter to joint angles; 11: Apply balance constraints; 12: Transfer final joint angles to robot; 13: Store the current frame (joint angles, FSRs , Time step T , demonstration index di, and task name tname) into dataset DS; 14: Wait for the next time step T ; 15: end if 16: end while 17: end for 18: return DS; 19: end procedure
signals of multiple demonstrations with each other; this component uses the DTW algorithm to do that. Next, the aligned features are used to train regression models through the VHGPR algorithm, where we train a regression model for each feature in the latent space. In the reproduction phase, the temporal data which is simply a sequence of time steps, are used to query the trained regression model of each feature to give the predicted values of this feature at multiple time steps. These, in turn, are used to reconstruct the original joint angles to be applied to the humanoid to reproduce the generalized motion. Algorithms 1, 3, and 4 summarize the implementation steps of our proposed framework. The computational complexity of our proposed framework depends mainly on the learning part which consists of three components; GPLVM with a computational complexity of O(n3 ), DTW with a computational complexity of O(n2 ), and VHGP with a computational complexity of O(n3 ). Thus, the overall computational complexity of the framework is O(n3 ). Now we describe it in details.
4.1 Motion capturing The proposed motion capturing system depends on a low-cost RGB-D sensor called Microsoft Kinect [15]. The Kinect was initially designed as a game controller in a computer game environment [26]. However, it has been used in various types of research in computer vision, such as object tracking and recognition, gesture recognition, and human motion analysis. The Kinect
8
Short form of author list Real-Time Imitation
Human Demonstrator
Depth Motion capturing using Kinect sensor Skeleton
Feet FSRs data
Hand state estimation (SVM) Joint angles calculation
Joint angles & Hands’ Data
Hand orientation (PCA)
Motion
Joint angles’ dataset of multiple demonstrations
Balance control
Visual Feedback
Training data
Offline Motion Learning & on Demand Reproduction Warped data
Query data (sequence of time steps)
(VHGP) Multiple models (a model for each Feature & Support mode)
(DTW) Data alignment
Predicted Reconstruction in low joint space dimensional data
Humanoid Robot
data
Low dimensional data
Dimensionality reduction (GPLVM)
Reconstructed joint angles
Balance control
Fig. 1 The proposed framework of learning through imitation
Algorithm 2 Detecting hands’ states and orientations Require: Hands’ positions P 1, P 2, Depth frame D. 1: procedure HandStateOrientation 2: Read 3D positions of the hands P 1, P 2; 3: Depth frame D; 4: Scan all the depth pixels in D around P 1, P 2; 5: Separate depth pixels of the hands H1D, H2D; 6: Calculate depth histograms for hands H1, H2; 7: Classify H1, H2 to detect hands states H1s, H2s; 8: Apply PCA to H1D, H2D to get hands orientations; 9: Calculate wrists Yaw joint angles W rY 1, W rY 2 based on hands orientations; 10: return H1s, H2s, W rY 1, W rY 2; 11: end procedure
Algorithm 3 Motion learning algorithm Require: Motion dataset DS. 1: procedure MotionLearning 2: for all tasks in DS do 3: Initialize variables; 4: Find optimal number of latent dimensions; 5: Apply GPLVM to reduce dimensionality of DS; 6: for all latent features plus support mode do 7: Apply DTW to align data of current feature; 8: Train a VHGP model for the current feature; 9: end for 10: end for 11: return Array of all trained VHGP models for all tasks V HGP Arr; 12: end procedure
device contains an RGB camera, a depth sensor, and a microphone array that includes four microphones. Thus, the device can provide RGB images, depth data, and audio signals at the same time. In our work, we are interested in depth data which provide the 3D positions of the human demonstrator as a human skeleton model with 20 joints.
4.2 joint angles calculation and mapping As illustrated in the previous subsection, the Kinect sensor provides a skeleton model of the human demonstrator that contains 3D positions of 20 joints. These positions are used in our framework to calculate the corresponding joint angles in the context of Euler angles. Using Euler angles, we can represent the 3D orientation of the human body using a combination of three rotations about di↵erent axes: (i) roll which represents rotation about the X-axis, (ii) pitch which represents rotation about the Y-axis, and (iii) yaw which represents rotation about the Z-axis. Calculating the joint angles in Euler form are more suitable for being transferred to the humanoid robot to imitate the human motion. In our work, we use the well-known vector algebra rules to calculate the joint angles from the joints 3D positions. As shown in Figure 2, to calculate the pitch angle of the left shoulder ⇥1 : Z = Pz Q z , X = Q x Px (3) then ⇥1 = tan
X ) (4) Z To calculate the roll angle of the left shoulder ⇥2 : 1
(
Z Shoulder Left (P) (Px, Py, Pz) Z
Z Shoulder Left (P) (Px, Py, Pz)
X Z
1
2
Elbow Left (Q) X
(Qx, Qy, Qz)
Y
Elbow Left (Q) Y
(Qx, Qy, Qz)
Fig. 2 Illustration of joint angles calculation of the left shoulder. Left: left shoulder pitch angle. Right: left shoulder roll angle
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
Z = Pz then ⇥2 = tan
Qz ,
Y = Qy
Py
(5)
Y ) (6) ( Z and so on for all the relevant joint angles of the human body. After calculating the values of the relevant joint angles in Euler form, direct mapping is used to transfer these values to the humanoid robot. Each human joint is mapped to the same joint in the robot by applying the calculated values to the actuator’s motor of the same joints. The di↵erences between human body and robot body regarding the length of their limbs do not a↵ect the similarity of the joint angles between them, especially in our proposed method because the task is performed using the robot body while the human demonstrator just controls it remotely in real time. The only thing we have to care about is the available ranges for the robot joint angles as shown in Table 1. Where the ranges in case of human are wider than the robot, the out-of-range movements are neglected and never sent to the robot. 1
Table 1 The motion range of the relevant DOFs in NAO robot. DOF Range LShoulderPitch -119.5 to 119.5 LShoulderRoll -18 to 76 LElbowRoll -88.5 to -2 LElbowYaw -119.5 to 119.5 LHipPitch -88.00 to 27.73 LHipRoll -21.74 to 45.29 LKneePitch -5.29 to 121.04 LAnklePitch -68.15 to 52.86 LAnkleRoll -22.79 to 44.06
DOF Range RShoulderPitch -119.5 to 119.5 RShoulderRoll -76 to 18 RElbowRoll 2 to 88.5 RElbowYaw -119.5 to 119.5 RHipPitch -88.00 to 27.73 RHipRoll -45.29 to 21.74 RKneePitch -5.90 to 121.47 RAnklePitch -67.97 to 53.40 RAnkleRoll -44.06 to 22.80
Algorithm 4 Motion reproduction algorithm Require: Trained VHGP models, V HGP Arr, and task name tname. 1: procedure MotionReproduction 2: Generate a sequence of time steps; 3: for all generated time steps do 4: for all latent space features of the given task do 5: Apply current time step to the VHGP to predict the feature value; 6: end for 7: Reconstruct the original features from the predicted latent features; 8: Apply predicted balance constraints; 9: Transfer final joint angles to humanoid; 10: end for 11: end procedure
9
Table 2 The meaning of FSRs binary data Left FSR 0 1 0 1
Right FSR 0 0 1 1
Support Mode No ground contact (N/A) Left leg support Right leg support Both legs support
4.3 Force sensitive resistors FSR data Support mode changing is one of the most important actions which people can do while performing multiple tasks like walking, kicking, and stepping over obstacles. So, mode changing is crucial in any motion imitation system. However, using Kinect in motion capturing causes problems in detecting mode changing of the human demonstrator in real-time. These problems are due to being unable to identify support mode changing when the human demonstrator lift one of his feet slightly o↵ the ground according to the resolution of depth data and the accuracy of the skeleton tracking algorithm. To solve this problem, we use two FSRs which the human demonstrator should wear over his shoes, between the shoes and the ground, while acting the required task. These FSRs are connected to a wireless transmitter to transmit a binary state (that represents whether there is a pressure or not on each of them) to a wireless receiver attached to a personal computer. This binary state denotes the current support mode of the human demonstrator accurately and in real time. Table 2 shows the meaning of all possible combinations of FSRs binary states where ’0’ means there is no applied force and ’1’ means there is an applied force. 4.4 Balance control The balance controller is an essential component in any humanoid robot control system. The Whole Body Balancer in NAO is considered as a Generalized Inverse Kinematics problem which takes into account all the constraints a↵ecting robot balancing during motion. This balance problem can be written as a quadratic program which is solved periodically using a proper software. In NAO, it is solved every 20 ms using an open source library based on c++. This quadratic program has the following general form: ⇢ 1 2 AY + b =0 min Y Y des Q such as (7) CY + d 0 2 where Y is the unknown vector containing velocities of the torso and all relevant joints, Y des is the desired solution, A, b, C and d are matrices to express equality and inequality constraints. The equality constraints are about keeping both feet or single foot fixed, and the inequality constraints are about both joint limits and
10
Short form of author list Table 3 Confusion matrices for the test process of the hand state determination method Both legs support Left leg support
Right leg support
Fig. 3 State transition diagram for human support mode changing
Closed Hand
Opened Hand
Opened Hand
Fig. 4 Sample depth histograms for opened and closed hands
maintaining the center of mass (COM) of the robot’s body within its support polygon. [2]. The support polygon is the convex hull of the robot’s feet, or foot, positions on the ground according to the current support mode of it. Thus, we have three balance modes: both legs support mode, left leg support mode, and right leg support mode. As shown in Figure 3, the balance mode is selected according to the current support mode of the human demonstrator which is observed through the states and transitions of the FSRs. While the human demonstrator is in the both legs support mode, the humanoid robot is required first to move to a stable pose calculated through inverse kinematic (IK) solver where the projection of COM must be within the support polygon of both its feet. The both feet of the humanoid robot is then constrained to be fixed on the ground, and any imitated motion that could lead to losing balance is neglected. In case of single support mode, the humanoid robot is required to move to a stable pose calculated through IK solver where the projection of COM must be within the support polygon of the current single support foot. The single support foot is then constrained to be fixed on the ground while the other leg is left free and any imitated motion that could lead to losing balance is ignored.
Hand Opened Hand Closed
Right Hand Opened Closed 58 3 5 52
Left Hand Opened Closed 53 8 5 51
4.5 Hand state determination Hand state determination is a crucial step for grasping. However, Kinect sensor can only detect the state of the demonstrator’s hand when the palm faces the camera, while it is not possible to keep the palm facing the camera all the time while learning the robot to grasp by imitation. So, another method is needed to estimate the hand state (opened or closed) during imitation without any restriction on the hand direction or orientation. In our method, as shown in algorithm 2, we perform this job based on a pre-trained classification model using SVM. The classification model is first trained o✏ine based on a proposed histogram for the depth data of multiple hand shapes, directions and orientations of multiple users for both opened and closed hand states. Then, during imitation, the histogram of the hand is calculated in the same way and is used to classify the hand state using the pre-trained model. In more details, we first detect the shapes of the hands based on its depth data in the depth frame captured by Kinect by scanning and segmenting all the depth data around the hand positions which we got from skeleton data. Then, given the set of segmented depth pixels of the hand D , a frequency histogram H with N pins is calculated as follows: di Hi = size(D) where di is the number of depth pixels belong to the period xi , xi ⌘ [ai , ai+1 [, a i = i ⇤ T2 , i = 0, 1, 2, ..., N 1, N = TT12 . T1 and T2 are thresholds which are taken to be 30mm and 2mm respectively, size(D) is the number of pixels of D. The histogram data is then used to classify the hand state using SVM model into one of the hand states (opened or closed). Figure 4 shows a sample histogram for each hand state. As shown in Figure 4, the two histograms have di↵erent patterns, where, due to the shapes of the hand in both situations, the closedhand histogram is Skewed right, while the opened-hand histogram is almost symmetric. Regarding to the training dataset, it contains 687 records; each record contains the calculated frequencies of the 15 pins of the histogram of the corresponding hand 3D shape. A separate SVM classification model is built for each hand based on a training dataset as shown
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
approximates the wrists yaw joint angles shown in Figure 6.
11 R
and
L
as
4.7 Dimensionality reduction
Fig. 5 The trained SVM classification models with the training datasets
ØL
ØR
eigenvectors
Fig. 6 Depth frame example showing the resulted eigenvectors and the calculated wrist yaw angles for both ’closed right hand’ ( R ) and ’opened left hand’ ( L )
in figure 51 . To evaluate the trained SVM models before the real operation, we generated and used a separate test dataset which contains 235 records for both right and left hands. Table 3 shows the confusion matrices of the test process. The overall test accuracy is 91.1%.
4.6 Hand orientation determination One of the difficulties faces Kinect users being unable to calculate the rotation angle of the wrist around its forearm, which is critical in the grasping process. We propose a new method to approximate the wrist yaw angle by calculating the orientation of the user hand which is very related to the wrist rotation in both human and robot body. Figure 6 shows that the hand orientation can be considered as a good approximation to the wrist yaw angle in both closed and opened hand states. In our method, we calculate the hand orientation based on the depth data of the hand by using PCA. When PCA is applied to the set of hand points in 2d, the resulted eigenvector corresponding to the maximum eigenvalue shows the maximum variance of the hand depth points. Thus, the direction of this eigenvector is used to approximate the hand orientation which in turn 1 The training datasets are visualized after reducing their dimensionalities from 15 to 2 dimensions using t-Distributed Stochastic Neighbor Embedding (t-SNE) [25]
The proposed learning approach depends on the collected data which contains the angles of all selected joints of the humanoid robot. However, the relevant joint angles are not the same for all tasks. For example, the tasks of the upper body part depend on the joints of the upper body only like shoulders and elbows joints, whereas the tasks of the lower body part rely on the joints of the lower body only like hips, knees, and ankles joints. So, each task has its own latent space, where the original dataset can be projected onto this latent space to give an optimal low-dimensional representation of this dataset. GPLVM is one of the most famous dimensionality reduction techniques, where a nonlinear mapping between the observation space and the latent space is devised. non-linear dimensionality reduction methods have been proven to be more accurate than linear ones in both the processes of reduction and reconstruction of multi-dimensional datasets [20, 35]. This is because linear-projection-based methods (such as PCA) cannot capture intrinsic nonlinearities into the original dataset, which are represented by latent-variable models and the non-linear learning characteristic of GP in GPLVM. The number of dimensions of the latent space is selected to give at least 95% of the accuracy of the original data set after data reconstruction. The implementation of the variational GPLVM algorithm presented in [41] is used in this component in our framework.
4.8 Temporal alignment using DTW Temporal alignment of the imitation data is a simple but critical step in such learning approaches. As mentioned before, the proposed learning framework depends on the observations of multiple demonstrations of the same task in the training phase. The relationship between the temporal values and the corresponding joint angles across all demonstrations of the same task has a large impact on the training phase. However, it is impossible for the human demonstrator to act all the demonstrations of the same task with the same time period and the same actions in the same time steps. So, the collected sequences of joint angles from multiple demonstrations of the same task must be aligned temporally before being used in the training phase to ensure the harmony between these signals. The DTW
12
Short form of author list
algorithm is used to align these temporal sequences as a preprocessing step before the training process.
4.9 Building VHGP models The low-dimensional aligned joints data of multiple demonstrations are then modeled using the VHGP algorithm, where a VHGP model is built for each feature in the latent space. In addition to these models, another model is built for the support mode changing while acting the task. The core of this component depends on the algorithm in [22] and the implementation of VHGPR in [8].
4.10 Motion reproduction The reproduction process starts by applying a sequence of time steps as an input query. Then, a sequence of values for each latent feature besides the support mode feature are generated based on the VHGPR models that have been produced in the training phase for all these features. However, the predicted data cannot be applied directly to the robot being still in the latent space. So, a reconstruction process is required to reconstruct the feature values into the space of joint angles. Thus, the predicted values of all joint angles and support modes are provided to the balance controller which constrain the input data to ensure the humanoid’s balance as described in Section 4.4. Then, the output data are applied to the humanoid robot to perform the required task.
Fig. 7 Mean square error for di↵erent number of latent dimensions for all the experiments.
5.1 Experiment I In this experiment, the robot is trained on a motion of hitting and dropping down a steady ball using its right hand as shown in Figure 8. In the training phase, the human demonstrator controls the humanoid robot to perform the required task by imitation; this process is performed six times. Each time, a set of frames of joint angles is calculated and recorded to form the training data of this task finally. Figure 10 shows the training data of the first task.
5 Experiments and Results Our proposed framework is evaluated through 5 experiments of 5 di↵erent motions which include both the upper and lower body parts of the humanoid robot including grasping. NAO H25 humanoid is used in our experiments to imitate and reproduce the proposed motions. NAO H25 has 25 degrees of freedom, its height is 57.3 cm, and its weight is 5.2 kg. The computations were performed using a standard laptop with a second generation dual core processor (Intel Core i5-2467M 1.6GHz), and 8 gigabytes of RAM. The motion data of the human demonstrator are observed with a rate of 5 frames per second, where 18 relevant degrees of freedom are considered as shown in Table 1. The attached video shows our experimental setup and results2 . 2
The video is available at: http://www.videosprout.com/ video?id=a69781a4-6817-42a0-9079-66dc2250c69d
Fig. 8 Motion training sample of Experiment I.
As shown in Figure 10, and from the nature of the current task which depends mainly on the motions of the right hand, we can notice that the values of joint angles of the right arm compose a remarkable, distinct pattern, whereas most of the other joint angles are almost fixed during the motion. So, dimensionality reduction will be e↵ective in such situations. Figure 7 shows the accuracy of dataset reconstruction as a function of the number of latent dimensions. We chose seven as the best selection of latent dimensions according to this figure. The GPLVM algorithm is used to reduce the dimensionality of the original dataset, which consists of 18 di-
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect Predicted Latent Dim. No.1 from VHGPR vs Training Latent Dims. 4 2
13
Predicted Latent Dim. No.2 from VHGPR vs Training Latent Dims. 5 0
0 -2 0 10 20 30 40 50 60 70 Predicted Latent Dim. No.3 from VHGPR vs Training Latent Dims. 4 2
-5 0 10 20 30 40 50 60 70 Predicted Latent Dim. No.4 from VHGPR vs Training Latent Dims. 5 0
0 -2 0 10 20 30 40 50 60 70 Predicted Latent Dim. No.5 from VHGPR vs Training Latent Dims. 5
-5 0 10 20 30 40 50 60 70 Predicted Latent Dim. No.6 from VHGPR vs Training Latent Dims. 4 2
0
0
-5 0 10 20 30 40 50 60 70 Predicted Latent Dim. No.7 from VHGPR vs Training Latent Dims. 5
-2 0
10
20
30
40
50
60
70
0 -5 0
10
20
30
40
50
60
Fig. 11 Motion reproduction of Experiment I.
70
Fig. 9 The reproduced motion of Experiment I in the latent space using VHGPR (the red line). The horizontal axis represent the elapsed time steps, and the vertical axis represent the value of the latent dimension. 2
RElbowRoll
1 0 0 0
2
50 RShoulderRoll
0
50 LShoulderPitch
0
50 RAnkleRoll
0.2
50 RHipRoll
0 0 -1
-0.65 0 0.4
50 LShoulderRoll
50 LElbowYaw
0.2 0 0.2
0 0 0
50 LKneePitch
-0.4 0 0.4
50 RKneePitch
0 0
-0.1
50 RAnklePitch
-0.3 0 0.4
50 RHipPitch
0.3 50 LAnklePitch
0.2 0 0.2
50 LAnkleRoll
0.1 50 LHipPitch
0.2 50
-1.5 0
-0.2
-0.2
0.1 0 0
50 LElbowRoll
0.1
-0.1 -0.2 0
-0.55
RShoulderPitch
1
0.3
-0.1 -0.2 0
0 0
2
-0.6
1.8 1.6 0
RElbowYaw
2
-1 -2 0
4
0 0 0.2
50 LHipRoll
0.1 50
0 0
50
Fig. 10 The reproduced motion of Experiment I (the red line)against The observed joint angles of the 6 training demonstrations. The horizontal axis represents the elapsed time steps, and the vertical axis represents the joint angles in radians.
mensions, to be seven dimensions only. Then, the lowerdimension training data are aligned through DTW to ensure harmony among the signals from multiple demonstrations.
After being trained on the aligned data, the VHGPR models are used to generalize the motion in the latent space as described before. The generalized signals in latent space are shown in Figure 9, where the red line is the mean and the gray area is the confidence of the generated models. It can be noticed that the confidence area is not constant over all the motion period, but it changes according to the noise in each area to give a more accurate representation of the motion. Then, the original motion is reconstructed from the latent space generalized motion using GPLVM as shown in Figure 10, where the red line represent the predicted joint angles. The predicted joint angles are applied to the robot to reproduce the motion as shown in the captured images in Figure 11. 5.2 Experiment II In this experiment, the humanoid robot is trained on a motion of kicking a steady ball with its right leg as shown in the captured images in Figure 12. This motion tests the ability to change the support mode of the robot. It is required to act this motion to change the support mode from the double-support mode to the left-leg-support mode while kicking the ball with the right leg, then return to the double-support mode. As shown in Figure 14, most of the variations of the humanoid robot’s joint angles are concerned with the robot’s legs. Figure 7 represents the accuracy of dimensionality reduction and shows the reason for considering only five latent dimensions. Figures 13, and 14 show the same procedure for Experiment II in the same manner as Experiment I. Figure 15 shows the reproduced motion after applying it to the robot. 5.3 Experiment III In this experiment, the humanoid robot is trained on a more complex motion of holding and lifting up a cube of
14
Short form of author list RElbowRoll
RElbowYaw
1.5 1 0.5 0
Fig. 12 Motion training sample of Experiment II.
50 RShoulderRoll
1.5
-0.6 -0.8 50 100 LShoulderPitch
-1 0
2
1
1.5
0.5 50 RAnkleRoll
100
0 0
50 LElbowRoll
100
1 0
50 LShoulderRoll
0 50 RKneePitch
100
-2 0
0
0
-1 0
-1 0
100
50 LKneePitch
0.5
0.4
0
0.2
-0.5 0
50 LAnklePitch
50
100
0 0
100
100
50 LAnkleRoll
100
50 LHipRoll
100
50
100
0.5
0 -0.5 100 0
50 RHipPitch
1
0.5
0 -0.5 0
50 100 RAnklePitch
2
1
50 RHipRoll
100
-1 -1.5 100 0
0
0.5
50 LElbowYaw
-0.5
0.5
-0.5 0
a sponge by both its arms as shown in Figure 16. This motion tests the ability of the proposed framework to learn and reproduce more complex motions that employ more than one e↵ector. The humanoid NAO uses both its arms to hold the sponge cube and lift it up. Figure 7 represents the accuracy of dimensionality reduction and considering a latent space of nine dimensions in this experiment. Figures 17, and 18 show the same procedure for Experiment III in the same manner as Experiments I and II. Figure 19 shows the test process.
2 1.5 100 0
0
1 0
Fig. 13 The reproduced motion of Experiment II in the latent space using VHGPR (the red line). The horizontal axis represents the elapsed time steps, and the vertical axis represents the value of the latent dimension.
2
-0.5 -1 0
RShoulderPitch
2.5
0 50 LHipPitch
-0.5 100 0 0.2 0
50
-0.2 100 0
Fig. 14 The reproduced motion of Experiment II (the red line) against the observed joint angles of the 6 training demonstrations. The horizontal axis represents the elapsed time steps, and the vertical axis represents the joint angles in radians
5.4 Experiment IV In this experiment, grasping capability is tested. The robot is trained to grasp a plastic cup with handle using its right hand as shown in figure 20. This experiment tests the features of determining robot’s hand state and orientation depending only on the data observed by Kinect. Figure 21 shows the observed and final reproduced joint angles for the relevant right arm joints only. Figure 22 shows the prediction process in the latent space. Finally, figure 23 shows the test process.
5.5 Experiment V In this experiment, a whole-body-based action is experimented. The robot is trained to hit a steady ball using its right hand while changing its support mode
Fig. 15 Motion reproduction of Experiment II.
Fig. 16 Motion training sample of Experiment III.
from double-support mode to right-leg-support mode as shown in figure 24. The experiment tests the e↵ect of the whole-body motion on the robot stability. Figure 25 shows the observed and final reproduced joint angles for all body joints. Figure 26 shows the prediction process in the latent space. Finally, figure 27 shows the motion reproduction after learning.
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect Predicted latent dimension no.1
Predicted latent dimension no.2
15
RElbowRoll
RElbowYaw
RShoulderPitch
2
5
2
5
2
0
0
1
0
0
-2 0 20 40 60 Predicted latent dimension no.3 5 0
-5 0 20 40 60 Predicted latent dimension no.4 5 0
-5 0 20 40 60 Predicted latent dimension no.5 5 0
-5 0 20 40 60 Predicted latent dimension no.6 5 0
-5 0 20 40 60 Predicted latent dimension no.7 5 0
-5 0 20 40 60 Predicted latent dimension no.8 5 0
-5 0 20 40 60 Predicted latent dimension no.9 5
-5 0
20
40
60
0 -5 0
20
40
60
Fig. 17 The reproduced motion of Experiment III in the latent space using VHGPR (the red line). The horizontal axis represents the elapsed time steps, and the vertical axis represents the value of the latent dimension.
5.6 Results accuracy and evaluation Let N is the number of training demonstrations, and T is the number of time steps in each reproduced motion. Also let y l 2 Y l and xl 2 X l are the values of the joint angles of the reproduced motions and the training motions for a single dimension in the latent space respectively. While y 2 Y and x 2 X are the same but in the observation space after data reconstruction. Table 4 Correlation coefficient for the regression process of all experiments Latent dimension 1 2 3 4 5 6 7 8 9
Ex.I 0.98 0.84 0.95 0.96 0.96 0.97 0.96 — —
Correlation coefficient Ex.II Ex.III Ex.IV 0.98 0.99 0.99 0.99 0.91 0.98 0.97 0.99 0.97 0.96 0.94 0.98 0.97 0.85 0.98 — 0.84 0.83 — 0.75 0.97 — 0.90 0.87 — 0.81 —
Ex.V 0.85 0.95 0.87 0.77 0.57 0.55 — — —
Four metrics are used to evaluate the reproduced motions: Correlation coefficient(⇢): which measures the statistical relationship or dependency between the reproduced motions and the training demonstrations. The Pearson correlation coefficient between the reproduced motion Y l and each training motion X l in the latent
0 0
50 RShoulderRoll
-5 0
-2 0
50 LElbowRoll
0
0
2
-0.2
-1
0
-0.4 0
50 LShoulderPitch
-2 0
50 LShoulderRoll
-2 0
2
1
0
0
0.5
-0.2
-2 0
50 RAnkleRoll
0 0 0.2
0.4
-0.2
0.1
0.3
50 RHipRoll
0 0 0
0.4
-0.1
-0.2
0.2
-0.2 0
-0.4 0 0.4
0.2
0.1
0.2
0.1
50
0 0
50 LAnkleRoll
0 0
50 LHipPitch
0.2
0 0
50 RHipPitch
0.2 0
50 LAnklePitch
0
50 LKneePitch
50 RAnklePitch
-0.4 0
50 RKneePitch
0
-0.4 0
50 LElbowYaw
50 LHipRoll
0 0
50
50
Fig. 18 The reproduced motion of Experiment III (the red line) against the observed joint angles of the 6 training demonstrations. The horizontal axis represents the elapsed time steps, and the vertical axis represents the joint angles in radians.
Fig. 19 Motion reproduction of Experiment III.
Fig. 20 Motion training sample of Experiment IV.
space is defined as: T 1 X ⇢ Y l, X l = T 1 t=1
ytl
µY l Yl
!✓
xlt
µX l Xl
◆
(8)
where µ and are the mean and the standard deviation respectively. Table 4 shows the mean correlation coefficient for this regression process between the reconstructed motion and the training demonstrations in the latent space. The table shows a very high correlation
16
Short form of author list RElbowRoll
RElbowYaw
2
2
1.5
1.5
1
1
0.5
0
50 100 RShoulderPitch
2
0.5
0
50 100 RShoulderRoll
Fig. 23 Motion reproduction of Experiment IV.
1
1.5 0
1 0.5
0
50 RWristYaw
100
−1 0
50
100
Fig. 24 Motion training sample of Experiment V.
1
RElbowRoll
2
RElbowYaw
3
0
RShoulderPitch
3
2
2
1
1
1
−1 0
50
0
100
Fig. 21 The reproduced motion of Experiment IV (the red line) against the observed joint angles of the 6 training demonstrations. The horizontal axis represents the elapsed time steps, and the vertical axis represents the joint angles in radians.
100
150
−1
2 0 −2 0 50 100 Predicted latent dimension No.3
2 0 −2 0 50 100 Predicted latent dimension No.4
1
2 0 −2 0 50 100 Predicted latent dimension No.5
2 0 −2 0 50 100 Predicted latent dimension No.6
2 0 −2 0 50 100 Predicted latent dimension No.7
2 0 −2 0 50 100 Predicted latent dimension No.8
2 0 −2 0
2 0 −2 0
0
50
100
150
LShoulderPitch
100
Fig. 22 The reproduced motion of Experiment IV in the latent space using VHGPR (the red line). The horizontal axis represents the elapsed time steps, and the vertical axis represents the value of the latent dimension.
between the predicted signal and the training signals where it is more than 95% in the most signals.
Root-mean-square error (EDOF ): which measures the match between the reproduced motion and the training demonstrations. The root-mean-square error is computed along the reproduced motion for each DOF as: N T q 1 XX 2 EDOF = (yt xn,t ) (9) N T n=1 t=1
100
150
−0.8
0
50
100
150
RAnkleRoll
0.5
0
0
50
100
150
LShoulderRoll
−1.5
0
50
100
150
RKneePitch
2
−1
0
0
−1
−4
−0.5
100
150
RHipRoll
0
0
−1
−0.5
0
50
100
150
LKneePitch
4
−2
0
50
100
150
LAnklePitch
50
100
150
LHipPitch
2
−0.5
0
50
100
150
RHipPitch
0
50
100
150
LAnkleRoll
0
50
100
150
LHipRoll
1 0.5
0
0
−1
−0.5
150
150
0 0
0 100
100
0.5
−2
50
50
RAnklePitch
1
1
0
150
0.5
1
0.5
0
1
−2 50
100
0
0
0
50
LElbowYaw
1
−0.5
1
0
−0.5
0.5 0
0
−1
1
2
50
50
LElbowRoll
−0.6
3 2
0
−0.4
−0.5
Predicted latent dimension No.2
100
50
RShoulderRoll
0
Predicted latent dimension No.1
50
0
0
0
50
100
150
0
50
100
150
Fig. 25 The reproduced motion of Experiment V (the red line) against the observed joint angles of the 6 training demonstrations. The horizontal axis represents the elapsed time steps, and the vertical axis represents the joint angles in radians.
Table 5 shows the calculated root-mean-square error in radian. Derivative of acceleration (SDOF ): which measures the smoothness of the reproduced motion based on calculating the rate of change of acceleration for the reproduced motion as: T 1 X ... SDOF = kyt k (10) T t=1
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
10
Predicted Latent Dim. No.1
0 −10 0 5
5
Predicted Latent Dim. No.2
Table 6 Smoothness of the reproduced motions for the relevant DOFs
0 50 100 Predicted Latent Dim. No.3
0 −5 0
10
17
−10 0 5
50 100 Predicted Latent Dim. No.4
0 50 100 Predicted Latent Dim. No.5
0
−5 0 20
50 100 Predicted Latent Dim. No.6
0
−5 0
50
100
−20 0
50
100
Fig. 26 The reproduced motion of Experiment V in the latent space using VHGPR (the red line). The horizontal axis represents the elapsed time steps, and the vertical axis represents the value of the latent dimension.
Fig. 27 Motion reproduction of Experiment V. Table 5 Root mean square errors measured in radian DOF RElbowRoll RElbowYaw RShoulderPitch RShoulderRoll LElbowRoll LElbowYaw LShoulderPitch LShoulderRoll RAnklePitch RAnkleRoll RKneePitch RHipPitch RHipRoll LAnklePitch LAnkleRoll LKneePitch LHipPitch LHipRoll RWristYaw
Ex.I 0.32 0.62 0.36 0.18 0.01 0.04 0.04 0.02 0.03 0.03 0.03 0.02 0.02 0.04 0.02 0.03 0.02 0.02 —
Root mean square error Ex.II Ex.III Ex.IV 0.07 0.2 0.08 0.07 0.79 0.15 0.09 0.53 0.11 0.07 0.12 0.09 0.04 0.14 0.02 0.06 0.63 0.06 0.07 0.63 0.05 0.07 0.11 0.03 0.39 0.03 0.03 0.12 0.04 0.03 0.23 0.04 0.05 0.19 0.03 0.03 0.12 0.02 0.03 0.05 0.02 0.02 0.1 0.03 0.03 0.13 0.02 0.03 0.03 0.02 0.02 0.08 0.02 0.03 — — 0.15
Ex.V 0.24 0.40 0.40 0.09 0.03 0.10 0.15 0.08 0.04 0.05 0.08 0.04 0.06 0.14 0.10 0.26 0.17 0.19 —
Table 6 shows the calculated values for each DOF where being very low shows the high degree of smoothness of the reproduced motion.
Elapsed training time (Ttr ): which measures the elapsed time in the training process from the beginning to the motion generation. Table 7 shows the elapsed training time for our experiments.
Derivative of acceleration (⇥10 4 ) Ex.II Ex.III Ex.IV 2.3 8.5 1.8 2.1 27.2 7.3 3.6 6.7 3.9 2.3 3.8 4.7 1.5 3.2 0.37 2.3 12.5 0.72 1.9 6.8 0.52 2.4 4.4 0.35 20.5 0.8 0.2 4.4 1.1 1.2 12.1 0.9 0.02 10.5 0.6 0.2 3.7 0.5 1.4 1.7 1.1 1.1 4.1 1.2 0.47 3.2 0.5 0.02 1.2 0.4 1.1 3 0.5 0.05 — — 5.7
DOF RElbowRoll RElbowYaw RShoulderPitch RShoulderRoll LElbowRoll LElbowYaw LShoulderPitch LShoulderRoll RAnklePitch RAnkleRoll RKneePitch RHipPitch RHipRoll LAnklePitch LAnkleRoll LKneePitch LHipPitch LHipRoll RWristYaw
Ex.I 12.9 13.8 10.8 9.5 0.2 0.6 0.5 0.3 0.2 0.2 0.3 0.3 0.3 0.4 0.2 0.5 0.5 0.2 —
Ex.V 4.9 4.6 3.9 2.2 1.0 4.5 4.4 3.2 1.2 2.1 7.5 3.3 1.5 5.7 1.3 3.0 6.7 4.2 —
Table 7 The elapsed training times measured in seconds for all the experiments Experiment Training time
I 90
II 421
III 114
IV 213
V 248
6 Conclusion and Future Work In this paper, the problem of teaching humanoid robots to perform new actions in a simple and efficient way is considered. A new framework is proposed to teach humanoid robots to move and grasp objects through realtime human motion imitation. The proposed framework consists of two parts. The first part is the real-time human motion imitation which is responsible for controlling the humanoid robot to perform the same motion as the human demonstrator. The second part is the learning part which is responsible for learning and reproducing the demonstrated motions. Observation of the human motion depends mainly on Microsoft Kinect sensor, and an additional set of two wireless FSRs fixed under both feet of the human demonstrator. The learning procedure depends on three machine learning approaches: GPLVM for dimensionality reduction, DTW for signal alignment, and GPLVMR for modeling and reproducing the demonstrated motions. Although one demonstration can only be used for learning, multiple demonstrations are used in our proposed method to achieve better accuracy by averaging the high probable noise resulting from the human demonstrator and the motion observation from the Kinect sensor. Five experiments using the NAO H25 humanoid robot demonstrate the ability of the proposed framework to teach
18
the robot five di↵erent motions concerned with both the upper and the lower body parts including grasping a cup. In the future, several extensions could be performed. For example, more degrees of freedom could be considered such as the head yaw and pitch. More complex grasping tasks which need the cooperation between both hands can be considered. Besides, improving the used technique to control the robot hand to consider more shapes not just ’open’ and ’close’ motions. Acknowledgements This research has been supported by the Ministry of Higher Education (MoHE) of Egypt through a PhD fellowship. Our sincere thanks to Egypt-Japan University of Science and Technology (E-JUST) for guidance and support.
References 1. Akgun, B., Cakmak, M., Yoo, J.W., Thomaz, A.L.: Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective. In: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pp. 391–398. ACM (2012) 2. Aldebaran: (2015). URL http://doc.aldebaran.com/ 1-14/. (accessed on 1 May 2017) 3. Billard, A., Calinon, S., Dillmann, R., Schaal, S.: Robot programming by demonstration. In: Springer handbook of robotics, pp. 1371–1394. Springer (2008) 4. Billard, A., Grollman, D.: Robot learning by demonstration 8(12), 3824 (2013) 5. Calinon, S., Billard, A.: Recognition and reproduction of gestures using a probabilistic framework combining pca, ica and hmm. In: Proceedings of the 22nd international conference on Machine learning, pp. 105–112. ACM (2005) 6. Calinon, S., D’halluin, F., Sauser, E.L., Caldwell, D.G., Billard, A.G.: Learning and reproduction of gestures by imitation. Robotics & Automation Magazine, IEEE 17(2), 44–54 (2010) 7. Calinon, S., Guenter, F., Billard, A.: On learning, representing, and generalizing a task in a humanoid robot. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 37(2), 286–298 (2007) 8. Camps-Valls, G., G´ omez-Chova, L., Mu˜ noz-Mar´ı, J., L´ azaro-Gredilla, M., Verrelst, J.: simpleR: A simple educational matlab toolbox for statistical regression (2013). URL http://www.uv.es/gcamps/code/simpleR. html. V2.1 9. Cela, A., Yebes, J.J., Arroyo, R., Bergasa, L.M., Barea, R., L´ opez, E.: Complete low-cost implementation of a teleoperated control system for a humanoid robot. Sensors 13(2), 1385–1401 (2013) 10. Chalodhorn, R., Grimes, D.B., Grochow, K., Rao, R.P.: Learning to walk through imitation. In: IJCAI, vol. 7, pp. 2084–2090 (2007) 11. Chen, N., Chew, C.M., Tee, K.P., Han, B.S.: Humanaided robotic grasping. In: RO-MAN, 2012 IEEE, pp. 75–80. IEEE (2012) 12. Dariush, B., Gienger, M., Arumbakkam, A., Zhu, Y., Jian, B., Fujimura, K., Goerick, C.: Online transfer of human motion to humanoids. International Journal of Humanoid Robotics 6(02), 265–289 (2009)
Short form of author list 13. Ekvall, S., Kragic, D.: Learning task models from multiple human demonstrations. In: Robot and Human Interactive Communication, 2006. ROMAN 2006. The 15th IEEE International Symposium on, pp. 358–363. IEEE (2006) 14. Evrard, P., Gribovskaya, E., Calinon, S., Billard, A., Kheddar, A.: Teaching physical collaborative tasks: Object-lifting case study with a humanoid. In: Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on, pp. 399–404. IEEE (2009) 15. Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: A review. Cybernetics, IEEE Transactions on 43(5), 1318–1334 (2013) 16. Khansari-Zadeh, S.M., Billard, A.: Learning stable nonlinear dynamical systems with gaussian mixture models. Robotics, IEEE Transactions on 27(5), 943–957 (2011) 17. Koenemann, J., Burget, F., Bennewitz, M.: Real-time imitation of human whole-body motions by humanoids. In: Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 2806–2812. IEEE (2014) 18. Kuli´ c, D., Takano, W., Nakamura, Y.: Incremental learning, clustering and hierarchy formation of whole body motion patterns using adaptive hidden markov chains. The International Journal of Robotics Research 27(7), 761–784 (2008) 19. Kuniyoshi, Y., Inaba, M., Inoue, H.: Learning by watching: Extracting reusable task knowledge from visual observation of human performance. Robotics and Automation, IEEE Transactions on 10(6), 799–822 (1994) 20. Lawrence, N.: Probabilistic non-linear principal component analysis with gaussian process latent variable models. The Journal of Machine Learning Research 6, 1783– 1816 (2005) 21. Lawrence, N.D.: Gaussian process latent variable models for visualisation of high dimensional data. Advances in neural information processing systems 16(3), 329–336 (2004) 22. Lazaro-Gredilla, M., Titsias, M.: Variational heteroscedastic gaussian process regression. In: L. Getoor, T. Sche↵er (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pp. 841–848. ACM, New York, NY, USA (2011) 23. Lei, J., Song, M., Li, Z.N., Chen, C.: Whole-body humanoid robot imitation with pose similarity evaluation. Signal Processing 108, 136–146 (2015) 24. Luo, R.C., Shih, B.H., Lin, T.W.: Real time human motion imitation of anthropomorphic dual arm robot based on cartesian impedance control. In: Robotic and Sensors Environments (ROSE), 2013 IEEE International Symposium on, pp. 25–30. IEEE (2013) 25. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov), 2579– 2605 (2008) 26. Microsoft: (2017). URL https://www.xbox.com/en-US/ xbox-one. (accessed on 1 May 2017) 27. M¨ ulling, K., Kober, J., Kroemer, O., Peters, J.: Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research 32(3), 263–279 (2013) 28. Nakaoka, S., Nakazawa, A., Kanehiro, F., Kaneko, K., Morisawa, M., Hirukawa, H., Ikeuchi, K.: Learning from observation paradigm: Leg task models for enabling a biped humanoid robot to imitate human dances. The International Journal of Robotics Research 26(8), 829– 844 (2007) 29. Nakaoka, S., Nakazawa, A., Yokoi, K., Hirukawa, H., Ikeuchi, K.: Generating whole body motions for a biped
Humanoids Skill Learning Based on Real-Time Human Motion Imitation Using Kinect
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41. 42.
43.
humanoid robot from captured human dances. In: Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, vol. 3, pp. 3905–3910. IEEE (2003) Nguyen-Tuong, D., Peters, J.: Model learning for robot control: a survey. Cognitive processing 12(4), 319–340 (2011) Ott, C., Lee, D., Nakamura, Y.: Motion capture based human motion recognition and imitation by direct marker control. In: Humanoid Robots, 2008. Humanoids 2008. 8th IEEE-RAS International Conference on, pp. 399–405. IEEE (2008) Ou, Y., Hu, J., Wang, Z., Fu, Y., Wu, X., Li, X.: A realtime human imitation system using kinect. International Journal of Social Robotics 7(5), 587–600 (2015) Pardowitz, M., Knoop, S., Dillmann, R., Zollner, R.D.: Incremental learning of tasks from user demonstrations, past experiences, and vocal comments. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 37(2), 322–332 (2007) Peternel, L., Babic, J.: Humanoid robot posture-control learning in real-time based on human sensorimotor learning ability. In: Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5329–5334 (2013) Quirion, S., Duchesne, C., Laurendeau, D., Marchand, M.: Comparing gplvm approaches for dimensionality reduction in character animation (2008) Ramos, O.E., Saab, L., Hak, S., Mansard, N.: Dynamic motion capture and edition using a stack of tasks. In: Humanoid Robots (Humanoids), 2011 11th IEEE-RAS International Conference on, pp. 224–230. IEEE (2011) Riley, M., Ude, A., Wade, K., Atkeson, C.G.: Enabling real-time full-body imitation: a natural way of transferring human movement to humanoids. In: Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, vol. 2, pp. 2368–2374. IEEE (2003) Shon, A.P., Grochow, K., Rao, R.P.: Robotic imitation from human motion capture using gaussian processes. In: Humanoid Robots, 2005 5th IEEE-RAS International Conference on, pp. 129–134. IEEE (2005) Stanton, C., Bogdanovych, A., Ratanasena, E.: Teleoperation of a humanoid robot using full-body motion capture, example movements, and machine learning. In: Proc. Australasian Conference on Robotics and Automation (2012) Suleiman, W., Yoshida, E., Kanehiro, F., Laumond, J.P., Monin, A.: On human motion imitation by humanoid robot. In: Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pp. 2697–2704. IEEE (2008) Titsias, M., Lawrence, N.: Bayesian gaussian process latent variable model (2010) Ude, A., Atkeson, C.G., Riley, M.: Programming fullbody movements for humanoid robots by observation. Robotics and autonomous systems 47(2), 93–108 (2004) Yamane, K., Hodgins, J.: Simultaneous tracking and balancing of humanoid robots for imitating human motion capture data. In: Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pp. 2510–2517. IEEE (2009)
19