Segmentation and Recognition of Human Grasps for Programming-by

Segmentation and Recognition of Human Grasps for Programming-by-Demonstration using Time-clustering and Fuzzy Modeling Rainer Palm, Senior Member, IEEE, and Boyko Iliev

Abstract— In this article we address the problem of Programming by Demonstration (PbD) of grasping tasks for a five-fingered robotic hand. The robot is instructed by a human operator wearing a data glove capturing the hand poses. For a number of human grasps, the corresponding fingertip trajectories are modeled in time and space by fuzzy clustering and Takagi-Sugeno modeling. This so-called time-clustering leads to grasp models using the time as input parameter and the fingertip positions as outputs. For a test sequence of grasps the control system of the robot hand identifies the grasp segments, classifies the grasps and generates the sequence of grasps shown before. For this purpose, each grasp is correlated with a training sequence. By means of a hybrid fuzzy model the demonstrated grasp sequence can be reconstructed.

I. I NTRODUCTION The applications of multi fingered robotic hands have so far been relatively few although the field has attracted significant research efforts. One of the reasons is that they are fairly difficult to program due to the inherently high dimensionality of the grasping and manipulation tasks. One approach to solve the problem is to program the robotic hand task directly by human demonstrations. That is, the operator performs the task while the robot “observes” the demonstrated actions with the help of motion capture devices or video cameras. The robot has to recognize the actions performed by the demonstrator and replicate them on its own. This approach is called Programming by Demonstration (PbD) and is used in complex robotics applications such as grasping and manipulation, see Kang et al. [1] and Bernardin et al. [2]. Different techniques and motion capture setups have been applied in PbD. For example, Kang et al. [1] describe a system, which observes, recognizes and maps human grasps to a robot manipulator. The human demonstration is captured with by a stereo vision system and a data glove. Zoellner et. al. [3] use a data glove with integrated tactile sensors where the classification method is based on support vector machines (SVM). In a similar setup, Bernardin et al. [2] apply Hidden Markov Models (HMMs) to segment and recognize grasp sequences. Ekvall and Kragic [4] also address the PbD-problem using the arm trajectory and hand pose as additional features for grasp classification. Li et. al. [5] use the singular value decomposition (SVD) to generate feature vectors for human grasps and support vector Rainer Palm is adjunct professor at the AASS, Dept. of Technology ¨ ¨ Orebro University SE-70182 Orebro, Sweden, Email: [email protected] ¨ Boyko Iliev is with the AASS, Dept. of Technology Orebro University SE-70182 Orebro, Sweden, Email: [email protected]

machines (SVM) which are applied to the classification problem. Aleotti and Caselli [6] describe a virtual reality based PbD-system for grasp recognition where final grasp postures are modeled based on the finger joint angles. Despite the advanced state of the art of the cited methods most of them lack the dynamical aspect of the grasp motion including the pre-grasp posture, the grasp phase while closing the hand and the release phase while opening the hand. In addition the estimation of the time of occurrence of a specific grasp in a sequence of grasps is still not solved with these methods. The method presented in this paper tries to overcome some of these drawbacks. To solve such a ’fuzzy’ task it turnes out that fuzzy modeling is a suitable approach to deal with. The main focus of this method is programming by demonstration and recognition of human grasps using fuzzy clustering and Takagi-Sugeno fuzzy modeling. Two types of models are built: - models of the fingertip trajectories. - models for segmentation and recognition The first model type incorporates the time variable of the grasp motion in the modeling process (time-clustering), i.e. it interprets the time as a model input, see Palm and Iliev [7]. The second type of model, which is the main contribution of this paper, deals with the recognition of grasps from a grasp sequence. The recognition model consists of a set of fuzzy expert rules to identify a grasp from a combination of grasps using membership functions of exponential type. This model is called a hybrid fuzzy model because it combines a fuzzy rule structure with membership functions and a crisp decision making procedure. One of the main advantages of our approach is that grasp recognition, hand control and execution monitoring can be implemented within the same modeling framework. This paper is organized as follows: Section II describes our experimental platform consisting of a data glove and a hand simulation tool. Furthermore the simulation model of the hand for the generation of grasp models is presented. Section III discusses briefly the learning of grasps by time-clustering. Section IV is the main part of this paper and describes a new method for segmentation and recognition of grasp sequences using a hybrid fuzzy approach. Section V presents experiments for the segmentation and recognition of selected grasps. Section VI draws some conclusions and establishes directions for future work.

II. A N EXPERIMENTAL PLATFORM FOR PROGRAMMING BY DEMONSTRATION

A. General principles The PbD-methodology for robotic grasping involves two main tasks: segmentation of human demonstrations and grasp recognition, see Fig. 1 The first task is to partition the data record into a sequence of episodes, where each one contains a single grasp. The second task is to recognize the grasp performed in each episode. At the final stage, the demonstrated task is (automatically) converted into a program code that can be executed on a particular robotic platform. The idea of our PbD-approach is to employ grasping primitives corresponding to human grasps to control the robot, see [7]. Then, if the system is able to recognize the corresponding human grasps in a demonstration, the robot will be able to perform the demonstrated task by activating the respective grasping primitives. Therefore, in this article we address the problem of segmentation and recognition of human grasp sequences.

Fig. 1.

Programming by demonstration of grasping tasks

we can simulate recorded demonstrations of human operators and compare them with the result from the execution of corresponding grasping primitives. We tested 15 different objects (see Fig. 3) where some of them belong to same class in terms of an applied grasp type. For example, cylinder and small bottle correspond to cylindrical grasp, (sphere) and (cube) to precision grasp, etc. III. L EARNING OF GRASPING PRIMITIVES BY TIME - CLUSTERING AND TAKAGI -S UGENO MODELING The recognition of a grasp type just by taking into account the initial and the final posture of the hand is difficult to achieve. Therefore, there is a need for a model that reflects the behavior of the hand in time. To build a model of the complete grasp motion two main features are considered - Fingertip positions at selected time points - Differences between selected fingertip positions (e.g. between index finger and thumb). If the fingertip positions are not available the joint angles of the fingers and the transformation to the fingertip positions have to be known. The incorporation of the time parameter in the modeling process has, beside others, the advantage to enable a recognition of grasps from given time sequences of hand motions. In the following we describe shortly an approach to learning of human grasps from demonstrations using the so-called time-clustering ([7]). The result is a set of grasp models for a selected number of human grasp motions. According to Section II-A experiments were performed in which time sequences for 15 different grasps were collected using a data glove with 18 sensors (see [8]). Each record consists of four phases • pre-grasp posture • grasp the object (closing the hand) • release the object (opening the hand) • hand at rest Each demonstration has been repeated several times to collect enough samples of every particular grasp. The time period for a single grasp is about 3 seconds. From those data, models for each individual grasp have been developed. Such a model is generated by Takagi-Sugeno fuzzy modeling and fuzzy clustering [11]. The basic idea is described in [7] but is briefly revisited in the following. We consider the 3 fingertip

B. Experimental platform 2

1.5

1

z

The experimental platform consists of a hand motion capturing device and a hand simulation environment. The human demonstrator motions are recorded by a data glove (CyberGlove) which measures 18 joint angles in the hand and the wrist (see [8]). Since humans mostly use a limited number of grasp types, the recognition process can be restricted to a certain grasp taxonomy, such as those developed by Cutkosky [9] and Iberall [10]. Here, we use Iberall’s human grasp taxonomy. To test the grasping primitives, we developed a simulation model of a five-fingered hand with 3 links and 3 joints in each finger. The simulation environment allows us to perform a kinematic simulation of the artificial hand and its interaction with modeled objects (see Fig. 2). Moreover,

0.5

0

−0.5

−1 −2 −1 0 1 2

x

Fig. 2.

−2

0

−1

y

Simulation of the hand

1

2

IV. S EGMENTATION AND RECOGNITION OF GRASPS

Fig. 3.

Grasp primitives

coordinates as model outputs and the time instants as model inputs. The TS fuzzy model is constructed from captured data as follows. Let the trajectory of a fingertip be described by the nonlinear function x(t) = f (t)

(1)

where x(t) ∈ R3 , f ∈ R3 , and t ∈ R+ . Equation (1) can be linearized at selected time points ti as follows ∆f (t) |t · (t − ti ) ∆t i which is a linear equation in t, x(t) = x(ti ) +

x(t) = Ai · t + di

(2)

(3)

where Ai = ∆f∆t(t) |ti ∈ R3 and di = x(ti ) − ∆f∆t(t) |ti · ti ∈ R3 that are determined by a clustering procedure. Using (3) as a local linear model one can express (1) in terms of an interpolation between several local linear models by applying Takagi-Sugeno fuzzy modeling [12] (see Fig. 4) x(t) =

c

wi (t) · (Ai · t + di )

(4)

i=1

where wi (t) ∈ [0, 1] is the degree of membership of the time point t to a cluster cwith the cluster center ti , c is the number of clusters, and i=1 wi (t) = 1.

In the previous section we have shown that T-S fuzzy models can be successfully used for modeling and imitation of human grasps. Now, we will show that they can also serve for recognition and classification of grasps from recorded human demonstrations. This task can be divided into two aspects: - Segmentation of recorded data and recognition of those time sequences that correspond to models of grasp primitives. - Recognition and classification of the identified grasps according to the models. Recognition of a grasp from a combination of apriori known grasps is based on a recognition model. This model is represented by a set of recognition rules using the model grasps mentioned in the last section. The generation of the recognition model is done by the following steps: 1. Computation of distance norms between a time sequence of a given grasp combination and the model grasps involved. 2. Computation of extrema along the sequence of distance norms obtained. 3. Formulation of a set of fuzzy rules reflecting the relationship between the extrema of the distance norms and the model grasps involved. 4. Computation of a vector of similarity degrees between the model grasps and the grasp combination. 1) Distance norms: Let, for example, grasp2 , grasp5 , grasp7 , grasp10 , grasp14 be a combination of grasps taken from the list of grasps shown in Fig. 3. A time series of this combination of grasps, in the following named as ’grasp combination’, is generated in the order of i = 2, 5, 7, 10, 14 using the existing time series of the corresponding grasp models, in the following named as ’grasps’ or ’model grasps’. Each grasp is separated from the other by a ’pause’ in order to obtain a realistic recognition problem. The ’pause’ is a specific time series representing an open hand. After that, each of the model grasps i = 2, 5, 7, 10, 14 is shifted along the time sequence of the grasp combination and compared with parts of it. Shifting and comparison take place in steps of the time cluster centers t1 ...tnc of the individual grasp where nc is the number of time clusters. “Comparison” means “Taking the norm2” ||Xci − Xmi || between the difference of the fingertip positions Xmi = (xm (t1 ), ...xm (tnc ))T i of a graspi and the fingertip positions of the grasp combination Xci = (xci (t˜1 ), ..., xc (t˜nc ))T

Fig. 4.

Time-clustering principle for one fingertip and its motion in (x,y)

t˜k = tk +(j −1)∆t, k = 1...nc , j = 1...(n−nc ), n - number of steps in the grasp combination, ∆t = (tnc − t1 )/nc . The vectors xm and xc include the 3 fingertip positions for each of the 5 fingers plus the distances between pinkie finger, index finger, and thumb, respectively. Because of scaling

reasons the norm of the difference is divided by the norm ||Xmi || of the model grasp. Then we obtain for the scaled norm ni =

||Xci − Xmi || ||Xmi ||

rule for graspi to be identified from the combination reads IF AN D T HEN

(5)

where ni are functions of time ni = ni (t˜), t˜ = t1 +(j−1)∆t. With this for each grasp i = 2, 5, 7, 10, 14 a time sequence ni (t˜1 ) is generated. During the pauses the norms ni do not change significantly because of the low correlation between the time series. But once the model grasp starts to overlap a grasp in the grasp combination, the norms ni change rapidly and reach an extremum at the highest overlap which is either a minimum or a maximum (see Fig. 5).

graspj

is exji

AN D...

graspk is exki grasp is graspi

(6)

j, k are the indexes of the individual grasps in the grasp combination. i is the index of the single model graspi that is compared with the grasps in the grasp combination. The terms exji indicate local extrema which are either minima minji or maxima maxji . Extrema appear at the time points t˜ = tj where model graspi meets graspj in the grasp combination with a maximum overlap. In our example the rule to identify grasp5 reads IF

grasp2

is

max2,5

AN D grasp5 is min5,5 AN D grasp7 is max7,5 AN D grasp10 is min10,5 AN D grasp14 is max14,5 T HEN

Fig. 5.

Overlap principle

2) Extrema in the distance norms: One important result is that for a specific model grasp the norms ni form individual patterns. For the discussed grasp combination, it can be expected that for each model grasp a minimum or a maximum occurs at five particular time points. These time points are known in advance from the model generation phase. In order to find the local extrema in ni during the search process the whole time period T = (n − nc − 1)∆t is partitioned into l time slices. In order to be able to identify all relevant extrema the lengths Tslice of the time slices should be Tgrasp,min < Tslice < Tdist,min /2 where Tgrasp,min is the minimum time length of a grasp, Tdist,min is the minimum time distance between two grasps. In doing so, all 5 local extrema of ni can be reliably identified. This search yields two vectors whose elements are the absolute values of the minima of ni zmin i = (z1 min,i , ..., zl min,i )T and the absolute values of the maxima of ni zmaxi = (z1max,i , ..., zlmax,i )T in each time slice. 3) Set of fuzzy rules: In order to recognize a model graspi from the grasp combination a set of 5 rules is formulated, one rule for each grasp in the combination. A general recognition

grasp

is

(7)

grasp5

Then each local minimum (maximum) is normalized to the total minimum (maximum) of the whole set of rules taking into account the connection between the particular rules. Let the total extremum zextot either be a total minimum zmintot or a maximum zmaxtot over all 5 rules and all time slices (see Fig. 6) zmintot j

= =

min(zj min,i ), zmaxtot = max(zj max,i ) 1, ..., l; i = 2, 5, 7, 10, 14 (8)

Furthermore let us expand the scalars zextot to vectors zextot zmaxtot

= zextot · v, zmintot = zmintot · v (9) = zmaxtot · v, v = (1, 1, ..., 1)T , v ∈ R1×l

Then a local extremum zj ex,i can be expressed by the total extremum zextot and a corresponding weight wji ∈ [0, 1] zj ex,i = wji · zextot ,

zj min,i = wji · zmintot

zj max,i = wji · zmaxtot ,

(10)

j, i = 2, 5, 7, 10, 14

4) Similarity degrees: From the time plots in Fig. 6 of the norms ni for the training sets the analytical form of (6) for the identification of graspi is chosen as follows ai = mji , j = 2, 5, 7, 10, 14 (11) j

mji = Exp(−|wji · zextot − zex i |) zexi = (z1ex,i , ..., zlex,i )T ai = (a1,i , ..., al,i )T ; am,i ∈ [0, 1], m = 1...l is a vector of similarity degrees between the model graspi and the individual grasps 2, 5, 7, 10, 14 in the grasp combination at the time point tm . The vector zex i represents either the

V. E XPERIMENTS AND SIMULATIONS In this section we present the experimental evaluation of the proposed approach for segmentation and classification of grasp behaviors. We tested the 15 different grasps shown in Fig. 3. As an example, a grasp combination with the grasps 2, 5, 7, 10, 14 are discussed. For the training phase a sequence of the model grasps 2, 5, 7, 10, 14 is generated. Then with the help of the same model grasps the norms ni = ni (t˜) are computed according to (5). Based on the resulting curves in Fig. 8 (see details in Fig. 9) the weights for the rule model (7) are determined. Fig. 6.

Norms of a grasp sequence

vector of minima zmini or maxima zmaxi of the norms ni , respectively. Furthermore we define Exp(−|wji · zextot − zex i |) = (e

−|wji ·zextot −z1ex,i |

For the product

j

, ..., e−|wji ·zextot −zlex,i | )T = (mji 1 , ..., mji l )T

(12)

mji we use the convention

mji · mki = (m1ji · m1ki , ..., mlji · mlki )T

(13)

The product operation in (11) represents the AND-operation in the rules (6) and (7). The exponential function in (12) is a membership function indicating the distance of a norm ni to a local extremum wji · zextot . With this the exponential function reaches its maximum at exactly that time point tm when graspi in the grasp combination has its local extremum (see, e.g., Fig. 7). Since we assume the time points for the occurrence of the particular grasps in the combination to be known we also know the similarity degrees am,i between the model graspi and the grasps 2, 5, 7, 10, 14 at these time points. If, for example, grasp5 occurs at the time point tm in the grasp combination then we obtain for am,5 = 1. All the other grasps lead to smaller values of ak,i , k = 2, 7, 10, 14. With this, both the start of a grasp and its type is identified.

Fig. 8.

Fig. 9.

Fig. 7.

Grasp membership functions

Grasp combination

Detail of grasp 5

The training results in the table in Fig. 10 show that both the starting points and the type of the grasps 5, 7, 10, and 14 can be reliably recognized. Grasp 2, however, shows rather high numbers at other time points, too, meaning that the recognition of this grasp is less reliable. A test sample of the same grasp combination confirms this assumption: There are good results for the grasps 5, 7, 10, and 14 but a different (wrong) result for grasp 2 (see Fig. 11). Many tests with other grasp combinations have shown that some grasps can be more reliably recognized than others. Grasps with distinct maxima and minima in their ni patterns can be recognized better than those grasps without this feature. In addition to the above example another combination with

a high recognition rate is grasp(1,13,12,8,10), where grasps 1, 12, 8, and 10 are reliably recognized. In the combination grasp(11,12,13,14,15) only the grasps 11, 14, and 15 could be well recognized where other combinations show even lower recognition rates. It could be shown that grasps 1, 10, and 14 are the most reliable grasps to be recognized where the grasps 2, 9, and 13 are the least reliable grasps. Reliable grasps are also robust against variations in the time span of an unknown test grasp compared to the time span of the respective model grasp. Our results show that the methods can handle up to 20% temporal difference. The analysis confirms the obvious fact that distinct grasps can be discriminated quite well from each other while the discrimination between similar grasps is difficult. Therefore, merging of similar grasps and building of larger classes can improve the recognition process significantly. Examples of such classes are grasp(5, 6, 15) and grasp(8,9). The final decision about the recognition of a grasp within a class may be taken on the basis of the technological context.

VI. C ONCLUSIONS In this paper, a new fuzzy-logic based method to modeling and classification of human grasps is presented. The goal is to develop an easy way of programming by demonstration of grasps for a humanoid robotic arm. Grasp primitives are captured by a data glove and modeled by TS-fuzzy models. Fuzzy clustering and modeling of time and space data is applied to the modeling of the fingertip trajectories of grasp primitives. Given a test sequence of grasps, the method allows us to identify the segments of the occurrence of grasps and to recognize the grasp types. For this purpose, each grasp is correlated with a training sequence. By means of a hybrid fuzzy model the original grasp sequence can be reconstructed. With this method, both the segmentation of grasps and their recognition are possible. The method also allows the recognition of a single grasp from a set of possible grasps. To improve the PbD process the method will be further developed for the recognition and classification of operator movements in a robotic environment using more sensor information about the robot workspace and the objects to be handled. R EFERENCES

Fig. 10.

Fig. 11.

Grasp combination, Training results

Grasp combination, Test results

[1] S.B. Kang and K. Ikeuchi. Towards automatic robot instruction form perception - mapping human grasps to manipulator grasps. IEEE Transactions on Robotics and Automation, 13 (1), 1997. [2] K. Ikeuchi K. Bernardin, K. Ogawara and R. Dillman. A sensor fusion approach for recognizing continuous human grasp sequences using hidden markov models. IEEE Transactions on Robotics, 21 (1), February, 2005. [3] R. Zoellner, O. Rogalla, R. Dillmann, and J.M. Zoellner. Dynamic grasp recognition within the framework of programming bydemonstration. In Proceedings Robot and Human Interactive Communication,10th IEEE International Workshop, Bordeaux, Paris, France, September 18-21 2001. IEEE. [4] S. Ekvall and D. Kragic. Grasp recognition for programming by demonstration. In Proceedings of International conference on robotics and automation, ICRA 2005, Barcelona, Spain, 2005. [5] Ch. Li, L. Khan, and B. Prabhaharan. Real-time classification of variable length multi-attribute motions. Knowledge and Information Systems, DOI 10.1007/s10115-005-0223-8, 2005. [6] J. Aleotti and S. Caselli. Grasp recognition in virtual reality for robot pregrasp planning by demonstration. In ICRA 2006 - International Conference on Robotics and Automation, Orlando, Florida, USA, May 2006. IEEE, IEEE. [7] R. Palm and B. Iliev. Learning of grasp behaviors for an artificial hand by time clustering and takagi-sugeno modeling. In Proceedings FUZZ-IEEE 2006 - IEEE International Conference on Fuzzy Systems, Vancouver, BC, Canada, July 16-21 2006. IEEE. [8] H. H. Asada and J.R. Fortier. Task recognition and human-machine coordination through the use of an instrument-glove. Progress report No. 2-5, pages 1–39, March 2000. [9] M. Cutkosky. On grasp coice, grasp models, and the design of hands for manufacturing tasks. IEEE Transactions on Robotics and Automation, 5 (3), 1989. [10] Thea Iberall. Human prehension and dexterous robot hands. The International Journal of Robotics Research, 16:285–299, 1997. [11] R. Palm and Ch. Stutz. Generation of control sequences for a fuzzy gain scheduler. International Journal of Fuzzy Systems, Vol. 5, No. 1:1–10, March 2003. [12] T. Takagi and M. Sugeno. Identification of systems and its applications to modeling and control. IEEE Trans. on Syst., Man, and Cyb., Vol. SMC-15. No.1:116–132, January/February 1985.

Segmentation and Recognition of Human Grasps for Programming-by

Segmentation and Recognition of Human Grasps for Programming-by

Suggest Documents

Segmentation and Recognition of Continuous

Human-to-Robot Mapping of Grasps - KTH

Automatic Segmentation and Recognition of Human Activities from

Human Iris Segmentation for Iris Recognition in Unconstrained ...

Segmentation techniques for recognition of Arabic ...

Realtime Segmentation and Recognition of Gestures using ...

Simultaneous Segmentation and Recognition of Arabic ... - CiteSeerX

Segmentation and Recognition of Vocalized Outlines ...

Segmentation and Recognition of Text Images

Instant Segmentation and Feature Extraction for Recognition of Simple ...

Automatic Skin Segmentation for Gesture Recognition ... - DORAS

Segmentation as Selective Search for Object Recognition

A Robust Segmentation Method for Iris Recognition

Segmentation methods for character recognition

ACTION SEGMENTATION AND RECOGNITION IN ... - Semantic Scholar

ACTION SEGMENTATION AND RECOGNITION IN MEETING ROOM ...

Microscopic Image Segmentation and Recognition ...

Simultaneous Object Recognition and Segmentation by Image

Finding Small, Versatile Sets of Human Grasps to Span Common ...

Visual Speech Recognition and Utterance Segmentation ... - CiteSeerX

Human Recognition and Tracking

Number Plate Recognition without Segmentation

Deep filter banks for texture recognition and segmentation

Word Segmentation and Recognition for Web Document ... - CiteSeerX