Document not found! Please try again

View-adaptive Manipulative Action Recognition for Robot Companions

2 downloads 0 Views 469KB Size Report
movements between the human hand and the objects. Thirdly, we propose an ... Index Terms—action recognition, manipulative gesture, hid- den Markov model ...
View-adaptive Manipulative Action Recognition for Robot Companions Zhe Li, Sven Wachsmuth, Jannik Fritsch, and Gerhard Sagerer Abstract— This paper puts forward an approach for a mobile robot to recognize the human’s manipulative actions from different single camera views. While most of the related work in action recognition assume a fixed static camera view that is the same for training and testing, such kind of constraints do not apply for mobile robot companions. We propose a recognition scheme that is able to generalize an action model, that has been learned from a very few data items observed from a single camera view, to variant view points and different settings. We tackle the problem of compensating the view dependence of 2D motion models on three different levels. Firstly, we pre-segment the trajectories based on an object vicinity that depends on the camera tilt and object detections. Secondly, an interactive feature vector is designed that represents the relative movements between the human hand and the objects. Thirdly, we propose an adaptive HMM-based matching process that is based on a particle filter and includes a dynamically adjusted scaling parameter that models the systematic error of the view dependency. Finally, we use a two-layered approach for task recognition which decouples the task knowledge from the view dependent primitive recognition. The results of experiments in an office environment show the applicability of this approach. Index Terms— action recognition, manipulative gesture, hidden Markov model, particle filter, view-adaptive, variant viewangles

I. I NTRODUCTION For human-centered robots, it is very important to achieve the awareness of the state of the user. The visual recognition of human actions provides a non-intrusive way for such kind of communication between a human and the robot, especially in passive, more observational situations. In the near past, much work has been done in this area [10]. Our aim is the vision-based recognition of object manipulations, e.g. “take cup” or even action sequences like “prepare tea” which contains “take cup”, “take tea can”, etc. According to Yacoob’s analysis [11], two main aspects affect the modeling and recognition: the human performance and the image perception process. The former includes aspects like variations in the repeated performance of the same activity even for the same person. Different individuals perform similar activities in significantly different ways. In this context, defining the onset and offset of an activity is challenging as similar activities frequently have different temporal durations. The latter contains issues like variant viewpoints as well as observational distances and occlusion during performance . Zhe Li, Sven Wachsmuth, and Gerhard Sagerer are with Applied Computer Science Group, Faculty of Technology, Bielefeld University, D-33594 Bielefeld, Germany. {lizhe, swachsmu,

sagerer}@techfak.uni-bielefeld.de Jannik Fritsch is with the Honda Research Institute Europe GmbH, 63073 Offenbach, Germany. [email protected]

As an extension of our previous work [4], this paper is focusing on the recognition of manipulative actions from different view-angles. There are two main strategies of viewinvariant action recognition. The first one is to reconstruct 3D representation of the movements. This can be achieved by using calibrated multicameras, which is too restrictive for most of the robotic platforms. Other approaches use an elaborated human body model for 3D tracking given monocamera images [8]. This approach typically suffers from the initialization problem, which is currently unsolved. The second strategy is to find view-invariant features in 2D images and model the actions based on such features. Rao introduced the view-invariant features – the so called dynamic instants [7]. They are the dramatic changes of the spatio-temporal curvature of a 2D trajectory. He argues that the same action should have the same number of instants. But in our scenario, the trajectories of object manipulations can have very different dynamic appearances because of the different positions of the objects and the unpredictable movements which are far away from the objects. Different to these approaches, we use a 2D motion model but do not stick on seeking view-invariant features. Inspired by the work of Wilson [9], we think the difference of the trajectories of the same action from different viewpoints is a kind of systematic error. He used parametric HMM (PHMM) to model gestures with the same meaning but different scalar quantities, like the gesture accompanying “this” within the sentance “I caught a fish. It is this big”. Instead of using a time-invariant linear model for the observation probability, we use a dynamic scaling parameter of the observation model in order to cope with nonlinear changes of the trajectories caused by different view-angles. In order to make the problem tractable we assume that manipulative actions are performed on a table plane which is a typical situation in domestic environments. To recognize the interactions between the hand and the objects, our approach uses the object-centered context. But different to Moore’s objectspace [6], we model a local relevance by introducing object vicinities and treat different camera views leading to much more sever trajectory variations. The proposed observation model considers a coarse estimate of the camera view point and the object distance which lead to feature vectors which are less affected by the change of view-angles. We understand the action recognition as a two-layered process consisting of primitive manipulations and structured tasks. Hence, the task level is becoming view independent and can provide top-down knowledge for the view dependent

K tasks

p1t

p1t−1 obj

L objects

oi

p0t

p0t−1

et

et−1

M primitives

(a) System architecture

qt−1

qt

ot−1

ot

Θt−1

Θt

(b) Dynamic graphical model

Fig. 1. Processing flow (a) and dependency structure of two time slices (b) in the recognition model. Each object-centered processing thread in (a) corresponds to one of the L plates in the dependency model. K is the number of different tasks modeled in the system and M is the number of possible primitives which each corresponds to one state of the variables p0t and p1t , respectively. The upper index of these variables denotes the primitive vs. task level.

part. The manipulative primitives are modeled by Hidden Markov Models (HMMs), which have been extended by an additional coupled hidden state modeling the systemetic error introduced by different view angles. The primitives are spotted from trajectories in the objects’ vicinities by a particle filter (PF). The upper layer of the system takes the primitive recognition results of the lower layer as input and recognizes the underlying tasks from the sequences of primitives. This paper is organized as follows: In the next section, we will firstly present the system architecture. Then, the object vicinity and the feature vector are introduced. After that, we describe how the particle filter is used to match the trajectories from different viewpoints to one of the HMM models. The task model is explained at the end of this section. Section III presents the experimental setting and the evaluation of this method. At last, a summary concludes the paper. II. R ECOGNITION S YSTEM In our definition, the manipulative task has two semantic layers. The bottom layer consists of the objectspecific manipulative primitives. Each object has its own set of manipulative primitives because we argue that different object types serve different manipulative functions and even manipulations with the same functional meaning are performed differently on different objects. The top layer is used for representing the manipulative task, which are modeled by typical transitions between certain manipulative primitives. The system architecture is shown in Figure 1(a). From bottom to top, a processing thread is created for each detected object. So the feature computation and HMM-based recognition are performed in parallel for different objects. The task level takes the detected primitives as input. And the task decision is based on matching the complete sequence detected on the first level, which includes all primitives from different threads, with the different task models. For

recognition, a top-down processing utilizes the task-level prediction of possible primitives for a task-driven attention filter on the low-level image processing [4]. A. Feature Extraction The manipulative gesture is different to the face-to-face interactional gesture because it reflects the interaction between the human hand and the objects, not the pure hand movement with a meaningful trajectory. The hand is detected in a color image sequence by an adaptive skin-color segmentation algorithm (see [1] for detail) and tracked over time using Kalman filtering. The hand observation ohand is represented t by the hand position (hx , hy )t at time t. In order to avoid partial occlusion problems with interacting hands and achieve the size of the objects in the scene, we use an object recognizer based on the Scaleinvariant Feature Transform (SIFT) [5] on the static scene. Then, object-dependent primitive actions are purely defined based on the hand trajectory that approaches an object instead of considering the object in the hand as a context. If a moved object is applied to another object, the second object defines the object context. The observation vector of a detected object oobj contains its position (ox,i , oy,i ), a i unique identifier (IDi ) for each different object type in the scene, and its height oh,i and width ow,i . As we can have several objects in the scene, the overall object observation vector contains multiple objects: obj obj oobj = {oobj 1 , . . . , oi , . . . , oL }

oobj i

= (ox,i , oy,i , IDi , oh,i , ow,i ).

with

(1)

It is a common sense that the relative movement between hand and object contains less interaction features when they are far away from each other. But the distance of two subjects in the real world cannot be measured in general by 2D images without their deep information. Because the manipulations will mainly happen on a flat table surface in our scenario, a vicinity of an object on the table is defined that is centered

image from camera

robot camera ct (o′x , o′y ) (o′x , o′y )

image center line image plane table object vicinity dis. to table

Fig. 2.

Fig. 3. The effect of feature transform (a) trajectories in pixel coordinates (b)features of distance and speed (c) transformed features (d)relative angle vs. time

The projection of an object vicinity

in the middle of the object and limited by the ratio β of its radius and the object size (see Figure 2). In order to achieve the projection of the border of the object vicinity without measuring the relative position between the object and the camera. It is assumed that the size of a detected object in the image is inverse proportional to its distance to the camera. Note that the exact closed curve of the projected object vicinity is not an ellipse with the object in the center. To avoid a complex representation of the curve caused by the 3-D to 2-D projection, it is approximated to an ellipse. Figure 2 illustrates the projection. Based on the focal length and the tilt angle of the observing camera, the lengths of the axes of the ellipse for object i are estimated as: ai = ri ∗ β ∗ cos(arctan(o0x /f )) bi = ri ∗ β ∗ sin(ct + arctan(o0y /f ))

(2)

In it, ai and bi are the horizontal and vertical semi-axes. ri is the radius of the object i in the image. o0x and o0y are the offset of the object position to the image center. f is the camera focus measured in pixels. ct is the tilt angle of the camera. In fact, the approximation can be thought as “moving the optical axis of the lens to the center of the detected object and then having the projection of the object vicinity. By applying the approximation, the pre-knowledge for achieving the vicinity of an object in 2-D images only consists of the tilt angle and intrinsic calibration parameters of the robot camera. There is no need to know the distance between the robot and the table, the height of the table, the height of the robot, etc., which gives great flexibility to the system. Based on this vicinity, a pre-segmentation step of the hand trajectory is performed that ignores irrelevant motions for primitive recognition. Considering the possible occlusions in manipulation and the uncertainty in moving an object, a segment is started when the hand enters the vicinity or when an object is newly detected and the hand is already in the vicinity (object put down into the scene). It ends when the hand goes out of the object’s vicinity or when the object is lost after the hand moves away (object has been taken).

As a consequence, the trajectory is segmented differently based on the different objects in the scene. To handle this multi-observation problem, one processing thread is started for each detected object. In the processing thread i, the interaction of the hand and the object is represented by a 3-dimensional feature vector vif that is calculated from ohand and oobj i . It contains the features: (i) distance di between the object and the operative hand, which is not the absolute distance in the images but measured with regard to polar coordinates scaled by object size. Suppose that hand position relative to the center of the object vicinity is (h0x,i , h0y,i ), s h0y,i 2 h0x,i 2 + , (3) di = ai 2 bi 2 (ii) magnitude of hand speed vi , which is the substraction of di of two successive time step, as well as (iii) the angle γi of the line connecting object and hand relative to the direction of the hand motion. vif = (di , vi , γi )

(4)

This feature vector is not view-invariant. But it can transform the trajectories of the same action perceived from different view-angles into a normalized and comparable feature space. With this representation, the effects of view-variant observation is depressed. Figure 3 shows the effect using simulated data. Two trajectories are from the same action in reality, but observed from two view-angles with 90 degree difference in pan angle and the same tilt angle 20.5◦ . Figure 3(a) is the display of them in an 2D image with size 320x240. Figure 3(b) is the original feature sequence in pixels and Figure 3(c) presents the transformed features of distance and speed. The difference of the two trajectories are depressed after the transform. Figure 3(d) shows the trajectories of the relative angle to the object with respect to time steps. Because this component is heavily affected by the viewpoints, it is vectorized into 3 discrete states, [0..60], (60..120), and [120..180]. The other dimensions are kept continuous.

B. Manipulative Primitive Detection The typical manipulations related to one object type are named as the object-oriented manipulative primitives, e.g., “take a cup”. They are modeled by semi-continuous HMMs with left-right topology. Different to the normal parameter set λ = (A, B, Π) of an HMM, a terminal probability E is added. It reflects the terminal probability of an HMM given a hidden state si . The parameters are learned from manually segmented trajectories with the Baum-Welch algorithm, E is calculated similar to Π, except using the last states. As already discussed in the previous section, it is found that the features significantly vary with different viewpoints. In contrast to the variance introduced by different persons or by different performances of the same person, it is a kind of systematic error that we aim to compensate for. Given that we have no apriori information on the current view angle, we model this influence by an additional hidden variable which is adapting the mean value of the observation probability (Figure 1(b)). Within the particle filter (PF) approach that has already been adopted in previous work, the state of the particles is extended by a newly introduced scaling variable. As a consequence, the robot only needs to observe the action from one point of view during learning and afterwards recognize it from a significantly different view-angle in a certain range. The underlying PF is called Sampling Importance Resampling (SIR) (better known as C ONDENSATION introduced by Isard and Blake [3]). The matching of the HMM and the observation are achieved by temporal propagation of a set of weighted particles: (1)

(1)

(N )

{(st , wt ), . . . , (st (i) st

=

(N )

, wt

)}

with

(5)

(i) 0(i) (i) (i) {pt , qt , et , Θt }. (i)

The number of particles is N . The sample st contains the (i) 0(i) primitive index pt , the hidden state qt , the terminal state (i) of this primitive et at time t, and the observation scaling (i) (i) vector Θt . The weight wt of a sample can be calculated from (i) p(ot |st ) (i) wt = PN . (6) (j) j=1 p(ot |st ) (i)

Here, p(ot |st ) models the observation probability of the (i) 0(i) scaled ot given qt and HMM pt . Let ot,m be the mth component of the observation vector at time t, then, (i) P r{ot,m |st } is calculated as −(ot,m − θt,m ∗ µs(i) ,m )2 1 t =√ exp 2σ 2(i) 2πσs(i) ,m st ,m t (7) The µs(i) ,m , σs(i) ,m is the mean and standard deviation (i) P r{ot,m |st }

t

t

(i)

of the mth component given the hidden state st , which (i) 0(i) is determined by qt and pt . The effect of this scaling parameter θt,m is shown in Figure 4. Suppose the red line is the model and the green line is the observation, the model can be scaled to the dashed red line according to the allowed

Fig. 4.

The effect of scaling parameter

scaling limits (the dashed rectangles), where they achieve a matching. Note that the original model variance keeps unchanged (the two schadowed areas). With the survival of the fittest rule, the PF will choose the best value within a limited range of the scaling parameters through the particle propagation. The propagation of the weighted samples over time consists of three steps: (i) Select: Selection of N − M samples st−1 according to their (i) respective weight wt−1 and random initialization of M new (i) samples. In every newly initialized sample, the value of θt,j is randomly chosen from its allowed period. (i) Predict: The current state of each sample st is predicted from the samples from the select step according to the (i) graphical model given in Fig. 1(b). The terminal state et−1 is a bi-valued variable, 0 means the primitive is continuing (i) and 1 means the primitive ends here. So if et−1 is 0, the (i) next hidden state qt is sampled according to the transition (i) probability of the HMM of primitive qt−1 and the primitive (i) 0(i) 0(i) index pt keeps the same as pt−1 . The θt,m is updated (i) by θt−1,m + N (σθm ). N (σθm ) is a normal distribution and represents the uncertainty in the prediction of θm , which spans a much smaller area than the allowed period. If the 0(i) (i) will be terminal state et−1 is 1, the primitive index pt sampled according to the current possible primitives of this (i) object. Then the hidden state qt is sampled according to the initial probability of the HMM of the new primitive 0(i) pt . The Θ(i) is reinitialized. At the end of this step, the (i) terminal state of this particle et is sampled based on the (i) terminal probability of the current primitive state qt . (i) Update: Determination of the weights wt of the predicted (i) samples st using Eq. 6. The recognition of a manipulative primitive is achieved by calculating the end-probability Pend that a certain HMM model pi is completed at time t: X (n) (n) Pend,t (pi ) = wt , if pi ∈ st . (8) n

A primitive model is considered recognized if the probability Pend,t (pk ) of the primitive model pk exceeds a threshold p0th which has been determined empirically. The resampling step in the particle propagation is able to

adapt the starting point of the model matching process if the beginning of the primitive does not match the beginning of the segment. The end-probability gives an estimation of the primitive’s ending point. This combination to a certain extent solves the problem of the forward-backward algorithm which needs a clear segmentation of the pattern. C. Model of the tasks The manipulative tasks are modeled as a first-level Markovian process which is the same as Moore’s definition [6]. Although this assumption violates certain domain dependencies, it is an efficient and practical way to deal with task knowledge. All the tasks share a set of possible manipulative primitives. The model Λi for a manipulative task i contains the transition matrix Ai , the initial probability Πi , the terminal probability Ei , and a threshold Thi . Suppose the result from the manipulative primitive recognition is the sequence Po . To calculate the acceptance of a task Λi = (Πi , Ai , Ei , Thi ), a random model Λr is used, which is similarly defined as a task model but no associated threshold and is learned from all the training data from different tasks. The similarity of the sequence and a task model s(i, Po ) is calculated as: p{Po |Πi , Ai , Ei } ) (9) s(i, Po ) = log( p{Po |Λr } The task decision d(Po )for recognition is  arg max(s(i, Po )|s(i, Po ) > Thi ) i d(Po ) = null

(10)

which takes the possible rejection into consideration. III. E XPERIMENTS In our experiment, a scenario in an office environment is set up as shown in the images in Figure 5. A person is sitting behind a table and manipulates the objects that are located on it. She or he is assumed to perform one of three different manipulation tasks: (1) water plant: take cup, water plant, put cup; (2) prepare tea: consists of take/put cup, take tea can/sugar, pour tea/take sugar into cup, put tea can; (3) prepare coffee: consists of take/put cup, take milk/sugar, pour milk/take sugar into cup, put milk. The images are recorded by 4 cameras at the same time with a resolution of 320x240 pixels and with a frame-rate of 15 images per second. The positions and view angles of the cameras are shown in Table I. The distance to table and the tilt angle of the camera are shown in Figure 2. The pan angles of the cameras is indicated in Figure 5. They have the same height of 160cm as the robot BIRON [2]. In the experiment, each TABLE I T HE POSITIONS AND VIEW- ANGLES OF THE CAMERAS dis. to table (cm) pan angle cp ( ◦ ) tilt angle ct ( ◦ )

cam 1 100 17.3 21.6

cam 2 100 0 20.5

cam 3 150 0 12.4

cam 4 100 30.4 22.7

cp

cp Cam 1

Cam 2

cp

cp Cam 3

Cam 4

Fig. 5. Views from 4 cameras with visualized pan angles cp . The arrow indicates the direction of the view on the horizontal plane. The dashed line is the center line of the table.

task is performed 15 times with different object layouts by 2 different persons. The object recognition results have been labeled because the evaluation experiment should concentrate on the performance of the trajectory matching. A. Evaluation of primitive detection To test the performance of the scaling parameter in the primitive detection, the training data for each primitive are taken from the same single camera (camera 2). Furthermore, each training sample of a specific action was performed with regard to an object at same place on the table. In this study, the primitive training is person-specific. Then, the learned models are applied to the data taken from all 4 cameras. Table II compares the detection results of the primitives related to the objects between the methods with and without scaling. The results are calculated based on the primitives TABLE II T HE DETECTION OF THE MANIPULATIVE PRIMITIVES . Object tea milk sugar cup

plant

Primitive take put take put take take put pour water

Num. of Truth 15 15 12 12 15 45 45 42 15

PER(%) 41.6 8.3 14.4 42.1 6.3 17.8 11.6 32.7 20.8

PER(%, no scale) 39.5 19.8 14.4 52.6 16.9 27.1 22.2 36.9 19.8

from all camera views with parameter set N = 1000, M = 50, p0th = 0.2, β = 3, the scaling range [0.8 1.2] with predict variance σ = 0.1. The primitive error rate (PER) is defined as PER=(#Substitution + #Insertion + #Deletion)/#Truth to present the quality of the detection because the primitives are detected from long trajectories. The ground truth of every primitive has a time stamp and an allowed time variance for matching. No detection in the allowed time variance of a primitive causes a deletion error. A false detection in it is

counted as a substitution. An insertion error is a detection out of that range or an additional detection in the range. Table II shows the results. It can be found that generally the PER of the method with scaling parameter is lower than that without it for most primitives. But there are also few cases where it increased. In the experiment, the big milk carton was placed near the tea can (see Figure 5 Cam3), which caused much insertion error in the PERs of ’take tea’ and ’put milk’. In oder to investigate the results in detail, we split the PER into different error types. Figure 6 shows the error rates of different types by using both methods. The numbers summarize all the results of all primitives and from all cameras. According to this figure, the scaling caused a significant drop of the error rates caused by deletion, as we expected. It also brought a slight increase of the insertion errors because it generalized the HMMs.

Fig. 6. The primitive detection results shown by different error types and with/without scaling

Figure 7 presents the effect of the scaling parameter on the results for different cameras. The PERs for all cameras are decreased by using it. Because the observation from one camera could contain the manipulations on different object layouts, the results for camera 2 is also listed here. The result with regard to the camera 4, which has the largest PER before scaling because of the largest view-angle difference, achieved comparable result to the others.

Fig. 7. views

The primitive detection results with regard to different camera

B. Manipulative task recognition The second evaluation assesses the recognition of the manipulative tasks. For each task, the whole training set contains the primitive sequences detected from 18 observations from the same view-angle, 18 sequences from other

viewpoints are testing data. The results in Table III show that the proposed approach can achieve good manipulative task recognition rates. They are further improved by using the scaling parameter. TABLE III T HE RECOGNITION RESULTS OF THE MANIPULATIVE TASKS WITH AND WITHOUT SCALING

Name water plant prepare tea prepare coffee

Num. 18 18 18

ER(%) 12.3 14.5 7.8

ER(%, no scale) 13.4 16.7 10.0

IV. S UMMARY The recognition of manipulative actions and tasks is an essential component for the natural, pro-active, and nonintrusive interaction between humans and robots. The proposed approach focuses on the recognition of human manipulative actions for a mobile robot in 2D images from different view-angles. To this end, a feature vector which is less affected by the change of view-angles is designed by a coarse estimate of the camera view point. Then, a particle filter realized HMM matching process detects the objectspecific manipulative primitives from longer trajectories. In this process, a dynamic scaling parameter in the observation model is used to cope with the nonlinear changes of the trajectories caused by different view-angles. The first experiments showed that the feature vector brought the reasonable results of the manipulative action recognition from different view-angles and the scaling parameter significantly promoted these results. R EFERENCES [1] J. Fritsch. Vision-based Recognition of Gestures with Context. Dissertation, Bielefeld University, 2003. [2] A. Haasch, S. Hohenner, S. Huwel, M. Kleinehagenbrock, S. Lang, I. Toptsis, G. A. Fink, J. Fritsch, B. Wrede, and G. Sagerer. Biron — the bielefeld robot companion. In Proc. Int. Workshop on Advances in Service Robotics, pages 27–32, Stuttgart, Germany, 2004. [3] M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. In Int. J. Computer Vision, pages 5–28, 1998. [4] Z. Li, J. Fritsch, S. Wachsmuth, and G. Sagerer. An object-oriented approach using a top-down and bottom-up process for manipulative action recognition. In DAGM, pages 212–221, Berlin, Germany, 2006. Springer-Verlag. [5] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 20:91–110, 2003. [6] D.J. Moore, I.A. Essa, and M.H. Hayes, III. Exploiting human actions and object context for recognition tasks. In Proc. ICCV, pages 20–27, 1999. [7] Cen Rao, Alper Yilmaz, and Mubarak Shah. View-invariant representation and recognition of actions. Int. J. Comput. Vision, 50(2):203–226, 2002. [8] J. Schmidt, B. Kwolek, and J. Fritsch. Kernel Particle Filter for Real-Time 3D Body Tracking in Monocular Color Images. In Proc. of Automatic Face and Gesture Recognition, pages 567–572, Southampton, UK, April 2006. IEEE. [9] Andrew D. Wilson and Aaron F. Bobick. Hidden Markov models for modeling and recognizing gesture under variation, pages 123–160. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2002. [10] Ying Wu and Thomas S. Huang. Vision-based gesture recognition: A review. Lecture Notes in Computer Science, 1739:103–114, 1999. [11] Yaser Yacoob and Michael J. Black. Parameterized modeling and recognition of activities. Computer Vision and Image Understanding: CVIU, 73(2):232–247, 1999.