Abstract 1 Introduction 2 Rough Estimation by Silhou- ette ... - CiteSeerX

1 downloads 0 Views 192KB Size Report
\The Harpy. Speech Understanding System". In W. A. Lea, ed- itor, Trends in Speech Recognition, pages 340{360. PresticeHall, Englewood Cli s, NJ, 1980. -0.5.
3-D HAND POSE ESTIMATION AND SHAPE MODEL REFINEMENT FROM A MONOCULAR IMAGE SEQUENCE

y Dept.

Nobutaka SHIMADAy and Yoshiaki SHIRAIy

of Mech. Eng. for Computer-Controlled Machinery, Osaka University, Yamadaoka 2-1, Suita, Osaka, 565 Japan. E-mail: [email protected]

Abstract

finger tip

DIP PIP MP

1 DOF

2 DOF This paper proposes a method to precisely estimate the 3 DOF shape and pose of articulated objects like a human hand. First, rough estimation is obtained using silhouette matchIP ing. Next, we apply the extended Kalman lter to tpalm ting a model to an image. However, because monocular MP images contain no depth information, ambiguity of the wrist CM shape and pose cannot essentially be resolved for articulated objects. In our method, considering restrictions of : 3-D hand model the shape and pose of human body, the distribution of EKF solution is modi ed so as to satisfy them. Then the possible candidates are excluded from generation of the possible solution space is incrementally reduced with in- candidates. This method estimates only pose parameters formative observations. Experimental results are shown. (joint angles), but initially xes the shape parameters (lengths and widths of parts). Because the shape param1 Introduction eters are not precise, the computed pose parameters are precise either. Automatic recognition of gestures is useful for man-machine notNext we consider the improvement of the shape and interface. Many visual methods have been developed for pose simultaneously. If the images from multiple views recognition of human gestures. The gesture estimation is are available, the model can be tted to the image feabased on 3-D reconstruction method from various infor- tures using the least squares method [2]. mation in 2-D images, in almost cases, with 3-D model However, this method cannot be applied to monocular as shape knowledge of its target. sequences due to the depth ambiguity. Further Such Visual methods using some image features are image information is necessary to estimate the 3-D shape and divided into two categories: ones using the projected po- pose of objects. sitions [1] [2], and the others using the motion informaIn this paper, we utilize knowledge of human body's tion [3]. Some other methods [4] [5] [6] can estimate shape, pose and motion restrictions which can be forhuman gestures from silhouettes using the idea of mulated as inequalities on the parameter space [8]. We without extracting the features. Us- rst obtain the tting solution using extended Kalman ing an initialized 3-D hand model, these methods gen- lter [9]. Next the distribution of EKF solution is modierate the possible pose images and evaluate their degree ed so as to satisfy the restricting inequalities, and then of matching to a silhouette image. This task takes much possible solution space is incrementally reduced (i.e. the computing cost for searching for the best-matching can- shape and pose get precise) when various observations didate owing to high degree of freedom (DOF) of human are obtained. Details are described in section 3. body. Therefore, it is important to reduce the search space appropriately. We previously proposed a silhouette-matching method 2 Rough Estimation by Silhoufor hand pose estimation from a monocular image seette Matching quence [7]. Using the positions and shape of nger-like Our hand model includes a position of the wrist, 3features extracted from the silhouette, the obviously imD shapes of parts and joint angles (Fig.1), and they are Fig. 1

Esti-

mation by Synthesis

1

Input Silhouette

Sp

Mean protrusion length

xp =

Lp

Sp Lp . C scale

Palm model

: Evaluation of palm candidates

(a) Silhouette 1

Fig. 2

finger model

translation difference Xft

palm model

(b) Estimation 1

: Rough Estimation Results as matching degree shown in Fig.2. While, for the nger pose, the projected di erences are evaluated between the axes of the model ngers and the pre-extracted ngerlike features shown in Fig.3, changing the model-feature correspondence. Next, in order to integrate these di erent sort of evaluation, we suppose the distributions of the evaluated matching degrees considering the error due to the shape modeling and quantization of the parameters. Then, the matching degrees of respective parts are integrated as a probability in a way of the Bayes rule. The rough estimation result at a frame in a certain sequence is shown in Fig.4. Still, considering the later observations, the best candidate at one frame is not always the best due to various causes: model approximation errors, too rapidly motion changes or ambiguities caused by occlusion. To resolve this problem, we utilize one more tactics: 4. at one frame by beamsearch method [10]. Even if the best estimation at one frame is actually an error 1 , the following estimations can be continued based on the rest estimations and the globally optimal solutions are obtained over a long sequence without back-tracking. Fig. 4

angular difference Xfa fingerlike feature

: Evaluation of nger candidates

Fig. 3

initialized at the rst frame. Because the shapes are xed in the rough estimation, a hand pose is represented by the set of the wrist position and the joint angles. At each frame, the following preprocesses precede. As a wrist position, the point where the width of arm abruptly changes is detected from the silhouette image. In the same way, nger-like regions are extracted from the silhouette (Fig.4(a)). We search for the candidates well-matched to the input silhouette. The matching of hand pose consists of two phases: candidates using the model and its degree of matching to the silhouette. In generation phase, for reduction of the number of total generated candidates, we utilize the following three tactics: 1. from the palm to ngers 2. of the model parameter space considering its sensitivity to shape deformation of the model projection 3. search strategy considering the of an appearance based on motion prediction. A candidate with higher probability is earlier evaluated its matching degree to a silhouette, and estimation at that frame nishes when well-matching candidates amount to a certain number xed in advance. In evaluation phase, for the generated palm pose, the mean protrusion length of an input silhouette is evaluated

preserving multiple estimations

generating

evaluating

3 Re nement of Shape and Pose Parameters

hierarchical estimation

In the rough estimation, the shape parameters are initially xed. Next we consider to re ne both the shape and pose using the rough estimation result. This task can be formulated as a model- tting problem using extended Kalman lter. However, it doesn't work appropriately due to the ambiguity of the depth. For example, any link longer than the correct length can explain the observed silhouette. For this problem, we consider the restrictions of the shape and pose of a human body:

adaptive quantization

prior probability

1 An occurrence of estimation error often cannot be found out until the next frame or further in worse cases.

2

r3

ϕTk x < = bk

θ4

h t (.)

θ3

θ1

(k−1) ( q(k−1) ,Q ) t t

excluded area

r2

θ2 r model 1 configurations (k) ( q(k) t , Qt )

observations

: Truncating the EKF solution h is non-linear, the observation is formulated based on EKF as @ h y ' @ x ~ (x 0 x~ ) + h (x~ ) + w (5) xt where x~ = A 01x^ 01 and x^ 01 is the t01th mean. Although the mean and variance are computed in an ordinary EKF way, this simple solution is often di erent from the correct due to the depth ambiguity and the prediction error. Fig. 6

: Representation of Articulation

Fig. 5

t

a) shape parameters (lengths and widths) are constant b) pose parameters (joint angles) change continuously c) each parameter is within a certain range and has relations with the other parameters. Above restrictions are simple but must be simultaneously satis ed. Therefore, if the range of one parameter is limited by a restriction, the ranges of other parameters are also limited through the relations. We introduce this mechanism into EKF in section 3.2. In spite of the above consideration, we still cannot distinguish the poses symmetry to the screen because of non-linearity of the system. For this problem, we propose generation and preservation of multiple EKF solutions in section 3.3. 3.1

as

t

3.2

i

i

T

i

t

t

t

t

t

t

t

t

t

t

t

t

t

2

A= t

Im 4 O O

Im Im O

3 O O 5 In

t

t

t

t

Modifying EKF Solution

min;i

i

max;i

i

j

ij

min;i

i

max;i

i

j

ij

t

(4)

where I denotes m 2 m unit matrix. U is determined considering the continuity of the pose changes. Because m

t

t

x = (1 ; 1 1 1 ;  ; _1 ; 11 1 ; _ ; r1 ; 1 11 ; r ) (1) where  , _ and r denote the joint angle, its velocity and the length of the ith link respectively (Fig.5). Supposing the constancy of the shape, r_ is not included. The transition and observation formulas are represented as x +1 = A x + u (2) y = h (x ) + w (3) where u and w are white noises with zero mean and variances U and W . Supposing linear prediction, A is represented by the following (2m + n) 2 (2m + n) matrix: n

t

In order to resolve the problem in the previous section, we consider to modify the EKF solution applying the restrictions of the possible ranges and relations of the pose and shape. They are formulated as below:      ; j 0  j  1 (6) r  r  r ; jr 0 r j  1r (7) One way is to regard a region represented by the restrictions as an observation of the state x at each frame. The region is approximated by a Gaussian distribution. Then the restrictions are easily introduced into EKF. However, each application of Kalman lter improperly decreases the variance. Another way is to introduce the restrictions as an initial distribution. But this method is also inappropriate because application of Kalman lter decreases the e ect of the initial distribution2 . Here remind that the shape and pose must satisfy the restrictions. Therefore, the distribution obtained by EKF must be re ned by the restrictions. Accordingly, we truncate the distribution of EKF solution (^x3; P 3 ) outside of the restrictions (Fig.6). Because the restrictions Eq.(6)(7) are linear inequalities, we generally represent them as ' x  b (k = 1 1 1 1 K ): (8)

Here, a state vector of the shape and pose is de ned m

t

t

t

Formulation of EKF

m

t

t

t

T k

t

2 The

3

k

e ect to non-observable modes does not decrease.

Depth

It is dicult to exactly compute the distribution truncated with all restrictions. Hence, the truncation with all restrictions is approximated by repeating the truncation with a single restriction sequentially. Here consider truncating a distribution with a mean q 01 and a variance Q 01 by the restriction ' x  b . This computation is reduced to the case where the mean is 0, the variance is I and the restriction is (1; 0; 1 1 1 ; 0) x  c applying the following transformation: (9) q0 = TW 21 R q where R, T is orthogonal, W is diagonal and TWT = Q 01 (10) 1 (11) RW 2 T ' = (1; 0; 1 1 1 ; 0) b 0 ' q 01 c = 0 (12) 11 : ' Q 01 ' 2 c means a Mahalanobis distance between q 01 and the plane represented by ' x = b . In this case, the truncated mean  and variance S is computed as r 2 exp(0 c2 )=(1 + erf( pc )) (13)  = 0  2 2 8 2 111i = j = 1 < 1+0 S = :1 1 1 1 i = j 6= 1 (14) 0 1 1 1 otherwise: where erf(1) represents the Error function. Using Eq.(9), then truncated mean and variance are computed as (15) q = TW 21 R  + q 01 1 1 (16) Q = TW 2 R SRW 2 T : Finally, the fully truncated mean and variance are computed recursively: x^ = q and P = Q where q0 = x^ 3 and Q0 = P 3. For eciency, we process the truncation only if c is smaller than a threshold. k

T k

k

t−1

B

k

Screen

: The case in which EKF possibly fails

Fig. 7

k

where h denotes observations(Eq.(3)) of the i th link, 2 denotes the angles of the proximal joint of the i th link ( 1 ; 1 1 1 ;  p ) of p DOF3. If the prediction satis es Eq.(17), the following processes are activated: 1. generate multiple predictions 2. for each prediction, obtain the linearized observation formula Eq.(5) and compute the EKF solution. 3. truncate each EKF solution. One of the multiple predictions is an ordinary prediction x~ in EKF. Another x~ is the same, except that only the i th link is symmetry to x~ about the image plane. Thus, 2 (n:the number of links) predictions are possible. At the time, the probability density of each prediction is approximately calculated based on the distribution of the ordinary prediction: p (x) = f (x; x~ ; A 01P 01A 01 + U 01) (18) x 2 fx~ ; x~ 1; 1 1 1 ; x~ g (19) where N denotes the number of the symmetry solution and f ( 1 ; m; 6) denotes a probability density function of Gaussian with m mean and 6 variance. At last, each EKF solution is computed and truncated in the way of section 3.2. In worse case that the truncated area is more than a threshold, such solution is eliminated as illegal. Among the rest of all, the solution arisen from the prediction with the most p is selected as the estimation that frame. Others are also preserved for robustness in the same way of the rough estimation. i

T

i

k

T

T k

k

k

T k

k

k

k

k

k

T k

k

t

pred

k

T

k

t

K

t

K

t

t

k

3.3

t

t

T T

t

n

ii

k

T

sym

t

k

k

i

i

T

k

Which is correct?

t

T

T

A

Multiple Solution

t

T t

t

sym

sym N

t

t

t

pred

Despite the use of the restrictions, the linearized ltering may fail because the distribution becomes multimodal due to the e ect of non-linearity and non-observability of the system. In Fig.7, for example, EKF only produces ether estimation of A or B. However, both of A and B are possible solutions. For the i th link, this problem arises when the link is nearly parallel to the screen, namely when @ h '0 (17) @ 2 x ~t

4 Experimental Results We show the simulational results of the our method. For simplicity, we consider a 2-D link system. An observation is given as a 1-D position of each joint. The most 3 If

i

the joint has 3 DOF, the joint parameters are represented as

quaternion.

i

4

: The restrictions and initial distribution restrictions j1 0 2 j  =4, j2 0 3 j  =6, 0  1  3=4, 0  2; 3  =2 initial mean correct2 value 60:2rad initial variance 1:0rad shape restrictions jr1 0 r2 j  40, jr2 0 r3 j  10, 30  r1  90, 25  r2; r3  70 initial mean correct value 610pixels initial variance 400pixels2

shrinking variance linearizing point r original direction observation

pose

Tab. 1

θ linearized observation

: A problem of linearized observation

Fig. 8

depth

proximal joint is xed to the origin and three links are connected like a nger. Here, we utilize the restrictions shown in Tab.1. Fig.10 shows correct poses and the estimated poses. Fig.11-14 show the changes of the angles and lengths of the 1st link (1,r1) and the 3rd link (3; r3). The broken line shows the correct value and the mark and vertical line show the estimated mean and variance (twice of the standard deviation). In Fig.11 and 12, 1 is accurately estimated but r1 is comparatively not because the 1st link is nearly perpendicular to the screen. At the 18th frame(Fig.10(e)), the 3rd link is incorrectly estimated. By evaluating Eq.(17), however, another estimation is simultaneously generated. The wrong estimation is eliminated at the 20th frame and the following estimation is successfully continued. In Fig.14, r3 is correctly estimated despite the longer initial length by the observation of the 2nd link. Because its length r2 is accurately estimated from the 20th to 30th frame, this information propagates into r3 through the distribution truncation.



PB PA

r θ

prediction approximated transition distribution screen

observation

: A problem of estimation in a depth and screen coordinates Fig. 9

ror of linearization of the observation. The linearized observation constrains  and r onto a plane whose direction varies depending on the linearizing point (Fig.8). The variance accordingly decreases in all directions if the linearizing point moves. In section 3.1, we formulate the EKF in -r space. If we alternatively construct the state vector with the depth and screen coordinates, then the observation is linear. But the restrictions are not linear and, moreover, transition noise is not Gaussian. The Gaussian approximation of the noise causes a serious problem in Fig.9. The appropriate estimation should be PA assuming r_ = 0. By approximating the transition distribution, however, PB is obtained as the estimation. Therefore, r is improperly stretched for any observations. Consequently, a future work is to resolve this problem.

5 Conclusion and Discussion In this paper, we proposed a method to precisely estimate the shape and pose of articulated objects from monocular image sequence. We resolved the ambiguity of the EKF estimation by truncating the distribution with the shape and pose restrictions. Then we reduced the possible solution space incrementally with informative observations over the time. In addition, we resolved the ambiguity of symmetry poses by preserving multiple solutions. we showed the e ectiveness of our method by simulation. However, we still have a problem. In some cases, the variance improperly decreases. This is caused by the er-

References

[1] J. Davis and M. Shah. \Recognizing Hand Gestures". , pages 331{340, 1994. [2] J. M. Rehg and T. Kanade. \Visual Tracking of High DOF Articulated Structures: an Application to Human Hand Tracking". , pages 35{46, 1994. ECCV'94.

ECCV'94

5

2.5 angle1 estimate angle1 true 2

1 3

1.5

2

1

(a) t=16/correct

(b) t=18/correct

(c) t=20/correct

0.5

0

-0.5 0

5

10

15

20

25

30

: Mean and variance of 1

35

Fig. 11

80

(d) t=16/estimate

(e) t=18/estimate

length1 estimate length1 true

75

(f) t=20/estimate

70

: Estimation results for a sequence : In (e), solid line shows the wrong estimation generated by ordinary EKF, and broken line shows the alternative estimation generated by our method.

65

Fig. 10

60 55 50 45

[3] M. Yamamoto and K. Koshikawa. \Human Motion Analysis Based on A Robot Arm Model". In , pages 664{665. IEEE, 1991. [4] Y. Kameda, M. Minoh, and K. Ikeda. \Three Dimensional Pose Estimation of an Articulated Object from its Silhouette Image". In , pages 612{ 615, 1993. [5] M. Mochimaru and N. Yamazaki. \The Threedimensional Measurement of Unconstrained Motion Using a Model-matching Method". , pages 493{510, 1994. [6] J. J. Kuch and T. S. Huang. \Virtual Gun: A Vision Based Human Computer Interface Using the Human Hand". In , pages 196{199, 1994. [7] N. Shimada, Y. Shirai, and Y. Kuno. \Hand Gestrue Recognition Using Computer Vision Based on Model-matching Method". In , pages 11{16. Elsevier, 1995. [8] J. O'Rourke and N. I. Badler. \Model-Based Image Analysis of Human Motion Using Constraint Propagation". , pages 522{536, 1980. [9] A. H. Jazwinski. . Acadimic Press, New York, San Francisco, London, 1970. [10] B. T. Lowerre and R. D. Reddy. \The Harpy Speech Understanding System". In W. A. Lea, editor, , pages 340{360. PresticeHall, Englewood Cli s, NJ, 1980.

40 35

CVPR'91

30 0

5

10

15

20

25

30

: Mean and variance of r1

35

Fig. 12

ACCV'93

angle3 estimate angle3 true sub estimation

2

1.5

ERGONOMICS,

vol.37, No.3

1

0.5

MVA'94

0

Proc.of 6th Inter-

national Conference on HCI

-0.5 0

5

10

15

20

25

30

: Mean and variance of 3

35

Fig. 13

80

length3 estimate length3 true sub estimation

75

IEEE Trans. of Pattern Anal. and Machine

Intell.,PAMI-2, No.6

70 65

Stochastic Processes and Filtering

Theory

60 55 50 45

Trends in Speech Recognition

40 35 30 0

5

6

10

15

20

25

30

: Mean and variance of r3

Fig. 14

35