Probabilistic Contour Extraction Using Hierarchical Shape ... - CiteSeerX

61 downloads 0 Views 628KB Size Report
Acknowledgements. The Matlab code of ASM is provided by Ghassan. Hamarneh ... Technical Report TR2001-22, MERL, 2001. [19] A.L. Yuille, D.S. Cohen, and ...
Probabilistic Contour Extraction Using Hierarchical Shape Representation Chun Qi2 Xin Fan1,2 1 School of Information Engineering, Dalian Maritime University, Dalian, P.R. China. 116026 [email protected]

Dequn Liang1 Hua Huang2 2 School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an, P.R. China. 710049 {qichun,huanghua}@mail.xjtu.edu.cn

Abstract In this paper, we address the issue of extracting contour of the object with a specific shape. A hierarchical graphical model is proposed to represent shape variations. A complex shape is decomposed into several components which are described as Principal Component Analysis (PCA) based models in various levels. The hierarchical representation allows for chain-like conditional dependency within a single level and bidirectional communication between different levels. Additionally, a Sequential Monte-Carlo (SMC) based inference algorithm that can explore the graphical structure is proposed to estimate the contour. The experiments performed on real-world hand and face images show that the proposed method is effective in combating occlusion and cluttered background. Moreover, it is possible to isolate the localization error to an individual component of a shape attributed to the hierarchical representation.

1. Introduction It is such a challenging work in computer vision to find the boundary of an object of interest from unconstrained background that a large number of methods have been developed during decades [11]. As the prior knowledge about the shape of target object is incorporated, model-based approaches offer the capability of extracting contours of objects with complex shape structures even in the cases of noisy or missing data. In model-based approaches, a prior model of the shape of what is expected in an image is defined, and then an optimization algorithm is applied to find the best match of the model to the given image data. One approach to representing the shape structure is to construct a shape template using combinations of simple geometry units, such as line segments, circles [19] and polygons [4]. The other widely used models

are built from statistically analyzing examples of the shapes of interest, namely statistical models [2, 3, 17]. Regardless of the representation of models, the solution to the problem is found by deterministic optimization techniques in most model-based approaches to contour extraction. In [4], Dynamic Programming is used to find a globally optimal match. In Active Shape Models (ASM) proposed by Cootes et al. [2], a local minimum of the matching energy is found based on greedy searching in local neighborhoods. While the matching is formulated as Bayesian estimation in [17], the problem is resolved via a pyramid searching strategy. And Cremers et al. provide a gradient descent scheme for the minimization of a modified Mumford-Shah energy functional that incorporates statistical shape information [3]. However, it is worth noting that there exist multiple inconsistent interpretations of image features due to the complication of natural scenes. Thus, it might be problematic to perform a searching procedure along one direction determined by local image features to a global optimal solution incorporating all low-level image features and highlevel shape information [8]. In comparison with the deterministic optimization techniques, Sequential Monte-Carlo (SMC) method, also known as particle filter, has the ability to carry multiple hypotheses, and is widely used to track multiple targets with cluttered background in image sequences [1, 5]. In this paper, we take the spirit of SMC to perform inference over a spatial chain rather than over a temporal chain, which is typical in tracking applications. Pérez et al. use SMC to extract the contour of general amorphous objects in an interactive way [12]. An extension of SMC that suffices to perform inference on arbitrarily structured graphical models has been proposed [6, 14] and applied for an edge linking task [6]. However, shape information is not introduced in both applications. More similar work

to ours is conducted in [20], in which SMC is used to localize the articulated and deformable shape of a human body. However, the shape prior information is organized in a sequential way, and the shape samples are generated by stochastically perturbing some predefined parameters of a fixed reference shape. Motivated by the hierarchical probabilistic structure to model the human inference in the visual cortex developed by biological researchers [8], we propose a hierarchical representation to encode rigid shape information utilized in finding contours. A shape is decomposed into a few levels in a semantic sense. In this paper, we have at most three levels, each which is composed of the nodes to represent the shape information from bottom to up as: local shape structure, shape components and connections between components (i.e. whole shape). The shape nodes are described as Principal Component Analysis (PCA) based models. The nodes within a level are connected as a chain like structure, and the information between neighbored levels is communicated in a message updating fashion [18]. Finally, the boundary of the target object is obtained via the inference based on samples of the shape structure in spirit of SMC. We first describe the hierarchical shape structure in section 2, followed by the definition of likelihood. The inference algorithm is proposed in section 4. Experiments performed on hand and face images are presented in section 5. Section 6 summarizes the paper.

2. Hierarchical shape representation The matching of a deformable shape to the object in an image can be formulated in a Bayesian framework. We use similar denotations to those in [17]: the template shape contour as f and the contour in the image domain as f , where both contours are regarded as sequences of N c control points in a 2D plane, denoted as x1:N and x1:N respectively. For given image features y , the joint posterior probability of f and f can be derived as p( f , f y ) ~ p( y f ) p( f , f ) , (1) c

c

where the likelihood p( y f ) describes how the deformed shape contour fits the image features. In [17], the joint prior probability of f and f is expressed by the multiplication of two energy functions, and the contour estimation is achieved by maximizing (1) via a pyramid searching strategy. In this paper, we propose a hierarchical graphical structure [7] and shape contour is inferred by a stochastic SMC method, as we discuss later. We attempt to design a shape representation such that: 1) the shape or contour is piecewise linear, and

can be sequentially augmented one control point at a step; 2) the shape may have several conditionally dependent parts; 3) the shape prior information can communicate with local features. The graphical structure of the proposed hierarchical shape model is depicted in Fig. 1, which is exemplified by a face shape. A face shape is decomposed into three levels, and the dependency between nodes within a level exhibits a chain-like structure, which is described in details below. z Local shape structure level (L1) As we assume that the contours are piecewise linear, the nodes belonging to this level are line segments between two adjacent control points, which give the local structure of shape contour, as shown in Fig. 2. We denote the set of nodes N 1 as (2) N 1 = {E1 ,..., E k ,..., E N } , c

where the line segment E k = x k −1:k . Given the current segment E k , the next segment E k +1 is determined by two parameters ( s k ,θ k ) from E k +1 = Ak ( s k , θ k ) E k , (3) where Ak is a matrix to represent the transformation from E k to E k +1 with the form of s k sin θ k ⎤ ⎡ a k bk ⎤ ⎡ s cos θ k Ak = ⎢ k ⎥=⎢ ⎥ , (4) ⎣− s k sin θ k s k cos θ k ⎦ ⎣− bk a k ⎦ a k and bk are substitution variables for calculation of Ak to be discussed in next section. z Shape component level (L2) The nodes of this level N 2 are the template shapes of the components, denoted as N 2 = { f1 ,..., f k ,..., f N p } , (5)

where N p is the number of shape components, and f k is the template shape of the k th component that can be represented by a PCA based model [2, 17]: f k = f k 0 + Φk bk , (6) where f k 0 and Φk are respectively the mean component shape and the matrix composed of the principal feature vectors of the k th shape component, and both can be obtained by performing PCA on the training examples of the corresponding shape component. The model parameter bk can be used to generate component shape samples. We share the common idea with that in [15, 17] of decomposing a full shape into several parts having semantic meanings, but the deforming of one part f k is dependent on the shape of its prior neighbor f k −1 , denoted as f k = Tk ( f k −1 ) , (7) where Tk is an affine transformation. z Full shape level (L3)

defined in [16] to measure the similarity between two 2D point sets. L3

L2

L1

Figure 1. Hierarchical shape representation. The nodes within a level are linked as a chain, and those between deferent levels are connected. The thick lines in the nodes of L1 denote local structure, while dot lines imply the connection to its parents.

Figure 2. Local geometry of contour curve The set of nodes at this level only has one element, the full shape template, i.e. N 3 = { f } . Similarly, the full shape is expressed by a PCA based model: f = f 0 + Φb . (8) Performing PCA on the training examples of the whole shape yields the mean shape f 0 and the principal feature matrix Φ , and varying b generates plausible full shape samples. The biological research on the visual cortex shows that there exists such bidirectional interaction between the low-level and high-level visual perception area that low-level cells generate hypotheses for perception and high-level areas provide prior directions or constraints for shape inference [8]. To model the bidirectional communication, the nodes at two different levels are connected in the proposed shape representation to represent the conditionally dependency, as shown in Fig. 1. This dependency is expressed by a potential function φ : 1 , (9) φ ( x l −1 , x l ) = D ( x l −1 , x l ) where x l −1 and xl , by slight abuse of denotation, denote the corresponding sequences of control points in the template or deformed shapes at levels l − 1 and l respectively. D(•) is an affine invariant distance

3. Likelihood We use the spatial gradient of image intensity as the observed feature y , denoted by y ≡ ∇I ( x ) . The likelihood p( y E k ) has the form of likelihood ratio of pon to poff [12]. pon is the probability distribution of the gradients along the contours of interest. It can be observed that the gradients of the pixels at the boundary possess larger norms with directions perpendicular to the boundary curve. Investigating the pixels u j ( j = 1,..., N ) along the line segment E k (as shown in Fig. 2), we define the distribution as: N

pon ∝ exp(−∑ (1 − ∇I (u j ) ) n(u j ) • h(u j ) ,

(10)

j =1

where n and h are respectively the directions of the image gradient and the contour at the point u j . poff expresses the distribution of the gradients out of the contours, which is approximated by an exponential distribution with parameter γ for simplicity [12]: N ∇I ( u ) j poff ∝ exp(−∑ ). (11) j =1

γ

4. Contour inference Using the graphical representation of the prior joint probability p( f , f ) and the defined likelihood, it is possible to estimate the contour of the object of interest by exploring the posterior joint probability (1). Noting the graphical structure shown in Fig.1 other than Markov chain, the inference of the contour cannot be achieved via a standard SMC method [1, 5]. And the methods that can perform inference on arbitrary graphical structure require sophisticated techniques to sample the product of mixture Gaussian distributions [6, 14]. In this paper, we propose a SMC based method to explore the chain-like structure within a level, and the method is coupled with a message updating procedure between levels. Arising from the idea of SMC and noticing that the random variables within a level in the proposed hierarchical graph are connected like a chain, the probabilities along the chains are maintained by propagating the samples and their associated weights, denoted as {S1i:k (l ), wki (l )}iN=1(l ) , where the superscript i is the index of samples, k is the step along the chains, l is the index of levels, and N s (l ) is the sample number at the l th level. When proceeding along the l th level, the generation of samples at a new s

step relies on weighted sampling the probability of its parent Pa(k ) at the (l + 1) th level. These sampling weights are determined by the messages at the (l + 1) th level. When the procedure reaches the end of the chain at L1 , the contour of the expected object can be inferred by a MAP estimate or a mean estimate from the particles at L1 [1]. The inference algorithm is summarized in Fig. 3 and the detailed calculations are specified in the sequel.

4.1. Sampling template shapes The samples of template shapes at levels of L2 and L3 are generated based on the PCA based models (6) and (8). For a wide variety of shapes, it is reasonable to assume the model parameter b is zero-mean Gaussian with the covariance matrix DM = diag (λ1 ...λ M ) [2, 3]. λi (i = 1,..., M ) are the variances along the corresponding principal feature vectors, which are byproducts of PCA. Thus, a sample vector of the shape parameter b i can be independently generated by M Gaussian random number generators with corresponding variances λi . Shape samples S i (l )}iN=1(l ) at L2 and L3 are reconstructed by substituting the sample vector b i into (6) and (8). s

4.2. Message updating The communications between nodes at different levels are encoded by propagating bottom-up messages Bu ( S1i:k (l )) and top-down messages Bd ( S1i:k (l )) [8]. At each level l , the messages are updated through: Bu (S1i:k (l)) = max[Bu (S1m:k (l −1))φ(S1m:k (l −1), S1i:k (l))] (12) m

Bd (S (l)) = max[Bd (S1m:k (l +1))φ(S1i:k (l), S1m:k (l +1))] (13) i 1:k

Initialize {S 0i :2 (l ), w2i (l )}iN=1(l ) , Bu ( S 0i :2 (l )) , Bd ( S 0i :2 (l )) For k = 1 : N c If E k within current component For i = 1 : N s (1) //Explore L1 Prediction z Generate new samples through (3) z Update messages Bu and Bd through (12) and (13) Weighting z Update weight wki (l ) by (14) End Else //Explore L2 For i = 1 : N s ( 2) Prediction z Generate new samples through (7) z Update messages Bu and Bd through (12) and (13) Weighting z Update weight wki (l ) by (14) End End Normalize the weights wki (l ) Resample the sample set according to the weights End Figure 3. The Inference algorithm. s

wki (l ) = Bu ( S1i:k (l )) Bd ( S1i:k (l )) , (14) At the L1 level, the contours are estimated based on these weights and the corresponding samples, whereas at higher levels (l = 2,3) we resample the template shape samples according to these weights and use the perturbed template samples to predict new samples at the (l − 1) th level.

m

The potential function φ (⋅) in (12) and (13) is given in (9) that measures the similarity between two 2D point sets, i.e. curve segments in this paper. It can be seen from (12) and (13) that the message updating scheme enforces the shape correlation between levels, and thus the shape information from both the higher and lower level is incorporated when inferring at a certain level. At each updating step, the bottom-up messages Bu (⋅) are recursively updated by (12) from bottom to up, i.e. from Bu ( S1i:k (1)) to Bu ( S1i:k (2)) and then to Bu ( S1i:k (3)) , where Bu ( S1i:k (1)) is evaluated by the likelihood to capture the information of image features. The top-down messages Bd (⋅) are calculated by (13) in a similar way except for the updating direction, i.e. starting from the message at the top level Bd ( S1i:k (3)) , where Bd ( S1i:k (3)) is set to 1 indicating that no information is passed down to this level. And the weights are given as:

4.3. Exploring the chain structures In order to propagating probability samples in a sequential way, it is necessary for SMC to predict new samples (or to sample a proposal density) from the samples at current step. In this paper, the information from higher level is incorporated into the prediction of new samples. As described in section 2, the transition property at level L1 differs from that of level L2, so the exploration of the two chains is given respectively. z Chain at L1 As seen from (3), the transformation matrix Ak need to be determined in order to obtain new samples. Given a shape sample of Pa (k ) at level L2, S1i:k (l + 1) , Aki can be derived from

⎧⎪a i = ( S • S ) / S 2 −1 −1 k 1 ⎨ i y x x y ⎪⎩bk = ( S −1 S1 − S −1 S1 ) / S −1

(15)

2

where S −1 = [ S −x1 , S −y1 ] = S ki (l + 1) − S ki −1 (l + 1) , S1 = [ S1x , S1y ] = S ki +1 (l + 1) − S ki (l + 1) , and • is the dot product. z Chain at L2 For 2D points in this paper, the affine transformation Tk in (7) can be represented by a 2 × 2 matrix Bk and a translation vector t k . Given the current sample S1i:k (l ) and a shape sample at a higher level L3, S1i:k (l + 1) , the affine transformation is derived as [17]: ) ) Bki = [( S1i:k (l ) − S 1i :k (l ))( S1i:k (l + 1) − S1i:k (l + 1)) T ] × ) ) [( S1i:k (l + 1) − S1i:k (l + 1))( S1i:k (l + 1) − S1i:k (l + 1)) T ] −1

) Tki = [t j [1,...,1]]2× N

c (k )

) = S1i:k (l ) − Bki Sˆ1i:k (l + 1) , (16)

where ) S1i:k (l ) = [(1 N c (k )∑ S1i:k (l ))[1,..,1]] 2× N ( k ) , ) S1i:k (l + 1) = [(1 N c (k )∑ S1i:k (l + 1))[1,..,1]] 2× N ( k ) , and N c (k ) denotes the number of control points in the k th component shape. c

c

5. Experimental results The performance of the proposed shape representation and inference algorithm is evaluated by the experiments on hand and face images. Although a hand shape can be decomposed as fingers and a palm and represented by the graphical structure with three levels as shown in Fig. 1, we only use a full shape model and remove the middle level in the hierarchical graph in order to demonstrate the impact of global shape information and the property of stochastic inference algorithm.

5.1. Results on hand images The experiments are performed on real-world frontal hand images with fingers in different stretching angles, 15 of which are manually annotated with 52 control points for training. N s (1) = N s (2) = 200 samples are used, and two points are manually specified as the seed points. It should be noted that the hand shapes in testing images are not part of the training set. The used PCA based model is able to represent linear variations that the training set performs. In our experiment, the hand shapes with linear combinations of movements of fingers are generated based on the

independent Gaussian assumption [2]. By varying the first parameter b1 between − 2 λ1 and 2 λ1 , the parameter representing the most significant variations found in the training set, the hand shapes with maximum, medium and minimum stretching angles are generated as shown in Fig. 4. From the characteristics of Gaussian distribution, 95% of the shape samples generated by (8) fall into the range. Directed by the generated shape samples at the higher level, the contours that exhibit the shapes within this range are obtained as Fig. 5 illustrates. As part of the image information is missing in the cases of occlusion, the methods without shape information (e.g. JetStream [12]) fail to handle them. In contrast, the proposed method is able to reconstruct the missing part by predicting samples using the shape information from the higher level, as seen in Fig. 6. In comparison, the ASM algorithm [2] that combines shape prior model does not give the desired hand contour as shown in Fig. 7(b), instead shrinks from its initial contour (shown in Fig. 7(a)). For further discussion, we peer into the posterior density evolution of x 20 , a component of the state sequence x1:52 . Fig. 8 shows the posterior density curves represented by state samples and corresponding weights at three consecutive steps and presents the locations of the corresponding MAP estimates denoted by dark points in the images beneath the line graph. Due to occlusion, the available image data at the 20th step do not provide strong evidence for the samples of the “true” state, consequently the right mode is not the one that maximizes the posterior density. In contrast to deterministic approaches, the proposed stochastic inference method maintains the non-optimal samples. As evolution goes on, new samples are predicted from the generated plausible shapes at higher level, and the image evidence visible at the two successive steps accentuates the samples that comply with the boundary of the object, so that the MAP estimate eventually approaches to the desired boundary, as seen in Fig. 6. On the other hand, the local search algorithm used in ASM finds the local minima, where a plausible shape fits the available image information at the junctions of the index finger, middle finger and ring finger as shown in Fig. 7(b).

5.2. Results on face images The face shape models are trained on 100 frontal face images chosen from AR database [9] with variations on facial expressions. Each image is annotated 47 control points composed of those from 4 facial features (8 points from each eye, 12 from mouth, and 19 from outer outline). In the experiments we use

Figure 4. Shape samples generated by varying the first shape parameter between − 2 λ and 2 λ1 . 1 Figure 6. Hand contour with occlusion extracted by the proposed method.

Figure 5. Extracted contours with shape variations.

(a)

(b)

Figure 7. Hand contour with occlusion obtained by ASM: (a) initial contour and (b) final result.

500 samples for the nodes at levels L1 and L3. Since there are fewer shape variations for individual facial organs, only 50 samples are used for level L2. The testing images are from CMU face database for detection [13]. We first compare the proposed method with ASM [2] to extract the contours of facial features from cluttered background. Due to the distractions from the edges in cluttered background, the obtained facial contour by ASM is enlarged as shown in Fig. 9(a). In our method, the incorporated higher shape information successfully directs the samples evolving with respect to the shape variations represented by the training set, and rules out those samples that probably represent the undesired edges. As Fig. 9(b) illustrates, the estimated contour is therefore accordant with the boundary of facial features and is free from distractions of the cluttered background. It is shown from Fig.10 that the proposed method gives the satisfactory results when faces in images

Figure 8. Evolution of posterior density and MAP estimates.

have in-plane rotation and slight pose changes. It is worth noting that though the proposed method fails to localize the contour of the baby’s chin in the right image, the wrong location of the outer contour does not make any influence on the other facial organs. This result is mainly because we decompose the face shape and encode the shape information of each facial feature individually.

6. Conclusion and future work In this paper, we propose a hierarchical representation for rigid shapes. The shape representation has three levels to encode the shape information with different semantic meanings, and allows for chain-like conditional dependency within a level and bidirectional communication between different levels. And a SMC based algorithm is proposed to perform inference on the proposed

(a) (b) Figure 9. Comparison with ASM: face contour obtained by (a) ASM and (b) the proposed method.

structure. Experiments performed on hand and face images show that the proposed approach is effective in combating cluttered background and occlusion due to the characteristics of maintaining multiple hypotheses. Moreover, it is possible to isolate the localization error to an individual part of a shape by using the hierarchical representation. In the future, the modular PCA based shape models [10] can be placed into the hierarchical structure, which would make the method adaptive to multiple views of an object. A simple gradient-based likelihood is used in the proposed method, so more elaborate image data models integrating multiple cues can be introduced so as to improve the performance in further.

Acknowledgements The Matlab code of ASM is provided by Ghassan Hamarneh, Simon Fraser University, Canada. We would like to thank Cheng Chang for fruitful discussions on SMC and Jing Li for crosschecking the English writing of the paper.

References [1] S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for on-line non-linear/nonGaussian Bayesian tracking. IEEE Trans. Signal Processing, 50(2):174-188, 2002. [2] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models: their training and application. Computer Vision and Image Understanding, 61(1):38-59, 1995. [3] D. Cremers, F. Tischhäuser, J. Weickert, and C. Schnörr. Diffusion Snakes: Introducing Statistical Shape Knowledge into the Mumford-Shah Functional. IJCV, 50(3):295-313, 2002. [4] P.F. Felzenszwalb. Representation and detection of deformable shapes. IEEE Trans. PAMI, 27(2):208-220, 2005.

Figure 10. Face contours with pose changes. [5] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. In Proc. ECCV, 343356,1996. [6] M. Isard. Pampas: Real-Valued Graphical Models for Computer Vision. In Proc. CVPR, 1:613-620, 2003. [7] M.I. Jordan. Graphical models. Statistical Science, 19 (1): 140-155, 2004. [8] T.S. Lee and D. Mumford. Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A. 20(7):1434-1448, 2003. [9] A.M. Martinez and R. Benavente. The AR face database. CVC Technical Report #24, 1998. [10] B. Moghaddam and A. Pentlan. Probabilistic Visual Learning for Object Representation. IEEE Trans. PAMI,19( 7):696-710, 1997. [11] N.R. Pal and S.K. Pal. A review of image segmentation techniques. Pattern Recognition, 26(9): 1277-1294, 1993. [12] P. Pérez, A. Blake, and M. Gangnet. JetStream: Probabilistic contour extraction with particles. In Proc. ICCV, II:524-531, 2001. [13] H.A. Rowley, S. Baluja, and T. Kanade. CMU Image Database: face. http://vasc.ri.cmu.edu/idb/images/face /frontal_images/images.tar. [14] E.B. Sudderth, A.T. Ihler, W.T. Freeman, and A.S. Willsky. Nonparametric belief propagation. In Proc. CVPR, 1:605-612, 2003. [15] F. De la Torre and M.J. Black. Robust parameterized component analysis: Theory and applications to 2D facial modeling. In Proc. ECCV, LNCS 2353, pages 653-669, 2002. [16] M. Werman and D. Weinshall. Similarity and Affine Invariant Distances Between 2D Point Sets . IEEE Trans. PAMI, 17(8):810-814, 1995. [17] Z. Xue, S.Z. Li, and E.K. Teoh. Bayesian Shape Model for Facial Feature Extraction and Recognition. Pattern Recognition, 36(12):2819-2833, 2003. [18] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Technical Report TR2001-22, MERL, 2001. [19] A.L. Yuille, D.S. Cohen, and P. Hallinan. Feature extraction from face using deformable templates. IJCV, 8(2):99-112, 1992. [20] J. Zhang, R. Collins, and Y. Liu. Representation and matching of articulated shapes. In Proc. CVPR, 2004.