Prior Knowledge, Level Set Representations & Visual ... - CiteSeerX

10 downloads 0 Views 2MB Size Report
φ : [x, y, φ(x, y)]: φ(C(p)) = 0 ... To this end, one can consider the distance transform D(s,C) as embedding ...... Orlando, Florida, pp. 56–61. Rousson, M. ... Zhao, K., T. F. Chan, B. Merriman, and S. Osher: 1996, 'A Variational Level Set Approach.
Prior Knowledge, Level Set Representations & Visual Grouping Mikael Rousson ([email protected]) Siemens Corporate Research 755 College Road East, Princeton, NJ 08540, USA

Nikos Paragios ([email protected]) M.A.S Ecole Centrale de Paris Grande Voie des Vignes, 92295 Chatenay-Malabry, FRANCE Abstract. In this paper, we propose a level set method for shape-driven object extraction. We introduce a voxel-wise probabilistic level set formulation to account for prior knowledge. To this end, objects are represented in an implicit form. Constraints on the segmentation process are imposed by seeking a projection to the image plane of the prior model modulo a similarity transformation. The optimization of a statistical metric between the evolving contour and the model leads to motion equations that evolve the contour toward the desired image properties while recovering the pose of the object in the new image. Upon convergence, a solution that is similarity invariant with respect to the model and the corresponding transformation are recovered. Promising experimental results demonstrate the potential of such an approach. Keywords: Level set method, distance transforms, curve propagation, similarity transformation, pose estimation, object extraction.

2 1. Introduction Segmentation is a core component of imaging and vision, and aims at separating the image domain into entities/regions with consistent properties. A vast amount of research was performed during the past three decades toward complete automated solutions for general-purpose image segmentation. Variational techniques (Mumford and Shah, 1985; Paragios et al., 2005), statistical methods (Geman and Geman, 1984; Zhu and Yuille, 1996), combinatorial approaches (Boykov et al., 2001), curve-propagation techniques (M.Kass et al., 1988), and methods that perform non-parametric clustering (Cheng, 1995) are some examples. However, unconstrained segmentation is an ill-posed problem. Changes on the object pose, illumination, camera’s point view, etc. can lead to inconsistent segmentation results. Model-based segmentation (Cootes et al., 1995) aims at recovering a particular structure of interest from an image that follows some predefined characteristics. Local geometric properties (curvature, local smoothness constraints) can be used when defining such prior model (Leventon et al., 2000a) or it can be done in a more global manner (Cremers et al., 2001) that captures the variance of the entire structure of interest. While local models are quite efficient, global representations are more appropriate to cope with occlusions, noise, and changes on the object pose. The modeling of shape properties should be addressed before introducing shape-driven constraints. Such modeling aims to extract a compact representation for the structure of interest from a set of training examples. Simple geometric components, like straight segments (Fischler and Elschlager, 1973) and ellipsoids (Birchfield, 1998) were a first attempt to create a compact representation for modeling faces. Deformable templates (Lipson et al., 1990; Yuille, 1991; Yuille et al., 1992; Metaxas, 1997) were a step further toward global modeling. Object extraction is performed through matching between the learnt visual model and the actual image. The introduction of active shape and appearance models (Cootes et al., 1995; Cootes et al., 1999) were a major improvement to address model-based segmentation, object extraction and tracking. Prior knowledge on the structure to be recovered is expressed through an average shape/image and a set of basic vectors that account for the shape/image variability. Segmentation is then addressed in two steps; (i) the method finds an optimal transformation between the average model and the image according to the desired image features and (ii) the solution is constrained to be a linear combination of the basic elements toward capturing local variations of the structure of interest. The snake/active contour, a pioneering framework introduced in (M.Kass et al., 1988) was a breakthrough in the area of object extraction. Such a model seeks for the lowest potential of a cost function defined along a curve. It is comprised of two terms: an internal one that imposes certain smoothness constraints and the external one that guides the curve toward the desired image attributes (visual properties, etc.). Region-based information was also considered to make

3 active contours more efficient (Zhu and Yuille, 1996). The use of basis functions to approximate the evolving contour, like spline-based (Bascle, 1994), Fourierdriven methods (Staib and Duncan, 1992), etc. is an appropriate technique to represent curves and therefore considered to a significant extent. Such a concept can be used to introduce constraints according to a prior model. In (Staib and Duncan, 1992) modeling consists of learning a distribution on the Fourier coefficients from a training set. Bayesian inference is then considered to perform object extraction, a compromise between the model’s geometric form and the image features to be recovered. The selection of an appropriate parameterization to describe the structure of interest is the most challenging part of snake-based segmentation approaches. Such a selection involves the nature of the basis functions, the number of control points, the sampling rule, and the re-parameterization of the evolving structure. Level set methods (Osher and Sethian, 1988; Dervieux and Thomasset, 1979; Dervieux and Thomasset, 1980) are established techniques to deal with computational vision (Osher and Paragios, 2003). Image restoration, segmentation, tracking and stereo reconstruction are some examples (Tek and Kimia, 1995; Kimmel and Bruckstein, 1995; Sethian, 1996; Faugeras and Keriven, 1998; Malladi and Sethian, 1998; Bertalmo et al., 2000; Yezzi et al., 2001; Sapiro, 2001; Yezzi and Soatto, 2003). Implicit representations can cope with tasks that refer to tracking moving interfaces (curves, surfaces, hyper-surfaces, etc). The main strengths of level set methods are that they are parameter free, they are able to change the topology, they capture local deformations and they provide a natural approach to the estimation of geometric properties of the evolving interface. On the other hand, dealing with noisy, incomplete data and solid/rigid objects are their most notable shortcomings, and in these cases where parametric models have a better performance. In this paper we introduce a mathematical formulation to constrain an implicit surface to follow global shape consistency while preserving its ability to capture local deformations. Toward this end, we propose a novel objective function that can account for global/local shape properties of the object to be recovered. Our approach consists of two stages. During the first stage a shape model is built directly on the level set space using a collection of samples. The model is constructed using a variational framework and refers to an implicit function that accounts for shape variability in a qualitative manner. This model is used as basis to introduce the shape prior term in an energetic form. Such a term aims at minimizing the distance between the evolving curve and the shape model. The evolving contour is also globally deformed to fit to the image according to a segmentation criteria. Both the global deformation between the contour and the image, and local deformations of the contour in the image plane are recovered. In order to demonstrate the performance of such an approach, it is applied to existing data-driven variational methods to perform object extraction of corrupted and incomplete data.

4 The most closely related work with our approach can be found in (Leventon et al., 2000b; Tsai et al., 2001; Chen et al., 2002). One can find more recent techniques in (Charpiat et al., 2003; Bresson et al., 2003; Jehan-Besson et al., 2003; Cremers et al., 2003; Taron et al., 2005). In (Leventon et al., 2000b; Tsai et al., 2001; Bresson et al., 2003), modeling consists of an average shape and modes of variation through a principal component analysis on the space of implicit functions while in (Chen et al., 2002; Jehan-Besson et al., 2003; Cremers et al., 2003) the model refers to an average shape. In (Charpiat et al., 2003), principal modes of variation are also estimated but considering a non-linear shape metric. The image term and prior term are well separated in (Leventon et al., 2000b), where the MAP criterion is used to impose prior knowledge. The Gaussian distribution learnt through the principal component analysis is used to guide the prior term while the geodesic active contour is the data-driven term. Prior and data terms are coupled in (Chen et al., 2002) within the geodesic active contour. The model refers to an average shape in an implicit form and the prior term refers to the projection of the evolving contour to this space according to a similarity transformation. A different approach is proposed in (Tsai et al., 2001), where a region-driven statistical measure is used to define the image component of the function while prior term refers to the projection of the contour to the model space using a global transformation and a linear combination of the basic modes of variation. In (Bresson et al., 2003) an approach that integrates (Chen et al., 2002) and (Tsai et al., 2001) is presented while (Jehan-Besson et al., 2003) is a straighforward application of the term proposed in (Chen et al., 2002). Our approach adopts an alternative model and has a different concept of the prior term when compared with the above techniques since it is based on a direct comparison between the evolving contour and the model. Also, contrary to related works that assume that the whole shape can be modeled with the same confidence, our prior captures local confidence. The remainder of this paper is organized as follows: in Section 2, we briefly introduce the level set representations. Section 3 is dedicated to the construction of the shape prior. In Section 4, we introduce our shape-prior functional that is integrated with a data-driven variational framework. Finally, experiments and discussion are part of Section 5.

2. Implicit Representations Evolving interfaces (curves) according to some flow is a popular segmentation technique (M.Kass et al., 1988). The flow that governs the propagation is either recovered through the minimization of an objective function (Paragios and Deriche, 2002), or defined according to the application context (Caselles et al., 1993; Malladi et al., 1994). Snake-based segmentation approaches often refer

5 to the propagation of curves from an initial position toward the desired image characteristics. Such flows consist of internal and external terms. In order to introduce the level set representations, one can consider a parametric curve C(p) : [0, 1] → R × R that evolves according to a given motion equation in the normal direction N 1 : d C(p) = F(C(p)) N dτ where F is a scalar function on the local properties of the curve (e.g. curvature). The level set method (Osher and Sethian, 1988) - initially introduced in the area of fluid dynamics - is an emerging technique to cope with various applications in imaging, vision and graphics (Osher and Paragios, 2003). Such methods rely on representing the evolving curve with the zero-level of a surface φ : [x, y, φ(x, y)]: φ(C(p)) = 0 Such representation is implicit, intrinsic, and parameter free. Then, we can evolve the surface in such a way that the zero-level set represents the deforming curve. Taking the derivatives of φ with respect to time, will lead to: d φ + F |∇φ| = 0 dτ Thus, we have established a connection between the family of evolving curves C and the family of evolving surfaces φ. Such a propagation scheme can account for topological changes and can provide natural support on the estimation of the local geometric properties of the curve. Techniques related with the introduction of the level set method in imaging and vision were initially reported in (Caselles et al., 1993; Malladi et al., 1994; Kimia et al., 1995) and then spread across various applications. Such tools were considered as efficient numerical approximation techniques to implement curve propagation according to various flows. The protocol was rather simple; using energy minimization techniques a flow was obtained and then implemented in the level set space. The development of appropriate numerical approximation tools/techniques have made these techniques far more popular. Definition of objective functions in the level set space (Zhao et al., 1996) for grouping was a step further toward the establishment of these techniques in imaging and vision. To this end, one can consider the distance transform D(s, C) as embedding function for C: ( D(s, ∂R) , s ∈ ΩC φ(x, y) = −D(s, ∂R) , s ∈ Ω − ΩC 1

The tangential component of the flow affects only the internal parameterization of the curve.

6 where ΩC is the area inside the contour. We also introduce the regularized Dirac and Heaviside distributions: ( 0 , |φ| > α δα (φ) =  πφ 1 1 + cos α , |φ| < α 2α    1 Hα (φ) = 0   1 2

, φ>α , φ < −α 1+

φ α

+

1 π

sin

πφ α



, |φ| < α

and use them to express an image partition objective function (Samson et al., 1999; Vese and Chan, 2002). Smoothness constraints, boundary-driven object detection as well as general region-consistency grouping terms can now be introduced directly on the level set space φ. Length minimization is a a well known geometric smoothness term that can be introduced in a straightforward manner using the proposed formulation: Z Z Esmoothness (φ) = δα (φ)|∇φ| dΩ Ω

The geodesic active contour (Caselles et al., 1997; Kichenassamy et al., 1995) is a step forward that aims to recover a minimal length curve Z Z δα (φ)b(; )|∇φ| dΩ Egeodesic (φ) = Ω

according to some arbitrary metric function b : R+ → [0, 1]. Such a function is monotonically decreasing with minimal values at the image locations with the desired features (high gradient). The calculus of variations can provide a geometric flow to update the position of the interface toward the desired image properties:   ∂ ∇φ φ = δα (φ)div b(; ) ∂τ |∇φ| Such flows can lead to precise boundary extraction under certain initial conditions. The use of regional/global information modules (Paragios and Deriche, 2002) aim at separating the object from the background and can lead to adaptive balloon forces. Within such a module, the evolving interface is used to define an image partition that is optimal with respect to some grouping criterion. Such criterion can be easily derived from the Heaviside distribution; Z Z Z Z Eregional (φ) = Hα (φ)rO (I(x)) dΩ + (1 − Hα (φ))rB (I(x)) dΩ Ω Ω {z } | {z } | object

background

7 according to some global descriptors rO : R+ → [0, 1], rB : R+ → [0, 1] that are monotonically decreasing functions. Such descriptors measure the quality of matching between the observed image and the expected regional properties of the structure of interest and the background. Such terms can improve segmentation performance and make the approach relatively independent to the initial conditions. The calculus of variations with respect to φ leads to the following flow: ∂ φ = δα (φ)(rB − rO ) ∂τ which is an adaptive balloon force (Cohen, 1991). Such a force is based on relative measurements and either expands or shrinks the curve according to the local fit of the data with respect to the expected intensity properties of the object and the background class. The introduction of shape-driven modules is a valuable element to the segmentation process. Such action involves the definition/recovery of a structure to represent such knowledge and the introduction of constraints that guide the segmentation process toward solutions that respect the prior. 3. Building an Implicit Shape Model Selecting a representation to describe prior knowledge is a critical component in knowledge-based segmentation. Within such a task, the objective is to recover a compact structure from a set of N training examples to represent the prior. Such a structure should be able to describe the variability of the training examples. Registration (Veltkamp and Hagedoorn, 1999) is a required step within such a process. One would like to align all training examples to a common pose and then seek a meaningful compact representation that can encode prior knowledge for this particular pose. The alignment of shapes is an open problem in imaging and vision with numerous potential applications. Since registration is not in the scope of this paper, we will not present in details the approach followed to address the problem. Implicit representations and distance transforms can be considered to represent shapes in a higher dimension (Paragios et al., 2002). Then one can perform registration on this space seeking a transformation that aligns the implicit representation of the source with the one of the target. Global error metrics like sum of squared differences (Paragios et al., 2003) as well as maximization of the mutual information (Huang et al., 2006) in the space of implicit representations can be used to recover a parametric model that describes the displacement between the source and the target. Local deformations can be accounted for in the space of implicit representations using either optical flow constraints (Paragios et al., 2003) or free-from deformations (Huang et al., 2006). Introducing prior knowledge within level set methods requires the definition of a model. A cloud of points is a simple technique to represent such knowledge in a simplistic manner. Building an average shape across the examples of the training

8 set can be sufficient enough to represent such a prior (Chen et al., 2002). Such a technique cannot capture variability and is not convenient within a level set framework where the evolving interface is not represented using points. Within such a framework, a natural selection is to consider the definition of the prior within the level set space (Leventon et al., 2000b; Tsai et al., 2001). Consistency between the propagation technique/optimization framework and the form of the prior is meaningful. In other words, the objective is to recover from a set of examples [φ1 , φ2 , ..., φN ] a compact representation to encode the prior where φi is the level set representation of Cˆi . Principle Component Analysis (PCA) can be applied to capture the statistics of the corresponding elements across the training examples. PCA refers to a linear transformation of variables that retains - for a given number n of operators - the largest amount of variation within the training data. Such a technique models global variations of the samples within the training set. One can consider creating a model that combines the simple structure of an average shape and the ability to capture local variability of the learning set. Such a model should consist of two components. The most prominent shape as well as the voxel-wise confidence along the shape parts: − φS : R × R → R that is an optimal - according to some criterion - implicit function derived from the training set, − σS : R × R → R+ that is confidence map, determined according to the agreement between the considered model φS and the training set. When agreement between the training examples for a particular part is present, the confidence should be high and the recovery of the object in the image should strongly respect the prior. On the other hand, when this is not the case, the prior constraint should be relaxed and the image information should be more important. To this end, we consider a probabilistic level set representation that consists of a representative shape φm and a confidence map σm both defined at the pixel level (s) according to: ps (φ(s)) = √

(φ(s)−φm (s))2 1 − 2σm (s)2 e 2πσm (s)

A similar model from conceptual point of view was considered in (Wang and Staib, 1998). Such model is presented in Figure (1). The representative shape should be a level set where we consider the distance transform as the embedding function leading to the constraint |∇φm | = 1. One can relax this constraint and seek the shape that the training samples [φ1 , φ2 , ..., φN ] (e.g. the hPbest describes i N 1 average): φm = N i=1 φi . Since distance function do not form a vector space, most probably, φm will not be a distance function. We will see how to overcome this by mean of reinitializations in the optimization process.

9 Toward the construction of the level set prior representation one can consider solving the inference problem at the pixel level (s). Given a set of values [φ1 (s), φ2 (s), ..., φN (s)], recover a distribution (φm (s), σm (s)) that better expresses the data. Maximum posterior of this distribution along the training samples is equivalent to minimizing 2 (s)) E(φm (s), σm

= −log

N X i=1

 N  (φi (s) − φm (s))2 1X 2 , ps (φi (s)) = log(σm (s)) + 2 (s) 2 i=1 σm

where some constant terms were omitted. Integrating over all image locations, we obtain a criterion defined on the image plane: N

2 ) E(φm , σm

1X = 2 i=1

 Z Z  (φi (s) − φm (s))2 2 dΩ. log(σm (s)) + σm (s)2 Ω

Natural objects are composed of local segments and articulations. Such articulations could lead to low confidence segments when building the considered model. Since this energy does not introduce any spatial constraint, the optimal 2 φm and σm are given by pixel-wise empirical means and variances of the training distance maps. However, while the motion of such components is not regular, one can expect that at a local level, the confidence of the model should be 2 smooth. The object can be decomposed in segments that are solid and σm has to be smooth along these segments or within a small neighborhood system in the image plane. Smoothness terms are quite popular when considering optimization criteria in the spatially continuous domain. A common technique (Tikhonov, 1992) is to 2 ): penalize the spatial derivatives of the field to be recovered (σm 2 E(φm , σm )

 Z Z N Z Z  X  (φi − φm )2 2 2 =α log(σm ) + dΩ + ψ ∇σ dΩ, m 2 σ Ω Ω m i=1

where ψ(u, v) is a regularization function. Simple selection for ψ involves a variant of the error-two norm:  Z Z N Z Z  X (φi − φm )2 2 2 2 2 dΩ + |∇σm | dΩ. E(φm , σm ) = α log(σm ) + 2 σ Ω Ω m i=1 The calculus of variations and a gradient descent method can now be used to 2 recover the solution for the prior model (φm , σm ). The last constraint to be accounted for is related with φm . Given the form of training examples (level set representations with distance transforms as embedding function), it is natural to recover a model φm in this manifold. Constrained optimization of this functional can be done using Lagrange multipliers and a gradient descend method. However,

10 given the form of constraints, one cannot assume that the conditions that guarantee the validity of Lagrange theorem are satisfied. Furthermore, the number of unknown variables of the system is too high, leading to a quite unstable system. One can overcome such limitations through the use of an augmented Lagrangian function. However, even in that case the proof of validity and the initial conditions are open issues. One can account for the distance function constraint by decoupling the problem in two stages. In the the first step, an optimal data-driven solution can be recovered that explains the training set. One can then find reinitialize such a solution to a distance function. Prior art on this subject includes several techniques (Sussman et al., 1994). Some of them require the extraction of the level set (Sethian, 1996), while others can perform the same task directly on the implicit representation space (Sussman et al., 1994). We consider a well known PDE that preserves the location of the zero crossing (Sussman et al., 1994):  d φ = sgn φ0 (1 − |∇φ|) dτ where φ0 is the initial representation to be projected to the manifold of signed distance functions. 2 , the overall process consists Taking into account the constraints on φm and σm of the following steps:

2 1. The optimal level set φm minimizing E(φm , σm ) is obtained by averaging the training shapes (the regularization introduced on the variance has no P influence on the estimation of the mean shape): φm = N1 N i=1 φi .

2. The level set φm is projected to the manifold of signed distance function using the PDE described previously. This gives a distance map φ˜m which can be used as mean shape representation.

2 3. The variance σm is then initialized as the empirical variance with respect to the distance map φ˜m :

2 σm

N 1 X = (φi − φ˜m )2 . N i=1

11

Image 1

Image 2

φ˜m

σm

Figure 1. Building probabilistic prior models using small training sets: Two horse images and corresponding voxel-wise probabilistic shape model.

Figure 2. Probabilistic prior models comparison - Left: mean level set and confidence map without projecting to the space of distance functions. Right: same but with the reinitialization to a distance function. 2 4. A smooth confidence map is then obtained by minimizing E(φm , σm ) with 2 the following gradient descent :   2  2  ˜ N d 2 ∂ 1 X  φi − φm ∂2 2  2 σ = α 2 − 1 + σ + σ .  2 dτ m σm i=1 σm ∂x∂x m ∂y∂y m

In Figure (1), we show the outcome of this modeling process on a set of two horse shapes. One can wonder whether we really need to constrain the mean shape to a distance function or not. In Figure (2), we show a comparison with and without reinitialization. It appears clearly that when φm is not a distance function, it includes not only shape information but also some variability of the training shapes. In that case, we lose the concept of shape. On the other hand, when φm is reinitialized to a distance function, it represents clearly a unique shape and the variability within the training set can then be entirely captured with the variable σm 3 . The advantages of such a prior model are numerous. It encodes in a natural form prior knowledge within implicit representations, provides straight2

In order to avoid stability problems in special cases, one can replace the variability estimates with [σm = 1 + σ ˆm ] and then seek a σ ˆm that is constrained to be strictly positive in the pixel level. 3 Segmentation experiments without reinitializing φm to the distance function are very unstable.

12 forward techniques for the estimation of geometric properties, can deal with multi-component objects, and can be determined from a small set of training examples. Such encoding can support meaningful comparisons between the evolving interface and the model. A minimal difference between the prior (φm ) and the evolving interface (φ) corresponds to a solution that to some extent respects the prior.

4. Introducing Prior Knowledge The assumption that all training examples are registered in a common pose was considered during the model construction. Such an assumption is required in order to recover a meaningful model. Knowledge-based segmentation has to address the same concern. Objects in the image can have different scale, orientation, etc, compared to the prior model. The parameters of the transformation between these two elements are unknown, while its form can be known. In this paper we will address the similarity-invariant case, where the object to be detected is a similarity transformation of the model combined with some local deformations. Then, in an abstract level, knowledge-based segmentation is equivalent to imposing a constraint that forces the evolving interface in all instances to be a similarity transformation of the average model. Such an action will lead to the recovery of an image structure that has the same geometric properties with the prior. This concept is demonstrated in Figure (3, top row). Such a constraint should be based on a meaningful comparison between the prior (φm ) and the evolving implicit representation (φ). Two choices are possible for the introduction of the similarity transformation. It can be either be between the evolving level set (φ) and the model, as presented in most related publication (Rousson et al., 2004; Rousson and Paragios, 2002; Chen et al., 2001; Leventon et al., 2000b), or between the evolving level set and the image. This second possibility has not been considered for curve evolutions but it is common for template matching (Dufour et al., 2002) and model-based segmentations4 (Tsai et al., 2003; Rousson and Cremers, 2005). In the following, we consider this second approach to introduce the shape model defined previously in level set based segmentations. Therefore, in the following, we make the assumption that the evolving level set has the same pose as the shape model. First, we present a Bayesian introduction of the shape model and then, we show how to introduce invariance to similarity transformations with respect to the model by introducing the transformation parameters in the data term of the segmentation criteria. 4

We refer to “model-based segmentation” as an approach that optimizes the parameters of a parametric given shape model to fit it to the image. This is different from a shape constrained curve evolution.

13

Figure 3. Rigid invariant object extraction using a static prior in the space of implicit representations. Rows correspond to images with the object in different pose or visual properties. Columns correspond to evolution of the contour over time.

4.1. Constraining the level set evolution Let p(φ|φm ) be the prior distribution of the level set φ given the model φm . Such distribution is unknown, varies across different objects, and cannot be recovered in the more general case. However, Monte-Carlo sampling or other techniques can be used to recover such a distribution when empirical evidence is sufficiently enough. Let us consider a Bayesian formulation for this density p(φ|φm ) =

p(φm |φ) p(φ). p(φm )

The constant term p(φm ) can be omitted and p(φ) can be assumed to constant since other the one induced by the shape model, we do not want to introduce

14 any prior on φ. Then, if we assume pixels to be independent5 , recovering the optimal interface is equivalent to finding the maximum posterior p(φm |φ), i.e. the extremum of Y p(φm |φ) = p(φm (s)|φ(s)), s∈Ω

where s is an image location, p(φm (s)|φ(s)) is the probabilistic prior in this location and independence across pixels was considered. The pixel-defined prior distributions ps () are known from the modeling phase and solving the inference problem is equivalent to finding the lowest potential of the −log function, or: " # Z Z Y E(φ) = −log p(φm (s)|φ(s)) = − log(ps (φ(s)))dΩ. Ω

s∈Ω

Using the known Gaussian properties of the pixel-defined prior distributions, one can recover the following analytical expression for the objective function:  Z Z  (φ − φm )2 E(φ) = dΩ, log(σm ) + 2 2σm Ω where the constant terms have been omitted. Such an objective function consists of two terms. The first one penalizes the voxels with low confidence. The second term evolves the interface so that it becomes like the prior. Such a term has a similar conceptual interpretation with the one used to introduce the static prior, while being able to account for model confidence. The projection error (φ − φm )2 is weighted according to the model confidence σm . The resulting criterion is defined in the entire image plane. The definition of the prior is consistent mainly around the the object region, and we should constrain the objective function within the structure of interest.   Z Z (φ − φm )2 δ (φ) log(σm ) + E(φ) = dΩ 2 2σm Ω The calculus of variations with respect to φ leads to the following gradient descent:    ∂ (φ − φm )2 (φ − φm ) d φ=− δ (φ) log(σm ) + −2 δ (φ) . 2 dτ ∂φ σm σ2 | {z m } | {z } area f orce

shape consistency f orce

This flow consists of two terms: (i) a shape consistency force that updates the interface toward a better local consistency with the prior and (ii) a force that 5

Independence between pixels is clearly a strong approximation, however, spatial correlation is already partially incorporated in the level set representation by construction.

15 aims to update the level set values such that the region on which the objective functions are evaluated (from − to ) becomes smaller in the image plane. In order to better understand the influence of this force, one can consider a negative φ value, within the range of (−, );   ∂ π πφ φ < 0 ⇒ − δ (φ) = 2 sin E(φτ +1 ). Therefore, such a force does not change the position of the interface since the sign of the implicit representation at each pixel is preserved. It affects only the form of the implicit function such that the area on which the objective function is evaluated decreases. One can ignore such a force since it does not have a meaningful interpretation in the process of imposing the prior knowledge. 4.2. Constraining image grouping with pose invariance To integrate pose invariance in the context of visual grouping, we augment the formulation of the previous section with a similarity transformation A composed of a translation T , a rotation RΘ and a scale operator S. As mentioned previously, we differ from related papers by introducing this transformation not between φ and the shape model (φm , σm ) but between φ and the image I. The problem is thus transformed to a joint estimation of the most probable level set φ and the transformation A. Following the the Bayesian formulation proposed in the previous section, we aim at maximizing the joint posterior p(φ, A|φm , σm , I) =

p(φm , σm , I|φ, A) p(φ, A). p(φm , σm , I)

Assuming uniform distributions of the transformation parameters and independence between these parameters and the shape model, we end up with: p(φ, A|φm , σm , I) ∝ p(φm , σm |φ)p(φ)p(I|A, φ). The first term is the shape prior that has been depicted in the previous section. The second one can be expressed as a regularity prior on φ, and the last one is the image term. Several possibilities have been presented in Section 2 for this last. Applying the negative logarithm to the posterior probability, we end up with the following energy: E(φ, A) = − log p(φm , σm |φ) − log p(φ) − log p(I|A, φ) = Eshape (φ) + µ Esmoothness (φ) + λ Eregional (φ, A),

16

Figure 4. Implicit representations, prior knowledge and object extraction under occlusions. The evolution of the contour is presented in a raster scan format (the iteration number is shown in the upper left corner). The first shows the segmentation of the original image from where the prior when no shape prior is introduced. The other rows show successively the shape-constrained segmentation for the original image unmodified, with changes of pose for the object and noise added to the image, and with changes of pose and missing parts.

where the weights µ and λ come from the definition of the two corresponding probabilities. In order to take into account that distance transforms are not invariant to scale variations, region terms such as the ones presented in Section 2 have to be modified. It has been shown in (Paragios et al., 2003) that the application of a scale operator to a contour will scale accordingly the embedding function. Therefore, the regional term is adapted as follows:      Z Z   φ(A) φ(A) Eregional (φ, A) = Hα rO (I(x)) + 1 − Hα rB (I(x)) dΩ, S S Ω where φ(A) is defined such that φ(A)(x) = φ(A(x)) for all x in the image.

17 The new cost function has to be minimized with respect to the level set φ and the transformation A. Assuming the global descriptors rO and rB to be independent of the location of the φ, the calculus of variations leads to the following gradient descent:   d d φ(A) d φ = − Eshape (φ) − µ Esmoothness (φ) − λδα (rO − rB ). dτ dφ dφ S The detailed expression of the first term can be found in the previous section, and the one of the second term in Section 2. Regarding the region term, different descriptors are available. In our experiments, we chose a Parzen window estimate of region intensity distribution which is updated after each iteration like in (Rousson and Cremers, 2005). Jointly with this evolution of φ, we need to estimate the pose parameters. This is done with the following coupled gradient descents:    d φ(A) ∇φ   T = δα (rO − rB ) ,    dτ S S       d ∇φ dA φ(A) S = δα · (rO − rB ) ,  dτ S S dS        d ∇φ dA φ(A)   Θ = δα · (rO − rB ) .  dτ S S dΘ The gradient descents of this system are only guided by the data term and not the shape prior. This is an important difference compared to the gradient descents obtained in (Rousson and Paragios, 2002) where the transformation is optimized to register the prior shape with the evolving level set. Introducing the transformation between φ and the image has two main advantages: (i) the evolving level set remains in the same pose as the prior, (ii) the image term is used to estimate the pose of the image, resulting in a much faster and robust estimation of the transformation parameters. Interestingly, Cremers et al. also tried to resolve this issue of pose estimation in (Cremers et al., 2006) by proposing an intrinsic alignment. This approach is quite elegant since translation and scaling do not need to be estimated directly anymore. However, it has a main drawback compared to our formulation: the gradient descent with respect to φ becomes more complex, which increases computational time. 4.3. Implementation General implementation details of a level set evolution have been extensively studied in the past (Sethian, 1996). However, our formulation has some particularities since we also need to estimate pose parameters. The first important point is that in each gradient descent the Dirac function appears as a factor. This

18 is very interesting since it means that all computations can be restricted to a narrow band around the zero crossing of φ. Implementing these gradient descents also requires a choice of time step. For the curve evolution, it can be adaptively estimated such that the largest displacement of the curve is equal to a constant (let say 1). For the transformation parameters, this choice was more empirical: 10−4 for the translations and 10−7 for the rotation and the scaling. Once these implementation details solved, two free parameters still need to be set: λ and µ. µ set to 1 and λ to .3 in all the experiments shown in the next section. This choice gives very smooth contour and imposes strongly the shape model. Of course, reducing the shape weight λ will progressively change the solution toward a purely image based segmentation. Another interesting question is: How long does it take? Very few seconds indeed! As mentioned above, all computations can be done inside a very narrow band (2 pixels wide in our implementation), thus the complexity is proportional to the length of the curve. On a 3GHz computer, it takes less than 2 seconds to do 100 iterations on a 320x240 image.

5. Experiments and comparisons 5.1. Static shape prior First, we show several experimental results of this procedure for the propagation of an interface toward the recovery of an object of known shape. In all experiments, the regional term is defined by approximating the intensity distribution of each region (inside and outside the contour) by Gaussian distributions. Following (Rousson and Deriche, 2002), the means and variances of these distributions are estimated dynamically. In Figure (3), a cup is recovered from different images where rigid deformations and intensity perturbations were applied. Initializing with two circles, the curve evolves toward the cup shape while the pose of the object in the current image is estimated. As illustrated in the last two rows, our approach is robust to missing parts and occlusion. Figure (4) shows a similar experiment on a real image composed of a hand in front of a cluttered background. Despite the very low discrimination power of the Gaussian model used to approximate the intensity distribution for each region (see segmentation without shape prior in the first row), our algorithm is able to recover the hand, even with additional noise and occlusion. A very common issue with gradient descent methods such as the one used here is the possibility to get trapped into local minima. To avoid this, we can rely on the direct geometric interpretation of curve evolutions to introduce simple heuristics like initializing with a shape with the same topology as the prior shape, setting the initial pose by choosing the one with minimal cost from a few possibilities (coarse quantization of the research space), etc. Figure (5) shows

19

Figure 5. Probabilistic implicit representations, prior knowledge Figure (1) and object extraction. The evolution of the contour is presented in a raster scan format (the iteration number is shown in the upper left corner). First row: original image from which the prior was extracted. Second row: missing components. Third row: random noise applied to the original image.

(1)

(2)

(3)

(4)

Figure 6. Probabilistic versus static prior model. (1) Segmentation using the exact static prior, (2) segmentation using the wrong static prior, (3) segmentation using the mean shape, (4) segmentation using the probabilistic prior model.

that our algorithm is very robust. If the initial contour has a small overlap with the object, it converges toward the expected solution.

20

Probabilistic result

PCA result

Image 1

PCA result

Image 2

σm

Probabilistic result

Figure 7. Probabilistic versus PCA model. The PCA model imposes a global constraint while the probabilistic one uses the shape prior only in parts with high confidence in the model.

5.2. Probabilistic shape prior The proposed framework can translate a soft constraint (representation of the prior using a probabilistic level set) to a hard constraint for knowledge-based segmentation. The prior term involves also a confidence map σm . Up to now, the use of the representative model φm was considered to impose the constraint. Introducing the confidence map is a step forward. One should expect that areas with strong prior should be recovered in a quite accurate way. On the other hand, in areas with low confidence, image information should have more dominant role than the prior shape model. Experimental results demonstrates the performance of this framework in Figures (5,6). Comparisons with the static prior term previously introduced are shown in Figure (6). Both terms refer to an additional component for imposing prior knowledge on the segmentation and do not account for the visual properties of the object. The behavior of the algorithm is easy to understand: the level set follows the data in regions of low confidence in the prior (high σm ) while it discards the data and follows the mean shape φm where the confidence in the prior is high (low σm ). 5.3. Comparison to prior works Several papers proposed to model global deformations of the shape through a Principal Component Analysis of the training shapes (Leventon et al., 2000b; Tsai et al., 2003; Rousson et al., 2004; Rousson and Cremers, 2005). These approaches are very efficient in extracting the main variations of a family of shape. However, they rely on the fact that the entire shape should follow the prior

21 model. If, for example, an object has a similar shape from one image to another in only a particular region, these models are not adapted because they try to “guess” the shape even in parts where the prior is not reliable. Our approach overcomes this problem by estimating a voxel-wise/local confidence map in the shape prior. Hence, during the segmentation, the shape prior is imposed only in regions with high confidence. In Figure (7), we illustrate this behavior on a horse image, as well as on a very explicit synthetic data. In the horse image, the PCA model converges very fast but remains very close to the mean shape. Because of the noise, global image forces cannot attract the model to the expected solution. In the synthetic case, two contours are used to learn the model: a simple ellipse and an ellipse overlapped with a rectangle on its right side. From this learning set, we can assume that the left part should look like an ellipse while no reliable information is available for the right part. The results obtained on a third image show that the PCA-based model gives a segmentation that does not follow the data while the probabilistic shape model imposes the prior in the left part and follows the data in the right part. 6. Conclusion We have proposed a voxel-wise probabilistic shape model for the recovery of a given object in a new image. Modeling local variances of the shape, we are able to impose the prior model with a different intensity at each voxel according to the confidence of the model. We have also introduced an efficient and robust way to estimate pose parameters. The proposed framework has been validated on synthetic and real images with various perturbations: changes of pose, missing parts, additions of noise, and occlusions. An explicit comparison with PCA shape modeling have shown the different interpretations given by the two approaches. One or another may be more or less adapted to a given problem, whether the shape can be completely modeled or not. To conclude, we think that the analysis and the integration of local confidence is very important in practical application and we believe that its introduction in global shape modelings such as PCA is an interesting direction for future works. References Bascle, B.: 1994, ‘Contributions and Aplications of Deformable Models in Computer Vision’. Ph.D. thesis, University of Nice/Sophia Antipolis, France. Bertalmo, M., G. Sapiro, and G. Randall: 2000, ‘Morphing Active Contours’. IEEE Trans. Pattern Anal. Mach. Intell 22, 733. Birchfield, S.: 1998, ‘Head Tracking Using Intensity Gradients and Color Histograms’. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 232–237. Boykov, Y., O. Veksler, and R. Zabih: 2001, ‘Fast Approximate Energy Minimization Via Graph Cuts’. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1222– 1239.

22 Bresson, X., P. Vandergheynst, and P. Thiran: 2003, ‘A Priori Information in Image Segmentation: Energy Functional based on Shape Statistical Model and Image Information’. In: IEEE International Conference on Image Processing, Vol. volume 3. pp. 428–428. Caselles, V., F. Catte, B. Coll, and F. Dibos: 1993, ‘A Geometric Model for Edge Detection Numerische Mathematik 66’. Numerische Mathematik 66, 1–31. Caselles, V., R. Kimmel, and G. Sapiro: 1997, ‘Geodesic active contours’. International Journal of Computer Vision Vol 22, 61–79. Charpiat, G., O. Faugeras, and R. Keriven: 2003, ‘Approximation of Shape Metrics and Applications to Shape Warping and Empirical Shape Statistics’. Rapport de Recherche 4820. Chen, Y., H. Thiruvenkadam, H. Tagare, F. Huang, and D. Wilson: 2001, ‘On the Incorporation of Shape Priors Int Geometric Active Contours’. In: IEEE Workshop on Variational and Level Set Methods in Computer Vision. Vancouver, Canada, pp. 145–152. Chen, Y., S. Thiruvenkadam, F. Huang, K. Gopinath, and R. Brigg: 2002, ‘Simultaneous Segmentation and Registration for Functional MR Images’. In: ICPR (1). p. 747. Cheng, Y.: 1995, ‘Mean Shift, Mode Seeking, and Clustering’. IEEE Trans. Pattern Anal. Mach. Intell 17, 790. Cohen, L.: 1991, ‘On Active Contours and Balloons’. CVGIP: Image Understanding 53, 211– 218. Cootes, T., C. Beeston, G. Edwards, and C. Taylor: 1999, ‘A Unified Framework for Atlas Matching Using Active Appearance Models’. In: IPMI. p. 322. Cootes, T., C. Taylor, D. Cooper, and J. Graham: 1995, ‘Active Shape Models - Their training and applications’. Computer Vision and Image Understanding 61, 38–59. Cremers, D., S. J. Osher, and S. Soatto: 2006, ‘Kernel density estimation and intrinsic alignment for shape priors in level set segmentation’. IJCV, To appeat. Cremers, D., C. Schn¨ or, and J. Weickert: 2001, ‘Diffusion-Snakes: Combining Statistical Shape Knowledge and Image Information in a Variational Framework’. In: IEEE Workshop in Variational and Level Set Methods. pp. 137–144. Cremers, D., N. Sochen, and C. Schnrr: 2003, ‘Towards Recognition-Based Variational Segmentation Using Shape Priors and Dynamic Labeling’. In: Scale-Space. p. 388. Dervieux, A. and F. Thomasset: 1979, ‘A Finite Element Method for the Simulation of RaleighTaylor Instability’. Springer Lect. Notes in Math. 771, 145–158. Dervieux, A. and F. Thomasset: 1980, ‘Multifluid Incompressible Flows by a Finite Element Method’. In: W. Reynolds and R. MacCormack (eds.): Seventh International Conference on Numerical Methods in Fluid Dynamics, Vol. 141 of Lecture Notes in Physics. pp. 158–163. Dufour, R., E. Miller, and N. Galatsanos: 2002, ‘Template Matching Based Object Recognition with Unknown Geometric Parameters’. IEEE Transactions on Image Processing 11(12), 1385–1396. Faugeras, O. and R. Keriven: 1998, ‘Complete Dense Stereovision Using Level Set Methods’. In: ECCV (1). p. 379. Fischler, A. and R. Elschlager: 1973, ‘The Representation and Matching of Pictorial Structures’. IEEE Transactions on Computer c-22, 67–92. Geman, S. and D. Geman: 1984, ‘Stochastic Relaxation, Gibbs distribution, and the Bayesian Restoration of Images’. PAMI Vol 6, 721–741. Huang, X., N. Paragios, and D. Metaxas: 2006, ‘Shape Registration in Implicit Spaces Using Information Theory and Free Form Deformations’. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1303–1318. Jehan-Besson, S., M. Gastaud, M. Barlaud, and G. Aubert: 2003, ‘Region-Based Active Contours Using Geometrical and Statistical Features for Image Segmentation’. IEEE International Conference in Image Processing 2, 643–646.

23 Kichenassamy, S., A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi: 1995, ‘Gradient Flows and geometric active contour models.’. In: IEEE Intl. Conf. on Comp. Vis. pp. 810–815. Kimia, B., R. Tannenbaum, and S. Zucker: 1995, ‘Shapes, Shocks, and Deformations I: The Components of 2dimensional Shape and the Reaction-Diffusion Space’. IJCV 15, 189–224. Kimmel, R. and A. Bruckstein: 1995, ‘Tracking Level Sets by Level Sets : A Method for Solving the Shape from Shading Problem’. Computer Vision and Image Understanding 62, 47–58. Leventon, M., E. Grimson, and O. Faugeras: 2000a, ‘Level Set Based Segmentation with Intensity and Curvature Priors’. In: IEEE Mathematical Methods in Biomedical Image Analysis. pp. 4–11. Leventon, M., W. Grimson, and O. Faugeras: 2000b, ‘Statistical Shape Influence in Geodesic Active Contours’. In: CVPR. p. 1316. Lipson, P., A. Yuille, D. Keeffe, J.Cavanaugh, J.Taaffe, and D.Rosenthal: 1990, ‘Deformable Templates for Feature Extraction from Medical Images’. In: ECCV. p. 413. Malladi, R. and J. Sethian: 1998, ‘A Real-Time Algorithm for Medical Shape Recovery’. In: IEEE International Conference in Computer Vision. Bombay, India, pp. 304–310. Malladi, R., J. Sethian, and B. Vemuri: 1994, ‘Evolutionary Fronts for Topology-Independent Shape Modeling and Recoveery’. In: ECCV (1). p. 3. Metaxas, D.: 1997, ‘Physics-Based Deformable Models: Applications to Computer Vision, Graphics and Medical Imaging’. In: Graphics and Medical Imaging. M.Kass, A. Witkin, and D. Terzopoulos: 1988, ‘Snakes: Active Contour Models’. International Journal of Computer Vision 1, 321–331. Mumford, D. and J. Shah: 1985, ‘Boundary Detection by Minimizing Functionals’. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 22–26. Osher, S. and N. Paragios: 2003, Geometric Level Set Methods in Imaging, Vision and Graphics. Springer Verlag. Osher, S. J. and J. A. Sethian: 1988, ‘Fronts Propagation with Curvature Dependent Speed: Algorithms Based on Hamilton–Jacobi Formulations’. J. of Comp. Phys. 79, 12–49. Paragios, N., Y. Chen, and O. Faugeras: 2005, Handbook of Mathematical Models in Computer Vision. Secaucus, NJ, USA: Springer-Verlag New York, Inc. Paragios, N. and R. Deriche: 2002, ‘Geodesic Active Regions and Level Set Methods for Supervised Texture Segmentation’. The International Journal of Computer Vision 46(3), 223–247. Paragios, N., M. Rousson, and V. Ramesh: 2002, ‘Matching Distance Functions: A Shape-toArea Variational Approach for Global-to-Local Registration’. In: ECCV (2). p. 775. Paragios, N., M. Rousson, and V. Ramesh: 2003, ‘Non-Rigid Registration Using Distance Functions’. Computer Vision and Image Understanding vol 23, 142–165. Rousson, M. and D. Cremers: 2005, ‘Efficient Kernel Density Estimation of Shape and Intensity Priors for Level Set Segmentation’. In: Intl. Conf. on Medical Image Computing and Comp. Ass. Intervention (MICCAI), Vol. 2. pp. 757–764. Rousson, M. and R. Deriche: 2002, ‘A Variational Framework for Active and Adaptive Segmentation of Vector Valued Images’. In: IEEE Workshop on Motion and Video Computing. Orlando, Florida, pp. 56–61. Rousson, M. and N. Paragios: 2002, ‘Shape Priors for Level Set Representations’. In: A. Heyden, G. Sparr, M. Nielsen, and P. Johansen (eds.): Proceedings of the7thEuropeanConference onComputer Vision, Vol. 2. Copenhagen, Denmark, pp. 78–92. Rousson, M., N. Paragios, and R. Deriche: 2004, ‘Implicit Active Shape Models for 3D Segmentation in MR Imaging’. In: MICCAI (1). p. 209. Samson, C., L. Blanc-Feraud, G. Aubert, and J. Zerubia: 1999, ‘A Level Set Method for Image Classification’. In: Int. Conf. ScaleSpace Theories in Computer Vision. pp. 306–317.

24 Sapiro, G.: 2001, Geometric Partial Differential Equations and Image Analysis. Cambridge University Press. Sethian, J.: 1996, Level Set Methods. Cambridge University Press. Staib, L. and J. Duncan: 1992, ‘Boundary Finding with Parametrically Deformable Models’. IEEE Trans. Pattern Anal. Mach. Intell 14, 1061. Sussman, M., P. Smereka, and S. Osher: 1994, ‘A Level Set Method for Computing Solutions of Incompressible Two-Phase Flows’. Journal of Computational Physics 114, 146–159. Taron, M., N. Paragios, and M. Jolly: 2005, ‘Modelling Shapes with Uncertainties: Higher Order Polynomials, Variable Bandwidth Kernels and Non Parametric Density Estimation’. In: ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision. Washington, DC, USA, pp. 1659–1666. Tek, H. and B. Kimia: 1995, ‘Image Segmentation by Reaction-Diffusion Bubbles’. In: ICCV. p. 156. Tikhonov, A.: 1992, Ill-Posed Problems in Natural Sciences. Coronet. Tsai, A., A. Y. A. W. Wells, C. Tempany, D. Tucker, A. Fan, A. Grimson, and A. Willsky: 2001, ‘Model-Based Curve Evolution Technique for Image Segmentation’. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 463–468. Tsai, A., A. Yezzi, W. W. III, C. Tempany, D. Tucker, A. Fan, W. Grimson, and A. Willsky: 2003, ‘A Shape-Based Approach to the Segmentation of Medical Imagery Using Level Sets’. IEEE Trans. Med. Imaging 22, 137. Veltkamp, R. and M. Hagedoorn: 1999, ‘State-of-the-Art in Shape Matching’. Technical report, Utrecht University, Department of Computer Science. Vese, L. and T. Chan: 2002, ‘A Multiphase Level Set Framework for Image Segmentation Using the Mumford and Shah Model’. International Journal of Computer Vision 50, 271. Wang, Y. and L. Staib: 1998, ‘Boundary Finding with Correspondence Using Statistical Shape Models’. In: CVPR. p. 338. Yezzi, A. and S. Soatto: 2003, ‘Stereoscopic Segmentation’. International Journal of Computer Vision 53, 31. Yezzi, A., L. Zollei, and T. Kapur: 2001, ‘A Variational Framework for Joint Segmentation and Registration’. In: IEEE Mathematical Methods in Biomedical Image Analysis. pp. 44–51. Yuille, A.: 1991, ‘Deformable Templates for Face Recognition’. J. of Cognitive Neurosci 3, 59–70. Yuille, A., P. Hallinan, and D.Cohen: 1992, ‘Feature Extraction from Faces Using Deformable Templates’. IJCV 8, 99–111. Zhao, K., T. F. Chan, B. Merriman, and S. Osher: 1996, ‘A Variational Level Set Approach to Multiphase Motion’. Journal of Computational Physics vol 127, 179. Zhu, S. and A. Yuille: 1996, ‘Region Competition: Unifying Snakes, Regiongrowing, and Bayes/Mdl for Multiband Image Segmentation’. IEEE Trans. on Pat. An. and Machine Intel 18, 884–900.