This paper introduces an edge-based object recogni- tion method that is robust with respect to clutter, occlu- sion and object deformations. The method ...
Flexible Object Recognition in Cluttered Scenes Using Relative Point Distribution Models Alexandros Bouganis and Murray Shanahan Department of Computing, Imperial College London 180 Queen’s Gate, South Kensington, London SW7 2AZ, UK {alexandros.bouganis, m.shanahan}@imperial.ac.uk Abstract This paper introduces an edge-based object recognition method that is robust with respect to clutter, occlusion and object deformations. The method combines the use of local features and their spatial relationships to identify the point correspondences between the objectof-interest and the input scene. Local features encode information from their neighbourhood, and this renders them insensitive to noise at a distance. However, they have moderate discriminating power, and the proposed method exploits their spatial structure to compensate for this. Our flexible localisation technique, which is based on Point Distribution Models, makes the method also applicable to deformable objects. The point matching task is formulated as an optimisation problem that is solved using the Viterbi algorithm. The method has been validated on challenging real scenes.
1. Introduction It has been acknowledged that an object’s shape is typically the most discriminative cue for its recognition by humans. Motivated by this, many researchers have developed computer vision methods which aim to localise a target object in an input image using only contour-based information. Umeyama [11] uses interpretation trees to specify the correspondences between points that represent the object-of-interest and the input scene, but this approach is computationally expensive and neglects any other information, besides point locations. Belongie et al. [2] introduced a global shape descriptor, namely the shape context, and specify the correspondences between the two point sets by minimising the sum of feature dissimilarities. Thayananthan et al. [10] have shown that this method cannot handle clutter, and impose a “fig-
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
ural continuity” constraint to improve its performance. However, their method remains sensitive to heavy clutter and to large deformations of the target object. Moreover, their “figural continuity” constraint might have limited impact when occlusion occurs. Chamfer matching methods (e.g., [10]) typically require a good pose initialisation of the target object near its actual location in the input image and have to search through a large number of model templates. Felzenszwalb [6] proposes a method for detecting deformable shapes by representing them with triangulated polygons, but his approach is computationally expensive. Athitsos et al. [1] developed a flexible method that tolerates structural shape changes, such as, for example, the occlusion of some fingers when localising hand instances in the input image. However, their approach is solely driven by bottom-up mechanisms which might deform the target object to non-meaningful shape instances. Ferrari et al. [7] initialise the pose of the target object in the input image using a Hough-style voting scheme and then perform non-rigid shape matching using a variation of [3]. Their method, mainly due to the initialisation stage, might run into difficulties when the target object consists of small subparts whose configuration can vary significantly between instances of the object-of-interest. This paper presents an edge-based object recognition method that is robust with respect to occlusion, clutter and object deformations, without requiring any a priori initialisation near the actual pose of the target object. This robustness is accomplished using: (i) local features, which, in contrast to global (e.g., shape context), are insensitive to noise at a distance; (ii) a flexible localisation technique, which is a top-down mechanism that imposes spatial constraints on feature correspondences, based on our novel Relative Point Distribution Models – this compensates for the moderate discriminating power of local descriptors, such as ours, and handles deformable objects; (iii) an efficient optimisation stage that employs the Viterbi algorithm for minimising
a cost function.
2. The Proposed Method In our approach, we follow the argument made by Cootes et al. [4], according to which a model should be able to deform only in ways characteristic of the object it represents. Thus, we provide the system with a training set of representative shape variations of the target object, and let it extract an informative model representation using statistical shape analysis (section 2.1). Our input features are extracted from the input image by approximating the detected Canny edges with polylines [5]. The paper first presents the off-line training phase, and then describes our shape matching method (section 2.2).
2.1
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
Figure 1. (i)-(iv) Training set for a reading lamp; Global (v) and local (vii) alignment; Global (vi) and relative (viii) mean shape and displacement vectors.
Training Phase
Given a training set of representative shape deformations of the target object, with the corresponding pairs of points labelled, our system learns two types of point distributions: the Point Distribution Models (PDM) [4] and our Relative Point Distribution Models (RPDM). The former represents the distribution of the model points in the training set when their shapes are globally aligned; the latter when the exemplar shapes are aligned locally around a reference point. Principal Component Analysis (PCA) is applied similarly to both types of distributions to extract the associated main modes of shape variation. Each distribution is then represented by a (relative or global) mean shape and a set of displacement vectors (Figure 1). The Relative Point Distribution Models are introduced in this paper and play a vital role in our shape matching process imposing spatial constraints. In particular, RP DMr captures the relative mean position and displacement vectors of each model point, with respect to the reference point indexed by r. Thus, if the position of the reference point in the input image is known, then the subspace of possible positions of the remaining model points can be predicted. Note that, in contrast to P DM , RP DMr represents the point distribution in the training set filtering out the displacement of the reference point. In our implementation, the set of the reference points, denoted by Q, is a subset of the points which describe the target object. Let us now see how PCA is applied to P DM and RP DM to extract the main modes of shape variation. Let X denote the mean shape found when the training shapes are (locally or globally) aligned based on Procrustes Analysis1 [4]. If X j is the concatenated 1 When
building the P DM , all the corresponding points between
vector which includes the point coordinates of the j th training matrix is given by S = N shape,j the covariance 1 j T · ((X − X) · (X − X) ), where N is the toj=1 N tal number of the training shapes. Its z th eigenvector pz and eigenvalue λz can be computed from Spz = λz pz . Typically, a small number of eigenvectors suffices to “explain” most of the shape variation exhibited. Thus, any shape instance x of the target object can now be approximated according to eq. (1), with the appropriate parameter vector h: x≈X +P ·h
(1)
where columns of P correspond to the principal eigenth vectors extracted. In practice, we constrain √the z com√ ponent of h to be in the range [−3 λz , 3 λz ]. During the training phase, the system also acquires the normal vector orientation at each model point, as that varies in the training shapes which have been aligned locally around the reference point. Using directional statistics [9], we obtain, for each reference point, the mean orientation and the associated standard deviation of each model point.
2.2 Shape Matching Let the two point sets that represent the target object and the input image be denoted as A = {ai }i=1:n and B = {bj }j=1:m , respectively. We represent the objectof-interest with samples taken from its boundary, and thus, A is an ordered set of points. Our aim is to find a function f : [1, n] → [1, m] that maps the model points the training shapes are aligned. In RP DMr , however, the alignment is only between samples which are extracted locally around the reference point from the training shapes.
to the corresponding input points. Since some of the model points may not be visible in the input image due, for example, to occlusion, we add a “dummy” point in the input set B. We formulate the point matching task as an optimisation problem, in terms of minimising the matching cost between the two point sets. Our aim is not just to find correspondences that minimise the total sum of feature dissimilarities, but also to preserve an allowable deformation of the model shape. Thus, we devise the following objective function, and search for the point correspondences Y = {pi }i=1:n that minimise it: D(Y ) = d0 +
n i=2
T C(pi−1 , pi ) · (SC(pi ) + LC(pi ))
(2)
where d0 = SC(p1 ) + LC(p1 ), SC is the feature similarity cost, T C is the transition cost, and LC is the localisation cost. Below, we specify these costs for exemplar pairs pi−1 = (au−1 , be ) and pi = (au , bv ), where au−1 , au are two consecutive model points, and be , bv are two input points. Weights are chosen empirically, denoted by subscripted w. Feature Similarity Cost: This cost evaluates the similarity between model and input features. We favour local descriptors as they are insensitive to clutter, occlusion, and object deformations. In this paper, we use the turning angles (i.e., T A) of the curve fragments that the points under comparison belong to. Thus, SC(pi ) = wsc · | T A(au ) − T A(bv ) |
(3)
Transition Cost: This cost is applied consecutively between neighbouring model features. Its intuition is that neighbouring model features should be mapped to neighbouring input features. If L = { au−1 − au , be − bv }, this cost is given by: T C(pi−1 , pi ) = wtc ·
max(L) min(L)
(4)
Since this cost is applied consecutively between neighbouring model features, its impact in the matching process is undermined when a gap appears between visible model features in the input image, due, for example, to occlusion. Our method succeeds to maintain coherence in the spatial structure of the point correspondences by introducing the following localisation cost. Localisation Cost: The method makes a hypothesis about the location of a reference model point in the input image; let ar be the reference point currently under consideration and bf (r) its hypothesised location in the input image. Now, all the remaining model points can appear only in certain regions of the input image, according to RP DMr . In case a model point is mapped to an input point which lies outside the assigned allowable region, the localisation cost penalises their match. An
important aspect of this cost is its flexibility in dealing with deformable objects. Here, the spatial relationship between a model point and the reference point can vary in the input scene, but this variation should be in agreement with the modes learned in the training stage and captured by RP DMr . The localisation cost also evaluates how feasible is the matching of two points taking into account the orientation of their normal vectors. Let us now quantify this cost by considering the potential match between au and bv . The first step is to specify the shape deformation, according to RP DMr , that can “explain” this match. We, therefore, compute the parameter vector h according to eq. (1), where: x includes the relative coordinates of bv with respect to bf (r) , X includes the relative coordinates of the mean position of au with respect to ar in RP DMr , and P includes the displacement vectors associated with au also in RP DMr . If Δs is the residual distance between au and bv after the mean shape of RP DMr has been deformed according to the computed vector h, then LC(pi ) = wl1 · max
j=1:M
3·
π−|π−|ϕ
hj
2 λj −ϕ
+ wl2 · Δs + wl3 · E ||
(5)
B,(f (r),v) A,(r,u) where E = , ϕU,(s,z) is the 2·σ angle between the normal vectors at indices s and z in U (U ∈ {A, B}), (ϕA,(r,u) , σ) are the mean value and the standard deviation of ϕA,(r,u) as computed in the training phase when the reference point is ar , and M is the number of variation modes in RP DMr .2 The localisation cost becomes orientation invariant when the mean shape of RP DMr is posed in the input image by locally aligning the curve fragments of the related input edge and mean model shape around (ar , bf (r) ). Note that x, X and P are expressed with respect to the local coordinate frame of the model object.
2.2.1
Reference Points
This section presents how the method builds a set of hypotheses about reference points and their locations in the input image. Exhaustively, all possible pairs (ai , bj ) could indicate a hypothesis, where ai is a reference point (i.e., ai ∈ Q) and bj its location in the input image. But for efficiency, a “strict” feature matching criterion is adopted to evaluate how likely is it for (ai , bj ) to be a true match, rejecting unpromising pairs. Each point in Q and B is assigned a semi-local shape descriptor based on triangular areas that neighbouring 2 The localisation cost considers the two most important modes of shape variation in RP DMr . In this work, we concentrate on object deformations where the large displacements of points can be captured by utmost two eigenvectors.
6
MODEL 6 21 5 4
7 1312 11
5
2 1
4
3
13 7 12 11
8 7
1
6 5
2 43
3
MODEL 78 1 10 9 8
6
10 9 8
5
4 32
Figure 2. Two input images and the point correspondences found by our method. vertices form, based on [8]. By comparing them, we reject pairs that exceed a dissimilarity threshold. Among the remaining ones, the method keeps the K pairs with the smallest displacement between their aligned local curve fragments. We have found it sufficient to build the descriptors using two triangles, and K = 70. In our implementation, the triangular areas of each feature are normalised by their mean value. 2.2.2
Optimisation - Viterbi Algorithm
For each hypothesis (K, in total), the aim of this stage is to determine the point-to-point matching that minimises the objective function given in eq. (2). Since the model points are sampled along the boundary of the model object, the inherited ordering information enables us to solve the optimisation problem efficiently using the Viterbi algorithm. We build a matrix, where columns correspond to the ordered set of model points, while rows correspond to the (unordered) input points. Each cell of this matrix corresponds to a possible assignment of a model point to an input point. The Viterbi algorithm finds a path through the matrix where the point correspondences included minimise the objective function. Note that a constant penalty is assigned when a model point is matched to the “dummy” point. 2.2.3
Alignment – Hypothesis Evaluation
Each hypothesis about a reference point and its location in the input image is associated with a potential solution. We evaluate their “goodness” based on the function EF = wR1 + w2 · Δ, where R is the percentage of the recognised model points, and Δ is the average residual distance between the corresponding points after the model has been affinely transformed and then deformed (based on eq. (1)), minimising the alignment error. Among the candidate solutions, the system favours the one with the minimum EF value. Note that in the alignment, the method employs PDM, instead of RPDM, to obtain some balance in the misplacement of
Figure 3. Exemplar deformations for the pair of scissors and the cup.
the matching points.
3. Results We have tested our method with a database of 100 real images. The target object appears in different poses (translation, rotation and small scale changes), while the images are taken under various lighting conditions and moderate viewpoint changes. We use as target objects a pair of scissors, a reading lamp and a cup (Figures 1 and 3). The deformation modelled for the cup concerns its projection into the camera, as it rotates around its vertical axis. The database and the results achieved are publicly available3 . A set of 15 images tests our system solely in respect of shape deformations, where the object-of-interest appears in different shape instances without noticeable noise introduced due to clutter or occlusion. Our method performed extremely well with 15/15 success rate in localising the object. Another set of 65 images included various deformations of the target object, while clutter and occlusion made the recognition task more challenging. Our method still performed well with success rate 48/65 (74%). The last class of experiments involves 20 images in which an object of similar shape to the model object is also present. Our method managed to recognise the model object in 16 images (80%). 3 www.doc.ic.ac.uk/∼abougani/ObjectRecognition
(i)
(ii)
(iii)
(iv)
Figure 4. First row: Extracted edges from input images – Second row: The deformation and pose of the target object, as found by our method (dashed yellow line). In the last column, the method distinguishes successfully the target object from an object of similar shape.
In three of the remaining images, the system failed to recognise the model object, but succeeded in retrieving an object of the same shape category. Examples of the results achieved are shown in Figures 2 and 4. The execution time is typically 10-40 sec per image using Matlab and a Pentium IV, 3 GHz processor.
4. Conclusion An edge-based method has been presented for recognising deformable objects in cluttered scenes. We have shown that by combining local features, their spatial relationships, and knowledge of how these relationships can change, we obtain good results despite clutter, occlusion and object deformations.
[3]
[4]
[5]
[6]
[7]
Acknowledgment [8]
This work was funded by the EPSRC project EP/C51050X/1.
References [1] V. Athitsos, J. Wang, S. Sclaroff, and M. Betke. Detecting Instances of Shape Classes that Exhibit Variable Structure. In Proc. of the European Conference on Computer Vision, 2006. [2] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts. IEEE
[9] [10]
[11]
Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. H. Chui and A. Rangarajan. A New Point Matching Algorithm for Non-Rigid Registration. Computer Vision and Image Understanding, 89(2-3):114–141, 2003. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active Shape Models-Their Training and Application. Computer Vision and Image Understanding, 61(1):38– 59, 1995. D. Douglas and T. Peucker. Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature. The Canadian Cartographer, 10:112–122, 1973. P. Felzenszwalb. Representation and Detection of Deformable Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(2):208–220, 2005. V. Ferrari, F. Jurie, and C. Schmid. Accurate Object Detection with Deformable Shape Models Learnt from Images. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. H. H. S. Ip and D. Shen. An Affine-Invariant Active Contour Model (AI-Snake) for Model-Based Segmentation. Image and Vision Computing, 16:135–146, 1998. K. V. Mardia and P. E. Jupp. Directional Statistics. Wiley, 2000. A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla. Shape Context and Chamfer Matching in Cluttered Scenes. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. S. Umeyama. Parameterized Point Pattern Matching and Its Application to Recognition of Object Families. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(2):136–144, 1993.