Object Categorization via Local Kernels
Barbara Caputo NADA/CVAP, KTH Stockholm, Sweden
[email protected]
Christian Wallraven Max-Planck-Institute for Biological Cybernetics T¨ubingen, Germany
[email protected]
Abstract In this paper we consider the problem of multi-object categorization. We present an algorithm that combines support vector machines with local features via a new class of Mercer kernels. This class of kernels allows us to perform scalar products on feature vectors consisting of local descriptors, computed around interest points (like corners); these feature vectors are generally of different lengths for different images. The resulting framework is able to recognize multi-object categories in different settings, from lab-controlled to real-world scenes. We present several experiments, on different databases, and we benchmark our results with state-of-the-art algorithms for categorization, achieving excellent results.
1 Introduction Over the last three decades there has been significant progress in the performance of computational object recognition systems. Today it is possible to perform object identification in different poses [4, 2, 18, 14]; significant improvements have been achieved in identifying objects in the presence of clutter, occlusion and varying lighting conditions [19, 3, 12, 14]. Moreover, approaches for category detection (such as faces, cars, pedestrians and horses) has obtained remarkable results [1, 15, 20, 8, 23]. However, limited progress has been achieved on the more general task of multi-object categorization, on which relatively few efforts have been reported in the literature [23, 24, 25, 11]. We argue that an effective algorithm for multi-object categorization must satisfy two main requirements: Robust representation An effective representation for object categories must be able to extract the key visual information which is common to the objects. At the same time, objects belonging to the same category may present a large visual variability; a good representation should be able to capture this. Finally, objects belonging to given categories should be recognized in real-world settings. Thus, an effective representation for object categories should be able to support all the robustness properties which are desirable for object identification (i.e. robustness to noise, occlusion, pose and scale changes, clutter and varying lighting). Robust classification An effective classification algorithm for categorization must face all the challenges described above for representation, and tackle them within an appropriate learning rule. A further challenge is the possible lack of control on the quality of training views, which means that robust classification should be possible even with difficult training data containing noise and/or clutter.
In this paper we present an algorithm for multi-category recognition which employs local features for representation, and SVMs for classification. Local features have shown excellent results for robust object identification (see for instance [14, 2, 12] and many others) and category detection [20, 1, 23, 25]. SVMs are state-of-the-art large margin classifiers which have demonstrated remarkable performance in object recognition [15, 16, 17]. We combine these two successful approaches via the definition of a new class of local kernels for categorization (see [27] for recognition). The resulting algorithm satisfies our robustness requirements for representation and classification. We present several experiments on multi-category recognition and category detection. We investigate the performance of our algorithm (1) when training is performed on a limited amount of training data; (2) when training is done on images taken in controlled settings, and test on images taken in real-world scenes, and (3) when training and test is done on cluttered views. For all the experiments we benchmarked with at least one other method, obtaining excellent results. The paper is organized as follows: section 2 describes the approach, reviews local representations (section 2.1) and SVMs (section 2.2), and derives local kernels (section 2.2). Section 3 describes the experimental setup and the result obtained. The paper concludes with a summary discussion.
2 The Approach 2.1
Robust Representation: Local Features
Local features have been shown to be remarkably successful for object identification and category detection in real-world settings [19, 26, 10]. They consist of a number of localized features in the image. Each local feature vector has the same dimension, but the number of local features describing a given object (thus the overall feature representation) will generally vary for different poses, backgrounds and so on. Being local, they permit us to recognize objects in cases of partial visibility, under image transformations and within complex scenes. The underlying philosophy in describing an image by local features is that once “interesting points” in the image are detected local descriptors are computed around these points. Such local descriptors should be discriminative in the sense that, if a point is detected again in a new image, the comparison of the descriptors computed around the points will allow them to match correctly. The recognition step is based on local matching, which makes the approach robust to occlusion and heterogeneous background [19, 26, 10]. , the most general local feature vector for the Thus, given a set of images image can be described as , computed as follows:
'
! #"%$ &
( ' ! are the coordinates (in the image plane) of the ) -th point; ' * + is a feature vector computed locally around the ) -th point (see for instance [19, 26]).
an interest point detector (a popular choice is the Harris corner detector, [19, 10]) detects points. In general, the number of interest points detected for each image will differ;
one does &" $ not . consider interest point coordinates, the local feature vector reduces to When In this paper we used jet features in order to define [19]. They are local grey value features computed at interesting points. The local characteristics are based on differential grey value invariants, which ensures invariance under the group of displacements within an image; a multi-scale approach makes the characterization robust to scale changes. We performed detection of interest points using a standard Harris-type corner detector, which was shown to have high repeatability and robust performance [13].
2.2
Robust Classification: Local Kernels
In this section we first give a brief overview of binary classification with SVMs 1 . We then describe the new class of local Mercer kernels [27]. Consider the problem of separating the set of training data into two classes, where is a feature vector and its class label. If we assume that the two classes can be separated by a hyperplane , and that we have no prior knowledge about the data distribution, then the optimal hyperplane (i.e. the one with the lowest bound on the expected generalization error) is the one which maximizes the margin [6, 22]. The optimal values for and can be found by solving a constrained minimization problem, using Lagrange multipliers :
,3547638
-9, : 4;- /* =,*./?@-0=A.1 2 22#, - BDC ,E?EFG;H B F I KJLM=2 22ONP -9 BC ,V?PF XWY=AZ9J[= 22 2ON\2 (1)
= . @ T BRQ S U B U subject to It results in a classification function ] K^3R_`0a b c I -% ,V C ,d?PF (2) F are found by using an SVC learning algorithm [6, 22]. Those , with where I and nonzero I are the “support vectors”. In cases where the two classes are non-separable, the solution is identical except HGe I eYtofgtheJ[Yseparable =A22 2ON ,case f for a modification of 6 the8 Lagrange where is the penalty for the misclassification. multipliers into To obtain a nonlinear classifier, ,D oneik maps from the input space to a high jg,lthem4 data , such that the mapped data points of dimensional feature space h by h the two classes are linearly Assuming there exists a kernel ,Locseparable l, po jg,[in C thejgocfeature , thenspace. a nonlinear SVM can be constructed function n such that n ,Loc by replacing the inner product C in the linear SVM by the kernel function n ] ,lP_`0a b I - n , ,l+?RF (3) minimize
This corresponds to constructing an optimal separating hyperplane in the feature space. Local features cannot be used in a straightforward way as input for SVMs: they have different lengths for different images, and this makes it impossible to perform a scalar product on them 2 . But they can be used as input for an SVM by defining a new class of Mercer kernels [27]:
Eq3M t!a set" $ ofJgimages s the3corresponding r and M = 2 2 O 2 N 4 r , consider &
. For all vu @w = (4) nyx @u zw T {n vu @w ? {n vw zu with {n @u vw 3 = "} c| ~ / " ny | @u @w ( u | Q
.0 . 0 > < | @u < t < a Mercer kernel (Gaussian kernel, [6, 22]), thus its product % with| @n u (which@isw a Mercer iskernel by hypothesys) is still a Mercer kernel. The operation of ~50 will result in a Mercer kernel as well (it does nothing but choose one of the (w Mercer kernels). This means that equation (4) is a linear combination with positive coefficients of Mercer kernels; it follows that nyx @u zw is a Mercer kernel. Proof Note first that
It is possible to show that several matching techniques, used for state-of-the-art local features, are related to this class of kernels [27]. In this paper, we used for jet features the following kernels:
,V