77 matches - State Key Laboratory for Novel Software Technology,. Nanjing .... ment results of Object Categorization on
Structural Context for Object Categorization Wei Liu and Yubin Yang State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
[email protected],
[email protected]
Abstract. Bag of Words model has been widely used in the task of Object Categorization, and SIFT, computed for interest local regions, has been extracted from the image as the representative features, which can provide robustness and invariance to many kind of image transformation. Even though, they can only capture the local information, while be blind to the large picture of the image. Besides, the same part of different objects(like the head lamp of different cars) may also not able to be identically represented by SIFT and the like. In order to efficiently represent the object category, we design a new local descriptor–structural context, which shares the similar idea as Shape Context, capturing the relationship between current point and the remaining points, which is the extrema from the scale space of the image and can to some extent represent the structural of the image. This newly proposed descriptor can provide more discriminative representation of the object category, being invariant to intra-class difference, scale change, illumination variation, clutter noise, partial occlusion, small range of deformation, rotation and viewpoint change. Experiments on object categorization and image matching have proved the effectiveness of our newly proposed descriptor in describing the images of the same category. Keywords: SIFT, Shape Context, Structural Context, Mean shift, Bag of Words model, Image Matching, Object Categorization.
1
Introduction
With the large volume of digital images which emerges every day, it becomes an urgent desire for the researchers to endow the computer with the ability to ”know” what exactly the images are. Especially, we here focus on the problem of Object Categorization. In order to analyze the image, we need to first efficiently represent it. Existing methods in the literature are of two types: the geometricbased methods and the appearance-based methods. At the very beginning, most of the literatures focus on the geometric-based methods [11], like the Block World [12], trying to extract the properties like lines, vertices and ellipses to recognize the objects. With time goes by, such rigid shape and edge detecting and matching method gradually gives way to the appearance-based models which do not focus on what has to be seen in the P. Muneesawang et al. (Eds.): PCM 2009, LNCS 5879, pp. 280–291, 2009. c Springer-Verlag Berlin Heidelberg 2009
Structural Context for Object Categorization
281
image(lines, points, ...) but rather what really appear in the image(the intensity value). Early work such as color histograms, eigenspace for face recognition [16] has proven the effectiveness of the appearance-based method. However, these similar methods are global and thus have difficulty in dealing with partial visibility and extraneous information. Thus, using local descriptors computed at interest points become a natural way to solve such problem. There exist many kind of local descriptors, among which SIFT (Scale-Invariant Feature Transform) [8], which is invariant to image scale, rotation and illumination change and also provides good tolerance to affine distortion, viewpoint change and noise, has been proven to perform the best [9]. In object categorization field, the widely and almost successfully used ”Bag of Words(BoW)” model would treat each image as a histogram of visual words by clustering the SIFT descriptors from all the images [4], and thus treat each image as a collection of local features while the geometric relationship between them has been totally ignored. However, many others argue that the geometric information between these visual words would be crucial important if we really want to understand what exactly the images describe. Sivic [15] extend the probabilistic Latent Semantic Analysis (pLSA) to introduce ”doublets” which encode spatial co-occurrences of word pairs. Savaress [13] introduce the correlograms to capture the spatial co-occurrences of features in the feature level. Fergus [5] model the objects as flexible constellations of parts, each of which has a distinctive appearance and spatial information. Although they all aim at incorporating the spatial information, the result is sometimes even not as good as BoW, either because of the sensitivity to background clutter or because of the long time consumption. So, we propose to design a new descriptor, which directly encodes the spatial information between different interest points, to solve the problem of BoW. Relevant descriptors have already exited in literature. For example, Shape Context [1] captures the shape configuration by using a log-polar histogram; Geometric Blur [2] provides an approach to template matching that is robust under affine distortions. kAS [6] form chains of k connected, roughly straight contour segments to encode the shape structural. Relevant evaluation [10] has shown that Shape Context can provide the best robustness and yet discriminative power in describing shapes. However Shape Context operates directly on the points sampled from the edges of the object which makes it can only efficiently representing some simple shape. In our framework, we introduce the Structural Context, sharing the same idea as Shape Context, while, on the other hand, operates on the interest point, which is the extrema from the scale space of the image and can do some extent to represent the structural information of the image. Such descriptor is efficient when using to represent the same object category as long as they share the similar structural configuration. Relative experiments using our proposed descriptor on image matching, and object categorization have shown the effectiveness of such descriptor to describe the more complicate object.
282
W. Liu and Y. Yang
The rest of the paper is organized as follows: in section 2, we provide some simple background about the currently widely used local descriptors, SIFT and Shape Context. Our new descriptor, Structural Context, is introduced and applied in the Bag of Words model for object categorization in section 3. Experiment results of Object Categorization on Caltech-101 are included in section 4. Conclusion and feature work are included in section 5.
2 2.1
Background Scale-Invariant Feature Transform(SIFT)
In [8], Lowe proposes an efficient way to extract distinctive invariant local descriptors–Scale Invariant Feature Transform(SIFT). The first stage is to search over all scale-space and image locations to detect the scale-space extrema using the Difference-of-Gaussian(DoG) functions which are successively convolved with the image and sampled. Since the DoG is an approximation of the normalized Laplacian, thus it can achieve scale invariance [7]. Then, those extrema that are of low contrast or localize poorly along the edge are eliminated, and the remaining becomes the interest points of the image. After that, one or more orientations are assigned to each interest points based on the histogram of local image gradient orientations. Finally, the local descriptor is computed based on the gradients(normalized by the orientation of current interest point) of the the region around each interest point at the selected scale. The interest points with orientation and scale(red arrow) and SIFT descriptor are shown in fig1(a),(b). SIFT has been proven to perform the best in the state-of-art [9], it provides not only tolerance to a substantial range of affine distortion, viewpoint change and noise, but also invariance to scale, rotation and illumination change. It
Fig. 1. Interest points and local descriptors. (a)SIFT descriptors(red arrows) and Structural Context(blue log-polar) of the interest point p. (b)SIFT of p, a 3D histogram of orientations on a 4*4 grid in the scale space where p is detected. (c) Structural Context, a log-polar histogram of which each bin is the sum of all scale value of the points in that bin.
Structural Context for Object Categorization
283
is because of these good features that make the SIFT descriptors widely and efficiently used for representing the objects. 2.2
Shape Context
Shape context [1] is another local rich descriptor sharing essentially the same idea as SIFT by computing a log-polar histogram of edge point locations and orientations and thus encodes the configuration of the entire shape into the reference point. After the object has been represented by a set of points from the edges, to compute the Shape Context for a reference point pi , it needs firstly quantize location of other points into logr ∗ θ bins of a log-polar coordinate system as shown in Fig. 1a, then the value for each bin of the Shape Context is the number of points in it. (1) hi (k) = num{q = pi |q ∈ bini (k)} Shape Context has been demonstrated to be invariant to translation, small perturbations of parts of the shape, and occlusion. And it can also achieve complete rotation invariance by treating the tangential direction as the positive x-axis, and turns the reference frame with the tangent angle. As introduced above, it is our desirability to design some different local descriptors which can discriminatively represent the interest points extracted from the object. SIFT, which can identify the same object in different images, is not general enough to represent the same parts appearing on different objects. Shape Context, on the other hand, operates on points on edge and thus can, to some extent, only describe some simple objects which maintain some explicit shape. Because of these, we design a new local descriptor, Structural Context which operates on the interest points extracted from the scale space of the image, to capture the global structural configuration of the whole image and in turn can identify the same parts of different objects.
3
Structural Context
Structural Context, which shares the same idea as Shape Context, operates on the interest points. Here we will use the interest points extracted when computing the SIFT, that is, the extrema in the DoG scale space. Because the speciality of the interest points, we would like to use the orientation as well as the scale property of interest points to capture more information of the shape to solve the problems exist in Shape Context. Since Structural Context is focusing on finding the roughly position of the point in the objects, in other words, the structural configuration of current point, it is robust to represent the parts in objects if they are of roughly the same shape, and this is always true, i.e. motorbikes would all appear to be the same as Fig1(a). It is because we operate on the interest points that it can capture the structural configuration of the objects, and thus is called the Structural Context.
284
W. Liu and Y. Yang
Suppose now we have got the interest points extracted from the DoG scale space, we could compute the Structural Context with the following steps. 3.1
Structural Orientation Assignment
The scale-space theory [7] provides a framework for multi-scale image representation, and thus interest points detected on different scale space will represent different level of structural of the objects. Intuitively, when consider representing a tree, we will first derive the scale space representation of the tree, thus those interest points in lower scale space would denote some fine structural such as the leaf of the tree, while the larger one for more global configuration of the object such as the trunk. So, the orientation of the large scale interest points could possible indicate the orientation of the whole object. To be more robust, we construct the orientation histogram, which contains 36 bins with each for 10 degree, from orientations of the interest point. The value of each bin is the sum of scale of value of interest points who fall into the bin according to their orientation. The structural orientation is assigned with the peaks in the orientation histogram. Firstly the highest peak in the histogram is detected, and then other peaks that is within 90% of the highest peak is also detected. Therefore, one or more main orientation are assigned with the orientation of the whole object. This can provides more robustness to achieve rotation invariance because the interest points would be assigned with two or more orientations during the construction. As stated in [1], Shape Context would rotate the coordinate system around the tangential direction of current point to achieve rotation invariant. However, experiment result shows that such method is not so effective as it described. The method we proposed here would provide a more robust way to achieve rotation invariant property because we utilize the orientation property of the interest points. Since the orientation of each point is assigned according to the distribution of gradient around it, the one with large scale would definitely represent the main orientation trend of the object. Thus, if we rotate the whole object
Fig. 2. Orientation histogram
Structural Context for Object Categorization
285
corresponding to the main direction, the objects would then all in roughly the same direction. 3.2
Structural Context Descriptor
As stated above, in order to achieve rotation invariance, the coordinates of the descriptor and the interest point orientation are rotated relative to the structural orientations. Then in the new coordinate system, we construct the Structural Context for each interest point using the similar approach as shape context. We should notice one important thing that our new proposed descriptor, Structural Context, is specifically designed to represent the information of object category, which means that object should hold almost the large part, say 80% of the image, otherwise, it will lose its power because it is large affected by the background or some irrelevant information. Fortunately, many existing object category databases, such as Caltech-101, satisfy such constraint. If not, we need to first segment the object from the image. In our work, we do not consider this situation. Although we accept the above constraints, in order to achieve more precision, we should first eliminate some outlier interest points. These can be detected by compare the mean distance of each interest point to other points with the mean distance between all the interest point pairs, if the previous one is 30% larger than the later one, we could define it as the outlier and eliminate it. Suppose after the elimination, we have n interest points, then we would normalize all radial distances by the mean distance between the n2 interest point pairs. The structural context is constructed using a 5*12 histogram, just be same as shape context, the radius of the log-polar is: r/16, r/8, r/4, r/2, r with r set to be 2 after the scale normalization. Given point pi , we compute the structural context hi according to equation 2 hi (k) =
s(pi ) max s
s(pj )
(2)
pj ∈bini (k)
where s(pi ) is the scale value of point pi and max s be the largest scale of the interest points. We assign the sum of the scale value of all the interest points in each bin to the correspondence bin value. To avoid the situation that a interest point with small scale would match a larger one because it is possible that their neighbor configuration are similar, we s(pi ) multiply a factor max s to each interest point. Besides, because different image will of course contain different number of interest points, we should then normalize the histogram with its sum of 1. Above approach will generate the discriminative descriptors which can capture the structural configuration of the whole object. Thus those points with the corresponding scale can be matched correctly.
286
3.3
W. Liu and Y. Yang
Invariance Properties
When representing the object, the descriptors need to be invariant to many factors: scale change, rotation, intra-class variation, background clutter, illumination change, partial occlusion, deformation, viewpoint variation. If a descriptor can to some extent solve these challenges, then it would certainly useful to represent the object. We would argue that the Structural Context is such a descriptor. Since we make the assumption that the object would roughly hold most part of the image, beside we have eliminated the outlier interest points and have normalized the distance between pairs of interest points by the mean distance, it is obvious that Structural Context is scale invariant. Because we first compute the structural orientation based on the orientation of all the interest points, and structural context is computed based on the coordinate system rotated according to the structural orientation, it is also rotation invariant. Intra-class variation can also be well solved because we operate on the interest points which always represent the blob-like region in different scale space of the image. And the interest points can to some extent represent the structural of each object which are always the same among the same category. It can also provide robustness to partial occlusion, small range of deformation and small viewpoint variation because the it use the log-polar histogram which can be tolerant to these problems. Even when the objects are surrounded by background clutter, the experiment result shows that the Structural Context can also effectively find the correspondence. Besides, since it operate directly on the interest point, the tolerance to illumination change would depend on the property of the interest point, and DoG extreme could provides with effectively evade the illumination change. 3.4
Image Matching
It has been shown in previous section that the structural context is invariant to many image transformation. To see it more explicitly, we choose two motor images, and apply different transformation to one motor, including scale change, rotation, background clutter. These examples can represent the most situation and thus can show the power of structural context in describing same object category. Because the Structural Context is in fact a histogram, we compare two structural context descriptors hi and hj using the χ2 test statistic: 1 [hi (k) − hj (k)]2 2 hi (k) + hj (k) K
Cij =
(3)
k=1
The matching result of SIFT and Structural Context can be seen as in fig. 3 From the match result, we could see that Structural Context can obviously find more correspondence from the two image with same category of object. This
Structural Context for Object Categorization
287
(a) SIFT, found 38 matches
(b) Structural Context, found 77 matches Fig. 3. SIFT and Structural Context match result comparison
is partly because SIFT are only efficient in finding the same part of different image, while Structural Context could find the similar structural part of the object. And thus prove that Structural Context can be more appropriate to be used as the descriptor for object category. 3.5
Object Categorization
While the matching result on some specific image cannot guarantee that Structural Context are more suitable in describing object category. So, we here investigate the task of object categorization and consider the ”Bag of Words” framework, in which each image can be seen as a document, consisting of visual words forming by clustering on the low-level descriptors, such as Structural Context. Here we would adopt the Mean Shift[3], a non-parametric kernel density estimation clustering technique, to obtain the codebook because it does not require any embedded prior knowledge of the number of clusters, and does not constrain the shape of the clusters. Although Mean Shift can avoid so many problems existed in K-means, it has one important problem: how to select the optimal bandwidth h. If the bandwidth is too large, then such kernel density function can capture the ”large picture” of the model, while only the ”local structural” on the other hand. We here derive a simple form of optimal bandwidth hk for each dimension k hk =
1/(d+4) 4 θˆk (d + 2)n
k = 1, 2, . . . , d
(4)
where θˆk is the standard deviation of the k-th variate of the points. This method is called the rule-of-thumb [14] rule.
288
W. Liu and Y. Yang
After applying Mean Shift method on the collection of point, suppose we get a codebook of D visual words, with each point belonging to one visual word. Suppose we have a collection of image I = I1 , I2 , . . . , IN with visual words from a codebook V = v1 , v2 , . . . , vD , we can summarize the whole image data set as a M = N × D co-occurrence matrix of counts M (Ii , vj ), denoting how often the visual word vj occurred in image Ii . Then we need to use some learning method to learn the model for each category, thus when given one new image, we could determine which category it belongs to according to the learned model. To determine which category a new image Ii belongs to, we here use the simplest learning method, Naive Bayesian Classifier, to take the largest posteriori score as the categorization result: P (Cj |Ii ) ∝ P (Cj )P (Ii |Cj ) = P (Cj )
m
P (vt |Cj )M(t,i)
(5)
t=1
where Ii belongs to category Cj if P (Cj |Ii ) are the largest one. And the model learned from Naive Bayesian need to estimate the conditional probabilities of visual words vt given category Cj , P (vt |Cj ). 1 + Ii ∈Cj M (t, i) P (vt |Cj ) = (6) D+ D Ii ∈Cj M (s, i) s=1 We here use the Laplace smoothing method to to avoid probabilities of zero.
4
Experiment Result
In this section, we will test our newly proposed descriptor, Structural Context, on Caltech-1011, a challenging data set used widely in the literature for the purpose of object categorization and others, by comparing with SIFT using the method stated in section 3.5. We will particulary select 8 categories: airplane, beaver, binocular, brontosaurus, buddha, camera, cup, dollar bill. We use half of the images as the training data and another half as the testing data. Some of these kind of picture are shown in fig. 4 We apply the Naive Bayesian Classifier on the task of Object Categorization as shown in section 3.5, and compare the effect of SIFT and Structural Context using the confusion matrix which is described as follows Mij =
|{Ik ∈ Cj , f (Ik ) = i}| |Cj |
(7)
We could derive some useful information from the confusion matrix. Firstly, it is clear that some categories with less intra-class variation, such as cup and brontosaurus, can be accurately discriminated and vise versa. Secondly, we could see that object with similar local appearance will have much more chance to be 1
http://www.vision.caltech.edu/Image Datasets/Caltech101/
Structural Context for Object Categorization
289
Fig. 4. Image Database Table 1. Confusion Matrix for SIFT
airplane brontosaurus camera dollar bill buddha beaver binocular cup
airplane brontosaurus camera dollar bill buddha beaver binocular cup 0.72 0.06 0.02 0.05 0.02 0.01 0.03 0.04 0.08 0.78 0.01 0 0.02 0.03 0.05 0.03 0.03 0.02 0.64 0.05 0.04 0.06 0.08 0.06 0.05 0.01 0.08 0.68 0.06 0.03 0.05 0.04 0.02 0.02 0.06 0.1 0.74 0.03 0.01 0.02 0.03 0.04 0.06 0.05 0.06 0.75 0.01 0 0.03 0.02 0.05 0.04 0.04 0.05 0.75 0.02 0.04 0.03 0.08 0.03 0.02 0.04 0.02 0.84 Table 2. Confusion Matrix for Structural Context
airplane brontosaurus camera dollar bill buddha beaver binocular cup
airplane brontosaurus camera dollar bill buddha beaver binocular cup 0.83 0.03 0.04 0.04 0.02 0.02 0.01 0.01 0.03 0.86 0.05 0.04 0.01 0.01 0 0 0.02 0.01 0.78 0.05 0.04 0.03 0.04 0.04 0.03 0.04 0.03 0.74 0.03 0.03 0.05 0.05 0.01 0.02 0.04 0.04 0.82 0.03 0.02 0.02 0.03 0.03 0.01 0.03 0.03 0.82 0.03 0.02 0.02 0 0.03 0.05 0.04 0.03 0.82 0.01 0.03 0.01 0.02 0.01 0.01 0.01 0.03 0.88
wrongly categorized. For example, airplane and brontosaurus all contain many smooth local intensity distribution, thus the SIFT descriptor are more likely to be the same, which leads to the more confusion when try to discriminate them. From the confusion matrix for Structural Context, we could see that Structural Context is much more discriminative than SIFT, especially on those categories with similar structural configuration. For example, except for the camera and dollar bill, all categories can attain at least 80% recognition accuracy, which is a large improve over SIFT’s performance, in which only the result of cup category is satisfying, and this is because the cup is not so complicated to recognize. All the above observations show that SIFT and the like, which describe the local appearance, is not appropriate in object categorization and will always
290
W. Liu and Y. Yang
obtain wrong result, while on the other hand, descriptor such as structural context, which encodes the spatial relationship between different part of the object can obtain better performance in categorizing different objects.
5
Conclusion
In our work, we propose a new descriptor, Structural Context, to describe the object category. Unlike other local descriptors which operate on the information of local appearance of the image, it encodes the geometric configuration of the interest points into a local point descriptor. From this angle, it is more likely the geometric-based method. However, it can provide much more robustness when representing the geometric property of the objects. As shown from the construction procedure and the explanation above, we could see that such descriptor is invariant to scale change, rotation, intra-class variation, background clutter, illumination change, partial occlusion, deformation, viewpoint variation. And thus, it should be a discriminative and powerful descriptor in representing the object category. And the experiment of object categorization has also proved that Structural Context is more suitable to represent the object categorization than SIFT and the like. By encoding the spatial information into each local descriptor, we can use some simple learning method to achieve better result. But we should also mention the shortness of such new descriptor: it can only operate on the image in which the object hold the most part, and thus can only restrict to some special situation. So how to solve such problem requires us to solve in future. Besides, how to better get the visual words? How to handling the scalable problem of clustering? In the learning period, we just use the simple method, Naive Bayesian Classifier. Then how about some more efficient model, such as Topic Model? Or could we apply the transfer learning method in the task of object categorization to help improve the efficiency? All these problems need to be further researched and tested how they work on the problem of object categorization.
Acknowledgments This work is supported by the National Natural Science Foundation of P. R. China (Grants 60875011, 60723003, 60505008, 60603086), and the Natural Science Foundation of Jiangsu Province, P. R. China (Grant BK2007520).
References 1. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape context. IEEE Trans. Pattern Analysis and Machine Intelligence 2, 509–522 (2002) 2. Berg, A.C., Malik, J.: Geometric blur for template matching. In: Computer Vision and Pattern Recognition (2001)
Structural Context for Object Categorization
291
3. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Transaction on Pattern Analysis and Machine Intelligence (2002) 4. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: ECCV 2004 (2004) 5. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR 2003, pp. 264–271 (2003) 6. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2006) 7. Lindeberg, T.: Feature detection with automatic scale selection. International Journal of Computer Vision (1998) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints (2004) 9. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis And Machine Intelligence 27 (2005) 10. Mikolajczyk, K., Tuytelaars, T., Schmid, C.: A comparison of affine region detectors. International Journal of Computer Vision (2006) 11. Mundy, J.L.: Object recognition in the geometric era: A retrospective. In: Toward Category Level Object Recognition, pp. 3–29. Springer, Heidelberg (2006) 12. Roberts, L.G.: Machine perception of three-dimensional solids. In: Optical and Electro-Optical Information Processing, pp. 159–197. MIT Press, Cambridge (1965) 13. Savarese, S., Winn, J., Criminisi, A.: Discriminative object class models of appearance and shape by correlatons. In: CVPR 2006, pp. 2033–2040. IEEE Computer Society, Los Alamitos (2006) 14. Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley, New York (1992) 15. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering Objects and their Localization in Images. In: ICCV, vol. 1 (2005) 16. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proc. Conf. Computer Vision and Pattern Recognition (1991)