Simultaneous Classification and Visual Word Selection using Entropy-based Minimum Description Length Sungho Kim and In So Kweon Korea Advanced Institute of Science and Technology 373-1 Guseong-dong Yuseong-gu Daejeon, Korea
[email protected],
[email protected]
Abstract In this paper, we present a new entropy-based minimum description length (MDL) criterion for simultaneous classification and visual word selection. Conventional MDL criteria focus on how to minimize cluster size and maximize the likelihood of data points. We extend the MDL by replacing the likelihood term with the entropy of class posterior. This new criterion can provide optimal visual words with enough classification accuracy. We validate the entropybased MDL to learn optimal visual words for place classification and categorization of the Caltech 101 object database.
1. Introduction
Learning Phase
Generation of Visual Words by Entropy-based MDL Criterion
…
Classification or categorization is an active research topic in computer vision society. Especially, the local feature-based classification approach shows promising results [1] [2] [9] [11]. This method commonly uses a set of visual words. The term visual word originated from linguistics. A paragraph consists of a set of words. Likewise, we can think of a scene or an object that is composed of visual words. The key issue of the visual word-based classification is how to learn the optimal set of visual words. Csurka et al. selected the optimal set of visual words by k-means clustering [1]. The size of k is empirically selected by cross validation of the training set. Winn et al. proposed a pair-wise feature clustering method that maximizes interclass variation and minimizes intra-class variation [11]. Although this method is intuitively reasonable, it can be problematic when the size of the feature is huge. Visual word selection can be regarded as a model selection problem. Several model selection criteria such as minimum description length (MDL), Bayesian information criterion (BIC), and Akaike’s information criterion (AIC) are proposed and
compared [7] [10] [12] [5]. These criteria can find the optimal set of clusters by minimizing cluster size and maximizing likelihood. The type of criteria depends on how the number of the model size is defined [5]. They assume that there is an optimal set of clusters for each class. The limitation of MDL-, BIC-, and AIC-based clustering is that these methods select optimal clusters only for a single class. They do not consider to find universal clusters of all classes for scene or object classification. The goal of our research is to find optimal visual words that can be shared by classes with sufficient classification performance. The key idea is to replace the likelihood term with entropy of the class posterior. By minimizing entropy, we can reduce the classification ambiguity. The size term in MDL penalizes the criterion. Conceptually, we can learn an optimal set of visual words by minimizing the modified criterion. We explain the overall learning scheme with the proposed criterion in Section 2. Then we formulate the entropy-based MDL criterion mathematically in Section 3. Section 4 presents implementation details of the learning system. In Section 5, we show the application of the learning scheme to place classification and object categorization to validate the learning criterion.
Extraction of Local Features [G-RIF/SIFT] Categorization Phase
Posterior Calculation using Bayes Rule and Visual Words Novel Objects
Figure 1. Overall system structure for visual word learning and categorization.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
2. Optimal learning structure Figure 1 shows the overall structure for visual word learning and categorization from local features. We basically represent a category in terms of learned visual words. First, we extract all local features from the labeled training set of objects [6] [8]. Through a learning phase in which entropy-based MDL criterion is used, we can obtain an optimal set of visual words and class conditional distribution of visual words for inference. Figure 2 summarizes the steps for the learning of optimal visual words. There is only one parameter (epsilon) that controls the size of the visual words. Through an iterative learning process, we can get the optimal visual words in terms of the entropy-based MDL criterion. Each block will be explained in sections 3 and 4. If a novel object is presented, categorization is conducted using the detected features and learned visual words.
3. Entropy-based MDL criterion
vˆ = arg min −
N i
ˆ (V) ) + ζ(V) log(N ) log p(Ii |Θ 2
(1)
Control
log(N ) ˆ = arg min ˆ (V) ()) + λ · ζ(V) V H(c|Ii , Θ 2 i (2) where entropy H is defined as
N
ˆ (V) ) = − H(c|Ii , Θ
In this section, we introduce an entropy-based MDL criterion for simultaneous classification and visual word learning. The conventional maximum likelihood (ML)-based MDL criterion can provide optimal clusters using equation 1 [10]. I represents training images belonging to only one category. V = {vi } represents visual words. N is the size of the training samples, and ζ(V) is the parameter dimension for the visual words. Each visual word has parameters such as θˆi = {µi , σi2 } (mean, variance). This MDL criterion is only useful to the class-specific learning of the visual word (Low distortion with minimal complexity).
In our problem, the MDL criterion is not suitable since we have to find universal visual words for all classes and sufficient classification accuracy. If the classification is discriminative, then the entropy of the class posterior should be low [3]. So, we propose an entropy-based MDL criterion for simultaneous classification and visual word learning by combining MDL with the entropy of the class posterior. Let L = {(Ii , ci }N i=1 , a set of labeled training images where ci ∈ {1, 2, · · · , C} is the class label. Then, the entropybased MDL criterion is defined as equation 2. λ represents the weight of complexity term.
ε
Automatic visual word generation Automatic class conditional word distribution estimation Entropy-based MDL criterion
Figure 2. Details of learning procedures in learning phase block of Figure 1.
C
ˆ (V) ) log2 p(c|Ii , Θ ˆ (V) ) p(c|Ii , Θ
ci =1
(3) In equation 2, the first term represents the overall entropy for the training image set. The lower the entropy, the better classification accuracy is guaranteed. The second term acts as a penalty on learning. If the size of the visual words increases, then the model needs more parameters. So, if we minimize equation 2, we can find the optimal set of visual words for successful classification with moderate model complexity. Note that we require only one parameter to minimize. controls the size of the visual words automatically. Details are explained in the next section.
4. Implementation details We follow the process of Figure 2 to learn the optimal visual words for classification. If is the distance threshold between the normalized features, we can see a clustering property as in Figure 3 (top) using -nearest neighbor (NN). If we apply this property to the whole feature set, we can obtain initial clusters with an automatic size [see Figure 3 (bottom)]. Initial features are selected randomly. Then we ˆ V ) = {θ(i) }V ) can obtain refined cluster parameters (Θ i=1 with k-means clustering. After -NN-based visual word generation, we have to estimate the class-conditional visual word distribution for entropy calculation. The Laplacian smoothing-based estimation is defined as equation 4 [1]. 1 + (Ii ∈cj ) N (t, i) (4) p(vt |cj ) = V V + s=1 Ii ∈cj N (s, i) where N (t, i) represents the number of occurrences of the visual word (vt ) in the training image (Ii ), and V represents
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
ε
0.15
0.25
0.2
Table 1. The composition of training images and test images. Clustered visual patches
Role Seed feature
ε
Cluster center
ε
Training Test
Scene (640 × 480) No. of places No. of scenes 10 120 10 7,208
ε
Figure 3. -NN based feature clustering: (top) clustered visual patches according to , (bottom) automatic clustering procedures.
the size of the visual words. The physical meaning of this equation is the empirical likelihood of visual words for a given class. ˆ (V) ), Finally, we can calculate the posterior p(c|Ii , Θ which is used for the entropy calculation in equation 2. Using Bayes rule and uniform distribution of class lables, this can be stated as equation 5 where the image is approximated with a set of local features (Ii ≈ {y}i ). ˆ (V) ) = αp(Ii |ci , Θ ˆ (V) )p(ci ) p(c|Ii , Θ ˆ (V) ) ≈ αp({y}i |ci , Θ
(5)
Assuming assuming independent features (naive Bayes),
ˆ (V) ) = p({y}i |ci , Θ
V j
p(yj |vt )p(vt |ci )
(6)
t=1
where p(yj |vt ) = exp −yj − µt 2 /2σt2 By calculating equations 3- 5, we can evaluate equation 2. We can learn the optimal set of visual words by changing and evaluating equation 2 iteratively.
5. Experimental results The proposed visual word learning scheme can be applicable to any visual categorization problems. We applied the learning scheme to both topological place recognition and object categorization of the Caltech-101 database. We used G-RIF as the local feature in all tests [6]. In the first experiment, we acquired 120 training images for 10 places. Table 1 summarizes the composition of the
Lobby
Elevator
Hallway
Wahstand
Figure 4. Training images for the learning of topological classification.
training set and test set. We acquired whole test sequences using a SONY-828 camcorder. Related partial training images are shown in Figure 4. The topological places are the lobby, elevator, hallway at 3rd floor, washstand, etc. Figure 5 shows the evaluation curves of equation 2 by varying . The green dotted line shows the relative size of the learned visual parts according to . We set the weight λ as 1. The blue dotted line represents the relative entropy term. As the increases or the size of the visual words is reduced, the entropy increases. This means that as the number of visual words is reduced, the classification becomes ambiguous (high probability of false classification). Through an overall evaluation, we obtained the global minimum of the entropy-based MDL criterion at = 0.2. Table 2 summarizes the final learned results at = 0.2. The size of the visual words was reduced by 41.0%. With this trained DB, we evaluate the topological classification from 7,208 test images. The rate of correct classification is 92.6%. In the second test, we learned the Caltech 101DB (available at http://www.vision.caltech.edu/htmlfiles/archive.html). For learning, we select 10 examples randomly for each category (category ID: 1-camera, 2-cell phone, 3-chair, 4-cup, 5-fan, 6-headphone, 7-helicopter,
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
6. Conclusions
Entropy−MDL criteria vs. ε
4
3.5
x 10
Entropy Complexity Entropy−MDL
3
In this paper, we presented an entropy-based MDL criterion for simultaneous classification and visual word learning. We applied the criterion to visual similarity-based automatic clustering for optimal learning. By controlling one parameter and checking the criterion, we can obtain optimal visual words for classification. We applied the proposed learning scheme to topological place classification and object categorization of a challenging open DB. The tests show a very promising performance.
Length
2.5
2
1.5
1
0.5
0 0.1
0.15
0.2
0.25
ε
0.3
0.35
0.4
Figure 5. Evaluation of entropy-based MDL
Acknowledgements This research has been partially supported by the Korean Ministry of Science and Technology for National Research Laboratory Program (Grant number M1-0302-00-0064) and by Microsoft Research Asia.
Table 2. Results after proposed MDL learning.
References Type Scene
Number of features Before After learning learning 106,119 62,610
Reduction[%] ˆ 41.0
0.2
8-notebook, 9-bike, 10-saxphone). Through learning, the visual words were reduced from 14,517 to 1,344 at = 0.2. Table 3 shows the categorization results for each category and overall performance. Our system shows a success rate of 41.8%. Note that this performance is comparable to state-of-the-art methods. According to [4], geometric blur-based matching shows 45%, one-shot learning shows 17%, and Holub et al.’s method shows 40.1%. Although the comparison might be unfair due to different handling of the database with a different number of categories, our proposed learning scheme is feasible for object categorization.
Table 3. Categorization results for Caltech101 DB. Category Correct/tot rate [%] 6 12/33 36
1 12/40 32 7 43/79 54.4
2 32/49 65 8 45/72 62.5
3 15/52 34 9 49/90 54.4
4 4/47 8.5 10 9/30 30.0
5 1/38 2.6 Total 222/530 41.8%
[1] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoint. In ECCV Workshop on Statistical Learning in Computer Vision, 2004. [2] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google’s image search. In ICCV, 2005. [3] G. Fritz, L. Paletta, and H. Bischof. Object recognition using local information content. In ICPR (2), pages 15–18, 2004. [4] A. Holub, M. Welling, and P. Perona. Combining generative models and fisher kernels for object recognition. In ICCV, pages 136–143, 2005. [5] X. Hu and L. Xu. Investigation on several model selection criteria for determining the number of cluster. Neural Information Processing-Letters and Reviews, 4(1):1–10, 2004. [6] S. Kim and I.-S. Kweon. Biologically motivated perceptual feature: Generalized robust invariant feature. In ACCV (2), pages 305–314, 2006. [7] J. Li and H. Zha. Simultaneous classification and feature clustering using discriminant vector quantization with applications to microarray data analysis. In Bioinformatics Conference (CSB), pages 246–255, 2002. [8] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [9] K. Mikolajczyk, B. Leibe, and B. Schiele. Local features for object class recognition. In ICCV, pages 1792–1799, 2005. [10] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Image classification for content-based indexing. IEEE Trans. on Image Processing, 10(1):117–130, 2001. [11] J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visual dictionary. In ICCV, pages 1800– 1807, 2005. [12] X. Zhou, X. Wang, and E. R. Dougherty. Gene selection using logistic regressions based on aic, bic, and mdl criteria. New Mathematics and Natural Computation, 1(1):129–145, 2005.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE