Applied Mechanics and Materials Vols. 433-435 (2013) pp 778-782 © (2013) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/AMM.433-435.778
An Object Classification Approach Based on Randomized Visual Vocabulary and Clustering Aggregation Li Su, Jienan Liu, Lanfang Ren, Feng Zhang 32#, Xuanwumenxi Avenue, xicheng district, Beijing, china
[email protected],
[email protected],
[email protected],
[email protected] Keywords: Object classification; Clustering aggregation; Exact Euclidean Locality Sensitive Hashing; Randomized visual vocabulary
Abstract: Considering the problems with the conventional Bag-of-Visual-Words approaches, such as high time consumption, the synonymy and ambiguity of visual word, and instability of clustering high-dimensionality image local features, this paper presents a novel object classificaiton approach based on randomized visual vocabulary and clustering aggregation. Firstly, Exact Euclidean Locality Sensitive Hashing (E2LSH) is used to cluster local features of the training dataset, and a group of randomized visual vocabularies is constructed. Then, the randomized visual vocabularies are aggregated using clustering aggregation technique, resulting in Randomized Visual Vocabularies Aggregating Dictionary (RVVAD). Finally, the visual words histogram is generated according to the dictionary, and the Support Vector Machines are learned to accomplish image object categorization. Experimental results indicate that the expression ability of the dictionary is effectively improved, and the object classification precision is increased dramatically. Introduction In the field of image object categorization, visual dictionary [1] has become a mainstream method of object categorization because of the significant performance. Though the method of visual dictionary has significant performance, it still has two problems. The first is the high time complexity. K-Means severely restricts the generating speed of visual dictionary. Therefore, Philbin[2] uses approximate K-Means to accelerate the convergent velocity of clustering, and introduces invert document structure to further improve retrieval efficiency; Wang et al. [3] use fast approximate K-Means to reduce the number of data points, which accelerate the convergent velocity of clustering and improve the construction efficiency of visual dictionary. Though, local features have the characteristics of high dimension and plenty of data[4], the problem of “the curse of dimensionality” severely limits the efficiency of K-Means. The second is the instability when using high-dimensional local feature points to cluster. Beyer et al[5] prove that the algorithm efficiency of K-Means, Mean Shift and other clustering algorithms reduces dramatically in high-dimensional space, and at the same time, the stability of the clustering results can be also worsen . Therefore, Moosmann et al. [6] present a randomized clustering forest method learned from decision tree and random forest, which is used to generate a group of visual dictionary for constructing a visual dictionary forest, this method can reduce the randomness existed in the process of generating visual dictionaries, but the drawbacks are the high memory usage and high time complexity. In order to improve the discrimination abilities of visual dictionary forest, López-Sastre et al. [7] combine clustering aggregation [8] with multiple coarse clustering results to generate the final visual dictionary, this method can weaken the synonymy and ambiguity to a certain extent, but using K-means to cluster many times can increase the complexity. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP, www.ttp.net. (ID: 221.130.253.135-08/10/13,05:35:36)
Applied Mechanics and Materials Vols. 433-435
779
Aiming at the problems mentioned above, we use Exact Euclidean Locality Sensitive Hashing (E2LSH) to map local feature points replacing K-Means, which can reduce the complexity and weaken the synonymy and ambiguity. Moreover, in order to reduce the randomness of visual vocabularies generated by hash clustering, we can generate the final visual dictionary through clustering and aggregating multiple LSH functions, this method can improve the stability of visual dictionary, and at the same time, it can weaken the synonymy and ambiguity, and the number of dictionaries can be determined automatically in the process of clustering aggregation. The rest of this paper is organized as follows: Section 2 introduces the fundamental of E2LSH and clustering aggregation; Section 3 introduces the key technology of object categorization based on Randomized Visual Vocabulary and Clustering Aggregation, and emphatically introduces the algorithm based on randomized visual vocabulary and Randomized Visual Vocabularies Aggregating Dictionary; Section 4 is the experimental verification and performance analysis; Section 5 is the conclusion. Relevant Knowledge 2.1 The fundamental of E2LSH The key idea of E2LSH is to hash the high-dimensional data points using several p-stable distribution based locality sensitive functions, guaranteeing that points close to each other have a much higher probability of collision than points which are far apart. The definition of locality sensitive functions and p-stable distribution is introduced in [6]. 2.2 The fundamental of clustering aggregation The meaning of clustering aggregation is: given an original cluster results set, searching a cluster which accords with the entire original cluster as much as possible. The relevant knowledge about clustering aggregation is introduced in[8][9]. Object classification based on randomized visual vocabulary and clustering aggregation
Fig.1 System structure The system structure is shown in figure 1, the flow of this method is: firstly, the sift features of all the images from the training image library are extracted and mapped using E2LSH to obtain a group of randomized visual vocabularies which can be clustered and aggregated to generate a randomized visual vocabulary aggregating dictionary; then a soft assignment method is used to construct the visual word histograms; finally, the visual word histogram is used to train the SVM classifier.
780
Advances in Mechatronics and Control Engineering II
3.1Randomized visual vocabulary construction based on E2LSH The algorithmic flow is as follows: SIFT points of all the images from the training dataset are extracted, a training feature library R = {r1 , r2 ,...ri ,..., rN −1 , rN } is constructed, where, ri indicates a 128-dimension SIFT descriptor, N is the number of sift points; (1) Selecting locality sensitive function g, then the SIFT feature library can be mapped through g; (2) For an arbitrary SIFT point ri ∈ R , it is mapped through g to get a k-dimension vector g(ri); (3) Calculating the primary hash key h1(g(ri)) and secondary hash key h2(g(ri)) of ri; (4) For all the SIFT points of R, SIFT points with the same primary and secondary hash key will be stored in the same bucket, and the hash table Tg={ R = {b1 , b2 ,...bi ,..., bZ −1 , bZ } } is obtained, where bk indicates the k-th bucket, Z presents the number of buckets in the hash table; g is selected randomly, so the visual vocabulary can also have much randomness, which can be regarded as randomized visual vocabulary V = {w1 , w2 , ⋅⋅⋅, wk , ⋅⋅⋅, wZ −1 , wZ } ,where, wk is the corresponding visual word of bk . 3.2 Aggregation of the randomized visual vocabulary The specific process is as follows: (1) Constructing the undirected connected graph of the feature library. For any two SIFT points ri and rj, the distance between them can be calculated according to the formula (1): 1 {k | Vk (ri ) ≠ Vk (r j ),1 ≤ k ≤ L} (1) L Where, Vk(ri)≠Vk(rj) indicates that ri and rj are divided into two different visual words. So, an undirected connected graph G can be constructed using SIFT points as the vertex, and the similarity distance d(ri,rj) as the side weight; (2) Ranking the SIFT points. For each SIFT point ri, calculating the sum of weights ωi of sides connected with every SIFT point, ranking all the SIFT points in R according to the weights, the arrange can be indicated as π; (3) The first feature point u which has not been clustered is selected from π, finding all the SIFT points which satisfy the condition d(u,r)≤1/2 and have not been clustered, indicated as set B; (4) Calculating the average distance between u and all the SIFT points in B: d (ri , r j ) =
d (u, B) =
1 d (u, r ), r ∈ B B
(2)
(5) If d (u , B ) ≤ α , then B ∪ {u} form a cluster, and all the SIFT points in B are labeled as clustered, otherwise {u}form a cluster; (6) Updating the undirected connected graph G and the arrange π. Deleting the vertex and the connected sides corresponding with the SIFT points which have been clustered, and deleting all the SIFT points in π which have been clustered ;
Applied Mechanics and Materials Vols. 433-435
781
(7) Repeating the steps (3)-(5) until all the SIFT points have been clustered, the clustering result C={c1,c2,...ci,...,ck-1,ck} of feature library B is obtained, where K indicates the number of cluster, ck indicates the k-th cluster. The visual dictionaries can be obtained through calculating the center of cluster, where wk indicates the k-th visual word; (8) Visual words filtering. Generally, data points with too little or too much visual words have weak discriminating power, so, these visual words can be filter out on the premise of minimum information loss, the final visual dictionaries W*={w1,w2,...wk,...wM-1,wM} can be obtained. 3.3 Visual word Histogram Construction The hard assignment cannot express the performance of visual dictionary adequately, so this paper uses a soft assignment method based on visual word uncertainty to generate the visual word histogram. Experiments The Caltech-101 dataset [10] and PASCAL VOC 2007 dataset [11] are used in this paper. Average precision (AP) of every category and Mean Average Precision (MAP) of all the categories are the evaluation criterions of categorization performance. In order to verify the performance of object categorization of the method(RVVAD) ,we compare our method with the methods based on AKM and Soft Assignment Scheme(AKM+SA), Reciprocal Nearest Neighbor Clustering, Related Clustering Optimization(RNNC+CC), Randomized Location Sensitive Dictionary(RLSV), AKM and Visual Words Aggregation Soft Assignment(AKM+VWA+SA),the performance contrast experiment of the object categorization is carried out on theCaltech-101 dataset, the experiment results are shown in Table 1. Table1 Experimental results on Caltech-101 object AKM+SA(%) RNNC+CC(%) AKM+VWA+SA(%) RLSV(%) RVVAD(%) brain 53.6 62.0 59.1 67.8 74.3 butterfly 40.3 38.2 60.5 49.0 63.4 ewer 54.8 43.4 43.5 52.6 56.8 grand piano 68.5 73.3 61.4 56.5 72.3 helicopter 46.7 64.5 63.7 74.2 75.6 kangaroo 52.1 70.7 71.9 75.8 78.5 laptop 65.6 59.8 68.8 66.4 72.1 menorah 36.1 49.6 56.4 53.5 62.4 starfish 47.8 60.5 58.0 58.0 63.1 sunflower 55.4 63.2 60.3 68.5 70.8 MAP 52.09 58.52 60.36 62.23 68.93 From Table 1 we can see, the MAP of RVVAD is higher than other methods, because RVVAD can weaken the synonymy and ambiguity of visual words, and also weaken the randomness of dictionary and improve the expression ability. Moreover, in order to further verify the effectiveness of the method in this paper, the object categorization experiment on PASCAL VOC 2007 dataset is carried out compared with other methods. The experiment results are shown in Table 2.
782
Advances in Mechatronics and Control Engineering II
Table2 Experimental results on PASCAL VOC 2007 AKM+SA(%) AKM+VWA+SA(%) RLSV(%) INRIA_Genetic(%) RVVAD(%) MAP 58.22 58.9 59.07 59.40 62.10 From Table 2 we can see, the Mean Average Precision of RVVAD is 62.10%, which is higher than other methods. Conclusions Aiming at the three problems existed in traditional method based on visual dictionary, we use E2LSH combining with clustering aggregation to construct randomized visual vocabularies aggregating dictionary, and the method has been used to classify image object categories. Compared with the traditional methods, the method in this paper can reduce the time and memory costs effectively and overcome the synonymy and ambiguity to a certain extent because of using E2LSH. At the same time, the method in this paper can reduce the randomness of visual dictionaries and improve the expression of visual dictionaries because of using clustering aggregation. Next step, we consider introducing supervisory mechanism in the process of generating visual dictionaries, in order to construct visual dictionaries better. References [1] Kersorn K: An Enhanced bag-of-visual word vector space model to represent visual content in athletics images[J]. IEEE Transactions on Multimedia, 14(1): 211-222. (2012) [2] Philbin J: Scalable object retrieval in very large image collections. [PhD thesis]. University of Oxford. (2010) [3] Wang J, Wang J D, Ke Q F: Fast approximate k-means via cluster closures. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 3037-3044. (2012) [4] Lowe D G: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110. (2004) [5] Beyer K S, Goldstein J, Ramakrishnan R: When is “nearest neighbor” meaningful?. Proceedings of 7th International Conference on Database Theory, Jerusalem, 217-235. (1999) [6] Moosmann F, Nowak E, Jurie F: Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8): 1632-1646. (2008) [7] López-Sastre R J, Renes-Olalla J, Gil-Jiménez P: Visual word aggregation. Proceedings of the 5th Iberian Conference on Pattern Recognition and Image Analysis, Las Palmas de Gran Canaria, 676-683. (2011) [8] Gionis A, Mannila H, Tsaparas P: Clustering aggregation. ACM Transactions on Knowledge Discovery from Data,, 1(1): 1-30. (2007) [9] Bansal N, Blum A, Chawla S. Correlation clustering[J]. Machine Learning, 2004, 56: 89-113. [10] Fei-Fei Li, Fergus R, Perona P: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshop on Generative-Model Based Vision, Washington D.C.. ( 2004.) [11] Everingham M, Van Gool L, Williams C K I: The PASCAL Visual Object Classes Challenge 2007(VOC 2007) Results[OL]. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/results/ index.shtml. (2012)