to overcome the challenges of searching instances from large video databases. Specifically, we train sparse filters in subspaces from unlabelled natural images, ...
EFFICIENT INSTANCE SEARCH FROM LARGE VIDEO DATABASE VIA SPARSE FILTERS IN SUBSPACES Yan Yang
Shin’ichi Satoh
The University of Queensland School of ITEE, QLD 4072, Australia 2 NICTA Queensland Research Laboratory, Australia
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo 101-8430, Japan
1
ABSTRACT In this paper, we propose a biologically inspired approach to overcome the challenges of searching instances from large video databases. Specifically, we train sparse filters in subspaces from unlabelled natural images, then yield image feature for new image instances through pre-learned filters. Therefore, no traditional “hand-designed” features (e.g. colour histograms, interest point descriptors) are required in our system. Experiments on a challenging large video database containing 20982 videos show our approach outperforms traditional approaches such as Bag-of-Words us- ing SURF, or the combination of SIFT, SURF, RGB and texture features. Index Terms— Instance Search, Independent Subspace Analysis, Image Retrieval, Large Multimedia Database, Sparse Filters. 1. INTRODUCTION Techniques for searching and classifying objects in image databases have been rapidly developed in recent years. Despite the significant improvement of both accuracy and speed, current artificial vision systems are still far less advanced compared to human vision. Studies of the primate’s visual system have inspired researchers to use techniques simulating the processes in vertebrate retina, such as the Difference of Gaussian (DOG) filters adopted in Scale-Invariant Feature Transform (SIFT) [1]. Another related example is to employ the spatial pooling of features (e.g. SIFT, colour histogram etc.) to build a codebook [2]. The corresponding frequency of “a term” is a sparse response, similar to the “neural code” generated by human brain [3]. However, these commonly used features in image retrieval systems are often carefully chosen and “hand-designed” for individual dataset. It is usually difficult and time consuming to extend these features for different systems [4]. One challenging task for a traditional “hand-designed” feature is to search for relevant videos containing a specific instance (e.g. person, object, or place) when several examples of this instance are given (as shown in Fig 1). The instance
978-1-4799-2341-0/13/$31.00 ©2013 IEEE
Fig. 1. Examples of 8 query topics from total 25 topics in instance search task TRECVID 2011. The contours indicate the specific object, person or location to be searched. Instance from left to right, top to bottom are setting sun, upstairs inside the windmill, fork, trailer, newsprint balloon, tortoise, Mrs Clark, American flag. search task is more difficult than problems such as “near duplicate detection”, because the former must consider the same instance appearing in other scenes (e.g. the same fork on a different plate). It is also different from general object classification, as it must identify the specific object rather than its category. For example, the first query in Fig 1 row two is “the newsprint balloon”, which does not include the other balloons. The challenge of instance search task is that the specific instance often are found in a large, unlabelled database. Therefore it is difficult to perform supervised algorithms [5]. Existing instance search systems are mainly focusing on efficient feature extraction and quantization [6]. Many are using the Bag-of-Visual-Words (BOVW or BOW) model based on interest point descriptors. Mansencal et al. [7] apply Speeded Up Robust Features (SURF), then cluster the database into 16K clusters via the K-means algorithm. This approach is good for objects with clear texture and locations containing many interest points, however without these it does not work well. Zhao et al. [8] propose to use a combination of HSV histogram, RGB moment feature, BOW of three types of interest points (i.e. SIFT, SURF, CSIFT), Gabor feature, edge histogram, LBP, and HOG features. The combination of features can overcome the deficiency of a single feature to some extent. However, such combination often needs carefully tuning via supervised learning (e.g. AdaBoost) to ensure features
3972
ICIP 2013
are selected and combined correctly. It is particularly difficult for instance search problem without any available labels. An alternative solution is to replace the “hand-designed” feature by the “self-taught” feature. This is motivated by the studies in human visual neurons [1, 9, 10, 11]. Specifically, it is an unsupervised process to learn features from unlabelled natural images, including Deep Belief Net (DBN) [12], (some versions of) Independent Component Analysis (ICA) [1], and sparse coding [13]. Among these models, ICA and extensions of ICA are similar to the structure of the receptive fields in primary visual cortex, and is commonly applied to detect salient regions via entropy change of the filtered pixel intensities [9]. The derived ICA saliency map has been applied to near-duplicate detection, as it is simpler to measure in logEuclidean Riemannian metric than other traditional features like SIFT [14]. The similar ICA feature is also employed by Kanan and Cottrell to classify faces, flowers, and objects in different categories [10]. However useful the ICA features may be, the theoretical assumption of the independence is generally untrue for real image data [1]. It is important to consider the relations between ICA components. In this paper, we introduce our approach based upon Independent Subspace Analysis (ISA), which is an extension of ICA. Specifically, ISA filters are trained with an extra layer representing “subspaces”, while the subspaces are learned as standard ICA components. This makes ISA features more robust to local translation compared to ICA features used in [10, 14], as variations in neighbourhoods are no longer sensitive in ISA. Our ISA feature is different from Santos et al. [15], as we use colour and do not restrict the training images to be in the same category. It is also different from Le et al. [11], who only use ISA responses to detect moving edges through time. In our system, we compute the responses of ISA filters in coarse grids to extract the feature descriptor for each image. Then we classify the input image according to the L1 distance between the feature descriptors of input and gallery images. The accuracy of our approach is evaluated on the instance search dataset TRECVID 2011, while filters are learned on another dataset distinct from the system. We show that ISA filters are able to yield discriminative features for instances in different categories, and work well for deformable objects such as animals and balloons. Compared to two methods mentioned earlier (SURF [7] and Combo [8]), our proposed sparse filter approach outperforms in overall accuracy and shows robustness for different object categories. The contribution of this work is using a novel Independent Subspace Analysis model for image instance search in a large video database. Despite the growing interest in learning natural image statistics (e.g. sparse coding, ICA, and its extensions), our approach is among the small number of attempts have made to apply sparse filters for image classification, matching, recognition, and retrieval tasks. We demonstrate that the “self-taught” ISA feature via learning the structure from existing images can be applied to replace the fixed
features like SURF. It is much less subjective than the “handtuned” combination of features and also is simpler to compute. Our approach is capable of fast instance search in a large image or video database, as feature extraction per image takes 0.08 seconds on a PC with ordinary configuration. 2. PROPOSED METHOD The proposed approach contains three steps. Firstly, ISA filters are learned from a database of unlabelled natural images via maximizing the sparseness in subspaces. Secondly, we generate a stack of filter responses by using learned ISA filters. The average responses of ISA filters in 1 × 1 and 2 × 2 grids are concatenated and represented as a descriptor vector for each image. Finally, we use K-Nearest-Neighbour classifier to retrieve a small subset from the image database, with a rank-list from 1 to K. 2.1. Learning ISA Filters The generative model of ISA is: x = As
(1)
where x is a vectorised image patch at a random location, A = [a1 , . . . , am ] is the bases collection and s = [s1 , . . . , sm ]T is the coefficient corresponding to individual basis vector ai = [a1 , . . . , am ]T . If A is restricted to be invertible and yielding filters W = A−1 , we get s = W x. Then si can be regarded as the filter responses for a corresponding wi . To learn filters W , we extend the Independent Subspace Analysis from [1] to 3-channel colour images. Pixel intensity in R, G, B channels concatenated as a single vector is used to represent an image patch. A large number of image patches of size b × b are selected at random locations from each image, and stored in matrix X as column vectors. The pre-processing steps for ISA learning is similar to the classic ICA: each patch vector is normalised to zero-mean and unit-length, then we execute the whitening transformation on columns of the patch collection matrix X and reduce the column dimension from m = 3 × b2 to d by applying Principal Component Analysis (PCA). This ensures rows in X are orthogonal in “principle subspaces” and yields d (as reduced) filters after maximizing the sparseness of filter responses in the independent subspaces. Filter matrix W = [w1 , . . . , wd ]T is initialized as a d × d matrix of random values to yield responses matrix S = W X, where S = [s1 , . . . , sd ]T . A layer of subspaces are created by measuring the energy in neighbourhood components: rX s2i (2) wg = i∈Sg
We then learn independent feature subspaces by maximizing the sparseness of the filter responses via gradient ascent of
3973
image. The system will return a ranking list sorted by the distance of all images pairs in ascending order. 3. EXPERIMENTS (1) ISA filters groupsize = 2
(2) ISA filters groupsize = 4
Fig. 2. During training, the components are grouped as subspaces. Examples are selected 16 components of trained ISA filters: (1) 2-neighbourhood components are grouped together; (2) 4-neighbourhoods are grouped together. log-likelihood. The stochastic gradient ascent can be represented as: X < wg , X >2 ) (3) ∆wi ∝ X < wi , X > h0 (
The proposed approach is evaluated on the instance search dataset of TrecVID 2011 1 , while the set of ISA filters are trained on another dataset BSD300 2 . In this section, we first introduce these datasets and parameter settings used in our experiments. Then, we compare performance of the proposed ISA filters to ICA filters, as well as approaches in the literature using Bag-of-words on SURF features [7] and a combination of HSV histogram, RGB moment, SIFT, SURF, CSIFT, Gabor, Edge histogram, LBP and HoG features [8]
Sg
where Sg is the index of subspace that wi belongs to, and < wi , X > denotes the dot-product. h0 is the gradient of √ a convex function h(u). Here, we use h(u) = − u, and 1 h0 (u) = − 21 cu− 2 where the constant 12 c can be ignored. To speed up convergence, we constrain the vector wi to be orthogonal and in unit length. Learned W can be projected ˆ by applying the whitening to its original column space as W matrix generated in PCA step. The shapes of learned ISA filters are similar to Gabor filters (as shown in Fig 2). There are two visible properties in these filters: 1) They are mostly achromatic bases followed by blue-yellow and finally red-green, as they are the principal axes that encode fluctuations [16]; 2) Filters are in relation to their neighbourhood, depending on the size of subspaces initialized in learning. 2.2. Feature Extraction We compute a stack of d filter responses for an input image, then apply spatial pooling in coarse grids 1 × 1 and ¯ = 2 ×P2. The average filter responses in each grid cell is s N 1 ˆ xj , where N image patches vector xj exists in W j=1 N ¯ = [¯ the cell. Here, s s1 , s¯2 , . . . , s¯d ] is a d dimensional vector corresponding to the d filters. Average filter responses are collected at all grid level and concatenated as a single vector of (1 + 4) × d values. This single vector is used as feature descriptor of the input image. Using filter responses as features can be regarded as a histogram-like approach, where values in histogram bins are probabilities of the filter activities average across the image. Thus, the proposed approach is invariant to shift and rotation changes. It is also resistant to illumination change, because the average brightness in images is usually separated from the other feature vectors during PCA whitening [1]. 2.3. Classification We use a K-Nearest-Neighbour (KNN) classifier based on the absolute difference (L1 ) between the descriptor vector of each
3.1. Dataset and parameter settings Instance search at TRECVID 2011 is evaluated on 20982 video clips with frame size 352 × 288 in the test dataset. Each video is a short clip that lasts less than 1 minute and around 30 seconds on average. This makes up around 150 hours of video which occupies 18.8G disk space as a test database. There are 25 query topics that contain 2 to 5 images per query, represents three categories: person, location, and object (as examples show in Fig. 1). The performance is evaluated as Mean Average Precision (MAP) per-query as well as overall queries, via labelled TRECVID ground truth. Filter learning: ISA filters are learned on 300 natural images in BSD300 dataset, thus the training can be prepared separately from TRECVID 2011. We resize the images in BSD300 to a tiny size (60×40), then randomly extract 500000 patches of size 8 × 8 from whole dataset. PCA dimension reduction is applied to reduce the patch collection matrix’s column dimension from 192 (8 × 8 × 3) to 128. Feature extraction: As a 2-level spatial pyramid 1 × 1 and 2 × 2 is applied, there are 128 × 5 = 640 bins feature histogram for each image. During testing, the query images and images in the video dataset are also downsampled to a tiny size 88 × 72 by using Gaussian pyramids. There are two reasons for using smaller size images instead of the original sizes: 1) patches collected from tiny images are more likely to contain joints of edges and irregular shapes, instead of straight lines and empty patches from images in larger size; 2) it requires smaller memory usage while searching in large image/video dataset; feature extraction and comparison time is greatly reduced. A simple “frame-skip” on every five frames approach is applied to reduce the frame rate. We process the entire image instead of segmented area even though the mask is given for queries. The reason is that the surrounding context of an object is often important for recognition and should 1 http://www-nlpir.nist.gov/projects/tv2011/ tv2011.html 2 http://www.eecs.berkeley.edu/Research/Projects/ CS/vision/bsds/
3974
Mean Average Precision
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Proposed
23
24
25
26
27
28
29
30
31
32
Combo
33
34
SURF
35
36
37
38
39
40
41
42
43
44
45
46
47 AVG
Topic ID
Fig. 3. Result comparing to method using a combination of HSV histogram, RGB moment, SIFT, SURF, CSIFT, Gabor, Edge histogram, LBP and HoG features [8], labeled as “Combo”; and method using BoW on SURF features [7], labeled as “SURF”. The query IDs in TRECVID dataset are originally “9023” to “9047”, we use a shorter version as “23” to “47” here. Group Size
1(ICA)
2
4
8
MAP (avg)
24.95%
38.46%
36.42 %
36.03%
Table 1. Results of ISA in different group size.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 16
Fig. 4. An example of retrieval result of a query topic 41 “monkey”. The first row is the 5 query images. Second and third rows are retrieval results from rank 1 to rank 16. not be neglected [17]. 3.2. Results and evaluation The Mean Average Precision (MAP) of purposed approach is 38.46% in current experiment setting. Results show that using subspace constraints is beneficial for image match compared to ICA (equivalent as subspace size 1), as shown in Tab 1. The best MAP is achieved when we use subspace size 2, and is slightly less when groupsize is increased to 4 and 8. The most significant improvement of our approach is the retrieval accuracy of deformable object, such as animals (i.e. (35) tortoise, (41) monkey) and balloons (i.e. (33) newsprint balloon, (36) yellow balloon, (47) airplane balloon), human (i.e. (38) female presenter, (40) Linda Robson, (46) gray-hair lady). An visual example of the search result based on our ISA filter approach is shown in Fig 4. From the input five image queries, we are able to locate the same object from a large test set of 20982 videos in various illumination/gestures that may not be the same as the original input. We compare proposed system to two recently proposed approaches in Mansencal et al. [7] and Zhao et al. [8]. Mansencal et al. apply the standard SURF feature with BOW of 16K clusters. This method is denoted as “SURF”. Zhao et al. apply a combination of 225 bins RGB colour moment,
145 bins edge histogram, 256 bins LBP histogram, 2520 bins HOG histogram, 3-scale and 6-direction Gabor features, HSV colour histogram, and BOW of standard SIFT, SURF, CSIFT features with 1000 visual words. This method is denoted as “Combo”. As shown in Fig 3, our proposed method performs better than both “SURF” and “Combo” in most queries. We achieve the best average MAP as 38.46%, while comparing against SURF (23.74%) and Combo (36.29%). However, without pre-tuning, we are not advantaged for following query topics: (24) upstairs inside the windmill, (28) plane, (29) downstairs and (32) staircase inside the windmill, comparing to the combination use of features [8]. We are outperforming SURF for all topics, except for query topic (43) Mrs Clark, as it’s indoor scenario contains many interest points which favours SURF feature. The computational complexity to extract ISA filter response is as filter convolution, which has quadratic computational complexity O(n log n). Our implementation uses OpenCV libraries 3 . Feature extraction on the entire test dataset of 150 hours videos takes 0.08 seconds per image and in total around 7 hours, when it runs on an ordinary PC (Intel Core i3, 2.4GHZ, 4G-RAM). The search time per query is about 1 minute on above hardware.
4. CONCLUSIONS In this paper, we propose a biologically inspired approach based on sparse filters learned in subspaces. Experiments on TRECVID 2011 instance search dataset show that our approach outperforms existing systems using traditional handdesigned features, as well achieves high efficiency executing image query in a large video database.
3975
3 http://opencv.willowgarage.com/
5. REFERENCES [1] A. Hyv¨arinen, J. Hurri, and P.O. Hoyer, Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39, Springer, 2009. [2] Josef Sivic and Andrew Zisserman, “Video google: A text retrieval approach to object matching in videos,” in IEEE International Conference on Computer Vision, 2003, pp. 1470– 1477.
[16] D.L. Ruderman, T.W. Cronin, and C.C. Chiao, “Statistics of cone responses to natural images: implications for visual coding,” Journal of the Optical Society of America A, vol. 15, no. 8, pp. 2036–2045, 1998. [17] C. Zhu and S. Satoh, “Large vocabulary quantization for searching instances from videos,” in Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, 2012, pp. 52:1–52:8.
[3] B.A. Olshausen et al., “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. [4] P. Gehler and S. Nowozin, “On feature combination for multiclass object classification,” in IEEE International Conference on Computer Vision, 2009, pp. 221–228. [5] G. Carneiro, A.B. Chan, P.J. Moreno, and N. Vasconcelos, “Supervised learning of semantic classes for image annotation and retrieval,” Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, 2007. [6] P. Over, G. Awad, J. Fiscus, A. F. Smeaton, W. Kraaij, and G. Qunot, “TRECVID 2011 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics,” in Proceedings of TRECVID 2011. NIST, USA, Dec 2011. [7] B. Mansencal, J. Benois-Pineau, R. Vieux, and J.-P. Domenger, “Search of objects of interest in videos,” in International Workshop on Content-Based Multimedia Indexing, 2012, pp. 1 –6. [8] Z. Zhao, Y. Zhao, and X. et al. Guo, “Bupt-mcprl at trecvid 2011,” in TRECVID, 2011. [9] X. Hou and L. Zhang, “Dynamic visual attention: Searching for coding length increments,” Advances in neural information processing systems, vol. 21, pp. 681–688, 2008. [10] C. Kanan and G. Cottrell, “Robust classification of objects, faces, and flowers using natural image statistics,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2472–2479. [11] Q.V. Le, W.Y. Zou, S.Y. Yeung, and A.Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3361– 3368. [12] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in International Conference on Machine Learning. ACM, 2009, pp. 609–616. [13] H. Lee, A. Battle, R. Raina, and A.Y. Ng, “Efficient sparse coding algorithms,” Advances in neural information processing systems, vol. 19, pp. 801, 2007. [14] L. Zheng, Y. Lei, G. Qiu, and J. Huang, “Near-duplicate image detection in a visually salient riemannian space,” Transactions on Information Forensics and Security, vol. 7, no. 5, pp. 1578 –1593, oct. 2012. [15] Carlos Silva Santos, Joao Eduardo Kogler Jr, and Emilio Del Moral Hernandez, “Using independent subspace analysis to build independent spectral representations of images,” in International Joint Conference on Neural Networks. IEEE, 2005, vol. 3, pp. 1860–1865.
3976