Dec 1, 2011 - retrieval system for large-scale multi-modal data including video tracks. Our method is ... of Fisher Vector as hard quantization of local descriptors us- ing the k-means .... vectors from hard disk to RAM. In the dimensionality ...
Efficient Multi-modal Retrieval in Conceptual Space
∗
Jun Imura1 Teppei Fujisawa1 Tatsuya Harada1,2 Yasuo Kuniyoshi1 1
Grad. School of Information Science and Technology, The University of Tokyo 2 JST PRESTO
{imura,fujisawa,harada,kuniyosh}@isi.imi.i.u-tokyo.ac.jp
ABSTRACT
of multimedia data available on the web will continue its rapid growth into the future. Several search algorithms have been proposed to process large amounts of image and video data. The most successful search method is the appearance-based approach, which is based on rapid similarity calculation by indexing low-level image features. However, a well-known semantic gap separates low-level image features and high-level image content. In general, the semantic similarity differs from similarity in the low-level image feature space. Because of the semantic gap, the appearance-based approach has limitations in terms of semantic search and the use of large amounts of visual data. Extraction of rich features from images is crucial to search enormous amounts of data including diverse contents. Recently, large-scale generic object recognition is a rapidly evolving field. In this field, many Bag-of-Visual Wordsbased approaches have been proposed. They increase dramatically in recognition and retrieval performance [1]. Those features can represent diverse and complicated image contents. On the other hand, the dimensionality becomes huge. Therefore, the curse of dimensionality degrades the performance in similarity measures and increases the computational costs. Proper evaluation of the similarity measure necessitates the selection of important elements representing contents from the high-dimensional space. In generic object recognition, kernel methods are typically used to learn the relation between images and labels from a tagged image dataset. The kernel methods can represent a nonlinear manifold in high-dimensional space. They are known to provide superior performance. However, the computational complexity and memory complexity of the kernel methods is no less than O(N 2 ), where N is the training size. It is nontrivial to apply kernel methods to real-world data whose size is far beyond millions. First, construction of a tagged image dataset is impractical for large amounts of real world data collected by individuals because the tagging of the huge amount of data is labor intensive. Other modality information collected through other sensors (e.g., audio and GPS) simultaneously with images is expected to represent contents that the image cannot capture by itself. Therefore, effective utilization of multimodal information is a key issue for practical large-scale semantic data search. In summary, the following four points are important for practical large-scale semantic data search: 1) rich image features providing diverse contents, 2) extraction of crucial ele-
In this paper, we propose a new, efficient retrieval system for large-scale multi-modal data including video tracks. With large-scale multi-modal data, the huge data size and various contents cause degradation of efficiency and precision of retrieval results. Recent research on image annotation and retrieval shows that image features based on the Bag-ofVisual Words approach with local descriptors such as SIFT perform surprisingly well with large-scale image datasets. Those powerful descriptors tend to be high-dimensional, imposing a high computational cost for approximate nearest neighbor searching in raw feature space. Our video retrieval method is focused on the correlation between image, sound, and location information recorded simultaneously, and to learn conceptual space describing the contents of the data to realize efficient searching. Experiments show good performance of our retrieval system with low memory usage and temporal complexity.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models
General Terms Algorithms, Performance, Experimentation
Keywords Generalized Canonical Correlation Analysis, Video Retrieval, Product Quantization
1. INTRODUCTION In recent years, image and video sharing (e.g. Flickr and YouTube) have become popular. The popularization of smartphones is expected to accelerate the collection of personal life log data pervasively in daily life. The amount ∗
Area chair: Xian-Sheng Hua
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’11, November 28–December 1, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0616-4/11/11 ...$10.00.
1085
ments from high dimensional data for proper semantic similarity calculation, 3) avoidance of labor-intensive dataset construction, and 4) scalability of large-scale real world data. As described in this paper, we propose a new efficient retrieval system for large-scale multi-modal data including video tracks. Our method is designed to consider four key issues described above. Particularly, our retrieval method is focused on correlation between image, sound, and location information recorded simultaneously, and learns conceptual space, which describes contents of data to realize efficient searching. Moreover, the training complexity of learning the conceptual space from the large-scale multi-modal data is O(N ), without high-cost manual tagging, while achieving an even better search performance than that available with the traditional appearance-based approach.
Figure 1: Illustration of the proposed multi-modal retrieval system. to images and text on the internet but also to large-scale multi-modal data in the real world.
2. RELATED WORKS In recent research related to image retrieval, some efficient retrieval methods using compression and encoding of powerful image descriptors have been proposed. [6] developed a rapid similar image retrieval method using a Fisher Vector compressed with Principal Component Analysis (PCA) and encoded by Locality Sensitive Hashing [2], Spectral Hashing [9], and α = 0 Binalizing. The Fisher Vector is a descriptor that is computed by application of a Fisher Kernel [3] to Gaussian Mixture Models (GMM) in the feature space of a local descriptor. It can be regarded as an extension of BoVW. It uses the first-order and secondorder statistics of local descriptors. Its dimension D equals to (2N − 1) × d − 1 or N × d for mean vectors, where N is the number of Gaussians and d is the dimension of the local descriptor. In this method, retrieval results are computed using the Hamming distance of the compact codes. [5] proposed a powerful descriptor, Vector of Locally Aggregated Descriptor (VLAD), which approximates the GMM of Fisher Vector as hard quantization of local descriptors using the k-means method. It is a D = N × d dimensional vector corresponding to the components of mean vectors of the Fisher Vector. They achieved high accuracy of rapid similar image retrieval with severe limitation of memory usage by encoding the principal components of VLAD using Product Quantization (PQ) [4]. In these studies, PCA is used as a dimensionality reduction method to cut off noisy components to avoid the curse of dimensionality and to reduce the spatial and temporal complexity of approximate nearest neighbor (ANN) search for retrieval. However, PCA projection is based on the variance in raw feature space. It is difficult to lower the semantic gap. Consequently, PCA projection might lack components that are necessary to describe the contents of the data and which cause degradation of the retrieval accuracy. [7] proposed learning of the semantic space of bi-modal data (images on the internet and texts linking to these images) using CCA and cross-modal retrieval in that space. Using data obtained from more than one modality enables us to use the structure between these modalities in addition to the information of the sample distribution in each feature space. This method is a scalable method to learn the space that represents its contents efficiently and which requires no labor for labeling data. This approach, which efficiently estimates latent topics from bi-modal data and which learns a necessary representation of the contents, is expected to be applicable not only
3.
APPROACH
In this section, we present a new method for efficient multi-modal retrieval. Using other modalities collected with images is reasonable to enrich features representing contents. Moreover, these multi-modal data are mutually related in accordance with their contents. Therefore, we first extract features expected to be related strongly to the contents and then learn crucial elements of high-dimensional features based on the structure between these modalities. For this research, we used three modalities: environmental sounds from audio tracks, locational categories from GPS, and visual features from video tracks.
3.1 Canonical Correlation Analysis For scalability that can deal with large-scale data collection, learning methods must be simple and effective. We specifically examine the correlation of multi-modal data and used Canonical Correlation Analysis (CCA) for learning the conceptual space, representing the contents of multi-modal data. In fact, CCA is a linear dimensionality reduction method that maximizes the correlation of datasets. Because the original CCA is designed for two datasets, Generalized Canonical Correlation Analysis (GCCA) [8] is used to address three modalities. As described in Figure 1, after extracting features from each modality, linear projection is calculated using GCCA. Let Xk ∈ RN ×Dk be feature matrices after centering for each modality and Rkl = XTk Xl be their correlation matrices where N is the number of samples, Dk represents the dimension of features, and k,l denotes the index of modality (k = 1 for image, k = 2 for sound, and k = 3 for location). Canonical vectors hk are calculated to maximize the sum of correlation ρ between canonical components zk = Xk hk under normalization 13 3k=1 hTk Rkk hk = 1.
arg max ρ = h1 ,h2 ,h3
1 6
3 k,l=1,k=l
zTk zl =
1 6
3
hTk Rkl hl .
(1)
k,l=1,k=l
The canonical vectors h = [hT1 hT2 hT3 ] are calculable by solving the following single Generalized EigenValue (GEV) problem.
1086
⎡ 0 1⎣ R21 2 R 31
R12 0 R32
⎤ ⎡ R11 R13 R23 ⎦ h = ρ ⎣ 0 0 0
0 R22 0
⎤ 0 0 ⎦ h. (2) R33
For calculation stability, regularization terms βI can be added to correlation matrices Rkk , where β is a small positive value. Canonical variables are calculated to be as many as the minimum of dimensions of three modalities. Thereby, we can obtain low-dimensional conceptual space by selecting canonical components from one corresponding to the largest correlation to an arbitrary dimension. Because the complexity of this GEV depends not on the number of learning samples but also on the total dimensions of features, this learning system of linear projection to the conceptual space is readily applicable to large-scale data. For retrieval, vectors in the canonical space thus calculated are encoded by Product Quantization [4] and stored in RAM. PQ is a quantization method designed to efficiently search ANN data in high-dimensional space. It devides the feature vector into some lower-dimensional subvectors and k-means is applied to each subvectors. For each feature vector, it provides a small-size code representing combinations of the nearest centroids of each subvector. When a query feature vector is given, the system first calculates distances among all centroids in subspace, then estimates distances between the query and learning samples as sum of the distances between query subvectors and centroids using pre-calculated distances (Asymmetric Distance Calculation). The system supports image query and multi-modal query for retrieval. We define the distance calculation for ANN searching as L2 distance for image query and sum of the L2 distance in all spaces for multi-modal query.
3.2
Figure 2: Effect of CCA and PCA for image features of Caltech256 dataset one locational category (e.g., park, campus, street). Recent recording devices can easily attach location information to a video clip. Consequently, we can obtain information related to categories of the location data. With GCCA projection of these features from multi-modal data, low-level features are mapped into conceptual space representing contents. Thereby, we obtain semantically important information from bottom-up features.
4.
EXPERIMENTS
In this section, we present the results of experiments on our method. First, we studied about the effect of CCA to image descriptors as a preliminary experiment. Then we evaluated proposed system with our original dataset for multimodal data retrieval.
4.1 Effects of CCA To make sure that CCA can find the essential elements representing contents, we did an experiment with CCA of image features and category labels of the images which are considered to have most strong relation to the contents. We used subset of Caltech256 - about 11,000 images in 100 categories , 80% of the data for learning and the rest for evaluation. BoVW and VLAD with grid sampled SIFT is extracted as image feature for the purpose of studying about the effect of CCA with different descriptors. Then canonical space considered as conceptual space is obtained by solving CCA of image feature and category labels given as 1-of-k vector representation. The evaluation is based on accuracy of recognition by nearest neighbor rule in this canonical space of image feature. We also did evaluation for PCA of BoVW and VLAD for comparison. Results are shown in Figure 2. We can find a large advancement in performance of VLAD with CCA as compared with VLAD with PCA. This indicates that CCA can effectively work for selecting important elements and proposed system can achieve conceptual space representing semantic information by utilizing multi-modal information.
Features for each modality
Selecting appropriate features is an important part of designing the system. Because we use linear GCCA for the sake of scalability, some features must be evaluated properly with a linear metric for measuring similarity. Fast feature extraction is also important for an acceptable response of the retrieval procedure. For those purposes, we used VLAD for image features, Bag-of-Audio Words (BoAW) for sound features, and 1-of-k vector representation for location features. As an image descriptor, VLAD estimates the distribution of local descriptors through hard clustering of k-means. Therefore, quantization of local descriptors and image description is conducted efficiently. From the perspective of complexity and performance, SURF128 features are densely sampled from points on a grid. For reduction of trivial elements for proper evaluation of correlation, VLAD is compressed into a high dimension (e.g., 1024 dim) using PCA before solving GCCA. It is reasonably expected that environmental sounds are related strongly to contents. To represent continuous sounds in the environment, 100 samples of a 39-dimensional MelFrequency Cepstral Coefficients (MFCC) feature is extracted in each 1 s of the audio track. Then MFCC vectors are quantized to calculate a histogram of the 100 MFCC vectors in 1 s, just as in the BoVW method. This BoAW feature ignores short-term temporal changes in the frequency domain. Feature for locational information is designed as a 1-of-k vector representation, where each dimension corresponds to
4.2
Multi-modal Dataset
We built a multi-modal dataset for evaluation of the retrieval performance. Some images in the dataset are portrayed in Figure 3. Our data are collected using a smartphone camera and microphone (640 × 480 resolution, 44.1 kHz monaural sound). This dataset includes about 2,100 frames sampled in 1.0 fps from about 35 min video clips. Features of each modality are extracted from each frame. For image feature extraction, each frame is downsam-
1087
highest MAP of PCA-based method is 0.45. Even when the query is given as image, MAP of proposed system outperformed traditional approach.
4.4
Figure 3: dataset
Computational cost
Evaluation was calculated with dual Xeon quad-core processor @ 2.67GHz (Intel Corp.). In the feature extraction step, it took 10 min for local descriptor extraction, 44 sec for quantization and 100 sec for VLAD description. Additionally, 22 min were needed to load all SURF128 feature vectors from hard disk to RAM. In the dimensionality reduction step, it took 110 sec for loading all VLAD features, 11 sec to solve the PCA of image features, less than 1 sec to estimate correlation matrices and 14 sec for solving GEV problem of CCA. It is noteworthy that we can obtain correlation matrices even faster by incremental calculation when adding new samples to the learned dataset. The retrieval cost for each query is estimated to be 120 ms including feature extraction.
example images of the multi-modal
5. CONCLUSION As described in this paper, we have proposed an efficient retrieval system for large-scale multi-modal data based on the conceptual space calculated using Generalized Canonical Correlation Analysis of image, sound, and location features. Our method has scalability of O(N ) for learning and obviates costs of manual tagging. Experiment results for semantically similar data retrieval show that the proposed system performs better than the traditional appearance-based approach. Figure 4: MAP vs. dimension with different query types
6.
pled to 320 × 240 pixels preliminarily, then 128-dimensional SURF128 local descriptor is calculated at the 10 pixel grid point. About 1.5M in all SURF128 feature vectors are quantized using k-means++ into 64 clusters. The 8192-dimensional VLAD is calculated. For sound features, 2.1M MFCC feature vectors are quantized into 32 clusters. Therefore, we obtained 32-dimensional BoAW. It is noteworthy that locational features are provided manually for each video clip in this experiment.
4.3
Multi-modal Retrieval Performance
We evaluated the accuracy of our multi-modal retrieval system. The compared method is based on the PCA of each feature. First, we evaluated both systems using a singlemodal (only image) query. Queries are given as images. Only image features are extracted. Retrieval results are ranked by the approximated L2 distance of canonical components or principal components. Second, with multi-modal (image, sound and location) query, all features are extracted and the retrieval results are ranked by the sum of the approximated L2 distance in each space. The compressed dimension is a critical parameter for accuracy and efficiency. Performance of the retrieval methods is measured using the Mean Average Precision (MAP). The results are shown in Figure 4. Compared only with PCA, our method, which uses conceptual space composed using multi-modal correlation shows a better result in Figure 4. Our method performed highest MAP 0.69 at 30 dimension with multi-modal query where
1088
REFERENCES
[1] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV SLCV Workshop, 2004. [2] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In ACM Symposium on Theory of Computing, 1998. [3] T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In NIPS, 1998. [4] H. J´egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE PAMI, 33(1):117–128, 2011. [5] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregating local descriptors into a compact image representation. In CVPR, 2010. [6] F. Perronnin, Y. Liu, J. Sa´ andnchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010. [7] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM Multimedia, 2010. [8] J. V´ıa, I. Santamar´ıa, and J. P´erez. Canonical correlation analysis (CCA) algorithms for multiple data sets: Application to blind simo equalization. In EUSIPCO, 2005. [9] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.