. ABSTRACT. Most existing cross-modal hashing methods
Linear Cross-Modal Hashing for Efſcient Multimedia Search Xiaofeng Zhu† †
Zi Huang‡
Heng Tao Shen‡
Xin Zhao‡
College of CSIT, Guangxi Normal University, Guangxi, 541004,P.R.China ‡ School of ITEE, The University of Queensland, QLD 4072, Australia
{zhux,huang,shenht}@itee.uq.edu.au,
[email protected] ABSTRACT
Keywords
Most existing cross-modal hashing methods suffer from the scalability issue in the training phase. In this paper, we propose a novel cross-modal hashing approach with a linear time complexity to the training data size, to enable scalable indexing for multimedia search across multiple modals. Taking both the intra-similarity in each modal and the intersimilarity across different modals into consideration, the proposed approach aims at effectively learning hash functions from large-scale training datasets. More specifically, for each modal, we first partition the training data into k clusters and then represent each training data point with its distances to k centroids of the clusters. Interestingly, such a k-dimensional data representation can reduce the time complexity of the training phase from traditional O(n2 ) or higher to O(n), where n is the training data size, leading to practical learning on large-scale datasets. We further prove that this new representation preserves the intra-similarity in each modal. To preserve the inter-similarity among data points across different modals, we transform the derived data representations into a common binary subspace in which binary codes from all the modals are “consistent” and comparable. The transformation simultaneously outputs the hash functions for all modals, which are used to convert unseen data into binary codes. Given a query of one modal, it is first mapped into the binary codes using the modal’s hash functions, followed by matching the database binary codes of any other modals. Experimental results on two benchmark datasets confirm the scalability and the effectiveness of the proposed approach in comparison with the state of the art.
Cross-modal, hashing, index, multimedia search
1. INTRODUCTION Hashing is increasingly popular to support approximate nearest neighbor (ANN) search from multimedia data. The idea of hashing for ANN search is to learn the hash functions for converting high-dimensional data into short binary codes while preserving the neighborhood relationships of original data as much as possible [13, 15, 21, 31]. It has been shown that hash function learning (HFL) is the key process for effective hashing [3, 12]. Exiting hashing methods on single modal data (referred to uni-modal hashing methods in this paper) can be categorized into LSH-like hashing (e.g., locality sensitive hashing (LSH) [7, 8], KLSH [15], and SKLSH [21]) which randomly selects linear functions as hash functions, PCA-like hashing (e.g., SH [33], PCAH [30], and ITQ [10]) which uses the principal components of training data to learn hash functions, and manifold-like hashing (e.g., MFH [26] and [34]) which employs manifold learning techniques to learn hash functions. More recently, some hashing methods have been proposed to index data represented by multiple modals1 (referred to multi-modal hashing in this paper) [26, 36], which can be used to facilitate the retrieval for data described by multiple modals in many real-life applications, such as nearduplicate image retrieval. Considering an image database where each image is described by multiple modals, such as SIFT descriptor, color histogram, bag-of-word, etc, multimodal hashing learns hash functions from all the modals to support effective image retrieval, where the similarities from all the modals are considered in ranking the final results with respect to a multi-modal query. Cross-modal hashing also constructs hash functions from all the modals by analyzing their correlations. However, it serves for a different purpose, i.e., supporting cross-modal retrieval where a query of one modal can search for the relevant results of another modal [2, 16, 22, 37, 38]. For example, given a query described by SIFT descriptor, relevant results described by other modals such as color histogram and bag-of-word can also be found and returned2 .
Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing Methods; H.3.3 [Information Search and Retrieval]: Search Process
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proſt or commercial advantage and that copies bear this notice and the full citation on the ſrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciſc permission and/or a fee. Request permissions from
[email protected]. MM’13, October 21–25, 2013, Barcelona, Spain. Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00. http://dx.doi.org/10.1145/2502081.2502107.
Area Chair: Roelof van Zwol
1 Modal, feature and view are often used with subtle differences in multimedia research. In this paper, we consistently use the term modal. 2 In this sense, cross-modal retrieval is defined more generally than traditional cross-media retrieval [35] where queries and results can be of different media types, such as text document, image, video, and audio.
143
Offline process
Online process
Training data
Database
(1) m1 m2
m3 m4
m5
0.1 0.7
0
0.2
0
(4) Database binary codes
(2) ...
(2) ...
(5)
0
0.2
0
0.7 0.1
0.2
0
0
0.2 0.6 ...
m1 m2
m3 m4
0.2
0.5
0
0
0.2
0.1
0
0.3
(3)
m5
0
0.6 0.2
0 ...
0.2 0.7
0011
0111
0011
0011
0111
...
1011
Hash functions Hash functions
0
0011
0011 Query image
Query binary codes
...
...
1011
...
Text results
Figure 1: Flowchart of the proposed linear cross-modal hashing (LCMH). training phase becomes linear to the training data size, i.e., O(n). To achieve high quality hash functions, LCMH also preserves both the intra-similarity among data points in each modal and the inter-similarity among data points across different modals. The learnt hash functions ensure that all the data points described by different modals in the common binary subspace are “consistent” (i.e., relevant data of different modals should have similar binary codes) and comparable (i.e., binary codes of different modals can be directly compared). Fig.1 illustrates the whole flowchart of the proposed LCMH. The training phase of LCMH is an offline process and includes five key steps. In the first step, for each modal we partition its data into k clusters. In the second step, we represent each training data point with its distances to the k clusters’ centroids. In the third step, hash functions are learnt efficiently with a linear time complexity and effectively with the intra- and inter-similarity preservations. In the fourth step, all the data points in the database are approximated with k-dimensional representations, which are then mapped into the binary codes with the learn hash functions in the fifth step. In the online search process, a query of one modal is first approximated with its k-dimensional representation in this modal which is then mapped into the query binary codes with the hash functions for this modal, followed by matching the database binary codes to find relevant results of any other modal. Extensive experimental results on two benchmark datasets confirm the scalability and the effectiveness of the proposed approach in comparison with the state of the art. The rest of the paper is organized as follows. Related work is reviewed in Section 2. The proposed LCMH and its analysis are presented in Section 3. Section 4 reports the results and the paper is concluded in Section 5.
While few attempts have been made towards effective cross-modal hashing, most existing cross-modal hashing methods [16, 22, 27, 37, 38] suffer from high time complexity in the training phase (i.e., O(n2 ) or higher, where n is the training data size) and thus fail to learn from large-scale training datasets in practical amount of time. Such a high complexity constrains the above methods from applications dealing with large-scale datasets. For example, multi-modal latent binary embedding (MLBE) [38] is a generative model such that only a small-sized training dataset (e.g., 300 out of 180,000 data points) can be used in the training phase. Although cross-modal similarity sensitive hashing (CMSSH) [2] is able to learn from large-scale training datasets, it requires prior knowledge (i.e., positive pairs and negative pairs among training data points) to be predefined and known, which is not practical in most real-life applications. To enable cross-modal retrieval, inter-media hashing (IMH) [27] explores the correlations among multiple modals from different data sources and achieves better hashing performance, but the train process of IMH with the time complexity O(n3 ) is expensive for large-scale cross-modal hashing. In this paper, we propose a novel hashing method, named linear cross-modal hashing (LCMH), to address the scalability issues without using any prior knowledge. LCMH achieves a linear time complexity to the training data size in the training phase, enabling effective learning from largescale datasets. The key idea is to first partition the training data of each modal into k clusters by applying a linear time clustering method, and then represent each training data point using its distances to the k clusters’ centroids. That is, we approximate each data point with a k-dimensional representation. Interestingly, such a representation leads to the time complexity of O(kn) for the training phase. Given a really large-scale training dataset, it is expected that k n. Since k is a constant, the overall time complexity of the
144
2.
RELATED WORK
common Hamming space for learning hash functions. IMH preserves the intra-similarity of each individual modal via enforcing that the data with similar semantic should have similar hash codes, and preserves the inter-similarity among different modals via preserving local structural information embedded in each modal.
In this section we review existing hashing methods in three major categories, including uni-modal hashing, multi-modal hashing and cross-modal hashing. In uni-modal hashing, early work such as LSH-like hashing methods [7, 8, 15, 21] construct hash functions based on random projections and are typically unsupervised. Although they have some asymptotic theoretical properties, LSH-like hashing methods often require long binary codes and multiple hash tables to achieve reasonable retrieval accuracy [20]. This leads to long query time and high storage cost. Recently machine learning techniques have been applied to improve hashing performance. For example, PCAlike hashing [10, 30, 33] learns hash functions via preserving the maximal covariance of original data and has been shown to outperform LSH-like hashing in [14, 17, 29]. Manifoldlike hashing [18, 26] employs manifold learning techniques to learn hash functions. Besides, some hashing methods conduct hash function learning by making the best use of priori knowledge of data. For example, supervised hashing methods [14, 17, 19, 24, 28] improve the hashing performance using the pre-provided pairs of data with the assumption that there is “similar” or “dissimilar” pairs in datasets. There are also some semi-supervised hashing methods [30, 34] in which a supervised term is used to minimize the empirical error on the labeled data while an unsupervised term is used to maximize desirable properties, such as variance and independence of individual bits in the binary codes. Multi-modal hashing is designed to conduct hash function learning for encoding multi-modal data. To this end, the method in [36] first uses an iterative method to preserve the semantic similarities among training examples, and then keeps the consistency between the hash codes and the corresponding hashing functions designed for multiple modals. The method multiple feature hashing (MFH) [26] preserves the local structure information of each modal and also globally considers the alignments for all the modals to learn a group of hash functions for real-time large scale nearduplicate web video retrieval. Cross-modal hashing also encodes multi-modal data. However, it focuses more on discovering the correlations among different modals to enable cross-modal retrieval. Cross-modal similarity sensitive hashing (CMSSH) [2] is the first crossmodal hashing method for cross-modal retrieval. However, CMSSH only considers the inter-similarity and ignores the intra-similarity. Cross-view hashing (CVH) [16] extends spectral hashing [33] to the multi-modal case, aiming at minimizing the Hamming distances for similar points and maximizing those for dissimilar points. However, it needs to construct the similarity matrix for all the data points, which leads to a quadratic time complexity to the training data size. Rasiwasia et al., [22] employs canonical correlation analysis (CCA) to conduct hash function learning, which is a special case of CVH. Recently, multi-modal latent binary embedding (MLBE) [38] uses a probabilistic latent factor model to learn hash functions. Similar to CVH, it also has a quadratic time complexity for constructing the similarity matrix. Moreover, it uses a sampling method to solve the issue of out-of-sample extension. Co-regularized hashing (CRH) [37] is a boosted co-regularization framework which learns a group of hash functions for each bit of binary codes in every modal. However, its objective function is nonconvex. Inter-media hashing (IMH) [27] aims to discover a
3. LINEAR CROSS-MODAL HASHING In this section we describe the details of the proposed LCMH method. For the purpose of interpreting our basic idea, we first focus on hash function learning for bimodal data from Section 3.1 to Section 3.5, and then extend it to the general setting of multi-modal data in Section 3.6. In this paper, we use boldface uppercases, boldface lowercase and letter to denote matrices, vectors and scales, respectively. Besides, the transpose of X is denoted as XT , the inverse of X is denoted as X−1 , and the trace operator of a matrix X is denoted as the symbol “tr(X)”.
3.1 Problem formulation
(i)
(i)
Assume we have two modals, X(i) = {x1 , ..., xn }; i = 1, 2, describing the same data points where n is the number of data points. For example, X(1) is the SIFT visual feature extracted from the content of images, and X(2) is the bagof-words feature extracted from the text surrounding the images. In general the feature dimensionalities of different modals are different. With the same assumption in [4, 11] that there is an invariant common space among multiple modals, the objective of LCMH is to effectively and efficiently learn hash functions for different modals to support cross-modal retrieval. To this end, LCMH needs to generate the hash functions: f (i) : x(i) → b(i) = {−1, 1}c , i = 1, 2, where c is the code length. Note that all the modals have the same code length. Moreover, LCMH needs to ensure that the neighborhood relationships within each individual modal and across different modals are preserved in the produced common Hamming space. To do this, LCMH is devised to preserve both the intra-similarity and the inter-similarity of the original feature spaces in the Hamming space. The main idea of learning the hash functions goes as follows. Data of each individual modal are firstly converted into their new representations, denoted as Z(i) , for preserving the intra-similarity (see Section 3.2). Data of all modals represented by Z are then mapped into a common space where the inter-similarity is preserved to generate hash functions (see Section 3.3). Finally, values generated from hash functions are binarized into the Hamming space (see Section 3.4). With the learnt hash functions, queries and database data can be mapped into the Hamming space to facilitate fast search by efficient binary code matching.
3.2 Intra-similarity preservation Intra-similarity preservation is designed to maintain the neighborhood relationships among training data points in each individual modal after they are mapped into the new space spanned by their new representations. To achieve this, manifold-like hashing [26, 27, 36, 39] constructs a similarity matrix, where each entry represents the distance between two data points. In such a matrix, each data point can be regarded as a n-dimensional representation indicating its distance to n data points. Typically, a neighborhood for a data point is described by its few nearest neighbors. To
145
can be well preserved in the derived k-dimensional representation, i.e., the invariance to rotations, rescalings, and translations. According to Eqs.1-2, we convert the training data X(i) into their k-dimensional representations Z(i) , i = 1, 2. That is, we can use a k ×n matrix to approximate the original n× n similarity matrix with intra-similarity preservation. The advantage is to reduce the complexity from O(n2 ) to O(kn). Note that one can select different numbers of centroids for each modal. For simplicity, in this paper we select the same numbers of centroids in our experiments. The next problem is to preserve the inter-similarity between Z(1) and Z(2) via seeking a common latent space between them.
preserve the neighborhood of each data point, only few dimensions corresponding to its nearest neighbors in the ndimensional representation are non-zero. In other words, the n-dimensional representation is highly sparse. However, to build such a sparse matrix needs quadratic time complexity, i.e., O(n2 ), which is impractical for large-scale datasets. Observed from the sparse n-dimensional representation, only few data points are used to describe the neighborhood for a data point. This motivates us to derive a smaller k-dimensional representation (i.e., k n) to approximate each training data point, aiming at reducing the time complexity for building the neighborhood structures. The idea is to select k most representative data points from the training dataset and approximate each training data point using its distances to these k representative data points. To do this, in this paper we use a scalable k-means clustering method [5] to generate k centroids which are taken as k most representative data points points in the training dataset. It has been shown that k centroids have a strong representation power to adequately cover large-scale datasets [5]. More specifically, given a training dataset in the first modal (1) X(1) , instead of mapping each training data point xi into the n-dimensional representation leading to quadratic time complexity, we convert it into the k-dimensional representa(1) tion zi , using the obtained k centroids which are denoted (1) by mi , i = 1, 2, ..., k. (1) For a zi , its j-th dimension carries the distance from (1) (1) (1) xi to the j-th centroid mj , denoted as zij .
3.3 Inter-similarity preservation It is well known that multimedia data with same semantics can exist in different types of modals. For example, a text document and an image can describe exactly the same topic. Research has shown that if data described in different modal spaces are related to the same event or topic, they are expected to have some common latent space [16, 38]. This suggests that multi-modal data with the same semantic should share some common space in which relevant data are close to each other. Such a property is understood as inter-similarity preservation when modal-modal data are mapped into the common space. In our problem setting, multi-modal data are eventually represented by binary codes in the common Hamming space. To this end, we first learn a “semantic bridge” for each modal Z(i) in its k-dimensional space to map Z(i) into the common Hamming space. To ensure inter-similarity preservation, in the Hamming space, data describing the same object from different modals should have same or similar binary codes. For example, in Fig.2, we map both the images’ visual modal and textual modal via learnt “semantic bridges” (i.e., the arrows in Fig.2) into the Hamming space (i.e., the circle in Fig.2), in which two modals of an image are represented with same or similar binary codes in the Hamming space. That is, consistency across different modals is achieved.
(1)
To obtain the value of zij , we first calculate the Euclidean (1)
distance between xi
(1)
(1)
and mj , i.e., (1)
zij = xi
(1)
− mj 2 ,
(1)
where . stands for the Euclidean norm. (1) As in [9], the value of zij can be further defined as a function of the Euclidean to better fit the Gaussian distribution (1) in real applications. Denote the redefined value of zij as (1)
pij , we have: (1)
exp(−zij /σ) (1) , pij = k (1) l=1 exp(−zil /σ)
(2) 0011
where σ is a tuning parameter for controlling the decay rate (1) of zij . For simplicity, we set σ = 1 in this paper, while an adaptive setting of σ can lead to better results. (1) (1) (1) (1) (1) Let pi = [pi1 ; ...; pij ; ...; pik ], pi forms the new repre-
0011
0011
0011
0111
(1)
sentation of xi . It can be seen that the rationale of defin(1) ing pi is similar to that of kernel density estimation with (1) a Gaussian kernel, i.e., if xi is near to the j-th centroid, (1) (1) pij will be relatively high; otherwise, pij will decay. To preserve the neighborhood of each training data point in the new k-dimensional space, here we also represent each training data point using several (say s and s k) near(1) (1) est centroids so that the new representation pi of xi is sparse. Therefore, in the implementation, for each training data point we only keep the values to its s nearest centroids (1) in pi and set the rest as 0. After this, we normalize the (1) derived value to generate the final value of zij . According to the perspective of geometric reconstruction in the literatures [23, 25, 32], we can easily show that the intra-similarity
1011
0111 1011
Figure 2: An illustration on inter-similarity preservation. More formally, given Z(1) ∈ Rn×k and Z(2) ∈ Rn×k where n is sample size and k is the number of centroids, we learn the transformation matrix (i.e., “semantic bridge”) W(1) ∈ Rk×c and W(2) ∈ Rk×c for converting Z(1) and Z(2) into the new representation B(1) ∈ {0, 1}n×c and B(2) ∈ {0, 1}n×c
146
3.4 Binarization
in a common Hamming space, in which each sample pair (1) (2) (describing the same object, i.e., Bi and Bi describing the i-th object with different modals) has the minimal Hamming distance, i.e., the maximal consistency. This leads to the following objective function: min
B(1) ,B(2)
B
(1)
−
After obtaining all Y (i) , we get the median vector of Y (i) u(i) = median(Y (i) ) ∈ Rc , we then binarize Y (i) as follows: ⎧ (i) (i) (i) ⎪ if yjl ≥ ul ⎨ bjl = 1
B(2) 2F
⎪ ⎩ b(i) = −1 jl
T
s.t., B(i) e = 0,
(3)
b(i) ∈ {−1, 1},
(i)
(9) (i)
(i)
if yjl < ul
(i)
where Y (i) = [y1 , ..., yn ]T , i = 1, 2; j = 1, ..., n and l = 1, ..., c. Eq.9 generates the final binary codes B for the training data X, in which the median value of each dimension is used as the threshold for binarization. The learnt hash functions and binarization step are used to map unseen data (e.g., database and query) into the Ham(i) ming space. In the online search phase, given a query xq from the i-th modal, we first approximate it with its dis(i) tances to k centroids, i.e., zq using Eqs.(1-2), and then (i) compute its yq using Eq.7, followed by the binarization on (i) (i) yq to generate its binary codes bq . Finally the Hamming (i) distances between bq and database binary codes are com(i) puted to find the neighbors of xq in any other modal.
T
B(i) B(i) = Ic , i = 1, 2, where .F means a Frobenius norm, e is a n × 1 vector whose each entry is 1 and Ic is a c × c identity matrix, the T constraint B(i) e = 0 requires each bit has equal chance to T be 1 or -1, the constraint B(i) B(i) = Ic requires the bits to be obtained independently, and the loss function term B(1) − B(2) 2F achieves the minimal difference (or the maximal consistency) on two representations of an object. The optimization problem in Eq.3 equals to the issue of balanced graph partitioning and is NP-hard. Following the literatures [16, 33], we first denote Y (i) as the real-valued representation of B(i) and solve the derived objective function on Y (i) in this subsection, and then binarize Y (i) into binary codes based on the median threshold method in Section 3.4. To map Z(i) into Y (i) ∈ Rn×c via the transformation matrix W(i) , we let Y (i) = Z(i) W(i) . According to Eq.3, we have the objective function as 2 min Z(1) W(1) − Z(2) W(2) F W(1) ,W(2) (4) T
(8)
3.5 Summary and analysis We summarize the proposed LCMH approach in Algorithm 1 (training phase) and Algorithm 2 (search phase).
Algorithm 1: Pseudo code of training phase Input: X, c, k Output: u(i) ∈ Rc ; W(i) ∈ Rk×c , i=1,2 1 Perform scalable k -means on X(i) to obtain m(i) ; 2 Compute Z(i) by Eq.1-2; 3 Generate W(i) by Eq.6; 4 Generate u(i) by Eq.8;
T
s.t., W(1) W(1) = I, W(2) W(2) = I, where orthogonal constraints are set to avoid trivial solutions. To optimize the objective function in Eq.4, we first convert its loss function term into Z(1) W(1) − Z(2) W(2) 2F T
T
(1)T
(1)T
T
T
(2)T
(2)T
= tr(W(1) Z(1) Z(1) W(1) + W(2) Z(2) Z(2) W(2) −W
Z
(2)
Z
W
(2)
−W
Z
(1)
Z
W
(1)
(5)
Algorithm 2: Pseudo code of search phase (1)
)
= −tr(WT ZW), T
1 2 3 4
T
where W = [W(1) ; W(2) ]T ∈ R2k×c and T T −Z(1) Z(1) Z(1) Z(2) Z= ∈ R2k×2k . T T Z(2) Z(1) −Z(2) Z(2) Then the objective function in Eq.4 becomes: max W
tr(WT ZW)
s.t., WT W = I.
(6)
In the training phase of LCMH, time cost mainly comes from the clustering process, new representation generation, and eigenvalue decomposition in generating hash functions. Applying a scalable clustering method, such as [5], clusters can be generated in linear time complexity to the training data size n. Generating the k-dimensional representations Z takes the complexity of O(kn). The time complexity to generate W is O(min{nk2 , k3 }). Since k n for largescale training datasets, O(k3 ) is the complexity to generate
Eq.6 is an eigenvalue problem. We can obtain the optimal W via solving the eigenvalue problem on Z. W represents the hash functions to generate Y as follows: Y (i) = tr(Z(i) W(i) )
Input: xq , u(1) , W(1) (1) Output: Nearest neighbors of xq in another modal (1) Compute zq by Eq.1-2; (1) Compute yq by Eq.7; (1) Generate bt by Eq.9; (1) Match bq with database binary codes in another modal;
(7)
where W(1) = W(1 : k, :) and W(2) = W(k + 1 : end, :).
147
and textual modals in different representations. In our experiments, each dataset is partitioned into a query set and a database set which is used for training.
hash functions. Therefore, the overall time complexity is O(max{kn, k3 }). Given that k n, we expect that k2 < n or both have similar scale. This leads to the approximation of O(kn) time complexity for the training phase. Having k as a constant, the final time complexity becomes linear to the training data size. In the search phase, the time complexity is constant.
4.1 Comparison algorithms The comparison algorithms include a baseline algorithm BLCMH and state-of-the-art algorithms, including CVH [16], CMSSH [2] and MLBE [38]. BLCMH is our LCMH without preserving intra-similarity, with the purpose to test the effect of intra-similarity preservation in our method. We compare LCMH with the comparison algorithms on two cross-modal retrieval tasks. Specifically, one task is to use a text query in the textual modal to search relevant images in the visual modal (shorted for “Text query vs. Image data”) and the other is to use an image query in the visual modal to search relevant texts from the textual modal (shorted for “Image query vs. Text data”).
3.6 Extension We present an extension of Algorithm 1 and Algorithm 2 to the case of more than two modals, which makes us to use the information available in all the possible modals to achieve better learning results. To do this, we first generate new representations of each modal according to Section 3.1 for preserving intra-similarity, and then transform new representations of all the modals into a common latent space for preserving inter-similarity across any pair of modals. The objective function for preserving inter-similarity is defined as: p p
min B(i) − B(j) 2F B(i) ,i=1,...,p
4.2 Evaluation Metrics We use mean Average Precision (mAP) [38] as one of performance measures. Given a query and a list of R retrieved results, the value of its Average Precision is defined as 1 R AP = { P (r)δ(r)}, (13) r=1 l
i=1 i