This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
1
Task-dependent and Query-dependent Subspace Learning for Cross-modal Retrieval Li Wang, Lei Zhu, En Yu, Jiande Sun, and Huaxiang Zhang
Abstract—Most existing cross-modal retrieval approaches learn the same couple of projection matrices for different subretrieval tasks (such as, image retrieves text (I2T) and text retrieves image (T2I)) and various queries. They ignore the important fact that, different sub-retrieval tasks and queries have unique characteristics themselves in real practice. To tackle the problem, we propose a task-dependent and query-dependent subspace learning (TQSL) approach for cross-modal retrieval. Specifically, we first develop a unified cross-modal learning framework, where task-specific and category-specific subspaces can be learned simultaneously via an efficient iterative optimization. Based on this step, a task-category-projection mapping table is built. Subsequently, an efficient linear classifier is trained to learn a semantic mapping function between multimedia documents and their potential categories. In the online retrieval stage, the task-dependent and query-dependent matching subspace is adaptively identified by considering the specific sub-retrieval task type, the potential semantic category of the query, and the task-category-projection mapping table. Experimental results demonstrate the superior performance of the proposed approach compared with several state-of-the-art techniques. Index Terms—Cross-modal retrieval, task- and querydependent subspace learning, task-category-projection mapping table, semantic mapping function
I. I NTRODUCTION
W
ITH the arrival of the big data era, different types of multimedia data have grown rapidly [1], [2]. These multimedia data often appear in the heterogeneous feature spaces but with strong semantic correlation on describing the same topic or event. There is an emerging need to retrieve semantically relevant results in a different modality for a given query. Hence, cross-modal retrieval has attracted great attention in literature [3]–[6]. In this paper, we mainly focus on two typical cross-modal sub-retrieval tasks: image retrieves text (I2T) and text retrieves image (T2I). To achieve desired performance on cross-modal retrieval tasks, a prerequisite is to effectively measure the similarity between two documents1 in different modalities, however, which is still very challenging due to the heterogeneity of different modalities. Moreover, as semantic correlation is a abstraction of human understanding [7], it is hard to bridge the semantic gap between high-level semantics and low-level features extracted from multimedia data [8]–[10]. To address these problems, many works have been proposed to model the L. Wang, L. Zhu, E. Yu, J. Sun, and H. Zhang are with the School of Information Science and Engineering, Shandong Normal University, Jinan 250358, Shandong Province, China. Corresponding email:
[email protected],
[email protected]. 1 A multimedia document could be a text document, or an image , or a piece of video or audio.
relations of different modalities by learning a shared subspace, where the data similarity across the different modalities can be measured [11]–[15]. However, existing subspace-based cross-modal retrieval approaches mainly learn the identical couple of projection matrices for different sub-retrieval tasks. The important differences of sub-retrieval tasks (such as I2T and T2I) have not been seriously considered when learning subspace in them. Under the circumstance, these methods can not guarantee the effective semantic representation of the query in the shared subspace and thus result in a certain deterioration of retrieval performance [16]–[18]. Furthermore, these existing approaches generally neglect the semantic characteristics of various queries. The same couple of projection matrices is also employed for different queries. However, in real practice of online crossmodal retrieval, the queries are semantically diverse. They will be likely to come from categories that possess their unique semantic distributions [6]. Under such circumstance, projecting them into a common subspace may lead to suboptimal retrieval performance. Consequently, it is necessary to distinguish different sub-retrieval tasks and various queries when learning the cross-modal shared subspace. Motivated by the above analysis, in this paper, we propose a task-dependent and query-dependent subspace learning (TQSL) approach for cross-modal retrieval. First, different from existing techniques, we formulate a unified cross-modal learning framework to learn different couples of projection matrices for different sub-retrieval tasks and the involved semantic categories. Specifically, a linear regression term is formulated to correlate the task-specific subspace further with the explicit high-level semantics. The resulted subspace can therefore involve task-specific semantics and it is taskdependent. Furthermore, category-specific subspace for each semantic category in database is separately learned by imposing different supervised learning penalties. Through iterative optimization on the learning framework, the task-specific and category-specific subspace can be simultaneously obtained. Consequently, a task-category-projection mapping table is built to support online retrieval. Second, an efficient linear classifier is trained to learn the semantic mapping function between multimedia documents and their potential categories. At the stage of online retrieval, the task-dependent and querydependent subspace projection matrices of a query are automatically identified by considering the sub-retrieval task type, its potential semantic category, and task-category-projection mapping table. Thus, the retrieval system can adaptively return more semantically relevant results for different queries and sub-retrieval tasks.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
2
TABLE I: List of notations
The main contributions of our work are summarized as: 1) We simultaneously consider the differences of subretrieval tasks and the characteristics of diverse queries on learning the matching subspace for cross-modal retrieval. To the best of our knowledge, there is no similar work. 2) We develop a unified task-dependent and querydependent cross-modal learning framework. An iterative optimization guaranteed with convergence is proposed to solve the optimal subspace projection matrices. With task and query awareness, an adaptive subspace for cross-modal retrieval can be effectively identified to measure the similarities between heterogeneous multimedia documents. 3) Experimental results on publicly available datasets demonstrate that the proposed approach outperforms several state-of-the-art methods. The rest of this paper is organized as follows. Section II reviews the related work on cross-modal retrieval. Section III describes the details of the proposed method. Section IV introduces the experimental setting. The experimental results are presented in Section V. Section VI concludes the paper. II. R ELATED W ORK Cross-modal retrieval has received great attention in literature due to its widespread application in practice. Unlike single-modal techniques [19]–[26], users can retrieve semantically relevant multi-modal objects for any types of query via cross-media retrieval. Recently, various techniques have been proposed to improve its performance. To bridge the heterogeneous gap, many subspace learning based cross-modal methods have been proposed to project heterogeneous modalities into a common subspace. With the learned subspace, the similarity among different modalities can be measured accordingly. In [27], canonical correlation analysis (CCA) explores the correlation coefficients as measurement criteria to maximize the feature correlation of different modalities. In [28], CCA is extended further with deep learning framework as Deep CCA. It exploits complex nonlinear transformations of multi-modal data and thus learns linearly correlated representations. In addition, as a kernelized extension of CCA, kernel canonical correlation analysis (KCCA) [29] first maps the original feature into a common latent subspace with a nonlinear mapping, and then establishes the modality correlations. Besides CCA and its extensions, partial least squares (PLS) [30] is also exploited for cross-modal retrieval. It linearly maps heterogeneous data into a common subspace where the directions of maximum covariance are found. Although these unsupervised methods can achieve certain success, they merely leverage the paired multimedia documents to learn projections. Under such circumstance, the discriminative explicit high-level semantics cannot be captured in subspace learning. To address the problem, supervised cross-modal retrieval methods are developed to learn the shared subspace with
Notation n m C d1 , d 2 V T Yc Umc Wmc c wij (p)
wij λg , β
Description Number of training samples Index of different sub-retrieval tasks Number of semantic catrgories Feature dimensions of text and image d1 × n feature matrix of image d2 × n feature matrix of text n × C sematic label matrix corresponding to class c d1 × C projection matrix for image d2 × C projection matrix for text n × n inter-modality similarity matrix n × n intra-modality similarity matrix Balance parameters
supervised semantic labels. For example, [31] proposes a generalized multiview analysis (GMA), which is a supervised extension of CCA. In [5], a coupled linear regression framework is formulated to perform the cross-modal matching. Considering that real-world data is usually annotated by multiple labels, multi-label CCA (ml-CCA) [4] learns the shared subspace with the modality correspondence established by multi-label annotations. Besides, many deep learning [32]–[34] and hashing models [35], [36] have developed to stimulate cross-modal retrieval research. The above cross-modal retrieval methods are designed to learn the same couple of projections for the involved subretrieval tasks. They fail to capture the characteristic of each sub-retrieval task and probably deteriorate the retrieval performance. Modality-dependent cross-modal retrieval (MDCR) [17] makes a further effort on learning two couples of projections for different sub-retrieval tasks. However, it still ignores the important semantic characteristics of various queries on subspace learning, which may lead to sub-optimal cross-modal retrieval performance. III. T HE P ROPOSED A PPROACH In this section, we first introduce the preliminaries and system overview in Section III-A and III-B, respectively. Then, the details of the linear classifier training are described in Section III-C. Next, the unified task-specific and category-specific subspace learning is illustrated in section III-D. Finally, we design an iterative algorithm to solve the optimization problem in Section III-E, and its convergence is proved in Section III-F. A. Preliminaries In this paper, multi-modal dataset that consists of n training n sample pairs is represented as { (vi , ti )} i=1 . The samples in each image-text pair describe the same topic and belong to one of C classes. Let V = [v1 , ..., vn ] ∈ Rd1 ×n and T = [t1 , ..., tn ] ∈ Rd2 ×n represent image and text modality respectively. Yc = [y1 , ..., yC ] ∈ Rn×C denotes the semantic label matrix for learning the projection matrices of class c. For brevity, a list of notations are shown in Table I. B. System Overview The basic system framework of TQSL for I2T sub-retrieval task is shown in Figure 1. The system is comprised of two main components: offline learning and online retrieval.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
3
Offline Learning
Online Retrieval Image Data
Semantic mapping function
Training Data Linear classifier training
Category
Unified U nified task-specific and category-specificc subspace learning Task-category-projection mapping table
Г Г
Task: I2TT
Image or Text?
Г Г Linear regression gression
Multi-modal graph Multi-
Adaptive projection matricess
Joint optimization
Subspace matching Correlation analysis
Feature selection
ГГ
I2T retrieval results
Fig. 1: The system framework of TQSL for I2T task. This figure is best viewed in color and with pdf magnification.
•
•
Offline Learning: First, a unified cross-modal learning framework is formulated to learn task-specific and category-specific projection matrices by jointly optimizing the linear regression term, correlation analysis, multimodal graph regularization and feature selection. Via iterative optimization, task-specific and category-specific subspace can be simultaneously learned, and consequently a task-category-projection mapping table is built to support online retrieval. Second, we train an efficient classifier with linear support vector machine (LinearSVM) [37], [38] to obtain the semantic mapping function. Online Retrieval: Given a query, we first predict its potential semantic category with the learned semantic mapping function. Then, by jointly considering the specific sub-retrieval task and potential semantic category, the optimal projection matrices of a query are identified with the built task-category-projection mapping table. Next, the multi-modal data is mapped into a shared matching subspace. Finally, the similarities between query image and database texts are measured and the semantically relevant retrieval results are returned.
C. Linear Classifier Training In our approach, we train an efficient classifier to support identifying the most proper projection matrices for a given query in online retrieval. Particularly, we resort to linear support vector machine (LinearSVM) [37] to train a classifier model for category prediction because of its high time efficiency and empirical success. The learned linear classifier model can be simply determined as the aforementioned semantic mapping function. Note that our model is flexible, this part can be substituted with other effective classification models.
D. Unified Task-specific and Category-specific Subspace Learning We develop a unified cross-modal learning framework to perform the task-specific and category-specific subspace learning. In each sub-retrieval task, we focus on learning C couples of projection matrices Umc ∈ Rd1 ×C and Wmc ∈ Rd2 ×C (m = 1, 2 denotes different sub-retrieval tasks) for the image and text modalities, respectively. The cth projection matrix is learned for class c. With the learned 2C couples of projection matrices, we establish a task-category-projection mapping table for online retrieval. Specifically, the overall learning framework is formulated as: min
Umc ,Wmc
f (Umc , Wmc ) = L (Umc , Wmc ) + C (Umc , Wmc ) + Ω (Umc , Wmc ) + S (Umc , Wmc )
The objective function consists of four parts: L (Umc , Wmc ) is a linear regression term. It correlates the task-specific subspace further with explicit high-level semantics. The resulted subspace can therefore involve task-specific semantics and it is task-dependent. In addition, by imposing higher weights on the semantic label matrix, more learning penalties will be imposed on the cth category. And the learned subspace can be embedded with more class-specific semantics. Specifically, if the ith data sample belongs to the jth class and j 6= c, we set the ith element of yj to 0.5. In contrast, if j = c, we set it to 1, otherwise it is set to 0. Thus, the categoryspecific subspace can be learned for the cth semantic category. C (Umc , Wmc ) is a correlation analysis term that learns the shared subspace between two modalities. Ω (Umc , Wmc ) is a multi-modal graph regularization term which exploits the underlying feature structure to preserve the inter-modality and intra-modality semantic consistency. When calculating the inter-modality similarity, we also increase the weights for the samples of the cth class to learn the cth class-specific subspace
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
4
as shown in the following. S (Umc , Wmc ) is a feature selection term that avoids noises and selects the discriminative features. In the I2T task, we redefine the two optimal projection matrices for the image and text modalities as U1c and W2c , respectively. The objective function is shown in Eq.(1). Different from the I2T task, the linear regression term for T2I task is a regression operation from text space to the corresponding semantic space. In the T2I task, we redefine the projection matrices U2c and W2c for the image and text, respectively. Its objective function is shown in Eq.(2).
2 min f (U1c , W1c ) = V T U1c − Yc k + V T U1c − U1c ,W1c
F
2 T T W1c F + λ1 Ω (U1c , W1c ) + λ2 kU1c k2,1 + λ3 kW1c k2,1 (1)
2 min f (U2c , W2c ) = T T W2c − Yc kF + V T U2c − U2c ,W2c
2 T T W2c F + λ1 Ω (U2c , W2c ) + λ2 kU2c k2,1 + λ3 kW2c k2,1 (2) where λg (g = 1, 2, 3) controls the importance of three regularization terms, k·k2,1 represents the l2,1 norm, and the k·kF denotes the Frobenius norm. In the following, we will introduce the details of each formulated term in the objective function. Multi-modal graph regularization Ω (Umc , Wmc ): We exploit explicit semantic labels and local manifold structure to prompt the projected heterogeneous features with the same semantic close to each other in the shared subspace, and vice versa [39]–[41]. We define the semantic similarity matrix c wij ∈ Rn×n between vi and tj from different modalities as follows: 0.5, if vi and tj belong to kth category (k 6= c) c 1, if vi and tj belong to cth category wij = 0, otherwise (3) The Eq.(3) characterizes the inter-modality sample relations in original space. Note that, similarities of samples from the cth class is set to 1, so that they can be preserved more effectively in subspace learning. Therefore, category-specific subspace can be learned accordingly. In addition, the data in the intra-modality feature space with the neighborhood relationships should also be close to each other in the corresponding shared subspace. We construct the k nearest neighbors graph to preserve this local manifold structure [42], [43]. Taking the image modality as an example, (1) the intra-modality similarity matrix wij ∈ Rn×n between two images is defined as: 1, if vi ∈ Nk (vj ) or vj ∈ Nk (vi ) (1) wij = 0, otherwise
where β > 0 balances the inter-modality and intra-modality (1) (2) c similarity, Wij = βwij +P βwij + wij , D ∈ Rn×n is n a diagonal matrix, Dii = j=1 Wij , and L = D − W T T is the Laplacian matrix. F = (F1T , F2T ) = (Umc V, Wmc T) denotes the projected data for different modalities in the shared subspace. The multi-modal graph regularization term in Eq.(1) and Eq.(2) are rewritten as follows: T T Ω (U1c , W1c ) = T r U1c V LII V T U1c + T r W1c T LT T T T W1c T + T r U1c V Lc T T W1c T T Ω (U2c , W2c ) = T r U2c V LII V T U2c + T r W2c T LT T T T W2c T + T r W2c T Lc V T U2c where LII and LT T are the Laplacian matrices that preserve the similarity in image modality and text modality, respectively. Lc is the Laplacian matrix corresponding to the samples of cth class. It preserves the inter-modality sample relationship. Feature selection term kUmc k2,1 and kWmc k2,1 : The real document representation may contain noises that deteriorate the retrieval performance. To tackle the problem, we impose the l2,1 norm on the projection matrices to select the discriminative feature. The l2,1 -norm is the sum of the l2 -norm of each row in the matrix [44], [45], which ensures that the matrix is sparse in the row. kUmc k2,1 (m = 1, 2) can be replaced by T kUmc k2,1 = T r(Umc Rmv Umc )
where Rmv = Diag (rmv ). rmv is an auxiliary vector of the
b b = 1/2 ubmc 2 . The rmv l2,1 -norm, and the b-th element rmv can be regularized as 1 b rmv = q 2 2 kubmc k2 + ε
(4)
where ε is a smoothing term, and it is usually set to be a small constant value to ensure the convergence of the iterative algorithm. Similarly, kWmc k2,1 (m = 1, 2) can be replaced by T kWmc k2,1 = T r(Wmc Rmt Wmc )
E. Optimization By comprehensively considering the above four terms, the objective functions of the different sub-retrieval tasks in Eq.(1) and Eq.(2) can be rewritten as follows, respectively. 1. The overall objective function for I2T: 2
2
min kV T U1c − Yc kF + kV T U1c − T T W1c kF
U1c ,W 1c
T T T r (U1c V LII V T U1c ) + T r (W1c T LT T T T W1c ) T T +T r (U1c V Lc T W1c ) T T + λ2 T r(U1c R1v U1c ) + λ3 T r(W1c R1t W1c )
+λ1
(5) where Nk (vi ) denotes k nearest neighbors of vi . 2. The overall objective function for T2I: We preserve the inter-modality and intra-modality similarity 2 2 relationships in the shared subspace with: min kT T W2c − Yc kF + kV T U2c − T T W2c kF U2c ,W2c n 2 n P P P T T T T (p) 2 2 c 2c ) + T r (W2c T LT T T W2c ) βwij kfi − fj kF + wij kfi − fj kF + λ1 T r (U2c VTLII V U Ω (U, W ) = 12 +T r (W2c T Lc V T U2c ) p=1 i,j=1 i,j=1 n T T P 2 + λ2 T r(U2c R2v U2c ) + λ3 T r(W2c R2t W2c ) = 12 Wij kfi − fj kF =T r(F LF T ) (6) i,j=1 2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
5
The optimization problems in Eq.(5) and Eq.(6) are difficult to solve directly because of its non-convexity. Fortunately, it can be seen that Eq.(5) is convex with respect to either U1c or W1c while the other is fixed. Similarly, Eq.(6) is convex with respect to either U2c or W2c while the other is treated as a constant. In this paper, we design an iterative algorithm to effectively solve the optimization problem. We calculate the partial derivatives of U1c and W1c in Eq.(5) and set them to zero. We get: ∂f (U1c ,W1c ) ∂U1c
= 2V V T U1c − V Yc − V T T W1c + λ2 R1v U1c +λ1 V LII V T U1c + λ21 V Lc T T W1c = 0 (7)
∂f (U1c ,W1c ) ∂W1c
= T T T W1c − T V T U1c + λ3 R1t W1c +λ1 T LT T T T W1c + λ21 T Lc V T U1c = 0 (8)
Similarly, the partial derivatives of U2c and W2c in Eq.(6) are calculated and set to zero. We have: ∂f (U2c ,W2c ) ∂U2c
∂f (U2c ,W2c ) ∂W2c
= V V T U2c − V T T W2c + λ2 R2v U2c +λ1 V LII V T U2c + λ21 V Lc T T W2c = 0
= 2T T T W2c − T Yc − T V T U2c + λ3 R2t W2c +λ1 T LT T T T W2c + λ21 T Lc V T U2c = 0
The optimal solution can be calculated by alternately updating the above procedures until it converges. The main optimization procedures for the I2T task are summarized in Algorithm 1. These procedures can be easily extended to T2I task. Algorithm 1 Task-dependent and Query-dependent Subspace Learning for Cross-modal Retrieval in I2T Input: Image representation V and text representation T , parameters λ1 , λ2 , λ3 and β. Output: The projection matrices U1c and W1c . 1: Train the classifier with LinearSVM. 2: Construct the Laplacian matrix LII and LT T of the multimodal graph regularization term. 3: for k=1,2,......,C do t t 4: Initialize U1c and W1c as an identity matrix, set t = 0. 5: Construct Laplacian matrix Lc and semantic matrix Y corresponding to the samples of cth class. 6: repeat 7: Compute rit according to Eq.(4). t 8: By fixing the other variable in Eq.(7), U1c is updated as follows: −1
t+1 U1c = (2V V T + λ2 R1v + λ1 V LII V T ) t t V Yc + V T T W1c − λ21 V Lc T T W1c
9:
(9)
t By fixing the other variable in Eq.(8), W1c is updated as follows: −1
t+1 W1c = (T T T + λ3 R1t + λ1 T LT T T T ) t t T V T U1c − λ21 T Lc V T U1c
10: 11: 12:
t=t+1 until convergence end for
(10)
F. Convergence Analysis In this section, we prove that the proposed iterative algorithm in Algorithm 1 will converge by the following theorem. To the end, we first introduce two lemmas. t Lemma 1. Let uit be the ith row of the U1c in previous t+1 i iteration and ut+1 be the ith raw of the variable U1c in current iteration. It has been shown in [47] that the following inequality holds:
i 2
i 2
u
u
i
ut+1 − t+1 2 ≤ uit − t 2 2 2 2 uit 2 2 uit 2 t Lemma 2. Given U1c = [u1t , u2t ..., udt 1 ], where uit is the ith t row of the U1c , then we have the following conclusion:
2
d1 i d1 d1 i 2 d1 X X X X
i
i ut+1 2 u
ut+1 −
≤
t 2 ut 2 − i 2
2 ut 2 2 uit 2 i=1 i=1 i=1 i=1
Proof. By summing up the inequalities corresponding to all t rows of U1c in Lemma 1, we can easily obtain the conclusion of Lemma 2. t For a given W1c = [wt1 , wt2 ..., wtd2 ], we can similarly get the conclusion in Lemma 1 and Lemma 2. As follows:
i 2
i 2
w
w
i
wt+1 − t+1 2 ≤ wti − t 2 2 2 2 wti 2 2 wti 2
2 d2 d2 i 2 d2 d2 i X X X X
i
i w wt+1 2
wt −
wt+1 −
≤
t 2 i 2 2
2 wt 2 2 wti 2 i=1 i=1 i=1 i=1 Theorem 1. At each iteration of Algorithm 1, the value of the objective function in Eq.(5) monotonically decreases until convergence. t t Proof. For brevity, we denote R(U1c , W1c ) as the loss function corresponding to the first three terms of the objective t+1 t+1 function in Eq.(5). Suppose U1c , W1c are the optimized solution to the alternative problem (5), then we have: t+1 t+1 t+1 T t+1 R(U1c , W1c ) + λ2 T r (U1c ) R1v U1c t+1 T t+1 +λ3 T r (W1c ) R1t W1c t t t T t , W1c ) + λ2 T r (U1c ) R1v U1c + ≤ R(U1c t T t λ3 T r (W1c ) R1t W1c 2 2 i d2 d1 P P k2 kwt+1 kuit+1 k2 t+1 t+1 + λ ⇒ R(U1c , W1c ) + λ2 3 i i 2 u 2 w i=1 k t k2 i=1 k t k2 2 i 2 d1 d2 P P u wti k k k k t 2 t t 2 , W1c ) + λ2 ≤ R(U1c + λ 3 i i 2 u 2 w i=1 k t k2 i=1 k t k2 d1
P t+1 t+1
uit+1 ⇒ R(U1c , W1c ) + λ2 2 i=1 d i d 1 1
P i P kut+1 k22 −λ2 ut+1 − 2kui k i=1 i=1 2 d t 2 2 i d d2 2 2
P P P kwt+1 k2 i i
+λ3 wt+1 2 − λ3 wt+1 − 2 wi i=1 i=1 i=1 k t k2 2 d1 P t t
uit ≤ R(U1c , W1c ) + λ2 2 i=1 d d 1 1
P i P kuit k22 −λ2 ut 2 − 2kuit k 2 i=1 i=1 2 d2 d2 d2
P P P kwti k2 i i
+λ3 wt 2 − λ3 wt − 2 wi i=1 i=1 i=1 k t k2 2
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
6
TABLE II: mAP of all compared approaches on three datasets. The best result in each column is marked with bold. Methods CCA [27] SM [27] SCM [27] T-V CCA [46] GMLDA [31] GMMFA [31] MDCR [17] TQSL
I2T 0.226 0.403 0.351 0.310 0.372 0.371 0.435 0.463
Wikipedia-CNN T2I Average 0.246 0.236 0.357 0.380 0.324 0.337 0.316 0.313 0.322 0.347 0.322 0.346 0.394 0.415 0.415 0.439
I2T 0.261 0.426 0.369 0.337 0.456 0.455 0.455 0.505
TABLE III: Statistics of test collections (Bow denotes Bagof-Words). Datasets #Database #Query #Training Visual Feature Text Feature
WikipediaCNN 2,866 693 2173 CNN (4096-D) BoW+LDA (100-D)
Pascal Sentence 1,000 400 600 CNN (4096-D) BoW+LDA (100-D)
INRIAWebsearch 14,698 4,366 10,332 CNN (4,096-D) BoW+LDA (1000-D)
According to Lemma 2, we finally arrive at t+1 t+1 R(U1c , W1c ) + λ2
d1 d2 X X
i
i
ut+1 + λ3
wt+1 2 2 i=1
i=1
d1 d2 X X
i
i t t
ut + λ3
wt ≤ R(U1c , W1c ) + λ2 2 2 i=1
i=1
Hence, we have proved that the value of the objective function in Eq.(5) monotonically decreases in each iteration. It is clear that Theorem 1 guarantees that Algorithm 1 will converge an optimal solution. IV. E XPERIMENTAL CONFIGURATION In this section, we introduce the experimental settings, including experimental dataset, evaluation metric and the compared approaches. A. Experimental Datasets Table III summarizes the key statistics of the test collections. • Wikipedia-CNN [17] consists of 2,866 image-text pairs. Each pair belongs to one of 10 semantic categories. The dataset is randomly split into two parts: 2173 pairs for training and the remaining 693 pairs for testing. The image features are 4,096-dimensional Convolutional Neural Network (CNN) [6] feature and the text features are 100-dimensional latent Dirichlet allocation (LDA) [27] feature. •
Pascal Sentence [6] consists of 1,000 image-text pairs. It is comprised of 20 semantic categories. In each semantic class, we randomly select 30 pairs as training sets, and the remaining 20 pairs as testing set. In this dataset, image is represented with 4,096-dimensional CNN feature, and text is represented with 100-dimensional LDA feature.
Pascal Sentence T2I Average 0.356 0.309 0.467 0.446 0.375 0.372 0.439 0.388 0.448 0.452 0.447 0.451 0.471 0.463 0.502 0.504
•
INRIA-Websearch I2T T2I Average 0.274 0.392 0.333 0.439 0.517 0.478 0.403 0.372 0.387 0.329 0.500 0.415 0.505 0.522 0.514 0.492 0.510 0.501 0.520 0.551 0.535 0.541 0.550 0.546
INRIA-Websearch [48] contains 71,478 image-text pairs which are classified into 353 semantic categories. In experiments, the largest 100 semantic classes are selected to construct a experimental dataset containing 14,698 image-text pairs. Among them, we select 70% of imagetext pairs in each class as the training set (10,332 pairs), and the rest as the testing set (4,366 pairs). Each image is represented by the 4,096-dimensional CNN feature and text is represented by the 1000-dimensional LDA feature.
B. Evaluation Metric In the experiments, we use Euclidean distance to measure the similarity of multi-modal data in the shared subspace. Mean Average Precision (mAP) [27] is adopted to evaluate the performance of the retrieval. The AP of T returned results is calculated by: T
AP =
1 X p (k) δ (k) R k=1
where R is the number of relevant samples in the retrieved set. p(k) denotes the precision of the top k retrieved samples. δ(k) = 1 if the kth retrieved samples is relevant to the query, and δ(k) = 0 otherwise. The mAP score is calculated by averaging AP scores of all the queries. In addition, we also report precision-recall curve [32] to illustrate the variations of retrieval precision with recall rate. C. Compared Methods In experiments, we mainly compare the proposed TQSL with several state-of-the-art cross-modal retrieval methods. CCA [27] utilizes the pair-wise information to learn a common subspace where only the pair-wise closeness is concerned. TV CCA [46] explicitly merges high-level semantic information into a third view to provide better separation between classes. Four popular supervised methods (i.e., GMMFA [31], GMLDA [31], SM [27], SCM [27]) take semantic information into account to learn a common subspace. GMLDA [31] and GMMFA [31] use Linear Discriminant Analysis (LDA) and Marginal Fisher Analysis (MFA) with GMA [31] to learn a common discriminative subspace for cross-view classification, respectively. SM [27] represents documents at a higher level of abstraction so that there is a natural correspondence between the text and image spaces. SCM [27] combines subspace and semantic modeling, where logistic regression is performed within two maximally correlated subspaces. In addition, MDCR [17] is also compared with our method. It learns two couples of projection matrices for different retrieval tasks.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access 7
Precision
0.6 0.5
Precision
CCA SM SCM TVCCA GMLDA GMMFA MDCR TQSL
0.7
0.4 0.3 0.2 0.1 0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
(a) I2T on Wikipedia-CNN 0.8
0.5
Precision
Precision
0.6 0.4 0.3 0.2 0.1 0
0
CCA SM SCM TVCCA GMLDA GMMFA MDCR TQSL
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(d) T2I on Wikipedia-CNN
Recall
(e) T2I on Pascal Sentence
1
(c) I2T on INRIA-Websearch.
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
Recall
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
(b) I2T on Pascal Sentence CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.7
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Recall
Precision
0.8
Precision
IEEE ACCESS MANUSCRIPT
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
CCA SM SCM TVCCA GMLDA GMMFA MDCR TQSL
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Recall
(f) T2I on INRIA-Websearch
Fig. 2: Precision-Recall curves of all compared approaches on three datasets.
V. E XPERIMENTAL RESULTS In this section, we evaluate the performance of the proposed TQSL by the following four experiments. First, we compare our method with the state-of-the-art methods based on mAP score, precision-recall curve and the retrieval performance on each class. Then, the effectiveness of task-dependent and query-dependent subspace learning for cross-modal retrieval is further validated. Next, parameter experiment is conducted to evaluate the robustness of TQSL. Finally, the convergence of the proposed method is verified by the experimental results. A. Performance Comparisons Table II presents the main results on three datasets. From the table, we find that the mAP scores of our method for two sub-retrieval tasks can achieve higher or at least comparable performance than the compared approaches. In addition, we report the precision-recall curves of all compared approaches in Figure 2. From the figure, it is clear that our approach has achieved better performance on two sub-retrieval tasks. On Wikipedia-CNN, we find that the mAP scores of our method for two retrieval tasks are 0.463 and 0.415, respectively. They are higher than the best performance of the compared approaches. Furthermore, the corresponding retrieval precision of each category is presented in Figure 3(a)(c). The experimental results show that our method obtains better results on 7 classes. With category-specific subspace learning and adaptive query accommodation, the semantic characteristics of queries can be well captured and thus the performance can be improved by TQSL. On Pascal Sentence, TQSL outperforms the second best performance by more than 3%. Among the compared approaches, CCA obtains the worst performance in all cases. This is because CCA only employs unsupervised sample pairs to learn the shared space and thereby no explicit semantics are
exploited. We also find that MDCR can achieve the second best performance. The reason is that MDCR learns modalitydependent subspace for cross-modal retrieval. The result confirms the importance of considering the characteristics of different sub-retrieval tasks. The corresponding results on each category are presented in Figure 3(d)-(f), respectively. The experimental results show that our method is effective. On INRIA-Websearch, we can observe that our method can achieve superior performance than the compared methods. In most cases, this dataset contains 100 semantic classes. The results demonstrate that our proposed approach can still perform well when complex semantic categories are included in database. B. Effects of Task- and Query-dependent Subspace Learning In this subsection, we conduct experiments on Pascal Sentence to validate the effectiveness. The main results are presented in Table IV. TQSL-I denotes the variant of our method that removes the query-dependent subspace learning in TQSL. In implementation, the same projection matrices are learned for all queries. In TQSL-I, the relationship between the feature and semantic information is not fully considered for each semantic category when learning the projection matrices. We can observe from the Table IV that the average mAP scores of TQSL-I is 2.1% lower than TQSL on the Pascal Sentence. TQSL-II denotes the variant of our method that removes task-dependent subspace learning in TQSL. For implementing TQSL-II, the linear regression terms from different feature spaces to semantic spaces are simultaneously optimized in the same objective function. As shown in Table IV, the performance of different retrieval tasks are greatly degraded, especially for the image query. The possible reason is that the same projection matrices are learned for different sub-retrieval tasks. Under such circumstance, the method can not guarantee
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
8
0.8
0.8
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.7 0.6
0.6
0.6 0.5
MAP
MAP
MAP
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0 art
bio
y y ry sic yalty ure edia log raph isto rat mu m h ro og lite ge
ort
sp
0
art
re
rfa wa
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.7
0.5
0.5
(a) I2T
y y ry sic yalty ure edia log raph isto rat mu m h ro og lite ge
bio
sp
ort
rf wa
art
are
0.9 0.8 0.7
0.8 0.7
0.8 0.7
MAP
MAP
0.6
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
ne ike ird oat ttle us car cat hair cowable dagorsebikerson lant eepsofatrain itor c t m pla b b b bo b h oto pe p sh mo m tv/
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.9
0.6
0.5
re rfa wa
1
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.9
0.6
ort
sp
(c) Average Performance
1
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
y y ry sic yalty ure edia log raph isto rat mu m h ro og lite ge
bio
(b) T2I
1
MAP
0.8
CCA SM SCM T-V CCA GMLDA GMMFA MDCR TQSL
0.7
0
ne ike ird oat ttle us car cat hair cowable dagorsebikerson lant eepsofatrain itor c t m pla b b b bo b h oto pe p sh mo m tv/
(d) I2T
ne ike ird oat ttle us car cat hair cowable dagorsebikerson lant eepsofatrain itor c t m pla b b b bo b h oto pe p sh mo m tv/
(e) T2I
(f) Average Performance
Fig. 3: mAP performance of each class on Wikipedia-CNN and Pascal Sentence. TABLE IV: mAP comparison between TQSL and its variants.
0.51
0.51 TQSL
Text query
Average
0.505 0.472 0.315
0.502 0.493 0.482
0.504 0.483 0.399
0.49 0.48 0.47 1
the effective semantic representation of the query in the shared subspace and thus lead to sub-optimal retrieval performance. We also illustrate the variations of retrieval performance with the number of predicted semantics categories in subspace matching process. In particular, in experiments, we first calculate the matching similarity in each category-specific subspace respectively. Then the final similarities of multimedia documents are obtained by the weighted similarities calculated from the selected category-specific subspaces. Specifically, the weights are determined as the predicted probabilities of the corresponding semantic categories. Figure 4 presents the main results on Pascal Sentence. From the figure, we can observe that the retrieval performance achieves the best when only one optimal prediction category is employed. The results demonstrate the effectiveness of the task-category-projection mapping strategy in the TQSL.
0.5
mAP
Image query
mAP
TQSL TQSL-I TQSL-II
TQSL
0.5
0.49 0.48
2
3
4
5
Category-specific Subspace Number
(a) I2T
6
0.47 1
2
3
4
5
Category-specific Subspace Number
(b) T2I
Fig. 4: The variations of mAP with the number of selected category-specific subspace on Pascal Sentence.
C. Parameter Sensitivity
the retrieval performance on different sub-retrieval tasks is stable over a wide range. On different datasets, λ1 , λ2 and λ3 may be set to different values for different retrieval tasks to optimize the retrieval performance. In this section, these three parameters are adjusted from {0.001, 0.01, 0.1, 1} on Pascal Sentence. In experiments, one of the parameters is fixed to observe the performance variations with the other two parameters. The experimental results for the I2T and T2I tasks are shown in Figure 5(c)-(e) and Figure 5(f)-(h) respectively. It can be seen that the performance of the proposed method is relatively stable to the parameter λ2 and λ3 . For λ1 , the performance is stable in the range of {0.001, 0.01, 0.1}.
In this subsection, we conduct experiment to evaluate the robustness of the proposed approach. There are four parameters λ1 , λ2 , λ3 and β in the proposed method. λ1 is the weighting parameter of multi-modal graph regularization term. λ2 and λ3 are the balance parameters for the feature selection in image and text modalities, respectively. β balances the importance of the inter-modality and the intramodality similarity relationship. In experiments, we tune β from {0.0001, 0.001, 0.01, 0.1}. As shown in Figure 5(a)-(b),
D. Convergence Experiment In this subsection, we conduct experiment to validate the convergence of the proposed iterative optimization approach. In Figure 6, we show the convergence curves of different retrieval tasks on Pascal Sentence, respectively. It can be seen that the objective function value tends to be stable as the number of iterations increases, and the proposed approach converges within about 5 iterations. This experimental results
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
6
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
9
0.51
0.505 TQSL
0.508
0.504
0.506
0.503
mAP
mAP
TQSL
0.504 0.502
0.502 0.501
0.5 0.1
0.01
0.001
0.5 0.1
0.0001
0.01
0.6
0.6
0.4
0.4
0.4
mAP
0.6
0.2
0.2
0.001 0.01
0.001 0.01
0.001 0.01 1
1
0.1
0.01 0.001
0.1 1
(c) λ1 = 0.1
0.01 0.001
0.1
1
0.1 1
(d) λ2 = 1 0.6
0.4
0.4
0.4
mAP
0.6
0.2
0.2
0
0
0
0.001 0.01
0.001 0.01
1
0.01 0.001
0.1
1
0.1 1
(f) λ1 = 0.01
0.01 0.001
(g) λ2 = 1
0.1
1
0.2
0.001 0.01 0.1
0.01 0.001
(e) λ3 = 1
0.6
mAP
mAP
0.2 0
0
0
0.1
0.0001
(b)
mAP
mAP
(a)
0.001
0.1
1
0.1 1
0.01 0.001
0.1
1
(h) λ3 = 1
Fig. 5: Performance variations with the key parameters on Pascal Sentence.
1560 TQSL
1460 1450 1440 1430 1420
1
2
3
4
5
6
7
8
9
10
Objective funtion value
Objective funtion value
1470
TQSL
1540
1520
1500
1
Iteration Number
(a) I2T
2
3
4
5
6
7
8
9
effectively identified by considering both the specific subretrieval tasks and the potential semantic category of query. Experimental results on the publicly available datasets demonstrate the superiority of the proposed TQSL compared with several state-of-the-art approaches.
10
Iteration Number
(b) T2I
Fig. 6: The variations of objective function values with iterations on Pascal Sentence.
show that the convergence of TQSL method can be guaranteed.
VI. C ONCLUSION In this paper, we propose a novel task-dependent and querydependent subspace learning (TQSL) approach for crossmodal retrieval. Via iterative optimization on a unified crossmodal learning framework, we obtain the task-dependent and query-dependent projection matrices. In online retrieval, an adaptive projected subspace for cross-modal retrieval can be
VII. ACKNOWLEDGEMENT The work is partially supported by the National Natural Science Foundation of China (No.61572298, 61772322, 61402268, 61401260, 61601268) and the Technology and Development Project of Shandong (No.2017GGX10117). R EFERENCES [1] X. Chang, Z. Ma, Y. Yang, Z. Zeng, and A. G. Hauptmann, “Bi-level semantic representation analysis for multimedia event detection,” IEEE Trans Cybern, vol. 47, no. 5, pp. 1180–1197, 2017. [2] L. Zhu, Z. Huang, X. Liu, X. He, J. Sun, and X. Zhou, “Discrete multimodal hashing with canonical views for robust mobile landmark search,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2066–2079, 2017. [3] L. Zhang, B. Ma, G. Li, Q. Huang, and Q. Tian, “Generalized semisupervised and structured subspace learning for cross-modal retrieval,” IEEE Transactions on Multimedia, vol. 20, no. 1, pp. 128–141, 2018. [4] V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multi-label cross-modal retrieval,” in IEEE International Conference on Computer Vision, 2015, pp. 4094–4102.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
[5] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for cross-modal matching,” in IEEE International Conference on Computer Vision, 2013, pp. 2088–2095. [6] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline.” IEEE Transactions on Cybernetics, vol. 47, no. 2, p. 449, 2017. [7] X. Chang and Y. Yang, “Semi-supervised feature analysis by mining correlations among multiple tasks,” IEEE Transactions on Neural Networks & Learning Systems, vol. 28, no. 10, pp. 2294–2305, 2014. [8] A. Hauptmann, R. Yan, W. H. Lin, M. Christel, and H. Wactlar, “Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 958–966, 2007. [9] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based lstm and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017. [10] X. Chang, Y. L. Yu, Y. Yang, and E. P. Xing, “Semantic pooling for complex event analysis in untrimmed videos,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 39, no. 8, pp. 1617–1632, 2017. [11] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for cross-media retrieval,” in AAAI Conference on Artificial Intelligence, 2013, pp. 1198–1204. [12] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, 2015. [13] L. Zhang, L. Wang, and W. Lin, “Conjunctive patches subspace learning with side information for collaborative image retrieval,” IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 21, no. 8, pp. 3707–3720, 2012. [14] G. Cao, A. Iosifidis, K. Chen, and M. Gabbouj, “Generalized multiview embedding for visual recognition and cross-modal retrieval,” IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1–14, 2017, doi:10.1109/TCYB.2017.2742705, [15] X. Xu, Y. Yang, A. Shimada, R. I. Taniguchi, and L. He, “Semisupervised coupled dictionary learning for cross-modal retrieval in internet images and texts,” in ACM International Conference on Multimedia, 2015, pp. 847–850. [16] X. Dong, J. Sun, P. Duan, L. Meng, Y. Tan, W. Wan, H. Wu, B. Zhang, and H. Zhang, “Semi-supervised modality-dependent cross-media retrieval,” Multimedia Tools & Applications, vol. 77, no. 3, pp. 3579–3595, 2018. [17] Y. Wei, Y. Zhao, Z. Zhu, S. Wei, Y. Xiao, J. Feng, and S. Yan, “Modalitydependent cross-media retrieval,” Acm Transactions on Intelligent Systems & Technology, vol. 7, no. 4, pp. 1–13, 2016. [18] J. Yan, H. Zhang, J. Sun, Q. Wang, P. Guo, L. Meng, W. Wan, and X. Dong, “Joint graph regularization based modality-dependent crossmedia retrieval,” Multimedia Tools & Applications, vol. 77, no. 3, pp. 3009–3027, 2018. [19] L. Zhu, J. Shen, L. Xie, and Z. Cheng, “Unsupervised visual hashing with semantic assistant for content-based image retrieval,” IEEE Transactions on Knowledge & Data Engineering, vol. 29, no. 2, pp. 472–486, 2017. [20] L. Zhang, L. Wang, and W. Lin, “Generalized biased discriminant analysis for content-based image retrieval,” IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society, vol. 42, no. 1, pp. 282–290, 2012. [21] L. Zhu, J. She, X. Liu, L. Xie, and L. Nie, “Learning compact visual representation with canonical views for robust mobile landmark search,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016, pp. 3959–3965. [22] Y. Wang, H. Zhang, and F. Yang, “A weighted sparse neighbourhoodpreserving projections for face recognition,” IETE journal of research, vol. 63, no. 3, pp. 358–367, 2017. [23] Y. H. Jia, W. N. Chen, T. Gu, H. Zhang, H. Yuan, Y. Lin, W. J. Yu, and J. Zhang, “A dynamic logistic dispatching system with setbased particle swarm optimization,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. PP, no. 99, pp. 1–15, 2017, doi:10.1109/TSMC.2017.2682264. [24] X. Chang, Z. Ma, M. Lin, Y. Yang, and A. Hauptmann, “Feature interaction augmented sparse learning for fast kinect motion detection.” IEEE Trans Image Process, vol. 26, no. 8, pp. 3911–3920, 2017. [25] H. Zhang and J. Lu, “Creating ensembles of classifiers via fuzzy clustering and deflection,” Fuzzy Sets and Systems, vol. 161, no. 13, pp. 1790–1802, 2010.
10
[26] L. Zhang, L. Wang, and W. Lin, “Semi-supervised biased maximum margin analysis for interactive image retrieval.” IEEE Trans Image Process, vol. 21, no. 4, pp. 2294–2308, 2012. [27] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in ACM International Conference on Multimedia, 2010, pp. 251–260. [28] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013, pp. III–1247. [29] S. J. Hwang and K. Grauman, “Accounting for the relative importance of objects in image retrieval,” in British Machine Vision Conference, 2010, pp. 1–12. [30] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, low-resolution andsketch,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 7, 2011, pp. 593–600. [31] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 157, no. 10, 2012, pp. 2160–2167. [32] Q. Y. Jiang and W. J. Li, “Deep cross-modal hashing,” in CVPR, 2016, pp. 3270–3278. [33] Z. Zeng, Z. Li, D. Cheng, H. Zhang, K. Zhan, and Y. Yang, “Twostream multi-rate recurrent neural network for video-based pedestrian re-identification,” IEEE Transactions on Industrial Informatics, vol. PP, no. 99, pp. 1–1, 2017, doi:10.1109/TII.2017.2767557. [34] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise relationship guided deep hashing for cross-modal retrieval,” in AAAI, 2017, pp. 1618–1625. [35] L. Zhu, Z. Huang, Z. Li, L. Xie, and H. T. Shen, “Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–13, 2018, doi:10.1109/TNNLS.2018.2797248. [36] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 4, pp. 769–790, 2018. [37] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, no. 9, pp. 1871–1874, 2008. [38] H. Zhang, L. Cao, and S. Gao, “A locality correlation preserving support vector machine,” Pattern Recognition, vol. 47, no. 9, pp. 3168–3178, 2014. [39] J. Tang, K. Wang, and L. Shao, “Supervised matrix factorization hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3157–3166, 2016. [40] L. Zhu, J. Shen, L. Xie, and Z. Cheng, “Unsupervised topic hypergraph hashing for efficient mobile image retrieval,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 3941–3954, 2017. [41] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for cross-modal retrieval,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 38, no. 10, pp. 2010–2023, 2016. [42] T. Yao, X. Kong, H. Fu, and Q. Tian, “Semantic consistency hashing for cross-modal retrieval,” Neurocomputing, vol. 193, no. C, pp. 250–259, 2016. [43] J. Song, L. Gao, F. Nie, H. T. Shen, and N. Sebe, “Optimized graph learning using partial tags and multiple features for image and video annotation,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 4999–5011, 2016. [44] R. He, T. Tan, L. Wang, and W. Zheng, “L2, 1 regularized correntropy for robust feature selection,” in IEEE Conference on Conputer Vision and Pattern Recognitionn, vol. 157, no. 10, 2012, pp. 2504–2511. [45] M. Nikolova and M. K. Ng, “Analysis of half-quadratic minimization methods for signal and image recovery,” vol. 27, no. 3, pp. 937–966, 2005. [46] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014. [47] Y. Yang, F. Shen, H. T. Shen, H. Li, and X. Li, “Robust discrete spectral hashing for large-scale image semantic indexing,” IEEE Transactions on Big Data, vol. 1, no. 4, pp. 162–171, 2015. [48] J. Krapac, M. Allan, J. Verbeek, and F. Juried, “Improving web image search results using query-relative classifiers,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 119, no. 5, 2010, pp. 1094–1101.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2831675, IEEE Access IEEE ACCESS MANUSCRIPT
Li Wang received the bachelor degrees in Information Management and Information System from Institute of Disaster Prevention Science and Technology, Langfang, Hebei, in 2016. She is currently working toward the master degree in Internet Of Things Application Technology from Shandong Normal University. Her research interests include cross-modal retrieval/hashing, machine learning, and computer vision. She is a student member of the CCF.
11
Huaxiang Zhang is currently a professor with the School of Information Science and Engineering, the Institute of Data Science and Technology, Shandong Normal University, China. He received his Ph.D. from Shanghai Jiaotong University in 2004, and worked as an associated professor with the Department of Computer Science, Shandong Normal University from 2004 to 2005. He has authored over 170 journal and conference papers and has been granted 10 invention patents. His current research interests include machine learning, pattern recognition, evolutionary computation, cross-media retrieval, web information processing, bioinformatics, etc.
Lei Zhu received the B.S. degree (2009) at Wuhan University of Technology, the Ph.D. degree (2015) at Huazhong University of Science and Technology. He is currently a full Professor with the School of Information Science and Engineering, Shandong Normal University, China. He was a Research Fellow under the supervision of Prof. Heng Tao Shen at the University of Queensland (2016-2017), and Dr. Jialie Shen at the Singapore Management University (2015-2016). His research interests are in the area of large-scale multimedia content analysis and retrieval.
En Yu received the bachelor degrees in School of Information Science and Engineering, from Shandong Normal University, Jinan, Shandong, in 2016. He is currently working toward the master degree of Communication and Information Systems at the Shandong Normal University. His research interests include multimedia processing and analysis, machine learning, and deep learning. He is a student member of the CCF.
Jiande Sun received the Ph.D. degree in communication and information system from Shandong University, Jinan, China, in 2005. From September 2008 to August 2009, he was a Visiting Researcher with the Institute of Telecommunications System, Technical University of Berlin, Berlin, Germany. From October 2010 to December 2012, he was a Post-Doctoral Researcher with the Institute of Digital Media, Peking University, Beijing, China, and with the State Key Laboratory of Digital-Media Technology, Hisense Group, respectively. From July 2014 to August 2015, he was a DAAD Visiting Researcher with Technical University of Berlin and University of Konstanz, Germany. From October 2015 to November 2016, he was a Visiting Researcher with the Language Technology Institute, School of Computer Science, Carnegie Mellon University, USA. He is currently a Professor with the School of Information Science and Engineering, Shandong Normal University. He has published more than 60 journal and conference papers. He is the co-author of two books. His current research interests include multimedia content analysis, video hashing, gaze tracking, imageideo watermarking, 2D to 3D conversion, and so on.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.