Dense Auto-Encoder Hashing for Robust Cross ... - ACM Digital Library

0 downloads 0 Views 3MB Size Report
Oct 22, 2018 - Then, the variational hashing framework is further used to qualify the matching between .... An intuitive solution is to use stacked Denoising Auto-encoders .... to the discrete hash codes will introduce more information loss.
Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval Hong Liu12† , Mingbao Lin12† , Shengchuan Zhang12 , Yongjian Wu3 , Feiyue Huang3 , Rongrong Ji12‡ 1 Fujian

Key Laboratory of Sensing and Computing for Smart City, Xiamen University, China 2 School of Information Science and Engineering, Xiamen University, China 3 Tencent Youtu Lab, Tencent Technology (Shanghai) Co., Ltd, China [email protected],[email protected],[email protected] {littlekenwu,garyhuang}@tencent.com,[email protected]

ABSTRACT Cross-modality retrieval has been widely studied, which aims to search images as response to text queries or vice versa. When faced with large-scale dataset, cross-modality hashing serves as an efficient and effective solution, which learns binary codes to approximate the cross-modality similarity in the Hamming space. Most recent cross-modality hashing schemes focus on learning the hash functions from data instances with fully modalities. However, how to learn robust binary codes when facing incomplete modality (i.e., with one modality missed or partially observed), is left unexploited, which however widely occurs in real-world applications. In this paper, we propose a novel cross-modality hashing, termed Dense Auto-encoder Hashing (DAH), which can explicitly impute the missed modality and produce robust binary codes by leveraging the relatedness among different modalities. To that effect, we propose a novel Dense Auto-encoder Network (DAN) to impute the missing modalities, which densely connects each layer to every other layer in a feed-forward fashion. For each layer, a noisy auto-encoder block is designed to calculate the residue between the current prediction and original data. Finally, a hash-layer is added to the end of DAN, which serves as a special binary encoder model to deal with the incomplete modality input. Quantitative experiments on three cross-modality visual search benchmarks have shown that the proposed DAH has superior performance over the state-of-the-art approaches.

Figure 1: The example of incomplete dataset. The white areas present the missing data, and the color block means the corresponding data representation. The black dashed box means this raw data is complete, which is used in most existing cross-modality hashing methods. (Best view in color.)

ACM Reference Format: Hong Liu, Mingbao Lin, Shengchuan Zhang, Yongjian Wu, Feiyue Huang, Rongrong Ji. 2018. Dense Auto-Encoder Hashing for Robust CrossModality Retrieval. In 2018 ACM Multimedia Conference (MM ’18), October 22–26, 2018, Seoul, Republic of Korea. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3240508.3240684

CCS CONCEPTS • Computing methodologies → Computer vision;

1

INTRODUCTION

Cross-modality multimedia retrieval has been widely studied recently [6, 12–15, 23, 24, 27, 31–35]. In a typical scenario, query comes from one modality, e.g., images, while the returnings are from the other, e.g., texts, and vice versa. To handle cross-modality retrieval in large-scale datasets, binary code learning, a.k.a. hashing, has attracted much attention recently due to its low storage cost and fast retrieval speed. The key design is to encode features from different modalities into a common Hamming space, where the cross-modality similarities among data can be well preserved. Both supervised and unsupervised hashing schemes are exploited in cross-modality hashing. For unsupervised hashing, representative works include, but not limited to, Cross-View Hashing (CVH) [10], Predictable Dual-View Hashing (PDH) [20], Inter-Media Hashing (IMH) [24], Collective Matrix Factorization Hashing (CMFH) [2], and Fusion similarity Hashing (FSH) [13].

KEYWORDS Hash Learning; Binary Code Learning; Cross-Modal Search; LargeScale Image Retrieval †Contributed Equally. ‡Corresponding Author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM ’18, October 22–26, 2018, Seoul, Republic of Korea © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5665-7/18/10. . . $15.00 https://doi.org/10.1145/3240508.3240684

1589

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Figure 2: The framework of the proposed DAH. It contains two stages: the construction of Dense Auto-encoder, and Variational Hash Learning. The first stage in left part, Dense Auto-encoder, that transforms the incompleted modality input xi into a common space y˜ i shared by all modalities, is built by stacking many denosining residual auto-encoder blocks. The second stage in right part uses LSH-based hash scheme to project the label vector L into supervised hash codes b, where label-based similarity is preserved. Then, the variational hashing framework is further used to qualify the matching between generated hash codes with the supervised codes as shown in middle part. All the losses are optimized in a joint way by performing an end-to-end deep training via back-propagation. Here, the black holes in the input variable xi denote incomplete modalities with zeros filled. The notation ⊙ denotes the concatenation of vectors. (Best view in color.) By giving category or pairwise labels, supervised cross-modality hashing preserves the label relations in the produced Hamming space, which can further improve the retrieval performance. In CoRegularized Hashing (CRH) [36], a co-regularization based framework was put forward. Heterogeneous Translated Hashing (HTH) [30] learned hash functions embedding heterogeneous media into different Hamming spaces, and then aligned these spaces. Semantic Correlation Maximization (SCM) [33] was proposed to integrate semantic labels into the hashing learning procedure. SemanticPreserving Hashing (SePH) [11] transformed semantic affinities into a probability distribution and approximated it with to-be-learnt hash codes in Hamming space via minimizing the Kullback-Leibler divergence. Liu et al. [12] proposed Supervised Matrix Factorization Hashing (SMFH), tackling the multi-modal hashing problem with a collective non-negative matrix factorization across the different modalities. Generalized Semantic Preserving Hashing (GSePH) [18] further generalized SePH to the n-Label case. Despite the extensive progresses made, most existing crossmodality hashing schemes are still far from practical usage. Especially in various real-world applications, modalities of data samples are not always complete, which contradicts the setting that data is always with full modalities. This depends on various reasons, such as the partial data sampling, the breakdown of multiple sensors, or errors introduced in data processing. Fig.1 shows an explanation of such a modality missing problem. To this end, some previous works aim at dealing with incomplete dataset, such as Partial Multi-Modal Hashing (PM2 H) [29] and Semi-Paired Discrete Hashing (SPDH) [22]. However, there are still some problems to be solved. For PM2 H, it merely exploited cross-modality consistency with given correspondence, but ignored the incomplete pairs. It assumed that the consistent representation among all complete/incomplete data must be considered in the produced Hamming space. To alleviate this problem, the SPDH aimed to construct the common Hamming space that preserved the similarities of complete/incomplete paired data with the help

of complete anchor pairs, but the unsupervised learning and the random sampled anchors made the results unsatisfactory. Therefore, a novel cross-modality hashing scheme with a specific design for dealing with missing modality is highly emergent. To tackle all above mentioned problems, this paper presents a novel cross-modality hashing, termed Dense Auto-encoder Hashing (DAH), whose framework is shown in Fig.2. In particular, DAH consists of two stages: (1). the Dense Auto-encoder Network (DAN) and (2). the Variational Hashing Learning. Our first innovation is a Dense Auto-encoder Network (DAN). Inspired by the Cascaded Residual Autoencoder (CRA) [25] and Densely Connect Network [7], we aims to minimize the residual error between the incomplete input and the complete output, which relies on a densely connected structure in data completion. In this network, we adopt a denoising auto-encoder [26] as a basic composition that imputes the missing data modality. In particular, its input layer is the data with missing modalities, while the output layer is the approximated complete data. The DAN structure not only differentiates between incomplete samples and the complete samples, but also explicitly preserves the data information despite missing modality. Towards robustly imputing the missing modality, we combine a metric learning loss with reconstruction error loss to do a joint optimization for training such a neural network. Our second innovation is a Variational Hashing Learning scheme, which addresses the challenge of generating binary codes for such imputed full-modality data. The proposed scheme learns hash functions from supervised labels, with a novel label distribution augmentation, and these functions can be added to the output of DAN. Comparing to other supervised hashing schemes, the proposed variational hashing scheme aims to match the distribution of hash codes to that of labels, which is more suitable to handle the missing modality. In particular, to make the two distributions closer, we first restrict the label to a binary vector generated from a random signed projection, where the label similarity can be well preserved [3].

1590

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

dimension, the recovered completed feature via sDA is not wellsuited for direct binary encoding. Therefore, the key challenge transforms to how to encode the ˆ i into binary codes. In particular, the proincomplete training data A posed DAH handles the above problems via two key innovations: the first is a Dense Auto-encoder which imputes the missing modality feature, and the second is the Variational Hashing Learning which preserves supervised information in the learned hash codes. In the following, we first introduce Dense Auto-encoder in Sec.2.2, and then Variational Hash Learning in Sec.2.3, both of which are further integrated into a joint end-to-end learning framework.

Figure 3: Framework of denoising auto-encoder block in the proposed Dense Auto-encoder Network.

2.2

At last, hashing and DAN are learned jointly, which can be uniformly optimized via back propagation, avoiding the fundamental difficulty of directly integrating binary constraint. The proposed DAH method is compared against various state-of-the-art crossmodality hashing methods, including [2, 10, 11, 18, 20, 22, 33] on several widely-used cross-modality benchmarks, i.e., Wiki, MIRFlickr25K and NUS-WIDE. Quantitative results demonstrate that in the case of incomplete training data, DAH achieves very competitive results when comparing other schemes with complete data input. In the case of complete data, DAH still outperforms the existing cross-modality hashing methods by a large margin.

For easy description, we take two modality data as an example, in which the data Ai contains three cases: incomplete case xi = [ai1 ; o2 ], incomplete case xi = [o1 ; ai2 ], and complete case yi = [ai1 ; ai2 ]. Following the recent advances in [25], DAN aims at recovering the complete data feature yi ∈ Rd1 +d2 via reconstruction. That is to say, we aim to find the best imputed full modality data vector y˜ i , which minimizes the following error: Err (yi , y˜ i ) = ∥yi − y˜ i ∥ 2,

(2)

where ∥ · ∥ is the l2-norm. Therefore, the key problem is to design the structure of DAN to minimize the reconstruction error in Eq.2. To this end, the proposed Dense Auto-encoder Network (DAN) is composed of a set of denoising residual auto-encoder blocks. Fig.3 shows the framework of each basic building block. Mathematically, each block consists of several fully-connected layers, each of which is followed by a nonlinear Rectified Linear Units (ReLU), as similar to the traditional stacked denoising auto-encoder. In details, for the m-th block, the first layer receives the features of all preceding layers, ∆xi0, ∆xi1, ..., ∆xm−1 , as: i

2 DENSE AUTO-ENCODER HASHING 2.1 Notations and Problems In this section, we describe the details of our proposed DAH. Assume that A = {A1, A2, ..., An } is the training set with n instances, where Ai = [ai1 ; ai2 ; ...; ait ] is the i-th instance with t modalities. Let At = [at1, at2, ..., ant ] ∈ Rdt ×n be the feature matrix for the t-th modality, where ait is the i-th data of At with dt dimension. We further denote L = [l1, l2, ..., ln ]T ∈ {0, 1}n×C as the label matrix, where Li,j = 1 if the i-th instance belongs to class j and 0 otherwise, C is the number of classes, and li is the label feature for i-th instance. We also denote the vector ok = [0, 0, ..., 0] ∈ Rdk as the zero vector with the same dimension of the k-th modality feature. Given the training set A, the existing cross-modality hashing learns the hash function H t (at ) for the t-th modal data at , and simultaneously learns the corresponding binary codes bt ∈ {0, 1}r , where r is the number of hash bits. In particular, the training data is with incomplete modalities in many real applications, under which most existing works cannot work. An intuitive solution is to use stacked Denoising Auto-encoders (sDA) [26], which aims to learn robust data representations by reconstruction to recover original features from input data that is artificially corrupted. Similar to sDA, if the k-th modality is missing, we use zero vector to represent such a missing modality, where the incomplete data can be rewritten as: ˆ i = [a1 ; ...; ok ; ...; at ], A i i

Dense Auto-encoder Network

0 1 m−1 x˜ m ], i = [∆xi ; ∆xi ; ...; ∆xi

(3)

where [∆xi0 ; ∆xi1 ; ...; ∆xm−1 ] refers to the concatenation of the outi put produced in block {0, ..., m − 1}, and ∆xi0 is the original input feature xi which serves as the 0-layer without any processing. Then, such concatenated vector is mapped to a latent representation, which has the same dimension of the input feature xi . We further input this latent representation to a residual auto-encoder layer to conduct a residual structure, which estimates the difference between the latent data and the desired target yi , denoted as ∆xm . In other words, such residual ∆xm can be viewed as the output of the m-th block, which is an approximated feature for the missing modality comparing to the incomplete input. As a result, the corresponding estimation of complete modality y˜ m i is given by:

(1)

m y˜ m i = xi + ∆xi ,

which serves as the new input. sDA can only deal with partial modality missing, which is incapable when multi-modality data cannot sufficiently recover such information by comparing the heterogeneous information among different modalities. In addition, due to the differences among different modalities, i.e., scalar and

(4)

Such an architecture block allows us to stack a set of basic blocks in a cascaded manner. As can be seen in Fig.2, to cascade M sets of denoising residual auto-encoder blocks, we form our densely connected auto-encoder to recover the complete modality feature

1591

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

via integrating all M residual values as: y˜ i = xi +

M Õ

∆xm i ,

construction of hash codes and the learning of hash functions are separated, which leads to a sub-optimal. Based on Sec.2.2, the output of DAN is a feature of all modalities y˜ i , where information from all modalities can be preserved. Therefore, the problem turns to be a regression with single-modality feature. However, the direct regression from the real-valued feature to the discrete hash codes will introduce more information loss. Recent works in [5, 21] adopted two regressions to reduce such quantization loss, which however may not be optimal for hash learning, and the optimization is highly inefficient. To solve this problem, we develop a novel hashing in a generative way, which aims to match the distribution of hash function to that of labels. As shown in [3], locality sensitive hashing can be used to generate the hash codes for labels, which can well preserve the similarity between two label vectors. Therefore, we randomly sample K independent hyper-plane hash functions R = [r1, r2, ..., rK ]T ∈ RK ×C to generate the supervised hashing codes bˆ i = siдn(Rli ) ∈ RK for label li , among which, each ri ∈ RC is sampled from the normal Gaussian distribution N (0, I)1 . As label has a much less complicated distribution than the data, the binary codes bˆ i can better approximate the true distribution in the Hamming space. We then expect the learned hash codes of y˜ i to be as close as to bi as possible. We achieve this goal by using a simple Gaussian distribution approximation, which is matched to the true distribution of labels. Let’s define a generative model p(y˜ i |bi ) and a probabilistic p(bi ) as the true distribution of label’s binary codes. We model the generation of y˜ i given bi as:

(5)

m=1

where y˜ i is the final-complete modal representation of the incomplete input xi . Therefore, the reconstruction error in Eq.2 for our dense connected model is redefined as:

2 M

Õ 1 1

2 m Lr e = ∥yi − y˜ i ∥ 2 = yi − (xi + ∆xi ) . (6)

2 2 m=1

2

The estimation y˜ i forms a latent variable from a common space shared by the input xi and its corresponding target yi . For further analysis, xi just contains one modality, i.e., text or image, while ∆yi is as close as to another desired modality. Thus, y˜ i can be seen as the imputed full multi-modality feature, in which the missing modality has been recovered through DAN. However, when the training data A only contains the incomplete data x i instead of the complete data yi , the reconstruction in Eq.6 cannot be used. Therefore, we add a function o(A) to present whether the current training data is complete or not, which returns 1 if A is complete and 0 otherwise. So, the new reconstruction error is given by: 1 Lr e = o(A) ∥yi − y˜ i ∥ 22 . (7) 2 In addition, the input x i and its corresponding estimation y˜ i should have the same label. Therefore, y˜ i should be categorized into the same label as xi . To this end, we use a margin-based Hinge loss to improve the robustness of y˜ i , which is further integrated with the xi ’s label li : Õ Õ  Lcl = max 0, α − D(i, h) + D(i, д) (8)

p(y˜ i , bi ) =p(y˜ i |bi )q(bi ), where p(y˜ i |bi ) ∼ N (Wbi , ρ 2 I),

where W is a dictionary with r words. Since the true discrete representation bi has been used, p(bi ) will be a constant so that the joint distribution can be written as:  1 p(y˜ i , bi ) ∝ exp − 2 (∥Wbi − y˜ i ∥ 2 ) , (10) 2ρ

lh =li lд ,li

where D(i, h) presents the Euclidean distance between imputed complete data yi and yh , and α is the parameter representing the margin between the matched and unmatched classes. Discussion. DAN comes from the concatenation of the outputs of all preceding blocks. By doing so, modality information from previous layers is forwarded to the deeper layers, which reduces the loss of information due to the message passing, and leads to more robust features generated. Moreover, with more connection from the front layers to deeper layers, more gradient information will be back propagated to the previous layer in a deep network.

2.3

(9)

That is, higher joint probability means the generated data is more similar to the original ones. In addition, Eq.10 can also be seen as a confidence level of the probability to present the reconstruction error. Similar to [19], we use a variational lower-bound to define our objective function: log p(y˜ i ) = Eq(bi | y˜ i ) [log p(bi )]

(11)

= Eq(bi | y˜ i ) [log p(y˜ i , bi ) − log p(bi | y˜ i )],

(12)

= Eq(bi | y˜ i ) [log q(bi | y˜ i )−log p(y˜ i , bi )]+KL(q(bi | y˜ i )||p(bi | y˜ i )), (13) where q(bi | y˜ i ) can be seen as an encoder that is added to Eq.5 as the hash layer to the output of DAN, and KL(·) is the Kullback-Leibler divergence. The learned estimation y˜ i is passed to a fully-connected layer, after which the sgn(·) is applied to quantize the learned binary codes. According to the theoretical results of evidence lower bound (ELBO) in [9] , the first item in Eq.13 is the ELBO, which maintains the Minimal Description Length principle and is a lower bound on the log-likelihood of the imputed data as:

Variational Hash Learning

Recent advances in cross-modality hashing refers to the usage of kernel trick [8, 11]. Most works need to construct different regression models according to different modalities, upon which each feature is regressed to its corresponding common Hamming space. However, since different modalities are heterogeneous, it becomes difficult to balance the common Hamming space and preserve information in different modalities. In addition, these methods, e.g, SePH [11], can be viewed as a two-stage hashing, which first learns the hash codes and then uses kernelized logistical regression to train hash functions. Such a two-step procedure is problematic, since the

L ≤ log p(y˜ i ) − KL(q(bi | y˜ i )||p(bi | y˜ i )). 1

(14)

0 and I are all-zeros vector and identify matrix, standing for the mean and variances of this distribution, respectively.

1592

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Table 1: The mAP and Precision Comparison in Modality Incompleteness Retrieval on Two Benchmarks with Hash Bits Fixed on 64 (The highest metric among the baselines are labeled with an underline and the experimental results for ours are denoted as boldface). Task

I2T

T2I

Methods CMFH PDH CVH SCM SePH SPDH DAH CMFH PDH CVH SCM SePH SPDH DAH

Wiki 25% 0.1566 0.2053 0.1464 0.1698 0.2988 0.2778 0.4938 0.1597 0.1230 0.1248 0.6795 0.4878 -

mAP 50% 75% 0.1549 0.1553 0.1968 0.2070 0.1526 0.1536 0.1655 0.1376 0.2861 0.3117 0.2547 0.2245 0.4818 0.4873 0.1514 0.1572 0.1239 0.1194 0.1240 0.1238 0.6791 0.6716 0.4811 0.4689 -

100% 0.3223 0.7504

25% 0.1713 0.1781 0.1117 0.1425 0.2411 0.2311 0.5638 0.1823 0.1383 0.1452 0.6902 0.4275 -

Pre@100 50% 75% 0.1708 0.1684 0.1627 0.1768 0.1304 0.1285 0.1344 0.1174 0.2338 0.2535 0.2147 0.1789 0.5517 0.5616 0.1673 0.1859 0.1385 0.1258 0.1456 0.1494 0.6895 0.6833 0.4055 0.3987 -

MIR-Flickr25K 100% 0.2599 0.7035

25% 0.5841 0.6072 0.5659 0.6437 0.6613 0.6547 0.5884 0.5986 0.5663 0.6357 0.7036 0.6738 -

mAP 50% 75% 0.5829 0.5831 0.6009 0.5978 0.5650 0.5645 0.6330 0.6291 0.6485 0.6480 0.6410 0.6332 0.5864 0.5855 0.5945 0.5874 0.5649 0.5638 0.6275 0.6239 0.6907 0.6806 0.657 0.6444 -

100% 0.7130 0.7142

25% 0.6196 0.6389 0.5841 0.6960 0.6883 0.6478 0.7096 0.6205 0.5901 0.7004 0.7901 0.6667 -

Pre@100 50% 75% 0.6194 0.6192 0.6182 0.6286 0.5836 0.5799 0.6857 0.6815 0.6724 0.6769 0.6315 0.6178 0.7076 0.7037 0.6043 0.6126 0.5838 0.5777 0.6838 0.6830 0.7744 0.7653 0.6471 0.6345 -

100% 0.7632 0.8236

Note that, p(bi | y˜ i ) is the true distribution about the binary codes by given the data y˜ i , which can be replaced by p(bˆ i ) that is generated via LSH. Minimizing the first item in Eq.14 aims to make the DAN more better, which can be presented via the integral of generated model Eq.10. Furthermore, minimizing the KL divergence of the approximation q(bi | y˜ i ) from the true distribution p(bˆ i ) aims to make the produced binary codes more better. As a result, we denote the loss function of the final hash layer as: LH = Ebi [p(y˜ i , bi )] − KL(q(bi | y˜ i ||p(bˆi )).

(15)

Combining Lr e , Lcl and LH together, we get the final loss for our model: L = Lr e + Lcl + LH . (16) Discussion. The learning process for DAN and variational hashing can be done simultaneously with respect to x i by using any optimization procedure. Since the gradient of Eq.16 can be easily got, we use a traditional SGD algorithm to update the parameters. Interestingly, different combinations of these three loss functions make the model suitable for different situations. For instance, given the situation where only loss Lcl and LH are considered, our model can be applied to the case of incomplete modality. On the contrary, if we just consider Lr e , our model can be further extended to handle the case with complete modality.

3

(a) Pre@K on I2T.

(b) Pre@K on T2I.

(c) Rec@K on I2T.

(d) Rec@K on T2I.

Figure 4: The Precision curves and Recall curves of all the algorithms on MIR-Flickr25K when hash bit is 64.

EXPERIMENTS

In this section, we explore the performance of the proposed DAH with comparisons to several state-of-the-arts on three widely-used cross-modality datasets. The Wiki dataset2 consists of 2, 866 documents crawled from Wikipedia, in which every document contains an image and the corresponding text. These documents are categorized into 10 classes. Each image is represented by an 128-D bag-of-words feature and each text is represented by a 10-D Latent Dirichlet Allocation (LDA) feature. We randomly select 75% image-text pairs from this dataset as the training set and the rest as the query samples.

The MIR-Flickr25K3 contains 25, 000 instances collected from Flickr, each of which is manually annotated with some of 24 provided labels. The image for each instance is described by an 150-D edge histogram. For text representation, we extract a 500-D feature vector, which is derived from PCA on its binary tagging vector. We take out 5% of the dataset as the query set, and the remaining as the training set. 3 http://archive.ics.uci.edu/ml/datasets/Multiple+Features

2 http://www.svcl.ucsd.edu/projects/crossmodal/

1593

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Table 2: The mAP and Precision comparisons in the modality completeness retrieval on two benchmarks with different hash bits (The highest is denoted as boldface and the second is labeled with an underline).

Task

I2T

T2I

Methods CMFH PDH CVH SCM SePH GSePHknn GSePHr nd DAH CMFH PDH CVH SCM SePH GSePHknn GSePHr nd DAH

Wiki 32 0.1575 0.1975 0.1551 0.1538 0.3010 0.3021 0.3010 0.3215 0.4652 0.1613 0.1298 0.1347 0.6881 0.6771 0.6893 0.7416

mAP 64 0.1544 0.1987 0.1505 0.1466 0.2917 0.3072 0.2942 0.3350 0.4894 0.1637 0.1240 0.1264 0.6909 0.6914 0.6889 0.7644

128 0.1596 0.2023 0.1549 0.1466 0.3047 0.3021 0.2919 0.3009 0.5139 0.1673 0.1220 0.1212 0.6922 0.7030 0.7210 0.7552

32 0.1710 0.1669 0.1444 0.1396 0.2502 0.2499 0.2305 0.2511 0.5458 0.1919 0.1522 0.1733 0.6911 0.6858 0.6801 0.7041

Pre@100 64 0.1710 0.1670 0.1272 0.1338 0.2372 0.2498 0.2416 0.2517 0.5615 0.1993 0.1402 0.1484 0.6962 0.6942 0.6912 0.7200

MIR-Flickr25K 128 0.1740 0.1730 0.1328 0.1338 0.2509 0.2464 0.2301 0.2460 0.5714 0.1996 0.1414 0.1373 0.6952 0.6989 0.7000 0.7177

32 0.5828 0.6092 0.5706 0.6479 0.6673 0.6797 0.6804 0.7105 0.5892 0.5966 0.5706 0.6371 0.7100 0.7145 0.7078 0.7223

mAP 64 0.5858 0.6107 0.5671 0.6559 0.6627 0.6785 0.6631 0.7137 0.5905 0.6005 0.5672 0.6439 0.7109 0.7164 0.7183 0.7219

128 0.5847 0.6125 0.5646 0.6606 0.6744 0.6899 0.6812 0.7328 0.5880 0.6020 0.5648 0.6485 0.7104 0.7244 0.7165 0.7438

32 0.6213 0.6309 0.6003 0.6999 0.7009 0.7263 0.7002 0.7701 0.6920 0.6107 0.6014 0.7073 0.7942 0.8011 0.7993 0.8221

Pre@100 64 0.6252 0.6302 0.5914 0.7093 0.6905 0.7632 0.7536 0.7754 0.7109 0.6152 0.5918 0.7237 0.7984 0.8141 0.8056 0.8315

128 0.6212 0.6452 0.5828 0.7710 0.7035 0.7254 0.7187 0.7757 0.7304 0.6207 0.5883 0.7297 0.7976 0.8210 0.8197 0.8399

tag is represented by an 1, 000-D bag-of-words feature. We choose 2, 000 image-tag pairs from this dataset as the query set, and the remaining as the training set.

3.1

(a) Pre@K on I2T.

We compare our methods to six state-of-the-art methods, i.e., Semantic Correlation Maximization (SCM) [33], Semantics Preserving Hashing (SePH) [11], Generalized Semantic Preserving Hashing (GSePH) [18], Collective Matrix Factorization Hashing (CMFH) [2], Predictable Dual-View Hashing (PDH) [20] and Cross-View Hashing (CVH) [10], among which, three are supervised (SCM, SePH, and GSePH) and the other three (CMFH, PDH and CVH) are unsupervised. Moreover, under the incomplete settings, we further compare with Semi-Paired Discrete Hashing (SPDH) [22], which is similar to our work and achieves the state-of-the-art. Source codes of all the methods are available publicly. And we directly adopt the original parameter settings described in their papers. All our experiments were run on a workstation with a 3.6GHz Intel Core I7-4790 CPU and 16G RAM.

(b) Pre@K on T2I.

3.2 (c) Rec@K on I2T.

Competing Methods

Preprocessing and Evaluation Protocols

For each dataset, we perform two cross-modality retrieval tasks: (1) image-to-text retrieval, termed I2T and (2) text-to-image retrieval, termed T2I, and two kinds of settings: the first is that modalities are incomplete, and the second is that modalities are complete. For incomplete setting, we consider three cases where 25%, 50%, and 75% modalities of the training data are removed by random selection, respectively. Among such settings, half are image modality and the other half are text modality. To conduct a fair comparison to other baselines, we conduct quantization method, e.g, multi-view clustering [1], to compensate for the missing modality for the rest approaches except ours. For instance, the missing text modality will be filled by the centroid text which its corresponding image modality belongs to. As to the proposed method, the inputs for DAN

(d) Rec@K on T2I.

Figure 5: The Precision curves and Recall curves of all the algorithms on NUS-WIDE when hash bit is 64. The NUS-WIDE4 is collected from Flickr, which contains 296, 648 images with associated tags. All image-tag pairs are manually annotated with at least one label from 81 concepts. We preserve 186, 577 labeled image-tag pairs from the whole dataset according to the top 10 frequent labels, following [33, 37]. In this dataset, each image is represented as a 500-D bag-of-visual-words feature, and its 4 http://lms.comp.nus.edu/research/NUSWIDE.htm

1594

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Table 3: The mAP and Pre@100 Comparison among Different Combinations of Loss Functions. Loss Lr e + Lcl Lr e + Lkl Lcl + Lkl Lr e + Lcl + Lkl

mAP 0.5695 0.7092 0.7130 0.7137

I2T Pre@100 0.5865 0.7552 0.7632 0.7754

(a) mAP on I2T.

mAP 0.5749 0.6915 0.7142 0.7219

Table 4: The mAP and Pre@100 Comparison among Prevailing Uni-modal Hashing Methods.

T2I Pre@100 0.6103 0.8001 0.8236 0.8315

Loss SDH KSH IT Q_CCA DAH

(b) mAP on T2I.

I2T Pre@100 0.5520 0.6310 0.7470 0.7754

mAP 0.5561 0.5981 0.6635 0.7219

T2I Pre@100 0.5105 0.6726 0.7943 0.8315

Figure 7: The analysis of time consumption under different depths of the proposed DAN.

Figure 6: Influence of different depths of proposed DAN.

all methods even with such a simple zero-filling, which proves the superiority of our scheme. Fig.4 shows the Pre@K and Rec@K performance of our DAH compared to the other baselines on MIR-Flickr25K. The baselines are supposed to be in the case with 75% modalities missing. We further evaluate on MIR-Flickr25K dataset and the large-scale NUSWIDE dataset, as shown in Fig.4 and Fig.5, where the results are evaluated by Pre@K and Rec@K in both the I2T and T2I tasks. We consider the case with 50% modalities and 75% removed randomly, respectively. Clearly, unlike other methods that replace missing modality with a corresponding centroid, the proposed DAH has achieved the best performance for both two retrieval tasks on both datasets. Under the setting of complete data, we further evaluate our scheme by adding the loss Lr e into the training process with different hash bits. Tab.2 shows the mAP and Pre@100 by Hamming Ranking under the settings of 32, 64 and 128 bits. As shown in Tab. 2, in most cases, the proposed DAH gets the best performance. When considering the retrieval task of T2I, no matter which metric in which dataset, the proposed DAH achieves substantially better performance at all code lengths. When it comes to the task of I2T, there are two cases on Wiki, i.e. 128-bit in both mAP and Pre@100, the proposed method doesn’t achieve best. However, in other situations, the proposed method still outperforms all the baselines. We take a further analysis. Compared with the second highest scores, as far as the mAP metric, DAH has averaged improvements of 5.02% and 7.66% in the retrieval task of I2T and T2I on Wiki, and 5.28% and 1.42% on MIR-Flickr25K, respectively. Regarding to the Pre@100, on Wiki, we have a 0.32% improvement in I2T and 2.61% in T2I, and on MIR-Flickr25K, we have 4.85% and 2.35% in each corresponding task. For Wiki, our model has a degradation from 64 bits to 128 bits in both tasks. But such a phenomenon can not be observed in MIR-Flickr25K. Since Wiki is a smaller dataset, larger binary codes result in information redundancy, which degenerates the performance. As mentioned in [28], large codes are

are always incomplete, with the missing modality filled with zeros. By removing the term Lr e from the gross loss L, the proposed DAH can be used under incomplete setting where 100% modalities are missing for ours. Otherwise, the missing modality is used as supervised information under the setting of completeness. The quantitative performance is evaluated by mean Average Precision (mAP). mAP is the mean of Average precision (AP) over all queries, which jointly considers the search accuracy and rankings. Given a query and a list of retrieval results, AP is defined as Ín i=1 p(i)δ (i), where p(i) denotes the precision of the top i retrieved images, and δ (i) = 1 iff the r -th retrieved image is the true neighbor of the query, otherwise δ (i) = 0. We also consider other three evaluation protocols, i.e., Precision at top-100 (termed Pre@100), Precision curves at top-K (termed Pre@K) and Recall curves at top-K (termed Rec@K).

3.3

mAP 0.5561 0.5910 0.6582 0.7137

Quantitative Results

We start experiments with incomplete data. we first explore the mAP and Pre@100 on Wiki and MIR-Flickr25K, with hash bits fixed on 64. The quantitative results are shown in Tab.1, Fig.4 and Fig.5. In Tab.1, it can be observed that no matter which metric in which dataset, our scheme consistently performs best even compared to baselines with 75% data remaining. Interestingly, compared to the experimental results in Tab.2 when the length of hash bits is 64, some baselines do not show degeneration as the percentage of incomplete data grows. It is due to the merits of outlier removal done by the quantization. More specifically, data gathered via Web crawling are often noisy [16, 38], which are mitigated by the quantization step that quantizes the outliers into centroids (which belong to dense regions and are less likely to be outliers). Also, SCM [33] transcends the SePH [11] given Pre@100 in the tasks of I2T on MIR-Flickr25K. The performance of ours decreases a little, which is apparently due to that the missing modality in our model is simply filled with zeros. However, our scheme still performs best among

1595

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

Figure 9: Averaged error over each epoch. (a) Pre@K on I2T.

(c) Rec@K on I2T.

(b) Pre@K on T2I.

treat it as the input feature of some prevailing uni-modal hashing methods, including SDH [21] KSH [17], and ITQ_CCA [4]. The experimental results are shown in Tab.4, where DAH holds the leading position in all metrics. We can see that our model tends to make full use of label information to reinforce the corresponding retrieval performance. We also analysize the performance of the proposed DAN scheme. To this end, we replace DAN with sDA [26]. For a fair comparision, we set the number of sDA’s blocks as 2 and share them with the same learning rate and learning epochs. The quantitative results are shown in Fig.8. We can observe that DAN outperforms sDA in both retrieval tasks in terms of Pre@K and Rec@K, which fully explains the advantages of the proposed scheme. Finally, in order to verify the convergence of the proposed DAH framework, we plot the averaged training error for each epoch in Fig.9. The batch size at each training step is 100. The initial learning rate is set to 0.9 with a decay value of 0.1 in every 40 epochs. The training error degenerates fast in the first 40 epochs, which then, slightly decreases from 40 to 50 epochs. After 60 training epochs, our model converges to a lower training loss at about 0.667.

(d) Rec@K on T2I.

Figure 8: Comparison between DAN and sDA. time-consuming in retrieval, which is contradict to the original intention of hashing. Under such a circumstance, the proposed DAH serves as a more reasonable solution.

3.4

Parameter Analysis

We further conduct a series of experiments to analyze different components in the proposed scheme. All the experiments in this section are done on MIR-Flickr25K with hash bits fixed on 64, which can be easily extended to other datasets under the same settings. We firstly explore how the depth of DAN (i.e., the number of residual auto-encoder blocks in the DAN) affects the performance. As shown in Fig.6, stacking more dense auto-encoder blocks means a better enhancement to hashing performance, but also means more training time consumption as illustrated in Fig.7, which shows a linear growth as depth goes up. Besides, when the number of residual auto-encoder blocks increases to around 3, the performance falls off. A more reasonable explanation is that, the model suffers from over-fitting. In all experiments, we set the depth of DAN to 2, which has a good trade-off between the efficiency and accuracy, as compared to the baselines. We then explore the impact of three different loss functions (Lr e , Lcl and LH ) by dropping one of them in the training process. The performance is evaluated by mAP and Pre@100, as summarized in Tab.3. In the first row, we observe that the combination of Lr e +Lcl drops dramatically. It explains the necessity and correctness of LH , which has been elaborated in Sec.2.3. The third row displays the removal of Lr e loss, where the performance decreases only a little. It explains why the proposed scheme can deal with the data incompleteness well. To further analyze the importance of the proposed Variational Hashing Learning, we extract the reconstructed imputed data y˜ i and

4

CONCLUSION

In this paper, we propose a novel cross-modality hashing, termed Dense Auto-encoder Hashing. The core idea is to leverage the relatedness among different modalities and encode the complete modality feature to binary codes. Two innovations of the proposed scheme are a novel Dense Auto-encoder Network and Variational hashing learning. The former is stacked by a set of residual autoencoders and explicitly imputes the incomplete modality, while the latter serves as a special encoder model that allows incompleteness modality input and outputs complete hash codes. Quantitative experiments conducted on Wiki, MIR-Flickr25K and NUS-WIDE visual search benchmarks demonstrate the superior performance of the proposed DAH over several state-of-the-art approaches.

5

ACKNOWLEDGE

This work is supported by the National Key R&D Program (No. 2017YFC0113000, and No. 2016YFB1001503), Nature Science Foundation of China (No. U1705262, No. 61772443, and No. 61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

1596

Session: FF-5

MM’18, October 22-26, 2018, Seoul, Republic of Korea

REFERENCES

[20] Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Daume Hal, and Larry Davis. 2013. Predictable dual-view hashing. In Proceedings of the ICML. [21] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. 2015. Supervised discrete hashing. In Proceedings of the CVPR. [22] Xiaobo Shen, Fumin Shen, Quan-Sen Sun, Yang Yang, Yun-Hao Yuan, and Heng Tao Shen. 2017. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE transactions on cybernetics (2017). [23] Xiaobo Shen, Fumin Shen, Quan-Sen Sun, Yun-Hao Yuan, and Heng Tao Shen. 2016. Robust cross-view hashing for multimedia retrieval. IEEE Signal Processing Letters (2016). [24] Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. 2013. Intermedia hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the SIGMOD. [25] Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. 2017. Missing Modalities Imputation via Cascaded Residual Autoencoder. In Proceedings of the CVPR. [26] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the ICML. [27] Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. 2015. Semantic Topic Multimodal Hashing for Cross-Media Retrieval.. In Proceedings of the IJCAI. [28] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). [29] Qifan Wang, Luo Si, and Bin Shen. 2015. Learning to Hash on Partial Multi-Modal Data.. In Proceedings of the IJCAI. [30] Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, and Qiang Yang. 2014. Scalable heterogeneous translated hashing. In Proceedings of the SIGKDD. [31] Botong Wu, Qiang Yang, Wei-Shi Zheng, Yizhou Wang, and Jingdong Wang. 2015. Quantized Correlation Hashing for Fast Cross-Modal Search.. In Proceedings of the IJCAI. [32] Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the ICCV. [33] Dongqing Zhang and Wu-Jun Li. 2014. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization.. In Proceedings of the AAAI. [34] Dan Zhang, Fei Wang, and Luo Si. 2011. Composite hashing with multiple information sources. In Proceedings of the SIGIR. [35] Yi Zhen, Yue Gao, Dit-Yan Yeung, Hongyuan Zha, and Xuelong Li. 2016. Spectral multimodal hashing and its application to multimedia retrieval. IEEE Transactions on cybernetics (2016). [36] Yi Zhen and Dit-Yan Yeung. 2012. Co-regularized hashing for multimodal data. In Proceedings of the NIPS. [37] Jile Zhou, Guiguang Ding, and Yuchen Guo. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the SIGIR. [38] Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining (2012).

[1] Xiao Cai, Feiping Nie, and Heng Huang. 2013. Multi-View K-Means Clustering on Big Data.. In Proceeding of the IJCAI. [2] Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of the CVPR. [3] Kun Ding, Chunlei Huo, Bin Fan, Shiming Xiang, and Chunhong Pan. 2016. In defense of locality-sensitive hashing. IEEE Transactions on Neural Networks and Learning Systems (2016). [4] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2013. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013). [5] Jie Gui, Tongliang Liu, Zhenan Sun, Dacheng Tao, and Tieniu Tan. 2017. Fast supervised discrete hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). [6] Yao Hu, Zhongming Jin, Hongyi Ren, Deng Cai, and Xiaofei He. 2014. Iterative multi-view hashing for cross media indexing. In Proceedings of the ACM MM. [7] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and year=2017 booktitle=CVPR. [n. d.]. Densely Connected Convolutional Networks. [8] Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the CVPR. [9] Diederik P Kingma. 2017. Variational inference & deep learning: A new synthesis. (2017). [10] Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the IJCAI. [11] Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. 2015. Semanticspreserving hashing for cross-view retrieval. In Proceedings of the CVPR. [12] Hong Liu, Rongrong Ji, Yongjian Wu, and Gang Hua. 2016. Supervised matrix factorization for cross-modality hashing. In Proceedings of the IJCAI. [13] Hong Liu, Rongrong Ji, Yongjian Wu, Feiyue Huang, and Baochang Zhang. 2017. Cross-Modality Binary Code Learning via Fusion Similarity Hashing. In Proceedings of the CVPR. [14] Li Liu, Zijia Lin, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. 2017. Sequential discrete hashing for scalable cross-modality similarity retrieval. IEEE Transactions on Image Processing (2017). [15] Li Liu, Mengyang Yu, and Ling Shao. 2015. Multiview alignment hashing for efficient image search. IEEE Transactions on image processing (2015). [16] Wei Liu, Gang Hua, and John R Smith. 2014. Unsupervised one-class learning for automatic outlier removal. In Proceedings of the CVPR. [17] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. 2012. Supervised hashing with kernels. In Proceedings of the CVPR. [18] Devraj Mandal, Kunal N Chaudhury, and Soma Biswas. 2017. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the CVPR. [19] Bahadir Ozdemir and Larry S Davis. 2014. A probabilistic framework for multimodal retrieval using integrative indian buffet process. In Proceedings of the NIPS.

1597