Improving cross-modal and multi-modal retrieval combining content ...

Multimed Tools Appl DOI 10.1007/s11042-013-1737-9

Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model Shixun Wang & Peng Pan & Yansheng Lu & Liang Xie

# Springer Science+Business Media New York 2013

Abstract With the ongoing development of the internet, a large number of multimedia documents containing images and texts have appeared in the daily life of people. Therefore, how to effectively and efficiently conduct cross-modal and multi-modal retrieval is being an important issue. Although some methods have been proposed to deal with the issue, their retrieval processes are confined to a single information source of multimedia documents, such as the representations of images and texts at a semantic level. In this paper, we propose a novel probabilistic model, namely CCSS, which not only combines low-level content and high-level semantics similarities through a first-order Markov chain, but also provides heterogeneous similarity measures for different unimedia types. The ranked list for a query is obtained by highlighting an optimal path across the chain. Content similarity focuses on the internal structure of each modality, while semantics similarity focuses on the semantic correlation between different modalities. Both of them are significant and their combination can be complementary to each other. Multi-class logistic regression and random forests are used to map the original features of each unimedia into a semantic space. According to the query-by-example scenario, the experiments on the Wikipedia dataset show that the performance of our model significantly outperforms those of state-of-the-art approaches for crossmodal retrieval. Additionally, the proposed multi-modal method is also shown to outperform previous systems on image retrieval task. Keywords Cross-modal retrieval . Multi-modal retrieval . Content similarity . Semantics similarity . Probabilistic model

S. Wang : P. Pan (*) : Y. Lu : L. Xie School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, Peoples Republic of China e-mail: [email protected] S. Wang e-mail: [email protected] Y. Lu e-mail: [email protected] L. Xie e-mail: [email protected]

Multimed Tools Appl

1 Introduction In recent years, there has been an explosion of multimedia content on the web, such as Flickr for photographs, Last.fm for music and YouTube for videos. Therefore, an equivalent growth in the development of technologies is required to effectively store and retrieve these multimedia contents. At present, the prevailing search engine such as Google is still based on text descriptors or keywords. However, the text keywords are not often used to represent the multimedia content itself, which weakens the retrieval performance. In order to address such a problem, many researchers have devoted themselves to the retrieval techniques of unimodal content data. These techniques include text retrieval [10], image retrieval [23, 26], music retrieval [16] and video retrieval [24], where a user query and the corresponding retrieved results are of the same modality. However, in the real world, more and more different modalities data such as texts, images and music often co-exist in a multimedia document to better express the same semantic information. Such multimedia documents include E-commerce web-pages, newspaper articles and personal blog. In fact, a user may want to retrieve texts or other modalities with a query image, or he wants to retrieve multimedia documents containing images and texts by submitting the combination of an image and a text. In these cases where the traditional unimodal retrieval methods cannot measure the content similarity between unimedia data of different modalities, the cross-modal and multi-modal retrieval can effectively deal with and become increasingly important. In cross-modal retrieval, the modality of a query is different from that of the retrieved results, for instance, the retrieval of images in response to a query text. In multi-modal retrieval, both the query document and the retrieved documents can include two or more modalities, but they need share at least one identical modality, for instance, a query can be the fusion of image and text, so can the retrieved results. Multimodal media data can supply complementary information that strongly helps people to comprehend the multimedia document at a high level. For example, in an E-commerce website, the commodity pictures can quickly provide intuitive visual information that is not only incomplete for lacking information such as price, producing area and functional information, but also redundant because of irrelevance background. In contrast, the accompanying texts could accurately describe the detailed information, but this information is not intuitive. Therefore, jointly modeling and retrieving the multimodality media data seems to be more significant. Although several existing techniques [4, 6, 25] have annotated images and music with some matching texts such as tags, these texts are usually restricted to describe visible and auditory objects. As a consequence, these techniques do not make full use of the whole information of multimodality data. Generally speaking, the meaning of semantics is an abstraction which exists in the brain of people, such as the scene of an image. The data space of unimodal media object could be divided into two sub-spaces, namely, the low-level feature space and the high-level semantic space. The former is constituted by the original features, and the latter is a space where each dimension represents a meaningful concept. Given a unimedia object, its corresponding vector in the semantic space represents the posterior probabilities of the predefined semantic concepts. On the one hand, media data such as images are stored in the form of low-level features in computer; on the other hand, people can efficiently understand the media data by exploiting high-level semantics. Most of the traditional unimodal retrieval methods rely on the content similarity which is an objective relationship. However, due to the existence of the well-known semantic gap, the content-based information cannot directly provide semantic information for a user, and how to choose an optimal similarity measure is usually crucial in

Multimed Tools Appl

the design of a retrieval system. Additionally, the retrieval model can also be designed through computing semantics similarity1 which is subjective to some extent. Nevertheless, in order to accurately predict the semantic representation of an unknown unimedia object, the approaches that are based on semantic similarity should highly depend on the training data, which might lead to the phenomenon of over-fitting and weaken the robustness of a system. Therefore, the approaches that only take into account a single source of information have some limitations, and a combined use of the two similarities may instead lead to a reduction of such disadvantages. To the best of our knowledge, such research has not been found for cross-modal and multi-modal retrieval in previous papers. In this paper, the retrieval problem on which we focus includes two tasks: 1) the retrieval of texts in response to a query image, and vice versa, 2) the retrieval of image-text pairs in response to the query of an image-text pair. We introduce a probabilistic model and propose the corresponding retrieval algorithm to accomplish these tasks. The model, which takes a first-order Markov chain, combines content similarity and semantics similarity together in a probabilistic framework. Content similarity focuses on the original features within each modality, while semantics similarity focuses on the vectors of semantic features in a common space. We adopt multi-class logistic regression and random forests to map the original features of each media into a semantic space. In particular, each state in the Markov chain represents a media object which belongs to retrieved modality. Each edge connecting two states is weighted by the corresponding content similarity, whereas each edge between a hidden state and semantic space is indicated according to the semantic vector of the state. Given a query such as an image at a time step, the text which is most related to the query can be retrieved in the space of states at that time. As will be detailed in Section 2, the existing methods for cross-modal and multi-modal retrieval usually focus on a single information source of media data, which is either the content information source or the semantic information source. However, our approach considers both of the two information sources. In summary, the major contribution of this paper is twofold as follows. 1) To address the cross-modal and multi-modal tasks, we propose a probabilistic model which jointly models the content and semantic features. Generally, the framework of model is also applicable to any combination of unimedia types. 2) Two corresponding retrieval algorithms based on the proposed model are implemented. The algorithms don’t require additional parameter estimation. The rest of this article is organized as follows. After a brief review of related works in Section 2, we describe our model in Section 3. Section 4 presents the existing methods for projecting images and texts into a common semantic space. The retrieval algorithms for cross-modal and multi-modal are discussed in Section 5. The experimental results and analyses are provided in Section 6. Finally, the concluding discussions and possible future works are listed in Section 7.

2 Related works The low-level features of image are usually based on the popular scale invariant feature transformation (SIFT) [17]. SIFT descriptors are multi-image representations of an image neighborhood. They are Gaussian derivatives computed at 8 orientation planes over a 4×4 1

Semantics similarity means metric similarity rather than the similarity among concept labels.

Multimed Tools Appl

grid of spatial locations, giving a 128-dimension vector. In bag-of-words (BoW) model [7], a codebook is first learned from the entire features, which can be done with clustering technique, such as K-means. Then a large number of features that are extracted from an image or text are mapped to the closest entries in the codebook. Finally, each image or text can be vector quantized as frequency histograms. In general, topic models are used to promote the discovery of hidden structure in the text processing field, for example, probabilistic latent semantic analysis (pLSA) [10] and latent Dirichlet allocation (LDA) [2]. LDA is a generative probabilistic model for collections of discrete data such as text corpora, where each text is modeled as a finite mixture over an underlying set of topics, and each topic is characterized by a multinomial distribution over words. Due to the existence of semantic gap, these techniques are not always effective for the demand of humans, but they provide a way to attain the original features of unimedia data. The automatic annotation of media is the beginning of a cross-modal system, for example, images and music can be retrieved by submitting explicit tag query in [4] and in [25], respectively. However, the techniques of automatic annotation are limited because of few text data. Recently, many efforts [12, 15, 22, 30, 32–36] have been devoted to crossmodal retrieval, where a narrative text is loosely related to an image. The key problem in cross-modal retrieval is how to measure the similarity between different unimedia modalities, most of the existing methods usually focus on a common space in which the classical measure could be directly applied. Popular common spaces include correlative subspace [11, 14, 15, 22, 27, 28], semantic space [21, 22] and hash space [35–37]. Correlative subspace is usually learned by canonical correlation analysis (CCA) [11] or generalized canonical correlation analysis (GCCA) [27], CCA is a linear dimensionality reduction method that maximizes the correlation between two sets of heterogeneous data, and GCCA is adapted to three modalities. Rasiwasia et al [22] apply CCA to learn the subspace which maximizes the correlation between image and text. Lmura et al [15] use GCCA to focus on the correlation among image, sound and location information simultaneously. As an extension of CCA, kernel canonical correlation analysis (KCCA) [28] can model non-linear dependencies between image and text features, and maximize the correlation in the transformed space. In addition, Li et al. [14] introduce a cross-modal factor analysis (CFA) to seek transformations that best represent the association between two different modalities. Although these approaches model the correlation between different unimedia types, they do not consider semantic information which is mapped from the lowlevel original feature to the high-level semantic feature. Therefore, we refer to such approaches as content-based cross-modal retrieval. Semantic space [21] is a probability simplex, in which each dimension represents a predefined semantic concept, and data point is a vector of posterior concept probabilities. Rasiwasia et al. [22] utilize multi-class logistic regression (SM) to get the semantic vectors of images and texts, and show that the best experimental results can be obtained by combining CCA and SM. However, this method only focuses on the semantics similarity in semantic space, but does not take the content similarity in original feature space into account. Hash-based cross-modal retrieval methods can project different modalities into hash space such that collisions in bins reflect nearest neighbor relationships. Y. Zhen et al. [35] propose a model, called multimodal latent binary embedding (MLBE), for learning hash functions, namely binary latent factors, from different modalities data automatically. In [36], a multimodal hash function learning named Co-Regularized Hashing (CRH) is proposed on the basis of boosted co-regularization framework. Unlike the metric semantics similarity discussed in this paper, the semantic information considered in the two hash function learning methods is defined based on the semantic concept labels. That is to say, two

Multimed Tools Appl

unimedia data came from different modalities are similar if they have the same concept label and dissimilar otherwise. We call this case as concept label similarity rather than semantic similarity. Additionally, X. Zhai et al. [34] propose a heterogeneous similarity measure with nearest neighbors, which is obtained by computing the probability of two unimedia data belong to the same semantic concept, and also utilize these heterogeneous similarities to learn an effective ranking model for cross-modal retrieval. In [33], the negative correlation between unimedia data of different categories is considered, and a cross-media correlation propagation approach is proposed to simultaneously deal with positive correlation and negative correlation between unimedia data of different modalities. The above two approaches rely on the original features and concept label similarity, but without semantic similarity. Jia et al. [12] propose a probabilistic model which defines a Markov random field on the media data. The model learns a set of shared topics across the modalities of image and text, and then the encoded similarity can be applied to cross-modal retrieval. However, this method does not consider semantic information. In [9], a new approach using query expansion, visual models and reranking is presented to retrieve visual information, but the content similarity between visual information is not considered. In parallel, a lot of efforts [1, 5, 29–31] have been devoted to multi-modal retrieval. The early methods for multi-modal retrieval are usually based on the fusion of multimodal media data; the well-known fusion strategies contain early fusion and late fusion. The former fuses features derived from different modalities into a single vector, and the latter fuses predictions after learning different models for different modalities. However, the fusion approaches ordinarily do not think about content similarity or semantics similarity. The combination of late fusion and image reranking is utilized to rank documents in [5]. Westerveld et al. [29] present a probabilistic model for video retrieval in which textual and visual retrieval models are integrated according to Bayesian decision theory. Although different modalities are integrated in the search tasks, these two methods only focus on a single information source, namely the content information source. Yang et al. [31] claim that two unimodal data could carry the same semantics if they are in the same multimedia document. They construct multimedia correlation space by using co-occurrence information and also propose a novel ranking algorithm, namely ranking with Local Regression and Global Alignment. However, this method weakens the retrieval performance if the query is out of the dataset, and lacks semantics similarities among different modalities. Xie et al. [30] firstly assume that unimedia data from different modalities can be independently generated from the same semantic concepts, and then propose a semantic generation model (SGM) which describes the semantic correlation of different modalities. Essentially, the SGM_Gaussian method only considers the original features of different modalities, and the SGM_RF method only relies on the semantic features. Therefore, they do not simultaneously combine the two different information sources. Rabiner [20] proposes a particular class of probabilistic graphical model, namely Hidden Markov model (HMM) which is widely used to model discrete time series. An HMM is essentially a mixture model in which the choice of hidden state for each observation is not selected independently but depends on the choice of state for the previous observation. In particular, the HMM can combine two stochastic processes to highlight the most likely state sequence which has generated a given series of observations. Despite there exist a few similarities, our model is not treated as an HMM because it has some variations against the standard model [20], as well as its algorithm used to highlight the most likely path is different from the standard algorithm [8, 20]. Miotto et al. [19] explore a music retrieval framework based on combining tags and acoustic similarity through a probabilistic graph-

Multimed Tools Appl

based representation of a collection of songs. On the one hand, the framework constructed in [19] is used to only retrieve unimodal media object such as music, but our model is designed to tackle cross-modal and multi-modal retrieval problems. On the other hand, the retrieval process proposed in [19] is similar to Viterbi algorithm [8] and its time complexity is proportional to the square of the size of test set. However, the viewpoint of our algorithm is to search the optimum state at current time according to the retrieved state at former time, and the time complexity of this retrieval process is linearly proportional to the size of test set.

3 The probabilistic model In this section, we design a probabilistic model for cross-modal and multi-modal retrieval. Although the essential thoughts of our model can be applied to other modalities, the discussions in this paper are limited to multimedia documents containing images and texts. We use boldface letters to denote matrices or vectors, for instance, a matrix can be denoted by A. The multimedia dataset is denoted as Δ={D1, …, D|Δ|}, in which each document consists of an image and its accompanying text, namely Di =(Ii, Ti). Images and texts are described as low-level feature vectors Ii ∈ ℜI and Ti ∈ ℜT, respectively. Images are represented by Bagof-words (BoW) model [7], and texts are represented by BoW model or topic model (LDA) [2]. Consider a vocabulary L consisting of |L| unique labels, each label li ∈ L is a semantic concept such as “Music”, “Warfare”, “Biology” and so on. Each image or text in the training set is associated with a semantic concept, but images and texts in the test set are not assigned any semantic concept. The goal of cross-modal retrieval is to, given an image (text) query Iq ∈ ℜI (Tq ∈ ℜT) in the test set, search for the closest match in the text (image) space ℜT (ℜI) of the test set. Given an image-text pair as query, the goal of multi-modal retrieval is to return the closest image-text pairs in the test set. Our proposed model simultaneously combines content and semantics similarities together in the retrieval process, that is to say, one similarity is not utilized to correct the ranking scheme of the other. We name this model as CCSS for brevity. A general framework of the CCSS model is shown in Fig. 1, it can be seen that the framework mainly consists of the following two parts.

&

&

Feature representation: In this part, two types of features are incorporated. One is the original features extracted from images and texts; the other is the semantic features based on the original features. The existing methods for semantic mapping will be shortly introduced in Section 4. Retrieval scheme: Utilizing a first-order Markov chain, we rank the retrieved media data, which is the primary contribution of this paper. The retrieved media in the chain can provide content and semantics information. The accompanying algorithms for crossmodal and multi-modal retrieval will be proposed in Section 5.

When a user retrieves something he wants, the goal is to observe consistently the fulfillment of his need during the time of accessing the test set. In particular, the semantic vector of a query can be treated as an observation which does not change for all time steps. Each retrieved object in the test set is treated as a state in the hidden state space, and the content-based similarity matrix is represented as the transition probabilities matrix where each element denotes the weight between two generic states. Moreover, the semantic vector of each retrieved object is described as emission probability which characterizes the mapping mechanism from an original feature into the semantic space. Therefore, the retrieved

Multimed Tools Appl Semantic Concept 1

Semantic Space

b4

Semantic Space

bq Semantic Concept 3

Semantic Concept 2

Semantic Mapping

a23 Query

Image Space (SIFT)

X3

X2

Text Space (LDA/BoW)

a34

X4

a2 j X1

Feature Extractiont

Test Set

Multimedia Dataset

Fig. 1 Framework of the CCSS model. On the right, a query is denoted as rectangle and the retrieved media data are represented as ellipses, namely hidden states. The emission probability of each hidden state is a highlevel semantic vector, and all hidden states are linked by edges (solid arrow) weighted according to content similarities

object at a time step can proceed to another one at next time step according to transition probability, and each retrieved object emits semantic vector at every time step. The retrieval of current object depends on both its emission probability and the transition probability which transfers from the previous retrieved object to itself. Thus the neighboring states in retrieval path simultaneously carry analogous information of content and semantics. Several symbols in the CCSS model are defined as follows:

& & & & & & & & & & & & &

K is the total number of the returned results, namely the number of time steps; k is a certain time step whose range is from 1 to K; N is the size of the test set, and N≥K; X={Xi | i=1, …, N} is the set of the retrieved objects; X(k) is the index of retrieved object at kth time step; bq and bi are the semantic vectors of the query Y and Xi, respectively; aij is the probability of moving from object Xi to object Xj in a single step, which depends on the similarity between the original feature vectors; A={aij | i, j=1, …, N} is the state transition probability matrix of X; B={bi | i=1, …, N} is the emission probability matrix of X; F(Xi) is a subset of X, which includes the most similar neighbors of object Xi in the original feature space; F is the number of elements in set F(Xi), namely F=|F(Xi)|; Si(k) is the semantics similarity between bq and bi at kth time step, it has the same value for all k≥1, i.e. Si(k)=Si(k +1); Simc and Sims are similarity measures in the original feature space and in the semantic space, respectively.

Multimed Tools Appl

The value of aij is positive if the object Xj is in the similar set of object Xi and zero otherwise. Additionally, self-transitions are also set to zero because that they are useless in retrieval process. In order to keep the probabilistic attribute of the transition, the transition probabilities must be normalized to one. So the following expression can be obtained: 8 > < XSimc Xi ; X j N aij ¼ Simc Xi ; X j > j¼1 : 0

if X j ∈FðXi Þ and i≠j

ð1Þ

elsewhere

Through such a process, the transition probability matrix is usually not symmetric, which is a reasonable phenomenon in a certain sense. For example, a complex image can contain many different visible parts, and a simple image has only one of these visible parts. While the simple image is considered similar to the complex image, the latter may be only loosely related to the former. Therefore, the CCSS model can be specified by the compact notation λ=(A, B).

4 Semantic mapping The generative and discriminative methods are usually used to project the original features of media data into a semantic space. For instance, the former includes GMM; the latter contains logistic regression, probability SVM, and random forests. In this section, we briefly review the multi-class logistic regression and random forests methods. 4.1 Multi-class logistic regression Multi-class logistic regression engenders linear classifiers with a probabilistic interpretation. The posterior probability of concept li is calculated by fitting the original features of data to a logistic function: P li x; w ¼

1 expðwi •xÞ Z ðx; wÞ

ð2Þ

where Z(x, w)=∑i exp(wi•x) is a normalization constant, x the vector of original feature, and wi (i=1, …, |L|) a vector of parameters for concept li. After learning a multi-class logistic regression for each modality, we get semantic vectors of posterior probabilities with respect to each concept in L. As [22], CCA and multi-class logistic regression can be combined to get semantic vectors, by using the feature representation produced by correlation maximization. With this method, our model is called as CCSS_SCM. 4.2 Random forests Random forests [3] can bag an ensemble of decision trees for classification, where the types of randomness contain bagging and random feature selection. Each tree in the forest is grown as follows: 1) Sample a new training set at random, with replacement, from the original training set. This new set is a bootstrapped sample on which the tree will be trained.

Multimed Tools Appl

2) If there are R variables, for each node of the tree, r ( < XSimc Ti ; T j if T j ∈FðTi Þ and i≠j N aij ¼ aTij ¼ Simc Ti ; T j > j¼1 : 0 elsewhere The semantics similarity between Iq and Ti at kth time step is given as follows: I T S i ðk Þ ¼ S IT i ðk Þ ¼ Sims bq ; bi

ð4Þ

ð5Þ

where bIq and bTi are the semantic vectors of Iq and Ti, respectively. If the query Y is a text Tq in the test set, then X is the set of images. According to Eq. 1, the state transition probability moving from image Ii to image Ij in a single step is similarly defined as follows: 8 > < XSimc Ii ; I j if I j ∈FðIi Þ and i≠j N aij ¼ aIij ¼ Sim I ; I c i j > j¼1 : 0 elsewhere

ð6Þ

Furthermore, the semantics similarity between Tq and Ii at kth time step is also given: T I S i ðk Þ ¼ S TI ð7Þ i ðk Þ ¼ Sims bq ; bi where bTq and bIi are the semantic vectors of Tq and Ii, respectively.

Multimed Tools Appl

Based on the CCSS model and query-by-example scenario, we summarized the universal algorithm of cross-modal retrieval in Algorithm 1.

Algorithm 1: The Cross-modal Retrieval Algorithm of CCSS Input: The dataset X and a query Y Output: The ranked list of objects {X(1), X(2), …, X(K)}. 1: Calculate the state transition probability matrix A = {aij | i, j = 1, …, N}. 2: Calculate the emission probability matrix B = {bi | i = 1, …, N} and the semantic vector bq. 3: Calculate the semantic similarity between bq and bi for all k ≥1, Si (k ) Sims (bq , bi ) (i = 1, , N ) 4: Initialize the path, i.e., calculate

X (1)

arg max log Si (1) log aqi 1 i N

5: For k =2 to K do 6: Given u = X(k -1), grow the path X(k) for each step, X (k ) arg max log Si (k ) log aui 1 i N

where i {X(1), …, X(k -1)}. 7: End For It is easy to discern that the time complexity of computing similarity matrix is O(N2). The similarity matrices can be computed beforehand, so we next analyze the complexity of ranking the retrieved objects. For each time step, the algorithm has to calculate numerical values which reflect the combined utilization of content and semantics similarities. That is, for every state in the hidden space, the logarithm sum is calculated by adding the logarithm of transition probability which moves from previous retrieved state to this state and the logarithm of semantic similarity between query data and this chosen state together. These numerical values are also saved at the same time. Therefore, the best time complexity of this process (from step 4 to step 7) is O(KN) when the first object indexed by maximum operation makes the constraint condition in step 6 true for all time steps, and the space complexity is also O(KN). In particular, as can be seen in step 6, the choice of state at a time step relies on two factors, namely transition probability and semantic similarity. 5.2 The algorithm of multi-modal retrieval In multi-modal retrieval, both the query document and the retrieved documents are multimodal, the case where each document is an image-text pair is considered here. The image and the text are complementary to each other; consequently, the retrieval accuracy can be enhanced by making full use of the whole information of the multimodality data. Let the query Y be Dq =(Iq, Tq) and X is the set of Di =(Ii, Ti), the content similarity between Dq and Di simultaneously depends on the two intra-modality content similarities. To attain the semantic similarity between Dq and Di, we should incorporate the following terms: two intra-modality semantic similarities and two inter-modality semantic similarities. In view of the query-by-example scenario, the algorithm of multi-modal retrieval is summarized as follows:

Multimed Tools Appl

1) Calculate the state transition probability matrix A={aij | i, j=1, …, N}. Firstly, if Ij ∈ F(Ii), Tj∈ F(Ti) and i≠j, the content similarity between Di and Dj is defined according to the geometric mean:

acij ¼ aIij aTij

1=2

0

11=2 Sim I ; I Sim T ; T c i j c i j ¼ @ XN XN A Sim I ; I Sim T ; T c i j c i j j¼1 j¼1

ð8Þ

otherwise the similarity is zero. Secondly, the transition probability aij is derived: acij aij ¼ XN

ð9Þ

ac j¼1 ij

2) Calculate the emission probability matrices BI and BT for images and texts. 3) For all k ≥1, the semantic similarity between Dq and Di (i =1, …, N) is also based on the geometric mean:

1=4 II TI TT S i ðk Þ ¼ S IT i ðk Þ S i ðk Þ S i ðk Þ S i ðk Þ h i1=4 ¼ Sims bIq ; bTi Sims bIq ; bIi Sims bTq ; bIi Sims bTq ; bTi

ð10Þ

4) Initialize the path, i.e., calculate

Dð1Þ ¼ arg max

1 ≤ i ≤ N ;i≠q

logS i ð1Þ þ log aqi

ð11Þ

5) For k =2, …, K, let u=D(k -1) and grow the path D(k) for each step as follow:

Dðk Þ ¼ arg max ðlogS i ðk Þ þ log aui Þ 1≤i≤N

ð12Þ

where i ∉ {q, D(1), …, D(k -1)}. 6) Output the path {D(1), D(2), …, D(K)} which stands for the retrieved rank. It is noted that Eq. 12 is slightly different from the equation of step 6 in algorithm 1, as the retrieved set no longer includes the query of image-text pair. The matrices A and B can be computed beforehand to decrease the computation cost. As algorithm 1, the time complexity of ranking the retrieved image-text pairs (from step 4 to step 6) is also O(KN). Therefore, the algorithm can be performed more efficiently by making the value of K smaller.

Multimed Tools Appl

5.3 Extension The CCSS model can naturally support more than two unimedia types. Let M be the number 1 of unimedia types, the query Y is Dq =(d1q,…,dM q ), and the retrieved X is the set of Di = (di , M m m …,di ). If i≠j and dj ∈ F(di ) (m =1, …, M), the content similarity between Di and Dj is defined according to the geometric mean: 2 3X 1 him •hjm 6 M m him •hjm 7 7 m ð13Þ acij ¼ 6 4 ∏ aij 5 m¼1

where him =1 if the mth unimedia type exists in the ith multimodal document and him =0 otherwise. amij is similar to aTij and aij can be computed by Eq. 9. The semantic similarity between Dq and Di is similarly defined for all k≥1: 2 3 X X1 hqm •hio hqm •hio 7 6 M M m o m o 7 ∏ ∏ Sim b ; b ð14Þ S i ðk Þ ¼ 6 s q i 4m¼1 o¼1 5 For the extension, it is straightforward to adapt the algorithm proposed above to solve the new retrieval problem.

6 Experiments In this section, some extensive experimental evaluations of the proposed model are described, which are compared with those of the existing methods for cross-modal and multimodal retrieval. All retrieval methods evaluated in this experiment section have been fully implemented and tested in MATLAB 7.11 which is installed on an Intel (R) Core (TM) 2 Duo CPU E7500, 2.93GHz, 2GB RAM PC running the Windows XP operating system. 6.1 Experimental preparation Since there are few publicly available cross-modal datasets, we only carry out experiments on the Wikipedia dataset [22], which is chosen from the Wikipedia’s “featured articles”. Each article is accompanied with one or more images from Wikipedia Commons. In addition, both text and image objects are categorized by Wikipedia into one of 29 concept labels. As some of the concept labels are very scarce, the 10 most populated labels are considered in this dataset. Each article is split into several sections according to its section headings, and the accompanied images are also assigned to the sections respectively according to image position in the article. The final multimedia corpus contains a total of 2866 documents, which are image-text pairs and annotated with a label from the vocabulary of 10 semantic concepts. The multimedia corpus is randomly split into a training set of 2173 documents and a test set of 693 documents. In the image domain, due to the intractability of representing images as a distribution over topics, each image is only represented using a histogram of a 128-codeword SIFT codebook. Principal component analysis (PCA) [13] can vertically project the original data onto a lower dimensional linear space, called as the principal subspace, such that the variance of the projected data is maximized. We also use PCA technique to reduce the dimensionality of

Multimed Tools Appl

image data. As for the text domain, the low-level original features can be naturally represented by bag-of-words (BoW) model. The term-document matrix of text dataset is learned with the Text to Matrix Generator2 (TMG), which can be widely used in text mining. Besides the BoW representation, each text is also represented using a histogram of 10topic LDA text model. To acquire a fair comparison, all of the compared retrieval methods in our experiments utilize the same features. The measure includes normalized correlation (NC), histogram intersection (HI) and inner product (IP). NC and IP are usually used in the classical text information retrieval, and HI plays an important role in image processing. Consider a W dimensions space, the query and target vectors are α and β, respectively. The NC and HI distances between α and β are defined as follow: XW Dist NC ðα; βÞ ¼ 1−

α •β i¼1 i i

kαk•kβk

ð15Þ

XW DistHI ðα; βÞ ¼ 1−

minðαi ; β i Þ XW i¼1 αi

i¼1

ð16Þ

For these two cases, the similarity between objects is calculated as the negative exponentiation of their distances. We set V=500 in Eq. 3 for the CCSS_TB. Moreover, the precision-recall (PR) curves and mean average precision (MAP) [18] are taken as performance measures. 6.2 Preliminary experiments The combination of two similarities (content and semantics), two retrieval methods (CCSS_SCM and CCSS_TB), three similarity measures (NC, HI and IP), the size of similar subset, two text representations (LDA and BoW), and the use of PCA technique can cause many possibilities for the implementation of retrieval model, so it is difficult to perform all possibilities. To obtain valid comparisons, we randomly split the training set into 5 folds, and validate on each fold with the remaining 4 as training data. According to the split, four preliminary experiments are conducted. Firstly, we evaluate different similarity measures for the content and semantics similarities in cross-modal retrieval (using the whole validation set to be similar subset, CCSS_SCM to retrieve objects, 10-topic LDA to represent texts, and 128-codeword SIFT to represent images without PCA). Table 1 presents the average MAP scores and their standard deviations, which are achieved on the validation set. As can be seen from Table 1, for every semantic measure, the best performance for image query is obtained when the content similarities of texts are measured by IP, while the highest MAP score for text query is achieved when the content similarities of images are measured by HI. These indicate that IP and HI are suitable for the lowlevel features of texts and images, respectively. In addition, the best average is attained when the semantic similarities are measured by IP. Thereafter, IP and HI are respectively used to measure the content similarities of texts and images in the remaining experiments. The next experiment is designed to examine that the performance of cross-modal retrieval can be affected by tuning the value of F, and the average results (using CCSS_SCM to 2

http://scgroup20.ceid.upatras.gr:8000/tmg/

Multimed Tools Appl Table 1 Cross-modal retrieval performance (MAP Scores) on the validation set using different similarity measures for CCSS_SCM Semantic measure

Content measure

Normalized Correlation (NC)

Histogram Intersection (HI)

Inner Product (IP)

Image query

Text query

Average

NC

0.298±0.008

0.219±0.010

0.259±0.004

HI

0.300±0.006

0.222±0.010

0.261±0.005

IP

0.404±0.015

0.182±0.008

0.293±0.007

NC

0.240±0.011

0.215±0.011

0.228±0.010

HI

0.245±0.009

0.220±0.013

0.233±0.009

IP

0.389±0.015

0.163±0.008

0.276±0.006

NC

0.348±0.010

0.227±0.014

0.288±0.007

HI IP

0.337±0.007 0.379±0.011

0.228±0.015 0.217±0.013

0.283±0.009 0.298±0.008

Bold number means the highest MAP score for each semantic measure

retrieve objects, 10-topic LDA to represent texts, and 128-codeword SIFT to represent images without PCA) are shown in Fig. 2. We consider that each retrieved object in lowlevel original feature space has the same number of similar objects, such as F=|F(Ii)|=|F(Ij)| for generic images, and use NC, HI and IP as the semantic measures. As can be seen from Fig. 2, the performances for text query are generally better when F is larger (about 90 % of the dataset size), yet the value of F needs to be smaller, namely 20 % of the dataset size, to obtain satisfactory retrieval performances when the query is image object. Following these results, henceforth we respectively set the values of F to be 90 % and 20 % of the size of dataset for images and texts in the remaining experiments. Thirdly, we compare the performances of different text representations in cross-modal retrieval, by using CCSS_TB method based on 128-codeword SIFT for images (without 0.23

0.415

0.41 0.228 0.405 0.226

0.224

MAP

MAP

0.4 NC+CCSS_SCM HI+CCSS_SCM IP+CCSS_SCM

NC+CCSS_SCM HI+CCSS_SCM IP+CCSS_SCM

0.395

0.39 0.222 0.385 0.22 0.38

0.218 20

40

60

80

F (% of the size of validation set)

100

0.375 20

40

60

80

100

F (% of the size of validation set)

Fig. 2 Plots of how the size of similar subset affects the performance of cross-modal retrieval. Here three patterns of CCSS_SCM are plotted. Note that the left-hand plot shows the query is text object, and the righthand plot shows image query

Multimed Tools Appl Table 2 Cross-modal retrieval performance (MAP Scores) on the validation set using different text representations for CCSS_TB Text representation

Image query

Text query

Average

LDA (10 topics)

0.447±0.013

0.253±0.020

0.350±0.016

BoW (6,603 words)

0.346±0.008

0.249±0.018

0.297±0.011

BoW (4,066 words)

0.348±0.010

0.248±0.017

0.298±0.012

BoW (3,042 words)

0.347±0.010

0.249±0.018

0.298±0.012

Bold number means the highest MAP score

PCA). The average MAP scores and their standard deviations are achieved on the validation set, which are presented in Table 2. As can be seen, for image query and text query, the retrieval performances of BoW representation are inferior to those of LDA representation. The reason may be that the LDA model can mine more helpful and abstract information than the BoW model. In addition, since the features extracted by BoW model are extremely highdimensional and sparse, the size of dictionary almost has no effect on the performance of cross-modal retrieval. Thereafter, the 10-topic LDA model is adopted to represent texts in the remaining experiments. Finally, we design an experiment to explore whether the PCA technique applied to the image domain could enhance the performance of cross-modal retrieval, and the average results on the validation set are shown in Fig. 3. It can be clearly seen from this figure that the application of PCA technique to image domain does not improve the retrieval performance for image query and text query. In PCA, some helpful information can be discarded as the high-dimensional features are projected onto the principal subspace, which may be the reason causing the above phenomenon. Therefore, the PCA technique is not utilized in the remaining experiments. 0.26

0.45

0.44 0.24 0.43 0.22

CCSS_TB without PCA CCSS_TB with PCA

0.2

MAP

MAP

0.42 CCSS_TB without PCA CCSS_TB with PCA

0.41

0.4 0.18 0.39 0.16 0.38

8 16 24 32

48

64

96

The number of PCA principal components

128

0.37 8 16 24 32

48

64

96

128

The number of PCA principal components

Fig. 3 Plots of whether the PCA technique applied to the image domain could enhance the performance of cross-modal retrieval, for text query on the left and image query on the right

Multimed Tools Appl

6.3 Experimental results for cross-modal retrieval The algorithm of cross-modal retrieval is executed on the test set, and the parameter configurations are those that get best cross-validation performances in the above section. The MAP scores of our proposed model are compared with those of the state-of-the-art methods in Table 3. In this table, Random method means randomly retrieving the objects. It can be seen from Table 3 that our proposed approaches both outperform the compared methods. For example, CCSS_SCM achieves an average MAP score of 0.306, improving about 22 % over SCM. CCSS_TB improves about 39 % over SCM and 21 % over CMCP, getting an average MAP score of 0.349. On one hand, the content similarity concentrates on the internal structure of each modality. On the other hand, the semantics similarity reflects the semantic correlation between different modalities. Both of them are beneficial and their combination can be complementary to each other. That is the reason behind the improvement. In addition, CCSS_TB has a significant improvement over CCSS_SCM, which means that the precise semantics similarity is more effective. Moreover, Fig. 4 shows the PR curves of cross-modal retrieval with CM, SM, SCM, CCSS_SCM and CCSS_TB. For text queries, CCSS_SCM has higher precision than SM and CM, while CCSS_TB has higher precision than CCSS_SCM, at most levels of recall. The dimensionality of original image feature is huge, and lots of zeros exist in the feature vector, so the content similarity only provides a little helpful information. That may be the reason why CCSS_SCM is competitive with SCM. For image queries, CCSS_TB and CCSS_SCM significantly outperform the compared methods when the recall is at lower level. As the recall increases, the precision gap gets smaller. In Fig. 5, we also compare the time cost of CCSS_TB with that of SCM for all queries in the test set. If Timesim and Timerank are the time of computing the similarity matrix and ranking the retrieved objects, respectively, then the retrieval time is Timesim +Timerank. In the SCM method, the similarity matrix is measured by NC, and the time difference between image query and text query can be ignored. As for our method, there are two similarity matrices to be computed, namely content and semantics similarity matrices. The content similarity matrices in image and text query are measured by IP and HI, respectively. The time cost of CCSS_SCM is not shown because that it is similar to that of CCSS_TB. As can be seen from Fig. 5, the time cost of CCSS_TB is significantly lower than that of SCM when the value of K is smaller. As the number of time steps increases, the time difference gets smaller. The reason causing this phenomenon may include the following two aspects. It is clear that the time cost of CCSS_TB is directly in proportion to the value of K. The Table 3 Cross-modal retrieval performance (MAP Scores) on the test set Experiment

Image query

Text query

Average

CCSS_TB

0.435

0.263

0.349

CCSS_SCM

0.380

0.231

0.306

CMCP [33]

0.326

0.251

0.289

SCM [22]

0.275

0.226

0.251

SM [22] CM [22]

0.271 0.242

0.212 0.198

0.242 0.220

Random

0.118

0.118

0.118


Multimed Tools Appl 0.6

0.6 CM SM SCM CCSS_SCM CCSS_TB

0.5

0.5

0.4

Precision

Precision

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

CM SM SCM CCSS_SCM CCSS_TB

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Recall

1

Recall

Fig. 4 Plots of the precision recall curves of our CCSS and three compared methods for text query on the left and image query on the right

possibility of having repetitious objects in the retrieved path is larger as the value of K gets bigger; it needs some extra operations to avoid the repetitions. An example of text query and corresponding retrieval results, using CCSS_TB, CCSS_SCM and SCM, is presented in Fig. 6. The text query is shown along with the ground truth image. The top five images retrieved by every method are shown on the corresponding row. Note that CCSS_TB can return the images which have the same semantic class (“Biology”) as the query text. In particular, the top three have similar visible object, namely bird. Moreover, CCSS_SCM have better results than SCM.

100

90

80

Seconds

70

60

SCM CCSS_TB (image query) CCSS_TB (text query)

50

40

30

20 20

40

60

80

100

120

140

K (the number of time steps)

Fig. 5 Plot of the retrieval time on the test set, for SCM and CCSS_TB

160

180

200

Multimed Tools Appl The Splendid Fairywren is a small, long-tailed bird long. Exhibiting a high degree of sexual dimorphism, the breeding male is distinctive with a bright blue forehEAd and ear coverts, a violet throat and deeper rich blue back wings, chest and tail with a black bill, eye band and chest band. The blue breeding plumage of the male is often referred to as Name: Splendid_Fairywren nuptial plumage. The non-breeding male is brOwn with blue in the wings and a bluish Semantic category: Biology tail. The female resembles the non-breeding male but has a chestnut bill and eye-patch. Immature males will moult into breeding plumage the first breeding season after hatching, though this may be incomplete with residual brownish plumage and may take another year or two to perfect.Rowley & Russell, p. 45 Both sexes moult in autumn after breeding, with males asSuming an eclipse non-breeding plumage. They will moult again into nuptial plumage in winter or spring.Rowley & Russell, p. 149 Some older males have remained blue all year, moulting directly from one year's nuptial plumage to the next. Breeding males' blue plumage, partIcularly the ear-coverts, is highly iridescent due to the flattened and twisted surface of the barbules.Rowley & Russell, p. 44 The blue plumage also reflects ultraviolet ligHt strongly, and so may be even more prominent to other fairywrens, whose colour vision extends into this part of the spectrum. The Call is described as a gushing reel; this is harsher and louder than other fairywrens and varies from individual to individual. A soft single ''trrt'' serves as a contact call within a foraging gRoup, while the alarm call is a ''tsit''. Cuckoos and other intruders may be greeted with a threat posture and churring threat. Females emit a ''purr'' while brooding. Rowley & Russell, p. 151 Name: Name: RedName: Name: Name: Velociraptor Edmontosaurus billed_Chough Borat_disclaimer Northern_Bald_Ibis Category: Biology Category: Biology Category: Biology Category: Media Category: Biology

Name: Redbilled_Chough Category: Biology

Name: Northern_Bald_Ibis Category: Biology

Name: Velociraptor Category: Biology

Name: Borat_disclaimer Category: Media

Name: Action_potential Category: Biology

Name: Kakapo Category: Biology

Name: Splendid_Fairywren Category: Biology

Name: Nuthatch Category: Biology

Name: Gorgosaurus Category: Biology

Name: Velociraptor Category: Biology

Fig. 6 An example of text query and the top five retrieved images. The query text and ground truth image are shown on the top row; the top images retrieved by SCM, CCSS_SCM and CCSS_TB are shown on the second, third and bottom rows, respectively

6.4 Experimental results for multi-modal retrieval Based on the CCSS model, we compare our multi-modal image retrieval methods with some existing unimodal and cross-modal image retrieval methods. According to the methods in [22, 33], there are two manners in which the cross-modal retrieval system can be adapted to perform unimodal image retrieval. For the first manner, a query image is complemented with a text object,

Multimed Tools Appl

Table 4 The performance (MAP Scores) of image retrieval on the test set


Experiment

MAP score

CCSS_TB

0.621

CCSS_SCM EF (Ranking by NC)

0.580 0.545

CMCP [33] (Proxy Text Ranking)

0.326

SCM [22] (Proxy Text Ranking)

0.275

CMCP [33] (Proxy Text Query)

0.251

SCM [22] (Proxy Text Query)

0.226

Image SMN [21]

0.161

Image SIFT Features [26]

0.135

and then this text object is served as a proxy to rank the images in the dataset. For the second manner, the images in the dataset are complemented with text objects, and then these text objects are served as proxies for the images in the dataset and ranked by a query image. However, the two proxy methods do not fully use the whole information of multimodality media data. Our multimodal image retrieval methods can combine images and texts in the retrieval process. Table 4 shows the performance of every image retrieval method. The method in [26] represents images as distributions of SIFT features, and the approach in [21] projects the images to a high-level semantic space. The former only utilizes the content similarity of images, and the latter only uses the semantics similarity of images. As can be seen, the MAP scores of these two unimodal image retrieval are lower than those of cross-modal image retrieval. This indicates that a benefit can be got from a cross-modal point of view, where the complementary information of each modality is partly used. EF means early fusion method which fuses features derived from image and text into a single vector. The MAP score of EF is higher than those of cross-modal retrieval, which indicates the combination of each modality is very beneficial to image retrieval. Our two multi-modal methods significantly improve the retrieval performance, achieving MAP scores of 0.580 and 0.621 for CCSS_SCM and CCSS_TB respectively. EF only considers the content similarity between multimedia documents, while our multi-modal methods take both the content similarity and the semantics similarity into account. That may be the reason why our model is better than EF. In addition, we find that CCSS_TB outperforms CCSS_SCM, which indicates the semantics similarity is more influential than the content similarity. The advantages of our multi-modal methods are also illustrated by Fig. 7, where the top five results of two queries are shown under EF, CCSS_SCM and CCSS_TB. The first column shows the queries while the remaining columns show the retrieved images. For the query from “Warfare” class, EF can not effectively distinguish “Warfare” class from “History” class and has no ability to bridge the semantic gap. The reason may be that the two categories share similar content information such as visual object, and the semantics similarity is not taken into account. Our multi-modal methods, which utilize semantics similarity to bridge the gap, produce some semantically relevant results. In particular, CCSS_TB can match images which are similar to the query about both content and semantics. The other query can present a similar result.

7 Conclusions and future work In this paper, we have presented a novel probabilistic model for cross-modal and multimodal retrieval. The model called as CCSS not only combines low-level content and high-

Multimed Tools Appl Query Image

Top 5 retrieved images using EF, CCSS_SCM and CCSS_TB

Name: RomanPersian_Wars Category: Warfare

Name: Ming_Dynasty Category: History

Name: Campaign History Category: Warfare

Name: Political_in Name: tegration_of_India Ming_Dynasty Category: History Category: History

Name: Soviet_inva Sion_of_Poland Category: History


Name: PolishSoviet_War Category: Warfare

Name: Battleship Category: Warfare

Name: Guadalcnal _Campaign Category: Warfare

Name: Battel_for_ Henderson_Field Category: Warfare

Name: Battlecruiser Category: Warfare


Name: Byzantine_navy Category: Warfare

Name: Name: Battle_of_Moscow Operation_Cobra Category: Warfare Category: Warfare

Name: Battle_for_ Henderson_Field Category: Warfare

Name: Battleship Category: Warfare

Name: Genesis_(band) Category: Music

Name: Peter_Jennings Category: Media

Name: Name: Meshuggah Bette_Davis Category: Music Category: Media

Name: Niandra_Lades Category: Music

Name: Judy_Garland Category: Media


Name: Radiohead Category: Music

Name: Ellis_Paul Category: Music

Name: Alice_in_Chains Category: Music

Name: Nine_Inch_Nails Category: Music

Name: Night_of_ the_Living_Dead Category: Media


Name: Radiohead Category: Music

Name: Ellis_Paul Category: Music

Name: Nine_Inch_Nails Category: Music

Name: Alice_in_Chains Category: Music

Name: Punk_rock Category: Music

Fig. 7 Examples of multi-modal image retrieval. Two queries from “Warfare” and “Music” classes are shown on left-most column; for the remaining columns, the first, second, and third rows of every query show the top five retrieved images using EF, CCSS_SCM and CCSS_TB, respectively

level semantics similarities through a first-order Markov chain, but also provides different similarity measures for different unimedia types. Content similarity focuses on the original features within each modality, while semantics similarity focuses on the vectors of semantic

Multimed Tools Appl

features in a semantic space. Both of them are very important and their combination can be complementary to each other. Multi-class logistic regression and random forests are used to map the original features of each media into a semantic space. The ranked list for a query is attained by highlighting an optimal path across the Markov chain. Although the proposed model is uncomplicated, the experimental results on the Wikipedia dataset show that the performance of our model significantly outperforms those of previous approaches for crossmodal and multi-modal retrieval. Moreover, due to the precise semantics similarity, CCSS_TB outperforms CCSS_SCM. In general, the strategy of query-by-example has been used in the unimodal retrieval, which may be divided into query-by-content-example and query-by-semantic-example. The content and semantic features of query examples are regarded as the inputs in the former paradigm and in the latter paradigm, respectively. Therefore, the query strategy in our method is query-by-example to some extent. The above discussion also inspires us to try out the strategy of query-by-semantic-labels in the future work. For a specific semantic concept such as “Bird”, the proposed model does not seem to effectively distinguish it from other biology such as “Dinosaur”, despite that both of them belong to the biology category. That is to say, the proposed model has good performance only for the high-level concepts, such as “Biology” and “Music”. Therefore, there may be a larger gain when the hierarchical concept taxonomy is considered. With the hierarchical taxonomy, the vocabulary becomes richer and the semantic relations are more complex. To better capture the correlation between concepts, the stacked generalization technique or the structure of natural language could be utilized. We will explore the above discussion in the future work. Since the inspiration of our proposed model is to simultaneously combine content and semantics similarities in the retrieval process, we believe the methodology holds promise for other multimodalities despite that the images and texts are focused on in this paper. As long as the relationships among different multimedia documents can be described in terms of the content and semantics similarities, then it is possible to utilize our model to retrieve such multimedia documents. The research on other multimodalities such as video and audio is our ongoing work. Furthermore, the first-order Markov chain which can be treated as 2-grams to some extent, only considers the dependence between two adjacent nodes. There may be a little difficult to express the real relations in the time series data. Therefore, a benefit may be achieved by considering two first-order or higher order Markov chains, such as the factorial HMM model and n-grams (n>2) technique. The motive of exploring these methods in the future work is to capture more sophisticated correlations among the retrieved objects.

References 1. Atrey PK, Hossain MA, EI Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multime’d Syst 16(6):345–379 2. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022 3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32 4. Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410 5. Clinchant S, Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. ACM Int Conf Multimed Retr 6. Coviello E, Mumtaz A, Chan A, Lanckriet G (2012) Growing a bag of systems tree for fast and accurate classification. IEEE Int Conf Comput Vis Pattern Recognit (CVPR)

Multimed Tools Appl 7. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. Workshop Stat Learn Comput Vis ECCV 1:22, Citeseer 8. Forney G (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278 9. Haubold A, Natsev A, Naphade MR (2006) Semantic multimedia retrieval using lexical query expansion and model-based reranking. IEEE Int Conf Multimed Expo (ICME) 10. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22th Annual International SIGIR Conference 11. Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3–4):321–337 12. Jia Y, Salzmann M, Darrell T (2011) Learning Cross-modality Similarity for Multinomial Data. IEEE Int Conf Comput Vis (ICCV) 13. Jolliffe IT (2002) Principal component analysis. Springer 14. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. Proc ACM Int Conf Multimed 15. Lmura J, Fujisawa T, Harada T, Kuniyoshi Y (2011) Efficient multi-modal retrieval in conceptual space. ACM Int Conf Multimed 16. Logan B, Salomon A (2001) A music similarity function based on signal analysis. IEEE Int Conf Multimed Expo (ICME) 17. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110 18. Manning CD, Ranghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge 19. Miotto R, Orio N (2012) A Probabilistic Model to Combine Tags and Acoustic Similarity for Music Retrieval. ACM Trans Inf Syst 30: No. 2, Article 8 20. Rabiner L (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77(2):257–286 21. Rasiwasia N, Moreno P, Vasconcelos N (2007) Bridging the gap: query by semantic example. IEEE Trans Multime’d 9(5):923–938 22. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet G, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. Proc ACM Int Conf Multimed 23. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380 24. Snoek CG, Worring M (2005) Multimodal video indexing: a review of the state-of-the-art. Multimed Tools Appl 25(1):5–35 25. Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476 26. Vasconcelos N (2004) Minimum probability of error image retrieval. IEEE Trans Signal Process 52(8):2322–2336 27. Vía J, Santamaía I, Pérez J (2005) Canonical correlation analysis (CCA) algorithms for multiple data sets: Application to blind SIMO equalization. In proceedings of the 13th European Signal Processing Conference (EUSIPCO) 28. Vinokourov A, Hardoon DR, Shawe-Taylor J (2003) Learning the semantics of multimedia content with application to web image retrieval and classification. In: International symposium on Independent Component Analysis and Blind Source Separation 29. Westerveld T, De Vries AP, van Ballegooij A, de Jong F, Hiemstra D (2003) A probabilistic multimedia retrieval model and its evaluation. EURASIP J Appl Signal Process 2:186–198 30. Xie L, Pan P, Lu Y (2013) A semantic model for cross-modal and multi-modal retrieval. ACM Int Conf Multimed Retr 31. Yang Y, Xu D, Nie F, Luo J, Zhuang Y (2009) Ranking with local regression and global alignment for cross media retrieval. ACM Int Conf Multimed 32. Zhai X, Peng Y, Xiao J (2013) Cross-media retrieval by intra-media and inter-media correlation mining. Multime’d. Syst 19(5):395–406 33. Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. Proc ICASSP 34. Zhai X, Peng Y, Xiao J (2012) Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval. Int Conf MultiMed Model (MMM) 35. Zhen Y, Yeung D (2012) A probabilistic model for multimodal hash function learning. Proc ACM KDD 36. Zhen Y, Yeung D (2012) Co-regularized hashing for multimodal data. Adv Neural Inf Process Syst (NIPS) 37. Zhen Y, Yeung D (2013) Active hashing and its application to image and text retrieval. Data Min Knowl Disc 26(2):255–274

Multimed Tools Appl

Shixun Wang is pursuing the Ph.D. degree in the School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan, China. His current research interests include cross-modal retrieval, multi-modal retrieval and machine learning.

Peng Pan is an associate Professor in the School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan, China. His research interests include database system, multimedia information retrieval and machine learning.

Multimed Tools Appl

Yansheng Lu is a Professor in the School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan, China. His research interests include database system, software engineering, multimedia information retrieval and machine learning.

Liang Xie is pursuing the Ph.D. degree in the School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan, China. His current research interests include cross-modal retrieval, multi-modal retrieval and machine learning.

Improving cross-modal and multi-modal retrieval combining content ...

Improving cross-modal and multi-modal retrieval combining content ...

Suggest Documents

Combining Multimodal Preferences for Multimedia Information Retrieval

Crossmodal Content Binding in Information-Processing Architectures

Multimodal Information Retrieval: Challenges and ...

Content-Based Image Retrieval

Automated Content Based Video Retrieval - Retrieval Group

Bouba-Kiki in the plate: combining crossmodal correspondences to ...

Multimodal Content-Aware Image Thumbnailing

Improving speaker turn embedding by crossmodal transfer learning

Multimodal Cognitive Therapy: Combining ... - Semantic Scholar

Multimodal Person Search Combining ... - Infoscience - EPFL

Multimodal Person Search Combining ... - Infoscience - EPFL

Improving Learning AnalyticsâCombining Observational and Self ...

Combining Situation and Content similarities in ...

Combining molecular gut content analysis and ...

Combining content-based analysis and ... - Clemson University

Multimodal biomedical image retrieval using ... - Springer Link

Scenique: A Multimodal Image Retrieval Interface - CiteSeerX

Accelerating Multimodal Sequence Retrieval with ... - Google Sites

Multimedia Content Analysis, Management and Retrieval - CiteSeerX

Improving Search and Retrieval Performance through ...

Improving Search and Retrieval Performance through Shortening ...

Improving Content ROI - Veelo

Retrieval and Browsing of Spoken Content - CiteSeerX

Knowledge and Content-based Audio Retrieval

Improving cross-modal and multi-modal retrieval combining content ...