Multimed Tools Appl DOI 10.1007/s11042-014-1873-x
An efficient framework of Bregman divergence optimization for co-ranking images and tags in a heterogeneous network Lin Wu · Xiaodi Huang · Chengyuan Zhang · John Shepherd · Yang Wang
© Springer Science+Business Media New York 2014
Abstract Graph-based ranking is an effective way of ranking images by making use of the graph structure. However, its applications are usually limited to individual image graphs, which are derived from self-contained features of images. Nowadays, many images in social web sites are often associated with semantic information (i.e., tags). Ranking of these orderless tags is helpful in understanding and retrieving images, thus, improving the overall ranking performance if their mutual reinforcement is considered. Unlike previous work only focusing on individual image or tag graphs, in this paper, we investigate the problem of coranking images and tags in a heterogeneous network. Considering that ranking on images and tags can be conducted simultaneously, we present a novel co-ranking method with random walks that is able to significantly improve the ranking effectiveness on both images and tags. We further improve the performance of our algorithm in computational complexity and the out-of-sample problem. This is achieved by casting the co-ranking as a Bregman divergence optimization, under which we transform the original random walks into an equivalent optimal kernel matrix learning problem. Extensive experiments conducted on three benchmarks show that our approach outperforms the state-of-the-art local ranking approaches and scales on large-scaled databases.
L. Wu · C. Zhang () · J. Shepherd · Y. Wang School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, Australia e-mail:
[email protected] L. Wu e-mail:
[email protected] J. Shepherd e-mail:
[email protected] Y. Wang e-mail:
[email protected] X. Huang School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia e-mail:
[email protected]
Multimed Tools Appl
Keywords Co-ranking · Random walks · Heterogeneous network · Bregman divergence
1 Introduction The explosion of online community-contributed multimedia data results in great focus on image retrieval. Most of social media sharing websites like Flickr allow users to upload personal images and annotate content with descriptive keywords called tags. Many ranking algorithms specialized to images on such social media repositories have been proposed to help organize the shared media data [15, 26, 42, 43] or to facilitate the image ranking process [9, 21, 41]. For instance, Jing et al. [21] propose to rank images by utilizing visual similarity among images. Although a lot of encouraging results have been reported from these approaches focusing on centrality measures on image content, the evaluations of the relative importance of images have been carried out independently, which fail to take advantage of useful metadata including tags, manual labels, etc. Similarly, the well-studied treatments on tag recommendation [13] and tag ranking [27] are solely conducted on the tag graph. Another effective approach proposed in [15] helps users in tagging process by suggesting relevant tags. Nevertheless, the natural connections between images and tags are not fully leveraged in these approaches. Although these methods [13, 15, 21, 27] have achieved better ranking performance over previous approaches, they do not consider the reinforcing dependency between images and tags, which is beneficial to further improving ranking results. For instance, the tag ranking list provided by Liu et al. [27] simply relies on the tag graph built upon a given image, while ignoring the additional ranking information from the image graph. This information would potentially improve the accuracy of tag ranking greatly. Our work is motivated by the observation that in the expanding heterogeneous network where the large amount of images associated with descriptions are uploaded increasingly, conventional ranking models on individual image or tag graphs simplify the complicated ranking context, thus making them not applicable in real circumstances. We argue that the reinforcing dependency between images and tags can be regarded as the mutual boosting in their individual ranking tasks. For image ranking, a set of images are ranked by minimizing the errors in visual content with respect to a query. However, there is a semantic gap between visual appearance and semantic meanings [16, 45], which can be narrowed down if there is assistance from semantics, e.g., tags. On the other hand, the ranking information from an image graph is able to facilitate tag ranking since tags are ordered according to their relevance to the image content. Intuitively, tags from highly ranked images are more likely to be ranked at higher positions. Therefore, it is desirable to develop a novel algorithm to handle dual-relational data over a heterogeneous network for ranking images and tags simultaneously. To this end, this paper aims to design a co-ranking scheme for images and their associated tags in a principled manner. The co-ranking framework faces two major challenges. How to design a framework that is is effective in leveraging mutual reinforcement over image and tag graphs is the first challenge. Another challenge lies in its high computational complexity yielded by the outof-sample problem. Given a query, co-ranking constructs two affinity graphs and propagates the ranking scores over the combined graph, leading to a complexity of O(n3 ), where n is the size of samples in the database. Such a high cost is prohibited in large-sized databases particularly when the case of out-of-sample happens. Normally, if an in-sample query is issued, co-ranking can use off-line pre-computation to reduce the computational cost. However, if the query is out of the database, the expensive ranking score propagation step needs
Multimed Tools Appl
to be performed in the on-line stage, which is referred to as the out-of-sample problem [5]. To this end, we make use of random walks, and further reformulate them into a Bregman divergence optimization. This provides us a novel perspective towards the co-ranking algorithm. That is, the optimal co-ranking function can be modeled by learning an optimal kernel matrix under the Bregman divergence matrix based metric. As a result, our co-ranking algorithm and its extension can effectively address the challenges of high computational cost and the out-of-sample case. Our main contributions of this paper are summarized as follows. –
–
–
–
We explore the mutual reinforcement between image and tag graphs by constructing three graphs as an image graph, a tag graph and a bridging graph that combines the two graphs. The importance of images and tags in a collection is obtained by random walks on their individual graphs. Such importance of tags is effectively utilized to enhance the image ranking through the bridging graph and vice versa for tag ranking by combining the importance of images. To further improve the performance, we formulate our co-ranking algorithm as a Bregman divergence optimization problem through which we make a novel extension to combat the challenges of computational complexity and out-of-sample problem. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of our co-ranking approach over the existing local ranking approaches for image ranking and tag ranking.
The rest of the paper is organized as follows. In Section 2, we briefly review the literature on image ranking, tag recommendation and a few frameworks that jointly consider both of images and tags. In Section 3, we begin to introduce our technique, by describing the construction of three graphs. Section 4 presents the co-ranking algorithm followed by its efficient extension derived from the Bregman matrix divergence, as given in Section 5. We conducted extensive experiments in Section 6 and conclusions are made in Section 7.
2 Related work Image retrieval and ranking The goal of image retrieval is to return the set of images in a database that are relevant to a query. The aim of ranking is that the retrieved images are ordered according to the their relevance to the query [40]. For large-scaled social media networks, it is essential that an image search application ranks images in a way that the most relevant ones are on the top. Ranking models can be content based models [34]; or link structure based models, like Pagerank [6] and HITS [22]; or cross media models [20]. Apart from these conventional ranking models, an important paradigm is the learning to rank models [13], aiming to optimize a ranking function that incorporates relevance features. Moreover, graph-based ranking, a particular kind of ranking models, has found its wide applications and superior performance in link-structure analysis of the web [22] and social networks research [15, 27]. A typical graph-based ranking algorithm is manifold ranking, which has been widely applied in applications of texts [39], images [16, 45] and videos [46]. Observing that manifold ranking is not efficient in graph construction and ranking computation, Xu et al. [44] proposed a framework to overcome the shortcomings of original manifold ranking by building an anchor graph and further speed up its ranking computation. Our approach belongs to graph-based ranking models. However, we go beyond the individual image graph and effectively leverage the beneficial information from tag ranking to
Multimed Tools Appl
enhance image ranking performance. We build a combined heterogeneous graph that bridges the image graph and tag graph, thus, making ranking task over the two graphs mutually reinforced. Tag recommendation and ranking Social media sharing web sites like Flickr and Zooomr allow users to upload personal media data and annotate the content with descriptive keywords called tags. Some amount of annotations can significantly improve the usefulness of such media collections as they are increasingly growing [2]. For example, the tags in Flickr are mostly assigned by the user who uploads the images, thus, making photos searchable by the contributing users, as well as enabling users to discover other users’ photos. Existing tag recommendation methods can be roughly categorized into semi-automatic and automatic manners. Semi-automatic tag recommendation is implemented based on one or several required tags provided by users [37, 44]. For example, by investigating a representative snapshot of Flickr issued by users and presenting the results by means of a tag characterization, Sigurbj¨ornsson et al. [37] evaluated tag strategies to assist users in the photo annotation task by recommending a set of tags that can be added to the photo. Moreover, in [44], Wu et al. formulated tag recommendation as a learning problem through which a multi-modality recommendation is proposed to combine both tag and visual correlations. Automatic tag recommendation, on the other hand, is usually accomplished by exploiting the image content [9, 26]. These methods typically annotate unlabeled images based on lowlevel features learned from a labeled training set. However, due to the semantic gap between tags and low-level features, the performance of these approaches are still far from satisfactory for large-scaled practices. Manual tagging, as an alternative method, can provide more accurate tags at the expense of labor costs. Nonetheless, the tags labeled by users are in a random order without any importance or relevance information, motivating a scheme of tag ranking [27] where the authors proposed to rank the tags associated with a given image according to their relevance to the image content. We argue that image ranking and tag ranking are complementary to each other, indicating that the ranking results from an image graph can boost the performance of tag ranking and vice versa. Thereby, we develop an effective co-ranking framework to incorporate both ranking for a specific ranking task. Then, we go step further to derive our co-ranking fashion from the perspective of Bregman divergence, leading to the extensions on computational efficiency as well as the out-of-sample strategy. Joint consideration of images and tags There have been a few projects considering visual features and tags in a joint framework [14, 30–32, 35]. For instance, an image analysis scheme that can automate the detection of landmarks and events in tagged image collections is proposed by Papadopoulos et al. [31]. There are critical differences between [31] and our work. First, their method is operated on a hybrid image-similarity graph, including visual tag similarities between images, and subsequently image clusters can by found and further classified as either landmarks or events. By contrast, we explicitly address the reinforcing relationship between images and tags via two individual graphs and the performance can be boosted by a bridge graph. This process is able to alternatively compensate two graphs and make them complementary to each other. Second, the out-of-sample problem is not tackled in [31]. Another graph-based approach is presented in [32], where images are ranked over an image similarity graph by considering visual features and user tags. However, the informative complement from tag graph is not exploited. In [30], Nikolopoulos et al. extend the use of probabilistic latent semantic analysis (pLSA) to be a high order version motivated by the co-existence of two information modalities of visual and tag.
Multimed Tools Appl
3 Graph constructions In this section, we introduce the constructions of three graphs: an image graph, a tag graph and a combined heterogeneous graph. For the image graph, we model the relations between images using a probabilistic hypergraph, in which a hyperedge corresponds to a particular tag and the probability of images that belong to hyperedges effectively quantifies a higherorder relationship. 3.1 Image hypergraph The existing methods of modeling the relations between images is to construct a pairwise image graph, in which images are taken as vertices and two similar images are connected by an edge whose weight is computed as the image-image affinity [21, 45]. However, the simple pairwise graphs cannot describe the high order relationship among more than two vertices as pointed out in [1, 18, 19, 49]. In a hypergraph, the edges (known as hyperedges) may connect an arbitrary number of vertices. Thus, compared to standard graphs that model pairs of images as their edges, hypergraphs can model semantic labels as their hyperedges. This property leads to the definition of higher-order relations among images and provides a way of exploiting semantics for image ranking. 3.1.1 Notations and definitions We consider a weighted hypergraph GM = (VM , EM , WM ), which consists of a set of vertices VM = {v1 , v2 , . . . , v|VM | }, a set of nonempty subsets of VM , referred to as the hyperedges EM = {e1 , e2 , . . . , e|EM | }, and a diagonal weight matrix WM in which w(i) is the weight of a hyperedge ei . In our framework, each tag ti is associated with its corresponding hyperedge ei , which contains the vertices of images labeled by vi . A toy example of showing the rationality of image hypergraph is illustrated in Fig. 1. We establish the relationship between hyperedges and vertices in a probabilistic incidence matrix H with the size of |VM | × |EM |, which can be defined as: p(vi |ej ), if vi ∈ ej ; (1) h(vi , ej ) = 0, otherwise.
H induces the definition of a vertex-degree matrix Dv with diagonal elements d(v) = e∈E h(v, e)w(e), and a hyperedge-degree matrix De with diagonal elements δ(e) =
Fig. 1 A toy example for image hypergraph. We consider each tag ti as a hyperedge containing an arbitrary number of images which are associated with the tag ti
Multimed Tools Appl
v∈V h(v, e). To estimate the probability function of p(vi |ej ), we employ the kernel density estimation approach, which can be formulated as
p(vi |ej ) =
1 Kσ (y − yk ) |Yi |
(2)
yk ∈Yi
where Yi is the set of images that contain the tag ti and Kσ is the Gaussian kernel function with radius parameter σ . 3.1.2 Hyperedge weight In order to quantify the similarity between two images, we use several feature descriptors: the SURF-based appearance feature descriptor [3] and the pyramid histogram of oriented gradients-based shape feature descriptor[4]. Another feature combining vector concatenating color, texture, edge and SIFT features is also used in this setting. Their brief introduction is given in Section 6. SURF features are extracted from images and then create a 128-bin codebook of SURF features by employing K-means algorithm, which are further quantized into a histogram by soft assignment, as suggested in [19, 38]. For the PHOG descriptor, we discretize gradient orientation into 8 bins to build corresponding histograms. Thus, the distance between two images (vi , vj ) can be calculated by a spatial pyramid matching (SPM) approach formulated as (3): φ(vi , vj ) =
L l=0
m(l)
1 2L−l
βpl χ 2 (H ispl (vi ), H ispl (vj ))
(3)
p=1
where H ispl (·) denotes local histograms at the pth position of level l;β is the weighting parameter;χ (·, ·) is the chi-square distance function to measure the distance between two histograms. Both SURF and PHOG descriptors are based on 3-level image pyramids. Specifically, three levels of spatial pyramids are: – – –
1×1: whole image, l=0,m(0)=1,β10=1; 2×2: image quarters, l=1,m(1)=4,β11 ∼ β41 = 14 ; 2 = 4×4: 16-divided image, l=2,m(2)=16,β12 ∼ β16
1 16 .
Therefore, the similarity between two images can be computed using a following kernel function: 1 (4) A(vi , vj ) = exp − φ(vi , vj ) where is the standard deviation of φ(vi , vj ) over all the data. An example case is shown in Fig. 2. To determine the weight of a hyperedge, we employ the homogeneity of appearance and shape features at all images constituting the hyperedge. In particular, we define the weight as a function of the variances of features at images associated with a hyperedge: w(e) = exp −
vi ,vj ∈e
φ(vi , vj )
|e|(|e| − 1)μ2
(5)
Multimed Tools Appl
Fig. 2 Spatial pyramid grids for appearance and shape descriptors. From left to right, l=0 to l=2
where parameter μ characterizes the homogeneity of the images at spatial pyramid grid level 1 and |e| denotes the number of images belonging to the hyperdege e. It can be seen that a large value of μ allows for an object of potentially different shape or appearance, while a small value of μ prohibits the difference. 3.2 Tag graph In the tag graph construction, we consider a simple pairwise graph to model the pairwise constraints between tags. We remark that a possible explanation for constructing a simple tag graph rather than a hypergraph is the difficulty to define the important concept of hyperedge due to the fact that tags are more difficult to be categorized into a particular image if an image is defined as a hyperedge. Moreover, the determination of hyperedges is equivalent to manually assigning tags to images, which is too laborious. An effective way to model the relations between tags is to build models using visual cues that come from the nearest neighbor images associated with the tags. The nearest neighbor strategy could avoid the noises caused by polysemy [27]. Unlike the popular method [27] that considers nonflexible number of nearest neighbors and equal contribution offered by different neighbors, we learn the distances between tags more accurately by introducing the locally specific distance metrics, which is further combined with the popular Google distance [10] to measure the similarity between tags. Suppose that two sets of Ui and Uj are the representative image collections of tag ti and tj , respectively. 2 Then the distance between ti and tj is defined as follows: d(ti , tj ) = ui − uj 2 . (6) ui ∈Ui ,uj ∈Uj
The distance defined in (6) is Euclidean distance, which is independent of the input data. To truly reflect the similarity among tags and capture ranking information from the image graph, we use the Mahalanobis distance with an appropriate distance metric.3
1 We
weight a hyperedge e based on the homogeneity of SURF features and PHOG at all grids derived from images constituting the hyperedge. The weighting principle supposes that a homogeneous area should be assigned a higher weight than a one with large variation. 2 For a tag t associated with an image u , we collect the N nearest neighbors from the image collection i i containing tag ti , and these images are regarded as the representative images of ti with respect to ui . 3 As
each tag does not equally contribute to the labeling of image, it is desirable to learn multiple weights for each tag. The weights reflect the relative importance of the tag with respect to an image.
Multimed Tools Appl
Essentially, instead of using (6), we compute the tag distances using the Mahalanobis distance as: dm (ti , tj ) = [(ui − uj )T Mij (ui − uj )], (7) ui ∈Ui ,uj ∈Uj
where Mij is an element in the Mahalanobis matrix M, which captures the ranking information of representative images. The Google distance[10] is also incorporated into the metric for the tags, which is motivated by [27]. Specifically, the Google distance between ti and tj by the concurrence similarity is defened as follows: dg (ti , tj ) =
max(log g(ti ), log g(tj )) − log g(ti , tj ) log G − min(g(ti ), log g(tj ))
(8)
where G is the total number of images, g(ti ) and g(tj ) denotes the number of images associated with tag ti and tj respectively, and g(ti , tj ) is the number of images associated with both ti and tj . Note that these numbers can be obtained by performing search on social media websites, e.g., Flickr, with tags as the keywords. Based on the above two distances defined as (7) and (8), we formalize the affinity value between tags ti and tj as: s(ti , tj ) = α · exp(−dm (ti , tj )) + (1 − α) · exp(−dg (ti , tj ))
(9)
where 0 < α < 1. In our experiments, we set α = 0.8 to be consistent with the baseline method [27]. 3.3 The combined heterogeneous graph To perform the transition of a random walk moving from image i (tag j) to tag j (image i) via the graph G∗ , we define the conditional transition matrix MT (with the entry MT ij ) and T M (with the entry T Mj i ), whose entries are the transition probabilities from image graph GM to tag graph GT and vice versa. After the next step in the graph G∗ , we have w∗ (i, j ) ; MT i,j = Y (tj |vi ) = ∗ k w (i, k) w∗ (i, j ) T Mj,i = Y (vi |tj ) = ∗ . k w (k, i)
(10)
where w∗ (i, j ) encodes the relevance score of tag tj to image vi . The value of w∗ (i, j ) can be probabilistically estimated by the algorithm in [27]. The above matrices MT and T M reflect the asymmetric relationship between images and tags. This implies that it is desirable for a tag to be associated with many related images, while for an image it is better to have tightly correlated tags, but not necessarily more tags. Moreover, the significance of a particular tag depends on the degree of such a tag, which has the underlying meaning to reveal that the more popular a tag is, the more connections to other tags it has.
4 Co-ranking framework In this section, we present the co-ranking mechanism for random walks on the heterogeneous network described in Section 3. The overview of framework is illustrated in Fig. 3.
Multimed Tools Appl
Fig. 3 The framework for co-ranking images and tags. GM and GT are image graph and tag graph, respectively. G∗ is the image-tag graph derived from images and their associated tags
4.1 Random walks on image and tag graphs Random walks on individual image and tag graphs can be modeled to compute the importance of image or tag nodes by propagating similarities over graph structures. Thus, through an iterative procedure over the random walks, a numerical weight is assigned to each node of image (tag), which measures its relative importance to other images (tags) in a collection for ranking. Consider the transition matrix Q = [qij ] ∈ R|VM |×|VM | derived from the weighted image hypergraph GM , and rk (i) denotes the relevance score of image vi with respect to the query image at the k-th iteration. Consequently, the relevance scores of all the images in the hypergraph at iteration k form a column vector rk = [rk (i)]n×1 . The element qij in matrix Q indicates the probability of the transition from image vi to image vj , which can be formulated as h(vi , e)h(vj , e) qij = , (11) w(e)A(vi , vj ) d(vi )d(vj ) v ,v ∈e i
j
where A(vi , vj ) denotes the visual similarity of two images vi and vj , which is further enhanced by the accumulation of their co-occurrence (h(vi , e)h(vj , e)) in the same hyperedges. Accordingly, the random walk process is formulated as rk (j ) = ρ rk−1 (i)qij + (1 − ρ)cj , (12) i
where cj is the initial relevance score of image vj , and ρ is a damping factor between [0,1]. ˆ = (1 − α)Q + α eeT , where e is the identity As a result, the matrix Q is changed to be Q |VM | vector of |VM | entries. In fact, the existence and uniqueness of the solution of (12) follows ˆ being ergodic [6]. The algorithm converges to its unique solution from the random walk Q as follows: ˆ −1 c, rπ = (1 − ρ)(I − ρ Q) (13) where c is a constant vector.
Multimed Tools Appl
Theorem 1 The iteration of (12) converges to a fixed distribution of rπ . ˆ k−1 + (1 − ρ)c and then we have Proof Equation (12) can be re-written as rk = ρ Qr k k i−1 ˆ ˆ ˆ is rowrπ = limn→∞ (ρ Q) r0 + (1 − ρ)( i=1 (ρ Q) )c. Here the transition matrix Q normalized. We can derive that ˆ k = ˆ k−1 (ρ Q) ˆ yj (ρ Q) (ρ Q) ij iy j
j
y
ˆ k−1 (ρ ˆ yj ) = (ρ Q) Q iy y
j
ˆ k−1 (ρ) ≤ ˆ k−1 (η) ≤ ηk , (ρ Q) (ρ Q) = iy iy y
y
where 0 ≤ ρ ≤ 1, η ≤ 1, and ρ ≤ η. Then we have the unique solution: rπ = (1 − ρ)(I − ˆ −1 c. ρ Q) ˆ induced by the Similarly, given the transition matrix Z (with random walk matrix Z) tag graph GT , the random walks over GT converge to another stationary probability distribution: ˆ −1 c¯ , fπ = (1 − ρ)(I − ρ Z)
(14)
where c¯ denotes the initial relevance scores of a particular tag. In addition, the element zij in matrix Z encodes the probability of the transition from tag ti to tag tj . Then we have s(t ,t ) zij = tk s(tii ,tjk ) , where tk is a set of tags connecting to ti . 4.2 Combined random walks on a heterogenous network The combined random walk on the heterogeneous graph simulates a random surfer who is capable of jumping over images and their tags as well. Thus, by coupling two random walks, a probability distribution has the form (rπ ,fπ ) [50], satisfying rπ 1 + fπ 1 = 1. Furthermore, we parameterize the coupling process by using four parameters of m, n, θ and λ, which are elaborated in the following procedures: (1) (2)
In the current state, the random surfer v ∈ VGM takes 2θ +1 steps on G∗ with probability λ while taking m steps on GM with probability 1-λ. If the current state of the random surfer is some tag, that is, v ∈ VGT , then it will take 2θ +1 steps on G∗ with probability λ while taking n on GT with probability 1-λ.
We summarize this process in Algorithm 1.
Multimed Tools Appl
4.2.1 Analysis on convergence Obviously, it requires that Algorithm 1 should converge; otherwise, we cannot get a stationary distribution of the combined random walks. Suppose that we have two ranking vectors rτ , fτ at iteration τ for images and tags, respectively. For the next iteration of τ + 1, we have
ˆ T )m rτ + λT MT (MT T · T MT )k fτ ; rτ +1 = (1 − λ)(Q ˆ T )m fτ + λMT T (T MT · MT T )k rτ . fτ +1 = (1 − λ)(Z
We define a vector combining the ranking vectors r and f: u = [rT , fT ]T . In particular, uτ = [rTτ , fTτ ]T consists of rτ and fτ after τ iterations. Thus, we can construct a matrix U as follows:
U=
ˆ T )m λT MT (MT T · T MT )k (1 − λ)(Q . ˆ T )m λMT T (T MT · MT T )k (1 − λ)(Z
(15)
Herein, U is a stochastic matrix that parameterizes the combined random walks in the form of uτ +1 = U uτ . Thus, for any initial vector u, the stationary probabilities can be obtained by limn→+∞ U n u [25]. Accordingly, r and f will converge to the ranking scores defined as rπ and fπ .
Multimed Tools Appl
5 Efficient co-ranking on images and tags using Bregman matrix divergence We have established the formulation of co-ranking images and tags by performing combined random walks over a heterogeneous graph. The ranking performance can be improved remarkably by virtue of the rich information from other side. However, another challenges remains unsolved. The challenge lies in the high computational complexity of both image and tag graphs. The computation time is O(n3 ) if an out-of-sample is issued, where n is the number of samples in a database. Therefore, it is highly desirable to make the algorithm efficient and scalable to large-scale graphs. Facing the fact that on-line ranking has to deal with the case of out-of-sample, we need to extend the proposed framework to make it effective and efficient when in the presence of out-of-sample queries. To this end, we cast the random walk technique as a Bregman divergence optimization through which we transform the original random walks into an equivalent optimal kernel matrix learning problem. 5.1 Preliminaries on Bregman divergence Let φ : → R be a real-valued strictly convex function defined over a convex set . The Bregman divergence [7] with respect to φ is defined as Dφ (x, x0 ) = φ(x) − φ(x0 ) − (x − x0 )T ∇φ(x0 ). Intuitively, the Bregman divergence is used to measure the closeness of two vectors. For example, if φ(x) = xT x then the corresponding Bregman divergence turns out to be the squared Euclidean distance: Dφ (x, x0 ) = x − x0 2 . We can naturally extend this definition to convex functions defined over matrices [23]. In this case, given a strictly convex, continuously differentiable function φ(X), the Bregman matrix divergence is defined to be Dφ (X, X0 ) = φ(X) − φ(X0 ) − tr((∇φ(X0 ))T (X − X0 )), where tr(X0 ) denotes the trace of matrix X0 . Examples include φ(X) = X 2F , which leads to the squared Frobenius norm X − X0 2F . In this paper, we use the log-determinant function φ(X) = − log detX, which can be expressed as the Burg entropy of the eigenvalues, i.e., φ(X) = − i logλi . The resulting matrix divergence becomes −1 D(X, X0 ) = tr(XX−1 0 ) − log det (XX0 ) − n,
(16)
which we call the Log-Determinant divergence [23]. 5.2 Bregman divergence derived random walks In the following, we will derive the random walks from a Bregman divergence optimization framework. By virtue of the new derivation, some extensions can be naturally derived to combat the aforementioned challenges. Considering the convergent solution in (13), we can rewrite it as follows: ˆ −1 c = Kc. r∗π = (I − ρ Q)
(17)
We omit the scaling factor 1-ρ as it does not influence the solutions. Let T = [t1 , t2 , . . . , tn ] ∈ Rm×n be the data representation in a new tag feature space of data samples. In particular, we define ti = (vi ), for i = 1, . . . , n, where is a transformation function of the original image vector to the new tag feature space. Specifically, we introduce
Multimed Tools Appl
a canonical representation [ t1 , t2 , . . . , tm ] for ti , which is composed of the 0-1 indication of the tag feature basis. Specifically, we have tj ) ≥ ε; 1, if p(vi | (18) ti (j ) = 0, otherwise. Intuitively, we project each image into a tag feature space using a binary representation, which indicates to which extent the image is associated with basis tags. We define matrix K as K = TT T, (19) which is a positive semi-definite matrix. Then, we present our primary theorem over the new derivation on random walks as follows. Theorem 2 The matrix K in the convergence formulation (17) is the solution of the following optimization problem: minK D(K, I) s.t. i,j d(v1 i ) ti − d(v1j ) tj 2 qij ≤ , K 0, where is a smoothness parameter constraining similar images having close distance in the new space. Proof The optimization problem seeks a K closest to the identity matrix measured by the Log-Determinant divergence with a regularization on normalized graph Laplacian smoothness, which can be written as a matrix form [48]: 1 1 tj 2 qij = tr(TLTT ), ti − √ d(v ) d(v i j) i,j ˆ is the normalized graph Laplacian. where L = I−Q Replacing the objective function with (16) and introducing the Lagrange multiplier, the minimizing optimization problem can reformulated as follows: min D(K, I) K
(20)
= min tr(KI−1 ) − log det (K) + ηtr(TLTT ) K 0
= min tr(KI) + ηtr(TT TL) − log det (K) K 0
= min tr(KE) − log det (K), K 0
where E = I+ηL is a positive-definite matrix and η is the Lagrange multiplier. The optimal solution K∗ of the above optimization problem is K∗ = E−1 = (I + ηL)−1 [17]. Recall that η ˆ ˆ where β = η . We E = I + ηL = (1 + η)(I − (1+η) Q), then we have K∗ = (I − β Q), (1+η) use K∗M for the sake of distinguish. Similarly, we can derive another new extension from the perspective of Bregman divergence metric for the random walks over the tag graph, which is summarized in the following theorem. We define KT as KT = CT C, where C = [c1 , . . . , cn ] ∈ Rm×n is the representation that transforms the tag samples to a new feature space. Likewise, we rewrite the (14) as follows: ˆ −1 c¯ = KT c¯ . (21) f∗π = (I − ρ Z)
Multimed Tools Appl
Theorem 3 The matrix KT in the convergence formulation (21) is the solution of the following optimization problem: minKT D(KT , I) 1 s.t. i,j √D ci − √1 cj 2 zij ≤ , KT 0, Djj
ii
Likewise, we have the optimal matrix K∗T , which is exactly equal to the matrix KT in (21). 5.3 Efficient co-ranking algorithm From the perspective of Bregman divergence, the random walk essentially learns an optimal matrix K close to the identity matrix under certain constraints. However, we still need to inverse a n × n matrix I + ηL, which has the complexity of O(n3 ). In the following, we derive an efficient extension of the co-ranking algorithm. Suppose the mapping of ti = (vi ) is linear, i.e., T = PT V, where P is a k × m matrix. Then we have tr(KI−1 ) − log det (K) + ηtr(TLTT )
(22)
= tr(V PP V) − log det (V PP V) + ηtr(P VLV P) T
T
T
T
T
T
= tr(HVVT ) − log det (VVT ) − log det (H) + ηtr(HVLVT ), where H = PPT 0. As a result, the optimization problem (20) becomes min tr(H(VVT + ηVLVT )) − log det (H),
H 0
(23)
k×k
where H is a k × k matrix. To get the optimal matrix H, we only need to inverse a matrix with size k × k, which remains unchanged as the size of database (n) grows. That is, if k n, the computational complexity of matrix inversion is reduced dramatically. Thereby, we optimize a small-sized matrix H to estimate the optimal matrix K∗T or K∗M . Then, the complexity is reduced from O(n3 ) to O(m3 ) + O(n2 ). In fact, the computation of optimal matrix H learns a distance metric: 2 dH = PT vi − PT vj 2 = (vi − vj )T H(vi − vj ).
K∗M
With the learned distance metric H, can be computed by Hence, each element of the K∗M can be calculated by
K∗M
(24) =
TT T
2 (vi , vj )/2σ 2 ) K∗M [ij ] = exp(−dH
=
VT HV. (25)
K∗M
Therefore, similar to (17), the ranking on the image graph can use to compute ranking scores: r∗π = K∗M c. (26) For simplicity, we summarize the procedure of efficient extension on image ranking in Algorithm 2, and the efficiency on tag ranking can be induced by a similar manner. 5.4 Out-of-sample extension Given a sample query out of the database, e.g., a new image query, we only need to compute a new column of the matrix K∗M , which avoids the update over the entire matrix. Assume
Multimed Tools Appl
Algorithm 2 Efficient extension on co-ranking algorithm. Input: Q. 1
ˆ = (1 − α)Q + Q
2 3 4
ˆ −1 Compute KM = (I − ρ Q) ∗ ˆ where β = η Derive KM to be K = (I − β Q), (1+η) ∗ 2 (v , v )/2σ 2 ) Compute the ranking score: rπ = K∗M c, with K∗M [ij ] = exp(−dH i j
α T |VM | ee
ˆ = [V, vu ] ∈ Rk×(n+1) is the new data matrix with the new sample vu , then we have that V ˆ ∗ ∈ R(n+1)×(n+1) : the new optimal matrix K M ∗ ˆ ∗M = KM knu . (27) K knu 1 Compared with traditional ranking algorithms based on random walks, which need O(n3 ) to propagate the ranking score, our approach updates the matrix K∗M with the complexity of O(n).
6 Experiments In this section, we experimentally evaluate the co-ranking algorithm in the tasks of both image ranking and tag ranking on real-world databases. Then, we conduct experiments to show the effectiveness as well as efficiency of our efficient co-ranking fashion derived from the Bregman divergence manner. 6.1 Experimental settings Databases All experiments are conducted on three real-world benchmark datasets: LabelMe [33], the Pascal-Yahoo! image corpus [12, 47] and the dataset collected from Flickr [27]. –
–
–
LabelMe. It is a large collection of annotated and unlabeled images. In our experiment, we use the training set of this database, which contains more than 1,000 fully annotated images and 2,000 partially annotated images. Pascal-Yahoo!. This dataset contains 15,339 images collected from Pascal VOC 2008 [11] (12,695 images) and Yahoo! image search engine (2,644 images). The images are from 32 object categories (20 in Pascal and 12 in Yahoo!) and all of them were annotated with 64 pre-defined tags, such as “round”, “wool”, and “wheel”. We randomly select 10 images from each category as experimental queries, generating 320 queries in total. Flickr dataset. This dataset comprises 50,000 images crawled from Flickr collected by Liu et al. [27], where they select ten most popular tags, including “cat”, “automobile”, “mountain”, “water”, “sea”, “bird”, “tree”, “sunset”, “flower”, and “sky”.
Baselines In our experiment, we implemented the following approaches: –
Pagerank: Pagerank is regarded as the baseline yet effective approach for image ranking [21].
Multimed Tools Appl
– – –
– – – – –
EMR[45]: Xu et al. proposed an efficient extension upon the conventional Manifold Ranking (MR) while preserving its effectiveness in image ranking. Tag ranking [27]: The tag ranking scheme aims to rank the tags associated with a given image according to their relevance scores to the image content. Learning to tag[44]: Learning to tag is a tag recommendation that learns an optimal combination of ranking features from multiple modalities based on tag and visual correlations. CLED [31]: CLED is a cluster-based scheme for automatically detecting landmarks and events in tagged photo collections. MMIR [32]: A multimodal image ranking method is proposed to boost image ranking performance via two modalities of visual features and tags. HpLSA [30]: High-order pLSA is an efficient indexing technique of tagged images. Joint-learning [14]: It performs a visual-text joint hypergraph learning approach to simultaneously explore the two information sources. MTS-MLMIL [35]: MTS-MLMIL is a multi-task structured SVM algorithm, which aims to leverage inter-object relations and tagged images to improve classification accuracy.
Image features We use a few types of features, each of which is employed bt a bag-of-words style representation. – – –
–
–
SURF [3] and PHOG [4]. Color: Color descriptors are densely extracted from each pixel as the 3-channel LAB values, which can be further quantized into a 128-bin histogram. Texture: Each image is first scaled to 64×64 pixels. The Gabor wavelet transform [24] is then applied on the scaled image with 5 levels and 8 orientations, which result in 40 subimages. For each subimage, 3 moments are calculated: mean, variance and skewness. Thus, we have a 120-dimensional vector. Edge: The Canny edge detector [8] is used to obtain the edge map for the edge orientation histogram, which is quantized into 36 bins of 10 degrees each. An additional bin is utilized to count the number of pixels without edge information. Thus, a 37-dimensional vector is produced. SIFT descriptor: SIFT descriptors [28] are densely extracted from the 8×8 neighboring block of each pixel with 4-pixel step size. The descriptors are quantized into a 1,000dimensional bag-of-words feature.
Evaluation metrics To evaluate the performance of the co-ranking method as well as the baseline algorithms, we adapt three evaluation metrics to measure the quality of ranking performance: the Precision at top K (P @K), NDCG@K and Mean Average Precision (MAP) [29]. We present the definitions of NDCG@K and MAP as follows. For a query image, each of its tags is labeled as one of five levels: most relevant(score 5), relevant(score 4), partially relevant(score 3), marginally relevant(score 2), and irrelevant(score 1). Assume that an image is associated with a tag ranking list L = t1 , . . . , tK , and the NDCG value of the list is NDCG@K = Zn
K 2r(j ) − 1 , log2 (j + 1)
(28)
j =1
where r(j ) is the relevance score of the j th returned item and Zn is a normalization constant.
Multimed Tools Appl
aeroplane,airfield,aerospace, aviation,flying,aircraft,jet, transport,sky,falcon,SonyS70 bird,nature,wildlife,finch, animal,eagle,water,reflection, water,little,canon car,sports,supercar,Benz,fast, roads,amazing,Ferrari,muscle, American,Lamborghini flower,flora,nature,petal,leaves, beauty,colors,shrub,tulip,daisy, sunshine sunset,sundown,sky,cloud,orange, fiery,water,lake,night,boat,nikon
Fig. 4 Top image-returns and corresponding tags w.r.t various image queries
For a single query, Average Precision is obtained for the set of top K items existing after each relevant item is retrieved, and this value is then averaged over all queries. If the set of relevant items for a query bi ∈ B is {d1 , . . . , dmj } and Rj k is the set of ranked retrieved results from the top result until we get to item dk , then we have mj |B| 1 1 P recision(Rj k ) MAP (B) = |B| mj j =1
(29)
k=1
6.2 Co-ranking results: a study case To evaluate the performance of co-ranking on images and tags, we randomly select five images from the Flickr collection [27] as the query set, as illustrated in the first column of Fig. 4. For each image query, top 10 returns are retrieved according to their relevance scores to the query. At the same time, orderless tags attached to the query image are ordered by considering their intrinsically semantic similarity as well as the reinforcing impact generated by the image graph. As illustrated in Fig. 4, each row shows the original query followed by the top 10 most relevant items, and the ranking list of tags is tailed simultaneously. 6.3 Image ranking As Pagerank and EMR are two popular image ranking methods, we first compare our algorithm against them in terms of MAP. The MAP values are reported in Fig. 5 in which we present three groups of bars corresponding to 15 popular topics collected from three benchmarks. This evaluation shows that the co-ranking method outperforms the other two 1 0.8
1
Pagerank EMR Co-ranking
0.9 0.8
0.8 0.7
0.6
0.6
MAP
0.7
0.6 0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.1
bird
cat
m w orbik ounta ater in e
mot
0.1
Pagerank EMR Co-ranking
0.9
0.7 MAP
MAP
1
Pagerank EMR Co-ranking
0.9
0.2 flow
er
tree
pers sofa on
sky
0.1
bike
aero boa plan t e
h ding orse
buil
Fig. 5 Comparisons on image ranking with respect to three benchmarks. From left to right: LabelMe, PascalYahoo!, and Flickr dataset
Multimed Tools Appl 1
CLED MMIR HpLSA Co-ranking
0.9 0.8
MAP
0.7 0.6 0.5 0.4 0.3 0.2 0.1
airc bea boa buil cha cow des fore mou peo tree stre sofa mot ding irs s ert st raft ch t ets s orcy ntai ple s cles ns
Fig. 6 Comparisons on image ranking w.r.t. state-of-the-art baselines over Flickr dataset
algorithms, e.g., achieving an average improvement of 33.11 %, 20.82 %, over Pagerank and EMR in terms of LabelMe database. Both Pagerank and EMR perform ranking procedure over the image graph individually, while overlooking the mutual reinforcement between images and tags. We can draw the conclusion that owing to fully leveraging tag ranking scores provided by random walks on the tag graph, our co-ranking approach shows superiority over baseline methods, which are limited on the simple image graph, leading to the failure of utilizing additional tag ranking information. Moreover, we report results, as shown in Fig. 6, by comparing co-ranking algorithm against another three state-of-the-art methods in jointly leveraging information from visual features and tags. From Fig. 6, we can see that there is no consistent winner among CLED, MMIR and HpLSA in terms of MAP values. In some concepts, e.g., “street”, “chairs”, and “mountains”, CLED performs better than MMIR and HpLSA partially because these concepts can be easily detected and clustered among user-contributed Flickr dataset. However, benefiting from co-ranking technique, our method consistently outperforms these baselines. Meanwhile, in Table 1 we show their time efficiency over three real-world datasets by averaging the running time of all test samples. From Table 1, it can be easily seen that Corank method runs faster than other methods because of its Bregman divergence extension. HpLSA is only second to Co-rank due to its effective indexing on tagged photos. Owing to the hierarchical K-means algorithm, MMIR is the least fast method and CLED is more efficient on account of its community detection [31], which is more efficient than K-means.
Table 1 Comparisons on time efficiency (in Sec) over three benchmarks CLED
MMIR
HpLSA
Co-rank 4.64
LabelMe
13.55
22.48
7.88
Pascal-Yahoo!
12.88
22.03
7.07
4.53
Flickr
14.79
24.33
8.24
4.82
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6 0.5
0.6 0.5
0.4
0.4
0.3
0.3
0.2
Ta
g-r
an
Le k
arn
Jo
int
-ta
g
-le
MT arn
S-
NDCG
1 0.9
NDCG
NDCG
Multimed Tools Appl
0.3 Ta
g-r
-ra
ML
nk
MI
0.5 0.4
0.2
Co
0.6
an
Le
L
k
arn
Jo
int
-ta
g
-le
MT arn
S-
0.2
Co
Ta
g-r
-ra
ML
MI
nk
L
Le
an
k
Jo MT Co in SML rank -ta t-lea g rn MI L
arn
Fig. 7 Performance of different tag ranking strategies over three benchmark databases. Baseline algorithms include: tag ranking [27], learning to tag [44], Joint-learning [14], and MTS-MLMIL [35]
6.4 Tag ranking For a particular database, e.g., Pascal-Yahoo!, given the NDCG values of each image’s tag list, we average them to obtain an overall performance of the tag ranking algorithm with respect to such a data set. The experimental results in terms of NDCG are shown in Fig. 7. The following conclusions can be drawn. –
–
–
Existing tag ranking methods without considering metadata from visual features, that are, “Tag-rank” and “Learn-tag” fall short in using ranking scores of images since only tag graph is considered, which is not comparable with our co-ranking method. Observing the results from Joint-learn, MTS-MLMIL and Co-rank, we can conclude that these methods that jointly consider complementary information of visual auxiliary can indeed boost ranking performance greatly. However, the methods of Joint-learn and MTS-MLMIL are inferior to our algorithm due to the fact that they both squeeze tag information into their main model of either image hypergraph or SVM classifier rather than explicitly exploring the distinction of tag graph. Our co-ranking algorithm outperforms baselines greatly by leveraging the ranking information from the image graph that is incorporated in the Mahalanobis matrix M in (7), which is the main difference between [27] that considers the weight equally.
6.5 Parameters learning
0.8
0.8
0.8
0.7
0.7
0.7
0.5
0.6
MAP
0.6
MAP
MAP
Considering that the probability value λ balances the random walks over the heterogeneous network, we report the learning process of parameter λ with different values ranging from 0.1 to 0.9. In Fig. 8, the MAP metric is used to evaluate the performance of co-ranking w.r.t varying λ in terms of three benchmarks. We can observe that the selection of λ is
0.5
0.6 0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0
0.2
0.4
λ
0.6
0.8
1
0.2
0
0.2
0.4
λ
0.6
0.8
1
0.2
0
0.2
0.4
λ
0.6
0.8
1
Fig. 8 Learning of parameter λ ranging from 0.1 to 0.9 on three benchmark databases. From left to right: LabelMe, Pascal-Yahoo!, and Flickr dataset
Multimed Tools Appl
45 30
40 Time(sec)
Time(sec)
35 30 25
20 10
20
0 0
15 10 0
2 4 5
10
0
4
2
6
8 m
n
m
6 8
0
4
2
6
8
n
Fig. 9 Effect of [m, n] on the number of iterations (left) and CPU running time (right)
critical to the algorithm of co-ranking and parameter learning is necessary in different image databases. 6.6 Running time evaluations We first show the effect of m and n on the number of iterations as well as CPU running time before convergence. The evaluated results are reported in Fig. 9, from which we can observe that as m and n increase, the number of iterations and CPU running time (sec) decrease slowly. This is due to the fact that the random walks on the individual graph have sufficient steps to become locally stationary before taking the next step on the combined graph. Then, we demonstrate the efficient extension on co-ranking presented in Section 5 by using the Bregman divergence principle. To this end, apart from our co-ranking approach (Co-rank), we also implemented the following algorithms: 1. PCA: PCA is the most popular linearly dimensional reduction method [36]. We set the reduced dimensionality to be 40 in our experiments. PCA is performed in advance to obtain a low-dimensional representation and then Euclidean metric is used to computer similarities of images. 2. DCoR: our algorithm of the efficient extension on Co-rank made by using the Bregman divergence principle.
Table 2 Results on LabelMe and Pascal-Yahoo! data sets over P @K, NDCG@K and MAP LabelMe
Pascal-Yahoo!
Metric
PCA
DCoR
Co-rank
PCA
DCoR
Co-rank
P @10
40.57 %
42.99 %
44.77 %
34.54 %
38.94 %
38.65 %
P @20
34.46 %
37.55 %
38.35 %
29.83 %
34.87 %
35.79 %
P @30
31.83 %
34.67 %
34.05 %
29.83 %
32.88 %
33.73 %
NDCG@10
42.49 %
45.09 %
47.51 %
35.43 %
36.31 %
40.37 %
NDCG@20
38.12 %
41.02 %
41.98 %
33.15 %
36.07 %
37.67 %
NDCG@30
34.47 %
38.20 %
38.27 %
31.81 %
32.84 %
36.05 %
MAP
32.38 %
36.71 %
35.47 %
31.07 %
35.22 %
36.56 %
Multimed Tools Appl Fig. 10 Time complexity. We report the running time performances of the three algorithms by ranging the size of samples from 1k to 6k. This experiment is performed over the benchmark of Pascal-Yahoo! database
90
Co-rank PCA DCoR
80 70 Time(Sec)
60 50 40 30 20 10 0 6k
5k
4k
3k
2k
1k
Ranking evaluations As shown in Table 2, for the LabelMe database, as the image features are discriminative, the baseline of PCA can perform reasonably well by simply using the Euclidean distance. The extension version of Co-rank, DCoR approach only loses slightly to the best performance shown by Co-rank. For the Pascal-Yahoo! database, PCA doesn’t show good performance. This is partially due to the large number of categories and complex content in images. Again, Co-rank is able to constantly show superior results because the reinforcing relation implemented by its heterogeneous graph. Unsurprisingly, compared to Co-rank, our extended method of DCoR is expected to have little sacrifice in accuracy values by the virtue of distance metric learning. Time complexity We show running time complexity in Fig. 10. It can be seen that Co-rank needs the high computational cost and it is a bit hard to be applied on the application of large-scaled databases. Although PCA reduces the running time to some degree, the cost of its computation is still not low enough to make it applicable on large-sized data sets. In contrast, our extended approach of DCoR is still efficient as the size of database increases. 6.7 Out-of-sample case In many real applications, queries are always issued out of the database, thus, making ranking paradigms to be performed in an on-line stage. To tackle the problem of out-ofsample, we derive an efficient extension to our co-ranking algorithm from the perspective of Bregman divergence, which merely needs to compute a new column and row of the data matrix. We conducted experiments in terms of MAP as well as time efficiency over three benchmarks by randomly issuing 10 queries beyond the three benchmarks. As shown in
0.5
0.5
0.45 0.4 0.35
0.5
DCoR EMR Pagerank
0.55
MAP
MAP
0.6
DCoR EMR Pagerank
0.4
0.45 0.4 0.35
0.35 0.3 0.25
0.3
0.3
0.2
0.25
0.25
0.15
0.2
0.2 10
15
20
25
30
K
35
40
45
50
DCoR EMR Pagerank
0.45
MAP
0.6 0.55
0.1 10
15
20
25
30
K
35
40
45
50
10
15
20
25
30
35
40
45
50
K
Fig. 11 Evaluation on the out-of-sample case. We report the MAP values of the three algorithms by ranging top K from 10 to 50. From left to right: LabelMe, Pascal-Yahoo!, and Flickr database
Multimed Tools Appl
1.5 Time (sec)
1 0.8 0.6 0.4
2
DCoR EMR Pagerank
DCoR EMR Pagerank
1.5 Time (sec)
1.2 Time (sec)
2
DCoR EMR Pagerank
1.4
1
1
0.5
0.5
0.2 0
0
0 10
15
20
25
30 K
35
40
45
50
10
15
20
25
30 K
35
40
45
50
10
15
20
25
30
35
40
45
50
K
Fig. 12 Time complexity on the out-of-sample case. We report the running time of the three algorithms by ranging top K from 10 to 50. From left to right: LabelMe, Pascal-Yahoo!, and Flickr database
Fig. 11, our DCoR outperforms other baselines because the extension based on co-ranking also leverages ranking information from tag graphs. As for time complexity, from Fig. 12, it can be easily observed that our DCoR runs more efficiently than others. In contrast, the standard Pagerank is the least efficient as it needs to update the whole data matrix by computing affinity values between each sample and the new query. EMR, however, is more faster than Pagerank but slower than DCoR by taking advantage of anchor graphs, which are efficient in graph constructions and ranking score computation.
7 Conclusions In this paper, we propose a novel co-ranking method that couples two random walk simultaneously by exploiting the mutually reinforcing relationship between images and tags. The ranking information upon individual graphs on images and tags are complementary, and hence the leveraging mechanism incorporated in our algorithm can achieve more satisfactory performance than current algorithms for image ranking and tag ranking. To further improve the performance, we present a new perspective of random walks used in our approach, which learns an optimal kernel matrix implemented by the Bregman matrix divergence metric. This novel formulation allows us to derive an efficient extension over our co-ranking algorithm as well as an effective solution for the case of out-of-sample. Extensive experiments on various image databases validate the effectiveness and efficiency of our method. Acknowledgment This research was supported in part by round 4 compact funding A541-2003-xxx25242, Charles Sturt University, Australia.
References 1. Agarwal S, Lim J, Zelnik-manor L, Perona P, Kriegman D, Belongie S (2005) Beyond pairwise clustering. In: CVPR 2. Ames M, Naaman M (2007) Why we tag: motivations for annotation in mobile and online media. In: SIG CHI 3. Bay H, Tuytelaars T, Gool LV (2006) Surf: speeded up robust features. In: ECCV 4. Bay H, Tuytelaars T, Gool LV (2007) Presenting shape with a spatial pyramid kernels. In: International conference on image and video retrieval 5. Bengio Y, Paiement J, Vincent P, Delalleau O, Roux NL, Ouimet M (2004) Out-of-sample extensions for lle, isomap, mds, eigenmaps and spectral clustering. In: NIPS, pp 177–184 6. Berkhin P (2005) A survey on pagerank computing. Internet Math 2(1):73–120
Multimed Tools Appl 7. Bregman L (1967) The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys 7(3):200–217 8. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intel 6:679– 698 9. Chen H, Chang MH, Chang P, Tien M, Hsu W, Wu J (2008). Sheepdog-group and tag recommendation for flickr photos by automatic seach-based learning. In: ACM multimedia 10. Cilibrasi R, Vit¨nyi PMB (2007) The google similarity distance. IEEE Trans knowl Data Eng 19(3):370– 383 11. Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2008) The PASCAL visual object classes challenge (VOC2008) Results. http://www.pascal-network.org/challenges/VOC/voc2008/ workshop/index.html 12. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: CVPR 13. Gao W, Cai P, Wong K, Zhou A (2010) Learning to rank only using training data from related domain. In: ACM SIGIR 14. Gao Y, Wang M, Luan H-B, Shen J, Yan S, Tao D (2011) Tag-based social image search with visual-text joint hypergraph learning. In: ACM multimedia 15. Guan Z, Bu J, Mei Q, Chen C, Wang C (2009) Personalized tag recommendation using graph-based ranking on multi-type interrelated objects. In: SIGIR 16. He J, Li M, Zhang H-J, Tong H, Zhang C (2006) Generalized manifold-ranking-based image retrieval. IEEE Trans Image Process 15(10):3170–3177 17. Hoi SCH, Liu W, Chang S-F (2010) Semi-supervised distance learning for collaborative image retrieval and clustering. ACM Trans Multimed Comput Commun Appl 6(3) 18. Huang Y, Liu Q, Lv F, Gong Y, Metaxas DN (2011) Unsupervised image categorization by hypergraph partition. TPAMI 33(8):1266–1273 19. Huang Y, Liu Q, Zhang S, Metaxas DN (2010) Image retrieval via probabilistic hypergraph ranking. In: CVPR 20. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR 21. Jing Y, Baluja S (2008) Pagerank for product image search. In: WWW 22. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632 23. Kulis B, Sustik M, Dhillon I (2006) Learing low-rank kernel matrices. In: ICML 24. Lades M, Vorbruggen J, Buhmann J, Lange J, von der Malsburg C, Wurtz R, Konen W (1993) Distortion invariant object recognition in the dynamic link architecture. IEEE Trans Comput 42(3):300–311 25. Langville AN, Meyer CD (2005) Deep inside PageRank. Internet Math 1(3):335–380 26. Li J, Wang JZ (2008) Real-time compuerized annotation of pictures. IEEE Trans Pattern Anal Mach Intell 30(6):985–1002 27. Liu D, Hua X-S, Yang L, Wang M, Zhang H-J (2009) Tag ranking. In: WWW, pp 351–360 28. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110 29. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge 30. Nikolopoulos S, Zafeiriou S, Patras I, Kompatsiaris I (2013) High-order plsa for indexing tagged images. Sig Process Elsevier, Spec Issue Index Large-Scale Multimed Signals 93(8):2212–2228 31. Papadopoulos S, Zigkolis C, Kompatsiaris Y, Vakali A (2011) Cluster-based landmark and event detection on tagged photo collections. IEEE Multimed Mag 18(1):52–63 32. Richter F, Romberg S, Hrster E, Lienhart R (2012) Leveraging community metadata for multimodal image ranking. Multimed Tools Appl 56(1):35–62 33. Russell BC, Torralba A, Murphy KP, Freeman WT (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77(1–3):157–173 34. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620 35. Shen Y, Fan J (2010) Leveraging loosely-tagged images and inter-object correlations for tag recommendation. In: ACM multimedia 36. Shlens J. (2005) A tutorial on principal component analysis. Technical report, Measurement 37. Sigurbjornsson B, Zwol RV (2008) Flickr tag recommendation based on collective knowledge. In: WWW 38. van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AW (2008) Kernel codebooks for scene categorization. In: ECCV 39. Wan X, Yang J, Xiao J (2007) Manifold-ranking based topic-focused multi-document summarization. In: IJCAI
Multimed Tools Appl 40. Wang Y, Cheema MA, Lin X, Zhang Q (2013) Multi-manifold ranking: using multiple features for better image retrieval. In: PAKDD 2:449–460 41. Wang Y, Lin X, Zhang Q (2013) Towards metric fusion on multi-view data: a cross-view based graph random walk. In: CIKM, pp 805–810 42. Wu L, Wang Y, Shepherd J (2013) Co-ranking images and tags via random walks on a heterogeneous graph. In: Multimedia modeling 43. Wu L, Wang Y, Shepherd J (2013) Efficient image and tag co-ranking: a bregman divergence optimization method. In: ACM multimedia 44. Wu L, Yang L, Yu N, Hua X (2009) Learning to tag. In: WWW, pp 361–370 45. Xu B, Bu J, Chen C, Cai D, He X, Liu W, Luo J (2011) Efficient manifold ranking for image retrieval. In: SIGIR, pp 525–534 46. Yuan X, Hua X, Wang M, Wu X (2006) Manifold-ranking based video concept detection on large database and feature pool. In: ACM multimedia 47. Zhang H, Zhan Z-J, Yan S, Bian J, Chua T-S (2012) Attribute feedback. In: ACM multimedia 48. Zhou D, Bousquet O, Lal TN, Weston J, Sch¨olkopf B (2003) Learning with local and global consistency. In: NIPS 49. Zhou D, Huang J, Sch¨olkopf B (2006) Learning with hypergraphs: clustering, classification, and embedding. In: NIPS 50. Zhou D, Orshanskiy SA, Zha H, Giles CL (2007) Co-ranking authors and documents in a heterogeneous network. In: ICDM, pp 739–744
Lin Wu received her BE degree from Dalian University of Technology, China, in 2008. Her research interests include pattern recognition, computer vision, and multimedia. She is currently serving as a PhD at the University of New South Wales, Sydney, Australia.
Multimed Tools Appl
Dr. Xiaodi Huang is a senior lecturer in the School of Computing and Mathematics at Charles Sturt University, Australia. He received his PhD degree in 2004. His research areas include visual information analysis. He has published over 60 scholar papers in international journals and conferences. Dr. Huang is a regular reviewer for several international journals, and serves as the committee members of international conferences. He is a member of the ACM and IEEE Computer Society. For details, please visit his homepage at http://csusap.csu.edu.au/∼xhuang/.
Chengyuan Zhang received his BE Degree in Sun Yat-sen university, China in 2008. He is currently pursuing his PhD at the University of New South Wales, Australia. His research interests include spatial database, stream data and multimedia system.
Multimed Tools Appl
John Shepherd received the Ph.D. degree in 1990 from the University of Melbourne, Melbourne, Australia. He is a Senior Lecturer in the School of Computer Science and Engineering, University of New South Wales, Sydney, Australia. His main research interests are query processing for both relational and nonrelational (e.g., multimedia) databases, information organization/retrieval, and applications of information technology to teaching and learning. Dr. Shepherd has served on the Program Committees of conferences such as VLDB, WISE, and DASFAA.
Yang Wang is currently a PhD student on school of Computer Science and Engineering, The University of New South Wales, Sydney, Australia. Prior to that, he obtained his bachelor degree from Dalian University of Technology, China, in 2009. He was a master student in Tianjin University, China, from 2010–2011. His research interests cover Data Mining, Machine learning and Graph Database. He is an IEEE student member.