Annotation Propagation in Image Databases Using ... - Semantic Scholar

3 downloads 722 Views 661KB Size Report
also provide a basis for image organization and retrieval. In social networks or other image hosting services, labels are added to images in order to allow better ...
Annotation Propagation in Image Databases Using Similarity Graphs MICHAEL E. HOULE, National Institute of Informatics VINCENT ORIA, New Jersey Institute of Technology SHIN’ICHI SATOH, National Institute of Informatics JICHAO SUN, New Jersey Institute of Technology

The practicality of large-scale image indexing and querying methods depends crucially upon the availability of semantic information. The manual tagging of images with semantic information is in general very labor intensive, and existing methods for automated image annotation may not always yield accurate results. The aim of this paper is to reduce to a minimum the amount of human intervention required in the semantic annotation of images, while preserving a high degree of accuracy. Ideally, only one copy of each object of interest would be labeled manually, and the labels would then be propagated automatically to all other occurrences of the objects in the database. To this end, we propose an influence propagation strategy, SW-KProp, that requires no human intervention beyond the initial labeling of a subset of the images. SW-KProp distributes semantic information within a similarity graph defined on all images in the database: each image iteratively transmits its current label information to its neighbors, and then readjusts its own label according to the combined influences of its neighbors. SW-KProp influence propagation can be efficiently performed by means of matrix computations, provided that pairwise similarities of images are available. We also propose a variant of SW-KProp which enhances the quality of the similarity graph by selecting a reduced feature set for each prelabeled image and rebuilding its neighborhood. The performances of the SW-KProp method and its variant were evaluated against several competing methods on classification tasks for three image datasets: a handwritten digit dataset, a face dataset and a web image dataset. For the digit images, SW-KProp and its variant performed consistently better than the other methods tested. For the face and web images, SW-KProp outperformed its competitors for the case when the number of prelabeled images was relatively small. The performance was seen to improve significantly when the feature selection strategy was applied. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Multimedia databases General Terms: Algorithms Additional Key Words and Phrases: Classification, feature selection, image annotation, iterative method, linear system, neighborhood ACM Reference Format: Houle, M. E., Oria, V., Satoh, S., and Sun, J. 2013. Annotation propagation in image databases using similarity graphs. ACM Trans. Multimedia Comput. Commun. Appl. 10, 1, Article 7 (December 2013), 21 pages. DOI: http://dx.doi.org/10.1145/2487736

M. E. Houle gratefully acknowledges the financial support of JSPS Grant 24500135. Author’s address: M. E. Houle and S. Satoh, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan; email: {meh,satoh}@nii.ac.jp; V. Oria and J. Sun, New Jersey Institute of Technology, University Heights, Newark, NJ 07102; email: {oria, js87}@njit.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2013 ACM 1551-6857/2013/12-ART7 $15.00  DOI: http://dx.doi.org/10.1145/2487736 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7

7:2



M. E. Houle et al.

1. INTRODUCTION Practical methods for the indexing and querying of large-scale image databases often require that the images be annotated with semantic information beforehand. Unfortunately, it is generally difficult to obtain large numbers of annotated images, due to the high costs associated with manual annotation. In order to resolve this problem, the topic of automated (or semi-automated) image annotation has received much attention from researchers in recent years. In Ono et al. [1996], image recognition techniques have been applied to the annotation problem. However, due to the difficulties associated with general object recognition, the practicality of such techniques is severely limited. More recently, statistical learning strategies have become widely used in automated image annotation, such as in Barnard et al. [2003], Duygulu et al. [2002], Hardoon et al. [2006], and Jeon et al. [2003]. Given a set of training images with manually labeled keywords that describe the visual content of the images, a statistical model is learned via the analysis of the correlation between visual features and keywords. The keywords can then be ranked according to their probability of occurrence given an image or a region thereof. Such methods, however, usually assume that there exists a strong “one-to-one” relationship between a keyword and a set of visual features, which is often not the case. Another drawback is the high computational cost of the learning process, especially when the number of keywords is very large. Classification techniques are also popular among researchers and practitioners. Examples can be found in Chang et al. [2003] and Cusano et al. [2003], where the annotation process is simply viewed as the assignment of images (or regions) to predefined classes, with each class corresponding to an individual keyword. In some cases, graph-based semi-supervised learning (GSSL) methods can be applied [Hu and Qian 2009; Liu et al. 2006, 2012; Tang et al. 2011]. Both labeled and unlabeled images are treated as nodes in an undirected graph with edge weights that depend on the similarity between the images corresponding to the two incident nodes. Popular GSSL methods predict the labels of unlabeled nodes using minimum graph cut techniques [Blum and Chawla 2001], or by minimizing a cost function defined over the graph, such as the local and global consistency technique [Zhou et al. 2003a], or by using Gaussian fields and harmonic functions (GFHF) [Zhu et al. 2003]. Another approach, deriving from content-based image retrieval (CBIR) [Li et al. 2006; Makadia et al. 2008], relies solely on the visual similarities between two images (or two regions) to decide whether they should share a label. Both CBIR and learning-based methods suffer from poor accuracy when the number of annotated images used for training is too low. To achieve higher accuracy, user feedback is sometimes employed as an intermediate step for image annotation, for example in Liu et al. [2001]. However, the cost and efficiency of the annotation process is severely compromised due to the use of human intervention. In this article, we consider the problem of accurately labeling as many instances of images as possible, given a very small number of annotated images as a training set. The visual vocabulary consists of the set of annotations associated with the training set. In Houle et al. [2011], we have proposed an image-labeling strategy, KProp, that propagates semantic information within a similarity graph having images as nodes. Each node iteratively transmits its current label information to its neighbors, and then readjusts its own labeling status according to the combined label scores of its neighbors. KProp adopts a straightforward averaging scheme: once the neighbors of a node have been decided, they will be treated uniformly. In this paper, we propose SW-KProp, an influence propagation approach based on KProp. Instead of weighting all the edges in the graph equally, SW-KProp weights an edge using the similarity value of the two incident nodes, which can be computed from a linear transformation of their distance value. In addition, edges in the new model are classified into “strong” and “weak” edges according to the influence types of the connected nodes, and are treated differently. To enhance the quality of the similarity graph, we also propose SW-KProp+, a variant of SW-KProp that selects a ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:3

Fig. 1. Applying SW-KProp for the classification of a face image set.

reduced feature set for each prelabeled image and recomputes its neighborhood according to the new feature vector. Figure 1 illustrates the results of an SW-KProp classification of a small face image set. Initially, faces 1 and 2 are labeled as A and B, respectively. Scores measuring the degree of association between labels and faces are propagated from labeled faces 1 and 2 to unlabeled faces 3 to 6. Edges in the directed graph indicate the directions of the influences. The thin and bold arrows represent “strong” and “weak” edges, respectively. The scores obtained after the convergence of SW-KProp are given in braces beside each face, with the first value corresponding to A and the second to B. By assigning each initially unlabeled image with the label associated with the greater of the two scores, we obtain a labeling of images 3, 4, and 6 with A, and image 5 with B. The details of the graph construction and score computation will be described later, in Section 3. The contributions of this article are as follows. —We present the SW-KProp label propagation method, a variant of KProp with a different edge weighting scheme leading to improved quality of classification; —We provide a proof of the convergence of SW-KProp score computation, through a formulation in terms of matrix computations; —We give the SW-KProp+ variant of SW-KProp, in which the quality of the similarity graph is enhanced by selecting a reduced feature set for each prelabeled image, thereby leading to further improvements in classification accuracy; —We present an experimental comparison of SW-KProp and SW-KProp+ for three image datasets, each of a different level of difficulty for the annotation task. The remainder of this article is organized as follows. Section 2 reviews the research literature related to this work. The description of SW-KProp appears in Section 3, divided into two phases: graph construction and label propagation. The variant SW-KProp+ is also described. The experimental framework is outlined in Section 4. In Section 5, we present and discuss the experimental results for three image datasets: a handwritten digit dataset, a face dataset and a web image dataset. We conclude this article in Section 6 with a discussion of future research directions. 2.

RELATED WORK

In this section, we review the research literature related to our approach, focusing on automated image annotation and on iterative methods. 2.1 Image Annotation The major benefits gained from the effective annotation of images include easy organization and communication for both personal and social purposes. Semantic labels, such as the names of people or the descriptions of events, not only help the owners of images recall the situations depicted therein, but ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:4



M. E. Houle et al.

also provide a basis for image organization and retrieval. In social networks or other image hosting services, labels are added to images in order to allow better understanding of the image context, and better communication between participants who share images. Labels also play a key role in commercial search engines for fast image indexing and querying. For more on the history and benefits of image annotation, we refer the reader to Ames and Naaman [2007] and Nov and Ye [2010]. Several methods have been proposed for assisting users in the annotation of images. Users can annotate images verbally as they are created, by means of a microphone built into the camera device [Desai et al. 2009]. Verbal annotations are transcribed into text by a speech recognizer incorporating external semantic knowledge sources. A web-based labeling tool has been developed with a drawing interface for object boundaries [Russell et al. 2008]. Users can identify new objects in images, or edit existing object labels. An interactive gaming system was developed in which a pair of players are encouraged to propose labels for each displayed image [von Ahn and Dabbish 2004]. If the two players happen to agree on a common label for the image, the label is added to the annotation information for that image. However, despite the assistance that these methods provide, the semi-automated association of images with semantic information is still too expensive to be applied on a large scale. In recent years, the topic of fully automated image annotation has generated great interest within the multimedia research community. In typical query-based annotation methods, the image to be annotated is submitted as a query to a CBIR system. Filtering schemes are then used to select labels from result images, and apply them as annotations to the query image. One such approach employs a simple greedy strategy for label selection [Makadia et al. 2008]. In their paper, the authors also made the claim that simple query-based baseline techniques often outperform more complex state-of-the-art annotation methods, according to a family of baseline measures which they proposed. A more sophisticated approach was proposed in Li et al. [2006], in which annotation keywords are mined from the query results. The keywords found in titles and other text associated with result images are clustered, from which representative keywords are selected as labels for the query image. Another popular solution involves the study of the correlation between visual features and semantic labels. A correlation method was proposed for mapping image descriptors to keywords, by which a query image can be annotated directly without retrieving matching images [Hardoon et al. 2006]. In Duygulu et al. [2002], the process of image annotation was viewed as analogous to machine translation, wherein a visual representation is transformed into a textual representation. Here, the mapping between blobs (clustered image features) and keywords is learned using the Expectation-Maximization (EM) algorithm. Image regions can then be labeled with the most likely keywords as determined by EM. The performance of the translation method was improved using a cross-media relevance model (CMRM) introduced in Jeon et al. [2003]. Instead of assuming the existence of a one-to-one correspondence between the keywords and blobs in an image, their approach assumes only that a set of keywords is related to the blob set that represent the image. The probability of observing a keyword given an image is estimated by the joint probability of observing the keyword and the blob set. Classification methods are extensively used in image annotation. One example is Cusano et al. [2003], in which salient regions of training images are extracted and manually labeled with one of several predefined classes for the image set under consideration. Regions of test images are then classified by support vector machines (SVMs). Another example uses Bayes point machines (BPMs) to train classifiers on a small set of labeled images [Chang et al. 2003]. Test images are classified by means of ensembles of multi-class classifiers, and assigned multiple soft labels with association scores. Some learning-based methods have also taken into account ontological information associated with textual labels. Text ontologies were used in Srikanth et al. [2005] to generate a visual vocabulary for the representation of images. The same paper proposed a hierarchical classification approach for automated ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:5

image annotation. Concept ontologies were used in Shi et al. [2007] to provide additional annotations for training images, so as to expand the training sets available for each concept class. Graph-based semi-supervised learning (GSSL) methods have also attracted much attention in recent years. In Liu et al. [2006], several nearest spanning chains (NSCs) are built for an image set, each of which sequentially connects an image node with its nearest neighbor from among the remaining nodes. The weight between two nodes is computed based on their similarity value, and on the frequency of the edges connecting them in the computed NSCs. Once the weighted similarity matrix is built, the annotation process is modeled as a manifold ranking (MR) problem [Zhou et al. 2003b]. Hu and Qian extended MR to multi-instance scenarios in Hu and Qian [2009]. Two bags are generated to represent each image, based on its quantized regions and on its nearest neighbors. The weight between two image nodes is the product of the two similarity values computed in the two bag spaces. Tang et al. [2011], proposed a sparse graph reconstruction method to reduce semantically unrelated links in traditional graphs with similarity weightings based solely on visual features. The label inference step is formulated by minimizing a function of the label reconstruction error. Several GSSL methods and their applications to web-scale image annotation are reviewed in Liu et al. [2012]. In one such method, anchor graphs have been deployed to tackle the problem of large graph construction [Liu et al. 2010]. The key idea of this approach is to reduce the cost of computing similarities among data items via estimation from a small set of anchor points. Our proposed SW-KProp method resembles existing GSSL methods in that links are created between similar images in an attempt to share annotation information among them. In general, the performance of GSSL methods depends crucially on the structure of the graph, and the edge weightings and aggregation functions used in order to decide the relative scoring of annotation options. However, as will be discussed in Section 3, there are several important differences in the graph construction and edge weighting schemes of our method, as compared to existing GSSL methods. A major drawback of existing annotation methods lies in their low accuracy when the number of labeled samples are not sufficiently large. This motivates our work on the proposed annotation method based on iterative information propagation, in which query images receive not only annotation information drawn directly from their immediate neighbor images according to a supplied similarity measure (as with k-NN classification and query-based annotation), but also indirectly via propagation within a similarity graph on the image set. 2.2 Iterative Methods The SW-KProp strategy of propagating information within a graph bears a superficial resemblance to variants of the well-known PageRank algorithm for web page ranking [Page et al. 1999]. PageRank determines the relative importance of web pages by means of a simulation of a browsing session in which the user clicks on successive links, while periodically jumping directly to another page. PageRank models this process as a random walk in a directed graph, in which nodes represent pages and edges represent embedded hyperlinks. PageRank assigns a score to each web page proportional to the probability of the random walk reaching the associated node; these scores are then used to determine a ranking. The probability scores are calculated through an iterative process where the score of each node is averaged by its out degree and propagated to the nodes it points to. This property guarantees that the propagation matrix of PageRank is positive and stochastic, and that the final score vector of web pages will converge to a unique solution in a finite number of steps. The PageRank algorithm has given rise to a number of close variants. One such variant, SimRank [Jeh and Widom 2002], computes structural-context similarity of two data items using a graph derived from the natural link structure, where the nodes and edges represent data items and ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:6



M. E. Houle et al.

relationships between them, respectively. The graph models all possible combinations of node pairs as new nodes, and draws directed edges between them according to certain rules. Unlike PageRank, the SimRank node scores are computed as an average of the scores of its reverse neighbors. Another PageRank-based data analysis method has been developed in which hypertext document collections are clustered based on the link structure, rather than on the content of the documents [Avrachenkov et al. 2008]. In the case of image data, the VisualRank method proposed in Jing and Baluja [2008] analyzes the visual link structure of images and computes the authority scores for them. For each query, a relevant and diverse set of images can be produced within the top-ranked results. RankCompete [Cao et al. 2010] provides an effective tool for organizing web photos through a generalization of PageRank for simultaneous image ranking and clustering. Images are ranked in the clusters with respect to their local neighborhoods, and the diversity among the result images can be displayed in a compact and structured way for subsequent browsing or querying by the user. The random walk with restarts (RWR) method [Wang et al. 2006] can be used to re-rank the precomputed annotations of images. RWR constructs a complete graph whose nodes are candidate annotations and whose edges are the co-occurrence similarity values of the connected nodes. The PageRank algorithm is then performed with a restart vector initialized to hold the normalized confidence scores of the candidate annotations. The final annotations selected are those achieving the highest probability scores in the stable state. A similar approach to annotation refinement appeared in Li et al. [2010]. There, the edges were weighted by a fusion of the co-occurrence similarity values of the connected annotations with the similarity values of their “visual content”, as measured by their most representative images in the training set. Although these methods can achieve improvements in the quality of the annotations produced, their performance depends greatly on the candidate annotations and confidence scores initially provided. 3.

THE INFLUENCE PROPAGATION MODEL

In this section we present SW-KProp, our proposed neighborhood-based influence propagation scheme. Under SW-KProp, each data item determines its labeling by iteratively consulting its neighbors for recommendations, weighing and combining the collected opinions, and then serving as a consultant for its own neighboring items. This iterative procedure eventually results in the dissemination of node influences throughout the entire dataset. To avoid confusion with the term “object of interest”, which we reserve for the subject of an image, we use the term “data item” (or simply “item”) to denote the element of a database (for example, an image or a region thereof). Let D = {o1 , o2 , . . . , on} be a set of n data items, with each item associated with a subset of label set L = {λ1 , λ2 , . . . , λt }. If the label set L(o) associated with o ∈ D is empty, then o will be said to be unlabeled; otherwise, o will be referred to as labeled. Given an initial labeling  ⊆ D × L whose elements o, λ refer to the association of item o ∈ D with label λ ∈ L, the goal is to determine an n × t score matrix S whose elements si, j (1 ≤ i ≤ n, 1 ≤ j ≤ t) measure the degree of association of item oi with label λ j . SW-KProp solves this problem in two phases, by first modeling the similarity information of data items as a neighborhood graph, and then propagating label scores through the graph according to certain weighting and combination rules. We next present the general framework of the SW-KProp algorithm, in Section 3.1. This is followed by discussions of the construction of the influence graph and the computation of the influence scores, in Sections 3.2 and 3.3, respectively. In Section 3.4, we propose a variant of SW-KProp which applies a simple strategy to compute a reduced feature set for each prelabeled image, and then uses these features to refine the structure of the similarity graph. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:7

ALGORITHM 1: SW-KProp input : data item set D, label set L, initial labeling  output: score matrix S 1 2 3 4 5 6 7 8 9 10

n ← |D|, t ← |L|; Let G be an influence graph modeling the neighborhood relationships of items in D; Compute the n × n adjacency matrix A of G; Compute the n × n propagation matrix P from A; Initialize the n × t score matrix S with respect to ; repeat S ← S; S ← P S ; until S = S ; return S;

3.1 The SW-KProp Algorithm The overall framework of the SW-KProp algorithm is shown in Algorithm 1. Line 1 of the algorithm acquires the number of data items and the number of distinct labels in the dataset. Line 2 corresponds to the first phase of our model, in which an influence graph is constructed according to the neighborhood information of items in the dataset. The definition of the neighborhood relies on a user-supplied distance measure. The remainder of the algorithm corresponds to the second phase, propagation through the influence graph. Lines 3–4 and 5 prepare the propagation matrix and the initial score matrix, respectively. The propagation of label scores is accomplished by iterative multiplication of these two matrices (lines 6–9). The details of the two phases are presented in Sections 3.2 and 3.3. As will be seen, the iteration converges toward a unique solution S f , which can be interpreted by reading off either its rows or its columns. If each column is sorted in non-increasing order, we obtain a ranked list of items, with the first item having the highest degree of association with a specific label. By sorting each row in nonincreasing order, we obtain ranked lists of labels, with the first entries corresponding to the maximum likelihood assignment of labels to items. The decision of annotating initially-unlabeled data items can be made based on the ranked lists, via a simple thresholding scheme. Let ri, j denote the rank of label λ j with respect to item oi on the list corresponding to oi . Given two user-supplied threshold values rmax , on the maximum rank, and smin, on the minimum score, each unlabeled item oi can be annotated by the label set {λ j | ri, j ≤ rmax , si, j ≥ smin}. As a special case, if rmax = 1 and smin = 0, each distinct label will be treated as a class identifier, and the entire set of unlabeled data items will be classified. 3.2 The Influence Graph As a preprocessing step, a directed graph is constructed whose nodes represent the data items, and whose edges denote pairs of items whose similarity is sufficient to allow propagation of contextual information from one to the other. As will be seen later, the quality of the similarity measure will have great influence on the performance of the label propagation method. The modeling of data relationships as graph edges often arises naturally according to the specific data domain. In some domains, such as web pages with embedded hyperlinks, scientific papers with citations, and user-item pairs in a recommender system, the similarity relationships are explicitly indicated by link structure, references, or pairings as the case may be. Here, we consider the case where no explicit item pairings are defined, but where a pairwise similarity measure (or distance ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:8



M. E. Houle et al.

measure) exists. We will make the natural assumption that contextual information should be shared and propagated between items whose similarity is sufficiently high. Let us denote the symmetric pairwise distance between two items o, o ∈ D by d(o, o ). Given an item o, the distance function d determines a ranking of the items of D relative to o. More precisely, the rank of o relative to o is given by ρ(o, o ) = |{z ∈ D| d(o, z) < d(o, o )}|. Note that under this definition it is possible for two items to have the same rank with respect to o. Uniqueness of ranks is guaranteed only if all pairwise distance values between items of D are unique; if desired, this can be achieved by breaking ties arbitrarily yet consistently. Let τd(o) and τρ (o) be positive threshold values for item distances and ranks, respectively. We define the region of influence of item o to be the set of nodes simultaneously falling within distance τd(o) of o, and rank τρ (o) of o: Infl(o) = {z ∈ D | d(o, z) ≤ τd(o) ∧ ρ(o, z) ≤ τρ (o)}. We will say that an item o influences item o if o lies within the region of influence associated with o. More formally, we model item relationships as a directed influence graph G(V, E), with the node set partitioned into V = Vl ∪ Vu, where Vl and Vu represent the initially labeled (source) item set Dl and initially unlabeled (nonsource) item sets Du, respectively. E is composed of three types of edges: (1) ∀v ∈ Vl , v, v ∈ E; (2) v, u ∈ E whenever v ∈ Vl , u ∈ Vu and u ∈ Infl(v); and (3) u, u , u , u ∈ E whenever u, u ∈ Vu, and either u ∈ Infl(u ), or u ∈ Infl(u) (or both). It can be observed that each v ∈ Vl has a self-edge, and all other edges lead to nodes of Vu. This construction prevents items whose labels are known in advance from being influenced by other items. The influence graph of SW-KProp differs from those of other graph-based methods, such as the k-NN graphs that are commonly used in MR and GFHF, in that edges v, u are excluded from the graph if u ∈ Vu influences v ∈ Vl and v does not influence u. The reason is that this type of edge may introduce greatly imbalanced distributions in the number of edges leading from source nodes, and thereby bias the propagation of label scores. For a pair of nonsource nodes u, u ∈ Vu, the influence is applied in both directions, even if the influence relationship is unidirectional. We will refer to the edges connecting two mutually influenced nodes as strong edges, and those connecting two singly influenced nodes as weak edges. As will be seen in Section 3.3, the two types of edges will be treated differently. In general, there are several difficulties associated with the selection of a distance threshold for the region of influence. Rank thresholds have an important advantage over distance thresholds in that they do not require an explicit interpretation of distance values. Choosing a fixed rank threshold k—that is, considering k-nearest neighbor (k-NN) sets of the items—compensates for local variations in data density in a way that distance thresholds cannot. Although distance threshold can be (and sometimes should be) used together with rank thresholds for some applications, in this paper we will consider only rank thresholds. Figure 1 shows the influence graph based on the following 2-NN lists of faces 1 to 6: {3, 4}, {4, 5}, {4, 6}, {1, 6}, {1, 6} and {3, 4}. Note that there is no edge between faces 1 and 5: although face 5 influences face 1, face 1 does not influence face 5. The problem of choosing a practical value of the rank threshold k will be addressed empirically in light of the pre-experimental test results of Section 4.3.1. A method that automatically computes a reasonable k will be given as well. 3.3 Label Propagation We now formulate the SW-KProp procedure in terms of iterative matrix multiplications of a propagation matrix with the score matrix. We then prove that the computation of SW-KProp scores will converge in a finite number of iterations. The problem of label propagation can finally be reduced to a ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:9

linear system with a sparse strictly diagonally dominant coefficient matrix, to which faster iterative methods can be applied. Let item oi correspond to row i and column i of the n × n adjacency matrix A of the influence graph G(V, E). Entries of A can be computed by: ⎧ ⎨α · sim(oi , o j ) if  j, i is a strong edge, sim(oi , o j ) if  j, i is a weak edge, (1) ai, j = ⎩ 0 otherwise, where α ≥ 1 is an amplifying factor that favors strong edges, and sim(·, ·) denotes the similarity value between two items. Instead of using a binary value to weight the edges as in KProp, of the Gaussian kernel as in typical graph-based methods, we adopt a simple linear transformation for the similarity function: d(o, o ) − dmin sim(o, o ) = 1 − , dmax − dmin where dmin and dmax are the minimum and maximum pairwise distances between different items in the graph, respectively. This similarity function normalizes the similarity values between pairs of graph nodes into [0, 1], and requires no parameter tuning. The amplifying factor α is applied in order to increase the influences of strong edges. Intuitively, two nodes are more likely to share a same label if each is a member of the k-NN list of the other. The choice of parameter α will be discussed in Section 4.3.3. Entries of the n × n propagation matrix P can be computed by:  ai, j if node i ∈ Vl , (2) pi, j = β · nai, j otherwise. ai,q q=1

Here, β is a damping factor (0 < β < 1) used to penalize nodes that are far away from source nodes, and to accelerate the convergence. Let item oi ∈ D correspond to row i of the score matrix S, and let label λ j ∈ L correspond to column j of S. Entries of the n × t initial score matrix S0 can be computed as:  1 if oi is associated with λ j , si, j = (3) 0 otherwise. Let Sq be the state of the score matrix in the qth iteration. Sq is computed from the previous state according to the formula Sq = P Sq−1 .

(4)

q δi, j

q

in q = Sq − Sq−1 falls within the bound |δi, j | ≤ ε, where The iteration continues until each element ε is a user-specified tolerance value. Let f be the iteration at which convergence is achieved; accordingly, S f is the final state of the score matrix. Different from PageRank, Eq. (4) does not represent a stochastic process, in which no matter what values are used to initialize the score matrix, the final scores are unique. In SW-KProp, given a f propagation matrix P, each column C j (1 ≤ j ≤ t) of S f is entirely decided by its corresponding column f

C 0j of the initial score matrix S0 . C j will turn out to be an eigenvector of the propagation matrix P for the eigenvalue 1. Now we prove that by iteratively multiplying the propagation matrix with the score matrix (starting f from S0 ), the process converges to a unique score matrix S f , of which each column C j represents f

the stabilized scores of all items for label λ j , while each row Ri represents the stabilized scores of λ1 through λt for item oi . ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:10



M. E. Houle et al.

THEOREM 3.1. Given a propagation matrix P corresponding to an influence graph G(V, E), the sequence of score matrices (Sq ) in Eq. (4) converges to a unique matrix S f . PROOF. By remapping the order of all data items, we may label source nodes from 1 to m, and nonsource nodes from m+ 1 to n. The propagation matrix P and the score matrix Sq can then be converted into the following forms: ⎛ ⎛ q ⎞ q ⎞ s1,1 · · · s1,t 1 ··· 0 0 ··· 0 .. ⎟ ⎜ .. ⎜ .. .. .. .. ⎟ .. .. .. ⎜ . ⎜ . . . . . ⎟ . . . ⎟ ⎜ ⎜ q ⎟ ⎟ q ⎜ 0 ⎜ ⎟ ⎟ · · · s s · · · 1 0 · · · 0 m,t ⎟ q m,1 ⎜ ⎜ ⎟ and S = ⎜ q P=⎜ q ⎟ ⎟. p · · · p p · · · p s · · · s m+1,m m+1,m+1 m+1,n⎟ m+1,t ⎟ ⎜ m+1,1 ⎜ m+1,1 ⎜ . ⎜ . .. .. .. ⎟ .. ⎟ .. .. .. ⎝ .. ⎝ .. . . . . . . ⎠ . ⎠ q q pn,m+1 · · · pn,n pn,1 · · · pn,m sn,1 · · · sn,t P can be divided into four submatrices. Denoting a submatrix by the ranges of rows and columns, let P0 = P(1 : m, 1 : m), P1 = P(1 : m, m + 1 : n), P2 = P(m + 1 : n, 1 : m), and P3 = P(m + 1 : n, m + 1 : n). P0 is then an identity matrix corresponding to the self-links of labeled items, and P1 is a zero matrix. q q Let S0 = Sq (1 : m, 1 : t) and S1 = Sq (m + 1 : n, 1 : t). Then Sq = P Sq−1 can be computed by: q

q−1

S0 = P0 × S0 q S0

q−1

+ P1 × S1

q

q−1

= S00 , and S1 = P2 × S0

q−1

+ P3 × S1

.

q−1 S0 ,

remains equal to and its entries are either 0 or 1, confirming that scores of labeled items q q remain fixed at every step of the iteration. Let Xq = S1 , H = P3 and Bq = P2 S0 = P2 S00 , then Xq = H Xq−1 + Bq−1 . Clearly, B is a constant matrix, and thus H is an iteration matrix. X converges if and only if the spectral radius r of H is smaller than 1. By the Gershgorin circle theorem [Higham and Tisseur 2003], each eigenvalue of H lies within at least one closed disc centered at hi,i with radius ri , where hi,i is the element on the major diagonal and ri is the sum of the absolute values of the nondiagonal elements in row i of H. Observing that elements on the major diagonal of H are zeros, and that the sum of each row of H is less than or equal to the damping factor β, the absolute value of each eigenvalue lies in [0, β]. Therefore r ≤ β < 1, and X has a unique solution: X = (I − H)−1 B.

(5)

It can be seen from Eq. (5) that the problem of label propagation is modeled as a linear system. Observing that I − H is a sparse strictly-diagonally dominant matrix, X can be solved by two widely used iterative methods, Jacobi and Gauss-Seidel [Hageman and Young 2004]. There also exist faster iterative methods for this problem, such as the conjugate gradient method (CG) [Hestenes and Stiefel 1952] and the generalized minimal residual method (GMRES) [Saad and Schultz 1986]. The details of these methods are beyond the scope of this article. 3.4 The SW-KProp+ Variant In this section, we present SW-KProp+, a variant of SW-KProp that improves the quality of the influence graph by selecting a reduced feature set for each prelabeled image that is discriminative for its immediate neighborhood, and then using it to recompute a new neighborhood for the prelabeled image. The difference between SW-KProp+ and SW-KProp lies in the graph construction step (line 2 of Algorithm 1). Ideally, edges in the influence graph should connect images that share the same labels. However, for any given image, due to the presence of features that are irrelevant or indiscriminative for that image, and due to the difficulty of choosing an appropriate value for k, there usually exist “false positive” edges connecting it to images whose label sets differ greatly. As the most important edges for the propagation ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:11

are those that lead from labeled nodes to unlabeled nodes, we are especially interested in reducing the number of false positive edges that originate from labeled nodes. In the following, we propose a simple algorithm that computes a reduced feature vector for each prelabeled image, and then uses the new feature vectors to rebuild the graph link structure. For each labeled item o ∈ Dl , given its original feature descriptor F, we can compute a reduced feature set for o according to Algorithm 2. ALGORITHM 2: Reduced feature set selection input : prelabeled set Dl and corresponding feature vectors, o ∈ Dl , parameters rd ∈ (0, 1) and tc ∈ (0, 1) output: a reduced feature vector Fo for o 1 foreach dimension i (1 ≤ i ≤ dim(F)) of the feature vectors; do 2 Compute d(o, o ) for all o ∈ Dl and o = o; 3 Find o’s tc · |Dl | nearest neighbors {o1 , o2 , . . . , otc|Dl | } with respect to i;

tc|D | Measure the discriminative ability of dimension i for o by j=1 l I(L(o), L(o j )), where I(L(o), L(o j )) is an indicator function equal to 1 if L(o) = L(o j ), and to 0 otherwise; 5 end 4

6 Rank all dim(F) dimensions according to their discriminative abilities with respect to o, and concatenate

the features in the top rd · h highest ranking dimensions into Fo .

Ideally, for each prelabeled image o, we wish to select those dimensions (features) for which neighboring prelabeled images of the same label as o rank higher (closer to o) in terms of the value of the feature, as compared to those prelabeled images from the neighborhood of o with labels different to that of o. By combining those features that achieve the best discrimination in ranks for all images sharing the label of o, we hope to produce a feature set that discriminates well for this label, even when applied elsewhere within the dataset. The two parameters tc and rd control the number of nearest neighbors of o to check, and the target dimension of the reduced feature vector, respectively. The influence of the two parameters will be discussed experimentally in Section 4.3.4. Once a new feature set Fo has been computed for o, we then measure the discriminative ability of Fo in a neighborhood of o (in the same spirit of Algorithm 2, lines 2-4), and compare it to that of the original feature set F. Fo will be used to recompute the k-NN of item o only if its discriminative ability is greater than that of F. If Fo is chosen to replace F, then along edges oriented outwards from o, distances of the form d(o, o ) are computed using the reduced feature set Fo rather than the full set F, regardless of whether o is labeled or unlabeled. Note that this simple feature selection strategy differs from traditional feature selection algorithms, which aim at removing redundant and irrelevant features from the full set of features, and applying the reduced set of features uniformly across the entire data domain. Our method instead computes different sets of dimensions for each prelabeled image, in an effort to identify subspaces within which clusters of prelabeled images reside. 4. EXPERIMENTAL FRAMEWORK In this section, we present the experimental framework for the comparison of SW-KProp and SW-KProp+ with several competing methods. We describe the three datasets used for our experiments in Section 4.1, and state the evaluation criteria in Section 4.2. The choice of the parameters for SW-KProp, as well as the selection of features for prelabeled images, is discussed in Section 4.3. In Section 4.4, we summarize the methods to be evaluated in the experiments. 4.1 Datasets 4.1.1 MNIST. MNIST [LeCun et al. 1998] contains 60,000 training and 10,000 test images of handwritten digits, with each image represented by a vector of 784 gray-scale texture values. For our ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:12



M. E. Houle et al.

experiments, we constructed a reduced subset of MNIST containing 10, 000 images, by randomly selecting 1000 images of each digit from the training set. 4.1.2 Google-23. We queried the names of 23 celebrities in Google Image Search (as per Ozkan and Duygulu [2006]), and crawled 11, 811 images from the query results. After manually removing irrelevant images, we ran the face detector of OpenCV [Bradski and Kaehler 2008] and detected 8381 frontal faces. Of these faces, 6686 were manually labeled with one of the 23 names, to produce a dataset which we refer to as Google-23. The number of faces per individual ranges from 97 to 406. Feature descriptors were computed by the Oxford face processing pipeline as per the description in Everingham et al. [2006]; for each face, 13 points of interest were detected, each of which was represented by a 149dimensional vector. Concatenating these 13 vectors into a single descriptor yielded a 1937-dimensional data point for each face image. 4.1.3 NUS-WIDE-OBJECT. The NUS-WIDE-OBJECT dataset is a subset of NUS-WIDE [Chua et al. 2009], a collection of general web images from the Flickr image sharing website. The origin NUS-WIDE-OBJECT set contains 30, 000 images associated with 31 different concepts. To evaluate the performance of classification methods on this dataset, we removed all images with multiple labels, and retained the 23,953 images that remained. The number of images for each concept varies greatly, from 108 to 3201. Each image in the dataset is represented by a 634-dimensional descriptor produced from a combination of five types of dataset features: color histogram, color correlogram, edge direction histogram, wavelet texture and color moments. 4.2 Evaluation Criteria For simplicity, we require that each image be associated with at most one label, which in our experiments is the class ID or name. For each method, at the termination of each run, each test (initiallyunlabeled) image was assigned the label with the maximum association score for that image. No scoreor distance-based thresholding was applied when assigning a label to an image. We evaluate the overall propagation performance—that is, the proportion of correct label assignments to the total number of unlabeled items—by modifying the usual definition of recall: recall =

#(correctly labeled test items) . #(test items)

For Google-23 and NUS-WIDE-OBJECT, we also evaluate the performances of methods in terms of average accuracy. It is worth noting that for MNIST the average accuracy is equivalent to the average recall, due to the fact that in this dataset, data items are evenly distributed among the classes. 4.3 Tuning of System Parameters We first choose appropriate values for native SW-KProp parameters empirically, and then discuss the influence of rd and tc on the performance of SW-KProp+. 4.3.1 The Rank Threshold k. To test the influence of the parameter k on the performance of SW-KProp, for each of the three datasets, we randomly labeled one image per category, and computed the average recall over 3 testing runs with respect to k over the range 1 ≤ k ≤ 15. The damping factor β was set at 0.9, α was set at 1.0, and no feature selection was applied. The result is plotted in Figure 2. The highest average recall was achieved when k = 10, 9, and 14 for MNIST, Google-23 and NUSWIDE-OBJECT, respectively. SW-KProp produced stable results on all datasets when k is sufficiently large. For simplicity and efficiency, we fixed k to be 10 throughout the remainder of the experiments. We also tested the effect of the choice of k on the proportion of nodes that are unreachable from any source node. The value of k was iteratively increased from 1 until the set of unreachable nodes became ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs

Fig. 2. Average recall with respect to k.



7:13

Fig. 3. Proportion of nodes unreachable from source nodes with respect to k.

Table I. The average recall and the number of iterations required for convergence are shown with respect to β values for the Google-23 set, with one prelabeled face per individual β Average recall (%) #(Iterations)

0.75 36.38 31

0.80 36.84 38

0.85 37.44 51

0.90 37.84 75

0.95 38.12 137

0.99 35.17 457

empty. The result is plotted in Figure 3. When k = 1, the majority of the nodes lie in small connected components that do not include source nodes. Every node in the graph becomes reachable from at least one source node for k ≥ 3, 7, and 3 on MNIST, Google-23 and NUS-WIDE-OBJECT, respectively. For smaller choices of k, items unrelated to labeled items are more likely to be isolated from annotation sources, and (as one would expect) remain unlabeled. On the other hand, an inappropriately small value of k could severely limit the range of the propagation. Unreachable nodes counted as incorrect label assignments would have a negative effect on assessments of classification performance: as shown in Figure 2 and Figure 3, the average recall improves as the number of unreachable nodes decreases, and stabilizes as the number of unreachable nodes approaches zero. Based on this observation, we propose a method that computes a reasonable value of k, for scenarios in which an estimate p is available for the proportion of unlabeled data items in a dataset containing n items. Denoting the set of non-source nodes that are reachable from source nodes by VuR, the idea is to expand the influence graph by increasing k from 1 until |VuR| ≥ pn, or until a constant number of consecutive iterations have been performed during which VuR did not increase. For example, for classification applications, we may increase k until all nodes are reachable from source nodes in the dataset. In practice, the proportion of unreachable nodes decreases rapidly as k increases, as can be seen from Figure 3. This method does not necessarily determine the best possible value of k; however, it does eliminate the need for tuning of this parameter while still allowing most if not all nodes to be reachable from source nodes. 4.3.2 The Damping Factor. β The influence of β on the performance of SW-KProp was also assessed. For each choice of β considered, we randomly labeled one face per person from Google-23, and computed the average recall and iterations required for convergence. The neighborhood size k was set to 10, α was set to 1.0, and no feature selection was applied. The score matrix was computed using Eq. (4). The results (averaged over 3 testing runs) can be found in Table I. It can be observed from Table I that the average recall improves slightly as β grows, and drops rapidly when β approaches 1; the number of iterations used for convergence increases rapidly when β exceeds 0.9. We chose β = 0.9 as a good tradeoff between performance and efficiency for the remaining experiments. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:14



M. E. Houle et al.

Fig. 4. The average recall with crespect to α on the three datasets.

Table II. The Average Recall (%) with Respect to rd and tc on the Three Datasets tc = 1/6 2/6 3/6 4/6 5/6

rd = 1/6 77.20 64.59 59.89 61.20 58.92

2/6 3/6 81.93 82.14 80.18 81.03 80.32 81.17 81.54 82.27 82.38 82.38 (a) MNIST

4/6 82.22 81.90 81.69 82.53 82.38

tc = 1/6 2/6 3/6 4/6 5/6

5/6 82.64 82.83 82.76 83.30 82.98

1

82.94

tc = 1/6 2/6 3/6 4/6 5/6

rd = 1/6 49.47 49.89 50.30 50.61 49.55

rd = 1/6 2/6 3/6 4/6 18.84 18.79 18.58 18.34 18.99 19.07 18.94 18.57 18.63 18.97 18.87 18.73 18.36 18.84 18.89 18.66 17.40 18.49 18.65 18.42 (c) NUS-WIDE-OBJECT

5/6 17.90 17.98 18.09 18.09 17.93

2/6 3/6 50.51 50.72 50.72 51.18 51.04 51.09 50.84 51.07 50.57 50.70 (b) Google-23 1

4/6 50.30 50.88 50.87 50.84 50.52

5/6 50.22 50.25 50.50 50.30 50.30

1

50.04

17.11

4.3.3 The Amplifying Factor. α Instead of using an arbitrary value for the parameter α ≥ 1, we tested a wide range of values from 1 to 512 (in the form of powers of 2). From each dataset, 5 random images per category were prelabeled. Figure 4 plots the average performance of 3 test runs versus α. On MNIST, the average recall keeps increasing until α > 128 (Figure 4(a)). This means that if an image node has both weak and strong edges pointing to it, the strong edges should dominate the label propagation. However, we cannot simply remove the weak edges from the graph—nodes having only weak edges pointing to them would be disconnected from the graph and remain unlabeled. The highest average recall was achieved when α = 2 on Google-23, and when α = 4 on NUS-WIDE-OBJECT (Figures 4(b) and 4(c)). This implies that the strong edges between image nodes of the two datasets deserve higher weights, but they should not dominate over weak edges. For the remaining experiments, we used α = 128, 2 and 4 for MNIST, Google-23 and NUS-WIDEOBJECT, respectively. In practice, for relatively simple datasets, we can use a large value of α to increase the influences of strong edges, as for such sets we can reasonably expect images with a common label to be close in distance. However, in datasets whose semantically related images present largely diverse visual features, such mutual influences are rare, and a small value of α should be considered. 4.3.4 The Parameters for Feature Selection. For all datasets, we tested the values for both rd and tc from 1/6 to 5/6. For the MNIST and Google-23 sets, 1 to 7 images per category were prelabeled, whereas for the NUS-WIDE-OBJECT dataset, 1, 5, 10, 20, 50, and 100 images per category were prelabeled. For each pair of values of rd and tc, the recall values were averaged over 3 test runs. The average recall values thus obtained were themselves averaged, over all possible numbers of prelabeled images taken over all categories. In addition, we have also measured the performance of SW-KProp+ with rd = 1 (which is equivalent to SW-KProp no matter what value tc uses) in the same configuration. The results are recorded in Table II. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:15

It can be seen from Table II that the best values of the parameters rd and tc depend heavily on the quality of the original descriptors. With the MNIST dataset, the performance of SW-KProp+ increases when both rd and tc approach 1 (Table II(a)), indicating that better performance is achieved when each prelabeled image produces a feature vector that resembles the full feature set. Conversely, for the Google-23 and NUS-WIDE-OBJECT sets, SW-KProp+ (with rd and tc smaller than 1) performs better than SW-KProp in most cases (Tables II(b) and II(c)). The best performances are achieved when rd and tc are relatively small, indicating that the original image descriptors of the Google-23 and NUS-WIDE-OBJECT dataset are less reliable than those of the MNIST set. As suggested by the outcomes reported in Table II, we used the parameter choices {rd, tc} = {5/6, 2/3}, {1/2, 1/3}, and {1/3, 1/3} for MNIST, Google-23 and NUS-WIDE-OBJECT, respectively, in the remainder of the experimentation involving SW-KProp+. In practice, SW-KProp+ is not able to greatly boost the annotation performance on simple image datasets with discriminative feature vectors. For web image datasets whose original descriptors are not fully reliable, we expect that choosing small values for rd and tc (e.g., on the order of 1/3 or 1/2) can effectively improve the classification performance. 4.4 Methods Evaluated In Section 4.4.1, we summarize the implementation details of the SW-KProp methods, and their predecessor KProp. The GSSL and SVM-like methods adopted in the experiments are discussed in Sections 4.4.2 and 4.4.3, respectively. 4.4.1 SW-KProp and KProp. The KProp, SW-KProp and SW-KProp+ propagation methods were tested. All require that the nearest neighbor set of each data item be available. Neighbor sets can be generated by either pre-computing the k-NN lists of all data items, or by retrieving them on demand via fast index structures such as SASH [Houle and Sakuma 2005]. The corresponding distance values between an item and its neighboring items are also required by the SW-KProp methods. We used the Jacobi method to compute the score matrices, which saved up to 32% of the iterations required for convergence, as compared to the original implementation of KProp based on Eq. (4). 4.4.2 GSSL Methods. We tested our graph-based propagation methods against two well-known GSSL methods that are related to our approach: manifold ranking (MR) and Gaussian fields and harmonic functions (GFHF). MR allows unlabeled nodes to influence the labeled nodes, while GFHF explicitly protects the original scores for the labeled nodes. For both methods, we used traditional undirected k-NN graphs with k = 10, and set the damping factor to 0.9. The edges were weighted by exp(−d2 /2σ 2 ) in both methods, where d is the distance value between two incident nodes, and σ is a bandwidth hyperparameter that was estimated by the average distance between pairs of graph nodes. 4.4.3 SVM and LapSVM. As suggested in Zhu et al. [2008], for our experimentation we used implementations of SVM and LapSVM [Melacci and Belkin 2011] as representative supervised learning and semi-supervised learning classifiers, respectively. SVMs are widely used in classification and other machine learning tasks. Using a supplied kernel function for similarity computation, they build a global boundary that has the largest distances to the two nearest data points from both positive and negative training sets. Once the boundary has been established, each unlabeled data item can be classified clearly as belonging to one set or the other. LapSVMs have achieved state-of-the-art performance among semi-supervised learning methods [Belkin et al. 2006]. They incorporate kernel methods in a manifold regularization framework, that seek to minimize a loss function involving quantities such as classification scores, together with regularization terms. The regularization term that ensures the smoothness of the target function over the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:16



M. E. Houle et al.

manifold structure of the input data is approximated by a weighted graph defined over all input data points, in the form of a corresponding (symmetrically normalized) Laplacian matrix. For both the SVM and LapSVM methods, multi-class classifiers were assembled according to the one-versus-all scheme, and trained using the linear kernel. The number of nearest neighbors used for the construction of the weighted graph in LapSVM was also set to 10.

5. EXPERIMENTAL RESULTS AND DISCUSSION In this section, we present and discuss the experimental results for the classification of the three datasets under consideration. In MNIST and Google-23, 1 to 7 images from each class were randomly selected for initial labeling in each experimental run. The largest number of prelabeled images per concept in NUS-WIDE-OBJECT was increased to 100, due to the fact that in this dataset, the images associated with a common concept are more visually and semantically diverse, and thus more labeled examples are required for a comprehensive performance evaluation. For each choice of the number of prelabeled images per class, we conducted five experimental runs. For all experiments, Euclidean (L2 ) distance was used as the distance measure. The average recall versus the number of prelabeled images is plotted in Figure 5. It can be observed from Figure 5 that, in terms of average recall, the best performance of all tested methods is achieved on MNIST (Figure 5(a)). There, SW-KProp and SW-KProp+ obtain consistently better results than their competitors. MR, GFHF, and KProp, which have similar results in the second tier outperform SVM and LapSVM classifiers significantly. The overall improvement in average recall of SW-KProp over MR is approximately 5%. However, the use of the feature selection technique for prelabeled images does not lead to a significant improvement for this dataset. One possible reason is that MNIST is a relatively easy dataset whose original image descriptors are already of sufficient quality, making it difficult to find reduced feature sets with better discriminative ability. For Google-23, the average recall performance curves are plotted in Figure 5(b), and the values of average accuracy are recorded in Table III. From these results, we can observe that the average recall and the average accuracy present a consistent trend. When the number of prelabeled faces per person is relatively small, SW-KProp and SW-KProp+ perform better than their competitors. However, we note that SVM outperforms SW-KProp and SW-KProp+ when 5 and 6 face images are prelabeled for each individual, respectively. We can also observe that the feature selection strategy boosts the performance of SW-KProp when each person has multiple prelabeled faces. For the web image dataset NUS-WIDE-OBJECT, the label prediction problem is quite difficult, as can be seen from Figure 5(c). None of the methods tested are able to achieve an average recall of more than 30%, even with 100 images prelabeled per category. In terms of average recall, KProp and SW-KProp consistently outperform MR, GFHF, and SVM. They also maintain their advantages over LapSVM when the number of prelabeled images is 20 or fewer. The recall performance of LapSVM matches those of KProp and SW-KProp when this number reaches 50. Over all tested methods, the best performer is SW-KProp+, which consistently outperforms LapSVM with respect to average recall. However, LapSVM has better average accuracy when the number of prelabeled images per category is 50 or more (Table IV). Thus, even if LapSVM were to correctly label fewer images than our method, it would still be possible to use LapSVM to build classifiers for NUS-WIDE-OBJECT with better average quality. For each concept in NUS-WIDE-OBJECT, the number of training images is the same, but the number of test images varies greatly. With an unreliable distance measure, test images from a very small concept class are less likely to be linked close to the source images, and thus tend to be mislabeled. LapSVM, on the other hand, has better performance on small concept classes, which boosts the average accuracy of individual classifiers. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:17

Fig. 5. Average recall for the three datasets.

Table III. Average Accuracy (%) for Google-23 #(labeled images)/class

SVM

LapSVM

MR

GFHF

KProp

SW-KProp

SW-KProp+

1 2 3 4 5 6 7

23.90±1.88 36.75±2.11 45.05±1.99 52.79±1.88 56.48±1.70 59.83±1.66 62.13±1.52

25.34±1.76 34.87±2.05 40.61±2.09 47.22±2.10 50.91±2.00 53.19±2.10 55.05±2.06

35.20±2.66 42.73±2.57 47.47±2.39 51.14±2.37 53.70±2.27 54.65±2.25 56.10±2.12

32.00±2.67 42.15±2.64 46.26±2.48 50.47±2.42 52.45±2.35 53.58±2.33 55.57±2.18

35.65±2.62 42.01±2.51 46.38±2.36 50.75±2.36 53.08±2.26 54.51±2.18 55.89±2.13

37.13±2.85 44.19±2.76 49.64±2.49 53.74±2.40 55.95±2.37 56.66±2.31 58.12±2.18

37.13±2.85 45.08±2.64 51.21±2.40 54.89±2.36 57.22±2.26 58.36±2.09 59.72±2.03

#(labeled images)/class

SVM

LapSVM

MR

GFHF

KProp

SW-KProp

SW-KProp+

1 5 10 20 50 100

9.46±0.83 13.06±0.93 13.78±0.93 16.67±0.95 20.72±1.24 26.38±1.54

8.43±0.66 15.40±1.21 19.01±1.43 22.78±1.61 27.55±1.81 30.31±1.88

9.46±0.81 16.00±1.12 18.18±1.23 20.72±1.27 24.33±1.38 26.58±1.48

8.68±0.91 14.97±1.15 16.84±1.21 19.75±1.27 23.15±1.31 25.56±1.39

9.50±0.73 15.65±1.05 18.17±1.18 20.88±1.25 24.22±1.37 26.45±1.40

9.43±0.78 15.87±1.12 18.08±1.20 20.68±1.24 23.96±1.35 26.62±1.40

9.43±0.78 16.98±1.08 20.38±1.15 23.37±1.17 26.79±1.21 28.36±1.12

Table IV. Average Accuracy (%) for NUS-WIDE-OBJECT

In Figure 5, the three datasets are arranged in increasing order of their level of difficulty in classification. MNIST is a relatively easy dataset to process, in that the distance measure is discriminative. On the other hand, images of Google-23 and NUS-WIDE-OBJECT are taken under uncontrolled conditions, resulting in great variation and diversity. Inter- and intra-class distance distributions of the three datasets are shown in Figure 6. Clearly, compared to the digits from MNIST, based solely on their L2 distance, it is more difficult to tell whether two faces of Google-23 belong to a common individual, and nearly impossible to distinguish images of different concepts in NUS-WIDE-OBJECT. SVM and LapSVM fail to give stable and consistent results on the three datasets: both have poor performances on MNIST, SVM outperforms LapSVM on Google-23, and LapSVM outperforms SVM on NUS-WIDE-OBJECT. When the number of labeled items is sufficiently large, SW-KProp is outperformed on Google-23 by SVM, and on NUS-WIDE-OBJECT by LapSVM. The relative performance of SW-KProp can be explained in terms of the transitivity of data item relationships. Unlike classifiers which build global boundaries between instances of different classes, SW-KProp transmits label ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:18



M. E. Houle et al.

Fig. 6. Distance distributions for the three datasets.

information locally, along paths leading from labeled images to unlabeled images. The reliability of links connecting image nodes decays as their graph link distance from source nodes increases. MNIST is a relatively simple dataset whose influence graph contains well-established paths from labeled images to unlabeled images of the same object. Conversely, such transitivity is rare or non-existent within the face image and the web image datasets. When image a is similar to image b, and b is similar to image c, it is often the case that a does not resemble c; in such situations, c would iteratively receive incorrect information from a, and propagate this incorrect information to its adjacent nodes. Classifiers, by not relying on the transitivity of similarity information, can avoid such errors when there are adequate numbers of training examples. Any ambiguous items are classified once, and incorrect decisions will not be propagated. SW-KProp has consistently better performance over MR and GFHF on all datasets, and over KProp on MNIST and Google-23. This confirms the effectiveness of its edge weighting schemes. On NUSWIDE-OBJECT, SW-KProp has no particular advantage over KProp, the reason being that for the web images, similarity values are less reliable with respect to semantic concepts, and the influence relationships defined by distances and ranks suffer greatly from noise. On Google-23 and NUS-WIDE-OBJECT, SW-KProp+ outperforms SW-KProp considerably, when the number of prelabeled images per category is larger than 1. This implies that with only a few images of the same category, SW-KProp+ can effectively select a subset of features with better discriminative ability for each prelabeled image, and enhance the quality of the similarity graph by recomputing the neighborhood of prelabeled images. 6. CONCLUSION AND FUTURE WORK We have proposed SW-KProp for the propagation of annotations associated with a small number of objects of interest to the remaining items in an image database. SW-KProp operates in two phases: by first modeling data items in an influence graph according to their visual similarities, and then propagating influence scores representing a tentative labeling of these items along the edges of the graph. The computation of influence scores of SW-KProp can be performed by solving a sparse linear system to which fast iterative methods and optimized matrix operations can be applied. To enhance the quality of the similarity graph, we have also proposed a variant SW-KProp+ that computes a discriminative subset of the features, and reconstructs the neighborhood of prelabeled images according to the reduced feature sets. We have discussed the influences on practical performance of system parameters for both SW-KProp and SW-KProp+. We have also proposed a dynamic method for suggesting a reasonable value of the rank threshold k. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:19

Our methods were compared with GSSL and classification-based methods on three image datasets: a handwritten digit dataset, a face dataset and a web image dataset. Experimental results showed the effectiveness of SW-KProp when the number of prelabeled data items per class is small. Our feature selection strategy for SW-KProp+ was also shown to improve the quality of the similarity graph, and therefore the annotation performance, especially when the original image descriptors are not fully reliable. Our approach can be easily adapted as an initial step for classifiers to boost performance, for such applications as family photo management and the identification of individuals in surveillance videos. Possible directions for future research include: —the reduction of ambiguities from the influence graph for complex objects, using clustering methods based on shared-neighbor information; —propagating the refined features from prelabeled images to unlabeled images, to enhance the link structure of unlabeled image nodes; —the extension of SW-KProp for multigraph and multilabel propagation; —augmenting training sets using SW-KProp, to boost the performance of classifiers; —the adaptation of SW-KProp for incremental database systems, so that updates of influence graphs and score matrices can be performed without a full recomputation. REFERENCES AMES, M. AND NAAMAN, M. 2007. Why we tag: Motivations for annotation in mobile and online media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 971–980. AVRACHENKOV, K., DOBRYNIN, V., NEMIROVSKY, D., PHAM, S. K., AND SMIRNOVA, E. 2008. Pagerank based clustering of hypertext document collections. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 873–874. BARNARD, K., DUYGULU, P., FORSYTH, D., DE FREITAS, N., BLEI, D. M., AND JORDAN, M. I. 2003. Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135. BELKIN, M., NIYOGI, P., AND SINDHWANI, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434. BLUM, A. AND CHAWLA, S. 2001. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning. 19–26. BRADSKI, G. R. AND KAEHLER, A. 2008. Learning OpenCV - Computer Vision with the OpenCV Library: Software that Sees. O’Reilly. CAO, L., POZO, A. D., JIN, X., LUO, J., HAN, J., AND HUANG, T. S. 2010. RankCompete: Simultaneous ranking and clustering of web photos. In Proceedings of the 19th International Conference on World Wide Web. 1071–1072. CHANG, E., GOH, K., SYCHAY, G., AND WU, G. 2003. CBSA: Content-Based Soft Annotation for multimodal image retrieval using Bayes point machines. IEEE Trans. Circ. Syst. Video Tech. 13, 1, 26–38. CHUA, T.-S., TANG, J., HONG, R., LI, H., LUO, Z., AND ZHENG, Y.-T. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval. CUSANO, C., CIOCCA, G., AND SCHETTINI, R. 2003. Image annotation using SVM. In Engineers SPIE Conference Series, Vol. 5304, 330–338. DESAI, C., KALASHNIKOV, D. V., MEHROTRA, S., AND VENKATASUBRAMANIAN, N. 2009. Using semantics for speech annotation of images. In Proceedings of the IEEE International Conference on Data Engineering. 1227–1230. DUYGULU, P., BARNARD, K., DE FREITAS, J. F. G., AND FORSYTH, D. A. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th European Conference on Computer Vision:Part IV. 97–112. EVERINGHAM, M., SIVIC, J., AND ZISSERMAN, A. 2006. “Hello! My name is... Buffy” – Automatic naming of characters in TV video. In Proceedings of the British Machine Vision Conference. 899–908. HAGEMAN, L. AND YOUNG, D. 2004. Applied Iterative Methods. Dover Publications. ´ , S., AND SHAWE-TAYLOR, J. 2006. A correlation approach for automatic image annotation. HARDOON, D. R., SAUNDERS, C., SZEDMAK In Advanced Data Mining and Applications. 681–692. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

7:20



M. E. Houle et al.

HESTENES, M. R. AND STIEFEL, E. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Standards 49, 409–436. HIGHAM, N. J. AND TISSEUR, F. 2003. Bounds for eigenvalues of matrix polynomials. Linear Algebra Appl. 358, 1–3, 5–22. HOULE, M. E., ORIA, V., SATOH, S., AND SUN, J. 2011. Knowledge propagation in large image databases using neighborhood information. In Proceedings of the ACM Multimedia. 1033–1036. HOULE, M. E. AND SAKUMA, J. 2005. Fast approximate similarity search in extremely high-dimensional data sets. In Proceedings of the 21st International Conference on Data Engineering. 619–630. HU, X. AND QIAN, X. 2009. A novel graph-based image annotation with two level bag generators. In Proceedings of the International Conference on Computational Intelligence and Security. 71–75. JEH, G. AND WIDOM, J. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 538–543. JEON, J., LAVRENKO, V., AND MANMATHA, R. 2003. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 119–126. JING, Y. AND BALUJA, S. 2008. VisualRank: Applying PageRank to large-scale image search. IEEE Trans. Patt. Anal. Mach. Intell. 30, 11, 1877–1890. LECUN, Y., BOTTOU, L., BENGIO, Y., AND HAFFNER, P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11, 2278–2324. LI, R., ZHANG, Y., LU, Z., LU, J., AND TIAN, Y. 2010. Technique of image retrieval based on multi-label image annotation. In Proceedings of the 2010 2nd International Conference on Multimedia and Information Technology, Vol. 2. 10–13. LI, X., CHEN, L., ZHANG, L., LIN, F., AND MA, W.-Y. 2006. Image annotation by large-scale content-based image retrieval. In Proceedings of the 14th Annual ACM International Conference on Multimedia. 607–610. LIU, J., LI, M., MA, W.-Y., LIU, Q., AND LU, H. 2006. An adaptive graph model for automatic image annotation. In Multimed. Inf. Ret. 61–70. LIU, W., DUMAIS, S., SUN, Y., ZHANG, H., CZERWINSKI, M., AND FIELD, B. 2001. Semi-automatic image annotation. In Proceedings of Interact: Conference on Human-Computer Interaction. 326–333. LIU, W., HE, J., AND CHANG, S.-F. 2010. Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th International Conference on Machine Learning. 679–686. LIU, W., WANG, J., AND CHANG, S.-F. 2012. Robust and scalable graph-based semisupervised learning. Proc. IEEE 100, 9, 2624– 2638. MAKADIA, A., PAVLOVIC, V., AND KUMAR, S. 2008. A new baseline for image annotation. In Proceedings of the 10th European Conference on Computer Vision: Part III. 316–329. MELACCI, S. AND BELKIN, M. 2011. Laplacian support vector machines trained in the primal. J. Mach. Learn. Res. 12, 1149–1184. NOV, O. AND YE, C. 2010. Why do people tag?: Motivations for photo tagging. Comm. ACM 53, 7, 128–131. ONO, A., AMANO, M., HAKARIDANI, M., SATOU, T., AND SAKAUCHI, M. 1996. A flexible content-based image retrieval system with combined scene description keyword. In Proceedings of the 3rd IEEE International Conference on Multimedia Computing and Systems. 201–208. OZKAN, D. AND DUYGULU, P. 2006. A graph based approach for naming faces in news photos. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, 1477–1482. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report 1999–66., Stanford InfoLab. RUSSELL, B., TORRALBA, A., MURPHY, K., AND FREEMAN, W. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1–3 , 157–173. SAAD, Y. AND SCHULTZ, M. H. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, 856–869. SHI, R., LEE, C.-H., AND CHUA, T.-S. 2007. Enhancing image annotation by integrating concept ontology and text-based bayesian learning model. In Proceedings of the 15th International Conference on Multimedia. 341–344. SRIKANTH, M., VARNER, J., BOWDEN, M., AND MOLDOVAN, D. 2005. Exploiting ontologies for automatic image annotation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 552–558. TANG, J., HONG, R., YAN, S., CHUA, T.-S., QI, G.-J., AND JAIN, R. 2011. Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images. ACM Trans. Intel. Syst. Tech. 2, 2, 14. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Annotation Propagation in Image Databases Using Similarity Graphs



7:21

VON AHN, L. AND DABBISH, L. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 319–326. WANG, C., JING, F., ZHANG, L., AND ZHANG, H. 2006. Image annotation refinement using random walk with restarts. In Proceedings of the ACM Multimedia. 647–650. ¨ ZHOU, D., BOUSQUET, O., LAL, T. N., WESTON, J., AND SCHOLKOPF , B. 2003a. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16. ¨ ZHOU, D., WESTON, J., GRETTON, A., BOUSQUET, O., AND SCHOLKOPF , B. 2003b. Ranking on data manifolds. In Advances in Neural Information Processing Systems 16. ZHU, J., HOI, S. C. H., AND LYU, M. R. 2008. Face annotation using transductive kernel fisher discriminant. IEEE Trans. Multimed. 10, 1, 86–96. ZHU, X., GHAHRAMANI, Z., AND LAFFERTY, J. D. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning. 912–919.

Received August 2012; revised November 2012 and April 2013; accepted May 2013

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 1, Article 7, Publication date: December 2013.

Suggest Documents