Hidden Markov Models for Automatic Annotation and Content-Based

0 downloads 0 Views 4MB Size Report
Aug 19, 2005 - naturally, or at least usually, specifies her information need is the same as ..... Precision for 75 One-Word Queries on TRECVID. Variance-Floor ...
Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video Arnab Ghoshal

Pavel Ircing

Sanjeev Khudanpur

Johns Hopkins University Baltimore, MD, U.S.A.

University of West Bohemia Pilsen, Czech Republic

Johns Hopkins University Baltimore, MD, U.S.A.

[email protected]

[email protected]

[email protected]

ABSTRACT This paper introduces a novel method for automatic annotation of images with keywords from a generic vocabulary of concepts or objects for the purpose of content-based image retrieval. An image, represented as sequence of featurevectors characterizing low-level visual features such as color, texture or oriented-edges, is modeled as having been stochastically generated by a hidden Markov model, whose states represent concepts. The parameters of the model are estimated from a set of manually annotated (training) images. Each image in a large test collection is then automatically annotated with the a posteriori probability of concepts present in it. This annotation supports content-based search of the image-collection via keywords. Various aspects of model parameterization, parameter estimation, and image annotation are discussed. Empirical retrieval results are presented on two image-collections — COREL and keyframes from TRECVID. Comparisons are made with two other recently developed techniques on the same datasets.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models; I.5.1 [Models]: Statistical; I.4.8 [Scene Analysis]: Object Recognition

General Terms Design, Experimentation, Performance

Keywords Hidden Markov models, Image & video retrieval

1.

INTRODUCTION

The content of communications in the digital age is increasingly multi-modal in nature, with text, images and even speech or video being used in a single “document.”

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’05, August 15–19, 2005, Salvador, Brazil. Copyright 2005 ACM 1-59593-034-5/05/0006 ...$5.00.

This is true of web pages on the Internet, reports at a workplace, and even interpersonal interactions, as witnessed by the file-contents of most personal computers. Content-based indexing and retrieval of multimedia is therefore becoming an increasingly important issue. Unlike search and retrieval of text documents, in which the modality in which the user naturally, or at least usually, specifies her information need is the same as the modality of the search collection, there is relatively little work in image retrieval based on textual queries. The major reason, of course, is that open domain image understanding remains a largely unsolved problem, and even detecting a single predetermined object in an image, such as a human face, is a fairly difficult problem (cf e.g. [13]). Yet, important progress has been made in the last few years in content-based image retrieval, as reported in [4, 2, 3, 7] and elsewhere. We are not the first to observe that while the “pure” image understanding problem, i.e. the problem of recognizing all the objects in a given image, is very difficult due to several invariance issues, there are two aspects of the image search and retrieval problem which make it relatively much more tractable. One is the availability of side information in the accompanying text: images in multimedia documents are often accompanied by captions or other descriptive text which a search engine may use to guess the content of an image. This is indeed what is used, e.g. by Google Images and similar tools. The other is that the search problem is far more tolerant of erroneous inferences than is usually assumed necessary for object recognition in individual images: the system need not recognize with high certainty whether any particular image in a collection contains, say, a tiger, since it suffices if the inferred likelihood of tiger in images containing tigers is higher than its likelihood in images not containing tigers, no matter how small either likelihood is. These two observations lead to the research paradigm subscribed to in this paper. 1. We develop a generative stochastic model for images and their accompanying captions. Training material for such a model, we believe, is readily obtained from naturally occurring image+caption pairs. The vocabulary of the words used in the captions is fairly generic, enabling the annotation of additional images that may not have surrounding text. If surrounding text is available, our model is capable of using it as a prior, and computing posterior probabilities of the words in the caption given the visual evidence in the image. 2. We estimates parameters of our model from a manually annotated (training) collection of image+caption

pairs, and then automatically annotate (or caption) a collection of test images using this fairly generic vocabulary of keywords. 3. We measure the efficacy of our image annotation model by performing image retrieval experiments on the test collection using single-word queries, and compute standard information retrieval performance measures. We take the view that as long as the retrieved images are in the right rank-order, human users are very efficient at scanning thumbnails for the images they need. The remainder of this paper is organized as follows. Section 2 formally describes the hidden Markov model (HMM) used for image annotation, and compares it with two other recently developed techniques for the same task. Section 3 delves into various issues in the design and parameter estimation for the model. Section 4 presents a series of experimental results on two image-collections, namely the COREL and the TRECVID datasets. We conclude with some discussion in Section 5.

2.

HIDDEN MARKOV MODELS FOR IMAGE ANNOTATION

Let a collection of image+caption pairs be provided and consider the problem of developing a stochastic generative model that jointly describes each pair. Let I = hi1 , . . . , iT i denote segments (regions) in an image, and C = {c1 , . . . , cN } denote the objects (concepts) present in that image, as specified by the corresponding label (caption) C. For each image region it , t = 1, . . . , T , let xt ∈ IRd represent the color, texture, edges, shape and other salient visual features of the region1 . Finally, let V denote the global vocabulary of the caption-words cn across the entire collection of images. We propose to model the {xt }-vectors of an image I as a hidden Markov process [11], generated by an underlying unobserved Markov chain whose states st take values in C. Specifically, each xt is generated according to some probability density function f (·|st ) given the state st , where st itself is a Markov chain with a known initial state s0 and transition probabilities p(st |st−1 ). Figure 1 illustrates an image with T = 6 rectangular regions with the caption {tiger, grass, water, ground}, and the state-transition diagram of the underlying Markov chain which generates these regions. Formally, for a state-sequence hs1 , . . . , sT i = sT1 ∈ C T , f (x1 , . . . , xT , s1 , . . . , sT |s0 ) =

T Y

X

where 1

M X m=1

T −1 1 wm,c p e− 2 (x−µm,c ) Σm,c (x−µm,c ) , (2) d (2π) |Σm,c |

where wm,c ’s are the mixture weights, µm,c the mean-vector and Σm,c the diagonal covariance-matrix for the of the m-th mixture component of the state c. The transition probabilities are non-parametric, with the initial values being the uniform pmf for all permissible states in C (or in V as described below). Details of this maximum likelihood estimation procedure are standard (cf [11]) and therefore omitted. Given an image+caption pair (I, C), the most likely alignment of image regions with caption-words is sˆT1 = arg max

T sT 1 ∈C

f (xt |st ) p(st |st−1 ),

T Y

f (xt |st ) p(st |st−1 ).

(3)

t=1

This is sometimes called the Viterbi alignment. Finally, the marginal likelihood of an image may be computed as f (xT1 |s0 ) = f (xT1 , V|s0 ) =

T X Y

f (xt |st ) p(st |st−1 ). (4)

When an image with no caption is provided, this marginal likelihood enables one to use the HMM to compute the conditional probability that a particular image region it was generated by a particular concept c ∈ V based on the total visual evidence xT1 in the image I: p(st = c|xT1 , s0 )

=

(1)

T t=1 sT 1 ∈C

xT1

f (x|c) =

f (xt |st ) p(st |st−1 ).

Note that observing the state sequence st is equivalent to knowing the alignment of each image region it with one of the words in the caption. Since this level of detail is usually not provided in a caption, a hidden Markov model (HMM) is an appropriate formalism for computing the joint likelihood: f (xT1 , C|s0 ) =

The emission densities f and transition probabilities p of the HMM are estimated, given a training collection of image+caption pairs, to maximize the joint likelihood (1). In particular, we model the output density f (·|c) for each state c ∈ V as a mixture of multivariate Gaussian densities on IRd :

T t=1 sT 1 ∈V

t=1

T Y

Figure 1: Illustration of the State Transitions Graph and Output for an Image+Caption HMM.

denotes the T -length sequence hx1 , .., xT i.

The T segments may be object based, with each region corresponding to one semantically distinct object, or they may be a simple rectangular partition of the image into fixed-size blocks. We have chosen to use fixed-size segments in our experiments based on a 4×6 or 5×7 partition of the image.

=

f (xT1 , st = c|s0 ) (5) f (xT1 |s0 ) QT P sT t=1 f (xt |st ) p(st |st−1 ) 1 :st =c . f (xT1 |s0 )

Finally, the probability of a particular concept c ∈ V being present (somewhere) in an image may be calculated as p(c ∈ C|I, s0 ) ∝

T X t=1

p(st = c|xT1 , s0 ).

(6)

Thus jointly modeling the image-features and caption-text as the realization of such a generative stochastic process, i.e. an HMM, provides for

I with visterms xT1 , a probability is calculated for each concept c in the caption-vocabulary (in the simplest case) as

1. estimating the parameters of the model from annotated image+caption pairs, 2. aligning image-regions with caption-words in an image+caption pair, and 3. computing the likelihood of a caption-word being present in an image. HMMs have several strengths, and a few limitations, as models of stochastic processes, which our methodology inherits. • It seems natural to use a common density f (·|c) to model image-features of regions corresponding to a concept c across different images, and to use different densities to model image-features corresponding to different concepts in the same image. Image-regions in our dataset are not manually aligned to words in the image-captions. HMMs provide a natural way to treat the missing alignment information. • HMMs provide efficient algorithms for model estimation and inference described in Steps 1-3 above. In particular, indexing a large number of (test) images simultaneously with all the concepts in the vocabulary V is efficient: the computation grows linearly in the size of the test collection and, at worst, quadratically in the number of concepts, and does not depend on the size of the training collection. • Features {xt } of adjacent image regions are assumed to be conditionally independent given the underlying concepts. We plan to address this limitation in future work using context-dependent HMM states and spatial derivatives of the image-features. A freely available toolkit, HTK [12], efficiently implements all the basic parameter estimation and probability calculation algorithms needed for working with the model described above, and this is what we have used in the experiments described in Section 4 below. Note that the model proposed above differs fundamentally from the use of HMMs by [9] for image annotation. In their work, different HMMs, one for each category (concept), are first estimated from all images labeled with that category. The states of the HMMs have no correspondence with any semantic concepts in an image. Only the likelihood (4) of each image in a test collection is computed under the HMM of every category. The actual assignment of index terms to a test image is done post hoc, based on (a few) categories under whose HMMs the test image is highly likely.

2.1

HMMs vs Image Annotation as MT

In some recent work, Duygulu et al [4] have presented the image annotation problem as a variant on the statistical machine translation problem. In their formulation, the representations xt of the segmented image regions are discretized and then treated as “words,” replacing IRd with a vocabulary X of visual expressions (visterms). The training corpus of image+caption pairs is then treated as a corpus of aligned bi-lingual text, and statistical machine translation techniques are employed to infer a stochastic translation lexicon p(x|c) between c ∈ V and x ∈ X . Given a test image

p(c ∈ C|I) =

X

p(c|x)p(x|I) =

T X

p(c|xt )

t=1

x∈X

1 . T

Note that the estimation of p(x|c) in their case, particularly when using IBM Model-1, is identical in form to the embedded estimation of f (x|c) in the HMM case provided we set the transition probabilities to be uniform. In both cases, the alignment of individual image-regions to caption-words is treated as a hidden variable and the EM algorithm is employed. Thus the primary difference between our model and theirs is that the image features are modeled here as continuous-valued vectors, avoiding the need for quantization. A lesser difference is that unlike object-based segmentation in their case, we have used regular rectangles to segment the image. In more recent work [6], the model of [4] has been investigated with rectangular regions, It will be shown in Section 4 that preserving the continuous-valued representation of the xt ’s results in significant improvement in annotation and retrieval performance.

2.2

HMMs vs Continuous Relevance Models

Manmatha et al [7, 8] have used a continuous version of the relevance model (CRM) to perform image annotation and retrieval. In their case, continuous-valued visual features {xt } are extracted from rectangular regions of the (test) image I and compared with the visual features extracted from each training image J using a Gaussian kernel function. In particular, for a training image J with visual features {r1 , . . . , rT } and caption {y1 , . . . , yN }, k(x|J) =

T −1 1 T 1 X e− 2 (x−rt ) Σ (x−rt ) p . T t=1 (2π)d |Σ|

The parameter Σ is estimated from held-out data. This kernel then yields the probability of each concept being present in I as "T # X X Y ˆ ˆ P (c|J) k(I, J) = P (c|J) k(xt |J) , p(c|I) ∝ J

J

t=1

where Pˆ is computed from the caption of J. Note that the sum need only run over the training images which assign a nonzero Pˆ -probability to the concept c. But for this respite, each test image needs to be compared, in principle, with every training image during annotation. While, upon first glance, the continuous relevance model seems to differ fundamentally from the HMM paradigm proposed here, deeper analysis reveals remarkable similarities. Each concept c ∈ V is modeled in the HMM by a mixture of Gaussians. The maximization of the joint likelihood (1) results in the emission probabilities f (x|c) in the HMM being estimated by a fractional assignment of each rt ∈ J to one of the states yn ; i.e. image regions rt from all images labeled with c contribute to the estimate of the probability density function (pdf) f (x|c). Therefore, when the likelihood of a region x in a test image is computed, concepts whose pdf’s were estimated from “similar looking” vectors rt will have high a posteriori probability (6). In particular, if each state c ∈ V has exactly as many Gaussian pdf’s in the

mixture as the number of images containing c in their caption, if the mixture weights wm are uniform, if each Gaussian is modeled by a mean µm,c = rt , and if the variance Σ is held constant, there is a remarkable similarity between the probability calculations of the HMM and the continuous relevance model. As may be anticipated, the annotation performance of the HMM and relevance model are comparable, as will be shown in Section 4. In a nutshell, the HMM performs an abstraction of the information presented in the paired image+caption training data. For each concept c ∈ V, only the sufficient statistics (mean and covariance) of the image features are retained. By contrast, the relevance model retains the entire training corpus for comparison with the test image. Consequently, the HMM performs the test annotations very efficiently, since each image-feature vector needs to be compared only with the Gaussian-mixture representation of each concept, not each training image. This results in tremendous computational speed-ups compared to the relevance model. Our approach is also much more scalable as a consequence. In our current experiments (see Sec. 4) the average time required to annotate an image with all the concepts is less than 2 seconds. Also the annotation is readily parallelized on multiple CPU’s leading to further speed-ups.

3.

HMM DESIGN AND TRAINING ISSUES

The aforementioned comparison with the relevance model ought to convince the reader that the HMM paradigm permits several design choices which need to be explored. We briefly describe some of the major issues in this regard in this section. Fixed vs Object-Based Image Segmentation: Before we begin, note that we partition each image into a fixed number of rectangular blocks. One problem with rectangular partitioning is that the rectangular blocks might contain pixels belonging to different “objects” (or concepts) whereas our statistical model assumes that it is generated by the state corresponding to a single object (concept). However it was shown in [5] that models trained on rectangular partitions outperform those trained on object-based partitions. The decision to work with fixed blocks is also driven by pragmatic concerns: we were kindly provided the processed training and test data from the 2004 Johns Hopkins University Summer Workshop [6], where considerable effort had been made in extracting image-features for the entire COREL and TRECVID datasets, and since state-of-the-art image annotation and retrieval performance has been published on these datasets, it provides a convenient venue to make comparisons with related techniques. Visual Feature Vectors: The workshop team extracted color, texture and oriented-edge features of 30-dimensions (COREL) or 76-dimensions (TRECVID) from a rectangular partition of the images, and this is what we have used in our experiments. In case of the COREL images, there are 4 × 6 blocks per image (T = 24)and in case of the TRECVID collection, it is 5 × 7 (T = 35). The block size, we are told, is an ad hoc balance between two conflicting criteria: one would like each block to be sufficiently small to contain only one of the concepts with which the image is annotated, and at the same time one would like it to be as big as needed to cover the entire object, so that features xt that characterize the object may easily localized.

Figure 2: Illustration of Unlabeled Objects in Images: this image was labeled with sand, beach, trees and people but not sky. HMM Topology and Permitted Transitions: We begin the HMM-design discussion by noting that given a set of image+caption pairs, we construct, for each image, an HMM with as many states as the number of words in its caption. The states of an HMM are fully connected, permitting any state to follow any other state. This corresponds to permitting any concept to occur in the image next to any other concept. One could, of course, investigate the exploitation of the spatial propensities of concepts such as sky and grass to appear in certain positions t in the image, and the propensity of certain concepts such as tiger and sky (not to) appear in adjacent blocks. But we have not done so in this preliminary investigation. Every state has an equal likelihood of being reached at every position t. Shared State HMMs: A caption c ∈ V appearing in two different images, of course, must be modeled by the same state. Therefore the HMM for each image “shares” states from a common pool of |V| states. In HTK parlance, the states of the HMM of each training image are “tied” to the corresponding states of other HMMs. Note that |V| = 375 for the COREL dataset, and |V| = 75 for TRECVID, as provided to us by [6]. Accounting for Unlabeled Objects: An additional concept, which we call null has been introduced into the vocabulary and is added to the caption of each image. The primary motivation for doing so was our observation that in several images, the annotators did not label some prominent concept in-spite of having a word in V corresponding to it. Figure 2 depicts an image in which the sky occupies more than a third of the image, and yet the image is annotated with sand, beach, trees and people but not sky. The null state is akin to the silence model used in automatic speech recognition. In both cases, since the stochastic model of the image or speech is a generative model, the entire image or speech sequence needs to be accounted for by the model, and adding an optional silence or null to the transcripts or caption results in better modeling of the data2 .

3.1

HMM Training Procedure

Once the visual features of the images are extracted and the HMM for each training image defined, estimation of the HMM parameters proceeds in a standard manner as prescribed in [12]. In particular, we first use a single Gaussian 2

Unlike the speech recognition application, however, we did not include the null state in the calculation of (6) since a null has no direct benefit for image retrieval.

pdf for each state c ∈ V. Each Gaussian pdf is initialized with the common mean and variance computed from all the images in the training data. Several iterations of the EM algorithm are carried out, updating the means and variances of the HMMs. We note that the transition probabilities are not updated in the experiments reported below; however they have a negligible effect on the likelihood and therefore, hopefully, on the eventual annotation performance.

This setting of transition probabilities is akin to the language model in speech recognition, and an experiment with this setting is so named in the results below.

Mixture Splitting: Once the single Gaussian pdf’s have been estimated, the pdf for each state c ∈ V is replaced by a mixture of a pair of Gaussian pdf’s obtained by minor perturbation. Further iterations of the EM algorithm are then carried out to (re)estimate the Gaussian mixture pdf’s. This procedure is standard in training speech recognition systems and the details are described in [12]. Note that if the image features xt that “align” with any state are very well modeled by the mixture of a certain number of Gaussian pdf’s, i.e. there is no gain in likelihood by increasing the complexity of the pdf, then one may choose not to “mix up” the pdf of that state. Such a model-based stopping criterion has been implemented in the HTK tools, and the experimental results reported below are for this default setting.

The COREL Image Dataset consists of 5000 images from 50 Corel Stock Photo CD’s provided to us by [4]. Each CD contains 100 images with a certain theme (e.g. polar bears), of which 90 are designated to be in the training set and 10 in the test set, resulting in 4500 training images and a balanced 500-image test collection. We note that this collection and training/test division is also that used by [7] and [5]. As stated earlier, each image is divided into 24 rectangular regions, and 30 dimensional image features are extracted from it and provided to us by [6]. The vocabulary has |V| = 375 words.

Setting the Variance-Floor: One subtle consideration in estimating Gaussian pdf’s is the estimate of variance. If a large number of mixture components are used, the amount of data used to estimate each component may be small, and the corresponding estimate of variance may be too low. Besides being inappropriate as a statistical model, this gives rise to numerical problems (divide-by-zero) during the probability calculations. It is standard to impose a “variance floor” — a minimum value for the estimate of the variance. Typically, this is set to be a fraction of the total empirical variance of all the image data. Maximum Mixture Size: As long as the data-likelihood increases by increasing the number of mixture components in f (x|c), we continue to increase the mixture size of the output pdf’s of each state until a certain pre-specified maximum mixture size is attained.

3.2

Image Annotation using the HMM

Once the output pdf’s f (x|c) of all the states c ∈ V have been estimated from the annotated image+caption pairs, we proceed to construct the so called decoding HMM. This is simply a fully connected |V| state HMM, one state corresponding to each concept in the vocabulary. Transition probabilities of the HMM are set to be uniform. Given test image, we perform the Balm-Welch algorithm that computes the posterior probability of (6). Once this is done for each image in the test collection, we are ready for image retrieval experiments. Note that the transition probabilities of the decoding HMM need not be uniform. A simple variation is to compute the co-occurrence statistics of the words in the concept vocabulary, and use that frequency to adjust the transition probabilities. For instance, plane and sky tend to co-occur much more often than plane and water, suggesting that p(sky|plane) could be set higher than p(water|plane). In a contrastive experiment, we set these probabilities to be proportional to the corresponding relative frequencies, with adequate smoothing to avoid zero-probabilities.

4.

EXPERIMENTAL RESULTS

We present image retrieval results on two collections as described below, one of still images and another of key frames extracted from video.

The TRECVID Dataset consists of key-frames extracted from Broadcast News video from several sources, such as ABC News and C-SPAN. A community-wide effort has resulted in 44,100 images from the video portion of the 2002 TREC SDR corpus to be annotated by a set 137 concepts arranged in a hierarchical tree. This collection has been divided into four parts by [1], and named “concept-train,” “concept-validate,” “concept-fusion-1” and “concept-fusion2.” Each key-frame has been divided into 35 rectangles, and 76-dim visual feature-vectors extracted for each image-block by researchers at the 2004 Johns Hopkins Summer Workshop [6]. The workshop team also culled the concept vocabulary to remove annotations with very low counts, etc., so that the resulting collection of 38K images has captions covered by a |V| = 75 word vocabulary. Of the 44K manually labeled images, roughly 35K are used to train the HMMs, and about 9K are held out for image retrieval experiments. We again acknowledge the generosity of the workshop team in letting us use their training and test data in our experiments. The TRECVID-2003 Dataset consists of about 32K keyframes, again from Broadcast News sources [10] and was used for multiple tasks in the 2003 benchmark tests conducted by the US National Institutes for Standards and Technology. One of the tasks, named “feature detection”, corresponds to our image annotation task. Participants were asked to detect 17 different features (concepts) in each shot of the video. Relevance judgements were created post hoc by a pooling of the retrieved shots of participating systems in the usual manner. Of these 17 concepts, 6 were of the sort that we are unable to cover using the generic concept annotation we trained our HMMs for, e.g. Madeline Albright and Physical Violence. Of the remaining 11 concept, some, such as Buildings, correspond directly with a concept in our vocabulary, making it easy for us to retrieve images for them. Finally, some concepts, such as Sporting Events, may be synthesized by a union of events in our vocabulary, e.g. basketball, hockey, football, baseball, etc. We therefore present retrieval results for this subset of 11 concepts (or features). Evaluating Image Annotation: Since our goal is to evaluate image annotation performance in the service of image retrieval, we follow [7] and construct single-word queries for

Mixture-size mAP

1 0.142

2 0.169

8 0.176

10 0.178

12 0.177

Table 1: Effect of Mixture-Size on Mean Average Precision for 260 One-Word Queries on COREL. Variance-Floor is set to 0.01. Variance-Floor mAP

0.01 0.178

0.002 0.179

0.001 0.192

each word in our concept vocabulary V. Thus there are 375 queries for the COREL dataset (of which 260 have at least one relevant image in the test collection) and 75 for the TRECVID dataset. We retrieve images from the appropriate manually-labeled test sets (500 images for COREL) and (9000 for TRECVID). All test images whose caption contains the query-word are deemed “relevant” to that query, and all other test images as irrelevant. Standard mean average precision (mAP) is calculated over all queries using the standard (trec eval) protocols, and precision-recall curves are plotted where appropriate.

Annotation Performance: COREL

Tables 1 and 2 show the mean average precision of the HMM based image annotation system on the 500 COREL images. Two primary design parameters are investigated: the maximum number of Gaussian mixture components in f (x|c), and the variance-floor set during re-estimation of the HMM parameters. All experiments use a word co-occurrence based language model during decoding. It is clear from the two tables that performance improves up to a certain point by building more and more complex models by adding mixture components, and then begins to level off. It is also clear that a very conservative variance floor (1% of the global variance of all image blocks) hurts performance, and that individual concepts have sufficiently sharp densities that a variance floor of 0.1% is significantly better. The best performance is an mAP of 0.19. We note for the sake of comparison that on the 260 word query set, the normalized CRM of [5] has a mAP of 0.26, which is significantly higher than that of the HMM, while the MT based model of [4] on the same training and test sets has an mAP of 0.15 [6].

4.2

Annotation Performance: TRECVID

The annotation performance for an HMM as a function of an increasing number of mixture components on the TRECVID dataset is shown in Table 3. It confirms the earlier conclusion based on the COREL data that up to a certain point, increasing model complexity improves performance. Note that the TREC training set is significantly larger than the COREL, and hence performance continues to increase even up to 20-Gaussian mixture densities. The improvement in mAP from 10 to 20 mixture components is significant at a p-value less than 0.0001, while that from 16 to 20 at 0.006. Before proceeding to increase the number of mixture components beyond 20, it is fair to contemplate whether a concept which has been attested, say, 20 times in the training

1 0.095

10 0.141

16 0.157

20 0.163

22 0.163

Table 3: Effect of Mixture-Size on Mean Average Precision for 75 One-Word Queries on TRECVID. Variance-Floor is set to 0.01. Variance-Floor mAP

0.0005 0.185

Table 2: Effect of the Variance-Floor on Mean Average Precision for 260 One-Word Queries on COREL. Number of mixture components is set to 10.

4.1

Mixture size mAP

0.01 0.163

0.001 0.166

0.0002 0.162

Table 4: Effect of the Variance-Floor on Mean Average Precision for 75 One-Word Queries on TRECVID. Number of mixture components is set to 20. images should be modeled by an HMM state with a mixture of more than 20 Gaussian pdf’s. Of course, each image has up to 35 feature vectors and a concept may span more than one image block. In order to establish where this trade-off may lie, we undertook an controlled mixture growth experiment. With a variance floor of 0.001, we estimated 4 different HMMs. The output pdf f (x|c) for concept c was 1. “mixed-up” to a maximum of 20 pdf’s no matter how many tokens were seen in the 28K image+caption pairs, 2. “mixed-up” to a maximum of 1 pdf per 3 tokens of c, 3. “mixed-up” to a maximum of 1 pdf per 5 tokens of c, 4. “mixed-up” to a maximum of 1 pdf per 15 tokens of c. Note that the very frequent concepts are not affected by this, since many of them have more than 300 tokens. It is only the infrequent concepts whose maximum mixture size is progressively reduced in the successive experiments. Table 5 presents the results for the increasingly conservative model growth strategies.

mAP

Max mixture size = 20, limited by 1 pdf per 15 tokens 5 tokens 3 tokens no limit 0.134 0.157 0.167 0.168

Table 5: Annotation Performance for Progressively Conservative Limits on the Maximum Number of Mixture Components for each Concept.

mAP

Max mixture size = 100, limited by 1 pdf per 15 tokens 5 tokens 3 tokens no limit 0.153 0.179 0.184 0.185

Table 6: Annotation Performance with up to 100 Gaussian pdf ’s for each Concept. The reader should be convinced that the TRECVID dataset is large enough not to take a conservative view about model size in the range we are operating. Therefore, we went ahead and estimated an HMM system with up to 100 Gaussian pdf’s per concept (state), limited of course by the likelihoodgain based automatic limits implemented in HTK. The annotation performance results of the final system are presented in Table 6. We plot the 11-point precision-recall curve for the 100-Gaussian HMM system in Figure 3. The mAP

Figure 3: Precision Recall Curve for 75 One-Word Queries on the TRECVID Dataset.

Figure 5: Precision Recall Curve for 11 Features on TRECVID-2003.

of 0.185 for the 100-Gaussian system is significantly better than the 0.168 mAP for the 20-Gaussian system (p-value = 0.0012). We note from [6] that the MT based system has a mean average precision of 0.13 and the Normalized CRM has 0.16 for identical training and test partitions and feature extraction. The MT based system discritizes the xt ’s while the CRM uses the 76-dim features. We believe that the HMM based system is significantly better than either of them.3 The HMM based approach is therefore competitive with the best of the available techniques. Figure 4 illustrates some of the more successful queries out of the 75 by showing the top 5 retrieved images, and the precision-recall curves for these queries. As the reader may notice, the images have several visually striking features, which the HMM has apparently managed to model even with such a primitive model. Note that one of the queries shown, (weather news), happens to be among the features prescribed in the TRECVID-2003 feature-detection task, as described next.

overall system submitted by [1] was the best performing feature detection system in TRECVID-2003, we feel confident in claiming that the HMM based methods proposed here are competitive with the state of the art.

4.3

Detection Performance: TRECVID-2003

We mapped 11 of the 17 “feature detection” queries from TRECVID-2003 onto one or more concepts in our 75-word vocabulary, and performed image retrieval using the HMM developed on the TRECVID dataset. Some features, such as sporting events, were obtained by the union of concepts such as hockey, baseball, etc., while others, such as weather, had direct analogues in V. The average precision for the 11 concepts is reported in Table 7 and Figure 5. Also reported in Table 7 is the average precision of an almost comparable stage (BOU) from [1]. Note that BOU represents the best of three feature detectors, only one of which is comparable to our HMM based system. Since the 3

No ¯ 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 mAP

The authors of [5] state that the CRM attains a mean average precision of 0.18 on this task using 30 dimensional features instead of 76-dim.

Feature Outdoors News subject face People Building Road Vegetation Animal Female speech Car/truck/bus Aircraft News subject monologue Non-studio setting Sporting event Weather news Zoom in Physical violence Madeleine Albright

HMM 0.043 0.151 0.034 0.036 0.395 0.042 0.06 0.093 0.084 0.48 0.769 0.199

IBM 0.138 0.161 0.217 0.09 0.271 0.271 0.02 0.075 0.194 0.237 0.043 0.073 0.467 0.584 0.118 0.076 0.316 0.186

Table 7: Feature Detection Performance on the TRECVID-2003 Feature-Detection Task.

5.

DISCUSSION AND CONCLUSIONS

We have presented a novel method for image annotation for the purpose of content based image retrieval, and evaluated in in several ways to establish that it is indeed competitive with or better than the state of the art. The system was implemented with publicly available tools and tested on standard datasets.

Figure 4: Top 5 Retrieved Images for Some Concepts from the TRECVID Dataset. Much more is known in the speech recognition literature about training and decoding using hidden Markov models, including context-dependent modeling, unsupervised adaptation, adaptive and discriminative training, and minimum error decoding, all of which remains to be applied to the image annotation and retrieval problem. Each of these techniques have resulted in significant improvements in recognition accuracy on speech recognition, and indeed constitute the bulk of the progress in speech recognition research in the last decade. Adaptation, for instance, adjusts the models to overall variations from image to image, discriminative training replaces the maximization of the likelihood (1) with maximization of the ability to discriminate concepts, and context dependent modeling accounts for the influence of adjacent objects on the visual features of an object. We expect that significant improvements in image annotation and retrieval performance will be obtained by intelligently importing these ideas from speech recognition.

6.

ACKNOWLEDGMENTS

The authors gratefully acknowledge the considerable assistance, particularly standardized image datasets and visual features, provided by the Johns Hopkins Summer Workshop team in conducting these experiments. This research was partially supported by the U.S. National Science Foundation via Grant No IIS-0121285, the U.S. Department of De¯ No MDA-904-01-C-1005, and the Czech fense via Contract Academy of Sciences ¯via Grant No 1ET101470416. ¯

7.

REFERENCES

[1] A. Amir et al. IBM Research TRECVID-2003 Video Retrieval System. In Proc. TRECVID2003, November 2003. [2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003. [3] D. M. Blei and M. I. Jordan. Modeling Annotated Data. In 26th Annual International ACM SIGIR Conference, pages 127–134, 2003.

[4] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In Seventh European Conference on Computer Vision, volume 4, pages 97–112, 2002. [5] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video annotation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages II–1002–II–1009, 2004. [6] G. Iyengar et al. Joint Visual-Test Modeling for Multimedia Retrieval. Available at: http://www.clsp.jhu.edu/ws2004/groups/ws04vstxt/, 2004. [7] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic Image Annotation and Retrieval using Cross-Media Relevance Models. In 26th Annual International ACM SIGIR COnference, pages 119–126, 2003. [8] V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Proc. IEEE International Conf. on Acoustics, Speech and Signal Processing, volume 3, pages 17–21, May 2003. [9] J. Li and J. Z. Wang. Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, 2003. [10] NIST. In Proceedings of the TREC Video Retrieval Evaluation Conference (TRECVID2003), November 2003. [11] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2):257–286, 1989. [12] S. Young et al. The HTK Book. 2002. [13] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages I–511–I–518, December 2001.