Bayesian Representations and Learning

17 downloads 0 Views 783KB Size Report
... the eld of view, the remaining 70% being lled with completely unrelated stu ? .... observation of a single image, and several iterations of the interaction .... By taking logarithms and solving for the recursion, equation (5) can also be written as.
Bayesian Representations and Learning Mechanisms for Content-Based Image Retrieval Nuno Vasconcelos and Andrew Lippman MIT Media Laboratory, 20 Ames St, E15-354, Cambridge, MA 02139

ABSTRACT

We have previously introduced a Bayesian framework for content-based image retrieval (CBIR) that relies on a generative model for feature representation based on embedded mixtures. This is a truly generic image representation that can jointly model color and texture and has been shown to perform well across a broad spectrum of image databases. In this paper, we expand the Bayesian framework along two directions. First, we show that the formulation of CBIR as a problem of Bayesian inference leads to a natural criteria for evaluating local image similarity without requiring any image segmentation. This allows the practical implementation of retrieval systems where users can provide image regions, or objects, as queries. Region-based queries are signi cantly less ambiguous than queries based on entire images leading to signi cant improvements in retrieval precision. Second, we present a Bayesian learning algorithm that relies on belief propagation to integrate feedback provided by the user over a retrieval session. When combined with local similarity this algorithm leads to a powerful paradigm for user interaction. Experimental results show that signi cant improvements in the frequency of convergence to the relevant images can be achieved by the inclusion of learning in the retrieval process. Keywords: Bayesian image retrieval, local vs global queries, Bayesian learning, relevance feedback, generative models, embedded mixtures

1. INTRODUCTION

Due to the large amounts of imagery that can now be accessed and managed via computers, the problem of CBIR has recently attracted signi cant interest from the vision and image processing communities. Unlike most traditional vision applications, very few assumptions about the content of the images to be analyzed are allowable in the context of CBIR. This implies that the space of valid image representations is restricted to those of a generic nature (and typically of low-level) and consequently the image understanding problem becomes even more complex. On the other hand, CBIR systems have access to feedback from their users that can be exploited to simplify the task of nding the desired images. There are, therefore, two fundamental problems to be addressed. First, the design of the image representation itself and, second, the design of learning mechanisms to facilitate the interaction. The two problems cannot, however, be solved in isolation as the careless selection of the representation will make learning more dicult and vice-versa. The impact of a poor image representation on the diculty of the learning problem is visible in CBIR systems that rely on holistic metrics of image similarity, forcing user-feedback to be relative to entire images. In response to a query, the CBIR system suggests a few images and the user rates those images according to how well they satisfy the goals of the search. Because each image usually contains several di erent objects or visual concepts, this rating is both dicult and inecient. How can the user rate an image that contains the concept of interest, but in which this concept only occupies 30% of the eld of view, the remaining 70% being lled with completely unrelated stu ? And how many example images will the CBIR system have to see, in order to gure out what the concept of interest is? A much better interaction paradigm is to let the user explicitly select the regions of the image that are relevant to the search, i.e. user-feedback at the region level. However, region-based feedback requires sophisticated image representations. The problem is that the most obvious choice, object-based representations, is dicult to implement Further author information: (Send correspondence to N. V.) N.V.: E-mail: [email protected], URL: http://www.media.mit.edu/~nuno A.L.: E-mail: [email protected]

because 1) it is still too hard to segment arbitrary images in a meaningful way and 2) segmentation leads to a combinatorial explosion of the number of similarity evaluations to be performed. In this paper we argue that a better formulation is to view the problem as one of Bayesian inference and rely on probabilistic image representations, showing that such formulation naturally leads to representations with support for region-based interaction and learning without the need for segmentation.

2. PRIOR WORK

While the importance of both local queries and relevance feedback has been recognized by various authors, very few of the systems developed so far can actually satisfy these two requirements.

2.1. Local queries

The common solution for handling local queries is to rely on image segmentation and then perform retrieval on the individual segments, i.e. evaluate the similarity of each query region against all the regions extracted from the images in the database. This approach su ers from two fundamental problems: 1) segmentation is hard, and 2) there is a combinatorial explosion of the number of similarity evaluations to be performed. Despite the diculty of automatic image segmentation, several retrieval systems have relied on it for determining image regions. While theoretically precise image segmentation enables shape-based retrieval, in practice it is not uncommon for a segmentation algorithm to break a single object into several regions or unify various objects into one region, making shape-based similarity close to hopeless. Hence, even when automated segmentation is used, shape representations tend to be very crude. Therefore, it is not clear that precise segmentation is an advantage for region-based queries. In fact, the use of sophisticated segmentation can be more harmful than bene cial: for example, in the context of \blob-world" Howe reports signi cant improvements by replacing the sophisticated segmentation algorithm used by Belongie et al with a much simpler variation. The only clear exceptions to this observation seem to be applications where it is possible to manually pre-segment all the imagery because 1) there is an economic incentive to do this and 2) it is very clear what portions of each database image will be relevant to the queries posed to the retrieval system. An example of such application domain is that of medical imaging, in particular what concerns to lesion diagnostics. On the contrary, for generic databases there is usually too much imagery to allow manual processing and it is rarely known what speci c objects may be of interest to the users of such databases. Since precise segmentations are hard, several authors have adopted the simplifying view of relying on arbitrary image partitions to obtain local information. While this solves the problem of segmentation complexity without noticeable degradation of performance (in fact it does not even seem clear at this point that segmentation works better than arbitrary image partitioning) it still does not address the second problem, i.e. the combinatorial explosion associated with matching all image segments. In order to overcome this diculty, several mechanisms have been proposed in the literature. The simplest among these is to make the individual regions large enough and their feature representation compact enough so that each image can still be represented by a simple feature vector (concatenation of the individual region features) of manageable dimensions. Such approaches are of limited use for local queries since 1) several objects or visual concepts may fall on a single image region, 2) feature representations are not expressive enough to nely characterize each region and, 3) it is hard to guarantee invariance to image transformations when dealing with regions of large spatial support. An alternative view is not to worry with compactness and simply deal with the combinatorics of region-based retrieval at the level of traditional database indexing. Minka and Picard propose clustering of the individual image regions as a database organization step that signi cantly reduces query time (since query regions are matched against cluster representatives instead of all the members). The use of clustering as an indexing tool has two major disadvantages: rst, the standard problems of clustering itself (how to determine the number of clusters, what are good functions to decide cluster membership, etc), and second the fact the entire database must be re-clustered (an expensive operation) when images are included in or deleted from the database. An alternative to clustering, proposed by Ravela et al and Smith and Chang, is to rely on indexing mechanisms derived from those traditionally used with text databases. The idea is to consider all the dimensions of the feature space independent, create one dimensional indices (which can be searched quickly) for each of the them and then 1,2,7,16,17

7

2

15

1,9,10,12,14,18,20

18,20

9,10,12,16

12

9

16

use standard database operations, such as joins, to perform the retrieval. The main problem with these approaches is that, for the high dimensional spaces required for meaningful image characterization, the indexing savings vanish as the database grows. The problem is, therefore, particularly acute for databases of image regions. In summary the downside of approaches based on indexing is that, when dealing with region databases, the indexing problem becomes orders of magnitude more complex. Since, at this point, indexing is itself an open question (even for the simpler case of non-region based representations) this can be a signi cant hurdle. By relying on a generative model for feature representation and a probabilistic similarity criteria, our solution avoids most of the problems associated with region-based representations. First, it does not require segmentation of the images in the database in order to support region-based queries. The only segmentation information that is required are the image regions which make up the query and which are provided by the user himself. The indexing complexity is therefore not increased. Second, since the generative model (a probability density) is compact independently of the number of elemental regions that compose each image, we can make these regions as small as desired, all the way down to the single pixel size. Our choice of local 8  8 neighborhoods provides a good tradeo between invariance, ability to model local image dependencies, and allowing users to include regions of almost arbitrary size and shape in their queries. Finally, unlike the representations discussed above, a generative model provides automatic support for inference and learning, facilitating the design of relevance feedback mechanisms.

2.2. Relevance feedback

In the context of relevance feedback, local queries are interesting because they allow users to indicate explicitly what they are looking for, greatly facilitating the learning task. Consider a retrieval system faced with the query image of Figure 1. Given the entire picture, the only possible conclusion is that the user maybe looking for any combination of the objects contained in it ( replace, bookshelves, painting on the wall, ower baskets, white table, sofas, carpet, rooms with light painted walls) and the query is too ambiguous. By allowing the user to select the relevant regions of the image, this ambiguity would be signi cantly reduced.

Figure 1.

Example of a query image with multiple interpretations.

The design of relevance feedback algorithms is particularly dicult for retrieval systems based on holistic image similarity because, in this case, the learner must actually perform two tasks: rst gure out what exactly is the set of visual image properties or concepts in which the user is interested and only then gure out what are the good matches in the database. As the example in Figure 1 demonstrates, the rst step cannot be accomplished from the observation of a single image, and several iterations of the interaction between user and retrieval system must occur

before the latter knows exactly what the former is looking for, assuming that this is ever clear. By avoiding this rst learning step, systems relying on localized feedback need to concentrate only on the second problem, which has an easier solution. Given this observation, it is somewhat surprising to realize that while various solutions have been presented to the problem of relevance feedback most are intimately related with image representations that preclude local similarity. In fact, to the best of our knowledge, only the \Four eyes" system combines learning with local queries, although these are restricted to image patches of a sizeable dimension and the feature representations and similarity criteria used are not conducive to learning per se (hence the requirement for the clustering mechanism discussed above). On the other hand, being a generative model, our representation ts naturally in a Bayesian formulation of the retrieval problem that provides an integrated solution for local queries and learning. The basic idea is that, at each point in time, the retrieval system must integrate the information contained in the current query with that in the previous queries and retrieve the images that best satisfy the entire interaction. Under the Bayesian formulation, retrieval is based on the system's beliefs on the adequacy of each image for the given query, and this integration simply consists on belief propagation according to the laws of probability. Since 1) all the necessary beliefs are an automatic outcome of the similarity evaluation and 2) all previous interaction can be summarized in a small set of prior probabilities, this belief propagation is very ecient from the points of view of computation and storage. In terms of the learning mechanism, our approach is closest to that of the \PicHunter" system which also relies on Bayesian relevance feedback. There are however two major di erences. First, \PicHunter" does not rely on a representation that can support local queries. Second, and more important, our approach does not require an explicit model for user actions and therefore is much more reliable (user modeling is known to be a very dicult problem) and generic. \PicHunter" de nes a set of actions that a user may take and, given the images retrieved at a given point, tries to estimate the probabilities of the actions the user will take next. Upon observation of these actions, Bayes rule gives the probability of each image in the database being the target. Due to the diculty of user-modeling the system relies on several simplifying assumptions and heuristics to estimate action probabilities. The problem is that these estimates can only be obtained through an ad-hoc function of image similarity which is hard to believe valid for all or even most of the users the system will encounter. Indeed it is not even clear that such function can be derived when the action set becomes more complicated than that supported by the simple interface of \PicHunter". For example, in the context of local queries the action set would have to account for all possible segmentations of the query image, which are not even de ned a priori. These problems are eliminated by our formulation, where all inferences are drawn directly from the observation of the image regions selected by the user. Since we rely on a generative model for feature representation there is no need to establish heuristic functions to relate the similarity between image and target and the belief that the image is the target. Under a generative model, the similarity function is, by de nition, that belief. 4,5,9,13,20

9

11

5

3. RETRIEVAL AS BAYESIAN INFERENCE

The standard interaction paradigm for CBIR is the so-called \query by example", where the user provides the system with a few examples, and the system retrieves from the database images that are visually similar to these examples. The problem is naturally formulated as one of statistical classi cation. Given a representation (or feature) space F for the entries in the database, the design of a retrieval system consists of nding a map g:

F ! M = f1; : : : ; K g x ! y

from F to the set M of classes identi ed as useful for the retrieval operation. In our work, we set as goal of content-based retrieval to minimize the probability of retrieval error , i.e. the probability P (g(x) 6= y) that if the user provides the retrieval system with a set of feature vectors x drawn from class y the system will return images from a class g(x) di erent than y. Once the problem is formulated in this way, it is well known that the optimal map is the Bayes classi er g (x) = arg max (1) i P (y = ijx) = arg max (2) i fP (xjy = i)P (y = i)g; 6

where P (xjy = i) is the likelihood function, or feature representation, of the image features for the ith class and P (y = i) the prior probability for this class. In the absence of prior information about which class is most suited for the query, an uninformative prior can be used and the optimal decision is the maximum likelihood (ML) criteria g (x) = arg max P (xjy = i) (3) i or, if instead of a single feature vector x we have a collection of N independent query features, X = fx ; : : : ; xN g

9 8 N =