images of work from the Fine Arts Museum of San. Francisco. The images ... Clustering Art. Kobus Barnard ..... AAAI '94, Seattle, WA, 1994. [15] R. Chopra and ...
Clustering Art Kobus Barnard, Pinar Duygulu, and David Forsyth Computer Division, University of California, Berkeley {kobus, duygulu, daf}@cs.berkeley.edu
Abstract We extend a recently developed method [1] for learning the semantics of image databases using text and pictures. We incorporate statistical natural language processing in order to deal with free text. We demonstrate the current system on a difficult dataset, namely 10,000 images of work from the Fine Arts Museum of San Francisco. The images include line drawings, paintings, and pictures of sculpture and ceramics. Many of the images have associated free text whose varies greatly, from physical description to interpretation and mood. We use WordNet to provide semantic grouping information and to help disambiguate word senses, as well as emphasize the hierarchical nature of semantic relationships. This allows us to impose a natural structure on the image collection, that reflects semantics to a considerable degree. Our method produces a joint probability distribution for words and picture elements. We demonstrate that this distribution can be used (a) to provide illustrations for given captions and (b) to generate words for images outside the training set. Results from this annotation process yield a quantitative study of our method. Finally, our annotation process can be seen as a form of object recognizer that has been learned through a partially supervised process.
1. Introduction It is a remarkable fact that, while text and images are separately ambiguous, jointly they tend not to be; this is probably because the writers of text descriptions of images tend to leave out what is visually obvious (the colour of flowers, etc.) and to mention properties that are very difficult to infer using vision (the species of the flower, say). We exploit this phenomenon, and extend a method for organizing image databases using both image features and associated text ([1], using a probabilistic model due to Hofmann [2]). By integrating the two kinds of information during model construction, the system learns links between the image features and semantics
which can be exploited for better browsing (§3.1), better search (§3.2), and novel applications such as associating words with pictures, and unsupervised learning for object recognition (§4). The system works by modeling the statistics of word and feature occurrence and cooccurrence. We use a hierarchical structure which further encourages semantics through levels of generalization, as well as being a natural choice for browsing applications. An additional advantage of our approach is that since it is a generative model, it contains processes for predicting image components—words and features—from observed image components. Since we can ask if some observed components are predicted by others, we can measure the performance of the model in ways not typically available for image retrieval systems (§4). This is exciting because an effective performance measure is an important tool for further improving the model (§5). A number of other researchers have introduced systems for searching image databases. There are reviews in [1, 3]. A few systems combine text and image data. Search using a simple conjunction of keywords and image features is provided in Blobworld [4]. Webseer [5] uses similar ideas for query of images on the web, but also indexes the results of a few automatically estimated image features. These include whether the image is a photograph or a sketch and notably the output of a face finder. Going further, Cascia et al integrate some text and histogram data in the indexing [6]. Others have also experimented with using image features as part of a query refinement process [7]. Enser and others have studied the nature of the image database query task [8-10]. Srihari and others have used text information to disambiguate image features, particularly in face finding applications [11-15]. Our primary goal is to organize pictures in a way that exposes as much semantic structure to a user as possible. The intention is that, if one can impose a structure on a collection that “makes sense” to a user, then it is possible for the user to grasp the overall content and organization of the collection quickly and efficiently. This suggests a hierarchical model which imposes a coarse to fine, or general to specific, structure on the image collection.
2. The Clustering Model Our model is a generative hierarchical model, inspired by one proposed for text by Hofmann [2, 16], and first applied to multiple data sources (text and image features) in [1]. This model is a hierarchical combination of the assymetric clustering model which maps documents into clusters, and the symmetric clustering model which models the joint distribution of documents and features (the “aspect” model). The data is modeled as being generated by a fixed hierarchy of nodes, with the leaves of the hierarchy corresponding to clusters. Each node in the tree has some probability of generating each word, and similarly, each node has some probability of generating an image segment with given features. The documents belonging to a given cluster are modeled as being generated by the nodes along the path from the leaf corresponding to the cluster, up to the root node, with each node being weighted on a document and cluster basis. Conceptually a document belongs to a specific cluster, but given finite data we can only model the probability that a document belongs to a cluster, which essentially makes the clusters soft. We note also that clusters which have insufficient membership are extinguished, and therefore, some of the branches down from the root may end prematurely. The model is illustrated further in Figure 1. To the extent that the sunset image illustrated is in the third cluster, as indicated in the figure, its words and segments are modeled by the nodes along the path shown. Taking all clusters into consideration, the document is modeled by a sum over the clusters, weighted by the probability that the document is in the cluster. Mathematically, the process for generating the set of observations D associated with a document d can be described by (1) P (D | d ) = ∑ P (c )∏ ∑ P (i | l , c ) P (l | c , d ) c i ∈D l where c indexes clusters, i indexes items (words or image segments), and l indexes levels. Notice that D is a set of observations that includes both words and image segments.
2.1. An Alternative Model Note that in (1) there is a separate probability distribution over the nodes for each document. This is an advantage for search as each document is optimally characterized. However this model is expensive in space, and documents belonging mostly to the same cluster can be quite different because their distribution over nodes can differ substantially. Finally, when a new document is considered, as in the case with the "auto-annotate" application described below, the distribution over the nodes must be computed using an iterative process. Thus
Higher level nodes emit more general words and blobs (e.g. sky) Moderately general words and blobs (e.g. sun, sea)
Lower level nodes emit more specific words and blobs (e.g. waves)
Sun Sky Sea Waves
Figure 1. Illustration of the generative process implicit in the statistical model. Each document has some probability of being in each cluster. To the extent that it is in a given cluster, its words and segments are modeled as being generated from a distribution over the nodes on the path to the root corresponding to that cluster. for some applications we propose a simpler variant of the model which uses a cluster dependent, rather than document dependent, distribution over the nodes. Documents are generated with this model according to (2) P (D) = ∑ P (c ) ∏ ∑ P (i | l , c ) P (l | c ) c i ∈D l In training the average distribution, P (l | c ) , is maintained in place of a document specific one; otherwise things are similar. We will refer to the standard model in (1) as Model I, and the model in (2) as Model II. Either model provides a joint distribution for words and image segments; model I by averaging over documents using some document prior and model II directly. The probability for an item, P (i | l , c ) , is conditionally independent, given a node in the tree. A node is uniquely specified by cluster and level. In the case of a word, P (i | l , c ) is simply tabulated, being determined by the appropriate word counts during training. For image segments, we use Gaussian distributions over a number of features capturing some aspects of size, position, colour, texture, and shape. These features taken together form a feature vector X. Each node, subscripted by cluster c, and level l, specifies a probability distribution over image segments by the usual formula. In this work we assume independence of the features, as learning the full covariance matrix leads to precision problems. A
reasonable compromise would be to enforce a block diagonal structure for the covariance matrix to capture the most important dependencies. To train the model we use the ExpectationMaximization algorithm [17]. This involves introducing hidden variables H d,c indicating that training document d is in cluster c, and Vd,i ,l indicating that item i of document d was generated at level l. Additional details on the EM equations can be found in [2]. We chose a hierarchical model over several nonhierarchal possibilities because it best supports browsing of large collections of images. Furthermore, because some of the information for each document is shared among the higher level nodes, the representation is also more compact than a similar non-hierarchical one. This economy is exactly why the model can be trained appropriately. Specifically, more general terms and more generic image segment descriptions will occur in the higher level nodes because they occur more often.
3. Implementation Previous work [1] was limited to a subset of the Corel dataset and features from Blobworld [4]. Furthermore, the text associated with the Corel images is simply 4-6 keywords, chosen by hand by Corel employees. In this work we incorporate simple natural language processing in order to deal with free text and to take advantage of additional semantics available using natural language tools (see §4). Feature extraction has also been improved largely through Normalized Cuts segmentation [18, 19]. For this work we use a modest set of features, specifically region color and standard deviation, region average orientation energy (12 filters), and region size, location, convexity, first moment, and ratio of region area to boundary length squared.
3.1 Data Set We demonstrate the current system on a completely new, and substantially more difficult dataset, namely 10,000 images of work from the Fine Arts Museum of San Francisco. The images are extremely diverse, and include line drawings, paintings, sculpture, ceramics, antiques, and so on. Many of the images have associated free text provided by volunteers. The nature of this text varies greatly, from physical description to interpretation and mood. Descriptions can run from a short sentence to several hundred words, and were not written with machine interpretation in mind.
the method described in [2] requires a data structure for the vertical indicator variables which increases linearly with four parameters: the number of images, the number of clusters, the number of levels, and the number of items (words and image segments). The dependence on the number of images can be removed at the expense of programming complexity by careful updates in the EM algorithm as described here. In the naive implementation, an entire E step is completed before the M step is begun (or vice versa). However, since the vertical indicators are used only to weight sums in the M step on an image by images bases, the part of the E step which computes the vertical indicators can be interleaved with the part of the M step which updates sums based on those indicators. This means that the storage for the vertical indicators can be recycled, removing the dependency on the number of images. This requires some additional initialization and cleanup of the loop over points (which contains a mix of both E and M parts). Weighted sums must be converted to means after all images have been visited, but before the next iteration. The storage reduction also applies to the horizontal indicator variables (which has a smaller data structure). Unlike the naive implementation, our version requires having both a "new" and "current" copy of the model (e.g. means, variances, and word emission probabilities), but this extra storage is small compared with the overall savings.
4. Language Models We use WordNet [20] (an on-line lexical reference system, developed by the Cognitive Science Laboratory at Princeton University), to determine word senses and semantic hierarchies. Every word in WordNet has one or more senses each of which has a distinct set of words related through other relationships such as hyper- or hyponyms (IS_A), holonyms (MEMBER_OF) and meronyms (PART_OF). Most words have more than one sense. Our current clustering model requires that the sense of each word be established. Word sense disambiguation is a long standing problem in Natural Language Processing and there are several methods proposed in the literature [21-23]. We use WordNet hypernyms to disambiguate the senses.
3.2 Scale Training on an large image collection requires sensitivity to scalability issues. A naive implementation of
Figure 2: Four possible senses of the word “path”
For example, in the Corel database, sometimes it is possible that one keyword is a hypernym of one sense of another keyword. In such cases, we always choose the sense that has this property. This method is less helpful for free text, where there are more, less carefully chosen, words. For free text, we use shared parentage to identify sense, because we assume that senses are shared for text associated with a given picture (as in Gale et. al’s one sense per discourse hypothesis [24]). Thus, for each word we use the sense which has the largest hypernym sense in common with the neighboring words. For example, figure 2 shows four available senses of the word path. Corel figure no. 187011 has keywords path, stone, trees and mountains. The sense chosen is path