Image Retrieval: Content versus Context - CiteSeerX

13 downloads 25272 Views 580KB Size Report
like AltaVista1 or HotBot2, type a textual query and check the 'must include image'-check-box, since ..... Maryland, http://www.ee.umd.edu/medlab/mlir/mlir.html ...
Image Retrieval: Content versus Context Thijs Westerveld University of Twente, Department of Computer Science, Parlevink Group, PO Box 217, 7500 AE Enschede, The Netherlands [email protected]

Abstract In this paper, we introduce a new approach to image retrieval. This new approach takes the best from two worlds, combines image features (content) and words from collateral text (context) into one semantic space. Our approach uses Latent Semantic Indexing, a method that uses co-occurrence statistics to uncover hidden semantics. This paper shows how this method, that has proven successful in both monolingual and cross lingual text retrieval, can be used for multi-modal and cross-modal information retrieval. Experiments with an on-line newspaper archive show that Latent Semantic Indexing can outperform both content based and context based approaches and that it is a promising approach for indexing visual and multi-modal data.

1

Introduction

In the last few years, several research groups have been investigating content based image retrieval (Flickner et al., 1997; Gevers and Smeulders, 1997; Marsicoi, Clinque and Levialdi, 1997). A popular approach is querying by example and computing relevance based on visual similarity using low-level image features like colour histograms, textures and shapes. However, user-studies show that most users are interested in semantic entities rather than in visual appearance. A journalists behaviour study (Markkula and Sormunen, 2000) showed that in 56% of the cases journalist were searching for concrete objects (people, buildings, places etc.) using textual ’named entity’-queries (Bill Clinton, Eiffel tower). Looking at these results, one could argue that image retrieval is not necessary at all; to retrieve images, a text retrieval system would do. One could simply go to a common search engine like AltaVista1 or HotBot2, type a textual query and check the ’must include image’-check-box, since most documents with images of the Bill Clinton would also contain the phrase ’Bill Clinton’. This kind of retrieval, disclosure of multi-modal information based on associated text, has also been a popular research issue in the last few years (De Jong et al., this volume; Hauptman and Witbrock, 1997). However, often images are not only used to show a certain concrete object, but they are also supposed to express a certain feeling. For a tourist brochure, one would use a different image of the Eiffel tower than for a spooky story in a Paris setting. When one wants to find images in a certain style (sharp, blurry, dark, warm, cold), only analysing collateral text may not be sufficient. In these cases, image features might help, ’spooky’ images probably have different colour-histograms than ’bright’ images. Apart from the need for images of concrete objects, user studies also report about thematic needs (like pictures about ’holidays in the south’) (Markkula and Sormunen, 2000) and about the need for image content (activities, types of people, visible objects) (Buscher, 1998). These needs can not be expressed by a simple textual query, leaving the conclusion that image retrieval systems should facilitate both visual and textual querying on a semantic level. This paper introduces a new approach to image retrieval, which use Latent Semantic Indexing to combine visual and textual elements into one semantic space, allowing for cross modal querying on a semantic level. The next section introduces some of the basic methods and common problems in traditional image retrieval. In section 3, Latent Semantic Indexing is introduced as an approach to 1 2

http://www.altavista.com http://www.hotbot.com

multi-modal indexing. Then section 4 explains our feature extraction process, section 5 tells about our experiments with an on-line newspaper Archive and finally, in section 6, the results are discussed and our future research plans are explained.

2

Information Retrieval

For many years, the retrieval of text and images has been based on manually created indexes stored as back of the book indexes or card indexes. Most libraries, image archives and video archives still use these indexes. In the last decades, techniques have been developed to automatically index large volumes of text and some of these techniques are also used to index images on the basis of associated text (context based image retrieval). In the last years, image processing techniques have been developed that allow the indexing of images based on their visual content (content based image retrieval). This section describes these two different approaches to image retrieval. In the next subsections, the basic methods and problems of respectively context based image retrieval and content based image retrieval will be discussed. 2.1 Context Based Image Retrieval A lot of information about the content of an image can come from other sources than the image itself. All information that doesn’t come from the visual properties of the image itself can be seen as the context of an image. For example, the place where you found an image or the person who pointed you at it can tell a lot about the information displayed in the image. In this paper however, we use the term context only for the textual information that comes with an image. Context based image retrieval can be based on annotations that were manually added for disclosing the images (keywords, descriptions), or on collateral text that is ’accidentally’ available with an image (captions, subtitles, nearby text). From these texts, indexes can be created using standard text retrieval techniques. The similarity between images is then based on the similarity between the associated texts, which in turn is often based on similarity in word use. An important problem with this approach is the difference in word use between documents. Documents can discuss the same subject using different words (synonymy) or use the same words describing different concepts (ambiguity). This problem, which also occurs in full-text retrieval, is known as the paraphrase problem (Oard and Dorr, 1996). It can be overcome by using a restricted vocabulary for manual annotation (controlled term indexing), but it is very expensive to manually index all images in a large collection. 2.2 Content Based Image Retrieval Content based image retrieval (CBIR) using query by example (QBE) has become popular in the last few years. CBIR systems try to return those images that are visually most similar to an example image; similarity is based on a set of low-level image features. Features that can be used to index images are colour, texture, shape and spatial layout. Some studies exists on what features best match human perception (Gargi and Kasturi, 1996; Liu and Picard, 1996), but, partly because of the subjectivity involved, it is improbable that such a feature set exists at all. Another important problem with content based indexing is the fact that visual similarity does not correspond to semantic similarity. Therefore, even if a feature set existed that matches human vision, still the retrieved images aren’t necessarily related to the example image on a semantic level. This problem is known as the semantic gap and causes current image retrieval systems to retrieve for example images of women in red dresses when the example image was a picture of a red car. 2.3 Uncovering Hidden Semantics In the previous sections we saw that one of the major problems in both context base image retrieval 1 and content based image retrieval is the fact that the terms in a document (words or low level image features) differ from the semantic content of a document. However, this doesn’t mean, that the terms of a document are totally meaningless. After all, humans make use of, among other things, the same 1

The term term is often used to refer to textual items only; in this paper term refers to both textual items (words) and visual items (low level image features); term should be read as indexing term.

set of terms to discover the semantics of a document. Therefore, we need a technique that uncovers these hidden semantics of a document, or at least is able to disregard different term use in related documents.

3

Latent Semantic Indexing

To uncover the hidden semantics of a document, we use Latent Semantic Indexing (LSI), an approach that has proven successful in both monolingual and cross-lingual text-retrieval (Deerwester, Dumais and Harshman, 1990; Dumais, Landauer and Littman, 1996; Yang, Carbonell and Brown, 1998). LSI is a method that uses co-occurrence statistics of terms to find the semantics behind a document’s terms concluding that documents using similar terms are probably related. Using LSI, one can for example infer that a document containing the words reservation, double room, shower and breakfast is related to (other) documents about hotels, even though the word hotel is not mentioned in that particular document. 3.1 State of the Art Monolingual, text based LSI starts with a term-document matrix that represents the term occurrences in the documents. Depending on the weighting that is used, the cells or the matrix can indicate either if a term occurs in a document, how often a term occurs or how important a term is in a certain document. This term-document matrix can be seen as a high-dimensional representation of the document base in which each document is represented as a vector of term occurrences. The dimension of this matrix can be reduced by applying the Singular Value Decomposition (SVD), a form of factoranalysis that computes the most meaningful linear combinations of terms and documents. When we take only the first few dimensions of the resulting matrices, we have an optimal approximation of the original term-document matrix in a lower dimension. This lower-dimensional space, in which similar terms and similar documents are close to each other, can be used for image retrieval. In the resulting lower-dimensional space, a document containing the words reservation, double room, shower and breakfast will be close to (other) documents about hotels and close to the term hotel. Therefore, people searching with the term hotel will also find this document although the term hotel is not in it. Because LSI groups related concepts, it can also solve ambiguity: a document containing the words bank, money, interest, account and cash-dispenser will be closer to documents about financial institutes than to documents about river banks. 3.2 Combining Text and Images Learning from the field of cross-language retrieval where LSI is used to index documents from multiple languages into one semantic space (Dumais, Landauer and Littman, 1996; Yang, Carbonell and Brown, 1998), we apply LSI to build a multi-modal space in which terms from both text and images  are  represented. Latent Semantic Indexing has also been used for image retrieval before , 1997) and even for the textual part in a combination of content and context based image ( retrieval (Cascia, Sethi and Sclaroff, 1998), but as far as we know, no one has combined text and image into the same semantic space using Latent Semantic Indexing. The basic principal for building a multi-modal space (Westerveld, Hiemstra and de Jong, 2000) is quite simple: we just list terms from both modalities in one term document matrix and then apply the SVD resulting in a semantic space that contains both visual and textual items (See Figure 1).

a tree in the park a tree a sunny day in the park

tree park sun day dark green brown light green blue yellow

doc 1 1 1 0 0 1 1 1 1 0

doc 2 1 0 0 0 1 1 0 0 0

doc 3 0 1 1 1 0 0 1 1 1

park

tree sun day

Figure 1: Multi-Modal LSI: Documents, Term-Document-matrix and Semantic Space The main difficulty however, lies in the choice of terms. What are the terms from an image? The words that are used normally for LSI in text retrieval now have to be replaced by image features. In order to make LSI perform well, the terms from the images should have a similar type and distribution as the ones from the text. Therefore, on the one hand the number of possible image features should be 4 high, just like the total number of terms in a document collection is very high (O(10 )). On the other hand, the number of image features per image should be low, since a typical textual document also contains only a small portion of all possible words (O(102)). This is different from the common use of terms in content based image retrieval, where the number of possible terms is often relatively low (O(102) - O(103)) and almost every image has a value for each term. Apart from that, image terms have often values on a continuous scale whereas terms from a text are discrete. To use LSI on image content, we have to define a set of discrete image terms that has roughly the same distribution as the set of textual terms. If we create a set of image terms that is both as small and as sparse as the set of words from the text, then this set will not cover all of the visual characteristics of the images. Therefore, we tried two different approaches for calculating image terms. In the first approach we create a set of image terms that is roughly as sparse as the set of textual terms and in the second approach we create a set of image terms that has about the same size as the set of textual terms. In the next section we will discuss both approaches.

4

Feature Extraction

Before we can start the process of LSI, we need to extract the indexing terms from our documents. The textual terms we use are extracted by lemmatising the image captions and selecting only those terms that occur in at least two different captions. For the visual part of our index, we extract two kinds of image features, namely colours and textures. To extract these, we use two different approaches one to extract a sparse set of image terms and one to extract a small set of image terms. In the next subsections, the two approaches will be discussed and the resulting sets of terms will be analysed. 4.1 Sparse Set of Image Terms The first set of image terms we extract is supposed to be as sparse as the set of textual terms. To extract such a set of features, we adopt the approach of Squire et al. (1999), who used a sparse image feature set to apply text retrieval techniques on visual data. In the next subsections, we first describe how we convert colour features and texture features to a sparse set of terms usable for LSI and then we analyse the resulting set of features. Colour Features For colour feature extraction, we use the HSV colour space, a space that is close to human perception. Since different lightning conditions or different viewpoints change the values for saturation and value, we make these bins somewhat larger and divide our colour space into 18 Hues, 3 Saturations and 3 Values. We extract two sets of colour features from our images. The first set of

features is a standard colour histogram for the whole image. For the second set of colour features, we cut each image in blocks ranging from 2 × 2 blocks to 16 × 16 blocks. Then we calculate the most 1 frequent colour for each block and store it as a binary feature. This way, we have 55,080 possible 2 block colour features and 162 colour histogram features. Texture Features To extract texture from the images we use gabor filters (Fogel and Sagi, 1989) at three different wavelengths and four orientations. The energy (the sum over the phases of the squared convolution values) is computed at each pixel for each combination of wavelength and orientation. We cut each image in 2 × 2 blocks and in 4 × 4 blocks and then compute the average energy for each 3 combination of wavelength and orientation, yielding 12 values per block . These average energy values are then quantified into 128 bands. The bands were chosen such that the average energy values for 1000 images were distributed equally over the bands. Finally, average energies that fall into the lower 16 bands are disregarded and a binary valued feature is stored for each filter that has an average energy in one of the other bands. This means we have in total 26,8804 possible texture terms per image. Term Frequencies To get an idea of the sparseness and the size of this set of terms, we analysed a set of 3379 newspaper photographs from the on-line archives of a Dutch newspaper. We also analysed the captions of these photographs to calculate the number of textual terms in the collection. Table 1 shows the total number of terms5, the average number of terms that actually occurred in our documents and the ratio between these two, which is a measure for the sparseness of our term document matrix.

Text Image Combination

tot. # terms 4283 37752 42035

avg. # terms / doc. 27 625 598

ratio 158 : 1 63 : 1 70 : 1

Table 1: Term frequency measures Sparse Set of Image Terms In this table, we see that, although the set of image features is not as sparse as the set of text features, it is still a lot sparser than features used commonly in image retrieval6. However, since we still want to capture as much as possible of the images’ visual properties, we need a lot of terms and the number of visual terms per document is a lot higher than the number of textual terms. 4.2 Small Set of Image Terms To get a more balanced set of terms, we also adopted an approach in which we focused more on keeping the number of terms roughly the same for text and images. This section describes this second approach to extracting features from images. Again first colour and texture feature extraction will be described, after which an analysis of the resulting space follows. Colour Features For colour feature extraction, we again use the HSV colour space divided into 18 Hues, 3 Saturations and 3 Values and we cut each image in blocks ranging from 2 × 2 blocks to 4 × 4 blocks. However, for this second set of terms, we calculate the colour histograms for each block as well as for the whole image. This way, we have only 32407 possible colour histogram features.

1

340 blocks * 18 hues * 3 saturations * 3 values 18 hues * 3 saturations * 3 values 3 Due to time constraints, we didn’t compute the average energies for the 8 × 8 block and the 16 × 16 block 4 20 blocks * 12 filters * (128 – 16) bands 5 This is the number of terms that occurred in any document, it is smaller than the number of possible terms since some terms don't occur at all. 6 E.g. colour histograms often have values for 1 out of 8 terms 7 (20 blocks + whole image)* 18 hues * 3 saturations * 3 values 2

Texture Features To extract texture from the images we again use gabor filters at three different wavelengths and four orientations. We cut each image in 2 × 2 blocks and in 4 × 4 blocks and again compute the average energies for each combination of wavelength and orientation, yielding 12 values per block. These average energy values are now quantified into only 10 bands and average energies that fall into one of the lower 2 bands are disregarded. We store a binary valued feature for each filter that has an average energy in one of the other bands. This means we have in total 19201 possible texture terms per image. Term Frequencies For this second set of image features, we again analysed our newspaper photographs and calculated the term frequencies, the results can be seen in Table 2.

Text Image Combination

tot. # terms 4283 4442 8725

avg. # terms / doc. 27 1131 1158

ratio 158 : 1 4:1 8:1

Table 2: Term frequency measures Small Set of Image Terms The numbers of terms from text and images are more balanced than in the first set of terms, but this set of image terms is a lot less sparse. Experiments have to show which of the two sets is best suitable for image retrieval.

5

Experiments

To compare the combined LSI, described in section 3.2, to pure content based and pure context based 2 approaches, we used a collection of 3379 images from a Dutch newspaper together with their captions (no other collateral text was available). Obviously, this collection is far too small to perform extensive recall and precision evaluations, but it can give us an idea of the possibilities and limitations of our approach. The terms from the images were extracted by applying the techniques described in the previous section and the terms from the text were extracted by lemmatising the captions and selecting only those terms that occur at least in two different documents. We then used LSI to index our document collection six times: once with only the visual terms, once with only the textual terms and once with terms from both image and text and then all of these combinations with both sets of image terms. To test how the different distributions of image features affected the semantic space, we randomly selected 20 documents from our collection as query documents and used the three indexes to retrieve the most similar documents for each query. For each query, we compared the top 100 documents from each set, and it turned out that the results from the combined approach were similar to the ones from the images only approach (this holds for both sets of image terms). In the top 100 returned documents from the combined index and the image index was an overlap of 88.6% for the sparse set of image terms and an overlap of 84.0% for the small set of image features. The combined index and the text index overlapped respectively only 6.6% and 8.4%. Manual inspection of these results confirmed this: visual similarity seems more important than semantic similarity. Although the percentage for the two sets of image features are roughly the same, the small set of image features seems to perform somewhat better than the sparse set. Manual inspection of the results shows that the combined approach for this set of features outperforms both the image and the text approach for queries with many relevant documents in the data set. For a query with two images of floods in the far east together with their collateral text (in Dutch), the top 5 results from the combined approach contains 4 images, whereas the images and text approach both contain only 3 (See Figure 2 on the next page). 1 2

20 blocks * 12 filters * 8 bands Reformatorisch Dagblad, http://www.reformatorischdagblad.nl

Query PHNOM-PENH - Een Cambodjaanse jongen in een rolstoel wordt door het water geduwd . Hevige stortregens hebben de straten van Phnom-Penh onder water gezet . Foto EPA ÿ

BHABER CHAR – Bewoners van het Bengaalse dorp Bhaber Char, 40 kilometer ten zuidoosten van de hoofdstad Dhaka, zwoegen voort door het water in de straatjes van hun woonplaats . Meer dan de helft van het grondgebied van Bangladesj wordt momenteel door overstromingen geteisterd . Het hoge water, veroorzaakt door zware regenval, heeft al aan 163 mensen het leven gekost . Ruim 7,5 miljoen burgers zijn dakloos . Foto EPA ÿ

Results Text

Images

Combination

1

1

1

2

2

2

3

3 3

4

4 4

5

5

5

Figure 2: Example Query with results for text space, image space and combined space

The two non-relevant documents in the set of results from the text index, do depict floods, but they are not in the Far East (Venice (rank 4) and Equador (rank 5)). Of course, this single example is no proof of the viability of this approach, but it shows how a combination of textual and visual terms can yield better results than the two separate approaches.

6

Discussion and Future Research

Although we didn’t extensively experiment with our approach and used only a very small set of test data, we think Latent Semantic Indexing is a promising method for indexing visual and multi-modal data. Since LSI can find semantic associations between terms, it can help bridge the semantic gap, one of the major problems in image retrieval. We showed that the combination of text and images can outperform both text and image based approaches in some cases. However, since LSI ignores incommon words, it is less useful for finding named entities. In these cases, when one wants to retrieve images dipicting a certain object or topic, the text based approach seems most suitable. However, text is not available with every image, therefore future research will investigate weather a combined approach can help in building a semantic space that allows the disclosure of images that do not have any text associated with them. However, when text is available, a combined approach can still help in improving retrieval results; a combined approach can help finding images with a certain object or topic, that also express a certain mood or feel.

7

Acknowledgements

The research for this paper has been funded by the Dutch organisation for Scientific Research NWO. The data we used in our experiments comes from the archives of the Dutch newspaper ’het Reformatorisch Dagblad’, the example query and results in Figure 2 are also taken from these archives.

8

References

Buscher, I. (1998). Going digital at SWR TV-archives: New dimensions of information management for professional and public demands. In Proceedings of the 14th Twente Workshop on Language Technology TWLT-14 (pp. 07--116). Enschede, The Netherlands. Cascia, M. La, Sethi, S., Sclaroff, S. (1998). Combining textual and visual cues for content-based image retrieval on the world wide web. In: Proceedings of the IEEE Workshop on content-based access of image and video libraries (pp. 24-28). IEEE Comput. Soc, Los Alamitos, CA, USA Deerwester, S., Dumais, S.T., Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (6), 39--407. Dumais, S.T., Landauer, T.K., Littman, M.L. (1996). Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing. In: Proceedings SIGIR96, Workshop On Cross-Linguistic Information Retrieval. Flickner, M., Sawhney, H., Niblack, W. Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D. Steele, D. and Yanker, P. (1997) Query by image and video content: the QBIC system, In Maybury, M.T. (ed.), Intelligent multimedia information retrieval, (pP 7—22. Menlo Park, CA : AAAI Press, Cambridge, MA, MIT Press. Fogel, I. and Sagi, D. (1989). Gabor Filters as Texture Discriminator. Journal of Biological Cybernetics, 61(2),103—113. Gargi, U. and Kasturi, R. (1996). An evaluation of color histogram based methods in video indexing. In: Proceedings of International Workshop on Image Databases and Multimedia Search, (pp 75 -- 82) Gevers, T. and Smeulders, A.W.M. (1997). PicToSeek: A content-based image search engine for the WWW. In Proceedings of VISUAL’97. San Diego, USA. Hauptman, A.G., Witbrock, M.J. (1997). Informedia: news-on-demand multimedia information acquisition and retrieval, In Maybury, M.T.(ed), Intelligent Multimedia information retrieval (pp 215-239). AAAI Press Cambridge, MA: MIT Press. Jong, F. de, Gauvain J-L., Hiemstra, D. and Netter, K. (this volume), Language-Based Multimedia Information Retrieval. Liu, F. and Picard, R. (1996). Periodicity, directionality, and randomness: Wold features for image modelling and retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7), 322—733. Markkula, M., Sormunen, E. (2000) End-user searching challenges indexing practices in the digital newspaper photo archive, Information Retrieval 1(4), 259—285. Marsicoi, M., Clinque, L. and Levialdi, S. (1997). Indexing pictorial documents by their content: a survey of current techniques. Image and vision computing 15(2), 119—141. Oard, D.W. and Dorr, B.J. (1996). A Survey of Multilingual Text Retrieval, Technical report TR-96-19, University of Maryland, http://www.ee.umd.edu/medlab/mlir/mlir.html



, Z. (1997). Image Retrieval using Latent Semantic Indexing. Diploma thesis. École Polytechnique Fédérale de Lausanne. Squire, D. McG., Müller, W., Müller, H. and Raki, J. (1999). Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback. In The 11th Scandinavian Conference on Image Analysis, (pp 143-149), Kangerlussuaq, Greenland. Westerveld, T, Hiemstra, D., De Jong, F (2000), Extracting Bimodal Representations for Language-Based Image Retrieval, In Multimedia ’99, Proceedings of the Eurographics Workshop in Milano Italy, (pp 33—42), Springer-Verlag, Wien Austria. Yang, Y. Carbonell, J.G., Brown, R.D., Frederking, R.E. (1998). Translingual Information Retrieval: Learning from Bilingual Corpora. Artificial Intelligence, 103 (1-2), 323—345.