search for images in an image database by pre- senting a .... This value then represents the corresponding ... the global view layer and the corresponding map is.
Pattern Recognition Letters 20 (1999) 1337±1345
www.elsevier.nl/locate/patrec
Image retrieval using hierarchical self-organizing feature maps q I.K. Sethi *,1, I. Coman Vision and Neural Networks Laboratory, Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
Abstract This paper presents a scheme for image retrieval that lets a user retrieve images either by exploring summary views of the image collection at dierent levels or by similarity retrieval using query images. The proposed scheme is based on image clustering through a hierarchy of self-organizing feature maps. While the suggested scheme can work with any kind of low-level feature representation of images, our implementation and description of the system is centered on the use of image color information. Experimental results using a database of 2100 images are presented to show the ecacy of the suggested scheme. Ó 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Exploration-based retrieval; Image databases; Image retrieval; Self-organizing feature maps
1. Introduction The widespread availability of images and videos in digital form has created a growing interest in methods that can search image and video archives and retrieve images and videos of desired content. The current methods for providing content-based access to images and videos follow one of the two approaches: (1) keyword-based retrieval (KBR) and (2) similarity-based retrieval (SBR). The KBR approach relies on manual cataloging to generate a set of descriptive keywords for each image or video. The keywords selected for an image are generally based on the q Electronic Annexes available. See www.elsevier.nl/locate/ patrec. * Corresponding author. 1 Present address: Department of Computer Science and Engineering, Oakland University, Rochester, MI 48309-4478, USA.
most direct description of the objects present in the image. It is the most widely used approach followed by large on-line stock photography archives. Although simple and straightforward, the KBR approach has two main limitations. First, the descriptive keywords of an image do not provide any clue about its compositional aspects, which are important in many applications such as advertising. Second, users in dierent contexts or with dierent backgrounds tend to describe the same object using dierent descriptive terms causing diculties in image retrieval. Additionally, manual cataloging is prone to subjectivity and other cataloging errors. The SBR approach follows the dictum that the best representation of an image is the image itself. Instead of assigning descriptive keywords to each image, a feature vector representation for each image is extracted at the time of image cataloging. The access to images in the SBR approach is provided by searching for images that exhibit
0167-8655/99/$ - see front matter Ó 1999 Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 1 0 3 - 8
1338
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
feature vectors similar to the feature vector of the query image. Thus, the SBR approach lets a user search for images in an image database by presenting a query in visual form, making it more suitable to search images based on their compositional aspects. The SBR approach is also well suited for computerized indexing. Therefore, the SBR approach has received considerable attention in image processing, pattern recognition and database communities in the last few years. Several prototypical image retrieval systems have been built in recent years and some of these systems, e.g., QBIC (Niblack et al., 1993) and Virage (Bach et al., 1996), have been commercialized. A de®ciency of the existing image-similarity retrieval systems is that these systems do not provide a summary view of the images in their image database to their users. The availability of a summary view is important in situations where a user has no speci®c query image at the beginning of the search process and wants to explore the image collection to locate images of interest. The only way a user can obtain a feel for images in a collection in existing systems is through a random browsing of the thumbnail images. Such a browsing is not necessarily guaranteed to let the user browse through the entire image collection. Furthermore, random browsing requires more time. The goal of the present paper is to describe a scheme for image retrieval that lets a user retrieve images either by exploring summary views of the image collection at dierent levels, or by similarity retrieval using query images. The proposed scheme is based on image clustering through a hierarchy of self-organizing feature maps (Kohonen, 1995). While the suggested scheme can work with any kind of low-level feature representation of images, our implementation and description of the system is centered on the use of image color information. The organization of the paper is as follows. Section 2 presents a method for encoding color composition information of images that is used to organize images for exploration and retrieval using hierarchical, self-organizing feature maps. Section 3 provides a brief exposition of hierarchical, selforganizing feature maps. Section 4 describes the
proposed scheme for image retrieval by exploration and similarity search. The performance of the system is described in Section 5. Finally, a summary of the work and conclusions are presented in Section 6. 2. Color composition representation The image exploration and retrieval scheme described here uses image color information to build a feature vector representation for images. The motivation for choosing color-based representation lies in the fact that color is an easily recognizable element of an image and the human visual system is capable of dierentiating between an in®nitely large number of colors. The use of color for similarity retrieval requires two main considerations: (1) the selection of the color space, i.e. the color-coordinate system and (2) a scheme for representing the color composition of an image. There is no consensus on the choice of color space; RGB, HSV, HSI and YUV systems have been all used in dierent systems. Histogramming is the most commonly used scheme to capture the color composition of an image. For 24-bit images, the number of bins in the color histogram is 224 . Since such a high resolution is not needed for image similarity retrieval, it is common to quantize the color space by either reducing the color resolution or color depth (Wan and Jay Kuo, 1996). The global color histogram, whether quantized or not, suers from one drawback; it is not able to capture the spatial component of the color composition of an image. This has led to many variations of histogramming. For example, a local histogramming approach is suggested by Gong et al. (1995) where an image is divided into nine equal partitions and each partition has its own local color histogram. A multi-level histogramming approach based on a quad-tree structure is used by Lu et al. (1994) to incorporate spatial components of the color composition of an image. Although these variations of color histogramming are able to capture the spatial distribution of color information, they do not provide an ecient representation scheme. The color information of each image is represented in a very high-dimen-
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
sional space because of many local histograms. This leads to high storage demands and inecient searches during similarity retrieval. Our image representation scheme is guided primarily by three major factors. First, the representation must be closely related to human visual perception, since a user determines whether a retrieval operation in response to an example query image is successful or not. Second, the representation must encode the spatial distribution of color in an image. Third, the representation should be as compact as possible to minimize storage and computation eorts. Following these considerations, we use the HSV (hue, saturation, value) color coordinate system, which correlates well with human color perception and is commonly used by artists. Since digital images are normally available in the RGB space, we use the conversion program given in (Foley et al., 1994) to obtain HSV values in the range 0; 1. In order to represent the spatial distribution of color in an image, we rely on a ®xed image-partitioning scheme. This is in contrast with several proposals in the literature (Smith and Chang, 1996) suggesting color-based segmentation to characterize the spatial distribution of color information. Although the color-based segmentation approach provides a more ¯exible representation and hence more powerful queries, we believe that these advantages are outweighed by the simplicity of the ®xed partitioning approach. In the ®xed partitioning scheme, each image is divided into M N overlapping blocks as shown in Fig. 1. The overlapping blocks allow a certain amount of `fuzzyness' to be incorporated in the spatial distribution of color information, which helps in obtaining a better performance. To provide for partial-image queries, a masking bit is associated with each block. The default value for this bit for
1339
Fig. 2. Two examples of original and approximated images.
every block is one. Only during partial-image queries, some of the mask bits are set to zero. Three separate local histograms (hue, saturation, value) for each block are computed. Although these local histograms can be used to encode the spatial distribution of color information, the resulting representation is not compact enough. To obtain a compact representation, we extract from each local histogram the location of its area-peak. This is done by placing a ®xed-sized window on the histogram at every possible location. At each location, the histogram area falling within the window is calculated. The location of the window yielding the highest area determines the histogram area-peak. This value then represents the corresponding histogram. Thus, each image is reduced to 3 M N numbers, three per block to account for the hue, saturation and intensity histograms. To demonstrate that our representation scheme is able to retain essential color information, we show in Fig. 2 two example images and their respective approximation using area-peak representation. 3. Hierarchical self-organizing feature maps
Fig. 1. The ®xed partitioning scheme with overlapping blocks.
The self-organizing feature map (SOFM) is a neural network-based method for unsupervised clustering that maps high-dimensional data on a two-dimensional grid of neurons in such a way that similar high-dimensional data points are
1340
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
mapped to same or neighboring neurons (Kohonen, 1995). While some distortion is inevitable, the mapping generally preserves the neighborhood relationships. Grids of other dimensions are also possible; however, two-dimensional rectangular or hexagonal grids are most common. The SOFM learning process is a generalization of competitive learning. To construct a map, each neuron in the grid is initialized with small random weights. The neighborhood for each neuron, which shrinks with learning, is also initialized. The initialized weights then adapt through an iterative learning process consisting of the following steps: 1. randomly select an input vector and apply it to all neurons; 2. determine the winning neuron, i.e. the neuron whose weights resemble most the input vector; 3. bring the weights of the winning neuron closer to the input; 4. bring also the weights of the neurons in the neighborhood of the winning neuron closer to the input vector. The learning process terminates when the weight adjustments are arbitrarily small. A hierarchical self-organizing map (HSOFM) is formed by arranging several layers of two-dimensional maps in a hierarchy. For each map unit in one layer of the hierarchy, a two-dimensional map is added to the next layer. The learning in an HSOFM is done in a sequential fashion; the map at the ®rst layer, the highest level of the hierarchy, is trained ®rst. While the ®rst layer map is trained with all the example vectors, the successive layer maps are trained only with those example vectors that are won by their respective parent map unit. In many instances, the maps at a lower level are trained with truncated example vectors by omitting those vector components that are equal in the original training vectors. Compared to an SOFM, an HSOFM may be viewed as performing organization of the information at several levels, going for ®ner and ®ner distinctions. This particular property of HSOFMs has been exploited by many researchers in the information retrieval area for organizing text documents and providing an exploratory search mode. For example, Merkl (1997) has used HSOFMs to generate a taxonomy of software manuals. Other
applications include organization of full-text documents of the Usenet group (Kohonen et al., 1996; Kaski et al., 1996), an analysis of AI literature (Lin et al., 1991) and context learning in natural language processing (Scholtes, 1991). It should be noted that the organization of information learned by an HSOFM is not independent of the HSOFM architecture; it depends, in addition to the training data, on the number of layers and the size of each layer. 4. Image retrieval by exploration and similarity search Our scheme for image retrieval by exploration and similarity search uses an HSOFM of three layers as shown in Fig. 3. The ®rst layer is called the global view layer and the corresponding map is called the global view map; it consists of a lattice of r1 c1 neurons. The function of this layer is to provide an overall summary view, in the form of a mosaic image, of the entire image collection. The second layer, called the regional layer, has r1 c1 maps. These maps are called regional view maps. Each map, consisting of a lattice of r2 c2 neurons, corresponds to a neuron in the global view layer and provides a ®ner summary of the associated images in the form of a mosaic image. The ®nal layer in the hierarchy of the self-organizing maps consists of r2 c2 maps with each map having r3 c3 neurons. The maps in the ®nal layer are known as local view maps and the layer is called local view layer. The local view maps provide yet another level of detailed summary and each
Fig. 3. HSOFM architecture for image retrieval.
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
element of a local view map points to a group of images, which are directly accessible to that map element. These images constitute another layer that is called the image layer. The system can operate in two modes: exploration mode and similarity search mode. In the exploration mode, a user simply browses through the image collection by traversing up and down the hierarchy and sideways at each level of the hierarchy to view a small set of images of chosen color composition. In the similarity search mode, a user accesses images via a query image. In this mode, both full and partial query modes are possible. In full query mode, the color composition of the full query image is used in the similarity search process. In partial query mode, a user has the option to specify which part of the query image should be used in the similarity search process. To provide for partial query mode, the system uses masking bits associated with each image block. The default setting of all masking bits is 1. Some of these bits are cleared in the partial query mode and the information from the corresponding blocks is not used in similarity computation. To perform similarity search, the color composition of the query image is ®rst matched at the global view level to determine the most appropriate regional view that should be searched further. The matching is then repeated at the regional view level to locate the best matching local view map. A further search at the local view level brings out a set of images that may be most similar to the query image. These images are then individually compared with the query image to retrieve them in a ranked order. The rank ordering is calculated by block-byblock matching of the dominant HSV triplets of the query image and the target image. Let qi and ti represent the block number i in a query (Q) and a target (T) image, respectively. Let
hqi ; sqi ; vqi be the dominant hue±saturation±value triplet for the block i of the query image. Let
hti ; sti ; vti represent the same in the target image. The block similarity is then de®ned by the following relationship: S
qi ; ti
1 2
2
2
1 a
hqi ÿ hti b
sqi ÿ sti c
vqi ÿ vti
;
1341
where a, b and c are constants that are selected to de®ne the relative importance of hue, saturation and value in similarity calculation. Using the similarities between the corresponding pairs of blocks from the query and target images, the similarity measure between a query±target image pair is computed by the following expression: S
Q; T
PMN i1 bi S
qi ; ti ; PMN i1 bi
where bi stands for the masking bit for block i and M N is the number of blocks. Before the system can be used, maps for different layers must be trained using the images that the system is expected to handle. The training is performed using the self-organizing feature map learning brie¯y described earlier. The global view layer is trained using all the images. The subsequent layers are trained with the respective image subsets, for example a regional map corresponding to the neuron
i; j of the global view is trained with an image subset consisting of only those images from the entire collection that are `won' by the neuron
i; j of the global view at the end of training. Once the maps at dierent levels are ®xed, the changes to the image database, for example addition of new images or deletion of some existing images, are made only at the image layer level. This is done to avoid retraining; however, as more and more modi®cations to the image layer are made, the summary maps start having inaccurate information. In such a situation, a retraining of the system is performed to generate a new set of summary maps. 5. Performance In this section, we present some results to show the performance of the system. These results are based on an implementation using a database of 2100 images. The implemented system has a global layer of 6 6 neurons. The regional layer consists of 36 maps of size 4 4. The number of maps in the local layer is 576; each map has 3 3 neurons. The main criterion for the selected architecture
1342
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
was to have a moderate number of images associated with each neuron at the lowest level of the hierarchy. All layers use a hexagonal lattice structure, which was found to yield relatively even cluster sizes at dierent levels of hierarchy. To train the HSOFM, each image was partitioned into 8 ´ 8 overlapping blocks as described earlier and was represented by a 192-component vector consisting of 64 elements each of hue, saturation and value. The training was done using the SOMPAK software (Kohonen et al., 1995). All layers were trained in two stages consisting of 10,000 iterations each. The ®rst stage is the ordering phase during which the reference vectors of the map units are ordered. During the second stage, the values of the reference vectors are ®netuned. At the end of the training for each layer, the weight vectors for each neuron were mapped into an image and mosaic images for the global view, various regional views and numerous local views were constructed. A small modi®cation was made to the training procedure described earlier. The modi®cation involves de®ning the training subset of images for maps at regional and local layers. The image subset for each map at these layers was constituted by pooling images won by the parent neuron as well as the images won by its three closest neighbors. This modi®cation was made in consideration of the size of the image database used in our experiment.
Fig. 4. The global view of the experimental database of 2100 images.
Fig. 4 shows the global view of the entire image collection. The dierent color compositions that are present in the image database are clearly seen in this global view. It is evident from the global view that not many images with dominating red hue are present. Furthermore, the global view mosaic shows a gradual change in color composition as we move in any direction in the global view map. This is due to the topology preserving property of the self-organizing maps.
Fig. 5. Two regional views of the image database. The view on the left corresponds to the third tile of the ®rst row of the global view map. The view on the right is for the last tile of the fourth row of the global view map.
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
Fig. 5 presents two regional views which show the next level of views corresponding to two different areas of the global view. Similarly, Fig. 6 shows two local views. These views from dierent layers indicate that images have been organized at several levels according to their color content. It is easy to see that the hierarchical organization of images provides a convenient method for image exploration. For example, the map areas in the top left corner of the global image correspond to images with predominance of blue. A user searching for sky images could explore this region of the map at regional and local levels to see ®ner distinctions and ®nally to retrieve a set of images with dominating blue in the upper part of the images. One
1343
such result of image retrieval by exploration is shown in Fig. 7. Figs. 8 and 9 show two sets of retrieval results for the similarity search mode. The ®rst image in each set was used as a query image. The numerical value above each thumbnail image represents the similarity value between the thumbnail and the query image. The three constants, a, b and c, of the similarity measure de®ned earlier were taken as 2.5, 0.5 and 3.0, respectively. These values were chosen through empirical means. The keyword above each thumbnail comes from the image CD. It is evident from the keywords that they are too general to really give an idea about image content.
Fig. 6. Two local views of the image database. The view on the left corresponds to the ®rst tile of the ®rst regional view of Fig. 5. The view on the right is for the ®rst tile of the second regional view of Fig. 5.
Fig. 7. Image retrieval by exploration. These images are retrieved when a user arrives at the ®rst local view map of Fig. 6 and clicks on the middle image tile in the last row.
1344
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
Fig. 8. An example of image retrieval via similarity search mode. The top left image was used as the query image.
Fig. 9. Another example of image retrieval.
I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345
1345
6. Summary and conclusions
References
A scheme for image retrieval using color composition information was presented. The salient feature of the scheme is its ability to provide an image exploration mode in addition to similarity search mode. The exploration mode is useful as it provides a summary view of the image collection at dierent levels of detail. The summary views are made possible due to the use of hierarchical, selforganizing feature maps. These maps provide a trainable method of organizing images into different clusters based on color composition information. While the present scheme is meant for image color information, similar schemes are possible using other image features such as shape and texture. We are currently investigating such implementations. For further reading, see (Sethi et al., 1998).
Bach, J.R. et al., 1996. Virage image search engine: an open framework for image management. Proc. SPIE: Storage and Retrieval for Image and Video Databases 2670, 76±87. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F., Phillips, R.L., 1994. Introduction to Computer Graphics. AddisonWesley, Reading, MA. Gong, Y., Chua, H., Guo, X., 1995. Image indexing and retrieval based on color histogram. In: Proc. 2nd Internat. Conf. Multimedia Modeling, Singapore, pp. 115±126. Kaski, S., Honkela, T., Lagus, K., Kohonen, T., 1996. Creating an order in digital libraries with self-organizing maps. In: Proc. World Congress on Neural Networks, pp. 814±817. Kohonen, T., 1995. Self-Organizing Maps. Springer, Berlin. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., 1995. SOM_PAK: The self-organizing map program package, Helsinki University of Technology, Finland. Kohonen, T., Kaski, S., Lagus, K., Honkela, T., 1996. Very large two-level SOM for the browsing of newsgroup. In: Proc. Internat. Conf. Arti®cial Neural Networks, Bochum, Germany. Lin, X., Soergei, D., Marchionini, G., 1991. A self-organizing semantic map for information retrieval. In: Proc. 14th Annual Internat. ACM SIGIR Conf., Chicago, IL, pp. 192± 201. Lu, H., Ooi, B., Tan, K., 1994. Ecient image retrieval by color contents. In: Proc. Internat. Conf. Applications of Databases, Vadstena, Sweden, pp. 95±108. Merkl, D., 1997. Exploration of text collections with hierarchical feature maps. In: Proc. 20th Annual Internat. ACM SIGIR Conf., Philadelphia, PA, pp. 186±195. Niblack, W. et al., 1993. The QBIC project: querying images by content using color, texture and shape. Proc. SPIE: Storage and Retrieval for Image and Video Databases 1908, 173± 187. Scholtes, J.C., 1991. Unsupervised learning and the information retrieval problem. In: Proc. Internat. Joint Conf. Neural Networks, Seattle, Washington, pp. 95±100. Sethi, I.K. et al., 1998. Color-Wise: A system for image similarity retrieval using color. Proc. SPIE: Storage and Retrieval for Image and Video Databases 3312, 140±149. Smith, J.R., Chang, S.-F., 1996. Tools and techniques for color image retrieval. Proc. SPIE: Storage and Retrieval for Image and Video Databases IV 2670, 426±437. Wan, X., Jay Kuo, C.-C., 1996. Color distribution analysis and quantization for image retrieval. Proc. SPIE: Storage and Retrieval for Image and Video Databases IV 2670, 8±16.
Discussion Kamel: I would like to hear your comments about another application, in which the similarity in the images is not expressed by the color or by the shape, but rather in the semantics. What is your comment on how to handle this? Sethi: I can give you a few ideas on that. Most people ignore that issue. My opinion is to move a little bit up in terms of the features. Instead of using the low-level features, one may try to extract mid-level features. Each of these mid-level features can be associated with a set of semantic concepts. Through relaxation, or some other similar scheme, one can then narrow down the semantic concepts associated with a collection of mid-level features detected in an image. (Note of the editors: at this point, recording of the discussion was interrupted by a power failure).