processes in visual analysis: bottom-up image-based analysis and top-down task-related ... The MPEG-7 specifies the descriptors, description schemes and a ...
744
Category: Multimedia Technology
Content-Based Image Retrieval Alan Wee-Chung Liew Griffith University, Australia Ngai-Fong Law The Hong Kong Polytechnic University, Hong Kong
IntroductIon With the rapid growth of Internet and multimedia systems, the use of visual information has increased enormously, such that indexing and retrieval techniques have become important. Historically, images are usually manually annotated with metadata such as captions or keywords (Chang & Hsu, 1992). Image retrieval is then performed by searching images with similar keywords. However, the keywords used may differ from one person to another. Also, many keywords can be used for describing the same image. Consequently, retrieval results are often inconsistent and unreliable. Due to these limitations, there is a growing interest in content-based image retrieval (CBIR). These techniques extract meaningful information or features from an image so that images can be classified and retrieved automatically based on their contents. Existing image retrieval systems such as QBIC and Virage extract the so-called low-level features such as color, texture and shape from an image in the spatial domain for indexing. Low-level features sometimes fail to represent high level semantic image features as they are subjective and depend greatly upon user preferences. To bridge the gap, a top-down retrieval approach involving high level knowledge can complement these low-level features. This articles deals with various aspects of CBIR. This includes bottom-up feature-based image retrieval in both the spatial and compressed domains, as well as top-down task-based image retrieval using prior knowledge.
Background Traditional text-based indexes for large image archives are time consuming to create. A domain expert is required to examine each image scene and describe its content using several keywords. The language-based descriptions, however, can never capture the visual content sufficiently because a description of the overall semantic content in an image does not include an enumeration of all the objects and their properties. Manual text-based annotation generally suffers from two major drawbacks: (i) content mismatch, and (ii) language mismatch. A content mismatch arises when the
information that the domain expert ascertains from an image differs from the information that the user is interested in. When this occurs, little can be done to recover the missing annotations. On the other hand, a language mismatch occurs when the user and the domain expert use different languages or phrases to describe the same scene. To circumvent language mismatch, a strictly controlled set of formal vocabulary or ontology is needed, but this complicates the annotation and the query processes. In text-based image query, when the user does not specify the right keywords or phrases, the desired images cannot be retrieved without visually examining the entire archive. In view of the deficiencies of text-based approach, major research effort has been spent on CBIR over the past 15 years. CBIR generally involves the application of computer vision techniques to search for certain images in large image databases. “Content-based” means that the search makes use of the contents of the images themselves, rather than relying on manually annotated texts. From a user perspective, CBIR should involve image semantics. An ideal CBIR system would perform semantic retrievals like “find pictures of dogs” or even “find pictures of George Bush.” However, this type of open-ended query is very difficult for computers to perform because, for example, a dog’s appearance can vary significantly between species. Current CBIR systems therefore generally make use of low-level features like texture, color, and shape. However, biologically-inspired vision research generally suggests two processes in visual analysis: bottom-up image-based analysis and top-down task-related analysis (Navalpakkam & Itti, 2006). Bottom-up analysis consists of memoryless stimuluscentric factors such as low-level image features. Top-down analysis uses prior domain knowledge to influence bottom-up analysis. An effective image retrieval system should therefore combine both the low-level features as well as the high level knowledge so that images can be classified automatically according to their context and semantic meaning.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Content-Based Image Retrieval
eXIstIng cBIr systems and standards The best-known commercial CBIR system is the QBIC (Query by Image Content) system developed by IBM (Flickner et al., 1995). Image retrieval is achieved by any combination of color, texture or shape as well as by keyword. Image queries can be formulated by selection from a palette, specifying an example image, or sketching a desired shape on the screen. The other well-known commercial CBIR systems are Virage (Gupta & Jain, 1997) which is used by AltaVista for image searching, and Excalibur (Feder, 1996) which is adopted by Yahoo! for image searching. Photobook (Pentland, Picard, & Sclaroff, 1996) from MIT Media Lab is the representative research CBIR system. Like QBIC, images are represented by color, shape, texture and other appropriate features. However, Photobook computes information-preserving features, from which all essential aspects of the original image can be reconstructed. In 1996, the Moving Picture Experts Group (MPEG)–a working group (JTC1/SC29/WG11) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC)—decided to start a standardization project called MPEG-7 (Manjunath, Salembier, & Sikora, 2002). The aim is to provide a quick and efficient identification and management of multimedia content so that audio-visual information can be easily searched. The MPEG-7 specifies the descriptors, description schemes and a description definition language. The descriptors is a representation of features at different levels of abstraction, ranging from low-level visual features like shape, texture and color to high level semantic information such as abstract concept and genres. The descriptor defines the syntax and semantics of the feature representation. The description schemes specify the structure and semantics of the relationships between its components such as descriptors. The scheme provides a solution to model and describe content in terms of structures and semantics. The description definition language allows the creation of new description schemes as well as descriptors. This allows extension and modification of existing description schemes. The MPEG-7 standard has eight parts. Of these, Part 3 specifies a set of standardized low-level descriptors and description schemes for visual content which includes shape descriptor, color descriptor, texture descriptor and motion descriptor. Note that the MPEG-7 specifies the descriptors only, and their extraction are not specified as part of the MPEG-7 standard.
cBIr metHodology A CBIR system has three key components: feature extraction, efficient indexing and user interface: •
•
•
Feature extraction: Image features include primitive features and semantic features. Examples of primitive features are color, texture, and shape. Primitive features are usually quantitative in nature and they can be extracted automatically from the image. Semantic features are qualitative in nature and they provide abstract representations of visual data at various levels of detail. Typically, semantic features are extracted manually. Once the features have been extracted, image retrieval becomes a task of measuring similarity between image features. Efficient indexing: To facilitate efficient query and search, the image indices needed to be organized into an efficient data structure. Because image features maybe interrelated, flexible data structures should be used in order to facilitate storage/retrieval. Structures such as k-d-tree, R-tree, R*-tree, quad-tree, and grid file are commonly used. User interface: In visual information systems, user interaction plays an important role. The user interface consists of a query processor and a browser to provide an interactive environment for querying and browsing the database. Common query mechanisms provided by the user interface are: query by keyword, query by sketch, query by example, browsing by categories, feature selection, and retrieval refinement.
In query by example, the user specifies a query image (either supplied by the user or chosen from a random set), and the system finds images similar to it based on various low-level criteria. In query by sketch, the user draws a rough sketch of the image he/she is looking for, for example, with blobs of color at different locations, and the system locates images whose layout matches the sketch. In either case, features are first extracted automatically from this query image to form a query image signature. A matching with all other images in the archive is performed by measuring the similarity between their signatures. It can be seen that the matching result is heavily influenced by the choice of features.
Bottom-up content-Based retrIeval In tHe spatIal domaIn Content-based retrieval makes use of low level image features computed from the image itself for matching. Commonly used
745
C
Content-Based Image Retrieval
features are color, texture and shape which can be obtained directly from the spatial domain representation.
color signature Color is one of the most widely used features as it is relatively robust to translation and rotation about the angle of view (Deng et al., 2001). One of the often used color features is color histogram. It partitions color distribution into discrete bins and can be used to show the overall color composition in an image. The MPEG-7 specifies six color descriptors: color space descriptor, dominant color descriptor, scalable color descriptor, group of frames or group of pictures descriptor, color structure descriptor and color layout descriptor. Among these six descriptors, two are related to the color histogram. For example, the dominant color information is defined as, F ={(ci , pi , i ), s} , i = 1,2,..., N , where N is the total number of dominant colors, ci is a vector storing color component values, pi is the fraction of pixels in the image corresponding to color ci , ui is the color variance representing the color variation in a cluster surrounding the color ci and s is a number representing the overall spatial homogeneity of the dominant colors in the image. The color structure information is obtained from the localized color histogram using a small structuring window so that local spatial structure of the color can be characterized.
texture signature Texture is one of the basic attributes of natural images. Commonly used methods for texture characterization are divided into three categories: statistical, model-based and filtering approaches. Statistical methods such as co-occurrence features describe tonal distribution in textures (Wouwer, Scheunders & Van Dyck, 1999). Model-based methods such as Markov random field (Cross & Jain, 1983) provide description in terms of spatial interaction while filtering approaches including wavelet, Gabor-filters and directional filter-bank (DFB) characterize textures in frequency domain (Chang & Kuo, 1993; Manjunath & Ma, 1996). It has been shown that directional together with scale information is important for texture perception. As texture patterns can be analyzed at various orientations with multiple scales using Gabor-filters, good texture descriptions can be obtained. However, Gabor-filter involves nonseparable transform which is computationally expensive. Recently, Contourlet transform and multiscale directional filter bank have been proposed to solve this problem by combining the DFB with Laplacian pyramid (LP) (Cheng, Law, & Siu, 2007). Although the LP is somehow redundant, the combined approach is still computationally efficient while providing a high angular
746
resolution. As a result, efficient texture descriptors can be obtained. There are three texture descriptors in MPEG-7: a homogeneous texture descriptor, a texture browsing descriptor and an edge histogram descriptor. The homogeneous texture descriptor adopts Gabor-like filtering for the texture description. The texture browsing descriptor provides a perceptual texture characterization in terms of regularity, coarseness and directionality of the texture pattern. The edge histogram descriptor is obtained by analyzing spatial distribution of edges in an image.
shape signature Shape is one of the key visual features used by human for distinguishing visual data. Compare with color and texture, shape is easier for users to describe in a query, either by example or by sketch. However, because shapes of natural objects in a 2D image can be obtained from different views of the same object, shapes can be rotated, scaled, or skewed. Hence, an effective shape representation should be rotation, translation and scaling invariant, as well as invariant to affine transform to address the different views of objects. Two of the common approaches for shape representation are boundary-based and region-based approaches. The boundary-based approach works on the edges/outlines of the image while the region-based approach considers the entire regions. In fact, the MPEG-7 standard defines three shape descriptors: region-based shape descriptor, contour-based shape descriptor and 3D shape spectrum descriptor. The region-based descriptor is based on a complex 2D angular radial transformation which belongs to a class of shape analysis techniques using Zernike moments. The contour-based descriptor is based on extracting features such as contour circularity and eccentricity from the curvature scale-space contour representation. The 3D shape descriptor is based on a polygonal 3D meshes object representation.
Bottom-up content-Based retrieval in the compressed domain Features used in retrieval are often extracted from the spatial domain. This is in contrast to the fact that images are usually compressed using JPEG or JPEG 2000 to reduce their size for storage and transmission. Retrieving these kinds of compressed images then requires reconversion to the uncompressed spatial domain for feature extraction. This approach requires many decompression operations, especially for large image archives. To avoid some of these operations, it was proposed that feature extraction be done directly in the transformed domains.
Content-Based Image Retrieval
JPEG employs the discrete cosine transform (DCT) for image compression (Pennebaker & Mitchell, 1993). Lowlevel features such as color, shape and texture have been proposed to be extracted directly by analyzing DCT coefficients. For example, DCT coefficients can be reorganized into a tree structure capturing the spatial-spectral characteristics for retrieval (Climer & Bhatia, 2002; Ngo, Pong, & Chin, 2001). JPEG 2000 is a new compression standard which compresses images using wavelets (Taubman & Marcellin, 2002). As wavelets provide a multiple resolution view of an image, several indexing techniques have been proposed to extract wavelet coefficients for coarse-to-fine image retrieval (Liang & Kuo, 1999; Xiong & Huang, 2002). One of the major concerns in the compressed domain feature extraction is that they are domain specific. Different compression techniques result in different transformed coefficients. This implies that features that can be extracted in the compressed domain depend greatly on the compression scheme used. As a result, the retrieval system can only be used to retrieve images from a particular compression format. To extract features in the compressed domain irrespective of the compression format, a common framework called the subband filtering model has been proposed (Au, Law, & Siu, 2007). Using this subband model, the block-based DCT coefficients are concatenated to form structures that are similar to the wavelet subbands. This would allow similar features to be extracted for retrieval purposes. It has been proved that similar features can always be extracted in the JPEG and JPEG 2000 domains for retrieval, irrespective of the values of the compression ratio.
top-down-Based retrieval using prior knowledge Although the visual mechanism is still not well understood, biological visual system research generally suggest two processes involved in visual analysis: bottom-up image-based analysis and top-down task-related analysis. Bottom-up analysis consists of extraction of low-level features described above. Top-down analysis refers to those high-level concepts that cannot be extracted directly from the image, but instead from the semantics of objects and image scenes as perceived by human beings. These conceptual aspects are subjective and are more closely related to users’ preferences. Low-level image features fail to represent high-level semantic image features because they are only the basic components to build cognitive features. A human always focus on the interesting part of an image based on what he or she is trying to look for. The cognitive features of an image play the most important role in the semantic understanding of images. Whereas low level features are usually context-free, high level semantic features are context-rich. The relationship (spatial or conceptual) between image objects is a strong
constraint for semantic image description. In Hare et al. (2006), a semantic space is used to describe the spatial relationships between image objects and hence the scene content. User’s domain knowledge can be used to constrain such a semantic space, for example, a mountain scene would consists of blue sky on the top portion of the image, mountains in the middle portion of the image, and grassland or forest in the foreground. To incorporate high level knowledge incrementally, relevance feedback was proposed to interactively refine retrieval results (Rui, Huang, Ortega, & Mehrotra, 1998). In particular, the user indicates to the retrieval system whether the particular retrieved result is “relevant,” “not relevant” or “neutral.” With the use of some learning algorithms such as Support Vector Machines, a set of possibly better retrieved results can be obtained after a few rounds of this feedback mechanism.
future trends A major challenge in CBIR research is to develop a semanticsensitive retrieval system. Semantic similarity is an important image matching criterion in human and is related to how we interpret image content. An effective image retrieval system should combine both low-level features as well as high level semantic knowledge so that images can be classified according to their context and meaning. However, very little research has been done on high-level semantic features. A possible reason is that high level knowledge are difficult to define and formulate because they are highly subjective and are closely related to viewers’ expectations of scene context. There are some attempts to segment an image into different regions using homogenous low level features and then combine them together with context information to form a model for image indexing and retrieval. However, meaningful segmentation is often difficult to obtain in practice. The major problem in semantic-sensitive CBIR is to define and extract the semantic concept behind the image. The concept can be related to the image categories such as buildings and gardens. It can also be tailored to certain application domains, such as detecting human faces and intruders. Once the concept is extracted and associated with certain low-level features, content-based image retrieval becomes concept matching, which is semantically more meaningful than low-level features matching. Performing concept matching instead of low-level features matching could also speed up the retrieval process. Machine learning techniques can be used to learn the high level semantic knowledge automatically. Relevance feedback has been proposed to improve the retrieval results. The major challenge here is to develop an effective and efficient algorithm for automatic concept learning and representation. Due to the fuzzy nature of semantic concepts, 747
C
Content-Based Image Retrieval
probabilistic graphical models such as Bayesian network or relevance network could find applications here.
Feder, J. (1996). Towards image content-based retrieval for the World Wide Web. Advanced Imaging, 11(1), 26-29.
conclusIon
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., et al. (1995). Query by image and video content: The QBIC system. IEEE Transactions on Computer, 28(9), 23-32.
With the rapid growth of Internet and multimedia systems, the use of visual information has increased enormously such that image-based indexing and retrieval techniques have become important. In this article, an overview of CBIR is given. Current CBIR systems mainly extract low-level features such as color, texture, and shape from an image for image classification and retrieval. However, it is well known that these low-level features cannot capture image semantics. In contrast, human performs image searching by relying heavily on the semantic concepts behind the image. In order to close this semantic gap, much more research effort is needed to find ways to define and incorporate high level semantic knowledge onto the CBIR system. We have discussed two major challenges in this area, including (1) semantic concept formulation and extraction from images and (2) the development of effective machine learning algorithms for concept learning and representation.
references Au, K.M., Law, N.F., & Siu, W.C. (2007). Unified feature analysis in JPEG and JPEG2000 compressed domains. Pattern Recognition, 40(7), 2049-2062. Chang, S.K., & Hsu, A. (1992). Image information systems: Where do we go from here? IEEE Transactions on Knowledge and Data Engineering, 5(5), 431-442. Chang, T., & Kuo, C.C. J. (1993). Texture analysis and classification with tree-structured wavelet transform. IEEE Transactions on Image Processing, 2(4), 429-441. Cheng, K.O., Law, N.F., & Siu, W.C. (2007). Multiscale directional filter bank with applications to structured and random texture classification. Pattern Recognition, 40(4), 1182-1194. Climer, S., & Bhatia, S.K. (2002). Image database indexing using JPEG coefficients. Pattern Recognition, 35(11), 2479-2488. Cross, G.R., & Jain, A.K. (1983). Markov random field texture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(1), 25-39. Deng, Y., Manjunath, B.S., Kenney, C., Moore, M.S., & Shin, H. (2001). An efficient color representation for image retrieval. IEEE Transactions on Image Processing, 10(1), 140-147. 748
Gupta, A., & Jain, R. (1997). Visual information retrieval. Communications of ACM, 40(5), 71-79. Hare, J.S., Sinclair, P.A.S., Lewis, P.H., Martinez, K., Enser, P.G.B., & Sandom, C. J. (2006). Bridging the semantic gap in multimedia information retrieval: Top-down and bottom-up approaches. In Proceedings of Mastering the Gap: From Information Extraction to Semantic Representation, 3rd European Semantic Web Conference. Liang, K.C., & Kuo, C.C.J. (1999). WaveGuide: A joint wavelet-based image representation and description system. IEEE Transactions on Image Processing, 8(11), 1619-1629. Manjunath, B.S., & Ma, W.Y. (1996). Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837842. Manjunath, B.S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-7: Multimedia content description language. John Wiley & Sons. Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimal object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2049-2056). Ngo, C.W., Pong T.C., & Chin, R.T. (2001). Exploiting image indexing techniques in DCT domain. Pattern Recognition, 34(9), 1841-1851. Pennebaker, W.B., & Mitchell, J.L. (1993). JPEG: Still image data compression standard. Van Nostrand Reinhold. Pentland, A. Picard, R.W., & Sclaroff, S. (1996). Photobook: Content-based manipulation of image databases. International Journal Computer Vision, 18, 223-254. Rui, Y., Huang, T.S., Ortega, M., & Mehrotra, S. (1998). Relevance feedback: A powerful tool in interactive contentbased image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5), 644-655. Taubman, D.S., & Marcellin, M.W. (2002). JPEG2000: Image compression fundamentals, standards and practice. Kluwer Academic Publishers. Wouwer, G.V.D., Scheunders, P., & Van Dyck, D. (1999). Statistical texture characterization from discrete wavelet
Content-Based Image Retrieval
representations. IEEE Transactions on Image Processing, 8(4), 592-598. Xiong, Z., & Huang, T.S. (2002). Subband-based, memoryefficient JPEG2000 images indexing in compressed-domain. In Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation, (pp. 290-294).
key terms Bottom-Up Image Analysis: This refers to the use of low-level features, such as high luminance/color contrast or unique orientation from its surrounding, to identify certain objects in an image. Compressed Domain Feature Analysis: This refers to the process of image signature extraction performed in the transform domain. Image features are extracted by analyzing the transform coefficients of the image without incurring a full decompression. Content-Based Image Retrieval: This refers to an image retrieval scheme which searches and retrieves images by matching information that is extracted from the images themselves. The information can be color, texture, shape and high level features representing image semantics and structure.
High Level Semantics: This refers to the image context as perceived by humans. It is generally subjective in nature and greatly depends on user’s preferences. Image Retrieval System: A computer system for users to search images stored in a database. Image Signature: This is the same as feature descriptors used for image annotation and indexing. Keyword-Based Image Retrieval: This refers to an image retrieval scheme which searches and retrieves images by using metadata such as keywords. In this scheme, all images are annotated with certain keywords. Searching is then performed by matching these keywords. Relevance Feedback: This provides an interactive way for humans to refine the retrieval results. Users can indicate to the image retrieval system whether the retrieved results are “relevant,” “irrelevant” or “neutral.” Retrieval results are then refined iteratively. Spatial Domain Feature Analysis: This refers to the process of image signature extraction performed in the spatial domain. Image features are extracted by analyzing the spatial domain image representation. Top-Down Image Analysis: This refers to the use of high level semantics, such as viewer’s expectations of objects and image context, to analyze and annotate an image.
Feature Descriptors: A set of features that is used for image annotation and indexing. The features can be keywords, low-level features including color, texture, shape, and high level features describing image semantics and structure.
749
C