Image Retrieval Based On Structural Content Sushil K. Bhattacharjee Signal Processing Laboratory Dept. of Electrical Engg. E.P.F.L. CH-1015, Lausanne
[email protected]
Abstract
A content-based image-retrieval system is described in this paper. The system is designed to administer a heterogeneous collection of images. The goal is to support querying in a rotationand translation-, and juxtaposition-invariant fashion. The conceptual similarity measure used to compare two images is the number of small image-patches the images have in common. The patches to be compared are chosen using a 2D continuous wavelet which acts as a low-level corner detector. The local maxima in the response of this wavelet is used to locate potential corner-like features in the image, in an ane-invariant way. Then, a small region around every image-feature is used for the similarity measure. Each region is characterized by a set of Gaussian-derivative lter-responses evaluated at the corresponding feature-point. The lters extract local texture information from the luminance-channel of the image. The responses of n such lters for one region are organized in a n-D vector, referred to as a token. These tokens are further quantized to select indexing-terms which are used to describe each image. Every image is represented by a vector of weights corresponding to an ordered set of indexing-terms. The task of comparing images then boils down to computing a vector product.
1 Introduction
As digital imaging technology gains maturity, consumer-needs are evolving in a new direction: to be able to select particular images from a large collection. Already, the sheer volume of digital imagery available makes manual browsing impractical. To bene t fully from the advantages of digital archival, tools are required for automatically indexing a collection of images, and for retrieving the images of interest to the user. This requirement has triggered research in the area of automatic image-retrieval. Images are inherently unstructured data, and cannot be adequately characterized by external descriptions (manually assigned keywords). We need to generate automatic descriptions based on the content of the image. The image-retrieval system proposed in this paper is designed to administer an unstructured, heterogeneous collection of still images. Two important problems have to be solved for contentbased indexing and retrieval of images: We need a meaningful measure of image-similarity. This measure will be used to establish the degree to which an image in the collection matches the user's query. Secondly, we have to generate a representation for images that is suitable for content-based access. For the similarity measure to be meaningful, the representation should characterize the essence of the image-content. The measure of image-similarity used in the proposed system is based on small regions of the images containing signi cant low-level structural information. The more the number of such regions common to both images, the more similar they are considered. My choice of similarity-measure for images is inspired, in part, by results from psychovisual research. We know that when a human subject analyses an image, the subject does not view the entire image at the same time. Rather the 1
eye xates momentarily on some point of interest, and then jumps to another point of signi cant interest within the eld of view of the current xation [Hub95]. This process is repeated over and over, with the eye saccading successively to dierent points of interest. Thus, in comparing two images, we seem to compare small subimages around points of interest. Here, I rst attempt to nd points of interest and then, for each such point, I characterize the local neighborhood by a set of lter-responses. Thus, using n lters, each image-patch selected from an image can be described by an n-dimensional (n-D) vector. Let us refer to these n-D vectors as tokens. Now, two image-patches can be compared based on their corresponding tokens. Every image in the collection is projected onto a set of canonical indexing terms (collectively referred to as the indexing vocabulary ). The indexing-terms used to describe an image are derived from the tokens of an image. Assuming that the indexing vocabulary consists of m terms, any image can be represented by a m-D vector. The problem of comparing two images then boils down to computing the classical vector-product. This technique is commonly used for describing textual documents in text-retrieval systems [Kow97]. In Sec. 2 I describe a 2D continuous wavelet that can used to detect meaningful feature-points in an image. The feature-points are used to identify a small patch in the image. The method for generating tokens, which describe the neighborhoods of the feature-points, is described in Sec. 3. The tokens from an image are used to select prede ned indexing-terms that are useful in describing the image. The process of selecting such indexing-terms is described in Sec. 4, and in Sec. 5 we see how to assign weights to these indexing-terms. Retrieving desired images from a collection is an iterative process, where the results for one query can be used to present a more re ned query in the next iteration. One query-re nement procedure is described in Sec. 6. The overall design of the proposed image-retrieval system is given in Sec. 7, and, in Sec. 8 sample results for two queries are presented. Finally, some concluding remarks are made in Sec. 9.
2 Feature-Point Detection In Images
In this section we address the problem of selecting points that correspond to visually meaningful features in the image. The most important requirement is that the process of detecting these points should be invariant to rotation and to juxtaposition. The second condition implies that if a feature point FA is detected in image IA , we should also be able to detect the same feature-point in another image IB if IA is a subimage of IB . I use a speci c 2D continuous wavelet to detect such points. This wavelet is designed to simulate the end-stopped cells found in the mammalian visual cortex. Psycho-visual experiments with oriented linear stimuli show that such a cell responds strongly if a line at a speci c orientation ends inside the receptive eld of the cell [Hub95]. These cells are tuned to detect visual features like line-endings and corners. In computational terms, one way of detecting endings of lines in a speci c orientation is to rst separate out the lines in the desired orientation, and to then search for the ends of these lines. The Morlet wavelet [AM96], a direction-sensitive lter, can isolate linear structures having a given orientation in an image. Considering a line as a mono-dimensional structure, the ends of a line can be be thought of as points of sharp discontinuity, which can be detected by applying the rst-derivative-of-Gaussian (FDoG) lter along the line. Equation (1) speci es a mother-wavelet, ES1 (in terms of spatial image-coordinates x and y), which combines these two ideas to detect end-points of lines. x y + y (y ,2iy )) ES1 (x; y) = 14 xe,( ; (1) where y1 controls the frequency-selectivity of the lter. Let us look at the frequency domain form of ES1 , given by EdS 1 in Eqn. (2) in terms of spatial-frequency coordinates u and v. 2+ 2 4
EdS 1 (u; v) = 2iue,( u
2 +v 2
2
2
)
1 4
e,(
1
u2 +(v ,y1 )2 2
)
:
(2)
Equation (2) shows that EdS 1 is a product of two components. The rst component is a FDoG wavelet oriented along the frequency-axis u. The second component is a Morlet wavelet oriented along the v-axis, that is, in the direction perpendicular to the orientation of the FDoG wavelet. (The Morlet wavelet has a wave-vector of length jy1 j.) The behavior of the ES1 wavelet can be easily discerned from Eqn. (2). The Morlet wavelet (a corrected form of the 2D Gabor function) tuned to a given orientation responds strongly to linear structures perpendicular to its orientation [AM96]. Therefore, the FDoG lter is applied only in the direction parallel to the orientation of the lines detected by the Morlet wavelet. The net eect of the wavelet is to highlight the end-points of lines oriented perpendicular to the Morlet component. The properties of this wavelet are discussed in greater detail elsewhere [Bha99]. In practice, the algorithm for detecting the coordinates of image feature-points is as follows: 1. Filter the input image with a suitably scaled version of the ES1 wavelet at several orientations. The result is a set of response-images, each showing high-activity near the end-points of linear structures oriented along a particular direction. 2. For each pixel position, retain only the strongest response value from among all the orientations. This produces a new `image', which we refer to as the maxima-image. Also, store the orientation corresponding to the maximum response at each pixel position. This angle is taken to be the dominant orientation associated with the point, and is used later to compute rotation-invariant representations of structural information. 3. Detect peaks of signi cant local maxima in the maxima-image. The coordinates of these peaks give the feature-points. Figure 1 shows typical results for a photograph. The points detected for the input image of Fig. 1(a) are shown in Fig. 1(b). In Fig. 1(c) you can see the same points superimposed on the original image. In this case, the points were detected at a scale of 8 pixels (the wavelet has a spatial support of 8 pixels) using 18 orientations (corresponding to 10 -steps). First, we note that all points lie along strong linear structures in the image, and usually mark the end-points of lines. Also, the wavelet avoids the spurious linear structures which appear in the image because of gray-level quantization problems (the blocky artifacts in the background).
(a)
(b)
(c)
Figure 1: Example of feature-point detection: (a) original image; (b) the extracted points; (c) points shown
in (b) superimposed on original image. In most cases, the points lie at the ends of linear structures. In some cases, the point does not fall exactly on any corner. This is usually because of the interaction between several structures close together, which in uences the position of the local maxima in the wavelet-response.
3 Texture Characterization of Image Patches
Each feature-point detected in an image is used to select a small patch of the image. Two images are compared based on such patches. To determine if an image patch PA is the same as another 3
patch, PB , we do not require an exact, pixel by pixel match. General visual similarity of the two patches is sucient for our purposes. Instead of using a correlation based approach, here we compare image-patches based on their textural properties. I use a set of Gaussian-derivative lters to characterize the local image-patch around each feature-point. The description generated is invariant to rotation and translation. The extent of the image-patches is not speci ed explicitly { it is implicit in the scale used for evaluating the lter-response at a particular feature-point. Researchers using several dierent approaches have reported that Gaussian-derivative lters can be used to compactly represent local image information [You85, San89, OF96, RB97]. In fact, successive derivatives-of-Gaussian (DOG) lters have been shown to approximate the Principal Components of the image-patch under examination [You85]. There is also evidence from the eld of psycho-visual research of the presence of cells in the visual cortex which simulate lower-order DOG lters (commonly of orders 0 to 4). From the computational point of view, DOG lters have the attractive property that they are steerable [FA91]. In the proposed system three families of lters namely, the rst-, second-, and third{order derivatives of Gaussians, are used to capture the texture properties of the neighborhood of a feature-point. For each feature-point in the image, the lter-responses are computed only at the dominant orientation associated with the feature-point. The lter responses may be computed at several scales. If a bank of n lters is used to characterize an image-patch, the responses are organized in a n-D vector which then represents information about the immediate neighborhood of the corresponding feature-point. Comparing the situation with text-retrieval systems, it is easy to see the analogy between the n-D vectors describing small regions in an image, and signi cant words found in a text-document. For this reason, following the text-retrieval jargon, we will refer to these vectors as tokens. By construction, the tokens extracted from an image are invariant to rotation and translation. Images can now be compared based on their respective sets of tokens. Given three dierent images, A, B , and Q, we can say that Q is more similar to A than to B , if Q has more tokens in common with A than with B . Recall that the tokens actually consist of real-valued lter-responses. Unless two image-patches are exactly the same, they are not likely to produce exactly the same lter-responses. So, in order to declare that a given token Q 2 Q is the same as some token A 2 A, we must compute the distance between the two tokens, and check if this distance is shorter than some predetermined threshold. This approach has two problems: The process of comparing a query image to images in the collection quickly becomes computationally expensive, as the size of the collection increases. The use of a threshold makes the entire system fragile, because the choice of an appropriate threshold is crucial to good performance. One way of avoiding these problems is by projecting the n-D tokens onto a set of predetermined points (in the same n-D space). The process of selecting such points is described next.
4 Construction of an Indexing Vocabulary
We can make the querying-complexity independent of the number of tokens representing the collection, by further mapping the tokens onto an indexing vocabulary consisting of a set of canonical indexing-terms. Following the approach of Lorenz [Lor96], I construct an indexing vocabulary by rst subdividing every axis of the n-D space of tokens into xed-size intervals. In general, the interval may be dierent for each axis. The intervals impose a n-D lattice on the space of tokens. Each grid-point of this lattice is taken as an indexing-term. In the absence of any constraint on the image-content present in the collection, this is a reasonable choice. Next I will describe the process of selecting indexing-terms appropriate for describing a speci c image, based on the tokens extracted from the image. In what follows, the symbol k will represent the kth token of an image, and the symbol i will represent the ith indexing-term of the indexing vocabulary. 4
5 Term-Weighting
Obviously most of the tokens generated from an image will not coincide exactly with the indexingterms in n-D space. In fact, the likelihood of this happening is zero. A token may be placed anywhere inside a hyper-cuboid which has indexing-terms at every vertex. The simplest approach for mapping an arbitrarily placed token onto the indexing vocabulary is to consider the grid-point closest to the token as the representative indexing-term. Thus, the weight of an indexing-term is incremented every time it happens to be the closest one to a token. In practice, however, as the density of tokens increases, the high quantization noise incurred in this approach renders the entire system useless. A more robust approach is to select all the indexing-terms that are in uenced by a given token. In this work the in uence of a given token is distributed over only the indexing-terms that are immediate neighbors of the token (i.e., the indexing terms that lie at the corners of the hypercuboid enclosing the token). The image can then be represented as a vector of weights, based on these indexing-terms. This model of image representation is called the vector space model (VSM). The simplest approach each image is represented by a binary vector. The indexing-terms not used to describe the image are assigned `0' weights, and every indexing-term that is in uenced by at least one token found the image carries the weight `1' in the vector. The Jaccard coecient [JD89] can be used to determine the similarity, SimJ (I1 ; I2 ), of two images, I1 and I2 , based on their corresponding binary vectors:
SimJ (I1; I2 ) = m d,11d
00
;
(3)
where d11 is the number of indexing-terms that are present in both images, d00 is the number of indexing-terms not present in either image, and m is the length of the two binary vectors being compared (i.e., the size of the vocabulary). In the binary weighting scheme, the weight of an indexing-term does not indicate its relative signi cance to the corresponding image { all indexing-terms present in the image are considered to be equally signi cant. Other weighting schemes, with multi-valued weights of indexing-terms, have been found to perform better in the context of text-retrieval systems [SB88]. In such algebraic schemes the weight of an indexing-term re ects the capacity of this term to distinguish the image in question from other images. Three factors should be considered when determining the signi cance of an indexing-term to a given image: 1. the term frequency (tf ) of the indexing-term, given by the number of times the term appears in the image; 2. the inverse collection frequency (icf ) of the term, given by the number of images where the term appears; and, 3. the relative lengths of the weight-vectors. The vectors may all be normalized to unit length. In our case, the indexing-terms do not explicitly appear in the image. The image produces only tokens. The indexing-terms are computed from the tokens. However, the distance between a token and an indexing-term may be interpreted as a measure of the probability that the token is a noisy version of the indexing-term. Therefore, we use the expected term frequency and the expected inverse collection frequency to determine the weight of an indexing-term for a given image. Let us rst compute the in uence of a token k on an indexing-term i . The in uence !ik of k on i is an estimate of the probability that k is a noisy version of i . I assume that this in uence drops exponentially with the distance between k and i . Speci cally, !ik should have the following properties: 1. !ik should be non-negative; 2. !ik should be an exponentially decreasing function of dik , the distance between i and k ; P l 3. the in uences of a given token, k , over all l tokens should sum to unity ( i=1 !ik = 1); and, 5
4. if k coincides with i then !ik should be unity, and the in uence of k on all other indexingterms should be zero. These requirements make intuitive sense. (The actual form of the function used in the second requirement is not very crucial; the exponential may be replaced by any other continuous, monotonically decreasing function.) Equation (4) gives the mathematical expression of a weighting function that satis es the above conditions. d ,(
!ik =
e
Plx
Q6 ,(
=1
2
ik ) l d2 j =i jk
e
Q6
d2xk
l d2 j =x jk
;
)
(4)
where the in uence of token k is distributed onto l = 2n indexing-terms, and !ik is the in uence of k on the indexing-term i . Here, dik is the Euclidean distance between i and k . Note that in Eqn. (4), if dik is zero for any i = 1 l (i.e., k falls exactly on i ), then the corresponding !ik is unity, and all other weights, !jk ; j = 1 l; j 6= i, are zero. A physical interpretation of Eqn. (4) is the following: the in uence of a token on the neighboring indexing-terms is determined basically by placing a n-D Gaussian centered over the token. The in uence of an indexing-term is then the value of the Gaussian at the indexing-term. The in uences on all l indexing-terms are further adjusted so that the sum of the in uences is unity. In the special case where the token coincides with an indexing term, the spread of the Gaussian is forced to collapse to zero, thus preventing the token from in uencing any other indexing term. For a given image Ij , we can now compute the expected term frequency, etf (i ; Ij ), as
etf (i ; Ij ) =
X!
k 2Ij
ik
:
(5)
The higher the value of etf (i ; Ij ), the more signi cant the term i is to the image Ij . However, if an indexing-term appears in a lot of images, it is not very discriminatory and hence not very useful for retrieval. Such terms should have low weights. The second factor, expected inverse collection frequency serves this purpose. The expected collection frequency, ecf (i ), of an indexing-term, i , is an estimate of the number of times i appears in the entire collection. We can de ne ecf (i ) based on the number of images in which i appears with non-zero probability: X p(etf ( ; I ) > 0) ; ecf (i ) = (6) i j Ij 2collection
where
p(etf (i; Ij ) > 0) = 1 ,
Y (1 , !
k 2Ij
ik )
:
(7)
The expected inverse collection frequency, eicf (i ), of the indexing-term i is then given by 1=ecf (i ). In information-theoretic terms, the information content of i varies logarithmically with the inverse of its frequency. This leads to the de nition of the probabilistic expected inverse collection frequency, given by log( ecf (Ci )+1 ), where C is the number of images in the collection. Another factor that aects retrieval eectiveness in the vector-space model is the length normalization of the m-D image-vectors. Each vector can be normalized to unit length, simply by dividing the vector by its magnitude. Several weighting schemes can be designed, by combining the three factors in dierent ways. In text-retrieval systems, tf icf weighting is the most popular scheme. In our case, this scheme is implemented as: wi = etf (i; Ij ) eicf (i ) ; (8) where wi gives the weight of indexing-term i for image Ij . Here the normalization component is unity (i.e., no normalization is done). Therefore, assuming an indexing vocabulary of m indexing 6
terms, the image Ij may now be described by a m-D vector of weights. Formally, we have a mapping, F , from the space, S, of images to the space, Rm , of m real-valued weights of indexing-terms:
F:
[ ,! Rm : I ,! F (I ) = [w ; w ; ; w ; ; w ]T j j i m, 0
1
1
:
(9)
The vector-product of two vectors is taken as a measure of similarity between the corresponding images1 . Given a query, the similarity of an image from the collection to the query is formally called the retrieval status value (RSV) of the image. For every query, we can generate a permutation of the set of images in the collection, by ranking the images in decreasing order of RSV.
6 Query Re nement
When using a retrieval system, the user may not be satis ed with the rst set of results. More often than not, the user will not be able to express the information-need as the most appropriate query in the rst attempt. For example, the user may not be able to nd the image that represents one's needs exactly. Thus, the querying-process should be iterative, and should allow the user to re ne the query from one iteration to the next. The vector-space model allows us to implement a simple query-re nement scheme, called relevance feedback [Kow97]. After the retrieval results for the initial query are available, the user can interactively indicate which images are relevant. The remaining images in the answer-set are considered non-relevant. The system automatically generates a new query-vector, based on the feedback provided by the user. Let R be the set of images marked relevant by the user and let N be the set of non-relevant images, with cardinalities jRj and jN j respectively. If an indexing-term, i , appears with high probability in most of the images marked relevant, this term should have a larger weight in the new query, whereas if the term appears quite often in the non-relevant documents, its weight in the subsequent query should be small. This will usually bias the new query-vector more towards the kind of images that the user considers relevant. Equation (10) gives Rocchio's formula [Roc71] for updating the weights of terms in the query-vector according to the relevance-feedback. X wK ) , ( 1 X wK ) ; 0 (10) wQ = wQ + ( 1 i
i
i
jRj K 2R
jN j K 2N
i
0
where wiQ is the updated value of the indexing-term i which previously had the weights wiQ in the query, and wiK represents the weight of term i in image K . ; , and are suitably chosen constants. Basically, Eqn. (10) generates the re ned query-vector as a weighted combination of the current query-vector and the dierence between the average of the relevant vectors and the average of the non-relevant vectors. Besides query-re nement, relevance feedback also serves another important purpose. In a similarity-based scheme, such as that used in the image-retrieval system, there is no way of implementing a rejection criterion. The feedback is one way of introducing external (subjective) knowledge into the system, which can be used to avoid certain images in the collection that the user de nitely does not want.
7 A Retrieval System for Images
Figure 2 shows the design of the proposed system. Given a collection of images, feature-points are rst detected in every image, and the corresponding tokens are generated using the derivatives-ofGaussian lters. Based on these tokens, the weights for the associated indexing-terms are computed. These weights are stored in an indexing structure. To retrieve images, the user speci es an example image as the query, and also speci es the desired number, d, of returned images. Tokens are extracted from the query-image, and the corresponding indexing-terms are computed, along with When length normalization is used, this measure essentially returns the cosine of the angle between the two vectors. In this case, all similarity values lie in the range [0 1]. 1
;
7
Query Image Collection Token Extraction
Token Extraction
Index-Feature Weighting
Index-Feature Weighting
Query-Vector
Vector Product Based Matching
Collection of Image-Vectors
Refined Query-Vector
Ranked List of Images
Relevance Feedback
Figure 2: Architecture of the proposed image-retrieval system. their weights for the query-image. These weights form the query-vector. The query-vector is compared, in turn, with every vector representing an image in the collection, based on the vectorproduct formula, and the result is a list of images ranked by their RSV with respect to the current query-vector. The rst d images in this list are presented to the user as the answer-set. The user may then provide relevance feedback, and execute a more re ned query in the next iteration. This process can be repeated until the user is satis ed or until the subjective quality of the answer-set does not improve any more.
8 Experimental Results
Results obtained for two samples queries on a test collection are presented in this section. The collection used in the experiments discussed here consists of about 1400 images. Of these, 600 images have been taken from a public domain collection of face images from the XM2VTSDB collection2 , and 700 images are non-overlapping subimages extracted from single-texture images of MIT's VisTex collection3 . The textured images appear in several orientations. The remaining images are the so-called ornament images4 . Ornament images are black-and-white images (represented 2 3
http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb
http://www-white.media.mit.edu/vismod/imagery/VisionTexture/vistex.html
Some examples of ornament images are shown in [BBM96] where a retrieval system for ornament images is described. 4
8
in gray-scale) depicting intricate hand drawn patterns. For these experiments, only three lters (one scale per DoG) have been used in the token-extraction process. The images in the collection are represented using the normalized best probabilistically weighted scheme (etf log( ecfC+1 ); with image-vectors normalized to unit length). The vector similarity is computed using the vector product, which, in this case, returns the cosine of the angle between the vectors (i.e., the well known Cosine measure). For relevance feedback, the parameter-settings = 1, = 0:5, and = 0:25 have been used. Figure 3 shows the result of a `Building' query after one round of relevance feedback. Twenty images were requested (d = 20). The query image is shown in Fig. 3(1). This image is also the image retrieved with the highest RSV. The other images shown in Fig. 3 are arranged in decreasing order of RSV (the number below each image shows the rank of the image). The results demonstrate that the indexing method is able to capture low-level visual features in images. Comparing the query with the fourth image and sixth image retrieved, we see that the system can also accommodate moderate variations in scale. Figure 4 shows the result, after one round of relevance feedback, of a `Bark' query, where 20 images were requested. The query image is shown in Fig. 4(1). Again, this image is also the image retrieved with the highest RSV. The other images shown in Fig. 4 are arranged in decreasing order of RSV, in row-major order. In the answer-set, only images (ranked 13 and 14) are not bark-images. However, even when evaluated visually, these images seem to have textures similar to some of the other bark-images retrieved (e.g., the texture of image 15, or 17). These examples demonstrate that the proposed system is invariant even to signi cant rotation, and that it works well for queries with strong structural information, as well as for seemingly random textures. In future the collection will also include photograph-images (e.g. vacation photographs).
9 Discussion
In recent years several schemes have been proposed for implementing content-based automatic image-retrieval systems. Most of them rely on color-distribution attributes of images, or shapecharacteristics of objects in view [GR95]. Retrieval based on shape-descriptors of objects is not feasible in the case of heterogeneous collections of images because the approach relies on scenesegmentation, which is not a stable process. In this paper I have proposed a system for retrieving images using low-level structural information. The measure for comparing digital images is based on small image-patches. The use of only local information allows us to implement retrieval based on subimages as well. The image-patches are selected automatically, using a feature-detection scheme that detects corner-like features. Also, the extent of the image-patch is not explicitly de ned { it is determined implicitly, from the spread of the lters used to evaluate the texture properties surrounding a feature-point. The texture of an image-patch is characterized using a bank of Gaussian derivative lters. The responses of the lters are organized into a vector, referred to as a token. The token extraction process uses only the luminance information in the image. The number of lters, as well as the number of scales per lter may be set arbitrarily, within the proposed framework. In addition to the lter-responses, other attributes, such as color information, can be readily incorporated in the tokens. The feature-detector is essentially a directional-derivative-of-Morlet wavelet. As Fig. 1 shows, the feature-points detected do not always fall exactly on corners. This happens when the natural scale of the feature is smaller than the scale of the analyzing wavelet, or when there are some other features close together which interact with each other and in uence the wavelet response. However, this is not a problem. For retrieval, all we want is that the process of detecting points be stable under rotation and translation. The exact position of a feature-point is not important. Images in the collection are represented in a vector space model, where the space of vectors is de ned by a set of indexing-terms. The indexing-terms are basically quantized version of tokens extracted from an image. The use of indexing-terms reduces the computational complexity of 9
comparing images, and also increases the robustness of the retrieval system. The negative eects of this quantization can be mitigated to some extent by using appropriate weighting schemes for the the indexing-terms. Several weighting schemes originally proposed in the context of textretrieval [SB88] can be used in the VSM. I have shown results of sample queries using the normalized best probabilistically weighted scheme. Preliminary results show that the retrieval process is able to compare images based on low-level structural attributes in a rotation-invariant fashion, and that the system can also accommodate small variations in scale. The proposed approach can be adapted for specialized applications where the collection is more homogeneous, by adding additional layers which incorporate knowledge about the type of images in the collection.
References
[AM96] J.-P. Antoine and R. Murenzi. Two-dimensional directional wavelets and the scale-angle representation. Signal Processing, 52(3):259 { 281, 1996. [BBM96] J. Bigun, S. K. Bhattacharjee, and S. Michel. Orientation radiograms for image retrieval: An alternative to segmentation. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, Aug. 25 { 30 1996. [Bha99] S. K. Bhattacharjee. Detection of feature-points using an end-stopped wavelet. Submitted to IEEE Trans. Image Processing, 1999. [FA91] W. T. Freeman and E. H. Adelson. The design and use of steerable lters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9):891 { 906, 1991. [GR95] V. N. Gudivada and V. V. Raghavan, editors. IEEE Computer (Special Issue on Image Retrieval), volume 28(9). 1995. [Hub95] D. H. Hubel. Eye Brain and Vision. Scienti c American Library, USA, 1995. [JD89] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Clis, NJ, 1989. [Kow97] G. Kowalski. Information Retrieval Systems. Kluwer Academic Publishers, Boston, USA, 1997. [Lor96] O. Lorenz. Automatic Indexing of Line Drawings for Content Based Information Retrieval. PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, 1996. [OF96] B. A. Olshausen and D. J. Field. Emergence of simple cell receptive eld properties by learning a sparse code for natural images. Nature, 381:607 { 609, 1996. [RB97] R. P. N. Rao and D. H. Ballard. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9:805 { 847, 1997. [Roc71] Rocchio, J. J. Relevance Feedback in Information Retrieval. In G. Salton, editor, The SMART Retrieval System { Experiments in Automatic Document Processing, page Chapter 14. Prentice Hall, Englewood Clis, NJ, USA, 1971. [San89] T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2:459 { 473, 1989. [SB88] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513 { 523, 1988. [You85] R. A. Young. The gaussian derivative theory of spatial vision: Analysis of cortical cell receptive- eld line-weighting pro les. General Motors Research Publication, GMR-4920, 1985.
10
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
Figure 3: Sample retrieval results for a `Building' query. The rst image was also the query image. The
gure shows the top 20 images retrieved after one round of relevance feedback. The query shows a segment of a building. Note that most of the retrieved images show portions of buildings. Only three images, at ranks 12, 18, and 20, contain images of `Brick' texture.
.
11
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
Figure 4: Sample retrieval results for a `Bark' query. the query-image (also the rst image returned), shows a close-up of the bark of a tree. The gure shows the top 20 images retrieved after one round of relevance feedback. Note that all but two of the images retrieved are `Bark' images. Two images of `Brick' texture appear at ranks 13 and 14. These images are close-ups of rocks, but visually they have textures similar to some of the other images retrieved.
.
12