Content Based Image Retrieval Using Pixel Descriptors S. Nepal and M.V. Ramakrishna Department of Computer Science, RMIT University, GPOBox 2476V, Melbourne VIC 3001 fnepal,
[email protected]
Abstract
Content based image retrieval systems have brought together the database and image processing communities. Both commercial developers and academic research community is actively interested in this area. Such systems will nd wide usage in the next millennium similar to relational systems today. This has become possible due to ever cheaper processing and storage costs. Texture and color are important cues for Content Based Image Retrieval (CBIR ) systems. Texture is usually extracted from the gray scale images. The contribution of the color to the texture perception has been mostly ignored. In this paper we introduce the notion of Pixel Descriptor(PID), which combines the human perception of color and texture into a single vector. It is a vector of responses to the image from a set of texture and color descriptor functions ( lters). We use Gabor functions to capture human texture perception, and Gaussian functions to capture colors which fall within the range of human perception. We describe details of feature for CBIR based on PIDs. We compare the performance of the feature for image retrieval against other published features using CHITRA system we are developing. For this comparison we use relevance feedback technique to account for user subjectivity. Keywords:
image features, texture, color, pixel descriptor, R*-tree indexing.
1 Introduction Ecient content based image retrieval systems have become an urgent necessity to eectively use very large collections of unconstrained images that have become common as the next millennium is approaching. Several CBIR systems have been prototyped, and are being used in real life applications [5, 7]. Most of these systems use low level features such as color, texture, structure and shape. Texture and color feature is used in almost all of these systems. It is generally accepted that color and texture are key features for CBIR systems [12, 4]. Many texture feature representations have been proposed and used in CBIR systems [10, 15, 7, 9]. These texture features are de ned for gray scale images. The texture features are obtained after converting the color image into gray scale. These features essentially ignore the contribution of color in texture pattern perception. There have been attempts to combine color and texture [8]. In this paper, we introduce the notion of PIxel Descriptor(PID), which encodes color information into texture feature. The PID encodes color information that is with in human perception range only. Our contention is that the feature derived from PID is more meaningful to human
perception than the texture feature representation alone, and signi cantly improves the retrieval performance. We compare the performance of our feature representation based on PIDs with Gabor texture feature representation [10] using interactive relevance feedback [14]. This is accomplished by incorporating the new feature into the data model of the CHITRA CBIR system we are developing. We also describe a new technique for performance comparison in CBIR systems. We measure the user's subjectivity on feature representations using an interactive relevance feedback on an unconstrained collection of images. The CHITRA system uses R*-trees for indexing the feature vectors (37 dimensional in this case). Thus the main contributions of this paper are as follows.
An algorithm to combine texture feature representation with color information using PIDs. development of a CBIR system, using PID and Gabor texture feature representations, that
supports interactive relevance feedback technique. This enables measurement of the user's subjectivity via weights on feature representations and gives a better comparison of dierent feature representations.
The remainder of this paper is organized as follows. We brie y discuss the current state of image retrieval, texture feature representations, and region extraction in Section 2. We describe the PIDs, and a CBIR system using PIDs in Section 3 and 4, respectively. The last section presents conclusions and our plans for future work.
2 Background Traditionally retrieval from large image databases was managed by posing queries against text annotated with the images. This does not scale well to large databases, and its use is limited by the annotators view of the image. The advances in image processing techniques enables us to model image feature information for supporting content based retrieval. The most well known CBIR system is IBM's QBIC [5], which allows users to specify queries using example images, various low level features such as color, texture and shape. Images are retrieved from the database based on the similarity and displayed them in order. The other recent content based retrieval systems are Photobook [13], VisualSEEk [15], Chabot [12], NETRA [9] and Virage [6]. These content based retrieval systems use many advanced features. The most important features that are commonly deployed by the current CBIR systems are color and texture. Texture is a well researched feature of the image, and many texture descriptors have been proposed such as multi-orientation lter banks and gray level dependency matrix. In most cases texture regions are characterized by the responses of the lters to an image. Wavelet decomposition [8] and Gabor decomposition [10] have been used in CBIR systems such as [7]. Texture representations are derived from the gray scale images and have largely ignored the importance of the color intensity. Serge and Malik [2] have combined intensity with texture feature representation. In this paper, we investigate the use of texture extraction lters to include the color information.
3 PIxel Descriptors We propose a new method of describing pixels in an image, called PIxel Descriptor (PID). The purpose of PID is to combine the vector form of texture feature representation, derived from the responses to the image from a set of lters, with the color information. In short, PIDs combine the texture information with color. The PIDs are derived by using a set of texture and color descriptors described below. PIDs can be obtained in dierent ways. In this section, we consider a Gabor
function as a texture descriptor function, and a Gaussian function as a color descriptor function. A PID for a pixel (x; y) in the image I is then obtained by taking a vector of both color and texture descriptors for that particular pixel (x; y).
3.1 Texture Descriptors
We use Gabor function to derive a set of lter banks for use in texture description. Gabor functions are Gaussian functions modulated by complex sinusoids. In two dimensions, the Gabor functions are as follows [10]:
g(x; y) = e
p
? 12 ( xx22 + yy22 )+2j!
2x y
where j = ?1, ! is the frequency of sinusoid, and s are standard deviations (parameters of the Gabor function). We can obtain a class of self similar Gabor wavelets by appropriate dilations and rotations of g(x; y) through the generating function,
gmn(x; y) = a?m g(x0 ; y0 ); a > 1; m; n = integer x0 = am (x cos + ysin); y0 = a?m (?xsin + ycos); where = n=K , K is the number of orientations and n = 0; 1; :::; S ? 1, S is the number of scales. Let Ul and Uh denote the lower and upper center frequencies of interest. Then the following lter design ensures that the half-peak magnitude support of the lter responses in the frequency spectrum touch each other. u = 1=2x; v = 1=2y ; a = (Uh =Ul )?1=(S?1) ; ! p= Uh ; u = ((a ? 1)Uh)=(a + 1) 2ln2; v = tan(=2K )[U2h 2? 2ln2(u2 =Uh)] [2ln2 ? (2lnU2)h2 u ]?1=2 For the experimental results reported, we used S = 4 and K = 6 and a lter size of 61x61. Given an Image I (x; y), we compute the transform coecients,
Tmn =
ZZ
I (x1 ; y1 )gmn (x ? x1 ; y ? y1 )dx1 dy1
where * indicates the complex conjugate. The texture descriptor for the given image I is then the vector of matrices, utexture = [T1 ; T2; :::; TN ] where N = S K = 24. Note that we are using a single subscript for T in the vector of matrices, rather than mn in the equation above that (this is for simplicity).
Color Descriptors
We use a normalized Gaussian model to extract color descriptors of the images. The Gaussian model in n-dimension is represented as:
Gn (x) = Kcexp ? 12 (
! pPn ( x ? ) i i i ) =1
2
2
The total number of colors present in an image is very large. We have over 16 million colors in an image with 24 bit color. It is extremely dicult to model such a large number of colors. To limit the number of colors without loosing much information contained in an image from the point of user's perception, we choose 13 colors following the experimental results in [3]. They are gray, black, pink, red, brown, orange, yellow, green, green-blue, blue, blue-light, purple, and white. Carson et. al. have experimentally found that these 13 dierent colors are signi cant in a collection of natural images and fall within the range of human perception [3]. Thus in order to represent color descriptors, we generate the 13 dierent Gaussian functions by changing the value of i . Since we are working in the RGB color space, we formulate the 3-dimensional(n = 3) Gaussian functions. represents the RGB values for a particular color. Suppose Cc is the response of the image I to a color c, the color descriptor for I is then given by the vector of matrices,
ucolor = [C1 ; C2 ; :::C13 ] By varying the value of we can tune the response of the Gaussian function. For our experiments we chose Kc = 1, and = 0.5. The values were obtained from results in [3].
PIxel Descriptors
PIDs are obtained from texture and color descriptors. The PID is a vector representing the color descriptors alongside the texture descriptors. In our case, the length of a PID vector is 37 (24 texture descriptors and 13 color descriptors). For a pixel i at (x; y) in an image, the PID is derived as follows.
upixel (x; y) = [utexture (x; y); ucolor (x; y)] where utexture (x; y) is the 24 dimensional vector of T (x; y), and similarly for color. We then normalize the PID vectors as follows. u^pixel (x; y) = jjuupixel((x;x;yy))j j pixel 2 We use this normalized PID to generate feature representation.
4 Architecture of CHITRA CBIR System We are building the CHITRA prototype CBIR system using a four layer data model as shown in Figure 1 [11]. The rst layer, image representation layer, stores the raw image data. The feature information extracted from the image is stored at the second layer, the image feature layer. The feature information includes global image features (features extracted from the image as a whole), image objects and features of the objects. The mapping from image representation layer to image feature layer is called feature extraction mapping. The system semantic layer is the third layer, and it stores system de ned functions such as similarity functions and spatial relationships. The fourth (top) layer is called user semantic layer. It contains semantics de ned by the user based on the
house User Semantic Layer
triangle
rectangle User Defined Mapping
above
System Semantic Layer
System Defined Mapping
texture shape
Image Feature Layer
colour AverageColour
Feature Extraction Mapping
Image Representation Layer
size
Caption
Figure 1: A four layer data model
information from the lower levels. The mapping to the user semantic layer is user de ned. This user de nition is accomplished by using interactive feedback technique to be discussed later. We extracted features from the image using statistical measure of PIDs, similar to that in [9], and stored in the image feature layer of CHITRA. The indexing is accomplished by R*-tree [1]. We have implemented interactive feedback technique in CHITRA to compare the texture feature representation derived from PID with the Gabor texture feature representation. The performance of the feature representations is measured through the weights associated with feature representations while posing queries. Our results indicate that the feature representation derived from PID outperforms Gabor feature representation presented in [9].
5 Texture-color Feature and Relevance Feedback A texture region is represented by mean and the standard deviation of the energy distributions of the transform coecient, calculated as follows [9].
= =
ZZ
sZ Z
jutexture (x; y)jdxdy
(jutexture (x; y) ? )2 jdxdy
Then the feature representation for texture descriptor alone is represented in [9] as follows.
r11 = [0 ; 0 ; ::; 23 ; 23 ]T Similarly the feature representation from PIDs are extracted as follows.
= =
ZZ
sZ Z
jupixel (x; y)jdxdy
(jupixel (x; y) ? )2 jdxdy
Then the feature representation for texture descriptor combined with color descriptors is represented as follows.
r12 = [0 ; 0 ; ::; 23 ; 23 ; :::; 36 ; 36 ]
5.1 Retrieval Performance
The performance of various feature representations in CBIR systems is generally characterised by measuring recall/precision or similar other methods. This does not take into account the inherent user perception of the results. In our experiments, we characterise performance based on user's perception by giving weights to feature representations. If a feature representation agrees with the user perception, then the corresponding feature gets higher weights assigned. This is accomplished by interactive relevance feedback techniques as described below [14].
5.2 Interactive Relevance Feedback System
The image data is represented as a ve tuple Im = < I; F; R; V; M >, where I is the raw image data, F = ffig a set of features. R = frij g, is the feature representation set ( each feature can have more than one representation). M represents similarity measure, M = fmij g (dierent representation can use dierent similarity measures). V is the value realized by each representation. In our case, we have two dierent representations explained above. We use the N-dimensional Gaussian function mentioned earlier as the similarity measure function for both representations. Both representations in our CBIR system are realized by a corresponding vector. For example, a feature representation using pixel descriptor (r12 ) itself consists of multiple components, i.e., rij = [rij1 ; ::::::; rijK ], where K = 37 is the length of the vector. Let r11 and r12 be two texture feature representations, and W11 and W12 be their weights. W11K and W12K be the weights of the components in the vector. Based on the image data model, we describe the retrieval process as follows. Retrieval with feedback
1. Initialize the weights W = [Wij ; Wijk ] as follows. W11 = W 011 = W12 = W 012 = 21 W11k = W 011k = W12k = W 012k = K1ij where K = 24 for r11 and K = 37 for r12 . 2. The user's query, Q, is represented by rij with their corresponding weights Wij . 3. The similarity between query Q and database images I is measured for each representation rij using the corresponding similarity measure mij and the weights Wijk . Sim(Qrij ; Irij ) = mij (rij ; Wijk ) 4. The overall similarity between the query Q, and the database image I is then evaluated by combining individual feature representation similarity Sim(Qrij ; Irij ) as follows, P Sim(Q; I ) = j Wij Sim(Qrij ; Irij ) 5. The database images are then ordered by their overall similarity to Q. The system then returns the top d(say = 10) objects to the user. 6. The user marks the relevant returned images and is fed-back to the system. Update the weights based on the user's feedback(procedure described below). The idea is to adjust the query weights to encode human perception subjectivity. The representation that better represents user's perception will be assigned higher weights. 7. Repeat step (2) until the user is satis ed with the query result. As mentioned above, the weights associated with feature representations and their vector components are dynamically updated based on the user fed-back images. We described them below in two levels of relevance feedback.
Updating weights of vector components(low level)
The lower level feedback mechanism updates the weights Wijk using the user perception. The Wijk are weights associated with the components of a feature vector. When we compare two vectors,
to start with, we assign equal weight to each of the component of the vector. But, the human perception may be such that the weight given to each component is dierent. For example, the user may perceive red color more signi cantly that green, and hence when comparing we need to assign more weight to red than green (while using RGB vector as the feature). We used a standard deviation based approach to update the weights based on user feedback. The basic idea is, that if a particular component of the returned vector are all close to each other, then the corresponding weight should be higher. Suppose L images are marked by the user as relevant, we form a L K matrix of the corresponding vectors. Each column of the matrix then contains a sequence(of length L) of rijk . If all values in a column are similar, then the column becomes a good indicator for the user's perception/information need and is given higher weight. In contrary, very dissimilar values in a column indicates that the corresponding rijk is not relevant to the user's information need and is given lower weight. Based on this, the inverse of the standard deviation of each column (the sequence of rijk of the relevant images) is used to update the weight Wijk of the corresponding component rijk . 1 Wijk = ijk
where ijk is the standard deviation of the corresponding column. The weight is then normalized in a similar manner.
Wijk = PWWijkijk .
Updating Feature Weights
The higher level relevance feedback process updates the weights Wij . The weight Wij associated with feature representation rij indicates the user's preference of a representation in the overall similarity. We use the feedback mechanism to update the weights. The higher the weight the better the feature representation. The nal weights assigned is a measure of the retrieval performance of PID feature representation. Let d be the number of best matching images the user would like to retrieve from the database (in our case 10). We denote the query result R as follows, R = [J1 ; J2 ; :::; Jd ]. For each feature representation, the system retrieves d images using similarity measure Sim(Drij ; Qrij ). Let the rst d images corresponding to feature representation rij be Sij , S ij = [S1ij ; ; Sdij ] Then the weights are updated as follows. for l = 0 to d do Wij = Wij + 1 if Slij is in R = Wij + 0 if Slij is not in R
1.
We then normalize the weights, as follows, so that the sum of the normalized weights is equal to
Wij = PWWijij .
It is clear that the weight Wij increases with the number of images common to R and S ij . Thus, if a representation rij captures user perception better, then it receives more emphasis. We conclude that the representation that gets higher weight at the end of the feedback process better represents the user perception.
6 Experimental Results From Our Prototype CBIR System We used the CHITRA CBIR system to test the eectiveness and the power of our PID based feature, and compare its performance to the texture feature in [9]. We considered the two feature representations as described above. We loaded a collection of images obtained from World Wide Web(WWW) into image database. The initial weights associated with both representations r11 and r22 were equal to 0.5. The Figure 2 shows the response of CHITRA to a query. The top left airplane is the query image, for the \example" query. The other two rows of images are the result of the query, returned by the system, ordered based on the similarity. The user then marked the images that have airplane in it as relevant and fed-back into the system. The system updates the weights using two levels of relevance feedback technique described above. The query result after the weights are updated by relevance feedback is shown in Figure 3. The updated weights of feature representations r11 and r12 are 0.2, and 0.8 (starting from 0.5 each). The sum of the weights in both interactions within the same query for r11 and r12 is 0.7 (0.5+0.2) and 1.3 (0.5+0.8), respectively (this is shown in the message window as well). We observed that the weight associated with feature r12 increases from 0.5 to 0.8, whereas the weight of the feature r11 decreases from 0.5 to 0.2. We posed a large number of similar queries to the database, and we observed that in all cases the total weight (sum over all interactions within a single query) for r12 (derived from PIDs) is higher than r11 ([9]). Thus, we conclude that the image retrieval using PID based feature gives better performance than the texture feature extracted from gray scale images.
Figure 2: Initial query image and results obtained from our CBIR system. The initial weights associated with feature representations are shown in the message window
Figure 3: Initial query image and results obtained from our CBIR system after rst user interaction. The updated weights associated with feature representations are shown in the message window
7 Conclusions The new computing paradigm and the hardware economies have resulted in large collections of image databases, and content based image retrieval will de nitely nd wide usage in the next millennium. In this paper we de ned PIDs and proposed a method of feature representation which included texture and color. We used the CHITRA CBIR system (having a four layer data model) to implement the proposed feature and evaluate it using interactive feedback technique. We found that the feature representation using PID always capture the user's information need better than the texture based feature. Work in progress includes using PIDs to extract regions from the database. Since PIDs capture the information such as locality, color and its neighborhood, the regions extracted from the image using PIDs are found to be consistent with the human perception of image regions. Regions extracted using PIDs have not yet incorporated into our prototype system. Our plans include development of a full edged CHITRA CBIR system using the proposed data model that includes many global features as well as regions extracted using PIDs. In addition a fuzzy query language is under development and implementation. At present we are using R*trees for indexing multidimensional feature vectors, and more ecient structures are under development.
Acknowledgements
Contributions of James Thom for the CHITRA architecture are gratefully acknowledged.
References
[1] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider and Bernhard Seeger. The R -tree: An ecient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pages 322{331, Atlantic City, NJ, 1990. [2] Serge Belongie and Jitendra Malik. Founding boundaries in natural images: A new method using point descriptors and area completion. In Fifth Euro. Conf. on Computer Vision, Freiburg, Germany, 1998. [3] Chad Carson, Serge Belongie, Hayit Greenspan and Jitendra Malik. Region-based image querying. Technical Report 97-941, Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, 1997. URL: http://HTTP.cs.Berkeley.EDU/ carson/papers/tr941.ps.gz. [4] Chad Carson and Virginia E. Ogle. Storage and retrieval of feature data for a very large online image collection. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Volume 19, Number 4, pages 19{27, December 1996. [5] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele and Peter Yanker. Query by image and video content: The QBIC system. Computer, Volume 28, Number 9, pages 23{32, September 1995. [6] Amarnath Gupta. Visual information retrieval: A virage perspective. Technical Report Revision 4, Virage Inc., 9605 Scranton Road, Suite 240, San Diego, CA 92121, 1997. [7] Jesse S. Jin, Ruth Kurniawati and Guangyu Xu. A scheme for intelligent image retrieval in multimedia databases. Journal of Visual Communication and Image Representation, Volume 7, Number 4, December 1996. [8] J.R.Smith and S.-F.Chang. Local color and texture extraction and spatial query. In In IEEE Proc. Int. Conf. Image Processing, Lausanne, Switzerland, 1996. [9] W.Y. Ma and B.S. Manjunath. NETRA: A toolbox for navigating large image databases. IEEE International Conference on Image Processing, anta Barbara, California, 1997. [10] B.S. Manjunath and W.Y.Ma. Texture features for browsing and retrieval of image data. IEEE Transaction on Pattern Analysis and Machine Intelligence, Volume 18, Number 8, pages 837{ 842, August 1996. [11] Surya Nepal, M.V.Ramakrishna and J.A.Thom. Four layer schema for image data modelling. In Chris McDonald (editor), Australian Computer Science Communications, Vol 20, No 2, Proceedings of the 9th Australasian Database Conference, ADC'98, pages 189{200, 2-3 February, Perth, Australia, 1998. [12] Virginia E. Ogle and Michael Stonebraker. Chabot:retrieval from a relational database of images. IEEE Computer, Volume 28, Number 9, pages 40{48, September 1995.
[13] A. Pentland, R. W. Picard and S. Sclaro. Photobook: Tools for content based manipulation of image databases. International Journal of Computer Vision, Volume 18, Number 3, pages 233{254. [14] Yong Rui, Thomas S. Huang and Sharad Mehrotra. Relevance feedback techniques in interactive content based image retrieval. In Proc. of IS & T and SPIE Storage and Retrieval of Image and Video Databases VI, San Jose, CA, 1998. [15] John R. Smith and Shih-Fu Chang. VisualSEEk: A fully automated content-based image query system. In ACM Multimedia, Bonton, MA, November 1996.