ImageSeeker: A Content-based Image Retrieval System
Alaa Tawfik, HebatAllah Fouad, Reem Megahed, Samar Mohamed, Elsayed Hemayed* Computer Engineering Dept., Faculty of Engineering, Cairo University, Giza, Egypt ABSTRACT In many areas of commerce, government, academia, and medicine, large collections of digital images are being used. Usually, the only way of searching these collections is by their name, or by browsing which is unpractical for large number of images. ImageSeeker aims at providing an improved technique to enhance image searching. It focuses on extracting visual contents from images and annotating them; it is based on the concept of CBIR (Content Based Image Retrieval) to retrieve the contents of the image based on what it has learnt during past trainings. When a user requests an image for a certain object, all images containing the same object will show up. ImageSeeker maintains high accuracy in finding matching results to the user’s query. The system was tested on images containing natural scenes, specifically, see, sand, grass, clouds and sky. Keywords: Image Seeker, CBIR, Search Engine, Image Retrieval, Image Annotation
1. INTRODUCTION Almost 30 years have passed since the idea of the World Wide Web (WWW) was just a proposal, and today, we are witnessing the evolution of the idea as it invades every aspect of our lives. A lot of technologies and products have emerged since then. One of the most important technologies that came to life was multimedia. Sound, music, videos, graphics...etc are the famous forms of multimedia nowadays. The future holds more other forms, such as holography. As a consequence, the amount of non-text based data has increased exponentially over the last decade and in particular, images, which have been gaining popularity as an alternative and sometimes more viable option for information storage. Images have gained their popularity after the introduction of digital cameras, mobiles, scanners and other storage elements that are cheap and can allow for a large amount of information to be stored in image format. Extracting specific information from images and retrieving the appropriate and relevant data from databases has always been a challenge. Researchers have been conducting a lot of research on extracting information from non-text based data (in particular, images) according to their actual content. The image content can be described by using either its semantic or its visual information. The retrieval of images based on the semantic content is mostly done via keywords or a text phrase. This is often achieved by applying the traditional text-based retrieval approach to analyze the image content through its file name and description tags. This approach is not suitable for large existing databases, where text annotations in images are often not available, and the image filename rarely reflects the true interpretation of the actual content. For instance, images captured by digital cameras are usually named based on the time and date that the images were taken. Additional descriptions are normally entered by the user after the images have been uploaded to the computer. In most cases, unless some form of intelligent technique is applied, these images have to be labeled manually. Preparing this type of image database for text-based retrieval will be labor intensive and it is clearly not a viable solution. The alternatively is to retrieve the images based on their visual contents. Recently, Content-Based Image Retrieval (CBIR) has emerged as a new intelligent field for retrieving images based on its visual contents. A CBIR system is an extension of the traditional text-based information retrieval system. However, the techniques and approaches used in CBIR have deviated from text-based retrieval systems, and CBIR has now matured into a distinct research discipline in its own. In this paper, we are presenting a new content-based image retrieval system. The proposed system allows searching for images according to their actual visual content. It also provides methods for automatically annotating different image regions and associating these annotations with the image in the database. This method provides higher accuracy when
*
Contact Author Email:
[email protected], Website: http://www.hemayed.net. Multimedia Content Access: Algorithms and Systems III, edited by Raimondo Schettini, Ramesh C. Jain, Simone Santini, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7255, 725507 © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.810090 SPIE-IS&T/ Vol. 7255 725507-1
searching for images, as the user is searching for keywords corresponding to the actual content of the image not keywords associated with the images. 1.1 Previous Work The term CBIR originated in 1992, when it was used by T. Kato to describe experiments on automatic retrieval of images from a database, based on the colors and shapes present. Since then, the term has been used to describe the process of retrieving desired images from a large collection on the basis of syntactical image features. The techniques, tools and algorithms that are used originate from fields such as statistics, pattern recognition, signal processing, and computer vision. There is growing interest in CBIR because of the limitations inherent in metadata-based systems, as well as the large range of possible uses for efficient image retrieval. Textual information about images can be easily searched using existing technology, but requires humans to personally describe every image in the database. This is impractical for very large databases, or for images that are generated automatically, e.g. from surveillance cameras. Potential uses for CBIR include: • • • • • • • • •
Art collections Photograph archives Retail catalogs Medical diagnosis Crime prevention The military Intellectual property Architectural and engineering design Geographical information and remote sensing.
Over the past few years, many image retrieval systems have been developed, such as: QBIC [1]: or Query by Image Content was developed by IBM, Almaden Research Centre, to allow users to graphically pose and refine queries based on multiple visual properties such as color, texture and shape. It supports queries based on input images, user-constructed sketches, and selected color and texture patterns. VIR Image Engine [2]: by Virage Inc., like QBIC, enables image retrieval based on primitive attributes such as color, texture and structure. It examines the pixels in the image and performs an analysis process, deriving image characterization features. VisualSEEK and WebSEEK [3]: were developed by the Department of Electrical Engineering, Columbia University. Both these systems support color and spatial location matching as well as texture matching. NeTra [4]: was developed by the Department of Electrical and Computer Engineering, University of California. It supports color, shape, spatial layout and texture matching, as well as image segmentation. MARS [5]: or Multimedia Analysis and Retrieval System was developed by the Beckman Institute for Advanced Science and Technology, University of Illinois. It supports color, spatial layout, texture and shape matching. Some other work have been applied to image classification, annotation, and retrieval such as visual templates [6], feedback learning [7], Support Vector Machines [8], active learning [9], Generative modeling [10], a latent space models [11], spatial context models [12], statistical boosting [13], manifold learning [14], and Real time computerized Annotation [15]. For a comprehensive survey of Image retrieval, see [16]. The common approach of the aforementioned systems was to extract features for every image based on its pixel values, and to define a rule for comparing images. The feature can be color, texture or shape. The proposed system uses similar approach in extracting visual contents from an image. It combines textures and color features and applies classification techniques that enable the system to annotate images containing supported categories. The architecture of the proposed system is presented in section 2. Experimental results are shown and discussed in Section 3. Section 4 contains the conclusion and planned extensions. Finally, references are cited in the last section.
SPIE-IS&T/ Vol. 7255 725507-2
2. SYSTEM ARCHITECTURE The system consists of 3 main modules, as shown in Figure 1. The first module “object learning module” is responsible for learning the system to recognize system objects (sand, sea, grass, cloud, and sky). The second module “Image Annotator Module” is responsible of annotating a new image with its contents according to the systems learning in the previous module. The third module “Image Search Module” is responsible of searching in the database to get the needed images according to the user’s query string. Each module is described in more details in the sections below. 2.1 Object Learning Module The input to this module, shown in Figure 2, is the set of images of the training sets for each class supported in the system (sea, sky, cloud, sand and grass). Sixty different features (texture and color features) are extracted from these images and combined into one feature vector. The principal components analysis (PCA) [17] is used to reduce the length of the feature vector. The mean of the feature vectors are calculated for each category after eliminating outliers. The mean is then stored and used when annotating unknown images. The algorithm steps are listed below and are applied to a window of size 50 x 50. This window size is also used on new incoming images before it is passed to the features extraction module. • •
First, the window is quantized to 64 levels for each R, G and B matrix. Co-occurrence matrix algorithm [18] is applied to the three color components (RGB) of the window producing three co-occurrence matrices with size of 64*64 (64 is the chosen quantization level). The Algorithm Steps can be summarized as follows: o The algorithm considers the relation between two pixels at a time, called the reference and the neighbor pixel. The neighbor pixel is chosen to be the one to the east (right) of each reference pixel. This can also be expressed as a (1, 0) relation: 1 pixel in the x direction, 0 pixels in the y direction. o Each pixel within the window becomes the reference pixel in turn, starting in the upper left corner and proceeding to the lower right. Pixels along the right edge have no right hand neighbor, so they are not used for this count. o Matrix of 64*64 is created where the top left cell will be filled with the number of times the combination 0,0 occurs, i.e. how many times within the image area a pixel with color level 0 (neighbor pixel) falls to the right of another pixel with color level 0 (reference pixel). o Calculate a symmetrical matrix by summing both the east matrix which is (1, 0) relation and west matrix which is (-1, 0) relation. Hint: The west matrix is calculated by getting the transpose of the east matrix calculated previously. o Finally, Normalize the symmetrical matrix by dividing by the total number of possible outcomes which is 64*64*2 (Multiply by 2 for symmetry)
•
Then 10 features are calculated from each co-occurrence matrix as follows: 1.
Mean (Sum Mean): Provides the mean of the color levels in the image. The sum mean is expected to be large if the sum of the color levels of the image is high.
2.
Variance: Demonstrates how spreads out the distribution of color levels are. The variance is expected to be large if the color levels of the image are spread out vastly.
3.
Energy (Angular Second Moment): Measures the number of repeated pairs. Energy is expected to be high if the occurrence of repeated pixel pairs is high.
4.
Contrast: Measures the local contrast of an image. Contrast is expected to be low if the color levels of each pixel pair are similar.
5.
Homogeneity: Measures the local homogeneity of a pixel pair. Homogeneity is expected to be large if the color levels of each pixel pair are similar.
SPIE-IS&T/ Vol. 7255 725507-3
6.
Correlation: Provides a correlation between the two pixels in the pixel pair. The correlation is expected to be high if the color levels of the pixel pairs are highly correlated.
7.
Auto Correlation: defines fineness/coarseness of the texture.
8.
Cluster Tendency: Measures the grouping of pixels that have similar color level values.
9.
Entropy: Measures the randomness of a color-level distribution. Entropy is expected to be high if the colorlevels are distributed randomly throughout the image.
10. Inverse Difference: Informs about the smoothness of the image, like homogeneity. The inverse difference moment is expected to be high if the color levels of the pixel pairs are similar. Table 1 lists the equations of the computed features from the co-occurrence matrix M [19]. A feature vector of 30 values for the current window is computed. These are the texture-related features. Additional features are calculated for the processed window. The additional features are color-related and are calculated in the HSV color space model. So each RGB image is transformed into three images which are Hue image, Saturation image and Value image using equations listed in Table 2. In HSV color space, quantization of hue attracts the highest attention. The hue circle consists of the primaries: red, green and blue separated by 120 degrees. A circular quantization at 20 degree steps sufficiently separates the hues such that the three primaries and yellow, magenta and cyan are represented each with three sub-divisions. Thus Hue is quantized to 18 levels. Saturation and Value have less weight so Saturation is quantized to 3 levels and Value is quantized to 9 levels. These 30 values are computed for each window to form the color feature vector. Finally the result is a feature vector of 60 values, this feature vector is a combination of both color and texture features. Table 1: The computed features
Feature Mean Variance Energy (Angular Second Moment) Contrast Homogeneity Correlation Auto correlation Cluster tendency Entropy Inverse Difference
Equation ∑i∑j ((I * M(i, j)) + (j * M(i, j)) / 2 ∑i∑j (((i - µ)2 * M(i, j)) + ((j - µ)2 * M(i, j))) / 2 ∑i∑j M(i, j)2 ∑i∑j (M(i, j) * (i - j)2) ∑i∑j (M(i, j) / (1 + |i - j|)) ∑i∑j (M(I, j) * (i - µ) * (j - µ) / variance) ∑i∑j (M(i, j) * i * j) ∑i∑j (M(i, j) * (i + j – 2 * µ)2) ∑i∑j (M(i, j) * log2 (M(i, j))) ∑i∑j (M(i, j) / (1 + |i - j|2))
SPIE-IS&T/ Vol. 7255 725507-4
Table 2: RGB to HSV Conversion
HUE
SATURATION
INTENISTY/VALUE
2.1.1 Data Transformation The recorded feature vector contains 60 different types of measurements, each with its own scale. In order to use these mixed measurements to differentiate between image contents, the measurement data should be normalized. In our system, statistical normalization technique [20] was used to normalize each measurement data to a standard normal distribution (zero mean and variance one).
z= Normal distribution:
µ= Mean:
σ= Standard deviation:
X −µ
σX
∑X N
∑(X − µ)
2
N
And for better performance and accuracy, the length of the feature vector is reduced. Principal Component Analysis (PCA) [17] is used to reduce the length of the feature vector keeping only the most important features. In our case we kept only the first ten feature. The number ten was chosen empirically. 2.1.2 Representative Election The feature vectors of the training set of each class are extracted from the PCA after reducing its dimensionality. Since the training set includes different inputs for the same image, a representative of these inputs is elected for each class. This representative is elected according to the selective averaging technique [20] in which we discard some of the minimum and maximum values for each feature to reduce the noise introduced by the rouge patterns then computing the average for the remaining inputs and we label it the center feature value. Similarly, all remaining features are calculated. Finally, a center features vector of each class is created and saved into the database to be used afterwards in the classification process. 2.2 Image Annotator Module When a new image is uploaded to the system, the Image Annotator Module, shown in Figure 3, processes the image and assigns labels for the different image segments. The annotated image is stored in the database with all the keywords (labels) found and the rank for each keyword so that it can be used later in searching for images which has this specific keyword or object. The details of the annotator module are discussed below. •
First the image is divided into windows of size 50x50. Say we have (i=1:n) windows.
SPIE-IS&T/ Vol. 7255 725507-5
• • • •
The texture and color features defined in the previous section are extracted for each window For all n windows, we compute the Euclidean distance between its feature vector and the stored center feature vector of each class. The class with closest distance is assigned to the window provided that the distance is less than a predefined threshold. Otherwise, a dummy class (other) is assigned to the window. For each class found in the image, we count its occurrence in the image (Say NC > 3) then we compute its probability PC using Equations 1-3.
(DT − D C ) P = * 100% DT C 1 N C D = C ∑ DiC N i =1 C
(1) (2)
n
D T = ∑ Di
(3)
i =1
As shown in Eqn. 1, the probability of Class C is the percentage of the difference between the class average distance
D C and the total distance D T . The average distance of class C is the summation of the distance between the window
feature vector and the class center feature vector for all windows assigned to class C divided by their occurrence NC, Eqn. (2). The total distance is the summation of the distance between the window feature vector and the class center feature vector for all windows in the image, Eqn. (3). Then the image is labeled with the selected classes along with their probability percentages. Image labels are stored in a database to be used in the image search engine. 2.3 Image Search Module Similar to other search engine, ImageSeeker has a database containing images of categories like nature. Each category has objects for the users to search for and each image stored in the database has its own keywords (labels) attached to it. These keywords are the contents found in this image (annotated image). A user who is interested in searching for images containing a certain object will simply choose the object that he wants to search for, and then the system will start searching for images in the database having a keyword matching the object the user wants to search for. All the images containing this object will be displayed to the user according to their ranking, as each object in each image has a rank, which is the percentage of accuracy determining how much the object we are searching for matches the object present in the image stored in the database.
3. EXPERIMENTAL RESULTS Several experimental results were executed to assess the accuracy of the developed system. In this section we discussed some of these experiments. The system was trained using 250 images (of size 50x50 pixels) from each class (sea, sand, sky, cloud and grass). Figure 4 shows samples of the training data set; 5 images for each class. Figure 5 shows an example of an image annotated by the proposed system where it shows the classification selected for each 50x50 image windows. Image windows that do not match the known categories are labeled as others. Figure 6 shows two examples of images annotated by the system. The figure shows the labels of each image that includes the class name, its probability percentage, and its location in pixel coordinates. The class with the highest probability percentage is shown in a white box. As shown in the figure, the classification decisions of the individual windows are fused and only the fused classification is shown to the user and is stored in the system. This fusion step is necessary to maximize the accuracy of the annotation.
4. CONCLUSION In this paper, we presented a content-based image retrieval system that learns five classes of category nature (sea, sky, sand, cloud and grass). The system uses the learned classes to label unknown images and store the label along with the image link. Such system is not searching the images by its name, it rather search the images by its contents. Among the list of our future work is extending the system capability by adding more categories and more classes in each category.
SPIE-IS&T/ Vol. 7255 725507-6
REFERENCES [1] [2]
[3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
IBM's Query By Image Content, http://www.qbic.almaden.ibm.com/ Bach, J., Fuller, C., Gupta, A., Hampapur, A., Gorowitz, B., Humphrey, R., Jain, R. and Shu, C., "Virage image search engine: an open framework for image management," In Proceedings of the SPIE, Storage and Retrieval for Image and Video Databases IV, San Jose, CA, 76-87 (1996). Smith, J. R. and Chang, S. F., "Querying by color regions using the VisualSEEK content-based visual query system," In M. T. Maybury, editor, Intelligent Multimedia Information Retrieval. AAAI Press (1997). Ma, W. Y. , [NETRA: A Toolbox for Navigating Large Image Databases], PhD thesis, Dept. of Electrical and Computer Engineering, University of California at Santa Barbara (1997). Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S. and Huang, T. S., "Supporting similarity queries in MARS," In Proceedings of the 5th ACM International Multimedia Conference, Seattle, Washington, 8-14 Nov. '97, 403-413 (1997). Chang, S. F. , Chen, W. and Sundaram, H., “Semantic visual templates: Linking visual features to semantics,” In Proc. Int. Conf. on Image Processing, Chicago, IL, Vol. 3, 531–535 (1998). Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, S., “Relevance feedback: A power tool in interactive content-based image retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 8(5), 644–655 (1998). Tong, S. and Chang, E., “Support vector machine active learning for image retrieval,” In Proc. ACM Multimedia Conf., 107–118 (2001). Zhang, C. and Chen, T., “An active learning framework for content-based information retrieval,” IEEE Transactions on Multimedia, Vol. 4(2), 260–268 (2002). Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D. A., Blei, D. M. and Jordan, M. I., “Matching words and pictures,” Journal of Machine Learning Research, Vol. 3, 1107–1135 (2003). Monay, F. and Gatica-Perez, D., “On image auto-annotation with latent space models,” In Proc. ACM Multimedia Conf. (2003). Singhal, A., Luo, J. and Zhu, W., “Probabilistic spatial context models for scene content understanding,” In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (2003). Tieu, K. and Viola, P., “Boosting image retrieval,” International Journal of Computer Vision, Vol. 56(1/2), 17–36 (2004). Vasconcelos, N. and Lippman, A., “A multiresolution manifold distance for invariant image similarity,” IEEE Transactions on Multimedia, Vol. 7(1), 127–142 (2005). Li, J. and Wang, J. Z., “Real-time Computerized Annotation of Pictures”, IEEE Trans. On PAMI, Vol. 30(6) (2008). Datta, R., Joshi, D., Li, J. and Wang, J. Z., “Image Retrieval: Ideas, Influences, and Trends of the New Age,'' ACM Computing Surveys, vol. 40(2), article 5, 60 pages (2008). Jolliffe, I.T., "Principal component analysis", Springer-Verlag New York (1986). Haralick, R. M., Shanmugam, K. and Dinstein, I., "Textural Features for Image Classification". IEEE Transactions on Systems, Man, and Cybernetics SMC-3 (6): 610–621 (1973). Haralick, R. M. and Shapir, L. G. , [Computer and Robot Vision], Addison-Wesley Publishing Co, 1992. Webb, A., [Statistical Pattern Recognition], 2nd Ed., John Wiley & Sons (2002).
SPIE-IS&T/ Vol. 7255 725507-7
Object Learning Module:
U. --
I.
p
Classes Training
Read Image
3D Co-occurterne Ma)ris Caicijlat,on
3 Cooc MatrIx
Features Extraction (10 F)
30 FV
t
Unknoer Patterns
RGd to hS: Conversion
HS.I rrge
Training Sets FVs (60F)
Regions of the image
Image Annotator Module
_Urknowr macfP
Unkno Patterns FVs
(60Ff
PCA
Eigen Vectors,
Nears Probability Calculation
'It
Predicated Classes
Distance Calculation
Unknown Pattern (in Eiger Space)
¶
iqen Vectons, Means
Classes Centers
Elgen Space Corivecsion
Training FV (in Elgen Space) 4,
Reçxesentative Election
Annotated Image
Image Search Module:
Query String
Figure 1: System Architecture
SPIE-IS&T/ Vol. 7255 725507-8
4
Object Learning Module:
I. -
Classes Training sets
3D Co-occurrerTce
Read Image RGB image
Matrix CalwIatn
¼.
J LOac Matrix
I-
Features Extraction
RGB to HSI Conversloi,
(1OF)
¼
H.SI. image 30 F
1 H.S.I.
Quanliatjon 1& H, 3LS, 91
30 F
Training
S1s
Training Sets
FVs
(60F)
FVs (60F)
Eigeii Space Conversion
Elgen Vectors Means
Eigen Vectors. Means Training FV (in Eigen Space)
Representative
Bection
1.
Casses Centers*
Figure 2: Image Learning Module
SPIE-IS&T/ Vol. 7255 725507-9
Imzçje Annotator Module: Regions oi the image
UiknOWTl
Image
(Jnknowr Patterns
RGB to HSI
Read Iruagn
Conversion
RGB Ernage
HSI. unage
H SI.
tD Coccurrers.e
Quantization
(1L H, 3l S 91
MaSix COlculation I)
3 Conc Mates
Extraction (10 F)
Unknown
Patterns FVs (6OF)
4,
Elgen Space Conversion
Unknown Pattern (in Eigen Space
Predicated Classes
Probability Calculation
Vectorsj
Eigen Means
Annotated Image Classes Centers
Figure 3: Image Annotator Module
SPIE-IS&T/ Vol. 7255 725507-10
(a) Sky
(b) Cloud
(c) Sea
(d) Grass
(e) Sand Figure 4: Samples from the training data set
sky
sky
sky
sky
sky
other
sky
clouds clouds
o
sed
sea
sea
Figure 5: Annotated Windows
Class Name sand Accuracy :85%
CIass clouds CIas: 5rd Accuracy S3% Accuracy 85%
Region:
X=25
Region: X=125
Y=25
Y=175
Class: clouds Class: ski,
Cuss: sea
Accuracy: 69% Accuracy: 90% Accuracy: 95%
Region:
Region:
Region:
5= 275 5= 75
5= 25 5= 75
2= 175 5= 175
Figure 6: Annotated Images
SPIE-IS&T/ Vol. 7255 725507-11