Proc. of Int. Conf. on Advances in Computer Science, AETACS
Efficient Classification of Images using Histogram based Average Distance Computation Algorithm Extended with Duplicate Image Detection Parag Shinde 1, Amrita A. Manjrekar 2 1
Computer Science and Technology, Department of Technology, Shivaji University, Kolhapur, India
[email protected] 2 Computer Science and Technology, Department of Technology, Shivaji University, Kolhapur, India
[email protected]
Abstract— Internet Image search is a day to day activity performed by user. User enters a keyword in a search engine like Google, Yahoo, Bing etc for retrieval of keyword related images, where millions of images are retrieved through search engines. The problem with a keyword search is that keywords entered by user are very short and ambiguous, hence images which are retrieved are of different categories and some of them are irrelevant. Visual information is used in order to solve the ambiguity in text based image retrieval. User only has to click on one query image. The query image is categorized based on textual features like image title, image URL, context and some visual features like histogram distance computation, SIFT, region based features. The query image selected by the user is first classified into a particular category and the images related to the query image are then retrieved by matching the class of query image and class of other images. Using image clustering, classified images are clustered and the keywords corresponding to the image clusters are extracted. The original keyword is extended by appending the extracted keyword with highest frequency. This gives more detail idea about user’s search intention. The images are then re-ranked using visual and textual similarity metrics. Duplicate images which are retrieved in search results are detected and eliminated by using SURF(Speeded Up Robust Feature) technique. The system is tested on variety of categories like person, scenery images at semantic level and other general categories like general objects, objects with simple background etc. The system is totally web based and works dynamically on any keyword given as input by user. Index Terms— Image Retrieval; Re-Ranked; Nonduplicate; Visual Information; Histogram; Scores I. INTRODUCTION
Data Mining aims at extracting useful and interesting patterns from the large collection of data. Data could be in the form of text, images, audio, and video. The large collection of image data can also be mined to extract useful information. Hence image mining can be defined as a process of discovering useful and potentially understandable features and patterns from large image data sets. It is not only a branch of data mining but © Elsevier, 2013
also is interdisciplinary in nature which includes digital image processing, image understanding, image analysis, artificial intelligence etc. The main aim of image mining is to extract useful patterns from images for a particular domain and also to identify best features and gather relevant knowledge from images. Current technique which is used in image retrieval and classification is content based image retrieval where image retrieval is a process of searching, browsing and mining images from web or large databases. Content based image retrieval is based on analysis of actual contents of images rather than metadata such as keywords, tags and some other surrounding information of the image. The term content is related to extraction of low level and high level features. Low level features include color, shape, texture etc and high level features are domain specific for example person image could be detected using face as a extracted feature, scenery images could be detected by extracting features like trees, mountains etc. Images could also be retrieved based on metadata such as URL, image alt, context, context10, image title etc. To search for images, a user may provide query keyword, image file/link, and the system will return images indexed to the query keyword. Internet image search engines like Goggle, Bing use only keywords as queries to retrieve images. The search engine retrieves thousands of images ranked by the keywords provided by the user. Web image search relies on textual features, which are the text based keywords given by user. Thus, the user’s search intention cannot be interpreted only by query keywords and keywords being ambiguous, the results which are obtained are noisy. The irrelevant images may also be retrieved. The ambiguity is caused due to following reasons [1]: • The user may not have enough knowledge on textual description of target images. • Users cannot accurately describe the visual information of images using keywords.[1] The query images could be classified into different levels. Level 1 classification of images is based on primary features of images that include color, texture, and shape. Level 2 classification of images is based on features of objects included in the images, such as face in case of person images, sky, trees in case of scenery images. Just textual information is not enough to classify images efficiently, so visual information is also considered for classification. Re-Ranking plays an important role in retrieval of relevant images where images which are closely related to query image are retrieved. For example, if user types the keyword cat, the images which are retrieved through search engine belong to different categories like black cat, white cat, brown cat because of the ambiguous word cat. If user clicks on image of a black cat then the image is classified as image of animal and the cluster of black cat images is formed. Thus, the keyword black is extracted and the original keyword cat which was entered by user is expanded with the keyword black. Then the re-ranking is applied which gives relevant images related to the image of a black cat. When user clicks on one particular image, i.e. a query image, the search engine must automatically retrieve the good quality relevant images for a given query image without retrieving duplicate images. II. RELATED WORK [1],[2]proposed a system where, web-scale image search engines are fully based on text features. J. Cui et al has proposed the use of adaptive visual similarity concept to re-rank the text-based search results. A query image is first classified into one of the several predefined categories, and the images are re-ranked based on a specific similarity measure. Feng Jing et al [3] proposed, IGroup, an efficient and effective algorithm that performs clustering of web images based on keywords extracted and then assigns all the resulting images to corresponding clusters based on visual or textual features. Zhiguo Gong et al [4] have focused on keyword expansion using WordNet. WordNet shows a relationship between words on the basis of three dimensions of Hypernym, Hyponym, and Synonym. To extract TSN (Term Semantic Network), a popular association mining algorithm i.e. Apriori algorithm is used, also the keywords are weighted using term frequency and inverse document frequency. Introduction of generic classifiers is proposed in [5], which are based on query-relative features. They combine textual features, visual features based on the occurrence of query terms in web pages, image metadata and visual histogram representation of images. A method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene is proposed [6]. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. D S Guru et al[7] proposed an algorithm to classify flower images using KNN classifier. The features which are extracted are Textural and Gabor features. Flower image is first segmented from which features are extracted. The feature values are then fed to the KNN-classifier for classification. 103
Vaibhav Kumar Singh, Ankit Goyal [8] proposed a SURF algorithm which is termed as Speeded up Robust Feature used to locate local feature values of images. III. SYSTEM DESIGN The system flow is depicted in figure1, which summarizes the steps performed by the system. 1. Text based Search- User enters a keyword for searching particular images in which user is interested. Search engine retrieves millions of images indexed on keywords. 2. User clicks on one image of his interest. This query image gets classified into one of the categories like portrait, people, scenery, animals and other retrieved images are also classified into different categories like portrait, people, scenery, animals. The class of other images is compared with a class of query image .The result is refreshed such that only the images with a matched class are retrieved. 3. Images of different categories are clustered with respect to the query image. The histogram based Euclidean distance is computed by taking RGB color components and the images having minimum distance with a query image are taken into one cluster corresponding to that query image. 4. The common keyword corresponding to clustered images is extracted using term frequency and the original keyword is expanded with this extracted keyword. 5. The query image and clustered images are combined for further search so that the images which are retrieved are more efficient and relevant. Once all the images related to expanded keyword are retrieved then the re-ranking is applied to get the relevant images. 6. Duplicate images which are retrieved in search results are detected and eliminated. The system consists of following modules: A. Image Classification The images are classified into different classes such as scene, people or images of general objects by extraction of several features like RGB color Histogram, SIFT, face, and color spatialet .Figure 2 represents image classification into general and semantic classes. General classification depicts classification of images depending upon keyword (e.g. apple, animal, landscape etc) entered by user, while as semantic classification represents classification of images into different categories, based on semantics of image like person and scenery images. B. Histogram Based Image Comparison for Image Clustering Histogram graphically represents the pixel intensity values. It plots the number of pixels in an image by counting the number of pixels of each color. Color histogram represents color distribution in an image. In this paper color histogram is built for RGB color space. Many histogram distances have been used to define the similarity of two color histogram representations. Euclidean distance computation is the most commonly used method. The color histogram is defined by ℎ,, (, , ) = . ( = , = , = ), (1) where A, B and C represent the three color channels and N is the number of pixels in the image. It defines the probability of intensities of three color components. Let h and g represent two color histograms. The Euclidean distance between the color histograms h and g can be computed as: (ℎ, ) = ∑ ∑ ∑ (ℎ(, , ) − (, , )) , (2) In this distance formula, the comparison is made on the basis of identical bins in the respective histograms. The steps to cluster images based on histogram comparison are shown in figure 3. The color histogram for an image is constructed by counting the number of pixels of each color. The steps in the formation of histogram and comparison of histograms include:[9] R, G, B color space is selected for formation of histogram The color space is quantized to reduce the distinct colors from an image. • • •
Compute histogram for corresponding image. Calculate the histogram distance based on Euclidean distance. Compare two histograms based on histogram distance.
The above histogram distance computation algorithm is used to cluster visually similar images. Two clusters are formed, one is for closely related images and other is for images with large Euclidean distance. The centroid is computed from the values of distances which are obtained by calculating Euclidean distance applied on Histogram and color components. Images with smaller distances to the centroid are taken into one cluster and images with larger distances are put into other cluster based on a threshold value. 104
Figure 1. System Flow
Figure 2. Classification of Images
Figure 3. Clustering of Classified Images
C. Extract common keywords corresponding to clustered Images (term frequency) The common keyword corresponding to clustered images is extracted by using a method of calculation of term frequency and the keyword is appended to the original keyword entered by the user. 105
D. Re-Ranking Scores are assigned to the images based on feature values. Images with maximum scores will be ranked to the highest position and will be considered as more relevant images to the query image. The Re-Ranking algorithm consists of following steps1. Compute R-histogram (Rhist), G-histogram (Ghist) and B-histogram (Bhist) for each image. Each image histogram is divided into 3 parts-shadow, mid-tone, highlight. Rhistshadow1, Ghistshadow1 , Bhistshadow1 are the shadow parts of R-histogram, G-histogram, B-histogram of a query image, Rhistshadow2, Ghistshadow2 , Bhistshadow2 are the shadow parts of R-histogram, G-histogram, B-histogram of any other image from retrieved images. Rhistmidtone1 , Ghistmidtone1, Bhistmidtone1 are the mid tone parts of R-histogram, G-histogram, B-histogram of a query image, Rhistmidtone2, Ghistmidtone2, Bhistmidtone2 are the mid tone parts of R-histogram, G-histogram, Bhistogram of any other image from retrieved images. Rhisthighlight1,Ghisthighlight1 ,Bhisthighlight1 are the highlight parts of R-histogram, G-histogram, B-histogram of a query image, Rhisthighlight2, Ghisthighlight2, Bhisthighlight2 are the highlight parts of R-histogram, G-histogram, Bhistogram of any other image from retrieved images. 2. Compute the distance of each image with the query image by using following stepsa. The distances of shadow, mid-tone and highlight regions of R-histogram are computed using following formulaeℎ = |ℎ!" − ℎ! |
.
(3)
# $% = |ℎ&'*" − ℎ&'* |.
(4)
ℎℎ+ℎ = |ℎ-/-" − ℎ-/- |.
(5)
Calculate the score for R-histogram by using the following equation 0 %_ℎ=1
8 9:2&'*3 8 9:(-/-38 ) 2!34567 4567 4567
";;;
.
(6)
b. The distances of shadow, mid-tone and highlight regions of G-histogram are computed using following formulaeℎ 8 9:(-/-> 8 ) 2!>4567 4567 4567
0 %