Image Representation and Retrieval Using Support Vector ... - CiteSeerX

0 downloads 0 Views 349KB Size Report
World Wide Web (WWW) technology, low storage cost, and the availability of ..... are stored in various formats (e.g., JPEG, TIFF, PNG etc.) with both grey level ...
Image Representation and Retrieval Using Support Vector Machine and Fuzzy C-means Clustering Based Semantical Spaces Prabir Bhattacharya Institute for Information Systems Engineering Concordia University Montreal,Quebec, CANADA [email protected]

Abstract This paper presents a learning based framework for content-based image retrieval to bridge the gap between low-level image features and high-level semantic information presented in the images on semantically organized collections. Both supervised (probabilistic multi-class support vector machine) and unsupervised (fuzzy c-means clustering) learning based techniques are investigated to associate global MPEG-7 based color and edge features with their high-level semantical and/or visual categories. It represents images in a successive semantic level of information abstraction based on confidence or membership scores obtained from the learning algorithms. A fusion-based similarity matching function is employed on these new image representations to rank and retrieve most similar images compared to a query image. Experimental results on a generic image database with manually assigned semantic categories and on a medical image database with different modalities and examined body parts demonstrate the effectiveness of the proposed approach compared to the commonly used Euclidean distance measure on MPEG-7 based descriptors.

1. Introduction In recent years, there has been an exponential growth of image data mainly due to the rapid advancement of the World Wide Web (WWW) technology, low storage cost, and the availability of many digital devices. Hence, it creates a compelling need for innovative tools for managing, retrieving, and visualizing images from large collection. Image Retrieval has gained wide popularity from different communities during the last two decades and it is an inter-disciplinary research activity now a days [1]. Many applications such as digital libraries, image search engines,

Md. Mahmudur Rahman, Bipin C. Desai Dept. of Computer Science Concordia University Montreal,Quebec, CANADA mah [email protected]

medical decision support systems require effective and efficient image retrieval techniques to access the images based on their contents, commonly known as content-based image retrieval (CBIR) [1, 2]. In a typical CBIR system, low-level visual features (e.g. color, texture, shape, edge, etc.) are generated in a vector form and stored to represent the query and target images in the database. When a user makes a query, image retrievals are performed based on computing similarity in the feature space and most similar to the query image are returned to the user based on similarity values computed [1]. In general, the similarity comparison is performed either globally based on visual content descriptors from the entire image or locally based on descriptors derived from decomposed regions of the images. While much research effort has been made on the development of the CBIR systems and various methods have been proposed [2], the performance is still limited. The limited retrieval accuracy is mainly because of the miss-match between user semantic concepts and system generated low-level image features [1]. Issues central to the following aspects still largely remain open as automatic extraction of semantically related visual features, indexing of image content to address the problem of slow response time and similarity matching for effective and precise image retrieval. When images are semantically organized in a database or semantic queries are available, it is possible to extract a set of low-level features to predict semantic categories of each image by identifying its class assignment using a classifier. In many application domains, such as medical images of different modalities, personal photo collection, etc., there exists prior semantic groupings of the images that can be explored for more effective and efficient image retrieval. Recently, some machine learning based approaches have been explored to classify image/video collections into multiple semantic categories to support automatic image annotation or semantically adapting searching [8, 9, 10]. Motivated by this, we present a novel approach of im-

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

age representation and retrieval in semantically organized image databases in this paper. In this approach, both supervised and unsupervised machine learning techniques are investigated to associate MPEG-7 based global color and edge features with their high level semantic categories. Specially, we explore the utilization of probabilistic multiclass support vector machine (SVM) [4] and fuzzy c-means (FCM) clustering [5] techniques for representing images in a semantic space based on category or cluster membership values for each image in the database. The SVM classifier is accountable for finding semantically meaningful categories from image collection based on training on a subset of images, whereas FCM tries to find the natural cluster information in the collection from the input feature vector set without any training or learning. For each image in the database, both SVM and FCM produce a membership or confidence score for each category, which represents the weight of a label or category in the overall description of an image and the class with the highest confidence is considered to be the class of the image. However, instead of classifying an image exactly to a particular category or cluster prototype in a mutually exclusive way, we utilize the confidence scores of each image in a new vector form of image representation in a semantical or more specifically in a category and cluster based feature space and consequent similarity matching for retrieval. Instead of totally relying on one representation scheme, which may not be accurate enough due to the present state of computer vision and machine learning techniques, we will unify both semantic category and natural cluster specific representation in a similarity matching function to complement each other, which provides better accuracy as demonstrated in the experimental section. Rest of this paper is organized as follows: in Section 2 and Section 3, we briefly describe the supervised and unsupervised learning based categorization and clustering techniques and the approaches considered. In Section 4, the framework for image representation by categorization based on both SVM and FCM is discussed. Section 5 describes a fusion-based similarity matching technique on this new image representation. Exhaustive experiments and analysis of results are presented in Section 6 and are based on a generic image database with 20 manually assigned semantic categories and on a medical image database with different modalities and as many as 22 categories. Finally, Section 7 provides the conclusions.

2. Supervised learning based multi-class SVM Supervised learning based image classification is an area of active research in the field of machine learning and pattern recognition [3, 6]. A supervised multi-class classification system works by classifying an image into one of many predefined categories. In this context, a semantic concept is

first defined by a sufficient number of training images. The classifier creates a function from the training data, where instance in the training set is represented by a feature vector and contains category or class specific labels [3]. The task of the supervised learner or classifier is to predict the categories of a newly encountered images based on previous learning. When categories are considered as mutually exclusive, generally a confidence score is predicted for each category and the highest one is considered as the winner and chosen as the right image category. In this paper we investigate a multi-class probability estimation technique by combining all pairwise comparisons of binary support vector machine (SVM) classifier, known as pairwise coupling (PWC) [4]. SVM is an emerging supervised machine learning technology [7], which has already been successfully used in classification or annotation of natural images [8, 9]. Given training data (x1 , . . . , xN ) that are vectors in space xi ∈ d and their labels (y1 , . . . , yN ) where yi ∈ (+1, −1)N , the general form of the binary linear classification function is g(x) = w · x + b

(1)

which corresponds to a separating hyper plane w·x+b = 0

(2)

where x is an input vector, w is a weight vector, and b is a bias. The goal of SVM is to find the parameters w and b for the optimal hyper plane to maximize the geometric margin 2 ||w|| between the hyper planes, subject to the solution of the following optimization problem [7]: min w, b, ξ

 1 T w w+C ξi 2 i=1 N

(3)

subject to yi (wT φ(xi ) + b) ≥ 1 − ξi

(4)

where ξi ≥ 0 and C > 0 is the penalty parameter of the error term. Here training vectors xi are mapped into a high dimensional space by the non linear mapping function φ : d → f , where f > d or f could even be infinite. Both the optimization problem and its solution can be represented by the inner product. Hence, xi · xj → φ(xi )T φ(xj ) = K(xi , xj )

(5)

where K is a kernel function. The SVM classification function is given by [7]: N   f (x) = sign αi yi K(xi , x) + b (6) i=1

A number of methods have been proposed for extension to multi-class problem to separate L mutually exclusive classes essentially by solving many two-class problems

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

and combining their predictions in various ways [4]. One technique, known as pairwise coupling (PWC) or one-vsone, constructs binary SVM’s between all possible pairs of classes. This method uses L · (L − 1)/2 binary classifiers, each of which provides a partial decision for classifying a data point. PWC then combines the output of all classifiers to form a class prediction. During testing, each of the L·(L−1)/2 classifier votes for one class. The winning class is the one with the largest number of accumulated votes. Though the voting procedure requires just pairwise decisions, it only predicts a class label. However, to label or represent each image with a category specific confidence score, probability estimation is required. In our experiments, the probability estimation approach in [4] for multi-class classification by PWC is utilized.

3. Unsupervised learning based FCM clustering If there exist natural clusters in the feature space of the database images, these might be located by applying any clustering techniques. Clustering is the unsupervised classification of images, which separate feature vectors into several subsets or groups on the basis of their similarity in feature space [3]. It is usually performed when no information is available concerning the membership of input patterns to predefined classes, hence it is called as unsupervised learning. The degree of membership of a data item to a cluster is either in [0,1] if the clusters are fuzzy or in {0,1} if the clusters are crisp. The fuzzy c-means (FCM) is the most widely used fuzzy clustering algorithm which assigns degrees of membership in several clusters to each input pattern [5]. This algorithm is based on an iterative optimization of a fuzzy objective function as follows [5]: JF CM (U, V; X) =

N  c 

(µm ji )dist(xi , vj )

(7)

i=1 j=1

where X = {x1 , x2 , · · · , xN } be a finite set of N unlabeled feature vectors xi ∈ d , c is the number of clusters and V = {v1 , v2 , · · · , vc } represents the unknown prototypes vj ∈ d , which are known as the cluster centers. The fuzzy c-partition is defined by a c × N matrix U = [µji ] where, µji is the membership degree of vector xi to the jth cluster, satisfies µji ∈ [0, 1], ∀j, i. The distance measure dist(xi , vj ), can be expressed by Euclidean or Mahalanobis distance. Two issues to consider in developing FCM are the number of cluster and the proper method for selecting the initial cluster centers. In this research we take a simple approach to address these problems. We assume that, semantic cluster centers of training image feature

vectors for SVM training may be a good choice for cluster initialization and keep the number of clusters same as the number of pre-defined semantic classes. This point of view suggests that in a semantically organized image database, natural clusters tend to be close to the semantic clusters perceived by the user. The steps involved in the FCM algorithm using the proposed initialization method is as follows: Step 1: Initialize the cluster number c = L, where L is the number of semantic classes, and select the cluster cen(0) ters as vj for j = 1, 2, · · · , c, where vj ∈ d is the mean vector of semantic cluster j ∈ L in feature space. Set the iteration loop index t = 1. Step 2: Select X = {x1 , x2 , · · · , xN } and compute the initial membership function U(0) dist(xi , vj )−2/(m−1) µji = c −2/(m−1) k=1 dist(xi , vk )

∀j, i

(8)

(t)

Step 3: Update the all new cluster centers vj using N

vj (t) = i=1 N

µji xi

i=1

µji

∀j

(9)

Step 4: Update U(t) based on new similarity measure of xi with respect to the new cluster centers vj (t) with (8). (t) (t−1) | < , ∀j then stop; otherwise, Step 5: If |vj − vj set t = t + 1 and go to step 3. Once the change of the cluster centers during the iterations is less than the termination criterion , the convergence point is reached c and the final clusteringisN achieved. The properties i=1 µji = 1, ∀i and 0 < i=1 µji < N, ∀j must be true for U to be a non degenerate fuzzy c-partition. Hence, each image can be labeled or represented with the cluster membership scores (µ), analogous to the probability or confidence score for multi-class SVM. However, these category labels are data driven; hence they are obtained solely from the data.

4. Image representation in semantic space The performance of a classification or clustering based algorithm mainly depends on the underlying image representation in the form of a feature vector and the employed similarity matching function. For SVM training, the initial input to the retrieval system is a feature vector set of training images in which each image is manually annotated with a single semantic label out of L labels or categories. MPEG-7 based Color Layout Descriptor (CLD) and Edge Histogram Descriptor (EHD) are extracted for image representation at global level and input to the learning based

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

algorithms [14]. CLD is obtained by applying DCT transformation on the 2-D array of local representative colors in YCbCr color space where each channel is represented by 8 bits and averaging each of the 3 channels separately for 8×8 image blocks. Scalable representation of CLD is allowed in the standard. So, one can select the number of coefficients to use from each channels DCT output. For each channel, 3, 6, 10, 15, 21, 28 or 64 coefficients can be used [15]. In this paper, CLD with 10 Y, 3 Cb and 3 Cr coefficients is extracted and used due to the presence of both color and grey level images. Spatial distribution of edges are utilized for global shape representation by EHD descriptor. The EHD represents local edge distribution in an image by dividing the image into 4×4 sub-images and generating a histogram from the edges present in each of these sub-images. Edges in the image are categorized into five types, namely vertical, horizontal, 45◦ diagonal, 135◦ diagonal and non-directional edges. In the end, a histogram with 16 × 5 = 80 bins is obtained, corresponding to a feature vector having a dimension of 80 [14, 15]. Both CLD and EHD based features vectors are combined or concatenated to form a single vector as f global = (f CLD + f EHD ) ∈ d , where f CLD and f EHD are color and edge descriptors respectively and d = 96 (16 for CLD and 80 for EHD for a total of 96). After training the system for SVM based classification, each database image without any label information is classified against the L categories in the index generation phase. This produces a ranking of the L categories, with each category assigned a confidence or probability score to each image. The confidence represents the weight of a label or category in the overall description of an image analogous to the weight of a keyword in vector space model of information retrieval (IR) [13]. The probability or confidence score of each category will form a L-dimensional vector fisvm ∈ L for an image i is as follows: fisvm = {p1i , p2i , · · · , pLi }

(10)

where, pki is the probability or confidence score of semantic class k ∈ L for image i. For FCM based image representation, feature vectors (f global ) of all database images are provided as input to the clustering algorithm with number of clusters c = L as described in previous section. The output of the clustering technique will be the cluster prototypes V = {v1 , v2 , · · · , vc } and a membership matrix U = [µji ]. Hence, the membership score of each cluster will form a c-dimensional vector fifcm ∈ c for an image i as follows: fifcm = {µ1i , µ2i , · · · , µci }

(11)

where, µji is the membership score of cluster prototype j ∈ c for image i.

Similarly, the label vector fqsvm of query image q can be found online by applying its global feature vector (fqglobal ) to the multi-class SVM classifier. Whereas, after finding the cluster prototypes or cluster centers V, the label vector fqfcm of query image is obtained by applying equation (8) of the FCM algorithm. These approaches of semantic image categorization and clustering may not attain high accuracy due the present state of the computer vision and pattern recognition technologies. However, the reliability of vector representations based on probability or membership scores is significantly better then chance and would be useful for similarity matching and retrieval similar to keyword based text retrieval approaches. In addition, instead of totally rely on either SVM or FCM based image representation, we unify both approaches in a similarity matching function to complement each other and obtained better accuracy as demonstrated in the experimental section.

5 Fusion-based similarity matching This section presents a fusion-based similarity matching function on a weighted combination of individual cosine similarity measures on both SVM and FCM based image representations. In the vector space model (VSM) of information retrieval (IR), one common measure of similarity is the cosine of the angle between the query and document vectors. Whereas, the inner product similarity measure finds the Euclidean distance between the query and document vectors in the space. In some cases, the directions of the vectors are a more reliable indication of the semantic similarities of the objects than the distance between the objects in the term-document space [13]. As the semantic SVM and FCM based feature representation closely resembles the term-document representation, the cosine similarity matching is performed at each representation individually. The cosine similarity measure between SVM based feature vectors of query image q and database image i is defined as follows: L k=1 pkq ∗ pki  SSVM (fqsvm , fisvm ) =  L L 2∗ 2 (p ) k=1 kq k=1 (pki ) (12) where, pkq and pki are the probability scores (weights) of the semantic label k ∈ L for q and i respectively. Similarly, the cosine similarity measure between FCM based feature vectors is as follows: c j=1 µjq ∗ µji fcm fcm   SFCM (fq , fi ) =  c c 2∗ 2 (µ ) j=1 jq j=1 (µji ) (13) where, µjq and µji are the membership scores (weights) of the cluster prototype j ∈ c for q and i respectively.

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Table 1. Cross Validation Accuracy

Figure 1. Block diagram (retrieval approach) After the similarity measures of each SVM and FCM representation are determined as SSVM (.) and SFCM (.) , we can aggregate or fuse them into a single similarity matching function as follows: S(q, i) = w1 SSVM (.) + w2 SFCM (.)

(14)

Here, w1 and w2 are non-negative weighting factors of different feature level similarities with normalization (w1 + w2 = 1), which needs to be selected experimentally. Fig. 1 shows the block diagram of the proposed image retrieval approach from a query image viewpoint.

Database

Kernel

C

γ

Accuracy

General

RBF

100

.003

77.56%

Medical

RBF

200

.0002

89.78%

well when the relation between class labels and attributes is nonlinear [8]. Therefore, we use RBF kernel as a reasonable first choice. There are two tunable parameters while using RBF kernels : C and γ. It is not known beforehand which C and γ are the best for the classification problem at hand and are selected by cross-validation (CV). Hence, the RBF kernel K(xi , xj ) = exp(−γ|xi − xj ||2 ), γ > 0 and 5-fold cross-validation (CV) are considered for training to find the best parameter C and γ as shown in Table 1. After finding the best parameters C and γ for both collections, they are used to train the entire training sets respectively to generate the svm model files. We use LIBSVM software package [16] for the implementation of the SVM classifiers. For FCM based clustering, the global feature vectors of the entire databases are utilized to generate the cluster prototypes V and membership matrix U . Parameters for FCM based clustering used in this experiment were set as follows: • The optimal number of clusters was set as c = 20 for general database and c = 22 for medical database which are same as the number of semantic categories repectively. • The mean of each semantic category is used as the initialization method, instead of random selection.

6 Experiments and results In order to verify the effectiveness of the proposed approach, experiments are performed using two different image collections. The first collection contains 5000 generic images of 20 manually assigned semantic categories (such as mountain, sea-beach, people, animal, food etc.) obtained from COREL and IAPR databases [11]. Whereas, the second collection contains 2400 medical images of 22 different modalities and/or body parts collected from a subset of ImageCLEFmed [12] collection. Images in the first collection are all in color and JPEG format, whereas medical images are stored in various formats (e.g., JPEG, TIFF, PNG etc.) with both grey level and color in nature. For training of multi-class SVM, 20% images of both collections (e.g. 1000 images from the generic collection and 480 images from the medical collection) are selected as training set and rest of the images (80% )are utilized to measure the accuracy (average precision) of the retrieval approaches. For SVM based image classification, recent work shows that the radial basis kernel function (RBF) works

• Euclidean distance was used to quantify the distance between each datum and cluster centroid’s and as a minimum distance classifier for query images. • The weighting exponent was set to be m = 1.5. • The termination criteria was set to be  = 0.001 or the maximum number of iterations as 500. For a quantitative evaluation, the performances of individual and fusion-based similarity measures are compared based on average precision curves by evaluating top N = {10, 20, 50, 100, 150, 200} returned results. Precision is the ratio of the number of relevant images returned to the total number of images returned. A high precision value means that there are few false alarms (i.e., the % of irrelevant images in the retrieval). We have selected all the test images (80%) in the data sets as query images and used query-byexample method, where the query is specified by providing an example image to the system. A retrieved image is

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

(a) General database

(b) Medical database

Figure 2. Average precision curves for similarity measures considered a correct match if it is in the same category as the query image. For the proposed fusion based similarity matching function in equation (14), a weight of w1 = 0.8 is assigned to the SVM based similarity measure as its feature captures more closely the semantics of the collections, and a weight of w2 = 0.2 is assigned to the FCM based similarity measure. Figure 2(a) and 2(b) present the average precision curves of different similarity measures in the generic and medical databases respectively. As shown in the both figures, best performance is always achieved when search is performed with the fusion-based similarity matching function. Whereas, Euclidean based similarity measure on global CLD and EHD feature vector and FCM based cosine similarity measure performed poorly compared to the SVM or fusion based similarity matching. This justifies our assumption that category specific search is more appropriate in a semantically organized image database. The reason for the poor performance of FCM based feature might be due to the selection of natural clusters same as semantic categories. By employing some cluster validation techniques, the performance can be further improved in future. However, when we combine both approaches in fusion-based similarity matching, a slightly better performance is achieved for both data sets as shown in Figure 2(a) and 2(b). The results are expected as fusion-based similarity function might complement each other representations. To get a qualitative idea about the performance improvement, Fig. 3 and Fig. 4 show the snapshots of the proposed CBIR interface for a query image. In Fig. 3, for a query image belongs to the X-ray-fumer category (the top left most image), the system returns 8 similar category images out of 15 from the database by applying Euclidean distance measure on global CLD and EHD based feature vector. Whereas

Figure 3. A snapshot of the image retrieval (Global Euclidean similarity measure)

Figure 4. A snapshot of the image retrieval (Fusion-based similarity measure)

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

in Fig. 4, for the same query image, the system returns 14 images from the same category out of 15 from the database by applying the proposed fusion-based similarity matching function. It clearly shows the performance improvement for this particular query image in the proposed image retrieval scheme.

7 Conclusion In this paper, a novel image representation and similarity matching technique is proposed for semantically organized image databases. In this technique, images are represented in a new feature space based on both supervised multi-class SVM and unsupervised FCM clustering based algorithms. This way, both semantic and natural organization of images in a database are considered and exploited by a fusion based similarity matching function. Another advantage is that, feature dimension and consequently computational complexity can be reduced in cases where image categories or natural clusters are less then the initial input feature size. Our study shows that even though the initial categorization may not be correct one, the retrieval approach find relevant images via the proposed similarity measures in the new feature spaces. We have tested and evaluated the performances in two different image collections with known ground truth, which showed promising results.

References [1] A. Smeulder, M. Worring, S. Santini, A. Gupta, R. Jain, “Content-Based Image Retrieval at the End of the Early Years.”, IEEE Trans. on Pattern Anal. and Machine Intell., vol.22, pp. 1349–1380, 2000. [2] P. J. Eakins, “Towards Intelligent image retrieval.”, Pattern Recognition, vol. 35, pp. 3–14, 2002. [3] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering:a review.”, ACM Computing Surveys, vol. 31(3), pp. 264–323, 1999.

[8] O. Chapelle, P. Haffner, V. Vapnik, “SVMs for histogram-based image classification.”, IEEE Trans. on Neural Networks, vol.10(5), pp. 1055–1064, 1999. [9] E. Chang, G. Kingshy, G. Sychay, and G. Wu. “CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines”, IEEE Trans. on CSVT, 13(1), pp. 26–38, 2003. [10] M. R. Naphade, C. Lin, J. R. Smith, B. Tseng, and S. Basu, “Learning to Annotate Video Databases,” Proceedings of SPIE, vol. 4676, pp. 264–275, 2002. [11] M. Grubinger, P. Clough, H. Mller, T. Deselears, “The IAPR TC-12 Benchmark - A New Evaluation Resource for Visual Information Systems”, In the Proceedings of the International Workshop OntoImage’2006 Language Resources for Content-Based Image Retrieval, Genova, Italy, 2006. [12] P. Clough, H. Mller and M. Sanderson, “The CLEF 2004 Cross-Language Image Retrieval Track”, 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17, 2004, Revised Selected Papers Series: LNCS, vol. 3491, pp. 597–613. [13] R. Baeza-Yates and B. Ribiero-Neto, Modern Information Retrieval, Addison Wesley, 1999. [14] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7 Multimedia Content Description Interface, John Wiley & Sons Ltd., England, 2002. [15] ISO/IEC JTC1/SC29/WG11/W3703 MPEG-7 Multimedia Content Description Interface Part 3 Visual, October 2000. [16] C. C. Chang, C.J. Lin, “LIBSVM : a library for support vector machines.”, 2001, Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm

[4] T. F.Wu, C. J. Lin, R. C. Weng, “Probability Estimates for Multi-class Classification by Pairwise Coupling.”, Journal of Machine Learning Research, vol. 5, pp. 975– 1005, 2004. [5] J. C. Bezdek, et al., Fuzzy Models and Algorithms for Pattern Recognition and Image Processing., Kluwer Academic Publishers, Boston, 1999. [6] K. Fukunaga, Introduction to Statistical Pattern Recognition., second ed., Academic Press, 1990. [7] C. Cortes and V. Vapnik, “Support-vector network”, Machine Learning, vol. 20, pp. 273–297, 1995.

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Suggest Documents