linkage clustering method (CLINK), is de ned as the maximum of the distances between objects in ..... Sar sea ice discrimination using texture statistics: A.
To appear in GEOINFORMATICA, An International Journal on Advances of Computer Science for Geographic Information Systems
A Multi-resolution Content-Based Retrieval Approach for Geographic Images Gholamhosein Sheikholeslami and Aidong Zhang Department of Computer Science and Engineering State University of New York at Bualo Bualo, NY 14260, USA Ling Bian Department of Geography State University of New York at Bualo Bualo, NY 14260, USA Abstract
Current retrieval methods in geographic image databases use only pixel-by-pixel spectral information. Texture is an important property of geographical images that can improve retrieval eectiveness and eciency. In this paper, we present a content-based retrieval approach that utilizes the texture features of geographical images. Various texture features are extracted using wavelet transforms. Based on the texture features, we design a hierarchical approach to cluster geographical images for eective and ecient retrieval, measuring distances between feature vectors in the feature space. Using wavelet-based multi-resolution decomposition, two dierent sets of texture features are formulated for clustering. For each feature set, dierent distance measurement techniques are designed and experimented for clustering images in a database. The experimental results demonstrate that the retrieval eciency and eectiveness improve when our clustering approach is used.
Keywords: geographical image retrieval, multi-resolution wavelet transform, texture features, hierarchical clustering
1 Introduction Satellite remote sensing development generated huge amount of image data during the last two decades. With the launch of the new breed of high spatial resolution satellite systems, the volume
This research is partially supported by Xerox Corporation and NCGIA at Bualo.
1
of digital image data will signi cantly increase. Paralleling with the remote image sensing is the availability of Geographic Information System (GIS) data such as digital orthophotographs and digital elevation models that are available in pixel format and in sheer volumes. Thus, there is a great demand for eective retrieval mechanisms. In processing these geographic images, the ultimate interest is to recognize and locate speci c contents. A content-based retrieval approach is paramount to organizing geographic images. In the design of such retrieval approaches, signi cant features must rst be extracted from geographical data in their pixel format. The features are numerical measurements to characterize an image. These features can then be used to cluster and index the images for ecient retrieval. Methods have been developed for indexing and accessing alpha-numerical data in traditional databases. However, the traditional approaches to indexing may not be appropriate in the context of content-based image retrieval [CYDA88, RS91, BPJ93]. In this context, a challenging problem arises with many image databases, within which queries are posed via visual or pictorial examples (termed visual queries). A typical visual query might entail the location of all images in a database that contain a subimage similar to a given query image. Such a query is well suited to and most appropriate for geographic images in identifying speci c geographic contents such as land use/land cover types. Texture in images has been an important aspect for geographic image classi cation and retrieval [Har79, Joh94, RW96, WH90], due primarily to the heterogeneity nature of geographic images and the association between the texture patterns and the geographic contents. The variety and complexity of geographic texture present a challenge as well as an impetus for seeking more eective approaches in feature representation. We have conducted experimental research on image data retrieval based on texture feature extraction [SZ97, RSZSM97, SZB97]. In most existing content-based retrieval approaches, feature vectors of images are rst constructed (e.g., from wavelet transforms), which are then used to distinguish images through measuring distances between feature vectors. Our experimental results demonstrate that this type of approaches can be eectively used to perform content-based retrieval based on similarity comparison between images. However, we encounter a critical problem that the feature vectors of some semantically irrelevant images may be located very close in the feature space. Figure 1 presents two images of water and stone between which the distance of the feature vectors is very small, but these images are semantically not similar. Given a query such that its feature vector is located in the neighborhood of the feature vectors of the two given images, both images are highly possible to be retrieved together in response to the query. Thus, indexing alone, using R-tree or its variants based on the closeness of feature vectors in the feature space, sometimes may not provide satisfactory solutions. An eective clustering approach needs to be integrated into indexing techniques for ecient and eective retrieval of image data. Once the query is narrowed to a speci c category, image retrieval can then proceed eciently. 2
(a)
(b)
Figure 1: Semantically dierent Images with similar feature vectors. Figure 2 shows two arbitrary shape semantic clusters C1 and C2 in a two-dimensional feature space. We may assume that C1 and C2 represent the images of stone and water in the feature space. Consider the query image q which belongs to the cluster C1 . If we only consider the closeness of feature vectors in the feature space to return relevant images, we may retrieve many images from cluster C2 that are semantically irrelevant. Indexing trees such as R-trees are very ecient in retrieving the nearest neighbors of the query image, but since they do not consider semantics of images, the retrieved images might be semantically irrelevant. If we successfully classify the database images into dierent semantic clusters, a visual query can be quickly narrowed to a speci c category. Therefore, combining clustering with R-tree related indexing, image retrieval can proceed both eectively and eciently. o oooooooooo o oo o o x x o o o C x x xx xx x 2 x xx xx x o o oo o xx o o ooo o xx x x oo ooo oooo o o o o x x o o o qo o o o o oooooo o x xxxx oo xxx x x xxx xx o x x o x x xxxxxx x x xx oo oo o o x xxxxx x xx xx x o o oo o o oo o ooooooooooo o oo
C 1
Figure 2: Two semantic clusters. In this paper, we present an approach to eective and ecient retrieval of images in large geographical image database systems. The goal is, given a query image of size n n pixels, to retrieve all the database images of size N N (where n N ), that contain a portion similar to the query image. Database images have been decomposed into smaller segments (down to n n segments) oine. The query image will be compared to the corresponding segments (as will be discussed later). A clustering approach is designed to be incorporated into the existing indexing methods [SRF87, BKSS90, LJF94] to achieve the eectiveness and eciency. Similar to most of other current clustering methods, our approach cannot distinguish overlapping clusters1 . Thus, 1
The proposed method in [SCZ98] handles the overlapping clusters using heterogeneous features.
3
we assume that dierent clusters do not overlap in the feature space. The semantic clusters can have arbitrary shapes or can be disconnected. We rst choose a set of training samples to identify prede ned semantics clusters (for example, residential or agriculture). These clusters include the semantically similar images. Then using a hierarchical method we nd template images representing each cluster. This approach helps consider both semantics and feature vectors in classifying images. Using the multi-resolution property of wavelet transforms, we extract the features at dierent scales where only the coarse features are used in clustering. In our clustering approach, distribution of feature vectors in the feature space is also considered. Combining the proposed clustering with any existing indexing approach, our experiments demonstrate that retrieval from clustered database outperforms the methods that are based only on nearest neighbor retrieval. In addition, since the retrieval of images is narrowed to a speci c category, image retrieval can proceed more eciently. Since geographical images (specially airphoto images) are rich in texture, and have relatively wellde ned clusters, it makes our approach, which extracts multi-resolution texture features and clusters the images, a good candidate to be used. The remainder of this paper is organized as follows. Section 2 presents the approach to multiresolution texture feature extraction. Section 3 discusses the application of the feature extraction approach on image clustering and retrieval. Experiments are presented in Section 4. Concluding remarks are oered in Section 5.
2 Texture Feature Extraction In this section, we will rst discuss the importance of texture features in retrieval. We will then introduce wavelet transforms and the wavelet-based texture features extracted from images.
2.1 Texture Features In the current methods, spectral features (pixel values) are usually used to retrieve images from a geographical image database. However, spectral features alone may not be sucient to characterize a geographical image because they fail to provide information about spatial pattern of pixels. One way of obtaining better classi cations and retrieval is the use of spatial pattern recognition techniques that are sensitive to factors such as image texture and pixel content. In addition, using texture features of a group of pixels instead of individual pixels can reduce the information to be processed. Hence it decreases the required memory storage and computation for image retrieval and clustering. Because of the spatial nature of geographic images, texture has been used to distinguish among the images [RW96, Lan80, HK69, MM96, Joh94, GT83]. However, compared to purely spectral based procedures, texture-based retrieval systems have received limited attention in remote sensing in the past [LK94]. 4
Dierent methods de ning texture were proposed to model natural texture feature or to be used in classi cation. These approaches have included the use of Fourier transforms [SF86], fractals [Man77], random mosaic models [AR81, JMV81], mathematical morphology [DP89], syntactic methods, and linear models [CJ83]. However, most of these approaches have been used in the eld of pattern recognition and have not been applied to the analysis of remotely sensed data [RW96]. One method in texture feature extraction that has been used for remotely sensed data is to pass a lter over the images and use the variance of the pixels covered by the lter as the value of the pixel in the texture image [AN91, BN91, Jen79]. Woodcock and Ryherd found that using the above local variance texturing technique in an adaptive windowing procedure resulted in an improved image for certain applications [WR89]. The location of the adaptively located window used to calculate local variance is the window with the lowest local variance of all windows that include the pixel. They then apply a multi-pass, pair-wise, region-growing algorithm to nd spatially cohesive units, or \regions", in the image [RW96]. Johnson used a knowledge-based approach to classi cation of built-up land from multi-spectral data. The approach involves spectral classi cation, determination of size and neighbor relations for the segments in the spectrally classi ed image, and rule-based classi cation of the image segments into land-use categories [Joh94]. The implementation is slow, and performance problems were encountered due to the large number of objects being processed [Joh94]. Gray-level co-occurrence matrix, a matrix of second order probabilities, has been used in texture analysis [BL91, HTC86]. The element s(i; j; v ) of the co-occurrence matrix is the estimated probability of going from gray-level i to j , given the displacement vector v = (x; y). A set of statistical values computed from co-occurrence matrices, such as entropy, angular second momentum, and inverse dierence moment have also been used [HSD73, PF91]. In practice, there are two problems regarding this approach. The rst is that the co-occurrence matrix depends not only on the spatial relationship of gray-levels, but also on regional intensity background variation within the image. The second issue is the choice of displacement vector v = (x; y), because the cooccurrence matrix characterizes the spatial relationship between pixels for a given vector v [WH90]. Wang and He presented a statistical method for texture analysis, where they de ne texture unit as the smallest complete unit that best characterizes the local texture in eight directions from a pixel. The distribution of texture units results in texture spectrum which can be used in texture analysis [WH90]. Since the local texture information for a pixel is extracted from a neighborhood of 3 3 pixels, instead of one displacement vector used in the gray-level co-occurrence matrix, their method is more complete for the characterization of texture. They provided limited experimental results. The textural structures that can be described within a neighborhood are naturally limited to those which are observable within the size of neighborhood. Hence, features that are extracted based on measurements within a xed size neighborhood (for example 3 3 pixels window) have poor 5
discrimination power when applied to textures not observable within the neighborhood because of wrong scale [BGW91]. All the above texture analysis methods share this common problem. This issue motivates the use of a multi-resolution method. Bigun et al formulated a general n-D least mean square problem whose solution facilitates the extraction of dominant orientation [BGW91]. In 2-D space, they used the linear symmetry along with the Laplacian pyramids approach, to produce feature images. The linear symmetry features directly measure the dominant orientation information in the frequency channels [BGW91]. QBIC is a content-based image retrieval system which retrieves images based on their color, shape, or texture [FSN+ 95]. In QBIC, texture features are based on modi ed versions of the coarseness, contrast, and directionality features proposed in [TMY78]. The distance between the three-elements feature vectors is measured by weighted Euclidean distance. However, new mathematical representations of image attributes are needed in QBIC [FSN+ 95]. Features that describe new image properties such as alternate texture measures or that are based on fractals or wavelet representations, for example, may oer advantages of representations, indexability, and ease of similarity matching [FSN+ 95]. Jacobs et al presented a searching algorithm for query images in an image database [JFS95]. They apply wavelet transform on query and database images. The wavelet coecients are then truncated and only the coecients with largest magnitudes are used in comparing images. They developed an image query metric which uses those truncated and quantized wavelet coecients. According to them, wavelet coecients provide information that is independent of the original resolution. Thus, a wavelet scheme allows queries to be speci ed to any resolution (potentially dierent to that of database images) [JFS95]. Hence they call their method multi-resolution image querying. Manjunath and Ma used Gabor wavelet transform to extract texture features and showed that it performed better than other texture classi cation methods [MM96]. They used mean and variance of frequency sub-bands to represent texture. In our method, we take advantage of multiresolution property of wavelet explicitly. In our approach, as will be explained later, we apply wavelet transform to extract texture features. Based on the property that was mentioned in [JFS95], a query image can have a dierent size from that of a database image, which we can use. In addition to that, by applying wavelet transform, we can have information about images at dierent scales (resolutions) from ne to coarse at the same time. Our interpretation of the term (multi)resolution is dierent from that of [JFS95]. In their case, resolution indicates number of pixels in the image, and since number of pixels of the query image and the database image do not necessarily have to be the same, they call their method multi-resolution image querying. In our method, since we extract the image features at dierent scales or levels of details (based on number of levels in wavelet transform), we have a multi-resolution representation of the image. Having information about images at dierent scales (resolutions), we can control the desired level of accuracy in comparing images. For example, in 6
clustering images where we roughly categorize images, we use only the coarse features, whereas in image retrieval, all the ne details of images will be used. However, the method proposed in [JFS95] uses a distance metric with xed accuracy. Moreover, their method directly compares the m largest wavelet coecients of images one by one. Thus, if the image is translated, so will the corresponding wavelet coecients. Hence the result of comparison will be very dierent. In our method, we compare derived values (mean, variance, or number of edge pixels) from each of the corresponding frequency sub-bands, resulting in less sensitive comparison with respect to translation. The method by Jacobs et al has the advantage of being fast. However, it has the common problem of the nearest neighbor retrieval approaches that just rely on the closeness of the feature vectors without considering the semantics of images. As we explained in the previous section, such methods sometimes may not provide satisfactory solutions. In our approach, we consider the semantics of images through clustering which results in more eective retrieval. Feature extraction may be performed on either a whole image or the segments of images. We assume that images are decomposed into segments properly. Some decomposition approaches can be found in [SC94a, SC94b, RSZSM97, SRGZ97]. Upon the proper decomposition of images, the system extracts the features of all the segments of images. The query image will be compared to the corresponding segments of database images with the proper size. In our implementation, we used nona-tree decomposition [RSZSM97, SRGZ97]. Nona-tree decomposition is a block-oriented approach in which images (and their segments) are hierarchically decomposed into nine quadrants. The nine overlapping quadrants are uniformly distributed segments. If there are k levels in nona-tree, the size of its segments will be N 2 , N222 , N42 , : : :, N 2 , where N 2 is the size of the original image. A query image is compared to the smallest 2 22k segments in nona-tree whose size is greater than or equal to the size of the query image. If the size of nona-tree's smallest segments is larger than that of the query image, the query image will not be compared to any of the segments. For example, if a nona-tree has segments of sizes 128128, 6464, and 3232, given a query image of size 6464, it will be compared to the 6464 segments. A query image of size 6060 will be compared to the 6464 segments but a 2424 query image will not be compared to any of the segments. In general, the segments generated by block-oriented decomposition methods such as quad-tree, quin-tree, or nona-tree may not perfectly align the portion which matches the query image. For example, in quad-tree, in the worst case, only 25% of the matching segment will be covered. However, in [RSZSM97, SRGZ97], we proved that nona-tree segments at least cover 56% of the matching segment. In average case, a larger portion of matching segment will be covered, and our experiments showed that nona-tree performs better than other block-oriented decomposition approaches [RSZSM97, SRGZ97].
7
2.2 Wavelet Transform Wavelet transform is a type of signal representation that can give the frequency content of the signal at a particular instant of time. In this context, one row/column of image pixels can be considered as a signal. Applying a wavelet transform on such a signal decomposes the signal into dierent frequency sub-bands (for example, high frequency and low frequency sub-bands). Initially, regions of similar texture need to be separated out. This may be achieved by decomposing the image in the frequency domain into a full sub-band tree using lter banks [Vai93]. Each of the sub-bands obtained after ltering has uniform texture information. A lter bank based on wavelets could be used to decompose the image into low-pass and high-pass spatial-frequency bands [Mal89a]. We will now brie y review the wavelet-based multi-resolution decomposition. More details can be found in Mallat's paper [Mal89b]. To have the multi-resolution representation of signals we can use a discrete wavelet transform. We can compute a coarser approximation of input signal A0 by convolving it with the low pass lter H~ and down sampling the signal by two [Mal89b]. By down sampling, we mean skipping every other signal sample (for example a pixel in an image). All the discrete approximations Aj , 1 < j < J , (J is the maximum possible scale), can thus be computed from A0 by repeating this process. Scales become coarser with increasing j . Figure 3 illustrates the method. ~ H
2
~ G
2
Aj
A j-1 Dj
. . .
. .
~ H
2
A1
~ G
2
D1
A0
Figure 3: Block diagram of multi-resolution wavelet transform. We can extract the dierence of information between the approximation of signal at scale j ? 1 and j . Dj denotes this dierence of information and is called detail signal at the scale j . We can compute the detail signal Dj by convolving Aj ?1 with the high pass lter G~ and returning every other sample of output. The wavelet representation of a discrete signal A0 can therefore be computed by successively decomposing Aj into Aj +1 and Dj +1 for 0 j < J . This representation 8
provides information about signal approximation and detail signals at dierent scales. We denote wavelet representation of signal A0 after K levels as fAK ; DK ; DK ?1 ; : : : ; D1 g, 1 K J . The idea of using multi-resolution property of wavelets in clustering is to use the features of the wavelet coecients at the coarse scale levels. Corresponding to the lowpass lter, there is a continuous-time scaling function (t), and corresponding to the highpass lter, there is a wavelet !(t). The dilation equation produces (t), and the wavelet equation produces !(t) [SN96]. For example, for Haar wavelet transform with H~ = p p p p [1= 2; 1= 2], and G~ = [1= 2; ?1= 2], the dilation equation is
(t) = (2t) + (2t ? 1);
(1)
!(t) = (2t) ? (2t ? 1): Figure 4 shows the Haar wavelet !(t).
(2)
and the wavelet equation is
1
w(t) 1
0
t
Figure 4: The Haar wavelet !(t). We can easily generalize wavelet model to 2 dimensions for images, in which we can apply 2 separate one-dimensional transforms [HJS94]. The image is rst ltered along the horizontal (x) dimension, resulting in a lowpass image L and a highpass image H . We then down sample each of the ltered images in the x dimension by 2. Both L and H are then ltered along the vertical (y) dimension, resulting in four subimages: LL, LH , HL, and HH . Once again, we down sample the subimages by 2, this time along the y dimension. The two-dimensional ltering decomposes an image into an average signal (LL) and three detail signals which are directionally sensitive: LH emphasizes the horizontal image features, HL the vertical features, and HH the diagonal features. Figure 5-a shows a sample airphoto image. Figures 5-b,c, and d show the wavelet representation of the image at three scales from ne to coarse. At each level, sub-band LL (the wavelet approximation of the original image) is shown in the upper left quadrant. Sub-band LH (horizontal edges) is shown in the upper right quadrant, sub-band HL (vertical edges) is displayed in the lower left quadrant, and sub-band HH (corners) is in the lower right quadrant. Our feature extraction and clustering methods can use any appropriate wavelet transforms such as Haar, Daubechies, Cohen-Daubechies-Feauveau ((2,2) and (4,2)), or Gabor wavelet transforms 9
a)
b)
c)
d)
Figure 5: Multi-resolution wavelet representation of an airphoto image: a)original image; b) wavelet representation at scale 1; c) wavelet representation at scale 2; d) wavelet representation at scale 3. [Vai93, SN96, URB97, MM96]. Applying wavelet transform on images results in wavelet coecients corresponding to each sub-band. We can extract dierent features from wavelet coecients of each of these sub-bands. Next subsection explains the features that we used in the experiments.
2.3 Image Features In clustering database images, dierent kinds of features such as shape, color, texture, layout, and position of various objects in images can be used. Texture in images has been recognized as an important aspect of human visual perception [TMY78, Har79, Joh94, RW96, WH90]. Tamura et al. [TMY78] have discussed various texture features including contrast, directionality, and coarseness. Contrast measures the vividness of the texture and is a function of gray-level distribution. Some factors in uencing the contrast are dynamic range of gray-levels, polarization of the distribution of black and white on the gray-level histogram, and sharpness of edges. Directionality measures the \peakedness" of the distribution of gradient directions in the image. Coarseness measures the scale of texture. When two patterns dier only in scale, the magni ed one is coarser. In a geographical image database we classify images into residential, agriculture, water, and grass clusters. These clusters have dierent textures. For example, their contrast is usually dierent; or, the cluster of water images does not have any directionality, whereas residential images are considered as directional images. These texture features of geographical images contain main visual characteristics of images that can be utilized in geographical image clustering and retrieval. Applying wavelet transform on images can provide information about the contrast of images. Also, dierent sub-bands generated by wavelet transform have information about horizontal, vertical, and diagonal edges in images [Mal89b], which helps in extracting features related to directionality of images. Moreover, multi-resolution aspect of wavelet transform provides information about images at dierent scales (from coarse to ne). These properties make wavelet transform a good 10
candidate to extract texture features of geographical images. We calculated the mean and variance of wavelet coecients to represent the contrast of the image. We also count number of edge pixels in horizontal, vertical, and diagonal directions to have an approximation of directionality of the image. These features are extracted at dierent scales to use the multi-resolution property of wavelets. Edge pixels in the image have high amplitude in detail signals (LH; HL, and HH ). It means that the wavelet coecient of such pixels has high positive/negative value. Let Image(i; j ) be a pixel in the image of size n m. Let WLH (i; j ), WHL (i; j ), and WHH (i; j ) be the corresponding wavelet coecients of Image(i; j ), for LH; HL, and HH sub-bands, respectively. We de ne mLH as the maximum of absolute value of WLH (i; j ). That is,
mLH = maxfjWLH (i; j )jg; 1 i n and 1 j m:
(3)
To detect horizontal edge pixels, we rst compute mLH . We consider Image(i; j ) as a horizontal edge pixel if
j(WLH (i; j )j mLH ;
(4)
where is a factor that is determined by experiments. In our experiments, we chose as 0.35. Similarly, we de ne
and
mHL = maxfjWHL (i; j )jg; 1 i n and 1 j m;
(5)
mHH = maxfjWHH (i; j )jg; 1 i n and 1 j m:
(6)
Image(i; j ) is considered as an edge pixel if
jWHL(i; j )j mHL; or jWHH (i; j )j mHH :
(7)
Tables 1, 2, and 3 show the wavelet coecients of Figure 5-a at scale 3 using Haar wavelet transform. The corresponding edge pixels are shown in Figure 5-d.
11
0 0 0 0 0 0 0 0 54 11 -13 82 -27 50 39 -2 -91 -17 -49 -64 -90 84 98 5 337 162 164 132 114 191 -14 -3 63 -99 54 -20 -38 -65 -73 47 -54 -71 -33 35 -30 -122 16 -13 -78 40 19 31 -17 68 26 28 17 -85 -8 -26 30 -34 -6 7 317 24 -8 118 28 -24 -30 26 145 69 -44 26 -63 -1 40 9 54 -41 -94 97 68 12 2 7 60 -163 114 101 79 23 -41 63 34 100 5 61 67 -40 6 -43 -10 121 169 19 -176 -7 6 15 -161 -28 201 -54 159 -3 269 377 -245 38 303 6 44 224 -216 -187
0 0 -15 18 32 -185 -25 -220 11 46 -17 -15 -32 83 41 -220 -18 -9 2 -185 6 -15 34 30 -7 -5 39 65 84 80 -89 90 mLH = 394
0 -260 115 -270 46 2 86 -186 47 -128 -29 -3 36 45 -118 146
0 0 0 0 0 -177 -212 -190 -143 -51 82 -22 -25 155 45 -119 -76 -210 -361 -394 75 39 74 24 -14 -11 28 -21 46 -40 23 71 12 -34 -129 -167 -26 342 450 77 10 -47 28 -59 -27 -238 43 -7 28 -6 88 -128 -227 30 141 -11 94 -59 59 25 173 138 22 -215 43 28 -77 71 9 15 11 -10 -105 -7 -270 -2 16 -30 -7 -42
Table 1: Wavelet coecients WLH of Figure 5-a at scale 3. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
82 -214 -178 -91 -27 -181 -21 74 -104 215 -146 -88 57 251 167 -38
-168 5 15 37 -112 -5 -37 -103 -90 -11 -25 -175 -16 -4 -36 -30
30 -31 12 35 -3 -53 -20 8 -207 -102 -9 127 4 -22 -285 -115
-13 -17 31 30 437 -92 -130 -12 31 29 -94 55 -66 -20 437 -106 99 13 117 -156 34 48 -92 -41 423 -141 172 -74 53 -36 -17 122 -123 -12 426 -69 -29 -141 11 44 3 52 -90 -106 411 -79 -50 18 -163 95 15 4 -28 26 430 -66 -18 -7 -107 112 7 -7 -23 25 405 -4 20 -71 -120 22 -61 28 -17 -0 403 -79 33 -34 37 141 46 -49 -8 -20 395 -121 -22 16 358 13 -28 -27 -6 -44 436 -160 -87 16 116 -6 116 -111 -9 -19 430 -162 26 -27 -44 343 -63 -81 46 -57 476 -162 -75 10 -54 85 -48 -12 35 2 480 -134 -70 -22 -144 1 37 -100 -25 -2 449 -249 -58 -32 48 -123 -16 37 157 65 93 -241 -22 -222 -26 149 22 96 2 -81 288 -14 -47 1 -38 9 mHL = 480
-86 -109 63 29 18 -27 25 -89 -233 -255 -342 -29 297 12 89 60
55 157 120 30 99 48 -37 -16 -1 -8 85 94 51 310 35 17
Table 2: Wavelet coecients WHL of Figure 5-a at scale 3. 0 0 0 0 0 0 0 0 0 0 0 -33 -4 -14 -4 14 -41 2 15 31 0 28 -14 8 5 11 -67 13 -32 7 0 -14 109 27 -69 82 -3 -14 21 -79 0 -75 -5 10 -34 -11 47 -34 9 -49 0 39 -7 -46 6 -13 -7 6 -2 7 0 48 -60 24 10 4 5 -34 29 9 0 110 -46 -29 -27 0 -14 14 -45 -44 0 -180 -0 -63 16 -9 10 -19 4 -5 0 -82 -37 26 43 -40 -45 16 -8 -100 0 -135 58 124 -39 43 -20 10 30 1 0 -18 -26 66 -102 9 -16 -3 -11 1 0 -17 46 57 -84 14 67 -50 44 -31 0 82 -70 -44 -35 14 36 12 -35 5 0 -41 -97 -141 -16 7 89 -4 -84 -13 0 130 -67 25 31 12 -43 30 89 -13 mHL = 228
0 65 -82 8 23 1 17 12 -37 -88 -27 36 78 -32 58 -0
0 0 0 0 0 30 6 -108 36 42 51 -78 50 12 -38 29 -14 -72 -24 -25 26 -61 -9 -38 14 -35 7 -2 23 97 -36 42 -12 73 8 -1 92 46 -13 -34 -26 -33 -2 -57 -6 66 36 -6 56 -1 70 6 93 -69 44 -32 -12 -44 228 55 -31 25 18 -168 28 3 -131 -43 -6 -15 53 -5 -68 -142 137 94 -5 -11 135 -86
Table 3: Wavelet coecients WHH of Figure 5-a at scale 3. 12
A 2
D LH 2
D1LH
D2HL D2HH
D1HL
D1HH
Figure 6: The seven sub-bands generated after applying two levels of wavelet transform. We tested our approach using two dierent sets of features. To obtain the features in feature set 1, we apply a two-level wavelet transform. For the sub-band LL of the coarsest scale of the wavelet transform, we consider mean and variance of absolute value of wavelet coecients as its features < mean; variance >. For the other sub-bands, in addition to mean and variance, the system counts number of edge pixels in the sub-band. So the feature vector for each of these sub-bands is < mean; variance; number of edge pixels >. In this feature set, we will have seven sub-bands with 20 (2+6 3) features. Figure 6 shows the seven sub-bands of image schematically. In clustering, we just consider the features of the sub-bands of the coarsest scale of wavelet transform (11 features) to compute the distance. Table 4 shows the extracted feature set 1 for Figure 5-a. Number of Sub-band Mean Variance edge pixels A2 561.68 189.90 D2 LH 32.94 36.16 114 D2 HL 36.46 43.57 91 D2 HH 15.94 17.95 96 D1 LH 13.44 15.96 265 D1 HL 14.14 17.45 272 D1 HH 5.81 6.82 159 Table 4: Feature set 1 of Figure 5-a. In feature set 2, the two-level wavelet transform is applied giving seven sub-bands. The feature vector for each sub-band is < mean; variance >. Each subsegment of the image has 14 features. To compute the distance between feature vectors in clustering, we only consider the features of the second scale of the wavelet transform (4 sub-bands, 8 features). We have extracted the feature sets using both Haar and Daubechies wavelet transforms [SN96]. Feature set 2 for Figure 5-a is shown in Table 5 where Haar wavelet transform is applied. Sub-band Mean Variance A3 561.68 189.89 D2 LH 32.94 36.16 D2 HL 36.46 43.57 D2 HH 15.94 17.95 D1 LH 13.44 15.94 D1 HL 14.14 17.45 D1 HH 5.81 6.82 Table 5: Feature set 2 of Figure 5-a. To give equal contributions to all the features in determining the similarity, and to remove the 13
arbitrary aects of dierent measurement units, we normalize the feature values and recast them in dimensionless units. To normalize the features, we rst compute the mean fi and variance i of each feature fi using the training samples. Our training samples include residential, agriculture, water, and grass areas. Let a feature vector be f = (f1 ; f2 ; : : : ; fn ), and its normalized feature vector be f^ = (f^1 ; f^2 ; : : : ; f^n ). We normalized fi using Equation (8): f^i = fi ? fi ; 1 i n;
(8)
f~ = (f~1 ; f~2 ; : : : ; f~n) = (w1 f1 ; w2 f2 ; : : : ; wn fn):
(9)
i
where f^i is the normalized value of fi . Our experiments show that we should give more (controlled) contribution to the more important features. We can do so, by assigning weights to the features and giving more weights to the more important features. Let W = (w1 ; w2 ; : : : ; wn ) be a weight vector, where wi is the weight that is assigned to feature fi . The weighted feature vector f~ of feature vector f will be:
3 Clustering and Retrieval Applying the feature sets introduced in Section 2, we will now present template-based image clustering and retrieval approaches. We rst describe how cluster templates are chosen. Given clustering templates and features, images in a database can be classi ed based on their similarity to the cluster templates. Based upon the resulting clusters, image retrieval can be eciently supported.
3.1 Clustering - Related Work Given a set of feature vectors, the feature space is usually not uniformly occupied. Clustering the data identi es the sparse and the dense places, and hence discovers the overall distribution of patterns of the feature vectors. In many existing clustering algorithms, k-medoid methods have been used, where each cluster is represented by the center of gravity of the cluster. For example, PAM (Partitioning Around Medoids) [KR90] was proposed to determine the most centrally located object medoid for each cluster. Each non-selected object is grouped with the medoid to which it is the most similar. CLARA (Clustering LARge Applications) [KR90] draws a sample of data set, applies PAM on the sample, and nds the medoids of the sample. Ng and Han introduced CLARANS (Clustering Large Applications based on RANdomaized Search) which is an improved k-medoid method [NH94]. Ester et al presented a clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shapes [EKSX96]. The key idea in DBSCAN is that for each point of a cluster, the neighborhood of a given radius must contain 14
at least a minimum number of points, i.e. the density in the neighborhood has to exceed some threshold. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) incrementally and dynamically clusters incoming data points to produce the best clusters with the available resources (i.e. available memory and time constraints) [ZRL96]. Ma and Manjunath used a neural network to learn the similarity measure and partition the feature space into clusters representing visually similar patterns [MM95]. As mentioned in Section 1, retrieval only based on closeness of feature vectors may not necessarily give satisfactory results, because it is possible that semantically irrelevant images have close feature vectors. Indexing trees such as R-trees and its variants have been successfully used in the past and are very ecient in retrieving the nearest neighbors of the query image in the feature space. However, they retrieve all the nearest neighbors, whether or not they have the same semantic as that of the query image. Our goal in clustering is to group together images considering their semantics. In other words, we want to cluster images such that images with the same semantics belong to the same cluster. These semantic clusters can be used to narrow down the search to the areas in the feature space where images have similar semantics to that of the query image. Searching only the semantically relevant images makes the retrieval more eective, and also more ecient. Note that, our proposed approach is not an alternative to R-tree or its variants. These indexing trees can be integrated within each cluster to provide better eciency, while the clusters preserve the similarity in semantics. Semantics of images is application dependent, for example, in a GIS visual database, we can consider residential, water, grass, and agriculture clusters.
3.2 Template-based Clustering To support ecient content-based image retrieval, methods have been proposed to rst cluster database images based on their similarity to prede ned image templates, which are termed cluster templates or cluster icons [WAL+ 93, SZ97]. To retrieve database images which are similar to or contain a query image, the query image will rst be compared to the cluster templates. Only database images in the clusters corresponding to the chosen cluster templates will be searched. The cluster templates and their associated information (as will be explained later) approximate the clusters. An eective clustering of database images can greatly reduce the number of images to be searched, resulting in a more ecient retrieval. Indexing techniques can be incorporated within each cluster to provide a faster retrieval. R-trees can be used to nd the templates representing the clusters. But since in practice, we have high dimensional feature vectors (20 features), R-trees might suer from \dimensionality curse" [BKK96]. Thus, we decided to apply hierarchical clustering to nd the cluster templates. Using separate training data for each cluster, we classify images considering their semantics. The 15
*
*
*
*
*
x x x
x x x x x x
x x x
x
x
x x
x x xx x x x xx
ooo oo oo o o o o o o o oo o o o o o o o x x x x xxxx x x xxx x
*
x
+ + + + + + + +++
+ oo oooooo
+ x x x
*
xx xxxx x x x x x x xx x ++
+ +
x x
+
x
0x 1
x
+ +
x x 0 x1 xx 0 1 xxx x
ooo 0o o oo1 0 1 o o o o o o o o o0 oo 1 o 0 1o o o o x x x x x x x1 x x 0 0 1 x xxx x
*
+ +
x
x x x 0 1
+
+
+ + +
*
x x x x 00 11 x 00 x11
+ + + 1 0 + 0 1+ + + +++
+ o1 o 0 oo 0 ooo1 o
+ x x x
x xx xx0 xx x 1 x0 1x x x xx x ++
+
+ +
+ +1 0
0+ 1 +
+
* + +
1 0 1 0
+
+ + +
*
*
(a)
(b)
Figure 7: a) arbitrary shaped clusters. b) circular sub-clusters of clusters. type of semantic clusters is application dependent and should be determined by the application experts. For example, in a geographical image database, we can de ne the clusters of residential, agriculture, water, grass, and a special cluster other. The cluster other contains all images which cannot be included in any of the residential, agriculture, water, or grass clusters. We represent clusters by cluster templates to organize database images. In the feature space, since semantically dierent image data may have their feature vectors close to each other, each cluster may have arbitrary shape. Figure 7-a shows an example in two dimensional space where clusters have arbitrary shapes. Our proposed method is a hybrid (both supervised and unsupervised) approach. Using supervised classi cation approach, we choose a set of training samples for each cluster. Then using unsupervised classi cation we nd the sub-clusters within each cluster. The centroids of those sub-clusters are used as the cluster templates of the cluster. Figure 7-b shows such centroids within each cluster as black square dots. Cluster other should include all the objects in Figure 7-a shown by \ ". In a hierarchical approach, sub-clusters are de ned within each cluster at dierent levels. Suppose the training set of a cluster Cc has n objects. Initially, we assume there are n sub-clusters corresponding to the n objects in the cluster. We compute the distances between all pairs of sub-clusters. We then group together the two closest sub-clusters, resulting in n ? 1 sub-clusters. We repeat the process of computing the distances between all pairs of current sub-clusters, and grouping the two closest ones until having only one sub-cluster. This hierarchically nested set of sub-clusters can be represented by a tree-diagram, or a dendrogram [Gor81]. Figure 8 shows a group of 5 objects and their corresponding dendrogram. Each of n leaves of the tree represents the single object whose label appears below it. Each position up the tree at which branches join, has an associated numerical value d, which determines the distance between the two grouped sub-clusters. The smaller the value of d, the more similar two sub-clusters will be. We considered the distance between the centroids of sub-clusters as the distance between sub-clusters. We call this method as centroid method. There are dierent methods to de ne the distance between sub-clusters. In unweighted pair-group method using arithmetic averages (UPGMA), the distance between two sub-clusters is the arithmetic average of the distances between the objects 16
1
2
4 3
5
d
d1 d2 1
(a)
2
3
4
5
(b)
Figure 8: A group of objects and the corresponding dendrogram. in one cluster and the objects in the other [Rom84]. In single linkage clustering method (SLINK), the distance is the minimum of the distances between the objects in one cluster and the objects in the other one (distance between two closest objects). The distance between sub-clusters in complete linkage clustering method (CLINK), is de ned as the maximum of the distances between objects in the two sub-clusters [Rom84]. Applying either of these methods may give a dierent dendrogram. We can cut the dendrogram of each cluster at dierent levels to obtain k number of sub-clusters (cluster templates). The choice of k and the corresponding cutting distance is one of the most dicult problems of cluster analysis, for which no unique solution exists. For example, if we cut the dendrogram in Figure 8-b, where the distance between sub-clusters is d1 , we will have two sub-clusters, whereas if we cut at level d2 , we will have three sub-clusters. Let S = fe1 ; e2 ; : : : ; en g be the sub-cluster containing objects e1 ; e2 ; : : : ; en . If we cut a dendrogram at the level in which the distance between the last two grouped sub-clusters is di , then the maximum distance between the children of the remaining sub-clusters will be di . For example, in Figure 8-b, if we cut the dendrogram at level d1 , the distance between sub-clusters f1g and f2g, or f3g and f4g, or f3,4g and f5g, will be less than d1 . A measure of the quality of clustering can be obtained by computing the average silhouette width [KR90], sk , of the n image samples for each k value. Given an image i, which is assigned to a cluster Cj , 1 j k, let ai be the average distance of i to all other images in cluster Cj and bi be the average distance of i to all images in cluster Ch , where Ch is the nearest neighbor cluster of cluster Cj (h 6= j ). The sk is calculated as follows: bi ? ai ; gi = max (10) fai ; bi g
Xn sk = n1 gi ; i=1
(11)
where 0:0 gi 1:0. The value of sk can be used to select an optimal value of k such that sk will be as high as possible. The maximum value of sk is known as the silhouette coecient. In general, a silhouette coecient value of 0.70 or higher indicates a good clustering selection [KR90]. Assume that a cluster Cc consists of m sub-clusters Sc1; Sc2 ; : : : ; Scm . In clustering database im17
ages, whenever a sub-cluster Sci is chosen, the database images will be assigned to the corresponding cluster Cc . Similarly, in image retrieval, if a sub-cluster Sci is chosen, only the corresponding cluster Cc will be searched. We also tested dierent methods in determining the distance between sub-clusters, to make the dendrograms.
3.3 Clustering To cluster database images, we rst compute the distance between the cluster feature vectors of cluster templates and all database images. Using the multi-resolution property of the wavelet transform, we extract the image features at dierent scales. In comparing cluster feature vectors, only coarse scale features will be used; while in retrieving images all the features will be compared. In Section 2.2, we used the notation fAK ; DK ; DK ?1 ; : : : ; D1 g, 1 K J , as the wavelet representation of signal A0 after K levels of the wavelet transform. We de ne clustering scale (K; c) as the number of coarsest scales that will be used in clustering after applying K levels of the wavelet transform. It means that in clustering only the features extracted from fAK ; DK ; DK ?1 ; : : : ; DK ?cg will be used, where 1 K J , 1 c J , and c < K . For example, clustering scale (5, 2) means that the 5-level wavelet transform will be applied and the features of the 2 coarsest scales (fA5 ; D5 ; D4 g) will be used in clustering. We de ne retrieval scale (K; r) as the number of scales that will be used in retrieval after applying K levels of transform. In retrieving images, the features extracted from fAK ; DK ; DK ?1; : : : ; DK ?r g will be used, where 1 K J , r = K ? 1, and usually r > c. In retrieving, we will use all the features. Clustering scale and retrieval scale for both feature sets 1 and 2 are (2,1) and (2,2), respectively. Let X and Y be two undecomposed images (for example, a query image, a cluster template, or an undecomposed segment of a database image). Let CFX (x1 ; x2 ; : : : ; xm ) and CFY (y1 ; y2 ; : : : ; ym ) be their cluster feature vectors extracted from the subbands fAK ; DK ; DK ?1 ; : : : ; DK ?cg, based on the clustering scale (K; c). Let RFX (x1 ; x2 ; : : : ; xn ) and RFY (y1 ; y2 ; : : : ; yn ) be the retrieval feature vectors of X and Y respectively, extracted from the subbands fAK ; DK ; DK ?1 ; : : : ; DK ?r g, according to the retrieval scale (K; r). To measure the similarity between X and Y in clustering, we de ne the clustering distance dc as follows:
dc (X; Y ) = dist(CFX ; CFY ) =
s Pm
i=1 (xi ? yi )2 ;
(12) m where dist could be any appropriate distance function. Here, we use Euclidean distance for dist. To measure the similarity between X and Y in retrieval, we de ne the retrieval distance dr as follows:
dr (X; Y ) = dist(RFX ; RFY ) = 18
s Pn
i=1 (xi ? yi )2 :
n
(13)
A database image Y can be decomposed into a set of smaller segments fY1 ; Y2 ; : : : ; Yp g. To measure the distance between the database image Y and the undecomposed image X (a query image, or a cluster template) in clustering, we use Dc :
Dc(X; Y ) = minfdc (X; Yi )g 1 i p:
(14)
Similarly, in retrieval, we use Dr to compute the distance between the query image X and the database image Y : Dr (X; Y ) = minfdr (X; Yi )g 1 i p: (15) In determining the correct cluster(s) for a database image, we use the distribution of objects (feature vectors) in each sub-cluster. For each sub-cluster we de ne a scope. Let i and i be the mean and variance of clustering distances dc of objects to the centroid of the sub-cluster i. The scope of sub-cluster i is part of the feature space whose distance to the centroid of sub-cluster is less than i + i . Parameter is a factor that is determined by experiments. In a two dimensional space, the scope of a sub-cluster is a circle with radius i + i . In our clustering approach, whenever a database image Y falls within the scope of any sub-cluster, it will be assigned to the cluster corresponding to the sub-cluster. The reason is that a database image may contain the features of several dierent cluster templates and it should be assigned to all the corresponding clusters. If the database image Y is not within the scope of any of the sub-clusters, but the distance Dc (t; Y ) between the database image and the closest cluster template t is less than a threshold T , it will be assigned to the corresponding cluster represented by the closest cluster template. If there is no cluster template for which the distance between the database image and the cluster template is less than T , the database image will be assigned to the special cluster other. The cluster other contains all the database images that cannot be assigned to the prede ned clusters. As an example, Figure 9 shows the scope of 3 sub-clusters A, B , and C in a two dimensional space. In general, these sub-clusters do not necessarily belong to dierent clusters. The object denoted by point 1 falls in the scope of sub-clusters A and B and will be assigned to both. 1
x x
x
3 x
x
x
x
x
x
x
2
x x
x
B
x
x
A
x
x
C x x xx
x x
Figure 9: Three sample sub-clusters. However, some objects may not fall in any of the scopes considered above. For example, Point 2 in Figure 9 is not in the scope of any of sub-clusters. Our clustering approach assigns such an 19
object to the closest sub-cluster. Although the distance dc between point 2 and the centroid of sub-cluster C is less than that of sub-clusters A and B , but based on the distribution of objects within clusters, it is more logical to assign object 2 to sub-cluster A. So, we rst compute the distance between the object and the border of the sub-clusters' scope, and then choose the closest cluster. Hence, object 2 will be assigned to sub-cluster A. The object shown as point 3 in Figure 9 will be assigned to the cluster other, because it is far away from all clusters and its distance dc to the closest cluster is greater than threshold T . Algorithm 1 shows the steps in choosing the cluster(s) for database images. After generating the templates, for each image, the time complexity of choosing clusters (either for adding the image to the clusters or for retrieving images) would be O(t) where t is the number of templates. The required time to assign database images to the clusters is O(tN ) where N is the number of images. Since the number of templates is generally far less than the number of images, that is t