ICMCS99
Perceptually Based Metrics for the Evaluation of Textural Image Retrieval Methods Janet S. Payne Dept of Computing, Buckinghamshire Chilterns University College, High Wycombe, HP11 2JZ, UK email:
[email protected]
L. Hepplewhite and T. J. Stonham Dept of Electrical and Electronic Engineering, Brunel University, Uxbridge, iMid& UB8 3PH, UK email:
[email protected]
computer which has classified them? We compare the results of human matching images for “most like” with those for ten computational methods.
Abstract Texture is widely used in CBIR, and there have been a number of studies over the years to establish which features are perceptually significant. However, it is still diSJicult to retrieve reliably images that the human user would agree are “similar”. This paper reviews a range of computational methods, and compares their performance in class@ing and retrieving imagesporn the Brodatz set. Their performance is then related to the combined ranking of “similar” images from the same dataset, obtained from experiments where human volunteers were asked to ident which images were most like each of the Brodatz images. The full set of I I2 images was used. We conclude that no one method consistently returns retrievals which the human user would agree were similar across the full range of textures, but that statistical methods appear to perform better overall. We propose a subset of the Brodatz images for comparison of retrieval methods, based on the correlation between individuals ’ rankings.
2. Texture Interest in visual texture was triggered by the phenomenon of texture segregation which occurs when a shape is defmed purely by its texture, with no associated change in colour or brightness: colour alone cannot distinguish between tigers and cheetahs! The phenomenon gives clear justification for texture features to be utilised in CBIR together with colour and shape. Inspired by the work of Julesz et al. [l], statistical measures were among the first texture methods proposed in the computer vision literature. Haralick et al [2] proposed the use of Grey Level Co-occurrence Matrices (GLCM) to capture second order properties of a texture. A subset of five of these features is commonly used, to reduce dimensionality [3]. In an attempt to overcome the dependence of the GLCM methods on intensity level, Wang and He proposed the Texture Unit and Texture Spectrum (TUTS), which characterises the local texture aspect of each pixel by texture units, and the image by the distribution of these units, to form the texture spectrum [4]. The Binary Texture Co-occurrence Spectrum (BTCS) of Pate1 and Stonham [5] similarly makes use of local texture and overall distribution, but uses oriented n-pixel operators. It is however dependent upon the thresholding used, and the Grey Level Texture Co-occurrence Spectrum (GLTCS) was developed to extend this [6]. A further improvement, Sequential Ranking (SRank), is discussed in [7]. Directionality and coarseness are also of importance in texture analysis, as found for example in studies of the human visual system [S], and the visual cortex of the macaque monkey [9]. Several authors have used ring shaped regions to measure the spectral energy in different frequency bands, and wedge shaped regions to measure the energy in various orientations [lo]. Shape measures have been used to describe the Fourier Spectrum; for example Liu and Jemigan [ 111
1. Introduction Texture is all around us. It is a low level vision process and subsequently plays a major role in human vision. Given its importance to human vision, it is also of great significance to Content Based Image Retrieval (CBIR), and to the computer vision community in general. Not only has it received considerable attention in the computer vision literature, ranging in applications from medical diagnosis and industrial inspection to remote sensing, but it has also contributed extensively to the understanding of early human vision. Several studies have focused on relating computational measures of texture to human perceptions: we discuss some of them below. However, it is still not unusual to hear or read a statement like “the computer got 94% right, and the humans only correctly classified 78% of the textures”. For CBIR, who is the final judge - the human who requested the images, or the
793
O-7695-0253-9/99 $10.00 0 1999 IEEE
Page 1 of 5
ICMCS99 four wedge features. l LSF Liu’s spectral features [l l] implementing six computationally efficient and optimal features. l LTE Laws texture energy method using nine 3 x 3 masks and a 5 x 5 moving window standard deviation estimate [ 131. l Gabor Gabor filter energy with four orientations and up to four scales depending on the window size. The features extracted are quadrature filter pair mean and standard deviation of energy [ 151.
propose descriptions of prominent peaks in the spectrum, which can give the principal direction and fundamental spatial period of the texture pattern. Caelli [ 121 proposed a model in which texture segregation involves three stages: decomposition by spatial filters, a ‘filling-in’ process, and the grouping of similar regions. A popular, empirically justified set of filters was suggested by Laws [I 31. Laws termed his measures “Texture Energy” due to their resemblance to an energy computation. Daugman, realising that texture perception is largely scale invariant, proposed the use of 2-D Gabor wavelet-like basis functions [ 141. These have been shown to correlate well with those found in the macaque monkey cortex [9], and justify their use in texture methods [ 151.
3.1 Retrieval performance We have benchmarked the above methods for the texture retrieval task, as designed by Manjunath and Ma [ 191. In this test, the image database consists of the 9 non-overlapping examples of all 112 Brodatz textures (i.e. 1008 images). The nearest matches in the database are selected as those with the lowest distance to the query in the Euclidean sense (i.e. k=l in a k”‘nearest neighbour classifier). Retrieval is performed by presenting each of the 1008 samples in turn as a query. Performance is then calculated based on what percentage of the other 8 samples of that particular texture is retrieved by the system. Figure 1 plots this performance as a percentage against the number of samples retrieved in total. As expected, the retrieval performance increases as the number of retrievals is increased, since there is more chance of retrieving all 8 samples in the first 40 retrievals than in the frst 10. It should
3. Texture Classification Although a plethora of methods have been evaluated in the literature for texture classification tasks, the difficulty comes in making any comparison between methods. This is seen in the dif?iculty review type articles, such as [ 161 and [ 171, have in making any type of conclusion, primarily because methods are commonly evaluated against different datasets. Hence, in this section a comparison of ten texture methods, chosen to represent a cross-section of techniques, is presented using the commonly accepted dataset benchmark, the Brodatz album [ 181. There is one obvious omission, the model based MRF or SAR based methods. These have been omitted from this comparison due to their high computational complexity; as shown for example by the slow execution of a SAR method when compared with the relatively complex Gabor based method in [19]. Although many of the methods have already been evaluated against this dataset, there is still variation in implementation details. Since the Brodatz album is a printed text, it requires digitisation for use in texture analysis; hence, variation remains in lighting, resolution, scanning equipment etc. The following ten methods were implemented as in the relevant reference; any specific parameters are detailed below. l Haralick implemented using the matrix itself as the feature vector. In order to keep the dimensional@ usable, the number of grey levels in the texture image has been reduced to I6 and a single displacement vector used [2]. l GLCM as above, but with 64 levels of intensity. However, this time only the commonly cited matrix features of energy, entropy, correlation, homogeneity and inertia were extracted, using four displacement vectors [3]. l TUTS implemented as local binary patterns, reducing the feature space dimensional@ to N=256, from N=656 1 [4]. l BTCS with n-tuple size of n=4, and interpixel spacing, t= 1, and a global threshold level [5]. l GLTCS, with n=4 and t=l as in BTCS above [6]. l SRank n=4 and t=l as in GLTCS and BTCS but with “roughly equal to” band off 5 levels [7]. l R&W implemented as in [lo] with four ring features and
Figure 1. Retrieval performance for ten methods also be noted that not all images are homogenous. The statistical methods SRank and TUTS perform best, with LSF showing the poorest result. Given that the Brodatz dataset contains many high frequency images, this is not surprising.
4. Texture and human perceptions Several studies have focused on relating computational measures of texture to human perceptions: some of these are discussed below. For textures, the Brodatz album is a
794
Page 2 of 5
ICMCS99 the “vote” ofthe overall second choice [24,25]. Measures that take some account of the ordering should therefore be appropriate. AVRR is one such measure: as its name implies, it is calculated fiorn the average rank of the relevant images in the retrieved set. Other more generally applicable statistical techniques such as Spear-man’s or Kendall’s correlation coefficients have also been used [eg in 201, which provide more information about the ordering of relevant images within the retrieved set. We have used Kendall’s tau coefficient of rank correlation [26], which ranges from -1 (inversely correlated), through 0 (no correlation), to +I (perfect correlation). For the number of retrievals used in our experiments, the (one-tailed) 1% significance level is 0.8, and the 5% significance level is 0.6.
well-established standard, and all of these studies used some or all of these images. Using the Brodatz dataset in this way has a long tradition: following the early work by Julesz and Beck, a 1978 study by Tanmra, Mori and Yamawaki [20] chose to work with naturally-occurring textures instead of computer-generated patterns, and selected the Brodatz photographic album. They decided on six features to cover the whole album: coarseness, contrast, directionality, line- or blob-like, regularity or roughness, but concluded that only the first three were straightforward to compute, and matched the results of their human study on 16 of the textures. They suggested that humans use multiple cues, rather than “simple combinations of our features”. Amadasun and King [21] define seven textural features: coarseness, contrast, complexity, busyness, shape, directionality and texture strength, and compare the computational with human ranking for ten images from the Brodatz textures. They state that there was good correspondence between the two. This may have been affected by the small size of the dataset. After discussing earlier work, they make the comment “A major disadvantage of almost all these approaches is that they do not have general applicability. The human perception mechanism, in comparison, seems to work well for almost all types of textures.” More recently, Rao and various colleagues [22] have been developing a Texture Naming System, based on unsupervised human classification of a set of 56 Brodatz images, and matching with texture names. They suggest the classification can be interpreted as three dimensions: repetitive versus non-repetitive, non-directional versus directional, and complexity, but caution that the names given are subjective. There is some agreement between most researchers about the main categories of texture classification, but they also note that humans tend to combine rather than use one single method. It would be very usefnl to have a method which produces the same or similar results to human perceptions, but this has so far not been achieved, and may not be possible.
6. Results from our human study We have carried out two sets of human studies, where volunteers were asked to select which images, in order, they considered to be most like a given target image. The full Brodatz dataset of 112 images was used in each case, but only taking one sample from each image. The initial study used four printed sheets, so that each person could view all the images at once. However, due to the lengthy nature of this process, only six people took part in this first stage. The results are reported in [24], and were used to produce a computer-based version, shown in Figure 2.
5. Evaluating retrieval performance A number of measures have been proposed in the literature for retrieval: these include retrieval rate, missed target and average rank of retrieved image [23], Retrieval rate and missed target basically measure the same thing: the number of images in the retrieved set that are present (or absent, for missed target) in the comparison set. For the case of four retrieved images, retrieval rate can range Corn 0 (none that match) to 1 (all four match, in any order, in any position in the comparison set). Missed target, similarly, can range from 4 (no matches at all) to 0 (all four present in the comparison set). However, the order or ranking of retrieved images may well be significant to the user. In our experiments, for example, the overall first choice often had two or three times
Figure 2. “Pick-a-pit” screen shot For this, all 112 images are again used, in a randomised order. Each one is shown in turn as the centre image in three rows of five, and the volunteer is asked to select up to four images which they consider most like this target image. They are instructed to do this “as quickly as possible”, and most people have taken about 40 to 45 minutes to work through the full set. So far, 24 people have taken part in this second stage, and the selections made by each person has been used to calculate an overall ranking [25].
795
Page 3 of 5
ICMCS99 > 0.98. Two of the statistical techniques, BTCS and SRank, both show high agreement with the humans’ ranking for image D4. GLTCS, similarly, matches with D5 and D62, and GLCM with D36. Haralick only matches with D103, and none of the other correlations are very significant. The other statistical technique, TUTS, appears to classify brick walls well, matching D25, D26 and D94. The Fourier based method of Liu strongly matches with D104, D74, D3 and D22, but performs much less well for all other images. These all contain significant high frequency components. Ring & Wedge matches for D26 (brick wall) and D8 1, and performs significantly well over a range of other images. The spatial filtering approach of Laws produced no highly significant correlations, while the spatiauspatial frequency Gabor method matched with D8 1, another image with high frequency components. One example will be given of how the retrievals obtained using some of the computational methods compare with the perceptual rankings. Figure 4 shows D 107, and the first three, in order, of the images considered most similar, from the overall human rankings. Figure 5 shows the corresponding first four retrievals, for SRank, with a 5 = 0.5908, and for Gabor, with the same value oft, and for one of the poorly correlated methods for this image, LTE, with a T = 0.3743.
6.1 Images with significant correlation The results from this second study have been used to select a subset of the Brodatz dataset where there was significant agreement between the ranking selected by each individual and the overall ranking. That is, only images where half or more of the group had a significant correlation of T > 0.6 with the overall ranking This subset contained 84 of the Brodatz data set, excluding images such as D37 and D48 which are unlike any other images in the set, and others such as D7 1, and D86, which people found hard to agree which other images were most like each one.
D37
D48
D71
D86
Figure 3 Examples of images excluded The selected images were: D2, D3, D4, D5, D6, D7, D8, D9, DIO, Dll, D12, D15, Dl6,D17,Dl8, D19, D22, D23, D25, D26, D27, D28, D29, D30, D31, D32, D33, D34, D35, D36, D38, D40, D41, D42, D43, D44, D45, D46, D50, D51, D56, D57, D58, D59, D60, D62, D63, D64,D65, D66, D68, D69, D70, D73, D74, D75, D77, D78, D79, D81, D85, D87, D88, D89, D90, D9 1, D92, D93, D94, D95, D96, D98, D99, D 10 1, D 102, D 103, D 104, Dl05,Dl06, D107, D108, D109, DllO, andDll1.
D107
D108
D109
D87
Figure 4. “Most like” rankings for D107 Having obtained this agreed-upon set of retrievals, we then compared these with the results obtained by the ten computer methods outlined above, using Kendall’s correlation coefficient. 6.2 Comparison with computational methods The average results across the full Brodatz set, using Kendall’s tau to correlate each with the combined results from the human study, ranged from 0.4772 (Haralick) to 0.5779 (SRank), with most methods achieving around 0.53 to 0.55. Since the 5% significance level is 0.6, as described above, the results of these computer methods are not very impressive. Using just the subset selected by the human study produced a slight improvement in most cases, with tau now ranging from 0.4758 (Hamlick, again), to 0.5809 (SRank), with most others around 0.55 to 0.57. However, as has already been noted, we would not expect any one method to perform equally well on all types of textures. It is hoped that further work will allow us to refine these results. For most of the computer methods used in this paper, there are a few images in each case where the correlation coefficient
D109
D39
D108
DIOI
D108
D45
D87
Dlll
D41
DlOl
Dlll
D109
Figure 5. Retrievals for SRank, Gabor, and LTE
796
Page 4 of 5
ICMCS99 [4] D.C. He and L. Wang, Texture unit, texture spectrum and texture analysis, IEEE Trans. Geoscience and Remote Sensing, VOI 28 NO 4, pp 509-512, 1990 [5] D. Pate1 and T.J. Stonham, Low level image segmentation via texture segmentation, Proc SPJE Visual Comms. and Image Processing, ~011606, ~621. 199 1. [6] D. Pate1 and T.J. Stonham, lJnsupervised / supervrsed texture segmentation and its application to real world data, Proc. SPlE
Extending the range of tau being considered, Table 1 shows the performance of each method cumulatively for decreasing values of the correlation coefficient, for the selected subset of 84 images. The last row shows the percentage of rankings for each method that correlates with the overall humans’ ranking with r > 0.5. Table 1 Number of similar rankings
Visual Comms. andlmage Processing, voll818. 1992. [7] L. Hepplewhite and T.J. Stonham and R.J. Glover, Automated visual inspection of magnetic disk media, Proc. of 3rd ZCECS, ~012,
p 732-735. 1996 [8] F.W. Campbell and J. G. Robson, Application of Fourier analysis to the visibility of gratings, Journal ofPhysiologv, ~01197, p551556. 1968. [9) R.L. DeValois and D.G. Albrecht and L.G. Thorell, Spatial-frequency selectivity of cells in macaque visual cortex.J&ion Research, ~0122, p 545-559. 1982.
62
59
58
56
55
54
52
49
48
41
as % 74
70
69
67
65
64
62
58
-57
49
,,0.5
[IO] J.S. Weszka and C.R. Dyer and A. Rosenfeld, A comparative study of texture measures for terrain classification,IEEE Trans SMC, ~016 no. 4. ~269-285. 1976
[I l] S.S. Liu and M.E. Jernigan, Texture analysis and discrimination in additive noise. CVGIP, ~0149, ~52-67, 1990 [ 121 T. M. Caelli, An adaptive computational model for texture segmentation. IEEE Trans SMC‘. ~0118, no 1, p 9-l 7. 1988 [ 131 K.1. Laws, Texture image segmentation, PhD thesis, University of Southern California. 1980. [ 141 J.G. Daugman and D.M. Kammen, Image statrstics. gases, and visual neural primitives, Proc. of IEEE JCNN. vol 4, ~163-175
As can be seen horn the table, the statistical methods generally match mot-e closely to the human ranking of retrieved images, with the exception of the original Haralick GLCM method.
7. Conclusions
1987.
We have reviewed a range of textural methods, and compared thetr performance for classification and retrieval tasks ustng the Brodatz dataset. However, we note that such measures may not correspond particularly well with perceptual similarity, and we outline some results from an experiment which asked volunteers to select up to four “most like”, in order, images for each of the full set of 1 12 images in the Brodatz album Using the combined rankings obtained, we propose a subset which can he using for evaluating and comparing the retrieval performance of computational methods. Our results show that in general statistical methods (excluding the original GLCM) correspond more closely to perceptual similarity over this subset, which represents images where there was significant agreement. These images tend either to show regularity, such as the various examples of brick walls, at various scales, or reptile skin, or to be “disordered”, in Rao’s classification, such as D4 or D2. The proposed subset of X4 images may be used to evaluate the retrieval per!ormance of computational measures.
[ 151 A. K. Jain and F. Farrokhnia, Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 24(12): l1671186. 1991. [ 161 L. VanGool and P. Dewaele and A.Oosterlinck, Survey: texture analysis anno 1983, CVGZP, ~0129, ~336-357. 1985. [ 171 T.R. Reed and J.M. HasnDuBuf, A review of recent texture segmentation and feature extraction techniques, CVGJP, 1~0157 no. 3. ~359-372. 1993. [ 181 P. Brodatz, Textures - a photographic album for artists & designers, Dover, New York, 1966.
[ 191 B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval of image data, Tech Report, UCSB. 1995 [20] H. Tamura and S. Mori and Y. Yamawaki, Textural features corresponding to visual perception, IEEE Trans. on SMC, ~016 no 6, ~460-473. 1978. [2 I ] M. Amadasun and R King, Textural Features Correspondmg to Textural Properties, IEEE SMC ~0119 No5 ~1264-1274, 1989 [22] A. R. Rao. N. Bhushan and G. L. Lohse, The relationship
between texture terms and texture images: a study in human texture perception, Proc SPJE 1996, vol 2670, ~206-214. 1996. [23] C Faloutsos and M. Flickner and D. Petcovic and W Niblack and W. Equitz and R. Barber, Efficient and effective querying by image content, Tech Report, IBM, 1993. [24] J. S. Payne and L. Hepplewhite, Texture similarity: using human studies to interpret retrieval results, Technology Letters, ~012 no. 2,
References [I] B. Julesz, Textons. the elements of texture perception and their interactions, Nature, ~01290, ~91-97, 198 1 (21 R.M. Haralick and K. Shanmugam and I Dinstein. Textural features for image classification, ZEEE Trans. SMC, vol 3 no 6,
~30-36. 1998. [25] J. S. Payne, L. Hepplewhite and T. J. Stonham, Evaluating
content-based image retrieval techniques using perceptually based metrics To appear in: Proc SPIE Electronic Imaging 99. 1999. [26] M. Kendall and J. Dickinson Gibbons, Rank Correlufion Memods. 5’h ed, Edward Arnold, London, 1990
~610-621. 1973. [3] R.W. Conners and C.A. Harlow, A theoretical comparison of texture algorithms,LEEE Trans. PAW, vol2 110.3, ~20-1-222. 1980.
797
Page 5 of 5