Compact combination of MPEG-7 color and texture ... - IEEE Xplore

3 downloads 0 Views 946KB Size Report
Department of Electrical and Computer Engineering. Wichita State University. Wichita, KS 67260, USA. Email: {rxdorairaj, kamesh.namuduri}@wichita.edu.
COMPACT COMBINATION OF MPEG-7 COLOR AND TEXTURE DESCRIPTORS FOR IMAGE RETRIEVAL Ramprasath Dorairaj and Kamesh R. Namuduri Department of Electrical and Computer Engineering Wichita State University Wichita, KS 67260, USA. Email: {rxdorairaj, kamesh.namuduri}@wichita.edu

Abstract— Multimedia databases are meaningfully indexed and retrieved with content based image retrieval (CBIR) systems. MPEG7 standard describes visual descriptors and performance metrics for CBIR systems. Amongst all image features, color and texture are more visually expressive and hence are attractive for image retrieval descriptors enhances image retrieval and efficiency of a multimedia database. Further, combination of features makes image retrieval more relevant and robust. This paper proposes efficient methods for compactly representing color and texture features and combining them for image retrieval. The proposed methods use MPEG-7 visual descriptors namely, scalable color descriptor (SCD) for color and, homogeneous texture descriptor (HTD) for texture representation respectively. This paper investigates how compactly a descriptor could be defined and yet be used as a metric for image retrieval. If the descriptor is compact and yet performs meaningful image retrieval, then it is computationally efficient. Image classification, clustering and hierarchical modeling of image databases will benefit with such a representation. The performance of retrieval based on compact descriptors obtained by proposed techniques is analyzed with MPEG-7 metrics.

I. I NTRODUCTION There has been a phenomenal increase in the volume of databases in past few years since the standardization of storage of multimedia data by MPEG-1. This was followed by standardization of compression by MPEG-4 which enhanced the storage efficiency of multimedia data. Multimedia databases were efficiently indexed and retrieved by various content based indexing and retrieval algorithms until MPEG-7 [2], [3] standardized the content description of a multimedia database. There are several visual descriptors defined by MPEG-7 standards and, in this paper, discussions are restricted to those visual descriptors. MPEG-7 visual descriptors do not completely represent every minute information of a selected feature of an image but they do describe the content of multimedia. Hence these descriptors have the ability to classify images based on incoming query which makes them suitable for efficient and fast content based image retrieval (CBIR) [10], [12]–[14], searching and filtering applications. This paper investigates visual descriptors namely scalable color descriptor and homogeneous texture descriptor. The paper investigates how coarsely a descriptor could be quantized and yet be used as a metric for image retrieval. This paper proposes techniques which succinctly represent these descriptors and yet retain their discriminating ability. If we can represent a descriptor compactly and yet do a meaningful image retrieval, then it is computationally more effective. Visualization, clustering and hierarchical modelling of image databases can be done more efficiently with this representation. The performance of retrieval is measured using performance metrics defined in MPEG-7 [1]. The organization of the paper is as follows. Section II describes MPEG 7 visual descriptors SCD [1] and HTD [4]. In section III, new

0-7803-8622-1/04/$20.00 ©2004 IEEE

methods are presented for compact representation of color and texture for image retrieval. Section III also presents performance analysis of the proposed methods. Section IV explains method for combining these compact descriptors for image retrieval, testing procedures and results and section V presents the conclusion. II. MPEG-7 V ISUAL DESCRIPTORS MPEG-7 defines multimedia content description interface for information retrieval. It supports the concept of multimedia access anytime and anywhere. MPEG-7 specifies a standardized description of various types of multimedia information. This description shall be associated with the content itself, to allow fast and efficient searching for material that a user may be interested in. There are several visual descriptors defined by MPEG-7 to describe color, texture, shape, audio, video etc. This paper addresses only those descriptors that are concerned with color and texture. Color descriptors are basically defined in various domains while keeping the number of possible variants to a minimum. This is done so as to allow the interoperability between differently generated MPEG 7 color descriptions. Generally color descriptors are either histogram descriptors, dominant color descriptors or color layout descriptors. The descriptors defined by MPEG 7 include scalable color, dominant color, color structure, and color layout descriptors. The color descriptor, which we are interested in, is scalable color descriptor. This descriptor is explained in detail in section-II A. There are three texture descriptors defined in MPEG-7, which include texture browsing descriptor, homogeneous texture descriptor (HTD), and local edge histogram descriptor. Texture browsing descriptor characterizes perceptual attributes such as directionality, coarseness and regularity of texture while homogeneous texture descriptor provides a quantitative characterization of homogeneous texture regions for similarity retrieval. HTD is based on local spatialfrequency statistics of the texture. The algorithm to obtain a HTD as in MPEG-7 standard is explained in section-II B. A. Scalable color descriptor Scalable color descriptor is a color histogram descriptor. It is computed in HSV color space with uniform quantization of HSV space into 256 bins. Resolutions range from 16 bits/histogram to 1000 bits /histogram. The HSV space has 16 levels of H, four levels of S and four levels of V for 256 bins. Initially, the histogram values are truncated into a 11 bit integer representation. These are again non-linearly mapped into 4-bit representation. This representation gives a higher significance to smaller values with high probability. If each histogram were to be expressed as 256 bins and each bin is expressed with 4 bit representation,then it might need a total of 1024 bits/histogram. To reduce the size of the descriptor, Haar transform

387

is used. Haar transform involves addition and difference operation on adjacent bin values. The sum gives low pass coefficients, while difference gives the high-pass coefficients. High-pass coefficients are used for compact representation. Thus, after computing these coefficients more subsets are iteratively calculated from the source histogram until we reach a level of 16 bins which correspond to 4 levels of H, 2 levels of S, and 2 levels of V. B. Homogeneous texture descriptor For describing texture we use homogeneous texture descriptor. MPEG-7 homogeneous texture descriptor splits an image into 30 channels in frequency domain with equal divisions in angular direction and octave division in radial direction. The feature channels are modelled using 2D Gabor functions [5], which are modulated Gaussians. The mathematical representation of Fourier transform of a 2D Gabor function in polar coordinates can be written as  GPs,r (ωs , θ)

=

exp

   −(ω − ωs )2 −(θ − θr )2 . exp 2σω2 s 2σθ2r

where, GPs,r (ωs , θ) is Gabor function at sth radial index and rth angular index r ∈{0,1,2,3,4,5}; s∈{0,1,2,3,4} ω0 = 3/4 ωs = ω0 ∗ 2−s ; Bs = (1/2)s+1 (octave bandwidth ) s (standard deviation in radial direction) σωs = 2√B2ln2 o (standard deviation in angular direction) σθr = 2√152ln2 Homogeneous texture descriptor characterizes texture using mean (fDC ), standard deviation (fSD ), energy (ei ) and energy deviation (di ) values. The energy and energy deviation values are calculated for every channel. Energy deviation values are considered only in enhanced layer representation. The proposed system includes energy deviation values. Thus an HTD descriptor for a single image is a vector of 62 values including all the above values and is represented as HT D

=

[fDC , fSD , e1 , e2 , ......e30 , d1 , d2 , .....d30 ]

III. C OMPACT REPRESENTATION OF FEATURE DESCRIPTORS In this section, we propose various methods by which feature vectors can be expressed as a compact descriptors. Color is represented with compact color descriptor number CCD while texture is represented with CTD which is compact texture descriptor number. These descriptors express the feature as scalar instead of a vector. A descriptor possesses better discriminating ability if it groups images with similar ground truth information and at the same time differentiates images with different ground truth information. The discriminating ability of compact descriptors is analyzed graphically by plotting them as shown in Figure 1. The interpretation of these plots is explained in section III C. A. Compact representation of color descriptor Color information is initially represented as scalable color descriptor which has 32 Haar coefficients of which 16 are low-pass coefficients and 16 are high-pass coefficients. Considering only the sign bit of the high pass coefficients results in a very compact representation [1]. So, compact descriptor number for representing color information is a 16 bit binary number which holds the sign information of the high pass Haar coefficients. In this representation, each bit is set to 0 if corresponding coefficient is positive while it is

set to 1 otherwise. The 16 bit binary number which represents the color information is converted to a single decimal number. This is used as the compact color descriptor number (CCD) for color feature. B. Compact representation of texture descriptor Image texture is described using homogeneous texture descriptor which is a vector of 62 values as explained in section 2. This vector is then abridged to get CTD. The sign information of the imaginary part of the energy and energy deviation components are used for compact representation. Three methods are proposed for compactly representing the texture descriptor. In all the three methods initially the texture descriptor is represented as a 60 bit binary number. The first 30 bits represent the sign information of the imaginary part of energy values of thirty channels and the next 30 bits represent the sign information of the imaginary part of energy deviation values. Each bit is set to a ’0’ when the imaginary part is positive and is set to a ’1’ viceversa. Texture can be represented using either 30 energy bits or 30 energy deviation bits. 1) Method 1: Representation by grouping: The energy and energy deviation values of 30 channels are complex numbers. From observation, sign of the imaginary part of these values are similar if the images are of similar texture. The aim of this method is to compress the descriptor as much as possible and yet classify images based on texture. In order to achieve this, only the sign of imaginary part of the energy or energy deviation is considered and a 30 bit binary number is formed for every image. Sum of all the 30 bits depicts the number of channels in which the imaginary part is negative. The sum however fails to represent the channels which hold a negative imaginary part. For this reason, the sum cannot be used for classifying images with similar texture as two images may have similar sum but could have been holding negative imaginary part in different channels. However, if we group the 30 bits into groups of 3 and get the sum of these 3 bits to represent a digit, we end up with a 10 digit number. This number can be used for representing the texture information. Experiments suggest that this representation is useful for image classification. For efficient retrieval purposes, L1 norm can be used digitwise. Employing L1 norm directly by considering all digits together is inefficient for retrieval purposes. 2) Method 2: Representation based on correlation with a standard reference texture: This method is also similar to above method and is employed with the help of a standard reference texture. The 30 bits representing energy or energy deviation is XORed with a standard reference texture bit-stream which is a 30 bit binary number. The result is again grouped by considering groups of 3 bits to form a 10 digit decimal number representing the texture. This representation again gives meaningful results when considered digit by digit but fails to classify if considered as single number. 3) Method3 : Representation with weights: From experiments and observation we find that sign information alone by itself can represent texture compactly. 7 The sign information however should represent the channel without which it becomes inefficient. To achieve this, proper weights are assigned to every channel such that the resulting number represents both the sign and channel information together. The compact texture descriptor (CT Di ) is the sum of all bits multiplied with appropriate weights as

388

shown below:

C. Discriminating ability of the proposed methods ctdi =

30 

wk .bk

k=1

CTD Vs Image index ( weights assigned in descending order)

CTD’s ( energy deviation values )

CTD1 ( energy deviation values)

where CT Di is the compact texture descriptor for ith image wk is the weight assigned to kth channel and bk is the sign bit of the kth channel. The selection of appropriate weights for every channel determines how good the resulting compact number describes the texture information. In this paper, we consider prime numbers as weights for every channel. The weights are assigned in descending order in order to give more weight to channels representing low pass frequency components. Deviation bits are preferred over energy bits for compact descriptor representation. Graphical analysis of such representation is described in section-III C. Quantitative analysis for selection of weights in descending order and preferring deviation bits for compact representation is justified with the help of MPEG-7 metrics in sectionIII D. 1500

1200

1000

1000

500

0

0

10

20

30

40

50

Image Index

60

70

80

90

2000

800 600 400 200

0

10

20

30

0

10

20

30

40

50

60

70

80

90

40

50

60

70

80

90

Image index

1600

CTD’s (energy values)

CTD2 ( energy values)

CTD Vs Image index ( Weights assigned in ascending order)

1400

1400

1500

1200

1000

1000

500

0

0

10

20

30

40

50

Image index

60

70

80

90

800 600 400

(a)

Image index

(b)

D. Discussion

CTD Vs Image index ( Method 1 and 2) 9

CTD’s (method 1)

3.5

x 10

3 2.5 2 1.5 1 0.5 0

0

10

20

30

10

20

30

40

50

60

70

80

90

40

50

60

70

80

90

Image index

9

CTD’s (method 2)

3.5

x 10

3 2.5 2 1.5 1 0.5 0

0

Image index

(c) Fig. 1. Discriminating ability of CTD’s. Figure (a) and (b) are plots of CTD’s from method 3. We see that CTD’s in (a) classify images better than CTD’s in (b) as discussed in section-IIIB. Figure (c) is a plot of CTD’s from method 1 and method 2

NMRR values for retrieval based on CTD energy and deviation bits 0.8

The proposed methods are analyzed based on their discriminating ability deduced by plotting the compact texture descriptors (CTDs). The test dataset consists of 84 images out of which the first 44 images are grouped into 4 sets, as per ground truth information. Each set contains 11 images from the same scene. The discriminating ability of a descriptor is better if it clearly identifies images of same set and at the same time differentiates images from different sets. This can be observed by uniquely identifiable clusters formed by the CTDs in Figure 1. The difference between the average value of the CTD values of these clusters depends upon the similarity in the texture features of the images. Figure 1(a) shows the plot of CTDs obtained from method 3 with weights assigned in descending order. Figure 1(b) also shows the plot of CTDs obtained with method 3 but with weights assigned in ascending order. We observe that, CTDs in Figure 1(a) have more discriminating ability than CTDs in figure 1(b). This can be explained from the fact that, when weights are assigned in ascending order, low frequency components are given less importance while high frequency components are given more preference. This method does not characterize texture accurately as high frequency components have noise and low frequency components have more signal energy. Figure1(c) shows the plot of CTDs formed from method 1 and method 2. We observe that CTDs from method 1 and 2 are not as efficient as CTDs formed from method 3 with respect to classification of images as per ground truth information. From these graphs, we also observe that the energy deviation bits are more efficient in classifying images than energy bits.

Effect of choice of weights on NMRR values in representation with weights 0.65

NMRR values for retrieval based on CTD from deviation bits NMRR values for retrieval based on CTD from energy bits

This section quantitatively investigates the discriminating ability of compact descriptors mentioned above, with MPEG-7 performance metrics. MPEG-7 metrics [1] include rank, retrieval rate(RR), average retrieval rate, modified retrieval rank (MRR), normalized modified retrieval rank (NMRR), and average normalized modified retrieval rank (ANMRR).NMRR takes a value between 0 and 1. A system is more efficient if NMRR values lie close to zero while the performance is worse as it approaches one. Figure 2(a) and 2(b) show the plot of NMRR values of retrieval techniques which employ CTDs based on method-3 described in section-III B. In Figure 2(a), shows the plot of NMRR values obtained for retrieval techniques which employ either energy bits or deviation bits. We can observe that performance of the system is better when deviation bits are used. In both these cases, weights are assigned in descending order. Figure 2(b) substantiates the observation made in section-IIIC that, retrieval techniques perform better when weights are assigned in descending order than in ascending order.

Weights in descending order Weights in ascending order 0.6

0.7

IV. C OMBINING COLOR AND TEXTURE

0.55 0.5

0.5

NMRR values

NMRR values

0.6

0.4

0.45 0.4 0.35 0.3

0.3

0.25

0.2 0.2

0.1

1

2

3

4

5 Query index

(a)

6

7

8

0.15

1

2

3

4

5

6

7

8

Query index

(b)

Fig. 2. NMRR plot of retrieval based on various methods discussed in section-III

Images in the database are represented with two numbers, one representing color information (CCD) while other representing texture(CTD), using the methods discussed in the previous section. Among the three methods proposed for compactly representing texture, the proposed technique follows the method of representation with weights assigned in descending order. The query image is also represented similarly. Let CCDi and CT Di represent the color and texture numbers for images in the database and CCDq and CT Dq represent color and texture point of query, then dqi

389

=

|CCDq − CCDi | + |CT Dq − CT Di |

(a) (a)

(b)

(b)

(c) (c)

1

1

0.9

0.9

0.8

0.8

0.7

0.7

Retrieval efficiency

Fig. 3. Figure 3(a), 3(b) and 3(c) show the retrieved images for the query which is at the top left corner of the window. Figure 3(a) is the result based on color alone, figure 3(b) shows retrieval results based on texture alone while figure 3(c) shows the retrieval result after combining color and texture for retrieval.

Fig. 5. Figure 5(a), 5(b) and 5(c) show the retrieved images for the query which is at the top left corner of the window. Figure 5(a) is the result based on color alone, figure 5(b) shows retrieval results based on texture alone while figure 5(c) shows the retrieval result after combining color and texture for retrieval.

NMRR values

0.6 0.5 0.4 0.3

Combined retrieval Color retrieval texture retrieval

0.6 0.5 0.4 0.3

0.2

0.2

texture color combined

0.1 0

NMRR plot for retrieval based on color and texture separately and combined

1

2

3

4

5

Image index

6

7

0.1 0

8

1

2

3

4

5

6

7

8

Query index

(a)

(b)

Fig. 6. (a) shows the plot of retrieval efficiency, and (b) shows the plot of NMRR values for retrieval after employing ccd and ctd separately and combined

(a)

(b)

where dqi is the distance metric defined for every query image ’q’ with image ’i’ in the database. Relevant images selected based on this metric belong to a set as defined below Relevant Images

(c) Fig. 4. Figure 4(a), 4(b) and 4(c) show the retrieved images for the query which is at the top left corner of the window. Figure 4(a) is the result based on color alone, figure 4(b) shows retrieval results based on texture alone while figure 4(c) shows the retrieval result after combining color and texture for retrieval.



{I : dqi ≤ }

where I is the set of images selected and  is the user-defined threshold. MPEG 7 dataset [16] is used for testing the efficiency of the proposed methods for image retrieval. The database used for testing contains 84 images whose ground truth information is known. Each image is represented with the help of compact numbers. Twelve images are taken as queries. Image matching and retrieval is done based on the distance metric discussed above. Images are matched and retrieved considering color and texture separately and also by combining them. Observing the result, it is clear that retrieval after combining color and texture is more meaningful than if each feature is considered separately for matching as shown Figure 2.

390

Figure 3,4,5 shows few results of image retrieval based on proposed methods. The query image is the image in the top left corner. The first images are results when only color was used for retrieval. The second set of images were retrieved based on texture. The third set of images were retrieved after combining color and texture. We can observe that the improvement in number of relevant images for the query which has a man with coat. The texture of the coat is really helpful for meaningful retrieval. The retrieval is enhanced by adding color information. Fig 6 (a) is the plot of retrieval efficiency [15] of the retrieval technique based on proposed methods. The graph depicts that retrieval efficiency of retrieval technique is more consistent when color and texture information are combined, than employed with either of the features alone. Figure 6 (b) is plotted based on MPEG-7 performance metrics. Figure 6 (b) shows the plot of NMRR values of the proposed methods. By observing NMRR values in Figure 6 (b), we can conclude that retrieval rate of a retrieval technique which employs both CCDs and CTDs is much better than the retrieval rate of the system that employs CCDs or CTDs separately for similarity retrieval. The ANMRR values shown in the table also quantitatively support this observation.

.

Compact Descriptor used ccd&ctd ccd ctd

AN M RRvalues 0.24994 0.30152 0.3248

Table 1: ANMRR VALUES V. C ONCLUSIONS The features of an image have been represented in a compact form from the MPEG 7 visual descriptors. We observe that the sign information of the descriptors alone can be used for image matching and retrieval. The results obtained quantitatively confirm that the retrieval efficiency of the proposed technique is high in most of the queries. Different techniques can be modelled from the methods proposed. For a huge database, the proposed methods can be a solution in terms of memory management since it is more compact and yet efficient. R EFERENCES [1] B.S Manjunath, J.R.Ohm, V.V Vasudevan and A.Yamada, “Color and texture descriptors,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, 2001, pp. 703-715. [2] T.Sikora, “The MPEG-7 visual standard for content description-an overview,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, 2001, pp. 696-702. [3] Shih-Fu Chang, T.Sikora and A.Purl, “Overview of the MPEG-7 standard,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, 2001, pp. 688-695. [4] Y.M.Ro, M.C.Kim, H.K.Kang, B.S.Manjunath and J.W.Kim “MPEG 7 Homogeneous Texture Descriptor,” ETRI Journal, Vol. 23, No.2, June 2001, pp. 41-51. [5] I.Fogel and D.Sagi, “Gabor filters as texture Descriminator,” Biological Cybernetics, Vol. 61, 1989, pp. 103-113. [6] W.Y.Ma and B.S.Manjunath, “Netra:A toolbox for navigating large image databases,” Multimedia Systems, Vol. 7, No. 7, 1999, pp. 184-198. [7] W.Niblack et al, “QBIC Project: Querying images by content, using colour, texture, and shape,” Proceedings of the SPIE: Storage and Retrieval for Image and Video Databases, Vol. 1908, February 1993, pp. 173-187. [8] A.Pentland, R.W.Picard and S.Sclaroff, “Photobook: Content-based manipulation of image databases,” Int. J. Comput. Vis, Vol. 18, 1996, pp. 233-254.

391

[9] J.R.Smith and S.Chang, “Visualseek: A fully automated content-based query system,” Proc.ACM Multimedia’96, pp. 87-98. [10] Aleksandra Mojsilovi´c, Jelena Kovaˇcevi´c, Jianying Hu, Robert J. Safranek and S.Kicha Ganapathy “ Matching and Retrieval Based on the Vocabulary and Grammar of Color Patterns,” IEEE Transactions on Image Processing, Vol. 9, No. 1, January 2000. [11] Qasim Iqbal, and J.K.Aggarwal, “CIRES:A SYSTEM FOR CONTENTBASED RETRIEVAL IN DIGITAL IMAGE LIBRARIES,” Seventh International Conference on Control, Automation, Robotics And Vision (ICARCV’02), Dec 2002, Singapore. [12] Constantin Vertan, Mihai Ciuc, Christine Fernandez-Maloigne, Vasile Buzuloiu, “IRIS - Color Texture Indexing and Recognition Toolbox,” Proc of 8th Intl Conf on Optimization and Electronic Equipment OPTIM’02, Brasov, Romania, May 2002, pp. 757-762. [13] W.Y. Ma, Y.Deng and B.S. Manjunath, “Tools for texture/color based search of images,” Human Vision and Electronic Imaging II, Feb. 1997, Proc. SPIE, Vol. 3016, pp. 496-507. [14] B.S. Manjunath, W.Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Transactions PAMI, Vol. 18, Aug. 1996, pp. 837842. [15] H. Muller, W. Muller, D. Squire, S. Marchand-Maillet and T. Pun, “Performance Evaluation in Content-Based Image Retrieval: Overview and Proposals,” 2001. [16] MPEG 7 ”MPEG 7 content set,” ROM Squared, Inc, Milwaukee WI 53202 USA.

Suggest Documents