Classi cation of Images on Internet by Visual and Textual ... - CiteSeerX

10 downloads 544 Views 702KB Size Report
Extensions have been made by adding a list of ... Keywords: WWW, HTML, image search engines, Internet image classi cation, supervised learning, combining.
Classi cation of Images on Internet by Visual and Textual Information Theo Gevers, Frank Aldersho , Arnold W.M. Smeulders Faculty of Mathematics & Computer Science, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands

ABSTRACT

In this paper, we study computational models and techniques to combine textual and image features for classi cation of images on Internet. A framework is given to index images on the basis of textual, pictorial and composite (textualpictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a high-dimensional similarity descriptor to be used as an index. Based on supervised learning, the k-nearest neighbor classi er is used to organize images into semantically meaningful groups of Internet images. Internet images are rst classi ed into photographical and synthetical images. After classifying images into photographical and synthetical (artwork) images, we further classify photographical images into portraits (i.e. the image contains a substantial face) and non-portraits. Further, synthetical images are classi ed into button and non-button images. Experiments have been conducted on a large set of images down loaded from Internet evaluating the accuracy of combining textual and pictorial information for classi cation. From the experimental results it is concluded that for classifying images into photographic/synthetic classes, the contribution of image and textual features is equally important. Consequently, high discriminative classi cation power is obtained based on composite information. Classifying images into portraits/non-portraits shows that pictorial information is more important then textual information. This is due to the inconsistent textual image descriptions, such as surnames, assigned to portrait images which we found on Internet. Hence, only marginal improvement in performance is achieved by using composite information for classifying into portrait and non-portrait classes. Extensions have been made by adding a list of surnames in the training set enhancing the classi cation rate signi cantly. The classi cation scheme can be experienced within PicToSeek6 on-line at: http://www.wins.uva.nl/research/isis/zomax/. Keywords: WWW, HTML, image search engines, Internet image classi cation, supervised learning, combining textual/image features, image databases

1. INTRODUCTION

With the growth of the World Wide Web, a tremendous amount of visual information has been become available publicly. Traditionally, verbal descriptions, such as keywords, le identi ers, or text, have been used to describe and retrieve images. More recently, content-based image retrieval systems have been developed retrieving images on the basis of multiple image features (e.g. color, shape and texture) 1 2 15 19 20 . Today, a number of systems are available for retrieving images from the World Wide Web on the basis of textual or visual information 3 6 10 14 18 . New research is directed towards the use of both textual and pictorial information to retrieve images from the World Wide Web.11 Most of the content-based search systems are based on the so-called query by example paradigm. The basic idea to image retrieval by image example is to extract characteristic features from images in the database which are stored and indexed. This is done o -line. These features are typically derived from shape, texture or color information. The on-line image retrieval process consists of a query example image from which image features are extracted. These image features are used to nd the images in the database which are most similar to the query image. Although signi cant results have been achieved, low-level image features, extracted from the images, are often too restricted to describe images on a conceptual or semantic level. This semantic gap is a well-known problem in content-based image retrieval. Therefore, image classi cation has been proposed to group images in the database into semantically meaningful classes to enhance the performance of content-based retrieval systems16 17 22 23 . The advantage of these classi cation schemes is that simple low-level image features can be used to express semantically meaningful classes. In this way, the gap is bridged between low-level image features and high-level concepts. Images classi cation

can be based on unsupervised learning techniques such as clustering, Self-Organization Maps (SOM)23 and Markov models22 . Further, supervised grouping can be applied. For example, vacation images have been classi ed based on a Bayesian framework into city vs. landscape by supervised learning16 17 . Landscape images are further classi ed into sunset, forest, and mountain classes. However, surprisingly, very little attention has been paid on using both textual and pictorial information for classifying images on the WWW. The more as images in HTML pages are often accompanied by textual descriptions (i.e. words) in their Url-path and near various HTML-tags such as IMG, HREF, and SRC. Hence, images on Internet have intrinsic annotation information induced by the HTML structure. Consequently, the set of images on Internet can be seen as an annotated image set. The challenge is now to get to a framework allowing to classify images on Internet by means of composite pictorial and textual (annotated) information into semantically meaningful groups, and to evaluate its added value, i.e. to what extent the use of composite information will increase the classi cation rate as opposed classi cation based only on visual or textual information. Therefore, in this paper, we study computational models and techniques to combine textual and image features to classify images on Internet. A framework is presented to index images on textual, pictorial and composite (textualpictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a high-dimensional similarity descriptor to be used as an index. The indexing scheme is used to study the added value of using both information sources. Therefore, the classi cation scheme will be applied on image, text and composite features. To achieve this, our oat of Web-robots down loaded over 100.000 images of the GIF and JFIF (JPEG) formats. Further, with these 100.000 images, textual descriptions (i.e. words) have been down loaded located near the images such as Url's and HTML-tags. These textual descriptions are parsed and stemmed yielding a robust set of discriminative words. Then, images and their textual attachments are represented in multidimensional feature space. The classi cation method is based on supervised learning based on the k-nearest neighbor classi er. Images are classi ed into photographical and synthetical images. After classifying images into photographical and synthetical images, we further classify photographical images into portraits (i.e. the image contains a substantial face). Further, synthetical images are classi ed into button or non-button images. The paper is organized as follows. First, in Section 2, the classi cation approach is discussed. A HTML parser is given in Section 3 yielding robust, non-redundant and discriminative words. Low-level image features will be proposed in Section 4. Experiments, conducted on a large set of images from Internet, are described in Section 5. Finally, conclusions will be drawn.

2. APPROACH

On the World Wide Web, images are being published by di erent people with various interests. As a consequence, a collection of images is created of various types and styles. Our aim is to crawl images eciently by Web-robots and preclassify them into di erent styles and types. The goal of the Web-crawler is to down load images from a large variety of sites. To that end, the Web-crawler is brie y outlined in Section 2.1. In Section 2.2, the image representation scheme is given. Finally, in Section 2.3, the image classi cation approach is outlined.

2.1. Image Collection

A Web-crawler has been implemented21 . The Web-crawler consists of a oat of 10 Web-robots, implemented in C, running in parallel on a 24 processor machine. A biased image collection is prevented in two ways. First, a general index page such as Yahoo! has been taken as a starting point. Secondly, by using a breadth- rst search strategy, the chance is minimized that pictures will come from a few sites only. The Web-robots down loaded over 100.000 images of the GIF, JPEG (JFIF), MPEG, and AVI formats. The results of the Web-crawler show a broad range of di erent Url-addresses. The 100.000 images in the collection come from more than 20.000 sites. The number of images per type in the collection is approximately 15% photographs and 85% synthetic images. 15% of the collection are grey value images.

2.2. Uni ed Representation of Textual and Pictorial Information

The Web-crawler down loaded 100.000 images together with their textual descriptions (i.e. words) appearing near the images in the HTML documents such as the Url-address and relevant HTML-tags such as IMG, HREF, and SRC. HTML parsing, stemming and selection will be discussed in Section 3. After HTML parsing, image features

are computed from the images. Image feature extraction is discussed in Section 4. Then, images and their textual attachments are represented in uni ed multidimensional feature space. To be precise, let an image I be represented by its feature vectors of the form I = (f0 ; wI 0 ; f1 ; wI 1 ; :::; ; ft ; wIt ) and a typical query Q by Q = (f0 ; wQ0 ; f1 ; wQ1 ; :::; ; ft ; wQt ), where wIk (or wQk ) represent the weight of feature fk in image I (or query Q), and t image features are used for classi cation. The weights are assumed to be between 0 and 1. A feature can be seen as an image characteristic or a HTML word term. For general classi cation, weights are assigned corresponding to the feature frequency as de ned by: wi = i

(1)

giving the well-known histogram form, where i (feature frequency) is the frequency of occurrences of the feature values i, e.g. the total number of red pixels in an image or the number of the same word in a HTML document. In this way, a feature x image matrix is created where the element (fi ; wIi ) represents the frequency of image or document feature fi of image Ii. For accurate classi cation, it is desirable to assign weights in accordance to the importance of the features. Hence, weights are assigned to increase/decrease the importance of features in and among images and documents. In this paper, the feature weights used for both images and queries is computed as the product of the features frequency multiplied by the inverse collection frequency factor, de ned by: N ) n

(2)

i log Nn vector( i log nN )2

(3)

wi = i (

where N is the number of images/documents in the dataset and n denotes the number of images/documents to which a feature value is assigned. In this way, features are emphasized having high feature frequencies but low overall collection frequencies. Further, we use cosine normalization by dividing each feature by a factor representing the Euclidean vector length8 : wi =

qP

equalizing the length of the image/document vectors.

2.3. Image Classi cation

i

The actual classi cation process is to search for the k elements in the trained image set closest to the query image. In the eld of pattern recognition, several methods have been proposed that improve classi cation automatically through experience such as arti cial neural networks, decision tree learning, Bayesian learning and k-nearest neighbor classi ers. Except for the k-nearest neighbor classi er, the other methods construct a general, explicit description of the target function when training examples are provided. In contrast, k-nearest neighbor classi cation consist of nding relationship to the previously stored images each time a new query image is given. When a new query is given by the user, a set of similar related images is retrieved from the image database and used to classify the new query image. The advantage of k-nearest neighbor classi cation is that the technique construct a local approximation to the target function that applies in the neighborhood of the new image query images, and never construct an approximation designed to perform well over the entire instance space. Because the k-nearest neighbor algorithm delays classi cation until a new query is received, signi cant computation can be required to process each new query. Various methods have been developed for indexing the stored images so that the nearest neighbors can be identi ed eciently at some additional costs in memory, such as a k-d trees or R*-trees, 7 for example. Unfortunately, the complexity of these search algorithms grows exponentially with the dimension of the vector space making them impractical for dimensionality above 15. To overcome this problem, we use the SR-tree for image indexing and high-dimensional nearest neighbor search. It has been shown that the SR-tree outperforms the R*-tree9 when using high-dimensional feature spaces.

3. TEXT FEATURES

Text features have been used for document retrieval and classi cation for many years. When objects are to be described by keywords, the goal is to get to proper and consistent keywords. However, it is known that two people choose the same keyword for a single well-known object in less than 20% of the time. Further, synonymy and polysemy may in uence the classi cation rate. Synonymy is the phenomenon that two di erent words can have the same meaning. Polysemy is the phenomenon that a single word has di erent meanings. The e ect of these two phenomena is that documents on the same topic don't necessarily contain the same words ( synonymy) and that documents that contain the same words are not necessarily about the same topic (polysemy). The common solution to the synonymy problem, is the use of a thesaurus to nd synonyms and add, possibly automatically, these words to the query. This solution may lead to the addition of many words, which in turn may lead to unwanted degradation of classi cation precision. Many words are derivations of one common word i.e. the stem. This can be seen as a form of synonymy. In that respect, the stemming of words (walking to walk and French to France) can also be seen as a way to reduce synonymy. To reduce the e ect of polysemy is by restricting the set of keywords for indexing and searching. In order to yield a robust, consistent and non-redundant set of discriminative words, HTML documents are parsed for each class during the training stage as follows:

 1. Class description: The class of Internet images is given manually such as photo, art, or portrait.  2. Text parsing: Words appearing in the Url-address and near certain HTML-tags such as SRC, IMG and    

HREF are excerpt. 3. Eliminating redundant words: A stop list is used to eliminate redundant words. For example, to discriminate photographical and synthetical images, words such as image and gif, which appear in equal number in both classes, are eliminated. 4. Stemming: Sux removal methods are applied on the remaining words to reduce each word to its stem form. 5. Stem merging: Multiple occurrences of a stem form are merged into a single text term. 6. Word reduction: Words with too low frequency are eliminated.

In this way, a highly representative set of words are computed for each class during the supervised training stage. Then, weights are given to the words and represented in multidimensional vector space as discussed in Section 2. Classi cation of a given Internet image can now be computed by the standard k-nearest neighbor classi er as follows. First, the query is given by an image on Internet. Query words appearing in the Url-address and near various HTML tags such as SRC, IMG and HREF will be excerpted, redundant words eliminated, and word stemming will be applied corresponding to steps 2-6 of the above described HTML parsing method. Then, a k-nearest neighbor classi cation with weighted distances is applied to provide the class label such as photo, artwork, portrait and button.

4. IMAGE FEATURES

In this section, low-level image features are proposed to allow the classi cation of images into the following groups: photographical/synthetical images in Section 4.1 and portraits/non-portraits images in Section 4.2. Image features are selected according to the following criteria: robust to noise; compact; and highly discriminative power. Image features are derived by visual inspection of a large set of images by various observers.

4.1. PHOTOGRAPHIC VS. ART

A useful classi cation problem for Internet images is the distinction between photographic and synthetic (artwork) images12 . To capture the di erences between the two classes, various human observers were asked to visually inspect and classify images into photo's and artwork. The following observations were made. Because artwork is manually created, it tends to have a limited number of colors. In contrast, photographs usually contain many di erent shades of colors. Further, artwork are usually designed to convey information such as buttons. Hence, the limited amount of colors in artwork are often very bright to attract attention to the user. In contrast, photographs contain mostly

dull colors. Further, edges in photographs are usually soft and subtle due to light variations and shading. In artwork, edges are usually very abrupt. Based on the above mentioned observations, the following image features have been selected to distinguish photographic images from artwork: Color variation: The number of distinct hue values in an image relative to the total number of hues. Hue values are computed by converting the RGB -image into HSI color space from which H is extracted. Synthetic images tend to have fewer distinct hue colors then photographs. Color saturation: The accumulation of the saturation of colors in an image relative to the total number of pixels. S from the HSI -model is used to express saturation. Colors in synthetic images are likely to be more saturated. Color transition strength: The pronouncement of hue edges in an image. Color transition strength is computed by applying the Canny edge detector on the H color component. Synthetic images tend to have more abrupt color transitions than photograph images.

4.2. PORTRAIT VS. NON-PORTRAIT

We use the observation that portrait images are substantially occupied by skin-colors of faces. To that end, a twostep lter is proposed in this section to detect portraits. Firstly, all pixels within the skin-tone color are determined yielding blobs. Then, these blobs will be tested with respect to shape and size requirements. In order to make the whole skin detection scheme robust to varying imaging conditions, color models are selected satisfying the following criteria:

 Robust to face folding;  Robust to changes in the viewing direction;  Robust to a change in the direction of the illumination;  Robust to changes in the intensity of the illumination;  Robust to changes in the color of the illumination; In Section 4.2.1, the dichromatic re ection model is used to derive color invariant models for skin detection. As the color invariant models are still sensitive to changes in illumination color, in Section 4.2.2, color ratios and geometric constraints are discussed to propose a robust portrait detector. 4.2.1. Color Invariant Skin Detection

The interaction between light and skin can be can be modeled as follows13 : k (~x) = GB (~x; ~n; ~s)E (~x)

Z



B (~x; )Fk ()d

(4)

giving the kth sensor response of an in nitesimal skin patch under the assumption of a white light source. Note that the above re ection model assumes approximately white illumination. Further, it is assumed that no highlights are present when measuring skin. Then, when the white light source E shines upon an object B , some wavelengths are re ected and some are absorbed. The distribution of the wavelengths of the re ected light determines the color of a face or the human body. The photometric properties of skin re ection depends on many factors. If we assume a random distribution of the skin pigments, the light exits in random directions from the skin. In this case, the distribution of exiting light can be described by Lambert's law. Lambertian re ection models surfaces which appear equally bright regardless from angle they are viewed. They re ect light with equal intensity in all directions. As a consequence, a face which is usually folded (i.e. varying surface orientation) will give rise to a broad variance of RGB values. Note that the above re ection model is not restricted to Lambertian surfaces. In contrast, normalized color space rgb given by: r(R; G; B ) =

R R+G+B

(5)

g (R; G; B ) = b(R; G; B ) =

G R+G+B B R+G+B

(6)

(7) is insensitive to face folding and a change in surface orientation, illumination direction and illumination intensity and mathematically speci ed by substituting equation ( 4) in equation ( 5) - ( 7): r(Rb ; Gb ; Bb ) =

GB (~x; ~n; ~s)E (~x)kR GB (~x; ~n; ~s)E (~x)(kR + kG + kB ) GB (~x; ~n; ~s)E (~x)kG GB (~x; ~n; ~s)E (~x)(kR + kG + kB ) GB (~x; ~n; ~s)E (~x)kB GB (~x; ~n; ~s)E (~x)(kR + kG + kB )

= k + kkR + k R G B

(8)

(9) = k + kkG + k R G B (10) = k + kkB + k r(Rb ; Gb ; Bb ) = R G B factoring out dependencies on illumination and object geometry and hence only dependent on the sensors and the skin color. Further, normalized color space c1 c2 c3 given by4: g (Rb ; Gb ; Bb ) =

R max G; B G c2 = arctan( max R; B B c3 = arctan( max R; G

c1 = arctan(

f g) f g) f g)

(11) (12) (13)

is also color invariant for skin cf. eq. ( 4) and eq. ( 11) - ( 13): c1 (Rb ; Gb ; Bb ) = arctan( c2 (Rb ; Gb ; Bb ) = arctan(

GB (~x; ~n; ~s)E (~x)kR GB (~x; ~n; ~s)E (~x)kG ; GB (~x; ~n; ~s)E (~x)kB GB (~x; ~n; ~s)E (~x)kG GB (~x; ~n; ~s)E (~x)kR ; GB (~x; ~n; ~s)E (~x)kB GB (~x; ~n; ~s)E (~x)kB GB (~x; ~n; ~s)E (~x)kR ; GB (~x; ~n; ~s)E (~x)kG

maxf

kR ) = arctan( g maxfk ; k g )

(14)

maxf

g ) = arctan( maxf g ) = arctan( maxf

(15)

G B kG kR ; kB kB kR ; kG

g) g)

(16) maxf only dependent on the sensors and the skin albedo. To illustrate the e ect of the imaging conditions di erentiated for the various color models, 10 by 10 squared neighborhoods of pixels have been collected from di erent images. From these these squares, 19 have been collected from regions depicting skin and 20 from other regions, resulting in 1900 pixels with skin tones and 2000 pixels with random colors. When these pixels are plotted in the RGB color cube, they form two overlapping clouds which are dicult to separate, see Figure 1.a, where skin-pixels are depicted bright and non-skin pixels are depicted dark. The skin/non-skin separation improves when pixels are represented in the color invariant spaces: c1c2c3 and rgb, see Figure 1.b and 1.c respectively. The two clouds are now more dense and the skin-tone pixels occupy a smaller space. However, the spreading of the clouds is partially due to distortions caused by changes in illumination color. Note that rgb and c1 c2 c3 are still dependent on a change in the color of the illumination. To reduce these disturbing e ects, ratio's have been chosen based on the observation that a change illumination color a ects c1 and c2 proportionally. Hence, by computing the ratio's c2 =c1 and c3 =c2 , and r=g and g=b, the disturbing in uences of a change in illumination color is cancelled out. A drawback is that c1 c2 c3 and rgb become unstable when intensity is very low near 5% of the total intensity range.5 Therefore pixels with low intensity have been removed from the skin-tone pixels. In Figure 2 pixels are plotted based on the ratio's. As can be seen in Figure 2.a, the ratio's achieve robust isolation of the skin-tone pixels. In fact, it allows us to specify a rectangle containing mostly skin pixels. This rectangle area de nes the limits of the c2=c1 and c3=c2 values of pixels taken from regions with skin-tones. These limits are the following: c2=c1 = [0:35:::0:80] and c3=c2 = [0:40:::0:98]. These limits have been determined empirically by visual inspection on a large set of test images. It has proved to be e ective in our application. c3 (Rb ; Gb ; Bb ) = arctan(

250

200

200

150

150

0.5

0.4

0.3

B

c3

b

250

100

100

50

50

0

0.2

0.1

0

0

0

1

0 50

0.8

50 100

0.6

100 150

0.4

150 200 250

R

0

100

50

250

200

150

200 250

G

100

50

0

c1

250

200

150

0.2

r

0

0.5

0.45

0.3

0.35

0.4

0.25

0.2

0.15

0.1

0.05

0

g

c2

a. Skin (bright) and non-skin (dark) points plotted in RGB color space. b. Skin (bright) and non-skin (dark) points plotted in c1 c2 c3 color space. c. Skin (bright) and non-skin (dark) points plotted in rgb color space. 5

10

4.5

9

4

8

3.5

7

3

6 b/g

c3/c2

Figure 1.

2.5

5

2

4

1.5

3

1

2

0.5

1

0

0 0

0.5

1

1.5

2

2.5 c2/c1

3

3.5

4

4.5

5

0

0.5

1

1.5

2

2.5 g/r

3

3.5

4

4.5

5

a. Skin (bright) and non-skin (dark) pixels plotted in the ratio color space: c2 =c1 and c3 =c2 . b. Skin (bright) and non-skin (dark) pixels plotted in the ratio color space: r=g and g=b.

Figure 2.

4.2.2. Portrait Detection

In the previous section, blobs having skin-tone have been identi ed. The second step is to impose shape restrictions on these blobs to actually identify portraits. The reason is that are various problems when using color when dealing with images taken under a wide variety of imaging circumstances. Also, Internet images are often of very poor image quality. Further, object color might match skin color as well. For example, the color of sand can be similar to the color of a face. Furthermore, highlights and extreme shadows may cause detection problems. Therefore, blobs are post-processed to meet certain geometric and size criteria. Small blobs are removed by applying morphological operations such as dilation and erosion. After removing small blobs, the remaining blobs are tested with respect to two criteria. First, blobs are required to occupy at least 5% of the total image. Further, at least 20% of the blob should occupy the mask. This ensures that only fairly large blobs are left. Figure 3 shows the performance of the portrait detector where the skin mask is given before and after the test on size and shape constraints.

5. EXPERIMENTS

For classifying Internet images into di erent semantically meaningful classes, in this section, we assess the precision of the proposed automated image classi cation method along the following criteria:

 The classi cation rate di erentiated for textual, visual, and composite (textual-visual) information.

Figure 3. a. Original images. b. Skin regions detected by the skin lter. c. Skin regions satisfying shape constraints. d. Skin regions superimposed on the original image.

 The classi cation accuracy di erentiated for photographical-synthetical, portraits-nonportraits, and buttonnonbutton.

In the experiments, we use two di erent sets of images. The rst set consists of a total of 200 images taken from the c Stock Photo Libraries. The rst test set of down loaded (100.000) images from Internet and a subset of the Corel set does not provide textual annotation. The test set is used to classify images on the basis of only visual information. The second dataset consists of 1432 images coming from the Internet. Textual annotations are provided.

5.1. Photographical vs. synthetical 5.1.1. TEST SET I

The rst dataset was composed of 100 images per class resulting in a total of 200 images. Hence the training set for each class is 100 images. The test set (query set) consisted of 50 images for each class. Classi cation based on Image Features: Typical images in the test set of photographical and synthetical images are shown in Fig. 4.a. Typical classi cation queries are shown in Figs. 4.b.1 and 4.c.1. The corresponding results, based on saturation and edge strength, are shown Figs. 4.b.2 and 4.c.2. Based on edge strength and saturation, de ned in Section 4.1, the 8-nearest neighbor classi er with weighted distance provided a classi cation success of 90% (i.e. 90% were correctly classi ed) for photographical images and 85% for synthetical images. From the results it is concluded that automated type classi cation based on low-level image features provides satisfactory distinction between photo and artwork. 5.1.2. TEST SET II

The second dataset consists of a total of 1432 images composed of 1157 synthetic images and 275 photographical images. Hence, 81% are artwork and 19% photo's were provided yielding a good representative ratio type of Internet images. Classi cation based on Image Features: Images are classi ed on the basis of color variation, color saturation, and color edge strength. The 8-nearest neighbor classi er with weighted distance provided a classi cation success of 74% for photographical images and 87% for synthetical images. Classi cation based on Composite Information: Images are classi ed on the basis of composite information. Therefore, the same image features are used: color variation, color saturation and color edge strength. Further, the annotations of the images have been processed

on the basis of the HTML parsing method given in Section 3. Typical high frequency words derived for photo's were: photo -in di erent languages such as bilder (German) and foto (Dutch)-, picture and people. High frequency words derived for artwork were: icon, logo, graphics, button, home, banner, and menu. Based on image features and text, the 8-nearest neighbor classi er with weighted distance provided a classi cation success of 89% (i.e. 89% were correctly classi ed) for photographical images and 96% for synthetical images. From these results it is concluded that classi cation accuracy based on both text and visual information is very high and that is outperforms the classi cation rate based entirely on visual information.

5.2. Portraits vs. Non-portraits 5.2.1. TEST SET I

To classify images into portrait-nonportrait images, a dataset is used consisting of 110 images containing a portrait and an arbitrary set of 100 images resulting in a total of 210 images. The test set consisted of 32 queries. Portraits vs. Non-portraits based on Image Features: Typical images in the the test sets are shown in Fig. 5.a. Typical queries are shown in Figs. 5.b.1 and 5.c.1. The corresponding results, based on the skin feature, are shown Figs. 5.b.2 and 5.c.2. Based on the skin detector the 8-nearest neighbor classi er provided a classi cation success of 81% for images containing portraits. 5.2.2. TEST SET II

The second dataset is composed of a 64 images of which 26 were portraits and 32 non-portraits (arbitrary photo's). Portraits vs. Non-portraits based on Image Features: Images have been represented by the skin feature in addition to 3 eigenvectors expressing color in the c1 c2 c3 color space. As a consequence, the skin feature is used together with a global color invariant features for classi cation. Based on a 4-nearest neighbor a classi cation success of 72% for portrait images has been obtained and 92% for non-portrait images. Portraits vs. Non-portraits based on Composite Information: HTML pages were parsed and relevant words were derived. Typical high frequency words derived for portraits were: usr, credits, team and many di erent names such as laura, kellerp, eckert, arnold, and schreiber. High frequency words derived for non-portraits were (see above): photo -in di erent languages such as bilder (German) and foto (Dutch)-, picture and tour. Based on image features and text, the 8-nearest neighbor classi er with weighted distance provided, a classi cation success of 72% (i.e. 72% were correctly classi ed) for portrait images and 92% for non-portrait images. From these results it is concluded that classi cation accuracy based on both text and visual information is the fairly same with respect to the classi cation rate based entirely on visual information. This is due to the inconsistent textual image descriptions assigned to portrait images which we found on Internet. For example, di erent nicknames, surnames and family names were attached to portraits. The use of a list of names in the training set will certainly improve the classi cation rate.

6. CONCLUSION

In this paper, a framework has been provided to classify images on the basis of textual, pictorial and composite (textual-pictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a high-dimensional similarity descriptor to be used as an index. The k-nearest neighbor classi er is used to organize images into semantically meaningful classes: photographical and synthetical images, and portraits (i.e. the image contains a substantial face). From the experimental results it is concluded that for classifying images into photographic/synthetic the contribution of image and text features is equally important. Consequently, high discriminative classi cation power is obtained based on composite information. Classifying images into portraits/non-portraits shows that pictorial information is more important then textual information. This is due to the inconsistent textual image descriptions for portrait images which were found on Internet. Hence only marginal improvement in performance is achieved by using composite information for classi cation. Extensions have been made by adding a list of surnames in the training set enhancing the classi cation rate.

REFERENCES

1. Proceedings of First International Workshop on Image Databases and Multi Media Search, IDB-MMS '96, Amsterdam, The Netherlands, 1996. 2. Proceedings of IEEE Workshop on Content-based Access and Video Libraries, CVPR, 1997. 3. Flickner, M. et al, "Query by Image and Video Content: the QBIC system", IEEE Computer, Vol. 28, No. 9, 1995. 4. T. Gevers and Arnold Smeulders, "Color Based Object Recognition", Pattern Recognition, 32, pp. 453-464, March, 1999. 5. T. Gevers and Arnold W.M. Smeulders, "Content-based Image Retrieval by Viewpoint-invariant Image Indexing", Image and Vision Computing, (17)7, 1999. 6. T. Gevers and Arnold W.M. Smeulders, "The PicToSeek WWW Image Search System", IEEE ICMCS, June, 1999. 7. A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching", ACM SIGMOD, pp. 47-57, 1984. 8. G. Salton and C. Buckley, "Term-weighting Approaches in Automatic Text Retrieval", Information Processing and Management, 24, pp. 513-523, 1988. 9. N. Katayama and S. Satoh, "SR-tree: an Index Structure for High-dimensional nearest neighbor queries", ACM SIGMOD, Arizona, 1997. 10. S. Sclaro , L. Taycher, M. La Cascia, "ImageRover: A Content-based Image Browser for the World Wide Web", In Proceedings of IEEE Workshop on Content-based Access and Video Libraries, 1997. 11. M. La Cascia, S. Sethi, S. Sclaro , \Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web," IEEE Workshop on Content-based Access of Image and Video Libraries June, 1998, 1998. 12. C. Frankel, M. Swain and Athitsos "Webseer: An Image Search Engine for the World Wide Web", TR-96-14, U. Chicago, 1996. 13. S. A. Shafer, "Using Color to Separate Re ection Components", COLOR Research Applications, 10(4), pp 210-218, 1985. 14. J.R. Smith and S.-F. Chang, "VisualSEEK: A Fully Automated Content-based Image Query System," Proceedings of ACM Multimedia, 1996. 15. Proceedings of SPIE Storage and Retrieval for Image and Video Databases VII, San Jose, 1999. 16. A. Vailaya, M. Figueiredo, A. Jain, H. Zhang, \A Bayesian Framework for Semantic Classi cation of Outdoor Vacation Images," in Storage and Retrieval for Image and Video Databases VII, M. M. Yeung, B. Yeo, C. a. Bouman, ed., Proc. SPIE 1999, pp. 415{426, 1999. 17. A. Vailaya, M. Figueiredo, A. Jain, H. Zhang, \Content-based Hierarchical Classi cation of Vacation Images," IEEE International Conference on Multimedia Computing and Systems, June 7-11 1999, 1999. 18. A. Gupta, "Visual Information Retrieval Technology: A Virage Perspective," TR 3A, Virage Inc., 1996. 19. Proceedings of Visual97, The Second International Conference on Visual Information Systems, San Diego, USA, 1997. 20. Proceedings of Visual99, The Third International Conference on Visual Information Systems, Amsterdam, The Netherlands, 1999. 21. M. A. Windhouwer, A. R. Schmidt, M. L. Kersten, \Acoi: A System for Indexing Multimedia Objects", Proc. International Workshop on Information Integration and Web-based Applications and Services, Indonesia, November 1999. 22. H. -H. Yu and W. Wolf, \Scene Classi cation Methods for Image and Video Databases", Proc. SPIE on Digital Image Storage and Archiving Systems, San Jose, CA, February, pp. 363-371, 1995. 23. D. Zhong, H. j. Zhang, S. -F. Chang, \Clustering Methods for Video Browsing and Annotation", Proc. SPIE on Storage and Retrieval for Image and Video Databases, San Jose, CA, February, 1995.

a

b.1

c.1

b.2

c.2

a. The test sets of photographical and synthetical images. b.1-b.2 Typical photographical query and corresponding result. c.1-c.2 Typical photographical query and corresponding result.

Figure 4.

a

b.1

c.1

b.2

c.2

a. The test sets of photographical and skin images. b.1-b.2 Typical skin query and corresponding result. c.1-c.2 Typical skin query and corresponding result.

Figure 5.

Suggest Documents