Image annotation for adaptive enhancement of

0 downloads 0 Views 259KB Size Report
multi-class Support Vector Machine, for classifying image pixels in one of ..... Confusion matrix of a non-linear multi-class SVM trained with the joint-histogram.
Image annotation for adaptive enhancement of uncalibrated color images Claudio Cusano, Francesca Gasparini and Raimondo Schettini Dipartimento di Informatica Sistemistica e Comunicazione Università degli studi di Milano-Bicocca Via Bicocca degli Arcimboldi 8 20126 Milano, Italy {cusano, gasparini, schettini}@disco.unimib.it http://www.ivl.disco.unimib.it/

Abstract. The paper describes an innovative image annotation tool, based on a multi-class Support Vector Machine, for classifying image pixels in one of seven classes - sky, skin, vegetation, snow, water, ground, and man-made structures - or as unknown. These visual categories mirror high-level human perception, permitting the design of intuitive and effective color and contrast enhancement strategies. As a pre-processing step, a smart color balancing algorithm is applied, making the overall procedure suitable for uncalibrated images, such as images acquired by unknown systems under unknown lighting conditions.

1

Introduction

The new generation of content-based image retrieval systems is often based on whole image classification and/or region-based image classification [1]. Several works have been therefore proposed for labeling the images with words describing their semantic content. These labels are usually exploited for increasing retrieval performance. Among these works, Fan et al. [2] have proposed a multi-level approach to annotate the semantics of natural scenes by using both the dominant image components and the relevant semantic concepts. Jeon et al. [3] have proposed an automatic approach to annotating and retrieving images assuming that regions in an image can be described using a small vocabulary of blobs generated from image features using clustering. Our interest in image annotation is here related to the application need of performing automatically selective color and contrast enhancement of amateur digital photographs. Color Management converts colors from one device space to another; however, originals with bad color quality will not be improved. Since interactive correction may prove difficult and tedious, especially for amateur users, several algorithms have been proposed for color and contrast enhancement. Among these, some authors have suggested selective color/contrast correction for objects having an intrinsic color such as human skin or sky [4-6]. Fredembach et al. [7] have presented a framework for image classification based on region information, for automatic color

correction of real-scene images. Some authors proposed a region based approach only for color correction on a limited set of image classes (sky, skin and vegetation). Among these, Fredembach et al. [8] have tested the performance of eigenregions in an image classification experiment, where the goal was to correctly identify semantic image classes, such as “blue sky,” “skin tone,” and “vegetation.” Taking this state of affairs as our point of departure, we propose a pixel classification tool to be used to tune image-processing algorithms in intelligent color devices, namely color balancing, noise removal, contrast enhancement and selective color correction. The tool is capable of automatically annotating digital photographs, by assigning the pixels to seven visual categories that mirror high-level human perception, and permitting the design of intuitive and effective enhancement strategies. Image feature extraction and classification may be not completely reliable when acquisition conditions and imaging devices are not known a priori, or are not carefully controlled. While in many cases the human observer will still be able to recognize the skin colors in the scene, we can only guess to what extent a classification algorithm, trained on high quality images, can perform the same task on typical unbalanced digital photos. To this end, we explore here the hypothesis that preprocessing the images with a white balance algorithm could improve the classification accuracy.

2

Data selection and pre-processing

Our annotation tool is capable to classify image pixels into semantically meaningful categories. The seven classes considered - sky, skin, vegetation, snow, water, ground, and man-made structures - are briefly defined in Table 1. Table 1. Description of the classes

Class Ground Man-made structures Skin Sky Snow Vegetation Water

Description Sand, rocks, beaches … Buildings, houses, vehicles, … Caucasian, asian, negroid Sky, clouds and sun Snow and ice Grass, trees, woods and forests Sea, oceans, lakes, rivers …

The set of the images used for our experiments consisted of 650 amateur photographs. In most cases, imaging systems and lighting conditions are either unknown or else very difficult to control. As a result, the acquired images may have an undesirable shift in the entire color range (color cast). We have approached the open issues of recovering the color characteristics of the original scene, designing an adaptive color cast removal algorithm [9]. Traditional methods of cast removal do not discriminate between images with true cast and those with predominant colors, but

are applied in the same way to all images. This may result in an unwanted distortion of the chromatic content with respect to the original scene. To avoid this problem, a multi-step algorithm classifies the input images as i) no-cast images; ii) evident cast images; iii) ambiguous cast images (images with feeble cast or for which whether or not the cast exists is a subjective opinion); iv) images with a predominant color that must be preserved; v) unclassifiable images. Classification makes it possible to discriminate between images requiring color correction and those in which the chromaticity must, instead, be preserved. If an evident or ambiguous cast is found, a cast remover step, which is a modified version of the white balance algorithm, is then applied. The whole analysis is performed by simple image statistics on the thumbnail image. Since the color correction is calibrated on the type of the cast, a wrong choice of the region to be whitened is less likely, and even ambiguous images can be processed without color distortion. After this color balancing step, the dataset was randomly subdivided into a training set of 350 photographs, and a test set composed of the remaining 300 images. The salient regions of the images in the training and in the test set have been manually labeled with the correct class. Our tool has not been designed to classify directly image pixels, but it works selecting and classifying several image tiles, that is, square subdivisions of the image the size of a fixed fraction of the total area of the image. The length l of the side of a tile of an image of width w and height h, is computed as:

l=

p ⋅ w⋅h ,

(1)

meaning that the area of a tile is p times the area of the whole image (here p=0.01). We randomly selected two sets of tiles from the training set and the test set, and used respectively for the learning and for the validation of the classifier. For both the training and the test set, we first drew an image, then a region of that image, and, finally, we selected at random a tile that was entirely inside that region. This process was repeated until we had 1800 tiles for each class extracted from the training set, and 1800 for each class extracted from the test set. At the end of the selection process we had two sets of tiles, consisting of 1800 x 7 = 12600 tiles each.

3. Tiles description We compute a description of each tile by extracting a set of low-level features that can be automatically computed from the tile simply on the basis of its color and luminance values. We have used histograms because they are very simple to compute and have given good results in practical applications, where feature extraction must be as simple and rapid as possible [10]. Two kinds of histograms are used: color and edge-gradient histogram. The first allow us to describe the pictorial content of the tile in terms of its color distribution and the second gives information about the overall strength of the regions edge in the tile by computing edge gradient statistics. To combine the histograms in a single descriptor, we have used what is called a joint histogram [11]. This is a multidimensional histogram which can incorporate different

information such as color distribution, edge and texture statistics and any other kind of local pixel feature. Every entry in a joint histogram contains the fraction of pixels in the image that are described by a particular combination of feature values. We used a simple color histogram based on the quantization of the HSV color space in eleven bins [12]. Color histograms, due to their properties of efficiency and invariance in rotation and translation, are widely used for content-based image indexing and retrieval. 3.1 Edge-Gradient Histogram Two edge-gradient histograms are computed: vertical edge-gradient histogram and horizontal edge-gradient histogram. The horizontal and vertical components of the gradient are computed by the application of Sobel’s filters to the luminance image. For each component, the absolute value is taken and then quantized in four bins on the basis of comparison with three thresholds. The thresholds have been selected taking the 0.25, 0.50, and 0.75 quantiles of the distribution of the absolute value of gradient components, estimated by a random selection of over two million pixels in the images of the training set. The joint histogram combines information regarding color histogram and the two edge-gradient histograms, and thus is a three-dimensional histogram with 11 x 4 x 4 = 176 total number of bins. Although the dimension of this joint histogram is quite small (compared with the typical joint histogram of thousands of bins), we think that the information conveyed can describe the pictorial content of each tile for applications of image annotation.

4. Support Vector Machines Here, we briefly introduce the Support Vector Machine (SVM) framework for data classification. SVMs are binary classifiers trained according to the statistical learning theory, under the assumption that the probability distribution of the data points and their classes is not known [13-15]. Recently, this methodology has been successfully applied to very challenging and high dimensional tasks such as face detection [16], and 3-D object recognition [17]. Briefly, a SVM is the separating hyperplane whose distance to the closest point of the training set is maximal. Such a distance is called margin, and the points closest to the hyperplane are called support vectors. The requirement of maximal margin ensures a low complexity of the classifier, and thus, according to the statistical learning theory, a low generalization error. Obviously, often the points in the training set are not linearly separable. When this occurs, a non-linear transformation Φ(⋅) can be applied to map the input space ℜd into a high (possibly infinite) dimensional space H which is called feature space. The separating hyperplane with maximal margin is then found in the feature space. Since the only operation needed in the feature space is the inner product, it is possible to work directly in the input space provided a kernel function K(⋅,⋅) [18], which computes the inner product between the projections in the feature space of two points

K : ℜ d × ℜ d → ℜ, K (x1 , x 2 ) = Φ(x1 ) ⋅ Φ(x 2 ) .

(2)

Several suitable kernel functions are known, the most widely used are the polynomial kernels K (x1 , x 2 ) = (x1 ⋅ x 2 + 1) k ,

(3)

with k natural parameter, and the Gaussian kernel  x −x 2 K (x1 , x 2 ) = exp − 1  σ 

2

 ,  

(4)

with σ real parameter. The label f(x) assigned by the SVM to a new point x is determined by the separating hyperplane in the feature space  f (x) = sgn  b +    = sgn  b +  



N

∑α y (Φ(x) ⋅ Φ(x )) , i i

i =1 N

 α i yi K (x, x i )  i =1 

i

(5)



where N is the size of the training set, xi ∈ℜd are the points in the training set and yi ∈{-1,+1} their classes. The real bias b and the positive coefficients αi can be found solving a convex quadratic optimization problem with linear constraints. Since the only points xi for which the corresponding coefficient αi can be nonzero are the support vectors, the solution is expected to be sparse. If training set is not linearly separable even in the feature space, a penalization for the misclassified points can be introduced. This strategy can be achieved simply bounding the coefficients αi to some value C. 4.1 Multi-class SVM Although SVMs are mainly designed for the discrimination of two classes, they can be adapted to multi-class problems. A multi-class SVM classifier can be obtained by training several classifiers and combining their results. The adopted strategy for combining SVMs is based on the “one per class” method [19, 20]. It consists in training one classifier for each class to discriminate between that class and the other classes. Each classifier defines a discrimination function g(k) that should assume positive values when the points belong to the class k and negative values otherwise. These values are then compared and the output of the combined classifier is the index k for which the value of the discriminating function g(k) is the largest. The most commonly used discrimination function is the signed distance between the case to classify and the hyperplane, obtained discarding the sign function from equation (5), and introducing a suitable normalization

b (k ) + g

(k )

( x) =

N

∑α

i

(k )

(k )

(k )

yi K (x, x i )

i =1

N

N

∑∑α

i

αj

(

yi y j K x i , x j

)

.

(6)

i =1 j =1

The label c, assigned to a point x, by the multi-class SVM is c(x) = arg max g ( k ) (x) ,

(7)

k∈{1,...,K }

where K is the number of classes.

5. Experimental results For classification, we used a multi-class SVM, constructed according to “one per class” strategy. The binary classifiers are non-linear SVMs with a Gaussian kernel. In order to train each of them, we used the 1800 tiles of one class, and a random selection of 1800 tiles of the other classes; all the tiles used in the learning phase are taken from the images of the training set. Each SVM was thus trained to discriminate between one class and the others. To evaluate the performance of the classification engine, we used all the 12600 tiles extracted from the test set. For each tile in the training and in the test set a joint histogram is computed combining information related to color and gradient statistics, as described in Section 3. The results obtained on the test set by the multi-class SVM are reported in Table 2. Table 2. Confusion matrix of a non-linear multi-class SVM trained with the joint-histogram features vector. The penalization coefficient C is set to 25, and the kernel parameter σ is set to 0.1. The values are computed only on the test set

True class

Predicted class Man-made structures Ground Skin Sky Snow Vegetation Water

Man-made structures 0.814 0.035 0.016 0.010 0.017 0.035 0.008

Ground

Skin

Sky

Snow

Vegetation

Water

0.063

0.022

0.012

0.026

0.046

0.017

0.841 0.025 0.002 0.011 0.056 0.018

0.033 0.928 0.000 0.011 0.014 0.009

0.009 0.011 0.899 0.074 0.007 0.042

0.014 0.006 0.036 0.832 0.005 0.040

0.064 0.006 0.002 0.010 0.869 0.009

0.004 0.008 0.051 0.045 0.014 0.874

The overall accuracy is about of 86%. The best recognized class was the skin class (more than 92%), the worst results are obtained classifying the Man-made structures class (about 81%). Typical errors involve classes with overlapped color distributions, ground tiles in particular have often been misclassified as vegetation tiles (6,4% of cases) and vice-versa (5,6%), snow tiles have been misclassified as sky tiles (7,4%),

and man-made structures tiles as ground tiles (6,3%). Table 3 reports confusion matrix using the same classification strategy on the images of the dataset previously preprocessed by color balancing. Note that with or without color correction the results are worse than those obtained in a previous experimentation where all the images considered were free of color cast [21]. Table 3. Confusion matrix of a non-linear multi-class SVM trained with the joint-histogram features vector, and applying color balancing pre-processing. The penalization coefficient C is set to 25, and the kernel parameter σ is set to 0.1. The values are only computed on the test set

True class

Predicted class Man-made structures Ground Skin Sky Snow Vegetation Water

Man-made structures 0.83 0.030 0.001 0.003 0.014 0.039 0.009

Ground

Skin

Sky

Snow

Vegetation

Water

0.055

0.017

0.011

0.024

0.047

0.016

0.857 0.006 0.002 0.008 0.055 0.019

0.035 0.989 0.000 0.001 0.013 0.008

0.008 0.000 0.920 0.058 0.005 0.040

0.011 0.002 0.032 0.890 0.007 0.036

0.056 0.002 0.000 0.007 0.870 0.008

0.003 0.000 0.043 0.022 0.011 0.880

We also believe that a richest description of the tiles would improve the performances of the classifiers significantly. This will be the main topic of our future research in the area. After the satisfactory training of a classifier for image tiles, we designed a strategy for annotating the pixels of whole images. In order to label each pixel of the image as belonging to one of the classes, we needed a way to select the tiles and then combine multiple classification results. In our approach, the tiles are sampled at fixed intervals. Since several tiles overlap, every pixel of the image is found in a given number of tiles. Each tile is independently classified, and the pixel’s final label is decided by majority vote. The size of the tiles is determined by applying Equation (1). Frequently an area of the image cannot be labeled with one of the seven classes selected, and in this case different classes are often assigned to overlapping tiles. To correct this kind of error and to achieve in general a more reliable annotation strategy, we introduced a rejection option: when the fraction of concordant votes related to overlapping tiles lies below a certain threshold, the pixels inside these tiles are labeled as belonging to an unknown class. In practice, the rejection option selects the pixels that cannot be assigned to any class with sufficient confidence.Our method of annotation has performed with success on the 300 images in the test set. The tiles are sampled at one third of their size in both the x and y dimensions. As a result, the same pixels are found in 9 tiles. Pixels on the image borders are classified using only the available tiles.

Sky Man-made structures Other classes Unknown

(a)

(b)

Fig. 1. Original (a) and annotated image (b) Sky Man-made structures Unknown

(a)

(b)

Fig. 2. Original (a) and annotated image (b) Snow Man-made structures Other classes Unknown

(a) Fig. 3. Original (a) and annotated image (b)

(b)

Snow Man-made structures Vegetation Other classes Unknown

(a)

(b)

Fig. 4. Original (a) and annotated image (b)

Figures 1-4 show examples of annotated images (right), with the identified classes visually represented, compared with the corresponding originals (left). Note the rejected pixels, labeled as unknown. Figures 5 and 6 show the improvement in the annotation performance after a color correction pre-processing of the original images.

Sky

(a)

(b)

(c)

(d)

Snow

Water

Man-made structures

Unknown

Fig. 5. Example of annotation in the presence of a color cast before (a) and after (b) color correction. In the first case the blue cast confuses the classification process which erroneously detects regions of water (c). The enhanced image has been correctly annotated (d)

(a) Sky

(b) Skin

Man-made structures

(c) Other classes

(d) Unknown

Fig. 6. Example of annotation in the presence of a color cast before (a) and after (b) color correction. Due to the red cast almost all pixels have been classified as skin (c). Significantly better results have been obtained on the corrected image (d)

Although the accuracy of the tool is quite satisfactory, we plan to further refine the whole strategy. For instance, the rejection option can be improved introducing a rejection class directly in the SVMs. Furthermore, we plan to introduce new application-specific classes. We are also considering the application of our annotation tool in content based image retrieval systems.

REFERENCES 1. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain: Content-based image retrieval at the end of the early years. IEEE Trans. on PAMI vol.22 (2000). 2. J.Fan , Y. Gao, H. Luo and G. Xu: Automatic image annotation by using concept-sensitive salient objects for image content representation. Proceedings of the 27th annual international conference on Research and development in information retrieval, July 25-29, Sheffield, United Kingdom (2004). 3. J. Jeon, V. Lavrenko and R. Manmatha: Retrieval using Cross-Media Relevance Models. Proceedings of the 26th Intl ACM SIGIR Conf. (2003) 119–126. 4. H. Saarelma, P. Oittinen: Automatic Picture Reproduction. Graphics Art in Finland Vol. 22(1) (1993) 3-11. 5. K. Kanamori, H. Kotera: A Method for Selective Color Control in Perceptual Color Space. Journal of Imaging Technologies Vol. 35(5) (1991) 307-316. 6. L. MacDonald: Framework for an image sharpness management system. IS&T/SID 7th Color Imaging Conference, Scottsdale (1999) 75-79. 7. C. Fredembach, M. Schröder, and S. Süsstrunk, Region-based image classification for automatic color correction. Proc. IS&T/SID 11th Color Imaging Conference, pp. 59-65, 2003. 8. C. Fredembach, M. Schröder, and S. Süsstrunk, Eigenregions for image classification. Accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004 9. F. Gasparini and R. Schettini: Color Balancing of Digital Photos Using Simple Image

Statistics. Pattern Recognition Vol. 37 (2004) 1201-1217. 10. M. Stricker and M. Swain: The Capacity of Color Histogram Indexing. Computer Vision and Pattern Recognition (1994) 704-708. 11. G. Pass and R. Zabih: Comparing Images Using Joint Histograms. Multimedia Systems Vol. 7(3) (1999) 234-240. 12. I. J. Cox, M. L. Miller, S. M. Omohundro, and P.N. Yianilos: Target Testing and the PicHunter Bayesian Multimedia Retrieval System. Advances in Digital Libraries (1996) 66-75. 13. V. Vapnik: The Nature of Statistical Learning Theory. Springer (1995). 14. T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning. Springer (2001). 15. C. Cortes and V. Vapnik: Support-Vector Networks. Machine Learning Vol. 20(3) (1995) 273-297. 16. E. Osuna, R. Freund, and F. Girosi: Training support vector machines. An application to face detection. Proceedings of CVPR'97 (1997). 17. V. Blanz, B. Schölkopf, H. Bülthoff, C. Burges, V. Vapnik, and T. Vetter: Comparison of view-based object recognition algorithms using realistic 3D models. Artificial Neural Networks ICANN'96. (1996) 251-256. 18. K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf: An introduction to kernelbased learning algorithms. IEEE Transactions on Neural Networks Vol. 12(2) (2001) 181201. 19. J. Weston and C. Watkins: Support vector machines for multiclass pattern recognition. Proc. Seventh European Symposium On Artificial Neural Networks (1999). 20. K. Goh, E. Chang, and K. Cheng: Support vector machine pairwise classifiers with error reduction for image classification. Proc. ACM workshops on Multimedia: multimedia information retrieval (2001), 32-37. 21. C. Cusano, G. Ciocca and R. Schettini: Image annotation using SVM. Proc Internet imaging V, Vol. SPIE 5304 (2004), 330-338, 2004.