tection of outliers in image training sets using an one-class. Support Vector .... As it was already mentioned in section 1, for each image cate- gory (see Sec.
USING ONE-CLASS SVM OUTLIERS DETECTION FOR VERIFICATION OF COLLABORATIVELY TAGGED IMAGE TRAINING SETS Hanna Lukashevich, Stefanie Nowak, Peter Dunker Fraunhofer Institute for Digital Media Technology Ehrenbergstr. 31, 98693 Ilmenau, Germany ABSTRACT Supervised learning requires adequately labeled training data. In this paper, we present an approach for automatic detection of outliers in image training sets using an one-class Support Vector Machine (SVM). The image sets were downloaded from photo communities solely based on their tags. We conducted four experiments to investigate if the one-class SVM can automatically differentiate between target and outlier images. As testing setup, we chose four image categories, namely Snow & Skiing, Family & Friends, Architecture & Buildings and Beach. Our experiments show that for all tests a significant tendency to remove the outliers and retain the target images is present. This offers a great possibility to gather big data sets from the web without the need for a manual review of the images. Index Terms— Outlier Detection, One-class SVM, Data Set Verification, Image Classification 1. INTRODUCTION Automatic methods for indexing, classification and retrieval are well suited for providing flexible access to data. Construction of appropriate classifiers is commonly based on supervised training and requires adequately labeled training data. Such data sets can either be manually labeled by experts or can be retrieved from web photo communities like Flickr1 , where the images are tagged by users. In the latter case, such tags have to be manually revised since the tags do not necessarily describe the content of the image. Upload services commonly provide a possibility to upload a set of photos and assign the same tags to the whole set. This fact significantly reduces the reliability of user-given tags. So the key challenge is an automatic detection of images that do not match the provided tag and a consequent removal of these images from the training sets. In our approach, we show that an one-class SVM can differentiate between images with wrongly associated tags and correctly tagged images. In the following, we refer to images that do not match their tags as outliers and to correctly tagged 1 http://www.flickr.com
images as targets. The paper is organized as follows. First, we briefly present how other research deals with uncertain or noisy data. Following, we introduce the one-class SVM classifier and explain the four experiments that were conducted. We give a short summary about the database and the visual features we use for classification and discuss the results of the experiments.
2. RELATED WORK In general, there are three ways of dealing with incorrectly labeled training data. Firstly, often a manually review of the data takes place. In practice it means, that one or more persons look through the data and decide whether the images fit into the category or not (e.g., in [1] or [2]). As this work is very time consuming, many researchers test their algorithms on standard databases that have been manually reviewed before. Secondly, some research groups focus on developing robust learning algorithms and classifiers that can deal with uncertain data. In [3] an approach to learn classifiers from automatically selected relevant frames in weakly labeled video is proposed. Many methods focus on retrieving images from the web by textual keywords (e.g., [4], [5], [6]). In [4], Fergus et al. utilize a probabilistic latent semantic analysis (pLSA) to automatically learn topic models for retrieving object categories from Google’s Image Search results. They show that their method improves the retrieval results and that it can be applied for a re-ranking of Googles Image Search result list or for training classifiers. In [5], images of animals are collected via Google Text search on the category name. With the help of surrounding text, positive and negative image examples for each category are automatically selected and manually refined in a relevance feedback loop. A complete automatic approach is introduced by Schroff et al. [6]. They investigate different approaches to download images from the web, automatically sort the images into drawings and natural images of the desired category and experiment with text-based and visualbased approaches for re-ranking. They train a two-class SVM with a radial basis function kernel where the best text-based re-ranked images serve as positive examples and randomly
retrieved images from the complete download as negative examples using only visual features for classification. Thirdly, there are approaches that automatically eliminate wrongly labeled training data as a pre-processing step and then train their classifiers on the reduced image sets. To compute a representation of a landmark site, Li et al. [7] collect thousands of images by tags from Flickr. Outlier images are automatically sorted out through a clustering step based on the gist of the images and through a geometric verification step by searching for a common 3D structure in the single clusters. They create an iconic scene graph and reject isolated nodes by tag-based filtering. In [8], target images from a web search are found through a k-nearest neighbour search on visual clusters combined with a feature selection method rejecting features based on strangeness. Our approach implicates the assumption that a big proportion of all images that are given the same tag, relate to the desired category and share similar visual properties to a certain extent (this assumption is also made in [8]). The visual similarity is expressed through the extracted features and used to separate targets from outliers without explicitly telling the algorithm what the negative and positive examples are. In contrast to [7], we focus on general scene categories, so that one can not make an assumption concerning the 3D structure of objects depicted in the image. 3. MATHEMATICAL BACKGROUND
to solve eq. (1). In the experiments we use the most common type of kernel, namely Radial Basis Function (RBF): K(xi , xj ) = exp −γkxi − xj k2 3.2. Proposed estimation of kernel parameters In this paper, we use a novel approach to optimize one-class SVM kernel parameters. Due to the lack of class labels in the one-class SVM, it is not possible to optimize the kernel parameters using cross-validation, as it is common for a twoclass SVM. In the case of the RBF kernel, we have to tune the kernel parameter γ. It was shown by Sch¨olkopf et al. [9], that the trade-off parameter ν is an upper bound on the fraction of outliers (training points outside the estimated region) and a lower bound on the fraction of support vectors. Our optimization method is based on minimizing the differences between the fraction of outliers fout , the fraction of support vectors fq SV and the trade-off parameter ν. The optimal γ minimizes 2
2
(ν − fSV ) + (ν − fout ) .
4. EXPERIMENTAL FRAMEWORK As it was already mentioned in section 1, for each image category (see Sec. 5.1) we distinguish between two classes of images, namely targets and outliers. Additionally, all images are randomly split into training and test sets. Four experiments of the proposed experimental framework are schematically displayed in Figure 1.
3.1. One-class SVM An one-class Support Vector Machine (SVM) was firstly proposed by Sch¨olkopf et al. [9] for estimating the support of a high-dimensional distribution. As in the case of a two-class SVM, the kernel function is used to map the feature vectors into a higher dimensional space. By utilizing an one-class SVM, most of the data are separated from the origin by a large margin in the higher dimensional space. Given the training vectors xi ∈ Rn , i ∈ [l], the model is estimated in the following way: min
w,b,ξ,ρ
l 1 X 1 T w w−ρ+ ξi 2 νl i=1
(1)
subject to wT φ(xi ) ≥ ρ − ξi , ξi ≥ 0, where ρ/ kwk specifies the distance from the decision hyperplane to the origin, and ξi are introduced slack variables. The trade-off parameter ν ∈ (0, 1] corresponds to an expected fraction of outliers within the feature vectors. A solution of the system (1) enables the usage of the decision P l function: sgn i=1 αi K(xi , x) − ρ , where K(xi , xj ) ≡ φ(xi )T φ(xj ) is a kernel function, equivalent to a dot product of the feature vectors mapped into the higher dimensional space; and α is a vector with Lagrange multipliers, needed
Experiment 1: We train the one-class SVM on the targets and test it on the outliers. For this experiment, the images are not divided into train and test set. With this experiment we intend to check how well the one-class SVM is able to capture the properties of the targets. Experiment 2: To make sure, that the model in Exp. 1 is not over-trained2 , we perform the following cross-check. Now we train the one-class SVM using only targets from the train set, and test on targets from the test set and all outliers (both train and test set). Experiment 3: Here we come to a real-world scenario, where the information about targets and outliers is not available. In this experiment, we learn the one-class SVM on all available images (both targets and outliers, test and train set) and then check if images, classified by the model as outliers, are targets or outliers in reality. Experiment 4: To make a cross-check to Exp.3, now we train the one-class SVM on all train images, and test on all test images. 2 Over-training leads to a loss of generalization; the model is trained to capture the properties of particular targets and not of the class properties.
Training
Test
Test
Training
1)
3)
2)
4)
target
outlier unknown
trainset testset
ν = 0.01
80 60 40 ν = 0.80
20 0 0
100
50 100 fraction of accepted outliers, % Experiment 2 ν = 0.01
80 60 40 ν = 0.80
20 0 0
50 100 fraction of accepted outliers, % Architecture
Beach
fraction of accepted targets, %
Experiment 1 100
fraction of accepted targets, %
fraction of accepted targets, %
fraction of accepted targets, %
Fig. 1. Schematic representation of experiments 1-4 described in Sec.4. The panel enumeration corresponds to the number of the experiment. The left side of each panel displays the training phase and the samples used for training, while the right panel correspondingly depicts the test phase. Exp. 1: train on targets, test with outliers; Exp. 2: train on targets from training set, test with rest; Exp. 3: train on all images and investigate which images are regarded as outliers; Exp. 4: train on training set and test on test set. Experiment 3 100
ν = 0.01
80 60 40 ν = 0.80
20 0 0
100
Friends
50 100 fraction of accepted outliers, % Experiment 4 ν = 0.01
80 60 40 ν = 0.80
20 0 0
50 100 fraction of accepted outliers, % Snow
Random
Fig. 2. Results of the four experiments described in Sec.4 displayed as ROC curves. The fraction of accepted targets corresponds to the true positive rate and the fraction of accepted outliers corresponds to the false positive rate. In all experiments we vary the trade-off parameter in the following way: ν = {0.01, 0.03, 0.05, 0.1, 0.15, 0.20, 0.30, 0.5, 0.8}. The steps are chosen nonequidistant to provide a finer resolution in a more interesting region of possible ν values.
5. EVALUATION AND RESULTS 5.1. Database and Feature Extraction Flickr does not only allow to search images by keywords, but also to get photos from Flickr groups by the group names. Flickr groups provide disk space for photos, videos and discussion boards and connect people who are interested in the same topic. Photos in a Flickr group usually correspond to one topic. Different search terms were used to extract photos from the Flickr groups and organize them into four categories: 1) Beach, 2) Snow & Skiing, 3) Architecture & Buildings and 4) Family & Friends. For instance, queries Beach and Beaches, Beach Photography or beach life were used for the category Beach. Each category consists of about 1000 photos. The data set for all categories was randomly split into training (75%) and test (25%) sets. The photos of all sets were manually annotated concerning outliers and targets by three persons to generate a ground truth for the experiments. This annotation educed ca. 20% of outliers in each of the above mentioned categories. For the image classification task, we implemented a set of visual features partly derived from the MPEG-7 visual descriptors [10]. All features of one image are concatenated and form a feature vector of 427 floating point values. The following MPEG-7 derived features were used: color layout descriptor with 10 DCT coefficients for each color channel, color structure descriptor, edge histogram descriptor without final quantization and an additional histogram bin for homogeneous regions. Based on the dominant color descriptor, the color temperature of each LUV cluster centroid was calculated and an eight bins color temperature histogram was created. Further non MPEG-7 derived features were integrated: a haar-wavelet feature with three decompositions and the energy and deviation value of each band, a simple HSV color histogram and a feature describing the blur factor of an image. 5.2. Results The aim of the conducted experiments is to illustrate the principal feasibility of the approach. We present the results of all four experiments in a form of so-called ROC curves [11]. A ROC curve provides the possibility to observe, how the change of a free parameter influences the performance of the system. It represents the relation between the true positive rate and the false positive rate of a retrieval algorithm for each parameter value in one plot. The results of our experiments are depicted in Fig. 2. Exp.1 partly includes a self-test: the training samples are used in testing in order to establish a ROC curve. This influence of training samples is excluded in Exp.2. As one can see, the rates of true positives in Exp.1 become slightly worse in comparison to Exp.2. In Exp.3, we test a real-world scenario to prove the feasibility of automatic outliers rejection.
Later in Exp.4, we check to what extent the model established in Exp.3 can generalize the class properties, and our results show that it performs well on unseen data. In every experiment the one-class SVM rejects more outliers than target images and delivers better results than in a random selection process. This confirms the assumption that correctly labeled images share similar visual properties. 6. CONCLUSIONS All in all, our experiments demonstrate a proof of concept to automatically reject outliers from the training data by an one-class SVM. Outliers detection with an one-class SVM offers promising research directions e.g. as pre-filtering step or combined with approaches that can deal with a small amount of uncertainty. An open issue remains concerning the polysemy of tags. In the research community, automatically solving polysemy is regarded as unsolved problem. The assumption that an intrinsic visual similarity is inherent in correctly labeled images does not hold up while dealing with ambiguous tags. 7. REFERENCES [1] M.J. Huiskes and Lew M.S., “The MIR flickr retrieval evaluation,” in Proc. of the 1st Intern. Conf. on Multimedia Information Retrieval ACM MIR. Vancouver, British Columbia, Canada, Oct. 2008, pp. 39–43. [2] P. Dunker, S. Nowak, A. Begau, and Lanz C., “Content-based mood classification for photos and music,” in Proc. of the 1st Intern. Conf. on Multimedia Information Retrieval ACM MIR. Vancouver, British Columbia, Canada, Oct. 2008, pp. 97–104. [3] A. Ulges, C. Schulze, D. Keysers, and T. Breuel, “Identifying relevant frames in weakly labeled videos for training concept detectors,” in Proc. of the 2008 Intern. Conf. on Content-based Image and Video Retrieval. ACM New York, NY, USA, 2008, pp. 9–16. [4] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from google’s image search.,” in Proc. of the 10th Intern. Conf. on Computer Vision, Beijing, China, 2005, vol. 2, pp. 1816–1823. [5] T.L. Berg and Forsyth D.A., “Animals on the Web,” in Proc. of the 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), 2006, vol. 2, pp. 1463 – 1470. [6] F. Schroff, A. Criminisi, and A. Zisserman, “Harvesting Image Databases from the Web,” in Proc. of the IEEE 11th Intern. Conf. on Computer Vision, 2007, pp. 1–8. [7] X. Li, C. Wu, C. Zach, S. Lazebnik, and J.M. Frahm, “Modeling and recognition of landmark image collections using iconic scene graphs,” in Proc. of the 10th European Conf. on Computer Vision, 2008, pp. 427–440. [8] K. Wnuk and S. Soatto, “Filtering Internet Image Search Results Towards Keyword Based Category Recognition,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [9] B. Sch¨olkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001. [10] T. Sikora, “The MPEG-7 visual standard for content description-an overview,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 696–702, Jun 2001. [11] J.A. Swets, “Measuring the accuracy of diagnostic systems.,” Science, vol. 240, pp. 1285–1293, 1988.