Document not found! Please try again

Learning Object Detectors from Weakly-Labeled Internet Images

2 downloads 138 Views 4MB Size Report
a set of training images containing the objects of interest; for the detection task, ... the search engine according to the textual information in their close surroundings. .... solve the problem in Eq.(1) using the same iterative optimization as ...
Learning Object Detectors from Weakly-Labeled Internet Images∗ Inayatullah Khan, Peter M. Roth, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {khan,pmroth,bischof}@icg.tugraz.at

Abstract Learning visual object detectors typically requires a large amount of labeled data, which is hard to obtain. To overcome this limitation, we propose a system that avoids any human labeling and autonomously learns an object detector from unlabeled Internet images. Without using any visual information! we obtain them by just typing the name of an object class. First, we determine the presence of the target object in a number of images and then, estimate its localization. Since we have to cope with ambiguously/wrongly labeled data a multiple instance learning (MIL) in both stages. In the experimental results, we demonstrate the benefits of this approach on publicly available benchmark datasets. In fact we show that we can train competitive object detectors without using visually labeled data.

1.

Introduction

Object category recognition and detection are one of the main challenges in computer vision. Due to high variations in object’s shape, appearance, background and scale many state-of-the-art methods require a large amount of labeled training data. For the classification task, an annotator has to provide a set of training images containing the objects of interest; for the detection task, additionally the bounding box around the location of the object has to be specified. Hence, the annotation effort limits the scalability of such methods, which arises the need for methods reducing the amount of human supervision. In particular, there is an increasing interest in exploiting the Internet for training visual models [11, 4, 16, 18] in an unsupervised way. Especially, keyword-based image search engines allow for gathering large amounts of unlabeled images without any manual effort. The images are filtered and ranked by the search engine according to the textual information in their close surroundings. Even though in this way most retrieved images are related to the visual concept, they also contain unrelated images. While from such noisy data collected from the Internet category classification models can be learned (e.g., [18]), it is more difficult to train object detectors which often requires the exact location of the objects of interest. Due to the complexity of the detection task and the higher supervision requirements, previous work has been focused on learning models for object classification. For example, Fergus et al. [11] developed a model to learn object categories from images retrieved from Google image search. Berg et al. [4] focus on animal categories classification. Schroff et al. [16] re-ranked the images based on the text accompanied with Internet images and built an object category classification model. ∗

This work has been supported by ”knowcenter project” and by the HEC Pakistan under OSs (Phase-II-Batch-I).

Guillaumin et al. [14] learned an image classifier and tried to improve classification by combining text and images. One way to resolve the ambiguities of image labels is to use multiple-instance learning (MIL). In contrast to supervised learning, in MIL training examples (instances) are grouped into bags and the label is attached to the whole bag (instead of a single instance). While a negative bag ensures that all instances are negative, for positives bags it is only ensured that the bag contains at least one positive instance; however, the identity of these positive instances is not known. For instance, Vijayanarasimhan et al. [18] proposed a MIL-based method to build an object category classification model by grouping the results obtained from different image search engines into bags of images. In this paper, we build on these ideas and propose a system to learn an object detector model from images automatically obtained from the Internet, requiring only the name of the target class. In particular, we first learn an object model that specifies the presence or absence a target object. Then in a localization step we crop image patches describing the object, which are finally used to learn a detector. To demonstrate the benefits of the approach we show results for two different publicly available datasets.

2.

Object Classification and Localization

The goal of this paper, is to learn an object detector from weakly-labeled Internet images. This is realized in a three-stage framework, which is illustrated in Figure 1. The first stage collects bags of images for a given object’s class name from the Internet and learns a binary object category classifier. The result is a collection of highly confident images containing the target object but with unknown location. The second stage uses this collection to learn an object detector by treating each image as a bag of unknown objects. This detector is used to localize the target objects in given images. These bounding boxes are then used in the third stage to train a fully supervised part-based model [10]. In the following, we first review the used representation and learning method and then describe the single stages in more detail. 2.1.

Image Representation and Learning

Representation We use two types of features to capture the shape and appearance of an object, i.e., pyramid of histogram of oriented gradients (PHOG) and pyramid of histogram of visual words (PHOW) [5]. For PHOG the number of histogram bins is set to 40, the level of pyramids is 2 and the shape angle is set to 360. We use these fixed numbers based on the best results reported in [5]. Similarly, for the Internet images color SIFT descriptors are computed densely on a regular grid with each cell of five pixels at four different scales. A random selection of the collected descriptors is then quantized into 1000 visual words using a standard k-means clustering. Each image is represented by a PHOW with 2 × 2 sub divisions. We use the χ2 -based RBF kernel, where the width of the kernel is set to mean χ2 distance. Multiple Instance Learning For learning from Internet images, we need a robust method which is able to cope with unsurely labeled data. Thus, we decided to use multiple instance learning (MIL). Let Xl = {(B1 , y1 ), . . . , (Bl , yl )}

and Xu = {Bl+1 , . . . , Bl+u } be the set of labeled and unlabeled bags respectively and each bag is represented as a collection of instances Bi = {xi1 , . . . , xini }, xij ∈ Rd . Then, assuming yij to be the unknown label of xij , the goal is to obtain an instance classifier. In our application, we build on sMIL [18]. In general, for MIL SVMs the objective is to compute the decision hyperplane w bias b as follows: l

n

i XX 1 2 min min kwk2 + C ξij {yij }li=1 w,ξ,b 2 i=1 j=1

s.t.∀i ≤ l, yi = 1 :

ni X

(1)

yij ≥ 2 − ni

j=1

∀i ≤ l, yi = −1 : yij = −1 ∀i, j : yij (wT φ(xij ) + b) ≥ 1 − ξij , ξij ≥ 0, where C > 0 is a constant controlling the trade-off between training error minimization and margin maximization, and ξij represents the slack variable for creating the soft-margin. Note that a particular instance xij is described by a single non-linear feature space φ(xij ). sMIL [18] considers the positive instances to be sparse within the bags and uses an iterative refinement scheme by introducing weights for instance in a positively labeled bag and used a single view representation of instances. In our case, we slightly extend the formulation in Eq.1 by replacing φ(xij ) by the responses of V kernels as h(xij ) = [φ1 (xij ), · · · , φv (xij )] based on multiple features representation of each instance. We then solve the problem in Eq.(1) using the same iterative optimization as proposed by Andrew et al. [2]. 2.2.

Stage 1: Object Category Classification

The goal of this work is to exploit keyword-based image search engines for automatic learning of object models. Since keyword-based search engines use text for ranking images there is an ambiguity that the image may or may not be related to the object-of-interest. Therefore, we explicitly model this ambiguity by collecting bags of images via a number of different image search engines. Such bags of images can be obtained by translating the object’s name into a number of languages, e.g., English, Spanish, German, Arabic, and French, using an automatic translation tool∗ . The idea of using linguistic translation to assist visual search has been explored previously by several authors (e.g., [12, 18, 8]). Additionally, rather than including the images retrieved by one image search in one bag, we assumed that a fixed number of top ranked images may contain at least one positive image and the rest of the images as member of unlabeled bags. Finally, we get a binary object category classifier that can classify unseen images, i.e., decide whether the target object is present in the image or not. Using this model we re-rank the collected images based on their confidence scores. We threshold the confidence scores for the next stage of our framework and consider highly confident images as labeled and less confident one as unlabeled images. In this way, we can, to some extent, prevent the propagation of noise from one stage to the other. ∗

http://www.google.com/language_tools?hl=en

Figure 1. Three-stage framework: (1) training an object classifier from Internet images; (b) object localization and cropping of training data; (c) learning an object detector using the cropped patches.

2.3.

Stage 2: Object Localization

In order to train an object detector, we need to localize the target objects in the training images. Such an ambiguity can also be cast as a MIL problem, where each image can be considered as a bag containing its sub-regions. Therefore, a positive bag ensures that at least one sub-region covers the target object. We need some reasonable guess about the object’s location so that we can group these sub-regions to form a bag. There are a number of approaches available to produce a set of possible locations of an object in an image: the objectness measure proposed by Alexe et al. [1], the saliency detection proposed by Bruce et al. [6], the hierarchical segmentation proposed by Arbel´aez et al. [3], and the method proposed by Endres et al. [7]. Here, in particular, we use the objectness measure [1] and the hierarchical segmentation [3] to represent an image by group of subregions. 2.4.

Stage 3: Learning a Supervised Detector

It has been shown that part-based models such as [10] yield state-of-the-art results in various object detection tasks. But such models require labeled object locations in training. We treat the predicted bounding boxes of our model in Internet images and use the Latent SVM part-based model [10] to train a final detector. This detector is then used to detect objects in any test image. However, recently a new approach has been proposed by Nguyen et al. [15] for object localization which requires labeled images for training. Such model can be trained directly on the output of Stage 1.

3.

Experimental Results

To demonstrate the proposed system, we give a detailed experimental analysis of the framework based on two publicly available datasets, i.e., on the ETHZ [13] and the PASCAL VOC [9] benchmark datasets. In particular, we show that our framework allows to train an object detector without having any visually labeled image data. We first substantiate the benefits of using multiple features on the ETHZ dataset and then, give a detailed analysis of the full system on the PASCAL dataset. 3.1.

Experimental Setup

For both experiments, we collected training bags of images using 3 different image search engines (Yahoo, Google and Bing). Each object class name is translated from English into 12 different languages before querying a keyword-based image search engine. In this way, we have an ambiguous

collection of images for each class indexed differently by total of 36 text based sources. The top 30 images are assumed as positively labeled and the next lower ranked 30 images as unlabeled bag of images. The negative bags of images were formed from images unrelated to the target object. Using these bags of images, we train a binary object MIL based classifier and then re-rank the images according to the confidence of the classifier. We chose the highly confident as labeled and the low confident as unlabeled images. 3.2.

ETHZ

The ETHZ test dataset is mainly built for object localization based on shape features. It contains five object classes (apple-logos, bottles, giraffes, mugs, and swans), totally a number of 255 images. The total number of object instances is 289 as in some images objects appear multiple times. The data set is highly challenging with high intra class variability and the objects appear in various scales. Most of the images are photographs but there are also some drawings and paintings. In majority of the images the target object occupies only a small fraction of the image. Our purpose of experiments on this test dataset is to evaluate the performance of our proposed extension with the baseline sMIL[18] and to demonstrate that the detection results can be improved using multiple features in parallel. Giraffes

0.8 0.7

Detection rate

0.6 0.5 0.4 0.3 0.2

proposed aproach sMIL

0.1 0 0

0.8 0.7

0.6

0.6 Detection rate

0.7

0.5 0.4

proposed aproach sMIL

0.3

0.3 0.2 0.1

0.8

0 0

1.5

Swans

0.8

0.7

0.7

0.6

0.6

0.5 0.4 0.3 0.2

0 0

0.5 1 False−positives per image

proposed aproach sMIL

0.5 1 False−positives per image

0.5 0.4 0.3 proposed aproach sMIL

0.1 1.5

1.5

Mugs

0.2

proposed aproach sMIL

0.1

Bottles

0.4

0.1 0.5 1 False−positives per image

1.5

0.5

0.2

0 0

Detection rate

0.5 1 False−positives per image

Detection rate

Detection rate

0.8

Apple logos

0 0

0.5 1 False−positives per image

1.5

Figure 2. Object detection results on ETHZ dataset based on the detection-rate under 50% intersection-over-union criterion. The results are shown for sMIL and the multi-feature extension used in this paper.

For this experiment, we train MIL based binary object detectors using the images collected by the first stage of our framework. We assume each image as bag of subregions. The hierarchical segmentation [3] is used to decompose an image into a bag of 100 subregions, where each subregion in an image is represented by a separate PHOG and PHOW for our proposed model and concatenated features for the baseline. The final detection results are shown in Figure 2. In particular, we show the results obtained by the original sMIL approach and by our modified methods. The performance is measured by the detection rate against the number of false positives per image (FPPI) averaged over all 255 test images. A detection is correct if the intersection-over-union ratio with the ground

truth bounding box is greater than 50%. The plots show that the proposed method performs well on all classes compared the baseline at the moderate false-positive rate of 0.5 FPPI as a reference point. This shows that using multiple features in parallel can improve the detection results. 3.3.

PASCAL VOC

For this experiment, we use a subset of the object classes from PASCAL VOC 2007 dataset. In particular, we select mainly compact objects where fully supervised methods perform reasonably well [9] instead of classes such as potted plant and dining table where the performance is rather low. The selected objects are horse, airplane, bicycle, motorbike, bus, train, boat, and car. We select PASCAL VOC 2007 because this dataset has been used for other state-of-the object detection model’s evaluation and people are still using it for evaluation [19]. We chose the highly confident images as labeled and the low confident as unlabeled images for input to Stage 2 of the framework. In the Stage 2 of the framework, we sample 100 windows from the 10000 windows per image generated according the objectness measure (OM) [1] using their publicly available code in order to decompose an image into a bag of objects. Based on this input (labeled and unlabeled bags of objects) we train an MIL-based binary object localization model using our proposed method. This model is used to detect the target objects in the queried images, some examples are shown in Figure 3. Finally in Stage 3 of the framework, this output is used to train a fully supervised part-based detection model (LSVM-part-based) [10]. We evaluate the performance of this model by detecting objects in the PASCAL test dataset. In order to show the performance gain we use the object localization model, learned in Stage 2, as a baseline to directly detect objects in test dataset, i.e., without training the LSVM-part-based detector. We have observed that in most of the queried images, some are shown in the Figure 2, the target object is the central focus and sometimes largely fill the frame. Hence, the OM may produce a good localized object, in the images feed to Stage 2. Therefore, in order to quantify this effect we assume the highest confident window per image, produced by the OM as a bounding box annotation for training the train the LSVM-part-based detector as our second baseline. Method proposed method Stage-2 (baseline) LSVM-part-based + OM (basline) LSVM-part-based + proposed method Stage-3 LSVM-part-based [10] Best VOC2007 [9] MKL Detector [17]

bike 30.1 31.0 39.6 59.5 40.9 47.8

mbike 27.5 28.6 34.5 48.7 37.5 45.5

bus 29.7 30.8 36.7 49.6 39.3 50.7

aero 9.5 7.1 11.8 28.9 26.2 37.6

horse 6.0 4.2 7.0 56.8 33.5 51.2

car 29.4 32.3 34.2 57.9 43.2 50.6

train 8.3 8.4 11.9 45.1 45.3 45.3

boat 2.4 2.4 3.4 15.2 9.4 15.3

mAP 17.9 18.1 22.4 45.2 34.4 43.0

Table 1. Object detection results, AP in percent, for different methods in the seven of the PASCAL VOC 2007 [9] challenge categories.

The thus obtained results in comparision to other state-of-the-art methods are given in Table 1. In particular, we give a comparison to the LSVM-part-based [10], the challenge winners (Oxford, INRIA PlusClass and UoCTTI) [9] and an MKL-based approach [17]. In contrast to our approach, these methods follow the ’comp3’ protocol for the detection task and thus are fully supervised using both the image label and the location of the object during training. The detection performance for each object class is measured by the average precision (AP) on the entire PASCAL VOC 2007 test set (4952 images). Notice that the results for the LSVM-part-based model are taken from the currently

Figure 3. Object locations of Internet images obtained automatically by Stage 2: bicycle, motorbike, bus, aeroplane, horse, car, train and boat. Correct detections are shown on the left side and bad detections (the bounding boxes covers either only parts of the object or is much larger than the object size) on the right side.

improved results † , where an object class is represented by a three component mixture of deformable part models. Comparing to the baselines, the proposed method (Stage-3) outperforms for all object’s classes. But the performance in case of horse, train and boat is not statisfactory. The main reason is the diversity in appearance, shape and scale of these objects which our model does not capture very well using the standard representation. As expected, our model cannot outperform the state-of-the-art detectors, which use high quality annotations for training. However the results are still encouraging, as we learn discriminative visual object models provided only with the textual prior.

4.

Conclusion

In this paper we proposed a system which is able to learn from unlabeled images downloaded from the Internet. In fact, the goal was to learn an object detector just by using a textual prior, i.e., not using any visual labeled data! Our system works in three stages. In the first stage, we determine the presence of a target object; in the second stage, we localize the object; finally, in the third stage, we learn a detector using a supervised method. Since in the first two stages uncertainly labeled data has to be handled we apply a MIL method for the actual learning task. To demonstrate the benefits of the approach, we evaluated it on two benchmark datasets. Even though state-of-the-art results could not be reached, which was even not a goal of this work, we were able to show that we are able to train a †

http://people.cs.uchicago.edu/ pff/latent/

reasonable detector without using any visually labeled data.

References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, 2010. [2] S. Andrew, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, 2003. [3] P. Arbel´aez, M. Maire, C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation. In CVPR, 2009. [4] T. L. Berg and D. A. Forsyth. Animals on the web. In CVPR, 2006. [5] A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a spatial pyramid kernel. In CIVR, 2007. [6] N. D. B. Bruce and J. K. Tsotsos. Saliency based on information maximization. In NIPS, 2006. [7] I. Endres and D. Hoiem. Category Independent Object Proposals. ECCV, 2010. [8] O. Etzioni, K. Reiter, S. Soderland, and M. Sammer. Lexical translation with application to image search on the Web. Machine Translation Summit XI, 2007. [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge, 2007. [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 2009. [11] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s image search. In ICCV, 2005. [12] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning Object Categories from Internet Image Searches. In Special Issue on Internet Vision IEEE, 2010. [13] V. Ferrari, T. Tuytelaars, and L. V. Gool. Object detection by contour segment networks. In ECCV, 2006. [14] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In CVPR, 2010. [15] M. H. Nguyen, L. Torresani, F. de la Torre, and C. Rother. Weakly supervised discriminative localization and classification: a joint learning process. In ICCV, 2010. [16] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In CVPR, 2007. [17] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV, 2009. [18] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learning for weakly supervised object categorization. In CVPR, 2008. [19] S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds. In CPVR (to appear), 2011.

Suggest Documents