Document not found! Please try again

A co-boost framework for learning object categories from Google ...

2 downloads 337 Views 8MB Size Report
However, the precision of the image search engines, which are mainly based .... classifier by optimizing the objective function with con- straints that reflect the ...
Noname manuscript No. (will be inserted by the editor)

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features Xi Liu · Zhi-Ping Shi · Zhong-Zhi Shi

Received: date / Accepted: date

Abstract Conventional object recognition techniques rely heavily on manually annotated image datasets to achieve good performances. However, collecting high quality datasets is really laborious. The image search engines such as Google Images seem to provide quantities of object images. Unfortunately, a large portion of the search images are irrelevant. In this paper, we propose a semi-supervised framework for learning visual categories from Google Images. We exploit a cotraining algorithm, the CoBoost algorithm, and integrate it with two kinds of features, the 1st and 2nd order features, which define bag of words representation and spatial relationship between local features respectively. We create two boosting classifiers based on the 1st and 2nd order features in the training, during which one classifier provides labels for the other. The 2nd order features are generated dynamically rather than extracted exhaustively to avoid high computation. An active learning technique is also introduced to further improve the performance. Experimental results show that the object models learned from Google Images by our method are competitive with the state-of-the-art unsupervised approaches and some supervised techniques on the standard benchmark datasets. Keywords CoBoost learning · Co-training · 1st and 2nd order features · Google Images X. Liu Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, PR China E-mail: [email protected] Z.-P. Shi E-mail: [email protected] Z.-Z. Shi E-mail: [email protected]

1 Introduction The recognition of generic object categories is one of the most challenging tasks within computer vision. Several works [10, 14, 18, 22, 23] have shown that a number of well-defined object categories can be learned well. Though many great advances have been made recently, current object recognition systems still perform not well due to the facts that real-world object images may contain large quantities of background clutter, have extensive occlusions or exhibit an incredible intra-class variation. Among the obstacles for existing object recognition approaches, how to acquire scalable training datasets requires to be first concerned. The reason for that is most current visual models rely heavily on the availability of scalable image datasets and in some cases the datasets may even have to continually grow with increasing algorithm capabilities. Presently many of the datasets are hand-collected and it is a time-consuming and arduous task. What’s more, artificially errors and biases will be inevitably introduced in the procedure of manual collection and annotation. The web seems to provide a feasible solution to the problem. Type the keywords related to the object categories and we can easily get hundreds of object images. However, the precision of the image search engines, which are mainly based upon words rather than image contents, is not quite satisfying because of text ambiguity and polysemy. As can be seen in Fig. 1, the returned images are typically about more than one third unrelated to the category. If we can learn good object models from these noisy images, the reward is tremendous: it will allow us to obtain classifier for any object category and thus provide high quality image datasets for other visual models. Many works have focused on this

2

Xi Liu et al.

recently; some methods combined text model and visual model to achieve good performances [3,25,29–31], and some other methods learned directly with the visual information by employing graphical models or multipleinstance learning [8, 11, 16, 17, 27, 28]. They all seek to obtain a high precision in a fully automatic manner, but it is really perplexing and hard to come to generalization due to the variable quality of the search returns and the essence of the task.

Fig. 2 The proposed visual category learning framework.

Fig. 1 Images returned from Google Images using the key-

word “airplane” and “motorbike”.

To learn object models from noisy images, it can be formulated as a semi-supervise learning task. We employ a co-training algorithm to accomodate the semisupervised setting. An independent and redundant feature split is necessary for the co-training style algorithm to work well [4]. Recently, [18] proposes a novel method for object categorization, which integrates both feature selection and higher-order spatial feature extraction. Low and high order features are simultaneously absorbed into the method to represent object images. It is seen from the work that the low and high order features can make good representations and they can make up an approximately independent and redundant feature split for the co-training algorithm. Also inspired by the idea of higher-order spatial feature extraction in the work, we dynamically generate high order features based on the selected low order features to avoid exhaustive computation. In this paper, we propose a novel semi-supervised framework to learn discriminative category models, which is illustrated in Fig. 2. Given a set of object images collected automatically from Google Images, we select some images as labeled data and make others unlabeled. With the labeled and unlabeled data, we show how to integrate a co-training algorithm CoBoost proposed by [7] with two kinds of features: the 1st and 2nd order fea-

tures, which define bag of words representation and spatial relationship between local features respectively. We co-train two boosting classifiers based on the two features iteratively, during which each classifier is trained on one feature and provides labels for the other classifier. A final classifier is obtained by combining the two boosting classifiers. The trained boosting classifiers also work to select features, and the 2nd order features are generated incrementally based on the selected 1st order features rather than extracted exhaustively in advance to avoid high computation. The process of our feature selection and extraction is similar to the work of [18]. However, our approach is designed for labeled and unlabeled data, and the method in [18] is intended for only labeled data. Furthermore, an active learning scheme is introduced to allow users to label the images two boosting classifiers most disagree on. Better performances are expected to obtain with a little more supervision. Experimental results show our learned model is comparative to the supervised methods intended for object recognition on the benchmark image datasets Caltech and also matches or exceeds the state-of-the-art ranking results on the Fergus and Berg dataset.

2 Related work Several existing approaches are proposed to handle the problems of filtering and learning visual categories among a large number of unlabeled data in semi-supervised or unsupervised manner. Below we will give a brief overview of some works and also discuss the differences between them and our method. Li et al. [16] propose an incremental learning framework, termed as “OPTIMOL”, which trains a classifier built upon a Hierarchical Dirichlet Process (HDP). HDP does not assume the topic number a priori and avoids the issue of selecting the proper number of topics. They first use the top 15 images returned by image

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features

search engine as the seed set for a category and learn an initial model based on the set. The model is then to classify subsequently obtained images. New classified relevant images are added to the dataset and the HDP model is thus refined. As this process repeats iteratively, more and more images are absorbed and the final classifier can achieve to learn the object category with larger intra-class variation. A big problem facing this method is the accuracy of the top 15 images. If the initial model learned are not good enough, the model may be iteratively updated in a wrong way. In the work of [8], a translation and scale invariant pLSA (TSI-pLSA) is trained on the image search result set. It extends pLSA to incorporate local information by introducing a second latent variable. The hidden topic most representative of the category is discovered according to topic performance on a validation set. Novel images are classified with the discovered topic in the model. One disadvantage of this approach is the difficulty of determining the appropriate number of topics and selecting which topic for each category. The authors of [8] later further propose a novel pLSAbased model [9] for learning object models from Goolge Images. They extend the pLSA-based model proposed by [27] in two ways, one is adding location information into the pLSA model in a straightforward manner, the other is picking the topics via the automatically gathered noisy validation set. This approach can avoid the problem of manually determining the topics which lies in [8]. The methods of [8] and [9] both incorporate the location information by introducing a second latent variable. They capture only the absolute location of objects within the image using a discrete grid and they claim that the spatial model is useful because images containing good examples always have the object of interest in the center. Our method does not require such assumption and it can automatically incorporate the intrinsic spatial structure by using the spatial relationship between local features. Compared with the absolute location used in [8] and [9], the spatial local features used in our method are more descriptive. Besides, [9] gathers the validation set for each object category via combining the top returned images from Google Images with the object keywords in different languages. Although it requires no human labeling cost, it is affected by the top search results. Our method requires few initial labeled data, but the manual labeling is small and also an active labeling strategy is provided to make it easier for users labeling. It is worthy of that to take the extra manual efforts with respect to the performances. Finally, the training of a pLSA-based model is quite costly since so many parameters are required to estimate via EM. In comparison, our method is very

3

fast by automatically selecting discriminative features within the boosting framework. Recently, [28] converts the category learning problem to a multiple-instance learning (MIL) problem and offer a direct solution to constructing discriminative category models from the images returned by search engines. They treat the groups of images returned by different search engines with the object name in different languages as positive bags and yield a large-margin classifier by optimizing the objective function with constraints that reflect the expected sparsity of true positive examples. The learned classifier can re-rank the search results and categorize novel images. This method also requires no manual supervision, but it does not exploit spatial information and is partially affected by the search results. More recently, [5] formulates a simple two-step filtering scheme to select good images. It first uses the saliency filtering to filter out images with a cluttered background, and then considers both content and contour consistency to further discard unsuitable images. Strict criteria such as simple background, consistent contour and content are applied to the filtering scheme so that a large portion of images are removed and only a very small set of images are kept as good object images. It is ok for image blending in [5], however, this scheme is not appropriate for learning object categories in our work. Our method can obtain more object images based on the initial labeled objects by iteratively co-training two boosting classifiers with an active labeling strategy. Unlabeled images are more easily available and in recent years semi-supervised learning techniques are more and more used in image classification and object recognition [6, 15, 34]. The semi-supervised learning algorithms attempt to leverage a large amount of unlabeled data to boost the classification accuracy with labeled data. Some semi-supervised learning methods include Transductive SVM [2], Co-training[4], and graph-based semisupervised learning methods[33, 36]. A good survey of semi-supervised learning methods can be found in [35]. In particular, multi-view semi-supervised learning such as co-training works well by absorbing multiple aspects of data when the different aspects of data are effective for learning. Therefore it is strikingly meaningful to integrate the CoBoost algorithm with bag of words representation and spatial features for the purpose of category learning. The main contributions of our work are 1) a cotraining based algorithm CoBoost is utilized to combine with the 1st and 2nd order features to learn object models from the noisy images. The learning approach naturally incorporates both the good theoretic foundation of the original CoBoost algorithm and the rich

4

representative abilities of the 1st and 2nd order features. 2) the discriminative 1st and 2nd order features are selected in the framework under semi-supervised setting. Especially for the high-dimensional 2nd order features, they are dynamically generated and selected. Due to substantially reduced feature dimension, memory and computation costs are reduced significantly.

Xi Liu et al.

Algorithm 1: Build2ndOdrFeat 1 2 3 4 5 6 7

Input: visual word pair(wa , wb ) Output: a vector of bin counts Suppose there are Na instances of wa , and Nb instances of wb in the image Initialize Na spatial histograms, using each instances of wa as a reference center for i = 1, . . . , Na do Calculate the number of instances of wb falling in each bin end Sum up corresponding bins over the Na spatial histograms Divide bin counts by Na

3 1st and 2nd order features We use bag-of-local feature descriptors as our image representations in this paper. Local feature descriptors are image statistics extracted from pixel neighborhoods or patches which denote distinctive patterns and properties. As defined in [18], the features originated from the local feature descriptors are called as 1st order features, and the features that encode the spatial information between a set of two patches are called as 2nd order features. Local descriptors are clustered and the prototype of each cluster is treated as a “visual word”. In this manner a visual vocabulary is built. For any image, each local feature descriptor within it is assigned to the closest visual word from the vocabulary. The image is then considered as a “document” composed of “visual words”. Count each visual word in the “document” and each word frequency gives one 1st order feature.

Fig. 3 a)Spatial histogram. b)Example of a 2nd feature.

The 2nd order features retain the information about the spatial layout of two 1st order features. Given a pair of 1st order features (wa , wb ), a corresponding 2nd order feature can be built with a spatial histogram. The spatial histogram, as illustrated in Fig. 3a, is divided into three log scales and four directions, i.e. twelve bins. The log scale deals with larger uncertainties of bin counts in longer ranges while the direction are employed to describe the semantics: ‘above’, ‘below’, ‘left’ and ‘right’. Taking wa as a reference center of the spatial histogram, we count how many instances of wb falling into each bin. These bin counts generate a 2nd order feature. Algorithm 1 details the process of building the 2nd order features, which follows the work of [18].

It is worth noting that there exists the data sparseness issue for the 2nd order features due to the large number of the 2nd order features and the limited number of local descriptors in an image. To overcome this problem, we employ the top-N technique in the construction of the 2nd order features. That is, we assign a local feature descriptor to the top N closest visual words rather than the closest. Here we choose N to be 5. In Fig. 3b, a twelve-dimensional 2nd order feature is shown. 4 CoBoost learning When an independent and redundant feature split exists, the CoBoost algorithm performs well. For the 1st and 2nd order features, they are both supposed to be effective for classifying object categories, as illustrated in [1,13, 24, 32]. Besides, the two features are supposed to be not quite correlated with each other. The occurrences of local features and the spatial information between the local features complement each other to some extent. Therefore, they provide a well-defined feature split for the CoBoost algorithm and it is natural to combine both of them to accommodate our semi-supervised setting. In the following, we first briefly review the original CoBoost algorithm, then describe how our method integrates CoBoost with 1st and 2nd order features, and finally clarify some implementation details. 4.1 The CoBoost algorithm The CoBoost algorithm adapts and generalizes AdaBoost [12] to the semi-supervised problem. It follows the idea of co-training and builds two boosting classifiers based on a pair of independent and redundant features in parallel. In the algorithm, a set of n training examples with the form (x1,i , x2,i ) are provided. The first nl examples l are labeled, which is given by yl = (y1l , y2l , · · · , ynl ), where each class label yil is either +1 or -1 and the left nu = n − nl examples are unlabeled.The goal is to find two classifiers f1 : 2x1 → {−1, +1} and f2 : 2x2 →

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features

{−1, +1} such that f1 (x1,i ) = f2 (x2,i ) = yi for examples i = 1, · · · , nl and f1 (x1,i ) = f2 (x2,i ) as much as possible on examples i = nl + 1, · · · , n. This leads to the following objective function: Zco =

nl X

exp(−yi g1 (x1,i )) +

i=1

nl X

P 0.5 ∗ ln(W+ /W− ); that is, it becomes t αtj hjt (xj ) after t rounds. This procedure is repeated for T rounds while alternating between the two classifiers. The final hypothesis is the combined output of the two boosting classifiers.

exp(−yi g2 (x2,i ))+

i=1

n X

exp(−f2 (x2,i )g1 (x1,i )) +

i=nl +1

n X

exp(−f1 (x1,i )g2 (x2,i )),

4.2 CoBoost learning with 1st and 2nd order features

i=nl +1

(1) where fj (x) = sign(gj (x)), gj (x) is unthresholded strong hypothesis,j = 1, 2. Zco provides an error bound on the total number of misclassified labeled examples and the amount of disagreement between the two classifiers on the unlabeled examples. The CoBoost algorithm seeks to minimize Zco ; it divides the function Zco into two parts and builds two boosting classifiers so as to mini1 2 mize Zco and Zco respectively. The division is as follows:  nl X 1 Zco = Zco :  exp(−yi g1 (x1,i )) + i=1 2 Zco

5

 nl X : exp(−yi g2 (x2,i )) + i=1



n X

exp(−f2 (x2,i )g1 (x1,i )) +

i=nl +1 n X



exp(−f1 (x1,i )g2 (x2,i )) ,

i=nl +1

(2) Through the division, it is apparent to see that two boosting classifiers should be created simultaneously in the CoBoost algorithm to achieve the minimization. The training of the boosting classifiers is similar to AdaBoost. It works for rounds. During each round, there are two stages in which one classifier is updated with the other classifier fixed. The fixed classifier provides pseudo-labels of the unlabeled data for the classifier to be updated. Thus we can define the pseudo labels ½ yi , 1 ≤ i ≤ nl y˜i,j = , (3) sign(g3−j (x3−j,i )), nl < i ≤ n P j j where gj (x) = t αt ht (x), j ∈ {1, 2}, t = 1, 2, . . . , T. j According to formulas 1, 2 and 3, Zco is rewritten as Xn j Zco = exp(−˜ yi,j (gjt−1 (xj,i )) + αtj hjt (xj,i )) i=1 (4) def Xn −→ Dtj (i)exp(−˜ yi,j αtj hjt (xj,i )),

In our category learning problem, we have n images with nl images labeled. Each image is represented as a (x1,i = 1st , x2,i = 2nd ) order feature pair. The goal is to find two classifiers f1 (x1,i ), f2 (x2,i ) such that f1 (x1,i ) = f2 (x2,i ) = yil , for i = 1, . . . , nl and f1 (x1,i ) = f2 (x2,i ) as much as possible for i = nl + 1, . . . , n. Algorithm 2 gives the procedure of CoBoost learning with the 1st and 2nd order features. Note that two main steps are added to the original CoBoost algorithm: dynamically generating new 2nd order features (Alg. 2, step 5) and actively labeling new images (Alg. 2, step 8). We will describe our algorithm in detail below. Firstly, we set the initial output of both the 1st and 2nd order feature based boosting classifier to zero. These two boosting classifiers are then trained for iterations. During each iteration, there exist two stages in which one classifier is updated with the other classifier fixed (Alg. 2, steps 3 and 7). Algorithm 3 illustrates the details of the updating process. The fixed classifier provides pseudo labels y˜i for the other classifier and data distribution is recalculated based on y˜i and the current normalized output g(x) of the classifier to be updated (Alg. 3, steps 2 and 4). Through the data distribution calculation formula Dt (i) = exp(−˜ yi g(x))/Zt , we can see that the misclassified labeled data and the unlabeled data which has disagreements between the two boosting classifiers are particularly emphasized. With the data distribution and pseudo labels, a weak hypothesis h(x) can be obtained by minimizing the weighted error and the boosting classifier is updated by adding the weak hypothesis with a confidence weight (Alg. 3, steps 5, 6, and 7).

i=1

where t is the current round number, Dtj (i) = exp(−˜ yi,j gjt−1 (xj,i ))/Ztj is the virtual data distribution, and Ztj is a normalization constant. For each round, p a weak hypothesis ht (xj,i ) with minimal value for 2 W+ W− is selected from the hypothesis space based on xj by using y˜i,j P and Dtj (i), where W+ = P i:ht (xj,i )=˜ yj,i Dt (i), W− = i:ht (xj,i )=−˜ yj,i Dt (i). The boosting classifier is updated by adding the weak hypothesis ht (xj,i ) multiplied by the confidence value αt =

Fig. 4 An example of generating new 2nd order features.

Secondly, for the 1st order feature based boosting classifier, features are selected from the 1st order feature pool P1 to create a weak hypothesis such as stump

6

classifier; but for the 2nd order feature based boosting classifier, the training of a weak classifier is prohibitively expensive due to the huge number of the 2nd order features. To overcome this, we dynamically generate and add new 2nd order features based on the selected 1st order features (Alg. 2, step 5). The 2nd order feature pool P2 is initially empty. When a new 1st order feature z(k) is selected, several new 2nd order features will be built with feature z(k) and the previous selected 1st order features z(j) by using Algorithm 1. P2 is then augmented by adding the generated features and a weak hypothesis is trained by selecting some 2nd order features from P2 . The computation is greatly reduced in this way since only a small portion of 2nd order features are considered. Note that in the first iteration, only one 1st order feature is selected and it is incapable of generating new 2nd order features. Therefore, two 1st order features are selected in the first iteration. An example of generating new 2nd order features is illustrated in Fig. 4. In the figure, the first order features 4 and 6 are already selected in the previous iterations, and in the current iteration, the first order feature 9 is to be selected. By use of the first order feature 9 and the selected first order features 4 and 6, two new second order features are generated, i.e. (9,4) and (9,6). The selected first order features 4, 6 (the green block) and the first order feature 9 to be selected (the yellow block) should be representative of the object but it cannot reflect the object’s structure information. However, the second order features of combining 4 and 6 or 9 and 6 are representative of the object’s structure, as shown in the figure. The first order features illustrate the appearance of some parts while the second order features illustrate the appearance of some geometric structures. In this way, the two boosting classifiers built on the first and second order features complement with each other to some extent. Lastly, an active learning scheme is introduced to allow users to label the images (Alg. 2, step 8). For every Tl iterations, the algorithm will pick out top Nl unlabeled images, which are sorted by the value of g1 (x1 ) ∗ g2 (x2 ) in ascending order. The smaller the value of g1 (x1 ) ∗ g2 (x2 ), the stronger the two classifiers have disagreements on the image, and also the harder for the classifiers to learn. 4.3 Implementation details a) Weak classifiers We employ classification and regression trees (CART) for the 1st order feature classifier and multilayer perceptrons (MLP) for the 2nd order feature classifier. The CART classifier is set to have at most two nodes. Each

Xi Liu et al.

Algorithm 2: CoBoost with 1st and 2nd order features 1 2 3 4 5

6 7 8

l Input: data={(x1,i , x2,i )}n i=1 , yi ∈ {−1, +1}, i = 1, . . . , nl , P1 ,P2 :feature pool, g1 (x1 ),g2 (x2 ):boosting classifier Initialize: x1,i ∈ P1 = {allvisualwords}, x2,i ∈ P2 = ∅, g10 (x1,i ) = g20 (x2,i ) = 0, k = 0 for t = 1, . . . , T do Alg.3:g1t (x1,i )=TrainUpdateBoost (P1 , data, g2t−1 (x2,i ), g1t−1 (x1,i )) Denote feature index selected by g1t (x1,i ) as i(t) Add new 2nd order features if i(t) has not been selected by g1 (x1 ) before then k = k + 1;z(k) = i(t) for j = 1, . . . , k − 1 do for each image do Alg.1:Build2ndOdrFeat(z(k),z(j)) end Augment feature pool P2 end end if P2 == ∅ g2t (x2,i ) = g2t−1 (x2,i );continue;end Alg.3:g2t (x2,i )=TrainUpdateBoost (P2 , data, g1t−1 (x1,i ), g2t−1 (x2,i )) Active label if t%Tl == 0 then Tl :the number of iterations needed for labeling Label the images whose g1 (x1 ) ∗ g2 (x2 ) rank top Nl among the unlabeled images in ascending order m = m + Nl end end Output: the final classifier f (x) = sign(g1T (x1 ) + g2T (x2 ))

Algorithm 3: TrainUpdateBoost(P, data, gf ixed (x0 ), gto−be−updated (x))

1 2 3 4 5 6 7

Input: P : feature pool, data: all training data(labeled and unlabeled) gf ixed (x): the fixed classifier, g 0 (x0 ) = gf ixed (x0 ) gto−be−updated (x): the classifier to be updated, g(x) = gto−be−updated (x) Select the top confident unlabeled data decided by the value 0 of |g (x0 )|, then data = {(x1,i , x2,i )}, i = 1, . . . , ns Set pseudo-labels: y˜i = yi , 1 ≤ i ≤ nl ; y˜i = sign(g 0 (x0i )), nl < i ≤ ns Normalize the value of g(xi ), i = 1, . . . , ns to a fixed range Set virtual yi g(xi ))/Z, where Pns distribution: D(i) = exp(−˜ Z= yi g(xi )) i=1 exp(−˜ Train weak classifier h(x) using P with pseudo-labels and virtual distribution error P so as to minimize weighted P Set α = 0.5 ∗ ln(( i:h(x)=y˜ D(i) + ²)/( i:h(x)6=y˜ D(i) + ²)), i i ²: a small value Update g(x) = g(x) + α ∗ h(x) Output: the updated classifier g(x)

node corresponds to one 1st order feature and the feature corresponding to the root node is regarded as the selected 1st order feature i(t). Note that in the first round only one 1st order feature is selected and no 2nd order features will be added into the empty 2nd order feature pool. To avoid this, we make the two 1st order features corresponding to both the root node and the other node as selected features. The MLP classifier has 12 input nodes, 4 hidden nodes and 1 output node. Its input node number is exactly equal with the dimension of one 2nd order feature. The MLP’s training is termi-

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features

nated when the error rate is below 0.3 or the iteration number exceeds 150. In the updating of each boosting classifier, the unlabeled data’s class labels are given by the fixed classifier g 0 (x0 ). Note that the labels may be prone to errors. This suggests that we should choose to use the most confident unlabeled data rather than all the unlabeled data. However, selecting a small number of samples might make the convergence slow, and selecting too large a sample might include non-informative or poor samples into the training set. We determine this number empirically and select top 30% of the unlabeled data by the value of |g 0 (x0 )| in descending order. Besides, as to the data distribution, when the value of the classifier to be updated g(x) is large enough, some data weights may be too large and other data weights will approximate to zero. To solve this problem, we normalize the value of g(x) to a fixed range, also by empirical analysis, the range is set to [-2, +2]. b) Stopping criterion According to the optimization procedure, the CoBoost algorithm stops when there are no misclassification for the labeled data and no disagreements between the two boosting classifiers on the unlabeled data. For each round, we calculate the number of the misclassified labeled data and the contentious unlabeled data. In most cases they decrease fast in the beginning and then converge to a small range of values after some iterations, as shown in section 5.2. We use a fixed number of iterations in all the experiments.

5 Experiments In order to test the effectiveness of our method, we divide the experiments into two parts. The first part is designed to compare the recognition performance with some state-of-the-art supervised and unsupervised methods. The second part is designed to test the re-ranked results on the images returned by Google Images.

7

10,000 images from 10 different classes of animals (alligator, ant, bear, beaver, dolphin, frog, giraffe, leopard, monkey and penguin). Fergus Google set (FG): a Google-downloaded image set used in [8], which has seven categories (airplane, car rear, face, guitar, leopard, motorbike, and wristwatch) and contains 600 images on average for each category. The images in FG are divided into three groups: “bad”, “ok” and “good”. To make it easy, we neglect the “ok” images and only use the “good” and “bad” images. New Google set (NG): a new Google-downloaded image set collected by ourselves, which has five categories the same as in CT, and contains about 700 images per category. We drop the images smaller than the size 120*120 and remove the abstract images such as drawings and sketches by use of the method mentioned in [25] during the collecting. Table 1 details the statistics for each dataset. Since the image datasets used are noisy, the ‘positive’ in the tables is the actual object image number, the ‘negative’ in the tables is the irrelevant image number, and the ‘total’ is the sum of the ‘positive’ and ‘negative’. We use Difference of Gaussian [19] and Maximally Stable Extremal Region [20] interest point detectors and represent each detected region with a 136-dimensional Gradient Location and Orientation Histogram descriptor [21]. The local feature descriptors are then clustered using k-means, with k = 500. Each image is converted to the gray image and resized to no larger than the size 400*400. Besides, for the spatial histogram in Fig. 3a, the scale is normalized according to the interest point size. The radius of the outermost bin is set to 20 times the interest point size.

For all experiments, we randomly label 10 positive and 30 negative images and set the total number of iterations T = 100. The number of iterations for labeling Tl and the labeled data number Nl for active labeling is fully specified by users. If we want better performances, Tl is set smaller and Nl is set larger. If we require less 5.1 Experimental setup manual efforts, Tl is set larger and Nl is set smaller. To balance the amount of supervision with the perforWe test the proposed framework on the following datasets, mances, we empirically set Tl = 10 and Nl = 8 with which we will later refer to by acronyms. regard to T = 100 iterations. Caltech-7 set (CT): a benchmark dataset for object recognition, which contains 2148 images in total The evaluation criteria is the error rates measured from seven categories. For comparison with results of at the point of equal error rate on an ROC curve. Fig. several supervised methods, we choose five categories 5 shows the ROC curves of our method over the five (Airplane, Car, Face, Leopard, and Motorbike) from it. categories (airplane, car rear, face, leopard and motorbike). The red circles in the figure are the equal error Berg animal set (BA): a benchmark dataset colrate points. lected by [3] for images re-ranking, which contains about

8

Xi Liu et al.

Table 1 Statistics for Caltech-7 set, Berg anmial set, Fergus Google set and New Google set. ‘-’ indicates no result for the

dataset. Caltech-7

Airplane

Car(rear)

Face

Leopard

Motorbike

Positive Negative Total

800 800

526 526

435 435

200 200

798 798

Berg animal

alligator

ant

bear

beaver

dolphin

frog

giraffe

leopard

monkey

penguin

Positive Negative Total

291 1154 1445

242 685 927

237 1458 1695

81 1057 1138

648 1290 1938

408 1954 2362

362 876 1238

413 1344 1757

175 1496 1671

254 906 1160

Google

Airplane

Car(rear)

Face

Leopard

Motorbike

Positive Negative Total

158 641 799

192 327 519

138 306 444

101 273 374

230 253 483

Google new

Airplane

Car(rear)

Face

Leopard

Motorbike

Positive Negative Total

143 549 692

201 539 740

202 503 705

128 528 656

176 575 751

Fig. 5 ROC curves of our method over five categories.

5.2 Categorize new images We will demonstrate that our discriminative category model learned from Google Images is comparative to the results reported by other authors using both supervised and unsupervised recognition methods. We make a comparison of our approach with the semi-supervised methods: [8], [27] and the supervised methods: [23], [18]. Note that these methods are trained on different datasets: the TSI-pLSA model of [8] and the boosting model of [23] are trained with the FG dataset, the model in [27] and the model in [18] are trained with the CT dataset. Therefore, to make the comparison fairer, we train our CoBoost framework under three datasets: the FG, NG and CT datasets respectively, which are referred as CB FG, CB NG and CB CT in Table 2. The FG and NG training datasets

are the same as Fergus Google set and New Google set, which are introduced in section 5.1. Since we cannot reproduce the train Caltech dataset used in [27] and [18], we follow the two paper and we collect 600 images for each category, of which the positive images are from the pure CT training set of the object category and the negative images are from the Caltech Background. The ratio of positive images to negative images is kept to 1:2. Besides, we implement the work of [18] and train their algorithm with only 1st and 2nd order features on a 100-100 mix of training dataset from the CT dataset. To increase robustness of performance reporting for our approach, we repeat the experiments for 10 times and report their average performance. Table 2 shows the error rates of the semi-supervised methods [8], [27], the supervised methods [23], [18] and our three approaches (CB FG, CB NG, CB CT). First we compare with the semi-supervised methods [8] and [27]. Our method CB FG and [8] are both trained on the FG dataset, and CB FG surpasses [8] over four categories. The average error rate of CB FG is 6.12 points lower than that of [8]. CB CT and [27] are both trained on the CT dataset, and CB CT surpasses [27] over three categories. The average error rate of CB CT is very low (only 3.78). Then we compare with the supervised methods [23] and [18]. Our method CB FG and [23] are both trained on the FG dataset, and CB FG surpasses [23] over two categories. Although CB FG falls short of [23] for the categories car rear and motorbike, the differences of the error rates between the two methods are small. CB CT and [18] are both trained on the CT datasset. The [18] method is reported to be computationally much more efficient than previous ap-

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features

9

Table 2 Comparison of the error rates for different methods. ‘-’ indicates no result for the method and the category.

Airplane Car rear Face Leopard Motorbike Average

[8] 15.5 16.0 20.7 13.0 6.2 14.28

[23] 11.1 8.9 6.5 7.8 -

[27] 3.4 21.4 5.3 15.4 -

[18] 2.0 3.6 1.0 2.3 1.4 2.06

CB FG 10.5 10.4 3.7 7.9 8.3 8.16

CB NG 10.1 11.1 3.2 7.0 9.2 8.12

CB CT 3.6 6.4 1.7 4.8 2.4 3.78

Table 3 Comparison of the error rates with active label and random label.

Airplane Car rear Face Leopard Motorbike Average

r CB FG

r CB NG

r CB CT

11.8 12.0 5.8 9.9 9.5 9.8

11.3 12.5 5.2 8.3 10.5 9.56

4.6 7.5 2.8 6.2 3.6 4.94

CB FG 10.5 10.4 3.7 7.9 8.3 8.16

CB NG 10.1 11.1 3.2 7.0 9.2 8.12

CB CT 3.6 6.4 1.7 4.8 2.4 3.78

Table 4 Comparison of the error rates with only 1st order features and 1st and 2nd order features.

Airplane Car rear Face Leopard Motorbike Average

1st CB FG

1st CB NG

1st CB CT

12.3 13.8 6.6 10.1 9.9 10.5

12.9 14.2 5.9 10.0 11.2 10.84

4.7 8.4 2.6 6.1 4.5 5.26

proaches and also compare against the state-of-the-art approaches. Our method CB CT is comparative to [18] and only 1.72 points higher, considering that we use only a few labeled data and a large number of unlabeled data. The CB NG method is trained on the newly collected Google-downloaded image dataset NG and it can be seen as the object model learned from Google Images. Its results are encouraging: the average error rate of the CB NG method is also superior to [8] and exceeds the CB FG method. Note even if we compare the CB NG method with the supervised technique [23], our method still does reasonably well.

CB FG 10.5 10.4 3.7 7.9 8.3 8.16

CB NG 10.1 11.1 3.2 7.0 9.2 8.12

CB CT 3.6 6.4 1.7 4.8 2.4 3.78

rounds of co-boosting for all the five categories. The 1st order features in the images are shown in green points and the 2nd order features in the images are shown in a red line. From the figure, we can see that the selected 1st and 2nd order features are highly discriminative and detect meaningful patterns in these images. To further show the effectiveness of using both the 1st order features and the 2nd order, we trained our method with two kinds of 1st order features instead of the 1st and 2nd order features over the FG, NG and CT dataset(1stCB FG, 1stCB NG, 1stCB CT). The two kinds of 1st order features are built with Difference of Gaussian detector and Maximally Stable Extremal Region detector respectively. They are both also described by a 136-dimensional Gradient Location and Orientation Histogram descriptor. Table 4 shows that our method with the 1st and 2nd order features performs much better than with the two 1st order features.

We also testify the benefits of the active learning process used in our method. The r CB FG, r CB NG, r CB CT method in the Table 3 correspond to our coboost learning method with random labeling instead of active labeling trained on the FG, NG and CT dataset respectively. From the table, we can see that the CB FG, CB NG and CB CT perform better than r CB FG, r CB NG Finally,we analyze the effect of the total iteration and r CB CT respectively over all five categories, and number parameter in the proposed method. We check 1.64, 1.44, 1.16 average points are gained over r CB FG, the misclassified labeled data and the unlabeled data r CB NG and r CB CT. The active learning technique two boosting classifiers disagree on for each iteration. can further improve the performances. We define the total number of them as the total inOur method can select the representative features consistent number. Fig. 7 gives the total inconsistent with the labeled and unabeled data. In Fig. 6, we show number curves of the five categories for FG, NG and the 1st and 2nd order features selected in the first two CT dataset over the iterations. Initially, the total in-

10

Xi Liu et al.

Fig. 6 The 1st and 2nd features selected by our method.

Fig. 7 The total inconsistent image number over the iterations for each category in FG, NG and CT datasets. Above left: FG,

Above right: NG, Below: CT.

consistent number falls rapidly, and after around 90 iterations for FG, NG, 70 iterations for CT, the value is insignificantly small relative to that of initial classifiers. As the total inconsistent number becomes small and changes little along with the iterations, the learning method converges. More new weak classifiers added in boosting will not significantly change the decision value. To ensure the generality, we set the total iteration number T = 100 for all the experiments. Note that 100 1st order and 2nd order features are selected 2 and C100 = 4950 2nd order features are extracted. It is rather efficient in contrast to the complete extraction 2 of C500 = 124750 2nd order features. 5.3 Re-rank search images We use our framework to re-rank the Web search images by the output value of g1 (x1 ) + g2 (x2 ). The FG and BA datasets are both collected from the web, and

so far they have been widely used in most of the category learning algorithms. We consider re-ranking the FG dataset and the ten-animal BA dataset. To compare our results with the works of [3] and [25], we present the precision at 15% recall for the FG dataset and the precision of the first 100 images for the BA dataset. Note in the work of Schroff et al., it treats “ok” images as “good” images. Therefore we train our framework on the FG dataset but re-rank the original Fergus dataset, regarding “ok” images as positive images. Besides, we use the result of “classification on test data” category of [3] rather than their “final dataset”, which includes ground truth from their manual step. Fig. 8 illustrates the comparative results between our approach and the works of [3] and [25]. As shown in the figure, we make comparison with the work of [25] for the first five categories while we compare our approach with the works of [3] and [25] for the last ten animal categories. Our approach gives superior preci-

A co-boost framework for learning object categories from Google Images with 1st and 2nd order features

Fig. 8 Re-ranking results, compared with [3] and [25].

sion to [25] over four categories (airplane, leopard, motorbike, guitar) on the FG dataset and seven categories (ant, bear, dolphin, giraffe, leopard, monkey, penguin) on the BA dataset. It outperforms [3] over nine animal categories except “frog”. In analysis of the categories (e.g. beaver, monkey) for which our approach gives low precision, we think that it is partly because of the little number of the actual positive images. In the beaver category, the positive image number is only 81 while the negative image number is 1057. Also in the monkey category, the positive image number is only 175 in contrast to 1496 negative images. It’s believed that the number of the actual positive images plays a great role in selecting discriminative features for the CoBoost.

Fig. 9 Top ranked 15 images of airplane, car(rear), face, leop-

ard and motorbike in New Google Set.

Finally, we re-rank the images returned by Google Images. The top ranked images of the categories airplane, car(rear), face, leopard and motorbike are exhibited in Fig. 9, of which the false positives are marked with red squares. It’s easily seen that the results are quite satisfying. 6 Conclusion In this paper, we have developed a semi-supervised learning framework which integrates the CoBoost algorithm with the 1st and 2nd order features to learn visual categories from Google Images. To further reduce the mem-

11

ory and computation costs of our approach, the highdimensional 2nd order features are dynamically generated and selected based on the selected 1st order features in the framework. Extensive experiments on several datasets such as Caltech, Fergus and Berg datasets demonstrate that our learned model performs very well in comparison to both the state-of-the-art unsupervised approaches and also the traditional fully supervised techniques. For the future, we intend to add mutual information to our CoBoost framework. The mutual information is expected to eliminate non-effective weak classifiers, and thus leads to faster convergence and better accuracy. This idea has ever been used in [26] for face recognition. Besides, we would also like to apply our algorithm to the problem of boosting the performance of current object recognition techniques with unlabeled data. Acknowledgements This work is supported by the National Basic Research Priorities Programme (No. 2007CB311004), National Science and Technology Support Plan (No. 2006BAC08B06) and National Science Foundation of China (No.60775035, No. 60903141, No. 60933004, No. 60970088).

References 1. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(4):509–522, 2002. 2. K. Bennett and A. Demiriz. Semi-supervised support vector machines. Advances in Neural Information processing systems, pages 368–374, 1999. 3. T.L. Berg and D.A. Forsyth. Animals on the web. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 1463– 1470. IEEE, 2006. 4. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998. 5. T. Chen, M.M. Cheng, P. Tan, A. Shamir, and S.M. Hu. Sketch2photo: internet image montage. Graphics, ACM Transactions on, 28(5):124:1–124:10, 2009. 6. I. Cohen, F.G. Cozman, N. Sebe, M.C. Cirelo, and T.S. Huang. Semisupervised learning of classifiers: Theory, algorithms, and their application to human-computer interaction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(12):1553–1566, 2004. 7. M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 189–196,

1999. 8. R. Fergus, F.F. Li, P. Perona, and A. Zisserman. Learning object categories from google’s image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816–1823. IEEE, 2005.

9. R. Fergus, F.F. Li, P. Perona, and A. Zisserman. Learning object categories from internet image searches. Proceedings of the IEEE, 98(8):1453–1466, 2010.

12

Xi Liu et al.

10. R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol-

ume 2, pages 264–271. IEEE, 2003. 11. R. Fergus, P. Perona, and A. Zisserman. A visual category filter for google images. Lecture notes in computer science, pages 242–256, 2004. 12. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning, 1996. ICML 1996. Thirteenth IEEE International Conference on, pages 148–156. MORGAN KAUFMANN PUBLISHERS, INC., 1996. 13. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178. Ieee, 2006. 14. B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 17–32, 2004. 15. C. Leistner, H. Grabner, and H. Bischof. Semi-supervised boosting using visual similarity learning. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

16. F.F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59–70, 2007. 17. L.J. Li and F.F. L. Optimol: automatic online picture collection via incremental model learning. International journal of computer vision, 88(2):147–168, 2010. 18. D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. Distinctive image features from scale19. D.G. Lowe. invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. 20. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, 2004. 21. K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1615–1630, 2005. 22. E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. Computer Vision– ECCV 2006, pages 490–503, 2006. 23. A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Weak hypotheses and boosting for generic object detection and recognition. Computer Vision-ECCV 2004, pages 71–84, 2004. 24. S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlatons. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2033–2040. IEEE, 2006. 25. F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. 26. L. Shen and L. Bai. Mutualboost learning for selecting gabor features for face recognition. Pattern Recognition Letters, 27(15):1758–1767, 2006.

27. J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T. Freeman. Discovering objects and their location in images. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 370– 377. IEEE, 2005. 28. S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learning forweakly supervised object categorization. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

29. G. Wang and D. Forsyth. Object image retrieval by exploiting online knowledge resources. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

30. G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1367–1374. IEEE, 2009.

31. J. Wang, Y.G. Jiang, and S.F. Chang. Label diagnosis through self tuning for web image search. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1390–1397. IEEE, 2009.

32. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International journal of computer vision, 73(2):213–238, 2007. 33. D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Sch¨ olkopf. Learning with local and global consistency. Advances in neural information processing systems, 16:321– 328, 2004. 34. Z.H. Zhou, K.J. Chen, and Y. Jiang. Exploiting unlabeled data in content-based image retrieval. Machine Learning: ECML 2004, pages 525–536, 2004. 35. X. Zhu. Semi-supervised learning literature survey. Technical report, Department of Computer Sciences, University of Wisconsin at Madison, 2005. 36. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Machine Learning, 2003. ICML 2003. Twentieth IEEE International Conference on, volume 20, pages 912–919, 2003.

Suggest Documents