Visual Object Recognition through One-Class ... - Semantic Scholar

3 downloads 1006 Views 165KB Size Report
Faculty of Information Technology and Systems. Delft University of Technology. P.O. Box ... classifiers based on only target class patterns but still having good ...
Visual Object Recognition through One-Class Learning 1

1

QingHua Wang , Luís Seabra Lopes , and David M. J. Tax

2

1

IEETA/Department of Electronics & Telecommunication, University of Aveiro, Campus Santiago, 3810-153, Aveiro, Portugal [email protected], [email protected] 2 Faculty of Information Technology and Systems Delft University of Technology P.O. Box 5031, 2600GA,Delft, The Netherlands [email protected]

Abstract. In this paper, several one-class classification methods are investigated in pixel space and PCA (Principal component Analysis) subspace having in mind the need of finding suitable learning and classification methods to support natural language grounding in the context of Human-Robot Interaction. Face and non-face classification is used as an example to demonstrate effectiveness of these one-class classifiers. The idea is to train target class models with only target (face) patterns, but still keeping good discrimination over outlier (never seen non-target) patterns. Some discussion is given and promising results are reported.

1 Introduction Let’s consider the task of teaching a robot to recognize an object, say, “apple”, through its camera, in the context of Human-Robot Interaction (HRI). How can the teaching be conducted? To apply state-of-the-art statistical approaches, e.g., Hidden Markov models [6, 22], Bayesian networks [11], naïve Bayes classifier [14], PCA [18], and other methods described in [20], basically it’s necessary to find quite a lot of apples, and to find enough non-apples, which is itself an ambiguous concept, to estimate the class distributions precisely. One might wonder whether these requirements are realistic in the context of HRI. The fact that learning is supervised and the teaching is interactive typically leads to the availability of only a small number of samples. This makes the conventional methods mentioned above not applicable as they require to prepare both target and non-target patterns. Thus, it might be useful to construct classifiers based on only target class patterns but still having good discrimination for never seen non-target patterns. Following this idea, a method based on the combination of the wavelet domain Hidden Markov Trees (HMTs) and Kullback-Leibler distance (KLD) was proposed in [19]. In that method, only target (face) samples were used to train an object model in

terms of parameters of HMTs. Then for each unknown pattern, its KLD to this model is computed. If its KLD is smaller than a certain threshold, obtained from training session, it is recognized as a target pattern; otherwise, it is rejected. One problem of this HMT/KLD based approach is that it can’t derive robust class models if there are big in-class variations among the training patterns. One cause is that simply the average of individual HMTs is used to attain the overall class model. In that way, if individual HMTs vary greatly from each other, the average method loses precision of HMT parameter estimation. In this paper, several one-class classification methods, previously described in [17], are investigated to solve this problem. The rest of this paper is organized as follows. The brief review of one-class classification is provided in section 2. In section 3, the experimental setup and results are presented. Conclusion is given in section 4 with some discussion and future work.

2 One-Class Classifiers The design of one-class classifiers is motivated by the fact that patterns from a same class usually cluster regularly together, while patterns from other classes scatter in feature space. One-class learning and classification was first presented in [7], but similar ideas had also appeared, including outlier detection [12], novelty detection [2], concept learning in the absence of counter-examples [5] and positive-only learning [9]. Generally, in multi-class approaches, one can precisely capture class descriptions through the availability of samples from all classes. In contrast, in one-class approaches, only samples of the target class are required. A very natural method for decision-making under this condition is to use some distance-based criterion. If the measurement of an unknown pattern x is smaller than the learned threshold, it can be accepted as the target class pattern; otherwise, it should be rejected. This can be formulated as follows.

 t arg et , if Me a su r eme nt ( x ) ≤ threshold ; Class ( x ) =   non −t arg et , otherwise .

(1)

It’s comparable to the Bayesian decision rule. The main difference is that, here, the threshold is learned only from target class patterns, while in Bayesian decision rule it’s determined by both target and non-target class patterns. If an appropriate model of the target class (and thus a proper threshold) is found, one can find that most patterns from this target class are accepted and most non-target class patterns are rejected. Surely the ideal model is one that can accept all target patterns and reject all non-target patterns. But this is usually not easy to find realistically. Common practice is to define a priori the fraction of training target patterns that should be discarded (known as reject rate), in order to obtain a compact data description and minimize false positives. In many cases 5% or 1% is used. Several methods were proposed to construct one-class classification models. A simple method is to generate artificial outlier data [13], and conventional two-class approaches are thus applicable. This method severely depends on the quality of artifi-

cial data and often does not work well. Some statistical methods were also proposed. One can estimate the density or distribution of the target class, e.g., using Parzen density estimator [2], Gaussian [9], multimodal density models [21] or wavelet-domain HMTs [19]. The requirement of well-sampled training data to precisely capture the density distribution makes this type of methods problematic. In [7, 17] some boundary-based methods were proposed to avoid density estimation of small or not wellsampled training data. But a well-chosen distance or threshold is needed. Tax provides a systematic description of one-class classification in [17], where the decision criteria are mostly based on the Euclidean distance. Below is a brief description of seven one-class classifiers previously described in [17] and [10]. The Support Vector Data Description (SV-DD) method, proposed in [17], basically finds a hypersphere boundary around the target class with minimal volume containing all or most of the target class patterns. It can provide excellent results when a suitable kernel is used. Currently, the Gaussian kernel is chosen. It is possible to optimize the method to reject a pre-defined fraction of the target data in order to obtain a good and compact data description of it (thus some remote target data points may be discarded). Thus for different rejection rates, the shape of the boundary changes. For classification, objects outside this sphere decision boundary are regarded as outliers (objects from other classes). The main drawback of the method is that it requires a difficult quadratic optimization. Another method, GAUSS-DD, models the target class as a simple Gaussian distribution. To avoid numerical instabilities, the density estimate is avoided, and just the Mahalanobis distance

f ( x) = ( x − µ ) T Σ −1 ( x − µ ) is used, where mean µ

and covariance matrix Σ are sample estimates. The classifier can be defined by (1). In KMEANS-DD, a class is described by k clusters, placed such that the average distance to a cluster center is minimized. The cluster centers ci are placed using the standard k-means clustering procedure [2]. The target class is then characterized by

f ( x) = min ( x − ci ) 2 . The classifier then is defined as in (1). i

The PCA-DD method, based on Principal Component Analysis, describes the target data by a linear subspace. This subspace is defined by the eigenvectors of the data covariance matrix Σ . Only k eigenvectors are used, which are stored in a d×k matrix W (where d is the dimensionality of the original feature space). To check if a new object fits the target subspace, the reconstruction error is computed. The reconstruction error is the difference between the original object and the projection of that object onto the subspace (in the original data). This projection is computed by:

x

proj

= W (W T W ) −1Wx

The reconstruction error is then given by

(2)

f ( x) = || x − x proj || 2 .

The NN-DD method is a simple nearest neighbor method. Here, a new object x is evaluated by computing the distance to its nearest neighbor NN(x) in the training set. This distance is normalized by the distance between its nearest neighbor, NN(x), and the nearest neighbor of NN(x) in training set, NN(NN(x)).

The KNN-DD is a k-nearest neighbor method. In its most simple version, just the distance to the k-th nearest neighbor is used. Slightly advanced methods use averaged distances, which works somewhat better. This simple method is often very good in high dimensional feature spaces. The LP-DD is a linear programming method [10]. This data descriptor is specifically constructed to describe target classes which are represented in terms of distances to a set of support objects. In some cases it might be much easier to define distances between objects than informative features (for instance when shapes have to be distinguished). This classifier uses the Euclidean distance by default. The classifier has basically the following form

f ( x) =

∑ w d ( x, x ) . i

i

The weights wi are

optimized such that just a few weights stay non-zero, and the boundary is as tight as possible around the data.

3 Experiments and Results 3.1 Experimental Setup All the seven one-class classifiers are investigated using the dataset in [19]. This dataset contains two parts. There are 400 pictures from AT&T/ORL face database [1] and 402 non-face pictures from our previous work [15, 16]. There are some examples from each part shown in Figure 1 and 2 respectively. It should note that all patterns were resized to 32×32. The reported experiments are all carried out based on the PRTOOLS [4] and DDTOOLS [17] packages, from Delft University of Technology. And face is the target class.

Fig. 1. Some face patterns

Fig. 2. Some non-face patterns

Currently two feature schemes are used in experiments reported in this paper. First experiments are directly conducted in full pixel space (1024 dimensions). Then similar experiments are repeated in PCA subspace. For all the seven methods, the reject rates for target class are set to 0.01. For PCA-DD, its dimension will be 10 if that

can’t be clearly found from context. For SV-DD, DD, k is 5. For KNN-DD, k is 2.

σ =1128 is used. For KMEANS-

3.2 Results and Discussion To know how the amount of training patterns affects the performance of each classifier, a fraction, from 10% to 90%, of face data (randomly selected in the whole face database each time) is used for training, and the rest of face data and all non-face data for independent testing. For a certain experiment, it is repeated ten times and average error rate is used as the final score. The first series of experiments are conducted directly in pixel space. PCA is used to reduce the dimension for another series of experiments. Results are demonstrated through Fig. 3. Over pixel space, SV-DD shows decrease of overall error rate (OA) from about 40% to 5%, and false negatives (FN) from 80% to 30%. Its false positive (FP) rates are very steadily less than 5%. No other methods show similar trend. Two methods LP-DD and GAUSS-DD don’t work well over pixel space. Both of them have 100% FN and 0% FP in all experiments. Therefore they are not included in Figure 3.a, 3.b and 3.c. In PCA subspace (10 Principle Components), SV-DD shows similar trend on FNs as it does over pixel space, but the decreases are relatively slight. It shows a very steady performance less than 10% in overall error rate and FPs. Similarly, no other methods work well like SV-DD. This time, the LP-DD method works as with pixel space. Methods like NN-DD, KMEANS-DD and KNN-DD have very low FNs, but very high FPs, both over pixel space and PCA subspace. The relatively good performance of SV-DD in comparison to the other six methods can be contributed by its flexibility. The other methods are mainly using very strict models, such as plane-like shapes or nearest neighbor type of models. They tend to capture large areas in feature space since the reject rates for target class were set relatively low at 0.01, and therefore large FPs and low FNs. How the number of features used may affect these classifiers is also investigated. For the specific case of SV-DD, 10, 15, 20 and 30 PCs are used. In table 1, a decrease of error rates (OA, FP, FN) can be found when more training patterns are used. There is also a more or less similar trend when more features are used (last row in the table). But when the main variation is captured over a specific training set, more features don’t always guarantee better results. It is because when more features are used, generally more training data are needed to estimate reliably the class models. Thus with a certain training set used above, more features may directly cause that the class models can’t be estimated reliably, and the performance dangles a little bit (the “curse-of-dimensionality” [3]). This is also why SV-DD performs better in PCA subspace than it does in full pixel space.

Overall error rate (pixel space)

Overall error rate (10 PCs) 70 60 50 40 30 20 10 0

50 40 30 20 10 0

faces used (n*10%)

face used (n*10%)

(d)

(a) False Positive (pixel space)

False Positive (10 PCs)

70

90 80 70 60 50 40 30 20 10 0

60 50 40 30 20 10 0 faces used (n*10%)

face used (n*10%)

(b)

(e) False Negative (10 PCs)

False Negative (pixel space) 100 90 80 70 60 50 40 30 20 10 0

pcadd svdd nndd kmeans knndd

faces used (n*10%)

(c)

pcadd

80 70 60 50 40 30 20 10 0

svdd nndd kmeans knndd gauss

faces used (n*10%)

(f)

Fig. 3. Some results: diagrams a, b, c show overall classification error, false positives and false negatives of five methods in full pixel space; diagrams d, e, f show overall classification error, false positives and false negatives of six methods in PCA subspace (10 Principal Components). The Y-axis is error rate score (%), and the X-axis is the percentage of faces used in training.

Table 1. Error rates of SV-DD over PCA subspcace (FN = false negatives, FP = false positives, OA = overall error rate)

Data size

Error FN 10% FP OA FN 20% FP OA FN 30% FP OA FN 40% FP OA Average (OA)

10 PCs

15 PCs

20 PCs

30 PCs

12.97 6.57 9.59 10.03 4.78 7.10 10.07 4.95 7.05 10.67 5.95 7.71 7.91

12.14 4.18 7.94 10.07 4.53 7.27 11.39 4.78 7.49 10.15 5.65 7.31 7.50

10.31 4.7 6.69 10.85 4.13 7.13 10.04 4.58 6.82 10.58 5.32 7.29 6.98

9.36 9.36 6.59 10.16 3.63 6.53 10.89 4.18 6.94 10.89 5.05 7.20 6.82

Average 11.20 6.20 7.70 10.28 4.27 7.01 10.60 4.62 7.08 10.57 5.49 7.38 7.29

4 Concluding Remarks In this paper, face and non-face classification is used as an example in investigating several one-class classification methods. It’s preliminary work towards finding suitable learning and classification methods for natural language concept grounding in the context of Human-Robot Interaction. In the reported experiments, it’s intentional to learn target class models with only target patterns, but still keeping good discrimination with respect to outlier patterns. It’s found that some of such one-class classifiers, particularly SV-DD, can attain very nice performance (overall error rate, false negative and false positive all less than 10%) on our data set. All other one-class classifiers perform less well in our experiments. Some of them work well to accept target patterns. Some of them work well to reject outlier patterns. Only SV-DD performs very steadily, especially when discriminant features such as PCA subspace is used. It can be concluded that SV-DD can form a good foundation for developing a learning and classification method suitable for HRI, since not only can it obtain reasonable performance with a (relative) small amount of training patterns, but also it can achieve very nice results when more training patterns are available. From a viewpoint of lifelong learning for a robot, this potential of SV-DD can be further utilized. Obviously further study on these one-class classifiers should be conducted, for example, using other larger data set and/or feature extraction methods. More importantly, it’s interesting to apply some of these methods on to Carl, a service robot prototype previously developed by our group [16].

Acknowledgement Q. H. Wang is supported by IEETA (Instituto de Engenharia Electrónica e Telemática de Aveiro), Universidade de Aveiro, Portugal, under a PhD research grant.

References 1. AT & T Face Database, formerly "The ORL Database of Faces", at http://www.uk.research.att.com/ facedatabase.html 2. Bishop, C.: Novelty detection and neural network validation. In: IEE Proc. Vision, Image and Signal Processing, 141 (1994) 217-222 3. Bishop, C.: Neural Networks for Pattern recognition. Oxford University Press (1995) 4. Duin, R.: PRTOOLS 4.0. Delft University of Technology, The Netherlands (2004) 5. Japkowicz, N.: Concept-Learning in the absence of counter-examples: an autoassociationbased approach to classification. Ph D thesis, The State Univ. of New Jersy (1999) 6. Meng, L. M.: An Image-based Bayesian Framework for Face detection. In: Proc. of IEEE Intl. Conf. On Computer Vision and Pattern Recognition (2000) 7. Moya, M., Koch, M. and Hostetler, L.: One-class classifier networks for target recognition applications. In: Proc. World congress on neural networks (1993) 797-801 8. Muggleton, S. and J. Firth. CProgol4.4: a tutorial introduction. In S. Dzeroski and N. Lavrac (eds.): Relational Data Mining. Springer-Verlag (2001) 160-188 9. Parra, L., Deco, G. And Miesbach, S.: Statistical independence and novelty detection with information preserving nonlinear maps. In: Neural Compiutation 8 (1996) 260-269 10. Pekalska, E., Tax, D. M.J. and Duin, R. P. W.: One-Class LP Classifiers for dissimilarity Representations. In: Advances in Neural Info. Processing Systems, vol. 15. MIT Press (2003) 761-768 11. Pham, T. V., Arnold, M. W. and Smeulders, W. M.: Face Detection by aggregated Bayesian network classifiers. In: Pattern Recognition Letters, 23(4) (2002) 451-461 12. Ritter, G. and Gallegos, M.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. In: Pattern Recognition Letters 18. 525-539 (1997) 13. Roberts, S. and Penny, W.: Novelty, confidence and errors in connectionist systems. Technology report, Imperial College, London, TR-96-1 (1996) 14. Schneiderman, H. and Kanade, K.: A Statistical Method for 3D Object Detection Applied to Faces and Cars. In: Proc. CVPR 2000 (2000) 746-751 15. Seabra Lopes, L.: Carl: from Situated Activity to Language-Level Interaction and Learning. In: Proc. IEEE Intl. Conf. on Intelligent Robotics & Systems (2002) 890-896 16. Seabra Lopes, L. and Wang, Q. H.: Towards Grounded Human-Robot Communication. In: Proc. IEEE Intl. Workshop RO-MAN (2002) 312-318 17. Tax, David M. J.: One-class classification. Ph D dissertation, Delft University of Technology, The Netherlands (2001) 18. Turk, M. and Pentland, A.: Eigenfaces for recognition. In: Journal of Cognitive Neuroscience 3 (1994) 71-86 19. Wang, Q. H. and Seabra Lopes, L.: An Object Recognition Framework Based on Hidden Markov Trees and Kullback-Leibler Distance. In: Proc. ACCV 2004 (2004) 276-281 20. Yang, M. H., Kriegman, D. and Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Trans. PAMI 24 (2002) 34-58 21. Yang, M. H., Kriegman, D. and Ahuja, N.: Face Detection Using Multimodal Density Models. In: Computer Vision and Image Understanding 84 (2001) 264-284 22. Zhu, Y. Schwartz, S.: Efficient Face Detection with Multiscale Sequential Classification. In: Proc. IEEE Intl. Conf. Image Processing ’02 (2002) 121-124

Suggest Documents