Partitioning a feature space using a locally defined ... - Semantic Scholar

2 downloads 0 Views 153KB Size Report
Email: [email protected], [email protected], Leon Cooper@brown.edu. Abstract–In this paper, we introduce a network for pat- tern classification, referred to as ...
Partitioning a feature space using a locally defined confidence measure Jigang Wang, Predrag Neskovic, Leon N Cooper Physics Department and Institute for Brain and Neural Systems, Brown University, Providence, RI 02912, USA. Email: [email protected], [email protected], Leon [email protected]

Abstract– In this paper, we introduce a network for pattern classification, referred to as Locally Confident Network (LCN), that learns object categories by partitioning the feature space into local regions with maximum confidence levels. We show that the probability of decision error over a region is a decreasing function of its confidence measure. Thus the probability of decision error can be minimized by maximizing the confidence measure. Moreover, in the asymptotic limit the Bayes probability of error can be achieved by partitioning the feature space into locally maximum confidence regions. We compare our approach with the Nearest Neighbour rule and RCE network, and show that our method overcomes some of their shortcomings. Empirical results are given on several datasets and compared to well known classification methods.

I. Introduction When a human teaches a student to recognize objects, and the student is not able to recognize an example of a given class, he or she can communicate that to the teacher during the learning process. Although many supervised learning algorithms are inspired by the human teaching processes, it is difficult, if not impossible, to find algorithms that engage in such interactive learning. Recognition systems based on some of the most prominent learning algorithms, such as back-propagation, RBF, SVM and HMM ”learn” during a training phase by updating weights or other parameters of the system. But if the features chosen to represent objects are not sufficient for class separation, no amount of training will help; what is required is a better representation of the object. One way to build a system that can interactively exchange information during the learning process is to design a system that knows its limitations and knows when to ask a question. In this work we present a first step toward constructing such a system. Given a confidence level that can be set by a user/teacher, we construct a system that learns object categories by partitioning the feature space into regions within which it has above threshold confidence of correctly classifying a pattern and into regions with confidence below a pre-set threshold value. In this way, the user/teacher can help the system by providing additional information (e.g. by introducing new features) only for selected low-confidence regions. But perhaps more important, the user knows the limitations and strengths of the system - how well the system has learned the task before it is used on unseen patterns.

The paper is organized as follows: In section II we review the pattern classification problem from a Bayesian point of view and point out the shortcomings of particular existing approaches. In section III we derive the probability of error over a region and show that it is a decreasing function of its confidence measure. We also show that the Bayes rate of error can be achieved asymptotically by locally maximizing the confidence measure. In section IV we present one possible algorithm for partitioning the feature space into different confidence regions and describe the architecture of the network we call Locally Confident Network (LCN). In section V we test our system on several real world datasets and compare it to some of the standard learning algorithms. Concluding remarks are given in section VI. II. Background Let us consider a classifier for a two class ω1 , ω2 classification problem. Assume that the classifier divides the feature space Ω into two disjoint regions R1 and R2 , where all the points within the region R1 are classified into class ω1 and all the points within the region R2 are classified into class ω2 and Ω = R1 ∪R2 . The classifier then makes an error if x ∈ R2 and the true state of nature is ω1 or if x ∈ R1 and its true state of nature is ω2 . Therefore, the probability of error of this classifier is given as: P (error) = P (x ∈ R1 , ω2 ) + P (x ∈ R2 , ω1 ) (1) 8 8 = P (ω2 |x)p(x)dx + P (ω1 |x)p(x)dx R1

R2

Combining Eq. (1) with 1 = P (x ∈ R1 , ω1 ) + P (x ∈ R2 , ω1 ), one arrives at the following expression: 8 (P (ω1 |x) − P (ω2 |x))p(x)dx (2) P (error) = 1 − R1

This means that the probability of error is minimized if all the points at which P (ω1 |x) ≥ P (ω2 |x) belong to R1 and all the points at which P (ω2 |x) ≥ P (ω1 |x) belong to R2 . This is the Bayes decision rule [1]. It is obvious that in order to build a classifier that obeys this rule, one has to be able to estimate the posterior probability of each class at every single point of the feature space, which in general is not possible.

Two well known prototype-based approaches are the Nearest Neighbor (NN) [2] rule and the RCE network [4], [5]. In the NN rule, each labeled example functions as a prototype to represent all the points in its Voronoi cell. Although the posterior probability p(ωi |x) at a given point x approaches the probability that its nearest neighbor belongs to class ωi , namely, p(ωi |nn(x)) = limn→∞ p(ωi |x) [3], it is well known that the NN rule is suboptimal. In addition, it is computationally demanding. The RCE, on the other hand, avoids the previous problem by partitioning the feature space into a relatively small number of prototype regions but relies on two a priori parameters (the maximal and minimal size of the prototype regions), and it lacks a well defined objective function. In the next section, we introduce a new approach to this problem. Compared with RCE, the new approach has a well defined objective function and doesn’t require any a priori parameters. Moreover, it is optimal in the asymptotic limit. III. Probability of error and Confidence Measure We begin by studying the probability of error over a region and show that the probability of error is a decreasing function of its confidence measure, which we will define. We also find that maximization of the confidence measure of a region automatically guarantees that one class has a higher posterior probability at all the points of the region. Thus in asymptotical limit the Bayes probability of error can be achieved. 1 ,x∈R) Let P = P (ω1 |x ∈ R) = P P(ω(x∈R) be the posterior probability of the state of nature being ω1 given that feature vector x falls in region R. Let x1 , x2 , . . . , xn in R be n independently and identically distributed random variables with distribution P (ωi |x ∈ R). Then the probability that n1 of them belongs to class D i ω1 (therefore n2 = n − n1 belongs to ω2 ) is given by nn1 P n1 (1 − P )n−n1 . And the probability of observing n1 from class ω1 and n2 from class ω2 with n2 − n1 ≥ ∆ is given by: n 3

n2 −n1 ≥∆

w

W n P n1 (1 − P )n−n1 n1

n 3

n2 −n1 ≥∆

w

n n1

W

P n1 (1 − P )n−n1

n2 −n1 ≥∆

=

1 2n

(n−∆)/2 w

3

k=0

n k

W

∆−1 ≈ Φ(− √ ) (4) n

where Φ is the cumulative distribution function of the standard normal distribution. The same result holds for the case n1 − n2 ≥ ∆ > 0. So for any ∆ = |n1 − n2 | and n = n1 + n2 , Pe (∆; n)max gives an upper bound on the probability of decision error based on the majority rule. Now consider the relation between Pe (∆; n)max and √ (∆ − 1)/ n. Because Φ(x) is√increasing in x, Pe (∆; n)max is decreasing in (∆ − 1)/ n. The message that this relation delivers is two-fold. First, if our decision √ is based on majority rule, then the larger (∆ − 1)/ n is, the smaller our probability of error would be, hence the more confidence we have on our decision. Second, it also tells us that in order to keep the probability √ of error under some preset value, how large (∆ − 1)/ n should be. For n < 200, we enumerate √ all possible values of ∆ and n and calculate (∆ − 1)/ n and the corresponding value of Pe (∆; n)max . The result is shown in Figure 1.

Fig. 1. Probability of error as a function of the confidence measure.

We claim that the probability Pe (∆; n) =

P , which means that it is bounded above by: w W n 3 1 1 n ( )n1 (1 − )n−n1 Pe (∆; n)max = n1 2 2

(3)

with ∆ > 0 and P ∈ (0.5, 1] is the probability of error for the following reasons: If n2 − n1 ≥ ∆ > 0, R will be associated with class ω2 by majority rule. However, since P ∈ (0.5, 1], according to the Bayes rule, R should be associated with class ω1 . It is easy to show that Eq. 3 is a decreasing function of

In pattern classification, we can make use of the confidence measure in the following way: Think of the whole feature space as consisting of infinitely many different regions. The finite training samples are scattered among them. It is usual that if one looks at one specific region, the sample data it contains doesn’t have a high confidence measure, which means that if we make a decision based on the samples it contains, we would have a large probability of being wrong. But if one can find an adjusted region such that the confidence measure in the new region is increased, we can then make a more confident decision on the new region. This process can be repeated until the confidence measure cannot be increased any further.

An earlier attempt in this direction using multiple neural networks is given in [6]. With this process we can find all the regions over which the probabilities of decision error have been minimized. As we will show later, in asymptotic limit all sub regions with maximum confidence measures have the property that with probability one the true state of nature of all the points in them is either ω1 or ω2 . This property is crucial for classification. Up to now, what we have discussed is the decision over regions, namely, whether P = P (ω1 |x ∈ R) is greater or less than 0.5. But P (ω1 |x ∈ R) is different from P (ω1 |x) for x ∈ R, which is what we really want to decide. Now being able to say that for all x ∈ R either P (ω1 |x) > 0.5 or P (ω2 |x) < 0.5, we can tell for sure whether P (ω1 |x) is greater or less than 0.5 from our decision on P (ω1 |x ∈ R).

Fig. 2.

Decision boundary and Prototype region.

This property can be easily proved by contradiction. √ Let’s assume that the confidence measure (n1 −n2 −1)/ n of region R has been maximized and it still covers some area where P (ω2 |x) > P (ω1 |x), as illustrated in Figure 2. Let n1 = n11 + n12 and n2 = n22 + n21 , where n11 and n12 are the number of samples of class ω1 coming from area where P (ω1 |x) > P (ω2 |x) and area where P (ω2 |x) > P (ω1 |x), respectively, and n22 and n21 are the numbers of samples of class ω2 coming from area where P (ω2 |x) > P (ω1 |x) and area where P (ω1 |x) > P (ω2 |x), respectively. Because P (ω2 |x) > P (ω1 |x) at the right side of the Bayes decision boundary, we have P rob{n12 < n22 } = 1 as n → ∞. With a little arithmetic, we have √ √ (5) (∆ − 1)/ n < (n11 − n21 − 1)/ n11 + n21 which is to say that the confidence measure of R increases if we adjust R so that it does not cover the area on the right side of the boundary. This contradicts the assumption that the confidence measure of R has been maximized. IV. Network architecture and implementation In this section we present the architecture of the network that partitions the feature space into locally confident

prototype regions. The network consists of three layers: the input layer, the hidden layer and the output layer. The input layer units represent the attributes of the incoming feature vector and are connected to every unit of the hidden layer. Each unit of the hidden layer, called a prototype unit, is class specific and therefore connected to only one unit of the output layer. A prototype unit represents a local region characterized by its weight vector (the center of the region) and its size (the radius of the hypersphere). A prototype unit also outputs its confidence level over the region it represents, as given by Eq. (4). The learning algorithm for the network consists of two phases. The task of the first phase is to associate a hypersphere with each sample point with the radius equal to the distance to the closest point of the opposite class. For each such hypersphere, we calculate its confidence measure using Eq. (4). As a result, there are as many regions as there are sample points. The goal of the second phase is to reduce the number of regions. This is done by assigning each sample point to the region that covers it with highest confidence. We call regions constructed in this way prototypes. The feature space is thus divided into three types of regions: those that represent only one class, the regions that represent more than one class and the regions that do not belong to any class - the holes in the feature space. If the point falls in the region of the first type, then the assignment is unambiguous, provided that the confidence level of the region is satisfactory. The regions of the second and third type represent the areas for which the network does not have sufficient knowledge for reliable classification. In order to compare our results with other methods, we choose to assign samples from those regions to the prototype with the smallest distance-to-size ratio, where the distance refers to the distance from the sample point to the center of the prototype. V. Results and Discussion We tested our algorithm on three Statlog datasets from the StatLog repository[7]: the Breast Cancer, SatImage and Letter datasets. For each of the three datasets, our result was compared with that of the four best algorithms reported and summarized in Tables 1 - 3. The Wisconsin Breast Cancer Dataset contains 699 instances. Each instance has 9 attributes. Among the 699 instances, 458 (65.5%) are classified as benign and the rest 241 (34.5%) as malignant. The result obtained with leave-one-out test (also called jackknife method) is shown in Table 1. The SatImage Dataset is generated from Landsat MultiSpectral Scanner image data. It has 4435 samples for training and 2000 samples for testing. Each sample contains 36 pixel values and a number indicating one of the six class categories of the central pixel.

The Letter Dataset is generated from 20,000 blackand-white images of the 26 capital letters in the English alphabet. Each instance has 16 primitive numerical attributes representing statistical moments and edge counts. We trained the network on the first 15000 samples and tested on the remaining 5000 samples. TABLE I Results on the Breast Cancer Dataset Method FSM LCN 2NN 21NN CART

Accuracy% 98.3 98.0 97.1 96.9 96.0

TABLE II Results on the SatImage Dataset Method LCN k-NN LVQ DIPOL92 RBF

Accuracy% 90.6 90.6 89.5 88.9 87.9

TABLE III Results on the Letter dataset Method Alloc80 k-NN LCN LVQ Quadisc

Test Error% 6.4 6.8 7.4 7.9 11.3

As can be seen from the results, our method performs well on real world datasets. It should be emphasized that in addition to the test error we can obtain much more detailed information about the performance of the network. Take the SatImage dataset as an example. Among the 2000 test samples, only 93 (4.65%) are covered by wrong prototypes, 204 (10.2%) are covered by both right and wrong prototypes and 89 (4.45%) are not covered by any prototype at all. In many situations, like the breast cancer diagnosis, it is extremely important not to miss any case of cancer. Therefore, an important property of the system is to know its limitations so that it is not used for the regions where its knowledge is insufficient for reliable classification; these cases should be referred to a doctor or other expert system. As with the Nearest Neighbor rule and RCE network, other distance functions such as Manhattan distance can also be used. We found that the results are not much different. Another interesting point is that most of the generalization errors result from the boundaries of large prototypes as opposed to the prototypes with small radius. This is not surprising because small prototypes represent rare events. But it does suggest that our partitioning algorithm might not be optimal in practice.

VI. Conclusion In this paper we introduced a pattern recognition system that learns object categories by partitioning the feature space into local regions with maximal confidence levels. We derived an expression for the probability of decision error over a region and showed that it is a decreasing function of its confidence measure. We also demonstrated that asymptotically the Bayes rate of error can be achieved by locally maximizing the confidence measure. Our approach offers several advantages over the Nearest Neighbor rule and RCE network. It is asymptotically optimal and doesn’t require any a priori parameters. In addition, the algorithm for constructing prototype regions is extremely easy to implement and offers a unique partitioning of the feature space regardless of the order in which the training samples are presented to the system. A significant property of the system is that it knows its limitations since its knowledge is represented locally. During the training phase, this property allows the system to interact with a teacher to request more samples or new features. Since this can be done in local areas of the feature space, it does not adversely affect regions for which the system is already functioning well. Similarly, during the testing/recognition phase, this property of local representation of knowledge, allows the system to report suspicious examples to its supervisor for detailed diagnosis. We, therefore propose that the LCN network represents a first step in constructing an interactive recognition system. Acknowledgment This work was supported in part by the Army Research Office under Contract DAAD19-01-1-0754. References [1] Richard O. Duda, Peter E. Hart and David G. Stock, Pattern Classification. John Wiley & Sons, New York, 2000. [2] T.M. Cover, and P.E. Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, Vol. IT-13, No. 1, Jan. 1967, 21-27. [3] T.M. Cover, Rates of convergence of nearest neighbor procedures, Proc. 1st Ann. Hawaii Conf. Systems Theory, Jan. 1968, 413-415. [4] D. L. Reilly, L. N Cooper and C. Elbaum, A neural model for category learning, Biol. Cybern., Vol. 45, 1982, 35-41. [5] C. L. Scofield, D. L. Reilly, C. Elbaum and L. N Cooper, Pattern class degeneracy in an unrestricted storage density memory, Neural Information Processing Systems, Denver, CO, 1987, D.Z. Anderson, ed., American Institute of Physics, New York, NY, 1988, 674-682. [6] L. N Cooper, Hybrid neural network Architectures: Equilibrium systems that pay attention, Neural Networks and Applications, R. J. Mammone and Y. Zeevi, ed., Academic Press, San Diego, 1991, 81-96. [7] D. Michie, D.J. Spiegelhalter, C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, 1994.