1 Framework - CiteSeerX

1 downloads 0 Views 133KB Size Report
Jun 2, 1998 - used as a validation set to bound the error rate of the classi er formed ..... 6] J. Franklin, Methods of Mathematical Economics, Springer-Verlag.
Validation of Nearest Neighbor Classi ers Eric Bax

Computer Science Dept. California Institute of Technology 256-80 Pasadena, CA 91125 [email protected] June 2, 1998

Abstract

We develop a probabilistic bound on the error rate of the nearest neighbor classi er formed from a set of labelled examples. The bound is computed using only the examples in the set. A subset of the examples is used as a validation set to bound the error rate of the classi er formed from the remaining examples. Then a bound is computed for the di erence in error rates between the original classi er and the reduced classi er. This bound is computed by partitioning the validation set and using each subset to compute bounds for the error rate di erence due to the other subsets.

1 Framework

Consider the following machine learning framework. There is an unknown boolean-valued target function, and there is a distribution over the input space of the function. For example, the input distribution could consist of typical satellite images of the North Atlantic Ocean, and the target function could be 1 if the image contains a large iceberg and 0 otherwise. We have a set of in-sample data examples f(x1 ; 1 ); : : : ; (xn ; n )g with inputs x drawn independently from the input distribution and outputs  determined by the target function. We will use a nearest neighbor classi er, composed of the in-sample examples and a distance metric, to classify test inputs drawn independently from the input distribution. For each test input, the classi er returns the output corresponding to the closest in-sample input. The test error rate is the fraction of test inputs for which the classi er and the target function disagree. The underlying error rate is the expected test error rate over the input distribution.

1

2 Introduction

Let Ln be the underlying error rate. Let Rn be the average of Ln over all size n in-sample data sets drawn from the input distribution. While this paper focuses on Ln , the error rate of the classi er at hand, much work in the past has focused on Rn , the average error rate over classi ers formed from randomly drawn examples. Cover and Hart [5] proved that under mild continuity assumptions R1 is no more than twice the Bayes (optimal) error rate. Cover [3] showed that the average error rate of using the remaining examples to classify each example in the in-sample set is an unbiased estimate of Rn?1 . Cover [4] and Psaltis, Snapp, and Venkatesh [12] have investigated the convergence of Rn to R1 . Cover [4] worked with the case of a one-dimensional input space, and Psaltis, et. al., [12] extended the work, showing that dimensionality can have a strong e ect on convergence. Wagner [15], Fritz [7], and Gyor [8] have studied the convergence of Ln to R1 . This paper develops the following method to bound Ln . Some insample examples are removed from the in-sample set to form a validation set. The remaining in-sample examples form a reduced classi er. Since the validation examples are independent of the reduced classi er, the validation examples can be used to obtain an error bound for the reduced classi er. To produce an error bound for the original classi er, we must also bound the di erence in error rates between the original classi er and the reduced classi er. To do this, we partition the validation examples into subsets. Since the examples in each subset are independent of the examples in the other subsets, we can use each subset to obtain a bound for the di erence in error rates due to the other subsets. We combine these bounds using truncated inclusion and exclusion to bound the error rate di erence due to the entire validation set, which is the error rate di erence between the original and reduced classi ers. After developing bounds for the underlying error rate that can be computed using only the in-sample examples, we develop bounds that incorporate independently drawn unlabelled inputs and bounds that incorporate knowledge of the input distribution. Next, we develop bounds for the test error rate, including bounds that incorporate knowledge of test inputs.

3 Validation

For now, imagine that we have k labelled validation examples (xn+1 ; n+1 ); : : : ; (xn+k ; n+k ) drawn according to the input distribution, independently of the in-sample examples. Let Cn be the classi er formed from the in-sample examples. Let k be the error rate of Cn over the validation examples. By Hoe ding's inequality [9],

PrfLn  k + g  e?2k : 2

2

(1)

This is a probabilistic upper bound on the underlying error rate. For more information on this approach to validation, refer to work by Vapnik and Chervonenkis [14], Vapnik [13], Abu-Mostafa [1], and Bax [2].

4 Reduced Classi er

Now return to the basic framework, in which we have only the labelled in-sample examples. To employ Hoe ding's inequality, remove the last k examples from the in-sample set to form a validation set. Let Cn?k be the classi er formed from the remaining examples. Let Ln?k be the underlying error rate of Cn?k . Let k be the error rate of Cn?k over the validation examples. Then

PrfLn?k  k + g  e?2k : (2) Now that we have an upper bound for Ln?k , we can develop an upper bound for Ln ? Ln?k to obtain an upper bound for Ln . 2

5 Error Rate Di erence

Let p be the probability that an example drawn from the input distribution is incorrectly classi ed by Cn and correctly classi ed by Cn?k . Let q be the probability that the example is correctly classi ed by Cn and incorrectly classi ed by Cn?k . Note that Ln ? Ln?k = p ? q: (3) Hence, the di erence between an upper bound for p and a lower bound for q is an upper bound for Ln ? Ln?k . To compute an upper bound for p, partition the validation examples into subsets A and B . De ne CA to be the classi er formed from the union of A and the examples in Cn?k . De ne CB to be the classi er formed from the union of B and the examples in Cn?k . Let pA be the probability that CA is incorrect and Cn?k is correct. Let pB be the probability that CB is incorrect and Cn?k is correct. Let pAB be the probability that CA and CB are both incorrect and Cn?k is correct. By inclusion and exclusion, p = pA + pB ? pAB : (4) Since the examples in B are independent of the examples in CA , we can use them to bound pA . Let p^A be the fraction of examples in B for which CA is incorrect and Cn?k is correct. By Hoe ding's inequality,

PrfpA  p^A + g  e?2jBj : (5) Let p^B be the fraction of examples in A for which CB is incorrect and Cn?k is correct. Then PrfpB  p^B + g  e?2jAj2 : (6) 2

3

We cannot bound pAB in this manner because we have no examples that are independent of the examples in both CA and CB . Note that pA + pB  p and p  Ln ? Ln?k : (7) Combine the bounds for pA and pB to form a bound for Ln ? Ln?k :

PrfLn ? Ln?k  p^A + p^B + g  e?2jAjA + e?2jBjB (8) where A + B = . Combine with bound 2 for Ln?k to form a bound for Ln : 2 2 2 PrfLn  k + p^A + p^B + g  e?2kk + e?2jAjA + e?2jBjB (9) where k + A + B = . To nd the optimal bound for a given tolerance , solve 2 2 2 min e?2kk + e?2jAjA + e?2jBjB subject to k + A + B = : (10) ( ; ; )>0 2

k A B

2

This is easy to solve numerically. Because the partial derivatives of the objective function with respect to the individual variables are all negative and monotonically increasing, the solution is the unique positive value of (k ; A ; B ) that satis es the constraint k + A + B =  and has all partial derivatives of the objective function equal. This value can be found by gradient descent within the region indicated by the constraints. For the theory that motivates this method, consult Kuhn and Tucker [10] or Franklin [6]. (All feasible solutions are valid bounds.) In the bound 9, an implicit lower bound of zero is used for q. This need not be the case. Let qA be the probability that CA is correct and Cn?k is incorrect. De ne qB and qAB analogously to pB and pAB . Then qA  q and qB  q (11) De ne q^A and q^B analogously to p^A and p^B . By Hoe ding's inequality, PrfqA  q^A ? g  e?2jBj2 and PrfqB  q^B ? g  e?2jAj2 : (12) Since qA  q, Ln ? Ln?k  pA + pB ? qA : (13) Combining the bound for qA with the bound 9 produces the bound 2 2 2 PrfLn  k + p^A + p^B ? q^A + g  e?2kk + e?2jAjA + 2e?2jBjB (14) where k + A + 2B = . Similarly, since qA  q and qB  q, Ln ? Ln?k  pA + pB ? max(qA ; qB ): (15) So we have the bound 2 2 2 PrfLn  k + p^A + p^B ? max(^qA ; q^B )+ g  e?2kk +2e?2jAjA +2e?2jBjB (16) where k + 2A + 2B = . 4

6 More Partitions

The sequence of bounds 9, 14, and 16 is a progression in which the expected bias in the estimates of Ln decreases, but the con dence in the bound also decreases. The progression can be continued by partitioning the validation set into increasing numbers of subsets. For example, partition into sets A, B , and C to compute a lower bound for q. De ne qA ; : : : ; qABC as before, e.g., qABC is the probability that CA , CB , and CC are all correct and Cn?k is incorrect. Note that q = qA + qB + qC ? qAB ? qAC ? qBC + qABC : (17) Let q^A be the estimate of qA calculated using B [ C , calculate q^AB using C , etc. As before, we do not have data to estimate the nal inclusion and exclusion term qABC . For simplicity, assume jAj = jB j = jC j = k3 . The lower bound for q is

Prfq  q^A + q^B + q^C ? q^AB ? q^AC ? q^BC ? g  3e?2( 23k )1 + e?2( k3 )2 (18) where 31 + 32 = . Lower bounds for q are derived using odd numbers of subsets, and upper bounds for p are derived using even numbers of subsets, because the nal term in the inclusion and exclusion formula is positive for odd numbers of sets and negative for even numbers of sets. In every case, we can validate all terms except the nal term. Since the probability of bound error (the RHS of the bound inequalities) increases with the number of terms in the inclusion and exclusion formula, it increases exponentially in the number of subsets in the partition. The expected bias due to truncating the inclusion and exclusion formula decreases more slowly. For some ideas about how to estimate the inclusion and exclusion formula using fewer terms, refer to work by Linial and Nisan [11]. 2

7 Validation Set Size

2

The choice of validation set size k mediates a tradeo between con dence and robustness. Observe from the RHS of bound 9 that as k increases, the probability of bound failure decreases exponentially. However, as more examples are removed from the in-sample set to form the validation set, the rate of disagreement between the original and reduced classi ers tends to increase. This, in turn, tends to increase the bias in the estimate of the di erence in error rates. The exact increase in bias is determined by the input distribution, the target function, and the random drawing of examples. However, the di erence in error rates is sure to become pronounced if enough examples are removed from the in-sample set to signi cantly weaken the accuracy of the classi er formed by the remaining examples.

5

8 Using Unlabelled Inputs

If we have some unlabelled inputs drawn from the input distribution independently of the in-sample inputs, we can use them to bound Ln ? Ln?k . Let m be the number of unlabelled inputs. Let d be the expected rate of disagreement between Cn and Cn?k over the input distribution. Note that d  Ln ? Ln?k , i.e., the rate of disagreement bounds the di erence in error rates. Let d^ be the rate of disagreement over the unlabelled inputs. By Hoe ding's inequality, 2 Prfd  d^ + g  e?2m : (19) So

2 PrfLn ? Ln?k  d^ + g  e?2m : (20) Combine with 20 to produce the bound 2 2 PrfLn  k + d^ + g  e?2k1 + e?2m2 (21) where 1 + 2 = . If the input distribution is known, then any number of random inputs can be generated, so d can be estimated to arbitrary accuracy. In general, bounding the error rate di erence by the rate of disagreement introduces more expected bias than using validation data to estimate p and q. Note that d = p + q. Since Ln ? Ln?k = p ? q, estimating Ln ? Ln?k by d introduces an expected bias of 2q. If the original classi er is more accurate than the reduced classi er, then q > p, and the expected bias is greater than p + q. In contrast, the truncated inclusion and exclusion estimates can only have expected bias p + q if the truncated nal terms representing are as large as the the entire inclusion and exclusion formulas.

9 Using Test Inputs

If the test inputs are known, then the rate of disagreement between Cn and Cn?k over the test inputs can be calculated directly. Denote it as d^T . Let Tn be the error rate of Cn over the test inputs, and let Tn?k be the error rate of Cn?k over the test inputs. By Hoe ding's inequality,

PrfTn?k  Ln?k + g  e?2t (22) where t is the number of test inputs. Since d^T  Tn ? Tn?k , 2 PrfTn  Ln?k + d^T + g  e?2t : (23) Combine with bound 2 for Ln?k to produce a bound in terms of computable quantities: PrfTn  k + d^T + g  e?2k2k + e?2t2t (24) where k + t = . 2

6

10 Bounds on Test Error

Even if the test inputs are unknown, we can still produce a bound for the test error rate in terms of computable quantities. First, compute a bound for Ln (such as 9, 14, 16, or 21). Then use Hoe ding's inequality again:

PrfTn  Ln + g  e?2t2 :

(25)

11 Conclusion

We have developed methods to bound the underlying error rate and the test error rate of a nearest neighbor classi er. The bounds can be computed using only the examples in the classi er. Thus, all available examples can be employed for the primary purpose of forming the classi er. No examples must be sacri ced for the secondary purpose of computing probabilistic bounds for the error rate on unseen data. A challenge for the future is to extend the bounds for single classi ers to uniform bounds over multiple classi ers. For nite sets of classi ers, uniform bounds may be developed by summing the probabilities of bound failures for single classi ers to bound the probability of failure in the uniform bound [1]. For example, these bounds apply to sets of classi ers formed by eliminating di erent subsets of examples from the in-sample set. It may be possible to improve these bounds using methods of validation by inference based on rates of agreement among classi ers, as in [2]. For in nite classes of classi ers, it may be possible to develop bounds using the concept of VC dimension [14]. For example, these bounds would apply to all classi ers with a particular set of examples and a metric from some class of metrics. Another challenge is to develop bounds for k nearest neighbor classi ers, in which the output is the result of voting among the k classi er examples with inputs nearest to the test input.

References

[1] Y. Abu-Mostafa, What you need to know about the VC inequality, Class notes from CS156, California Institute of Technology, 1996. [2] E. Bax, Validation of voting committees, Neural Computation 10 (4) (1998) 975-986. [3] T. M. Cover, Learning in pattern recognition, in Methodologies of Pattern Recognition, S. Watanabe, Ed., Academic Press, New York, 1969, 111-132. [4] T. M. Cover, Rates of convergence of nearest neighbor decision procedures, Proc. First Annual Hawaii Conf. on Systems Theory, 1968, 413-415. [5] T. M. Cover and P. E. Hart, Nearest neighbor pattern classi cation, IEEE Trans. Inform. Theory, IT-13 (1967) 21-27. 7

[6] J. Franklin, Methods of Mathematical Economics, Springer-Verlag New York, Inc., 1979, 190-203. [7] J. Fritz, Distribution-free exponential error bound for nearest neighbor classi cation, IEEE Trans. Inform. Theory, IT-21 (1975) 552557. [8] L. Gyor , On the rate of convergence of nearest neighbor rules, IEEE Trans. Inform. Theory, IT-24 (1978) 509-512. [9] W. Hoe ding, Probability inequalities for sums of bounded random variables, Am. Stat. Assoc. J., 58 (1963) 13-30. [10] H. W. Kuhn and J. W. Tucker, Nonlinear programming, in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability Berkeley, Univ. of California Press, 1950, 481-492. [11] N. Linial and N. Nisan, Approximate inclusion-exclusion, Combinatorica 10 (4) (1990) 349-365. [12] D. Psaltis, R. Snapp, and S. Venkatesh, On the nite sample performance of the nearest neighbor classi er, IEEE Trans. Inform. Theory, 40 (3) (1994) 820-837. [13] V. N. Vapnik, Estimation of Dependences Based on Empirical Data p.31, Springer-Verlag New York, Inc. 1982. [14] V. N. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Prob. Appl., 16(1971):264-280. [15] T. J. Wagner, Convergence of the nearest neighbor rule, IEEE Trans. Inform. Theory, IT-17 (1971) 566-571.

8