LNAI 3930 - A Statistical Confidence-Based Adaptive ... - Springer Link

2 downloads 0 Views 1MB Size Report
Euclidean distance to locate the nearest neighbors, the proposed method takes into account ... We test the new method on real-world benchmark datasets and ...
A Statistical Confidence-Based Adaptive Nearest Neighbor Algorithm for Pattern Classification Jigang Wang, Predrag Neskovic, and Leon N. Cooper Institute for Brain and Neural Systems, Department of Physics, Brown University, Providence RI 02912, USA [email protected], [email protected], Leon [email protected] Abstract. The k-nearest neighbor rule is one of the simplest and most attractive pattern classification algorithms. It can be interpreted as an empirical Bayes classifier based on the estimated a posteriori probabilities from the k nearest neighbors. The performance of the k-nearest neighbor rule relies on the locally constant a posteriori probability assumption. This assumption, however, becomes problematic in high dimensional spaces due to the curse of dimensionality. In this paper we introduce a locally adaptive nearest neighbor rule. Instead of using the Euclidean distance to locate the nearest neighbors, the proposed method takes into account the effective influence size of each training example and the statistical confidence with which the label of each training example can be trusted. We test the new method on real-world benchmark datasets and compare it with the standard k-nearest neighbor rule and the support vector machines. The experimental results confirm the effectiveness of the proposed method.

1

Introduction

One of the simplest and most attractive pattern classification algorithms, first proposed by Fix and Hodges in 1951, is the nearest neighbor rule [1]. Given a set of n labeled examples Dn = {(X 1 , Y1 ), . . . , (X n , Yn )}, where X i are the feature vectors in some input space and Yi ∈ {ω1 , . . . , ωM } are the corresponding class labels, the nearest neighbor rule classifies an unseen pattern X into the class of its nearest neighbor in the training data Dn . It can be shown that, at any given point X in the input space, the probability that its nearest neighbor X  belongs to class ωi converges to the corresponding a posteriori probability P (ωi |X) as the size of training sample goes to infinity, namely, P (ωi |X) = limn→∞ P (ωi |X  ). Furthermore, it was shown in [2,3] that under certain continuity conditions on the underlying distributions, the asymptotic probability of error LNN of the nearest neighbor rule is bounded by L∗ ≤ LNN ≤ L∗ (2 − 

M L∗ ) , M −1

(1)

This work was partially supported by ARO under Grant W911NF-04-1-0357. Jigang Wang was supported by a dissertation fellowship from Brown University.

D.S. Yeung et al. (Eds.): ICMLC 2005, LNAI 3930, pp. 548–557, 2006. c Springer-Verlag Berlin Heidelberg 2006 

A Statistical Confidence-Based Adaptive Nearest Neighbor

549

where L∗ is the optimal Bayes probability of error. Therefore, the nearest neighbor rule, despite its extreme simplicity, is asymptotically optimal when the classes do not overlap, i.e., L∗ = 0. However, when the classes do overlap, the nearest neighbor rule is suboptimal. In this case, the problem occurs at overlapping regions where P (ωi |X) > 0 for more than one class ωi . In those regions, the nearest neighbor rule deviates from the Bayes decision rule by classifying X into class ωi with probability P (ωi |X) instead of assigning X to the majority class with probability one. Theoretically, this shortcoming of the nearest neighbor rule can be overcome by a natural extension, the k-nearest neighbor rule. As the name suggests, this rule classifies X to the class that appears most frequently among the k nearest neighbors. Indeed, as shown by Stone and Devroye respectively in [4,5], the k-nearest neighbor rule is universally consistent provided that the speed of k approaching n is properly controlled, i.e., k → ∞ and k/n → 0 as n → ∞. However, in practical applications where there is only a finite amount of training data, choosing an optimal value for k is a non-trivial task, and people usually use methods such as cross-validation to pick the best value for k. With only a finite number of training examples, the k-nearest neighbor rule can be viewed as an empirical Bayes classifier based on the estimated a posteriori probabilities from the k nearest neighbors. Its performance, therefore, relies heavily on the assumption that the a posteriori probability is approximately constant within the neighborhood determined by the k nearest neighbors. With the commonly used Euclidean distance measure, this method is subject to severe bias in a high dimensional input space due to the curse of dimensionality [7]. To overcome this problem, many methods have been proposed to locally adapt the metric so that a neighborhood of constant a posteriori probability can be produced. These methods include the flexible metric method by Friedman [7], the discriminant adaptive method proposed by Hastie and Tibshirani [8], and the adaptive metric method proposed by Domeniconi et al. [9]. Although differing in their approaches, the common idea underlying these methods is that they estimate feature relevance locally at each query point. The locally estimated feature relevance leads to a weighted metric for computing the distance between a query point and the training data. As a result, neighborhoods get constricted along the most relevant dimensions and elongated along the less important ones. These methods improve the original k-nearest neighbor rule because they are capable of producing local neighborhoods in which the a posteriori probabilities are approximately constant. However, the computational complexity of such improvements is high. Furthermore, these methods usually introduce more model parameters, which need to be optimized along with the value of k via crossvalidation, therefore further increasing the computational load. In this paper, we propose a simple adaptive nearest neighbor rule for pattern classification. Compared to the standard k-nearest neighbor rule, our proposed method has two appealing features. First, with our method, each training example defines an influence region that is centered on the training example itself and covers as many as training examples from the same class as possible. Therefore,

550

J. Wang, P. Neskovic, and L.N. Cooper

the size of the influence region associated with each training example depends on its relative position to other training examples, especially training examples of other classes. In determining the nearest neighbors of a given query point, the distance from a query point to each training example is divided by the size of the corresponding influence region. As a result, training examples identified as nearest neighbors according to the scaled distance measure are more likely to have the same class label as the query point. The second feature of our proposed method is that it takes into account possible labeling errors in the training data by weighting each training example with the statistical confidence that can be associated with it. Our experimental results show that the proposed method significantly improves the generalization performance of the standard k-nearest neighbor rule. This paper is organized as follows. In Section 2, we define the probability of error in decisions made by the majority rule based on a finite number of observations, and show that the probability of error is bounded by a decreasing function of the confidence measure. We then use the defined probability of error as a criterion for determining the influence region and the statistical confidence that can be associated with each training example. Finally, we present a new adaptive nearest neighbor rule based on the influence region and statistical confidence. In Section 3, we test the new algorithm on several real-world datasets and compare it to the original k-nearest neighbor rule. Concluding remarks are given in Section 4.

2

Adaptive Nearest Neighbor Rule for Classification

One of the main reasons for the success of the k-nearest neighbor rule is that for an arbitrary query point X, the class labels Y  of its k nearest neighbors can be treated as approximately distributed from the desired a posteriori probability P (Y |X). Therefore, the empirical frequency with which each class ωi appears among the k-nearest neighbors provides an estimate of the a posteriori probability P (ωi |X). The k-nearest neighbor rule can thus be viewed as an empirical Bayes decision rule based on the estimate of P (Y |X) from the k nearest neighbors. Therefore, in practice, the performance of the k-nearest neighbor rule to a large extent depends on how well the constant a posteriori probability assumption is met. In addition, even if the constant a posteriori probability assumption holds, there is still a probability that the empirical majority class turns out to be different from the true majority class based on the underlying distribution. In this section, we will address these issues. 2.1

Probability of Error and Confidence Measure

For simplicity we consider a two-class classification problem. Let R ∈ Ω be an arbitrary region in the input space, and p = P (Y = +1|X ∈ R) be the a posteriori probability of the class being +1 given that X ∈ R. Let X 1 , . . . , X n be n independently and identically distributed (i.i.d.) random feature vectors

A Statistical Confidence-Based Adaptive Nearest Neighbor

551

that fall in R. The n corresponding labels Y1 , . . . , Yn can then be treated as i.i.d. according to the Bernoulli distribution Bern(p). According to the binomial law, the probability that n+ of them belong to class +1 (therefore n− = n − n+  belong to −1) is given by nn+ pn+ (1 − p)n− . Without loss of generality, let us assume that R contains n− examples of class −1 and n+ examples of class +1, with n− − n+ = δ > 0, i.e., R contains δ more negative examples than positive examples. By the majority rule, R will be associated with the class −1. In practice, p is unknown; hence whether p ∈ [0, 0.5) or p ∈ (0.5, 1] is unknown. However, there are only two possibilities: if p ∈ [0, 0.5), the class with the maximum a posteriori probability is also −1 and the classification of R to −1 is correct; however, if p ∈ (0.5, 1], the class with the maximum a posteriori probability is +1 and the classification of R to −1 is a false negative error. Therefore, given p ∈ (0, 5, 1], the expression (n−δ)/2 

Perr (p; δ; n) =

 n i p (1 − p)n−i i

 i=0

(2)

is the probability of false negatives. It is easy to show that, when R contains δ more positive examples than negative examples, the probability of false positives can be defined similarly. To keep the probability of false negatives or false positives small, it is important to keep Eq. (2) under control. It is easy to check that Perr (p; δ; n) is a decreasing function of p, which means that it is bounded above by Perr (δ; n) =

sup Perr (p; δ; n) =

p∈(0.5,1]

1 2n

(n−δ)/2 

 i=0

n i



δ−1 ≈ Φ(− √ ) , n

(3)

where Φ is the cumulative distribution function (CDF) of a standard Gaussian random variable. The approximation of the upper bound is obtained by applying the normal approximation to the binomial distribution, which is in turn based on the central limit theorem. The probability of error Perr (p; δ; n) can also be bounded by applying other concentration of measure inequalities. For example, applying Hoeffding’s inequality [10], one can obtain Perr (δ; n)max

1 = n 2

(n−δ)/2 

 i=0

 2 n ≈ e−δ /2n . i

(4)

In the following, we use the normal approximation of the upper √ bound (3) to illustrate the relationship between P√ err (δ; n)max and (δ − 1)/ n. Obviously, Perr (δ; n)max is decreasing in (δ − 1)/ n because as a cumulative distribution function, Φ(x) is an √increasing function of x. Eq. (3) also tells us quantitatively how large (δ − 1)/ n should be in order to keep the probability of error under some preset value.√For n < 200, we enumerate all possible values of δ and n and calculate (δ − 1)/ n and the corresponding value of Perr (δ; n)max . The results are shown in Fig. 1. As we can see, the upper bound of the probability of error √ is a rapid decreasing function of (δ − 1)/ n.

552

J. Wang, P. Neskovic, and L.N. Cooper 0.5 0.45

Probability of error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

Confidence measure

Fig. 1. Probability of error as a function of the confidence measure

Since Perr (¯ p; δ; n) is the probability that the observation is at odds with the true state of nature, 1−Perr (¯ p; δ; n) is the probability that the observation agrees with the true state of nature. We therefore define p; δ; n) α(¯ p; δ; n) ≡ 1 − Perr (¯

(5)

to be the confidence level. From Eq. (3), it follows that the confidence level is bounded below by δ−1 α(δ; n) = 1 − Perr (δ; n)max ≈ erf ( √ ) , (6) n √ and the larger (δ − 1)/ n, the higher √ the confidence level. For this reason and for convenience, we will call (δ − 1)/ n the confidence measure. An alternative way to define the probability of error for a decision that is made by the majority rule based on a finite number of observations is to use the Beta prior model for the binomial distribution. Based on the same argument, the probability of error can be defined as 1

n−δ

n+δ

p 2 (1 − p) 2 dp Perr (δ; n) = 0.5 , n+δ 1 n−δ p 2 (1 − p) 2 dp 0

(7)

which gives the probability that the actual majority class of the posterior probability distribution differs from the one that is concluded empirically from the majority rule based on n and δ. Likewise, the confidence level can be defined as  0.5 α(δ; n) = 1 − Perr (δ; n) = 0 1 0

p

p

n−δ 2

n−δ 2

(1 − p)

(1 − p)

n+δ 2

n+δ 2

dp

dp

.

(8)

In Fig. 2, the probability of error based on the Beta prior model is plotted against the confidence measure. Numerically, the two different definitions give about the same results, which can be seen by comparing Figs. 1 and 2. More precisely,

A Statistical Confidence-Based Adaptive Nearest Neighbor

553

0.5 0.45

Probability of error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

Confidence Measure

Fig. 2. Probability of error (based on the Beta prior model) against the confidence measure

the first definition of the probability of error can be better approximated as a function of the confidence measure, which is easily computable. In addition, for the same value of n and δ, the first definition gives a higher probability of error value than the second one since it is based on the worst case consideration. 2.2

Adaptive Nearest Neighbor Rule

Given a training data set Dn = {(X 1 , Y1 ), . . . , (X n , Yn )} and a query point X, the k-nearest neighbor rule first finds the k-nearest neighbors of X, denoted by X (1) , . . . , X (k) , and then assigns X to the majority class among Y(1) , . . . , Y(k) , where Y(i) are the corresponding class labels of X (i) . Commonly used distance measures d(X, X i ) include the Euclidean distance and the L1 distance. For a binary classification problem in which Y ∈ {−1, 1}, it amounts to the following decision rule: k  f (X) = sgn( Y(i) ) . (9) i=1

The adaptive nearest neighbor rule starts off by first constructing an influence region for each training example. For each training example X i , i = 1, . . . , n, its influence region is constructed as follows. Centered on X i , we draw a sphere that is as large as possible without enclosing a training example of a different class. For simplicity, we call the influence region a sphere even if other metric such as L1 metric is used to compute the distance. Denote the radius of the sphere as ri . We then count the number of training examples that fall inside the sphere and compute the statistical confidence level αi according to Eq. (8). Depending on the value of the computed confidence level, there are two possibilities. If the confidence level is above a preset threshold, e.g., 75%, we keep the influence region as it is. Otherwise, we would enlarge the sphere to include more training examples until its confidence level reaches the preset threshold. In the second case, the radius ri is set to the radius of the sphere that achieves the desired confidence level, and so is the actual confidence level αi . Therefore, we associate

554

J. Wang, P. Neskovic, and L.N. Cooper

with each training example an influence sphere whose radius is ri and whose confidence level is αi . Given a query point X, the adaptive nearest neighbor rule works as follows. It first finds the k-nearest neighbors of X, denoted by X (1) , . . . , X (k) , according to the scaled distances d(X, X i )/ri for i = 1, . . . , n, i.e., the distance measure according to which the training examples are sorted is dnew (X, X i ) =

d(X, X i ) . ri

(10)

Once the k nearest neighbors X (1) , . . . , X (k) are identified, it classifies X according to the following weighted majority rule: k  α(i) Y(i) ) , f (X) = sgn(

(11)

i=1

where α(i) is the confidence level associated with X (i) , i.e., each neighbor is weighted by the confidence level associated with the corresponding influence sphere. Therefore, as one can easily imagine, a neighbor that happens to be close to decision boundary would have less weight because of the lower confidence level.

3

Results and Discussion

In this section, we present experimental results obtained on several real-world benchmark datasets from the UCI Machine Learning Repository [11]. Throughout our experiments, we used the 10-fold cross validation method to estimate the generalization error of our algorithm and the nearest neighbor rule. Table 1 shows the error rates and the corresponding standard deviations of the nearest neighbor (NN) rule and our adaptive nearest neighbor (A-NN) rule. Table 1. Comparison of error rates Dataset Breast Cancer Ionosphere Pima Liver Sonar

NN

A-NN

4.85(0.91) 12.86(1.96) 31.84(1.05) 37.65(2.80) 17.00(2.26)

3.53(0.63) 7.14(1.15) 29.74(1.38) 37.94(1.93) 13.00(1.70)

As we can see from the results, the adaptive nearest neighbor rule outperforms the nearest neighbor rule on almost all 5 datasets being tested. On some datasets, such as the Ionosphere and Sonar datasets, the improvement of the adaptive nearest neighbor rule is statistically significant. These results show

A Statistical Confidence-Based Adaptive Nearest Neighbor

555

that the nearest neighbor identified according to the scaled distance measure is more likely to have the same class label as the query point. Table 2 shows the corresponding results of the k-nearest neighbor (k-NN) rule and the adaptive k-nearest neighbor (A-k-NN) rule. On each dataset, we run the two algorithms at various values of k from 1 to 99. We report the lowest generalization errors obtained by these two algorithms together with the corresponding standard deviations. For comparison, we also obtained the best results obtained by the support vector machines (SVMs) equipped with Gaussian kernels. The kernel parameter and the regularization parameter for the support vector machines were determined via cross-validation. Table 2. Comparison of results Dataset Breast Cancer Ionosphere Pima Liver Sonar

k-NN

SVMs

A-k-NN

2.79(0.67) 12.86(1.96) 24.61(1.36) 30.88(3.32) 17.00(2.26)

3.68(0.66) 4.86(1.05) 27.50(1.68) 31.47(2.63) 11.00(2.33)

2.65 (0.84) 4.00 (0.87) 24.21(1.39) 30.59(2.33) 13.00(1.70)

As we can see from Table 2, the adaptive k-nearest neighbor rule outperforms the k-nearest neighbor rule on all 5 datasets being tested. On the Ionosphere and Sonar datasets, the improvement of the adaptive k-nearest neighbor rule is significant. These results further confirm that, by modulating the distance with the size of the influence region associated with each training example, better neighbors are found by the adaptive nearest neighbor rule. Furthermore, the adaptive nearest neighbor also outperforms the support vector machines on all except the Sonar dataset. In Table 3, we compare the numbers of nearest neighbors used by the two different nearest neighbor rule when the respective lowest generalization error is achieved. As we can see, on three out of the 5 datasets being tested, namely, the Breast Cancer, Pima, and Liver datasets, the adaptive k-nearest neighbor uses significantly less nearest neighbors for decision making. Combining the results in Tables 2 and 3, we can see that the adaptive k-nearest neighbor rule achieves better classification performance while using less nearest neighbors. On the Ionosphere dataset, although the adaptive k-nearest neighbor rule uses more nearest neighbors than the k-nearest neighbor rule, its performance is much better than the latter when using the same number of nearest neighbors. For example, as we can see from Table 1, using only a single nearest neighbor, the adaptive nearest neighbor rule is already performing much better than the simple nearest neighbor rule. Using the data from the Wisconsin Breast Cancer dataset, Figure 3 shows how the generalization error of the two algorithms varies as the value of k changes. The solid line represents the error rate of the adaptive k-nearest neighbor rule at different values of k. The dashed line is the corresponding result of the

556

J. Wang, P. Neskovic, and L.N. Cooper Table 3. Comparison of the number of nearest neighbors Dataset

k-NN

A-k-NN

13 1 31 55 1

7 15 7 29 1

Breast Cancer Ionosphere Pima Liver Sonar

k-nearest neighbor rule. From Fig. 3, we can see that, when k is very small or very large, the generalization error of both algorithms is large due to the large variance or bias, respectively. This is a result of the so called bias/variance dilemma [6]. The best tradeoff between the bias error and the variance is achieved somewhere in the middle. Also clear from this figure is that the adaptive k-nearest neighbor rule performs almost always better than the k-nearest neighbor rule on this particular dataset. The flat shape of the adaptive k-nearest neighbor rule also suggests that it is less prone to the bias as k increases. 5.5

Error Rates (%)

5 4.5 4 3.5 3 2.5 0

20

40

60

80

100

k

Fig. 3. Error rates at different values of k

4

Conclusion

In this paper, we presented an adaptive nearest neighbor rule for pattern classification. This new classifier overcomes the shortcomings of the k-nearest neighbor in two aspects. First, before a query point is presented to the classifier, it constructs an influence region for each training example. To determine the k nearest neighbors of the query point, it divides the standard distance from the query point to each training example by the size of the corresponding influence region. As a result, training examples with large influence regions have relatively large influence on determining the nearest neighbors, and the nearest neighbors identified according to the new distance measure are more likely to have the same class labels as the query point. Secondly, each training example is weighted according to the corresponding statistical confidence measure. This weighting

A Statistical Confidence-Based Adaptive Nearest Neighbor

557

mechanism leads to a lower influence of boundary points on the classification decision, therefore reducing the generalization error. We tested the adaptive k-nearest neighbor rule on real-world benchmark datasets and compared it with the original k-nearest neighbor rule and the support vector machines. It outperforms both the k-nearest neighbor rule and the support vector machines on almost all datasets being tested and demonstrates better generalization performance.

References 1. E. Fix and J. Hodges, Discriminatory analysis, nonparametric discrimination: consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. 2. T. M. Cover, and P. E. Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, Vol. IT-13, No. 1, Jan. 1967, pp. 21-27. 3. L. Devroye, On the inequality of Cover and Hart. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 3, 1981, pp. 75-78. 4. C. J. Stone, Consistent nonparametric regression. Annals of Statistics, Vol. 5, 1977, pp. 595-645. 5. L. Devroye, L. Gy¨ orfi, A. Krzy˙zak and G. Lugosi, On the strong universal consistency of nearest neighbor regression function estimates, Annals of Statistics, Vol. 22, 1994, pp. 1371-1385. 6. S. Geman, E. Bienenstock, and R. Doursat, Neural networks and the bias/variance dilemma, Neural Computation, Vol. 4. No. 1, 1992, pp. 1-58. 7. J. Friedman, Flexible metric nearest neighbor classification. Technical Report 113, Stanford University Statistics Department, 1994. 8. T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 1996, pp. 607-615. 9. C. Domeniconi, J. Peng, and D. Gunopulos, Locally adaptive metric nearestneighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2002, pp. 1281-1285. 10. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 1963, pp. 13-30. 11. C. L. Blake, and C. J. Merz, UCI Repository of machine learning databases, Dept. of Information and Computer Sciences, University of California, Irvine, http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998.