Document not found! Please try again

An Adaptive Nearest Neighbor Algorithm for ... - Semantic Scholar

3 downloads 0 Views 1MB Size Report
mated a posteriori probabilities from the k nearest neighbors. The performance of ... We test the new method on real-world benchmark datasets and compare it ...
An Adaptive Nearest Neighbor Algorithm for Classification JIGANG WANG, PREDRAG NESKOVIC, LEON COOPER Institute for Brain and Neural Systems, Department of Physics Brown University, Providence, RI 02912 USA E-mail: [email protected], [email protected], Leon [email protected]

is the Bayes decision rule [1]:

Abstract: The k-nearest neighbor rule is one of the simplest and most attractive pattern classification algorithms. It can be interpreted as an empirical Bayes classifier based on the estimated a posteriori probabilities from the k nearest neighbors. The performance of the k-nearest neighbor rule relies on the locally constant a posteriori probability assumption. This assumption, however, becomes problematic in high dimensional spaces due to the curse of dimensionality. In this paper we introduce a locally adaptive nearest neighbor rule. Instead of using the Euclidean distance to locate the nearest neighbors, the proposed method takes into account the effective influence size of each training example and the statistical confidence with which the label of each training example can be trusted. We test the new method on real-world benchmark datasets and compare it with the standard k-nearest neighbor rule and the support vector machines. The experimental results confirm the effectiveness of the propose method.

~ = arg φ∗(X)

Classification; nearest neighbor rule; curse of dimensionality; statistical confidence.

Introduction

In a typical classification problem, one is given a set of ~ 1 , Y1 ), . . . , (X ~ n, Yn)}, where X ~i n examples Dn = {(X are the feature vectors and Yi are the corresponding class ~ i , Yi ) are assumed to be independently and idenlabels. (X tically distributed according to some unknown distribution ~ Y ) on Rd × {ω1, . . . , ωM }. The goal is to deP of (X, sign a function φn : Rd → {ω1 , . . ., ωM }, such that it can ~ to the desired class from map an unseen feature vector X {ω1, . . . , ωM }. The performance of a classifier φn can be measured by the probability of error, which is defined as ~ Y ) : φn (X) ~ 6= Y }. L(φn ) = P {(X,

(2)

~ to the class with the maxwhich assigns an observation X imum a posteriori probability. However, in practice, one rarely knows the underlying distribution. Therefore, a classifier has to be designed based on the given training data Dn . One of the most attractive classification algorithms, first proposed by Fix and Hodges in 1951, is the nearest neigh~ into the bor rule [2]. It classifies an unknown pattern X class of its nearest neighbor in the training data. Geometrically, each labeled observation in the training dataset serves as a prototype to represent all the points in its Voronoi cell [1]. ~ the probability It can be shown that at any given point X 0 ~ that its nearest neighbor X belongs to class ωi converges ~ as to the corresponding a posteriori probability P (ωi |X) the number of reference observations goes to infinity, i.e., ~ = limn→∞ P (ωi|X ~ 0 ). Furthermore, it was P (ωi|X) shown in [3, 4] that under certain continuity conditions on the underlying distributions, the asymptotic probability of error LNN of the nearest neighbor rule is bounded by

Keywords:

1.

~ P (Y |X),

max Y ∈{ω1 ,...,ωM }

L∗ ≤ LNN ≤ L∗ (2 −

M L∗ ), M −1

(3)

where L∗ is the optimal Bayes probability of error. Therefore, the nearest neighbor rule, despite its extreme simplicity, is asymptotically optimal when the classes do not overlap. However, when the classes do overlap, the nearest neighbor rule is suboptimal. In these situations, the prob~ > 0 for lem occurs at overlapping regions where P (ωi|X) more than one class ωi. In those regions, the nearest neighbor rule deviates from the Bayes decision rule by classi~ into class ωi with probability P (ωi|X) ~ instead of fying X ~ assigning X to the majority class with probability one.

(1)

If the underlying probability distribution is known, the optimal decision rule for minimizing the probability of error 1

In principle, this shortcoming can be overcome by a natural extension, the k-nearest neighbor rule. As the name ~ by assigning it to the class suggests, this rule classifies X that appears most frequently among the k nearest neighbors. Indeed, as shown by Stone and Devroye in [5, 6], the k-nearest neighbor rule is universally consistent provided that the speed of k approaching n is properly controlled, i.e., k → ∞ and k/n → 0 as n → ∞. However, choosing an optimal value k in a practical application is always a problem, due to the fact that only a finite amount of training data is available. This problem is known as the bias/variance dilemma in the statistical learning community [7]. In practice, one usually uses methods such as cross-validation to pick the best value for k. With only a finite number of training examples, the knearest neighbor rule can be viewed as an empirical Bayes classifier based on the estimated a posteriori probabilities from the k nearest neighbors. Its performance, therefore, relies heavily on the assumption that the a posteriori probability is approximately constant within the neighborhood determined by the k nearest neighbors. With the commonly used Euclidean distance measure, this method is subject to severe bias in a high dimensional input space due to the curse of dimensionality [8]. To overcome this problem, many method have been proposed to locally adapt the metric so that a neighborhood of constant a posteriori probability can be produced. These methods include the flexible metric method by Friedman [8], the discriminant adaptive method proposed by Hastie and Hibshirani [9], and the adaptive metric method proposed by Domeniconi et. al. [10], among others. Although differing in their approaches, the common idea underlying these methods is that they estimate feature relevance locally at each query point. The locally estimated feature relevance leads to a weighted metric for computing the distance between a query point and the training data. As a result, neighborhoods get constricted along the most relevant dimensions and elongated along the less important ones. These methods improve the original k-nearest neighbor rule in that they are capable of producing local neighborhoods in which the a posteriori probabilities are approximately constant. However, the computational complexity of such improvements is high. Furthermore, these methods usually introduce more model parameters, which need to be optimized along with the value of k through cross-validation, therefore further increasing the computational load. In this paper, we propose a simple adaptive nearest neighbor rule for pattern classification. Compared to the standard k-nearest neighbor rule, our proposed method has

two appealing features. First, with our method, each training example defines an influence region, whose size depends on its relative position to other training examples. Roughly speaking, each training example is associated with an intrinsic influence region, which is defined as the largest sphere centered around the training example itself that does not include a training example of a different class. In locating the nearest neighbors of a given query point, the distance from a query point to each training example is modulated by the size of the influence region associated with each training example. In this way, training examples with large influence regions have a relatively large influence on determining the nearest neighbors. The second feature of our proposed method is that it takes into account possible labeling errors in the training data by weighting each training example with the statistical confidence that can be associated with it. Our experimental results show that the proposed method significantly improves the generalization performance of the standard k-nearest neighbor rule. This paper is organized as follows. In section 2, we define the probability of error in decisions made by the majority rule based on a finite number of observations, and show that the probability of error is bounded by a decreasing function of the confidence measure. We then use the defined probability of error as a criterion for determining the influence region and the statistical confidence that can be associated with each training example. Finally, we present a new adaptive nearest neighbor rule based on the influence region and statistical confidence. In section 3 we test the new algorithm on several real-world datasets and compare it to the original k-nearest neighbor rule. Concluding remarks are given in section 4. 2. tion

Adaptive Nearest Neighbor Rule for Classifica-

One of the main reasons for the success of the k-nearest neighbor rule is the fact that for an arbitrary query point ~ the class labels Y 0 of its k nearest neighbors can be X, treated as approximately distributed from the desired a pos~ Therefore, the empirical freteriori probability P (Y |X). quency with which each class ωi appears within the neighborhood provides an estimate of the a posteriori proba~ The k-nearest neighbor rule can thus be bility P (ωi|X). viewed as an empirical Bayes decision rule based on the ~ from the k nearest neighbors. There estimate of P (Y |X) are two sources of error in this procedure. One results from whether or not the class labels Y 0 of the neighbors can be approximated as i.i.d. as Y . The other source of error is caused by the fact that, even if Y 0 can be approximated

2.1.

Probability of error and confidence measure

For simplicity we consider a two-class classification problem. Let R ∈ Ω be an arbitrary region in the feature ~ ∈ R) be the a posterispace, and p = P (Y = +1|X ~ ∈ R. ori probability of the class being +1 given that X ~ n be n i.i.d. random feature vectors that fall ~ 1, . . . , X Let X in R. The n corresponding labels Y1 , . . . , Yn can then be treated as i.i.d. from the binomial distribution B(n, p). According to the binomial law, the probability that n+ of them belong to class +1 (therefore n− = n − n+ belong to −1) is given by nn+ pn+ (1 − p)n− . Without loss of generality, let us assume that R contains n− samples of class −1 and n+ samples of class +1, with n− − n+ = δ > 0, i.e., R contains δ more negative samples than positive samples. By the majority rule, R will be associated with the class −1. In practice, p is unknown; hence whether p ∈ [0, 0.5) or p ∈ (0.5, 1] is unknown. However, there are only two possibilities: if p ∈ [0, 0.5), the class with the maximum a posteriori probability is also −1 and the classification of R to −1 is correct; however, if p ∈ (0.5, 1], the class with the maximum a posteriori probability is +1 and the classification of R to −1 is a false negative error. Given p ∈ (0, 5, 1], the expression Perr (p; δ; n) =

[(n−δ)/2] 

X i=0

 n i p (1 − p)n−i i

(4)

sup

Perr (p; δ; n) =

p∈(0.5,1]



δ−1 Φ(− √ ) , n

1 2n

[(n−δ)/2] 

X i=0

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

n i



(5)

where Φ is the cumulative distribution function (CDF) of a standard Gaussian random variable. When R contains δ more positive samples than negative samples, it is easy to check that the probability of false positives can be defined similarly and it is also bounded by Eq. (5).

15

Confidence measure

Figure 1: Probability of error as a function of confidence measure. Since Perr (¯ p; δ; n) is the probability that the observation is at odds with the true state of nature, 1−Perr (¯ p; δ; n) is the probability that the observation agrees with the true state of nature. We therefore define α(¯ p; δ; n) ≡ 1 − Perr (¯ p; δ; n)

is the probability of false negatives. To keep the probability of false negatives small, it is important to keep Eq. (4) under control. It is easy to check that Perr (p; δ; n) is a decreasing function of p, which means that it is bounded above by Perr (δ; n) =

Obviously, Perr (δ; n) is a decreasing function of (δ − √ 1)/ n because as a cumulative distribution function, √ Φ(x) is increasing in x. Therefore, the larger (δ − 1)/ n, the smaller the probability of error, or equivalently, the more confident we are about the classification. For this reason √ and for notational convenience, we will call (δ − 1)/ n the confidence measure. For n < 200, we enumerate √ all possible values of δ and n and calculate (δ − 1)/ n and the corresponding value of Perr (δ; n). The result is shown in Fig. 1.

Probability of error

as i.i.d. as Y , there is still a probability that the empirical majority class differs from the true majority class based on the underlying distribution. In this section, we will address these issues.

(6)

to be the confidence level. From Eq. (5), it follows that the confidence level is bounded below by δ−1 α(δ; n) = 1 − Perr (δ; n)max ≈ erf( √ ) . (7) n √ The larger (δ −1)/ n, the higher the confidence level. √ For this reason and for convenience, we will call (δ − 1)/ n the confidence measure. An alternative way to define the probability of error for a decision that is made by the majority rule based on a finite number of observations is to use the Beta prior model for the binomial distribution. Using the same argument, the probability of error can be defined as R 1 n−δ n+δ p 2 (1 − p) 2 dp 0.5 Perr (δ; n) = R 1 n−δ , (8) n+δ p 2 (1 − p) 2 dp 0

which gives the probability that the actual majority class of the posterior probability distribution differs from the one that is concluded empirically from the majority rule based on n and δ. Likewise, the confidence level can be defined as R 0.5 n−δ n+δ p 2 (1 − p) 2 dp 0 . α(δ; n) = 1 − Perr (δ; n) = R 1 n−δ n+δ p 2 (1 − p) 2 dp 0 (9) In Fig. 2, the probability of error based on the Beta prior model is plotted against the confidence measure. Numerically, the two different definitions give about the same results, which can be seen by comparing Figs. 1 and 2. Compared to the second definition, the first definition of the probability of error can be better approximated as a function of the confidence measure, which is easily computable. For the same values of n and δ, the first definition also gives a higher probability of error value since it is based on the worst case consideration. 0.5 0.45

Probability of error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

Confidence Measure

Figure 2: Probability of error (based on the Beta prior model) against the confidence measure.

2.2.

Adaptive Nearest Neighbor Rule

Assume that we are given a training data set Dn = ~ 1 , Y1), . . . , (X ~ n, Yn)} and a query point X. ~ We want {(X ~ to its desired class. The traditional k-nearest to assign X neighbor rule works as follows. It first finds the k-nearest ~ denoted by X ~ (1) , . . . , X ~ (k), according to neighbors of X, ~ ~ the distances d(X, Xi ) for i = 1, . . . , n. Commonly used distance measures include the Euclidean distance and the L1 distance. Once the k nearest neighbors are identified, it ~ to the class that appears most often among the classifies X k nearest neighbors. For a binary classification problem in which Y ∈ {−1, 1}, it amounts to the following decision

rule: ~ ) = sgn( f(X

k X

Y(i) ) ,

(10)

i=1

where Y(i) are the corresponding class labels of the k nearest neighbors. The adaptive nearest neighbor rule starts by first constructing an influence region for each training example. For ~ i , i = 1, . . . , n, its influence reeach training example X ~ i , we draw a gion is constructed as follows. Centered on X sphere that is as large as possible without inclosing a training example of a different class. For simplicity, we call the influence region a sphere even if other metric such as L1 metric is used to compute the distance. Denote the radius of the sphere as ri . We then count the number of training examples that fall inside the sphere and compute the statistical confidence level αi according to Eq. (9). According to the value of the computed confidence level, there are two possibilities. If the confidence level is above a preset threshold, e.g., 75%, then we keep the influence region as it is. Otherwise, we would enlarge the sphere to include more training examples until its confidence level reaches the threshold. In the second case, the radius ri is set to the radius of the sphere that achieves the desired confidence level. So is the confidence level αi. Therefore, we associate with each training example with an influence sphere whose radius is ri and whose confidence level is αi . ~ the adaptive nearest neighbor Given a query point X, rule works as follows. It first finds the k-nearest neigh~ denoted by X ~ (1) , . . . , X ~ (k), according to the bors of X, ~ ~ rescaled distances d(X, Xi )/ri for i = 1, . . . , n, i.e., the distance measure according to which the training examples are sorted are ~ X ~ i) d(X, dnew = . (11) ri ~ (1) , . . . , X ~ (k) are identified, Once the k nearest neighbors X ~ it classifies X according to the following weighted majority rule: k X ~ f(X ) = sgn( α(i)Y(i) ) , (12) i=1

~ (i) , i.e., where α(i) is the confidence level associated with X each neighbor is weighted by the confidence level associated with the corresponding influence sphere. Therefore, as one can easily imagine, a neighbor that happens to be close to decision boundary would have less weight because of the lower confidence level. 3.

Results and discussion

In this section, we present experimental results obtained on several real-world benchmark datasets from the UCI Machine Learning Repository [11]. Throughout our experiments, we used the 5-fold cross validation method to estimate the generalization error of our algorithm and the nearest neighbor rule. Table 1 shows the error rates and the corresponding standard deviations of the nearest neighbor (NN) rule and our adaptive nearest neighbor (A-NN) rule. Table 1: Comparison of error rates Dataset NN A-NN BreastCancer 4.85(0.91) 3.53(0.63) Ionosphere 12.86(1.96) 7.14(1.15) Pima 31.84(1.05) 29.74(1.38) Liver 37.65(2.80) 37.94(1.93) Sonar 17.00(2.26) 13.00(1.70) As we can see from the results, the adaptive nearest neighbor rule outperforms the nearest neighbor rule on almost all 5 datasets being tested. For some datasets, such as the Ionosphere and Sonar datasets, the improvements of our proposed adaptive nearest neighbor rule are significant. These results confirm that better nearest neighbors are found by modulating the distance with the size of the influence region associated with each training example. Table 2 shows the corresponding results of the k-nearest neighbor (k-NN) rule and the adaptive k-nearest neighbor (A-k-NN) rule. For each dataset, we run the two algorithms for various values of k, ranging from 1 to 99. We report the lowest generalization errors obtained by these two algorithms together with the corresponding standard deviations. For comparison, we also obtained the best results by the support vector machines equipped with Gaussian kernels. The kernel parameter and the regularization parameter for the support vector machines are determined via cross-validation. Table 2: Comparison of results Dataset

k-NN

SVM

A-k-NN

BreastCancer 2.79(0.67) 3.68(0.66) 2.65 (0.84) Ionosphere 12.86(1.96) 4.86(1.05) 4.00 (0.87) Pima 24.61(1.36) 27.50(1.68) 24.21(1.39) Liver 30.88(3.32) 31.47(2.63) 30.59(2.33) Sonar 17.00(2.26) 11.00(2.33) 13.00(1.70) As we can see from Table 2, the adaptive k-nearest neighbor rule outperforms the k-nearest neighbor rule on

all 5 datasets being tested. For the Ionosphere and Sonar datasets, the improvements of the adaptive k-nearest neighbor rule are significant. These results further confirm that, by modulating the distance with the size of the influence region associated with each training example, better neighbors are found by the adaptive nearest neighbor rule. Furthermore, the adaptive nearest neighbor also outperforms the support vector machines on all datasets except the Sonar dataset. We also compared the numbers of nearest neighbors used when the best generalization errors are achieved by the two different nearest neighbor rules. The results are shown in Table 3. As we can see, on three of the datasets being tested, i.e., the Breast Cancer, Pima, and Liver datasets, the adaptive k-nearest neighbor uses significantly less nearest neighbors for inference. Combining the results in Tables 2 and 3, we can see that the adaptive knearest neighbor rule achieves better classification performance while using less nearest neighbors. On the Ionosphere dataset, although the number of nearest neighbors used is much larger than in the k-nearest neighbor rule. As we can see from Table 1, using only single nearest neighbor, the adaptive nearest neighbor rule is already performing much better than the simple nearest neighbor rule. Table 3: Comparison of the number of nearest neighbors Dataset BreastCancer Ionosphere Pima Liver Sonar

k-NN

A-k-NN

13 1 31 55 1

7 15 7 29 1

Using the data from the Wisconsin Breast Cancer dataset, Figure 3 shows how the generalization error of the two algorithms changes as the value of k changes. The solid line represents the error rate of the adaptive k-nearest neighbor rule at different values of k. The dashed line is the corresponding result of the k-nearest neighbor rule. From the figure, we can see that, for both algorithms, the best result is achieved somewhere in the middle. This is a result of the so called bias/variance dilemma. When k is very small or very large, the generalization error of both algorithms is large. The best tradeoff between the bias error and the variance is achieved somewhere in between. Also clear from this figure is that the adaptive k-nearest neighbor rule performs almost always better than the k-nearest neighbor rule. The flat shape of the adaptive k-nearest neighbor rule

also suggests that it might be less subject to the bias as k increases.

References [1] R. O. Duda, P. E. Hart and D. G. Stock, Pattern Classification. John Wiley & Sons, New York, 2000.

5.5

[2] E. Fix and J. Hodges, Discriminatory analysis, nonparametric discrimination: consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.

Error Rates (%)

5 4.5 4

[3] T. M. Cover, and P. E. Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, Vol. IT-13, No. 1, Jan. 1967, pp. 21-27.

3.5 3 2.5 0

20

40

60

80

100

k

[4] L. Devroye, On the inequality of Cover and Hart. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 3, 1981, pp. 75-78.

Figure 3: Error rates at different values of k. [5] C. J. Stone, Consistent nonparametric regression. Annals of Statistics, Vol. 5, 1977, pp. 595-645. 4.

Conclusion

In this paper, we present an adaptive nearest neighbor rule for pattern classification. This new classifier overcomes the shortcomings of the k-nearest neighbor in two aspects. First, before a query point is present to the classifier, it constructs an influence region for each training example. When the time comes to find the k nearest neighbors of the query point, it modulates the standard distance from the query point to each training example with the corresponding influence size. As such, training examples with large influence regions have relatively large influence on determining the nearest neighbors. Secondly, each training example is weighted according to the corresponding statistical confidence measure. This weighting mechanism leads to a lower influence of boundary points on the classification decision, therefore reducing the generalization error. We tested the proposed adaptive k-nearest neighbor rule using real-world benchmark datasets and compared with the original k-nearest neighbor rule and the support vector machines. It outperforms both the k-nearest neighbor rule and the support vector machines on almost all the datasets being tested and demonstrates better generalization performance. Acknowledgments This work is partially supported by ARO under Grant DAAD19-01-1-0754. Jigang Wang is supported by the Brown University Dissertation Fellowship.

[6] L. Devroye, L. Gy¨orfi, A. Krzy˙zak and G. Lugosi, On the strong universal consistency of nearest neighbor regression function estimates, Annals of Statistics, Vol. 22, 1994, pp. 1371-1385. [7] S. Geman, E. Bienenstock, and R. Doursat, Neural networks and the bias/variance dilemma, Neural Computation, Vol. 4. No. 1, 1992, pp. 1-58. [8] Friedman, J.: Flexible metric nearest neighbor classification. Technical Report 113, Stanford University Statistics Department (1994) [9] Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (1996) 607– 615 [10] Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 1281–1285 [11] C. L. Blake, and C. J. Merz, UCI Repository of machine learning databases, Dept. of Information and Computer Sciences, University of California, Irvine, http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998.

Suggest Documents