Active and Passive Nearest Neighbor Algorithm - Springer Link

3 downloads 14777 Views 231KB Size Report
K nearest neighbor algorithm (k-NN) is an instance-based lazy classifier that does not ... build a classification model based on some training instances and only the pre- ..... Nearest Neighbor (NN) Norms: Pattern Classification Tech- niques. ... 04/03, Department of Computer Science, University of Waikato (2003). 8. Frank, A.
Active and Passive Nearest Neighbor Algorithm: A Newly-Developed Supervised Classifier KaiYan Feng1,2 , JunHui Gao1 , KaiRui Feng3 , Lei Liu1,2 , and YiXue Li1,2 1

Shanghai Center for Bioinformatics Technology 100, Qinzhou Road, Shanghai, China 2 Key Laboratory of System Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences 320 Yueyang Road, Shanghai, China 3 Simcyp Limited, Blades Enterprise Centre, John Street, Sheffield S2 4SU, United Kingdom [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. K nearest neighbor algorithm (k-NN) is an instance-based lazy classifier that does not need to delineate the entire boundaries between classes. Thus some classification tasks that constantly need a training procedure may favor k-NN if high efficiency is needed. However, k-NN is prone to be affected by the underlying data distribution. In this paper, we define a new neighborhood relationship, called passive nearest neighbors, which is deemed to be able to counteract with the variation of data densities. Based on which we develop a new classifier called active and passive nearest neighbor algorithm (APNNA). The classifier is evaluated by 10-fold cross-validation on 10 randomly chosen benchmark datasets. The experimental results show that APNNA performs better than other classifiers on some datasets and worse on some other datasets, indicating that APNNA is a good complement to the current state-of-the-art of classification. Keywords: Machine Learning, Lazy Classifier, Nearest Neighbor Algorithm, Active and Passive Nearest Neighbor Algorithm.

1

Introduction

Many learning algorithms exist in the literature, and they are rooted in and developed from various theories – some are based on statistics, some are turned into optimization problems, some are inspired by biological structures, and yet some others evolve from simple or intricate logic reasonings. A few examples of classifiers include classification tree [3], AdaBoost [5], SVM [4], k-NN [1,2], KStar [6], and LWL [7]. Among these classifiers, decision tree, AdaBoost and SVM are eager classifiers, and k-NN, KStar and LWL are lazy classifiers. Eager classifiers build a classification model based on some training instances and only the prebuilt classification model is used to classify new instances, while lazy classifiers conduct all computations during test without any explicit training stage. Lazy D.-S. Huang et al. (Eds.): ICIC 2011, LNAI 6839, pp. 189–196, 2012. c Springer-Verlag Berlin Heidelberg 2012 

190

K. Feng et al.

classifiers like k-NN, KStar and LWL are inherently able to handle multi-class classification problems efficiently, and are usually faster in cross-validation test because, in each test, there are no need for these algorithms to determine the entire boundaries between classes that require a lot of computation. Virtually k-NN, KStar and LWL are all based on the same idea of the nearest neighbor algorithm, which is to search the k nearest neighbors of a new instance based on some distance functions, and that the class of the new instance is assigned according to the values of its k nearest neighbors. In this paper, we are defining a new neighborhood relationship, called passive nearest neighbors, and integrate it with the conventional nearest neighbors, termed as active nearest neighbors, to define the neighborhood of a test instance. Because the cores of k-NN, KStar and LWL are the same, they all suffer from being sensitive to the distribution of the training data, e.g. if the data of one class is denser, given a same area being the neighborhood area, more data of the denser class will be included as the nearest neighbors of the test datum, which will in turn affect the classification of the test datum. The passive nearest neighborhood relationship is defined to counteract with the active relationship, and the combination of the two is deemed to be more reliable in determining the class of a datum. First let us consider the passive 1st -order nearest neighbor. The passive 1st -order nearest neighbor is defined as following. B is the passive 1st -order nearest neighbor of A, only and only if A is the active 1st -order nearest neighbor of B. The idea of passive 1st order nearest neighbor and its use in classification are depicted in figure 1, where there are data from two classes, the circle ones and the rectangle ones, with the test instance being depicted as an triangle. According to the conventional 1-NN algorithm, datum No. 4 (the test instance) is to be labeled as the rectangle class – the same class as datum No. 3 – since datum No. 3 is the nearest neighbor of datum No. 4. However, because datum No. 4 has 2 passive nearest neighbors of circles and 1 active nearest neighbor of rectangles, datum No. 4 should be labeled to be the class of circle that has the majority votes. In the following section we will present a version of APNNA and describe it in great detail.

2 2.1

Active and Passive Nearest Neighbor Algorithm Local Passive kth -Order Nearest Neighbors

We defined the passive 1st -order nearest neighbor above, and discussed it through a schematic graph. If the conditions are further restricted, we can define a local passive 1st -order nearest neighbor as following. B ∈ L (meaning datum B belongs to class L) is the local passive 1st -order nearest neighbor of Datum A, only and only if A is the active 1st -order nearest neighbor of B among all data of class L, i.e. A is the local active nearest neighbor of B. It is easy to deduce that if B is the global passive 1st -order nearest neighbor of A, then B is also the local passive 1st -order nearest neighbor of A. However, the reverse does not always hold true. Again we demonstrate the idea of local passive 1st -order nearest

Active and Passive Nearest Neighbor Algorithm

191

Fig. 1. A schematic image demonstrating the idea of 1st order active and passive nearest neighbor algorithm, with the test instance being depicted in triangle which has two passive nearest neighbors and one active nearest neighbor. A → B means that B is the active 1st -order nearest neighbor of A, and A is the passive 1st -order nearest neighbor of B.

Fig. 2. A schematic image demonstrating the idea of local passive first-order nearest neighbor. The symbols have the same meaning as those in figure 1, except that the red arrow indicates a local nearest neighbor relationship.

neighbor in a schematic figure – please refer to figure 2 for detail. Datum No. 1 is the global passive nearest neighbor of datum No. 4 in figure 1, and become the local passive nearest neighbor of datum No. 4 in figure 2 since datum No. 4 is merely the nearest neighbor of datum No. 1 among the circle data. And datum No. 6 is the global nearest neighbor of datum No. 1. However, knowing datum No. 6 is the global nearest neighbor of datum No. 1 does not contribute much to the task of classification. Thus datum No. 4 has two passive nearest neighbors (a global one and a local one) of the circles and one active nearest neighbor of the rectangles, and datum No. 4 could still be classified as the circle class.

192

K. Feng et al.

Fig. 3. A schematic image illustrating that local passive nearest neighbors can counteract with different datum densities among different classes. The number inside a datum indicates the passive distance between the datum and the test instance (the triangle one).

We will use local passive nearest neighbors to construct an APNNA classifier instead of global ones. The passive k th -order nearest neighbors is defined as following: B ∈ L is the local passive k th -order nearest neighbor of datum A, only and only if A is the active k th -order nearest neighbor of B among all data of class L. If B is the passive k th -order nearest neighbor of A, we define that the passive distance from B to A is k. Passive nearest neighbors can intrinsically counteract with the variation of datum densities because relative distances are applied to quantify the neighborhood relationship, which is demonstrated through figure 3. In k-NN, the data of rectangle class are much nearer to the test instance than the circle ones. However, the passive distances of the circle data are nearer than that of the rectangle ones. And we provide a way to combine passive nearest neighbors with active nearest neighbors to construct an APNNA classifier. We need to restrict the conditions so that active factor and passive factor contribute equally to a classification task: given l classes, same number of passive and active nearest neighbors (k active nearest neighbors and k passive nearest neighbors) of each class are taken to calculate the prediction factors. 2.2

The Calculation of Active Nearest Neighbor Factor

First from each class collect k data that are closest to the processed datum D – totally l × k = z data are taken. For numerical data, the closeness of two data Di =< x1 , x2 , . . . , xn > and Dj =< x1 , x2 , . . . , xn > can be measured by the Euclidean distance between Di and Dj , which can be calculated by equation 1.   n  (1) M (Di , Dj ) =  (xi − xi )2 i=1

Active and Passive Nearest Neighbor Algorithm

193

To get the rank of each datum, the z data are sorted in ascending order, according to the distance between each datum and the test instance D. The higher the rank, the closer the distance between the datum and D. The sorted data Γ (DAN ) of the active neighbors are expressed by Γ (DAN ) = {D1A , D2A , . . . , DzA } with labels yi = L(i = 1, . . . , z, L = 1, . . . l), where DiA is the ith datum in the sorted data, M (DiA , D)

Suggest Documents