Nearest Neighbor Voting in High-Dimensional Data - ailab ijs

0 downloads 0 Views 93KB Size Report
by a majority vote of its k-nearest neighbors (kNN) from the training set. .... 1-occurrence scenarios are considered in Figure 1. Let the .... a kNN graph.
Nearest Neighbor Voting in High-Dimensional Data: Learning from Past Occurrences Nenad Tomaˇsev∗ , Dunja Mladeni´c∗ ∗ Artificial Intelligence Laboratory Joˇzef Stefan Institute Ljubljana, Slovenia Email: [email protected], [email protected]

Abstract—Hubness is a recently described aspect of the curse of dimensionality inherent to nearest-neighbor methods. In this paper we present a new approach for exploiting the hubness phenomenon in k-nearest neighbor classification. We argue that some of the neighbor occurrences carry more information than others, by the virtue of being less frequent events. This observation is related to the hubness phenomenon and we explore how it affects high-dimensional k-nearest neighbor classification. We propose a new algorithm, Hubness Information k-Nearest Neighbor (HIKNN), which introduces the k-occurrence informativeness into the hubness-aware k-nearest neighbor voting framework. Our evaluation on high-dimensional data shows significant improvements over both the basic k-nearest neighbor approach and all previously used hubness-aware approaches.

on the proportion of label matches/mismatches in their koccurrences. A. Contributions • When there is hubness, some points occur much more frequently in k-neighbor sets. We claim that some occurrences hence become much less informative than others, and are consequently of much lower value for the kNN classification process. • We propose a new hubness-aware approach to k-nearest neighbor classification, Hubness Information k-Nearest Neighbor (HIKNN). The algorithm exploits the notion of occurrence informativeness, which leads to a more robust voting scheme.

I. I NTRODUCTION The k-nearest neighbor algorithm is one of the simplest pattern classification algorithms. It is based on a notion that instances which are judged to be similar in the feature space often share common properties in other attributes, one of them being the instance label itself. The basic algorithm was first proposed in [1]. The label of a new instance is determined by a majority vote of its k-nearest neighbors (kNN) from the training set. Let D = (x1 , y1 ), (x2 , y2 ), ..(xn , yn ) be the data set, where each xi ∈ Rd . The xi are feature vectors which reside in some high-dimensional Euclidean space, and yi ∈ c1 , c2 , ..cC are the labels. In an infinite data sample, the following asymptotic result holds: p(c|xi ) = limn→∞ p(c|NN(xi )). Realworld data is usually very sparse, so the point probability estimates achieved by kNN in practice are much less reliable. However, this is merely one aspect of the well known curse of dimensionality. Concentration of distances [2] is another phenomenon of interest, since all nearest-neighbor approaches require a similarity measure. In high-dimensional spaces, it is very difficult to distinguish between relevant and irrelevant points and the very concept of nearest neighbors becomes much less meaningful. Hubness is a recently described aspect of the dimensionality curse, related specifically to nearest neighbor methods [3] [4]. The term is coined to reflect the emergence of hubs, very frequent nearest neighbors. As such, these points exhibit a substantial influence on the classification outcome. Two types of hubs can be distinguished: good hubs and bad hubs, based

II. R ELATED WORK A. Hubs, frequent nearest neighbors The emergence of hubs as prominent points in k-nearest neighbor methods had first been noted in analyzing music collections [5]. The researchers discovered some songs which were similar to many other songs (i.e. frequent neighbors). The conceptual similarity, however, did not reflect the expected perceptual similarity. The phenomenon of hubness was further explored in [3] [6], where it was shown that hubness is a natural property of many inherently high-dimensional data sets. Not only do some very frequent points emerge, but the entire distribution of k-occurrences exhibits very high skewness. In other words, most points occur very rarely in k-neighbor sets, less often than what would otherwise have been expected. We refer to these rarely occurring points as anti-hubs. Denote by Nk (xi ) the number of k-occurrences of xi and by Nk,c (xi ) the number of such occurrences in neighborhoods of elements from class c. The latter will also be referred to as the class hubness of instance xi . A k-neighborhood of xi is denoted by Dk (xi ). B. Hubness-aware methods Hubs, as frequent neighbors, can exhibit both good and bad influence on kNN classification, based on the number of label matches and mismatches in the respective k-occurrences.The number of good occurrences will be denoted by GNk (xi ) and the number of bad ones by BNk (xi ), so that Nk (xi ) = GNk (xi ) + BNk (xi ).

Three recently proposed approaches exploit the hubness information in different ways: • Hubness-weighted kNN (hw-kNN) [3] introduced a weighting scheme for reducing the influence of bad hubs on classification. • Hubness fuzzy kNN (h-FNN) [7] was based on a notion that the class hubness frequencies of neighbors (Nk,c (xi )) can be used to define occurrence fuzziness, which can be used within a fuzzy k-nearest neighbor framework [8]. • In Naive hubness Bayesian kNN (NHBNN) [9], all koccurrences are observed as random events. The class affiliation for a new instance is then inferred via a naive Bayesian inference from the respective kNN set. All these approaches have some unresolved issues. For instance, hw-1-NN is the same as 1-NN (since all neighbors vote by their original labels) - so no improvement is observed for k = 1. The other two approaches have in turn failed to provide a consistent way of handling anti-hubs. In our proposed approach, there rarely occurring points are no longer treated as a special case. Moreover, we consider them to be highly informative. III. T HE MOTIVATION Let x be an observed neighbor point. Two possible previous 1-occurrence scenarios are considered in Figure 1. Let the circles represent class ’0’ and the squares class ’1’. In the first example, N1,0 (x) = 3 and N1,1 (x) = 21, which indicates high bad hubness. Therefore, if x is a neighbor to the point of interest, it is certainly beneficial to base the vote on the class hubness scores, instead of its label. It would reduce the probability of error. In the second example, N1,0 (x) = 3 and N1,1 (x) = 4, which makes for a very small difference in class hubness scores. Even though N1,1 (x) > N1,0 (x), the label of x is 0, so it is not entirely clear how x should vote. What needs to be evaluated is how much trust should be placed in the neighbor’s label and how much in the occurrence information. If there had been no previous occurrences of x on the training data (an anti-hub), there would be no choice but to use the label. On the other hand, for high hubness points we should probably rely more on their occurrence tendencies. It is precisely the points in between which need to be handled more carefully.

sets of other xj ∈ D. Assume then that we are trying to determine a label of a new data point x and that xi also appears in this neighborhood, Dk (x). This would not be surprising at all, since xi appears in all other previously observed neighborhoods. Since such an occurrence carries no information, xi should not be allowed to cast a vote. By going one step further, it is easy to see that less frequent occurrences may in fact be more informative and that such neighbors might be more local to the point of interest. This is exploited in our proposed approach. IV. T HE ALGORITHM Let now xi be the point of interest, to which we wish to assign a label. Let xit , t = {1, 2..k} be its k nearest neighbors. We calculate the informativeness of the occurrences according to Equation 1. In all our calculations, we assume each data point to be its own 0th nearest neighbor, thereby making all Nk (xi ) ≥ 1. Not only does this give us some additional data, but since it makes all k-occurrence frequencies non-zero, we thereby avoid any pathological cases in our calculations. Nk (xit ) n (1) 1 Ixit = log p(xit ∈ Dk (xi )) We proceed by defining relative and absolute normalized informativeness. We will also refer to them as surprise values. p(xit ∈ Dk (xi )) ≈

Ixit − minxj ∈D Ixj Ixit , β(xit ) = (2) log n − minxj ∈D Ixj log n As we have been discussing, one of the things we wish to achieve is to combine the class information from neighbor labels and their previous occurrences. In order to do this, we need to make one more small observation. Namely, as the number of previous occurrences (Nk (xit )) increases, two things happen simultaneously. First of all, the informativeness of the current occurrence of xit drops. Secondly, class hubness gives us a more accurate estimate of pk (yi = c|xit ∈ Dk (xi )). Therefore, when the hubness of a point is high, more information is contained in the class hubness scores. Also, when the hubness of a point is low, more information is contained in its label. α(xit ) =

p¯k (yi = c|xit ∈ Dk (xi )) =

α(xit ) + (1 − α(xit )) · p¯k,c (xit ), yit = c (1 − α(xit )) · p¯k,c (xit ), yit = 6 c (3) The α factor controls how much information is contributed to the vote by the instance label and how much by its previous occurrences. If xit never appeared in a k-neighbor set apart from its own, i.e. Nk (xit ) = 1, then it votes by its label. If, on the other hand, Nk (xit ) = maxxj ∈D Nk (xj ), then the vote is cast entirely according to the class hubness scores. The fuzzy votes are based on the pk (yi = c|xit ), which are approximated according to Equation 3. These probabilities pk (yi = c|xit ) ≈

Fig. 1. A binary classification example. Class hubness is shown for point x towards both classes. Two examples are depicted, example ’a’ where there is a big difference in the previous k-occurrences, and example ’b’ where there is nearly no observable difference.

This is not the only problem, though. Suppose that there is a data point xi ∈ D which appears in all k-neighbor

(

Nk,c (xit ) = p¯k,c (xit ) Nk (xit )

are then weighted by the absolute normalized informativeness β(xit ). This is shown in Equation 4. uc (xi ) ∝

k X

β(xit ) · dw (xit ) · pk (yi = c|xit )

i=1

kxi − xit k−2

i=1

(kxi − xit k−2 )

kNN hw-kNN h-FNN NHBNN HIKNN

(4)

Additional distance weighting has been introduced for purposes of later comparison with the h-FNN algorithm [7], since it also employs distance weighting. It is not an essential part of the algorithm. We opted for the same distance weighting scheme used in h-FNN, which was in turn first proposed in FNN [8]. It is given in Equation 5. dw (xit ) = Pk

TABLE I PAIRWISE COMPARISON OF CLASSIFIERS : NUMBER OF WINS ( WITH STATISTICALLY SIGNIFICANT ONES IN PARENTHESIS )

(5)

Equations 1, 2, 3 and 4 represent our proposed solution for exploiting the information contained in the past k-occurrences on the training data and we will refer to this new algorithm as Hubness Information k-Nearest Neighbor (HIKNN). It embodies some major improvements over the previous approaches: • Unlike h-FNN and NHBNN, it is essentially parameterfree, one only needs to set the neighborhood size (k). • Anti-hubs are no longer a special case. They are, however, handled appropriately via the information-based framework. • Label information is combined with information from the previous k-occurrences, so that both sources of information are exploited for the voting. • Total occurrence informativeness is taken into account The training phase of the algorithm is summarized in 1. The voting is simply done according to 4 and requires no further explanations. Algorithm 1 HIKNN: Training Input: (X,Y ,k) training set T = (X, Y ) ⊂ Rd×1 number of neighbors k ∈ {1, 2 . . . n − 1} Train: kNeighbors = findNeighborSets(T , k) for all (xi , yi ) ∈ (X, Y ) do Nk (xi ) = 0 for all c = 1 . . . C do count Nk,c (xi ) Nk (xi )+ = Nk,c (xi ) end for calculate α(xi ) and β(xi ) by Eq. 2 for all c = 1 . . . C do calculate pk (y = c|xi ) by Eq. 3 end for end for

The time complexity of HIKNN, as with all other hubnessbased approaches, is asymptotically the same as constructing a kNN graph. Fast algorithms for constructing approximate kNN graphs exist, like the algorithm by [10]. This particular procedure runs in Θ(dn1+τ ) time, where τ ∈ (0, 1] is a parameter which is used to set a trade off between speed and graph construction accuracy.

kNN hw-kNN h-FNN NHBNN HIKNN

27 26 17 27

– 0 (0) 1 (0) 10 (8) (22) – 14 (1) 18 (13) (22) 12 (6) – 21 (16) (14) 9 (7) 3 (1) – (26) 24 (11) 26 (9) 24 (20)

0 1 1 1

(0) (0) (1) (1) –

Total 11 (8) 60 (36) 60 (45) 30 (23) 101 (66)

V. E XPERIMENTS We compared our proposed HIKNN algorithm to the existing related algorithms: kNN, hw-kNN, h-FNN and NHBNN - on 27 classification problems. In all cases, 10-times 10-fold cross validation was performed. Corrected resampled t-test was used to check for statistical significance. All experiments were performed for k = 5, which is a standard choice. Experiment summaries are given in Table II. A. The data The first 10 datasets are from the UCI repository (http://archive.ics.uci.edu/ml/datasets.html) and exhibit low or medium hubness. The Manhattan distance was used for this data, as well as for the image data. All features were normalized prior to classification. The Acquis aligned corpus data (http://langtech.jrc.it/JRCAcquis.html) represents a set of more than 20000 documents in several different languages. In the experiments we only used the English documents, represented as bag-of-words. On top of this data, 14 different binary classification problems were examined. We used the cosine similarity. The iNet3 image dataset is a subset taken from the ImageNet repository (http://www.image-net.org/) and was used in [11], given as a quantized Haar feature representation. The three iNet3 representations have an interesting property: due to an I/O error during feature extraction, 5 images were accidentally assigned zero vector representations. It turns out that the hubness of zero vectors increased dramatically with the representation dimensionality. Since all 5 of these points were of the minority class, the classification results were affected greatly and this a prime example of how bad the bad hubness can get in high-dimensional data. B. The results The results in Table II show that the hubness-aware algorithms clearly outperform the basic kNN algorithm. Also, HIKNN seems to be the overall best approach. The detailed comparison between the algorithms is shown in Table I. By comparing both the total number of wins and also the number of wins between pairs of algorithms, we see that HIKNN is to be preferred to the second-best algorithm in the experiments, h-FNN - since it beats it quite convincingly 26(9) : 1(1) in direct comparison and 101(66) : 60(45) overall. The results for the iNet3Err data require special attention. As previously discussed, 5 data points ended up as strong bad hubs of the minority class, which has completely disabled the basic kNN classifier. On this 3-class dataset, it ends up being

TABLE II OVERVIEW OF THE EXPERIMENTS . E ACH DATASET IS DESCRIBED BY ITS SIZE , DIMENSIONALITY, THE NUMBER OF CATEGORIES , SKEWNESS OF THE N5 DISTRIBUTION (SN5 ), PROPORTION OF BAD 5- OCCURRENCES BN5 , AS WELL AS THE MAXIMAL ACHIEVED NUMBER OF OCCURRENCES ON THE DATASET. C LASSIFICATION ACCURACY IS GIVEN FOR kNN, HUBNESS - WEIGHTED kNN ( HW-kNN), HUBNESS - BASED FUZZY NEAREST NEIGHBOR ( H -FNN), NAIVE HUBNESS BAYESIAN k- NEAREST NEIGHBOR (NHBNN) AND HUBNESS INFORMATION k- NEAREST NEIGHBOR (HIKNN). A LL EXPERIMENTS WERE PERFORMED FOR k = 5. T HE SYMBOLS •/◦ DENOTE STATISTICALLY SIGNIFICANT WORSE / BETTER PERFORMANCE (p < 0.05) COMPARED TO HIKNN. T HE BEST RESULT IN EACH LINE IS IN BOLD . Data set dexter diabetes glass ionosphere isolet1 page-blocks segment sonar vehicle vowel Acquis1 Acquis2 Acquis3 Acquis4 Acquis5 Acquis6 Acquis7 Acquis8 Acquis9 Acquis10 Acquis11 Acquis12 Acquis13 Acquis14 iNet3Err100 iNet3Err150 iNet3Err1000

size

d

300 768 214 351 1560 5473 2310 208 846 990 23412 23412 23412 23412 23412 23412 23412 23412 23412 23412 23412 23412 23412 23412 2731 2731 2731

20000 8 9 34 617 10 19 60 18 10 254963 254963 254963 254963 254963 254963 254963 254963 254963 254963 254963 254963 254963 254963 100 150 1000

C SN5 BN5 max N5 2 2 6 2 26 5 7 2 4 11 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3

6.64 0.19 0.26 2.06 1.23 0.31 0.33 1.28 0.64 0.60 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 62.97 20.56 25.1 23.3

30.5% 32.3% 35.0% 12.5% 28.7% 5.0% 5.3% 21.3% 36.0% 9.7% 19.2% 8.7% 27.3% 12.2% 5.7% 7.6% 18.1% 9.3% 7.6% 21.4% 23.4% 9.8% 16.4% 6.9% 10.2% 34.8% 79.7%

AVG

219 14 13 34 30 16 15 22 14 16 4778 4778 4778 4778 4778 4778 4778 4778 4778 4778 4778 4778 4778 4778 375 1280 2363

kNN 57.2 67.8 61.5 80.8 75.2 95.1 87.6 82.7 62.5 87.8 78.7 92.4 72.7 89.8 97.3 93.6 82.9 92.3 93.0 83.1 77.7 91.9 85.6 94.2 92.4 80.0 21.2

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

7.0 3.7 7.3 4.5 2.5 0.6 1.5 5.5 3.8 2.2 1.0 0.5 0.9 0.6 0.3 0.4 0.8 0.5 0.5 1.6 0.9 0.6 0.7 0.4 0.9 2.0 2.0

80.63

worse than zero-rule! Note that the major hub in the 1000dimensional case appears in 86.5% of all k-neighbor sets. Bad hubness of the data is closely linked to the error of the kNN classification. The Pearson correlation coefficient comparing the kNN error with bad hubness percentages on the datasets in our experiments gives 0.94, which indicates strong positive correlation. HIKNN bases its votes on expectations derived from the previous k-occurrences, so it is encouraging that the correlation between the accuracy gain over kNN and bad hubness of the data is also very strong: 0.87 according to the Pearson coefficient. VI. C ONCLUSION In this paper we presented a novel approach for handling high-dimensional data in k-NN classification, Hubness Information k-Nearest Neighbor (HIKNN). It is a hubness-aware approach which is based on evaluating the informativeness of individual neighbor occurrences. The algorithm is parameterfree. Rare neighbors (anti-hubs) are shown to carry valuable information which can be well exploited for classification. The algorithm was compared to the three previous hubness-aware approaches (hw-kNN, h-FNN, NHBNN), as well as the kNN baseline on 27 classification problems. Our proposed approach had an overall best performance in the experiments. Acknowledgments. This work was supported by the ICT Programme of the EC under PASCAL2 (ICT-NoE-216886) and PlanetData (ICT-NoE-257641).

hw-kNN • • • • • • • • • • • • • • • • • • • • • • • • • •

67.7 75.6 65.8 87.9 82.5 95.8 88.2 83.4 65.9 88.2 87.5 93.6 78.7 90.6 97.6 94.4 86.3 93.0 94.8 88.8 81.8 92.8 87.5 94.9 93.6 88.7 27.1 84.17

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

5.4 3.7 6.7 3.6 2.1 0.6 1.3 5.3 3.2 1.9 0.8 0.5 0.9 0.6 0.3 0.5 0.7 0.5 0.4 0.7 0.8 0.5 0.6 0.4 0.9 2.0 11.2

h-FNN

• • •

• •





• • •

67.6 75.4 67.2 90.3 83.8 96.0 88.8 82.0 64.9 91.0 88.8 93.3 79.5 90.5 97.5 94.0 86.1 93.1 94.2 88.7 82.4 92.6 87.1 94.6 97.5 94.6 59.5 85.96

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

4.9 3.2 7.0 3.6 1.8 0.6 1.3 5.8 3.6 1.8 0.7 0.5 0.9 0.6 0.3 0.5 0.6 0.5 0.5 0.6 0.6 0.5 0.7 0.5 0.9 0.9 3.2

NHBNN

◦ • • • •

• • • •



68.0 73.9 59.1 92.2 83.0 92.6 87.8 81.1 63.7 88.1 88.4 92.5 78.9 87.4 95.1 92.5 85.7 91.0 93.4 87.4 81.9 90.7 85.2 92.5 97.5 94.6 59.6

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

84.59

4.9 3.4 7.5 3.2 2.0 0.6 1.3 5.6 3.5 2.2 0.7 0.5 0.9 0.7 0.4 0.5 0.7 0.5 0.5 0.7 0.7 0.6 0.7 0.5 0.9 0.9 0.9

HIKNN

• ◦ • • • • • • • • • • • • • • • • • •

68.0 75.8 67.9 87.3 86.8 96.2 91.2 85.3 67.2 93.6 89.4 93.7 79.6 91.0 97.7 94.6 87.0 93.5 94.8 89.7 82.5 92.8 88.0 95.0 97.6 94.8 59.6

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

5.3 3.6 6.7 3.8 1.5 0.6 1.1 5.5 3.6 1.6 0.6 0.5 0.9 0.5 0.3 0.5 0.7 0.5 0.4 0.5 0.5 0.5 0.7 0.5 0.9 0.9 3.2

86.69

R EFERENCES [1] E. Fix and J. Hodges, “Discriminatory analysis, nonparametric discrimination: consistency properties,” USAF School of Aviation Medicine, Randolph Field, Texas, Tech. Rep., 1951. [2] D. Franc¸ois, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 7, pp. 873–886, 2007. [3] M. Radovanovi´c, A. Nanopoulos, and M. Ivanovi´c, “Nearest neighbors in high-dimensional data: The emergence and influence of hubs,” in Proc. 26th Int. Conf. on Machine Learning (ICML), 2009, pp. 865–872. [4] N. Tomaˇsev, M. Radovanovi´c, D. Mladeni´c, and M. Ivanovi´c, “The role of hubness in clustering high-dimensional data.” in PAKDD (1)’11, 2011, pp. 183–195. [5] J. Aucouturier and F. Pachet, “Improving timbre similarity: How high is the sky?” Journal of Negative Results in Speech and Audio Sciences, vol. 1, 2004. [6] M. Radovanovi´c, A. Nanopoulos, and M. Ivanovi´c, “Hubs in space: Popular nearest neighbors in high-dimensional data,” Journal of Machine Learning Research, vol. 11, pp. 2487–2531, 2011. [7] N. Tomaˇsev, M. Radovanovi´c, D. Mladeni´c, and M. Ivanovi´c, “Hubnessbased fuzzy measures for high dimensional k-nearest neighbor classification,” in Proceedings of the MLDM conference, 2011. [8] J. E. Keller, M. R. Gray, and J. A. Givens, “A fuzzy k-nearest-neighbor algorithm,” in IEEE Transactions on Systems, Man and Cybernetics, 1985, pp. 580–585. [9] N. Tomaˇsev, M. Radovanovi´c, D. Mladeni´c, and M. Ivanovi´c, “A probabilistic approach to nearest-neighbor classification: Naive hubness bayesian knn,” in CIKM conference, 2011. [10] J. Chen, H. ren Fang, and Y. Saad, “Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection,” Journal of Machine Learning Research, vol. 10, pp. 1989–2012, 2009. [11] N. Tomaˇsev, R. Brehar, D. Mladeni´c, and S. Nedevschi, “The influence of hubness on nearest-neighbor methods in object recognition,” in IEEE Conf. on Intel. Comp. Communication and Processing, 2011.

Suggest Documents