Nearest neighbor search by using Partial KD-tree ...

0 downloads 0 Views 265KB Size Report
However, the efficiency of this brute force technique is insufficient in many real .... nearest neighbor search techniques, i.e., the Exhaustive Search (brute force),.
Nearest neighbor search by using Partial KD-tree method Piotr Kraus1 and Witold Dzwinel1,2 1 AGH Institute of Computer Science, al. Mickiewicza 30, 30-059, Kraków, Poland WSEiA, Department of Computer Science, Jagiellońska 109a, 25-734, Kielce, Poland

2

Abstract We present a new nearest neighbor (NN) search algorithm, the Partial KD-Tree Search (PKD), which couples the Friedman’s algorithm and the Partial Distance Search (PDS) technique. Its efficiency was tested using a wide spectrum of input datasets of various sizes and dimensions. The test datasets were both generated artificially and selected from the UCI repository. It appears that our hybrid algorithm is very efficient in comparison to its components and to other popular NN search technique - the Slicing Search algorithm. The results of tests show that PKD outperforms up to 100 times the brute force method and is substantially faster than other techniques. We can conclude that the Partial KD-Tree is a universal and efficient nearest neighbor search scheme. Keywords: nearest neighbor search, partial distance, Friedman’s algorithm, KD-tree, slicing approach, hybrid method, UCI datasets, efficiency

1.

Introduction

The nearest neighbor (NN) problem was formulated more than over fifty years ago ([FH51], [FH52]). Since the middle of 20th century numerous solutions were proposed. The works of Bentley [Ben75b], Dasarathy [Das91] and Arya [Ary95] present broad overviews of the most efficient NN techniques. They found many applications in fundamental algorithms of pattern recognition (such as classification, clustering, quantization, multidimensional scaling); high performance N-body simulations; navigation schemes and in many other scientific and technological problems. In the famous book by Duda et al. [DHS00] the Authors say about three general classes of techniques, which allow for reducing the computational complexity of the NN search: 1. partial distance search [CGRS84], 2. prestructuring, i.e., constructing a search tree in which prototypes are selectively linked [Ben75a], [Yan93b], 3. editing the stored training points [HT00], [Wil72], [Das94], [Kun97], [Har68], [Yan93a], [HCSG02]. The easiest solution of the nearest neighbor search problem is the Exhaustive Search (ES) or the brute force search. It involves linear storage proportional to the number of points N (or feature vectors) in the dataset searched. Each NN search is answered with N-1 distance calculations. This technique is trivial and hard to beat, especially, for small datasets of multidimensional feature vectors. However, the efficiency of this brute force technique is insufficient in many real world applications such as the high performance molecular dynamics simulations which involve 109 atoms or data mining procedures working on datasets consisting of more than 107 multidimensional feature vectors. Therefore, the schemes faster than ES and allowing for penetrating huge datasets are in great demand. Cheng et al. [CGRS84] originally proposed an algorithm called the Partial Distance Search (PDS) that is arguably the most popular and the easiest full-search technique. The algorithm is a simple modification of ES. During calculation of distance between two points, if the partial sum of square differences exceeds the distance to the nearest neighbor found so far, the calculation is aborted. Similarly to the brute force search, PDS does not require any preprocessing or storage and the performance of PDS is almost always substantially better than for the Exhaustive Search. 1

The projection methods concentrate on the Friedman’s algorithm [FBS75] and its clones. Their basic steps are as follows: 1. select the best projection axis, 2. project every data point onto this axis and sort them, 3. select the points on the sorted list, which are the closest to the query point Q, 4. focus the NN search only on these points. The KD-Tree algorithm [Ben75a] bases on the Friedman’s idea. First, a binary tree (called KD-Tree) is constructed (see Fig.1) from all the data points. Then the tree is searched starting from the root towards the terminal nodes. The terminal node usually contains a group of b points (called “the bucket”) which distance to the query point Q might be examined by the Exhaustive Search. At each search step, a lower bound on the minimum distance to each subset of points is calculated. If the lower bound is greater than the distance to the nearest neighbor found so far, then the entire subset can be eliminated without calculating the distance to each point explicitly [FN75], [FBF77], [KP86], [MOC96], [KKZ96]. Because search trees are capable of eliminating entire groups of points, query time of O(logN) is expected in low dimensions and for uniformly distributed data points [FBF77], [KP86]. In Fig. 1 the idea of KD-Tree technique is explained.

Fig. 1: The scheme of the KD-Tree Search algorithm.

Another approach employing projection method for NN search was proposed by Nene and Nayar [NN96], [NN97]. The Authors suggested a simple algorithm devised for efficient search for the nearest neighbor of a query point Q within a distance ε. This algorithm does not tackle the nearest neighbor problem as it is originally stated but can be successful in many situations (e.g. for the high performance molecular dynamics simulations). In [NN97] the authors introduce the Slicing Search that significantly reduces the total number of operations required to localize points within D the hypercube with the center in Q and sides equal to 2ε, to roughly O εN + N  1 − ε   . Then the     1 − ε   closest point to Q can be found by performing the Exhaustive Search inside this hypercube. If there are no points inside the cube, we know that there are no points within the radius ε from Q. The major computational effort is directed in constructing and trimming the candidate list. It can be done in a variety of ways. In [NN96] the Authors propose a method that uses a simple preconstructed data structure along with binary searches ([Knu02b]) to detect efficiently the points sandwiched between a pair of parallel hyperplanes. Trimming off the list of candidates allows to find the points within L∞ norm. Since L∞ bounds L2, one can naively perform the Exhaustive Search inside L∞. However, as is shown in Fig.2, this does not always work correctly. Notice that P2 is closer to Q than P1, although the Exhaustive Search within the cube will incorrectly identify P1 to be the nearest neighbor. There is a simple solution of this problem. When performing the Exhaustive Search, impose an additional constraint that only points within a radius ε (in L2) are considered (see Fig.2b). This, however, increases the probability that the hypersphere may appear to be empty. This poses a serious drawback for data dimensionality D>4.

2

Fig. 2: The Exhaustive Search within a hypercube may produce an incorrect result. (a) P2 is closer to Q than P1, but just the exhaustive search within the cube will incorrectly identify P1 as the closest point. (b) This can be remedied by imposing the constraint that the exhaustive search should consider only points within a distance ε (in L2 ) from Q (assuming that the length of the hypercube side is 2ε).

In our tests, four nearest neighbor search techniques, i.e., the Exhaustive Search (brute force), Partial Distance Search, KD-Tree and Slicing Search were examined. We compare their performance with two hybridized versions: 1. 2.

The KD-Tree search supported by the Partial Distance Search technique. A coupling of the Slicing Search and PDS method.

Both the KD-Tree and the Slicing Search create an optimal structure of the prototype dataset while PDS was used to improve the process of distances calculation.

2.

Results of tests

The tests consist of two tracks. The first one includes tests of three different collections of generated datasets. To verify the methods on the realistic data we also use ten sets from UCI (University of California, Irvine) repository. The timings concentrate on the search time and completely neglect the prestructuring time and the space complexity. For all the experiments we use AMD Athlon64 3200+ (2200 MHz, 512 kB cache), 1024 MB (DDR 400 MHz), Gentoo Linux 2005.0 (kernel 2.6.12). 2.1 Description of the test data The key factors affecting the nearest neighbor search time are the number of feature vectors, their dimensionality and variations in spatial distribution of points. We introduce the spread factor S to describe the differences in data distributions. The greater value of S means the greater variations in density. The artificially generated datasets consist of various number of feature vectors N of various dimensions D and the spread factors S. The datasets we generated are as follows. 1. Dataset I [S = 1.0] — dataset that contains uniformly distributed prototypes with no distinguished clusters. 2. Dataset II [S = 2.0] — groups of clusters on a background of uniformly distributed points. 3. Dataset III [S = 3.0] —separate clusters only. The testbed consist of 336 datasets that represent all the combinations of three different values of spread factor S (1.0, 2.0 and 3.0), of eight data sizes N (with 103, 2×103, 5×103, 104, 2×104, 3×104, 4×104 and 5×104 points), and fourteen dimensions D (1, 2, 3, 4, 8, 12, 16, 24, 32, 40, 48, 64, 80, 96). In Fig.3 we present 2-dimensional visualization of some of them (after the Karhunen-Loeve transformation from D-dimensional to 2-dimensional space). The values of generated point’s coordinates are bounded from -0.5 to 0.5 in all dimensions.

3

Fig. 3: The Karhunen-Loeve transformation to 2-D space of the tested data. The rows represent the values of the spread factor (S) while the columns mean various dimensionality D of the feature space. Table 1: The UCI Datasets

Dataset Name iono

Description

radar returns from the ionosphere iris flowers iris spoken letters isolet latin capital letters letter page-blocks blocks of the page layout of a document pendigits written digits shuttle shuttle dataset satimage satellite image of soil waveform waveforms of signals wine cultivars wine

Training Points (N) 281

Test Dimension Points (D) 70 34

Spread (S) 2.8

120 6238 12200 4377

30 1559 7800 1096

4 617 16 10

2.8 2.1 2.5 2.5

7494 43500 4435 4000 142

3498 14500 2000 1000 36

16 9 36 21 13

2.5 2.5 2.6 2.7 2.7

To test the search techniques on the realistic data we selected ten datasets from the UCI Machine Learning Repository1, which are collected in Table 1. The datasets differs in the number of training points ( 120 ≤ N ≤ 43500 ), data dimensions ( 4 ≤ D ≤ 617 ) and spread (2

Suggest Documents