An efficient branch and bound search algorithm for ... - CiteSeerX

0 downloads 0 Views 121KB Size Report
An efficient branch and bound search algorithm is proposed for the computation of the K nearest neighbors in a multidimensional vec- tor space.
Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

AN EFFICIENT BRANCH AND BOUND SEARCH ALGORITHM FOR COMPUTING K NEAREST NEIGHBORS IN A MULTIDIMENSIONAL VECTOR SPACE Wim D’haes 1,2 , Dirk Van Dyck 1 , Xavier Rodet 2 1

V ISIONLAB – U NIVERSITY OF A NTWERP (UA) Groenenborgerlaan 171 · 2020 Antwerp · Belgium 2

I RCAM – C ENTRE G EORGE P OMPIDOU 1, place Igor-Stravinsky · 75004 Paris · France ABSTRACT

2. DECOMPOSITION OF THE SAMPLE 2.1. Definition of the decomposition

An efficient branch and bound search algorithm is proposed for the computation of the K nearest neighbors in a multidimensional vector space. In a preprocessing step, the sample of feature vectors is decomposed hierarchically using hyperplanes determined by principal component analysis (PCA). During the search of the nearest neighbors, the tree that represents this decomposition is traversed in depth first order avoiding nodes that cannot contain nearest neighbors. The behavior of the algorithm is studied on artificial data.

1. INTRODUCTION AND STATE OF THE ART Searching the K nearest neighbors in a multidimensional vector space is a very common procedure in the field of pattern recognition where it is used for non parametric density estimation and classification [1]. When the number of samples is large however, the computational cost of the nearest neighbor search can prohibit its practical use. Various techniques that reduce the number of distance computations have been proposed. For data that can be represented in a vector space, branch and bound search algorithms have been proposed [2, 3, 4, 5, 6]. For nearest neighbors in a metrical space, several approximating and eliminating search algorithms (AESA) are available [7, 8, 9, 10, 11]. In the present work, a branch and bound algorithm is proposed that searches the nearest vectors in a vector space where the dissimilarity between two vectors is expressed by the euclidean distance. The main contribution consists of a very efficient hierarchical decomposition that uses hyperplanes determined by principal component analysis. It is shown that this decomposition is optimal when the distribution of the sample is gaussian.

124

In order to obtain an efficient decomposition of the sample we wish that: 1. each subset contains an equal number of vectors 2. the number of vectors that have nearest neighbors in both subsets is minimized This is realized in the following manner. First, a multivariate gaussian is fit to the set. Then, the vectors are divided in two subsets according to the hyperplane containing the mean vector and perpendicular to the eigenvector of the covariance matrix with the largest eigenvalue. It is shown that this decomposition is optimal for a multivariate gaussian distribution. Since any hyperplane containing the mean vector divides the distribution in two equal halves, the first efficiency criterium is automatically fulfilled. When considering a continuous distribution as a sampling with an infinite accuracy, only vectors that lie exactly on the hyperplane have nearest neighbors in both subsets. Therefore, we wish to determine the hyperplane for which the integral of the distribution over that plane is minimal. A multivariate gaussian, fit to a D-dimensional data set S = {¯ x1 , . . . , x¯N } with xi ∈ RD is given by   1 x−µ ¯)T Σ−1 (¯ x−µ ¯) (1) exp − (¯ 2 where µ ¯ is the mean vector and Σ the covariance matrix over S. T denotes the transpose operator. By calculating the uk = eigenvectors u ¯ k and eigenvalues λ k of Σ (implying Σ¯ ¯k ), any vector x ¯ can be expressed in the coordinate sysλk u tem of the eigenvectors using α k = (¯ x−µ ¯ )T u¯k resulting in D  x¯ = µ ¯+ αk u ¯k (2) k=1

Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

y



u ¯1

❅ I ❅✒ µ ¯

I ❘ ❅ −1/2❅ λ1



y

u ¯2

✒ −1/2

λ2

✲ x



❅ ❅ v¯p ❅ ✒ ❅ µ ¯p ❅ ❅ ❅ v¯pT (¯ x−µ ¯p ) = 0 ✲ x

y

✁ ✁ ❅ ❅ 4❤ ✁ 6❤ ❅ ✁❆ T ❅✁ ❆ v¯2 (¯x − µ¯2 ) = 0 ❆ 3❤ ✁ 5❤ ❆ ✁ ❆✲ x

x−µ ¯1 ) = 0 v¯1T (¯

Figure 1: Optimal decomposition after gaussian fit.

exp −

D

1  2 −1 αk λk 2

k=1

(4)

The inverse of this integral is the normalization factor that is used when a multivariate gaussian is applied as a probability density function. The integral of the gaussian distribution, with the normalization factor, over the hyperplane perpendicular to the principal component u kmax is given by

{¯ xi ∈ Sp : v¯pT (¯ xi − µ ¯p ) < 0)} {¯ xi ∈ Sp : v¯pT (¯ xi − µ ¯p ) ≥ 0)}

6

v¯p

=

u ¯kmax

(7)

x ¯ i ∈ Sp

(8)

and calculating the value of µ ¯ p using µ ¯p = β¯ vp + µ ¯

(5)

(9)

2.2. The Decomposition Algorithm

Since the principal component gives the direction of the maximal variance, this equation shows that the integral over the hyperplane perpendicular to the principal component is minimal. Therefore, the decomposition according to this plane will result in the least vectors having nearest neighbors in both subsets. This is the second efficiency criterium given at the beginning of this subsection. Figure 1 shows the contour of a two-dimensional gaussian with its corresponding eigenvalues and eigenvectors. On the right, the decomposition parameters of the optimal split are depicted. As visualized in figure 2, the decomposition process is applied iteratively resulting in a hierarchic decomposition represented by a binary tree where each node represents a subset of the total data set S. Enumerating the nodes starting from zero, the root node S 0 denotes the entire data sample. The child nodes of a node p are defined by induction using = =

kmax

µ ¯ argmaxk {λk }

xi − µ ¯ )}, β = median{¯ vpT (¯

D

S2p+1 S2p+2

5

where λk and u ¯ k are the eigenvalues and eigenvectors respectively of the covariance matrix Σ over S p . This definition implies that Sp = S2p+1 ∪ S2p+2 and S2p+1 ∩ S2p+2 = ∅. Obviously, the distribution of the vectors might not be gaussian at all, resulting in the fact that the tree will not be balanced. In order to balance the tree, the value of µ ¯ p is redefined to be the median in the direction of v¯ p , instead of the mean µ ¯. The balancing is realized defining a scalar β

k=1

√ 2πλk 1 k=1,k=kmax =

D √ 2πλkmax 2πλk k=1

4

= =

µ ¯p

(3)

The integration of the distribution over all axes determined by the eigenvectors yields  D

 α2k 2πλk exp − dαk = 2λk

3

with



k=1

D  

level 0 ❡ ❅ ❅ ❘ ❅ ✠ level 1 1 ❡ 2 ❡ ✂ ❇ ✂ ❇ ✂ ❇ ✂ ❇ ✂✌❡ ❇N❡ ✂✌❡ ❇N❡level 2 0

Figure 2: Example of a hierarchical decomposition in two dimensions.

Equation (1) can be simplified using (2), resulting in 

v¯0T (¯ x−µ ¯0 ) = 0



The decomposition algorithm of the data set up to a level L ≤ log2 N is described. The vectors are organized so that all vectors belonging to the same node are grouped together. For each node p the index of its first vector b p and last vector ep are stored so that S p = {¯ xi ∈ S0 : bp ≤ i ≤ ep }. The decomposition parameters that are determined for each node p are • bp the index of the first vector of S p • ep the index of the last vector of S p • v¯p eigenvector of S p with the largest eigenvalue • µ ¯p median vector of S p , in the direction of v¯p The complete decomposition algorithm is listed below. S0 = {¯ x1 , . . . , x ¯N } b0 = 1 e0 = N

(6)

125

Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

4 4 3 3 2

y

2 1

1

−1

−1

−2

−2

−3

−3

−4 −4

−3

−2

−1

0

1

2

3

4

−4 −4

−3

−2

−1

0

1

2

3



✁ ❍ δ0 ✁❍ ❅ ❍ ❤ ❥ ❍ ❅ 4 ✯ x¯ ✟ ✁ δ2 ✟ ❅ ✁ ❆✟ ✟ ✙δ ✒ ❅✁p ❆ 1 pp ❆ p✠ ✁ ❤ 3❤ 5 p p p ❆ 6❤ ✁ ❆✲

0

0

δ0 = |¯ v0T (¯ x−µ ¯0 )| v1T (¯ x−µ ¯1 )| δ1 = |¯ δ2 = |¯ v2T (¯ x−µ ¯2 )|

4

x

Figure 3: Examples of the decomposition. p=0 while p < 2L − 1 // Calculation of decomposition parameters − bp + 1 Np = ep ep x ¯i µ ¯ = N1p i=b p ep Σ = N1p i=b (¯ xi − µ ¯)(¯ xi − µ ¯ )T p u ¯k , λk = eigenvectors(Σ) kmax = argmaxk {λk } ¯kmax v¯p = u β = median{¯ vpT (¯ xi − µ ¯ ), bp ≤ i ≤ ep } µ ¯p = β¯ vp + µ ¯ // Organization of the order i = bp j = ep while j > i xi − µ ¯p ) ≤ 0 while v¯pT (¯ i=i+1 end xj − µ ¯p ) > 0 while v¯pT (¯ j =j−1 end ¯j // Exchange position of x ¯i and x if j > i SW AP (¯ xi , x ¯j ) end end b2p+1 = bp e2p+1 = j b2p+2 = i e2p+2 = ep p=p+1 end

Algorithm 1: Hierarchical decomposition. Results of the decomposition are shown in figure 3. All vectors are visualized by drawing a line from each vector to the mean vector of the node it belongs to. 3. THE SEARCH ALGORITHM 3.1. Elimination rule The branch and bound algorithm searches the nearest neighbors for a given vector x¯ and consists of a depth first traversal of the tree that represents the hierarchical decomposition

126

d(¯ x, 0) = 0 x−µ ¯0 )|) d(¯ x, 1) = max(d(¯ x, 0), |¯ v0T (¯ T = |¯ v0 (¯ x−µ ¯0 )| d(¯ x, 2) = d(¯ x, 0) = 0 d(¯ x, 3) = max(d(¯ x, 1), |¯ v1T (¯ x−µ ¯1 )|) x−µ ¯1 )| = |¯ v1T (¯ d(¯ x, 4) = d(¯ x, 1) = |¯ v0T (¯ x−µ ¯0 )| x−µ ¯2 )|) d(¯ x, 5) = max(d(¯ x, 2), |¯ v2T (¯ = |¯ v2T (¯ x−µ ¯2 )| d(¯ x, 6) = d(¯ x, 2) = d(¯ x, 0) = 0

Figure 4: Vector-to-node distance. of the data set. When a node is evaluated, it is determined whether it can contain nearest neighbors. If this is not the case, this node can be omitted from the search procedure. The rule that is used to determine whether a node can contain nearest neighbors is called the elimination rule. This rule is adapted to the definition on the decomposition and relies on a distance measure between a vector x ¯ and a node with index p. This is a lower bound distance from x¯ to any vector belonging to p which is compared to the distance to the Kth nearest node. A vector-to-node distance d(¯ x, p) is proposed that is defined to be 0 when p is the root node. According to the decomposition defined in equation (6), the distances to underlying nodes of p are determined from d(¯ x, p) using if

x−µ ¯p ) < 0 v¯p (¯ d(¯ x, 2p + 1) = d(¯ x, p) x−µ ¯p )|) d(¯ x, 2p + 2) = max(d(¯ x, p), |¯ vpT (¯

else x−µ ¯p )|) d(¯ x, 2p + 1) = max(d(¯ x, p), |¯ vpT (¯ d(¯ x, 2p + 2) = d(¯ x, p) (10) For the child node that contains the vector x¯, the same distance is taken as the parent node. For the other child node, all vectors belonging to it have a greater distance to x¯ than the x−µ ¯p )| to the hyperplane. This perpendicular distance |¯ v pT (¯ distance might therefore be a valid definition for the vectorto-node distance. However, hyperplanes on previous levels might provide larger, thus more efficient, distances. Therex−µ ¯p )| is taken. fore, the maximum of d(¯ x, p) and |¯ v pT (¯ In order to clarify this distance measure the decomposition depicted in figure 2 is taken and the distances from a given vector x ¯ to all nodes are given in figure 4. Note especially the x− distance d(¯ x, 3) where one has the choice between |¯ v 0T (¯ µ ¯0 )| and |¯ v1T (¯ x−µ ¯1 )|. Since the second distance is greater, it is more likely that it will eliminate nodes from the search and is therefore chosen to be d(¯ x, 3). During the search procedure the currently found nearest neighbors and their distance to the vector x ¯ are stored in the variables y¯k and dk respectively with k = 1, . . . , K. The values

Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

of dk are initially set at ∞. The index of the vector y¯ that is the current Kth nearest neighbor is denoted k max . When a vector is found that is closer than y¯kmax , it is replaced by this nearer vector. Then, it is determined which of y¯ k is the new Kth nearest neighbor which results in a new value for kmax . By the definition of the vector-to-node distance, the following elimination rule can be applied. A node p can be discarded from the search procedure if ykmax − x ¯)T (¯ ykmax − x ¯) d(¯ x, p)2 > (¯

dkmax = dxx kmax = max argk {dk } end i=i+1 end else // p is a branched node x−µ ¯p ) < 0 if v¯p (¯ st+1 = 2p + 2 vpT (¯ x−µ ¯p ))2 ) st+2 = max(dxp, (¯ st+3 = 2p + 1 st+4 = dxp t = t+4 else st+1 = 2p + 1 vpT (¯ x−µ ¯p ))2 ) st+2 = max(dxp, (¯ st+3 = 2p + 2 st+4 = dxp t = t+4 end end

(11)

since all the vectors belonging to p are further from x¯ than y¯kmax . 3.2. The Search Procedure The tree that represents the decomposition of the data sample is traversed in a depth first order which can be implemented efficiently using a stack s which is addressed by an index t. On this stack, the node index p of nodes that still need to be evaluated and their vector-to-node distances dxp will be stored. Initially, the root node and the distance to this node (both zero) are pushed on the stack. When a node is evaluated it is popped of the stack and the distance dxp from the vector x ¯ to the node p is compared with d kmax . When the node is further than d kmax it cannot contain nearest nodes and the following node on the stack is evaluated. If not, two cases can be distinguished. If the node is a leaf node (p ≥ 2 L − xbp , . . . , x ¯ep } are searched which 1) the vectors in S p = {¯ means that their distance to x ¯ is calculated and compared with dkmax . If the node is branched, the child nodes and their distances are pushed on the stack. Note that the node closest to x ¯ is pushed on the stack last so that it will be evaluated first. The algorithm terminates when the stack is empty (t < 0) which indicates that the entire tree was traversed and the K nearest neighbors found. The complete search algorithm is listed below. The K nearest neighbors of a vector x¯ are determined from a data set S 0 that is decomposed up to a level L.

end end

Algorithm 2: Search Algorithm. 4. RESULTS The behavior of the algorithm was studied by means of experiments on artificial data. Prototype sets were produced from a D-dimensional normal probability distribution with mean 0 and unit covariance matrix. Each result was obtained from the average of 2 10 experiments. The performance of the algorithm was studied with respect to different values of • D the dimensionality of the vector space • N the number of vectors in the data set • K the number of nearest neighbors that are searched • L the level of decomposition that is used for the search In table 1 the average number of vector-to-vector distance computations is given. Experiments with a uniform distribution and a mixture of gaussian distributions were also performed successfully, yielding about the same number of distance computations. In addition to these distance computations, the search algorithm also spends time calculating the vector-to-node distance and traversing the tree. If the cost of the distance computation is very high relative to this extra effort, this could be neglected. However, when observing the average calculation time of the algorithm for increasing levels of decomposition, the total calculation time decreases fast at the lower levels, obtains a minimum and increases towards the highest level. Every extra level of decomposition reduces the number of distance computations but increases the traversal cost. If this extra cost exceeds the reduction in distance calculation time,

d1 , . . . , d K = ∞ x1 , . . . , x ¯N } S0 = {¯ kmax = 1 s0 = 0 // push the root node s1 = 0 // push the distance t=1 while t ≥ 0 dxp = st // pop distance p = st−1 // pop node t=t−2 if dxp < dkmax // elimination rule if p ≥ 2L − 1 // p is a leaf node i = bp while i ≤ ep dxx = (¯ x−x ¯i )T (¯ x−x ¯i ) if dxx < dkmax y¯kmax = x ¯i

127

Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

Table 1: Average number of distance computations. Number of vectors in data set N D K 210 211 212 213 214 215 2 1 4.5 4.7 4.7 4.9 4.7 4.7 2 7.7 7.9 8.0 7.9 8.0 7.9 4 13.5 13.8 13.6 13.6 13.5 13.5 8 23.7 23.8 23.8 23.9 23.8 24.0 4 1 34 35 37 39 39 41 2 51 55 58 62 62 63 4 79 87 93 98 101 102 8 121 136 149 157 163 168 8 1 371 544 739 968 1197 1515 2 495 736 1025 1380 1789 2210 4 608 954 1388 1915 2534 3203 8 726 1185 1785 2567 3443 4483

0.014

average computation time (s)

0.012

0.01

0.008

0.006

0.004

0.002

0 10

11

12

13

14

15

log2 N

Figure 5: average computation time for D = 2. an additional level of decomposition will increase the total search time. This implies that there is an optimal level L opt for which the total computation time is minimal. Using this optimal level, the average computation time was determined in function of the number of vectors N as shown in figures 5 to 7.

0.05

average computation time (s)

0.045

5. CONCLUSIONS AND FURTHER WORK Since the number of nodes in the tree is 2 L+1 − 1, with L bounded by log 2 N the space complexity of the algorithm grows linear with N . It was observed that the average number of distance computations was bounded by a constant which is independent of the sample size. The lowest average computation time was obtained for an optimal level Lopt , which realized the tradeoff between the distance computation cost and the tree traversal cost. Using this optimal level of decomposition, it was shown that the average calculation time grows sublinear with the data set size N . For a low dimensionality (≤ 4) it was very close to logarithmic. A weak property of the algorithm in its form here presented, is that its performance decreases drastically with the dimensionality D of the vector space. However, it decreases only with the intrinsic dimensionality of the data set. For a fourdimensional sample where the third and fourth dimension were a linear combination of the first two dimensions (intrinsic dimensionality of 2), the same number of distance computations was observed as for a two dimensional sample. The results that are presented for the gaussian distribution with unit covariance matrix provide therefore a worst case estimation of the efficiency. During the revision process of this paper, a recent article was brought to the attention of the author in which a branch and bound algorithm was proposed with exactly the same decomposition method [6]. This algorithm was called the Principal

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 10

11

12

13

14

15

log2 N

Figure 6: average computation time for D = 4. 0.8

average computation time (s)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10

11

12

13

14

15

log2 N

Figure 7: average computation time for D = 8.

128

Proceedings of ACIVS 2002 (Advanced Concepts for Intelligent Vision Systems), Ghent, Belgium, September 9-11, 2002

Axis Tree (PAT). Although our idea was developed independently we cannot claim to have invented a new decomposition method. Our work puts a strong emphasis on the theoretical motivation of the decomposition by showing that the plane perpendicular to the principal component minimizes the number of vectors that have nearest neighbors in both subsets for a gaussian distribution. By contrast, in [6], the emphasis lies on the experimental comparison of the efficiency of PAT with sixteen other fast nearest neighbor search algorithms. A clear difference in both algorithms however, lies in the proposed elimination rules. Future research will point out the benefits of each rule. Further possible points of improvement are:

[6] James McNames, “A fast nearest neighbor algorithm based on a principal axis search tree,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 964–976, september 2001. [7] E. Vidal, “An algorithm for finding nearest neighbors in (approximately) constant average time complexity,” Pattern Recognition Letters, vol. 4, pp. 145–157, July 1986. [8] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Enrique Vidal, “An algorithm for finding nearest neighbors in constant average time with a linear space complexity.,” Proc. of the 11th ICPR, vol. Vol. II, pp. 557–560, 1992.

• the use of other elimination rules, or combinations of different elimination rules

[9] E. Vidal, “New formulation and improvements of the nearest-neighbour approximation and elimination search alorithm (AESA),” Pattern Recognition Letters, vol. 15, pp. 1–7, January 1994.

• the traversal order • automatic optimization of the user-specified parameters, being in this case the optimal level of decomposition Lopt

[10] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Enrique Vidal, “A new version of the nearest-neighbour approximation and elimination search alorithm (AESA) with linear preprocessing time and memory requirements,” Pattern Recognition Letters, vol. 15, pp. 9–18, January 1994.

6. ACKNOWLEDGEMENTS

[11] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Raphael C.Carrasco, “A fast branch & bound nearest neighbor classifier in metric spaces,” Pattern Recognition Letters, vol. 17, pp. 731–739, June 1996.

This work was financially supported by the Flemish Institute for the Promotion of Scientific and Technological Research in the Industry (IWT), Brussels. The author thanks Diemo Schwarz, Steve de Backer, Jan Sijbers and Paul Scheunders of the Visionlab at the University of Antwerp and the Analysis/Synthesis team at IRCAM. 7. REFERENCES [1] Keinosuke Fukunaga, Statistical Pattern Recognition, Academic Press, 1990. [2] Keinosuke Fukunaga and Patrenahalli M. Nerada, “A branch and bound algorithm for computing k-nearest neighbors,” IEEE transactions on computers, vol. 24, pp. 750–753, July 1975. [3] Behrooz Kamgar-Parsi, “An improved branch and bound algorithm for computing k-nearest neighbors,” Pattern Recognition Letters, vol. 3, pp. 7–12, January 1985. [4] Baek S. Kim and Song B. Park, “A fast k nearest neighbor finding algorithm based on the ordered partition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 761–766, november 1986. [5] H. Niemann and R. Goppert, “An efficient branch and bound nearest neighbour classifier,” Pattern Recognition Letters, vol. 7, pp. 67–72, February 1988.

129

Suggest Documents