Abstract. In this paper, we introduce a new efficient compression .... normalized d-simplex which an edge length of l comprises the following vertices4 vi: vi = (0,...
Towards Enhanced Compression Techniques for Efficient High-Dimensional Similarity Search in Multimedia Databases S¨ oren Balko, Ingo Schmitt, and Gunter Saake Database Research Group Institut f¨ ur Technische und Betriebliche Informationssysteme Fakult¨ at f¨ ur Informatik, Universit¨ at Magdeburg, Germany {balko|schmitt|saake}@iti.cs.uni-magdeburg.de
Abstract. In this paper, we introduce a new efficient compression technique for high-dimensional similarity search in MMDBS. We propose the Active Vertice Tree which is based on concave cluster geometries. Furthermore, we briefly sketch a model for high-dimensional point alignments and specify basic requirements for high-dimensional cluster shapes. Finally, we compare the Active Vertice Tree with other methods for high-dimensional similarity search in terms of their retrieval performance.
1
Foundations
Similarity search in multimedia databases has a wide spectrum of applications. For instance, operations like authentication control based on biometrical data or search for missing antiques in art databases may benefit from efficient algorithms for similarity search. Similarity queries are typically conducted by means of a nearest neighbor retrieval. For this purpose, single data objects (e.g. iris scans, art photos) are mapped to high-dimensional feature vectors. In the scope of this paper, we consider these feature vectors1 to be scaled to an interval of [0, 1] in each dimension. Therefore, we obtain data points pi with: pi ∈ [0 . . . 1]d We observe two orthogonal problems: Firstly, there is typically a large number N of data objects. For example, an art database may easily comprise several thousand objects. Therefore, naive implementations of similarity queries based on sequential processing of the data objects are too time consuming. Secondly, the dimensionality d of the data points tends to be high. Actually, this number depends on the selected feature extraction method and may, for instance, exceed 1000 feature values for color histograms. In fact, the dimensionality, in turn, 1
Subsequently, we use the term data point instead of feature vector.
A.B. Chaudhri et al. (Eds.): EDBT 2002 Workshops, LNCS 2490, pp. 365–375, 2002. c Springer-Verlag Berlin Heidelberg 2002
366
S. Balko, I. Schmitt, and G. Saake
causes manifold secondary problems like barely avoidable cluster overlap and almost equal distances between arbitrary (uniformly distributed) data points. The most prominent problem is the rising approximation error which inherently characterizes the approximation quality of a cluster shape. For the time being, we restrict our considerations on a set P of artificially created uniformly distributed data points (|P | = N ). The similarity between two data points p, q ∈ P is expressed by a distance function. In this paper, we consider distance functions based on L2 (Euclidean) norms: d L2 (p − q) = p − q2 = (pi − qi )2 i=1
The “nearest neighbor” (NN) of a query point q is actually the closest data point p ∈ P: NN(q) = p ∈ P | ∀pi ∈ P : q − p2 ≤ q − pi 2 We measure the “cost” of a nearest neighbor retrieval as the amount of data which has to be fetched from the database during this process. Since particular hardware configurations and numerous optimizations on the chosen operating system, programming language and database system influence the elapsing time, we believe that the amount of data is a more objective basis of comparison. A sequential scan of all data points is a simple naive implementation of a nearest neighbor retrieval. In this case, the retrieval cost equals the amount of data which is occupied by all data points2 : costsequential scan = N · d · sizeof(float) Therefore, the goal of any nearest neighbor retrieval approach is to provide an acceleration which yields lower retrieval costs. In fact, the unsatisfactory performance of a sequential scan has yielded considerable research activities in this field (see [4] for an overview). Most of the proposed indexing techniques can be classified as clustering approaches. That is, either the vector space [0 . . . 1]d (space partitioning) or the data set P (data partitioning) is separated into clusters. These clusters are typically assembled in tree-like hierarchies where a node corresponds to a cluster which, both in terms of cluster region and data points, comprises its sub nodes. This method aims at providing logarithmic retrieval costs: O(log N ). There are different geometrical cluster shapes in the various indexing methods. For example, the X-tree [3] is based on minimal bounding (hyper-) rectangles. In contrast, the SR-tree [6] bases on an intersection of hyperballs and hyper rectangles. However, as a common feature these shapes are convex. 2
Here, float denotes a 4-byte single precision real number (IEEE 754).
Towards Enhanced Compression Techniques
367
In [7] we investigated the growth of the approximation error3 in rising dimensionalities for the minimal convex cluster shape (convex hull). This deterioration of the approximation error is often called “curse of dimensionality”. That is, in rising dimensionalities the retrieval performance of convex clustering methods quickly approaches linear scan costs. Furthermore, these cluster trees decrease in height if overlap between the nodes is strictly avoided [3], i.e. we retrieve a list of clusters. Consequently, as another approach towards efficient nearest neighbor retrieval compression techniques (e.g. the VA-File [8]) have been proposed. Instead of pruning whole subtrees these methods rely on a mapping of data points to compressed representations. During the retrieval process these representations are transformed to “small regions” which are completely scanned. As a result, compression techniques do (almost) not suffer from the “curse of dimensionality”. However, they still yield linear retrieval costs which only differ from the sequential scan by a constant factor. To overcome the disadvantages of convex cluster geometries, in this paper, we advance the idea of concave cluster geometries (see [2,1] for details) towards efficient compression techniques.
2
Point Approximations
In this section, we propose a model for high-dimensional data point alignments and derive basic requirements for efficient cluster shapes. In [7] we provided a statistical model for the expecting value E(X)p−q2 and the standard deviation σp−q2 of distances between uniformly distributed high-dimensional data points: E(X)p−q2 →p d/6 σp−q2 →p const ≈ 0.24 That is, distances between arbitrarily selected points increase with rising dimensionality. However, the variance in these distances remains constant: ∀p1 , p2 , q1 , q2 ∈ [0 . . . 1]d :
lim
d→∞
p1 − q1 2 =1 p2 − q2 2
Informally, we observe relatively converging distances. In consequence, a geometrical formation of equally distant points would be a good approximation for high-dimensional points. However, the only hyper polyhedron that obeys this requirement is the regular simplex. In [2] we introduced the concept of regular simplexes (regular hyper-polyhedra) where the vertices approximate the location of highdimensional data points. Unfortunately, arbitrarily aligned simplexes require very costly geometrical descriptions (see [1] for details). 3
Difference between the distance of a query point to the cluster surface and the distance to the actual nearest data point in the cluster.
368
S. Balko, I. Schmitt, and G. Saake
Therefore, we propose the normalized regular simplex which is assembled by means of fixed (i.e. in parallel to the Cartesian coordinate axis) vectors. A normalized d-simplex which an edge length of l comprises the following vertices4 vi : vi = (0, . . . , 0, x, 0, . . . , 0) i−1
√
d−i
The value of x is computed by x = l/ 2. Due to the fact that (i) the vertices vi all have a common coordinate of x and (ii) any two vertices are orthogonal to each other vi , vj = 0, we propose to inscribe normalized regular simplexes into hypercubes. That is, we can regard any corner point of a hypercube as the origin of an inscribed simplex. 011
111
001 101
p2
v3 010
110
v2 p1
v1
000 (origin)
100
Fig. 2.1: Active Vertice Cluster
Therefore, hypercubes are the basis of Active Vertice Clusters (see [1] for details). In Fig. 2.1 an Active Vertice Cluster inscribing a normalized simplex is depicted. For the sake of clarity, we pictured only one inscribed simplex and marked its origin point. Again, we can state some observations: Though regular simplexes are an appropriate model for high dimensional point alignments, real data points will unlikely be precisely located in the simplex vertices. Instead, we have data points that correlate with a vertice. This correlation corresponds to a maximal distance rmax of the data point from the vertice. In Fig. 2.1 the data points p1 and p2 correlate with vertices of an inscribed simplex. Due to the fact that any corner point may be regarded as the origin point of an inscribed simplex, in consequence, any corner point may also be part (i.e. vertice) of some inscribed simplex. However, a hypercube comprises 2d corner 4
Scaled unit vectors ei = (0, . . . , 0, 1, 0, . . . , 0) i−1
d−i
Towards Enhanced Compression Techniques
369
points which will exceed the number of data points even in comparatively low dimensionality (e.g. d = 50 : 250 ≈ 1.1 · 1015 ). Obviously, only a small subset of the corner points can correlate to data points. That is, the inscribed simplexes may be incomplete, i.e. comprise less than d points. Therefore, we have to “tag” those corner points that correlate to a data point. This may be done by bit codes. For instance, the corner points that correlate to the data points p1 and p2 are labeled with the bit codes 100 and 001 (Fig. 2.1). Finally, we can subsume four basic requirements for high-dimensional clusters: – The geometrical representation of a cluster must be small in storage. In case of Active Vertice Clusters it comprises the origin point o (d · float), the edge length x (float), and one bit code for any assigned data point (d · bit). – The cluster shape should closely follow the actual alignment of the data points. The regular simplex/normalized regular simplex model may be used as a “guideline”. – The cluster shape should allow some “variance” in the location of the data points. For example, this may be expressed by a maximum distance rmax around precisely located points. – Finally, the cluster shape should not enclose “empty” regions. That is, any region in the cluster should be assigned to an enclosed data point.
3
Active Vertice Trees
Based on the Active Vertice Cluster, we introduce a new compression technique: the Active Vertice Tree (AV-Tree). By advancing the Active Vertice Cluster we pursue three major goals. Firstly, we intend to utilize a cluster description which inherently derives most of its geometrical parameters (e.g. origin point, edge length) and locates the data points by sequences of bit codes, only. Secondly, the Active Vertice Cluster turned out to still suffer from the dimensionality curse. Consequently, an improved cluster geometry must superiorly locate the enclosed data points. Finally, our indexing proposal must outperform existing approaches like the VA-File. The main idea of AV-Trees is a hierarchical approximation of data point locations. That is, based on a predefined center point c0 = (0.5, . . . , 0.5) and a fixed edge length x0 = 0.5 we describe a hypercube H0 . Subsequently we assign center points ci of “child”-hypercubes Hi to selected vertices of H0 . The edge length xi is recursively computed as follows5 : xi = 1/2i+1 5
Please note, that i denotes the AV-Tree “level”.
370
S. Balko, I. Schmitt, and G. Saake
The center point ci is recursively computed by means of a bit code (here: b) as follows:
(0.5, . . . , 0.5) if i = 0 ci = 1 (. . . , ci−1j + (bj − 2 ) · xi−1 , . . . ) otherwise Please note that bj denotes the j-th position in the bit code b (0 or 1). Similarly, ci−1j denotes the j-th coordinate of the center point ci−1 of the superordinated hypercube Hi−1 . Fig. 3.1 depicts this situation:
p’’ c1’’
1
c0
0 c1’
c2’
p’
0
1
Fig. 3.1: Active Vertice Tree
In the example, to insert the data point p one has to: (i) Find the appropriate corner point c1 of H0 and save its bit code (b1 = 10). (ii) Check if p − c1 2 ≤ rmax holds. (iii) If not, create a new hypercube H1 with c1 as its center point and x1 = x0 /2 as its edge length. (iv) Find the appropriate corner point c2 of H1 and save its bit code (b2 = 00). (v) Check if p − c2 2 ≤ rmax holds. (vi) If so, assign the complete bit code (B = b1 ◦ b2 = 1000) to p and store it in the database. For the data points p (see example) and p from Fig. 3.1 we obtain the following AV-Tree: 10 root
01
00
p’
p’’
Fig. 3.2: Database Tree Representation
Obviously, the insertion procedure terminates if the distance between the data point and the corner point of a certain hypercube lies below the threshold value rmax . To get a grasp of how the distance between the data point and the nearest corner point changes in the different depths of an AV-Tree we measured the average distance in an experiment (d = 100, rmax = 0.2) and depicted the results in Fig. 3.3.
Towards Enhanced Compression Techniques
371
distance
3
2
1
0
0
1
2
3
4
5
tree depth
Fig. 3.3: Vertice - Data Point Distances
Clearly, this distance scales down quickly and approaches the threshold value of rmax on average in a tree depth of 4. The nearest neighbor retrieval algorithm was adopted from the VA-File [8] which is roughly based on the GEMINI [5] approach. The algorithm retrieveNN comprises two stages whereas the first stage processes the compressed point approximations into a candidate list. In the second stage the assigned data points are retrieved:
2nd stage
1st stage
function retrieveNN(q: Point): Point candidates = {}; nn = p1 ; min = q − nn2 ; for i = 1 to N do if LB(q, Bi ) ≤ min then if UB(q, Bi ) < min then min = UB(q, Bi ); candidates = candidates ∪ {Bi }; endif od sort candidates ascendingly by LB; for each Bi ∈ candidates do if LB(q, Bi ) ≤ min then if q − pi 2 < min then nn = pi ; min = q − pi 2 ; endif endif od return nn
In this algorithm, pi (with 1 ≤ i ≤ N ) denotes a point from the set of data points, Bi is the bit code assigned to pi . LB and UB compute the lower and
372
S. Balko, I. Schmitt, and G. Saake
upper bound distance from a query point q to a hyperball (with radius rmax ) that approximates pi whose center center point is determined by Bi . To put our AV-Tree into a contest with established retrieval methods, we additionally implemented the VA-File approach. That is, the vector space is separated by a grid with 2n intervals in each dimension. Each data point is approximated by a sufficiently small grid cell. The VA-File requires n bits in each dimension to uniquely address each interval. Therefore, one data point requires d · n bits for its approximation. In Fig. 3.4 a VA-File (d = 2, n = 3) is depicted:
110
p’’
p’
000 001
101
Fig. 3.4: VA-File
Both the AV-Tree and the VA-File require external parameterization. In case of the AV-Tree, the threshold value rmax indirectly determines the tree depth. For the VA-File, the number n of bits specifies the precision of the VA-File grid. For the time being, we abstain from an analytic model for appropriately predefining rmax and n to yield minimal retrieval costs in a given dimensionality d. Instead, we determined “optimal” values in both cases for three selected dimensionalities: d = 10, 100, 1000. To do so, we scaled rmax from 0.025 to 0.3 (in steps of 0.025) and n from 1 to 10 and measured the resulting retrieval cost. According to our retrieval algorithm, the overall retrieval cost is composed of two factors: (i) the “compressed” data point representation which is processed in the filtering step and (ii) the actual data points (candidate list) which have to be fetched afterwards. In Fig. 3.5 the both factors (i - bright, ii - dark) are depicted for d = 1000. For the selected dimensionalities (d = 10, 100, 1000) we obtain the following “optimal” values: rmax = 0.15, 0.2, 0.15 and n = 3, 5, 6.
Towards Enhanced Compression Techniques
373
1,2E+6 min
AV-Tree
1,0E+6 8,0E+5 6,0E+5 4,0E+5 2,0E+5 0,0E+0
0.05
0.1
0.15
0.2
0.25
0.3
5,0E+6
VA-File
4,0E+6
3,0E+6 min
2,0E+6
1,0E+6
0,0E+0
2
4
6
8
10
Fig. 3.5: Optimal Parameterization (d = 1000)
Considering the diameter of the point approximations in the AV-Tree and the VA-File we can exemplary analyze the d = 10 case. Here, n = 3 is the best parameter for the VA-File whereas rmax = 0.15 yields the best results for the AV-Tree. Since one grid cell in the VA-File specifies a hyper rectangle with an edge length of 1/2n , we obtain a diameter (maximal √ n extent) of d/2 ≈ 0.4. Clearly, the diameter of the approximation region in the AV-Tree is 2 · rmax = 0.3. In Fig. 3.6 this situation is depicted by means of one selected dimension:
c0 c2 c1 (a) 0 1 (b) 000 001 010 011 100 101 110 111
Fig. 3.6: AV-Tree and VA-File in Single Dimension
374
S. Balko, I. Schmitt, and G. Saake
Even though the diameter of the AV-Tree approximation lies below the diameter of the VA-File approximation it takes 3 bit (100) in the VA-File to locate a data point (•). In case of the AV-Tree we can locate this data point by two bits (10). Finally, we directly compared the retrieval performance of AV-Tree and VA-File in a number of experiments. We conducted these experiments in d = 10, 100, 1000 for N = 1000, . . . , 5000. In Fig. 3.7 we depict the retrieval cost (y-axis, in bytes) for d = 100 and different numbers of data points (x-axis). The AV-Tree ( ) reveals lower retrieval costs than the VA-File ( ): 4E+5
3E+5
2E+5
1E+5
0E+0 1000
2000
3000
4000
5000
Fig. 3.7: Retrieval Performance for d = 100
Actually, we obtain superior retrieval performance of the AV-Tree in all measured dimensionalities. That is, for d = 1000 retrieval costs improve by ≈ 2.2%, for d = 100 the improvement is ≈ 16.4% and for d = 10 we obtain an improvement of ≈ 25.7%. Nevertheless, we are bound to restrict these results to artificially created uniformly distributed data points.
4
Conclusions
In this extended abstract, we sketched the AV-Tree approach for efficient highdimensional nearest neighbor retrieval. We briefly presented the foundation of concave cluster geometries and high-dimensional data point alignments. Furthermore, we classified our AV-Tree as a data compression approach and related it to the well-known VA-File. Finally, we examined the retrieval costs and compared it with the VA-File performance. We outlined the superior AV-Tree performance by means of a concrete example. Ongoing work should be focused on a further improvement of this hierarchical compression technique. In particular, the high fanout of the AV-Tree even at the first level should be subject to an improvement. Therefore, we are currently investigation alternative cluster geometries. Furthermore, we intend to analyze the impact of file compression tools to AV-Trees and VA-Files:
Towards Enhanced Compression Techniques
375
Compression Program AV-Tree VA-File zip ±0% −2.5% gzip ±0% −2.5% bzip2 ±0% −10.2% Fig. 4.1: Post Compression Test on AV-Tree and VA-File (N = 5000, d = 100)
Finally, it is up to future research to confirm our results in real data domains.
References 1. S. Balko and I. Schmitt. Active Vertice Clusters – A Sophisticated Concave Cluster Shape Approach for Efficient High Dimensional Nearest Neighbor Retrieval. Preprint 24, Fakult¨ at f¨ ur Informatik, Universit¨ at Magdeburg, 2001. 2. S. Balko and I. Schmitt. Concave Cluster Shapes for Efficient Nearest Neighbor Search in High Dimensional Space. Preprint 23, Fakult¨ at f¨ ur Informatik, Universit¨ at Magdeburg, 2001. 3. S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An Index Structure for High-Dimensional Data. In T. M. Vijayaraman et al., editor, Proc. of VLDB, pages 28–39. Morgan Kaufmann Publishers, San Francisco, CA, 1996. 4. C. B¨ ohm, S. Berchtold, and D.A. Keim. Searching in High-dimensional Spaces – Index Structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys, 2001. To appear. 5. Christos Faloutsos. Searching Multimedia Databasees by Content. Kluwer Academic Publishers, Boston/Dordrecht/London, 1996. 6. Norio Katayama and Shin’ichi Satoh. The SR-tree: an Index Structure for HighDimensional Nearest Neighbor Queries. In Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of Data, pages 369–380, 1997. 7. I. Schmitt. Nearest Neighbor Search in High Dimensional Space by Using Convex Hulls. Preprint 6, Fakult¨ at f¨ ur Informatik, Universit¨ at Magdeburg, 2001. 8. R. Weber, H.-J. Schek, and S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. In A. Gupta et al., editor, Proc. of VLDB, pages 194–205. Morgan Kaufmann Publishers, San Francisco, CA, 1998.