Efficient High-Dimensional Indexing by Superimposing Space-Partitioning Schemes Jack Lukaszuk and Ratko Orlandic Department of Computer Science Illinois Institute of Technology 10 West 31st Street, Chicago, IL 60616 ph: (312) 567-5343 fax: (312) 567-5067
[email protected],
[email protected]
Abstract The problem of accessing data in high-dimensional spaces has attracted considerable scientific interest. Multidimensional access methods typically employ a spacepartitioning strategy whose goal is to direct the search toward relevant parts of the space. Unfortunately, monolithic partitioning, whether it is static, dynamic or some hybrid scheme, has serious limitations in high-dimensional situations. This paper presents a new family of indexing techniques that superimposes static partitioning over a dynamic partitioning scheme in order to improve the search performance in high-dimensional spaces. A comprehensive set of experiments shows that, in realistic scenarios, the proposed solutions are usually much more efficient and more scalable than the multi-dimensional access methods based on monolithic partitioning. Keywords: access methods, space partitioning, data dimensionality, performance evaluation.
1. Introduction A growing number of scientific, engineering, and business applications regard their objects as feature vectors in multi-dimensional spaces [8, 16, 20]. Once collected and stored, this data is analyzed in order to extrapolate higher facts, such as physical phenomena, similarities, or trends. Since typical analytical tasks require frequent searches through the data, analytical engines generally use some form of a multi-dimensional index. Many database environments, including scientific applications, such as molecular biology [19] and high-energy physics [18], as well as real-time applications, e.g. robotic Supported
in part by the NSF grant IIS-0312266.
vision [13], deal with data of very high dimensionality. Traditional multi-dimensional indexing techniques [6, 14], which were designed for 2- and 3-dimensional situations as in CAD or geographic applications, do not scale well as dimensionality increases [4]. But despite the considerable interest in the problem of accessing data in highdimensional spaces [4], we still do not have appropriate high-dimensional indexing techniques. In this paper, we consider the problem of supporting region (window) queries over a dynamically-growing set of data in multi-dimensional feature spaces. With the exception of certain indexing techniques, such as those based on bit slicing or hashing [10], most multi-dimensional access methods organize the space into partitions (i.e., index regions) in order to eliminate from inspection all but the relevant parts of the space. Typically, one data page corresponds to a single index region. A large data set can have virtually infinite number of possible page assignments. However, in reality, the contents of pages and the layout of partitions are determined by the space-partitioning strategy, which is often characterized as either space-driven or data-driven [5]. A typical spacedriven technique, e.g. the KDB-tree [14], splits the space into non-overlapping index regions and assigns points to the corresponding pages. A data-driven technique, e.g. the R-tree and its variants [1, 7], assigns to each page the minimum bounding hyper-rectangle (MBR) enclosing all points in the page. The two may seem similar, but their effects on retrieval performance are often different. Many multi-dimensional access methods employ hybrid partitioning, which combines different partitioning strategies into a single monolithic scheme. For example, the Buddy Tree [17] divides the space into non-overlapping partitions (just like the KDB-tree), but it also maintains for each partition the MBR enclosing all points in the partition (like the R-tree). While the KDB-tree partitioning elimi-
nates region overlap, the MBRs help avoid searching dead (empty) space. The same effect is achieved by a more compact structure, called the Hybrid Tree [5], which reduces dead space by performing split at two different locations, creating two partitions and an empty area in between. The MBRs are compactly represented by the split plane and two location values. The A-tree [15] is another hybrid design, which combines R-tree partitioning with the VA-file [21]. The partitioning strategies of these techniques dynamically adjust the index regions to the changing data distribution while loading the data. As a result, these strategies can be regarded as dynamic partitioning schemes. However, since an overfilled partition is split along a single dimension, even for very large data sets, a dynamic scheme provides no assurance that every dimension is split [2]. Because of this, dynamic schemes generally do not scale well as dimensionality increases. Motivated by this observation, several static partitioning schemes have been designed specifically for high-dimensional spaces. These include the Pyramid Technique [2] and Gamma partitioning [11]. Unfortunately, since the partitions are statically determined prior to data loading, they cannot be easily adjusted to a changing data distribution. Even this brief examination of contemporary multidimensional access methods points to a conclusion that their monolithic partitioning, whether it is a static, dynamic, or some hybrid scheme, is the main source of their problems. One can hope to avoid these problems by superimposing static on top of a dynamic partitioning scheme. While the static pre-partition of the space would enable splitting along all dimensions regardless of data dimensionality, the dynamic sub-partitioning would enable a graceful adaptation of the structure to the changing data distribution. These superimposed partitioning schemes could accommodate voluminous data of very high dimensionality. In this paper, we develop a family of high-dimensional indexing techniques with superimposed partitioning and demonstrate the superiority of this approach through a comprehensive set of experiments with both syntactic and real application data. The approach of superimposing partitioning schemes is not only original, but also represents a significant step forward with respect to the flexibility, performance, and scalability of multi-dimensional indexing. The experimental evidence presented in the paper will show that superimposed partitioning is at least as good as the underlying strategies used. More importantly, in practical situations, it enables significant performance improvements over its underlying partitioning schemes. While the superimposed designs introduced in this paper combine only two partitioning strategies, in principle, there is no restriction on the number of component partitioning schemes. For example, one can use a static partitioning in a recursive manner, or mix one static scheme with more then
one dynamic partitioning. This approach can also incorporate and take the advantage of partitioning schemes yet to be discovered. Another important aspect of this approach is that it can be easily implemented by reusing an existing indexing technique. For analytical engines and commercial database management systems with the embedded retrieval functions already in place, these add-on capabilities could enable easy transition toward more functional systems with better performance and scalability. In the rest of the paper, Section 2 reviews several examples of monolithic partitioning. Section 3 describes three variants of the proposed superimposed partitioning. Section 4 presents the experimental evidence and Section 5 summarizes the paper.
2. Monolithic Partitioning Schemes As noted in the introduction, most multi-dimensional access methods associate space partitions with pages on the storage. In dynamic partitioning, the final layout of partitions depends on the volume and distribution of data. In contrast, static partitioning does not change in the course of data loading. The following paragraphs describe selected examples of both dynamic and static partitioning. Grid File [9] was one of the first multi-dimensional indexing techniques with dynamic partitioning. This structure initially starts with one data page, which is associated with the entire space. When the page overfills, it is split into two pages corresponding to two equal halves of the space. The process repeats for every data page that overfills. While a page split always bisects the entire space, the remaining data pages are not forced to split. As shown in Figure 1a, a partition can cover multiple cells of the spatial grid. The mapping between cells and data pages is done in a directory file, which has a reference to every cell in the conceptual grid. With the space partition of the Grid File, the number of cells in the grid increases exponentially with a growing data-set size. As a result, the directory file grows very fast. This problem was resolved by the hierarchical directory structure of the KDB-tree [14]. At the lowest level, directory pages store cell descriptors along with the pointers to the corresponding data pages. One level above, the directory pages contain the descriptors of combined cells covering the cells in a directory page one level below (see Figure 1b). The root page contains master cells representing the main branches of the hierarchy. Since there is no overlapping between the partitions, which also cover the entire space, the space division of the Grid File and KDB-tree can be called complete and disjoint. However, each partition may have an outer shell of dead space without any points. This dead space is dramatically reduced in another dynamic structure, called the R-
Dir 1 1
P1
MBRs
Dir 2
5
3
P2
5
4 4
6
2
1
P3 6 7
3
P4
8
2
Dir 3
a) Grid File
c) R-tree
b) KDB-tree
4
Live Regions Gen1
5 4
1
2
Gen2
3 2
Pos
Pos Gen3
1
3
d) Pyramid
e) Gamma
f) Live Regions
Figure 1. Space-partitioning strategies. tree [7], which also builds a hierarchy of partitions. However, the building block of the hierarchy is not a grid cell, but an MBR. As shown in Figure 1c, at the lowest level of the hierarchy, an MBR encloses the points of a data page. At a higher (interior) level, MBR is the smallest encapsulation of an MBR group one level below. Typically, the MBRs at any level do not cover the space, thus reducing the amount of dead space. Unfortunately, building the hierarchy of MBRs leads to region overlapping, which in highdimensional spaces can result in a severe degradation of the search performance [3]. Dynamic partitioning has evolved into a large family of different structures. In contrast, static partitioning is a fairly new concept. The Pyramid Technique and Gamma partitioning are typical representatives. As shown in Figure 1d, the Pyramid Technique [2] divides a d-dimensional space into 2d pyramids with a common vertex in the center and bases on the outside walls. Normally, the common vertex is located in the center of the space. However, one can shift it towards a dense population of points in order to even the distribution of points among the pyramids [2]. Gamma partitioning [11] is based on nested hyperrectangles, called generators, which have common corner in the origin of the space. The shell-like sub-space between two adjacent generators, which in a 2-dimensional space resembles Greek letter Γ, is further divided into rectangular Gamma regions. Figure 1e gives an example of Gamma partitioning with 3 generators and 5 regions in a 2-dimensional space. In general, the number of Gamma partitions is at most 1 + (m 1) d, where m is the number of generators and d data dimensionality. Depending on the generators, different Gamma regions can have widely different volumes, but the generators are usually chosen to produce Gamma
regions of equal size. Because Gamma partitions are rectangular and static, each can be reduced to an MBR, called the “live region”. Each live region bounds the points of the corresponding Gamma region, minimizing the amount of dead space (see Figure 1f). The Pyramid and the original Gamma [11] techniques build the multi-dimensional index in the same way. Each data point is assigned a one-dimensional key, representing a combination of the partition number and a linear position within the partition. The Pyramid Technique orders the partitions by the number of the dimensions on which the pyramids lie, allowing two numbers per each dimension. In addition to the number of the pyramid containing the given point, the key value includes the distance from the point to the common vertex. Gamma orders its partitions by the generator and the number of the dimension on which the partition is cut out from the sub-space. The projection of the point on the longest edge of the region is the second component of that point’s key. In both Pyramid and Gamma, data points with their one-dimensional key values are stored in a B + -tree. Since points are projected onto a linear space, a multidimensional query must be transformed into a set of onedimensional ranges. While eliminating irrelevant partitions decreases the search area, one-dimensional projection within a partition often enlarges the scope of the search. In the rest of the paper, the variant of Gamma partitioning with live regions and the B + -tree is called GammaSLB.
3. Superimposed Gamma-KDB Partitioning Dynamic partitioning schemes do not split all axes of a high-dimensional space. This causes the partitions to stretch over the entire ranges of possible values along some dimensions. As a consequence, the search space of the given query is artificially expanded [2, 12]. This problem, which has been identified as “unused information” [12], leads to a degradation of retrieval performance in highdimensional situations. By dividing the space on all dimensions (see Figures 1d and 1e) and restricting the search only to relevant regions, static partitioning schemes solve this particular problem. But the linear transformation associated with these schemes introduces problems that sometimes outweigh the benefits. Another problem is that the partitions cannot be adjusted to dynamic data distributions. By combing static and dynamic partitioning, one can overcome the limitations of monolithic partitioning schemes and improve the performance of multidimensional indexing, especially in high-dimensional situations. Our superimposed partitioning combines static Gamma partitioning with dynamic KDB-tree partitioning. We named this concept GammaSLK (Gamma space-
partitioning with live regions and KDB-trees). As we will see, this general idea can be implemented in different ways. We choose Gamma as the static partitioning scheme because it has many advantages over the Pyramid Technique. For example, since the number and size of Gamma partitions can be tuned to the actual data distribution, the Gamma strategy is more flexible than the Pyramid Technique. Since individual partitions have hyper-rectangular shape, they can deal with dead space in a simple way and can be easily coupled with dynamic partitioning. The selection of the dynamic structure was based on several factors, including the fact that KDB-trees avoid the problem of region overlap. R-trees result in potentially significant reduction of dead space, but the superimposed Gamma partitioning already deals with that problem. The following paragraphs discuss different variants of GammaSLK. The simplest way to combine static and dynamic partitioning is to assign a separate structure to each. In this variant, called GammaSLK1, Gamma partitioning is used to pre-configure the space and to maintain information about live regions. The KDB-tree serves as the data storage for points and their IDs. During the data load, each point is first ”probed” against the Gamma partitions in order to enlarge appropriate live region, if necessary. The point is then inserted into the KDB-tree using the regular insertion procedure without any alteration. While Gamma partitions remain constant when loading the data, the hierarchy of KDB-tree partitions changes dynamically. As shown in Figure 2a, a region query is first clipped against all live regions that overlap the query. The individual sub-queries are then processed against the KDB-tree. By avoiding the search of the dead space, this query pre-processing can improve the retrieval performance. The main advantages of GammaSLK1 are the implementation simplicity and portability. The approach can be used with any kind of dynamic partitioning, and any multi-dimensional access method can be reused without alteration. The improved performance comes from clipping the queries against live portions of static partitions, avoiding the search over potentially large amounts of dead space. Since Gamma partitioning divides the space on all dimensions, even this simple combination of static and dynamic partitioning effectively deals with unused information. Even though the sub-queries can generate potentially high access to the same pages, when coupled with an appropriate page-caching scheme, this variant of GammaSLK is likely to outperform the basic KDB-tree structure. Unfortunately, since there exists virtually no correlation between Gamma partitions and the data pages of the underlying KDB-tree index, the points of the same Gamma region could be scattered across many data pages. Thus, GammaSLK1 may not always compare favorably against GammaSLB or the Pyramid Technique.
INSERT R1 RESULT R2 QUERY
Gamma with Live Regions
KDB-tree
a) GammaSLK1
INSERT
(d) R1 (d + 1)
RESULT R2
QUERY
Gamma with Live Regions
KDB-tree with extra dimension
b) GammaSLK2
INSERT R1 RESULT R2
QUERY
Gamma with Live Regions
Multiple KDB-trees
c) GammaSLK3
Figure 2. Illustration of the three variants of Gamma-KDB partitioning. GammaSLK1 reduces the scope of the search, but it does not alter the static or dynamic structure. Our next variant, called GammaSLK2, increases the dimensionality of the space with an extra dimension. As depicted in Figure 2b, during the data load, every point is augmented with an additional value, representing the number of the Gamma partition containing the point. Since this extra dimension is subject to splitting just like other dimensions, the static partitioning now has direct effect on the way data is stored in the underlying KDB-tree. The KDB-tree splitting of data pages selects the split dimension from the ordered set of coordinates [14]. In our implementation, the extra value is the first dimension, making it the most likely axis to be split. Except for the extra dimension, GammaSLK2 is the same as GammaSLK1 in every other aspect. Through live regions and by grouping points in the KDB-tree pages according to the Gamma partitions they belong to, GammaSLK2 improves the search performance. These performance improvements are achieved without altering the original KDB-tree structure and its associated algorithms. Unfortunately, since KDB-tree partitions in-
clude dead space and the structure must be searched for each sub-query from the root down, the same pages could be accessed many times, generating potentially many duplicate page accesses. Another problem is that the extra dimension increases the size of the KDB-tree index. The highest degree of integration between static and dynamic partitioning is achieved by a nested design of GammaSLK3, which is illustrated in Figure 2c. In this design, each Gamma partition is assigned its own KDB-tree index. In essence, each Gamma partition represents the “root” of a hierarchy of dynamic index regions. All KDB-tree indices are stored in a single file. With respect to GammaSLB, the insertion procedure must be changed so that each new point can be dynamically inserted into the KDB-tree of the Gamma partition that contains the point. The search procedure starts by clipping the query window against each live region that overlaps the query. For each of these clips of the original query, the procedure must perform a region search through the corresponding KDB-tree index. While GammaSLK3 distinguishes between static and dynamic partitioning, the resulting structure has one integrated design. No separate structure is required to keep live regions and KDB-tree partitions — all these parts can be stored at various levels of a single integrated tree. The top pages of the tree contain information about live regions along with pointers to the roots of sub-trees. Each sub-tree represents a standard KDB-tree hierarchy of directory and data pages. An important advantage of this variant of GammaSLK is its static pre-clustering of data on the storage according to the Gamma partitions. Another advantage is that the query, despite its break-up into multiple sub-queries, never visits the same page twice.
4. Experimental Results A comprehensive set of experiments was performed to compare the three variants of GammaSLK against the Pyramid Technique, GammaSLB, and regular KDB-trees. The structures were loaded with data sets of varying volumes, dimensionalities and distributions, and then retrieved for different types of region queries. We measured both the number of pages per query as well as the total execution time for all queries in an experiment. Not surprisingly, we found a strong correlation between the numbers of page accesses and corresponding retrieval times. While page accesses are portable indicators of performance, the timings are heavily dependent on the hardware and operating system. For the experiments, we used a standard PC configuration with a 2.8 GHz Pentium-4 processor and the Microsoft Windows XP operating system. We had 512K of main memory on a 533Mhz bus, and 60GB of secondary storage with 10ms access time. The machine did not have any special hardware caching.
Our KDB-tree implementation performs split of a data page along the longest side of the corresponding index region, producing two pages with approximately the same numbers of points. In all structures, the page size was 8K bytes. Each point had a 2-byte coordinate for every dimension and a unique 4-byte point ID. The points stored in a B+ -tree index also had an additional 4-byte key value, containing the number of the corresponding partition and the projected distance within the partition. The correctness of the B+ -tree and KDB-tree implementations was verified through an extensive set of checking and counting routines. In the experiments, the static Gamma partitioning with Gamma regions of equal size was obtained using 10 generators. However, in some experiments that are not reported here, we observed that the performance of GammaSLK3 is somewhat better with fewer than 10 generators.
4.1. Synthetic Data The first set of experiments involved four synthetic data sets with different data distributions (center, corner, edge, and wall), each with 1,000,000 points and dimensionality between 2 to 50. To describe the data distributions, we assume a normalized space [0; 1] d . In this space, corner is a point with each coordinate equal to 0 or 1; an edge is a group of all points with an arbitrary subset of d-1 coordinates equal to 0 or 1; and a wall is a set of all points that have one particular coordinate equal to 0 or 1. For the center distribution, the randomly selected points of each space were normalized to fit within a hypercube centered in the middle of the space, whose each side was 50% of the linear measure on every dimension. In a highdimensional space, this hypercube represented only a tiny fraction of the entire space. The corner distribution, the hypercube whose each side was 25% of the linear measure was placed in a randomly selected corner of the space. In the edge distribution, points were located within 25% in linear measure from the walls adjacent to a randomly selected edge of the space. In this distribution, the sub- space with data did not have the shape of a hypercube. In the fourth set of data with wall distribution, the points were placed in a strip of 10% linear measure along a randomly selected wall of the space. In each space, exactly 1,000 randomly selected query widows were focused only in the areas were data appears. In other words, the center of each query was located within the same part of the space as points. The length of every query window along each dimension was initially fixed to 15% linear extension. However, the linear query ranges that intersected the space were simply clipped. While these experimental scenarios may seem contrived, they are actually very instructive from the research perspective. Since both data and queries were focused in
(b)
(a) DATA IN CENTER -- 50% LINEAR RANGE
9000
PAGES ACCESSED
7000 6000 5000
PYR KDB SLB SLK1 SLK2 SLK3
6000 PAGES ACCESSED
8000
DATA IN CORNER -- 25% LINEAR RANGE
7000
PYR KDB SLB SLK1 SLK2 SLK3
4000 3000 2000
5000 4000 3000 2000 1000
1000
0
0 2
3
5
10
25
2
50
3
5
(c)
PYR KDB SLB SLK1 SLK2 SLK3
4000
PYR KDB SLB SLK1 SLK2 SLK3
3500 PAGES ACCESSED
PAGES ACCESSED
5000
50
DATA ON WALL -- 10% LINEAR RANGE
4000
6000
25
(d)
DATA ON EDGE -- 25% LINEAR RANGE
7000
10
DIMENSIONS
DIMENSIONS
3000 2000 1000
3000 2500 2000 1500 1000 500
0
0 2
3
5
10
DIMENSIONS
25
50
2
3
5
10
25
50
DIMENSIONS
Figure 3. Simulated data and queries with 15% linear extensions. fairly small portions of the space, these scenarios represent optimal conditions for dynamic partitioning schemes. Under these conditions, one can observe potentially negative impact of static partitioning in our superimposed designs. Figure 3 shows the results measured as the average number of unique page accesses per query (all accesses to the same page were counted as one page access). As one can see, monolithic static partitioning typically does not perform well under these extreme conditions. In these scenarios, the locations of data and queries had little effect on the performance of KDB-trees, whose dynamic partitioning was able to adjust to any distribution. But the most important observation from this set of experiments is that our superimposed designs were able to closely match the performance of KDB-trees for all data distributions. The results clearly show that, even under the best conditions for dynamic partitioning, the GammaSLK variants are not inferior to dynamic partitioning. Since we measured here only unique page accesses, for GammaSLK1 and GammaSLK2, this observation is conditioned on the presence of a good page-caching facility. This conclusion established, we will further investigate the positive impact of static prepartitioning in our superimposed designs.
4.2. Real Data While syntactic data can help draw isolated conclusions, the core of our evaluation is based on the experiments with a set of real high-dimensional data. In these experiments, we loaded the structures with varying subsets of data, and measured the performance of each structure for different
types of queries in terms of the structure size, load time, unique and non-unique page accesses per query, as well as the total execution time for the given set of queries. Since the Pyramid Technique was by far the worst performer in these experiments, it is not included in the figures below. The data set was generated from a database of a local company. Using some order-preserving transformations, 1,000,000 records from the database, with attributes such as name, address and phone number, were transformed into 25-dimensional points, whose coordinates were normalized to floating point numbers between 0 and 1. Since many attributes had only a few distinct values, the data set was heavily skewed, distributed across many distant clusters. Each structure loaded with a subset of data was searched with two different sets of 1,000 window queries, uniformly distributed in the space. The queries were constructed to test the efficiency of the structures under two extreme conditions. The first set consisted of very large queries, each with a fixed volume of 1% of the entire space constructed around a randomly selected center. In the second set, a widow query also had randomly selected center and initial volume 1% of the total space. However, since the extension of each side of the query was multiplied by a random value between 0 and 1, these queries were so small that they usually retrieved no matching points. Figures 4a and 4b show the number of pages and time required to load each structure with 1,000,000 points, respectively. Since 8K-byte pages can hold up to 151 points of 25-dimensional data with 2-bytes per dimension and 4byte point IDs, the minimum number of data pages required to store this set is 6,622. Since all structures had between 12,000 and 14,000 pages at all levels of the index tree, the overall storage utilization was between 45% and 55%. Interesting observation here is that the static pre-partitioning of our superimposed designs did not adversely affect page utilization. Mainly because every point in the B + -tree index had an additional 4-byte key value, the B + -tree index of GammaSLB had more pages than other structures. Nevertheless, since the splits of B+ -tree pages are much faster than those of KDB-tree pages, it took less time to load GammaSLB than the other structures. Perhaps the most representative indicator of search performance is the average number of unique page accesses per query, shown in Figures 4c and 4d. Comparing these with the results for synthetic data (recall Figure 3), we can observe here dramatic changes in the relative performance of the five structures. While KDB-trees generally outperformed other structures in the experiments with synthetic data, they were the worst performer in the experiments with real data (the Pyramid Technique, which is not displayed in Figures 4 and 5, had by far the worst performance of all). A closer look reveals that the performance of KDB-trees changed very little, but the other structures had signifi-
(a)
90
NON-UNIQUE PAGES
FIXED QUERY WINDOW - 1% VOL
70
KDB
60
SLK1
70 50 40 30
15 00
500000
SECONDS
750000
1000000
10
POINTS 250000
30
(d)
20
10
1000000
40 0
(c)
POINTS 250000
35 0
SLB KDB SLK1
750000
1000000
(d) VARIABLE QUERY WINDOW - MAX 1% VOL
SLB
SLB KDB
SLK2 SLK3
500000
SECONDS
FIXED QUERY WINDOW - 1% VOL
VARIABLE QUERY WINDOW - MAX 1% VOL
25
750000
KDB
SLK1
SLK1
SLK2
SLK2
SLK3
SLK3 15
15 0 10 0
5
POINTS
750000
1000000
250000
500000
750000
POINTS
POINTS
1000000 250000
500000
750000
1000000
0
500000
0
250000
0
0
50
POINTS
10
20 0
20
40 0
10
30
60 0
40
SLK3
20 0
50
SLK2
SLK3
20
SLB
UNIQUE PAGES
500000
30 0
80
90
FIXED QUERY WINDOW - 1% VOL
POINTS 250000
25 0
(c)
UNIQUE PAGES
SLK2
SLB KDB SLK1 SLK2 SLK3
00
20 0 15 0 50
10 0
1000000
0
750000
SLK1
20
00 80 00
500000
(b) NON-UNIQUE VARIABLE QUERY WINDOW - MAX 1% VOL PAGES
00
SLK3
60 00 40 00 80 0
10
00
12
00
14
00
16
00
0
20
POINTS 250000
KDB
60
35 00 25
SLK2
25 0
SLK3
00
SLK1
30
KDB
30 0
10
00
0
SLK2
SLB
SLB
50 0
12
SLK1
40 0
00
0
KDB
35 0
14
SLB
0
00
0
45 0
00
TIME TO LOAD
80
00 40
(b)
PAGES
0
50 0
0
TOTAL NUMBER OF PAGES
16
00
(a)
PAGES
250000
500000
750000
1000000
Figure 4. Size, time to load, and number of unique page accesses per query for real data.
Figure 5. Total page accesses per query and total query time for real data.
cantly fewer unique page accesses than before. Since the real data was distributed across many clusters in a heavily sparse space, this was mainly due to the effects of live regions. Since live regions were also small, for many small queries, the Gamma indexing techniques did not have to access any index page. Having nothing comparable to live regions, KDB-trees could not avoid searching the dead space. In the experiments with synthetic data, the restricted locations of queries made live regions of Gamma indexing techniques ineffective, which favored the KDB-tree structure. While GammaSLK1 and GammaSLK2 show good performance in terms of the unique page accesses, as one can see from Figures 5a and 5b, for large queries, they generate large numbers of total (non-unique) page accesses per query. (Due to the grouping of data in the index according to Gamma partitions, GammaSLK2 outperformed GammaSLK1.) In contrast, since KDB-trees, GammaSLB, and GammaSLK3 do not generate duplicate accesses to the same pages, they generate the same number of unique and non-unique page accesses. The improvements of GammaSLK3 over GammaSLB (about an order of magnitude for small queries) were due to its method of clustering data on the storage, which takes into account multiple dimensions of each Gamma partition. Figures 5c and 5d show the total retrieval times for large and small query windows, respectively, which have strong correlation with the corresponding results for nonunique page accesses. A small exception was the reversion of relative performance of KDB-trees and GammaSLK2, which was due to operating-system (file-system) caching
that made duplicate accesses of GammaSLK2 to the same pages somewhat faster than the first accesses to the pages.
5. Summary It is well known that contemporary multi-dimensional access methods do not scale well to high data dimensionalities. Many limitations of these techniques are due to their space-partitioning schemes. In this paper, we developed a family of indexing techniques with superimposed partitioning schemes that outperform the original structures. The three variants of the superimposed partitioning scheme, called GammaSLK, were evaluated through a comprehensive set of experiments on simulated and real data with various distributions and dimensionalities, and for different types of queries. The static pre-partition of the space using Gamma partitioning divides every axis multiple times, making sure that each dimension can effectively contribute to the search process. Since Gamma partitions (regions) are static and have rectangular shape, they enable dynamic maintenance of live regions as an effective way of dealing with the problem of dead space. Storing the points of each Gamma region into a separate KDB-tree, as in GammaSLK3, enables a dynamic partitioning scheme that gracefully adapts to the changes in the data distribution. The experimental evidence clearly demonstrates the superiority of GammaSLK3. However, in the environments were portability and implementation simplicity are important concerns, GammaSLK1 and Gam-
maSLK2 can be attractive alternatives, provided they are coupled with a good page-caching facility.
[13] G. Pass and R. Zabih. Histogram refinement for contentbased image retrieval. In Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision, pp. 96–102, 1996.
References
[14] J. T. Robinson. The K-D-B Tree: A search structure for large multidimensional dynamic indexes. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 10–18, 1981.
[1] N. Beckmann, H.P. Kriegel, R. Schneider and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 322–331, 1990. [2] S. Berchtold, C. Bohm and H.-P. Kriegel. The PyramidTechnique: Towards breaking the curse of dimensionality. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 142–153, 1998.
[15] Y. Sakurai, M. Yoshikawa, S. Uemura and H. Kojima. The A-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the 26th International Conference on Very Large Data Bases, pp. 516–526, 2000. [16] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys 34(1):1–47, 2002.
[3] S. Berchtold, D. A. Keim and H. P. Kriegel. The X-tree: An index structure for high-dimensional data. In it Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 28–39, 1996.
[17] B. Seeger and H.P. Kriegel. The Buddy-tree: An efficient and robust access method for spatial data base systems. In Proceedings of the 16th International Conference on Very Large Data Bases, pp. 590–601, 1990.
[4] C. Bohm, S. Berchtold and D. A. Keim. Searching in highdimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3):322–373, 2001.
[18] D.F. Sutter and B.P. Strauss. Next generation high energy physics colliders: technical challenges and prospects. IEEE Trans. On Applied Superconductivity 10(1):33–43, 2000.
[5] K. Chakrabarti and S. Mehrotra. The Hybrid Tree: An index structure for high dimensional feature spaces. In Proceedings of the 15th International Conference on Data Engineering, pp. 440–447, 1999. [6] R.H. Guting. An introduction to spatial database systems. VLDB Journal 3:357–399, 1994. [7] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 47–54, 1984. [8] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel and Z. Protopapas. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 215–226, 1996. [9] J. Nievergelt and H. Hinterberger. The Grid File: An adaptive, symmetric multikey file structure. ACM Transactions on Database Systems 9(1):38–71, 1984. [10] B. C. Ooi, K.-L. Tan, C. Yu and S. Bressan. Indexing the edges - A simple and yet efficient approach to high-dimensional indexing. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS’2000, pp. 166–174, 2000. [11] R. Orlandic, J. Lukaszuk and C. Swietlik. The Design of a retrieval technique for high-dimensional data on tertiary storage. SIGMOD Record 31(2):15–21, 2002. [12] R. Orlandic and B. Yu. A retrieval technique for highdimensional data and partially specified queries. Data and Knowledge Engineering 42(1):1–21, 2002.
[19] H.E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Trans. On Knowledge and Data Engineering 14(1):63–78, 2002. [20] Y. Wang, Z. Liu and J.C. Huang. Multimedia content analysis: using both audio and visual clues. IEEE Signal Processing Magazine 11:12–36, 2000. [21] R. Weber, H.J. Schek and S. Blott. A quantitative analysis and performance study for similarity-search methods in highdimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205, 1998.