Mar 25, 2011 - demonstrate that the performance of this approach scales linearly up to at least 128 cores ..... to carry
Empirical Evaluation of Excluded Middle Vantage Point Foreston Biological Sequences Workload Weijia Xu
Lee Parnell Thompson
Daniel P Miranker
Texas Advanced Computing Center
Department of Computer Sciences
Department of Computer Sciences
The University of Texas at Austin
The University of Texas at Austin
The University of Texas at Austin
[email protected]
[email protected]
[email protected]
ABSTRACT Wedevelop and evaluate a version of the excluded middle vantage point forest in support of range searches and load balancing for parallel queries. The algorithm is evaluated using a benchmark suite that includes real-world biological sequence workloads. Favorable results are demonstrated when comparing to the Multiple Vantage Point Tree and Spatial Approximation Tree algorithms with respect to sequential measures. We also demonstrate that the performance of this approach scales linearly up to at least 128 cores and outperforms a naïve distributed multiple vantage point forest approach when run in parallel.
Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods
General Terms Algorithms
Keywords Metric space index, Exclusion
1. INTRODUCTION Partition-based indexing schemes are descended from binary search trees, which solve set membership queries in O(log n). The complexity result depends on the property that when a search descends the tree, the decision to search the right or left child of a node is mutually exclusive. Starting at least with Gutmann's Rtree, partition-based multidimensional indexing methods include decision procedures that do not guarantee an exclusive decision[1]. A search may descend through multiple children of a node when covering predicates overlap, or when using range searches, the distance separating the partitions is less than twice the search radius. Subsequently, many indexing schemes have been developed, whose performance is evaluated strictly empirically and depends on the workload[2]. We investigate the use of exclusion as a method to introduce parallelism and to improve the performance of partition-based indexing schemes for range queries in metric space indexing. The basic idea starts with building a conventional partition-based index tree on a data set. The data in the middle partition are removed such that the covering predicates are reduced in size, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NTSS 2011, Mar 25, 2011, Uppsala, Sweden. Copyright 2011 ACM 978-1-4503-0612-6/11/03 ...$10.00.
which eliminates overlapand even introduces gaps between them. The process is repeated recursively on the removed data, until further exclusion no longer provides an advantage. Search of the resulting forest of index trees can be done in parallel. Herein we report on an empirical assessment of the use of exclusion in conjunction with distance-based indexing and range queries. Distance-based indexing assumes only that there is a set of data S and a metric-distance function d. A range query of x, q(x, S) m|D| leftIndex mid- m|D|/2 rightIndexmid+m|D|/2 end if E.add(D, leftIndex, rightIndex) this.left build(0, leftIndex, D) this.right build (rightIndex, |D|-1, D) left_min distance(D[0], pivot) left_max distance(D[leftIndex], pivot) right_min distance(D[rightIndex], pivot) right_max distance(D[|D|-1], pivot) end while this.data D Figure 2Pseudo code for building index with exclusion.
r1 p
r2
Figure 1 Illustration of three partitions with one pivot. The middle partition is the shaded area. Figure 1 shows an example of pivot-based indexing on one pivot with three partitions defined by distance values r1 and r2. Given a query point q and a radius r,if the d(q, p)+rr2,then only one partition needs to be searched. Otherwise, both partitions need to be searched. But, if the middle partition, the shaded area in Figure 1, can be excluded from the current tree, then all range searches with r2τ;2) rightIndex – leftIndex