ric spaces, and a novel access structure for similarity search in metric data, called D-Index, ... execution of similarity range and nearest neighbor queries for large files with objects stored ..... access bucket Eh; update A and dk; if dk > Ï (second ...
Access Structures for Advanced Similarity Search in Metric Spaces Vlastislav Dohnal1 , Claudio Gennaro2 , Pasquale Savino2 , and Pavel Zezula1 1
Masaryk University Brno, Czech Republic {xdohnal, zezula}@fi.muni.cz 2 ISTI-CNR Pisa, Italy {gennaro, savino}@isti.pi.cnr.it
Abstract. Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets from arbitrary metric spaces, and a novel access structure for similarity search in metric data, called D-Index, is discussed. D-Index combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. Moreover, we propose an extension of this access structure (eD-Index) which is able to deal with the problem of similarity self join. Though this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.
1
Introduction
Similarity searching has become a fundamental computational task in a variety of application areas, including multimedia information retrieval, data mining, pattern recognition, machine learning, computer vision, genome databases, data compression, and statistical data analysis. This problem, originally mostly studied within the theoretical area of computational geometry, is recently attracting more and more attention in the database community, because of the increasingly growing needs to deal with large, often distributed, volume of data. Consequently, high performance has become an important feature of cutting edge designs. In this article, we consider the problem from a broad perspective and assume the data to be objects from a metric space where only pair-wise distances between objects are known. We present the D-Index, a multi-level access structure of separable buckets on each level which supports easy insertion, range search, and nearest neighbor search. Moreover, in order to complete the set of similarity
search operations, similarity joins are needed. For this purpose, an extension of D-Index is introduced. Applications of similarity join operation are data cleaning or copy detection, to name a few. The former is useful, for instance, in the development of Internet services which often require an integration of heterogeneous sources of data. Such sources are typically unstructured whereas the intended services often require structured data. In this case, the main challenge is to provide consistent and error-free data, which implies the data cleaning, typically implemented by a sort of similarity join. This work focuses on similarity self join useful for copy detection and data cleaning of large databases. The paper is organized as follows. In Section 2, we define the general principles of our access structures. The system structure, algorithms, of D-Index are presented in Section 2.3, while the eD-Index is illustrated in Section 2.4. Section 3 briefly summarize the experimental evaluation results. Finally, Appendix A formally presents the proposed algorithms. The article concludes in Section 4.
2 2.1
Access structures and search algorithms Search Strategies
A convenient way to assess similarity between two objects is to apply metric functions to decide the closeness of objects as a distance, which can be seen as a measure of the objects dis-similarity. A metric space M = (D, d) is defined by a domain of objects (elements, points) D and a total (distance) function d – a non negative (d(x, y) ≥ 0 with d(x, y) = 0 iff x = y) and symmetric (d(x, y) = d(y, x)) function, which satisfies the triangle inequality (d(x, y) ≤ d(x, z)+d(z, y), ∀x, y, z ∈ D). Without any loss of generality, we assume that the maximum distance never exceeds the distance d+ . For a query object q ∈ D, two fundamental similarity queries can be defined. A range query retrieves all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r}. A k-nearest neighbor query retrieves the k closest elements to q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R, y ∈ X − R, d(q, x) ≤ d(q, y). The similarity join is a search primitive which combines objects of two subsets of D into one set such that a similarity condition is satisfied. The similarity condition between two objects is defined according to the metric distance d. sim
Formally, the similarity join X Y between two finite sets X = {x1 , ..., xN } sim
and Y = {y1 , ..., yM } (X ⊆ D and Y ⊆ D) is defined as the set of pairs X Y = {(xi , yj ) | d(xi , yj ) ≤ µ}, where the threshold µ is a real number such that 0 ≤ µ ≤ d+ . If the sets X and Y coincide, we talk about the similarity self join. 2.2
General approach: Clustering Through Separable Partitioning
To achieve our objectives, we base our partitioning principles on a mapping function, which we call the ρ-split function, where ρ is a real number constrained as 0 ≤ ρ < d+ . In order to gradually explain the concept of ρ-split functions, we
first define a first order ρ-split function and its properties. More details about the mathematical specification of the ρ-split function are available in [4]. Definition 1. Given a metric space (D, d), a first order ρ-split function s1,ρ is the mapping s1,ρ : D → {0, 1, −}, such that for arbitrary different objects x, y ∈ D, s1,ρ (x) = 0 ∧ s1,ρ (y) = 1 ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ s1,ρ2 (x) = − ∧ s1,ρ1 (y) = − ⇒ d(x, y) > ρ2 − ρ1 (symmetry property). In other words, the ρ-split function assigns to each object of the space D one of the symbols 0, 1, or −. We can generalize the ρ-split function by concatenating n first order ρ-split functions with the purpose of obtaining a split function of order n. 1,ρ Definition 2. Given n first order ρ-split functions s1,ρ 1 , . . . , sn in the metric 1,ρ 1,ρ n,ρ = (s1 , s2 , . . . , s1,ρ space (D, d), a ρ-split function of order n s n ) : D → n {0, 1, −} is the mapping, such that for arbitrary different objects x, y ∈ D, 1,ρ n,ρ (x) = sn,ρ (y) ⇒ d(x, y) > 2ρ (separable ∀i s1,ρ i (x) = − ∧ ∀j sj (y) = − ∧ s 1,ρ2 1 property) and ρ2 ≥ ρ1 ∧ ∀i si (x) = − ∧ ∃j s1,ρ (y) = − ⇒ d(x, y) > ρ2 − ρ1 j (symmetry property).
An obvious consequence of the ρ-split function definitions, useful for our 1,ρ purposes, is that by combining n ρ-split functions of order 1 s1,ρ 1 , . . . , sn , we n,ρ obtain a ρ-split function of order n s . We often refer to the number of bits generated by sn,ρ , that is the parameter n, as the order of the ρ-split function. In order to obtain an addressing scheme, we need another function that transforms the ρ-split strings into integers, which we define as follows. Definition 3. Given a string b =(b1 , . . . , bn ) of n elements 0, 1, or −, the function · : {0, 1, −}n → [0..2n ] is specified as: n [b1 , b2 , . . . , bn ]2 = j=1 2j−1 bj , if ∀j bj = − b = 2n , otherwise When all the elements are different from ‘−’, the function b simply translates the string b into an integer by interpreting it as a binary number (which is always < 2n ), otherwise the function returns 2n . By means of the ρ-split function and the b operator we can assign an integer number i (0 ≤ i ≤ 2n ) to each object x ∈ D, i.e., the function can group objects from X ⊂ D in 2n + 1 disjoint sub-sets. Though several different types of first order ρ-split functions are proposed, analyzed, and evaluated in [3], the ball partitioning split (bps) originally proposed in [6] under the name excluded middle partitioning, provided the smallest exclusion set. For this reason, we also apply this approach, which can be characterized as follows. The ball partitioning ρ-split function bpsρ (x, xv ) uses one object xv ∈ D and the medium distance dm to partition the data file into three subsets (see Figure 1a). The object corresponding to the value 1 or 0, of the bps function,
form separable partitions, and the object corresponding to ‘-’ form the exclusion set. 0 if d(x, xv ) ≤ dm − ρ bps1,ρ (x) = 1 if d(x, xv ) > dm + ρ (1) − otherwise Once we have defined a set of first order ρ-split functions, it is possible to combine them in order to obtain a function which generates more partitions. This idea is depicted in Figure 1b, where two bps-split functions, on the twodimensional space, are used. The domain D, represented by the grey square, is divided into four regions, corresponding to the separable partitions. The exclusion set is represented by the brighter region and it is formed by the union of the exclusion sets resulting from the two splits. 2.3
D-Index
The basic idea of the D-Index [4] is to create a multilevel storage and retrieval structure that uses several ρ-split functions, one for each level, to create an array of buckets for storing objects. On the first level, we use a ρ-split function for separating objects of the whole data set. For any other level, objects mapped to the exclusion bucket of the previous level are the candidates for storage in separable buckets of this level. Finally, the exclusion bucket of the last level forms the exclusion bucket of the whole D-Index structure. It is worth noting that the ρ-split functions of individual levels use the same ρ. Moreover, split functions can have different order, typically decreasing with the level, allowing the D-Index structure to have levels with a different number of buckets. More precisely, the D-Index structure can be defined as follows. From the structure point of view, you can see the buckets organized as the h following two dimensional array consisting of 1 + i=1 2mi elements. B1,0 , B1,1 , . . . , B1,2m1 −1 B2,0 , B2,1 , . . . , B2,2m2 −1 .. . Bh,0 , Bh,1 , . . . , Bh,2mh −1 , Eh All separable buckets are included, but only the Eh exclusion bucket is present – exclusion buckets Ei ρ can also be executed. Indeed, queries with rq > ρ can be executed by evaluating the split function srq −ρ . In case srq −ρ returns a string without any ‘−’, the result
is contained in a single bucket (namely Bsrq −ρ ) plus, possibly, the exclusion bucket. Let us now consider that the string returned contains at least one ‘−’. We indicate this string as (b1 , . . . , bn ) with bi = {0, 1, −}. In case there is only one bi = ‘−’, we must access all buckets B, whose index is obtained by substituting in srq −ρ the ‘−’ with 0 and 1. In the most general case we must substitute in (b1 , . . . , bn ) all the ‘−’ with zeros and ones and generate all possible combinations. In order to define an algorithm for this process, we need some additional terms and notation. Definition 4. We define an extended exclusive OR bit operator, ⊗, which is based on the following truth table: bit1 0 0 0 1 1 1 − − − bit2 0 1 − 0 1 − 0 1 − ⊗ 010 100 0 0 0
Note that the operator ⊗ can be used bit-wise on two strings of the same length and that it always returns a standard binary number (i.e., it does not contain any ‘−’). Consequently, s1 ⊗ s2 < 2n is always true for strings s1 and s2 of length n (see Definition 3). Definition 5. Given a string s of length n, G(s) denotes a sub-set of Ωn = 0 are {0, 1, . . . , 2n − 1}, such that all elements ei ∈ Ωn for which s ⊗ ei = eliminated (interpreting ei as a binary string of length n). Observe that, G(− − . . . −) = Ωn , and that the cardinality of the set is the second power of the number of ‘−’ elements, that is generating all partition identifications in which the symbol ‘−’ is alternatively substituted by zeros and ones. Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ d+ . The advanced similarity range query can be executed following the Algorithm A2. The task of the nearest neighbor search is to retrieve k closest elements from X to q ∈ D, respecting the metric M. For convenience, we designate the distance to the k-th nearest neighbor as dk (dk ≤ d+ ). Due to the lack of space, we do not comment this algorithm, however we can say that the general strategy of Algorithm A3 works as follows. The algorithm starts with an optimistic strategy assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails additional steps of search are performed to find the correct result. 2.4
eD-Index
The idea behind the eD-Index is to modify the insertion algorithm of the DIndex in way that the exclusion set overlaps with the separable partitions of (see Figure 2). The objects which fall in the partition where the exclusion set intersects the separable partitions are replicated in both those sets. This principle, called the overloading exclusion set, ensures that any similarity self join query
with threshold µ ≤ cannot find a qualifying pair of objects (x, y) from different sets, specifically bps(x) = bps(y), because all objects of a separable set which can make a qualifying pair with an object of the exclusion set are copied to the exclusion set. In this way, the eD-Index speeds up the execution of similarity self
Separable partitions
Exclusion Set
2ρ
2ρ
dm
dm
(a)
(b)
ε
ε
Fig. 2. The difference between D-Index (a) and eD-Index (b).
join queries upto a predefined value of threshold apart in each separable bucket. eD-Index Insertion. It proceeds according to the Algorithm A4. The algorithm is similar the one for the insertion of D-Index (A1). The difference is that in eD-Index we use both the split functions simi ,ρ and simi ,ρ+ , allowing us to replicate all the objects which fall in the overlapping region of two consecutive levels. eD-Index Similarity Self Join. The outline of the similarity self join algorithm is following: execute the join query independently on every separable bucket of every level of the eD-index and additionally on the exclusion bucket of the whole structure. This behaviour is correct due to the overloading exclusion set principle, which copies every object of a separable set which can make a qualifying pair with an object of the exclusion set to the exclusion set. In this respect, the join query elaboration can be done independently in every bucket and the results of individual join subqueries form the result of the given join query. The Algorithm A5 describes this idea.
window
d≤µ
p o o 1 2
olo
oup
oj
Fig. 3. The Sliding Window algorithm.
on
d(p,oj)
The procedure SimJoin(Bi,j , µ) executes the similarity self join in one bucket. It is possible to improve the algorithm by exploiting the pivots during the SimJoin elaboration. This idea is based on the sliding window algorithm (see Algorithm A6): objects of a bucket are ordered with respect to a pivot p, which is the reference object of a ρ-split function used by the eD-Index, and we define a sliding window of objects as [olo , oup ]. This window is moved through the whole order set of the bucket in order to find all qualifying pairs (see Figure 3).
3
Performance Evaluation and Comparison
We have implemented the D-Index and the eD-Index and conducted numerous experiments to verify its properties on two metric data sets. The first data set (STR) consisted of sentences of Czech language corpus compared by the edit distance measure, so-called Levenshtein distance [5]. The most frequent distance was around 100 and the longest distance was 500, equal to the length of the longest sentence. The second data set (VEC) was composed of 45-dimensional vectors of color features extracted from images. Vectors were compared by the quadratic form distance measure. The distance distribution of this data set was practically normal distribution with the most frequent distance equal to 4,100 and the maximum distance equal to 8,100. 3.1
D-Index Performance Evaluation
In all our experiments, the query objects are not chosen from the indexed data sets, but they follow the same distance distribution. The search costs are measured in terms of distance computations. All presented cost values are averages obtained by executing queries for 50 different query objects and constant search selectivity, that is queries using the same search radius or the same number of nearest neighbors. We have considered about 11,000 objects for each of our data sets VEC and STR. We have compared the performance of D-Index with other index structures under the same workload. In particular, we considered the M-tree3 [2] and the sequential organization, SEQ. Efficiency. The main objective of these experiments was to compare the similarity search efficiency of the D-Index with the other organizations. The results for the range search are shown in Figures 4a and 4b, while the performance for the nearest neighbor search is presented in Figures 4c and 4d. For all tested queries, i.e. retrieving subsets up to 20% of the database, the M-tree and the D-Index always needed less distance computations than the sequential scan. The D-Index certainly performed much better for the sentences and also for the range search over vectors. However, the nearest neighbor search on vectors needed practically the same number of distance computations. 3
The software is available at http://www-db.deis.unibo.it/research/Mtree/
3.2
eD-Index Performance Evaluation
In order to demonstrate the suitability of the eD-index to the problem of similarity self join, we have compared several different approaches to join operation. The nested loops, uses the symmetric property of metric distance functions for pruning some pairs. The time complexity is O( N ·(N2 −1) ). A more sophisticated method, called range query join, uses the D-index structure. Specifically, we assume a data set X ⊆ D organized by the D-index and apply the search strategy as follows: for ∀o ∈ X, perform range query(o, µ). Finally, the last compared method is the overloading join algorithm which is described in Section 2.4. In all experiments, we have compared three different techniques for the problem of the similarity self join, the nested loops (NL) algorithm, the range query join (RJ), and the overloading join (OJ) algorithm, applied on the eD-index. As performance index we have used the speedup with respect the naive approach, i.e., the number of distance computation of the examined algorithm divided by N ·(N −1) , where N is the number of object stored. 2 Efficiency. The objective of this group of tests was to study the relationship between the query size and the efficiency measured in terms of distance computations. Figures 4e and 4f show results of experiments. The experiments show that OJ is more than twice faster than RJ for small µ, which are used in data cleaning area. On the vector data set, OJ algorithm performed even better especially for small query radii. Scalability. It is probably the most important issue to investigate considering the web-based dimension of data. In the elementary case, it is necessary to study what happens with the performance of algorithms when the size of a data set grows. We have experimentally investigated the behaviour of the eD-index on the text data set with sizes from 50,000 to 300,000 objects (sentences). We have mainly concentrated on small queries which are typical for data cleaning area. Figures 4g and 4h report the speedups of RJ and OJ algorithms, respectively. In summary, the figures demonstrate that the speedup is very high and constant for different values of µ with respect to the data set size. This implies that the similarity self join with the eD-index, specifically the overloading join algorithm, is also suitable for large and growing data sets.
4
Conclusions
In this paper we have presented a new access structure able to cope with range queries, nearest neighbor queries, and similarity join. In the performance evaluation, we have concentrated on the distance computation. However, the presented structure is also suitable for working on a disk storage and it is not limited to operate in the main memory only. Our experiments, not reported in this paper, exhibit very good performance in terms of disk block accesses. Compared to the M-tree, it typically needs less distance computations and much less disk reads to
(a)
(b)
distance computations Range Search VEC
12000 10000 D-Index
8000
mtree
6000
12000
10000
10000
8000
8000
4000
4000
2000
2000
0
0 500
1000
1500
2000
10000
mtree
2000
seq
seq
0 0
20
40
60
80
100
search radius
0
20
40
60
80
100
search radius
(f)speedup
VEC
800
12000
D-Index
4000
mtree
(e)speedup
distance computations NN STR
distance computations NN VEC
6000
D-Index
search radius
(d)
(c)
12000
6000
seq
0
distance computations Range Search STR
STR
1200 1000
RJ
600
RJ
800
8000 OJ
400
6000 4000
D-Index
2000
mtree
400 200 200
seq
0 0
10
20
30
40
50
0
60
70
0 0
300
600
900 1200 1500 1800
0
4
8
search radius
search radius
(g)
OJ
600
speedup scalability
(h)
RJ
12
16
20
24
28
search radius
speedup scalability
OJ
1400
400
1200 300 200
µ=1
1000
µ=2
µ=1
800
µ=3
µ=2
600
µ=3
400
100
200 0
0 1
2
3 Dataset size (x 50,000)
4
5
1
2
3
4
5
Dataset size (x 50,000)
Fig. 4. The experiments of the performance evaluation.
execute a query. The D-Index is also economical in space requirements. It needs slightly more space than the sequential organization, but at least two times less disk space compared to the M-tree. We have extended D-Index to implement two similarity join algorithms and we have performed numerous experiments to analyze their search properties and suitability for the similarity join implementation. The main advantage of these structures is that they can also perform similar operations on other metric data. The challenge is to apply our access structure for problems of similarity on XML structures, where metric indexes could be applied for approximate matching of tree structures.
References 1. Edgar Chvez, Jos Luis Marroqun, and Gonzalo Navarro. Fixed queries array: A fast and economical data structure for proximity searching. Multimedia Tools and Applications (MTAP), 14(2):113–135, 2001.
2. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proc. of VLDB’97, pages 426–435. Morgan Kaufmann, August 1997. 3. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Separable splits of metric data sets. In 9th Italian Conf. on Database Systems (SEBD), pages 45–62. LCM Selecta Group - Milano, 2001. 4. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1), 2003. To appear. 5. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. 6. P.N. Yianilos. Excluded middle vantage point forests for nearest neighbor search. In Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop, January 1999.
A
The algorithms
Algorithm A1 Insertion for i = 1 to h if simi ,ρ (x) < 2mi then x → Bi,smi ,ρ (x) ; exit; i end if end for x → Eh ; Algorithm A2 Range Search for i=1 to h m ,ρ+rq (q) < 2mi then if si i return all objects x such that x ∈ Q ∩ Bi,smi ,ρ+rq (q) ; exit; i
end if if rq ≤ ρ then (search radius up to ρ) m ,ρ−rq if si i (q) < 2mi then return all objects x such that x ∈ Q ∩ Bi,smi ,ρ−rq (q) ; i
end if else m ,r −ρ let{l1 , l2 , . . . , lk } = G(si i q (q)) return all objects x such that x ∈ Q ∩ Bi,l1 or . . . or x ∈ Q ∩ Bi,lk ; end if end for return all objects x such that x ∈ Q ∩ Eh ; Algorithm A3 Nearest Neighbor Search A = ∅, dk = d+ ; (initialization) for i=1 to h (first – optimistic – phase) r = min{dk , ρ}; i ,ρ+r (q) < 2mi then if sm i access bucket Bi,smi ,ρ+r (q) ; update A and dk ; i
if dk ≤ ρ then exit; else i ,ρ−r (q) < 2mi if sm i access bucket Bi,smi ,0 (q) ; update A and dk ; i end if end if end for access bucket Eh ; update A and dk ; if dk > ρ (second phase – if needed) for i=1 to h m ,ρ+dk (q) < 2mi then if si i access bucket Bi,smi ,ρ+dk (q) if not already accessed; update A and i
dk ; exit; else m ,d −ρ let{b1 , b2 , . . . , bk } = G(si i k (q)) access buckets Bi,b1 , Bi,b2 , Bi,bk if not accessed; update A and dk ; end if end for end if Algorithm A4 eD-Index Insertion for i = 1 to h i ,ρ+ (x) < 2mi then x → Bi,smi ,ρ (x) ; exit; if sm i i mi ,ρ if si (x) < 2mi then x → Bi,smi ,ρ (x) ; i end for x → Eh ; Algorithm A5 eD-Index Similarity Self Join for i = 1 to h for j = 0 to 2mi − 1 SimJoin(Bi,j , µ); end for end for SimJoin(Eh , µ); Algorithm A6 Sliding Window lo = 1 for up = 2 to n increment lo while d(oup , p) − d(olo , p) > µ for j = lo to up if PivotCheck() = FALSE then if d(oj , oup ) ≤ µ then add pair (oj , oup ) to result end if end for end for