Access Structures for Advanced Similarity Search in Metric Spaces - Cnr

0 downloads 0 Views 256KB Size Report
ric spaces, and a novel access structure for similarity search in metric data, called D-Index, ... execution of similarity range and nearest neighbor queries for large files with objects stored ..... access bucket Eh; update A and dk; if dk > ρ (second ...
Access Structures for Advanced Similarity Search in Metric Spaces Vlastislav Dohnal1 , Claudio Gennaro2 , Pasquale Savino2 , and Pavel Zezula1 1

Masaryk University Brno, Czech Republic {xdohnal, zezula}@fi.muni.cz 2 ISTI-CNR Pisa, Italy {gennaro, savino}@isti.pi.cnr.it

Abstract. Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets from arbitrary metric spaces, and a novel access structure for similarity search in metric data, called D-Index, is discussed. D-Index combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. Moreover, we propose an extension of this access structure (eD-Index) which is able to deal with the problem of similarity self join. Though this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.

1

Introduction

Similarity searching has become a fundamental computational task in a variety of application areas, including multimedia information retrieval, data mining, pattern recognition, machine learning, computer vision, genome databases, data compression, and statistical data analysis. This problem, originally mostly studied within the theoretical area of computational geometry, is recently attracting more and more attention in the database community, because of the increasingly growing needs to deal with large, often distributed, volume of data. Consequently, high performance has become an important feature of cutting edge designs. In this article, we consider the problem from a broad perspective and assume the data to be objects from a metric space where only pair-wise distances between objects are known. We present the D-Index, a multi-level access structure of separable buckets on each level which supports easy insertion, range search, and nearest neighbor search. Moreover, in order to complete the set of similarity

search operations, similarity joins are needed. For this purpose, an extension of D-Index is introduced. Applications of similarity join operation are data cleaning or copy detection, to name a few. The former is useful, for instance, in the development of Internet services which often require an integration of heterogeneous sources of data. Such sources are typically unstructured whereas the intended services often require structured data. In this case, the main challenge is to provide consistent and error-free data, which implies the data cleaning, typically implemented by a sort of similarity join. This work focuses on similarity self join useful for copy detection and data cleaning of large databases. The paper is organized as follows. In Section 2, we define the general principles of our access structures. The system structure, algorithms, of D-Index are presented in Section 2.3, while the eD-Index is illustrated in Section 2.4. Section 3 briefly summarize the experimental evaluation results. Finally, Appendix A formally presents the proposed algorithms. The article concludes in Section 4.

2 2.1

Access structures and search algorithms Search Strategies

A convenient way to assess similarity between two objects is to apply metric functions to decide the closeness of objects as a distance, which can be seen as a measure of the objects dis-similarity. A metric space M = (D, d) is defined by a domain of objects (elements, points) D and a total (distance) function d – a non negative (d(x, y) ≥ 0 with d(x, y) = 0 iff x = y) and symmetric (d(x, y) = d(y, x)) function, which satisfies the triangle inequality (d(x, y) ≤ d(x, z)+d(z, y), ∀x, y, z ∈ D). Without any loss of generality, we assume that the maximum distance never exceeds the distance d+ . For a query object q ∈ D, two fundamental similarity queries can be defined. A range query retrieves all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r}. A k-nearest neighbor query retrieves the k closest elements to q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R, y ∈ X − R, d(q, x) ≤ d(q, y). The similarity join is a search primitive which combines objects of two subsets of D into one set such that a similarity condition is satisfied. The similarity condition between two objects is defined according to the metric distance d. sim

Formally, the similarity join X  Y between two finite sets X = {x1 , ..., xN } sim

and Y = {y1 , ..., yM } (X ⊆ D and Y ⊆ D) is defined as the set of pairs X  Y = {(xi , yj ) | d(xi , yj ) ≤ µ}, where the threshold µ is a real number such that 0 ≤ µ ≤ d+ . If the sets X and Y coincide, we talk about the similarity self join. 2.2

General approach: Clustering Through Separable Partitioning

To achieve our objectives, we base our partitioning principles on a mapping function, which we call the ρ-split function, where ρ is a real number constrained as 0 ≤ ρ < d+ . In order to gradually explain the concept of ρ-split functions, we

first define a first order ρ-split function and its properties. More details about the mathematical specification of the ρ-split function are available in [4]. Definition 1. Given a metric space (D, d), a first order ρ-split function s1,ρ is the mapping s1,ρ : D → {0, 1, −}, such that for arbitrary different objects x, y ∈ D, s1,ρ (x) = 0 ∧ s1,ρ (y) = 1 ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ s1,ρ2 (x) = − ∧ s1,ρ1 (y) = − ⇒ d(x, y) > ρ2 − ρ1 (symmetry property). In other words, the ρ-split function assigns to each object of the space D one of the symbols 0, 1, or −. We can generalize the ρ-split function by concatenating n first order ρ-split functions with the purpose of obtaining a split function of order n. 1,ρ Definition 2. Given n first order ρ-split functions s1,ρ 1 , . . . , sn in the metric 1,ρ 1,ρ n,ρ = (s1 , s2 , . . . , s1,ρ space (D, d), a ρ-split function of order n s n ) : D → n {0, 1, −} is the mapping, such that for arbitrary different objects x, y ∈ D, 1,ρ n,ρ (x) = sn,ρ (y) ⇒ d(x, y) > 2ρ (separable ∀i s1,ρ i (x) = − ∧ ∀j sj (y) = − ∧ s 1,ρ2 1 property) and ρ2 ≥ ρ1 ∧ ∀i si (x) = − ∧ ∃j s1,ρ (y) = − ⇒ d(x, y) > ρ2 − ρ1 j (symmetry property).

An obvious consequence of the ρ-split function definitions, useful for our 1,ρ purposes, is that by combining n ρ-split functions of order 1 s1,ρ 1 , . . . , sn , we n,ρ obtain a ρ-split function of order n s . We often refer to the number of bits generated by sn,ρ , that is the parameter n, as the order of the ρ-split function. In order to obtain an addressing scheme, we need another function that transforms the ρ-split strings into integers, which we define as follows. Definition 3. Given a string b =(b1 , . . . , bn ) of n elements 0, 1, or −, the function · : {0, 1, −}n → [0..2n ] is specified as:  n [b1 , b2 , . . . , bn ]2 = j=1 2j−1 bj , if ∀j bj = − b = 2n , otherwise When all the elements are different from ‘−’, the function b simply translates the string b into an integer by interpreting it as a binary number (which is always < 2n ), otherwise the function returns 2n . By means of the ρ-split function and the b operator we can assign an integer number i (0 ≤ i ≤ 2n ) to each object x ∈ D, i.e., the function can group objects from X ⊂ D in 2n + 1 disjoint sub-sets. Though several different types of first order ρ-split functions are proposed, analyzed, and evaluated in [3], the ball partitioning split (bps) originally proposed in [6] under the name excluded middle partitioning, provided the smallest exclusion set. For this reason, we also apply this approach, which can be characterized as follows. The ball partitioning ρ-split function bpsρ (x, xv ) uses one object xv ∈ D and the medium distance dm to partition the data file into three subsets (see Figure 1a). The object corresponding to the value 1 or 0, of the bps function,

form separable partitions, and the object corresponding to ‘-’ form the exclusion set.   0 if d(x, xv ) ≤ dm − ρ bps1,ρ (x) = 1 if d(x, xv ) > dm + ρ (1)  − otherwise Once we have defined a set of first order ρ-split functions, it is possible to combine them in order to obtain a function which generates more partitions. This idea is depicted in Figure 1b, where two bps-split functions, on the twodimensional space, are used. The domain D, represented by the grey square, is divided into four regions, corresponding to the separable partitions. The exclusion set is represented by the brighter region and it is formed by the union of the exclusion sets resulting from the two splits. 2.3

D-Index

The basic idea of the D-Index [4] is to create a multilevel storage and retrieval structure that uses several ρ-split functions, one for each level, to create an array of buckets for storing objects. On the first level, we use a ρ-split function for separating objects of the whole data set. For any other level, objects mapped to the exclusion bucket of the previous level are the candidates for storage in separable buckets of this level. Finally, the exclusion bucket of the last level forms the exclusion bucket of the whole D-Index structure. It is worth noting that the ρ-split functions of individual levels use the same ρ. Moreover, split functions can have different order, typically decreasing with the level, allowing the D-Index structure to have levels with a different number of buckets. More precisely, the D-Index structure can be defined as follows. From the structure point of view, you can see the buckets organized as the h following two dimensional array consisting of 1 + i=1 2mi elements. B1,0 , B1,1 , . . . , B1,2m1 −1 B2,0 , B2,1 , . . . , B2,2m2 −1 .. . Bh,0 , Bh,1 , . . . , Bh,2mh −1 , Eh All separable buckets are included, but only the Eh exclusion bucket is present – exclusion buckets Ei ρ can also be executed. Indeed, queries with rq > ρ can be executed by evaluating the split function srq −ρ . In case srq −ρ returns a string without any ‘−’, the result

is contained in a single bucket (namely Bsrq −ρ  ) plus, possibly, the exclusion bucket. Let us now consider that the string returned contains at least one ‘−’. We indicate this string as (b1 , . . . , bn ) with bi = {0, 1, −}. In case there is only one bi = ‘−’, we must access all buckets B, whose index is obtained by substituting in srq −ρ the ‘−’ with 0 and 1. In the most general case we must substitute in (b1 , . . . , bn ) all the ‘−’ with zeros and ones and generate all possible combinations. In order to define an algorithm for this process, we need some additional terms and notation. Definition 4. We define an extended exclusive OR bit operator, ⊗, which is based on the following truth table: bit1 0 0 0 1 1 1 − − − bit2 0 1 − 0 1 − 0 1 − ⊗ 010 100 0 0 0

Note that the operator ⊗ can be used bit-wise on two strings of the same length and that it always returns a standard binary number (i.e., it does not contain any ‘−’). Consequently, s1 ⊗ s2 < 2n is always true for strings s1 and s2 of length n (see Definition 3). Definition 5. Given a string s of length n, G(s) denotes a sub-set of Ωn = 0 are {0, 1, . . . , 2n − 1}, such that all elements ei ∈ Ωn for which s ⊗ ei = eliminated (interpreting ei as a binary string of length n). Observe that, G(− − . . . −) = Ωn , and that the cardinality of the set is the second power of the number of ‘−’ elements, that is generating all partition identifications in which the symbol ‘−’ is alternatively substituted by zeros and ones. Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ d+ . The advanced similarity range query can be executed following the Algorithm A2. The task of the nearest neighbor search is to retrieve k closest elements from X to q ∈ D, respecting the metric M. For convenience, we designate the distance to the k-th nearest neighbor as dk (dk ≤ d+ ). Due to the lack of space, we do not comment this algorithm, however we can say that the general strategy of Algorithm A3 works as follows. The algorithm starts with an optimistic strategy assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails additional steps of search are performed to find the correct result. 2.4

eD-Index

The idea behind the eD-Index is to modify the insertion algorithm of the DIndex in way that the exclusion set overlaps with the separable partitions of  (see Figure 2). The objects which fall in the partition where the exclusion set intersects the separable partitions are replicated in both those sets. This principle, called the overloading exclusion set, ensures that any similarity self join query

with threshold µ ≤  cannot find a qualifying pair of objects (x, y) from different sets, specifically bps(x) = bps(y), because all objects of a separable set which can make a qualifying pair with an object of the exclusion set are copied to the exclusion set. In this way, the eD-Index speeds up the execution of similarity self

Separable partitions

Exclusion Set





dm

dm

(a)

(b)

ε

ε

Fig. 2. The difference between D-Index (a) and eD-Index (b).

join queries upto a predefined value of threshold  apart in each separable bucket. eD-Index Insertion. It proceeds according to the Algorithm A4. The algorithm is similar the one for the insertion of D-Index (A1). The difference is that in eD-Index we use both the split functions simi ,ρ and simi ,ρ+ , allowing us to replicate all the objects which fall in the overlapping region of two consecutive levels. eD-Index Similarity Self Join. The outline of the similarity self join algorithm is following: execute the join query independently on every separable bucket of every level of the eD-index and additionally on the exclusion bucket of the whole structure. This behaviour is correct due to the overloading exclusion set principle, which copies every object of a separable set which can make a qualifying pair with an object of the exclusion set to the exclusion set. In this respect, the join query elaboration can be done independently in every bucket and the results of individual join subqueries form the result of the given join query. The Algorithm A5 describes this idea.

window

d≤µ

p o o 1 2

olo

oup

oj

Fig. 3. The Sliding Window algorithm.

on

d(p,oj)

The procedure SimJoin(Bi,j , µ) executes the similarity self join in one bucket. It is possible to improve the algorithm by exploiting the pivots during the SimJoin elaboration. This idea is based on the sliding window algorithm (see Algorithm A6): objects of a bucket are ordered with respect to a pivot p, which is the reference object of a ρ-split function used by the eD-Index, and we define a sliding window of objects as [olo , oup ]. This window is moved through the whole order set of the bucket in order to find all qualifying pairs (see Figure 3).

3

Performance Evaluation and Comparison

We have implemented the D-Index and the eD-Index and conducted numerous experiments to verify its properties on two metric data sets. The first data set (STR) consisted of sentences of Czech language corpus compared by the edit distance measure, so-called Levenshtein distance [5]. The most frequent distance was around 100 and the longest distance was 500, equal to the length of the longest sentence. The second data set (VEC) was composed of 45-dimensional vectors of color features extracted from images. Vectors were compared by the quadratic form distance measure. The distance distribution of this data set was practically normal distribution with the most frequent distance equal to 4,100 and the maximum distance equal to 8,100. 3.1

D-Index Performance Evaluation

In all our experiments, the query objects are not chosen from the indexed data sets, but they follow the same distance distribution. The search costs are measured in terms of distance computations. All presented cost values are averages obtained by executing queries for 50 different query objects and constant search selectivity, that is queries using the same search radius or the same number of nearest neighbors. We have considered about 11,000 objects for each of our data sets VEC and STR. We have compared the performance of D-Index with other index structures under the same workload. In particular, we considered the M-tree3 [2] and the sequential organization, SEQ. Efficiency. The main objective of these experiments was to compare the similarity search efficiency of the D-Index with the other organizations. The results for the range search are shown in Figures 4a and 4b, while the performance for the nearest neighbor search is presented in Figures 4c and 4d. For all tested queries, i.e. retrieving subsets up to 20% of the database, the M-tree and the D-Index always needed less distance computations than the sequential scan. The D-Index certainly performed much better for the sentences and also for the range search over vectors. However, the nearest neighbor search on vectors needed practically the same number of distance computations. 3

The software is available at http://www-db.deis.unibo.it/research/Mtree/

3.2

eD-Index Performance Evaluation

In order to demonstrate the suitability of the eD-index to the problem of similarity self join, we have compared several different approaches to join operation. The nested loops, uses the symmetric property of metric distance functions for pruning some pairs. The time complexity is O( N ·(N2 −1) ). A more sophisticated method, called range query join, uses the D-index structure. Specifically, we assume a data set X ⊆ D organized by the D-index and apply the search strategy as follows: for ∀o ∈ X, perform range query(o, µ). Finally, the last compared method is the overloading join algorithm which is described in Section 2.4. In all experiments, we have compared three different techniques for the problem of the similarity self join, the nested loops (NL) algorithm, the range query join (RJ), and the overloading join (OJ) algorithm, applied on the eD-index. As performance index we have used the speedup with respect the naive approach, i.e., the number of distance computation of the examined algorithm divided by N ·(N −1) , where N is the number of object stored. 2 Efficiency. The objective of this group of tests was to study the relationship between the query size and the efficiency measured in terms of distance computations. Figures 4e and 4f show results of experiments. The experiments show that OJ is more than twice faster than RJ for small µ, which are used in data cleaning area. On the vector data set, OJ algorithm performed even better especially for small query radii. Scalability. It is probably the most important issue to investigate considering the web-based dimension of data. In the elementary case, it is necessary to study what happens with the performance of algorithms when the size of a data set grows. We have experimentally investigated the behaviour of the eD-index on the text data set with sizes from 50,000 to 300,000 objects (sentences). We have mainly concentrated on small queries which are typical for data cleaning area. Figures 4g and 4h report the speedups of RJ and OJ algorithms, respectively. In summary, the figures demonstrate that the speedup is very high and constant for different values of µ with respect to the data set size. This implies that the similarity self join with the eD-index, specifically the overloading join algorithm, is also suitable for large and growing data sets.

4

Conclusions

In this paper we have presented a new access structure able to cope with range queries, nearest neighbor queries, and similarity join. In the performance evaluation, we have concentrated on the distance computation. However, the presented structure is also suitable for working on a disk storage and it is not limited to operate in the main memory only. Our experiments, not reported in this paper, exhibit very good performance in terms of disk block accesses. Compared to the M-tree, it typically needs less distance computations and much less disk reads to

(a)

(b)

distance computations Range Search VEC

12000 10000 D-Index

8000

mtree

6000

12000

10000

10000

8000

8000

4000

4000

2000

2000

0

0 500

1000

1500

2000

10000

mtree

2000

seq

seq

0 0

20

40

60

80

100

search radius

0

20

40

60

80

100

search radius

(f)speedup

VEC

800

12000

D-Index

4000

mtree

(e)speedup

distance computations NN STR

distance computations NN VEC

6000

D-Index

search radius

(d)

(c)

12000

6000

seq

0

distance computations Range Search STR

STR

1200 1000

RJ

600

RJ

800

8000 OJ

400

6000 4000

D-Index

2000

mtree

400 200 200

seq

0 0

10

20

30

40

50

0

60

70

0 0

300

600

900 1200 1500 1800

0

4

8

search radius

search radius

(g)

OJ

600

speedup scalability

(h)

RJ

12

16

20

24

28

search radius

speedup scalability

OJ

1400

400

1200 300 200

µ=1

1000

µ=2

µ=1

800

µ=3

µ=2

600

µ=3

400

100

200 0

0 1

2

3 Dataset size (x 50,000)

4

5

1

2

3

4

5

Dataset size (x 50,000)

Fig. 4. The experiments of the performance evaluation.

execute a query. The D-Index is also economical in space requirements. It needs slightly more space than the sequential organization, but at least two times less disk space compared to the M-tree. We have extended D-Index to implement two similarity join algorithms and we have performed numerous experiments to analyze their search properties and suitability for the similarity join implementation. The main advantage of these structures is that they can also perform similar operations on other metric data. The challenge is to apply our access structure for problems of similarity on XML structures, where metric indexes could be applied for approximate matching of tree structures.

References 1. Edgar Chvez, Jos Luis Marroqun, and Gonzalo Navarro. Fixed queries array: A fast and economical data structure for proximity searching. Multimedia Tools and Applications (MTAP), 14(2):113–135, 2001.

2. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proc. of VLDB’97, pages 426–435. Morgan Kaufmann, August 1997. 3. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Separable splits of metric data sets. In 9th Italian Conf. on Database Systems (SEBD), pages 45–62. LCM Selecta Group - Milano, 2001. 4. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1), 2003. To appear. 5. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. 6. P.N. Yianilos. Excluded middle vantage point forests for nearest neighbor search. In Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop, January 1999.

A

The algorithms

Algorithm A1 Insertion for i = 1 to h if simi ,ρ (x) < 2mi then x → Bi,smi ,ρ (x) ; exit; i end if end for x → Eh ; Algorithm A2 Range Search for i=1 to h m ,ρ+rq (q) < 2mi then if si i return all objects x such that x ∈ Q ∩ Bi,smi ,ρ+rq (q) ; exit; i

end if if rq ≤ ρ then (search radius up to ρ) m ,ρ−rq if si i (q) < 2mi then return all objects x such that x ∈ Q ∩ Bi,smi ,ρ−rq (q) ; i

end if else m ,r −ρ let{l1 , l2 , . . . , lk } = G(si i q (q)) return all objects x such that x ∈ Q ∩ Bi,l1 or . . . or x ∈ Q ∩ Bi,lk ; end if end for return all objects x such that x ∈ Q ∩ Eh ; Algorithm A3 Nearest Neighbor Search A = ∅, dk = d+ ; (initialization) for i=1 to h (first – optimistic – phase) r = min{dk , ρ}; i ,ρ+r (q) < 2mi then if sm i access bucket Bi,smi ,ρ+r (q) ; update A and dk ; i

if dk ≤ ρ then exit; else i ,ρ−r (q) < 2mi if sm i access bucket Bi,smi ,0 (q) ; update A and dk ; i end if end if end for access bucket Eh ; update A and dk ; if dk > ρ (second phase – if needed) for i=1 to h m ,ρ+dk (q) < 2mi then if si i access bucket Bi,smi ,ρ+dk (q) if not already accessed; update A and i

dk ; exit; else m ,d −ρ let{b1 , b2 , . . . , bk } = G(si i k (q)) access buckets Bi,b1 , Bi,b2 , Bi,bk if not accessed; update A and dk ; end if end for end if Algorithm A4 eD-Index Insertion for i = 1 to h i ,ρ+ (x) < 2mi then x → Bi,smi ,ρ (x) ; exit; if sm i i mi ,ρ if si (x) < 2mi then x → Bi,smi ,ρ (x) ; i end for x → Eh ; Algorithm A5 eD-Index Similarity Self Join for i = 1 to h for j = 0 to 2mi − 1 SimJoin(Bi,j , µ); end for end for SimJoin(Eh , µ); Algorithm A6 Sliding Window lo = 1 for up = 2 to n increment lo while d(oup , p) − d(olo , p) > µ for j = lo to up if PivotCheck() = FALSE then if d(oj , oup ) ≤ µ then add pair (oj , oup ) to result end if end for end for

Suggest Documents