Estimating the Selectivity of Approximate String Queries ¨ ARTURAS MAZEIKA and MICHAEL H. BOHLEN Free University of Bozen-Bolzano NICK KOUDAS University of Toronto and DIVESH SRIVASTAVA AT&T Labs–Research
Approximate queries on string data are important, due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures. We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear wrt the length of the query string. We give a detailed empirical performance evaluation of our solution for synthetic and real world datasets. We show that VSol is effective for large skewed databases of short strings. Categories and Subject Descriptors: H.2.4 [Systems]: Query processing, Textual databases; H.2.8 [Database Applications]: Statistical Databases General Terms: Database algorithms, Selectivity Estimation, Query Optimization Additional Key Words and Phrases: Inverse strings, q-grams, min-wise hash signatures
This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. Copyright 200x by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to Post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or
[email protected]. c 20YY ACM 0362-5915/20YY/0300-0001 $5.00
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1–0??.
2
·
Arturas Mazeika et al.
1. INTRODUCTION This paper presents the VSol estimator to estimate the selectivity of approximate string queries in large string databases of short strings (names, addresses, etc). The selectivity is defined as the number of database strings that are within edit distance d from query string σ. Our technique is based on inverse strings and the time to estimate the selectivity of a query string is independent of the number of database strings. Similar to inverted file indices, where keywords are mapped to sets of files/documents that contain the keywords, we compute short sub-strings of length q (also known as q-grams) and associate q-grams with sets of string IDs. A hash index is used to access q-grams. This strategy permits to quickly identify similar strings and estimate the selectivity of a query string. Figure 1 provides an overview of our selectivity estimator. Detailed explanations follow in Section 5. First, we compute all substrings of length 2 (2-grams) of the database strings (cf. Bag of 2-grams: Q(Database) at the top of Figure 1). Then we associate each 2-gram with an inverse string: the IDs of the strings that contain the 2-gram (vertical arrows at the top, D(q)). For example, 2-gram “on” is contained in strings 1, 4 and 6. This setup allows to efficiently identify the strings that contain q as a 2-gram. The time complexity of inverse strings is linear wrt the number of database strings, since the size of inverse strings is proportional to the size of the database. We use signatures to compress inverse strings and get a time complexity that is linear wrt the number of neighborhoods (an accumulation of strings, cf. Section 2). The compression of inverse strings with signatures is illustrated by the vertical arrows near the bottom of Figure 1. Intuitively, a set of string IDs is replaced with a short vector of numbers between 0 and 1. This trades accuracy for a lower space consumption. We further decrease the memory usage of signatures by replacing clusters of signatures with representative signatures (cf. Clustered Signatures at the bottom of Figure 1). Given a set of clustered signatures that compress inverse strings, we develop the M choose L similarity to estimate the selectivity of a query string. The M choose L similarity estimates the similarity of a query string with respect to the q-grams of a database and their signatures. This solution differs from existing techniques that associate each string (or document) with its q-grams and then find similar strings by doing a pairwise comparison of sets of q-grams [Broder 2000]. We term our string selectivity estimator VSol (vertical solution) since it assigns sets of IDs and signatures to each q-gram. The vertical allocation of signatures is attractive, because it improves the space complexity of existing techniques, which is linear wrt the number of database strings. We prove that the space complexity of VSol is linear wrt the number of different q-grams of the database (cf. Q(Database) at the top of Figure 1), inverse quadratic wrt the selected precision ε, and linear wrt the number of neighborhoods R in the database. The empirical evaluation investigates the impact of different parameters (number of database strings, string length, etc) to the estimation precision, the space usage, and the timings of the method. We show that VSol is effective for large string databases of short strings. The query time of the VSol estimator is independent of the number of database strings and linear wrt the length of the query string. The application domain of VSol are large skewed databases with a few large neighborACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
an
br
on
ks
tz
1
2
1 4
2
5 ...
...
6
...
...
...
0.7
0.7
0.3
0.8
0.8
0.8 0.6
0.2
0.9
0.6
0.7
0.3
0.8
0.6
0.2
0.6
...
...
...
D(q)
ID 1 2 3 4 5 6
...
...
3
Bag of 2−grams: Q(Database)
Database String anderson brooks froyd simon schwartz wilkilson
·
Signatures Clustered Signatures
Fig. 1.
The Idea of VSol, q = 2
hoods and optionally many small neighborhoods (data is skewed in terms of the sizes of neighborhoods). The paper makes the following contributions: —We introduce VSol, a new selectivity estimator for approximate string queries. The selectivity |Sd (σ)| of a query string σ is the number of database strings that are within edit distance d from σ. We determine the selectivity by counting all database strings that share at least L q-grams with the query string: [ D qi1 ∩ · · · ∩ D qiL |, (1) |Sd (σ)| ≈ | qi1 ,...,qiL ∈Q(σ)
where Q(σ) are the q-grams of query string σ, qi1 , . . . , qiL are all possible combinations of L q-grams of Q(σ), D(qi ) is the inverse string for q-gram qi (all IDs of strings that contain qi ), and L is the number of q-grams query string σ and database string α must share such that the edit distance is less than or equal to d: dist(α, σ) ≤ d (cf. Sections 4 and 5). Sd (σ) denotes the d-neighborhood of σ, i.e., the set of strings that are within edit distance d to σ. —The computation of the selectivity of Equation (1) is expensive since the size of inverse strings is proportional to the size of the database. We show how to use signatures to efficiently approximate the selectivity. Assume M is the number of q-grams of query string σ. We define the M choose L similarity, ρ(L, D(qi1 ), . . . , D(qiM )), which allows to approximate the selectivity as follows: |Sd (σ)| ≈ ρ(L, D(qi1 ), . . . , D(qiM )) · |D(qi1 ) ∪ · · · ∪ D(qiM )|. We develop solutions to efficiently approximate the M choose L similarity and the size of the union of sets with only one scan of the signature vectors. Throughout, we write M for the M choose L similarity. L ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
4
·
Arturas Mazeika et al.
—We provide analytical results for our selectivity estimator. The space complexity of our estimator is O(2/ε2 ·log(2/δ)·|E|/n) where ε is the accepted error level, δ is the accepted probability that the estimator exceeds level ε, |E| is the smallest size of the neighborhood that is approximated robustly, and n is the number of data strings. We experimentally show that the approximation of the selectivity of large neighborhoods (|Sd (σ)| ≥ |E|) is invariant of the number of small neighborhoods (|Sd (σ)| < |E|) in the database. VSol supports insert-updates efficiently. Deleteupdates may require to recompute the summary structure. —We provide a detailed evaluation of our solution and compare our technique with HSol, a technique that calculates q-grams and signatures for each string (i.e., “horizontally”), and Sepia, a selectivity estimator that uses clustering and histograms to estimate the selectivity of fuzzy string predicates. HSol has been used to detect similar documents in information retrieval [Broder 2000] and to do approximate joins in databases. We prove that, without the signature approximation, VSol and HSol are equivalent. VSol enjoys a better space complexity for large databases of short strings and the query time is independent of the number of database strings. Sepia groups strings into clusters and builds a histogram for the strings in each cluster. A global histogram stores distributions of similarity values based on the distances between strings. Sepia performs well in environments where strings are deleted but the clustering does not change. —We compare the VSol and HSol estimators for real world customer data from AT&T (names and addresses). The address database is skewed: there are a few large and many small neighborhoods in the data. VSol scales better than HSol as the number of database strings increases. The names database is less skewed and both methods have to allocate larger signatures to accurately estimate the selectivity. —We extend VSol with positional q-grams and discuss the benefits of the optimization. Positional q-grams provide a higher accuracy but enlarge the summary structure of the estimator. The paper is organized as follows. Section 2 defines and motivates the problem. We review related work in Section 3. Section 4 introduces the background material and notation used in the paper. Section 5 defines the M L similarity, describes VSol, and proves the equivalence of HSol and VSol. We give a detailed experimental evaluation for synthetic and real world data in Section 6. Section 7 extends VSol with positional q-grams. Section 8 concludes the paper. 2. PROBLEM DEFINITION AND MOTIVATION We use A to denote the alphabet, and Greek lower case symbols α, β, etc., possibly with subscripts, to denote arbitrary (finite) length strings in Ω = A∗ . For a string α, we write α[i..j] to denote the substring starting at the i-th position and ending at the j-th position of α. We denote a database by DB = {α1 , . . . , αn } ⊆ A∗ . We use σ ∈ A∗ to denote the query string. For a given query string σ and database DB the task is to compute the number of strings within edit distance d from σ, i.e., |Sd (σ)| = |{α ∈ DB : dist(σ, α) ≤ d}|, ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
(2)
Estimating the Selectivity of Approximate String Queries
·
5
where |A| denotes the cardinality of set A, and Sd (σ) denotes the set of strings that are within edit distance d of string σ. The edit distance dist(α, β) between strings α and β is defined as the smallest number of character insertions, deletions, and substitutions to get string β from string α. For example, the edit distance between “wilkilson” and “wilkilsen” is one, since it requires one substitution (o should be replaced by e) to transform the first string into the second string. A brute-force approach to compute the selectivity is to scan database DB and compute dist(α, σ) for each string α. If dist(α, σ) ≤ d then α ∈ Sd (σ), otherwise α ∈ / Sd (σ). The time complexity in terms of character comparisons of such an approach is slightly better than O(n · l · |σ|), where n is the number of strings in the database and l is the average length of the strings. We assume that the length of query string σ is also l on average (we are querying a database of names for a given name, or a database of addresses for a given address string). Thus, the time complexity of the brute-force approach is O(n · l2 ). In this paper we approximate the string selectivity. Thus, we trade accuracy for a lower space and time complexity. Most real world string databases consist of strings located far away from each other (centroids) and a highly varying number of other strings located close to the centroids. The reason for these characteristics is the high-dimensionality of string data combined with typing errors and different text formatting rules. Throughout the paper we call accumulations of strings (centroids with close-by strings) neighborhoods. Intuitively, the number of neighborhoods is equal to the number of centroids in the data. Alternatively one can define neighborhoods as a set {Sd1 (σ1 ), . . . , Sdk (σk )}, where the distances between the Sdi are large compared to the distances within the neighborhoods. Thus, Sd (σ) denotes the neighborhood of query string σ up to edit distance d. We write ε to denote the estimation precision. VSol guarantees a robust approximation with O(2/ε2 · log(2/δ) · |E|/|DB|) space complexity, where |E| is the smallest size of the neighborhood that shall be approximated robustly. Neighborhoods that are larger than |E| are large neighborhoods, whereas neighborhoods that are smaller than |E| are small neighborhoods. We assume skewed databases that consist of a few large neighborhoods and many small neighborhoods. Table I summarizes the notations used in the paper. A fast and precise string selectivity estimation is useful in a number of application areas. For example, it can be used in the context of approximate string joins and data cleansing. Gravano et al. [Gravano et al. 2001] proposed a technique to join all strings that are within edit distance d. Their technique requires an edit distance d for each string. The efficient computation of the approximate selectivity of our method combined with a binary search allows to compute these distances. Similarly, the combination of our solution with the binary search allows to find the edit distance that can be used to identify groups of strings that are very close and are likely to be the results of transcription errors or typos. A subsequent data cleansing step can correct groups of strings by replacing infrequent strings with nearby frequent ones [Mazeika and B¨ohlen 2006]. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
6
·
Arturas Mazeika et al. Notation
Meaning
A α, β, γ, . . . α[i . . . j] DB = {α1 , . . . , αn } n R l |E| σ Sd (σ) ˆ A ε |A|,|α| (q, s) Q(α) J(Q(α), Q(σ)) 1 , . . . , XK Xα α D(q, s) ρ(L, A1 , . . . , AM ) 1 ,...,Y K Yq,s q,s ˜ Y 1 , . . . , Y˜ K 1 Z , . . . , ZK
The alphabet of strings Strings Substring starting with ith and ending with jth position Database of strings Number of database strings Number of neighborhoods String length The size of the smallest neighborhood that is approximated robustly Query string The set of strings that are within edit distance d from σ Approximation of A Approximation error Cardinality of bag A, length of string α A q-gram (cf. Section 4) The set of q-grams of string α (cf. Section 4) The Jaccard measure of two sets (cf. Section 4) A signature vector of α (cf. Section 4) Inverse string of q-gram (q, s) (cf. Section 5.1) `M ´ similarity of sets A1 , . . . , AM (cf. Section 5.1) L A signature vector for the inverse string D(q, s) (cf. Section 5.3) A clustered signature vector (cf. Section 5.4) The signature vector for ∪(q,s)∈Q(σ) D(q, s) (cf. Section 5.3) Table I.
Notations Used in the Paper
3. RELATED WORK We discuss, in turn, related work in the areas of approximate string matching, histograms, substring selectivity, text and information retrieval, and automatic spell checking. Broder [Broder 2000] uses q-grams and signatures to identify duplicate documents and strings. Their method for approximate string/document matching and approximate joins [Navarro 2001; Ukkonen 1983] scans one or two tables, computes the sets of q-grams for each string (the sets of tokens), and compresses the sets with the help of signatures. The signatures are then used to join strings with high similarity. We used this approach to implement the HSol string selectivity estimator illustrated in Figure 2.
String
Set of q-grams
anderson brooks f royd simon schwartz wilkilson
→ {#a, an, nd, . . . , n$} → {#b, br, ro, . . . , s$} → {#f, f r, ro, . . . , d$} → {#s, si, im, . . . , n$} → {#s, sc, ch, . . . , z$} → {#w, wi, il, . . . , n$}
Fig. 2.
Signature → {X11 , . . . , X1k } → {X21 , . . . , X2k } → {X31 , . . . , X3k } → {X41 , . . . , X4k } → {X51 , . . . , X5k } → {X61 , . . . , X6k }
The Idea of HSol, q=2
HSol counts the database strings that have a large number of q-grams in common ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
7
with query string σ: |Sd (σ)| ≈
X
1{|Q(α) ∩ Q(σ)| ≥ L}1
(3)
α∈DB
The similarity (the Jaccard measure of two sets) and the size of sets has been used to approximate Equation (3): X |Sd (σ)| ≈ 1{J(Q(α), Q(σ)) · |Q(α) ∪ Q(σ)| ≥ L}, (4) α∈DB
where J(Q(α), Q(σ)) = |Q(α) ∩ Q(σ)|/|Q(α) ∪ Q(σ)| is the Jaccard measure of sets Q(α) and Q(σ). The similarity and the size of sets can be estimated efficiently with the help of signatures (cf. Section 4). Theoretical and experimental results have validated this approach for long strings [Navarro 2001; Ukkonen 1983]. The time complexity of the techniques is at least linear wrt the number of database strings. Another class of approximate string algorithms introduces two steps to compute the neighborhood Sd (σ) (and the selectivity |Sd (σ)|) [Sahinalp et al. 2003; Sheu et al. 2005]. In the first step non-relevant strings are pruned by clustering the strings. In the second step the remaining candidate strings are scanned and the exact edit distances are computed. The strings that are within edit distance d are returned as the answer. This technique prunes all neighborhoods whose centroids are far away from query string σ (triangle inequality). The pruning technique works well if the database contains a large number of small neighborhoods. In contrast, we focus on skewed databases with a few large neighborhoods. In this setup a pruning strategy is not effective since the expensive edit distance computation will be required for all strings in a neighborhood. Sepia [Jin and Li 2005] clusters the space of input strings, and computes local and global histograms to compute the approximate string selectivity. Sepia enjoys a good estimation precision and the histogram can be updated incrementally as long as the clustering does not change. The query time of Sepia depends on the data (the number of clusters, size of clusters), since during querying edit distances have to be computed. Traditionally, histograms have been used for selectivity estimation in databases (see, e.g., [Muralikrishna and DeWitt 1988; Blohsfeld et al. 1999; Matias et al. 1998]). Histograms work well for numerical attributes [Jagadish et al. 1998; Matias et al. 2000; Muralikrishna and DeWitt 1988; Poosala et al. 1996]. It is possible to use value-range histograms by sorting strings according to their lexicographical order. However, the lexicographical order does not produce good approximations since a string α1 may be next to string α2 in lexicographical order (e.g., when α1 is a prefix of α2 ) even if the edit distance between them is large. The task of substring selectivity is to calculate (or estimate) the number of strings the given query string is a substring of (see e.g., [Jin et al. 2005; Chaudhuri et al. 2004; Jin et al. 2003; Chen et al. 2003; Jagadish et al. 2000; Krishnan et al. 1996]). The techniques build a tree data structure similar to a trie and prune the tree to make it fit into memory. The proposed solutions provide information about the 1 We
use 1{Pred} as a shorthand for
1 iff Pred 0 otherwise
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
8
·
Arturas Mazeika et al.
selectivity of substrings, which is different from the approximate string selectivity. The substring selectivity does not provide information about the selectivity of an approximate string query. There is also a substantial body of work in text and information retrieval [Salton and McGill 1986; Broder 1998; Frakes and Yates 1992]. The basics of work in this area is quite different. Information retrieval techniques consider large documents. In our settings we consider a large database of short strings. Thus, the comparison of strings has to be done on a different scale: for a given query string we need a much more precise control of the similarity of strings. Finally, there is related work in the area of automatic spell checking (cf. [Kukich 1992; Hodge and Austin 2003] for an overview). Typically, the techniques compare the misspelled word to a given dictionary or a model of the dictionary. They output a correction (or a set of corrections). The developed methods are based on various techniques, including edit distance, q-grams, probabilistic techniques, Bayesian, Neural networks, and Markov models, and they assume a dictionary of correct words. 4. BACKGROUND This section introduces q-grams, signatures, and similarities of sets. We also present the HSol string selectivity estimator. The HSol selectivity estimator approximates the edit distance with q-grams and signature vectors for each string [Broder 2000]. 4.1 Q-grams Given a string α, its q-grams are obtained by sliding a window of size q over the characters of σ. Since at the beginning and at the end of the string we have fewer characters than q, we extend the alphabet A with new characters “#” and “$”, not in the alphabet A, and modify the string by prefixing it with q − 1 occurrences of # and suffixing it with q − 1 occurrences of $. Definition 4.1. [Q-gram] A q-gram of a string σ is a substring σ[i ... i + q − 1]. We write Q(σ) for the bag (multiset) of all possible q-grams of σ. We encode bags of q-grams as sets: each element e in the bag is encoded as a pair (e, s), where e is the element of the bag, and s is the sequence number of the element. For each distinct element the sequence numbers start at 1. For example, the bag {{a, a, b}} is encoded as the set {(a, 1), (a, 2), (b, 1)}. This encoding simplifies equations and extends the concepts and results of set similarity to bags of q-grams. Example 4.1. [Q-grams] The q-grams for q = 2 and string “wilkilson” are Q(wilkilson) = {(#w, 1), (wi, 1), (il, 1), (lk, 1), (ki, 1), (il, 2), (ls, 1), (so, 1), (on, 1), (n$, 1)}. Note that we do not encode bags as sets where elements are associated with their multiplicity. Our encoding allows us to directly reuse standard set operations, e.g., ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
9
{{a, a, b, b}} ∩ {{a, a, a, b}} := {(a, 1), (a, 2), (b, 1), (b, 2)} ∩ {(a, 1), (a, 2), (a, 3), (b, 1)} = {(a, 1), (a, 2), (b, 1)} := {{a, a, b}} Q-grams can be used to approximate the edit distance: Lemma 4.1. [q-gram approximation [Gravano et al. 2001]]. Consider strings α and σ of lengths |α| and |σ|. If α and σ are within edit distance d, then strings σ and α share at least max{|σ|, |α|}−1−q ·(d−1) q-grams. Thus, the d-neighborhood of σ consists of the following strings: \ S d (σ) = {α : |Q(σ) ∩ Q(α)| ≥ L},
(5)
\ where L = max{|σ|, |α|}} − 1 − q · (d − 1), and S d (σ) denotes the estimation of Sd (σ). Example 4.2. [q-gram approximation]. Let σ = “froyd”, α = “freud”, and q = 2. Then according to Lemma 4.1 α ∈ S2 (σ) if |Q(α) ∩ Q(σ)| ≥ 2. Indeed, |Q(α) ∩ Q(σ)| = |{(#f, 1), (f r, 1), (d$, 1)}| = 3
(6)
Lemma 4.1 allows us to estimate the selectivity in string databases in O(l · n) time, where l is the average length of a database string and n is the number of database strings. The estimation procedure is the following. For each α ∈ DB compute |Q(σ) ∩ Q(α)|. If |Q(σ) ∩ Q(α)| ≥ max{|σ|, |α|} − 1 − (d − 1) · q, then α \ is in the estimated neighborhood (we write α ∈ S d (σ)), otherwise α is not in the neighborhood. The size of the intersection |Q(α) ∩ Q(σ)| can be expressed in terms of the similarity and the size of sets: |Q(α) ∩ Q(σ)| = J(Q(α), Q(σ)) · |Q(α) ∪ Q(σ)|,
(7)
where J(Q(α), Q(σ)) is the similarity of sets. 4.2 Signature of q-grams We use signatures to compress Q(α). Intuitively, we replace a large set Q(α) with a small vector of real numbers (Xα1 , . . . , XαK ). This section describes the calculation of signatures and shows how the size of the neighborhood Sd (σ) can be approximated with the help of signatures. Basically, signatures allow to estimate the size of a set and the similarity of sets (cf. Theorem 4.1 for a more precise explanation) with a good precision (precision ε) and low space requirements (size O(1/ε2 )). A min-wise hashing signature vector (Xα1 , . . . , XαK ) for a set Q(α) ⊆ Q is calculated in the following way: 1. Let Q = {(q1 , s1 ), . . . , (qp , sp )}. To each element (qi , si ) we assign an independent (0, 1) uniformly distributed number U (qi , si ). 2. A component of the signature vector is calculated by the following formula: Xαi =
min
(q,s)∈Q(α)
U (q, s)
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
10
·
Arturas Mazeika et al.
3. A signature vector for Q(α) is calculated by repeating the first two steps K times. Example 4.3. [Min-wise hashing] Consider a database of two strings {α1 , α2 }, where α1 =“froyd” and α2 =“freud”. The set of all 2-grams of the database is Q = {(#f, 1), (f r, 1), (ro, 1), (oy, 1), (yd, 1), (d$, 1), (re, 1), (eu, 1), (ud, 1)}. Assume K = 3, i.e., the length of a signature vector is 3. To build signatures for the database strings we need to assign three random [0, 1] uniformly distributed numbers for each q-gram in the database, e.g.,: q:
(#f, 1) (f r, 1) (ro, 1) (oy, 1) (yd, 1) (d$, 1) (re, 1) (eu, 1) (ud, 1) ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ U 1 (q) : 0.2 0.4 0.8 0.5 0.4 0.6 0.5 0.2 0.6 U 2 (q) : 0.2 0.6 0.4 0.4 0.8 0.5 0.3 0.9 0.7 U 3 (q) : 0.8 0.3 0.1 0.5 0.8 0.4 0.4 0.6 0.9
The first signature component for α1 =“froyd” is: Xα1 1 = min{U 1 (#f, 1), U 1 (f r, 1), U 1 (ro, 1), U 1 (oy, 1), U 1 (yd, 1), U 1 (d$, 1)} = min{0.2, 0.4, 0.8, 0.5, 0.4, 0.6} = 0.2 Similarly: Xα2 1 = min{0.2, 0.6, 0.4, 0.4, 0.8, 0.5} = 0.2 Xα3 1 = min{0.8, 0.3, 0.1, 0.5, 0.8, 0.4} = 0.1 Thus, the signature vector for α1 is Xα1 = (0.2, 0.2, 0.1) Similarly the signature vector for α2 is Xα2 = (0.2, 0.2, 0.3) The following result extends the result of Cohen [Cohen 1994] to bags and formalizes the approximation of the similarity and size of bags (cf. Equation (7)) with the help of signatures. 1 K Theorem 4.1. [Estimation of the similarity and size of bags] Let (XA , . . . , XA ) 1 K be a signature vector of bag A, and (XB , . . . , XB ) be a signature vector of bag B. Let K 1 X j j 1{XA = XB }, (8) JˆK (A, B) = K j=1
c =P K |A| K K j=1
j XA
− 1,
j j j j where 1{XA = XB } = 1 if XA = XB , and 0 otherwise. Then
—The estimator Jˆ has the following convergence rate: abs(Jˆ − J) K P ≥ ε ≤ δ, J ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
(9)
Estimating the Selectivity of Approximate String Queries
·
11
where ε > 0, 0 < δ < 1, K ≥ 2/ε2 · 1/CJ · log(2/δ), and CJ is the smallest value for which we want to have a robust estimate of J. —The estimator | ˆ· | has the following convergence rate: P
abs(|c · |K − | · |) ≥ ε ≤ δ, |·|
where ε > 0, 0 < δ < 1, K ≥ 2/ε2 · log(2/δ). Proof. The proof follows from the proof for sets [Cohen et al. 2000; Cohen 1994] and the unique encoding of a bag of q-grams as a set. Example 4.4. [Estimation of the similarity and size of sets] We continue Example 4.3. The true similarity for strings α1 =“froyd” and α2 =“freud” is J(Q(α1 ), Q(α2 )) |{(#f, 1), (f r, 1), (d$, 1)}| |{(#f, 1), (f r, 1), (re, 1), (ro, 1), (ey, 1), (eu, 1), (yd, 1), (ud, 1), (d$, 1)}| 1 = . 3 =
The estimated similarity is: 1 ˆ 1{Xα1 1 = Xα1 2 } + 1{Xα2 1 = Xα2 2 } + 1{Xα2 1 = Xα2 2 } J(Q(α 1 ), Q(α2 )) = 3 1 = 1 + 1 + 0 = 2/3 ≈ 1/3. 3 The true number of 2-grams in Q(α1 ) is 6. The approximated size of Q(α1 ) is: \ |Q(α 1 )| =
Xα1 1
3 3 −1= − 1 = 5 ≈ 6. 3 2 + Xα1 + Xα1 0.2 + 0.2 + 0.1
In order to get a robust estimation, CJ should be chosen such that |Q(α) ∩ Q(σ)|/|Q(α) ∪ Q(σ)| > CJ for all α and σ. In the worst case |Q(α) ∩ Q(σ)|/|Q(α) ∪ Q(σ)| is 1/(2(l + q − 1) − 1) (when bags Q(α) and Q(σ) share only one q-gram). Thus CJ < 1/(2l). 4.3 HSol Selectivity Estimator The HSol selectivity estimator implements Lemma 4.1 and Theorem 4.1. HSol computes a signature vector for each string in the database (usually off-line), and then uses the signatures to answer selectivity queries approximately. The signatures are calculated in two steps. For each string α we (i) compute the α set of q-grams Q(α) = {(q1α , 1), . . . , (q|α|+q−1 , s|α|+q−1 )}, and (ii) compress the set 1 K with signature (Xα , . . . , Xα ). The calculation of the approximate selectivity using the signatures is done in two steps. First, we compute the signature vector (Xσ1 , . . . , XσK ) for query string σ. Second, we scan the signature vectors and for each signature (Xα1 , . . . , XαK ) and \ ˆ estimate J(Q(α), Q(σ)) and |Q(σ) ∪ Q(α)|. This provides an estimate for the size ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
12
·
Arturas Mazeika et al.
of the intersection: \ \ ˆ |Q(σ) ∩ Q(α)| = J(Q(σ), Q(α)) · |Q(σ) ∪ Q(α)|. If the size of the intersection is greater than max{|α|, |σ|} − 1 − (d − 1) · q then α is in the neighborhood of σ. \ and |Q(σ)|. \ Note that Equation (9) allows to estimate the size of the sets |Q(α)| it does not help to pre-compute the number of q-grams for each string, since we need to compute |Q(α) ∪ Q(σ)|. This expression depends on the query string σ and cannot be pre-computed together with the signatures. Therefore, we have to approximate the size of the union with the help of signatures at runtime. Note that if (Xα1 , . . . , Xαk ) is a signature vector for Q(α) and (Xσ1 , . . . , Xσk ) is a signa1 k ture vector for Q(σ) then the signature vector for the union (Xα∪σ , . . . , Xα∪σ ) is 1 1 k k (min{Xα , Xσ }, . . . , min{Xα , Xσ }). Therefore, the size of the union can be estimated as follows: K \ − 1. |Q(α) ∪ Q(σ)|K = PK j j=1 Xα∪σ
Basically, the time complexity of HSol to answer approximate selectivity queries is O(n · 1/ε2 ), where n is the number of strings in the database and ε is the desired precision. Note, that HSol is expensive and not a practical solution for large databases with many strings, since we need to scan all signatures in order to answer approximate selectivity queries. Example 4.5. [HSol Selectivity Estimator] Consider the database {α1 , α2 , α3 } with signature vectors: α1 = f royd → {0.2, 0.2, 0.1} α2 = simon → {0.2, 0.3, 0.4} α3 = wilkilson → {0.1, 0.2, 0.2}, query string σ =“freud”, and distance d = 2. Assume the signature vector of the query string is Xσ = (0.2, 0.2, 0.3) (cf. Example 4.3). To determine the selectivity of the query string we scan the signatures and perform the following computations. (i) Compute the signature vector Xα1 ∪σ for S(α1 ) ∪ S(σ): Xα1 ∪σ = (min{Xα1 , Xσ }, min{Xα1 , Xσ }, min{Xα1 , Xσ }) = (min{0.2, 0.2}, min{0.2, 0.2}, min{0.1, 0.3}) = (0.2, 0.2, 0.1). (ii) Estimate the size of the union: |Q(α\ 1 ) ∪ Q(σ)| = 3/(0.2 + 0.2 + 0.1) − 1 = 5. (iii) Estimate the similarity: ˆ J(Q(α 1 ), Q(σ)) = (1 + 1 + 0)/3 = 2/3. (iv) Estimate the number of 2-grams strings α1 and σ share: \ ˆ |Q(α\ 1 ) ∩ Q(σ)| = J(Q(α1 ), Q(σ)) · |Q(α) ∪ Q(σ)| = 2/3 · 5 ≈ 3.3. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
13
(v) Since max{|α|, |σ|} − 1 − (d − 1) · q = 2 and |Q(α\ 1 ) ∩ Q(σ)| ≈ 3.3 ≥ 2 we get \ dist(α1 , σ) ≤ 2 and α1 ∈ S (σ). 2 \ Similarly we get that |Q(α\ 2 ) ∩ Q(σ)| ≈ 1 (dist(α2 , σ) > 2), and |Q(α3 ) ∩ Q(σ)| \ ≈ 1 (dist(α3 , σ) > 2). Therefore, |S 2 (σ)| = 1. 5. VSOL SELECTIVITY ESTIMATION
The VSol estimator is based on a new similarity, the M L similarity, and inverse strings. It allows to identify database strings that contain a large number of q-grams in time independent of the size of the database. The inverse string D(q, s) of q-gram (q, s) is the set of database string IDs that contain q-gram (q, s). Let (qi1 , si1 ), . . . , (qiM , siM ) be the q-grams of the query string. We then use {D(qi1 , si1 ), . . . , D(qiM , siM )} to calculate the selectivity of query string σ. Since the sets D(q, s) are large we compress them with the help of signatures. 5.1 Inverse Strings Let Q be the set of all q-grams in database DB and let (q, s) ∈ Q. We define the inverse string of a q-gram (q, s) as follows. Definition 5.1. [Inverse string of a q-gram] The inverse string of a q-gram (q, s) is the set D(q, s): D(q, s) = {i : αi ∈ DB ∧ (q, s) ∈ Q(αi )}. Example 5.1. Let DB = {α1 , α2 , α3 , α4 , α5 , α6 } be a string database, where α1 α2 α3 α4 α5 α6
= anderson = brooks = f royd = simon = schwartz = wilkilson
Then D(on, 1) = {1, 4, 6} D(ro, 1) = {2, 3} D(q, s) consists of the IDs of database strings that contain one q-gram of the query string. We are interested in the string IDs that contain a large number of q-grams of the query string. Below we establish an expression over D(q, s) to select the string IDs that contain a large number of q-grams of the query string. 5.2 String Selectivity Using Inverse Strings In this section we rewrite |Sd (σ)| (cf. Lemma 4.1) in terms of D(q, s). \ |S d (σ)| = |{α : |Q(α) ∩ Q(σ)| ≥ max{|α|, |σ|} − 1 − (d − 1) · q}|. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
14
·
Arturas Mazeika et al.
We approximate the expression max{|α|, |σ|} with |σ|: \ |S d (σ)| ≈ |{α : |Q(α) ∩ Q(σ)| ≥ |σ| − 1 − (d − 1) · q}|.
(10)
This approximation is good if the length of the query string is close to the average length of database strings. If the database strings are significantly longer than the query string, the approximation might include some false positives. Let (q, s) and (q ′ , s′ ) be two q-grams of σ. Then D(q, s) ∩ D(q ′ , s′ ) yields all database strings that have (q, s) and (q ′ , s′ ) among their q-grams. If (qi1 , si1 ), . . . , (qiL , siL ) are L q-grams of σ, where L = |σ| − 1 − (d − 1) · q, then D(qi1 , si1 ) ∩ · · ·∩D(qiL , siL ) yields strings that are approximately within edit distance d from σ. The union of all combinations of L q-grams of σ will approximately give us strings that are within edit distance d from query string σ: [ Sd (σ) ≈
D qi1 , si1 ∩ · · · ∩ D qiL , siL
(11)
(qi1 ,si1 ),...,(qi ,si )∈Q(σ) L
L
Example 5.2. [String selectivity using inverse strings] Consider the database DB = {α1 , α2 , α3 } with α1 =froyd, α2 =royd, and α3 =froid. The 2-grams of the database are: Q = {(#f, 1), (f r, 1), (ro, 1), (oy, 1), (yd, 1), (d$, 1), (#r, 1), (oi, 1), (id, 1)}, and the inverse strings for the 2-grams are the following: (#f, 1) (f r, 1) (ro, 1) (oy, 1) (yd, 1) (d$, 1) (#r, 1) (oi, 1) (id, 1) ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ 3 3 1 1 1 1 1 1 2 |{z} |{z} |{z} 3 3 2 2 2 2 |{z} |{z} |{z} |{z} 3 |{z}
3 |{z}
Assume query string σ=froyd and d = 1. Then M = |Q(σ)| = 6 and L = |froyd| − 1 − (1 − 1) · 2 = 4. Thus, all sets of the form: D(qi1 , si1 ) ∩ D(qi2 , si2 ) ∩ D(qi3 , si3 ) ∩ D(qi4 , si4 )
(12)
are part of the neighborhood S1 (σ), where (qi1 , si1 ), (qi2 , si2 ), (qi3 , si3 ), and (qi4 , si4 ) are different 2-grams of the query string. The union of all such intersections yields S1 (σ). There are 64 = 15 different intersections. For example, D(#f, 1) ∩ D(f r, 1) ∩ D(oy, 1) ∩ D(yd, 1) = {1, 3} ∩ {1, 3} ∩ {1, 2} ∩ {1, 2} = {1}. Therefore, {1} ⊂ S1 (σ). Similarly, D(ro, 1) ∩ D(oy, 1) ∩ D(yd, 1) ∩ D(d$, 1) = {1, 2, 3} ∩ {1, 2} ∩ {1, 2} ∩ {1, 2, 3} = {1, 2}. Therefore, {1, 2} ⊂ S1 (σ). ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
15
Equation (11) uses inverse strings to define the selectivity of approximate string queries. The naive computation of the expression is expensive. The next section will show how to estimate the selectivity with the help of the M choose L similarity and the size of the union. Our algorithm requires one scan of the relevant signature vectors to estimate the similarity and the size of the union of bags (cf. Section 5.3). We conclude this section by proving that, without a signature approximation, VSol and HSol are equivalent. Note that the neighborhood as defined by HSol consists of the actual strings, whereas the neighborhood as defined by VSol consists of string IDs. This difference is not relevant for the purpose of this paper. Without loss of generality we assume that the index of string αi is i. Theorem 5.1. [Equivalence of VSol and HSol] Let H = {α : |Q(α) ∩ Q(σ)| ≥ L} be the estimated neighborhood of the HSol estimator and [ D qi1 , si1 ∩ · · · ∩ D qiL , siL , V =
(13)
(qi1 ,si1 ),...,(qiL ,siL )∈Q(σ)
be the estimated neighborhood of the VSol estimator where (qi1 , si1 ), . . . , (qiL , siL ) are different q-grams of query string σ. Then H = V . Proof. We need to show that H = V , i.e., (i) αi ∈ H =⇒ i ∈ V , (ii) i ∈ V =⇒ αi ∈ H. We investigate case (i) first. Assume αi ∈ H. Then α and σ share at least L qgrams. We denote these q-grams by (qi1 , si1 ), (qi2 , si2 ), . . . , (qiL , siL ). Since (qi1 , si1 ) is a q-gram of αi , we also get i ∈ D(qi1 , si1 ). Similarly, i ∈ D(qi2 , si2 ), . . . , i ∈ D(qiL , siL ). Hence i ∈ D(qi1 , si1 ) ∩ D(qi2 , si2 ) ∩ · · · ∩ D(qiL , siL ) [ D qi1 , si1 ∩ · · · ∩ D qiL , siL ⊂ (qi1 ,si1 ),...,(qiL ,siL )∈Q(σ)
= V. This proves (i). The proof of case (ii) is similar. Assume i ∈ V . Then there exists (qi1 , si1 ), (qi2 , si2 ), . . . , (qiL , siL ) ∈ Q(σ) such that i ∈ D(qi1 , si1 ) ∩ D(qi2 , si2 ) ∩ · · · ∩ D(qiL , siL ). This means that all q-grams belong to Q(αi ): (qi1 , si1 ), (qi2 , si2 ), . . . , (qiL , siL ) ∈ Q(αi ). Therefore |Q(αi ) ∩ Q(σ)| ≥ L, and αi ∈ H. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
16
·
Arturas Mazeika et al.
5.3 Signatures of Inverse Strings VSol uses signatures to compress the D(q, s) and quickly approximate the similarity and size of sets. First, we define the M L similarity of sets. Next, we develop a M method to estimate the L similarity of sets and the size of the union of sets with the help of the signatures. Note that the signatures of VSol and HSol are semantically different. HSol compresses sets of q-grams, while VSol compresses string IDs. We write X for signature vectors of HSol and Y for signature vectors of VSol to emphasize this difference. Algorithmically there is no difference between the signatures. Figure 3 gives the intuition of VSol signatures. The figure illustrates the q-grams (qi , si ) of the database together with the inverse strings D(qi , si ). Each inverse string is associated with vectors of real numbers (signatures). Signatures exhibit the following properties: Set D(q1 , s1 ) is a relatively large set, therefore most of the
(q2,s2)
(q3,s3)
(q4,s4)
(qp,sp)
5
13 15 17
13 15 19
13 15 39 50
...
Fig. 3.
...
0.1 1.0 0.3 0.3 0.2 0.9 0.4 0.4
0.3 0.2
Signatures
3 15 19 25 37 50
D(q,s)
(q1,s1)
Q−grams
Signatures for Inverse Strings
signature values are small. Set D(q2 , s2 ) is a small set, hence most of the signature values are large. Sets D(q3 , s3 ) and D(q4 , s4 ) are similar, therefore most of the signature components are equal. We use these properties of signatures to estimate the size of the union and the similarity. Definition 5.2. [ M = {D(qi1 , si1 ), . . . , D(qiM , siM )} be a L similarity] Let D set of inverse strings, L ≤ M , then the M similarity of the sets is L ρ(L, D(qi1 , si1 ), . . . , D(qiM , siM )) =
|
S
(qi′ ,s′i ),...,(qi′ ,s′i )∈D 1
1
L
L
D qi′1 , s′i1 ∩ · · · ∩ D qi′L , s′iL |
|D(qi1 , si1 ) ∪ · · · ∪ D(qiM , siM )|
(14)
Intuitively, the M similarity for {D(qi1 , si1 ), . . . , D(qiM , siM )} measures how L similar any L sets are in the context of D(qi1 , si1 )∪· · ·∪D(qiM , siM ). The more any L sets overlap the higher is the similarity. Since there is more than one combination ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
17
of L sets, we calculate the union of all possible combinations (cf. the numerator in Equation (14)). Example 5.3. [ M the simL similarity] We continue Example 5.2 and compute ilarity ρ 4, D(#f, 1), D(f r, 1), D(ro, 1), D(oy, 1), D(yd, 1), D(d$, 1) . From [ D qi′1 , s′i1 ∩ · · · ∩ D qi′4 , s′i4 = {1, 2} (qi′ ,s′i ),...,(qi′ ,s′i )∈Q(froyd) 1
1
4
4
and D(#f, 1) ∪ D(f r, 1) ∪ D(ro, 1) ∪ D(oy, 1) ∪ D(yd, 1) ∪ D(d$, 1) = {1, 2, 3}, we get ρ 4, D(#f, 1), D(f r, 1), D(ro, 1), D(oy, 1), D(yd, 1), D(d$, 1) = 2/3.
Let (Yi1 , . . . , YiK ) be a signature vector of the set D(qi , si ). The following result formalizes the approximation of the M L similarity of sets and the size of the union with the help of signatures. Theorem 5.2. Let {D(qi1 , si1 ), . . . , D(qiM , siM )} be a set of inverse strings. Let Yi = (Yi1 , . . . , YiK ) be the signature vector of D(qi , si ), i = 1, . . . , M . Let (Z 1 , . . . , Z K ) = ( min Yi1 , . . . , min YiK ) i=1,...,M
i=1,...,M
be the signature vector of the union ∪i D(qi , si ). Let ρˆK (L, D(qi1 , si1 ), . . . , D(qiM , siM )) =
K 1 X 1{∃i1 , . . . , iL : Yij1 = · · · = YijL = Z j }, K j=1
K · · · ∪ D(qiM , siM )|K = PK |D(qi1 , si1 ) ∪\
j=1
Then
Zj
− 1.
(15)
(16)
—The estimator ρˆ has the following convergence rate: abs(ˆ ρK − ρ) ≥ ε ≤ δ, P ρ where ε > 0, 0 < δ < 1, K ≥ 2/ε2 · 1/Cρ · log(2/δ), and Cρ is the smallest value for which we have a robust estimate of ρ. —The estimator | ˆ· | has the following convergence rate: P
abs(|c · |K − | · |) ≥ ε ≤ δ, |·|
where ε > 0, 0 < δ < 1, K ≥ 2/ε2 · log(2/δ). Proof. The proof builds on the proof of Theorem 4.1. We give the idea of the proof for the M similarity. The proof for the size of the bag is analogous. L
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
18
·
Arturas Mazeika et al.
From P (∃i1 , . . . , iL : Yij1 = · · · = YijL = Z j ) = ρ(L, D(qi1 , si1 ), . . . , D(qiM , siM )), we get E1{∃i1 , . . . , iL : Yij1 = · · · = YijL = Z j } = ρ(L, D(qi1 , si1 ), . . . , D(qiM , siM )) = ρ. Note that we often omit the arguments for the similarity (we write ρ instead of ρ(L, D(qi1 , si1 ), . . . , D(qiM , siM ))) in the mathematical expressions. The Chernoff bounds for the random variable T =
K X
1{∃i1 , . . . , iL : Yij1 = · · · = YijL = Z j }
j=1
complete the proof. We need to prove that ρˆ − ρ )≥ε ≤δ P abs( ρ
(17)
for all K larger than 2/ε2 · 1/Cρ · log(2/δ). This is equivalent to: P (1 − ε)ρ ≤ ρˆ ≤ (1 + ε)ρ > 1 − δ
This is true, if
P ρˆ ≥ (1 + ε)ρ ≤ δ/2, P ρˆ ≤ (1 − ε)ρ ≤ δ/2.
(18) (19)
Both equations can be proved with the help of Chernoff’s bounds. We prove Equation (18), the proof of Equation (19) is analogous. Indeed, 1 T ≥ (1 + ε)ρ P ρˆ ≥ (1 + ε)ρ = P K = P T ≥ (1 + ε)Kρ = P T ≥ (1 + ε)ET .
The Chernoff’s bounds for T are ε2 P T ≥ (1 + ε)ET ≤ e− 2 ET . Therefore, ε2 ε2 P ρˆ ≥ (1 + ε)ρ = e− 2 ET = e− 2 Kρ ,
which is less than δ/2 for all K greater than 2/ε2 · 1/Cρ · log(2/δ). Figure 4 illustrates Theorem 5.2 with L = 2 and M = 7. Figure 4(a) illustrates the estimation of the similarity of 2 sets. We assigned uniform random numbers to each element of Q. The higher the similarity of sets D(qi1 , si1 ) and D(qi2 , si2 ), the more the sets overlap. Therefore, there is a bigger chance that the minimum values of set D(qi1 , si1 ) and set D(qi2 , si2 ) will be the same (cf. the solid circles in Figure 4(a)). The similarity of any 2 sets is considered wrt sets {D(qi1 , si1 ), . . . , D(qi7 , si7 )}. Therefore, 1{Yqji1 ,si1 = Yqji2 ,si2 = Z j } yields 1 only ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries 0.6
0.1
Q
0.2
0.5 0.2 0.1
0.5 0.6 0.2
D(q1,s1)
0.2
0.4 0.1
0.5 0.4 D(q2,s2) 0.1 0.2 0.4 0.1 0.1 0.5 0.1 0.5
0.5
19
Q
0.1 0.2 D(q,s) 0.5 0.4 0.4 0.1 0.1 0.6
0.1 0.2
(a) Similarity of Sets Fig. 4.
0.6
·
(b) Size of a Set
Estimating the Similarity and the Size of Sets
if the smallest value of the intersection D(qi1 , si1 ) ∩ D(qi2 , si1 ) is the same as the smallest value of the union D(qi1 , si1 ) ∪ · · · ∪ D(qi7 , si7 ). Figure 4(b) illustrates the estimation of the number of elements in set D(q, s). The bigger the set, the more small values it contains. Therefore, the bigger the estimated value of the size |c · | (cf. Equation (16)). In order to get a robust estimation Cρ should be chosen to be smaller than 1/R. Intuitively, this derives from the following observations. Assume the database consists of R neighborhoods of similar size. The expected selectivity is n/R, and the expected size of the union is n. Therefore, similarity values will be typically around 1/R, and therefore Cρ should be smaller than 1/R. If |E| is the size of the smallest neighborhood we need to approximate robustly then Cρ should be smaller than |E|/n. Theorem 5.2 permits the efficient computation of the M L similarity of sets and the size of the union. Both the similarity and the size of the union can be computed in one scan of the signature vectors. Example 5.4. [Estimation of M L similarity and the size of the union]. Consider the following signatures: (#f, 1) (f r, 1) (ro, 1) (oy, 1) (yd, 1) (d$, 1) ↓ ↓ ↓ ↓ ↓ ↓ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 |{z} |{z} |{z} |{z} |{z} |{z}
(oi, 1) ↓ z}|{ 0.3 0.4 0.4 |{z}
(id, 1) (re, 1) ↓ ↓ z}|{ z}|{ 0.3 0.4 0.4 0.3 0.4 0.3 |{z} |{z}
(eu, 1) (ud, 1) ↓ ↓ z}|{ z}|{ 0.4 0.4 0.3 0.3 0.3 0.4 |{z} |{z}
The 52 similarity of D = {D(#f, 1), D(f r, 1), D(ro, 1), D(oy, 1), D(d$, 1)} is estimated in two steps. First we compute the signature vector for the union ∪Di ∈D Di : (Z 1 , Z 2 , Z 3 ) = min(0.1, 0.1, 0.1, 0.1, 0.1), min(0.2, 0.2, 0.2, 0.3, 0.2), min(0.3, 0.3, 0.3, 0.3, 0.3)
=(0.1, 0.2, 0.3). Second, we estimate the
5 2
similarity with the help of Equation (15):
ρˆ(2, D(#f, 1), D(f r, 1), D(ro, 1), D(oy, 1), D(d$, 1)) ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
20
·
Arturas Mazeika et al.
=
=
1 1{0.1 occurs at least 2 times in {0.1, 0.1, 0.1, 0.1, 0.1}}, 3 1{0.2 occurs at least 2 times in {0.2, 0.2, 0.2, 0.3, 0.2}}, 1{0.3 occurs at least 2 times in {0.3, 0.3, 0.3, 0.3, 0.3}} 1 1 + 1 + 1 = 1. 3
The estimation of the size of the union |D(#f, 1)∪D(f r, 1)∪D(ro, 1)∪D(oy, 1)∪ D(d$, 1)| is computed in the following way: \1) ∪ D(oy, 1) ∪ D(d$, 1)| = |D(#f, 1) ∪ D(f r, 1) ∪ D(ro,
3 − 1 = 4. 0.1 + 0.2 + 0.3
5.4 Clustering the Signatures Space Large skewed databases yield inverse strings and therefore signatures that are similar to each other. It is possible to substantially reduce the memory usage without a loss of precision by clustering the signature space. This section explains how to apply the k-means clustering technique to reduce the memory usage of the estimator. Throughout the section we assume that the reader is familiar with the basics of clustering and the k-means clustering algorithm [Jain et al. 1999; Jain and Dubes 1988]. We use three steps to decrease the number of signatures: (i) cluster the signature space with the k-means clustering algorithm, (ii) compute a representative signature for each cluster, and (iii) replace all signatures in a cluster with its representative. In the first step (clustering step) we choose k signatures and run the k-means clustering algorithm with the Euclidean distance for a fixed number of iterations. The convergence rate and the number of iterations of the k-means clustering depends on the choice of initial centroids. We experienced that 5-7 iterations are enough to get a robust clustering of the signatures, and we used 10 iterations to cluster the signatures. In the second step we compute a representative signature for each cluster of signatures. The representative signature Y˜ for a cluster Y1 , Y2 , . . . , Yc is the minimum vector: (Y˜ 1 , Y˜ 2 , . . . , Y˜ K ) =
min Yi1 , min Yi2 , . . . , min YiK .
i=1,...,c
i=1,...,c
i=1,...,c
Remember that Y1 represents the set D(q1 , s1 ), Y2 represents the set D(q2 , s2 ), etc. Therefore the signature Y˜ represents the set ∪i=1,...,c D(qi , si ). In the third step the signatures Y1 , Y2 , . . . , Yc are replaced by the signature Y˜ , thereby decreasing the number of signatures. 5.5 The VSol Selectivity Estimator Theorem 5.2 is the final building block for the VSol selectivity estimator. We define the VSol selectivity estimator as follows: ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
21
\ |S d (σ)| · · · ∪ D(qiM , siM )| = ρˆK (L, D(qi1 , si1 ), . . . , D(qiM , siM )) × |D(qi1 , si1 ) ∪\ K K 1 X 1{∃i1 , . . . , iL : Y˜ij1 = · · · = Y˜ijL = Z j } × − 1 , = K j=1 Z1 + · · · + ZK
(20)
(n$,1) (qp,sp)
...
...
...
...
...
...
Signatures
Signature vector
for the union of
D(#w,1),...,D(n$,1)
...
...
Database q−grams Q(S)
(q1,s1)
Query string: wilkilson
(il,1)
(#w,1)
where Q(σ) = {(qi1 , si1 ), . . . , (qiM , siM )} and L = |σ| − 1 − (d − 1) · q. Figure 5 illustrates the VSol estimator. The method starts with the clustered signatures of inverse strings (cf. (q1 , s1 ), . . . , (qp , sp ) pointing to the clustered signature vectors). Then the method calculates the q-grams of the query string (cf. σ = wilkilson), identifies the clustered signatures Y˜i of the q-grams of the query string, and computes the signature Z of the union. Finally, we scan the clustered signature vectors of the q-grams of query string σ and the signature vector of the union to estimate the similarity and the size of the union.
... ...
~ Y1 Fig. 5.
~ YM
Z
Clustered Signatures
j−th coordinates of the clustered signature vectors
The VSol Estimator
Algorithm 5.1 constructs the signatures of a database DB. The algorithm starts by initializing the signatures and updates the signatures by processing the data strings α1 , . . . , αk in sequence. For each data string α the algorithm identifies the signatures of Q(α) and updates them (step 2). Then it reduces the size of signatures with the help of k-means clustering (step 3). Algorithm 5.1. [VSol (Construction of Clustered Signatures)] ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
22
Arturas Mazeika et al.
Input: DB = {α1 , . . . , αk }: database of strings k: number of clustered signatures Output: j is j-th coordinate of the signature vector of D(q, s). Y : signatures for database DB. Y(q,s) j 0. Initialize Y : Y(q,s) = ∞, for all (q, s) and j
1. seed K hashing functions h1 , . . . , hK 2. FOR EACH α ∈ DB DO 2.1 Compute Q(α). FOR EACH (q, s) ∈ Q(α) DO j 2.1.1. FOR EACH hash function ´ ` jh DO j j , h (q, s) = min Y(q,s) 2.1.1.1 Y(q,s) 3. Cluster signatures. Select random signatures Y˜1 , . . . , Y˜k to be the centers in k-means 3.1 Repeat the following steps 10 times 3.1.1 Assign signatures to the centers: Scan signatures Y(q,s) . Y(q,s) is assigned to center Y˜i iff d(Y(q,s) , Y˜i ) = minj=1,...,k d(Y(q,s) , Y˜j ), where d(·, ·) is the Euclidean distance 3.1.2 Update the center of each cluster. Scan centers. For each center Yj 3.1.2.1 Scan all signatures Y(q1 ,s1 ) , . . . , Y(qm ,sm ) assigned to center Yj . Update the center: Y˜i = minj=1,...,m Y(qj ,sj ) 3.2 Decrease the size of signatures. Scan centers. For each center Y˜i replace all assigned signatures Y(q,s) to center Y˜i with the center signature Y(q,s)
The algorithm to estimate the selectivity, i.e., size of the neighborhood of a query string is presented in Algorithm 5.2. The algorithm implements Equation (20). It identifies the clustered signatures associated with the q-grams of the query string Q(σ) = {(qi1 , si1 ), . . . , (qiM , siM )} and calculates the signature of the union D(qi1 , si1 ) ∪ · · · ∪ D(qiM , siM ). Then it scans the coordinates of the clustered signatures (cf. j-th coordinates, Figure 5) and computes the sums of Equation (20). Algorithm 5.2. [VSol (Selectivity Estimation)] Input: Y : clustered signatures for database DB. Y˜qj′ ,s′ is j-th coordinate of the signature vector of inverse string D(q′ , s′ ). σ: query string d: distance Output: \ sel = |S d (σ)| sel = 0, union size = 0 Calculate Q(σ) = {(qi1 , si1 ), . . . , (qiM , siM )} For each (qi , si ) identify the associated clustered signature: Y˜i = (Y˜i1 , . . . , Y˜iK ) Calculate (Z 1 , . . . , Z K ): the signature for ∪D(qi , si ) where: Z j = mini=1...,M {Y˜ij } P j Calculate the size of the union: union size = K/ K j=1 Z − 1 Scan the ith coordinates of the signature vectors; calculate the number of coordinates, which are equal to the minimal value: FOR EACH j = 1, . . . , K DO 5.1. L = |{i : Y˜ij = Z j }| 5.2. IF L ≥ |σ| − 1 − (d − 1) · q THEN sel++ 6. sim = sel/K. 7. sel = sim · union size
0. 1. 2. 3. 4. 5.
We implemented the VSol (and HSol) selectivity estimators in C++. The signatures of VSol and HSol are implemented with map and hash-map containers. The q-grams of a string and the signatures were computed during the creation of the summary structures. Only the signatures were materialized. Accessing the signature of a given database string (for HSol) and a given q-gram (for VSol) was implemented with the index operators of the map and hash-map containers. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
23
6. EXPERIMENTS This section thoroughly evaluates VSol on synthetic and real world datasets. We structure the results into four parts: (i) scalability of VSol as the number of database strings increases, (ii) comparison of VSol and HSol for different parameters, (iii) comparison of VSol and HSol for real world datasets, and (iv) comparison of VSol and Sepia. 6.1 Setup 6.1.1 The Datasets. The synthetic datasets vary three main parameters: the number of neighborhoods R, the number of database strings n, and the length of database strings l. The size of the neighborhoods |Sd (σ)| is equal to the number of database strings divided by the number of neighborhoods. The datasets were generated in the following way. We first generate R centroid strings of length l that are far apart from each other. Then for each centroid we generate |Sd (σ)| strings within a short edit distance from the centroid (usually edit distance 1). The query strings are chosen from the set of centroid strings.2 Unless stated otherwise the parameters of the synthetic datasets are the following: the number of the database strings n = 100, 000, the string length l = 25, and the number of neighborhoods R = 10. The size of one signature is 100 components. We also report experiments with real world data. Specifically, we worked with two customer datasets: a dataset of company names (more than 13 million records) and a dataset with addresses of companies (more than 100,000 records). The name dataset contains the company names of customers. The strings are 10–40 characters long with an average of 20. The strings fall into neighborhoods of small size (1–5 strings). The address database consists of the addresses of customers. The strings in the address database are longer (30–60 characters long) and the database is skewed. There are many small neighborhoods (1–2 strings) and a few large neighborhoods (several hundred strings). 6.1.2 The Measurements. Four measurements are recorded for each experiment: the creation time of the signatures (measured in seconds), the query time (measured in seconds), the relative error of the estimation, and the size of signatures (measured in megabytes). The relative error is defined as: ε = abs(|Sd (σ)| − |Sˆd (σ)|)/|Sd (σ)|. For real world experiments with small neighborhoods (cf. Section 6.4) we also report the absolute error, which is defined as: abs(|Sd (σ)| − |Sˆd (σ)|). Note that both HSol and VSol are probabilistic methods and therefore the exact estimation varies from one set of signatures to another set of signatures. For every set of parameters we ran the experiment 5 times (computed 5 sets of signatures). The results (creation and query time, and signature size) were averaged. Note that since the signatures were generated independently for each experiment, the cross points of the same parameters in different figures do not match exactly. The experiments were performed on an Intel Xeon 3.06GHz machine. The methods were implemented in C++.
2 We
choose query strings from the centroids since this makes the computation of the true size of the neighborhoods efficient. We got similar results if the query strings are not centroids. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
24
Arturas Mazeika et al.
6.1.3 Length of q-grams and Query Distance. The impact of the length of qgrams and of the query distance to the estimation precision is described by the following expression: dist(σ, α) < d =⇒ |Q(σ) ∩ Q(α)| ≥ L(d, q),
(21)
where L(d, q) = max{|σ|, |α|} + q − 1 − d · q is the number of q-grams the query and the data string must have in common to be within edit distance d. Figure 6 summarizes the empirical results for VSol (very similar results hold for HSol). The results are further discussed below. They show that q = 3 is a good choice for large databases of short strings, and that it is feasible to estimate the neighborhood size with a high precision for query distances up to 25% of the length of the database strings.
0.4
0.5 l=20 l=40 l=60
Relative Error
Relative Error
0.5
0.3 0.2
0.3 0.2
0
5 q (a) Impact of q
Fig. 6.
0.4
l=20 l=40 l=60
10
15
0.1
0.2 0.3 d/l (b) Query Distances in %
0.4
Impact of q and Query Distance to the VSol Estimator (l = string length)
6.1.3.1 Length of q-grams. q-grams divide database strings into overlapping substrings of length q. The length controls the representativeness of separate qgrams and the number of q-grams. Figure 6(a) evaluates the impact of q to the quality of the estimator. The estimation error is high for small values of q (q = 1, 2) and large values of q (q ≥ 10). Short q-grams (q = 1, 2) do not precisely represent the database strings and result in poor estimations. Large values of q allow large blocks of the database strings to be different from the query string, even for small query distances d. This results in high estimation errors. Mathematically, for large q the value L(d, q) (cf. Equation (21)) is negative and the estimators return the size of the database as the answer. The solution experiences a range of possible q values for which the relative error is low. The range of q values depends on the length of the database strings. The longer the strings the longer the q-range. The size of the signatures linearly depends on the number of different q-grams in the database. The smaller the q value, the fewer q-grams. A larger number of q-grams increases the computational time slightly. In our experiments we chose q = 3. This balances precision and space consumption for databases of short strings (names, addresses, etc). ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
25
6.1.3.2 Query Distance. Similar to the length of the q-grams, the impact of the query distance to the estimation precision is described by Equation (21). A query distance d yields strings that are within edit distance d to the query string. The estimator provides good results for distance d if L(d, q) is positive. For large values of d, L(d, q) is negative and the estimations return the size of the database as the result. Figure 6(b) shows the experimental evaluation of the solutions as d changes for different string lengths. The figure shows that the precision of our estimator is invariant wrt d in the range up to 25% of the length of the database strings. As d reaches around 25% of the length of the database strings, the estimators return the size of the database, which is the reason for the sharp increase of the error. 6.2 Scalability of VSol This section evaluates the VSol estimator for a large number of database strings. The datasets consist of 10 neighborhoods and the strings are around 30 characters long. To the extent possible we also ran the experiments with HSol. We experimented with up to n = 106 strings and extrapolated the values for larger n since it was impossible to compute them for HSol. For VSol all values were determined experimentally. The experiments focus on the evaluation of the signature component and do not include clustering (cf. Section 6.5 for experiments with clustering). The experiments show that VSol scales up nicely. The size of the signatures is orders of magnitude smaller than the string database for n = 108 . The query time of VSol is independent of the number of database strings and is very low. In contrast, HSol does not scale up as n increases since the size of the signatures is proportional to the size of the database. The query time of HSol is linear wrt the number of strings and is orders of magnitude slower than the one of VSol. VSol starts with the query string and uses a hash index to directly identify the signatures that are relevant for the answer of the approximate query. This makes the solution good for databases with a large number of strings that are relatively short. VSol signatures are effective if the number of neighborhoods is small. As a rule of thumb VSol is good for databases with a million or more short strings and a small number of neighborhoods (l ≈ 30 and R ≈ 100). HSol cannot directly identify the strings that are relevant for the answer of an approximate query. HSol always needs to scan the whole signature space. Therefore HSol works well, if the number of database strings is small and the length of strings is large. The effectiveness of HSol signatures increases as the length of the database strings increases and it is independent of the number of neighborhoods in the database. As a rule of thumb HSol is good for databases with a small number n < 100, 000 of long l > 500 strings, and a large number of neighborhoods R > 1000. 6.2.1 Error and Signature Size. Figure 7 shows that the estimation error does not increase as the number of strings increases, and is roughly 0.25 (the variations are due to the probabilistic nature of the estimator, cf. Figure 7(a)). The size of the signatures is as large as the database for 100,000 strings, and decreases very fast as the number of strings increases (cf. the dashed line, Figure 7(b)). For n = 108 strings the size of the signatures is four orders of magnitude smaller than ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
26
Arturas Mazeika et al.
the database. 1 Size of Signatures, Fraction of DB
Relative Error
0.8
10 VSol HSol
0.6 0.4 0.2 0 1e+06
1e+07 Number of Strings (a) Estimation Error
Fig. 7.
1 0.1 0.01
HSol VSol
0.001 1e+06
1e+08
1e+07 1e+08 Number of Strings (b) Signature Size
Number of Database Strings vs. Estimation Error and Signature Size
The solid line in Figure 7(b) shows the space usage of the HSol estimator for the same estimation error. The size of the signatures of HSol is proportional to the size of the database. This is because HSol assigns signatures for each database string, which makes the space complexity linear wrt the number of database strings. In our experiment the size of the signatures of HSol is slightly bigger than the database itself. Clearly, HSol does not scale up for large databases. 6.2.2 Signature Creation and Update Time. Figure 8(a) illustrates the creation time of VSol and HSol signatures. Both solutions need to scan all strings to compute signatures. This results in a linear time complexity wrt the number of database strings. The overall time complexity for HSol is higher than the time complexity of VSol, because the space usage of HSol is higher (cf. next section).
HSol VSol 100000 10000 1000 1e+06
1e+07 1e+08 Number of Strings (a) Creation Time
Fig. 8.
Update Time, Sec
Creation Time, Sec
1e+06 100 10 1 0.1 100
1000 10000 100000 1e+06 Number of Updates (b) Update Time
Number of Database Strings vs. Creation and Update Time
For incremental additions the signatures of the inverse strings of the added qgrams should be identified and possibly be updated (in the case the hash values of the newly added string ID is smaller than the signature values). The relative error of incremental additions remains constant (cf. Figure 7(a)) and the update time is very fast (cf. Figure 8(b)). For incremental deletions the signatures of the ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
27
inverse strings of deleted q-grams should be identified and compared against the hash values of the string ID of the deleted string. If the hash values of the deleted string ID are larger than the signature values the summary structure does not change. Otherwise, the summary structure must be recalculated (cf. Figure 8(a)). 6.2.3 Signature Query Time. The query time of VSol (cf. dashed line Figure 9(a)) is fast and is independent of the size of the database. Because of the vertical allocation of signatures, VSol identifies the signatures of the corresponding string IDs in time independent of the number of database strings. In contrast, HSol needs to scan all signatures to answer approximate selectivity queries. This yields a query time that is linear wrt the number of database strings. VSol is several orders of magnitude faster than HSol for 108 database strings.
1000
8000 HSol VSol
Number of Q-grams
Query Time, Sec
10000 100 10 1 0.1 0.01 1e+06
7500 7000 6500 6000
1e+07 1e+08 Number of Strings (a) Query Time
Fig. 9.
0
100 200 300 400 500 Number of Neighborhoods (b) Number of q-grams
Query Time and Number of q-grams
Figure 9(b) shows the number of q-grams as the number of neighborhoods increases. The increase is sub-linear: for small number of neighborhoods the number of different q-grams increases fast (cf. R = 0–100, Figure 9(b)). As the number of neighborhoods grows, the increase of different q-grams slows down (cf. R = 100– 500, Figure 9(b)). 6.3 Comparison of HSol and VSol for Synthetic Data This section compares HSol and VSol for synthetic datasets. In the experiment we vary the number and size of neighborhoods, and the length of the database strings. The experiments focus on the evaluation of the signature component and do not include clustering (cf. Section 6.5 for experiments with clustering). With a fixed signature size the estimation error of HSol increases as the number of database strings increases, whereas the estimation error of VSol increases as the number of neighborhoods in the database increases. The comparison of VSol and HSol is done for n = 105 database strings because of the space complexity of HSol. Even with this constraint we had to give more space to HSol to get a statistically robust trend. For example to get a precise selectivity estimation a typical size of the VSol signatures is ∼5MB. If the database consisted of 106 strings than the number of components per HSol signature would be 5,000,000/sizeof(int)/1,000,000=1.25. Clearly, 1–2 components per signature vector for HSol is not enough to get a robust statistical trend. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
28
Arturas Mazeika et al.
6.3.1 Length of Database Strings. HSol and VSol are both independent of the length of the query strings (cf. Figure 10). However, the space complexity of HSol is an order of magnitude higher for similar precisions. This is because HSol allocates a signature for each string while VSol allocates a signature for each q-gram. Since the database strings are short, HSol signatures are less effective than VSol signatures and the space-precision tradeoff is better for VSol. 1 20MB 40MB 80MB
0.8
Relative Error
Relative Error
1
0.6 0.4 0.2 0
1MB 2MB 5MB
0.8 0.6 0.4 0.2 0
0
20 40 60 80 Length of Strings (a) HSol
Fig. 10.
100
0
20 40 60 80 Length of Strings (b) VSol
100
Estimation Error for Different String Lengths
6.3.2 Different Number of Neighborhoods. In the experiment in Figure 11 we fixed the number of database strings and varied the number of neighborhoods (and therefore the selectivity). This means that the database gets less skewed as the number of neighborhoods increases. Therefore, we expect a constant error for HSol and a linear increase of the error for VSol. Sufficiently large summary structures confirm the constant trend for HSol (cf. 40MB and 80MB summary structures in Figure 11(a)). Since the number of database strings and the size of the summary structure are fixed, the HSol signature size is the same for each database string, and therefore the HSol estimation error does not increase as the number of neighborhoods increases. The 20MB summary structure is too small and leads to more false positives as the number of neighborhoods increases. 2
20MB 40MB 1.5 80MB
Relative Error
Relative Error
2
1 0.5 0
2MB 5MB 1.5 40MB 1 0.5 0
0
100 200 300 400 500 Number of Neighborhoods (a) HSol
Fig. 11.
0
100 200 300 400 500 Number of Neighborhoods (b) VSol
Estimation Error for Different Number of Neighborhoods
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
29
For VSol the estimation error increases linearly as the number ofneighborhoods increases (cf. Figure 11(b)). This is because the values of the M L similarity decreases and larger signatures should be used to get the same precision. This is consistent with Theorem 5.2 (cf. Section 5.3). 6.3.3 Different Size of Neighborhoods. Figure 12 compares VSol and HSol as the size of the neighborhoods and the number of strings increases. With a fixed number of neighborhoods this means that the database gets more skewed. We expect a constant estimation error for VSol and linear increase of the estimation error for HSol.
0.8
1 40MB 120MB 200MB
Relative Error
Relative Error
1
0.6 0.4 0.2 0 100000
1e+06 1e+07 Size of Neighborhoods (a) HSol
Fig. 12.
1MB 0.8 2MB 5MB 0.6 0.4 0.2 0 100000
1e+06 1e+07 Size of Neighborhoods (b) VSol
Estimation Error for Varying Number of Database Strings
Figure 12 confirms our expectations. The HSol estimation error increases as the size of the neighborhoods increases. Since the size of the summary structure is fixed and the number of strings increases, HSol allocates smaller signatures for each data string. VSol experiences an almost constant estimation error since the M similarity is almost constant. Note that we had to increase the space for L HSol substantially (to a few hundred megabytes) to get at least a few signature components for n = 107 database strings. 6.4 Comparing VSol and HSol for Real Data For the experiments with real world data we use two customer datasets: a dataset with company names and a dataset with company addresses. The company address database is skewed in terms of the size of the neighborhoods: there are a few (approximately 50) large neighborhoods and many (several 100) small neighborhoods. VSol scales up nicely for this dataset. The company names database consists of individual strings distributed in space. Each string forms a cluster, maybe with another couple of strings in its neighborhood. For this type of data both methods have to allocate large signatures to get a robust estimator. We investigate the trends of the errors for (i) different number of strings (the signature size is fixed at 3,000 integers) and (ii) different signature sizes (the number of database strings is fixed at 30,000 strings). We considered two types of query scenarios. For the address database we selected query strings from large neighborhoods. For the company name database we selected query strings from small neighborhoods. Note that the centroid strings of ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
30
Arturas Mazeika et al.
the neighborhoods are unknown for the real world data. We selected a number of database strings as possible query strings and computed the exact size of the neighborhoods for different distances. For the address dataset we chose query strings and distances that yield large neighborhoods. For the name dataset we chose query strings and distances that yield small neighborhoods. In the experiments we measure the absolute and relative errors of the estimators. For large neighborhoods the relative error better quantifies the precision of the estimators. For small neighborhoods the absolute error better quantifies the precision. For example, in the company name dataset the selectivity of a typical string is 3–4. The estimated size of the neighborhood is usually 1–2. Thus, the relative error is 50% while the absolute error is 1–3. Clearly, in this case the absolute error better characterizes the precision. Note that the absolute values of the precision of HSol and VSol on the real world data cannot be compared directly with the ones on the synthetic data, since the distribution of the data (number of neighborhoods, number of data strings) is different. 6.4.1 Company Addresses. The address dataset exhibits a skewed distribution: it contains a few large neighborhoods and many small neighborhoods. This setup works well for VSol as illustrated in Figure 13.
2.5
3 HSol VSol
2 1.5 1 0.5
Relative Error
Relative Error
3
2.5 2 HSol VSol
1.5 1 0.5
10 30 50 70 90 Number of Data Strings (x1000) (a) Number of Strings
Fig. 13.
10
2000 5000 Signature Size (b) Signature Size
10000
Impact of Number of Strings and Signature Size (Address Database)
For VSol the error is almost constant as the number of strings increases (cf. Figure 13(a)). This is because the M choose L similarity ρ does not change as the number of strings increases. The number of large neighborhoods R = 1/ρ remains the same and the estimation is accurate as the number of strings increases. This result is consistent with Theorem 5.2. HSol deteriorates as the number of strings increases since the number of components per signature gets too small. Figure 13(b) evaluates the estimation precision for the address database and different signature sizes. Small signatures (10–100 components) exhibit a low precision and low confidence of the approximation. Increasing the size of the signatures to a few hundred components decreases the errors and increases the confidence of the approximation. Further increasing the signature size neither decreases the error nor increases the confidence level. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
31
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5
HSol VSol
Absolute Error
Absolute Error
6.4.2 Company Names. Figure 14 evaluates the solutions for the database with company names. The precision of VSol and HSol is similar. The name dataset consists of individual strings distributed in space. Each string forms a single cluster, maybe with another couple of strings in its neighborhood. This means that larger signatures have to be allocated to get accurate results.
10 30 50 70 90 Number of Data Strings (x1000) (a) Number of Strings
Fig. 14.
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5
HSol VSol
10
2000 5000 Signature Size (b) Signature Sizes
10000
Impact of Number of Strings and Signature Size (Name Database)
Figure 14(a) shows that the absolute error increases as the number of strings increases. This indicates that the signature size is too short. VSol and HSol return 0 as the size of the neighborhood, which is reasonable. Figure 14(b) shows that for the company names dataset the size of the signature has to be 2000-10000 components to get a good and robust estimation of the approximate string selectivity. The precision of the solutions is similar, since the signature is large enough to capture all neighborhoods. 6.5 Clustering the Signature Space Clustering has the potential to reduce the memory usage to 10% of the original size, although for robust results one should not go below 20%–30% of the original size. Figures 15 and 16 show that the clustering reduces the number of the signatures without loss of precision to 5%–10% of the original size of the signatures. In some cases we even observe a minor increase of the precision. This comes from the combined effects of the error of signatures, the error of clustering, and the error of the q-gram approximation. Figure 15 investigates the effect of clustering for the string length l, the size of the neighborhoods ns, and the number of neighborhoods nn. In all experiments the signatures can be reduced to 5%–10% of its original size. The estimation error of clustering does not depend on the length of the strings and the size of the neighborhoods. (cf. Figures 15(a)–15(b)). As the number of neighborhoods increases the effectiveness of the clustering is slightly reduced (cf. Figure 15(c)). We ran a similar experiment for number of clusters up to few hundreds and a larger signature, and got similar results. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
Arturas Mazeika et al. 1 l=10 l=30 l=50 l=100
0.6 0.4
Cluster Error - Error
Cluster Error - Error
1 0.8
0.2 0 -0.2 0
0.05 0.1 0.15 0.2 0.25 0.3
1 ns=10 ns=30 ns=100 ns=300
0.8 0.6 0.4 0.2 0 -0.2 0
nn=10 nn=20 nn=30 nn=40
0.8 0.6 0.4 0.2 0 -0.2
0.05 0.1 0.15 0.2 0.25 0.3
Memory, %
0
Memory, %
(a) String Length Fig. 15.
Cluster Error - Error
32
0.05 0.1 0.15 0.2 0.25 0.3 Memory, %
(b) Size of the Neighborhood (c) Number of Neighborhoods
String Length, Size of Neighborhood, and Nr of Neighborhoods vs. Clustering Error
Figure 16 shows the clustering results for real world address dataset. Similarly to the synthetic datasets the memory usage can be substantially reduced, but for robust results one should not go beyond 20%–30% of the original size.
Cluster Error - Error
1 n=10,000 n=30,000 n=50,000 n=100,000
0.8 0.6 0.4 0.2 0 -0.2 0.2
Fig. 16.
0.4 0.6 Memory, %
0.8
1
Impact of Clustering (Address Database)
6.6 Comparison with the Sepia Selectivity Estimator This section compares our method with the Sepia string selectivity estimator [Jin and Li 2005]. Sepia uses k-medoid clustering to cluster the data. Based on the medoids and the edit distance a proximity-pair-distribution table is computed. We show that Sepia performs well in environments where strings are deleted but the clustering does not change (or where effective sampling can be applied). In contrast, VSol is robust to changes of the neighborhoods and the creation of the summary structure is faster than the clustering step of Sepia. The following summarizes our implementation of the Sepia estimator: —The database is clustered into k clusters according to the partition around medoids (PAM) algorithm [Jain and Dubes 1988]. We start the clustering with k random strings as medoids and assign database strings to the closest medoid. Then we swap the medoid and database string that decreases the clustering error most. We continue the last step until the clustering error is minimized. The complexity of this step is O(tk(n−k)2 l2 ), where t is the number of iterations, n is the number of database strings, and l is the average length of database strings. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
33
—For each cluster we build a frequency table of the form EditVector(µ,ζ),m , where µ is the medoid, ζ is a string of the cluster, and m is the number of occurrences of EditVector(µ,ζ). —We compute the proximity-pair-distribution (PPD) table as follows [Jin and Li 2005]. We scan medoids µ, all
strings ζ of the clusters, and 10% of the database strings α. Then we insert v1 , v2 ,CPDF into the PPD table, where CPDF is the cumulative probability density function of edit distance dist(α, ζ) under the condition that v1 = EditVector(α, µ) and v2 = EditVector(µ, ζ). The PPD table is implemented as a sorted array. The complexity of the lookup of an element (v1 , v2 ) is logarithmic wrt the number of entries in the PPD table. We compare four characteristics of Sepia and VSol: estimation error, creation time, query time, and memory usage. We used the following parameters: number of neighborhoods is 10, average length of database strings l = 25, diameter of neighborhoods d = 3. The database is pre-clustered during the data generation into k = 10 clusters (to avoid the quadratic PAM clustering step). We used centroid strings as well as strings far away from the centroids as query strings with d = 3. We compare the solutions for the same estimation error. The results are discussed in detail below and can be summarized as follows: The memory complexity of Sepia is better than the memory complexity of VSol (without clustering) and roughly the same as the memory complexity of VSol with clustering (cf. Section 5.4). Even with pre-clustered databases the creation of the summary structures of Sepia was an expensive and limiting factor. Wrt query time VSol is the preferred estimator, since it does not perform expensive edit distance operations. Sepia is well-suited for static environments where clustering is an off-line process and the precision needs to be maximized. VSol is well-suited for dynamic environments where the data changes and fast selectivity estimations are required. 6.6.1 Estimation Error and Memory Complexity. Figure 17(a) evaluates Sepia and VSol (without clustering) as the number of strings increases. We compared Sepia and VSol for an estimation error of approximately 0.2. Sepia is precise if the query string is close to the centers of the k-medoids (PAM) clusters. In this case the estimation error is below 10% and is a few per cent on average. The estimation error increases as the distance between the query string and the closest center of a cluster increases. In this case the estimation error is typically above 50%, and in rare cases several hundred per cent. There are two reasons for this: (i) Sepia is based on the triangle inequality and the triangle inequality is inaccurate, (ii) Sepia uses a subset of the input database to train the frequency and PPD tables; as the distance to the center increases more strings are required to train the frequency and PPD tables. In contrast, VSol does not depend on the position of the query string wrt the center of the cluster and the relative error does not increase as the query string approaches the border of neighborhoods. The memory complexity is around 5MB for VSol (without clustering) and 1MB for Sepia (cf. Figure 17(b)). Sepia is more memory efficient, because the solution efficiently allocates the cluster centers at the centers of the neighborhoods. The memory complexity of VSol depends on the number of different q-grams in the database and does not depend on the number of neighborhoods. Note, that the ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
34
Arturas Mazeika et al. 1
6 Sepia VSol
Sepia VSol
5 Memory, MB
Relative Error
0.8 0.6 0.4 0.2
4 3 2 1
0
0 1000 10000 Number of Strings (a) Estimation Error
Fig. 17.
100000
1000
10000 100000 Number of Strings (b) Varying Neighborhood Size
Number of DB Strings vs. Estimation Error and Memory Usage
memory complexity of VSol gets as small as the one of Sepia if the signatures are clustered (cf. Section 5.4). 6.6.2 Creation and Query Time. Figure 18(a) evaluates the creation time of Sepia and VSol. The quadratic complexity of the PAM clustering makes it infeasible to run the clustering for more than a few thousand strings. Therefore, we clustered the strings during the data generation and report only the creation time of the frequency and PPD tables. Due to expensive edit distance computations the time complexity is still high and a limiting factor. The creation time of VSol signatures is linear. Sepia exhibits a quadratic query time wrt the length of strings since Sepia needs to compute edit distances from the query string to the centers of the clusters (cf. Figure 18(b)). VSol identifies the signatures of the q-grams of the query string in linear time.
10000
Sepia VSol
Query Time, Sec
Creation Time, Sec
100000 1000 100 10 1
0.003
Sepia VSol
0.002 0.001
0.1 1000 10000 100000 Number of Strings (a) Creation Time
Fig. 18.
50 100 150 200 Length of the Strings (b) Varying String Length
Creation and Query Time
A number of modifications of the k-medoid clustering algorithm have been used to improve the clustering time and speed up the computation of the summary structures of Sepia [Vernica and Li ]: (i) use sampling of the database with sample size 5×NC where NC is the number of clusters, (ii) pre-compute edit distances dist(αi , αj ) where αi , αj are strings from the sample, (iii) restrict the search space ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
35
for the best clustering: start with an initial random clustering and improve the position of the medoid within the cluster, (iv) limit the number of iterations to find the best clustering. These techniques decrease the computational time substantially but they also decrease the robustness of the approximation. It becomes necessary to allocate substantially more medoids than there are neighborhoods in the dataset, and the sample of the database that is used to construct the frequency and PPD tables has to be chosen carefully. 6.6.3 Comparison of Sepia with VSol with Clustered Signatures. This section summarizes the comparison of Sepia with VSol with clustered signatures. VSol with clustered signatures outperforms Sepia. For the same accuracy, VSol (with clustered signatures) has the same (or even slightly better) space usage than Sepia. Clustering decreases the signatures without loss of precision to 10% of the original size. VSol with clustered signatures occupies 100–550KB, while Sepia occupies 285–800KB. VSol with clustered signatures requires around 30%–40% less memory than Sepia. Note that the computation time of the summary structure of VSol with clustered signatures is still better than the computation time of Sepia for large databases. Sepia experiences quadratic complexity wrt the number of input strings, while VSol has linear complexity wrt the number of input strings. 7. VSOL WITH POSITIONAL Q-GRAMS This section describes and evaluates the extension to VSol with positional q-grams. Positional q-grams, which also consider the position of q-grams in a string, decrease the error by around 10%, though increase the memory usage by one-two orders of magnitude. 7.1 Preliminaries Positional q-grams are extensions of regular q-grams. In addition to sliding a window of size q over the string, positional q-grams also record the position of the window. Consider the following example. Example 7.1. Let α=george. The regular Q(α) and positional QP (α) 2-grams of α are Q(george) = {(#g, 1), (ge, 1), (eo, 1), (or, 1), (rg, 1), (ge, 2), (e$, 1)} QP (george) = {(#g, 1), (ge, 2), (eo, 3), (or, 4), (rg, 5), (ge, 6), (e$, 7)}. In our paper regular as well as positional q-grams are expressed as a pair (q, i), where q is the substring, and i is the sequence number for regular q-grams and the position for positional q-grams. Since positional q-grams encode the position there cannot be two identical q-grams in a string. Consequently, we do not need sequence numbers for positional q-grams. Positional q-grams are more precise than regular q-grams because they not only relate strings in terms of the number of shared q-grams but also filter out false positives with the help of positional information. Intuitively, VSol with positional q-grams is based on two properties: (i) strings that are within edit distance d should share a large number of q-grams and (ii) if a ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
36
·
Arturas Mazeika et al.
q-gram occurs in α and σ then the positions of the q-gram in α and σ shall differ by at most d positions. Definition 7.1. [d-match of positional q-grams] Let α and σ be strings. Then (q, p) ∈ QP (σ) d-matches (q ′ , p′ ) ∈ QP (α) iff (q ′ , p′ ) ∈ (q, p ± d) and {(q, p − d), (q, p − d + 1), . . . , (q, p + d)} if p > d (q, p ± d) = {(q, 1), (q, 2), . . . , (q, p + d)} otherwise. Example 7.2. [d-match of positional q-grams.] Let α = george and σ = org, and q = 2. Let (q, p) = (or, 2) ∈ QP (org). Then possible positional q-grams of α that 4-match (q, p) are: (or, 2 ± 4) = {(or, 1), (or, 2), (or, 3), (or, 4), (or, 5), (or, 6)} Since QP (α) = {(#g, 1), (ge, 2), (eo, 3), (or, 4), (rg, 5), (ge, 6), (e$, 7)} the 2-grams (or, 2) in σ and (or, 4) in α 4-match. The following lemma formalizes the approximation of edit distance with positional q-grams [Gravano et al. 2001]. Lemma 7.1. [Positional q-gram approximation]. If α and σ are within edit distance d, then a positional q-gram in one string cannot correspond to a positional q-gram in the other string that differs from it by more than k positions. Thus, the d-neighborhood of σ for positional q-grams consists of the following strings: P P \ S d (σ) = α ∈ DB : |{(q, s) ∈ Q (σ) : (q, s ± d) ∈ Q (α)}| ≥ L ,
where L = max{|σ|, |α|}} − 1 − q · (d − 1).
7.2 Extending VSol with Positional Q-grams Definition 7.2. [String selectivity with positional q-grams.] Let σ be a query string and QP (σ) = {(q1 , p1 ), . . . , (qM , pM )} be the positional q-grams of string σ. Then the VSol string selectivity for positional q-grams is defined as follows: [ P S\ D qi1 , pi1 ± d ∩ · · · ∩ D qiL , piL ± d , (22) d (σ) = (qi1 ,pi1 ),...,(qiL ,piL )∈QP (σ)
where
and
D qj , pj ± d = D qj , pj − d ∪ · · · ∪ D qj , pj ∪ · · · ∪ D qj , pj + d , L = |σ| − 1 − (d − 1) · q.
Example 7.3. [VSol selectivity for positional q-grams.] Consider a database with D = {α1 , α2 , α3 } where α1 =froyd, α2 =royd, and α3 =froid, query string σ = froyd, q = 2, and query distance d = 1 (L = 4). Figure 19 illustrates the inverse strings for positional q-grams. All sets of the form: D(qi1 , pi1 ± 1) ∩ D(qi2 , pi2 ± 1) ∩ D(qi3 , pi3 ± 1) ∩ D(qi4 , pi4 ± 1) ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
37
are part of the answer of the string selectivity kS1P\ (froyd)k. The union of the \ P intersections yield S (froyd). For example, consider: 1
D(ro, 3 ± 1) ∩ D(oy, 4 ± 1) ∩ D(yd, 5 ± 1) ∩ D(d$, 6 ± 1), where D(ro, 3 ± 1) = D(oy, 4 ± 1) =
D(ro, 2) ∪ D(ro, 3) ∪ D(ro, 4) = D(oy, 3) ∪ D(oy, 4) ∪ D(oy, 5) =
{2} ∪ {1, 3} ∪ ∅, {2} ∪ {1} ∪ ∅,
D(yd, 5 ± 1) = D(d$, 6 ± 1) =
D(yd, 4) ∪ D(yd, 5) ∪ D(yd, 6) = D(d$, 5) ∪ D(d$, 6) ∪ D(d$, 7) =
{2} ∪ {1} ∪ ∅, {2} ∪ {1, 3} ∪ ∅.
Therefore, D(ro, 3 ± 1) ∩ D(oy, 4 ± 1) ∩ D(yd, 5 ± 1) ∩ D(d$, 6 ± 1) = {1, 2}. Definition 7.3. [ M L similarity for positional q-grams.] Let D = {D(q i1 , pi1 ± d), . . . , D(qiM , piM ± d)} be a set of inverse strings, L ≤ M , then the M L similarity of the sets is ρ(L, D(qi1 , pi1 ± d), . . . , D(qiM , piM ± d)) =
|
S
(qi′ ,p′i ±d),...,(qi′ ,p′i ±d)∈D 1
1
D qi′1 , p′i1 ± d ∩ · · · ∩ D qi′L , s′pL ± d |
L L (23) |D(qi1 , pi1 ± d) ∪ · · · ∪ D(qiM , piM ± d)| Note that the difference between the M similarity for non-positional q-grams L M and the L similarity for positional q-grams is that inverse strings D(q, s) are exchanged with the unions of inverse strings of positional q-grams D(q, p ± d). Below we give the algorithm for the VSol selectivity estimator with positional q-grams. Note that the computation of the summary structure algorithm is the same as for non-positional q-grams.
Algorithm 7.1. [VSol (Selectivity Estimation with Positional Q-grams)] Input: j Y : signatures for database DB. Yq,p is j-th coordinate of the signature vector of D(q, p). σ: query string d: distance Output: P sel = |S\ d (σ)|
(#f, 1) (f r, 2) (ro, 3) (oy, 4) (yd, 5) (d$, 6) (#r, 1) (oi, 4) (id, 5) (ro, 2) (oy, 3) (yd, 4) (d$, 5) ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ 1 1 1 1 1 1 2 3 3 2 2 2 2 |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} 3 3 3 3 |{z} |{z} |{z} |{z}
Fig. 19.
Inverse Strings for the Positional Q-grams of Example 7.3 ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
·
38
0. 1. 2. 2P .
Arturas Mazeika et al.
sel = 0, union size = 0 Calculate QP (σ) = {(qi1 , pi1 ), . . . , (qiM , piM )} 1 K For each (qi , pi + k), k ∈ [−d, d] identify the associated signature: Yi,k = (Yi,k , . . . , Yi,k ) 1 K Calculate (Ui , . . . , Ui ) the signature vector for D(qil , pil ± d), l
l
j , l = 1, . . . , M where Uij = mink∈[−d,d] Yi,k l
3. 4. 5.
6. 7.
Calculate (Z 1 , . . . , Z K ): the signature for ∪D(qi , si ) where: Z j = mini=1...,M {Yij } P j Calculate the size of the union: union size = K/ K j=1 Z − 1 Scan the ith coordinates of signature vectors Uil ; calculate the number of coordinates, which are equal to the minimal value: FOR EACH j = 1, . . . , K DO 5.1. L = |{l : Uij = Z j }| l 5.2. IF L ≥ |σ| − 1 − (d − 1) · q THEN sel++ sim = sel/K. sel = sim · union size
There are two differences between the algorithm for positional and the algorithm for non-positional q-grams. First, the algorithm computes the signature vector Uil for D(q, p ± d) (cf. line 2P , Algorithm 7.1), and second, it queries signatures Uil instead of the individual signatures Yi,k . Signatures for inverse strings of positional q-grams are less (space) effective compared to the signatures for inverse strings for non-positional q-grams. This is because there are substantially more positional q-grams for a database and the size of inverse strings is shorter compared to non-positional q-grams. 7.3 Experiments Typically VSol for positional q-grams uses 500–1000 MB of memory (cf. the solid line, Figure 20(b)). This is because the number of different q-grams increases substantially if the position of the q-gram is considered, and we get more but shorter inverse strings. We increased the usage of VSol with regular q-grams to use approximately the same memory (cf. the dashed line, Figure 20(b)) to have a common basis for the comparison of the solutions. 1200 Positional Regular
0.8
Positional Regular
1100 Memory, MB
Relative Error
1
0.6 0.4 0.2
1000 900 800 700
0
600 0
50 100 150 200 250 300 Number of Clusters (a) Relative Error
Fig. 20.
0
50 100 150 200 250 300 Number of Clusters (b) Memory
Positional versus Regular q-grams
VSol for positional q-grams is less precise than non-positional q-grams (cf. Figure 20(a)) if similar amount of memory is used. This is because, there are a lot less non-positional q-grams compared to positional q-grams, and the signature vectors of non-positional q-grams are a lot (an order of magnitude) longer compared to ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
39
(RegErr-PosErr)/Reg Err
the signatures vectors of positional q-grams. The experiment was adjusted so the sizes of the memory structures of VSol for positional and non-positional q-grams is as close as possible. We ran experiments for VSol for positional q-grams, then counted the number of different q-grams for the dataset, and adjusted the size of the signature vectors for VSol for regular q-grams. Figure 21 shows the impact of positional q-grams to the address database. Positional q-grams improve the precision of VSol by 10%, though the signatures for positional q-grams are one-two orders of magnitude higher than the ones of regular q-grams. 1 0.8 0.6 0.4 0.2 0 100
1000
10000
100000
Number of Strings Fig. 21.
Impact of Positional q-grams (Address Database)
8. CONCLUSION In this paper we study the problem of approximate string selectivity for string databases of short strings. We propose a new technique, called VSol, to answer selectivity queries approximately. VSol computes inverse strings for the input database, compresses the inverse strings with signatures, and clusters the signatures with the help of k-means clustering. The salient feature of our technique is a query time that is independent of the number of the database strings. The space complexity of the signatures is inverse quadratic wrt the selected precision and linear wrt the number of neighborhoods in the dataset. We give an extensive evaluation of our algorithm for synthetic and real world data. For the same precision VSol enjoys a faster creation time of signatures, query time, and memory usage compared to other state-of-the art approximate selectivity methods. In contrast to HSol, VSol allocates signatures vertically, and scales nicely as the number of input strings increases. The query time of VSol is linear wrt the length of query string, and size of a signature. The query time of Sepia is quadratic wrt the length of query strings, and linear wrt the number of clusters in the summary structure. Acknowledgments This work is supported in part by the Danish research council through grant 505101-004. We thank the anonymous reviewers for their insightful and constructive comments. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
40
·
Arturas Mazeika et al.
REFERENCES Blohsfeld, B., Korus, D., and Seeger, B. 1999. A Comparison of Selectivity Estimators for Range Queries on Metric Attributes. In SIGMOD. 239–250. Broder, A. Z. 1998. On the resemblance and containment of documents. In SEQS’91. Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium. 1–10. Chaudhuri, S., Ganti, V., and Gravano, L. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE. 227–239. Chen, Z., Korn, F., Koudas, N., and Muthukrishnan, S. 2003. Generalized substring selectivity estimation. J. Comput. Syst. Sci. 66, 1, 98–132. Cohen, E. November 1994. Estimating the size of the transitive closure in linear time. In 35th Annual Symposium on Foundations of Computer Science. 190–200. Cohen, E., D., M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J. D., and Yang, C. 2000. Finding interesting associations without support pruning. In ICDE. 489-499. Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 1, 58–75. Frakes, B. and Yates, R. 1992. Information Retrieval Data Structures & Algorithms. PrenticeHall. Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In VLDB. 491–500. Hodge, V. J. and Austin, J. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. TKDE 15, 5, 1073–1081. Jagadish, H. V., Kapitskaia, O., Ng, R. T., and Srivastava, D. 2000. One-dimensional and multi-dimensional substring selectivity estimation. VLDB Journal 9, 3, 214–230. Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In VLDB. 275–286. Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice Hall. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: a review. ACM Computing Surveys 31, 3, 264–323. Jin, L., Koudas, N., Li, C., and Tung, A. K. H. 2005. Indexing mixed types for approximate retrieval. In VLDB. 793–804. Jin, L. and Li, C. 2005. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB. 397–408. Jin, L., Li, C., and Mehrotra, S. 2003. Efficient record linkage in large data sets. In DASFAA. 137. Krishnan, P., Vitter, J. S., and Iyer, B. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In SIGMOD. 282–293. Kukich, K. 1992. Technique for automatically correcting words in text. ACM Computing Surveys 24, 4, 377–439. Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In SIGMOD. 448–459. Matias, Y., Vitter, J. S., and Wang, M. 2000. Dynamic maintenance of wavelet-based histograms. In VLDB Journal. 101–110. ¨ hlen, M. H. 2006. Cleansing Databases of Misspelled Proper Nouns. ProMazeika, A. and Bo ceedings of CleanDB Workshop in Conjunction with VLDB. Muralikrishna, M. and DeWitt, D. J. 1988. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In SIGMOD. 28–36. Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31–88. Poosala, V., Haas, P. J., Ioannidis, Y. E., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In SIGMOD. 294–305. Sahinalp, C., Tasan, M., Macker, J., and Ozsoyoglu, Z. M. 2003. Distance based indexing for string proximity search. In ICDE. 125–137. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.
Estimating the Selectivity of Approximate String Queries
·
41
Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGrawHill. Sheu, S., Cheng, C.-Y., and Chang, A. 2005. Fast pattern detection in stream data. In AINA. 125–130. Ukkonen, E. 1983. On approximate string matching. In Proceedings of Conference on Foundations of Computation Theory. Vernica, R. and Li, C. Flamingo project. http://www.ics.uci.edu/∼flamingo/.
ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY.