duplicate ratio, in the connectivity of the graph, which is better scalability than any previous .... relationships between persons, publications, etc. form a graph.
Space and Time Scalability of Duplicate Detection in Graph Data Melanie Weis and Felix Naumann Hasso Plattner Institut fr Softwaresystemtechnik GmbH Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany {melanie.weis, felix.naumann}@hpi.uni-potsdam.de June 2007 Abstract The task of duplicate detection consists in determining different representations of a same real-world object in a database, and that for every object in the database. Recent research has considered to use relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks to scale up. In this report, we present how duplicate detection in graph data scales up to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we generalize the process of duplicate detection in graphs (DDG). We then define two methods to scale algorithms for DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data not residing in main memory, and we observe that they all scale linearly in time with the data size, the duplicate ratio, in the connectivity of the graph, which is better scalability than any previous algorithm.
1 Introduction Duplicate detection is the problem of determining that different entries in a database actually represent the same real-world object, and this for all objects represented in a database. Duplicate detection is a necessary task in data cleaning [15, 24] and is relevant for data integration [11], personal information management [12], and many other areas. The problem has been studied extensively for data stored in a single relational table with a sufficient number of attributes to make sensible comparisons. However, much data comes in more complex structures. For instance, a relational schema is not limited to a single table, it also includes foreign keys that define relationships between tables. Such relationships provide additional information for comparisons. Recently, a new class of algorithms for duplicate detection has emerged. We refer to that class of algorithms as duplicate detection in graphs (DDG) algorithms. These algorithms detect duplicates between representations of entities in a data source, called 1
candidates by using relationships among candidates. Within this class, we focus on those algorithms that iteratively detect duplicates when relationships form a graph. To the best of our knowledge, none of the proposed algorithms within this class has explicitly considered efficiency or scalability. Instead, the focus has so far been set on effectiveness. In this paper, we show how the class of iterative algorithms for DDG can scale up to large amounts of data and thus to a large number of pairwise comparisons. To this end, we use a relational DBMS as storage manager and overcome its limitations in supporting DDG. More specifically, our contributions are: (i) A generalization of iterative DDG, (ii) an algorithm for scaling DDG in space, i.e., on arbitrarily large amounts of data, called RECUS/DUP and an enhanced version of RECUS/DUP that additionally scales in time, named RECUS/BUFF, and (iii) three methods to compute similarity considering relationships in our scalable environment. Both algorithms and all similarity computation techniques are based on our generalization of iterative DDG, thus making them applicable to existing algorithms performing DDG, so far limited to main-memory processing. Our most significant results include successful DDG over 1,000,000 candidates, a data set an order of magnitude larger than data sets considered so far, and a linear behavior of DDG runtime with respect to data set size, duplicate ratio, and connectivity when using RECUS/BUFF, which outperforms existing DDG algorithms. The paper is organized as follows: In Sec. 2, we cover related work. Next, we present a generalization of iterative DDG and provide definitions in Sec. 3. Section 4 describes how initialization of the algorithm is performed when main-memory does not suffice to complete DDG, and Sec. 5 presents both RECUS/DUP and RECUS/BUFF. Similarity computation is discussed in Sec 6. Evaluation is provided in Sec. 7, before we conclude in Sec. 8.
2 Related Work The problem of identifying multiple representations of a same real-world object, originally defined in [22], and formalized in [14] has been addressed in a large body of work. Ironically, it appears under numerous names, e.g., record linkage [18], merge/purge [17], object matching [15], object consolidation [10], and reference reconciliation [12]. A duplicate detection survey is provided in [13]. Broadly speaking, research in duplicate detection falls into three categories: techniques to improve effectiveness, techniques to improve efficiency, and techniques to enable scalability. Research on the first problem is concerned with improving precision and recall, for instance by developing sophisticated similarity measures [6, 9] or by considering relationships [27]. Research on efficiency assumes a given similarity measure and develops algorithms that avoid applying the measure to all pairs of objects [1, 21]. To apply methods to very large data sets, it is essential to scale not only in time by increasing efficiency, but also to scale in space, for which relational databases are commonly used [17, 23]. In Tab. 1, we summarize some duplicate detection methods, classifying them along two dimensions, namely data and approach. For data, we distinguish between (i) data in a single table, without multi-valued attributes, (ii) tree data, such as data warehouse hierarchies or XML data, and (iii) data represented as a graph, e.g., XML with keyrefs or data for personal information management (PIM). The second dimension discerns between three approaches used for duplicate detection: (i) machine learning, where 2
Learning Clustering Iterative
Table Bil.03[6](Q) Sar.02[26](Q) Chau.05[9](Q,T ,S)
Tree
Her.95[17](T ,S) Mon.97[21](T ) Jin03[18](Q,T )
Ana.02[1](Q,T ,S) Puh.06[23](T ,S) Mil.06 [20](Q)
Graph Sin.05[27](Q) Bhat.06 [5](Q) Kala.06[19, 10](Q, T ) Yin06[31](Q,T ) Dong05[12](Q) Weis06[29, 30](Q,T ) Bhat.04[2, 3, 4](Q)
Table 1: Summary of duplicate detection approaches, their focus lying on efficiency(T), effectiveness(Q), or scalability(S). models and similarity measures are learned, (ii) the use of clustering techniques, and (iii) iterative algorithms that iteratively detect pairs of duplicates, which are aggregated to clusters. In Tab. 1, we also show whether an article mainly focuses on efficiency (T ), effectiveness (Q), or scalability(S). Due to space constraints, we limit the discussion to approaches falling in the class of algorithms we consider, i.e., iterative duplicate detection algorithms in graph data. Research presented in [12] performs duplicate detection in a PIM scenario, where relationships between persons, publications, etc. form a graph. Pairs of candidates to be compared are maintained in an active priority queue where the first pair is retrieved at every iteration. The pair is compared and classified as non-duplicate or duplicate. In the latter case, relationships are used to propagate the decision to neighbors, which potentially become more similar due to the found duplicate, and the priority queue is reordered, neighbors being sent to the front or the back of the queue. The largest data set consists of 40,516 candidates, but no runtimes are reported. In [29, 30], we presented similar algorithms to [12], and focused on the impact of comparison order and recomparisons on the efficiency and effectiveness of DDG. We showed that the comparison order can significantly compromise efficiency and effectiveness in data graphs with high connectivity. For evaluation, we used small data sets (≈ 2,500 candidates) which easily fit in main-memory. The RC-ER algorithm first introduced in [3] and further described and evaluated in [2, 4] represents candidates and dependencies in a reference graph. The algorithm re-evaluates distances of candidate pairs at each iterative step, and selects the closest pair according to the distance measure. Duplicates are merged together before the next iteration, so effectively clusters of candidates are compared. The largest data set [3] contains 83,000 candidates and duplicate detection takes between 310 and 510 seconds, depending on which variant of the algorithm is used. [4] describes the use of an indexed max-heap data structure to maintain the order of the priority queue, but this structure again fits in main memory. Experiments in [2] show that the higher connectivity of the reference graph, the slower RC-ER. The focus of DDG has so far been set on effectiveness. Scaling DDG to large amounts of data not fitting in main memory has not been considered at all, and efficiency has been only a minor concern. This gap is filled by the research presented in this paper.
3
MID m1 m1’ m1”
title Troy Troja Illiad Project (a) MOVIE
AID Name a1 Brad Pitt a2 Eric Bana a1’ Brad Pit a2’ Erik Bana a3 Brian Cox a1” Prad Pitt a3’ Brian Cox (b) ACTOR
AID MID a1 m1 a2 m1 a1’ m1’ a2’ m1’ a3 m1’ a1” m1” a2’ m1” (c) ACTOR-MOVIE
Figure 1: Sample movie database
3 DDG in a Nutshell The methods we propose to scale up DDG are intended to be as general as possible to fit all algorithms performing iterative DDG. Therefore, the first contribution of this paper is a generalization of existing methods into an unified model. We also provide definitions that we use throughout the paper.
3.1 General DDG Approach In general, we observe that algorithms for iterative DDG have in common that (i) they consider data as a graph, (ii) they perform some preprocessing before candidates are compared to obtain initial similarities and avoid recurrent computations, (iii) they compare pairs of candidates iteratively in an order that potentially changes at every iteration where a duplicate is found, requiring the maintenance of an ordered priority queue. Merging [3] or enriching [12] detected duplicates is also a common technique to increase effectiveness and requires updating the graph. Based on these observations, we devise the following general approach for DDG. 3.1.1 Unified Graph Model for DDG Existing DDG approaches all have in common that they use a graph to represent the data. [12] uses a dependency graph whose nodes are pairs of candidates m = (r1 , r2 ) and pairs of attribute values n = (a1 , a2 ), and an edge between m and n exists if a1 is an attribute value of r1 and a2 is an attribute value of r2 . Further edges between candidate nodes m1 and m2 exist if they are related (e.g., over a co-author, an email,...). The method used in [3] defines a reference graph, where nodes do not represent pairs, but candidates (with attached attributes) , and edges represent relationships (such as coauthoring). Other DDG algorithms use graphs that basically fall in one of the two categories described: nodes either represent single candidates [19, 30], or pairs [27]. Attribute values are usually treated separately. As pairs of nodes and the resulting edges in a dependency graph can be derived from a reference graph, our unified graph model for DDG is a reference graph Gref that consists of three components, namely (i) candidate nodes, (ii) attribute nodes, and (iii) dependency edges. More specifically, a candidate node vc is created for every candidate. Every attribute value associated with a candidate is translated by an attribute node va . We represent relationships between candidates as directed edges in the graph. More specifically, let vc and vc′ be two candidate nodes. We add an edge ed = (vc , vc′ ) directed from vc to v ′ if vc′ influences
4
Figure 2: Sample reference graph. the similarity of pairs in which vc is involved, which can represent co-authorship, email correspondence, etc. We say that vc′ is an influencing neighbor of vc , and vc is a dependent neighbor of vc′ . Example 1 In Fig. 1 it is obvious for human experts that all three movies represent a single movie and there are three distinct actors, despite slightly different representations. Movies, titles and actors, which we consider as candidates are identified by an ID for ease of presentation. Title candidates illustrate that candidates are not necessarily represented by a single relation. We use ′ to mark duplicates, e.g., m1 and m1’ signifies that the movies are duplicates. Fig. 2 depicts a possible reference graph Gref .
3.1.2 Unified DDG Initialization To detect duplicates, DDG algorithms set up a priority queue P Q where candidate pairs are sorted according to a specific ordering scheme (see Sec. 2). Furthermore, iterative DDG algorithms have a jump-start phase, where attribute similarities are computed or obvious duplicates are detected. Also note that blocking [21] or filtering [1] methods can be used to restrict the candidate pairs entering P Q. Priority queue initialization and precomputations are performed before candidate pairs are compared and classified and thus compose the initialization phase of a DDG algorithm. Example 2 A possible priority queue is P Q = { {(m1, m1’), (m1, m1”), (m1’,m1”), (t1,t2), (t1,t3), (t2,t3), (a1,a2),...} }. Precomputing similarities for attributes for example consists in computing their edit distance. 3.1.3 Iterative Phase After initialization, candidate pairs are compared and classified as duplicates or nonduplicates in the iterative phase (Fig. 3). Existing algorithms maintain P Q in mainmemory. At every iteration step, the first pair in P Q is retrieved, then classified using a similarity (distance) metric, and finally the classification causes some update in P Q or the graph (adding pairs to P Q, enrichment in [12], duplicate merging and similarity recomputation in [3], heuristic recomputation in [29] ) before the next iteration starts. In general, the priority queue has to be reordered whenever a duplicate is detected to reduce the number of recomparisons, for which [3, 12, 30] devise different strategies (see Sec. 2). Reordering is an expensive task for large P Q. However, by maintaining the order in P Q, complex similarity computations are potentially saved. We illustrate these concepts using our example.
5
Figure 3: Sample iteration for DDG Example 3 The first pair retrieved in above sample P Q is (m1, m1’). The pair is classified as non-duplicate, because the movies’ sets of influencing neighbors are disjoint. In the update phase, we remember this classification. The same occurs when subsequently iterating through movie candidate pairs and title candidate pairs. When comparing (a1,a1’), we finally find a duplicate. During the update phase, we notice that we have found a duplicate, and add pairs of dependent movies back to P Q, because their similarity potentially increases. This requires reorganizing P Q before the next iteration. The same occurs in the subsequent iterations where duplicate actors are detected. Now, when we compare movies again they can be classified as duplicates, because they share a significant amount of influencing actor neighbors. Hence, recomparisons increase effectiveness. If we had started by comparing actors, we would have saved reclassifying movies a second time, which shows that comparison order is important for efficiency.
3.2 Definitions Having outlined the general DDG process, we provide formal definitions that are used throughout the paper. Candidates, Object Descriptions, & Relationship Descriptions. Within a database, several object types T = {t1 , t2 , ..., tn } are described using tables and attributes. The real-world objects subject to duplicate detection are called candidates. We denote the set of candidates of type t ∈ T as Ct = {c1 , c2 , ..., ck }. To denote the type of a particular candidate c, we use tc . Candidates are an abstraction of an object, described by information stored in the database or implicit in the domain. We distinguish between attributes describing an object, and relationships describing an object. Let At = {a1 , ..., al } be a set of attribute labels that describe a candidate of type t. Then, ODa (c) = {v1 , ..., vm } defines the attribute values of attribute label a that are part of the c’s object description (OD). The complete OD of a candidate c is defined as S OD(c) = ai ∈At ODai (c), and is usually obtained using a query parameterized by the candidate. The relationships describing an object are relationships to other objects. Hence, the relationship description (RD) of a candidate c is defined as a set of candiS dates to which it is related, such that RD(c) ⊆ ti ∈T Cti . The selection of candidates in RD(c) is based on foreign key constraints, or implicit domain constraints provided by a domain expert. Example 4 In our example, T = {movie, actor, title}. A candidate of type actor is described by it’s name, so the OD of actor candidate a1 is OD(a1) = ODName (a1) = {(Brad Pitt)}. Assuming that a movie is described by associated actors and its title, RD(m1) = {a1,a2,t1}. The definitions for candidates and ODs are adaptations to DDG of definitions used 6
in the XML duplicate detection framework we presented in [28]. We additionally introduce RDs to account for relationships. In the graph, candidates are represented by candidates nodes, ODs by attribute nodes, and RDs by directed edges. Influencing & Dependent Candidates. Influencing candidates are candidates that influence the similarity of c to other candidates. Hence, c’s set of influencing candidates I(c) = RD(c). Analogously, dependent candidates of c are candidates whose similarity is influenced by c, i.e., D(c) = {c′ |c′ 6= c ∧ c ∈ RD(c′ )}. In the data graph, influencing candidates of c are represented by candidate nodes reached through outgoing edges of c, whereas dependent candidates of c are represented by candidate nodes pointing to c through dependency edges. Example 5 We assume I(m1) = {a1, a2, t1} and D(m1) = {t1}. Similarly, I(m1”) = {a1”, a3’, t3} and D(m1”) = {t3}. Influencing & Dependent Candidate Pairs. Influencing candidate pairs of candidate pair (c, c′ ) are defined as I(c, c′ ) = {(ni , nj )|ni ∈ I(c), nj ∈ I(c′ ), tni = tnj }. Analogously, its dependent candidate pairs D(c, c′ ) = {(ni , nj )|ni ∈ D(c), nj ∈ D(c′ ), tni = tnj }. Intuitively, influencing and candidate pairs of two candidates are obtained by forming the cross product between their respective influencing and dependent candidates. Example 6 We assume I(m1, m1”) = {(a1, a1”), (a1, a3’), (a2, a1”), (a2, a3’), (t1, t3)}, and D(m1, m1”) = {(t1, t3)}. Thresholded Similarity Measure. During classification, DDG algorithms commonly use a similarity measure. In this section, we provide a template for a similarity measure, for which different implementations can be found in the literature [1, 3, 12, 28]. The template is used below when defining general definitions and optimizations for scaling the classification phase in Sec. 6. ≈ First, we define the set Nrd of duplicate influencing candidate pairs of a pair of ′ candidates (c, c ) as ≈ Nrd (c, c′ ) := {(n1 , n2 )|(n1 , n2 ) ∈ I(c, c′ ) ∧ (n1 , n2 ) duplicates} 6= The set Nrd of non-duplicate influencing candidates is 6= Nrd (c, c′ ) := {(n1 , ⊥)|n1 ∈ I(c) ∧ n1 has no duplicates in I(c′ )}
∪ {(⊥, n2 )|n2 ∈ I(c′ ) ∧ n2 has no duplicates in I(c)} where ⊥ denotes an empty entry. Example 7 Assuming that duplicate actor and title candidates have already been de6= ≈ tected, we obtain Nrd (m1, m1”) = {(a1, a1”)}, and Nrd (m1, m1”) = {(a2, ⊥), (t1, ⊥), (⊥, a3’), (⊥, t3)}. We further introduce a weight function wrd (S) that captures the relevance of a set of candidate pairs S = {(ni , n′i )|ni ∈ Ct ∧ n′i ∈ Ct , t ∈ T }. This weight function has the following properties allowing its incremental computation and guaranteeing that the complete similarity of two candidates monotonously increases. [ wrd (S) = W Agg( wrd ({(ni , n′i )})) (1) (ni ,n′i )∈S
7
W Agg({wrd ({(n, ⊥)}), wrd ({(⊥, n′ )})}) ≥ wrd ({(n, n′ )})
(2)
where W Agg is an aggregate function, e.g., sum or count that combines multiple weights such that W Agg(S) ≥ W Agg(S ′ ), S ′ ⊂ S. The first property guarantees that the weight of a set of pairs can be computed based on weights of individual pairs, allowing an incremental computation of the weight. The second property implies that the aggregated weight of two non-duplicates is larger or equal to their weight if they were duplicates. The second property is reasonable because the relevance of an object represented by two candidates should not be larger than the combined relevance of two individual representations of that object. On the other hand, when we do not know that p and p′ are duplicates, we have to assume that they are different and each contribute an individual relevance that is likely to be larger or equal to their combined relevance. More interestingly, this property guarantees that the similarity of two candidates monotonously increases as duplicate descriptions among these candidates are detected. Furthermore, it allows us to estimate the total similarity (with an upper and lower bound) while duplicates are detected, allowing us to use the estimate as pruning method (see Sec. 6.3). In practice, variations of the inverse document frequency are used as weight function. Example 8 A simpler example for such a weight function is wrd (S) = |S|, using count ≈ as aggregate function. Using this weight function, we obtain wrd (Nrd (m1, m1”)) = 1, 6= and wrd (Nrd (m1, m1”)) = 4. ≈ Similarity and difference of ODs are computed analogously. We denote Nod the set of pairwise similar OD attributes. Here, similarity is measured with a secondary 6= similarity distance, e.g., edit-distance. We also define the set Nod as the set of OD 6= attributes that are not shared between two candidates, similarly to Nrd . Finally, a weight function wod (S) with the same properties as wrd (S) measures the relevance of a pair of OD attributes. The final similarity measure template is: P ≈ ′ i∈{rd,od} wi (Ni (c, c )) ′ (3) sim(c, c ) = P 6= ≈ ′ ′ i∈{rd,od} wi (Ni (c, c )) + wi (Ni (c, c ))
Example 9 The total similarity of m1 and m1’, which we considered in the previous 1+0 examples as well, equals sim(m1,m2)= 4+1+0+0 = 0.25. Note that the zeros are due to the empty ODs. The similarity measure obtains a result between 0 and 1. Given a user defined similarity threshold θ, if sim(c, c′ ) > θ, c and c′ are duplicates, otherwise they are classified as non-duplicates. 6= ≈ Note that in defining Nrd and Nrd , we assume that we know that n1 and n2 are duplicates or non-duplicates. Clearly, this knowledge is acquired during the iterative phase, and the similarity measure is defined to increase as this knowledge is gained (based on Eq. 1 and Eq. 2). On the other hand, the similarity and the difference of ODs can be precomputed, as they do not vary during the iterative phase. Proof. We prove that the similarity of two candidates c and c′ monotonously increases with increasing duplicates found in their ODs and RDs, assuming that the properties of the weight function defined in Eq. 1 and Eq. 2 hold, and W Agg = sum(). Let 6= ≈ Nrd,i = {(n1 , ⊥), (n2 , ⊥), (⊥, n′1 )(⊥, n′2 ), . . .)} and Nrd,i = {(n3 , n′3 ), . . .} at a 8
given processing point i of a comparison algorithm. In the next step, the algorithm 6= determines that n1 and n′1 are duplicates, hence Nrd,i+1 = {(n2 , ⊥), (⊥, n′2 ), . . .)} ≈ ′ ′ and Nrd,i+1 = {(n3 , n3 ), (n1 , n1 ), . . .}. We prove that simi (c, c′ ) ≤ simi+1 (c, c′ ) by showing that the nominator of sim increases from step i to step i + 1, whereas the denominator decreases. ≈ ≈ It is true that Nrd,i ⊂ Nrd,i+1 , so using Eq. 1 and the fact that sum(S) ≥ sum(S ′ ) ′ for S ⊂ S, we conclude that ≈ ≈ ≈ ≈ sum wrd (Nrd,i ), wod (Nod ) ≤ sum wrd (Nrd,i+1 ), wod (Nod ) This proves that the nominator of simi (c, c) is smaller than the nominator of simi+1 (c, c). The denominator of sim decreases from step i to step i + 1: i h 6= 6= ≈ ≈ ), w (N ), w (N ) sum wrd (Nrd ), w (N od rd od +1 od rdi +1 od i h i 6= 6= ≈ ′ ≈ = sum wrd (Nrdi \ {(n1 , ⊥), (⊥, n1 )}), wod (Nod ), wod (Nrd ∪ {n1 , n′1 }), wod (Nod ) i i h 6= 6= ≈ ≈ ), w (N ), w (N ) ≤ sum wrd (Nrd ), w (N od rd od od rd od i i because, following Eq. 2, w ({(n, n′ )}) − sum [w ({(n, ⊥)}) , w ({(⊥, n′ )})] ≤ 0. The nominator of sim(c, c′ ) increases from step i to step i + 1, and the denominator decreases from step i to step i + 1, so we conclude that simi (c, c′ ) ≤ simi+1 (c, c′ ).
4 Scaling Up Initialization Having discussed the general DDG process, we now show how initialization is performed when processing large amounts of data. To scale up the initialization phase, we use a relational DBMS to store the data structures described in the next subsections. This way, DDG can process an arbitrarily large amount of data and can benefit from several DBMS features, such as access to the data, indices, or query optimization. Essentially, the DBMS acts as a data store.
4.1 Graph Model in Database The reference graph Gref does not fit in main-memory when considering large amounts of data, so we decide to store Gref in a relational database. For every candidate type t ∈ T a relation C t(CID, a1 , a2 , ..., an ) is created. CID represents a unique identifier for a candidate c, and ai ∈ At represents the candidate’s OD attributes. Hence table C t stores both the information contained in candidate nodes and the information contained in their related attribute nodes. Using standard SQL statements, the tables are created in the database and tuples are inserted as follows. For every candidate c of type t, e.g., returned by a SQL query, determine OD(c), given the set of OD labels At = {a1 , ..., al }, again using an extraction method that can be a query language or a parsing program. We then create a tuple for every candidate, id(c) being a function that creates a unique ID for a candidate. If ODai contains more than one value, values are concatenated using a special character. To store RDs in the database, we create a single table EDGES(S,T,REL): the first attribute S is the source candidate, T is the target candidate, and REL is the relevance (weight) of that edge. Both the source and the target attribute are references to a CID in a candidate table. 9
CID CID OD1 m1 t1 Troy m1’ t2 Troja m1” t3 Illiad... (a) C MOVIE (b) C TITLE
CID OD1 a1 Brad Pitt a2 Eric Bana ... ... (c) C ACTOR
S T REL m1 a1 1 m1 a2 1 ... ... ... (d) EDGES
Figure 4: Graph representation in database CAND1 m1 a1 t1 ...
CAND2 m1’ a2 t2 ...
STATUS 0 0 0 ...
TYPE M A T ...
RANK 7 0 2 ...
Figure 5: Initial priority queue sample Example 10 Fig. 4 shows excerpts of data graph tables for our movie example. There is one table for every candidate type, i.e., one for movie candidates (a), one for title candidates (b) and one for actor candidates (c). The table representing RDs is shown in Fig. 4(d). In our example, all edges have equal relevance, set to 1.
4.2 Initializing the Priority Queue The priority queue used in DDG is a sequence of candidate pairs, where two candidates of the same pair have same candidate type and the priority queue contains all candidate pairs subject to comparison. The order of pairs is determined by an algorithmdependent heuristic that satisfies the following property: let (c, c′ ) ∈ P Q, and let the position of (c, c′ ) = ranki (c, c′ ) at iteration i, then the heuristic guarantees that when a pair (i, i′ ) ∈ I(c, c′ ) is classified as duplicate, ranki+1 (c, c′ ) ≤ ranki (c, c′ ) at the next iteration. P Q is ordered in ascending order of rank. Using such a rank function can implement the ordering in [3, 12, 29], and is natural, because the similarity of pairs increases with increasing number of similar influencing neighbors, and more similar pairs should be compared first, because they are more likely to be classified as duplicate. In the database, we store the priority queue P Q in a relation PQT(C1, C2, STATUS, TYPE, RANK), where C1 and C2 are references to the CID of two candidates of a pair, TYPE is both candidate’s type, RANK is the current rank of a pair, which determines its position in the priority queue order, and STATUS is the processing status of the pair in the specific DDG algorithm. For instance, STATUS = 0 if a pair is in the priority queue and has not been classified, STATUS = 1 if the pair is a duplicate, and STATUS = -1 if the pair has been classified as a non-duplicate but may re-enter the priority queue if an influencing duplicate were found. Example 11 Examples of tuples in table PQT are shown in Fig. 5. As sample ranking heuristic, we use the number of unshared influencing neighbors between the two candidates.
10
A W Brad Pitt 1 Eric Bana 1 ... ... (a) ODW actor OD1 S1 m1 m1 ...
S2 m1’ m1’ ...
T1 a1 a1 ...
T2 a1’ a2’ ...
A1 A2 W Brad Pitt Brad Pit 1 Eric Bana Erik Bana 1 ... ... ... (b) ODSIMW actor OD1 STATUS S STATUS T 0 0 0 0 ... ... (c) DEP
WS 1 1 ...
WT 1 1 ...
Figure 6: Sample precomputation tables
4.3 Precomputations ≈ In similarity template proposed in Eq. 3, the varying operands are Nrd (c, c′ ) and 6= Nrd (c, c′ ), as their value depends on duplicates detected in their influencing neighbor pairs I(c, c′ ). OD similarity and difference do not vary the more duplicates are found and can thus be computed prior to the iterative phase. This way, we save expensive and recurrent similarity computations. Indeed, the same computations may be performed several times due to recomparisons. Moreover, when using functions such as the inverse document frequency of a value for weight determination or the edit distance between two values for OD similarity, the function depends on the value itself, not on the candidate. Hence, if the same attribute value occurs several times in different pairs, the computation would be performed more than once. Following these observations, we precompute (i) weights of attribute values, which we store in ODW a t(A,W) relations, i.e., one relation for each OD attribute and (ii) weights of similar OD value pairs, which are stored in ODSIMW a t(A1,A2,W) relations. Note that A, A1, A2 designate OD attribute values, and W stores the weight. Another precomputation is the creation of a table that associates candidate pairs with their influencing pairs and that stores the status of both pairs, as well as the weight 6= ≈ of their relationships. This way, Nrd (c, c′ ) and Nrd (c, c′ ) can be computed incrementally and without an expensive join over PQT and EDGES that would otherwise be necessary. The schema of the table is DEP(S1,S2,T1,T2,STATUS S,STATUS T,W S,W T), where S1 and S2 describe a candidate pair, and T1 and T2 its influencing pair. The respective duplicate status of each candidate pair is stored in the STATUS attributes, whereas weights are stored in W attributes.
Example 12 Fig. 6 shows a table excerpt of the three types of precomputations we perform. Fig. 6(a) shows precomputed weights for actor values, Fig. 6(b) shows weights of similar attribute values, and Fig. 6(c) shows the table storing candidate pair relationships. How to perform precomputation efficiently is not discussed in this paper, as the approach is specific to the particular similarity measure or DBMS used. As example, [16] discusses how to compute string similarity joins efficiently in a database. In [8], an operator for similarity joins for data cleaning is presented. We focus on scaling up the unified iterative phase both in space and in time.
11
Figure 7: Sample RECUS/DUP iteration
5 Scaling Up Retrieval & Update As described in Sec. 3, the iterative phase in DDG essentially consists of a retrieval step, a classification step, and an update step that involves sorting the priority queue when a duplicate has been found. We present methods to scale up this RetrieveClassify-Update-Sort (RECUS) process in space using the RECUS/DUP algorithm, making DDG applicable to very large amounts of data. We then present RECUS/BUFF, which extends RECUS/DUP to scale up in time so that DDG on large amounts of data terminates within reasonable time. In this section, we focus on scaling up DDG retrieval and update, and postpone the discussion of scalable classification to Sec. 6 for which we present several solutions orthogonal to the choice of the retrieval and update algorithm.
5.1 Scaling in Space with RECUS/DUP To scale in space, we map the DDG process described in Fig. 3 from main memory to a relational database and use SQL and operations on returned result sets for retrieval and update, as shown in Fig. 7. In Alg. 1, we provide pseudocode for RECUS/DUP. RECUS/DUP Retrieval Phase. Candidate pairs are retrieved from the database as long as the priority queue table PQT contains candidate pairs that are not classified as duplicate or non-duplicate, i.e., as long as PQT contains pairs with STATUS = 0. To retrieve pairs, we send a query to the database, which returns unclassified pairs in ascending order of their RANK. We then iterate over these pairs, as long as no duplicate is found, according to the classify () function (discussion postponed to Sec. 6). This is possible, because the order of candidate pairs remains the same as long as no duplicate is found. When a duplicate is detected, the retrieval query is issued again at the next iteration (because of the isDup flag being set to true), and we classify pairs returned by the new result. Thus, we sort the data stored in table PQT during the retrieval phase only when it is necessary, i.e., when a duplicate is found and the rank of dependent neighbors potentially changes.
12
/* Retrieval while executeSQL(
*/
SELECT COUNT(*) C FROM PQT WHERE STATUS = 0
) returns C > 0 do boolean isDup ← F ; ResultSet r ← executeSQL ( SELECT * FROM PQT WHERE STATUS = 0 ORDER BY RANK
); while r has more tuples and isDup = F do Tuple t ← r.getNextTuple (); /* Classification isDup ← classify (t.getValue (CAND1), t.getValue (CAND2)); /* Update if isDup = true then t.updateValue (STATUS, 1); cluster (t.getValue (CAND1), t.getValue (CAND2)); executeSQL (
*/
*/
UPDATE REL SET STATUS2 = 1 WHERE T1 = t.getValue (CAND1) AND T2 = t.getValue (CAND2)
); ResultSet d ← executeSQL ( SELECT S1, S2, RANK(S1,S2) R FROM DEP WHERE T1 = t.getValue (CAND1) AND T2 = t.getValue (CAND2) AND STATUS1 6= 0 OR 1
); while d has more tuples do executeSQL ( UPDATE PQT SET STATUS = 0, RANK = d.getValue (R) WHERE CAND1 = d.getValue (T1) AND CAND2 = d.getValue (T2)
); else t.updateValue (STATUS,-1) Algorithm 1: RECUS/DUP Algorithm RECUS/DUP Update Phase. After a pair has been classified, the status of the retrieved pair is set to 1 if it has been classified as a duplicate, and to -1, if it has been classified as non-duplicate. To update a value of the current tuple t, we use a function updateValue () that updates both the value in the ResultSet as well as the value in the database, i.e., in PQT. In case of a duplicate classification, duplicates can be merged, enriched, or clustered in any other way using the cluster () function, which updates graph tables and precomputed tables according to the algorithm-specific method. The next step in any algorithm is to update the status of the duplicate pair in table DEP, which is performed using a standard SQL update statement. As a reminder DEP is used to retrieve influencing and dependent non-duplicate pairs more efficiently by avoiding expensive SQL joins. This way, the next step does not require a join between PQT, EDGES, and PQT to retrieve non-duplicate and classified dependent pairs. Finally, the status of every non-duplicate dependent pair has to be reset to 0, because of new evidence that they
13
may be duplicates. Example 13 Consider that the first iteration retrieves ResultSet r containing pairs (a1, a2), (a1, a1’), . . . in that order. At the first iteration, we classify pair (a1, a2), which is a non-duplicate. Hence, we only update the status of the pair in r and PQT and move to the next pair in r. When classifying (a1, a1’), we detect a duplicate. In addition to updating the status of the pair, we determine its dependent pairs, e.g., D(a1, a1’) = {(m1, m1’) }. The rank of dependent pairs changes due to the detected duplicate, e.g., the rank of the dependent movie pair changes from 7 to 5, which is updated in PQT. As a consequence, r is no longer consistent with relationPQT (it still contains rank 7) and the order of pairs has potentially changed. Hence, we issue a new SQL query to get a new, correctly sorted ResultSet r before we proceed with the next classification. Essentially, we retrieve, classify, and update one pair in PQT at a time, and it is guaranteed that the database is always consistent with the actual state of the DDG algorithm, meaning that both PQT and DEP store the correct STATUS values for all pairs. This is not the case when using RECUS/BUFF, as discussed later. Using RECUS/DUP, a single pair and possibly a dependent pair (one at a time using an iterator over these pairs) are kept in main memory at every iteration, plus some information to compute similarity. This amount of data very easily fits in mainmemory. Consequently, increasing the number of pairs to be compared does not affect the amount of space needed in main-memory, making DDG scalable in space. However, main-memory can be used more effectively. As experiments show (see Sec. 7), sorting PQT is the most time consuming task in the iterative phase when using RECUS/DUP. The worst case is reached when every pair in the priority queue is classified as a duplicate, which requires sorting PQT after every iteration. The algorithm described in the next section leverages the sorting effort and makes DDG more efficient by using main memory more extensively, but with an upper bound, while still using a database to scale up in space.
5.2 Scaling in Time with RECUS/BUFF RECUS/BUFF uses an in-memory buffer of fixed size s, denoted Bs , to avoid sorting PQT whenever a duplicate is found. The intuition behind this algorithm is that although ranks of several pairs may change after a duplicate has been found, sorting PQT immediately after finding the duplicate is not always necessary and may actually occur several times before an affected pair is retrieved. For instance, consider the initial priority queue order of Fig. 5. All candidate actor pairs have rank 0. When a1 and a1’ are detected to be duplicates, the rank of their dependent neighbor pair (m1,m1’) computed as the number of non-shared influencing neighbors changes from 7 to 5. Although the rank has decreased, all actor and title pairs still have a lower rank (0 and 2, respectively), and will be compared first. Hence, sorting the priority queue will not immediately affect the comparison order and should therefore be avoided. To this end, we use Bs to temporarily store dependent neighbors of pairs classified as duplicates, whose rank potentially decreases, and maintain these pairs in the order of ascending rank (the same order as the similarity measure), which avoids sorting the much larger priority queue stored in PQT. Using an in-memory buffer requires modifications in the retrieval and update phase of RECUS/DUP, because a pair is either retrieved from PQT on disk or the buffer Bs in main-memory, and is either updated in PQT or in Bs , as depicted in Fig. 8. 14
Figure 8: Sample RECUS/BUFF iteration RECUS/BUFF Retrieval Phase. In Alg. 2, we describe the retrieval phase of RECUS/BUFF. Similarly to the retrieval phase of RECUS/DUP (Fig. 1), the retrieval phase determines the pair to be classified. Clearly, this task is more complex in RECUS/BUFF, as both PQT and Bs have to be considered. More specifically, as long as the buffer Bs does not overflow, we distinguish the following two cases, using the lastF romP Q flag: In the first case, the pair to be classified potentially is the next pair in the ResultSet r containing all non-classified candidate pairs from PQT sorted in ascending order of their rank. That is, the pair at the current cursor position has been classified at a previous iteration, and we use the getNextTupel () function to get the next pair. Note that the remaining steps performed in this case are related to the update phase, discussed further below. In the second case, the pair at the current cursor position has not been classified yet, because the first pair in Bs had a lower rank. In this case, we do not move forward the cursor on the ResultSet r. RECUS/BUFF Update Phase. Having the correct pair in hand after the retrieval phase shown in Alg. 2, the pair is classified as duplicate or non-duplicate as in RECUS/DUP, before PQT and Bs are updated accordingly. The pseudo-code for the RECUS/BUFF update phase is similar to the update phase of RECUS/DUP in Alg. 1, and we highlight only differences. When updating the status of the classified pair, two situations are possible: either the pair has been retrieved from PQT, in which case the cursor still points to that pair, or the pair has been retrieved from Bs . In the first case, STATUS can directly be updated at the current cursor position of r using updateValue () without compromising consistency and without requiring costly repositioning of the cursor within r. In the latter case, the current pair, which has been retrieved from Bs , is at a position with higher rank in PQT. Hence, the cursor would have to jump forward in r to update the STATUS value, and jump back. We avoid this expensive operation by adding or updating the classified pair to Bs , together with its status. The pair stays in Bs until the iterator on r has moved to that pair in PQT, in which case it is updated in PQT and removed from Bs . We verify whether a pair has been classified and stored in Bs during retrieval and avoid classifying it again by setting the updateOnly flag to true. Instead, we update the status in PQT to the status stored in Bs , which is guaranteed to contain the most recent status. Using RECUS/BUFF, dependent neighbor pairs are added to Bs instead of being updated in PQT. Information about a dependent pair represented in a tuple d is added to the buffer as follows: if d is not contained in Bs , d is added to the buffer, i.e., the dependent pair, its status, and its newly computed rank. If d ∈ Bs , we check if the status of the pair equals 1 in Bs . If it does, the pair has been classified as duplicate and does not need to be compared again. If the pair is contained in Bs , but is not a duplicate, the rank of the pair in Bs is updated to the current rank (which is the 15
while P QT or Bs has unclassified pairs do boolean overf low ← false; boolean lastF romP Q ← true; boolean updateOnly ← false; ResultSet r ← executeSQL ( same query as in Alg. 1 ); while r has more tuples and overf low = false do Tuple t; if lastF romP Q = true then t ← r.getNextTuple (); if t ∈ Bs then t.updateValue (STATUS,status in Bs ); updateOnly ← true; else updateOnly ← false else t.getCurrentTuple (); lastF romP Q ← true; Pair tb := b.getFirstEntry (); if updateOnly = false and pb.getRank() ≤ p.getRank() then p ← pb; lastF romP Q ← false; else lastF romP Q ← true; ... classification & update ... Algorithm 2: RECUS/BUFF Retrieval Phase
smallest) and the buffer is reordered. After this, the next retrieval phase starts, without sorting the complete priority queue data in PQT. Example 14 Consider that we classify pair (a1,a1’) as duplicate before its dependent neighbor pair (m1, m1’). At this point, the buffer Bs = {(m1,m1’,5)}, where 5 is the new rank of the pair. Unlike for RECUS/DUP, we do not sort PQT again to obtain a ResultSet r that is consistent with the algorithm state. Instead, we continue iterating through the same r, with outdated rank and order. Now, consider we reach pair (m1, m1”) with rank 6 in r. When checking for lower rank in Bs , we see that it contains a pair with lower rank 5. Hence, the next pair to be classified is the first pair in the buffer, e.g., (m1,m1’). We classify the pair as duplicate. Now, because the cursor on r does not point to pair (m1,m1’), which is still sorted according to its old rank (7), we put the pair back to Bs to be updated later in PQT when the cursor on r reaches it. At the next iteration, the pair the cursor points to, e.g., (m1, m1”) has not been classified yet, so we do not move the cursor forward. In case of a buffer overflow, a strategy that frees buffer space has to be devised. Any pair removed from the buffer needs to be added back to the priority queue, meaning an update of relation PQT. This in turn requires sorting the data of PQT, as the rank of the updated pair may have decreased. To minimize the occurrences of a buffer overflow, and hence of sorting PQT, we clear the entire buffer when it is full. That is, we update all pairs in PQT that were buffered, remove them from Bs , and sort PQT in the next retrieval phase. 16
As for RECUS/DUP, the theoretical worst case is reached when PQT needs to be sorted after every iteration. However, it is much less probable to reach the worst case as when using RECUS/DUP. Indeed, RECUS/DUP requires sorting at every iteration when every pair in the priority queue is classified as a duplicate. In RECUS/BUFF, the worst case is reached if at every iteration, the buffer overflows, an unlikely event especially when wisely choosing the size of the buffer. In setting the buffer size, one should keep in mind that (i) it must fit in main-memory, (ii) it should be significantly smaller than table PQT to make sorting it more efficient than sorting PQT, and (iii) it should be large enough to store a large number of dependent pairs to avoid sorting PQT. In this section, we have discussed the retrieval phase and the update phase of two algorithms that enable scaling DDG both in space and in time. Next, we discuss how to scale up classification.
6 Scaling Up Classification Classifying a candidate pair requires the computation of a similarity that considers both ODs and RDs as described in Sec. 3.2. When the reference graph is not held in mainmemory, similarity computation requires communicating with the database, for which we discuss three methods: SQL-based similarity computation uses SQL queries to determine the operands of Eq. 3, thereby making use of query optimization capabilities of the database. When database capabilities do not suffice, e.g., because they do not support the weight aggregate function, we use hybrid similarity computation, which performs some operations, e.g., weight aggregation outside the database. Unlike SQL queries, the external program can further be interrupted when the classification result is clear (i.e., duplicate or non-duplicate). This enhancement, which we call early classification, improves classification efficiency and experiments show that it is constant with increasing connection of candidates in the data graph.
6.1 SQL-Based Similarity Computation The first method to compute similarity makes heavy use of SQL queries to determine the operands of Eq. 3. These include computing the similarity and the difference of ODs and of RDs. As the techniques to compute these are similar, we focus the discussion on ODs and make some remarks concerning RDs. Fig. 9 outlines the query determining OD weights. For every pair of candidates (c, c′ ) of type t, we first determine the set SIM of similar OD attributes between these pairs, and the associated weight. Because an OD is usually composed of several types of attributes, similar ODs have to be determined for every attribute ai ∈ At and are uni6= ≈ fied to obtain the complete set Nod (c, c′ ) (ln. 5). Next, the set DIFF = Nod (c, c′ ) is de′ termined by selecting all OD attributes from OD(c) and OD(c ) that are not in SIM, together with associated weights. Again, different types of attributes are treated individually and unified. The weight aggregation function wod is then applied to SIM (ln. 19) ≈ and on DIFF (ln. 20) and the two resulting double values, i.e., wod (Nod (c, c′ )) and 6= wod (Nod (c, c′ )), are returned by the query. COALESCE is used to account for the case where SIM or DIFF are empty, in which case the weight is set to 0. Computing the weights of influencing candidate pairs (RDs) is analogous to computing OD weights. We first determine the set of influencing pairs I(c, c′ ) of candidate pair (c, c′ ) by selecting all tuples from table DEP where S1= c and S2= c′ . Due 17
WITH SIM AS ( SELECT A1, A2, W FROM
ODSIMW a1 t S, Ct C1, Ct C2
WHERE S.A1 = C1.a1 AND S.A2 = C2.a1 AND C1.CID = c AND C2.CID = c′ UNION ... a2 ... UNION ... UNION ... an ... ), DIFF AS ( SELECT C1.a1 A A1.W W FROM Ct C1, ODW a1 t A1 WHERE C1.CID = c AND A1.A = C1.a1 AND A1.A NOT IN (SELECT A1 FROM SIM) UNION SELECT C2.a1 A A2.W W FROM Ct C2, ODW a1 t A2 WHERE C2.CID = c′ AND A2.A = C2.a1 AND A2.A NOT IN (SELECT A2 FROM SIM) UNION ... a2 ... UNION ... UNION ... an ...), SW AS (SELECT COALESCE(wod (W),0) ODSIM FROM SIM), DW AS (SELECT COALESCE(wod (W),0) ODDIFF FROM DIFF) SELECT ODSIM, ODDIFF FROM SW, DW
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Figure 9: Computing OD weights in SQL to the fact that we do not have multiple types of dependencies, no UNION operation ≈ is required. The set of duplicate influencing pairs SIM= Nrd (c, c′ ) is defined as the set of influencing neighbors in DEP where STATUS T= 1, i.e., where the target pair 6= (T1,T2) is a duplicate. Non-duplicates DIFF= Nrd (c, c′ ) are all influencing candi′ dates of I(c, c ) that do not appear in any tuple of SIM. The weights of these sets are aggregated as for OD weights (see Fig. 9), using the weight function wrd . Using SQL queries to compute OD weights and RD weights has the benefit that the DBMS deals with computing these values for large amounts of data, making use of its optimization capabilities. However, the applicability of the query is limited to aggregate functions supported by the DBMS for the weight functions wod and wrd . As a consequence, we define hybrid similarity computation, where SQL queries are used to gather data, and aggregation is performed in an external program.
6.2 Hybrid Similarity Computation In the hybrid version of OD weight computation, we use two queries Q1 and Q2 . Q1 determines the set of similar OD attribute pairs with their weight, corresponding to the subquery SIM in Fig. 9. Q2 determines the set of all OD attributes defined as OD(c) ∪ OD(c′ ), so it is essentially the subquery DIFF in Fig. 9 without ln. 11 and ln. 17, which check whether an attribute value is in the set of similar attribute values. In hybrid similarity computation, this check is performed outside the database in an external program, as well as weight aggregation. The external program consists of the following steps, where the sets D (duplicates) and N (non-duplicates) are initially empty. 1. For every tuple, < v1 , v2 , w > returned by Q1 , let D := D ∪ {((v1 , v2 ), w)}. 18
2. For every tuple < v, w > returned by Q2 , check if {(v1,i , v2,i )|(v1,i , v2,i ) ∈ D ∧ (v1,i = v ∨ v2,i = v)} = ∅. If it is empty, let N := N ∪ {((v, ⊥), w)}. 3. Once all pairs have been processed, compute the aggregate weights wod (D) and wod (N ). Example 15 When computing OD similarity and OD difference of (a1, a1’), Q1 returns similar values pairs, which are added to D in Step 1. Hence, D = {((Brad Pitt, Brad Pit), 1)}. Q2 returns OD(a1) ∪ OD(a1’) = {(Brad Pitt, 1), Brad Pit, 1)}. Because both OD values are part of a similar pair in D, N = ∅ in step 2. Step 3 computes wod (D) = 1 and wod (N ) = 0, assuming wod computes the sum of weights. For RDs the hybrid strategy is slightly different. Indeed, for ODs we have only a precomputed table for similar OD attribute values, so Q1 and Q2 have different input tables. For RDs table DEP stores both duplicate influencing pairs and non-duplicate influencing pairs together with their status. Consequently, we use a query Q3 to determine I(c, c′ ), which sorts influencing pairs in the ascending order of their status, so that duplicate pairs are the first pairs in the result set. The external program then takes care of splitting up duplicates and non-duplicates. The order chosen guarantees that all duplicates are added to D before the first non-duplicate appears, so that we can apply Step 2 as for OD weight computation. Example 16 When comparing (m1, m1’), Q3 returns tuples {(a1,a1’,1,1), (a2,a2’,1,1), (a1,a3,1,0), (a2,a1’,1,0), (a2,a3,1,0), (t1,t2,1,0)} in that order, where the tuple schema is . The external program iterates through these tuples and adds them to D as long as status equals 0. This results in D = {( (a1,a1’),1), ( (a2,a2’),1) }. The remaining tuples are split up in two candidates and we apply Step 2 and obtain N = {(((a3),⊥),1), (((t1),⊥),1), (((t2),⊥),1)}. Using weight sum as wrd , we obtain wrd (D) = 2 and wrd (N ) = 3. Compared to the SQL based similarity computation, we have to send three queries to the database instead of one query, which can be a drawback due to communication overhead. Furthermore, the results returned by Q1 , Q2 , and Q3 are larger than the result of the query used in SQL based similarity computation, and these results need to be processed in main memory, i.e., entries are added to D and N and are aggregated. But compared to the in-memory buffer, we consider this main-memory consumption as negligible, because in the worst case |D| + |N | = |I(c)| + |I(c′ )|, which can be taken into account when setting the buffer size. In practice, these sets are small (14 candidates being the maximum observed in experiments reported in DDG algorithms summarized in Sec. 2). The main advantage of hybrid similarity computation is that it overcomes database system limitations, e.g., when weight aggregation functions are not supported. We expect that both methods have comparable time complexity, because they both compute the same, only that processing is split between the database and an external program for hybrid similarity computation. This is confirmed by our experiments. We now present a technique that increases classification scalability when using hybrid similarity computation.
6.3 Early Classification We efficiently compute similarity by stopping computation as soon as we know if the outcome results in a duplicate or non-duplicate classification. This amounts to (i) stopping computing the similarity once we know that the result cannot exceed the similarity 19
threshold, in which case we know the pair is a non-duplicate, or (ii) stopping when the similarity is above the threshold and cannot decrease below the threshold again, hence the pair is a duplicate. We call this novel technique early classification. It distinguishes itself from existing filters defined as upper bounds to the similarity measure [1, 28] in that no extra filter function is defined to prune non-duplicates prior to similarity computation. Instead, the similarity function is computed incrementally and intermediate results are used to classify non-duplicates prior to termination. Using a single SQL query, early classification is not possible, so the optimization proposed in this section applies only to hybrid similarity computation. As real-world data usually only contains a small percentage of duplicates, we decide to determine non-duplicates more efficiently using early classification, although early classification can also be used to classify pairs of candidates as duplicates before similarity has been completely calculated. Consider the following simplified notation of the similarity template provided in od≈ +rd≈ Eq. 3: sim = od6= +rd 6= +od≈ +rd≈ . If a candidate pair similarity sim > θ, the candidate pair is classified as a duplicate, otherwise it is a non-duplicate. Hence, the following equations, which we use in our implementation correctly classify non-duplicates, although sim has not been calculated completely: od≈ + rd≈ < θ → sim < θ → c, c′ not duplicates
(4)
od≈ + rd≈ < θ → sim < θ → c, c′ not duplicates (5) rd6= + od≈ + rd≈ To include early classification, we modify hybrid similarity computation. First, we do not treat ODs and RDs sequentially by first computing the similarity and the difference of ODs, and then computing them for RDs, but instead first compute their similarity to obtain od≈ + rd≈ . That is, we execute Q1 and Q3 and perform Step 1 for ODs and iterate through the result of Q3 until the duplicate status stored in the attribute STATUS of a tuple switches to 0. Before we continue, we apply Eq. 4. If it classifies the pair as a non-duplicate, we avoid iterating over non-duplicates in Q3 and executing Q2 . Otherwise, we start computing RD difference by iterating through the remaining tuples returned by Q3 . At every iteration, we check if Eq. 5 can classify the pair as non-duplicate to potentially save further iteration as well as query Q2 . If the pair is not classified as a non-duplicate after iterating through Q3 ’s result, we execute Q2 and iterate over its result, again checking at every iteration if Eq. 5 classifies a nonduplicate. When reaching the final iteration, we have finally computed sim(c, c′ ), and return the corresponding classification result. Using early classification helps to detect non-duplicates without computing the exact similarity. So it helps to increase classification efficiency when a significant portion of pairs are non-duplicates. The larger RDs and ODs are (i.e., if candidates have a large number of influencing neighbors or OD attribute values), the more processing is potentially saved using early classification, because the number of iterations through non-duplicates among RDs and ODs (i.e., in the results of Q3 and Q2 , when it was executed) that are potentially saved increases the larger RDs and ODs get. This explains why in our experiments, which are discussed next, classification time is not affected by increasing RD size when using early classification.
20
7 Evaluation We evaluate how our proposed methods perform in terms of efficiency and scalability. Evaluation includes experiments on artificial data, used to control several parameters, experiments on real-world data, and a comparative study to related work.
7.1 Data Sets RealCD (Real-world CD Data): The real-world data set we use comes from the CD domain, the data being available at FreeDB 1 . In this scenario, we consider CDs, artists, and track titles to be candidates. CD ODs consist of title, year, genre, and category attributes, and they are related to artist and track candidates. Artist and track candidates respectively have the artist’s name and the track title as OD. Track candidates have an RD consisting of artist candidates. Using RealCD data, we evaluate how RECUS scales in large, real-world scenarios, but we have little influence on data characteristics. In order to modify parameters and study their effect, we generate artificial data. ArtMov (Artificial Movie Data): From a list of 35,000 movie names and 800,000 actors from IMDB2 , we generate data sets for which we control (i) the number of candidate movies M and the number of candidate actors A to be generated, (ii) the connectivity c, i.e., the average number of actors per movie and movies per actor, (iii) the duplicate ratio dr, defined as the percentage of duplicate pairs that enter the priority queue, (iv) the probability of errors in ODs, and (v) the probability of errors in RDs. More specifically, every movie candidate has a movie title as OD, selected from the list of real movie titles obtained from IMDB. Similarly, every actor candidate has a name as OD attribute, selected in the list of real actor names. Furthermore, we assume that several different actors play in a movie, and set the RD of movies to be a set of randomly chosen actor candidates. Similarly, as actors can play in several movies, we add to actor RDs randomly selected movies. How many edges are added between movies and actors is controlled by the connectivity. Duplicates are generated by duplicating a certain amount of movies and actors, whose ODs are polluted with typos (inserted/missing characters), non-similar strings, or completely deleted. These errors occur with different probabilities. As for errors in RDs, the current generator removes influencing candidates with a certain probability. The number of candidates and the duplicate ratio in the priority queue are correlated as follow: We generate the minimum number of candidates necessary to build a priority queue of size s with a given percentage of duplicate pairs dr. In our experiments, A = M = k2 . The number of duplicate pairs equals k ∗ dr, which result from duplicating dr k2 original movie candidates as well as dr k2 actor candidates. The remaining k − k ∗ dr non-duplicate pairs are equally distributed between movie pairs and actor pairs, each resulting from x non duplicates such that k − dr ∗ k = x ∗ (x − 1). Hence, A = M = max(
pq ∗ dr pq ∗ dr , roundU p(x)) + 2 2
(6)
In all our experiments concerned with efficiency, all error probabilities are set to 20%. When not mentioned otherwise, connectivity c = 5 and the buffer’s size is set to 1,000. The number of candidates as well as the sizes of PQT for ArtMov data are moderate compared to the real-world data sets. This is mainly due to the number of 1 http://www.freedb.org 2 http://www.imdb.com
21
number of candidates (1000) 8 12 16 20 4 3000 2500 RECUS/BUFF RECUS/DUP 2000 1500 1000 500 00 10 20 30 40 50 # pairs in PQT(1000)
dr = 0.2
dr = 0.4
number of candidates (1000) 10 20 30 40 50 3000 2500 RECUS/BUFF RECUS/DUP 2000 1500 1000 500 00 10 20 30 40 50 # pairs in PQT(1000)
retrieval+update time (s)
retrieval+update time (s)
retrieval+update time (s)
number of candidates (1000) 6 8 10 2 4 3000 2500 RECUS/BUFF RECUS/DUP 2000 1500 1000 500 00 10 20 30 40 50 # pairs in PQT(1000)
dr = 1.0
Figure 10: Retrieval & update runtime for varying data set size & duplicate ratio measurements performed for every experiment, which were all repeated several times to obtain average execution times, but has no influence on the conclusions.
7.2 Experimental Evaluation We perform experiments to evaluate (i) retrieval and update scalability using RECUS/DUP and RECUS/BUFF, (ii) classification scalability using our different classification techniques, and (iii) the performance on real-world data. As DMBS, we use DB2 V8.2, running on a Linux server and remotely accessed by a Java program running on a Pentium 4 PC (3.2GHz) with 2GB of RAM. That is, all runtimes reported also include network latency. 7.2.1 Retrieval and Update Scalability We evaluate the scalability of both RECUS/DUP and RECUS/BUFF depending on (i) the number of candidates M and A and the priority queue size, (ii) the duplicate ratio dr, (iii) the connectivity c, and (iv) the size of Bs . Experiment 1. We first evaluate how RECUS/DUP and RECUS/BUFF behave with varying dr, on various data set sizes. Methodology. We generate artificial movie data according to Eq. 6, p. 21 with pq ranging from 10,000 to 50,000 candidate pairs in increments of 5,000, and vary the duplicate ratio dr between dr = 0.2 and dr = 1.0 in increments of 0.2. As representative results, we show runtimes of RECUS/DUP and RECUS/BUFF (in seconds) for dr = 0.2, dr = 0.4, and dr = 1.0 in Fig. 10. The number of candidates (|A| + |M |) is shown at the top x-axis, whereas the bottom x-axis shows the size of the priority queue table PQT (in thousands). Both axes are correlated through Eq. 6. Discussion. Fig. 10 clearly shows that RECUS/BUFF outperforms RECUS/DUP regardless of dr. Obviously, sorting PQT is more time consuming than maintaining the order of the smaller in-memory buffer. We further observe that the higher dr, the more time retrieval and update need for both algorithms, because in both algorithms, sorting PQT occurs more frequently: RECUS/DUP sorts PQT every time a duplicate has been found, and RECUS/BUFF sorts the PQT every time the buffer overflows, happening more frequently because influencing neighbor pairs enter Bs more frequently. The final observation is that with increasing priority queue size / number of candidates, RECUS/BUFF scales almost linearly, unlike RECUS/DUP. Therefore, we expect RECUS/BUFF to be efficient even on very large data sets like RealCD, as we confirm in
22
150 100 50 00
140 120 100 80 60 40 20 00 2000 4000 6000 8000 10000 buffer size c=5 c=3 c=1
(a) retrieval+update time
update time (s)
200
retrieval time (s)
retrieval+update time (s)
250
10
20 30 40 buffer size
(b) retrieval time
50
140 120 100 80 60 40 20 00
10
20 30 40 buffer size
50
(c) update time
Figure 11: Retrieval & update runtime for varying buffer size & connectivity Exp. 6. Experiment 2. From the previous experiment, we conclude that RECUS/BUFF performs better than RECUS/DUP because it sorts PQT less frequently, and instead maintains a main-memory buffer of fixed size s. Clearly, s plays a central role in the efficiency gain. Another factor that affects the filling of the internal buffer Bs is the connectivity c. The higher it is, the more neighbors enter the buffer when a duplicate is found, and an overflow occurs more frequently. So, we expect that RECUS/BUFF retrieval and update slows down with increasing c and smaller s. Methodology. We study how a changing buffer size affects artificial movie data with 10,000 pairs in PQT and a dr = 0.4. We vary s from 1 to 10,000. We further vary connectivity c from 1 to 5 for all considered buffer sizes. Results are shown in Fig. 11, which depicts the sum of retrieval and update time for all buffer sizes (a), retrieval time for small buffer sizes (b), and update time for small buffer sizes (c). Discussion. We observe in Fig. 11 that for all but small buffer sizes, both retrieval time and update time stabilize, and hence the sum stabilizes. On the other hand, for very small buffer sizes, the smaller the buffer, the longer retrieval and update take. Furthermore, the larger the connectivity c, the more time is needed for retrieval and update, mainly resulting from the increased update complexity. For larger c, when a duplicate is found, a larger number of dependent pairs needs to be determined and added to the buffer. As a general rule of thumb, a buffer size of 1000 already suffices to significantly improve efficiency. 7.2.2 Classification Scalability We evaluate the scalability of SQL based similarity measurement, we call SQL/Complete (SQL/C for short) vs. hybrid similarity computation with and without early classification, called HYB/Complete (HYB/C) and HYB/Optimized (HYB/O), respectively. We vary (i) the number of candidates M and A and the priority queue size, (ii) the connection degree c, and (iii) the duplicate ratio dr. As weight aggregation function, we use the sum of weights, so that SQL based similarity is applicable. Experiment 3. We compare runtimes of SQL/C, HYB/C, and HYB/O on artificial movie data of different sizes, using varying duplicate ratios. The goal of this experiment is to analyze how the different approaches scale and how their behavior is influenced by dr. Methodology. We generate artificial movie data with PQT sizes ranging from 5,000 to 40,000 in increments of 5,000, and for each size, we generate data with dr varying
23
50 00 5 10 15 20 25 30 35 40 # pairs in PQT
(a) SQL Complete, dr = 0.2
50 00
10 20 30 # pairs in PQT
40
(b) HYB Complete, dr = 0.4
comparison time (s)
number of candidates (1000) 8 16 4 12 200 SQL/C dr150 = 0.4, HYB/C 100 HYB/O
comparison time (s)
comparison time (s)
number of candidates (1000) 6 8 2 4 200 SQL/C 150 HYB/C 100 HYB/O
number of candidates (1000) 8 16 32 24 200 SQL/C 150 HYB/C 100 HYB/O 50 00 5 10 15 20 25 30 35 40 # pairs in PQT
(c) SQL Complete, dr = 0.8
comparison time (s)
Figure 12: Classification time comparison 60 50 40 30 SQL/C 20 HYB/O 10 010 15 20 25 30 35 40 45 50 connectivity c
Figure 13: Effect of connectivity on classification time from 0.2 to 1.0 in increments of 0.2. We run each classification method on each data set, and measure its runtime in seconds. Results are shown in Fig. 12 for selected duplicate ratios dr = 0.2, dr = 0.4, and dr = 0.8. Discussion. SQL/C and HYB/C have comparable execution times, because they both have to compute the same result. All classification methods scale almost linearly on the range of considered priority queue sizes, and hence with the number of candidates when no blocking technique is additionally used. Early classification allows to save classification time: at dr = 0.2, 32 % of classification time (compared to HYB/C) is saved, which gracefully degrades to 26% at dr = 0.8, when 40,000 pairs are compared. For dr = 1, we still observe 5% savings, which are due to pairs that are classified as non-duplicates, although they are (false negatives). With increasing dr, we also observe that comparison time of both SQL/C and HYB/C only slightly increases. Experiment 4. Our next experiment shows how the connectivity c affects efficiency of our different classification methods. Because c defines how many influencing neighbors a candidate has, which have to be determined using join operations and processed, we expect similarity computation to be slower for larger c when using SQL/C. On the other hand, HYB/O potentially saves more processing the larger c. Methodology. We vary the connectivity c, i.e., the average number of influencing candidates, from c = 10 to c = 50 in increments of 10 for an ArtMov dataset of size 10,000 and duplicate ratio 0.4. Fig. 13 reports comparison time for SQL/C and HYB/O. Discussion. When using SQL/C for duplicate classification, comparison time increases with increasing connectivity, an effect also observed by other DDG algorithms. Indeed, the average comparison time per candidate increases from 9 ms for c = 10 to 13 ms for c = 50 when using SQL/C, whereas it is around 7 ms for all c when using HYB/O. This experiment clearly shows that HYB/O counters the negative effect of increasing c on efficiency. We currently do not have an explanation for the shape of the SQL/C curve, where comparison time is roughly constant between c = 20 and c = 40. We suspect
24
that “intriguing behavior of modern [query] optimizers” [25] is partly responsible for that behavior, but this needs further investigation. Nevertheless, the general trend is clear. The analysis of the techniques proposed in this paper using artificial data of moderate size leads to the conclusion that RECUS/BUFF and HYB/O scale up best. We now put them to the test by applying them to large, real-world data. 7.2.3 Real-World Behavior Using RealCD, we show real-world scalability of our approach over a very large data set. We do not present results on effectiveness due to the lack of space, and because the benefit of DDG in terms of effectiveness has already been studied extensively (see Sec. 2). Experiment 5. We show that RECUS/BUFF in combination with HYB/O scales linearly with the number of candidate pairs in PQT even on large data sizes that do not fit in main memory. Methodology. Using RealCD data with 1,000,000 candidates, we apply a blocking technique to reduce the number of candidate pairs entering PQT, a common technique also used by [4, 19]. Note that we evaluate the behavior of our approaches according to the number of candidate pairs that are in PQT, so this does not affect our conclusions. To prune candidate pairs, we use the sorted neighborhood method (SNM) [17] with window size 3, i.e., 2,000,000 candidate pairs enter PQT. OD attribute weights are precomputed as the inverse document frequency (IDF) of their value, whereas RD weight is fixed to 1. Due to the large amount of data, we consider only equal OD attribute values, and do not perform expensive string similarity computation during initialization. Being part of initialization, this does not affect our results that apply to the iterative phase. Buffer size is 1,000, so that we can compare runtimes with those obtained on artificial data. We obtain the following results: retrieval takes 1,379 s, classification takes 5,482 s, and update takes 17,572 s. Discussion. Among the two million candidate pairs in PQT, we found 600,000 duplicates, so the observed duplicate ratio is 0.3. If we extrapolate retrieval and update time obtained on ArtMov data of size 50,000 with dr = 0.3, for which we obtained a retrieval time of 36 seconds and 432 seconds for update using a buffer of 1,000, we see that the results obtained on a million candidates with similar parameters are in accord with the linear behavior of retrieval and update observed in Exp. 1, as the expected retrieval and update time is 18,720 seconds. We further observe that update time dominates classification time. But the extra effort put in update required for DDG algorithms is justified by a significant improvement concerning duplicate detection effectiveness. Classification time is also in accord with the linear behavior of HYB/O observed in Exp. 3.
7.3 Comparative Evaluation From all previous experiments, we conclude that DDG using RECUS/BUFF and HYB/O scales linearly in time with the size of the data s, the duplicate ratio dr, and the connectivity c. Indeed, Tab. 2 summarizes how the different phases of DDG scale in time depending on these three parameters, which in total amounts to a linear behavior. Exp. 5 confirms this behavior on a data set an order of magnitude larger than any previously considered data set, as show in Tab. 3. In addition to data set sizes, Tab. 3
25
Parameter PQT size s duplicate ratio dr connectivity c Overall
Retrieval & update linear (Exp. 1) linear (Exp. 1) linear (Exp. 2) linear
Classification linear (Exp. 3) constant (Exp. 3) constant (Exp. 4) linear
Table 2: DDG scalability using RECUS/BUFF and HYB/O summarizes runtimes reported in the respective publications, when available. We observe that RECUS/BUFF takes comparably long, but this comes as no surprise as it communicates with a remotely located database instead of operating in main memory. Furthermore, Tab. 3 shows that none of the DDG algorithms except RECUS/BUFF scales linearly in time with all three parameters (s, dr, and c). Approach Singla [27] Dong05 [12] RC-ER [3, 2] RelDC [19] LinkClus [31] RECUS/BUFF a depending b depending
# candidates 1,295 40,516 83,000 75,000 100,000 1,000,000
Classif. Time (s) n.a. n.a. 310 - 510 a 180 - 13,000 b 900 24,433
Not linear in n.a. n.a. s, c s, c s -
on algorithm on connectivity
Table 3: Comparison time for different approaches
8 Conclusion and Outlook We presented a generalization of existing iterative algorithms for duplicate detection in graphs (DDG) consisting of an initialization phase and an iterative phase. The latter in turn consists of retrieval, classification, and update steps. Using this generalization, we presented how initialization and the iterative phase scale up to large amounts of data, with the help of a RDBMS. For iterative retrieval and update, we proposed RECUS/DUP to scale up in space by issuing a query to the database to get candidate pairs, and which sorts all pairs in the priority queue every time a duplicate is found, because the comparison order may have changed. An improvement to RECUS/DUP that we called RECUS/BUFF uses an in-memory buffer to store pairs whose rank has changed, and when retrieving pairs, we check if the next pair comes from the priority queue or the buffer. We thus save resorting the priority queue, as long as the buffer does not overflow, thus scaling DDG in time. Experiments demonstrated that RECUS/BUFF is significantly faster than RECUS/DUP and scales linearly with increasing data set size, duplicate ratio, and connectivity, unlike any previous DDG algorithm. To scale the classification step, we proposed a SQL based similarity computation that uses SQL queries to determine the operands of the similarity measure template. However, this method is limited by a DBMS’s capabilities, especially by their capabilities concerning user defined aggregate functions when it comes to determining weights. We therefore proposed a hybrid similarity computation that uses a query to retrieve relevant data, but computes weights 26
and similarities outside the database. We further improved hybrid classification with early classification, meaning that similarity computation stops when it is sure that a pair is not a duplicate. Experiments show that SQL based similarity computation and hybrid similarity computation have comparable runtime, and early classification improves hybrid similarity efficiency. All three methods scale linearly with data set size. We also observe that duplicate ratio does not significantly affect classification time. Early classification saves the more time the larger the connectivity of the graph, and thus counters the degradation of classification runtime with increasing connectivity that is observed without early classification. Essentially, this paper is the first that considers to scale up DDG to large amounts of data, such as one million candidates (more than an order of magnitude larger than previously considered data sets), and that proposes techniques to efficiently perform DDG so that applications terminate in a reasonable amount of time. Besides further experimental evaluation, we plan to detect a candidate’s OD attributes and RD candidate types automatically, because this is the task requiring the most user interaction. Instead of determining duplicates in a data source, we also plan to investigate how we can avoid duplicates to enter the database, again using relationships, e.g., using techniques as described in [7].
References [1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Conference on Very Large Databases (VLDB), Hong Kong, China, 2002. [2] I. Bhattachary, L. Getoor, and L. Licamele. Query-time entity resolution. In Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, PA, 2006. Poster. [3] I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), 2004. [4] I. Bhattacharya and L. Getoor. Entity resolution in graph data. Technical Report CS-TR4758, University of Maryland, 2006. [5] I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SIAM Conference on Data Mining (SDM), Bethesda, MD, 2006. [6] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2003. [7] S. Chakrabarti and A. Agarwal. Learning parameters in entity relationship graphs from ranking preferences. In Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 2006. [8] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Conference on Data Engineering (ICDE), 2006. [9] S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Conference on Data Engineering (ICDE), Tokyo, Japan, 2005. [10] Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting relationships for object consolidation. In SIGMOD Workshop on Information Quality in Information Systems, Baltimore, MD, 2005. [11] A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: A profiler-based approach. IEEE Intelligent Systems, pages 54-59, 2003. [12] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005.
27
[13] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1), 2007. [14] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969. [15] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model, and algorithms. In Conference on Very Large Databases (VLDB), Rome, Italy, 2001. [16] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Conference on Very Large Databases (VLDB), Roma, Italy, 2001. [17] M. A. Hern´andez and S. J. Stolfo. The merge/purge problem for large databases. In Conference on Management of Data (SIGMOD), San Jose, CA, 1995. [18] L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In Conference on Database Systems for Advanced Applications (DASFAA), Kyoto, Japan, 2003. [19] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2), 2006. [20] D. Milano, M. Scannapieco, and T. Catarci. Structure aware xml object identification. In VLDB CleanDB Workshop, 2006. [21] A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD-1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon, AZ, 1997. [22] H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science 130 (1959) no. 3381, 1959. [23] S. Puhlmann, M. Weis, and F. Naumann. XML duplicate detection using sorted neigborhoods. Conference on Extending Database Technology (EDBT), 2006. [24] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, Volume 23, 2000. [25] N. Reddy and J. R. Haritsa. Analyzing plan diagrams of database query optimizers. In Conference on Very Large Databases (VLDB), Trondheim, Norway, 2005. [26] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, 2002. [27] P. Singla and P. Domingos. Object identification with attribute-mediated dependences. In Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, 2005. [28] M. Weis and F. Naumann. Dogmatix tracks down duplicates in XML. In Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005. [29] M. Weis and F. Naumann. Detecting duplicates in complex XML data. In Conference on Data Engineering (ICDE), Atlanta, Georgia, 2006. Poster. [30] M. Weis and F. Naumann. Relationship-based duplicate detection. Technical Report HUIB-206, Humboldt University Berlin, 2006. [31] X. Yin, J. Han, and P. S. Yu. LinkClus: Efficient clustering via heterogeneous semantic links. In Conference on Very Large Databases (VLDB), 2006.
28