SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs Mikhail Galkin1,2,5 , Diego Collarana1,2 , Ignacio Traverso-Ribón3 , Maria-Esther Vidal2,4 , Sören Auer1,2 1
Enterprise Information Systems (EIS), University of Bonn {galkin|collaran|vidal|auer}@cs.uni-bonn.de
2
Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) 3 FZI Research Center for Information Technology, Germany
[email protected] 4
5
Universidad Simón Bolívar, Venezuela ITMO University, Saint Petersburg, Russia
Abstract. Semi-structured data models like the Resource Description Framework (RDF), naturally allow for modeling the same real-world entity in various ways. For example, different RDF vocabularies enable the definition of various RDF graphs representing the same drug in Bio2RDF or Drugbank. Albeit semantically equivalent, these RDF graphs may be syntactically different, i.e., they have distinctive graph structure or entity identifiers and properties. Existing data-driven integration approaches only consider syntactic matching criteria or similarity measures to solve the problem of integrating RDF graphs. However, syntactic-based approaches are unable to semantically integrate heterogeneous RDF graphs. We devise SJoin, a semantic similarity join operator to solve the problem of matching semantically equivalent RDF graphs, i.e., syntactically different graphs corresponding to the same real-world entity. Two physical implementations are proposed for SJoin which follow blocking or non-blocking data processing strategies, i.e., RDF graphs can be merged in a batch or incrementally. We empirically evaluate the effectiveness and efficiency of the SJoin physical operators with respect to baseline similarity join algorithms. Experimental results suggest that SJoin outperforms baseline approaches, i.e., non-blocking SJoin incrementally produces results faster, while the blocking SJoin accurately matches all semantically equivalent RDF graphs.
1
Introduction
The support that Open Data and Semantic Web initiatives have received from the society has resulted in the publication of a large number of publicly available datasets, e.g., United Nations Data6 or Linked Open Data cloud7 allows for accessing billion of records. In the context of the Semantic Web, the Resource 6
http://data.un.org/
7
http://stats.lod2.eu/
2
Mikhail Galkin et al.
Drugbank drugbank:DB01050
drugbank:DB00316 label
atcCode
Acetaminophen CAS
103-90-2
chemicalIupacName
label
N02BE01
N-(4-hydroxy phenyl) acetamide
Ibuprofen
chemicalIupacName
CAS
15687-27-1
2-[4-(2-methylpropyl) phenyl]propanoic acid
DBpedia dbr:Paracetamol rdfs:label
dbo:iupacName
Paracetamol@en
dbr:Acetaminophen rdfs:label
dbo:pageRedirect
Acetaminophen dbr:Paracetamol N-(4-hydroxy @en phenyl)ethanamide
dbr:Ibuprofen rdfs:label
Ibuprofen@en
dbo:casNumber
dbo:casNumber
dbo:casNumber
103-90-2
103-90-2
15687-27-1
dbo:iupacName
(RS)-2-(4-(2-Methyl propyl)phenyl)pro panoic acid
Fig. 1: Motivating Example. The Ibuprofen and Paracetamol real-world entities are modeled in different ways by Drugbank and DBpedia. Syntactically the properties and objects are different, but semantically the represent the same drugs. Drug drugbank:DB01050 matches 1-1 with dbr:Ibuprofen, while drugbank:DB00316 matches 1-2 with dbr:Paracetamol and dbr:Acetaminophen. Description Framework (RDF) is utilized for semantically enriching data with vocabularies or ontologies. Albeit expressive, the RDF data model allows (e.g., due to the non-unique names assumption) multiple representations of a realworld entity using different vocabularies. To illustrate this, consider chemicals and drugs represented in the Drugbank and DBpedia knowledge graphs. Using different vocabularies, drugs are represented from different perspectives. DBpedia contains more general information, whereas Drugbank provides more domain-specific facts, e.g., the chemical composition and properties, pharmacology, and interactions with other drugs. Fig. 1 illustrates representations of two drugs in Drugbank and DBpedia. Ibuprofen, a drug for treating pain, inflammation and fever, and Paracetamol, a drug with analgesic, and antipyretic effects. Firstly, Drugbank Uniform Resource Identifiers (URIs) are textual IDs (e.g., drugbank:DB003168 corresponds to Acetaminophen and drugbank:DB01050 to Ibuprofen. In contrast, DBpedia utilizes human-readable URIs (e.g., dbr:Acetaminophen and dbr:Ibuprofen) to identify drugs. Secondly, the same attributes are encoded differently with various property URIs, e.g., chemicalIupacName, casRegistryNumber in Drugbank, and iupacName, casNumber in DBpedia, respectively. Thirdly, some drugs might be linked to more than one analogue, e.g., Acetaminophen in Drugbank (drugbank:DB00316) corresponds to two DBpedia resources: dbr:Paracetamol, and dbr:Acetaminophen. Traditional join operators, e.g., Hash Join [2] or XJoin [11], are not capable of joining those resources as neither URIs nor properties match syntactically. Similarity join operators [3, 5, 6, 8, 12] tackle this heterogeneity issue, but due to the same extent of inequality string and set similarity techniques are limited in deciding whether two RDF resources should be joined or not. Therefore, we identify the need of a semantic similarity join operator able to satisfy the following requirements: R1) Applicable to heterogeneous RDF knowledge graphs. 8
Prefixes are as specified on
http://prefix.cc/
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
3
R2) Able to identify joinable tuples leveraging semantic relatedness between RDF graphs. R3) Capable of performing perfect matching for one-to-one integration, and fuzzy conditional matching for integrating groups of N entities from one graph with M entities from another knowledge graph. R4) Support of a blocking operation mode for batch processing, and a non-blocking mode for on-demand real time cases whenever results are expected incrementally. We present SJoin – a semantic join operator which meets these requirements. The contributions of this article include: 1) Definition and description of SJoin, a semantic join operator for integrating heterogeneous RDF graphs. 2) Algorithms and complexity study of a blocking SJoin for 1 − 1 integration and non-blocking SJoin for the N − M similarity case. 3) An extensive evaluation that demonstrates benefits of SJoin in terms of efficiency, effectiveness and completeness over time in various heterogeneity conditions and confidence levels. The article is organized as follows: The problem addressed in this work is clearly defined in Section 2. Section 3 presents the SJoin operator, as well as the blocking and non-blocking physical implementations, as solutions for detecting semantically equivalent entities in RDF knowledge graphs. Results from our experimental study are reported on Section 4. An overview of traditional binary joins and similarity joins as a related work is analyzed in Section 5. Finally, we sum up the lessons learned and outline future research directions in Section 6.
2
Problem Statement
In this work, we tackle the problem of identifying semantically equivalent RDF molecules from RDF graphs. Given an RDF graph G, we call a subgraph M of G an RDF molecule [4] iff the RDF triples of M = {t1 , . . . , tn } share the same subject, i.e., ∀ i, j ∈ {1, .., n} (subject(ti ) = subject(tj )). An RDF molecule can be represented as a pair M = (R, T ), where R corresponds to the URI (or blank node) of the molecule subject, and T is a set of pairs p=(prop,val) such that the triple (R,prop,val) belongs to M . We name R and T the head and the tail of the RDF molecule M, respectively. For example, an RDF molecule of a drug Paracetamol is (dbr:Paracetamol, {(rdfs:label,"Paracetamol@en"), (dbo:casNumber,"103-90-2"), (dbo:iupacName,"N-(4-hydroxyphenyl)ethanamide")}). An RDF graph G can be described in terms of its RDF molecules as follows: φ(G) = {M = (R, T )|t = (R, prop, val) ∈ G and (prop, val) ∈ T }
(1)
Definition 1 (Problem of Semantically Equivalent RDF Graphs). Given sets of RDF molecules φ(G), φ(D), and φ(F ), and an RDF molecule Me in φ(F ) which corresponds to an entity e represented by different RDF molecules MG and MD in φ(G) and φ(D), respectively. The problem of identifying semantically equivalent entities between sets of RDF molecules φ(G) and φ(D) consists of providing an homomorphism θ : φ(G) ∪ φ(D) → 2φ(F ) , such that if two RDF molecules MG and MD represent the RDF molecule Me , then Me ∈ θ(MG ) and Me ∈ θ(MD ); otherwise, θ(MG ) 6= θ(MD ).
4
Mikhail Galkin et al.
Definition 1 considers perfect 1-1 matching, e.g., determining 1-1 semantic equivalences between drugbank:01050 and dbr:Ibuprofen, as well as N − M matching, e.g., drugbank:DB00316 with both dbr:Paracetamol and dbr:Acetaminophen.
3
Proposed Solution: The SJoin Operator
We propose a similarity join operator named SJoin, able to identify joinable entities between RDF graphs, i.e., SJoin implements the homomorphism θ(.). SJoin is based on the Resource Similarity Molecule (RSM) structure, that in combination with a similarity function Simf , and a threshold γ, produce a list of matching entity pairs. RSM is defined as follows: Definition 2 (Resource Similarity Molecule (RSM)). Given a set M of RDF molecules, a similarity function Simf , and a threshold γ. A Resource Similarity Molecule is a pair RSM=(M,T), where: • M = (R, T ) is the head of RSM and the RDF molecule described in RSM. • T is the tail of RSM and represents an ordered list of RDF molecules Mi = (Ri , Ti ). T meets the following conditions: • M is highly similar to Mi , i.e., Simf (R, Ri ) ≥ γ. • For all Mi = (Ri , Ti ) ∈ T, Simf (R, Ri ) ≥ Simf (R, Ri+1 ). An RSM is composed of a head and tail that correspond to an RDF molecule and a list of molecules which similarity score is higher than a specified threshold γ, respectively. For example, an RSM of Ibuprofen (with omitted tails of property:value pairs) is ((dbr:Ibuprofen, T )[(drugbank:DB01050, T1 ), (chebi:5855, T2 ), (wikidata:Q186969, T3 )]) given a similarity function Simf , a threshold γ, and Simf (dbr:Ibuprofen,drugbank:DB01050)≥ Simf (dbr:Ibuprofen,chebi:5855), and Simf (dbr:Ibuprofen,chebi:5855)≥ Simf (dbr:Ibuprofen,wikidata:Q186969). The SJoin operator is a two-fold algorithm that performs: first, Similarity Partitioning, and second, Similarity Probing to identify semantically equivalent RDF molecules. To address batch and real-time processing scenarios, we present two implementations of SJoin. Blocking SJoin Operator solves the 1-1 weighted perfect matching problem allowing for a batch processing of the graphs. Non-Blocking SJoin Operator employs fuzzy conditional matching for identifying communities of N -M entities in graphs covering the on-demand case whenever results are expected to be produced incrementally. 3.1
Blocking SJoin Operator
Fig. 2 illustrates the intuition behind the blocking SJoin operator. Similarity Partitioning and Probing steps are executed sequentially. Thus, blocking SJoin operator completely evaluates both datasets of RDF molecules in the Partitioning step, and then fires the Probing step to produce the whole output. The Similarity Partitioning step is described in Algorithm 1. The operator initializes two lists of RSMs for two RDF graphs and incoming RDF molecules
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs Dataset B
insert
(R1B,T1B)[ ] (R2B,T2B)[ ] (R3B,T3B)[ ]
Dataset A
insert
(R1A,T1A)[ ] (R2A,T2A)[ ]
(R3A,T3A)[ ] Similarity Partitioning
(R1A,T1A)
(R1B,T1B)
(R2A,T2A)
(R2B,T2B)
(R3A,T3A)
(R3B,T3B)
insert
1-1 Perfect Matching
sim f γ
(R1A,T1A) (R2B,T2B) (R2A,T2A)
(R3B,T3B)
(R3A,T3A)
(R1B,T1B)
5
Similarity Probing
Fig. 2: SJoin Blocking Operator. Similarity Partitioning step initializes lists of RSMs and populates their tails through a similarity function Simf and a threshold γ. Similarity Probing step performs 1-1 weighted perfect matching and outputs the perfect pairs of semantically equivalent molecules (MiA , MjB ). Algorithm 1: Similarity Partitioning step for Blocking SJoin operator according to similarity function Simf and threshold γ
1 2 3 4 5 6 7 8 9
Data: Dataset φ(DA ), Simf , γ Result: List of RSMA , List of RSMB while getMolecule(φ(DA )) do MiA ← getMolecule(φ(DA )) ; RiA ← head(MiA ) ; for RSMjB ∈ List of RSMB do RSMjB = ((RjB , TjB )[(RlA , TlA )), . . . , (RkA , TkA )] ; RjB ← head(head(RSMjB )) ; if Simf (RjB , RiA ) ≥ γ then tail(RSMjB ) ← tail(RSMjB ) + (MiA ) ; return sort(List of RSMA ),sort(List of RSMB )
// Get URI
// Get URI // Probe
are inserted into a respective list with a filled head M and empty tail T . To populate the tail of a RSM in the list A, SJoin resorts to a semantic similarity function for computing a similarity score between the RSM and all RSMs in the opposite list B. If the similarity score exceeds a certain threshold γ then the molecule from the list B is appended to the tail of the RSM. Finally, the tail is sorted in the descending similarity score order such that the most similar RDF molecule obtains the top position in the tail. For instance, the semantic similarity function GADES [10] is able to decide relatedness between the RDF molecules of dbr:Ibuprofen and drugbank:DB01050 in Fig. 1, and assigns a similarity score of 0.8. The algorithm supports datasets with arbitrary amounts of molecules. However, in order to guarantee 1-1 perfect matching, we place a restriction card(φ(DA )) = card(φ(DB )), i.e., the number of molecules in φ(DA ) and φ(DB ) must be the same. Thus, card(List of RSMA ) = card(List of RSMB ). A 1-1 weighted perfect matching is applied at the Similarity Probing stage in the Blocking SJoin operator. It accepts the lists of RSMA , RSMB created and populated during the previous Similarity Partitioning step. This step aims at producing perfect pairs of semantically equivalent RDF molecules (MiA , MjB ), i.e.,
6
Mikhail Galkin et al. List of RSMA
List of RSMB
(RiA,TiA )[(RjB,TjB),…,(RkB,TkB)]
(RjB,TjB )[(RiA,TiA),…,(RmA,TmA)]
(RiA,TiA )
(RjB,TjB )
(RaA,TaA )
(RbB,TbB ) n pairs
(RiA,TiA )
(RjB,TjB )
(a) 1-1 matching from the bipartite graph of RMS
(RmA,TmA )
(RnB,TnB )
(b) Matched pairs
Fig. 3: 1-1 Weighted Perfect Matching. (a) The matching is identified from the lists of RSMA and RSMB ; RDF molecules MiA =(RiA ,TiA ) and MjB = (RjB ,TjB ) are semantically equivalent whenever RiA and RjB are reciprocally the most similar RDF molecules according to Simf .
Algorithm 2: 1-1 Weighted Perfect Matching of RSMs bipartite graph
1 2 3 4 5 6 7 8 9 10 11
Data: List of RSMA , List of RSMB Result: List of pairs LP = ((RiA , TiA ), (RjB , TjB )) for RSMiA ∈ List of RSMA do RSMiA = ((RiA , TiA )[(RjB , TjB ), . . . , (RkB , TkB )]) ; for (RjB , TjB ) ∈ tail(RSMiA ) do RSMjB ← Find in the List of RSMB ; RSMjB = ((RjB , TjB )[(RlA , TlA ), . . . , (RzA , TzA )]) ; if (RlA , TlA ) = (RiA , TiA ) and (RiA , TiA ) 6∈ LP then LP ← LP + ((RiA , TiA ), (RjB , TjB )) ; else for (RlA , TlA ) ∈ tail(RSMjB ) do find the position of (RiA , TiA ) ; return LP
// Ordered Set
// Ordered Set // Add to result
max(Simf (MiA , RSMB )) = max(Simf (MjB , RSMA )) = Simf (MiA , MjB ). That is, for a given molecule MiA , there is no molecule in the list of RSMA which has a similarity score higher than Simf (MiA , MjB ) and vice versa. Algorithm 2 describes how perfect pairs are created; Fig. 3 illustrates the algorithm. Traversing the List of RSMA , the algorithm iterates over each RSMiA . Then, the tail of RSMiA , i.e., an ordered list of highly similar molecules, is extracted. The first molecule of the tail RSMjB corresponds to the most similar molecule from the List of RSMB . The algorithm searches for RSMjB in the List of RSMB and examines whether the molecule (RiA , TiA ) is the first one in the tail of RSMjB . If this condition holds and (RiA , TiA ) is not already matched with another RSM, then the pair ((RiA , TiA ), (RjB , TjB )) is identified as a perfect pair and is appended to the result list of pairs LP (cf. Fig. 3a). If false, then the algorithm finds the first occurrence of (RiA , TiA ) in the tail of RSMjB and appends the result pair to LP . When all RSM s are matched, the algorithm yields the list of perfectly matched pairs (cf. Fig. 3b).
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
Dataset A
Dataset B
(R1A,T1A)
simf γ
insert
probe
(R1A,T1A)[ ]
Dataset A
(R1A,T1A) (R 2B,T2B)
(R1B,T1B)[ ] (R2B,T2B)[ ]
(R1A,T1A)[ ] (R2A,T2A)[ ]
7
Dataset B
simf γ
(R3B,T3B)
probe
insert
(R1A,T1A) (R 2B,T2B) (R 3B,T3B) (R 2A,T2A)
(R1B,T1B)[ ] (R2B,T2B)[ ] (R3B,T3B)[ ]
Similarity Partitioning / Similarity Probing
Similarity Partitioning / Similarity Probing
(a) Molecule (RiA , TiA ) yields a pair(b) Molecule (R3B , T3B ) yields a pair ((R1A , T1A ), (R2B , T2B )) ((R3B , T3B ), (R2A , T2A ))
Fig. 4: SJoin Non-Blocking Operator. Identifies N-M matchings and produces results as soon as new molecule arrives. When a molecule (RiA , TiA ) arrives, it is inserted into a relevant list and probed against another list. If the similarity score exceeds the threshold γ, a new matching is produced. 3.2
Non-Blocking SJoin Operator
The Non-Blocking SJoin operator aims at identifying N − M matchings, i.e., an RSMiA might be associated with multiple RSMs, e.g., RSMjB or RSMkB . Therefore, 1-1 weighted perfect matching is not executed which enables the operator to produce results as soon as new molecules arrive, i.e., in a non-blocking, on-demand manner. The operator receives two sets of RDF molecules φ(DA ) and φ(DB ). Lists of RSMA , RSMB are initialized as empty lists. Algorithm 3 describes the join procedure and Fig. 4 illustrates the algorithm. For every incoming molecule MiA from φ(DA ), Algorithm 3 performs the same two steps: Similarity Partitioning and Similarity Probing. The URI RiA of an RDF molecule extracted from the tuple (RiA , TiA ) is probed against URIs of all existing RSM s in the List of RSMB (cf. Fig. 4). If the similarity score of Simf (RiA , RjB ) exceeds the threshold γ, then the pair ((RiA , TiA ), (RjB , TjB )) is considered as a matching and appended to the results list LP . During the Similarity Insert step, an RSMiA is initialized, the molecule (RiA , TiA ) becomes its head, and eventually added to the respective List of RSMA . Algorithm 3 is applied to both φ(DA ) and φ(DB ) and able to produce results with constantly updating Lists of RSMs supporting the non-blocking operation workflow. 3.3
Time Complexity Analysis
The SJoin binary operator receives two RDF graphs of n RDF molecules each. To estimate the complexity of the blocking SJoin operator, three most expensive operations have to be analyzed. Table 1 gives an overview of the analysis. The complexity of the Data Partitioner module depends on the Algorithm 1, i.e., construction of Lists of RSMA , RSMB and a similarity function Simf . The asymptotic approximation equals to O(n2 · O(Simf )). To produce ordered tails of RSM s the similar molecules in the tail have to be sorted in the descending
8
Mikhail Galkin et al.
Algorithm 3: The Non-Blocking SJoin operator executes both Similarity Partitioning and Probing steps as soon as an RDF molecule arrives from an RDF graph.
1 2 3 4 5 6 7 8 9 10 11 12
Data: Dataset φ(DA ), Simf , γ Result: List of pairs LP = ((RiA , TiA ), (RjB , TjB )) while getMolecule(φ(DA )) do MiA ← getMolecule(φ(DA )) ; RiA ← head(MiA ), TiA ← tail(MiA ) ; for RSMjB ∈ List of RSMB do RSMjB = ((RjB , TjB )[]) ; RjB ← head(head(RSMjB )) ; TjB ← tail(head(RSMjB ) ; if Simf (RiA , RjB ) ≥ γ then LP ← LP + ((RiA , TiA ), (RjB , TjB )) ; head(RSMiA ) ← MiA , tail(RSMiA ) ← [] ; List of RSMA ←List of RSMA + RSMiA ; return LP
// Get URI, tail
// Get URI // Get tail // Probe
// Insert
Table 1: The SJoin Time Complexity. Results for the steps of Partitioning, Sorting, and Matching, where n is the number of RDF molecules. Blocking SJoin Complexity Non-Blocking SJoin Complexity Stage Partitioning O(n2 · O(Simf )) O(n2 · O(Simf )) Sorting O(n log n) Matching O(n3 ) 2 Overall O(n · O(Simf )) + O(n3 ) O(n2 · O(Simf ))
similarity score order. The applicable merge sort and heapsort algorithms have O(n log n) asymptotic complexity. The 1-1 Weighted Perfect Matching component has O(n3 ) complexity in the worst case according to the Algorithm 2. However, the Hungarian algorithm [7], a standard approach for 1-1 weighted perfect matching, converges to the same O(n3 ) complexity. Partitioning, sorting, and perfect matching are executed sequentially. Therefore, the overall complexity conforms to the sum of complexities, i.e., O(n2 · O(Simf )) + O(n log n) + O(n3 ) which equals to O(n2 · O(Simf )) + O(n3 ). We thus deduce that the SJoin complexity depends on the complexity of a chosen similarity measure whereas the lowest achievable order of complexity is limited to O(n3 ). The complexity of the non-blocking SJoin operator stems from the analysis of the Algorithm 3. The most expensive step of the algorithm is to compute a similarity score between an RSMiA and RSMs in the List of RSMB . Applied to both φ(DA ) and φ(DB ) the complexity converges to O(n2 · O(Simf )).
4
Empirical Study
An empirical evaluation is conducted to study the efficiency and effectiveness of SJoin in blocking and non-blocking conditions on RDF graphs from DBpedia and
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
9
Table 2: Benchmark Description. RDF datasets used in the evaluation. Experiment 1: People Experiment 2: People DBpedia D1 DBpedia D2 DBpedia Wikidata DBpedia Wikidata Molecules 500 500 500 500 1000 1000 Triples 17,951 17,894 29,263 16,307 54,590 29,138
Wikidata. We assess the following research questions: RQ1) Does blocking SJoin integrate RDF graphs more efficiently and effectively compared to the state of the art? RQ2) What is the impact of threshold values on the completeness of a non-blocking SJoin? RQ3) What is the effect of a similarity function in the SJoin results? The experimental configuration is as follows: Benchmark: Experiment 1 is executed against a dataset of 500 molecules9 of type Person extracted from the live version of DBpedia (February 2017). Based on the original molecules, we created two sets of molecules by randomly deleting or editing triples in the two sets. Sharing the same DBpedia vocabulary, Experiment 1 datasets have a higher resemblance degree compared to Experiment 2. Experiment 2 employs subsets of DBpedia and Wikidata of the Person class. Assessing SJoin in the higher heterogeneity settings, we sampled datasets of 500 and 1000 molecules varying triples count from 16K up to 55K10 . Table 2 provides basic statistics on the experimental datasets. DBpedia D1 and D2 refer to the dumps of 500 molecules. Further, the dumps of 500 and 1000 molecules for Experiment 2 are extracted from DBpedia and Wikidata. Baseline: Gold standards for blocking operators comparison include the original DBpedia Person descriptions (Experiment 1) and owl:sameAs links between DBpedia and Wikidata (Experiment 2). We compare SJoin with a Hash Join operator. For a fair comparison, the Hash Join was extended to support similarity functions at the Probing stage. That is, blocking SJoin is compared against blocking similarity Hash Join and non-blocking SJoin is evaluated against nonblocking Symmetric Hash Join. The Gold standard for evaluating non-blocking operators is comprised of the precomputed amounts of pairs which similarity score exceeds a predefined threshold; gold standards are computed off line. Metrics: We report on execution time (ET in secs) as the elapsed time required by the SJoin operator to produce all the answers. Furthermore, we measure Precision, Recall and report F1-measure during the experiments with blocking operators. Precision is the fraction of RDF molecules that has been identified and integrated (M ) that intersects with the Gold Standard (GS), i.e., ∩GS| Precision = |M|M | . Recall corresponds to the fraction of the identified similar ∩GS| molecules in the Gold Standard, i.e., Recall = |M|GS| . Comparing non-blocking operators, we measure Completeness over time, i.e., a fraction of results produced at a certain time stamp. The timeout is set to one hour (3,600 seconds), the operators results are checked every second. Ten thresholds in the range [0.1 : 1.0] and step 0.1 were applied in Experiment 1. In Experiment 2, five thresholds in 9 10
https://github.com/RDF-Molecules/Test-DataSets/tree/master/DBpedia-People/20160819 https://github.com/RDF-Molecules/Test-DataSets/tree/master/DBpedia-WikiData/operators_evaluation
10
Mikhail Galkin et al.
sjoin_partitioning
sjoin_probing
800
hash_partitioning
hash_probing
1.00
800
1.00
600
0.75
600
0.75
400
0.50
400
0.50
200
0.25
200
0.00
0
F1 score
ET, sec
F1 score
ET, sec
F1 score
F1 score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold
(a) SJoin performance
0.25
0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold
(b) Hash Join performance
Fig. 5: Experiment 1 (GADES) with blocking operators. The partitioning bar shows the time taken to partition the molecules in RSMs, probing indicates the time required for 1-1 weighted perfect matching. Black line chart on the right axis denotes F1 score. (a) SJoin demonstrates higher F1 score while consuming more time for perfect matching. (b) Baseline Hash Join demonstrates less than 0.25 F1 score even on lower thresholds spending less time on probing. the range [0.1 : 0.5] were evaluated because no pair of entities in the sampled RDF datasets has a GADES similarity score higher than 0.5. Implementation: Both blocking and non-blocking SJoin operators are implemented in Python 2.7.1011 . Baseline improved Hash Joins are implemented in Python as well12 . The experiments were executed on a Ubuntu 16.04 (64 bits) Dell PowerEdge R805 server, AMD Opteron 2.4GHz CPU, 64 cores, 256GB RAM. We evaluated two similarity functions: GADES [10] and Semantic Jaccard (SemJaccard) [1]. GADES relies on semantic descriptions encoded in ontologies to determine relatedness, while SemJaccard requires the materialization of implicit knowledge and mappings. Evaluating schema heterogeneity of DBpedia and Wikidata in Experiment 2 the similarity function is fixed to GADES. 4.1
DBpedia – DBpedia People
Experiment 1 evaluates the performance and effectiveness of blocking and nonblocking SJoin compared to respective Hash Join implementations. The testbed includes two split DBpedia dumps with semantically equivalent entities but nonmatching resource URIs and randomly distributed properties; GADES and SemJaccard similarity functions. That is, both graphs are described in terms of one DBpedia ontology. Fig. 5 visualizes the results obtained when applying GADES semantic similarity function in order to identify a perfect matching of graphs resources, i.e., in blocking conditions. SJoin exhibits better F1 score up to very 11
https://github.com/RDF-Molecules/operators/tree/master/mFuhsion
12
https://github.com/RDF-Molecules/operators/tree/master/baseline_ops
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
T0.1 # triples: 166573 hash join sjoin
T0.3 # triples: 108922 hash join sjoin 100 Completeness, %
Completeness, %
100 80 60 40 20 0
80 60 40 20 0
0
600
1200
1800 Time, sec
2400
3000
0
3600
600
(a) T = 0.1
1200
1800 Time, sec
2400
3000
3600
3000
3600
(b) T = 0.3 T0.8 # triples: 406 hash join sjoin
T0.5 # triples: 15148 hash join sjoin 100 Completeness, %
100 Completeness, %
11
80 60 40 20
80 60 40 20 0
0 0
600
1200
1800 Time, sec
(c) T=0.5
2400
3000
3600
0
600
1200
1800 Time, sec
2400
(d) T=0.8
Fig. 6: Experiment 1 (GADES) with non-blocking operators. SJoin produces complete results at all threholds in contrast to Hash Join. high 0.9 threshold value. Moreover, the effectiveness of more than 80% is ensured up to 0.6 threshold value whereas Hash Join barely reaches 25% even on lower thresholds. The partitioning time is constant for both operators but Hash Join performs the partitioning slower due to the application of a hash function to all incoming molecules. However, high effectiveness of SJoin is achieved at the expense of time efficiency. SJoin has to complete a 1-1 perfect matching algorithm against a large 500x500 matrix whereas Hash Join performs the perfect matching three times but for smaller matrices equal to the size of its buckets, e.g., about 166x166 for three buckets which is faster due to the cubic complexity of the weighted perfect matching algorithm. Fig. 6 shows the results of the evaluation of non-blocking operators with GADES. SJoin outperforms the baseline Hash Join in terms of completeness over time in all four cases with the threshold in the range 0.1-0.8. Fig. 6a demonstrates that the SJoin operator is capable of producing 100% of results within the timeframe whereas the Hash Join operator outputs only about 10% of the expected tuples. In Fig. 6b, SJoin achieves the full completeness even faster. In Fig. 6c both operators finish after 18 minutes, but SJoin retains full completeness while Hash Join reaches only 35%. Finally, with the 0.8 threshold in Fig. 6d, Hash Join performs very fast but still struggles to attain the full completeness; SJoin takes more time but sustainably achieves answer completeness. One of the reasons why Hash Join performs worse is its hash function which does not consider semantics encoded in the molecules descriptions. Therefore, the hash function partitions RDF molecules into buckets almost randomly, while it was originally envisioned to place similar entities in the same buckets. Fig. 7 presents the efficiency and effectiveness of blocking SJoin and Hash Join when applying SemJaccard similarity function. As an unsophisticated measure,
12
Mikhail Galkin et al.
sjoin_partitioning
sjoin_probing
hash_partitioning
400
hash_probing
1.00
400
1.00
300
0.75
300
0.75
200
0.50
200
0.50
100
0.25
100
0.00
0
F1 score
ET, sec
F1 score
ET, sec
F1 score
F1 score
0
0.25
0.00
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold
(a) SJoin performance
(b) Hash Join performance
Fig. 7: Experiment 1 (SemJaccard) with blocking operators. (a) SJoin takes less time to compute similarity scores while F1 score quickly deteriorates after threshold 0.5. (b) Baseline Hash Join in most cases consumes more time and produces less reliable matchings. T0.4 # triples: 486 hash join sjoin
T0.4 # triples: 50857 hash join sjoin 100 Completeness, %
Completeness, %
100 80 60 40 20
80 60 40 20 0
0 0
600
1200
1800 Time, sec
2400
3000
3600
(a) T = 0.4, GADES
0
600
1200
1800 Time, sec
2400
3000
3600
(b) T = 0.4, Jaccard
Fig. 8: Experiment 1 with fixed threshold. GADES identifies two orders of magnitude more results than Jaccard while SJoin still achieves full completeness. operators require less time for partitioning and take less time for probing stages. That is, due to the heterogeneous nature of the compared datasets, SemJaccard is not able to produce similarity scores higher than 0.4. On the other hand, SemJaccard simplicity leads to significant deterioration of the F1 score already at low thresholds, i.e., 0.3-0.4. Fig. 8 illustrates the difference in elapsed time and achieved completeness of SJoin and Hash Join applying GADES or SemJaccard similarity functions. Evidently, SemJaccard outputs fewer tuples even on lower thresholds, e.g., 486 pairs at 0.4 threshold against 50,857 pairs by GADES. We therefore demonstrate that plain set similarity measures as SemJaccard that consider only an intersection of exactly same triples are ineffective in integrating heterogeneous RDF graphs. 4.2
DBpedia - Wikidata People
The distinctive feature of the experiment consists in completely different vocabularies used to semantically describe the same people. Therefore, traditional
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
ET, sec
3000
1.00
800
1.00
0.75
600
0.75
0.50
2000
13
hash_probing
400
0.50
F1 score
F1 score
hash_partitioning
ET, sec
4000
sjoin_probing
F1 score
sjoin_partitioning
F1 score
1000
0 0.1
(a) GADES distribution
0.2
0.3 0.4 Threshold
(b) SJoin
0.5
0.25
200
0.00
0
0.25
0.00 0.1
0.2
0.3 0.4 Threshold
0.5
(c) Hash Join
Fig. 9: Experiment 2 (GADES) with blocking operators, 500 molecules. (a) The distribution of GADES similarity scores shows that there are few pairs which score exceeds 0.4 threshold. (b) SJoin requires more time but achieves more than 0.9 F1 score until T0.3. (c) Baseline Hash Join operates faster but achieves less than 0.25 F1 accuracy. joins and set similarity joins, e.g., Jaccard, are not applicable. We evaluate the performance of SJoin employing GADES semantic similarity measure. Fig. 9 reports the efficiency and effectiveness of SJoin compared to Hash Join in the 500 molecules setup. Fig. 9a justifies the range of selected thresholds as only a few number of pairs have a similarity score higher than 0.5. Blocking SJoin manages to achieve higher F1 score (max 95%) up to 0.3 threshold value, but requires significantly more time to accomplish the perfect matching. Results of non-blocking SJoin and Hash Join executed against 500 and 1000 molecules configurations are reported on Fig. 10. The observed behavior of these operators resembles the one in Experiment 1, i.e., SJoin outputs complete results within a predefined time frame, while Hash Join barely achieves 40% completeness in the case with a relatively high threshold 0.4 and small number of outputs. Analyzing the observed empirical results, we are able to answer our research questions: RQ1) Blocking SJoin consistently exhibits higher F1 scores, and the results are more reliable. However, time efficiency depends on the input graphs and applied similarity functions. RQ2) A threshold value prunes the amount of expected results and does not affect the completeness of SJoin. RQ3) Clearly, a semantic similarity function allows for matching RDF graphs more accurately.
5
Related Work
Traditional binary join operators require join variables instantiations to be exactly the same. For example, XJoin [11] and Hash Join [2] (chosen as a baseline in this paper) operators abide this condition. At the Insert step, both blocking and non-blocking Hash Join algorithms partition incoming tuples into a number of buckets based on the assumption that after applying a hash function similar tuples will reside in the same bucket. The assumption holds true in cases of simple data structures, e.g., numbers or strings. However, applying hash functions to string representations of complex data structures such as RDF molecules or RSMs tend to produce more collisions rather then efficient partitions. At the
14
Mikhail Galkin et al.
T0.4 # triples: 639 hash join sjoin
T0.2 # triples: 153904 hash join sjoin 100 Completeness, %
Completeness, %
100 80 60 40 20
80 60 40 20 0
0 0
600
1200
1800 Time, sec
2400
3000
3600
0
(a) T = 0.2, 500 molecules
600
1800 Time, sec
2400
3000
3600
(b) T = 0.4, 500 molecules T0.4 # triples: 3466 hash join sjoin
T0.2 # triples: 160062 hash join sjoin 100 Completeness, %
100 Completeness, %
1200
80 60 40 20
80 60 40 20 0
0 0
600
1200
1800 Time, sec
2400
3000
(c) T=0.2, 1000 molecules
3600
0
600
1200
1800 Time, sec
2400
3000
3600
(d) T=0.4, 1000 molecules
Fig. 10: Experiment 2. Non-blocking operators in different dataset sizes. In larger setups, SJoin still reaches full completeness. Probe stage, Hash Join performs matching as to a specified join variable. Thus, having URI as a join variable, semantically equivalent RSMs with different URIs can not be joined by Hash Join. Similarity join algorithms are able to match syntactically different entities and address the heterogeneity issue. String similarity join techniques reported in [3, 5, 12] rely on various metrics to compute a distance between two strings. Set similarity joins [6, 8] identify matches between sets. String and set similarity techniques are, however, inefficient being applied to RDF data as they do not consider the graph nature of semantic data. There exist graph similarity joins [9, 13] which traverse graph data in order to identify similar nodes. On the other hand, those operators do not tackle semantics encoded in the knowledge graphs and are tailored for specific similarity functions. In contrast, SJoin, presented in this paper, is a semantic similarity operator that fully leverages RDF and OWL semantics encoded in the RDF graphs. Moreover, SJoin is able to perform in blocking, i.e., 1-1 perfect matching, conditions or non-blocking, i.e., incremental N − M , manner allowing for on-demand and ad-hoc semantic data integration pipelines. Additionally, SJoin is flexible and is able to employ various similarity functions and metrics, e.g., from simple Jaccard similarity to complex NED [14] or GADES [10] measures, achieving best performance with semantic similarity functions.
6
Conclusions and Future Work
We presented SJoin, an operator for detecting semantically equivalent RDF molecules from RDF graphs. SJoin implements two operators: Blocking and Non-Blocking, which rely on similarity measures and ontologies to effectively
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs
15
detect equivalent entities from heterogeneous RDF graphs. Moreover, the time complexity of SJoin operators depends on the time complexity of the similarity measure, i.e., SJoin does not introduce additional overhead. The behavior of SJoin was empirically studied on DBpedia and Wikidata real-world RDF graphs, and on Jaccard and GADES similarity measures. Observed results suggest that SJoin is able to identify and merge semantically equivalent entities, and is empowered by the semantics encoded in ontologies and exploited by similarity measures. As future work, we plan to define new SJoin operators to compute on-demand integration of RDF graphs and address streams of RDF data.
Acknowledgments Mikhail Galkin is supported by the project Open Budgets (GA 645833). This work is also funded in part by the European Union under the Horizon 2020 Framework Program for the project BigDataEurope (GA 644564), and the German Ministry of Education and Research with grant no. 13N13627 (LiDaKra).
References 1. D. Collarana, M. Galkin, C. Lange, I. Grangel-González, M. Vidal, and S. Auer. Fuhsen: A federated hybrid search engine for building a knowledge graph ondemand (short paper). In ODBASE, pages 752–761, 2016. 2. A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1–140, 2007. 3. J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437–461, 2012. 4. J. D. Fernández, A. Llaves, and Ó. Corcho. Efficient RDF interchange (ERI) format for RDF data streams. In ISWC, pages 244–259, 2014. 5. G. Li, D. Deng, J. Wang, and J. Feng. PASS-JOIN: A partition-based method for similarity joins. PVLDB, 5(3):253–264, 2011. 6. W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity join techniques. PVLDB, 9(9):636–647, 2016. 7. J. Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957. 8. L. A. Ribeiro, A. Cuzzocrea, K. A. A. Bezerra, and B. H. B. do Nascimento. Incorporating clustering into set similarity join algorithms: The sjclust framework. In DEXA 2016, Porto, Portugal, pages 185–204, 2016. 9. Z. Shang, Y. Liu, G. Li, and J. Feng. K-join: Knowledge-aware similarity join. IEEE Trans. Knowl. Data Eng., 28(12):3293–3308, 2016. 10. I. Traverso, M.-E. Vidal, B. Kämpgen, and Y. Sure-Vetter. Gades: A graph-based semantic similarity measure. In SEMANTiCS, pages 101–104. ACM, 2016. 11. T. Urhan and M. J. Franklin. Xjoin: A reactively-scheduled pipelined join operator. IEEE Data Eng. Bull., 23(2):27–33, 2000. 12. S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, E. Siragusa, A. Tiskin, W. Wang, J. Wang, and U. Leser. State-of-the-art in string similarity search and join. SIGMOD Record, 43(1):64–76, 2014. 13. Y. Wang, H. Wang, J. Li, and H. Gao. Efficient graph similarity join for information integration on graphs. Frontiers of Computer Science, 10(2):317–329, 2016. 14. H. Zhu, X. Meng, and G. Kollios. NED: an inter-graph node metric based on edit distance. PVLDB, 10(6):697–708, 2017.