1
Semantic Web 1 (2017) 1–5 IOS Press
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs Edgard Marx a,b Saeedeh Shekarpour e Konrad Höffner a Axel-Cyrille Ngonga Ngomo a Jens Lehmann a,d Sören Auer c a
AKSW Research Group, Department of Computer Science, Leipzig University, Germany E-mails: {marx,ngonga}@informatik.uni-leipzig.de,
[email protected] b Faculty of Mathematics, Leipzig University of Applied Sciences, Germany c Enterprise Information Systems, Computer Science Institute, Bonn University, Germany E-mail:
[email protected] d Smart Data Aanalytics, Computer Science Institute, Bonn University, Germany E-mail:
[email protected] e Kno.e.sis, Ohio Center of Excellence in Knowledge-enabled Computing, Computer Science Institute, Wright State University, Dayton, United States of America E-mail:
[email protected]
Abstract. Information retrieval approaches are currently regarded as a key technology to empower lay users to access the Web of Data. To assist such need, a large number of approaches such as Question Answering and Semantic Search have been developed. While Question Answering promises accurate results by returning a specific answer, Semantic Search engines are designed to retrieve the top-K resources on a given scoring function. In this work, we focus on the latter paradigm. We aim to address one of the major drawbacks of current implementations, i.e., the accuracy. We propose *P, a Semantic Search approach that explores term networks to answer keyword queries on large RDF knowledge graphs. The proposed method is based on a novel graph disambiguation model. The adequacy of the approach is demonstrated on the QALD benchmark data set against state-of-the-art Question Answering and Semantic Search systems as well as in the Triple Scoring Challenge at the International Conference on Web Search and Data Mining (WSDM) 2017. The results suggest that *P is more accurate than the current best performing Semantic Search scoring function while achieving a performance comparable to an average Question Answering system. Keywords: Semantic Search, Entity Search, Question Answering, Entity Linking, Graph Disambiguation, Information Retrieval
1. Introduction The uptake of Semantic Web technologies has led to the publication of large volumes of data in RDF format. Approximately 10.000 Resource Description Framework (RDF)1 data sets are currently available via public data portals.2 Examples of such data sets are DBpedia [25] and LinkedGeoData [46], which encompass more than 1 billion triples each. To enable lay users to access this data, a significant number of approaches such as Question Answering (QA) and Semantic Search 1 http://www.w3.org/RDF 2 http://lodstats.aksw.org/
(SemS) engines have thus been developed. However, retrieving the desired information still poses a significant challenge. Although the use of triple stores leads to direct and efficient access to the data, lay users cannot be expected to make themselves familiar with the underlying formal languages and modeling structures. The use of Semantic Search and Question Answering systems can enhance the access to the data. However, they often rely on methods adapted from traditional Information Retrieval (IR), including approaches such as document retrieval and the exploration of triple stores. On one hand, a typical Question Answering approach begins by converting the input query into a syntax tree. Then,
c 2017 – IOS Press and the authors. All rights reserved 1570-0844/17/$35.00
2
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
it generates and ranks potential answer graphs by relying either on a triple store or document retrieval techniques [43, 50]. On the other hand, a common approach for Semantic Search consists of adapting document retrieval engines and their score functions—e.g., term frequency–inverse document frequency (TF-IDF) [21]— to Semantic Search [8, 13, 14, 49]. However, document retrieval engines rely on the assumption that the frequency of a term is related to the topic of the document [27]. Overall, both categories of systems that rely on traditional IR methods are commonly penalized with respect to their precision (see Section 5). The research in the area of search over RDF data has thus shifted towards developing methods for efficient Semantic Search [5, 54] or Question Answering [57] that take the topology of RDF data into consideration. This is due to evidence that supports the idea that better results can be achieved by exploring the graph structure of the RDF knowledge bases. This assumption is derived from linguistics [12, 19] and supported by results in Semantic Search [5, 54] and Question Answering [43, 50, 57]. However, these approaches face low accuracy, especially when dealing with large volume of data (see Section 5). In this work, we address the following research question: How to increase the accuracy of the current scoring functions on RDF knowledge graphs (KGs)? While Semantic search engines seek to retrieve the most relevant entity associated with the query intent, Question Answering systems seek to retrieve answers from the knowledge graphs. In both cases, there is a need of correctly segment and ultimately annotate the query with the KG resources. Many approaches perform this task using an Entity Linking approach [15, 52]. However, solemnly Entity Linking approaches does not suffice because to achieve the final goal of a Question Answering or Semantic Search engines there is a need for a method to correctly annotate the resources (Entities, Properties, and Objects). Our hypothesis is that a single scoring function can be used to annotate correctly the resources and therefore increase the accuracy of annotations, search results as well as answers. Our solution relies on the concept of Term Networks, which we describe formally in the subsequent sections of this work. By relying on these networks, our approach provides means for improving the information access on knowledge graphs and, in addition, for developing more reliable methods for Semantic Search and Question Answering systems. In particular, we focus on the study of entity retrieval scoring functions over a particular
graph structure called Semantic Connected Component (SCC) [29]. Our contributions are as follows: – We extend the previously introduced Semantic Search scoring function based on Term Networks [29] dubbed as *P (read star path) that allows querying RDF data; – We compare our approach with the state-of-the-art Semantic Search techniques on the QALD-4 [51] benchmark and show that we outperform them w.r.t. F1 score; – We provide an evaluation of QA version of our approach with all participating systems in QALD3 [9] and QALD-4 [51]; – We describe the participation of *P in Triple Scoring Challenge at WSDM 2017 [30] and how it helps the Catsear team to achieve the general 4th place; – We provide a detailed discussion of the weaknesses and strengths of previously introduced approaches for IR on RDF data as well as for *P itself. The rest of this paper is organized as follows. Section 2 defines the essential concepts necessary to understand the work. Section 3 provides a review of the related work. Section 4 describes the star path model, also written shortly as *Path or simply *P. Section 5 presents our evaluation and discusses the results. Finally, Section 6 concludes by giving a summary of the limitations of our approach and discussing potential future work.
2. Preliminaries In this section, we introduce a formalization of Knowledge Graphs (KG) for the Resource Description Framework (RDF). Therefore, we introduce some well known String Similarity, String Distance as well as Document Scoring functions. 2.1. RDF RDF3 is a standard for describing Web resources. A resource can refer to any physical or conceptual thing, such as a Web site, a person or a device. The RDF data model expresses statements about resources in the form of subject-predicate-object triples. The subject denotes an entity; the predicate expresses a property 3 https://www.w3.org/TR/REC-rdf-syntax/
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
Person
rdfs:label
Definition 2 (Query). A query q ∈ Σ∗ is a user given keyword string expressing a factual information needed.
r1
rdfs:label
p4
Mona Lisa
rdfs:label p5
rdfs:label
type
p6
e1
p2
dbo:artist
Leonardo da Vinci
rdf:type
p1
p3
rdfs:label p7
artist
Definition 3 (Term). A term5 can be a word or a phrase used to describe a thing or to express a concept [38]. In this work we consider as term any literal (l ∈ L) in a KG. Definition 4 (Token). Tokens are sub-strings extracted from another string. A token t ∈ T is the result from a tokenizing function T : Σ∗ → Σ∗∗ , which converts a string (Σ∗ ) to a set of tokens (Σ∗∗ ).
e2
Figure 1. An excerpt of the KG. The label of rdfs:label properties were omitted for simplification.
(of the subject) or a relationship (between subject and object); the object is either an entity or literal. Entities are identified with the International Resource Identifier (IRI), a generalization of Universal Resource Identifier (URI), while literals are used to identify values such as numbers and dates using a lexical representation. Definition 1 (RDF knowledge Graph, KG). Formally, let a KG be a finite RDF knowledge graph (KG). A KG can be regarded as a set of triples (s, p, o) ∈ (I ∪ B) × P × (I ∪ L ∪ B) where I is the set of all IRIs, B is the set of all blank nodes, B ∩ I = ∅. P is the set of all predicates, P ⊆ I. L is the set of all literals, L ⊂ Σ∗ and L ∩ I = ∅, where Σ is the unicode alphabet. E is the set of all entities, E = I ∪ B \ P. A R = I ∪ B ∪ L is the set of all RDF resources r ∈ R in the KG. A KG is modeled as a directed labeled graph G = (V, D), where V = E ∪ L, D ⊆ E × (E ∪ L) and the labeling function4 of the edges is a mapping λ : D 7→ P. Figure 1 shows an excerpt of a KG where a literal vertice vi ∈ L (respectively a resource vertice vi ∈ R) is illustrated by a rectangle respectively an oval. Each edge between two vertices corresponds to a triple, where the first vertice is called the subject, the labeled edge the predicate and the second vertice the object. rdfs:label For example, e2 −−−−−−−→ Mona Lisa corresponds to the triple . SemS systems aim to retrieve the top-K ranked entities which address the information need behind a given user query. 4 Not
3
to be confused with rdfs:label.
2.2. String Similarity and Distance Vector Space Model (VSM) is one of the most known models used as score function. It permits the representation of text as an algebraic model (vectors) and can be used for IR. In VSM, documents and queries are represented as vectors while each dimension corresponds to a distinct token. Therefore, vector operations can be used to compute the similarity between the query and a document. One method to compute how similar two vectors are from each other is the Cosine Similarity. For instance, the VSM score of two strings s1 and s2 can be computed by evaluating the cosine of the tokens (t1 , t2 , . . . , t x ) in the two vectors V(s1 ) and V(s2 ) as follows. V(s1 ) = (t s11 , t s12 , . . . , t s1m ) V(s2 ) = (t s21 , t s22 , . . . , t s2n )
cosθ (s1 , s2 ) =
V(s1 ) · V(s2 ) kV(s1 )k kV(s2 )k
There are also works [43] that use Jaccard coeficient [20] as scoring function. The Jaccard coefficient is measured given by dividing the intersection of two sets by their union.
J(s1 , s2 ) =
|T (s1 ) ∩ T (s2 )| |T (s1 ) ∪ T (s2 )|
Both string similarity functions can be used for measuring string distance by measuring the difference of a full match value e.g. 1 by its similarity. 5 Not
to be confused with an RDFTerm.
4
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs s1
s2 "mona lisa artist"
"lisa mona artist" "mona lisa artist" "artist lisa mona"
"artist lisa mona unknown"
function
distance
Cosine Jaccard Levenshtein Cosine Jaccard Levenshtein Cosine Jaccard
0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.00
Levenshtein Cosine Jaccard Levenshtein
12.00 0.13 0.25 17.00
Table 1 Example of distance achieved by different string distance functions on different samples.
In the fuction TFIDF(e, q), t represents the query tokens for the query q, the function id f (t) returns the inverse document frequency of the token t in the entity’s collection E while t f (t, e) returns the frequency of the token t in the entity e. Robertson et al. [41] proposed a TF-IDF scoring function introducing a document lenght normalization. The proposed function comes from the observation that some authors are simply more verbose than others, other authors may write a single document containing or covering more ground:
(q, e) =
BM25
Another function that measures string distance is known as Levenshtein [26]. Levenshtein is a scoring function that measures the minimum number of insertions, deletions, and substitutions required to change one string into another. Thus, a high difference indicates a high Levenshtein distance. The main difference among the functions mentioned consist in two aspects: (1) token position and (2) frequency. For instance, the Levenshtein distance between two strings can be very high if the tokens do not occur in the same sequence. Jaccard and Cosine similarity are regardless the token’s position. Consine distance takes in consideration the local fequency. However, Jaccard and Cosine functions produce similar results.
id f (t)t f (t, e)
t∈T (q)
cosθdist (s1 , s2 ) = 1 − cosθ (s1 , s2 ) Jdist (s1 , s2 ) = 1 − J(s1 , s2 )
X
TF-IDF(q, e) =
X
id f (t)
t∈T (q)
t f (t, e)(k + 1) (e)| t f (t, e) + k 1 − b + b |Tavgdl
Herein, |T (e)| is the number of tokens extracted from the entity e, and avgdl is the average number of entity tokens in the entity’s collection. The parameter b can be used to set different degrees of the normalization– 0 6 b 6 1, whereas b = 0 switch the normalization off. In absence of an advanced optimization, the parameters k and b are usually chosen as k ∈ [1.2, 2.0] and b ∈ [0.5, 0.8]. Scoring functions can be extended for supporting structured data. An example of scoring function modified to deal with structured data is the BM25F [10]. BM25F is an extended version of BM25 scoring function designed for structured data. For instance, the t f funtion in TF-IDF can be applied in each field belonging to the data structure–in case of RDF data, each of the entity’s property-object.
2.3. Document Scoring There are several [41, 42] functions proposed for scoring documents based on their representative vectors. [42] proposed the classical VSM implemented by well known IR frameworks as Lucene6 , Solr7 and Elasticsearch8 . In Salton’s model, the token weights are based on their local and global frequency, the Term Frequency-Inverse Document Frequency (TF-IDF): 6 http://lucene.apache.org 7 https://lucene.apache.org/solr 8 https://www.elastic.co
TF-IDF f (q, e) =
X X
id f (t)t f (t, f )
t∈T (q) f ∈Fe
3. Related Work Information retrieval (IR) over Linked Data is an active and diverse research field. Over the years, a large number of approaches have been designed to tackle the requirements of diverse environments. Accordingly, the state-of-the-art approaches differ on complexity and precision. Two main categories of approaches are cur-
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
rently used prominently: (1) approaches designed to recover information from KGs by using conventional IR techniques and (2) approaches based on answering natural language questions. In general, both models explore the semantics of an NL query by applying statistical measures and heuristics on the KG. The following subsections discuss these types of systems. 3.1. Semantic Search (SemS) Existing SemS approaches commonly aim to retrieve the top-K ranked resources for a given NL input query. Swoogle [14] introduces a modified version of PageRank that takes into consideration the types of the links between ontologies. Semplore [57], Falcons [8], and Sindice [13] explores traditional document retrieval methods to retrieve relevant sources and/or resources. All approaches are based on traditional document retrieval engines whereas the structure and semantics are not taken into account during the matching phase. Semplore [57] is implemented in Lucene9 IR engine which allows the executation of all basic IR operations using the standard TF-IDF score functions to evaluate conjunctive queries. The retrieval is performed with an inverted index called PosIdx. PosIdx store relations instances (s, p, o) in three different indexes (see Table 2): (1) classes, (2) properties and (3) entities. In the PosIdx index, each entity e is treated as a document with a field named subjOf and the predicate p in the field. PosIdx store the objects of predicates under an entity (subjOf) and its symmetric case (objOf). The proposed model is capable of evaluate keyword base three-shaped queries with a single target variable. Semplore implements three basic operators: (1) basicretrieval, (2) merge-sort, (3) concept expansion, and (4) relation expasion. Falcons [8] is a search engine that enables users to query RDF KG by indexing entities along with their minimal self-contained graph. It achieves this by using the labels and URIs extracted from the entity e triples (e, p, o). The author proposes to score entity labels and their URI ten times higher than other properties. The engine is implemented using Lucene Apache Framework10 and the score function is based on a combination of cosine similarity and the entity’s in-degree cardinality expressed by the function Popularity(e). 9 http://lucene.apache.org/ 10 http://lucene.apache.org/
Resource Class c
Predicate p
Entity e
Index
Description
SubConOf SuperConOf Text
Sub-classes of c Super-classes of c Tokens in textual properties of c
SubRelOf SuperRelOf Text
Sub-relations of p Super-relations of p Tokens in textual properties of p
type SubjOf ObjOf Text
Types of e where (e, rdf:type, c) All properties p where (e,p,o) All properties p where (s,p,e) Tokens in textual properties of e
5
Table 2 Semplore PosIdx indexes.
InEdges(e) = {p|p ∈ (s, p, e)}
Popularity(e) = log|InEndges(e)| + 1
scoreFalcon (q, e) = cosθ (q, e)Popularity(e) Sindice [13] is a search engine that retrieves documents containing a given RDF data set, predicate or token. Sindice is implemented on top of Lucene and employes the Semantic Information Retrieval Engine (SIREn), a high-performance indexing model that consists of three different inverted indexes: – data set and Entity: In this index, the tokens extracted from a data sets or entities labels are associated with a list of entity identifiers along with their position in the literal. This enables the search engine to perform a search with phrase and proximity queries. – Attribute: Tokens from the predicate labels are indexed along with the entity identifiers, the attribute identifier as well as to their position in the label. – Value: Tokens from extracted from the literalobjects are indexed along with the entity identifier, its frequency, the attribute identifier, the object identifier and its relative position. YAHOO! BNC and Umass [16] were respectively the best and second best SemS systems in SemSearch’10. YAHOO! BNC uses a local, per property, term frequency as well as a global term frequency. It also applies a boost based approach on the number of matched
6
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
query terms. In addition, a prior is calculated for each domain and multiplied with the final score. Umass explores existing ranking functions applied to four field types: (1) title; (2) name; (3) dbo:title, and; (4) all others. The fields are weighted separately with a specific boost applied to each of them. Blanco et al. [5] proposed a modified version of BM25F ranking function adapted for RDF data. The function was applied to a horizontal pairwise index structure composed of the subjects and their property-values. The approach split properties into two categories, important and unimportant. The important properties are dbp:abstract, rdfs:label, rdfs:comment, rss:description, rss:tit le, skos:prefLabel, akt:family-name, w n:lexicalForm, nie:tittle. The properties dc:date, dc:identifier, dc:language, dc :issued, dc:type, dc:rights, rss:pubDa te, dbp:imagesize, dbp:coorDmsProperty, dbo:birthdate, foaf:dateOfBirth, foaf :nick, foaf:aimChatID, foaf:openid, fo af:yahooChatID, georss:point, wgs84:l at, wgs84:long were categorized as unimportant fields. The proposed adaptation is implemented in the GlimmerY! engine Blanco et al. [5] and is shown to be time-efficient and to outperform other state-of-the-art methods on the task of ranking RDF resources. Zhiltsov et al. [59] proposed an entity-centric SemS based on unigrams and bigrams applied to five different entity fields Fe (names, categories, similar entity names, related entity names and other attributes), Table 3. The model is based on a term frequency per field ( feT (q, e)) combined with different field weights for ordered (e.g., terms that appear sequentially feO (q, e)), and unordered bigrams ( feU (q, e)): P feT (q, e) = log f 0 ∈Fe P
0 ti ∈T (q) w fT
P feO (q, e) = log f 0 ∈Fe P
ti,i+1 ∈T (q)
P feU (q, e) = log f 0 ∈Fe P
ti, j ∈T (q)
t fti , f 0 +µ
pct(ti ,F 0 )
length( f 0 )+µ
w fO0
w fU0
f0
f0
t fti,i+1 , f 0 +µ
t fti, j , f 0 +µ
0 < j 6 8, j 6= i
f0
pct(ti,i+1 ,F 0 )
length( f 0 )+µ
f0
f0
pct(ti, j ,F 0 )
length( f 0 )+µ
f0
,
FS DM(q, e) = λT feT (q, e) + λO feO (q, e) + λU feU (q, e) where ti , t(i,i+1) and t(i, j) are consecutivly a token, two consecutive tokens or a bigram formed for any two distinct tokens extracted from the literal values of the entity’s fields—the property-object instances. The variable µ f 0 is the average number of tokens extracted from the instances of f 0 filed in the collection. The function pct(x, F 0 ) is the percentage occurrence of the string x = {ti , ti,i+1 , ti, j } in all instances of the same field f 0 ∈ F 0 . The function length( f 0 ) measure the number of tokens extracted from f 0 . The variables λT , λO , and λU are free variables such as 0 6 λ 6 1 and λT + λO + λU = 1. The variable w is the field weight, and obey the following constraints w > 0 and P w 0 f ∈Fe p = 1. In the proposed formula, the variable µ f 0 and the functions percentage and length act as a normalization proposed in BM25 function, the difference here is that the normalization is applied to the fields containing the set of properties that belong to the instances of the same property, f 0 ∈ F 0 defined by propertyclasses. The evaluation shows that the best setup found for SemS using the different property-classes is to score attributes higher than similar entity names, similar entity names higher than entity labels, entity labels higher than categories, and categories higher than related entity names. The λT and λO were defined as following λT ≈ 0.6, λO ≈ 0.2 and λU ≈ 0.2. Virgilio and Maccioni [54] introduced a distributed technique for SemS on RDF data using MapReduce. The method uses a distributed index of RDF paths. The proposed strategy returns the best top-K answers in the first K generated results. The retrieval is carried out by evaluating the paths containing the resources that are labeled with terms found in the query using two strategies: (1) Linear and (2) Monotonic. The Linear strategy uses only the high ranked path(s). As a result, it does not produce an optimum solution but has linear complexity on the size of matched entities. The Monotonic strategy uses all matched paths and, thus, produces better results. Intuitively, measuring all suitable paths from all entities is less time efficient. Koumenides and Shadbolt [24] give an comprehensive overview of entity-centric SemS approaches. Note that systems which abide by the first paradigm are usually time-efficient but lack the ability to deal with complex queries. In particular, Wang et al. [55] shows that traditional IR engines are faster than the combination of a triplestore with a full-text index.
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs Class labels attributes categories similar entity names related entity names
7
Condition o : ∃(e, p, o) & p ∈ Plabels = regex(∗[name|label|$) o : ∃(e, p, o) & p ∈ P & p ∈ / Plabels o : ∃(e1 , p1 , e2 ) & (e2 , p2 , o) & p ∈ Pcategories &p2 ∈ Plabels o : ∃(e1 , p1 , e2 ) & (e2 , p2 , o) & e2 ∈ similar(e1label ) & p2 ∈ Plabels o : ∃(e1 , p1 , e2 ) & (e2 , p2 , o) & e2 ∈ / similar(e1label ) & p2 ∈ Plabels Table 3 Property classes proposed by Zhiltsov et al. [59].
3.2. Question Answering (QA) The second retrieval category is QA systems. The main goal of these systems is to answer a given NL query precisely rather than listing the most relevant resources. However, in some cases, their functionality is close to IR-based paradigms. QA on Linked Data commonly transforms the user question into a formal query to a triple store. Approaches running on a closed domain are optimized for a specific KG or field, like biology [2] or medicine [3]. Since they do not need to integrate different schemas and KGs and thus suffer less from ambiguity, they create faster and higher-quality results. However, closed-domain approaches lack flexibility. Additionally, there are high costs when adapting such an approach to a new domain or implementing a new approach. Thus, the research focus has been moved towards multi-purpose, open-domain QA approaches like SINA [43] or hybrid approaches like TBSL [50] with a domain-unspecific core and a domain-specific, adaptable extension. TBSL is a template-based QA system that relies on SPARQL query templates to generate queries with structures that resemble that of the semantic structure of the original query. This is performed in two steps. In the first step, TBSL creates an internal logical representation resembling the original input query. This representation is converted into a set of possible templates. In a second step, the system uses an entity identification and predicate detection method to fill the slots of the previously defined SPARQL templates. The predicate and entity detection are done by applying TF-IDF function implemented by Lucene Apache Framework in the entity and predicate labels. Nevertheless, TBSL supports the resolution of complex queries, that is, requires Solution Sequence Modifiers and/or Filters. SINA is a QA system that explores the KG to formulate the SPARQL query by applying a Hidden Markov Model (HMM) for disambiguation between query matching candidates. The system treats the input query as a set of keywords. This allows the in-
terpretation of single-keyword queries as well as full question queries. To evaluate the answer, the system uses a Hidden Markov Model (HMM) to disambiguate between candidates which match the (or parts of) input query. SINA uses Jaccard distance to prune resource candidates—or hidden states—that do not match more than 65% with the input query, therefore the SPARQL query graph is built. We are going to name this rule as the 65 rule–J(q, r) ∈ ]0.65, 1]. Different from TBSL that uses a machine learning technique to learn SPARQL graph patterns, SINA takes uses the explicit RDF graph connections to evaluate the SPARQL query that will lead to the answer. Zhang et al. [58] proposed a five step fold approach for QA over joint RDF knowledge graphs using Levensthein distance. The first step is name phrase detection and resource mapping. It consists in findind the switable resources for SPARQL query generation. Different from SINA, the candidate are chosen based on the Levensthein coeficient greater than 0.85. The second step is the candidate triple patern generation. This step consists of checking the connection among the resources from previous step avoiding redundant alligments, i.e. aligment of resources containing overlaping phrases. The third step consist in evalute potential triple patterns alignment. The fourth step is the global joint inference (Basic Graph Patterns aligment) from different RDF knowledge graphs. Possible aligments are evaluated using Levensthein with the 65 rule proposed by Shekarpour et al. [43] over the entity’s labels. Finally, the last step is the formal query generation. Another type of systems explores Named Entity Linking and syntatic parsing tree approaches to answer to questions [7, 15, 52]. For instance, Dubey et al. [15] exploit the query’s canonical syntactic tree structure— called Normalized Query Structure (NQS)—to generate Ask, Order by, Count, Set (List) and Property-Value SPARQL queries. In this conversion, queries are annottated and idetified (query-type) based on pre-defined syntatic rules. In the the retrieval process, entities are pre-mapped by checking the co-
8
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
existence of two pairs of tokens (a bigram) in MS Encarta 98 (Token Merger), and annotated using DBpedia Spotlight [11]. Finally, the system perform a exact token match on the properties and property-values of the extracted entities. A concept expansion is than applied using—in order—Wordnet11 and BOA12 patterns, in case the property-mapping fails in the previous phase. Lukovnikov et al. [28] proposed a fact retrieval approach on word and character level based on a recurrent neural network (RNN). The entities are encoded using their labels in character level while predicates and predicate data-types in both character and word levels. The disambiguation is performed using the cosine similarty in the independent encoded vectors, and pruning the matching predicates that do not have connection to any matched entity. The final answer is evaluated by highest ranked fact given by summing the cosine distance of the entity’s label with its respective predicate. The scoring model is than enchanced by a rank-learning approach trained over a selective sample of positive and negative examples. The resulting entities are sorted by the number of predicates. Höffner et al. [18] discuss challenges and draw recommendations on how to develop QA systems observed from an analysis of 72 publications about 62 systems developed from 2010 to 2015. The challenges were listed as (1) Lexical Gap, (2) Ambiguity, (3) Multiligualism, (4) Complex Queries, (5) Distributed Knowledge, Procedural, Temporal and Spatial Questions. He also considers how to tackle some of the challenges using templates. Berant and Liang [4] argue that the central challenge in semantic parsing is to hangle the myriad ways in which knowledge base predicates can be expressed. They proposed a paraphrase model that consist in automatically mapping entities and properties to Question/Answering template pairs. They show that the paraphrase model can improve state-of-the-art approaches in standard benchmarks. Since then, many works have been inspired by the same idea [28, 47, 56]. QA systems that uses QA pairs derives from the same approach of previous chatbot systems,13 mapping questions to answers. However, previous chatbot systems are based on manually QA mappings, whereas today it is possible to derive the questions and answers pairs from the KG. The main drawback of this systems is the creation of the templates and/or QA pairs data set that can not be 11 https://wordnet.princeton.edu 12 http://aksw.org/Projects/BOA 13 http://www.alicebot.org
full automatized due to the differences in vocabulary and representation in different KGs. Futhermore, different from SemS and other types of QA systems, the users are limited to only access the data exposed by the templates.
4. Approach For many years, scientists from the diverse fields of cognitive science, such as psychology, neuroscience, philosophy, linguistics and artificial intelligence, have tried to explain and reproduce the human cognition system. While diverse theories have been developed, a commonly shared idea is that knowledge is commonly organized as a network [40]. Hudson [19] go further and claim that grammar is organized as a network as well. According to Hudson’s work, the syntactic structure of a sentence consists of a network of dependencies between single terms. Thus, everything that needs to be said about the syntactic structure of a sentence can be represented in such a network. Hudson explores Saussure’s [12] idea that “language is a system of interdependent terms in which the value of each term results solely from the simultaneous presence of the others”. He also argues about the psycholinguistic evidence for the use of spreading activation in supporting knowledge reasoning. However, according to Hudson [19], the main challenge consists in finding how the activation occurs in mathematical terms. "How exactly does spreading activation work? How does such a crude, unguided process help us to achieve our cognitive goals, rather than leave us drifting aimlessly round our mental networks? It is very unclear exactly how it works in mathematical terms, but the ... hypothesis is that a single formula controls activation throughout the network". Hudson [19] Our intuition is that, since the KG contains a network of terms formed by the label (e.g. rdfs:label) of its resources. Then, entites, properties and literals can be used to query. Although there is no evidence that the previous works were influenced by Hudson’s theory, some of the proposed models [43, 57] follow this assumption. Following Definition 5 formally defines resource label. Definition 5 (Resource Label). A literal associated with a resource r denoted by label(r), is the literal value respectively the label of the resource. Considering
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
e2 rdfs:label
Mona Lisa
dbo:artist
rdfs:label
artist
e1 rdfs:label
Leonardo da Vinci
Figure 2. Representation of the SCC of the entity e2 extracted from the KG depicted in Fig. 1.
the rdfs:label14 as labeling property, a resource’s label is given as follows: label(r) =
value(r) label(l) | (r, rdfs:label, l) ∈ K
if r ∈ L; otherwise.
An RDF KG can have cycles as well as an arbitrary number of edges and nodes. In order to simplify the KG and eliminates its ambiguity, the proposed aproach is bounded to a simplified version of the entity’s graph called Semantic Connected Component (S CC). Definition 6 (Semantic Connected Component). The Semantic Connected Component (SCC) of an entity e in an RDF graph G under a consequence relation |= is defined as S CCG,|= (e) := {(e, p, o) | G |= {(e, p, o)}} ∪ {(p, rdfs:label, l) ∈ G} ∪ {(o, rdfs:label, l) ∈ G}}. If the graph and consequence relation is clear from the context, we use the shorter notation S CC(e). Within this paper, we use the RDFS entailment consequence relation as defined in its specification.15 Example 1 (Semantic Connected Component). For instance, by RDFS entailment, the entity dbpedia :Australia is a dbo:PopulatedPlace. The inference is due to dbpedia:Australia being typed as dbo:Country which is a subclass of db o:PopulatedPlace. Considering the running example, the SCC of the entity e2 is S CC(e2 )=({e2 , e1 , ”Mona Lisa”}, {p5 , p4 }). One of the biggest challenges in IR for RDF data lies in evaluating the relatedness between the an entity in an KG and the users’s intent. Document retrieval engines rely on term frequency weighting functions based on the assumption that the more frequently a term occurs, the more related it is to the topic of the document [27]. While good retrieval method needs to take the frequency 14 Other
labeling properties may also be used.
15 http://www.w3.org/TR/rdf-mt/
9
into account, it suffers from frequent yet unspecific words such as “the”, “a” or “in”. Inverse document frequency corrects this by diminishing the weight of words that are frequently occurring in the corpus, leading to the combined term frequency–inverse document frequency [45] to score documents for a query. However, document retrieval approaches are not designed for RDF because the most important feature of RDF is not merly the term occurrence, but the relation of the concepts underlying its graph structure. We propose that the probability of the resource being part of an answer correlates with the number of tokens in intersection between the query and its label. For instance, a query containing birth date should be more related to the property dbo:birthDate than to the property dbo:deathDate or dbpprop:date. Hence, a Jaccard string similarity function is used, allowing to distinguish between a non, partial and full match. We discard the use of Levenshtein function because it is position sensitive (see Table 1) and only measures the matching characters. Jaccard produces similar results to Cosine similarity, but it is easier to understand. Definition 7 (Resource Matching). A resource matching is a function Mr : T → [true, f alse] that maps query tokens T = {t1 , t2 , t3 ...tn } and a resource to a boolean value (false, true), formally defined by M(r, s), where Jr (r, s) is Jaccard string similarity function that measures the percentage of the resource tokens belonging to a given string s. Jr (r, s) =
|T (s) ∩ T (label(r))| |T (label(r))|
M(r, s) =
true i f Jr (r, s) > 0; f alse otherwise.
Example 2 (Resource Matching). Let T (q) = {”mona”, ”lisa”, ”artist”}. According to Fig. 1, the tokens are going to match the resources e2 and p4 as follows: M(e2, ”mona”) = {true}, M(e2, ”lisa”) = {true}, M(p4 , ”artist”) = {true}. As the SCC is a graph, the resources and literal values are connected by paths formed by edges and vertices Fig. 2. Definition 8 (Path). A path is a sequence of consecutive and distinct vertices and edges [6]. Example 3 (Path). In the SCC shown in Fig. 2, there are two paths starting from the entity e2 as follows: γ1 = ((e2 , ”Mona” ”Lisa”)) and γ2 = ((e2 , e1 )).
10
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
Furthermore, resources that belong to a path between one resource to another are labeled (e.g. rdfs:label). Therefore, it is possible to explore the labels of the resources in the entity’s SCC paths to determine its relevance. The intuition behind our proposed model is that the relevance score of an entity depends on the number of matched resources on its associated paths. The higher the number of matched resources, the higher the relevance of an entity. However, when a token matches multiple resources and paths, its score must only be attributed to the highest scoring one. The final relevance of an entity is the sum of its individual path scores and is measured by the Semantic Weight Model (SWM), which is formally defined as follows. Definition 9 (Semantic Weight Model (SWM)). The SWM evaluates the score of an entity based on the individual score of the resources evaluated by the function Jr (r, q). Tokens matching equally scored resources or paths only have they score added once. This approach avoids the over and the under scoring of frequent or rare tokens. The evaluation of the path’s score is given by the function scoreγ and is a sum of its individual resources’ scores. true i f r ∈ resources(γ) ∧ Jr (r, s) > 0 ∧ ∀ r0 ∈ resources(γ), r0 6= r, Mr (t, r, s, γ) = Jr (r0 , t) > 0, Jr (r, s) > Jr (r0 , s); f alse otherwise.
scoreγ (γ, s) =
X
r (t, r, s, γ); otherwise.
X n J (r, t) if M r
0 r∈resources(γ) t∈T (s)
The final score of an SCC S is a sum of its n pathscores and is measured by the function scoreS (S ) using a path weighting function w : γ → R as follows: true i f γ ∈ paths(S ) ∧ w(γ) scoreγ (γ, t) > 0 ∧ ∀ γ0 ∈ paths(S ), 0 γ 6= γ, Mγ (t, γ, q, S ) = 0 score γ (γ , t) > 0, w(γ) scoreγ (γ, q) > w(γ0 ) scoreγ (γ0 , q); f alse otherwise.
scoreS (S , q) =
P
γ∈paths(S )
P
t∈T (q)
w(γ) scoreγ (γ, t) if Mγ (t, γ, q, S ), 0 otherwise.
One of the problems in QA as well as SemS is regarding the disambiguation of resources with overlapping tokens–e.g., dbp:place and dbo:Place. To tackle this problem, the SWM assigns different weights (function w(γ)) based on the RDF property on the path. The weight hierarchy among paths is constructed to allow the exploration of the KG, acting as a tiebreaker for disambiguating among paths (facts) and resources. The Information Atomicity In Set Theory, two distinct sets—of elements (resources)—can either be disjoint, joint or intersect each other. This principle is fundamental to measure differences and similarities among things [20]. In information retrieval, it is not different. When ranking information, a scoring function is, in other words, evaluating the most similar element(s) to a given query. That is one of the reasons why document scoring functions can also be used in recommendation systems [1]. Most of the current algorithms perform well when selecting disjoint sets of elements because one can always use the information that is disjoint to reach the desired element(s). However, the problem arises when the provided information is ambiguous, that is, it is shared among different elements. The challenge comprises in how to understand what an ambiguous piece of information is referring to—in our context, a query containing information belonging to more than one resource in a KG. We theorize that, when disambiguating a piece of information shared among different elements, it should be attributed to the most atomic (smallest) of the ambiguous elements to which it could be referring to, we called Information Atomicity. Furthermore, by following the Information Atomicity principle, every element—or subsets—in a set can be reachable independently, which is not possible to achieve otherwise. For instance, let’s suppose there are two sets A = {x, z} and H = {x}, and a function score that receives as a parameter an element and output the sets where the element occurs in relevance order–the most similar set of a given element set. In this hypothetical case, if the parameter of the function is x, and the sets are equally weighted, score will return A and H. Therefore, both sets are going to be reachable. It will also be possible to retrieve A, by using the difference, e.g., z. However, in this scenario, it will never be possible to select only H. Furthermore, since H—is
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
more atomic, and—has only x as an element, it seems reasonable to H to be closer to x than A. Similarly, if the score(x) function weight A higher, H will never be reached. However, if score(x) returns the most atomic, that is, H, it will be possible to select H using x, A, using the difference z as well as both sets, A and H, utilizing the conjunction, x and z. To illustrate this principle, we can use as image two pieces of fabric with different size, overlapping each other. It is difficult to reach the short piece of fabric when the large piece is overlapping it. However, it is easier to reach any of them when it is the short overlapping the large one. In our context, this principle applies to all components of a KG such as tokens, literals (terms), resources, paths, and subgraphs, following the principle, from the small (tokens) to the biggest component (sub-graphs). Following, we describe how we implement the Information Atomicity principle using a hierarchy of weights to SemS over a KG. Is-a relation The highest weight is assigned to resource’s tokens in an is-a relation property path γt –i.e., the paths containing the is-a relation property defined by rdf:type. The problem is that tokens in the path of an is-a relation property can also belong to other property paths. However, the entity label’s property reference the entity itself while an is-a relation reference classes of entities. In this case, if a query intends to select a specific class of entities, other entities can be retrieved by mistake. Thus, it is important to provide an efficient method to disambiguate between classes and entities. To alleviate this problem, the weight of the paths containing an is-a relation property is set higher than other paths. Entity label The second highest weight is assigned to labeling property paths γl —i.e. the paths containing the rdfs:label property—and those are assigned higher values than other property paths γo . Entities can be referenced multiple times in a KG, but when a query contains an entity label, it is more likely that it is looking for the entity than for its references—an object instance. Therefore, to prevent entities with multiple references to being higher ranked than the entity itself, the weight of the path with a labeling property is set higher than a path with another property. Despite the different weights, we still want a higher number of matched tokens to score higher in practical cases, i.e.,. n + 1
11
matched tokens should score higher than n matched tokens for reasonably low n: (n + 1) w(γt ) > (n + 1) w(γl ) > (n + 1) w(γo ) > n w(γt ) >
(1)
n w(γl ) > n w(γo ) Notice that, the proposed weights does not violate the Information Atomicity hypothesis as other entities can still be reached by using other properties. Following, the model is explained using examples. Case 1: Querying by entity label For the query “Rio de Janeiro”, the SWM should consider the DBpedia entity dbpedia:Rio_de_Janeiro as the best answer although the DBpedia entity dbpedia:Tom_Jo bim has the DBpedia property dbpprop:birthPla ce referencing the entity dbpedia:Rio_de_Jane iro. For the term “The” in a query, the model will consider as a possible answer the entities dbpedia:The _Simpsons and dbpedia:The_Beatles rather than the DBpedia property dbpprop:The_GIP. Case 2: Querying by is-a relation Considering the query “place”, the implemented SWM will prefer the data type dbo:Place instead of the properties dbo:birthPlace and dbo:place. Case 3: Querying by another properties Let us consider the case that the query is “birth place” rather than “place” as in the previous example. As the number of matching terms in the property dbo:birthPlace is higher than for the data type dbo:Place, consequently the weight of dbo:birthPlace will be higher than the data type. Query Analysis Information retrieval systems for RDF are commonly designed to support full or keyword NL queries. As mentioned in Shekarpour et al. [43], there is a contradiction in usability studies regarding users predilection in using full or keyword queries. Different from Reichert et al. [39], Kaufmann and Bernstein [23] show that users prefer the use of keywords instead of full questions. Nevertheless, full queries can be easily converted to keyword queries by applying well known NL processing techniques as stop word removal and stemming. However, converting keywords to full queries is a more challenging task. The *P approach is designed to deal with keywords or full queries by converting the latter into keywords queries. The process of conversion of an input query to a tuple of key-
12
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
words consists in applying known techniques, in order: (1) lowercase, (2) lemmatization, and the (3) 65 rule proposed by Shekarpour et al. [43]. Ranking resources with equal SWM score When a user performs a query, he can be looking for a single or a set of resources. In the latter case, one important feature is to rank the resources according to their relevance. The ranking is crucial when dealing with the problem of the excess of information. For instance, one possible result for the query “give me all persons" can return more than one million resources if applied to the DBpedia knowledge graph. As ranking functions can be very diverse, the introduced model supports any ranking function as a parameter. Therefore, the model can evaluate and also rank the most promising result set. For instance, for a query like the previous (“give me all persons") resulting in many resources, our approach can sort them according to a defined rank function. To achieve that, the introduced SWM model is used to find the most promising entity candidates. Therefore, a sorting function is used to sort the multiset results, which allows resources with equal SWM score to be better ranked. There are several ranking functions designed for RDF data. In this work, we use a modified version of PageRank [48] dubbed as DBpedia Page-Rank. The DBpedia Page-Rank was introduced by Thalhammer and Rettinger [48] is a variant of the original PageRank algorithm where the rank of a DBpedia entity is measured over DBpedia Wikipedia-internal links data set. Marx et al. [32] has shown that this ranking function produce better results than others such as the the number of outgoing edges used by other approaches [28]. We made the DBpedia Page-Rank and other ranking methods for DBpedia entities public available for the research community over DBtrends.16 The Page-Rank value is normalized to a value lower than a token weight (w(γo ) > Rank(S )) and added to the final *P score using the Rank function which enables to rank the answer according to its relevance.
starpath(q, S ) = score s (q, S ) + Rank(S ) Retrieving resource in a object position The proposed S W M is used to reveal which S CC contains the answer by retrieving the best (top-K) answers in the first k S CCs. Despite that, there are scenarios where a post16 http://dbtrends.org
processing is needed. We take as an example the query “Mona Lisa artist". After selecting the S CC of “Mona Lisa", post-processing is necessary to select the property “artist" and thus retrieve the targeting resource. This technique, also called property reasoning, is only applied in cases where there is a suitable query token matching one of the S CC properties. For this stage, we use the same path scoring function previously defined (see scoreγ (γ, q)). We select the path (fact) where there is a property-matching and return its object(s). Note that, in this case, there might be more than one object per property. For instance, the Query 13 on QALD-4 benchmark data set “carrot cake ingredients”, Basic Graph Pattern17 (BGP) . The Figure 3 shows the full process of retrieving the result for the query “Mona Lisa artist".
5. Evaluation In this section, we empirically evaluate the accuracy of *P. We start by describing the experimental setup and the means of it. We finish by presenting the achieved results and discussing the accuracy of the different strategies. In this work, we do not target runtime, but rather to conduct an empirical study on how much accuracy could be achieved and improved by the current systems. 5.1. Benchmark Several benchmark data sets can be used to measure the accuracy of our approach, including benchmarks from the initiatives SemSearch [16]18 and QA Over Linked Data (QALD).19 SemSearch is based on user queries extracted from the YAHOO ! search log, with an average distribution of 2.2 words per-query. QALD provides both QA and keyword search benchmarks for RDF data that aim to evaluate the intrinsic behavior of systems. The QALD data sets are the most suitable for our evaluation due to the wide type of queries they contain and also because they make use of DBpedia, a very large and diverse KG. In this work, we use QALD version 4 (QALD-4) [51] and 3 [9], the latest version of the QALD benchmark data sets compatible with the openQA [31] framework. 17 For Basic Graph Pattern definition, visit http://www.w3.org/TR/ rdf-sparql-query/#BasicGraphPatterns 18 http://km.aifb.kit.edu/ws/semsearch10/ 19 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
13
mona artist
lisa
“Mona Lisa artist”
artist vinci
… 1
2
SCC1. Mona Lisa SCC2. entity 1 SCC3. entity 2
Mona Lisa
SCCn. entity N
3
“Leonardo da Vinci”
artist
…
Leonardo da Vinci
4
5
Figure 3. The full-stack stages for retrieving the result for the query “Mona Lisa artist".
5.2. Experimental Setup All the queries for *P and GlimmerY! [5] were performed in OR mode. For all queries, we consider only top-K entries returned by each approach, where K is equal to the number of entries in the target test answer. The results presented by *P were produced by an offline framework developed exclusively for this purpose using a pre-extracted and indexed DBpedia SCCs. From all queries in QALD-4 data set, four queries can not be answered for being ASK queries,20 three can not be answered for having Count,21 seven can not be answered for having Solution Sequence and Modifiers (SSM),22 two can not be answered for having Filters,23 and four can not be answered for having multiple features such as SSM and Count24 as well as SSM and Filter.25 These questions can not answered because the proposed approach target simply BGP queries– i.e., queries that do not require the use of aggregations, restrictions as well as SSM to be answered. From the remaining queries, five queries required information that is outside the SCC span.26 Eleven queries present vocabulary mismatch problem.27 Query 36, “pope, succeed, Johannes Paul II”, could not be answered because 20 7,
19, 23, and 45 27, and 38 22 9, 10, 16, 17, 22, 25, and 37 23 2, and 4 24 3, 14, and 20 25 24 26 8, 29, 31, 33, and 35 27 1, 5, 6, 11, 15, 28, 39, 40, 43, 46, 48 21 18,
of the token “Johannes” appears in the Deutsch version of the data set leading the approach to annotate “Johannes Paul II” with dbpedia:Pope_Paul_II instead of dbpedia:Pope_John_Paul_II. Therefore, Query 36 requires a cross language approach [37]. Finally, two Queries could not be answered using the information contained in the data set.28 We also evaluate QALD with five different Levenshtein, Jaccard and BMF25F baseline scoring functions: – Levenshteina uses the number of matched characters for each matched token; – Levenshteinb uses the number of matched characters with the paraphrase disambiguation method proposed by Zhang et al. [58]; – Jaccarda uses the Jaccard distance of matched resources per matched token; – Jaccardb uses the disambiguating model implemented by Shekarpour et al. [43]; – BMF25F is the state-of-the-art SemS for RDF approach proposed by Blanco et al. [5], and; – *PP65 is the *P disambiguation model with rule 65 applied only to properties. The idea is that there is a need to address inflections only on properties where verbs occur rather than objects that usually contain proper names. The Levenshteina and Jaccarda methods are used to measure term frequency without token occurrancy normalization. Therefore, all tokens have the same weight and as more frequent the token is in the entity, as more 28 49,
and 50
14
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
relevant the entity is. The results achieved by these methods represents the upper bound of the functions precision, recall and f-measure that can be achieved without the use of optimization techniques such as bigrams and trigrams [22]. The use of bigrams and trigrams, although can lead to a compact index, simplifies the tokens in a path to resp. two and tree neighborhood words which consequentially does not capture the entirelly data structure. We extend this evaluation to entity-centric queries. We called entity-centric the subset of simply BGP queries were the result can be found by using the information located in the entity’s predicate or object. For this purpose, the Queries 26, 32, and 41 were used and *P was not allowed to retrieve resources in object position (property reasoning). We also evaluate the Entity Linking of different approaches and compare it with the last version of DBpedia Spotlight (version 1.0), MAG [34] and AGDISTIS [53] in simply BGP queries. This evaluation was design to measure how accurate *P can be when dealing with approaches that use entity linking to evaluate the NL input query [15, 52]. We discard queries that can only be answered using classes and properties. We avoid the use of this queries because annotators usually can only handle entities. All queries evaluated over DBpedia Spotlight used a refinement operator approach starting from confidence 0.5 in decreasing scale of 0.05 until reaching an annotation—when it was possible—or zero. We also evaluate annotation using graph disambiguation approaches using . We evaluate AGDISTIS [53] and MAG [34] annotating the sub-strings refering to the entities manually and used the approach to evaluate the link to the correponding DBpedia entities. Finally, we extend the evaluation to QALD-3. All output generated by the systems is public available at https://bitbucket.org/emarx/smart/. Therefore, the evaluation was designed to answer the following research questions: Q1. How accurate is the proposed approach? Q2. How does *P compare to other SemS, QA and Entity Linking approaches? Q3. How much accuracy can be achieved by this systems? Q4. Is it possible to increase the accuracy of current approaches? Q5. Is it possible to design an approach to perform endto-end Semanti Search, Entity-Lnking and QA?
5.3. Results Table 7 shows the performance of *P in comparison to all participating QA systems in the multilingual challenge of the QALD-4 benchmark while Table 15 shows the individual Recall, Precision and F-measure of one achieved per question. From the 11 targeted queries,29 our approach was capable to answer 8 fully.30 Query 21 was wrongly annotated respectively with dbpedia:Bachs because its label had a higher Jaccard similarity with the query than the target ones dbpedia:Johann_Sebastian_Bac h. Query 42 could not be answered because the property dbo:office scored higher than the target one dbo:officialSchoolColour. Unexpectally, three queries, one containing information outside the SCC span,31 another containing SSM,32 and one containing vocabulary mismatch33 were partially answered. Table 9 shows the distribution of features in QALD-4 queries whereas Table 11 shows the percentage of Fully, Partially, Not Answered, and Total answers. The Query 11, 25, and 29 were partially answered because they had extra information that helped *P to partially find the answer. For instance, the Query 11 was partially answered because many animals have the property dbp:extinct which dismissed the use of the predicate-object (dbo:conservationStatus EX). Query 25 was answered because some of the returned action role-playing games are in the top-10 answer. Finally, Query 29 consist in a union of two BGPs, one inside, and another outside the SCC span. The BGP inside the SCC span helped to partially answer the query. Table 4 shows the maximum Precision, Recall and F-measure one achieved by each scoring function for SemS–without property reasoning. Here *P achieved better score than *PP65 mainly because it could overcome the problem of vocabulary mistmatch on Query 29 by matching the term “australian” with “Austria”, and Query 49, by matching the term “Swedish” with “Sweden”. Table 5 shows the results of entity-centric queries whereas *P achieved the highest F-measure one of 1 while GlimmerY! and Levenshteinb was capable only 29 12, 30 12, 31 29 32 25 33 11
13, 21, 26, 30, 32, 34, 41, 42, and 44 13, 26, 30, 32, 34, 41, and 44
15
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
to respectively partially and fully retrieve the result for Query 41. Table 6 shows the entity linking evaluation whereas *P achieved an F-measure one of 0.90, beeing 0.10% more accurate than MAG, the second best performing Entity Linking approach. *P was not capable to annotate correctly Query 21 while MAG Queries 34 and 44 and DBpedia Spotlight Queries 12, 41, and 42. *P did not annotated Queries 21 “birthplace, Bach” because the words bach and II received a full match with the tokens bachs and III using the 65 rule. We than evaluate *P applying only 65 rule to the properties (*PP65 ) which enabled the approach to annotate correctly query 21. Table 8 shows the results for *P with (QA) and without property reasoning (SemS) as well as all participating systems in the Multilingual Challenge of QALD-3 data set benchemark. The results of QALD-4 benchmark data set in Table 4, Table 5, Table 6, and Table 7 shows that: Q1. *P achieved resp. 0.19, 0.11, 1, and 0.81 accuracy in Question Asnwering, SemS, Entitycentric and Entity Linking tasks; Q2. *P can be resp. ≈275% and 10% more accurate than SemS and Entity Linking approaches while achieving an avarage performace of a Question Answering system; Q3. SemS approachs can be less accurate than Question Answering, but annotators can achieve higher accuracy than both SemS and Question Answering when performing Entity Linking tasks; Q4. It is possible to increase the accurracy of SemS, Entity Linking, and Question Answering approaches, and; Q5. The results shows that *P can achieve high f-measure in either SemS, Entity Linking, and Question Answering tasks showing that is possible to design a unique scoring model that addresses these three approaches. 5.4. Discussion The results of the QALD-4 benchmark data set in Table 4, Table 5, Table 6 show that *P can achieve a better accuracy than Entity-centric, SemS and Entity Linking approaches. The proposed approach can achieve better results because it explores the connection among words, 34 The
results were obtained using GlimmerY! in OR mode.
Approach *P *PP65 Levenshteinb BM25F34 [5] Jaccardb Levenshteina Jaccarda
P
R
F1
0.11 0.09 0.04 0.03 0.01 0.00 0.00
0.11 0.09 0.05 0.03 0.04 0.00 0.00
0.11 0.09 0.04 0.03 0.01 0.00 0.00
Table 4 Precision, Recall and F1 -measure achieved by different SemS approaches on QALD-4 benchmark data set.
Approach *P Levenshteinb BM25F34 [5] Levenshteina Jaccarda Jaccardb
P
R
F1
1 0.33 0.04 0.00 0.00 0.00
1 0.33 0.04 0.00 0.00 0.00
1 0.33 0.04 0.00 0.00 0.00
Table 5 Precision, Recall and F1 -measure achieved by different SemS approaches, targeting only entity-centric BGP queries.
Approach
P
R
F1
*PP65 *P MAG [34] DBpedia Spotlight [11] Levenshteinb Jaccardb AGDISTIS [53] BM25F34 [5]
1 0.90 0.80 0.63 0.60 0.60 0.30 0.30
1 0.90 0.80 0.70 0.60 0.60 0.30 0.30
1 0.90 0.80 0.70 0.60 0.60 0.30 0.30
Levenshteina Jaccarda
0.00 0.00
0.00 0.00
0.00 0.00
Table 6 Precision, Recall and F1 -measure achieved by different approaches in Entity Linking evaluating only entity annotation.
16
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs Challenge
System
P
R
F1
Xser gAnswer CASIA Intui3 ISOFT *P
0.71 0.37 0.40 0.25 0.26 0.19
0.72 0.37 0.32 0.23 0.21 0.19
0.72 0.37 0.36 0.24 0.23 0.19
RO FII
0.12
0.12
0.12
Outside SCC Span Vocabulary Mismatch Complex Queries Cross Language
10 24 40 2
Target Queries OUT OF SCOPE
20 4
Total
Table 7 Precision, Recall and F1 -measure achieved by different QA approaches in QALD-4 Multilingual Challenge. The systems are GlimmerY! , *P and all QALD-4 participating systems.
squall2sparql CASIA Scalewelis RTV Intui2 *P *P SWIP
P
R
F1
Approach
0.88 0.36 0.33
0.93 0.35 0.33
0.90 0.36 0.33
QA QA QA
0.34 0.32 0.26 0.24 0.16
0.32 0.32 0.29 0.26 0.17
0.33 0.32 0.27 0.24 0.17
QA QA QA SemS QA
Table 8 Precision, Recall and F1 -measure achieved by different QA approaches in QALD-3 Multilingual Challenge. The systems are QA and SemS versions of *P and all QALD-3 participating systems.
Feature BGP SSM Multiple Features Ask Count Filter
Percentage (%) 60 14 8 8 6 4
Table 9 Distribution of query features in QALD-4 data set. The query features include the use of simply Basic Graph Patterns (BGP), Solution Sequence and Modifiers (SSM), Count, Filter, Multiple Features, and Ask queries.
100
Table 10 Percentage of different challenges exiting in BGP queries of QALD4 data set. The challenges includes graph patterns Outside the SCC span, Vocabulary Mismatch between the query and the graph patterns, Cross Language, Complex Queries as well as queries that can not be answered with the information available in the KG (OUT OF SCOPE).
Answers System
Percentage (%)
Percentage (%)
Fully Not Answered Partially
16 4 0
Total
20
Table 11 Total of Fully, patially, and not answered queries from the target queries.
resources and facts of the KG in contrast with document based entity search systems that deal with bag of words paradigm. For instance, let’s suppose the user employs the Query 26 of QALD-4 benchmark data set, “actors, born, Berlin”. By not considering the graph structure, a model can generate as a result the SPARQL query Listing 2 instead of the Listing 1 because it does not consider the entity-type dbo:Actor being a subclass of dbo:Person and thus having the property dbo:birthPlace. The same problem can occur in any approach that employs syntactic parse tree without considering the graph structure35 because tokens may not be related to the same resource or fact. That is the case of approaches that evaluate entities [15, 52] without considering the relations (predicates), relations (predicates) based on pre-annotated entities [35, 44] as well as objects (literal or not) based on pre-annotated entities and relations. In TBSL as well as in paraphrase models [4, 28], this issue can be alleviated by a machine learning approach that maps NL input queries to intended SPARQL 35 The
connection among words, resources and facts
Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs
queries (or resources). However, the main limitations of this methods are the vocabulary dependency, and the requirement of large test corpus of manually curated training sets. In other words, it requires the publisher to establish all the possible ways that an user can query the data set. Machine learning is appropriated to scenarios that have less ambiguity to be solved. For instance, consider a task of learning to execute a walking movement or to break a vehicle. In both scenarios, there is only one precise manner to perform the tasks. In natural language is different, as shown by examples in Section 4 and demonstrated in this section, same terms can lead to entirely different results. The use of machine learning in disambiguation process, however, will always lead to the most statistical relevant one. Therefore, when using automatically generated or manually annotated training sets, terms belonging to most common properties will always prevail in the face of less frequent ones, which may lead some data to become inaccessible to end users. For instance, a training data set whereas the property dbp:place occurs more often than the data type dbo:Place will always lead to entities having the property when quering for “places”. Nevertheless, in both cases, the challenges can be overcome by a specialist that either knows the users and the KG to fix the training data set. Notice that people who generate the training data, usually do not profoundly know the entire KG. In fact, it would be humanly challenging to know to which data every possible term combination in the KG can lead to. That said, previous works [28, 30] show that, when ignoring these facts, good results can yet be achieved by the use of RNN and other alike machine learning techniques in the current standard benchmarks. We highlight that *P can be used on top of this models to replace the human disambiguation work and automatic generate the correct training sets. Another drawback in proposed SemS methods is the weight given to attributes despite the entity label itself. For instance, Zhiltsov et al. [59] proposes to score attributes and similar entity labels higher than the entity’s label itself. However, most often users query for an attribute-value (e.g., Query 12, 13, 30). The problem is that when weighing entity’s attributes higher than their labels, one is querying the reverse pattern. For instance, for a query “Rio de Janeiro”, a bag of words entity search may return entities that have high incidence of the tokens in the property-object. Therefore, the pattern