Capturing Knowledge in Semantically-typed

0 downloads 0 Views 877KB Size Report
bases containing semantically-typed relations, e.g., PATTY, for an extra disambiguation step. Two major factors may affect the performance of relation linking ...
Capturing Knowledge in Semantically-typed Relational Patterns to Enhance Relation Linking Kuldeep Singh

Isaiah Onando Mulang’

Ioanna Lytra

University of Bonn & Fraunhofer IAIS, Germany [email protected]

University of Bonn & Fraunhofer IAIS, Germany [email protected]

University of Bonn & Fraunhofer IAIS, Germany [email protected]

Mohamad Yaser Jaradeh

Ahmad Sakor

Maria-Esther Vidal

University of Bonn, Germany [email protected]

University of Bonn, Germany [email protected]

Fraunhofer IAIS, Germany [email protected]

Christoph Lange

Sören Auer

University of Bonn & Fraunhofer IAIS, Germany [email protected]

University of Hannover, Germany [email protected]

Abstract

1

Transforming natural language questions into formal queries is an integral task in Question Answering (QA) systems. QA systems built on knowledge graphs like DBpedia, require a step after natural language processing for linking words, specifically including named entities and relations, to their corresponding entities in a knowledge graph. To achieve this task, several approaches rely on background knowledge bases containing semantically-typed relations, e.g., PATTY, for an extra disambiguation step. Two major factors may affect the performance of relation linking approaches whenever background knowledge bases are accessed: a) limited availability of such semantic knowledge sources, and b) lack of a systematic approach on how to maximize the benefits of the collected knowledge. We tackle this problem and devise SIBKB, a semantic-based index able to capture knowledge encoded on background knowledge bases like PATTY. SIBKB represents a background knowledge base as a bi-partite and a dynamic index over the relation patterns included in the knowledge base. Moreover, we develop a relation linking component able to exploit SIBKB features. The benefits of SIBKB are empirically studied on existing QA benchmarks and observed results suggest that SIBKB is able to enhance the accuracy of relation linking by up to three times.

The last decade has seen an explosion of data publicly available on the Web in the form of knowledge graphs (KGs), e.g., DBpedia [2] and FreeBase [3]. Although these knowledge graphs provide access to a structural representation of unstructured knowledge, there is still a challenge to capture and exploit this knowledge in Natural Language Processing (NLP). For example, Question Answering (QA) systems built on top of knowledge graphs are NLP systems empowered with the ability to translate natural language questions to equivalent formal queries against the knowledge graph [5]. The task of formal query generation can further be viewed as a pipeline consisting of several sub-tasks, namely: named entity recognition (NER), named entity disambiguation (NED), relation extraction and linking, and query generation. Research shows that formal query formulation from natural language questions and, more specifically, linking relations to KG properties, often requires extra knowledge sources that contain semantic descriptions or extensions of the underlying knowledge graphs. These knowledge bases (KBs) capture knowledge from large corpora or taxonomies, e.g., Wordnet1 [8, 12], PATTY [8, 18], or the BOA pattern library [5], and allow for enhancing the accuracy of the process of mapping natural language relations to concepts in a specific knowledge graph. Hence, extracting knowledge from such background knowledge bases will also improve the effectiveness of the relation linking task and increase the overall performance of QA systems. The idea of providing semantically typed patterns against the properties in a knowledge graph is a special feature. For example, PATTY [13] is a large knowledge base consisting semantically typed relational patterns with their associated properties in open domain knowledge graphs such as DBpedia. Therefore, PATTY provides a rich source of relational

Keywords Question Answering Systems; Relation Linking; Knowledge Graphs; Knowledge Capture Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owner/author(s). K-CAP 2017, December 4th-6th, 2017, Austin, Texas, USA © 2017 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. https://doi.org/

Introduction

1 https://wordnet.princeton.edu/

Where was Albert Einstein born?

PATTY Patterns

was born

matches with

birthPlace

was born; [[adj]] hometown of; s homecountry of;

deathPlace

was born; 3707 was born grew up; patterns died in [[det]] town of;

parent

predecessor

relation

was born; also married [[det]]; aged [[num]] married; was born; s son [[adj]]; s daughter [[con]]; was born; s son [[adj]]; [[con]] father of; was born; was born after; [[det]] son [[pro]];

3426 patterns

dbr:Kingdom_of_Württemberg dbr:German_Empire dbr:Ulm

Candidate DBpedia Relations

SELECT ?x WHERE { dbr:Albert_Einstein dbo:birthPlace ?x }



SELECT ?x WHERE {

birthPlace 1204 patterns

wrong results

spouse

6138 patterns

deathPlace spouse

1513 patterns

(a) Excerpt of PATTY Knowledge Base

dbo:deathPlace ?x

SELECT ?x WHERE {

dbr:Albert_Einstein ✗ }

dbo:spouse ?x

SELECT ?x WHERE {

parent predecessor

846 patterns

dbr:Albert_Einstein ✗ }

dbr:Albert_Einstein ✗ } empty results

DBpedia Relation

relation

(b) Candidate DBpedia Relations

dbo:parent ?x

SELECT ?x WHERE { dbr:Albert_Einstein dbo:predecessor ?x }



SELECT ?x WHERE { dbr:Albert_Einstein dbo:relation ?x }



(c) Potential SPARQL Queries

Figure 1. Motivating Example. PATTY Knowledge Base is used to identify the DBpedia predicates associated with the patterns from a given question; SPARQL queries are built from these DBpedia predicates. (a) Excerpt of PATTY Knowledge Base; (b) DBpedia predicates in PATTY associated with the pattern was born in the question; (c) Potential SPARQL queries to answer the input question. Only the DBpedia predicate dbo:birthPlace allows for collecting correct answers. patterns that can be used during relation linking. Höffner et al. [9] report that PATTY allows for flexible mapping of natural language relations to their KG properties. However, this flexibility implies that one relation can be matched to several patterns. For example, the natural language relational pattern been playing with, appears 12 times in PATTY and is associated with 11 relations. Hence, efficient methods are needed both for capturing knowledge from a large corpus like PATTY, and for exploiting their features in QA systems. In this work, we devise an approach for capturing knowledge from collections of semantically typed relational patterns like PATTY; further, we present a relation linking method able to exploit these features. First, SIBKB, a semantic-index based representation of these knowledge bases is proposed; SIBKB provides searching mechanisms for accurately linking relational patterns to semantic types. The benefits of SIBKB have been empirically evaluated on existing QA benchmarks. Results suggest that SIBKB enhances the performance of relation linking methods by up to three times. The remainder of the article is structured as follows. Section 2 motivates our work with an example. Section 3 elaborates on the specific problem of capturing knowledge from semantically typed knowledge bases; this is followed by a detailed illustration of the approach and the proposed solution. Section 4 presents the experimental study and the

evaluation of our approach. Related work is presented in Section 5, and, finally, conclusions and future plans are outlined in Section 6.

2

Motivating Example

We motivate our work by analyzing the problem of extracting knowledge from a large knowledge base corpus during the relation (predicate) linking task in QA systems. PATTY [13] is one such large knowledge base of semantically-typed relational patterns; it contains 127,811 pairs of relational phrases and DBpedia predicates, involving 225 DBpedia relations in total. Many QA systems, such as Xser [18] and CASIA [8] use PATTY’s relational phrases to match word patterns in an input question and find the corresponding DBpedia relation, as part of understanding a natural language question. Let us consider the following question: Where was Albert Einstein born? Part of understanding this natural language question includes: (1) the extraction of the named entities and (2) the identification of the predicate(s). The successful completion of these tasks will allow a QA system to construct a formal query – e.g., a SPARQL query – in order to retrieve the answers from a knowledge graph like DBpedia. For the first task, named entity recognition and disambiguation, tools such as DBpedia Spotlight [11] or AGDISTIS [16] can be used to identify the Albert Einstein entity and disambiguate it 2

dbr:Kingdom_of_Württemberg dbr:German_Empire

dbr:Ulm

SELECT ?x WHERE {

dbo:spouse dbo:award

R

died at is playing in

Question Where was Albert Einstein born? Question Patterns

was born, was Albert, [Noun], [Noun] born, was [Noun]

Relations

dbo:birthPlace, dbo:deathPlace, dbo:spouse, dbo:relation

P

✓dbr:Albert_Einstein

Q pattern(Q)

Rels(pattern(Q),G)

wrong results

dbo:deathPlace

was born been married to

empty results

dbo:birthPlace

dbo:birthPlace ?x}

SELECT ?x WHERE {

✗dbr:Albert_Einstein

dbo:deathPlace ?x}

SELECT ?x WHERE { dbr:Albert_Einstein dbo:relation ?x}



(a) Portion of a Bi-partite Property↔Pattern Graph (b) Question, Question Patterns, and Relations (c) Potential SPARQL Queries using Identified Relations

Figure 2. Instance of the Problem. A bi-partite graph of semantically-typed relational patterns is illustrated. (a) A portion of a Bi-partite Graph for PATTY; (b) A question, its patterns, and corresponding relations (DBpedia properties) from PATTY; (c) Potential SPARQL queries from the selected DBpedia properties.

3

to its DBpedia mention http://DBpedia.org/resource/Albert_ Einstein. For the second task, the PATTY knowledge base can be used to link phrase patterns in the question such as was born to its associated DBpedia predicates (i.e., relations). In our exemplary question, the pattern was born can be mapped to six different DBpedia predicates of the PATTY corpus, namely dbo:birthPlace, dbo:deathPlace, dbo:spouse, dbo:parent, dbo:relation, and dbo:predecessor. In fact, all of these relations are linked to several textual patterns (e.g., dbo:birthPlace is related to more than 6,000 patterns of the PATTY corpus), which are often shared among different relations, as illustrated in Figure 1a. For example, in the PATTY corpus, the pattern was born appears 876 times and corresponds to six DBpedia relations like dbo:birthPlace (the correct answer in this case), dbo:religion, dbo:parent, dbo:predecessor, dbo:spouse, and dbo:deathPlace. Hence, if simple keyword based matching or generic similarity techniques are used to match phrase patterns of the question, multiple DBpedia relations for a given pattern are retrieved from the knowledge base. For the pattern was born, for instance, this will lead to six candidate DBpedia relations in PATTY, as depicted in Figure 1b. Hakimov et al. [7], the first two authors of the current paper [12], and Dubey et al. [5] describe this problem of PATTY and report noisy behavior of PATTY patterns while building a QA system and relation linking tool. Many incomplete and ambiguous patterns in PATTY such as s son [[adj]], lt ref gt with also cause noisy behavior of PATTY. However, for our question, only dbo:birthPlace will provide the correct relation that will allow a QA developer to construct a SPARQL query to retrieve the correct answers. In the case of identifying the predicates dbo:deathPlace, dbo:spouse, and dbo:parent, a QA system which is utilizing PATTY will retrieve wrong answers, while matching to the predicates dbo:predecessor and dbo:relation will lead to an empty answer set (see Figure 1c). Therefore, relational pattern knowledge bases need to be exploited in an efficient way, in order to increase precision and recall in QA systems.

Problem Statement and Solution

In this section, we present the problem of capturing knowledge in semantically-typed relational patterns. Further, we propose an index-based approach that allows for efficiently extracting the properties from a knowledge base that solve the relation linking task in question answering pipelines. 3.1

Bi-partite Graphs of Semantically-typed Relational Patterns A collection of semantically-typed relational patterns corresponds to a bi-partite graph of patterns and properties in a knowledge base. A collection G of semantically-typed relational patterns is defined as a triple G = (R, P, E), where: ● P and R are two disjoint sets representing semantic relational patterns and properties in a knowledge base (e.g., RDF properties from DBpedia or Yago ), respectively. ● E is a set of pairs (r ,p) in R ×P representing a semantic type r of a relational pattern p, i.e., r is a property semantically related to p. PATTY can be represented as a bi-partite graph G = (R, P, E) where relational patterns in P are mined from large corpora, and properties in R correspond to the DBpedia predicates associated or semantically related to these patterns. Figure 2a illustrates a portion of a bi-partite graph for PATTY. Rels(pattern(Q),G) = {r ⋃︀ p ∈ pattern(Q) and (p, r ) ∈ E} (1) Figure 2b presents relational patterns of the question Where was Albert Einstein born?, as well as their associated semantic types in DBpedia. Semantic types associated with a pattern are used in question answering pipelines for building SPARQL queries whose evaluation will provide the answers of a question Q. For example, Figure 2c shows three SPARQL queries that can be built from the DBpedia predicates dbo:birthPlace, dbo:deathPlace, and dbo:relation. Given a set Rel of semantic types or RDF properties in Rels(pattern(Q),G), f (Rel, D, Q,G) denotes a set of SPARQL 3

Vector Representation of Indexed Patterns of Knowledge Graph

was born birthPlace

been married to

died at

is playing in



spouse



restingPlace



managerClub

deathPlace

parent



knownFor



team



spouse



successor



deathPlace



league



parent



child



majorShrine



college





Each pattern in PATTY acts as index for the bucket Relations in a bucket for a pattern in PATTY and associated vectors









Vector Representation of Knowledge Graph





P’

R’

Figure 3. Example of SIBKB on PATTY. A portion of a Semantically Indexed Bi-partite Knowledge Base (SIBKB) for PATTY. queries over the knowledge base D that use predicates in Rel and that provide the correct answers for the question Q; f (Rel, D, Q,G) is defined as follows:

optimization problem: ⋃︀f (Rel, D, Q,G)⋃︀ max(⋃︀Rel⋃︀, ⋃︀IdealQueries(Q, D)⋃︀) Rel⊆Rels(Pattern(Q ),G) argmax

f (Rel, D, Q,G) = {𝒬(r ) ⋃︀ r ∈ Rel ∧ Rel ⊆ Rels(Pattern(Q),G) ∧

(3)

Since the set IdealQueries(Q, D) only includes one query in our running example, the optimal solution to this optimization problem corresponds to the set Rel that is only composed of the DBpedia property dbo:birthPlace. This property is part of the only triple pattern of the SPARQL query that produces the complete answer for the question Q.

(2)

𝒬(r ) ∈ IdealQueries(Q, D)} ● 𝒬(r ) is a SPARQL query composed of a triple pattern whose predicate is r ; ● IdealQueries(Q, D) represents a set that only includes the SPARQL queries that need to be run over D to produce the complete answer to the question Q.

3.3 Proposed Solution For matching the correct relations from a knowledge base for a given input question Q, we follow a two-step process. In the first step, a semantically indexed bi-partite knowledge graph (SIBKB) is built. In the second step, SIBKB is utilized in a pipeline for relation linking.

In our running example, the resources dbr:German_Empire, dbr:Kingdom_of_Württemberg, and dbr:Ulm correspond to the complete answers of Q in DBpedia; one SPARQL query produces all these results, i.e., IdealQueries(Q, D) is only composed of this query. Thus, although f (Rel, D, Q,G) in Figure 1c includes this query, the other two queries in this set produce either incorrect or empty results.

3.3.1

Semantically Indexed Bi-Partite Knowledge Base (SIBKB) In the first step, we applied the GloVe [14] model to PATTY and built a vector representation of its bi-partite graph G = (R, P, E), i.e., each node in R and P is replaced by its vector representation. PATTY is converted into G ′ = (R ′ , P ′ , E ′ ) where R ′ , P ′ are the vector representations of the semantically typed relational patterns and their associated DBpedia relations, respectively. Furthermore, a dynamic hashing [10] on semantically typed relational patterns is built; each entry in the hash table corresponds to a bucket composed of the predicates, e.g., in DBpedia, associated with the pattern in

3.2 Problem Statement Given a question Q and a collection G of semantically typed relational patterns, the problem of linking relational patterns in Q to semantic types from a knowledge base D corresponds to selecting a subset Rel of Rels(pattern(Q),G) from which the maximal number of SPARQL queries that produce the correct answers of Q can be generated. We define the problem of linking relational patterns in a question as the following 4

Input 1. 2.

patternVectors P’ from SIBKB vectors for question patterns {was born, was [[Noun]], [[Noun]] born}

Step performed

Output

Apply Sim(P’,Q’)>= Threshold (T1)

PotentialRels(pattern(Q’),G’) {parent, spouse, deathPlace, predecessor, birthPlace, relation}

Input PotentialRels(pattern(Q’),G’) {parent, spouse, deathPlace, predecessor, birthPlace, relation}

Step performed

Output

Apply Penalty(W)>= Threshold (T2)

RankedRel’(pattern(Q’),G’) {parent, spouse, birthPlace, deathPlace, predecessor, relation}

birthPlace changes its position in retrieved results

List of potential relations is generated

(a) Finding Potential Relevant Relations in a SIBKB

Input

Step performed

1. Question Predicate (pr) (was born)

2. 3.

Remove unnecessary words (eg.: was) Apply Synonym Create Vectors for Synonyms

(b) Ranking Potential Relevant Relations in a SIBKB

Output

Input 1.

extendedPredicate(pr’){ born, birth, deliver, bear}

2.

(c) Extending the Set of Relevant Natural Language Relations for Input Question

extendedPredicate(pr’){born, birth, deliver, bear} from Step (c) RankedRel’(pattern(Q’),G’) {parent, spouse, birthPlace, deathPlace, predecessor, relation} from Step (b)

Step performed

Output

Apply Sim(extendedPredicate, RankedRel)

RankedRelations(R’’) {birthPlace, parent, spouse, deathPlace, predecessor, relation}

(d) Re-ranking the Relevant Relations

Figure 4. A SIBKB Relation Linking Pipeline. Four-step pipeline exploiting SIBKB indices and captured knowledge. the key of the bucket. Figure 3 illustrates a portion of the SIBKB built on top of PATTY.

count the number of relational patterns associated with it; then this value is normalized by the total number of patterns in PATTY. The penalty function W is defined as follows:

3.4

Pipeline for Relation Linking using a Semantically Indexed Bi-Partite Knowledge Base (SIBKB) For finding the associated relation set Rel which is part of Rels(pattern(Q),G) (see Section 3.2), a four-step process is followed; Figure 4 illustrates the steps of this pipeline.

⎨ count(P )⇑count(P ) ⎬ ⎝ r,1 all ⎠ ⎝ ⎠ ⋯ ⎠ W =1−⎝ ⎠ ⎝ ⎝count(Pr,n )⇑count(P all )⎠ ⎪ ⎮ Pr,1 ,⋯ Pr,n are numbers of patterns for a relation, and Pall is the total number of relational patterns in PATTY. This step changes the ranking of retrieved relations in Step 3.4. Hence, potentialRels′ (pattern(Q ′ ),G ′ ) is now turned into RankedRel ′ (pattern(Q ′ ),G ′ ), which is the output of this pipeline step. In our example, the ranked list of relevant relation is now updated from (dbo:parent, dbo:spouse, dbo:death Place, dbo:predecessor, dbo:birthPlace, dbo:relation) to (dbo:parent, dbo:spouse, dbo:birthPlace, dbo:death Place, dbo:predecessor, dbo:relation), i.e., the DBpedia predicate dbo:birthPlace is ranked in a higher position.

Finding potential relevant relations in SIBKB In this first step of the pipeline, we convert pattern(Q) into its vector representation pattern(Q ′ ). We then calculate the cosine similarity between pattern(Q ′ ) and the indexed semantically typed relational patterns P ′ such that Sim(pattern(Q ′ ), P ′ ) ≥ Threshold(T )

(4)

where Threshold(T ) is the minimum admissible limit of the cosine similarity value. This results into a set of potential relevant relation vectors potentialRels′ (pattern(Q ′ ),G ′ ) in SIBKB. In our example, the input for this step is the vector of question patterns, e.g., where, where was, was born, was [Noun], [Noun] born; the output is the list of vectors associated with potential relevant relations: dbo:parent, dbo:spouse, dbo:relation, dbo:deathPlace, dbo:birthPlace, dbo:predecessor.

Extending the set of relevant natural language relations for the input Question Many times an irrelevant pattern appearing in a question, matches higher in number while calculating cosine similarities in the previous step. For example, the word ‘where’ appears 1,498 times in PATTY; this will negatively impact on the overall results. Therefore, to overcome this problem, we extract NL relations from the input question. In DBpedia, it is very likely that the DBpedia predicate associated has similar names with the NL predicate. For example, the NL relation ‘was born’ is associated with dbo:birthPlace, the relation ‘president of’ is associated with dbo:President, the relation ‘wife of’ is associated with dbo:spouse in the ranked list of DBpedia properties, and so on. Therefore, we extract Predicate(Pr ) from the question Q; furthermore, we expand this list with synonyms from

Ranking potential relevant relations in SIBKB The numbers of occurrence of a particular pattern in PATTY is not uniform as illustrated in Section 2. Therefore, it is likely that, while calculating the cosine similarity, some relations are ranked higher than others due to a higher number of associated matched patterns. To solve this issue, we applied a penalty function. For each relation R in PATTY, we first 5

Wordnet. We then create vector representation of each of the relations in extendedPredicate(Pr ′ ) using the GloVe model. In our running example, the relation ‘was born’ is expanded to the list (born, birth, bear, deliver); it is converted further into its vector representation.

Precision: The number of correct relations retrieved at first rank in the list of retrieved relations out of the total number of questions. vi) Global Recall: The number of questions answered at any position (in our case till the 5th position of occurrence of a relation in the retrieved list) out of the total number of questions. vii) F-Score: Harmonic mean of global precision and global recall. viii) Precision @ K: The cumulative precision at position K. ix) Recall @ K: The correct relations for questions recommended in top K position out of total number of questions. x) F-Score @ K: Harmonic mean of precision and recall at position K. Implementation: The pipeline for relation linking has been implemented in Python 2.7.12. Experiments were executed on a laptop with a quad-core 1.50 GHz Intel i7-4550U processor and 8GB RAM, running Fedora Linux 25. The word to vector conversion was done using GloVe [14]. Furthermore, for extracting NL predicates from the input question in the third step of the pipeline in Section 3.4, we used the TextRazor API3 . The source code and evaluation results can be downloaded from https://github.com/WDAqua/ReMatch.

Re-ranking the relevant relations In the last step of the pipeline, we take the outputs of the second and third step, which correspond to the vector representation of ranked potential relations (RankedRel ′ (pattern(Q ′ ),G ′ )) and extended predicate patterns (extendedPredicate(Pr ′ )). We again calculate cosine similarities between them to re-rank the list of obtained relations in RankedRel ′ (pattern(Q ′ ),G ′ ). In our example, the extended question predicate list from the third step is (born, birth, bear, deliver) and the ranked list of potential relations from the second step of the pipeline is (dbo:parent, dbo:spouse, dbo:birthPlace, dbo:predecessor, dbo:relation, dbo:deathPlace). After this step, the relation dbo:birthPlace has the highest similarity with birth, changing its position in the ranked list of relations. Therefore, our final re-ranked list of relations associated with the pattern was born is the following: (dbo:birthPlace, dbo:parent, dbo:spouse, dbo:deathPlace, dbo:predecessor, dbo:relation). The DBpedia predicate dbo:birthPlace is the top-1.

4

4.1

Experiment 1: Performance Evaluation Using Relation Linking Benchmark 4.1.1 Evaluation of Relation Linking Task Using SIBKB To evaluate the impact of the SIBKB on the relation linking task, we first calculate the performance of PATTY using a similarity measurement between question patterns and PATTY relational patterns using cosine similarity [14]; we call it ‘BaseLine’ approach. In the BaseLine approach, PATTY is directly used without indices. However, in our approach, we use SIBKB i.e., PATTY with indices along with the pipeline described in Section 3.4. Out of 215 questions of QALD-7, using PATTY patterns, we can answer 143 questions. The remaining 72 questions do not have any associated relational patterns for QALD questions in PATTY, and are therefore out of the scope for evaluation. Table 2 illustrates the results. Using our approach, the global precision increases from 17% to 51% compared to the BaseLine, which means a significant improvement of nearly three folds. The same is true for the global recall and F-score. We further observed the impact of our approach on capturing knowledge from the knowledge base by calculating the precision and recall values till the first five occurrences in the obtained list of relations. We divided questions with two or three properties into different groups as shown in Table 1. For example, the question ‘Which professional surfers were born in Australia?’, contains two DBpedia properties, namely, dbo:occupation and dbo:birthPlace. The Table 1 has two or three rows depending on the number of relations in a question. Using our SIBKB approach for relation linking, precision and recall at the first position are high enough to prove that our implementation can be easily

Experimental Study

We empirically study the efficiency and effectiveness of SIBKB for extracting properties from a knowledge base to solve the relation linking task. In the first experiment, we assess the precision, recall, and F-Score of our approach using the QALD-7 benchmark. We address the following research questions: RQ1) What is the impact of using an SIBKB on a relation linking task? RQ2) What is the impact of using an SIBKB on the relation linking execution time? RQ3) What is the impact of an SIBKB on the size of a collection of semantically-typed relational patterns? The experimental configuration is as follows: Relation Linking Benchmark we created a benchmark based on the QALD (Question Answering over Linked Data) benchmark used for evaluating complete QA systems in the similar way as created by [4] for entity linking. We devised a similar approach for the relation linking benchmark using the QALD-7 training set2 that contains 215 questions. Metrics: i) Execution Time: Elapsed time between the submission of a question to an engine and the delivery of the relevant DBpedia relations. Timeout is set to 300 seconds. ii) Inv.Time: It is calculated as: 1- (average execution time for BaseLine/average execution time for SIBKB) iii) In Memory Size: The Total size of the PATTY knowledge base and size of its corresponding SIBKB. iv) Inv.Memory: It is calculated as: 1- (Memory Size of PATTY/Memory Size of SIBKB) v) Global 2 https://github.com/ag-sc/QALD/blob/master/7/data/ qald-7-train-multilingual.json

3 https://www.textrazor.com/

6

Table 1. SIBKB Performance. Cumulative Frequency at Rank Positions 1 to 5; Precision, Recall, and F-Score are also reported at Top-1 and Top-5. Accuracy of the SIBKB-based relation linking method is enhanced whenever Top-5 results are considered. Num Properties 1 Property

Total 116

Properties

21

Properties

6

Cumulative Frequency at Rank Positions Rank#1 Rank#2 Rank#3 Rank#4 Rank#5 55 71 78 84 87 10 11 13 13 13 5 10 14 14 15 1 1 1 1 1 1 1 2 4 4 1 1 2 2 2

Precision @k #1 #5 47.4% 57.6% 47.6% 53.1% 23.80% 44.9% 16.6% 16.6% 16.6% 30.5% 16.6% 22.2%

Table 2. SIBKB Performance. Comparison of SIBKB and Baseline. Top-1 predicates are considered; SIBKB enhances accuracy of the proposed relation linking method.

Recall @k #5 75.0% 61.9% 71.4% 16.6% 66.6% 33.3%

F-Score @k #5 63.2% 57.2% 52.6% 16.6% 41.8% 26.64%

Global Precision 20 15

Baseline SIBKB

Global Precision Recall 17% 37% 51% 73%

10

F-Score 23% 60%

Global Recall

5

Inv.Time

0

BaseLine Our Approach Using SIBKG

used as relation linking tool in modular question answering frameworks such as Qanary [4] and OKBQA[1]; it will significantly improve the overall performance of QA systems in general. For example, QA system CASIA, which uses PATTY, shows an average precision of 0.35 [8]4 . If its relation linking tool is replaced by our approach, this will improve the overall performance of the CASIA system. Furthermore, we have excluded a performance comparison with state-of-theart relation linking tool presented in [12] because this work does not use the background knowledge base PATTY; it relies on modeling NL relations with their underlying part of speech. The part of speech is then enhanced with Wordnet and dependency parsing. In contrast, our approach focuses on enhancing efficient knowledge capturing from knowledge bases for relation linking, which can further be extended for other similar knowledge bases like PATTY. However, combining both approaches will result in better performance of the relation linking task since relational patterns in PATTY are limited.

F−Score

Inv.Memory

Figure 5. Performance of SIBKB. SIBKB and the Baseline are compared in terms of Global Precision, Global Recall, Global F-Score, Inv.Time, Inv.Memory; higher values are better. SIBKB increases Precision, Recall, and F-Score at the cost of evaluation time and memory consumption; Precision, Recall, and F-Score are improved by up to three times. from 0.64 seconds to 5.96 seconds. A large portion of the total execution time of our implementation per question includes calling the TextRazor API (nearly 20 percent) for extracting NL predicates from the question. Figure 5 illustrates the trade-offs between these five dimensions. Although the SIBKB-based approach is more expensive by an order of magnitude in terms of memory consumption and execution time, it shows a drastic improvement with regard to precision, recall, and F-Score.

4.2

Experiment 2: Trade-offs between Different Metrics We illustrate a trade-off between different dimensions of performance metrics for the SIBKB-based approach compared to the baseline. We choose global precision, global recall, F-score, in-memory size, and execution time as five different dimensions. The in-memory size of the PATTY knowledge base has increased from 7.34 MB to 22.44 MB as we have converted PATTY (two column corpus of relational pattern and associated DBpedia relations) into SIBKB (indexed bipartite knowledge base from PATTY) using the GloVe model. Also, the average execution time per question is increased

5

Related Work

With the advancements in semantic web technologies, researchers have developed more than 50 QA systems since 2010 that interpret a user’s natural language input question and fetches its answer from online knowledge bases such as DBpedia and Freebase [9]. However, most of them have monolithic implementations and find limitations in the reusability of their components [15]. To address this problem, considerable work has been done in the research community to build QA systems using generic frameworks. Qanary [4] and OKBQA[1] are efforts to build QA systems in a collaborative effort by creating components for different

4 Based-on QALD-3; questions of QALD-3 are also part of later versions

7

(iii) Provision of more semantic knowledge bases that can be used with SIBKB for the overall success of QA systems.

tasks in the question answering process rather than building a whole QA system from scratch. Independent tools such as AGDISTIS [16] and RelMatch [12] thus find applicability in QA frameworks to perform NED and relation linking tasks. However, in QA systems, developers have implemented the relation linking task using different approaches. The CASIA QA system [8] that has monolithic implementation has a dedicated component named Resource Mapper that maps of the question to its corresponding DBpedia resources and properties. HAWK [17] has a similar component that takes the dependency tree generated from the input question as input and maps all nodes in this tree to the concepts of DBpedia ontology. For the same task, QA systems like CASIA [8] and Xser [18] use PATTY [13]. However, from the research publications (no source code available), it is unclear how these systems use PATTY. The QA system AskNow [5] uses the BOA pattern library [6] and PATTY collectively for the relation linking task. It largely relies on BOA pattern library, and if the relational pattern is not in BOA, then it searches the pattern in the PATTY corpus. AskNow does not fully rely on PATTY due to its noisy behavior5 , which is also reported by Hakimov et al. [7].

6

Acknowledgment This project has received funding from the EU Horizon 2020 R&I programme for the Marie Skłodowska-Curie action WDAqua (GA No 642795), and the CSA BigDataEurope (GA No 644564).

References [1] 2014. OKBQA QA Framework http://www.okbqa.org/. (2014). [2] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC. [3] Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts. 2007. Freebase: A Shared Database of Structured General Human Knowledge. In AAAI 2007. [4] Dennis Diefenbach, Kuldeep Singh, Andreas Both, Didier Cherix, Christoph Lange, and Sören Auer. 2017. The Qanary Ecosystem: Getting New Insights by Composing Question Answering Pipelines. In ICWE. [5] Mohnish Dubey, Sourish Dasgupta, Ankit Sharma, Konrad Höffner, and Jens Lehmann. 2016. AskNow: A Framework for Natural Language Query Formalization in SPARQL. In ESWC. [6] Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Extracting Multilingual Natural-Language Patterns for RDF Predicates. In EKAW. [7] Sherzod Hakimov, Hakan Tunc, Marlen Akimaliev, and Erdogan Dogdu. 2013. Semantic question answering system over linked data using relational patterns. In EDBT/ICDT ’13. [8] Shizhu He, Yuanzhe Zhang, Kang Liu, and Jun Zhao. 2014. CASIA@ V2: A MLN-based Question Answering System over Linked Data.. In CLEF (Working Notes). [9] Konrad Höffner, Sebastian Walter, Edgard Marx, Ricardo Usbeck, Jens Lehmann, and Axel-Cyrille Ngonga Ngomo. 2016. Survey on challenges of Question Answering in the Semantic Web. Semantic Web (2016). [10] Per-Åke Larson. 1978. Dynamic Hashing. BIT (1978). [11] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In I-SEMANTICS. [12] Isaiah Onando Mulang, Kuldeep Singh, and Fabrizio Orlandi. 2017. Matching Natural Language Relations to Knowledge Graph Properties for Question Answering. In Semantics 2017. To appear. [13] Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Relational Patterns with Semantic Types. In EMNLP-CoNLL. [14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation.. In EMNLP. [15] Kuldeep Singh, Andreas Both, Dennis Diefenbach, and Saeedeh Shekarpour. 2016. Towards a Message-Driven Vocabulary for Promoting the Interoperability of Question Answering Systems. In ICSC. [16] R. Usbeck, A.-C. Ngonga Ngomo, M. Röder, D. Gerber, S. A. Coelho, S. Auer, and A. Both. 2014. AGDISTIS – Graph-Based Disambiguation of Named Entities Using Linked Data. In ISWC. [17] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, and Christina Unger. 2015. HAWK-hybrid question answering over linked data. In ESWC. [18] K. Xu, Y. Feng, and D. Zhao. 2014. Xser@ QALD-4: Answering Natural Language Questions via Phrasal Semantic Parsing. (2014).

Conclusion

In this paper, we have presented the novel approach SIBKB – a semantic-based index which is able to capture knowledge encoded in background knowledge bases such as PATTY for the relation linking task. Many QA systems (e.g., [5] and state-of-the-art relation linking tools such as [12], could not completely rely on PATTY owing to inherent noise, poor baseline performance, and in memory overheads. SIBKB is an approach that can be generalized for application on similar knowledge bases to alleviate these limitations. SIBKB indices allow not only for speeding up the search but also for reducing irrelevant relations appearing in the selection while efficiently and effectively matching natural language patterns to semantic relational patterns of knowledge bases. We demonstrate a case where semantically typed knowledge bases can now be fully utilized to a comparable degree to already successful graph types. SIBKB can, therefore, be integrated with other successful techniques for semantic disambiguation such as Wordnet similarity measures besides the inclusion of synonyms to extend precision of relation linking tools. There are three possible ways to extend this work: (i) For relations that can both be matched using SIBKB and other already existing approaches, weighted metrics can be provided to discriminate among these approaches. (ii) Exploring and evaluating the performance of an integrated relation matching system that incorporates multiple types of background knowledge bases. 5 As

discussed with the main developer personally 8

Suggest Documents