Lemonade: A web assistant for creating and ...

1 downloads 0 Views 285KB Size Report
Christina Unger, John McCrae, Sebastian Walter, Sara Winter, and Philipp Cimi- ano. A lemon lexicon for dbpedia. Proceedings of 1st International Workshop ...
Lemonade: A web assistant for creating and debugging ontology lexica Mariano Rico1? and Christina Unger2 1

Universidad Polit´ecnica de Madrid [email protected] http://oeg-upm.net 2 Bielefeld University [email protected] http://www.sc.cit-ec.uni-bielefeld.de/

Abstract. The current state of the art in providing lexicalizations for ontologies is the lemon model. Based on experiences in creating a lemon lexicon for the DBpedia ontology in English and subsequently porting it to Spanish and German, we show that creating ontology lexica is a time consuming, often tedious and also error-prone process. As a remedy, this paper introduces Lemonade, an assistant that facilitates the creation of lexica and helps users in spotting errors and inconsistencies in the created lexical entries, thereby ‘sweetening’ the otherwise ‘bitter’ lemon. Keywords: Ontology lexicon, lemon, Grammatical Framework, DBpedia

1

Introduction

One of the major challenges in providing natural language access to Semantic Web data is relating natural language expressions and corresponding vocabulary elements. One possibility to specify this relation are ontology lexica [4]. The current state of the art for specifying ontology lexica is the lemon model3 [2], which provides a standard format for capturing linguistically rich information about how the vocabulary elements used in a particular ontology or dataset are verbalized in natural language, in particular covering different verbalization variants, possibly in multiple languages. The resulting lexica are themselves expressed as RDF data, so that they can be shared in accordance with linked data principles and thus can be re-used across applications. Although the process of creating ontology lexica can be automated to a certain extent [7], creating a wide coverage and high precision lexical resource still requires a significant manual effort. This presupposes familiarity with RDF, and even though there are tools like the lemon design patterns library4 [3] to support ? 3 4

LIDER (EU FP7 project No. 610782) and MINECO JdC Grant (JCI-2012-12719) http://lemon-model.net http://github.com/jmccrae/lemon.patterns

2

Rico, M. and Unger, C., NLDB 2015

and facilitate lexicon creation, still the process of creating lexica manually is time-consuming, and often also tedious and error-prone. To illustrate this, we take as starting point the creation of an English lexicon for DBpedia [6] (available at http://github.com/cunger/lemon.dbpedia) and our efforts to port this lexicon to German and Spanish. In total, the DBpedia lexicon contains 1,217 lexicalizations for the most important classes and properties of the DBpedia ontology, specified using the lemon design pattern macros, which correspond to more than 50,000 RDF triples. Although using the lemon design patterns almost completely frees the lexicon engineer from writing verbose RDF code, it also has several limitations. First, writing lexicalizations by hand is error-prone. For instance, if you type “femenine” instead of “feminine” you get an error when converting the macros into RDF, and you typically run the conversion at least as many times as you have errors. Therefore, the time required to remove all errors is very high. Second, and more importantly, validation of lexica is currently only possible with respect to the well-formedness of the RDF code and its conforming to the lemon ontology [1], but cannot be performed on the level of lexical consistency and correctness. For example, at one point you can say that the gender of a particular word is feminine, and at another point you can say that it is masculine. There could also be typos in literals (let us say you typed “expecies” instead of “especies”). Another example concerns mistakes in the argument mapping. For instance, one of the English lexicalizations of the DBpedia property writer is specified as follows: StateVerb("write", dbpedia:writer, propSubj = DirectObject) propObj = Subject) This establishes that the subject of the property corresponds to the direct object in syntactic structures, and that the object corresponds to the syntactic subject. That is, the triple (Macbeth, writer, Shakespeare) can be expressed as “Shakespeare wrote Macbeth”. However, if we accidentally swapped DirectObject and Subject, then the same triple would be expressed as “Macbeth wrote Shakespeare”. These kinds of errors are very hard to spot when all you get is a huge, automatically generated RDF file. In this paper we therefore present a system, Lemonade, that assists users in creating lexica by means of an easy-to-use web interface, and furthermore provides support for spotting errors and inconsistencies in the created lexicon. In particular, we suggest that showing natural language sentences that would result from using the specified entries significantly helps in detecting erroneous and inappropriate entries.

2

Architecture of Lemonade

Fig. 1 shows the architecture of Lemonade. At its very core, it is a library written in R that interfaces with lemon for the creation of lexica and with Grammatical

Submission 11 for NLDB 2015

3

Fig. 1. Architecture of the Lemonade system.

Framework [5] for the construction of example sentences based on already created lexicalizations. On top of this library we have developed two applications intended to assist users in the creation of ontology lexica: – The lemon assistant, shown in the left side of Fig. 1, is a web interface for creating lexicalizations of classes and properties. It covers the most common lemon design patterns, in particular common nouns (e.g. “mountain”), relational nouns (e.g. “capital of”), and state verbs (e.g. “to write”). – The lemon lint remover (LEIRE), shown in the right side of Fig. 1, then reads created lexicalizations and implements several consistency checks for each design pattern, such as checking for multiple plural forms for nouns. In addition it creates a natural language sentence that illustrates a possible use of the created entry. The result of the analysis is published as a web page on GitHub, which users can check in order to realize possible errors and inconsistencies. We instantiated the system for DBpedia and three languages – English, Spanish, and German – but it can easily be ported to any other dataset and a wide range of other languages. The web interface is shown in Figure 2. For example, if we choose to create a class noun for the DBpedia class Book in Spanish, the assistant prompts us to provide the necessary information, in this case the singular and plural form (“libro” and “libros”), the gender (masculine) as well as the ontology reference (the class Book). Based on this information the assistant then creates examples sentences for both the singular and the plural form, in this case:

4

Rico, M. and Unger, C., NLDB 2015

Fig. 2. Screenshot of the web interface to create a Spanish class noun for the DBpedia class Book.

– “C´ antico por Leibowitz es un libro” – “C´ antico por Leibowitz y La conjura de los necios son libros” It does so by querying the DBpedia endpoint for instances of the class Book and uses their labels to fill sentence templates that are encoded using Grammatical Framework. In addition it shows the created lemon design patterns macro, so that expert users can directly check the lexicon code that is created. If the user validates the information, the lexical entry in stored in the GitHub repository underlying the project. The process of creating lexical entries and storing them in the repository is described in Figure 1 by the sequence A1-K1-K2-K3-A2-A3. The process of reading lexicalizations from the repository and creating example sentences is described in Figure 1 by the sequence L1-L2-L3-L4-L5-K1-K2-K3-L6. Information about the tool and links to the web application can be found at https:// github.com/cunger/lemon.dbpedia/blob/master/test/LemonadeTools.md.

Submission 11 for NLDB 2015

5

References 1. Dimitris Kontokostas, Martin Br¨ ummer, Sebastian Hellmann, Jens Lehmann, and Lazaros Ioannidis. Nlp data cleansing based on linguistic ontology constraints. In Proceedings of the Extended Semantic Web Conference (ESWC) 2014, 2014. 2. J. McCrae, D. Spohr, and P. Cimiano. Linking lexical resources and ontologies on the semantic web with lemon. In Proceedings of the 8th Extended Semantic Web Conference (ESWC), pages 245–259. Springer, 2011. 3. J. McCrae and C. Unger. Design patterns for engineering the ontology-lexicon interface. In Paul Buitelaar and Philipp Cimiano, editors, Towards the Multilingual Semantic Web: Principles, Methods and Applications. Springer, 2014. 4. L. Pr´evot, C.R. Huang, N. Calzolari, A. Gangemi, A. Lenci, and A. Oltramari. Ontology and the lexicon: a multi-disciplinary perspective. In Ontology and the Lexicon: A Natural Language Processing Perspective, pages 3–24. Cambridge University Press, 2010. 5. A. Ranta. Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications, 2011. 6. Christina Unger, John McCrae, Sebastian Walter, Sara Winter, and Philipp Cimiano. A lemon lexicon for dbpedia. Proceedings of 1st International Workshop on NLP and DBpedia, co-located with the 12th International Semantic Web Conference (ISWC 2013), October 21-25, Sydney, Australia, 2013. 7. Sebastian Walter, Christina Unger, and Philipp Cimiano. M-ATOLL: A framework for the lexicalization of ontologies in multiple languages. In The Semantic Web – ISWC 2014, volume 8796 of Lecture Notes in Computer Science, pages 472–486. Springer, 2014.

Suggest Documents