(German) Language Processing for Lucene

2 downloads 0 Views 161KB Size Report
Stroudsburg, PA, USA (2010). 8 For example, for German http://sourceforge.net/projects/jobimtext/files/ data/models/de_news70M_pruned.zip/download; based ...
(German) Language Processing for Lucene Bastian Entrup Applied and Computational Linguistics Justus-Liebig-Universit¨ at Gießen Germany [email protected]

Introduction

DR AF

1

T

Abstract. This paper1 introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. It aims at facilitating four language processing steps for working with non-English texts and Apache Lucene/Solr: lemmatizing words, weighting terms based on their part-of-speech, adding synonyms and decompounding nouns, without the necessity of a thorough understanding of natural language processing.

In recent years, Apache Lucene and Apache Solr2 , the search platform based on Lucene, have gained a lot of popularity not only in industrial applications, but also in searching websites, and for academic purposes. Lucene offers many interesting, expandable features and a number of resources and literature on how to use and apply Lucene to different purposes (e.g. [12]) are available. Although Lucene offers quite some possibilities to incorporate natural language processing (NLP) or methods from computational linguistics, these are usually only available in English by default. For example, German language support is basically limited to stemming or to language-independent methods, such as providing a stop-word list or a dictionary for compound splitting. The open-source package3 German Language Processing for Lucene (glp4lucene) described here aims at facilitating four obstacles that one might encounter when working with non-English texts in Lucene: lemmatization, synonym expansion, decompounding, and part-of-speech (POS) weighting. Despite its name, it is not only applicable to German, but to other languages as well; it is basically language independent. It was developed within a Digital Humanities project, where a number of German texts where to be processed, and where synonym expansion and decompounding were of special importance for the performance of the search platform. 1

2 3

The final publication is available at http://link.springer.com/chapter/10.1007/ 978-3-319-19581-0_35. http://lucene.apache.org/, http://lucene.apache.org/solr/ https://sourceforge.net/projects/glpforlucene

2

Bastian Entrup

2

Motivation

While it is easy to do very basic indexing and searching with almost no experience in Lucene/Solr, applying NLP methods requires more in-depth knowledge of Lucene. This packages aims at facilitating the usage of these methods. Lemmatization, i.e., the reduction of inflected word forms to a base form, is, from a linguistic point of view, more meaningful than stemming, the reduction of a word to its stem. Language is, per se, ambiguous (cf. cases of homonymy or polysemy). Using stemming, this ambiguity is further increased since different, not related words are reduced to the same word stem. Even words that are not homograph, like German Bauer (farmer) and Bau (a homonymous word with the meanings building, construction site, and jail), are reduced to the same stem: Bau. Lemmatization, on the other hand, leads to different lemmas of the words, while still reducing different inflected word forms (such as Bauers (of the farmer) and Bauern (the farmers)) to one base form.

DR AF

T

Despite these assumptions, studies have shown that stemming increases the precision in German information retrieval (IR) compared to using neither stemming nor lemmatization. [3] found an increase between 11 and 23%, while [5] found that stemming improved the precision for German by 7.3% and lemmatization only by 6%. Nonetheless, lemmatizing words is a necessity to look up synonymous words from GermanNet [4], a German WordNet [13] counterpart. Adding synonyms is commonly expected to increase recall [11]. To be able to find all relevant documents the search-engine has to identify the concept in question even if other words than the query term where used to refer to it. While in the beginning of IR stop word lists were commonly applied, todays state-of-the-art web search engines do not use stops words, since they are problematic, e.g., when it comes to finding song titles or proverbs4 . Weighting terms instead seems to be a plausible approach: as [7] shows, nouns are the most commonly searched for terms. Weighting terms by their POS is inspired by Jespersen’s Rank Theory [6], which states that the open POS, nouns, adjectives, adverbs, and verbs, are more content bearing, while the class of closed POS are, roughly speaking, more or less empty and fulfill mainly grammatical or deictic purposes. This idea has been applied to IR before [10, 9]. Unlike these proposals, only ranking of single words is supported here: one can increase the weight of a term depending on its POS. Given a query such as Caf´e in Paris, documents containing Caf´e and Paris are more likely hits than documents containing in. One could thus decide to weight nouns more heavily than prepositions. This is, of course, totally up to the user. If no weights are given, or if the method is not used at all, all terms are weighted equally. 4

Think of Shakespeare’s To Be or not to Be, where almost every token is a stop word, and one cannot just ignore them altogether.

(German) Language Processing for Lucene

3

3

Implementation

The following three modules of glp4lucene can be used to analyze and filter the input during indexing or search time. While GermaNet requires the input to be lemmatized, all parts of the package can freely be combined, left out, or used together with Lucene’s standard analyzers or stemming Lemmatization: The implementation of the lemmatization presented here is language independent and can be used in the same way and out of the box for other languages as well. It only requires a MATE-tool [2] model appropriate for the given language5 . This model, for German the model described in [14] can be used, is used to assign an appropriate lemma to each token of the field it is applied to.

DR AF

T

Synonym expansion: The glp4lucene packages extends the existing synonym interface6 of Lucene and comes with three pre-defined implementations of this interface. To add appropriate synonyms for words in the search index or the query term, GermaNet can be used. GermaNet is a manually compiled resource that establishes different semantic and lexical relations between words, or senses. The building block of GermaNet is the synset, a set of synonymous word senses. The German word Gef¨ angnis (prison) for example is found in a synset together with Bau (jail). Since GermaNet is a proprietary software, the glp4lucene package includes two other methods to assign synonyms: The first is to take a list of semantic similarity of words of the form focus-word synonym . The second is to use a list of synonyms, compiled from whatever resource one has at hand, of the form focus-word synonym. These two methods are, of course, also language independent. Furthermore, one can use the interface to implement further methods to add synonyms. Instead of looking up each word that is encountered in the texts during indexing or searching, a map between each word form found and its synonyms is built and stored for later re-use. This list can directly be (re-)used in Lucene’s decompounding class. Splitting up compounds has been found to be very helpful for many Germanic but also for other languages [5]. POS-based term weighting: The implementation uses the Stanford Maxent Tagger [15]. Again, if presented the correct model7 , this implementation is language independent. Besides the model, one also has to provide a list of POSs and the 5

6

7

Models for French, Spanish, Chinese, English, and German are available from https: //code.google.com/p/mate-tools/. The interface follows the implementation found in [12], extends it by new methods, and is adapted to the newer Lucene versions 4.x. It has been tested using versions 4.6 to 4.8.1. Models for English, Arabic, Chinese, French, Spanish, and German are available at http://nlp.stanford.edu/software/tagger.shtml.

4

Bastian Entrup

according weight. If no weight is given, the POS is treated as being neutral, no weight is stored in the index.

4

Conclusion

DR AF

T

The package is meant to assist developers without a thorough knowledge of NLP tools, to build a Lucene index applying different techniques from NLP. The source of the texts can be a database or text files in any format. The resulting index can be used in web search servers such as Apache Solr or it can be used to speed up and facilitate database queries. Using the morphological information given in the index, as well as the lemmatization, building a linguistic corpus based on Lucene is an other possible application of the software. Using the source code, provided along with the compiled JAR-file, the implementation can be adapted to one’s needs. For example, it might be interesting to assign not only synonyms but also hyponyms or meronyms as described in [8]: Searching for Hund (dog) should perhaps also find texts about Dackel (dachshund) as well. Other possibilities include using distributional thesauri8 to find related, though maybe not strictly synonymous, words and using those instead of the information provided by GermaNet. This approach can also be used to create a second index that provides the end-user with alternative query terms. Also one can use other resources, e.g., Wikitionaries, to generate and use lists of synonymous words. If resources for other languages are available, the implementation given in this package can be used as is for not only German texts, but for other languages as well.

5

Acknowledgement

This package was created for and within the GeoBib project to facilitate searching the project’s data set and will be used in the planed website. GeoBib is funded by the German Federal Ministry of Education and Research (grant no. 01UG1238A-B).

References

1. Biemann, C., Riedl, M.: Text: Now in 2D! a Framework for Lexical Expansion with Contextual Similarity. Journal of Language Modelling 1(1), 55–95 (2013) 2. Bohnet, B.: Very High Accuracy and Fast Dependency Parsing is Not a Contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics. pp. 89–97. COLING ’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010) 8

For example, for German http://sourceforge.net/projects/jobimtext/files/ data/models/de_news70M_pruned.zip/download; based on 70 million sentences from a news corpus extracted using the system described in [1].

(German) Language Processing for Lucene

5

DR AF

T

3. Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004) 4. Hamp, B., Feldweg, H.: GermaNet - a Lexical-Semantic Net for German. In: Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. pp. 9–15 (1997) 5. Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7(1-2), 33–52 (2004) 6. Jespersen, O.: The Philosophy of Grammar. Chicago Studies in Ethnomusicology Series, University of Chicago Press (1992) 7. Kraaij, W., Pohlmann, R.E.: Viewing Stemming as Recall Enhancement. In: In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 40–48 (1996) 8. Leveling, J.: University of Hagen at CLEF 2003: Natural Language Access to the GIRT4 Data. In: CLEF. pp. 412–424 (2003) 9. Lioma, C., Blanco, R.: Part of Speech Based Term Weighting for Information Retrieval. In: ECIR. pp. 412–423 (2009) 10. Lioma, C., van Rijsbergen, C.K.: Part of Speech Based Term Weighting for Information Retrieval. In: Revue Franaise de Linguistique Applique. vol. 1 (2008) 11. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. No. 2, Cambridge University Press, Cambridge, England (2008) 12. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT, USA (2010) 13. Miller, G.A.: WordNet: A Lexical Database for English. COMMUNICATIONS OF THE ACM 38, 39–41 (1995) 14. Seeker, W., Kuhn, J.: Making Ellipses Explicit in Dependency Conversion for a German Treebank. In: LREC. pp. 3132–3139 (2012) 15. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In: Proceedings of the 2003 Conference of the NAACL on Human Language Technology. pp. 173–180. NAACL ’03, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)