POSTagging and Semantic Dictionary Creation for ...

23 downloads 0 Views 722KB Size Report
[6] John McCrae, Dennis Spohr, and Philipp Cimi- ano. Linking lexical resources and ... [7] Timo Homburg, Christian Chiarcos, Thomas. Richter, and Dirk Wicke.
POSTagging and Semantic Dictionary Creation for Hittite Cuneiform Timo Homburg [email protected] Mainz University Of Applied Sciences, Germany

Introduction

Hittite Cuneiform Morphological Analysis

Natural Language Processing and Semantic information extraction in cuneiform languages is an emerging field in the Digital Humanities. Since recently the tool Cuneify [1] allows us to convert cuneiform transliterations into a unicode representation and efforts to create treebanks [2], segmentation algorithms [3] as well as 3D recognition of cuneiform tablets [4] are on the rise. We want to contribute by creating a morphological analyzer for Hittite cuneiform texts as a first step to automatically process Hittite cuneiform using an artificial Natural Language Processing pipeline.

We conducted a morphological analysis based on regular expressions on a text corpus of 10 texts provided by the university of Austin, Texas [5] in three different Hittite Epochs. We conducted a basic classification (the POSTag) and an extended classification (Person, Gender, Wordcase).

Background An example Natural Language Processing Pipeline for cuneiform is depicted in the following figure

PosTagging Results

POSTagging Tool

Experiment

Translit

Cunei

POS_Basic POS_Ext POS_Basic_OH POS_Ext_OH POS_Basic_MH POS_Ext_MH POS_Basic_NH POS_Ext_NH

74.7 % 78.3% 68.5% 75.6% 69.8% 72.2% 75.8% 79.5%

62.2% 67.1% 60% 68.5% 60.2% 61.3% 61.3% 70.5%

The results serve as a test of how useful a morphological analysis can be to determine POSTags in Hittite. In our further research we are using the given morphological features along with features common in Machine Learning approaches to achieve better results on average.

Cuneiform LOD Dictionary Creation and Applications Figure 1: Natural Language Processing Pipeline

For Hittite cuneiform our research is centred on automatic POSTagging assignment, preliminary results of which we are presenting on this poster. In order to improve POSTagging and to work towards automated translation of cuneiform texts, we integrate publicly available dictionary resources, which can subsequently improve POSTagging assignment in case of ambiguities and become the basis of a more sophisticated linguistic analysis like distant reading and Topic Modelling approaches. In addition we want to promote more userfriendly and Open Source tools (e.g. Input Method Engines, Flash Card Learning) to make it easier for scholars, students and researchers to work with cuneiform texts.

References

In Hittite, nouns and adjectives follow the same declension rules. To better distinguish these and other cases of ambiguity for POSTagging and to promote Linguistic LOD in Cuneiform Languages we use existing dictionary resources for Akkadian/Hittite/Sumerian cuneiform and convert them according to the Lexicon Model for Dictionaries (Lemon) standard to include: Dictionary Contents

Dictionary Format [6]

• Semantic Concept from Wikidata, Verbnet and Babelnet • POSTag including Gender, Time, Person, Case etc. • Morphological changes as rules or wordforms and word decompositions • Etymology Information and Metadata The Semantic Dictionary for Ancient Languages https://situx.github.io/SemanticDictionary/ is published including specific export formats following [7] for IME and Flash Card Learning.

[1] Stephen Tinney. Cuneify. http://oracc.museum. (docuupenn.edu/doc/tools/cuneify/, 2015. ment) [2] Guglielmo Inglese. Towards a hittite treebank. basic challenges and methodological remarks. Corpus-Based Research in the Humanities (CRH), page 59, 2015. (document) [3] Timo Homburg and Christian Chiarcos. Word segmentation for akkadian cuneiform. In LREC 2016, 2016. (document) [4] Hubert Mara, Susanne Krömker, Stefan Jakob, and Bernd Breuckmann. Gigamesh and gilgamesh:–3d multiscale integral invariant cuneiform character extraction. In Proceedings of the 11th International conference on Virtual Reality, Archaeology and Cultural Heritage, pages 131–138. Eurographics Association, 2010. (document) [5] Winfred P. Lehmann and Jonathan Slocum. Hittite online. https://lrc.la.utexas.edu/eieol/ hitol, 2013. (document) [6] John McCrae, Dennis Spohr, and Philipp Cimiano. Linking lexical resources and ontologies on the semantic web with lemon. In Extended Semantic Web Conference, pages 245–259. Springer, 2011. (document) [7] Timo Homburg, Christian Chiarcos, Thomas Richter, and Dirk Wicke. Learning cuneiform the modern way. https://gams.uni-graz.at/o: dhd2015.p.55, 2015. (document)

Figure 2: Applications: Input Method Engine (left) Flash Cards (middle); Dictionary Excerpt (right)

Machine Translation and Future Perspectives Machine Translation The next step in researching for cuneiform languages should be to implement Machine Translation and Semantic extraction algorithms for cuneiform tablets. Similarly to other languages and texts, cuneiform language resources should be automatically categorized and analyzed using State-Of-The-Art Distant Reading methods. The figure on the right shows first approaches in this direction by extracting key words from geolocated cuneiform tablets.

Wordclouds (Keyword Matching)

Suggest Documents