2 from Rade KuzmanoviÄ: Partija karata, NOLIT, Beograd, 1982. ... Proceedings of 3nd Language & Technology Conference, IMPRESJA Widawnictwa.
How to find the right path? (On the morphological disambiguation of sentence in Serbian) Cvetana Krstev & Dusko Vitas University of Belgrade {krstev|vitas}@matf.bg.ac.yu
1 Introduction On the basis of the method of lexical recognition (Silberztein, 1993) we have developed (Vitas, 2007a) the morphological electronic dictionary of contemporary Serbian(Croatian) in the so called LADL-format (Courtois, 1990). The exploitation of such dictionaries for text and corpus analysis is supported by the programming system Unitex1. The application of the lexical recognition method to the text analysis, unlike the methods based on the machine learning, describes very precisely and with high reliability possible morphological interpretations at the sentence level of the input text. For an arbitrary text less then 1% of all word occurrences remain unrecognized. The obtained results are, however, highly ambiguous in the sense that the word from text (the sequence of alphabetic characters between two separators) can represent several lemmas and/or sets of grammatical categories..
2 An Example As an example of the mentioned ambiguity we can look at the segment2: (1) Izgledalo mi
je
da
joj
više
ni do čega
nije
stalo
Seemed.V me.Pron is.Aux that.CONJ she.Pron more.ADV anything.Pron didn’t.Aux care.V
‘It seemed to me that she cared for nothing anymore’ The graph in Figure 1, which is produced as a result of the analysis by the system Unitex, shows the possible interpretations of this sentence. The number of possible interpretations of the sentence corresponds to the number of paths through the graph multiplied by the number of sets of grammatical categories attached to each node (for instance, the form je can be the form if the verb jesam or the pronoun ona, while the node je,ona has two sets of grammatical categories attached to it – enclitic form in the genitive and the accusative case). It means that the form je can have three possible interpretations. However, except in some exceptional cases, only one path that leads from the starting to the ending node of the graph represents the correct interpretation of the results of the lexical recognition. 1 2
http://infolingu.univ-mlv.fr/english/ from Rade Kuzmanović: Partija karata, NOLIT, Beograd, 1982.
3 The Method The topic of our paper is the investigation of various methods for the automatic detection and elimination of those paths in the graph that are not possible (the 'false ambiguity'). The set of all these procedure we will call the disambiguation procedure. We will discuss the following disambiguation methods: a. The enhancement of the area in the e-dictionary devoted to the syntactic and semantic markers, for instance, addition of the grammatical features such as natural gender and number for nouns, or semantic features, such as professions. b. The enhancement of the dictionary with the compound words that would include their inflection (if applicable). We will show that the dictionaries of compound adverbs and prepositions are particularly useful (Krstev, 2006). c. The stratification of the dictionary. For instance in the example in Figure 1 one potential interpretation of the form mi is the verb miti 'to wash' in aorist which is rarely used in contemporary Serbian. With the stratification of the dictionary the verb interpretation of mi is used if and only if the pronoun interpretation leads to the contradiction. d. The regular derivation, whose first role is to process the unknown words, such as (2)
tridesetrogodišnjakinja 'thirty three year old woman'
in such a way as to determine its precise grammatical status (Krstev, 2007). e. Local grammars that enable, for instance, the recognition and tagging of some classes of named entities (Vitas, 2007b). f. ELAG-grammars (Laporte, 2001) that enable the formalization of the complex conditions for the elimination of the false paths in the sentence graph. These grammars are illustrated by the examples that formulate some agreement conditions for number and gender in Serbian on the basis of the problems explored in (Popović, 2000). Finally, we will discuss how the application of the designed procedure benefits the subsequent syntax analysis, and we will present the limits of the designed procedure.
4 Conclusion The described procedures enable the recognition of certain number of “false ambiguities” which leads to the removal of a number of paths from the graph. We are trying to answer the question to what extent it is possible to process the results of the lexical analyzer with the presented methods (that are on the level of the local grammars) in order to make them suitable as the input to a hypothetic syntactic parser.
References Courtois, B. et al. (eds.) (1990) Dictionnaires électroniques du français, Langue française 87, Paris: Larousse Krstev, C., Vitas, D., Savary, A. (2006). Prerequisites for a Comprehensive Dictionary of Serbian, FinTAL 2006, LNCS 4139, Springer, pp. 552-564. Krstev, C., Vitas, D. (2007). Treatment of Numerals in Text Processing”, in Vetulany, Z. Proceedings of 3nd Language & Technology Conference, IMPRESJA Widawnictwa Elektroniczne S.A., Poznań, pp. 418-422. Laporte, Eric (2001) Reduction of lexical ambiguity, Lingvisticae Investigationes XXIV:1, Amsterdam-Philadelphie : Benjamins, pp. 67-103. Popović, Lj. (2000): Bivalentni kontrolori kongruencije: problem leksikografskog opisa kongruencije gramatičkog i semantičkog slaganja, Naučni sastanak slavista u Vukove dane 29/1, MSC, Beograd, pp. 65-80. Silberztein, M (1993): Dictionnaires élelctroniques et analyse automatique de text (le system INTEX), Paris: Masson. Vitas, D, Krstev, C,. Koeva, S. (2007a). Towards a Complex Model for Morpho-Syntactic Annotation , in E. Paskaleva and Slavcheva, M. Proceedings of the Workshop Workshop on a Common Natural Language Processing Paradigm for Balkan Languages, RANLP, Borovets, Bulgaria, pp. 65-71. Vitas, D., Krstev, C., Maurel D. (2007b). A note on the Semantic and Morphological Properties of Proper Names in the Prolex Project, Lingvisticae Investigationes, Special issue on Named Entities: Recognition, Classification and Use, Vol. 30(XXX), No. 1, pp. 115-134.
Figure 1 The sentence graph for Izgledalo mi je da joj više ni do čega nije stalo