Towards full lexical recognition

Towards full lexical recognition Gordana Pavlović-Laˇzetić1, Duˇsko Vitas1 , and Cvetana Krstev2 1

Faculty of Mathematics, {gordana,vitas}@matf.bg.ac.yu 2 Faculty of Philology, [email protected] University of Belgrade

Abstract. Text processing in Serbian is based on the Intex format system of electronic dictionaries. Although lexical recognition is successful for 75% to 90% of word forms (depending on the type of text), some categories of words remain unrecognized. In this paper we present two aspects of e-dictionary enhancement that provide for additional recognition of two important categories of words: named entities and words generally not recorded in traditional dictionaries. We first describe the structure and content of dictionaries of proper names, both personal and geographic, developed to recognize the corresponding classes of named entities. Then we present a set of lexical transducers expressing morphological rules governing word formation, developed for the recognition of unknown words. The resources presented significantly improve the lexical recognition process.

1

Introduction

The basic form of dictionary used for processing Serbian texts is a system of morphological electronic Serbian dictionaries, in INTEX format [5], which corespond in size to a one volume dictionary of approximately 80,000 entries. It consists of dictionaries of simple words—DELAS (approximately 70,000 lemmas at present), simple word forms—DELAF (approximately 1.000,000 word forms) a dictionary of compounds (multiword expressions)—DELAC (in the initial phase), and morphological transducers that model certain classes of lemmas [6]. A DELAS entry includes a morphological code, uniquely describing its inflective class, but it may also be supplied with syntactic and semantic codes, such as the entry diviti ‘to admire’ diviti.V552+Imperf+It+Ref describing the verb as being imperfective (Imperf), intransitive (It) and reflexive (Ref), or the entry crven ‘red’ crven.A17+Col describing the adjective with the colour feature. Although such an exhaustive classification and the size of the e-dictionaries provide for text tagging and lemmatization in 75% (for newspaper texts) to 90% (for literary texts), a significant number of words remain unrecognized. Those words fall into different categories.

Named entities constitute the broadest category. Apart from their presence in all the subcorpora, the significance of named entities also follows from the fact that a class of typical queries submitted to Web search machines contains elements of the same lexical fund. Nevertheless, they represent encyclopedic knowledge, while their morphology is not described in any traditional dictionary. Our approach to the recognition of named entities is based on building dictionaries of different classes of proper names such as personal names, toponyms, hydronyms, oronyms, and their derivatives, with morphological tags, in DELAS format. These are supplied with semantic descriptors in the way described in [3]. Another broad category of words that remain unrecognized are the so called “unknown” words, which are acceptable words produced by different derivational processes, in general not recorded in traditional dictionaries. In order to recognize the unknown words in Slavic languages many approaches are possible [4] [1]. In this paper we present the use of lexical transducers to express morphological rules governing certain kinds of derivational processes [5].

2

Recognition of named entities

Proper names are expressions falling into two subclasses: pure proper names (such as personal or geographic names, e.g., toponyms, hydronyms, oronyms) and descriptive proper names and acronyms (e.g. United Nations Organization, World Health Organization, BBC, etc.) [3]. In daily newspapers, the pure proper names represent more than 10% of the overall text size, while in literary texts the percentage is smaller but still significant. Some of the problems in constructing proper names dictionaries for Serbian are the following. a) Coding problems. In writing Serbian, two alphabets are used equally, Cyrillic and Latin. In order to neutralize the use of alphabet, since just Serbianspecific letters are of interest, we decided to encode Serbian specific letters by digraphs (Figure 1). The same encoding is used for e-dictionaries as well. q ˇc cy

ć cx

ˇz zx

x ˇs sx

dj dx

ǌ nj nx

ǉ lj lx

dˇz dy

Qaqak ˇ cak Caˇ Cyacyak

Fig. 1. Serbian-specific letters in Cyrillic, Latin and neutral encoding, and one city name in three encodings

b) Variation problems. Although both Cyrillic and Latin alphabets are used equally in written Serbian, the leading principle in writing foreign names is transcription, typical for Cyrillic orthography, and not transliteration, typical for Latin alphabets. Nevertheless, there are quite a few term variations recognized by Serbian orthography. For example, the software company is

spelled both Mikrosoft and Majkrosoft (transcription), but also Microsoft (transliteration). c) Inflective and derivational properties. Both the inflectional and derivational morphology of Serbian is very rich. Apart from inflectional properties, proper names are also a source of derived forms, such as possessive and relational adjectives and in some cases adverbs. For example, in an aligned French / Serbian corpus, one single form of the proper name Bouvard (in French) corresponds to sixteen different forms in Serbian. Moreover, in some cases it is not easy to establish what the inflectional properties of some proper names are (locative of the city name Merdare can be Merdaru and Merdarima). In other cases it is difficult to establish the derived form of a name entity (e.g. the inhabitant of Merdare). d) Compounds. There are many compound proper names with the property of each (or some) of the components being inflected. For example, Novi Sad, Kosovo i Metohija - both components inflective Tel Aviv, Soko Grad - first component non-inflective The corresponding nouns and relational adjectives are, however, in some cases simple words derived from all components, for instance Novosadxanin ‘the citizen of Novi Sad’ and novosadski ‘related to Novi Sad’. In other cases simple words are derived from just one component, for instance Pribojac ‘the citizen of Priboj na Limu’. The solution to these problems relies on building electronic dictionaries of proper names, both simple and compound, and structuring them as to include morphosyntactic and semantic codes. The dictionary of personal names was compiled from a list of the names of 1.7 million Belgrade inhabitants as established in 1993 [6]. It consists of two parts: DELA-FName for first names and DELA-LName for the surnames. Surnames in Serbian are rather uniform. Most of them (87%) end in -icx and belong to the same inflectional class (N1), with the same derivational properties. Neither dictionary contains possesive adjectives—they are processed using the alternative methods, as explained in section 3. Entries in both dictionaries have semantic marker PROP, entries from DELALName have the marker Last, and entries from DELA-FName have the marker First. These markers are important for advanced analysis since in Serbian the surnames followed by first names never declinate, while the female surnames do not declinate even when follow the first names. Since a number of masculine first names change the grammatical gender in paukal and plural, the marker MG is added to appropriate entries. Some entries from DELAF-LName and DELAFFName are: Pavlovicxem,Pavlovicx.N1+PROP+Last:m6sv Pavlovicxa,Pavlovicx.N1+PROP+Last:m2sv:m4sv Nebojsxom,Nebojsxa.N679+PROP+First+MG:m6sv The dictionary of geographic names DELA-Top contains around 20,000 toponyms, hydronyms and oronyms corresponding to both domestic (Serbia and

Montenegro) and foreign geography, and covering high-school atlas concepts. The following geographic entities have been chosen: names of countries, official languages, capital cities, administrative divisions of common importance (e.g., US states), cities with more than 10,000 inhabitants, hydronyms such as lakes, swamps, rivers, and oronyms such as mountains or volcanoes. For proper names collected in such a way, except for their nominal forms, names of inhabitants are included in the dictionary and the corresponding relational and possessive adjectives. For example, for the city name Pariz ‘Paris’, an excerpt from the DELA-Top includes the following entries: Pariz,N003+Top+PGgr+IsoFR (nominal form) pariski,A2+PosQ+Top+PGgr+IsoFR (relational adjective) Parizxanin,N003+Hum+Top+PGgr+IsoFR (male inhabitant) Parizxaninov,A1+Pos+Top+PGgr+IsoFR (the corresponding poss. adj.) Parizxanka,N661+Hum+Top+PGgr+IsoFRf (female inhabitant) Parizxankin,A1+Pos+Top+PGgr+IsoFR (the corresponding poss. adj.) parizxanski,A2+Rel+Top+PGgr+IsoFR (the way it is done in Paris) In writing geographical proper names, local official names of toponyms are used for domestic geography, and exonyms, basically traditional names, are predominantly used for foreign geography. Except for local names, quite different from their originals, such as Becy (Wien), Rim (Roma), Solun (Thessaloniki), Prag (Praha), transcription is also used for writing foreign names, with different orthography transcription rules for proper names sourcing from different languages (e.g. Cyikago for Chicago, Peking for Beijing, etc). In the current version of DELA-Top we use two sets of semantic tags – general tags, such as Der, Top, Hyd, Oro, Hum, Lng, IsoCode – with obvious meanings, and specific ones such as PAut (for autonomous region), PCen (for regional center), PDgr (for parts of a city), PDrz (for country), etc. These codes can be used not only for a text search but also to express the constraints in local grammars and lexical transducers. Derived forms - male and female names of inhabitants, the corresponding possessive adjectives and relational adjectives derived from toponyms, are characterized by the following facts: a) Inhabitants. Feminine inhabitants may be grouped into three groups: 1. Feminine inhabitants ending in -ka, 94%; they all belong to the same morphological class N661. Examples are: Beogradxanka, Parizxanka; 2. Feminine inhabitants ending in -ica, 2%; they all belong to morphological class N651. Examples are: Nemica, Sremica; 3. Feminine inhabitants ending in -nxa, 4%; belong to the same morphological class N601; Examples are: Grkinxa, Polxakinxa, Francuskinxa. b) Adjectives. Derived adjectives may be grouped into two groups: 1. Relational adjectives, corresponding to toponyms, hydronyms and oronyms, constitute 60% of the overall number of adjectives derived. They end in -ski, -sxki, -cyki, e.g., beogradski, prasxki, becyki, and they all belong to the class A2. They are written with a lower case first letter.

2. Possessive, corresponding to inhabitants, make 40% of the number of adjectives derived. They end in -in (for feminine gender and masculine gender having the feminine inflection), e.g., Beogradxankin (f.), Becylijin(m.), or in -ov, -ev (for masculine gender), e.g., Beogradxaninov (m), Prisxtincyev (m), and all belong to the class A1. The use of DELA-Lname, DELA-FName, and DELA-Top dictionaries significantly improves the recognition process. It should be noted that they also add to some extent to the ambiguity of the text. Some entries are ambiguous in the newly added dictionaries, while others are ambiguous with the entries in the basic e-dictionaries, for instance Sofija,.N601+Hum+PROP+First Sofija,.N600+Top+PGgr+IsoBG Vlada,.N679+Hum+PROP+First+MG vlada,.N600

(female name) (capital of Bulgaria) (masculine first name) (governement)

In order to use these new dictionaries adequately it is important to establish the synonymy of proper names (both personal and toponyms) in order to group together all the proper names addressing (exactly) the same entity. This is a consequence of using official and unofficial (colloquial) names, actual and exnames for toponyms, and also of using full personal names, only first names and surnames as well as nicknames. Examples of exact synonymy are: Jugoslavija vs YU vs Srbija i Crna Gora vs SCG vs CS (toponyms) Vuk vs Vuk Stefanovicx Karadyicx vs Vuk Karadyicx ( pers./nickname) These synonymous entries can be grouped together by using the appropriate finite state transducers, or using some structure similar to Wordnet.

3

The Recognition of Unknown Words by Lexical Transducers

After applying all the e-dictionaries, including e-dictionaries of named entities, to the analyzed text in order to associate lemmas and grammatical categories to word forms, a number of unrecognized words still remain. Some of the unrecognized word forms are acceptable words that as a rule are recorded neither in dictionaries nor in encyclopedias. The simple solution is to try to incorporate these acceptable words into the dictionary. This solution inevitably leads to failure. For instance, in Serbian some adjectives are produced by prefixation with numbers, such as dvonedeljni ‘two-weeks’, ˇsesnaestogodiˇsnji ‘sixteen years’, dvoiposobni ‘two and half rooms’. If we consider the adjective petospratni ‘five floors’ where the number pet ‘five’ can be replaced by any number between one and hundred, then all hundred adjectives could be included in the e-dictionary. This, and similar cases, would lead to the enormous expansion of a dictionary, yet the problems of unknown words would not be solved, as the adjective dvestospratni ‘two hundred floors’ can also be valid.

The lexical transducers incorporated in Intex allow the expression of the morphological rules that govern word formation [5]. The input of lexical transducers is used to recognize word forms while the output is used to compute the corresponding lemma and other grammatical information. They can be quite complex and can perform the tokenization of word forms into linguistic units. These linguistic units are established on the basis of imposed constraints which are expressed in terms of recognition by e-dictionaries. Furthermore, during the recognition process the values of the recognized linguistic units can be stored into variables, which can later be used for the computation of lemmas and grammatical categories.

Fig. 2. The lexical transducer that recognizes the prefixed adjectives and adverbs

Figure 2 represents the lexical transducer that recognizes prefixed adjectives and adverbs. The tokens recognized by it are enclosed in parenthesis and they become the values of the variables associated to the corresponding open parenthesis. Two tokens are recognized by the upper branch of this transducer. Recognition of the first token is very simple: it recognizes the fixed set of word forms by invoking the subgraph brojevi ‘numbers’. Recognition of the second token is more complex. It is a sequence of letters () on which a constraint is imposed. The constraint is enclosed in angle brackets and it states that the recognized sequence of letters, that is the value of the variable $br, has to be an adjective (A) in positive (:a). This constraint is checked against the applied e-dictionaries. The constraints can also use the syntactic and semantic features of the dictionary entries, as illustrated in the lower branch of the same transducer. In order to recognize the adjective prefixed by one of the prefixes from the chosen set (the subgraph prefiks), the simple adjective should not be the ordinal number (the constraint -Ord).

The lexical transducer can produce the output. In this case the produced output is in the format that is usual for the dictionaries of word forms used by Intex, for example: godisxnxoj,godisxnxi.A3:aefs3g:aefs7g If the adjective godiˇsnji ‘year’ is prefixed by number ˇsesnaest ‘sixteen’ than the transducer would recognize the word form, e.g. ˇsesnaestogodiˇsnjoj. We would like to attach to the recognized word form the appropriate lemma ˇsesnaestogodiˇsnji and other linguistic information and to incorporate that information in the vocabulary of text in the same manner as is done for the word forms recognized by the e-dictionaries exclusively. This information should be sxesnaestogodisxnxoj,sxesnaestogodisxnxi.A3:aefs3g:aefs7g This output is produced using the introduced variables ($prfb, $br, . . .) and Intex special variables: for instance, the variable $2L, denotes the lemma corresponding to the word form recognized by the second constraint. The lexical transducers have also been produced to recognize negated adjectives, adverbs and nouns, possessive adjectives, diminutive and augmentative nouns. These derivative forms are in some particular cases recorded in traditional dictionaries and transferred to e-dictionaries but not in general. It should be noted that the constraint is checked against all the applied dictionaries, so the forms derived, for instance, from proper names are recognized as well (e.g. the possessive adjective Pavlovicxev derived from Pavlovicx belonging to DELALName). When computing the word form lemma semantic and grammatical information can be inherited from the root form, as is the case for the adjectives recognized by the transducer from Figure 2 or the additional information can be added, for instance derivational code +Pos for possessive adjectives. Such codes can be crucial for advanced analysis. For instance, all the feminine gender forms of the possessive adjectives derived from the surname can actually denote the female unmaried daughter of the family, e.g. To je peta Pavlovićeva knjiga ‘This is Pavlovi`c’s fifth book’ vs. Pavlovićeva ima tek 13 godina ‘Pavlovićeva is but 13 years old’. Thus, word form with the categories can act as a noun in a phrase. There is a possibility of erroneously recognizing a word form in a text and associating to it a wrong lemma. Experiments show that in certain cases a word form is recognized as a derivative of one lemma from the e-dictionary when it is actually the form of some other lemma from the same or an other dictionary. In a number of cases this adds to the ambiguity of text analysis. For instance, the form dvorane is recognized in a text by a lexical transducer from Figure 2 as the form of adjective derived from the adjective ran ‘early’ prefixed by number dva ‘two’ and also as a form of the noun dvorana ‘hall’, while only the second interpretation is correct: dvorane,ran.A17:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g dvorane,dvorana.N600:fs2q:fp1q:fp4q:fp5q

This problem is solved by using the priority of lexical resources. This kind of lexical transducer should be used with the lowest priority, and that means for the recognition of words that have not already been recognized by the lexical resources of higher priority, that is by e-dictionaries. After such an application of the lexical transducers, some cases of erroneous recognitions may still occur in cases when the right recognition is missing due to the incompleteness of the e-dictionaries themselves. For instance, the form debarski is recognized in a text as an adjective derived from the relational adjective barski ‘like marsh’ with the prefix de-, while actually it is the relational adjective of the toponym Debar (a small town in FYR Macedonia). This error has occurred only because the name of this town has not yet been included in the dictionary of toponyms.

4

Conclusion

Although the resources described in this paper are not yet finished, the results obtained after their application are very promising. After aplying basic e-dictionaries to one newspaper text of 320KW, the 29.5% of simple words were unrecognized. When DELAS-Top was applied to it, it reduced the number of unrecognized words by 3% while the application of lexical transducers decreased it by a further 19%. In a literary text of 130KW there were 12.1% of unrecognized words. The contribution of a dictionary DELAS-Top to word recognition was not significant, less then 0.5% of all unrecognized words, but the contribution of lexical resources was even higher, more than 22%. Some words still remain unrecognized, due to the incompleteness of dictionaries, both basic and name entity dictionaries. These dictionaries will be further enhanced, and dictionaries of descriptive proper names, celebrities, and acronyms developed. Moreover, the set of semantic codes will be refined. The already developed lexical transducers will be refined, and new ones added, which will cover some other derivational processes, such as relational adjectives and verbal nouns. More particularly, transducers will be developed that recognize forms derived from compounds. Although the contents of the dictionaries developed depend on Serbian, their structure and the method itself are not language dependent and can be applied to other languages, especially Slavic. The two aspects of enhancing the system of e-dictionaries significantly refine the lexical recognition process, strengthening the power of text processing tools and giving support for a number of applications such as information retrieval, text alignment, machine translation, and information extraction.

References 1. Erjavec,T., Dˇzeroski,S. (2004) Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Appl. Artificial Intelligence 18(1), 17–40 2. Krstev, C., Pavlović-Laˇzetić,G, Obradović, I., Vitas, D. (2004) Using Textual and Lexical Resources in Developing Serbian Wordnet, Romanina Journal for Information Science & Technology, [in print]

3. Grass, T., Maurel, D., Piton,O., Eggert,E. (2002) Description of a Multilingual Database of Proper Names Advances in Natural Language Processing, LNAI 2389, pp. 137–140 4. Pala, K.; Sedlacek, R., Veber, M. (2003) Relations between Inflectional and Derivation Patterns, Proc. of Workshop ”Morphological Processing of Slavic languages”, EACL’03, Budapest, pp. 1–8 5. Silberztein, M. D. (1993) Le dictionaire électronique et analyse automatique de textes: Le systeme INTEX, Paris: Masson 6. Vitas, D., et al. (2003) An Overview of Resources and Basic Tools for Processing of Serbian Written Texts, Proc. of the Workshop on Balkan Language Resources and Tools, 1st Balkan Conference in Informatics, http://iit.demokritos.gr/skel/bci03 workshop/pages/programme.html