tag disambiguation in italian - Semantic Scholar

TAG DISAMBIGUATION IN ITALIAN Rodolfo Delmonte°, Emanuele Pianta* * I.R.S.T. - Povo (Trento) °Ca' Garzoni-Moro, San Marco 3417 Università "Ca Foscari" 30124 - VENEZIA Tel. 39-41-2578464/52/19 E-mail: [email protected] Website: http//byron.cgm.unive.it

ABSTRACT In this paper we argue in favour of syntactically based tagging by presenting data from a study of a 1,000,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution – 50% only unambiguous tags. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. In a preliminary experiment we made we obtained 99,97% accuracy in the training set and 99,03% in the test set using syntactic disambiguation: data derived from statistical tagging is well below 95% even when referred to the training set. 1. Introduction The availability of Brill's tagger[5], work being carried out at Saarbruecken and the research done with the wellestablished Xerox tagger have contributed tagging results in languages different from English: in this paper we shall contribute data and experimental results on Italian, a “morphologically rich” language which seems to behave in a similar fashion to Swedish and French, and differently from English. We assume that tagging cannot be/is not to be regarded as a self-contained/self-sufficient processing task/module: we regard it as just the first important module/process in a wider and deeper text processing system. We also assume it must be in a strict feeding relationship with a syntactic (shallow) parser/chunker, which is then used either for text understanding and generation/summarization or for other such more complex tasks. Since tagging cannot be regarded as an end in itself, restrictions on its output should be targeted to the goals of tagging is intended for, i.e. it should respect/obey the following criteria: ♦ Accuracy and Efficiency: it should be over 99% correct, or the error rate should be below 1% errors are here intended as out-of-vocabulary tokens which are unknown words to the Guesser and cannot be tagged as either proper names, or foreign words; ♦ Robustness and Reusability: it should be generative, in order to be adaptable/cope with different domains/genres/ corpora: in our case this means that the tagger is actually a morphological analyser with linguistic rules, a root dictionary, and a list of affixes of the language and constraints to the generation process; ♦ Linguistic Granularity: it should produce lemmata, which is a trivial task with morphologically poor languages like English or Chinese, but not so trivial with all remaining languages; lemmata are essential in alla tasks of semantic/conceptual information retrieval; ♦ Linguistic Efficiency: in order not to require reprocessing, it should allow for subcategorization information to be encoded in verbal tags, to serve further processing modules. It should incorporate a minimum of efficient and necessary semantic information in tags requiring it in order to produce sensible tagging disambiguation: i.e. temporal nouns, common nouns, human beings nouns, proper nouns etc. In addition, disambiguation should be syntactically targeted and pragmatically constrained on the basis of genre/corpus type: in Italian as in French word "La" is three times ambiguous: it is a clitic pronoun, a definite article and a common noun (meaning the A musical note) but this latter tag is rare or specific to a certain domain and only with initial uppercase "L". Most of the published papers on the subject deal with English due to the availabiliby of tagged corpora for training and testing. However, English is not a good representative of European languages in that it should be regarded a "morphologically poor" language, whereas the remaining languages are "rich" or even "extremely rich" in morphology. This amounts to a lot of differences in processing: as can be easily surmised the first difference is in the number of different worforms each lemma can produce; then a second important difference lies in the level of homography of each wordform. We argue in favour of a syntactically based disambiguation phase after the redundant morphological analysis for two main reasons: HMM-based tagging, which is usually adopted as the best statistical framework to work in, has two important limitations, i.e. sparsity of data and lack of wider-context information, being conventionally based on trigrams. The second reason is based on the simple intuition that unambiguous tag sequences are strictly syntactically governed in the sense that they must obey the grammar of the language. This is confirmed by [1] where they say: “We

believe that the best way to boost the accuracy of a tagger is to employ even more linguistic knowledge. The knowledge should in addition contain more syntactic information so that we could refer to real (syntactic) objects of the language, not just a sequence of words or parts of speech.” Being language-dependent the tagger needs to be based on an accurate analysis of corpora with an as broad as possible coverage of genre, style and other social and communicative variables. To answer these needs we built our syntactic shallow parser on the basis of manually annotated texts for 60,000 words chosen from different corpora and satisfying the above-mentioned criteria. The annotation was carried out twelve years ago to be used for a a text-to-speech system for Italian (DecTalk Italian version) with unlimited vocabulary [9,10]. Italian has a number of peculiar linguistic features that make it more difficult to disambiguate than other languages: it may be defined “stucturally underdetermined” in the sense that it allows a lot of freedom to the position of syntactic constituents. Sentences may be subjectless and start with a VP which may contain the subject NP or its Object complement. Postverbal position may be occupied by adjuncts which can be freely interspersed between main verb and direct object or other nuclear complements. At constituent level, adjectives may be placed before or after the Head noun they modify with only few exceptions. If we compare it with other languages, the level of omography is very high and in addition the number of wordform per lemma is significantly higher than English. Written texts tend to have very long sentences were complements may be very far apart from their governing head with a series of nested adjuncts in between. The paper is organized as follows: we give general information on the 1,000,000 word corpus of Italian we used to train our tag disambiguator in Section 1.1; we then comment on the use of probability transition tables based on unambiguous bigrams and give further data on bigram and n-gram distribution in our corpus in Section 2; and then finally we describe our Syntactic Disambiguator (SD) in Section 3 and give some accuracy measurements on a preliminary experiment we made on Italian, in Section 4. 1.1 Morphologically Rich Languages are Different For morphologically rich languages like Italian, processes like tagging and syntactic analysis must be soundly based on linguistically generated morphological analysis. We experimented this approach with the analysis of a corpus of approximately 1 million words and on a first run of the tagger, it failed for approximately 5% of the total: at least one word over 20 constituted what can be labelled as unknown (out-of-vocabulary) word. In POS taggers which rely on the context to induce the appropriate part of speech, guesses will be based on the surrounding words. However, the problem is to find misspelled words and tell them apart from foreign words and other classes of words. We analysed the 50,000 unknown words in 5 classes of words: Misspelled Words = 4500; New Vocabulary Entries = 6000; Foreign Words = 3000; Proper Nouns = 15,000; Abbreviations & Acronyms 10,000. As to types, the total number of types from the three subcorpora is 58334 which were then merged to 36578. In the total rank file we extracted the first 65 types whose total frequency of occurrence amounts to 332,238 tokens. An extended hapax legomena count - types with frequency less than 4 - cover 22,421 types and approximately 33,000 tokens. Total number of lemmata is 24666. Non-rich lemmata constitute a big percentage of the total number of lemmata: lemmata with only one wordform associated amount to 17464, which corresponds to 70% of the overall number of lemmata. We end up with 1007 lemmata with more than four types associated, the great majority of which are verbs[8]. 1.2 Lemmata and Wordforms We shall consider now the ratio lemmata/wordform as indicating the morphological richness of the language: suppose now that in our corpora more than half of all wordforms or types uniquely individuate its lemma and viceversa, we might conclude that even though Italian has a potentially rich morphology it uses it in a poor manner. From the computation of lemmata we ended up with the following data: Total number of lemmata is 24666. Non-rich lemmata constitute a big percentage of the total number of lemmata: lemmata with only one wordform associated amount to 17464, which corresponds to 70% of the overall number of lemmata. Here below is the count for lemmata with two, three or four wordforms associated: Wordforms = 2 Number of lemmata 4354 Wordforms = 3 Number of lemmata 962 Wordforms = 4 Number of lemmata 879 Finally we end up with 1007 lemmata with more than four types associated, the great majority of which are verbs. However when we look down in the rank list starting from lemmata with 8 types associated, the number of past participles/adjectives increases until they become the majority of lemmata. The rank list has the two auxiliary/copulative verbs have (avere) and be (essere) at the top, respectively with 50 and 48 word forms associated. We may note that "avere" has 13 cliticized forms and that "essere" has 10 such forms. l(avere, 50, [abbia, abbiamo, abbiano, abbiate, avendo, avendola, avendole, avendolo, avendone, avente, aventi, aver, averci, avere, avergli, averla, averle, averlo, avermi, averne, aversi, averti, avesse, avessero, avessi, avessimo, aveste,

avete, aveva, avevamo, avevano, avevo, avrà, avranno, avrebbe, avrebbero, avrei, avremmo, avremo, avresti, avrete, avuta, avute, avuti, avuto, ebbe, ebbero, ha, hai, hanno, ho]). / have l(essere, 48, [è, era, erano, eravamo, eravate, eri, ero, essendo, essendoci, essendosi, essendovi, esser, essercene, esserci, essere, esserlo, esserne, essersi, esservi, fosse, fossero, fossi, fossimo, fu, fui, fummo, furono, sarà, saranno, sarebbe, sarebbero, sarei, saremmo, saremo, sarete, sarò, sei, sia, siamo, siano, siate, siete, sono, stata, state, stati, stativi, stato]). / be

3. Statistical vs Syntactic Tagging In the rest of the paper we present our syntactic disambiguator (hence SD), final modul of our syntactic tagger of Italian. Input to the SD is the complete and redundant output of the morphological analyser and lemmatizer, IMMORTALE (see Delmonte & Pianta, 1998). IMMORTALE finds all possible and legal tags for the word/token under analysis on the basis of morphological generation from a root dictionary of Italian made up of 60,000 entries and a dictionary of invariant words - function words, polywords, names and surnames, abbreviations etc. - of over 12,000 entries. As commented by Brill(92), the application of stochastic techniques in automatic part-of-speech tagging is particularly appealing given the ease with which the necessary statistics can be automatically acquired and the fact that very little handcrafted knowledge need to be built into the system(ibid., 152). However both probabilistic models and Brill’s algorithm need a large tagged corpus where to derive most likely tagging information. It is a wellknown fact that in lack of sufficient training data, sparsity in the probabilistic matrix will cause many bigrams or trigrams to be insufficiently characterized and prone to generate wrong hypotheses. This in turn will introduce errors in the tagging prediction procedure. So the training corpus must be really very large in order to adequately cover all possible tag combinations. Italian is a language which has not yet made available to the scientific community such large corpus. In lack of such an important basic resource, there are two possibilities: • manually building it by yourself; • using some incrementally automatic learning procedure to recursively apply in order to get to 1 million tagged word corpus (the same as the frequently quoted Brown Corpus for English). We have been working on such a corpus of Italian with the aim of getting at the final goal above without having to manually build it. The algorithm that we will present in this paper is coupled with linguistic processing by means of a CF grammar of Italian formalized as an RTN, which filters it. Statistics is usefully integrated into the syntactic disambiguator in order to reduce recursivity and allow for better predictions. Fully stochastic taggers, in case no large tagged corpora are available, make use of HMMs. However, HMMs show some of the disadvantages present in more common Markov models: they lack perspicuity, and even though they allow for biases in the form of Finite State Automata to be implemented they are inherently incapable of capturing higher level dependencies present in natural language, and are always prone at generating wrong interpretations, i.e. accuracy never goes higher than 96-97%. Of course it is a good statistical result, but a poor linguistic result, seen the premises, i.e. the need to use tagging information for further syntactic processing. We assume, together with Voutilainen & Tapanainen we hold that pos tagging is essentially a syntactically-based phenomenon and that by cleverly coupling stochastic and linguistic processing one should be able to remedy some if not all of the drawbacks usually associated with the two approaches, when used in isolation. 3.1 Tagset and Ambiguity Classes We studied our training corpus in order to ascertain what level of ambiguity was present and where, seen that our corpus is made up of sub-corpora from different domains and genres. Our tagset is made up of 86 tags thus subdivided: 7 for punctuation; 4 for unknown out-of-vocabulary words, abbreviations, titles, dates, numbers; 19 for verbs including three syntactic types of subcategorization – transitive, intransitives, copulatives – and tensed cliticized verbs; 8 for auxiliaries, both have and be; 42 for function closed class words; 6 for nouns including special labels for colour nouns, time nouns, factive nouns, proper nouns, person names. Twenty categories from the general tagset never occur single, so they had to be converted into distributionally equivalent ones, in the statistical table. A general criterion we adopted for including or not a tag in our tagset which obeys the following principles: • a tag must be unambiguously associated to a wordform or class of wordforms; • a tag must be motivated by unique distributional properties, i.e. must be in complementary distribution with other similar tags: for instance, we collapsed the tag for common nouns [n] has a number of "allotags", [nt] for temporal nouns, [nf] for factive nouns, [np] for proper nouns (geographical and others), [nh] for person names: this subdivision is important when disambiguation is called for and for syntactic reasons; • don't use new tags when there is no need to: tagsets for English usually include 3rd person verbs and plural for nouns. We don't see any reasond to introduce such morphologically-based tags, which in the case of Italian or other such languages would make the tagset explode. Here below we present a series of tables that illustrate the tag distribution particularly in relation to their inherent ambiguity. We regard the ambiguity level a parameter which is both language specific and dependent on the linguistic

domain. Roughly speaking, it is different to have to deal with Ambiguity Classes lower than a certain threshold, say 4 times ambiguous, rather than having to cope with more than the double. And this is what happens in Italian: the lexicon of Italian seems to be highly ambiguous, as will be shown below. This does not have to apply to all human languages, of course. Anyhow we will also show some data for English produced by our tagger – thus following similar tagging conventions, which have a much lower level of ambiguity. The analysis below is referred to half a million tokens approximately – when punctuation is subtracted, 450,000 tokens. This is half the corpus of Italian we have been working on previously, which is presented above: a lot of additional work has been done to encode polywords, abbreviations and proper nouns in order to prevent misanalysis from taking place. As can be gathered from Table 1, the level of ambiguity amounts to less than half the tokens, with a 5% increase when punctuation tokens are subtracted from the total amounts. TABLE 1. General Token/Types Data Total Total %Tot. %Typs/ Total %Read %Read/ Tokens Types Types Tokens Readings /Token Tok-Punc Culture 28,000 5,277 13.13 18.84 48,595 1.73 1.61 Politics 58,230 6,430 16 11.04 104,452 1.79 1.71 School-Adm. 45,420 2,706 6.73 5.95 82,019 1.80 1.71 Finance-Bur 328,550 17,468 43.47 5.31 562,455 1.71 1.59 Science 79,893 8,298 20.65 10.38 146,715 1.83 1.74 Totals 540,093 40,179 100.00 7.44 944,236 1.74 1.67 Merge 21,085 We tabulated the number of Total Tokens in column1 and the number of Total Types in column2. We then have percent of Types per each subcorpus as compared to the total number of types where we see that apart from the Finance corpus which has the highest value, the remaining figures see an overwhelming superiority of Culture which has types for more than the double the amount of School, being clearly highly repetitive. In the third column we compare total types with total tokens in each corpus and we get the usual Zipf-law effect whereby to an increase in the number of tokens the number of types increment tends to zero: in particular then we note the previous fact that School corpus has a very low figure, comparable only to the much bigger Finance. Comin now to the three remaining columns, we tabulated the number of total readings per corpus and their percentages. If we compare our data with those available for English and reported by Tapanainen&Voutilainen, the overall proportion between number of tokens and number of morphological analyses is much higher in Italian: they report 1.04-1.07 readings per token, whereas our data are attested on an average of 1.74 readings per word; 1.67 when punctuation is subtracted from the total count. TABLE 1. General Tagging Data Total % Tokens Total Culture 28,000 5.18 Politics 58,230 10.78 School-Adm. 45,420 8.40 Finance-Bur 328,550 60.83 Science 79,893 14.80 Totals 540,093 100%

Punctuation 3,288 4,792 3,962 38,660 7,655 58,357

%Tot.T okens 11.74 8.23 8.72 11.77 9.58 10.8

Tot. Ambiguous 12,402 29,802 20,715 142,851 40,877 246,647

%Tot. Ambigu. 5.02 12.08 8.4 58 16.57

% Ambi. / Tokens 44% 51% 45% 43% 51% 45.67%

% Amb. Punct. 50,2% 55,7% 50% 49,3% 56,6% 51%

Ambiguous tags are fairly evenly distributed among the subcorpora as can be seen in Table 2, where we tabulate ambiguity classes starting from unambiguous then cardinality 2 or twice ambiguous tokens, ending with 9 times ambiguity figuring only once. As can be seen from Table 1 and 2, there are two domains which have a higher level of ambiguity than the remaining domains which are fairly evenly behaved, and these are Politics and Science. TABLE 2a. AMBIGUITY CLASSES: TYPES 1 2 3 4 Types 76 144 128 76

5 47

TABLE 2b. AMBIGUITY CLASSES: TOKENS 2 % Tot 3 %Tot 4 Culture 7,450 26.6 2,975 10.6 734 Politics 19,652 33.7 6,602 11.34 1,323

6 16

7 14

8 4

9 1

5 777 1456

6 174 445

7 49 92

8 107 126

Tot 506

9 12 0

S.Admin. Finance Science

12,106 86,763 25,770

26.6 26.4 32.2

4,795 35,421 9,656

10.76 10.78 12.08

1,284 8,261 1,919

1723 8138 2532

527 1757 434

137 805 277

102 882 289

0 0 0

As can be seen from Table 2, the number of ambiguous tags is very high and even though it decreases dramatically already with class 4 it might still constitute a serious obstacle to approaches like the ones adopted by advocates for rule-based or constraint based disambiguation. Of course what the Table tells us is that the total number of occurrences of tags belonging to a given AC is very high: however, the class might be represented by a very small number of types. In that case it would still be very convenient to manually encode biases for all types. Unfortunately this is not the case, and is clearly shown in the following table, Table 3. We discovered that the number of possible bigrams is very high and changes a lot from one sub-corpus to another. We also discovered that the great majority of bigrams is made up by hapax-legomena, i.e. bigrams with frequency of occurence equal to or below 3. TABLE 3: BIGRAMS - TYPES Tot.Typ %Tot Culture 5,867 12,7 Politics 7,291 15,8 S.Admin. 4,956 10,7 Finance 17,885 38,7 Science 10,164 22 Total 46,163 100,0 Merge

Hapax 1 3,206 3,487 2,006 7,192 4,826 20,717 18,407

TABLE 4. N-GRAMS DISTRIBUTION Tokens Types 3 Culture 9,889 7,659 1,893 Politics 22,061 12,657 1,677 S.Adm. 17,105 9,488 1,943 Finance 119,820 40,645 6,000 Science 29,500 18,802 2,680

%Tot 15,47 16,83 9,68 34,71 23,29 100,00

Hapax 2 967 1,180 770 2,755 785 6,457

4 5 2,824 1,501 4,179 3,240 3,354 2,042 6,548 13,546 6,077 4,376

6 753 1,811 1,111 7,484 2,452

Hapax 3 413 558 446 1,449 785 3651

7 8 345 180 833 446 543 283 3728 1756 1332 693

9 80 235 118 822 342

10 37 121 41 418 155

11 12 26 11 64 18 24 9 172 76 89 52

REFERENCES [1] P. Tapanainen and Voutilainen A.(1994)Tagging accurately - don't guess if you know. Proc. of ANLP '94, pp.4752, Stuttgart, Germany. [2] Brants T. & C.Samuelsson(1995), Tagging the Teleman Corpus, in Proc.10th Nordic Conference of Computational Linguistics, Helsinki, 1-12. [3] Lecomte J.(1998), Le Categoriseur Brill14-JL5 / WinBrill-0.3, INaLF/CNRS, [4] Chanod J.P., P.Tapanainen (1995), Tagging French - comparing a statistical and a constraint-based method". Proc. EACL'95, pp.149-156. [5] Brill E. (1992), A Simple Rule-Based Part of Speech Tagger, in Proc. 3rd Conf. ANLP, Trento, 152-155. [6] Cutting D., Kupiec J., Pedersen J., Sibun P., (1992), A practical part-of-speech tagger, in Proc. 3rd Conf. ANLP, Proc. ANLP, Trento. [7] Voutilainen A. and P. Tapanainen,(1993), Ambiguity resolution in a reductionistic parser, in Sixth Conference of the European Chapter of the ACL, pp. 394-403. Utrecht. [8] Delmonte R., E.Pianta(1996), "IMMORTALE - Analizzatore Morfologico, Tagger e Lemmatizzatore per l'Italiano", in Atti V Convegno AI*IA, Napoli, 19-22. [9] Delmonte R. G.A.Mian, G.Tisato(1986), A Grammatical Component for a Text-to-Speech System, Proceedings of the ICASSP'86, IEEE, Tokyo, 2407-2410. [10] Delmonte R., R.Dolci(1989), Parsing Italian with a Context-Free Recognizer, Annali di Ca' Foscari XXVIII, 12,123-161.