Morpho-lexical ambiguities in the recognition of written Arabic word ...

1 downloads 0 Views 278KB Size Report
language, and from the way in which lexical entries (nouns, adjectives, verbs) are realised and inserted in word-forms. The morphological system of Arabic ...
ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

Morpho-lexical ambiguities in the recognition of written Arabic word-forms, evidence from the DIINAR.1 lexical resource Ramzi Abbès1, Joseph Dichy2 & Mohamed Hassoun3, 1

École Nationale Supérieure des Sciences de l’Information et des Bibliothèques (ENSSIB), Villeurbanne Lyon, France [email protected] 2 Université Lumière-Lyon 2 & ICAR (UMR 5191-CNRS/Lyon 2), Lyon, France [email protected] 3 École Nationale Supérieure des Sciences de l’Information et des Bibliothèques (ENSSIB), Villeurbanne Lyon, France [email protected] Abstract — This paper tackles the issue of the ambiguities that are generated by the morpho-lexical system of Arabic, and not only by the fact that the standard writing of the language is ‘unvowelled’. Evidence is drawn from the Arabic monolingual DIINAR.1 lexical resource [section II]. Authors focus on morpho-lexical ambiguities, which are considered successively in vowelled and unvowelled writing. For lack of space, and in order to ensure consistency in the presentation of data, all the examples belong to verb structures. Statistical results are given, as to the level of ambiguity of prefix/suffix combinations and that of pre-stem/post-stem combinations. Percentage results are also given concerning the number of stems that are found in the conjugation of verbs. The numbers of stems that are related either to a single verb, or to two verbs or more, are also analysed. Statistical results are given, in addition, for stem-verb-root relations. Evidence from vowelled and from unvowelled realisations of stems, verbs and word-forms is proposed. Root recognition is finally examined, and the results obtained by queries in the DIINAR.1 lexical resource are compared – on this point – to those of the morphological analysis of a corpus of 2 million word occurrences (al-©ayāt newspaper, year 1995).

I. INTRODUCTION: VOWELLED AND UNVOWELLED WRITTEN ARABIC AMBIGUITY It is generally admitted that automatic recognition of written Arabic words, sentences and texts is rendered especially difficult by the highly homographic structure of ‘unvowelled’ writing. Standard writing in Arabic – as is also the case in other Semitic languages, such as Hebrew, Aramaic or Syriac – does not make use of the additional diacritic symbols, which have been elaborated, mainly, for the needs of the oral reading of sacred texts (Bible, New Testament, and Koran). In Arabic, traditionally termed ‘unvowelled’ writing is bare of secondary diacritical signs indicating short vowels

(™arakāt ‫)ﺣﺮآﺎت‬, but also include consonant doubling (šadda, ‫)ﺷﺪة‬, diacritical case endings (tanw∏n, ‫)ﺗﻨﻮﻳﻦ‬, etc. (the term ‘unvowelled’ is based on a metonymical semantic extension). What is less well known is that many ambiguities at wordlevel originate in the actual morphological system of the language, and from the way in which lexical entries (nouns, adjectives, verbs) are realised and inserted in word-forms. The morphological system of Arabic includes phonological modifications related to the contents and/or structure of roots (33 types of roots have been identified from this point of view – [16], [6]). In other word, ambiguity in the recognition of written Arabic word-forms occurs: (a) in vowelled script, in which case it can only originate in the morpho-lexical structures of the language; (b) in standard unvowelled script, which considerably increases the percentage of ambiguous forms generated by morpho-lexical structures. It has now become generally acknowledged that automatic morphological analysis – i.e. analysis of word-forms considered context-free – encounters in standard Arabic writing a high number of potential existing analyses for a relevant percentage of word-forms [12]; [13], [26]; [2]1. Exactly why and how remains an open question. This paper endeavours to tackle the issue of the relations between morpho-lexical ambiguity and the ambiguities that are found in unvowelled Arabic texts. This implies considering the question of the relations between points (a) and (b) above, and going beyond the current idea that word-form ambiguity in standard Arabic texts is mainly related to unvowelled writing. 1 This could also be tested, in addition to the analysers drawing on the DIINAR.1 resource ([26], [31], [28], and [2]) which will be used here, with the equally effective morphological analyser put on the Internet by the Xerox European research centre [8], or with Tim Buckwalter’s analyser.

-1-

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

The results given here are founded on evidence extracted from the DIINAR.1 lexical database, which was elaborated on descriptive linguistic and lexicological grounds (see section II below). One result is based on the automatic morphological analysis of a newspaper corpus of 2 million words (from the al©ayāt ‫ اﻟﺤﻴﺎة‬daily newspaper, year 1995). II. BACKGROUND OF THE RESEARCH: WORD-FORM ANALYSIS IN THE SAMIA-DIINAR.1 METHODOLOGY AND THE ARAMORPH ANALYSER The research presented here is based on extensive linguistic and engineering research. The software used are the AraMorph word-form (or morphological) analyser and the AraConc automatic concordance compiler [4], [5], [1], [2]), both of which draw on the Arabic monolingual DIINAR.1 lexical resource (“DIctionnaire INformatisé de l’ARabe, version 1” – Arabic acronym Ma‘āl∏ (Mu‘jam al-‘Arabiyya l-’āl∏ – ‫ ﻣـﻌـﺎﻟـﻲ‬‫[ )ﻣﻌﺠﻢ اﻟﻌﺮﺑﻴﺔ اﻵﻟﻲ‬21], [3] – in Arabic: [10]. The methodology underlying the DIINAR.1 database and the related morphological and syntactic analysers was first outlined in 1982-83 [12], and was subsequently extended and developed in Lyon (Université Lumière-Lyon 2 and ENSSIB) in the frame of the SAMIA research project (“Synthèse et Analyse Morphosyntaxiques informatisées de l’arabe”) [13], [14], [28], [23],[15]. The analysis or generation software devices and the lexical resource are based on a formal representation of the written word-form in Arabic [15] and a comprehensive grammar operating at word-form level. The grammar includes two different sets of rules, devised in order to be compatible with either generation or recognition processes (hence the “S” of “Synthesis” and the “A” of “Analysis” in the acronym of the SAMIA project) [9], [23], [17]. (In English, see [14], [19] and [20]). The word-formatives grammar (WFG) associated with this representation of word-form structures includes rules and relations involving the lexical nucleus of the Arabic wordform2. One of the major contributions of the SAMIA project is to have highlighted from the early 1980ies onwards the fact that relations between the lexical nucleus and the other formatives of the word-form entail the need for morphological analysis or generation to draw on a lexical database, the entries of which are associated with morpho-syntactic specifiers (presented in [28], [15]). The set of specifiers accounting for grammar-lexis specifications operating at word-form level is both finite and exhaustive [17], [19]. Specifiers also include links between morphologically related items such as verbdeverbal(s) or singular-plural, etc.

presently IT.COM – Prs. Salem Ghazali and Abdelfattah Braham). Total number of lemma-entries in the DIINAR.1 database is currently : 121,522. This includes 445 tool-words belonging to various grammatical categories (e.g.: prepositions, conjunctions, etc.) and the prototype of a proper names database, of 1,384 entries. Both types of entries are associated with their own subsets of morpho-syntactic specifiers (operating at word-form level). The main parts of the db comprise, in addition to these two types: Nouns, including adjectives [Broken plural forms – ‫]ﺟﻤﻮع اﻟﺘﻜﺴﻴﺮ‬ Verbs Deverbals (‫)ﻣﺸﺘﻘﺎت اﺳﻤﻴﺔ‬: - infinitive forms (‫)ﻣﺼﺪر‬ - active participles (‫)اﺳﻢ اﻟﻔﺎﻋﻞ‬ - passive participles (‫)اﺳﻢ اﻟﻤﻔﻌﻮل‬ - ‘analogous adjectives’ (‫)ﺻﻔﺔ ﻣﺸﺒﻬﺔ‬ - ‘nouns of place & time’ (‫)اﺳﻢ اﻟﻤﻜﺎن واﻟﺰﻣﺎن‬ [Total number of deverbals] Subtotal of lemma-entries TABLE I.

29,534 [9,565] 19,457 23,274 17,904 13,373 5,781 10,370 [70,702] 119,693

NUMBER OF LEMMAS-ENTRIES BELONGING TO MAIN MAJOR LEXICAL CATEGORIES IN DIINAR.1

This lexical resource is now available to researchers and developers in the form of generated lexica presented in Excel tables, through ELRA/ELDA (The Evaluation and Language resources Distribution Agency, Paris; www.elda.org). III.

MORPHO-LEXICAL AMBIGUITY

The analysis presented here is based on correct (or corrected) Arabic occurrences. Pre- and post-stem elements are considered in vowelled realisation. Ambiguities in stems are examined successively in vowelled and unvowelled writing. On the other hand, orthographic ambiguity due to writing usage will be discarded here, and examined in another paper. In this section, we focus on the parts of the lexicon that generate word-level mix-ups, and state, whenever possible, the statistical data related to the level of ambiguity. Albeit it is not always easy to quantify this type of phenomena, especially with an automatic procedure, figures extracted from the DIINAR.1 database are given. Morpho-lexical ambiguity is connected to the two general fields of the word-form [17]:

Descriptive work conducted in Lyon has resulted during the 1990ies in the elaboration of the DIINAR.1 lexical resource, which was completed in collaboration with a leading Tunisian research and development centre, IRSIT (Institut de Recherche en Sciences de l’Informatique et des Télécommunications, 2

Word-forms in Arabic encompass one lexical nucleus and one only. The structure of the Arabic written word-form is extensively described in [15], [17].

-2-



A first source of word-form ambiguities is related to the lexical nucleus: a number of stems share a same written form, but yield different morpho-syntactic analyses. This type of ambiguity is of course much more important in unvowelled writing.



Another source is concerned with ‘particles’ included in the word-form – i.e. with word-formatives other than the lexical nucleus (or stem) – that feature different morpho-syntactic analyses. The phenomenon can in

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

turn be divided into two subtypes related, respectively, to: - Affix compatibility – i.e. prefix-prefix, suffix-suffix and prefix-suffix compatibility. - Pre- and post-stem particles compatibility. This includes, in addition to prefix and suffixes, proclitic and enclitic compatibility. The two sources above are not only theoretical. It is obvious that word-level ambiguities are usually solved at higher levels (noun-, verb or adjective-phrases, sentences, texts). Word-forms, for instance, combine stems with affixes in ‘minimal words’, and concatenate stems+affixes units with proclitics and/or enclitics in ‘maximal words’. Nevertheless, analysis of actual examples shows that ambiguities in larger syntagmatic units (word-forms, sentences, texts) actually originate in lower level ones3, which we will now consider. A. Morphological feature ambiguity within word-form boundaries We concentrate, in the limited frame of this work, on verbs. This should also serve the purpose of bringing better consistency to the presentation. 1) Prefixes and suffixes The DIINAR.1 database allows the production of generated lexica of inflected nouns and verbs. Minimal word-forms – i.e. word-forms restricted to prefix-suffixes and stem combinations – are generated from a matrix table based on the SAMIA descriptions and specifications. The table includes nominal suffixes and verbal affixes, and a number of empty chains. Only existing combinations are taken into account [17], [3]. Nine vowelled prefixes are combined with 61 suffixes, and yield 179 existing combinations (instead of 549, which would result from blind combination). The level of ambiguity of prefix-suffix combinations varies according to the number of features they support. Consider the following example: Combination of (a) stems of verbal patterns such as fā‘ala/yufā‘ilu (‫ﻞ‬ ُ‫ﻋ‬ ِ ‫ ُﻳﻔَﺎ‬،َ ‫ )ﻓَﺎﻋَﻞ‬and (b) the Imperfective (‫ )ﻣﻀﺎرع‬sequence of prefix ‫ ﺗـُـ‬and suffix ‫ ــَﺎ‬supports, with the same vowelled written form, four morphological analyses, e.g.: ‫ﺧﻴَﺎ‬ ِ ‫ ﺗُـﺆَا‬tu’ā¨iyā: 1 and 2: 2nd person masculine dual of either the Subjunctive Imperfective (‫ )ﻣﻀﺎرع ﻣﻨﺼﻮب‬or the Jussive Imperfective (‫)ﻣﻀﺎرع ﻣﺠﺰوم‬, meaning: “you two fraternize [with]”; 3 and 4: 3rd person feminine dual of the same two verbal paradigms, meaning: “they (fem.) both fraternize [with]”.

Number of prefix/suffix combinations 1 1 1 14 11 58 93 Total: 179 TABLE II.

The approach of ambiguity underlying the above analysis has been presented in [16] and [19]. For a general reference on ambiguity in Arabic, see [7].

0.56% 0.56% 0.56% 0.78% 6.14% 32.4% 51.95%

Number of morphological interpretations 10 6 5 4 3 2 1

LEVEL OF AMBIGUITY IN PREFIX AND SUFFIX COMBINATIONS

In other words, 52% of combinations support a single analysis, and 48% two morphological interpretations or more. The above levels of ambiguity are given for vowelled affixes. In standard unvowelled writing, the 9 prefixes drop down to 5 written realisations and the 61 suffixes, to 21 realisations. The level of ambiguity is naturally much higher. 2) Pre- and post-stem combinations: prefixes and suffixes plus proclitics and enclitics It is to be noted, at first, that many proclitic elements sharing a same graphic and/or phonetic realisation do not correspond to a single morpheme, which is expected to generate word-form ambiguities when at least two of the interpretations supported are compatible with the lexical nucleus or stem. The coordinating conjunction wa- (‫)و‬, e.g., is compatible with any minimal word, whereas the homophonic and homographic proclitic known as wāw al-qasam (‫)واو اﻟﻘﺴﻢ‬ the ‘oath particle’, is only compatible with a subset of nouns. A more intricate example is that of the many interpretations associated with proclitic l- (‫)ل‬. Proclitic and enclitic compatibility cannot be considered apart from prefixes and suffixes. Pre-stem/post-stem matrix tables have subsequently been built, for the analysis or the generation of maximal word-forms. Existing combinations amount to 19,573 possibilities. This includes the prefix-suffix compatibility matrix presented above. The number of interpretation supported by the slots of the extended matrix table range as follows: Number of prestem/ post-tem combinations 3 4 3 4 253 465 3,193 15,648 Total: 19,573

The slots of the DIINAR.1 matrix table of vowelled prefixsuffix combinations support analyses, the number of which ranges from 10 to 1: 3

Percentage of combinations

TABLE III.

Percentage Number of morphoof combinations syntactic interpretations 0.015% 10 0.020% 7 0.015% 6 0.020% 5 1.29% 4 2.37% 3 16.31% 2 79.95% 1

LEVEL OF AMBIGUITY OF PRE-STEM AND POST-STEM COMBINATIONS

Almost 80% of the combinations support a single morphosyntactic interpretation. The addition of clitics thus

-3-

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

considerably reduces the level of ambiguity, in spite of the number of ‘readings’ that can be supported by unvowelled writing. This confirms hypotheses on the structure of the writing system of Arabic and the associated reading process presented in [15], as well as many empirical observations that occurred to us in the course of corpus analyses.

Number of vowelled ‘formal stems’ 1

B. Stem-level ambiguity In the DIINAR.1 db, 19,457 verbs are included in 5,751 Arabic roots. All conjugated forms are obtained by prefix and/or suffix combinations associated to 96,076 stems. The average number of stems per verb model is 4.93. For instance: •



Percentage of ‘formal stems’ 0.001%

Number of related verbs 10

7

0.009%

8

71

0.09%

7

109

0.14%

6

397

0.52%

5

757

0.99%

4

2,360

3.10%

3

The conjugation of the verb ‫ﺐ‬ ُ ‫ﻳَﻜ ْـﺘُـ‬/‫ﺐ‬ َ ‫ آَﺘَـ‬kataba/yaktubu, “to write” is performed by inflecting 4 stems, i.e.: ‫آ ْـﺘُـﺐ‬/‫آ ْـﺘَـﺐ‬/‫آُـﺘِـﺐ‬/‫ آَﺘَـﺐ‬katab/kutib/ktab/ktub.

10,407

13.68%

2

61,951

81.45%

1

On the other hand, the verb ‫ﻳَـﺘَـﻨَـﺒﱠـ ُﻪ‬/‫ﺗَـﻨَـﺒﱠـ َﻪ‬ tanabbaha/yatanabbahu, “to pay attention”, “to turn one’s mind [to]”, only requires 2 stems: ‫ﺗُـﻨُـﺒﱢـﻪ‬/‫ﺗَـﻨَـﺒّـﻪ‬ tanabbah/tunubbih.

Total: 76,060

The following table shows the correspondence between the stems needed for conjugation and the numbers of verbs: Number of verbs 14 452 1043 587 81 332 1079 1079 5844 2415 3760 2771 Total: 19,457 TABLE IV.

Percentage of verbs 0.072% 2.32% 5.36% 3.02% 0.16% 1.71% 5.54% 5.54% 30.03% 12.41% 19.32% 14.24%

Number of conjugation stems 13 12 11 10 9 8 7 6 5 4 3 2

TABLE V.

Over 81% of the items occupying the position of stems in the word-form relate to a single verb, which implies a strong reduction in their level of ambiguity (but not a total reduction, since they can still belong to different paradigms of that verb, and within a given paradigm, to different persons, genders and numbers). C. Stem-level ambiguity in terms of roots Let us go one-step further. Do these verbs relate to a same root ? In other words, if a given formal stem can be used in the conjugation of a number of verbs, are these verbs related to a same root ? The answer is positive in over 90% of the stems, as shown in the following table : Number of vowelled ‘formal stems’ 2 127 691 6,768 68,472 Total: 76,060

NUMBER OF VERBS ACCORDING TO THE NUMBER OF STEMS NEEDED FOR CONJUGATION

Considering the ambiguity issue, one can ask the question of the relation between stems and verbs: can stems be related to more than one verb? By grouping similar vowelled verbal stems, one finds that the overall number of stems can be reduced to 76,060 vowelled graphic strings, which can be described as ‘formal stems’. Let us consider the relation between these strings and the related verbs:

STEM-LEVEL AMBIGUITY IN VERBS

TABLE VI.

Percentage of ‘formal stems’

Number of related roots

0.002% 0.17% 0.91% 8.9% 90.02%

5 4 3 2 1

STEM-ROOT RELATIONS

D. An example of ambiguous stem-verb-root relations in vowelled wrting Let us, for the sake of clarity, exemplify the above relations. The stem -‫رَع‬- -ra‘- is related to the following verbs and roots:

-4-

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

Vowelled stem

Verbs in which the stem appears

Related roots

-‫رَع‬-

‫ع‬ ُ ‫ َﻳ َﺮ‬/‫ع‬ َ ‫َو َر‬

(‫)ورع‬

-ra‘-

wara‘a/yara‘u

/w-r-‘/

-‫رَع‬-

‫ َﻳﺮِﻳ ُﻊ‬/‫ع‬ َ ‫رَا‬

(‫)رﻳﻊ‬

-ra‘-

rā‘a/yar∏‘u

/r-y-‘/

-‫رَع‬-

‫ ُﻳﺮِﻳ ُﻊ‬/‫ع‬ َ ‫أرَا‬

(‫)رﻳﻊ‬

-ra‘-

‘arā‘a/yur∏‘u

/r-y-‘/

-‫رَع‬-

‫ع‬ ُ ‫ َﻳﺮُو‬/‫ع‬ َ ‫رَا‬

(‫)روع‬

-ra‘-

rā‘a/yar∏‘u

/r-w-‘/

-‫رَع‬-

‫ ُﻳﺮِﻳ ُﻊ‬/‫ع‬ َ ‫أرَا‬

(‫)روع‬

-ra‘-

‘arā‘a/yur∏‘u

/r-w-‘/

-‫رَع‬-

‫ َﻳ ْﺮﻋَﻰ‬/‫َرﻋَﻰ‬

(‫)رﻋﻲ‬

-ra‘-

ra‘ā/yar‘ā

/r-‘-y/

-‫رَع‬-

‫ َﻳ ْﺮﻋُﻮ‬/‫َرﻋَﺎ‬

(‫)رﻋﻮ‬

-ra‘-

ra‘ā/yar‘ū

/r-‘-w/

TABLE VII.

VERB AND ROOT AMBIGUITY RELATED TO THE STEM -RA‘-

E. Stem-verbs ambiguity in unvowelled writing In vowel-free writing, the overall number of stems comes down to 31,365. In other words, approximately 70% of unvowelled stems – i.e. of ‘formal stems’ – are ambiguous. As expected, the number of verbs and of roots related to a given stem increases noticeably (over 30 verbs for 15 unvowelled stems) . The number of ‘formal stems’ related to a single verb falls to 9,742. Here are the other figures: Number of unvowelled ‘formal stems’ 9,742 9,265 6,106 1,502 724 913 571 435 506 270 1195 136 Total: 31,365 TABLE VIII.

Number of unvowelled ‘formal stems’ 26,407 3,939 561 290 131 36 1 Total: 31,365

Percentage

Number of related verbs

31.06% 29.54% 19.47% 4.79% 2.31% 2.91% 1.82% 1.39% 1.61% 0.86% 3.81% 0.43%

1 2 3 4 5 6 7 8 9 10 11 to 20 21 to 39

TABLE IX.

Percentage

Number of related roots

84.19% 12.56% 1.79% 0.92% 0.42% 0.11% (0.003%)

1 2 3 4 5 6 7

NUMBER OF STEMS RELATED TO ONE ROOT OR MORE

‘Root ambiguity’, so to speak, appears to be remarkably low. Only 15.8% of the total amount of unvowelled ‘formal stems’ can be associated with 2 roots or more. This result is checked against the results of an automated corpus analysis and further commented in § H below. G. An example of ambiguous stem-verb-root relations in unvowelled writing Let us, as has been done for vowelled writing in § D, exemplify stem-verb and stem-verb-root relations in unvowelled writing. Consider the fully vowelled stem -‫ﻒ‬‫ﹾﺃﺳ‬-, ’saf-, which is used in the Imperfective paradigm (‫ )ﻣﻀﺎﺭﻉ‬of the

 ‫ﻳ ﹾﺄ‬/‫ﻒ‬  ‫’ ﹶﺃ ِﺳ‬asifa/ya’safu, “to regret”, “to feel sorry”, verb ‫ﺳﻒ‬

related to the root /’-s-f/ (‫)ﺃﺳﻒ‬. The following table presents the verbs, roots and vowelled stems which relate to the ‘formal unvowelled stem’ -‫ﺃﺳﻒ‬-. (Other vowelled stems often exist, but only the ones corresponding to the unvowelled form in consideration are mentioned.) Unvowelled Related verbs ‘formal stems’

-‫ﺃﺳﻒ‬ -‫ﺃﺳﻒ‬-

Related roots

‫ﺳﻒ‬ ‫ﻳ ﹾﺄ‬/‫ﻒ‬  ‫ﹶﺃ ِﺳ‬

(‫)ﺃﺳﻒ‬

’asifa/ya’safu

/’-s-f/

Related vowelled stems ’asif- -‫ﹶﺃﺳِﻒ‬ ’usif- -‫ﹸﺃﺳِﻒ‬

-’saf- -‫ﻒ‬‫ﹾﺃﺳ‬ -‫ﺃﺳﻒ‬

UNVOWELLED STEMS TO VERBS FIGURES

-‫ﺃﺳﻒ‬

Around 59% of unvowelled ‘formal stems’ are ambiguous. In about 49% of cases, a given ‘formal stem’ is related to either 2 or 3 verbs.

‫ﺴﻒ‬ ِ ‫ﻳ‬/‫ﻒ‬  ‫ﺳ‬ ‫ﹶﺃ‬

(‫)ﺳﻔﻒ‬

’asaff- -‫ﻒ‬  ‫ﺳ‬ ‫ﹶﺃ‬

’asaffa/yusiffu

/s-f-f/

’usiff- -‫ﺳﻒ‬ ِ ‫ﹸﺃ‬

‫ﺴﻔِﻲ‬  ‫ﻳ‬/‫ﺳﻔﹶﻰ‬ ‫ﹶﺃ‬

(‫)ﺳﻔﻮ‬

’asf- -‫ﻒ‬‫ﹶﺃﺳ‬

/s-f-w/

’usf- -‫ﻒ‬‫ﹸﺃﺳ‬

(‫)ﺳﻔﻲ‬

’asf- -‫ﻒ‬‫ﹶﺃﺳ‬

/s-f-y/

’usf- -‫ﻒ‬‫ﹸﺃﺳ‬

’asfā/yusf∏

-‫ﺃﺳﻒ‬-

‫ﺴﻔِﻲ‬  ‫ﻳ‬/‫ﺳﻔﹶﻰ‬ ‫ﹶﺃ‬

’asfā/yusf∏ F. Stem-root relations in unvowelled writing The following table gives the numbers of unvowelled ‘formal stems’ that are related to one root or more:

TABLE X.

-5-

VERBS, ROOTS AND VOWELLED STEMS RELATED TO THE UNVOWELLED ‘FORMAL STEM’ -‫ﺃﺳﻒ‬-

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005

H. Root recognition The root remains the less ambiguous element of the wordform. We have seen that around 90% of vowelled ‘formal stems’, and 84% of unvowelled ones are associated to a single root (§ C and F above). When it comes to minimal word-forms (corresponding to prefix/suffixes + stem combination), an overall automated examination of the contents of the DIINAR.1 lexical resource , give a similar result: 81% of minimal word-forms are related to a single root [2].

over 84% of unvowelled stems are related to a single root. It is also substantiated by the analysis of a 2 million words corpus with the AraMorph analyser [2], the outcome of which shows that a percentage of 79% word occurrences is associated with a single root. This last point underlines the importance of roots in the recognition of Arabic word-forms, and brings out the issue of the use of roots, among other morpho-lexical features in the automated analysis of Arabic texts, it being understood, naturally, that there should be no going back to a mythical view of root-&-pattern generation of the lexicon [17], [22].

These results have been checked against a 2 million-word occurrences corpus of general vocabulary newspaper articles belonging to various columns of the daily al-©ayāt (‫ )اﻟﺤﻴﺎة‬of the year 1995. The software used for this analysis was AraMorph (first lines of section II above). The experiment confirmed the results obtained through automated examination of the contents of DIINAR.1: 79% of occurrences comprised in the al-©ayāt corpus are related to a single root each [2].

[1]

Significantly, researches in cognitive psychology on the reading process of words in Hebrew [24], [25] and in Arabic [27] have shown that root recognition plays an important part in the word-form recognition process of both languages.

[3]

IV.

CONCLUSION:

FURTHER

WORK

AND

SUMMARY

REFERENCES

[2]

OF

RESULTS

The research presented here will need to be extended. Evidence from the DIINAR.1 lexical resource has only been presented for verbs. Nominal and adjectival stems are likely to bring forth new observation and data. Some of the statistical results will also be further analysed, as experimental work goes along. For instance, the types of verbs that support a relatively high number of stems could be further presented in the light of the classification of verb transformations given in [6]. A typology of stem-verb-root relations can, on the same basis, easily be proposed. Statistical evidence from the DIINAR.1 lexical resource will also need to be further checked against corpus frequency results. The results presented here show that prefix/suffix combinations are more ambiguous (42%) than pre-stem/post stem combinations (20%): maximal words are – as could have been expected – less ambiguous than minimal ones. In verbal morphology, the average number of stems necessary for conjugation is 4.93 stems per verb. In vowelled writing, around 81% of the stems are related to a single verb, and another 13.68%, to 2 verbs. In unvowelled script, the percentage of unambiguous ‘formal stems’ drops down to 31%. The highest level of ambiguity is subsequently to be found in unvowelled minimal word-forms (only featuring prefix/suffix and stem combinations). Another result is that of root recognition. 81% of minimal word-forms are related to a single root (regardless of whether they are realised in vowelled or unvowelled writing). The result is confirmed by the fact that, in the DIINAR.1 lexical resource,

[4]

[5]

[6]

[7] [8]

[9]

[10]

[11]

[12]

[13]

-6-

R. Abbès. “AraConc: un outil informatique pour le traitement des corpus de textes arabes: quantification et concordances”. In: B. Ben Mrad, ed. Colloque internationnal de lexicographie : Dictionnairique et corpus, 1921 juin 2004, Tunis. Association de la lexicologie arabe en Tunisie, Tunis, in Press. R. Abbès. La conception et la réalisation d’un concordancier électronique pour l’arabe. Thèse de doctorat en sciences de l’information, Lyon, ENSSIB/INSA, 2004. R. Abbès, J. Dichy & M. Hassoun. “The Architecture of a Standard Arabic Lexical database: some figures, ratios and categories from the DIINAR.1 source program”. COLING'04, 20th International Conference on Computational Linguistics. Proceedings of the Workshop Computational Approaches to Arabic Script-based Languages, 28 august 2004., Geneva: 15-22. R. Abbès & M. Hassoun. “Conceptions d'un prototype de concordancier de la langue arabe: éléments de réflexion”. In: A. Fassi Fehri, ed. Generation, Systematicity and Machine Translation, Jan. 2001, Rabat: Institut d'études et de recherches pour l'arabisation (IERA), Rabat, Morocco, 2001: 23-44. R. Abbès & M. Hassoun. “AraFreq : un outil pour le calcul de fréquence de mots arabes”. In: Abdelfattah Braham. Colloque international sur le traitement automatique de l'arabe, 18-20 Avril 2002, La Manouba-Tunis: Université de La Manouba, 2002: 213-225. S. Ammar & J. Dichy. Les verbes arabes, , Hatier (collection Bescherelle), Paris, 1999. Arabic version: ‫( اﻷﻓﻌﺎل اﻟﻌﺮﺑﻴﺔ‬Al-’af‘âl al‘arabiyya), same publisher, 1999. M. Arar. 2003: ‫ دار‬،‫ ﺟﺪل اﻟﺘﻮاﺻﻞ واﻟﺘﻔﺎﺻﻞ‬.‫ ﻇﺎهﺮة اﻟﻠﺒﺲ ﻓﻲ اﻟﻌﺮﺑﻴﺔ‬،‫ﻣﻬﺪي ﻋﺮر‬ 2003 ،‫ ﻋﻤﺎن – اﻷردن‬،‫واﺋﻞ‬ K. Beesley, Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. In ACL 39th Annual Meeting. Workshop on Arabic Language Processing; Status and Prospect, Toulouse, 2001: 1-8. R. Bouché, J. Dichy & M.O. Hassoun. “Enseignement Assisté par Ordinateur de l'arabe. Simulation à l'aide d'un modèle linguistique, la morphologie”. Colloque “E.A.O. 84”, (Lyon, 4-5 septembre 1984), Paris, Agence de l'informatique, 1984: 81-96, revisited version, Dichy & Hassoun, eds, 1989: 42-62. A. Braham & S. Ghazali. 1998: ‫ﻗﺎﻋﺪة اﻟﺒﻴﺎﻧﺎت اﻟﻤﻌﺠﻤﻴﺔ اﻟﻌﺮﺑﻴﺔ أو ﻣﺸﺮوع ﻣﻌﺠﻢ‬ – 32 ‫ ﺣﺼﻴﻠﺔ وﺁﻓﺎق – اﻟﻤﺠﻠﺔ اﻟﻌﺮﺑﻴﺔ ﻟﻠﻌﻠﻮم – ع‬،(DIINAR – ‫اﻟﻌﺮﺑﻴﺔ اﻵﻟﻲ )ﻣﻌﺎﻟﻲ‬ 23-14 :1998 M. Denise. Les dictionnaires électroniques (Dictionnaires électroniques et télématiques: nouveaux rapports à la langue, nouveaux accès aux savoirs). Didier, Paris, 2000. J.-P. Desclés, dir.. (H. Abaab, J.-P. Desclés, J. Dichy, D.E. Kouloughli, M.S. Ziadah). Conception d'un synthétiseur et d'un analyseur morphologiques de l'arabe, en vue d'une utilisation en Enseignement assisté par Ordinateur, Rapport rédigé à la demande du Ministère des Affaires étrangères, 1983. J. Dichy. “Vers un modèle d’analyse automatique du mot graphique nonvocalisé en arabe”, 1984, in Dichy & Hassoun, eds, 1989: 92-158.

ACIDCA-ICMI’05 – Tozeur, 5-7 novembre 2005 [14] J. Dichy. “The SAMIA Research Program, Year Four, Progress and Prospects”. Processing Arabic Report 2, T.C.M.O., Nijmegen University, 1987:1-26. [15] J. Dichy. L’Écriture dans la représentation de la langue : la lettre et le mot en arabe. Thèse d’État (en linguistique), Université Lumière-Lyon 2, 1990. [16] J. Dichy. “Knowledge-system simulation and the computer-aided learning of Arabic verb-form synthesis and analysis”. Processing Arabic Report 6/7, T.C.M.O., Nijmegen University: 67-84, 1993: 92-95. [17] J. Dichy. “Pour une lexicomatique de l’arabe : l’unité lexicale simple et l’inventaire fini des spécificateurs du domaine du mot”. Meta 42, Presses de l’Université de Montréal, Québec, spring 1997: 291-306. [18] J. Dichy. “Mémoire des racines et mémoire des mots : le lexique stratifié de l’arabe”. T. Baccouche, A. Clas et S. Mejri, eds., La Mémoire des mots. Special issue of: Revue Tunisienne de Sciences Sociales, 117, 1998: 93-107. [19] J. Dichy. “Morphosyntactic Specifiers to be associated to Arabic Lexical Entries - Methodological and Theoretical Aspects”. ACIDA’ 2000 (Monastir, Tunisia, 22-24.03.00), Corpora and Natural Language Processing vol., 2000: 55-60. www.elsnet.org/arabic2001/dichy.pdf [20] J. Dichy. “On lemmatization in Arabic. A formal definition of the Arabic entries of multilingual lexical databases”. ACL 39th Annual Meeting. Workshop on Arabic Language Processing; Status and Prospect, Toulouse, 2001: 23-30. www.cs.cmu.edu/~alavie/Sem-MTwshp/Dichy+Farghaly_paper.pd [21] J. Dichy, A. Braham, S. Ghazali, M. Hassoun. “La base de connaissances linguistiques DIINAR.1”. In: A. Braham. Colloque international sur le traitement automatique de l'arabe, 18-20 Avril 2002, La Manouba-Tunis: Université de La Manouba, 2002: 45-56. [22] J. Dichy & A. Fargaly. “Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built?” IXth MT Summit (New-Orleans, 2327.09.03), Proceedings of the Workshop on Machine Translation for Semitic Languages: Issues and Approaches, 2003: 1-8. www.cs.cmu.edu/~alavie/Sem-MTwshp/Dichy+Farghaly_paper.pd [23] J. Dichy & M.O. Hassoun, eds. Simulation de modèles linguistiques et Enseignement Assisté par Ordinateur de l'arabe – Travaux SAMIA I. Conseil International de la Langue Française, Paris, 1989. [24] R. Frost, K. Forster & A. Deutsch, “What can we learn from the morphology of Hebrew ? A masked priming investigation of morphological representation”, Journal of Experimental Psychology: Learning, Memory and Cognition, 23, 1997: 829-856. [25] R. Frost, A. Deutsch & K. Forste. “Decomposing morphologically complex words in a non linear morphology”. Journal of Experimental Psychology: Learning, Memory and Cognition, 26, 2000: 751-65. [26] M. Ghenima. Analyse morpho-syntaxique en vue de la voyellation assistée par ordinateur des textes écrits en arabe. Doct. dissert., ENSSIB/Université Lyon 2, 1998. [27] J. Grainger, J. Dichy, M. El-Halfaoui & M. Bamhamed. “Approche expérimentale de la reconnaissance du mot écrit en arabe”, in Jean-Pierre Jaffré, éd., Dynamiques de l’écriture : approches pluridisciplinaires, revue Faits de langue, n°22, 2003: 77-86. [28] Mohamed HASSOUN. Conception d'un dictionnaire pour le traitement automatique de l'arabe dans différents contextes d'application. State doct. dissert., Université Lyon 1, 1987. [29] Riadh Ouersighni. “A major offshoot of the DIINAR-MBC project: AraParse, a morpho-syntactic analyzer of unvowelled Arabic texts”. In ACL 39th Annual Meeting. Workshop on Arabic Language Processing: Status and Prospect, Toulouse, 2001:. 66-72. [30] J. Véronis. “Annotation automatique de corpus: panorama et état de la technique”. In: J.-M. Pierrel. Ingénierie des langues.Hermès, Paris, 2000: 111-130. [31] R. Zaafrani,. Développement d’un environnement interactif d’apprentissage avec ordinateur de l’arabe langue étrangère. Thèse de doct., ENSSIB/Université Lyon 2 2002.

-7-