A Corpus Based Approach for the Automatic Creation ...

A Corpus Based Approach for the Automatic Creation of Arabic Broken Plural Dictionaries Samhaa R. El-Beltagy1 and Ahmed Rafea2 1

Nile University, Center for Informatics Science, Giza, Egypt [email protected] 2 The American University in Cairo, Computer Science Department, Cairo, Egypt [email protected]

Abstract. Research has shown that Arabic broken plurals constitute approximately 10% of the content of Arabic texts. Detecting Arabic broken plurals and mapping them to their singular forms is a task that can greatly affect the performance of information retrieval, annotation or tagging tasks, and many other text mining applications. It has been reported that the most effective way of detecting broken plurals is through the use of dictionaries. However, if the target domain is a specialized one, or one for which there are no such dictionaries, building those manually becomes a tiresome, not to mention expensive task. This paper presents a corpus based approach for automatically building broken plural dictionaries. The approach utilizes a set of rules for mapping broken plural patterns to their candidate singular forms, and a corpus based cooccurrence statistic to determine when an entry should be added to the broken plural dictionary. Evaluation of the approach has shown that it is capable of creating dictionaries with high levels of precision and recall.

1

Introduction

Broken plurals (BP) are an important part of any Arabic text, as they form around 10% of any Arabic content and 40% of encountered plurals [1]. This makes broken plurals much more common than irregular nouns in English. Unlike regular or sound plurals, which simply result in the addition of suffixes, broken plurals change the entire structure of a word’s singular form in order to convert it to a plural. This structural change usually involves the addition of infixes as well as re-ordering or deletion of letters that existed in the original word. While a language like English has irregular plurals, it does not really have an equivalent of a broken plural. Gowder el al [2] have shown that identifying broken plurals and handling them results in improved information retrieval results, while El-Beltagy and Rafea [3] have also shown that handling broken plurals can significantly improve the results of semantic tagging systems. Yet the identification of broken plurals has proven to be a difficult task. While broken plurals exhibit well defined and known patterns, simply relying on these patterns for their identification results in considerable errors. In [4], it was adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

shown that applying simple BP pattern matching on a test corpus for identifying broken plurals resulted in a precision value of 13.7% and a recall value of 99.7%. The extremely low precision value is indicative of the uselessness of relying on BP pattern matching alone for accurate identification of broken plurals. Work presented in [4] also concluded that a dictionary based approach yields the best results in identifying broken plurals. Building dictionaries, especially for specific domains, is however a tiresome, expensive and a time consuming process. This work proposes a framework for automatically building broken plural dictionaries from a large input corpus. Experimental results show that the proposed method is capable of achieving its task with a high level of accuracy. The rest of this paper is organized as follows: Section 2 briefly reviews related work, section 3 provides an overview of the proposed system, section 4 presents the experiment carried out to evaluate the work and its results, and finally section 5 concludes this paper.

2

Related Work

Highly relevant to this work is that of Goweder et al[4] in which they attempt the difficult problem of identifying broken plurals and reducing them to their singular forms. In their work, Goweder et al carry out a series of experiments for detecting broken plurals using a test corpus. In all experiments, input words in the corpus are firstly lightly stemmed using a modified version of the aggressive Khoja stemmer [5]. In the first and least accurate experiment, all forms that fit the pattern of a broken plural were detected and analyzed to see whether or not words that fit these patterns are in fact broken plurals. Having found that this technique results in very low precision, an alternative method which adds further restrictions to existing patterns based on the authors’ observations, was adopted. Using this method precision increased significantly to 53.92% from 13.7% and recall reached 100% after it was 99.7%. To improve the results further, a third variation that employs a machine learning approach was followed to automatically add restriction rules based on a training dataset. Using this approach the results were further improved as precision reached 75.1% and recall 95.9%. The best results however, were obtained using a dictionary based approach. The dictionary in this instance was built semi-automatically and resulted in a precision value of 81.2% and a recall of 100%. Adopting the premise that broken plural identification is important for conducting accurate stemming, the work of (El-Beltagy and Rafea)[3] employed a corpus based approach for identifying broken plurals. Like in [4], the authors of [3] identified several target broken plural patterns. In their work, they try to match words that fit identified broken plural patterns to their candidate singular forms which they generate using a set of rules (one for each pattern). A match is said to be found, if the candidate transformation appears in any document in their input corpus. While the underlying

premise adopted in the work of [3] is quite similar to the one proposed in this paper, it differs from it in four significant ways. 1. In [3] , the initial process of identifying broken plurals and building a BP list accordingly is semi-supervised, whereas in this work it is fully automated. 2. In [3], when a match is found as described above, only the singular form is stored in what the authors call a stem list. During their stemming stage, words to be stemmed are again matched against BP patterns. If a match is found, a possible transformation is generated; if this transformation is found in the stem list, then conflation takes place. This process however, is highly error prone because a word matching a BP pattern may incorrectly map to an entry in the stem list. For example, if the word ‫( كلب‬dog), the broken plural of which is ‫( كالب‬dogs) is stored in the stem list, and if the word ‫( كلوب‬lantern) is encountered, ‫ كلوب‬will match with one of the BP patterns even though it is not a broken plural and it will be incorrectly conflated to the word ‫ كلب‬. Our proposed model avoids this problem by storing both the broken plural word, and its singular form in a dictionary. 3. The model proposed in [3] attempts to find a match between a candidate transformation and any word in the input corpus, again an error prone process. The work presented herein, proposes a set of restrictions when matching between a word and its possible candidate form to reduce errors. 4. The set of broken plural patterns covered by this work is wider, and cases where a single BP pattern may have multiple possible transformations, are handled as detailed in section 3. The work presented in [6] proposes a model based on machine translation to detect and convert broken plurals to their singular forms. In this model, words that match BP patterns are translated to the English language. If the resulting English word is found to end with an “s” or if the word exists in a list of irregular English nouns, then it is identified as a broken plural. The English term is then stemmed and the stem is retranslated to the Arabic language to obtain the Arabic singular form. [7] and [8] discuss the complexity of Arabic broken plurals from a linguistic perspective. Using a corpus for transformation validation purposes is not a novel idea. In fact Xu and Croft made use of this idea to build a stemmer and demonstrated that for the English language this is a very effective approach [9]. In their work, Xu and Croft define corpus based stemming as the process of automatically modifying “equivalence classes to suit the characteristics of a given text corpus” their assumption being that a stemmer that can adapt to a certain domain using the characteristics of its corpus, should perform better than one that cannot. Another assumption underlying their work is that words and their stems are likely to occur in the same document or even more specifically, in the same text window. This is a similar assumption to what the proposed work takes with respect to broken plurals and their singular forms. Rather than use any linguistic knowledge to generate equivalence classes, an n-gram model is employed to carry out that task, which is different than what we propose in this work as candidate singular forms are generated based on linguistic knowledge. After experimenting with Xu and Croft’s approach on Arabic, Larkey et al [10] have shown that stemming using an n-gram approach is not the most appropriate approach to use for a language such as Arabic.

3

Overview of the Proposed Approach

In order to detect and create a dictionary of broken plurals where an entry in the dictionary is a broken plural term mapped to its singular form, we follow a corpus based approach. The main difficulty in detecting Arabic broken plurals can be attributed to two main factors: 1. A word that matches a BP pattern may not be a broken plural at all, and 2. A single BP pattern may have more than one possible transformation pattern. For example, the terms ‫( ماليين‬millions), ‫( براميل‬barrels), and ‫(اعاصير‬hurricanes) all match the BP pattern ‫( فعاليل‬f3Alyl). However, to transform each of these words to its singular form, each requires a different rule. Please refer to table 1. for a list of BP patterns covered by this work, examples of each pattern, and the possible transformation patterns for each. The “[‫ ”]ه‬that appear at the end of some transformation patterns, indicates that the singular form may or may not have a “‫ ”ه‬at the last letter. The basic premise on which this work builds is that in any given corpus, BPs and their singular forms are likely to co-occur within documents of this corpus.

Table 1. List of BP patterns covered by this work

# P1

Pattern ‫( فعول‬f3Wl)

P2

‫( فوائل‬fWA2l)

P3

‫( فعائل‬f3A2l)

P4 P5 P6

‫( فعايا‬f3AyA) ‫( افعال‬Af3Al) ‫( فواعل‬fWA3l)

P7

‫( فعالء‬f3lA2)

P8 P9

‫(فواعيل‬fWA3yl) ‫( فعاليل‬f3Alyl)

P10

‫افعياء‬ (Af3yA2)

Trans. Plural Example Form Word length = 4 ‫( حقوق‬rights) C1 ‫( قروض‬loans) C2 Word length = 5 ‫( سوائل‬liquids) C1 ‫( رھائن‬hostages) C1 ‫( خسائر‬losses) C2 ‫( بدائل‬alternatives) C3 ‫( خاليا‬cells) C1 ‫( اصوات‬sounds) C1 ‫( شوارع‬streets) C1 ‫( مواسم‬seasons) C2 ‫( خبراء‬experts) C1 Word length = 6 ‫( صواريخ‬rockets) C1 ‫( ماليين‬millions) C1 ‫( براميل‬barrels) C2 ‫(اعاصير‬hurricanes) C3 ‫( اذكياء‬Stupid plurC1 al)

Singular Transformation

Transformation Pattern

‫( حق‬rights) ‫( قرض‬loans)

‫( فع‬f3) ‫( فعل‬f3l)

‫( سائل‬liquid)

[‫فائل]ه‬ (fA2l[h]) ‫( فع يله‬f3ylh) ‫( فعا له‬f3Alh) ‫( فعيل‬f3yl) ‫( فعيه‬f3yh) ‫( فعل‬f3l) ‫( فاعل‬fA3l) ‫( فوعل‬fW3l) ‫(فعيل‬f3yl)

‫( رھينه‬hostage) ‫( خساره‬loss) ‫( بديل‬alternative) ‫( خليه‬cell) ‫( صوت‬sound) ‫( شارع‬street) ‫( موسم‬season) ‫( خبير‬expert) ‫( صاروخ‬rocket) ‫( مليون‬million) ‫( برميل‬barrel) ‫( اعصار‬hurricane) ‫( ذكي‬Stupid singular)

‫(فاعول‬fA3Wl) ‫(فعويل‬f3wyl) ‫(فعليل‬f3lyl) ‫(فعالل‬f3lAl) ‫( فعي‬f3y)

The first step in building a dictionary using the proposed approach, is to index all documents of the input corpus via search engine. When experimenting with this approach, the Apache Lucune search engine was used [11]. Some of the steps taken in preprocessing are dependent on the nature of the input documents. For example, html documents would require the removal of html tags. Same thing with XML or SGML documents, but in general the only preprocessing done on input documents is normalization and stop word removal. Most of the approaches that addressed broken plural detection, carried out light stemming on words in a corpus during preprocessing. The reason we did not follow a similar approach is that doing so would result in missing broken plurals that have leading or trailing letters that can be mistakenly eliminated using a light stemmer, as an integral part of the word. For example the word ‫الوان‬ (colors), which matches with broken plural pattern P5 has“‫ ”ال‬as its leading two letters. Since “‫ ”ال‬as a prefix often denotes “the”, it is always removed by a light stemmer. However, in the above given example, it does not stand for “the”, but is simply part of the word and thus it will be incorrectly removed by a light stemmer preventing its detection and the detection of it singular form ‫( لون‬color). Another example is that of the word ‫( قوانين‬laws) (broken plural pattern 9) in which the trailing letters “‫”ين‬ would also be mistaken as a suffix by a light stemmer and removed. To build the dictionary, words in the corpus are scanned for matches with any of the BP patterns presented in Table 1. The followed approach relies on the generation of candidate transformations for each of the encountered patterns and on attempting to measure the extent to which the original word that matched a BP pattern collocates with its candidate transformation. The collocation metric employed by this work is that of normalized point wise mutual information [12] which is calculated using equation (1) and where x represents the original BP word and y its candidate transformation. iሺx, yሻ = ቀln

୮ሺ୶,୷ሻ

ቁ ൘ −ln pሺx, yሻ

୮ሺ୶ሻ୮ሺ୷ሻ

A simplified algorithm for building the BP dictionary is as follows: seen = {} for each document di ∈ collection c For each term ti ∈ di and ti ∉ seen seen = seen ∪ ti if((ti.length >= 4) and (ti.length Ω) // Ω is a threshold dictionary.add(new entry(ti,e.term)

(1)

end end end Actual implementation is more involved. For broken plurals with transformation patterns that have a possible trailing [‫]ه‬, a candidate term without the “‫ ”ه‬is generated first and it’s only if the this term fails to match the threshold Ω that the “‫ ”ه‬is added and the search and scoring repeated. Matching each of the patterns has some additional constraints, such as letters it should not start with, end with or have as a second letter. These are summarized in table 2. Table 2. Constraints placed on various patterns

Pattern P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

Does not start with No constraints ‫واي‬ ‫واي‬ ‫واي‬ No constraints ‫ت واي‬ ‫واي‬ No constraints ‫و‬ No constraints

Does not end with No constraints ‫ا ي‬ ‫ا ي‬ No constraints ‫و ءا ي‬ ‫ا ي‬ No constraints ‫ا و ي‬ ‫ا ي‬ No constraints

Second letter is Not No constraints No constraints No constraints No constraints ‫يت‬ No constraints No constraints No constraints No constraints ‫ل‬

Also, when searching for any term in the corpus, whether the term is the original BP or the proposed singular form, the term in addition to the term + prefix“‫ ”ال‬are both entered and Or(ed) in the search query. Before evaluating the system using a large dataset, it was first applied on a small document collection (39 documents). This collection is part of a larger dataset which is described in the next section and which was fully indexed before experimenting with the smaller subset. The purpose of experimentation was to determine the best threshold Ω for each of the presented patterns and for detecting any problems with the algorithm. One of the problems that emerged during this experimentation was related to the detection and conversion of pattern P1. When applying the transformation rules of this BP pattern, many nouns which happened to match with this pattern and which should have been left as is, were mapped to closely related verbs that co-occurred with the nouns, which is of course incorrect. An example of such a faulty mapping is that between the word ‫( ذھول‬amazement) to the verb ‫( ذھل‬amaze). This happened at a frequency that threatened to considerably affect the precision of this pattern mapping. To avoid such a faulty mapping, the knowledge that the prefix “‫ ”ال‬never attaches to verbs was employed. So basically another factor called alScore was introduced based on this observation. The alScore factor is calculated using equation 2.

alScore =

்௢௧௔௟ # ௢௙ ௗ௢௖௨௠௘௡௧௦ ௜௡ ௪௛௜௖௛ ஻௉ ௧௘௥௠ ௔௡ௗ ൫‫ال‬ା ௖௔௡ௗ௜ௗ௔௧௘ ௧௘௥௠൯ ௔௣௣௘௔௥ ௧௢௚௘௧௛௘௥ ்௢௧௔௟ # ௢௙ ௗ௢௖௨௠௘௡௧௦ ௜௡ ௪௛௜௖௛ ஻௉ ௧௘௥௠ ௔௣௣௘௔௥௦ ௜௡ ௖௢௥௣௨௦

(2)

For any valid mapping, the alScore is calculated. If this score falls below a threshold α, the mapping is not added to the dictionary, otherwise it is. This restriction improved the precision of this rule considerably. Since the documents used were small ones (average number of words = 220), colocation counts were carried out on the level of an entire document. However, in larger documents, it might be preferable to carry out search using a proximity window.

4

Experiment and Results

In order to experiment with the proposed approach the Arabic Newswire A Corpus [13] which consists of approximately 383,872 articles collected from the Agence France Press (AFP) Arabic Newswire, was used. As stated in the previous section, documents in the corpus were indexed using Lucune [11]. SGML tags were removed before indexing. The evaluation metrics used were precision, recall, f-score, accuracy, and specificity (true negative rate). Each of these metrics is defined as follows: Let TP = number of correctly extracted dictionary entries TN = number of correctly rejected dictionary entries. These represent words that fit a broken plural pattern, but which were correctly rejected due to constraints introduced by the proposed model FP = number of incorrectly extracted dictionary entries FN = number of broken plural words that match with one of the target BP patterns and should have been added as an entry in the dictionary, but was ignored. Then Precision =

்௉

்௉ାி௉ ்௉

Recall = ்௉ାிே F-score =

ଶ ௑ ௉௥௘௖௦௜௢௡ ௑ ோ௘௖௔௟௟ ௉௥௘௖௦௜௢௡ା ோ௘௖௔௟௟

Accuracy =

்௉ା்ே

்௉ା்ேାி௉ାிே

Specificity =

்ே

்ேାி௉

(3) (4) (5) (6) (7)

To determine the values of the above metrics, an Arabic native speaker was asked to go over all instances where a term matched a BP pattern and to indicate whether or

not this term should be mapped to a singular term, and to select the value of the singular form from suggested alternatives where appropriate. When in doubt, the native speaker referred to the on-line dictionary Almaany [14]. The system generated dictionary was then compared to manually created one. Table 3 shows the pattern transformation statistics, while table 4 displays the overall results of the performance of the system based on the above metrics. Table 3. The different patterns and their matcing results

Pattern TP FP FN TN Used Ω Threshold P1 99 19 54 2693 0.1 P2 14 1 2 10 0.1 P3 72 1 26 104 0.001 P4 9 0 3 46 0.001 P5 171 30 42 915 0.15 P6 63 3 20 223 0.15 P7 33 4 9 294 0.1 P8 12 0 7 183 0.2 P9 87 9 23 952 0.15 P10 6 0 7 5 0.1 Overall 566 67 193 5425 The used Ω thresholds were empirically set using a small subset of documents as described in the previous section. By looking at table 3, it can be seen that the pattern with the most word matches is that of P1, followed by P5 and P9, but despite the fact these patterns’ match rate is quite high, the system manages to successfully avoid making in-correct mappings. Patterns P10 and P4, seem to be the rarest of all patterns, while pattern P5 seems to account for the highest number of correct entries in the dictionary. Table 4. Overall system performance Pattern

Precision %

Recall %

F-score %

Accuracy %

Specificity %

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Overall

83.9 93.3 98.6 100 85.1 95.5 89.2 100 90.6 100 89.5

64.7 87.5 73.5 75 80.3 75.9 78.6 63.2 79.1 46.2 74.5

73.1 90.3 84.2 85.7 82.6 84.6 83.5 77.4 84.5 63.2 81.3

97.5 88.9 86.7 94.8 93.8 92.6 96.2 96.5 97.0 61.1 95.8

99.3 90.9 99.1 100 96.8 98.7 98.7 100 99 100 98.8

From table 4, it can be concluded that the overall precision and accuracy of the system are quite high for a fully automated dictionary builder. However, recall while not low, could certainly be higher. When analyzing the reason for the low recall, it was found that the majority of BP words that should have been included in the dictionary, but were not, did not collocate with their singular forms within the corpus. However, experimenting with light stemming of words that do not match with any of the BP patterns may improve this recall value.

5

Conclusion

This paper presented a corpus based approach for automatically building dictionaries that map broken plurals to their singular forms. Evaluation of the system has shown that it can carry out this task with a high level of accuracy and precision. The main advantage of following the proposed approach is that it can be applied on any corpora, whether domain specific or even colloquial text, as colloquial Arabic broken plurals follow the same transformation patterns as the more formal modern standard Arabic. The generated dictionaries can be easily integrated into any Arabic stemmer or text mining application.

References 1. Goweder, A., Roeck, A. De: Assessment of a Significant Arabic Corpus. Proceedings of the Arabic NLP Workshop at ACL/EACL. pp. 73–79. , Toulouse, France (2001). 2. Goweder, A., Poesio, M., Roeck, A. De: Broken Plural Detection for Arabic Information Retrieval. SIGIR’04. pp. 566–567 (2004). 3. El-Beltagy, S.R., Rafea, A.: An Accuracy Enhanced Light Stemmer for Arabic Text. ACM Transactions on Speech and Language Processing. 7, 2 – 23 (2011). 4. Goweder, A., Poesio, M., Roeck, A. De, Reynolds, J.: Identifying broken plurals in unvowelised Arabic text. EMNLP’04. , Barcelona, Spain (2004). 5. Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster, UK (1999). 6. Goweder, A.M., Almerhag, I.A., Ennakoa, A.A.: Arabic Broken Plural Recognition Using a Machine Translation Technique. ACIT’2008. , Hammamet, Tunisia (2008). 7. McCarthy, J.J.: A prosodic account of Arabic broken plurals. (1983). 8. Kiraz, G.A.: Analysis of the Arabic Broken Plural and Diminutive. the 5th International Conference and Exhibition on Multi-Lingual Computing. , Cambridge, UK (1996). 9. Xu, J., Croft, W.B.: Corpus, based stemming using co-occurrence of word variants. ACM Transactions on Information Systems. 16, 61 – 81 (1998). 10. Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. Proceeedings of SIGIR’02, Tampere, Finland (2002). 11. Apache: Lucene, http://lucene.apache.org/. 12. Bouma, G.: Normalized (Pointwise) Mutual Information in Collocation Extraction. the Biennial GSCL Conference. pp. 31–40 (2009).

13. LDC: Arabic Newswire A Corpus, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T55, (1994). 14. Almaany: Almaany on-line dictionary, http://www.almaany.com/.