Building a syntactic rules-based stemmer to improve ... - IEEE Xplore

5 downloads 5521 Views 788KB Size Report
search effectiveness for arabic language. Walid Cherif. Abdellah ... Email: [email protected]. EI Jadida, Morocco ... Email : [email protected].
Building a syntactic rules-based stemmer to improve search effectiveness for arabic language Walid Cherif

Abdellah Madani

Mohamed Kissi *

LIMA, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 EI Jadida, Morocco Email: [email protected]

MATIC, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 EI Jadida, Morocco Email : [email protected]

LIMA, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 EI Jadida, Morocco *Email: [email protected]

Abstract-Nowadays, The world is experiencing a huge growth in the volume of exchanged texts, which makes some of it untapped. Text Mining is the set of techniques that analyze these large masses of information, extract relations that can be unknown beforehand and provide solutions that help decision making. In this sense, stemming is a common requirement of these techniques. It includes reducing different grammatical forms of a word and bringing them to a common base form. In what follows, we will discuss these treatment methods for arabic text, show their limits and provide new algorithm to improve them. Keywords-

text mining;

light-stemming;

stemming;

arabic

language, automatic language processing.

I.

INTRODUCTION

Arabic is a semitic language that is based on the arabic alphabet containing 28 letters. It's one of the six official languages of the United Nations and is the mother tongue of more than 300 million people. Its basic feature is that most of its words are built up from, and can be analyzed down to common

explains why the fifth most used language in the world has seen little research despite its morphological richness [6, 7]. Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming can be defined as the process of removing any affixes (prefixes, infixes, orland suffixes) from words to reduce these words to their stems or roots. Through our research on actual stemming models, we classified encountered errors from different texts. Then, we came back to arabic grammatical and syntactic rules to make improvements to these models. This paper is organized as follows: In the first part, we will defme light-stemming and stemming, and compare them. In the second part, we will compare actual stemming models, analyze their results and show their limits before making improvements in our model in the last part.

roots [I]. The exceptions to this rule are common nouns and

II.

particles. Arabic is a highly inflectional language with 85% of words derived from triliteral roots. Nouns and verbs are derived from a closed set of around 10,000 roots [2]. Arabic has three genders, feminine, masculine, and neuter; and three numbers,

BACKGROUND

The structure of an arabic word is broken down into five components: the proclitic, the prefix, the root, the suffix, and the enclitic.

singular, dual, and plural.

With the advent of electronic documents, massive amounts information is generated. This increase in volume of texts requires production of efficient computer tools whose task is to find and to extract the relevant information in a condensed form. Arabic language became a center of research and commercial development in this relevant information domain and need for such tools. By morphological and syntactic properties, arabic is considered as a difficult language in the field of automatic processing language [3, 4]. Arabic owes its tremendous expansion from the 7th century with the spread of Islam and the dissemination of the Koran. Research for automatic processing Arabic began in the 1970s. Early work involved including lexicons and morphology. Arabic words are distinguished by their agglutinated forms and their voyellisation [5], which makes stemming process harder. This

TABLE I. Proclitic

J For

WORD � y,!1� DECORTTCATTON

Prefix

Root

y>1..J

'"

(they)

Observ

Suffix

Enclitic



J

(they)

Them

In table I, the agglutinated word ��I» is formed by the proclitic J (for), the prefix t.? and the suffix .J (that refers to the 3rd person plura!), the enclitic � (them) and the root �IJ (observe). The root - with prefix and suffix - forms the core vocabulary, possibly surrounded by extensions [8].

A. Light-Stemming Light Stemming consists on extracting the root from the word. [t is to remove prefixes, suffixes and extensions as the following two examples show:

Unagglutinated

Root

I'" + c!J + "",, I J + c:..,; ul c:..J c..Jc...�..1 .i """..1 w....1"," (will) are automatically followed by a verb. We expected that other particles like '\}" (in) would be followed by a noun, but our tests conducted us to correct the following word type to -noun or particle-:

TABLE IX.

Particle



Next word

4...J.w1

Next words type

I I

JI.w1 JS�

U"

(in)



(the school)

JS

Noun

- Some nouns may be treated as nonexistent verbs: the verb: '\)...:..." (to take) does not take the pattern '\.lc\.9". Thus, the noun "J..I.::.." could be treated as a verb. This error will interfere with information on following words types. Our work has required manual collection of all verbs and their possible individual forms "Jw'll C::.>I.l;y" [28], with their exceptions grouped in a table according to the rules of special verbs: ...�I

,j�1 (,,\.;ll ,�I ,uJ'll) J;,....ll �I

(Verbs including one of three letters: I (a),

I I

(in) (all)

Particle

In table IX, we noted that the particle '\}" can be followed by a noun or a particle. Then, we check first if the next word is a particle, if it isn't, we stem it as a noun. We classify particles into those that are followed by verbs, those that are followed by nouns, and those that can be followed by particles or nouns. As an example of particles that can be followed only by nouns "�". In the sentence "(WI J-ol �", the word "J-oI" is automatically treated as a noun to give "J..i" (hope). Verbs can also determine the following words type: only initiation verbs and incomplete verbs can be followed by

J

(w),

'-?

(y))

4) A noun is either defined by a determinant JI (the), an addition, or an enclitic but never a combination of them. The word "��I" (banks) as an example can be broken down into"JI", "o!" and "�". Those two extensions cannot exist at the same time; we conclude that "�,, is part of the root. We obtain then a proclitic "JI" (the) and a root "�" (bank). But with a similar reasoning on the word ",-?�", we obtain two possible roots: "�" and ",-?�". Hence the need for other indicators from the sentence. 5) Many confusions occur while stemming plural nouns due to the pattern change [29]. A work on a rule-based light stemmer collected patterns for plural forms and their corresponding singular pattern [30]: SINGULAR AND PLURAL PATTERNS

Plural pattern

Example 2

4...J.w I �

quoted in sentence (3), does not belong to the standard verbs patterns.

TABLE X.

PARTICLES NEXT WORD

Example 1

3) Khoja stemmer is based on a list of roots and a list of patterns. This method can lead to two types of errors:

"::";'1" (to take) that takes the exceptional root "::':":;1" (to take)

Noun SuffIXes

Particle Suffixes

another verb. Then, a verb, if it isn't in those two lists, and didn't have enclitics, is automatically followed by a noun or a particle: the verb "c.,.>I.:.." (disappointed) in "J-ol c.,.>1.:.." (I was disappointed) cannot be followed by another verb; as the next word "J-oi" isn't a particle, we conclude then that it is a noun.

Singular pattern

Jr.li.o

J.i.

J.,cli.o JWI

J.,...

�)W

J.;..; ,Jr.t.; ,JW,J..;

JW

Jr.t.;

J..;

J..;I

J..;

:u..;i

JW,J.;..;

Jr.1";

Jr.t.; ,Jr. j

From table X, we can reduce as an example the word " IY'.J�" that gives the stem '\Y'.J�" (to study). But words matching ",,)I....!", ";;.wi" and "Jr.lj"can have different singular patterns. We check then which belongs to our roots.

''l.y.>.Jb.." (schooIs) to

The example: "� 4b l " (doctors) matches both plural patterns "J-.-il" that gives "4b" and ".)W" that gives for singular root: "�1" , "�j", "ylbl" or "�i"· while its singular is "�" (doctor). This is due to the (achadda) on the "y" that disappears in the unvoyellised text. 6) Some standard words like weekdays, months of the year ... are listed as general exceptions, and for nouns, countable standard nouns are added like units ... in order to extend our stemming model. In the sentence: "l.J:!ii':i1 r3:!A..;lb � 1 Y'1 y:i:iji", as the word "l.J:!ii':iI" (Monday) belongs to exceptions list, we keep it as it is. B.

Sentence

3 4

5

6

Word

illl

,).) �! )....""'i

.ely,ll

� .l;b'J1 14!

()WI �y (Iyl �� �J:!

u,;t;';I1

Algorithm

Kho;a Stemmer Root Root Root � Root y.u> Root � Root � Root � 14! Stopword Root � Root ,"",J (y Root � Root Stopword �J:! Root u.t;

I:b. J')

Our model Verb NOLm Verb � Noun y.u> NOLm � Verb � Noun '-,'\b 11>. Particule Noun � Verb ,"",J NOLm �y )� Noun N. Exception �J:! G. Exception u,;t;';I1

:b.1 J')

We will describe step by step how our algorithm works: 1. We start by initializing words type into unknown, and at each punctuation or enclitic, we reset it.

In Table XI, we compared our results with Khoja stemmer on the six sentences that summarize Khoja stemmer errors.

2. At this level, we verity words type in order to apply the appropriate stemming: if it is a verb, we go directly to 6, if it is a noun, we go to 8, if it is a noun or particle, we go to 7 and if it is unknown, we continue to 3.

As "JI..;" is a noun prefix, we apply to "j,'1I..;" a noun stemming to obtain "j,j".

3. If the word exists in general exceptions list, we take it as a stem and we move to the next word. Otherwise, we continue.

noun. We obtained "J..,i" (hope).

4. We try first to truncate possible combinations of particles extensions. If the root belongs to particles list, we deduce if next word is a verb \j...! a noun or particle ";;I�1 ) 1"""'1 " or unknown. '

",

5. If the word has a nouns form, we go directly to 8. 6. We try to truncate possible combinations of verbs extensions. If the root belongs to our verbs list, we extract its stem. If we have removed an enclitic or the verb is an initiation verb or an incomplete verb, we reset next words type to unknown. Otherwise, we set it to a noun or a particle. 7. When the word is either a noun or a particle, we proceed by cases elimination. We check whether it is, without extensions, among particles, otherwise, it is treated as a noun. 8. After truncating possible nouns extensions, we check first if the noun is an exception, otherwise, we match the root against our nouns patterns in order to extract the stem. If the noun starts with Jl, we verify if with a single J, it belongs to the list of nouns starting with letter J to keep it. C.

Results and comparison

In this section, we will compare our model to Khoja stemmer, taking the six sentences seen before: TABLE XI. Sentence

1

Word

u� ,1>.

:w,wl J,)'l;

2

� �I C�I

EXAMPLE (1) MODELS COMPARISON Kho;a Stemmer

Stem wy ,1>.

Jh1. J,

� )l.



Type Root Stopword Root Stopword Stopword Root Root

Our model

Stem wy ,1>.

Jh1. J,!

� J..I �

Type Verb Particu!e Noun Noun Particu!e Noun Noun

The particle '\)�" is followed by nouns or particles; as

"J..l" (my hope) isn't a particle, our model has treated it as a

By collecting arabic verbs forms, we recognized the irregular pattern "::";':;1 " which allowed us to obtain the stem

" ,li,1 "

The broken plural "�4bl" gives "4b", "�i", "�j", "ylbl" and "�1" . From those 5 singular patterns, two exists in our verbs list: "ylbi" and "�i". Both of them refer to the same root "y\..b" . Even by respecting broken plural rules, the unvoyellised text form leads to confusions. The noun "�I" contains the prefix "JI" and the suffIx "c!l" that cannot appear at the same time in this word. Then, The suffIx "c!l" is part of root. We obtained by removing "JI": "�" (Bank). As nouns exceptions and general exceptions are listed in our model, the words "r3:!" and "l.J:!ii'lI" are automatically detected. v.

CONCLUSION AND PERSPECTIVES

The complex nature of arabic language justifies the unavailability of tools for morphological analysis. Khoja stemmer defined affixes and extensions to remove despite processed words type, which leaded to errors in the process. Through our research on arabic language morphology, we detect word's type from its pattern, its form and the previous words in order to determine the proper stemming, but some of the text does not give us enough information to recognize the types of words like the pronoun ''1)0'' (who/ from) or ",J" (not/why) in this unvoyellised form; whereas with a voyellisation, the following words types are known. We collected a list of nouns exceptions and an extended base of arabic verbs to treat verbs exceptions. And to further optimize our stemming results, we distinguished possible affIxes combinations, and broken plurals procedure.

Those stemming improvements will increase precision in Opinion mining and text categorization that can be considered in our future researches. REFERENCES [1]

[2]

A Farag and A N"urnberger, "A Web Statistics based Conflation Approach to Improve Arabic Text Retrieval," Federated Conference on Computer Science and Information Systems, Szczecin, Poland, pp. 3-9, 2011. S. AI-Fedaghi and F. Al-Anzi, "A new algorithm to generate arabic root-pattern forms," 11th National Computer Conference, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, pp. 4-7, 1989.

[15] S.Khoja and R.Garside, "Stemming arabic text", Computer Science Department, Lancaster University, Lancaster, UK, 1999. [16] R. Al-Shalabi, G. Kanaan and H. Al-Serhan "New Approach for Extracting Arabic Roots" International Arab Conference on Information Technology (ACIT 2003), Alexandria, Egypt, pp. 42-59, 2003. [17] R. AI-Shalabi and M. Evens. "A computational morphology system for arabic," Computational Approaches to Semitic Languages COLlNG­ ACL98, 1998. [18] M. Sawalha and E. Atwell, "Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers" COLING (Posters) pp. 107-110, 2008. [19] D. A Said, N. M. Wanas, N. M. Darwish and N. H. Hegazy, "A Study of Text Preprocessing Tools for Arabic Text Categorization," 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 230-236, April, 2009.

[3]

M. Aljlayl and O. Frieder, "Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach," 11th International Conference on Information and Knowledge Management (CIKM), Virginia (USA), pp.340-347, November, 2002.

[20] M. K. Saad and Wesam Ashour, "Arabic Morphological Tools for Text Mining" 6th International Conference on Electrical and Computer Systems (EECS'IO), Leike, North Cyprus, Nov 25-26, 2010.

[4]

L. S. Larkey, L. Ballesteros and M. Connell, "Improving Stemming for Arabic Information Retrieval: Light Stemming and Cooccurrence Analysis," 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, pp. 275-282, August , 2002.

[21] R. Duwairi, "Arabic Text Categorization" The International Arab Journal of Information Technology, Vol.4, No.2, pp. 125-131, 2007. [22] S. Khoja, "APT: Arabic part of speech taggef" North American Chapter of the Association for Computational Linguistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, June, 200I.

[5]

N. Abdusalam, T. Seyed and S. Falk, "Stemming Arabic Conjunctions and Prepositions," 12th international conference on String Processing and Information Retrieval, Heidelberg, pp. 206-217, 2005.

[23] S. Khoja, R. Garside and G. Knowles, "A tagset for the morphosyntactic tagging of arabic" Corpus Linguistics 2001, Lancaster University, Lancaster, UK, March, 2001.

[6]

C. Huot and P. Coupet 'Text mining on arabic language : application to open source treatment". Journees sur les systemes d'information elaboree, He-Rousse. Session 6 - Outils et Applications. TEMIS SA, Paris, France, 2005.

[24] H. M. Harmain, H. EI Khatib and A Lakas, "Arabic Text Mining" IADIS International Conference Applied Computing, Lisbon, Portugal, 2004.

[7]

M. P. Lewis, "Ethnologue: languages of the world," Sixteenth edition. Dallas, Tex.: SIL International, 2009.

[8]

L. Tuerlinckx, "Stemming from the non-classical Arabic", 7th international days of textual data statistical analysis (JADT), Louvain­ la-Neuve, Belgium, pp. 1069-1079, 2004.

[9]

AAl-Said, "Simplified rules of arabic language, " Ed. 3, 2006.

[10] H. Masrahia and S. Qorni, "Computing in arabic language", 2006. [II] A Mountassir and H. Benbrahim, "Sentiment Analysis: Supervised classification of Arabic documents" 7th international conference on intelligent systems: Theories and Applications (SIT A'12), Mohammadia, Morocco, pp. 282-289, 2012. [12] C. D. Paice, "Another stemmer". ACM SIGIR Forum, Vol. 24, No. 3, pp. 56-61, 1990. [13] C. D. Paice, "An evaluation method for stemming algorithms", 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42-50, 1994. [14] J. Ms. Anjali, "A Comparative Study of Stemming Algorithms," IJCTA, Vol. 2, No. 6, pp. 1930-1938, 2011.

[25] C. Aitao, "Building an Arabic Stemmer for Information Retrieval," 11th Text Retrieval Conference, Berkeley, pp. 631-639, 2003. [26] L. S. Larkey, L. Ballesteros and M. E. Connell, "Light Stemming for Arabic Information Retrieval," University of Massachusetts, Springer, 2007. [27] H. M. Harmananil, W. T. Keirouz and S. Raheel, "A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic," The International Arab Journal of Information Technology, Vol.3, No.3, pp. 265-272, 2006. [28] A Dahdah "Verbs conjugation glossary", Dahdah Encyclopedy of arabic sciences, Lebanon Librairy, Ed, 1995. [29] A Goweder, M. Poesio, A De Roeck and 1. Reynolds, "IdentifYing Broken Plurals in Unvowelised Arabic Text," Empirical Methods in Natural Language Processing, ACL, pp. 246-253, 2004. [30] M. Ababnehl, R. AI-Shalabi, G. Kanaan and A AI-Nobanil, "Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve Search Effectiveness," The International Arab Journal of Information Technology, Vol. 9, No. 4, pp. 368-372, 2012.

Suggest Documents