An efficient stemming for Arabic Text Classification

An efficient stemming for Arabic Text Classification Attia Nehar

Djelloul Ziadi

Hadda Cherroun

Département d’informatique A.T. University BP. 37G, 03000 Laghouat, Algeria Email: [email protected]

Université de Rouen Rouen, France Email: [email protected]

Département d’informatique Laghouat, Algeria Email: hadda [email protected]

Abstract—Using N -gram technique without stemming is not appropriate in the context of Arabic Text Classification. For this, we introduce a new stemming technique, which we call “approximate-stemming”, based on the use of Arabic patterns. These are modeled using transducers and stemming is done without depending on any dictionary. This stemmer will be used in the context of Arabic Text Classification. Index Terms—Arabic, classification, kernels, transducers, Arabic Patterns

0.1. I NTRODUCTION Text Classification (TC) is the task of automatically sorting a set of documents into one or more categories from a predefined set [1]. Text classification techniques are used in many domains, including mail spam filtering, article indexing, Web searching, automated population of hierarchical catalogues of Web resources, even automated essay grading task. Arabic language is spoken by more than 422 million people, which makes it the 5the widely used language in the world [2]. Arabic has 3 forms; Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialectal Arabic (DA). CA includes classical historical liturgical text and old litterature texts. MSA includes news media and formal speech. DA includes predominantly spoken vernaculars and has no written standards. Arabic alphabet consists of the following 28 letters ( _ ª | Q M x E C Ð © ¤ £ ) and the Hamza (º). There are three vowels (© ¤ ) and the rest are consonants. There is no upper or lower case for Arabic letters. The orientation is like all semitic languages from right to left. Arabic differs from other languages syntactically, morphologically and semantically. It is a semitic language whose main characteristic feature is that most words are built up from roots by following certain known patterns and adding prefixes and suffixes. Due to the complexity of the Arabic language, text classification is a challenging task. Many algorithms have been developed to improve performance of Arabic TC systems [3], [4], [5], [6], [7], [8]. In general, we can divide an Arabic text classification system into three steps: 1. preprocessing step: where punctuation marks, diacritics, stop words and non letters are removed.

2. features extraction: a set of features is extracted from the text, which will represent the text in the next step. For instance, Khreisat [6] used the N -gram technique to extract features form documents. Syiam et al. [8], stemming is used to extract features. 3. learning task: many supervised algorithms were used to learn systems how to classify Arabic text documents: Support Vector Machines [5], [7], K-Nearest Neighbors [8] and many others. Most algorithms rely on distance measures over extracted features to decide how much two documents are similar. In the second step, a feature vector is constructed, it will represent the document in the third step. Many stemming approaches are developed [9]. S. Khoja and R. Garside [10] developed a dictionary based stemmer. It gives good performances, but the dictionary needs to be maintained and updated. The stemmer algorithm of Al-Serhan et al. [11] finds the three-letter roots for arabic words without depending on any roots dictionary or pattern files. Many arabic words have the same stem but not the same meaning. Stemming two semantically different words to the same root can induce classification errors. To prevent this, light stemming is used in TC algorithms [12]. In this technique, the main idea is that many words generated from the same root have different meanings. The basis of this light-stemming algorithms consists of several rounds over the text, that attempt to locate and remove the most frequent prefixes and suffixes from the word. This leads to a lot of features due to the light stemming strategy. In the third step, many distance measures could be used to calculate distance between documents using these feature vectors. In this paper, we introduce a new stemming technique which do not rely on any dictionary. It is based on the use of transducers, which we will use also to measure distance between documents in our framework. This paper is organized as follows. Section 0.2 presents, in more details, the features selection techniques, namely: brute stemming and light stemming. Our new stemming approach, called “approximate-stemming”, is described in Section 0.3. Next, in Section 0.4 we summarize the framework in which our new feature selection method is used and explain the kernel similarity measure. Finally, in Section 0.5 we highlight some

perspectives.

0.2. S TEMMING TECHNIQUES In the context of TC, stemming is used to reduce dimentiality of the feature vector. It consists of transforming each Arabic word in the text, into its root.

A. Brute Stemming There are many stemming techniques used in the context of TC. They can be classified into two classes : • Stemming using a dictionary : a dictionary of Arabic words stems is needed here. Khoja’s stemmer [10] is an example of this class. • Stemming without dictionary : Stems are extracted without depending on any root or pattern files. Al-Serhan et al. [11], give an example of this class. Khoja’s stemmer removes the longest suffix and the longest prefix. It then matches the remaining word with verbal and noun patterns, using a dictionary, to extract the root. The stemmer makes use of many linguistic data files such as a list of all diacritic characters, punctuation characters, definite articles and stop words. This stemmer gives good performance but relies on dictionary which needs to be maintained and updated. The second technique, due to Al-Serhan et al.[11], finds the three-letter roots for Arabic words without depending on any root or pattern files. They extracted word roots by assigning weights and ranks to the letters that constitute a word. Weights are real numbers in the range of 0 to 5. The assignment of weights to letters was determined by statstial studing arabic documents. Table1gives the groups of letters assignemets. The rank of letters in a word depends on both the length of that word and whether the word contains odd or even number of letters. Table 2 shows the assignment of ranks to letters. N is the number of letters in a word. After determination of the rank and weight of every letter in a word, letter weights are multiplied by the letter ranks. The three letters with the smallest product value constitute the root. Table 3 gives an example of using this algorithm. This agorithm, like any other brute stemming agorithm, gives the same stem for two semantically defferent words. This could decrease performance of the classification system. Arabic Letters

©¹ ¤« £x

Weight 5 3.5 3 2 5

Rest of the Arabic Alphabet

5

Table 1: Assignment of weights to letters.

Letter Position from Right 1 2 3 [N/2] [N/2] + 1 [N/2] + 2 [N/2] + 3 N

Rank if Word Length is Even N N-1 N −2 N/2 + 1 N/2 + 1 − 0.5 N/2 + 2 − 0.5 N/2 + 3 − 0.5 N − 0.5

Rank if Word Length is Odd N N-1 N −2 [N/2] [N/2] + 1 − 1.5 [N/2] + 2 − 1.5 [N/2] + 3 − 1.5 N − 1.5

Table 2: Ranks of letters. Word Letters Weights Rank Product Root

T\Am

_

5 7.5 37.5

0 6.5 0

0 5.5 0

5 4.5 22.5

0 5 0

2 6 12

1 7 7

5 8 40

^f

Table 3: An example of using Al-Serhan et al. algorithm B. Light stemming In Arabic language, many word variants do not have similar meanings or semantics (like the two words: Tbtk which means library and A which means writer). However, these word variants give the same root if a brute stemming is used. Thus, brute stemming affects the meanings of words. Light stemming [12] aims to enhance the Text Classification performance while retaining the words meanings. 0.3. S TEMMING USING TRANSDUCERS In this section, we will explain our new stemmer. First, we introduce the notion of Weighted Transducers. Then, we explain how to build a model for stemming using transducers. Notation : Σ represents a finite alphabet. The length of a string x ∈ Σ∗ over that alphabet is denoted by |x| and the complement of a ¯ = Σ∗ \ L. |x|a denotes the number of subset L ⊆ Σ∗ by L occurrences of the symbol a in x. K denotes either the set of real numbers R, rational numbers Q or integers Z. A. Weighted Transducers Transducers and Weighted Transducers are finite automata in which each transition is augmented with an output label in addition to the familiar input label. Output labels are concatenated along a path to form an output sequence as we do with input labels. Weighted transducers are finite-state transducers in which each transition carries some weight in addition to the input and output labels. The weight of a pair of input and output strings (x, y) is obtained by summing the weights of the paths labeled with (x, y). The following definition define formally weighted transducers [13]. Definition 1: A weighted finite-state transducer T over the semi-ring (K, +, ·, 0, 1) is an 8-tuple T = (Σ, ∆, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet of the transducer, ∆ is the finite output alphabet, Q is a finite set of states, I ⊆ Q the set of initial states, F ⊆ Q the set of final states, E ⊆ Q(Σ ∪ {})(∆ ∪ {})KQ a finite set of transitions, λ : I → K the initial weight function, and ρ : F → K the final weight function mapping F to K

For a path π in a transducer, p[π] denotes the origin state of that path and n[π] its destination state. The set of al paths from the initial states I to the final states F labeled with input string x and output string y is denoted by P (I, x, y, F ). A transducer T is regulated if the output weight associated by T to any pair of input-output strings (x, y) given by: P T (x, y) = π∈P (I,x,y,F ) λ(p[π]) · w[π] · ρ(n[π]) (1) is well-defined and in K. T (x, y) = 0 if P (I, x, y, F ) = ∅. Fig. 0.1 shows an example of a simple transducer, with an input string x : A and an output string y : ` . The only possible path in this transducer is the singleton set : P ({0}, x, y, {4}) and T (x, y) = 0.0625.

Figure 0.1: Transducers corresponding to the measure

A.

Regulated weighted transducers are closed under the following operations called rational operations: • the sum (or union) of two weighted transducers T1 and T2 is defined by: ∗

∗

∀(x, y) ∈ Σ Σ , (T1 ⊕ T2 )(x, y)

= T1 (x, y) ⊕ T2 (x, y)

(2)

• the product (or concatenation) of two weighted transducers T1 and∗ ∗T2 is defined by: ∀(x, y) ∈ Σ

(a) Weighted transducer version

(b) Unweighted transducer version

Figure 0.2: Transducer corresponding to the word

TFCdm .

the three letters root of any Arabic word matching with this measure. This is achieved by applying the composition operation (4). We consider Tword , the transducer which map any string to it self, i.e, the only possible path is the singleton set P ({s}, word, word, {q}) (Fig. 0.2 shows transducer associated to the Arabic word TFCdm , Figur 0.2a gives a weighted version of the transducer, Fig. 0.2b shows an unweighted one). The composition of two transducers is also a transducer. P

(Tword ◦ Tmeasure )(word, y)

=

z∈Σ∗

Tword (word, z)Tmeasure (z, y)

Since the only possible string matching z is z = word, we conclude that: (Tword ◦ Tmeasure )(word, y)

= Tword (word, word) · Tmeasure (word, y)

We have Tword (word, word) = 1, so:

,

(T1 ⊗ T2 )(x, y)

=

L x=x1 x2 ,y=y1 y2

T1 (x1 , y1 ) ⊗ T2 (x2 , y2 ) (3)

(Tword ◦ Tmeasure )(word, y)

= Tmeasure (word, y)

The composition of two weighted transducers T1 and T2 with matching input and output alphabets Σ, is a weighted transducer denoted by T1 ◦ T2 when the sum: P (T1 ◦ T2 )(x, y) = z∈Σ∗ T1 (x, z) · T2 (z, y) (4)

If word matches with the measure, the output projection will extract the root (or stem) y associated to word. In Arabic language, there are 4 verb prefixes ( © ), 12 noun prefixes ( , ,© ,¤ , , , ,x , , , , ) and 28 suffixes ( ,A , , , , , ¤ ,A¡ , ,Amk,Am ,Am¡

is well-defined and in K for all x, y ∈ Σ∗ .

© , , , , ,£ , , ,¨ ,¨ ,¢ ,¡ ,¡ ,§ , ¤ ,Am,A

B. Stemming by transducers Arabic language differs from other languages syntactically, morphologically and semantically. One of the main characteristic features is that most words are built up from roots by following certain fixed patterns1 and adding prefixes and suffixes. For instance, the Arabic word TFCd (school) is built from the three-letters root xC 2 (learn) and using the measure `f (see Table4), then the suffixe (which is used to denote female gender) is added. Measures Words

Tl`f TFCd

A xC

wlAf wFC d

TA`f TF Cd

Table 4: Measures for the three-letters root built words.

`f§ xCd§

xC

Aft§ xC dt§

and the

We will use measures to construct a transducer which do stemming. Fig. 0.1 shows the example of the measure A. This transducer (noted by Tmeasure ) can be used to extract 1 Also

called measures or binyan. letter denotes the first letter of the three-letters root, second letter and denotes the third one. 2 The

denotes the

). When considering the diacritics, there are more than 3000 patterns. Since we don’t consider diacritics in our approach, patterns are much less than 200, much of them are not used in the context of Modern Standard Arabic. For example, the following patterns will result in only one pattern (`) after removing diacritics: ¿ià "à ¿ù Áluà Álaà We adopt the following process, to construct the transducer, which enable us to include all possible measures: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Building the transducer of all noun prefixes; Building the transducer of all noun patterns; Building the transducer of all noun suffixes; Concatenate transducers obtained in 1, 2 and 3. Building the transducer of all verb prefixes; Building the transducer of all verb patterns; Building the transducer of all verb suffixes; Concatenate transducers obtained in 5,6 and 7. Sum the two transducers obtained in steps 4 and 8.

The first and third steps are very simple. We construct a transducer for each prefix (resp. suffix) then we do the union of these transducers. The resulting transducer represents the prefixes (resp. suffixes) transducer (see Fig. 0.3 and Fig. 0.4). To do the second step, we build all possible noun pattern transducers. Then, the result of the sum of these transducers represents the transducer of all noun patterns. We do the same

(a) Noun prefixes

(b) Verb prefixes

Figure 0.3: Noun and verb prefixes

to build the transducer of all verb patterns. The final transducer is obtained by doing the sum (union) of transducers built in steps 4 and 8. Tables 5 and 6 show some examples of noun and verb patterns. The resulting transducer could not be represented graphically because of the number of states (about 400 states), Fig. 0.5 shows the verb measures part of this transducer. This transducer can stem any well-formed arabic word, i.e, a word which match with some arabic measure. In addition, it can give us a semantic information about the stemmed word. This information can be used to improve the quality of classification system. 3-letters

4-letters

`

A w` `f

Noun Patterns 5-letters 6-letters

Af `tf `tf

Aft w`f `fts

7-letters

A`ftF ®y` TA`t

Table 5: Examples of noun patterns.

3-letters

4-letters

`

l`

Verb Patterns 3-letters +1 3-letters +2

A

`t `f Af

3-letters +3

4-letters +1

`ftF w`

l`f ln`

Table 6: Examples of verb patterns 0.4. F RAMEWORK FOR A RABIC T EXT C LASSIFICATION In the following, we explain how we will use our transducer to measure distance between documents. As mentioned above, our classification system is divided to three components: 1. preprocessing step 2. feature extraction: our transducer is applied on each word in the resulting document from step 1, this will give a document of the concatenation of words stems. Then, a transducer is built from this document and will be used in the next step.

Figure 0.4: Noun and verb suffixes

[9] M. Y. Al-Nashashibi, D. Neagu, and A. A. Yaghi, “Stemming techniques for arabic words: A comparative study,” in 2010 2nd International Conference on Computer Technology and Development (ICCTD). IEEE, Nov. 2010, pp. 270–276. [10] S. Khoja and R. Garside, “Stemming arabic text,” 1999. [11] H. AlSerhan, R. A. Shalabi, and G. Kannan, “New approach for extracting arabic roots,” in Proceedings of The 2003 Arab conference on Information Technology (ACIT 2003), Alexandria, Egypt, Dec. 2003, pp. 42–59. [12] M. Aljlayl and O. Frieder, “On arabic search: Improving the retrieval effectiveness via light stemming approach,” in ACM Eleventh Conference on Information and Knowledge Management. PP, 2002, pp. 340–347. [13] J. Berstel, Transductions and Context-Free Languages. Teubner Studienbucher, 1979. [14] C. Cortes, P. Haffner, and M. Mohri, “Rational kernels: Theory and algorithms,” J. Mach. Learn. Res., vol. 5, pp. 1035–1062, December 2004. [15] C. Cortes, L. Kontorovich, and M. Mohri, “Learning languages with rational kernels,” in Proceedings of the 20th annual conference on Learning theory, ser. COLT’07, 2007, pp. 349–364.

Figure 0.5: Verb patterns

3. learning task: many algorithms could be used to classify documents (or transducers). These documents are represented by transducers, we will use a rational kernel to measure distance between documents [14], [15]. 0.5. C ONCLUSION AND FUTURE DIRECTIONS This paper presents a new stemming approach, which is used in the context of Arabic text classification. It is based on the use of transducers for both words stemming and distance measure between documents. First, the transducer for stemming is built by mean the Arabic Patterns. Second, transducers will be also used to calcultate ditstances. Deep experiments and analysis of this stemmer in the context of Arabic Text Classificationare the object of the future work. R EFERENCES [1] F. Sebastiani and C. N. D. Ricerche, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, pp. 1–47, 2002. [2] “Arabic language - wikipedia, the free encyclopedia.” [Online]. Available: http://ar.wikipedia.org/wiki/http://ar.wikipedia.org/wiki/T

Tyr [3] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, “Automatic arabic text classification,” in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data, March 2008. [4] R. M. Duwairi, “Arabic text categorization,” Int. Arab J. Inf. Technol., vol. 4, no. 2, pp. 125–132, 2007. [5] T. F. Gharib, M. B. Habib, and Z. T. Fayed, “Arabic text classification using support vector machines,” International Journal of Computers and Their Applications, vol. 16, no. 4, pp. 192–199, December 2009. [6] L. Khreisat, “A machine learning approach for arabic text classification using n-gram frequency statistics,” Journal of Informetrics, vol. 3, no. 1, pp. 72–77, Jan. 2009. [7] A. M. Mesleh, “Support vector machines based arabic language text classification system: Feature selection comparative study,” in Advances in Computer and Information Sciences and Engineering, T. Sobh, Ed. Dordrecht: Springer Netherlands, 2008, pp. 11–16. [8] M. M. Syiam, Z. T. Fayed, and M. B. Habib, International Journal of Intelligent Computing and Information Sciences, no. 1, January.