Stemming Hausa text: using affix-stripping rules and ...

2 downloads 0 Views 504KB Size Report
derivational morphology (Schuh 2012; Newman 2000; Smirnova 1982). ..... Acknowledgments We gratefully acknowledge the support of Paul Newman, Indiana ...
Lang Resources & Evaluation DOI 10.1007/s10579-015-9311-x PROJECT NOTES

Stemming Hausa text: using affix-stripping rules and reference look-up Andrew Bimba1 · Norisma Idris1 · Norazlina Khamis1 · Nurul Fazmidar Mohd Noor1

© Springer Science+Business Media Dordrecht 2015

Abstract Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their preprocessing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish, Malay and Amharic. Unfortunately, no algorithm has been used to stem text in Hausa, a Chadic language spoken in West Africa. To address this need, we propose stemming Hausa text using affix-stripping rules and reference lookup. We stemmed Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up consisting of 1500 Hausa root words. The over-stemming index, under-stemming index, stemmer weight, word stemmed factor, correctly stemmed words factor and average words conflation factor were calculated to determine the effect of reference look-up on the strength and accuracy of the stemmer. It was observed that reference look-up aided in reducing both over-stemming and under-stemming errors, increased accuracy and has a tendency to reduce the strength of an affix stripping stemmer. The rationality behind the approach used is discussed and directions for future research are identified.

& Andrew Bimba [email protected]; [email protected] Norisma Idris [email protected] Norazlina Khamis [email protected] Nurul Fazmidar Mohd Noor [email protected] 1

Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia

123

A. Bimba et al.

Keywords Stemming algorithm · Hausa language · Root word · Over-stemming · Under-stemming · Stemmer accuracy

1 Introduction Since 1968, when Lovins (1968) first officially developed a stemming algorithm for English, various modifications have been performed on the algorithm. This improvement has produced other stemming methods such as Porter’s Stemmer (1980), Dawson’s Stemmer (1974), Paice/Husk’s Stemmer (1990), etc. Stemming can be defined as a computational or manual procedure which reduces all words with the same root or stem to a common form, usually by stripping each word of its derivational and inflectional affixes’ (Lovins 1968). For example, words like Connected, Connecting, Connection, and Connections will be truncated to Connect. Stemming is applied to improve the efficiency of information retrieval by providing searches with means of discovering other morphological variants of their search items (Frakes and Baeza-Yates 1992). Normally a single stem relates to several full terms, therefore storing the stem instead of the term will go a long way in reducing the size of the indexed files (Frakes and Baeza-Yates 1992). Currently, a few stemming algorithms have been developed for various languages such as Arabic (Alhanini et al. 2011), Turkish (Sever and Bitirim 2003), Malay (Darwis et al. 2012), and Amharic (Alemayehu and Willett 2002). Unfortunately, no literatures have been found at the time of writing this paper on any stemming algorithm for the Hausa language. Hausa is a Chadic language spoken in West Africa, mostly northern Nigeria and Niger Republic, having speakers in northern Ghana, Togo, Cameroun and Benin Republic. According to ‘Ethnologue’, Hausa language has approximately 25 million native speakers and about 18 million second language speakers all located in 13 different countries in Africa (Lewis 2009). This makes Hausa the 38th most spoken language in the world and consisting of the most native speakers south of the Sahara Desert (Schuh 2012; Lewis 2009; Newman 2007). To improve the efficiency of information retrieval and indexing of Hausa text, there is a need to develop a method to stem Hausa text. This paper describes our effort to stem Hausa text and investigate the suitability of our approach. Section 2 reviews the background of the Hausa language. Section 3 summarizes the stemming algorithms for a few languages. Section 4 describes the proposed method for stemming Hausa text. The evaluation and result obtained from experiments are discussed in Sect. 5. Finally, Sect. 6 concludes the paper with a suggestion for future work.

2 A brief on Hausa language grammar and morphology The first writing system of the Hausa language, called ajami, was based on the Arabic script. Presently, the Romanized orthography, called boko, is used, since its introduction by the British at the beginning of the twentieth century (Newman 2000; Jaggar 2001). The standard Hausa alphabet consists of 28 characters with 23 from

123

Stemming Hausa text: using affix-stripping rules and…

the English alphabet excluding q, v and x with the addition of ɓ, ɗ, ƙ, ‘and ƴ’ (Newman 2000). The Hausa language consists of a large amount of inflectional and derivational morphology (Schuh 2012; Newman 2000; Smirnova 1982). Like most African languages, Hausa has a variety of reduplicative processes in the nominal, adjectival, and verbal systems, but most of the morphologies, both derivational and inflectional, are related to nominal gender, number and to the verbal system to mark lexical verb classes and verbal derivational functions called “extensions” (Schuh 2012; Newman 2000; Smirnova 1982). Hausa nouns are either simple or derived. Simple nouns have their origin traced from the present form such as gona (farm), karfi (strength), while derived nouns are produced by adding prefixes, suffixes or confixes and in rare cases infixes. Example of nouns with confix (both prefix and suffix) is marowaci (greedy person); having the prefix ma and suffix ci the root word is rowa (greediness, stinginess). Example of noun with suffix is makafta (blindness) has the suffix ta and the root word makafo (blind man). Noun plurals in the Hausa language are derived by either reduplication, addition of suffix, a change in the last vowel or a complete change of the root word (Smirnova 1982; Newman 2000). Reduplication of the last syllable takes the form (b)obi, (d)odi, (f)ofi, (g)ogi, (ƙ)oki (s)oshi, (t)oti or (t)oci, (w)oyi, (y)oyi, etc. for Table 1 Some affixes in the Hausa language

Affix class

Affixes

Prefix

ma

ya-l

ba

yaya-n

ta

‘ya-r

abi-n

ya-t

wuri-n

ya-n

da-n

na

da-m Suffix

taka

oki

ci

oshi

nci

ni

wa

ki

ya

Aye

ye

aje

odi

ai

ofi

una

oyi

je

ke

u

woyi

ka

di

ntaka

shi

da

ta

nye

Infix

tta

a

Confix

ma – ci

ma-ta

123

A. Bimba et al.

example ƙofofi (doors) and ƙofa (door). Example of addition of suffix is rana (day) and ranaki (days). Example of change in last vowel is fitila (lamp) and fitilu (lamps). Nouns that have internal plurals are jirgi (plane) and jirage (planes), sariki (king) and sarakuna (kings), doki (horse) and dawakai (horses) etc. The nouns with internal plural have both have a suffix in addition to the infix. Verbs in Hausa language can either be primitive or derived. Examples of primitive verbs are ci (eat), sha (drink), ji (hear) etc. The formation of derived verbs is through the addition of a suffix to the stem, with or without modification of the stem. Examples of derivational verbs are tsorata (frighten) and tsoro (fear), ci (eat), canye or cinye (devour) etc. Infixes occur in few cases such as the infix tta which is used for littattafai (books) a plural of littafi (book) and infix ‘a’ as occurs in gurgu (cripple) and guragu (cripples) (Newman 2000). Table 1 shows some important affixes in the Hausa language.

3 Related works Kraaij and Pohlmann (1994) suggested that stemming algorithm for information retrieval is effective in languages with a complex morphology (Kraaij and Pohlmann 1994). Over the years researchers have worked in developing stemming algorithms for the English language and other languages such as Arabic, Malay, and Turkish etc. These researches are focused mostly on modifying the stemming algorithm developed by Porter (1980), because of its popularity and effectiveness in information retrieval systems. 3.1 English stemmers Lovins stemmer (1968), the first English stemmer, was developed based on keywords in material science and engineering documents. The algorithm consisted of 294 endings, 29 conditions and 35 transformation rules, where every ending is linked to any of the conditions. Stripping of a word to its stem is done in 2 stages, the stemming stage and the recoding stage (Lovins 1968). The stemming stage involves stripping the longest ending found that fulfills its corresponding condition. The ending is then modified by applying the transformation rules in the recoding stage (Lovins 1968; Smirnov 2008; Jivani 2011). Dawson (1974), extended the Lovins stemming algorithm using a broad list of 1200 suffixes, which are stored in the reversed order indexed by their length and last letter. The rule for stripping a suffix using this algorithm is when the word is not shorter than a specific number and its suffix is preceded by a specific order of characters. This makes it more complex than the Lovins algorithm although it is fast and covers more suffixes (Smirnov 2008; Jivani 2011). Porter’s stemmer was introduced in 1980 by Martin Porter (Porter 1980). This stemming algorithm is derived on the basis that there are approximately 1200 suffixes in English language, which comprises of a mix of smaller and simpler suffixes (Porter 1980). It comprises 5 steps, with a series of 60 arranged rules which uses a partial matching concept to reduce the complexity of the stemming algorithm,

123

Stemming Hausa text: using affix-stripping rules and…

although the stemmer may not necessarily produce a root word. This algorithm has resulted 6370 different entries in the vocabulary of stems and less errors than the Lovins stemmer. However, it was slower than Lovins due to its 5 steps whereas Lovins has only 2 steps (Jivani 2011). Paice/Husk stemmer (Paice 1990; Smirnov 2008; Jivani 2011) developed a stemming algorithm, which iterates through a table that consists of 120 rules that are indexed by the last letter of the suffix (Paice 1990; Jivani 2011). It has a higher tendency for over stemming than the Porter’s algorithm (Smirnov 2008; Jivani 2011). 3.2 Arabic stemmers For stemming Arabic text, 3 methods are employed; dictionary-based stemming, light-based stemming and unsupervised approach will be discussed. In the dictionary-based stemming, the root word is derived based on linguistic lexicons. It involves 3 steps; pre-processing, entity recognition and stem identification. To remove redundant spaces, sentence boundaries are identified in the pre-processing stage and the text is tokenized and passed to the morphological analyzer. In the entity recognition stage, Arabic personal names are identified. Finally, the stem is extracted based on suffixes, prefixes, clitics and stem. In the light-based methodology, 3 steps are involved; word identification, word segmentation and pattern matching. In the first step common words and non-derivational nouns are identified. In the segmentation stage the words are divided into its components; affix, stem and clitics based on some rules. Finally, in the pattern matching stage the root word is extracted by matching it with some patterns (Alhanini et al. 2011). Alhanini and Aziz offered an enhanced stemming algorithm for Arabic text (Alhanini et al. 2011). It was developed to overcome the disadvantages posed by the dictionary-based and light-based stemming. Based on their experiment with an inhouse newspaper corpus, they achieved an accuracy of 96 % as compared to the light and dictionary base which have 85 and 88 % accuracy, respectively (Alhanini et al. 2011). The unsupervised approach used by Khorsi (2012) learns from a corpus of about 14,000 traditional distinct Arabic words mapped to their stems. This approach extracts morphological segments from classical Arabic words using a set of all the distinct n-grams acquired from a plain text input mapped to the frequency. The basic concept is that a letter is irremovable from the start or end of a word if the remaining substring depends relatively on this letter. This unsupervised approach deals with the eventual imperfection in the input text such as misspellings, and achieved about 90 % true positives in stemming of classical Arabic text (Khorsi 2012). 3.3 Malay stemmers The Malay stemming algorithm has gone through various improvements since it was first introduced by Othman (Darwis et al. 2012). It is a rule-based algorithm using 121 affix stripping rules. Ahmad et al. (1996) decided to improve Othman’s algorithm which had some over-stemming and under-stemming errors as a result of

123

A. Bimba et al.

no dictionary lookup and dependency on the order of the affixes. Sembok’s algorithm included a stem dictionary lookup and some additional list of affixes. Idris and Mustapha (2001) then developed a stemmer which is similar to Sembok’s that greatly improved the recall of an automated essay grading system for Malay History text. This improvement was made by altering the order of processing the affixes and including this process in the steps of the algorithm rather than looping through a file as seen in Sembok’s algorithm (Idris and Mustapha 2001). Regardless of the techniques applied, stemming errors still existed in the earlier Malay Language Stemmers. Hence, a recent approach proposed the use of a Malay Word Register to solve the ambiguity problem and an exhaustive affix stripping approach to tackle under-stemming and over-stemming errors (Darwis et al. 2012). This approach performed well with 99.80 % accuracy. 3.4 Turkish stemmers There are a few stemming algorithms developed for Turkish language, such as Longest-Match (L-M) algorithm, an algorithm that is centered on a word search logic done on a lexicon or dictionary that contains Turkish stems and their possible alteration, and Aysin–Fazli (A–F) algorithm (Solak and Can 1994), which uses a dictionary of Turkish stems annotated with sixty-four tags indicating how to produce root words (Sever and Bitirim 2003). Currently, a more accurate stemming algorithm for the Turkish language called FindStem was developed. The algorithm consists of three components; ‘Find the Root’, ‘Morphological Analysis’, and ‘Choose the Stem’ and has a pre-processing stage that handles clitics and converts the words in a text to a smaller case (Sever and Bitirim 2003). The FindStem algorithm improved document retrieval effectiveness by about 25 % in terms of precision. 3.5 Amharic stemmers The Amharic stemmer involves the truncation of words to their root form by deletion of both prefixes and suffixes (Alemayehu and Willett 2002). This is done by some iterative procedures, which use a minimum stem length, recoding and contextsensitive rule similar to the Lovins algorithm (1968). During the affix removal process, the prefix is deleted before the suffix, ensuring the minimum stem length condition of 2 consonants. The algorithm was tested against a sample file of 1211 words and showed an accuracy of 95.90 %, 2.70 % over stemming and 1.40 % under stemming errors (Alemayehu and Willett 2002). 3.6 Discussion Based on the few stemmers reviewed, it can be seen that several approaches have been used in their implementations. These differences are seen in the complexity of the languages. For instance, English stemmers consist of some affix stripping rules, which includes transformation rules for recoding the stem, verifying the vowel and consonant combination of the word ensuring it is not shorter than a specified value. More complex languages such as Arabic used dictionary-based, light stemming and

123

Stemming Hausa text: using affix-stripping rules and…

unsupervised approach, Malay used affix rules, dictionary reference and recently the Malay Word Register to solve the problem with word ambiguity and the Turkish, which implemented the Longest-Match (L-M) algorithm, A–F algorithm and FindStem. Other languages such as Amharic used some iterative procedures that use a minimum stem length, recoding and context-sensitive rule just like the English stemmers. The proposed approach to developing the Hausa stemming algorithm is based on the affix stripping rules used in English, Malay, Amharic and Turkish stemmers. The dictionary approach is used to reduce under-stemming errors.

4 Approach to stemming Hausa text 4.1 Overview The proposed approach for stemming Hausa text is based on 78 affix stripping rules and a reference lookup. It comprises a series of steps to remove certain suffix, prefix, infix or confix based on some substitution rules and a condition which specifies the vowel-consonant sequences in the resulting stem. Similar to other affix stripping based stemmers, some of its substitution rules are determined according to the combination of letters preceding the affix. For example, words with prefix ma and suffix ci having a combination *VwVci, where V stands for vowel, then the combination will be replaced by *VwV such as in marowaci (miser, greedy man) \ rowa (greediness). Also words having a combination *Cvuci where C stands for consonant will be replaced by *CVuta, such as in mafauci (butcher) \ fauta (slaughter), marubuci (writer) \ rubuta (write). Similarly, such computations are made for other words with suffix oshi, awa, nci. 4.2 Hausa affix-rules with reference lookup The Hausa Affix-rules and reference lookup was developed based on the information obtained from the books ‘The Hausa language: a descriptive grammar’ (Smirnova 1982) and ‘The Hausa Language: An Encyclopedic Reference Grammar’ (Newman 2000). References were also made to the dictionary; ‘The Hausa Lexicographic Tradition’ (Newman and Newman 2001). The reference lookup consist of 1500 root words, which were compiled based on common words used in the Hausa language, words found in an old Hausa folktale and news articles which range from politics, religion, sports, crime and conflict. The affix-rules ensure that all common affixes, inflectional morphology such as noun plurals, common noun and verb derivational affixes are covered. 4.3 The basic algorithm The Hausa stemmer is developed based on the rule-based approach. It comprises 4 steps which are:

123

A. Bimba et al.

Step 1a: Check word against reference lookup; if word exist go to next word else stem word reduplication and concrete nouns formed from prefixed particles and other nouns that use hyphens. For example: guje-guje (race) \ guje (ran) Step 1b: Stem plural of compound nouns formed with prefix ma and suffix ai, ta, and awa. For example: madafai \ madafa (kitchen) marubuta \ marubuci (writer) Step 1c: Stem plural of compound nouns formed with prefix ma and have the suffix ta. For example: mafauta \ mafauci (butcher) mahaukata \ mahaukaci (madman) Step 1d: Stem abstract nouns formed from a verb or adjective with suffix ci and prefix ma. For example: marowaci (greediness) \ rowa (to be greedy) Stem concrete nouns formed from verb with ma prefixed and ya suffixed. For example: mashaya (dinking place) \ sha Step 2: Check word against reference lookup; if word exist go to next word else stem verbal nouns formed with the suffix wa and abstract nouns formed from simple nouns with suffixes ci (masculine) and ta (feminine) -n-ci (masculine of Kano origin) and -n-taka (feminine of Sokoto origin), Plural of compound nouns formed with particles and other plural formations. For example: rubuce-rubuce (writing) \ rubutu (write) tayasuwa (helping) \ taya (to help) rantsuwa (swearing, oath) \ rantse (to swear) tadowa (raising) \ tada (to raise) gadonci \ gadontaka (inheritance) \ gado (inheritance) fauta (slaughter) \ fawa (slaughtering) Step 3: Check word against reference lookup; if word exist go to next word else check for reduplication of the last syllable taking the forms (b)obi, oci, (d)odi, (f) ofi, (g)ogi, (l)oli, (k)oki, (o)omi, (r)ori, (s)osi, (t)oti or (t)oshi, (w)owi, (y)oyi to signify plurals and stem. For example: goribobi \ goriba (a palm fruit) batoci \ bata (small box made of skin) kofofi \ kofa (door) bindigogi \ bindiga (gun) fitiloli \ fitila (lamp)

123

Stemming Hausa text: using affix-stripping rules and…

Step 4: Check word against reference lookup; if word exist go to next word else stem some common plurals by stripping common suffix, prefix and infix. For example: Suffix: abokai \ aboki (friend) shaiduna \ shaida (witness) kafirai \ kafiri (heathen) Prefix: tarwatsa (scattering) \ watsa (to scatter) tsattsaga (separate) \ tsaga (cut, divide) Infix: littattafi \ littafi The flow chart of the algorithm is shown in Fig. 1. During each step, a reference is made to the dictionary if the word is found it exits and goes to the next word else it checks for a matching rule. If a rule matches a word, then whatever consequences tied to that rule is tested on the stem to be produced, to see if the affix is to be removed or replaced as indicated by the rule. The affix is taken out once a rule’s conditions are satisfied and the command is passed to the following step. When the rule is not satisfied, the following rule in that step is then applied, until a rule in that step is met and command is passed to the next step or after all the rules in that step has been exhausted, hence moving control to the following step. This procedure will continue to stem all the words until, the words in the given text is exhausted. The stemmer was implemented in Python Programing Language which is heavily used in industry, scientific research, and education around the world (Kuhlman 2012; Bird et al. 2009). To store 1500 Hausa root words for the reference lookup, MySQL was chosen as the database management system.

5 Performance evaluation 5.1 Stemmer evaluation Several techniques have been suggested for evaluating the strength, similarities and accuracy of stemming algorithms (Paice 1994; Frakes and Fox 2003; Sirsat et al. 2013). The methods we considered in this research do not evaluate stemmers based on their effect on retrieval performance, as this does not give understanding of the specific causes of errors. These methods are: 5.1.1 Paice’s evaluation method In evaluating English based stemmers Paice did not use the customary precision/ recall factors, but as an alternative presented over-stemming index (OI), under-

123

A. Bimba et al.

Start

Select Next Word

Yes

Is Word in Dictionary? No

Apply Rule in Step 1 to word Yes Rule Satisfied?

Yes

No

Is Word in Dictionary? No

Apply Rule in Step 2 to word Yes

Is Word in Dictionary?

Yes

Rule Satisfied? No

No

Apply Rule in Step 3 to word

Rule Satisfied?

Yes

No

Is Word in Yes Dictionary? No

Apply Rule in Step 4 to word

Yes

Are words Exhausted?

No

Fig. 1 Flow chart of Hausa stemmer

stemming index (UI) and their relationship, the stemming weight (SW). The Paice approach provides insights for stemmer optimization, unlike the traditional method which evaluates stemmers based on retrieval performance (Paice 1994). The Paice evaluation method requires a predefined list of groups of words which are morphologically and semantically related. An ideal stemmer should be able to merge all the semantically equivalent words in a group to a common stem which is not a stem in another group. A stemmer has made under-stemming errors if a stemmed group has more than one unique stem. Subsequently, a stemmer has overstemmed if a stem from a certain group also occurs in another stemmed group (Kraaij and Pohlmann 1995). The over-stemming index signifies the proportion of words, in different groups, which have been reduced to the same stem. In order to calculate the OI two parameters are required.

123

Stemming Hausa text: using affix-stripping rules and… ●

Desired Non-merge Total (DNT): Since an ideal stemmer is not expected to merge any word in a group with a word in another group, the desired non-merge total for that group DNTg is calculated by the formula: DNTg ¼ 0:5ng ðW  ng Þ where ng, the number of words in the group; W, the total number of unique words in the sample. For example, considering one group (g1) in our sample of 1723 unique words: By summing the total value of DNTg over all the groups in the sample, we obtain the global desired non-merge total GDNT. g1 = {doka (law), dokar, dokoki, dokokin, dokokinsa} W = 1723 ng = 5 DNT1 ¼ 0:5  5ð1723  5Þ



Wrongly-merged total (WMT): In situations where a similar stem occurs in more than one group, the number of wrongly merged words is calculated by the formula: X WMTg ¼ 0:5 v i ð ns  v i Þ i¼1...t

where t, number of groups that share the same stem; ns, the number of instances of that stem; vi, the number of stems for each group t. For example, considering two groups g1 and g2 with the stemmer output sg1 and sg2 respectively: g1 = {baya (back), bayan, bayar, bayansu, bayarwa} g2 = {bayanai (explanations), bayanaka, bayanan, bayananka, bayanansu, bayani, bayanin} sg1 = {baya, baya, baya, baya, bayar} sg2 = {baya, bayanaka, baya, baya, baya, bayani, bayani} v1 = 4 the number of the stem baya in sg1 v2 = 4 the number of the stem baya in sg2 ns = 8 the total number of the stem baya in sg1 and sg2 WMT1 ¼ 0:5ð4  ð8  4Þ þ 4  ð8  4ÞÞ After summing the WMTg for each group we obtain the Global Wrongly-Merged Total (GWMT). The Over-stemming Index is represented as:

123

A. Bimba et al.

OI ¼

GWMT GDNT

The under-stemming index signifies portion of the words in a group (g), which failed to be merged by the stemmer. Similar to OI, the UI is calculated based on two parameters. ●

Desired Merge Total (DMT): Since a perfect stemmer is required to merge all the words in a group to one another, the desired merge total is calculated as:   DMTg ¼ 0:5ng ng  1 For example, looking at a group g1 in our sample having 4 words: g1 = {aboki (friend), abokansa, abokan, abokin} DMT1 ¼ 0:5  4ð4  1Þ ¼ 6 The 6 pairs can be seen below as: Pair Pair Pair Pair Pair Pair

1 2 3 4 5 6

= = = = = =

{abokan, abokansa} {abokan, aboki} {abokan, abokin} {abokansa, aboki} {abokansa, abokin} {aboki, abokin}

A summation of the total value of DMTg across all the groups in the sample gives the global desired merge total GDMT. Unachieved Merge Total (UMT): After stemming some groups may contain more than one distinct stem. The failure of the stemmer to combine all the words in a group to a single stem is referred to as the unachieved merge total (UMT), represented by the formula: UMTg ¼ 0:5

X

  ui ng  ui

i¼1...s

where s, the number of distinct stems; ui, the number of instances of each stem. For example, a stemmed group sg1: sg1 = {ciki (inside), cik, ciki, ciki, ciki, ciki} ns = 6 the total number of the stems ciki and cik in sg1 u1 = 5 number of stem ciki in sg1 u2 = 1 number of stem cik in sg1 UMT1 ¼ 0:5ð5  ð6  5Þ þ 1  ð6  1ÞÞ

123

Stemming Hausa text: using affix-stripping rules and…

Summing the UMTg for each group, we get the global unachieved merge total (GUMT). The Under-stemming Index is represented as: UI ¼

GUMT GDMT

The strength of the stemmer known as the stemming weight is measured as the ratio of OI and UI. SW ¼

OI UI

A high or low value of SW indicates a strong (heavy) stemmer or a weak (light) stemmer respectively (Kraaij and Pohlmann 1995). 5.1.2 Sirsat’s evaluation method Sirsat et al. (2013), proposed a useful method to measure the strength and accuracy of a stemming algorithm. They introduced the Word Stemmed Factor (WSF) to compute the strength of a stemmer. Also, they used the Correctly Stemmed Words Factor (CSWF) and Average Words Conflation Factor (AWCF) to measure the accuracy of a stemmer. ●

Word Stemmed Factor (WSF): This is the ratio of the words stemmed by the stemmer and the total number of words in the sample. It has a minimum threshold of 50 %. The strength of the stemmer is greater if the WSF is larger. It is calculated as: WSF ¼



where WS, number of words stemmed; TW, number of words in a sample. Correctly Stemmed Words Factor (CSWF): It is a ratio of the words stemmed correctly by the stemmer and the number of words stemmed. The minimum threshold is 50 %. A stemmer has a higher accuracy if the value of CSWF is higher. CSWF is calculated as: CSWF ¼



WS  100 TW

CSW  100 WS

where CSW, number of correctly stemmed words; WS, total number of words stemmed. Average Words Conflation Factor (AWCF): This represents the average number of different words of diverse conflation group that are stemmed correctly. In computing AWCF, the number of unique words after conflation is first calculated as:

123

A. Bimba et al.

NWC ¼ S  CW where S, number of distinct stem after stemming; CW, number of correct words not stemmed. Thus, AWCF is calculated as: AWCF ¼

CSW  NWC  100 CSW

A stemmer has a higher accuracy if the value of AWCF is higher. 5.2 Result analysis To measure the performance of stemming Hausa text, Paice’s (1994) and Sirsat’s (2013) evaluation methods were applied to 436 online news articles in Hausa language. The Hausa text consisted of 32,236 words among which 3460 are unique. A total of 490 groups of morphologically and semantically related words were formed from the unique words. This amounted to 1723 words. A sample, of the groups created is shown in Table 2. Two versions of the algorithm for stemming Hausa text were evaluated. The first version (HStemV1) consisted of only affix stripping rules. While the second version (HStemV2) consisted of affix stripping rules and a reference lookup. We applied the two versions to our sample group to obtain the values of OI, UI and SW. The results are shown in Table 3. HStemV2 had a lower OI than HStemV1. This is because the reference lookup reduced the portion of words merged, from different groups of morphologically and semantically related words. For example, HStemV1 merged words from 4 different groups: ragu (reduce), rana (day), ransa (his life), runduna (company). Table 2 Sample group of related words S/N

Related words

1

alama, alamar, alamomin, alamu, alamun

2

bangare, bangaren, bangarori, bangarorin

3

gida, gidaje, gidajen, gidajensu, gidan, gidanka, gidansa gidansu

4

haifa, haifar, haife, haife shi’, haifi, haihu, haihuwa, haihuwar

5

jarirai, jariran, jariransu, jariranta, jaririya, jaririyar

6

makamai, makaman, makamanta, makamashi, makamashin, makami

7

rubuce, rubuta, rubutawa, rubutu

8

shawara, shawarar, shawararin, shawarwari

9

tsaka, tsakanin, tsakaninsa, tsakaninsu, tsaki, tsakiya, tsakiyar

10

zumunci, zumuncin, zumunta, zumuntan

123

Stemming Hausa text: using affix-stripping rules and… Table 3 Result using Paice’s Evaluation Method Stemmer

OI

HStemV1

0.39 × 10−4

HStemV2

−4

0.24 × 10

UI

SW

0.15

2.65 × 10−4

0.14

1.68 × 10−4

But, HStemV2 merged words from only 2 different groups: ragu (reduce) and ransa (his life). Similarly, the UI of HStemV2 was lower than HStemV1. This shows that a reference lookup can reduce the number of semantically equivalent words, which are unsuccessfully merged while stemming. For example, HStemV2 reduced the unsuccessfully merged words by 1, while stemming a group words as compared to HStem V1. A sample result is shown in Table 4. It was also observed from the stemming weight that HStemV1 is heavier than HStemV2. Furthermore, the strength and accuracy of the two stemming approaches were measured using Sirat’s method. The results obtained are shown in Table 5. Table 4 Sample result of under-stemming error

Group

HStemV1-result

HStemV2-result

wasa (play)

Wasa

wasa

wasan

Washi

wasa

wasanni

Washi

wasa

wasannin

Washi

wasa

wasanninta

Was an

wasan

wasansa

Washi

wasa

wasansu

Washi

wasa

Table 5 Result using Sirsat’s method Analysis of stemmers

HStemV1

HStemV2

Total words (TW)

1723

1723

Number of words stemmed (WS)

1197

1132

Words stemmed factor (WSF)

69.47 %

65.7 %

Number of distinct words after Stemming (S)

984

992

Index compression factor (ICF)

75.1 %

73.7 %

Correctly stemmed words (CSW)

971

989

Incorrectly stemmed words (ISW)

226

143

Correctly stemmed words factor (CSWF)

81.12 %

87.37 %

Correct words not stemmed (CW)

526

591

Number of distinct words after conflation (NWC)

458

401

Average words conflation factor (AWCF)

52.83 %

59.45 %

123

A. Bimba et al.

The word stemmed factors (WSF) from the two approaches were well above the 50 % threshold. This indicates the stemmers are strong and aggressive (Sirsat et al. 2013). However, the HStemV1 is more aggressive than the HStemV2. This is in agreement with the Paice’s evaluation method, which shows HStemV1 is heavier than HStemV2. Furthermore, the following were observed: ●



The word stemmed factor (WSF) value for HStemV1 (69.47 %) is higher than HStemV2 (65.7 %), but the CSWF and AWCF are lower. This is because HStemV1 lacks a lookup of root words, thereby making it transform root words to incorrect stem. HStemV1 has a higher tendency to generate under-stemming errors than HStemV2. This is seen by a higher value of NWC. This result corresponds to the outcome obtained from Paice’s evaluation method.

6 Conclusion In this study, we used two methods to stem Hausa text. The first method HStemV1 consisted of only affix stripping rules, while the second method HStemV2 had both affix stripping rules and a reference lookup. Using the Paice’s evaluation method, we observed that reference lookup aided in reducing both over-stemming and understemming errors. Comparison of HStemV1 and HStemV2 based on stemming weight indicates, a reference lookup has a tendency to reduce the strength of an affix stripping stemmer. To further measure the effect of reference lookup on stemming accuracy, we used Sirsat’s method to calculate the CSWF and AWCF. While using reference lookup, a 6.62 % improvement in correctly stemmed words was achieved. These improvements are attributed to the presence of root words in the reference lookup. Even though the reference lookup improved the accuracy of the stemmer, some understemming and over-stemming errors were consistent amongst both methods. This could be attributed to the amount of root words in the reference lookup. Further work can be carried out to determine the effect of the size of the reference lookup on stemming accuracy. Acknowledgments We gratefully acknowledge the support of Paul Newman, Indiana University USA, for the substantive comments and constructive criticisms given.

References Ahmad, F., Yusoff, M., & Sembok, T. M. T. (1996). Experiments with a stemming algorithm for Malay words. Journal of the American Society for Information Science, 47(12), 909–918. Alemayehu, N., & Willett, P. (2002). Stemming of Amharic Words for Information Retrieval. Literary and Linguistic Computing, 17(1) Alhanini, Y., Juzaiddin, M., & Aziz, A. (2011). The enhancement of arabic stemming by using light stemming and dictionary-based stemming. Journal of Software Engineering and Applications, 4, 522–526.

123

Stemming Hausa text: using affix-stripping rules and… Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: O’Reilly Media Inc. Darwis, S. A., Rukaini, A., & Idris, N. (2012). Exhaustive affix stripping and a Malay word register to solve stemming errors and ambiguity problem in Malay stemmers. Malaysian Journal of Computer Science, 25(4), 196–209. Dawson, J. (1974). Suffix removal and word conflation. Bulletin of the Association for Literary and Linguistic Computing, 2(3), 33–46. Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: Data structures and algorithms (pp. 161– 218). Englewood Cliffs, NJ: Prentice Hall. Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum, 37(1), 26–30. Idris, N., & Mustapha S. M. F. D. (2001). Stemming for term conflation in Malay texts. In International conference on artificial intelligence (IC-AI. Las Vegas) (pp. 1512–1517). Jaggar, P. J. (2001). Hausa. Reading: John Benjamins Publishing. Jivani, A. G. (2011). A comparative study of stemming Algorithms. International Journal of Computer Technology and Applications, 2(6), 1930–1938. Khorsi, A. (2012). Effective unsupervised Arabic word stemming: Towards an unsupervised radicals extraction. The International Arab Journal of Information Technology, 9(6), 571–577. Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch. In L.G.M. Noordman & W. A.M. de Vroomen (Eds.), Informatiewetenschap 1994: Wetenschappelijke bijdragen aande derde STINFON conferentie (pp. 167–180). Leiden, Netherlands: Stichting Informatiewetenschap Nederland. Kraaij, W., & Pohlmann, R. (1995). Evaluation of a Dutch stemming algorithm. The New Review of Document and Text Management, 1, 25–43. Kuhlman, D. (2012). A python book: Beginning python, advanced python and python exercises. Rexx.com. Lewis, M. P. (2009). Ethnologue: Languages of the World, Sixteenth edition. [online]. http://www. ethnologue.com/. Accessed 4 Dec 2012 Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1&2), 22–31. Newman, P. (2000). The Hausa language: An encyclopedic reference grammar. New Haven: Yale University Press. Newman, P. (2007). A Hausa-English dictionary. New Haven: Yale University Press. Newman, R., & Newman, P. (2001). The Hausa lexicographic tradition. Lexikos, 11, 263–286. Paice, C. D. (1990). Another stemmer. ACM, SIGIR Forum, 24(3), 56–61. Paice, C. D. (1994). An evaluation method for stemming algorithms. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 42– 50). New York, NY: Springer-Verlag. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137. Schuh, R. G. (2012). A Hausa story and Hausa verb morphology UCLA [online]. http://www.linguistics. ucla.edu/people/schuh/lx105/. Accessed 4 Dec 2012 Sever, H., & Bitirim, Y. (2003). FindStem: Analysis and evaluation of a Turkish stemming algorithm. LNCS, 2857, 238–251. Sirsat, S. R., Chavan, V., & Mahalle, H. S. (2013). Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies, 4 (2), 265–269. Smirnov, I. (2008). Overview of stemming algorithms. Mechanical Translation, 52. Smirnova, M. (1982). The Hausa language a descriptive grammar. London: Routledge & Keagan Paul. Solak, A., & Can, F. (1994). Effects of stemming on Turkish text retrieval. In Proceedings of the ninth international. Symposium on Computer and Information Sciences (ISCIS), pp. 49–56.

123