Spelling Correction for Kazakh

Spelling Correction for Kazakh Aibek Makazhanov, Olzhas Makhambetov, Islam Sabyrgaliyev, and Zhandos Yessenbayev Nazarbayev University Research and Innovation System Astana, Kazakhstan {aibek.makazhanov, omakhambetov, islam.sabyrgaliyev, zhyessenbayev}@nu.edu.kz The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-54903-8 44

Abstract. Being an agglutinative language Kazakh imposes certain difficulties on both recognition of correct words and generation of candidate corrections for misspelled words. In this paper we describe a spelling correction method for Kazakh that takes advantage of both morphological analysis and noisy channelbased model. Our method outperforms both open source and commercial analogues in terms of the overall accuracy. We performed a comparative analysis of the spelling correction tools and pointed out some problems of spelling correction for agglutinative languages in general and for Kazakh in particular.

1

Introduction

Kazakh is an agglutinative language that belongs to the Turkic group. It has a complex and productive derivational and inflectional morphology. While extensive research has been conducted into the morphological analysis of agglutinative languages [1–3], not too much work has been accomplished in building tools for the analysis of Kazakh. Being one of the oldest problems in NLP with arguably the highest demand for a practical solution, automatic spelling correction is one of the basic steps in the analysis of any language. In this paper we describe a spelling correction method for Kazakh language that takes advantage of both statistical and rule based approaches. Note that fixing punctuation, structure, or stylistics is out of the scope of the present work. The spelling correction can be divided into two tasks: word recognition and error correction. For languages with a fairly straightforward morphology recognition may be reduced to a trivial dictionary look up: if a given word is absent from the dictionary, then most likely it had been misspelled. Correction is done through generating a list of possible suggestions: usually words within some minimal edit distance to a misspelled word. For agglutinative languages, such as Kazakh, even recognizing misspelled words becomes challenging as a single root may produce hundreds of word forms1 . It is practically infeasible to construct a dictionary with all possible word forms included: apart 1

In Kazakh, for example, nominals (nouns, pronouns, participles) produce up to 157 forms, and verbs - up to 840 [4]. Not to mention derivational suffixes that transform POS of a word upon inflection.

from being gigantic such a dictionary would be all but verifiable. For the same reason the correction task becomes challenging as well. One possible solution that has been successfully applied to agglutinative languages in the past [3, 5] is to use a mixture of lexicon-based and generative approaches by keeping a lexicon of roots and generating word forms from that lexicon on the fly. This method requires a generator of word forms. Following the approach presented by Oflazer and Güzey [3], we developed a word forms generator (referred to as the generator hereinafter) and extended it to implement a tool for isolated-word (i.e. context insensitive) error correction for Kazakh language. We implemented the generator as an FSA whose states correspond to morphemes and transitions correspond to morphological rules. If a given word cannot be generated it is considered a misspelling, and for such a misspelled word our method generates a list of possible corrections. To rank such a list we use a Bayesian argument that combines error and source models. For our error model we employ a noisy channel-based approach proposed by Church and Gale [6]. Our source model is built upon the theoretical aspects that were used for morphological disambiguation in [2]. The developed tool is evaluated in terms of general accuracy, top-k precision, and false positive rate. For the purpose of comparison we also experiment with Kazakh spelling dictionary (KSD) [7], an open source Kazakh spelling corrector and Microsoft Office 2010 (MSO) Kazakh language pack [8]. We show that although our method is more accurate than the open source and commercial analogues, the process of generation of candidate corrections still needs improvement in terms of pruning and better ranking of suggestion lists. The rest of this paper is organized as follows. Subsection 1.1 briefly outlines our contribution. Section 2 reviews related work. Section 3 thoroughly describes our methodology. Section 4 describes the experimental set up and analyzes the results. Finally, we draw conclusions and discuss future work in Sect. 5. 1.1

Our Contribution

Our contribution can be summarized in two following statements: (i) we have built one of the first morphological disambiguators for Kazakh language that can be used to generate and segment word forms; (ii) based on the disambiguator a spelling correction tool was implemented.

2

Related Work

A large number of studies have been performed on spelling correction problem. Some early approaches were based on comparing a misspelled word to words in a lexicon and suggesting as possible corrections the ones with the minimal edit distance [9, 10]. Another popular approach used in more recent works [6, 11] is based on applying a noisy channel model [12], which consists of a source model and a channel model. These works differ in the way how authors weigh the edit operations and in context-awareness of the source models. While Church and Gale [6] utilize word trigram model, Mays et al. [11] do not consider context. Later Brill and Moore [13] proposed an improved

method with more sophisticated error model, where instead of using single insertions, deletions, substitutions and transpositions, the authors model substitutions of up to 5letter sequences that also depend on the position in the word. An interesting method based on neural networks were proposed by Hodge and Austin [14]. The authors use modular neural system AURA [15], where for checking/correction they employ two correlation matrix memories: one trained on patterns derived from handling typing errors by binary Hamming distance and n-grams shifting, and another trained on patterns derived from handling phonetic spelling errors. Ranking suggested corrections is accomplished by choosing the maximum score obtained from the addition of the scores for Hamming distance and n-grams shifting with the score for phonetic modules. A classical approach to spelling correction for agglutinative languages is to use FSAs [3, 16, 17]. One of the pioneering works that uses finite state automata for spell checking were presented by Oflazer and Güzey [3]. In the proposed method candidate words are generated using two-level transducers. To optimize the recognizer the authors prune the paths that generate the substrings of the candidate words which do not pass some editing distance threshold. Ordering of suggested corrections is accomplished by employing ranking techniques based on the statistics of the types of typing errors. In a more recent work presented by Pirinen et al. [17], the authors use two weighted FSAs one for language model and second for error model, where the authors reorder corrections by using POS n-gram probabilities for a given word. One of the tools we compare our method to, Kazakh spelling dictionary (KSD) was developed by Mussayeva [7] and freely available in the form of add-ons to various Mozilla products [18] and OpenOffice extension [19]. KSD is based on Hunspell [5], an open source spelling corrector originally developed for Hungarian. To work with any language Hunspell needs a dictionary and an affix table designed for that language. These two are the essence of KSD. The author reports 51 suffix types included to the affix table [7]. These types represent large grammatical groups like noun case, person suffixes, and verb tense suffixes, etc. Given a word Hunspell searches its dictionary to determine if it is correct. If a given word is not found, Hunspell derives possible word forms by appending suitable suffixes. Each word form consists of a root and only one (at least with default settings) appended suffix. In Kazakh, however, suffixes can be, and usually are, appended into longer chains. In the KSD affix table this issue has been partially overcome by collapsing shorter suffix chains into a single suffix.

3

Methodology

Given an input word, the fundamental task of spelling correction is to determine if it is correct, and if not, to offer a list of corrections. Intuitively to solve the problem one could try to identify the root of a given word and then generate a list of all possible word forms that can be transformed into the target word using no more than some maximum number of edit operations (usually two). If there is a generated word form that has a zero edit distance to the target, then the given word is considered correct. Typically word forms are generated by appending morphemes (Kazakh has only suffixes) to roots (root-first fashion) and to each other using some sort of automaton. This is exactly how Oflazer and Güzey [3] solve the problem for Turkish. However, there is a

Algorithm 1 The procedure of generating correction suggestions Require: W RD, M AX ED {The procedure takes as input a target word and maximum edit distance threshold} correct ⇐ false if W RD ∈ Lexicon then correct ⇐ true {Already correct} end if suggestions ⇐ [] chains ⇐ [00 ] {We start from a single null-morpheme chain} while correct = false and chains 6= ∅ do current states ⇐ chains chains ⇐ [] {Empty morpheme chains list} for all c ∈ fsaGetChains(current states) do if withinDistance(c, W RD, M AX ED) = true then appendToList(chains, c) {Append current chain to the list, if there is such a suffix SF X of W RD that editDistance(c, SF X) ≤ M AX ED} for all r ∈ lexGetRoots(c) do s ⇐ stringConcat(r, c) {Append the current morpheme chain to a candidate root to get a suggestion} if editDistance(s, W RD) = 0 then correct ⇐ true end if if editDistance(s, W RD) ≤ M AX ED then appendToList(suggestions, s) end if end for end if end for end while return correct, suggestions

potential drawback. There are usually more than one candidate roots, especially given that a word can be misspelled. Thus, the same morpheme chain can be generated for several candidate roots involving extra computation. In contrast, if morpheme chains are generated first (morpheme-first fashion) and then the root lexicon is searched for acceptable2 entries any given chain is generated only once. Let us consider an example for English. Suppose the following list of corrections was generated: merci + ful + ly, peace + ful + ly, beauti + ful + ly. In both root- and morpheme-first approaches the root lexicon would have been searched once to get three candidate roots. However, in the root-first fashion the morpheme chain ful + ly would have been generated three times (once for each candidate root), and in the morpheme-first fashion - only once. Following Oflazer and Güzey [3] we build the generator FSA, with the exception of using morpheme-first approach. We have considered all inflectional suffixes of nominals and verbs and their appending order, i.e. transitions, described in the classic Kazakh 2

Vowel harmony must be accounted for. Also certain morphemes are appended to roots with certain POS. This fact, actually, narrows down root search.

grammar. We have also considered some of the frequent derivational suffixes found in the annotated sub-corpus of the Kazakh Language Corpus (KLC) [4]. As a result, the generator consists of 298 states (allomorphs and sub-categories of 55 distinct suffix types) and 329 transitions (not counting transitions between allomorphs). Our POSlabeled root lexicon comprising 18230 entries was also derived from KLC. The developed FSA produces morpheme chains which in turn can be appended to roots from the lexicon to produce inflected word forms. In the setting of spelling correction, however, we need to account for misspellings, and prune less probable corrections. Algorithm 1 describes the process of generating correction suggestion for a given input word and maximum edit distance threshold. Provided that a given word is not in a root lexicon (otherwise the word is considered correct), the process starts with an empty list of suggestions and a list of morpheme chains containing a null-morpheme (an empty string) that can be appended to any other morpheme. The algorithm repeatedly invokes the procedure fsaGetChains(current states), which, using FSA, provides a list of chains reachable from a given list of morphemes. The process stops in either of two cases: (i) a word form identical to an input word is generated (the word is correct); (ii) fsaGetChains(current states) returns no morpheme chains, because the list provided to the FSA consists of final states only. The pruning is done with the help of withinDistance(c, W RD, M AX ED) procedure, that returns True if for a given morpheme chain c target word W RD has at least one suffix (in a sense of a sequence of trailing letters) for which an edit distance between itself and c is no larger than M AX ED. Indeed, if the distance between a morpheme chain and a word exceeds the threshold there is no need in either developing that chain or searching for acceptable root. To get the suggestions, we search for roots acceptable by the eligible chains. This is done by the procedure lexGetRoots(c). Roots are concatenated with morpheme chains, and resulting strings that pass the edit distance threshold are added to the suggestion list. Once we have a list of candidate corrections produced by the generator, we need to rank it. To this end we use a Bayesian argument that combines error and source models. For our error model we employ a noisy channel-based approach proposed by Church and Gale [6]. Our source model is built upon the theoretical aspects that were used for morphological disambiguation in [2]. Thus, the ranker that for each suggested correction computes a conditional probability of a suggestion being correct given a word: P (s)P (w|s) P (s|w) = (1) P (w) The denominator P (w) can be dropped as it is common for all suggestions. Here P (s) is a source model that denotes the probability of a suggestion having a given surface form, and P (w|s) is an error model which denotes a likelihood of w being transformed into s. Let us start from describing the computation of the error model probability. As in [6] we compute P (w|s) with respect to the types of possible errors in the following manner. In case of missing a letter (deletion): P (w|s) ≈

deletion(si−1 , si ) + α pattern(si−1 , si ) + α|V |

(2)

where deletion(si−1 , si ) is a number of times a pair of consecutive letters (si−1 , si ) was written as si−1 and pattern(si−1 , si ) is the count of this pair in the training set. In case of reversing consecutive letters: P (w|s) ≈

reversing(si , si+1 ) + α pattern(si , si+1 ) + α|V |

(3)

where reversing(si , si+1 ) is a number of times a pair (si , si+1 ) was written in reverse order and pattern(si−1 , si ) is the count of this pair in the training set. In case of inserting a letter: P (w|s) ≈

insertion(si−1 , wi ) + α N (si ) + α|W |

(4)

where insertion(si−1 , wi ) is the number of times si−1 was followed immediately by wi and N (si ) is the count of si in the training set. In case of typing a wrong letter (substitution): P (w|s) ≈

substitution(wi , si ) + α N (si ) + α|W |

(5)

where substitution(wi , si ) is the number of times si was written as wi and N (si ) is the count of si in the training set. In all cases we use Laplace smoothing where |V | is the cardinality of a set of letter bigrams, and |W | is the cardinality of a set of unigram letters found in the training set of our misspelling-correction pairs dataset. The smoothing factor α was empirically set to α = 0.7. Let us now describe the computation of the source model probability P (s). We compute P (s) following the ideas discussed in [2] and taking advantage of morphological disambiguation. Recall that each suggestion is produced in a segmented form, i.e. a root and (possibly) a chain of morphemes with corresponding POS information. Thus, using a training set we can compute the probability of a morpheme chain using the chain rule. Assuming that a morpheme chain is independent of a root (it indeed depends only on the POS of a root), we consider the probability P (s) to be proportional to the product of the probabilities of a root and a morpheme chain: P (s) ∝ P (r s)

n Y

P (m si |m si−1 )

(6)

i=0

where r s is a root of suggestion s with its POS information, m si is the ith morpheme with its inflectional/derivational information, and n is the number of morphemes in s. All counts are derived from a training set gathered from the annotated data of KLC. Both P (r s) and P (m si |m si−1 ) are smoothed using Laplace smoothing, with λ empirically set to λ = 0.3. The rank of a suggestion s given word w is calculated as: R(s) = P (r s)

n Y i=0

P (m si |m si−1 )P (w|s)

(7)

Table 1. Overall accuracy of spelling correction. Tool Acc.,% 1 err.,% corr. 2 err.,% corr. Ours MSO KSD

83 79 53

85 85 55

69 31 31

Finally, we would like to note that the developed FSA can be also used for morphological segmentation. A segmentator can be implemented on the basis of Algirithm 1. The difference is that instead of collecting candidate suggestions we need to collect all segmentations (a root together with a morpheme chain) whose surface form is identical to a given input word. Then the collected segmentations can be ranked using Eq. 7.

4

Experiments and Evaluation

For the experiments we gathered data from KLC [4]. During the annotation, annotators had fixed spelling errors in some of the documents. In KLC each annotated document has its earlier unlabeled version saved. Thus, by a simple comparison of the edited and the original versions of the documents we have collected more than 1800 errorcorrection pairs. Removing words with more than two errors left us with 1776 pairs. Recall that our ranker requires a training set of error types. Given a relatively small dataset of errors, we resorted to a 10-fold cross-validation leaving out 90% of data for training at each fold. This way we have trained and tested our method on the entire dataset and reported average performance. We compare our method to a Hunspellbased [5] open source Kazakh spelling correction tool KSD [7] and Microsoft Office 2010 (MSO) Kazakh language pack [8]. As these tools do not require training, we run them on the entire dataset without using cross-validation. We begin with comparing the overall accuracy of the tools. Table 1 shows the overall accuracies broken down by the number of errors. The accuracy is calculated as a percent of words for which a correct fix was suggested regardless on which position in a suggestion list it appeared. Our method outperforms the remaining two in both overall accuracy and percent of 2-error words fixed. It is interesting that KSD, which like our method generates word forms on the fly, performs much worse. This can be explained by a small affix table and noisy root lexicon that the tool uses. When we skimmed through the words that our method missed, we found out that the most common reason was the absence of some derivational suffixes from our generator FSA. Thus, the accuracy of our method can be improved by incorporating new transitions into the generator and adding new roots to the lexicon. Next, in Tab. 2 we compare the tools by another important metric, precision-at-k, which is calculated as a percentage of all correct suggestions that appeared at the first k positions of ranked suggestion lists. As we can see, KSD outperforms both MSO and our method. The latter two perform almost in par for k in the range of 2-7, with MSO leading 5% at k = 1 and k = 10. For further analysis of the results we have measured the average length of suggestion lists and the lowest rank at which a correct

Table 2. Comparison by the precision @k,% k KSD MSO Ours 1 2 3 4 5 6 7 8 9 10

54 73 85 90 94 95 97 97 98 99

42 57 67 75 79 84 87 90 93 94

37 55 67 76 80 83 85 87 88 89

suggestion appeared. The average lengths of suggestion lists for KSD and MSO were 4.74 and 4.89 respectively. For our method it was 99.49. Similarly, the lowest ranks for KSD, MSO, and our method were 18, 20, and 535 respectfully. Low ranking of some corrections is explainable: a root or a morpheme chain, or both may be rare, hence our ranker assigns them low probability. In fact, we tried to modify our ranker by removing either source or error model from Eq. 1. Still, the best results were the ones reported and they were achieved when both terms were kept. Long suggestion lists are explainable too: insufficient pruning while generating word forms. Basically, the overall accuracy can be regarded as the recall of the method, i.e. the coverage of misspelled word forms. Whereas precision at-k is the precision of the method, i.e. a percentage of the fixed words for which a correct suggestion appeared at a reasonable position in the list. It is clear that our method trades recall over precision. However, upon analysis of words for which the most precise tool, KSD, could not find corrections, we found that those were frequent enough word forms and our method ranked many of them at the top. We think that precision-recall trade off is a crucial issue in spelling correction for agglutinative languages, and it definitely needs to be studied further. Finally, we compare the tools in terms of the false positive rate, i.e. percentage of incorrect words recognized as correct ones. For our tool the ratio is 4%, whereas MSO and KSD have 8% and 13% false positive rates respectively. In terms of this metric our tool turned out to be twice and thrice as accurate as MSO and KSD respectively.

5

Conclusion and Future Work

We have developed a spelling correction tool for Kazakh language based on a morphological disambiguator. Our tool outperformed both open source and commercial analogues, achieving the overall accuracy of 83% in generating correct suggestions. The advantage of our method is that it can be iteratively improved by adding new rules/transitions to the disambiguator and new entries to the root lexicon. Moreover the generator FSA, which is the core of our method, can be also used for morphological segmentation. In this paper we have discussed this possibility.

We also report on some existing weaknesses of our method. In particular, relatively poor candidate correction ranking and pruning of candidate correction lists. Our future work will be directed towards solving these problems, as well as incorporating context sensitivity into our method.

References 1. Koskenniemi, K.: A general computational model for word-form recognition and production. In: Proceedings of the 10th international conference on Computational linguistics, Association for Computational Linguistics (1984) 178–181 2. Hakkani-Tur, D.Z., Oflazer, K., Tur, G.: Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities 36(4) (2002) 381–410 3. Oflazer, K., Güzey, C.: Spelling correction in agglutinative languages. In: ANLP. (1994) 194–195 4. Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A.: Assembling the kazakh language corpus. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, Association for Computational Linguistics (October 2013) 1022–1031 5. Németh, L.: Hunspell open source spell checker (2011) 6. Church, K., Gale, W.: Probability scoring for spelling correction. Statistics and Computing 1(2) (1991) 93–103 7. Mussayeva, A.: Kazakh language spelling with hunspell in openoffice.org. Technical report, The University of Nottingham (2008) 8. Microsoft: Microsoft Office 2010, kazakh language pack (2010) 9. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3) (1964) 171–176 10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 10(8) (February 1966) 707–710 11. Mays, E., Damerau, F., Mercer, R.: Context based spelling correction. Information Processing & Management 27(5) (1991) 517–522 12. Shannon, C.E.: A mathematical theory of communication. The Bell system technical journal 27 (July 1948) 379–423 13. Brill, E., Moore, R.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong (2000) 14. Hodge, V.J., Austin, J.: A comparison of a novel neural spell checker and standard spell checking algorithms. Pattern Recognition 35(11) (2002) 2571–2580 15. Austin, J., Kennedy, J., Lees, K.: The advanced uncertain reasoning architecture, aura. Technical report, University of Canterbury (1995) 16. Alegria, I., Ceberio, K., Ezeiza, N., Soroa, A., Hernández, G.: Spelling correction: from twolevel morphology to open source. In: LREC, European Language Resources Association (2008) 17. Pirinen, T.A., Silfverberg, M., Lindén, K.: Improving finite-state spell- checker suggestions with part of speech n-grams (2012) 18. Mussayeva, A.: Mozilla add-ons, kazakh spelling dictionary 1.1 (2009) 19. Mussayeva, A.: OpenOffice, kazakh spelling dictionary (2008)