Automatic Stochastic Arabic Spelling Correction ... - Semantic Scholar

9 downloads 137664 Views 776KB Size Report
recognition need a reliable automatic misspelling correction such that those systems ... (KACST), Riyadh 11442 , Saudi Arabia (e-mail: [email protected]).
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 7, SEPTEMBER 2012

2111

Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions Mohamed I. Alkanhal, Mohamed A. Al-Badrashiny, Mansour M. Alghamdi, and Abdulaziz O. Al-Qabbany

Abstract—This paper presents a stochastic-based approach for misspelling correction of Arabic text. In this approach, a contextbased two-layer system is utilized to automatically correct misspelled words in large datasets. The first layer produces a list in which possible alternatives for each misspelled word are ranked using the Damerau–Levenshtein edit distance. The same layer also considers merged and split words resulting from deletion and insertion of space character. The right alternative for each misspelled word is stochastically selected based on the maximum marginal probability via A* lattice search and m-gram probability estimation. A large dataset was utilized to build and test the system. The testing results show that as we increase the size of the training set, the performance improves reaching 97.9% of 1 score for detection and 92.3% of 1 score for correction. Index Terms—A* lattice search, Arabic language processing, space deletion errors, space insertion errors, spelling correction, statistical disambiguation, word distance.

I. INTRODUCTION N this paper, we present a text enhancement system that automatically corrects misspelled words in Arabic text. Such a system plays an important role in the area of human language technology (HLT) where manual correction is time consuming and might create a bottleneck in HLT applications. All systems in HLT from document understanding to speech recognition need a reliable automatic misspelling correction such that those systems can be trained on clean data. Although spell checkers are widely available for a number of languages, including Arabic, most of them only detect errors and propose corrections regardless of their context, which increases ambiguity and may results in incorrect suggestions for the misspelled words. In addition, available systems might not be able to detect and correct all kinds of errors. Furthermore, most of them work based on some constraints and assumptions as discussed in section two. Misspelling errors can be categorized according to their detection and correction difficulty into the following types of errors:

I

Manuscript received September 18, 2011; revised February 07, 2012; accepted April 18, 2012. Date of publication May 02, 2012; date of current version June 11, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Renato De Mori. M. I. Alkanhal, M. A. Al-Badrashiny, and A. O. Al-Qabbany are with the Computer Research Institute (CRI), King Abdulaziz City for Science and Technology (KACST), Riyadh 11442 , Saudi Arabia (e-mail: [email protected]; [email protected]; [email protected]). M. M. Alghamdi is with the Scientific Awareness and Publishing, National Digital Content Program, King Abdulaziz City for Science and Technology (KACST), Riyadh 11442 , Saudi Arabia (e-mail: [email protected]). Digital Object Identifier 10.1109/TASL.2012.2197612

1) The first type of error: This error happens when a correctly spelled word goes through one or more errors of insertion, deletion, substitution, or transposition resulting in a non-word (i.e., A word that does not follow the target-language morphological rules; thus it is not included in that language dictionary). As an example, consider the Arabic – Yaktoboha,” which means “He writes it.” It word “ – Yatobota” after the deletion becomes the non-word “ and substitution of some letters. 2) The second type of error: This type of error occurs when a one or more spaces are inserted into or deleted from a correctly spelled word, and makes it result in a – Bareq non-word (e.g., the Arabic phrase “ Althahab” —which means “The gold glitter”—becomes the non-word “ – Bareqalthahab” after the space deletion). 3) The third type of error: This error is the same as the first type except that it results in another correctly spelled – Yashrab,” which means word (e.g., the word “ – “He drinks,” becomes the correctly spelled word “ Beshorbe,” which means “By drinking”). 4) The fourth type of error: This error is the same the second type of error except that it results in correctly spelled – Mophrase of words (e.g., the Arabic word “ tatawer,” which means “Advanced,” becomes the right – Mot Tawar,” which means “Die spelled phrase “ Developed”). 5) The fifth type of error: This error is the same as the first type of error but with space insertions or deletions that make the correctly spelled word become a non-word (e.g., the – Yaktoboha,” which means original Arabic word “ “He writes it,” becomes the non-word “ – Yatobota” by – Yato Bota” by space the first type of error, then “ insertion). 6) The sixth type of error: This error is the same as the fifth type of error but it results in a correctly spelled word (e.g., – Yaktoboha,” which the original Arabic word “ – means “He writes it,” becomes the non-word “ Saktoboha” by the first type of error, then becomes “ – Sakata Beha,” which means “Stopped-talking With-it,” after a space insertion). The third, fourth, and sixth types of errors are also known as semantic hidden errors since they are correctly spelled words but causing semantic irregularities within their context [1]. In this paper, we envision the automatic misspelling correction as a system that is composed of three main components, namely, error detection, candidate generation and best candidate selection.

1558-7916/$31.00 © 2012 IEEE

2112

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 7, SEPTEMBER 2012

Error detection is the process that is responsible for detecting misspelled words whether they are non-words or semantic hidden words. Actually, the detection process of semantic hidden errors is much more difficult than the non-word errors. Some well-known techniques are used to detect semantic hidden errors like semantic distance, confusion set, and neural network. The semantic distance approach is based on comparing the word semantic with its surrounding words [2] and [3], but this technique faces another HLT problem which is word sense disambiguation that is still one of the HLT’s challenging problems. The confusion set approach depends on a set that is composed of dictionary words that occur together [2] and [4]. In this approach, computational complexity might be an issue if the dictionary size is large. Another approach that has shown good results for semantic hidden error detection problem is the neural network [2] and [3]. While the non-word detection is a much easier problem than semantic hidden error detection, it faces some challenges like detecting space insertions and deletions. There are many detection algorithms for non-word errors. In the next section, we show that the two most common types of these methods are the rule and dictionary based methods. The rule-based methods depend mainly on morphological analyzers to check whether a word is following the language morphological rules or not. The dictionary-based methods depends on a large balanced and revised training corpus, to generate a dictionary that covers the most frequently used words in the target language. While the rule based method has a better coverage of possible words, the morphological analysis process is slow by itself which may affect the system performance. Furthermore, this process might not able to manage transliterated words (e.g., the word “Computer” is an English word and it is a common word now in ” which is a non-word from Arabic and it is written as “ the point of view of any Arabic morphological analyzer but the dictionary-based method considers it as a right word since it is a frequently used one in the training corpus). Thus, a hybrid combination of the rule based and the dictionary-based methods will be a good solution for error detection of the non-words but it might affect the system performance. The candidates’ generation component is responsible for finding the most probable candidates for the misspelled words. The most common used method for this component is the edit distance [2], [5], and [6]. The best candidate selection component is the process through which the right candidate of the misspelled words is selected. While most of the published systems do not go beyond the candidates’ generation process, there are fewer systems try to solve this problem, but unfortunately as will be discussed in the next section, most of these systems make some assumptions to make them able to solve a complete problem of automatic misspelling detection and correction. The -gram language model technique [7] with some selection criteria is the most commonly used method for the best candidate selection. In this paper, we describe the architecture of our system that addresses the above problems. It is made up of two layers. The first layer is for error detection and candidate generation, where candidates for each misspelled word are ranked using Damerau–Levenshtein distance. Although the presented system in

this paper is able to detect and correct all the above mentioned types of errors, but for the sake of system performance, we are only concerned with the first, second, fourth (the case of space insertion), fifth, and sixth (only the case of space insertions) types of errors, but the system is still able to detect and correct the other types of errors with very little changes by considering all right spelled words as candidate words to be corrected and retrieve their alternatives as described in Section IV. In the second layer, we utilize the A* lattice search [8] and -gram estimation to select the best candidate based on its context. This paper is organized as follows. The next section reviews the related work. Section III describes error detection process. In Section IV, we introduce our technique to retrieve candidates for misspelled words. We discuss how to deal with the single word errors, space deletion errors, and space insertion errors. Section V presents the misspelling correction system. In Section VI, we present the datasets we used to train and test our system. Section VII demonstrates the ability of our system to detect and correct spelling errors. In Section VII-B, we theoretically analyze the output errors. Finally, Section VIII contains a summary and conclusions. II. RELATED WORK The vast majority research for spelling correction has been done on English [6], [9], and [10] due to the size of English language user market. In general, different approaches have been proposed in the literature to deal with spelling error correction issues. One of the main approaches is the dictionary-based mechanism which heavily depends on a training corpus in order to detect misspelled words [11]. Among dictionary-based systems, several studies used the noisy channel model for spelling correction problem. One of the earliest contributions in this area was made by Brill and Moor [12] who applied the noisy channel model on string-to-string edits. Using tri-gram as language model, they reduced the error rate by 74% compared to the weighted Damerau–Levenshtein distance approach that was proposed by Church and Gale [13]. However, the corpus size used in the experiment seems to be small, since it consist of only 10 K words. Fossati and Eugenio [14] used the word-trigrams and the POS-trigrams for context-based spell checking process. Their experiment has some constraints on the training corpus such as ignoring the short words and the least frequent ones. The training corpus size was around 1 M words and the testing corpus contained 500 sentences; each sentences has between 10 and 30 words. Their system exhibited 55% hit rate (i.e., the ratio between the number of the corrected errors and the total number of errors) and 18% false positive rate. Dalkilic and Cebi [15] used -gram model with a back-off technique to correct spelling errors in the Turkish language. Their system produces a list of suggestions for the misspelled words, but without any ranking. They concluded that the obtained results were more accurate than the spell checker of Microsoft Word 2000, but they did not present any discussion about the evaluation process. Similarly, Islam and Inkpen [16] have applied -gram model with the back-off technique to deal with context misspelling correction. They used GoogleWeb 1T corpus which contains -grams of length ranging from unigrams to five-grams. In the testing stage, they used a data set

ALKANHAL et al.: AUTOMATIC STOCHASTIC ARABIC SPELLING CORRECTION

2113

that has around 300 k words. Their experiment results showed that the combination of -grams (e.g., 4-grams and 5-grams) method achieved 88% for recall and 91% for precision. However, number of candidate words was limited to ten suggestions. Also, their approach did not deal with split/merge errors. For Arabic language, there have been few attempts on spelling correction. Shalan et al. [17] presented a theoretical approach that utilizes Arabic specific rules to develop a system for Arabic spelling correction but no experimental results were reported. Rachidi et al. [18] introduced correction and expansion techniques for multilingual search queries submitted to an Arabic search engine. The rules implemented in their work were limited and the correction was not automatic. Their results showed 92% correction rate on 100 misspelled terms. Hassan et al. [19] developed a method based on finite state automata with a language model to select the best correction in a given context. The best accuracy they reported in their experiments was 89% on a list of 556 misspelled words. Rytting et al. [20] developed a spelling corrector for Arabic that is designed specifically for L2 learners of dialectal Arabic in the context of dictionary lookup. Zribi et al. [1] conducted an experiment for detecting and correcting semantic hidden errors in Arabic texts assuming that there is only one error per sentences. Using a training corpus that contains around 23 k words and a testing set with 1.5 k words, the highest rate of detection accuracy was 89.18%. In fact, corpus size is an essential aspect of the dictionary-based spelling correction techniques, either for the training set or for the testing one. One of the main contributions of our work is the way of handling complex error types, such as “word boundary infractions” which is one of the significant problems that are not yet completely solved [6]. There are some studies in the literature that tried to deal partially with run-on and split-word errors. For example, Schierle et al. [21] proposed a dictionary-based statistical approach to deal with words resulting from split/merge errors. Their system cannot handle multiple adjacent or separated words because it deals only with bigrams. Similarly, Kolak et al. [22] proposed a finite state-based model to deal with only one split/merge error per word. In addition, Delden and Gomez [23] presented unsupervised approach to automatically correct word boundary infractions based on frequencies of words and morphological rules. They mentioned that their algorithm achieved an accuracy of around 75%. However, the experiment results showed that this approach could not handle all cases of multiple concatenated words.

slow. Furthermore, these techniques are not adaptive since many commonly used words might be considered as non-words (e.g., transliterated words and name-entities). Another widely used technique is the dictionary lookup, where the input word is compared with the dictionary words, and the word is considered a non-word if it is not found in the dictionary [11]. Although this technique could suffer from that the language is not completely covered, it guarantees that most commonly used words including the regularly used transliterated words or name-entities are covered if this technique is trained using a large balanced representative corpus. For semantic hidden errors, there are many other methods for detection like the semantic analysis, co-occurrence or collocation method, context-vector method, and latent semantic analysis method [1]. The system introduced in this paper is based on the dictionary lookup technique. This is due to its widely use in the literature, moreover, it can handle the most common used words even if these words are out of language vocabulary. The dictionary lookup technique can also be adapted according to the needs; for example, customizing the dictionary according to user requirements by adding new user defined words. Furthermore, this technique is faster than the other above-mentioned techniques. Finally, the main advantage of this technique is that it is a language independent; since the dictionary could be built using any language without affecting the detection process. However, the name-entities problem is still the main problem that faces this technique and all above-mentioned techniques that sometimes considered as out of vocabulary (OOV). But some other techniques for name-entities extraction [25], [26], and [27] could be used to determine the name-entities and do not consider them as OOV. In this paper, we are not concerned with the problem of the name-entities since it is another problem that faces many HLT applications (e.g., text-to-speech, text diacritization, machine translation, etc.). Furthermore, if the training corpus is large enough and represents the common used words, it should cover the most used names-entities in the target language. Our dictionary is based on a large balanced Arabic corpus with a size of 2600 K words (to be discussed in Section VI). The dictionary size is about 427 K unique words. The words in the dictionary are arranged in an ascending order for faster comparison using the binary search algorithm [8]. Another used trick is to have all words that start with the same letter in a separate table; this improves the comparison process since the input word will be compared only with the words that have the same start letter.

III. ERROR DETECTION

IV. CANDIDATES’ GENERATION PROCESS

The spelling error detection process is concerned with detecting a word as a non-word or as a semantic hidden error. The non-word detection techniques include -grams algorithms where usually a mono, bi, or tri gram of letters that are subsequent of string is used to determine whether each -gram in an input string is likely to be valid in the language or not [2], [5], and [24]. Other techniques use morphological analysis to check whether an input word follows the language morphological rules or not [2]. The main advantage of these techniques is its wide coverage of the language. However, they might have some performance issues since the morphological analyzers are

The candidates’ generation process is concerned with finding the most probable alternatives or corrections for the misspelled words. The most common used technique for candidate retrieval is the edit distance that is also known as Levenshtein distance. This technique is based on calculating the minimum number of required edits (insertions, deletions, substitutions, and transpositions) to transform a misspelled word into a valid word [2], [5], [6], [28], and [29]. There are also the similarity key techniques, where each word in the dictionary is transformed into keys. Hence, calculating the key of the misspelled word will point to similarly spelled words in the dictionary [2], [5], and

2114

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 7, SEPTEMBER 2012

TABLE I SAMPLES OF SIMILARITY BETWEEN LETTERS

Fig. 1. Candidates generating process.

[6]. Some rule-based techniques apply some algorithms to represents the common spelling errors patterns in the form of rules that are used to transform a misspelled word into a correct one [2], [5], and [6]. According to Shaalan K. [30], the neural network was able to make associative retrieval of incomplete input word. The above-mentioned approaches try to find possible corrections of single-word errors; i.e., the problem of multiple-word errors as in space insertions and deletions, that cause merging or splitting the words, are not handled. In this paper we consider the single-word errors as well as the space insertion and deletion errors. Fig. 1 describes the candidate generation process. The modules “Single word errors analyzer,” “Spaces deletions errors analyzer,” and “Spaces insertions errors analyzer” are discussed into details in Sections IV-A–IV-C, respectively. A. Single-Word Errors In this paper, the edit distance technique is used for its popularity and ease of implementation. Particularly, we use a modified version of the Levenshtein distance called Damerau–Levenshtein. The only difference between them is that the Damerau–Levenshtein considers the two letters transposition as a single edit while the Levenshtein considers it as a two edits [9], [28], [29], and [31]. To increase the Damerau–Levenshtein distance efficiency, we assign a low distance cost for the letters that have shape similarity (e.g., – ), pronunciation similarity (e.g., – ), or keyboard proximity (e.g., – ). We apply the variable cost distance for the four types of errors (insertions, deletions, substitutions, and transpositions). Table I represents examples of similar letters. – To clarify this idea, consider the correct word (“ Mastarah” which means “Ruler”) and assume that it is mis– Mastarah”) because of the protakenly written as (“ nunciation similarity between the letters ( ) and ( ). The

normal Damerau–Levenshtein distance considers the edit distance between (“ – Mastarah”) and (“ – Mastarah” – Mo’atarah” as “Scented”) the same, while as “Ruler,” “ our variable cost method assigns a lower distance for the word – Mastarah” as “Ruler”) than (“ – Mo’atarah” (“ – Mo’atarah” as as “Scented”). Although the word (“ “Scented”) is a right candidate for the misspelled word (“ – Mastarah”), but there is no reasonable cause for mistakenly ) instead of ( ). They are far from each write the letter ( other on the keyboard and have completely different shapes and pronunciations; thus, it is not understood to mistakenly write – Mo’atarah” as “Scented”) as (“ – the word (“ Mastarah”). The pseudo code for the variable cost Damerau–Levenshtein distance is described below in Algorithm 1. Algorithm 1 Damerau–Levenshtein distance with variable cost distance Given: 2 strings with length and with length Create: table with rows and columns for from 0 to end for for from 1 to end for for from 1 to for from 1 to if then else if and are similar according to the similarity table then else , // insertion , // deletion // substitution) if( and and and ) then // transposition) end for end for Return For all words that are declared as a misspelled word after the error detection process, each word of them is compared with

ALKANHAL et al.: AUTOMATIC STOCHASTIC ARABIC SPELLING CORRECTION

TABLE II AN EXAMPLE OF A GROUP AND ITS MEMBERS

G

2115

;

and ; where and are the number of words in the groups and , respectively. could also has some members whose edit This means that and should be distance with . Thus, both of considered. Algorithm 3 describes the candidates’ retrieval criteria.

to (1)

M

all words in the dictionary using the above described variable cost Damerau–Levenshtein distance. After that, the words that have the minimum distance are considered as candidates for the misspelled words. It is extremely time consuming to compare a misspelled word with the entire dictionary. In our case, according to our dictionary size, we need to apply 427 K edit distance calculations for each misspelled word which is too much. Therefore, a clustering technique has been implemented on the dictionary members to reduce the number of comparisons. There may be other better algorithms than this clustering technique; thus, it could be replaced with any other better algorithm; since as mentioned before, the main reason for using the clustering algorithm is to improve the system speed, but it will not affect the system accuracy. The proposed clustering algorithm makes use of the following observation: (1) where is the variable cost Damerau–Levenshtein distance and ( , , and ) are three different words. According to (1) a new version of the dictionary is created offline. In this version, the dictionary words are combined into and members different clusters. Each cluster has a centroid of size where ; . Algorithm 2 describes the clustering criteria.

Algorithm 3 Candidates retrieval Given: a misspelled word and a clusters table Create: tables temp and candidates // initialization for from 1 to if ( ) then

or

else if empty temp

then

end for for from 1 to if end for return candidates

then

To show the effectiveness of this procedure, assume that the number of words in the dictionary is , the number of created , the average number of words per each group is groups is , and the number of groups that have edit distances and is with an input misspelled word:

Algorithm 2 Words clustering Given: a dictionary of size Create: table of clusters while (

is not empty)

for from 1 to if end for remove and all

then members from

end while Return Table II shows an example of a group and its members. If we have a misspelled word and two centroids with and with then according

original comparisons size

(2)

proposed comparisons size

(3)

reduction

(4)

and Our experimental results show that . By substituting in (4), we can find that the reduction is , which is a significant gain. To test the clustering algorithm, we used 30 k non-words and for each word, we retrieved its candidates using the proposed procedure in Algorithm 3 and by comparing the word with all dictionary words without clustering. It is found that the two procedures retrieve the same words, but the procedure of clustering is faster. B. Space Deletion Errors This type of errors occurs when the user forgets to add spaces between words. The main difficulty in this kind of errors is that

2116

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 7, SEPTEMBER 2012

Fig. 2. Arabic letters-spaces search trellis for a word with a number of letters n, where the circles with (-) sign indicates space letter and the circles with (.) sign indicates no space.

we do not know the number of merged words (i.e., we do not know the number of spaces). Assuming that the merged words are only of the second type of errors described above; thus, the direct solution for this kind of errors is to apply exhaustive search by adding spaces everycorrectly spelled where in the word and try to split it into words. However, the order of complexity of this method is , where is the number of letters in the merged words. For example, if the number of letters in the merged words is only 16 letters, this means that the number of trials equals to 32 768. This makes the fifth type of error more complicated. Our approach is to split these words statistically by choosing the sequence of Arabic letters with the maximum marginal probability via A* lattice search and -gram probability estimation [7], [8], [32], [33], and [34] using 15-grams language model of Arabic letters based on a corpus that is introduced in Section VI. In addition to the normal Arabic letters, two extra special letters are added (- and .), where the (-) and (.) indicate space and no space, respectively. For example, the – Montho Fajre Altarekh” which means phrase (“ “Since the dawn of history”) is represented in the training ). The average number of corpus as ( letters per an Arabic word varies from 5 to 7 letters [35]. Thus, the concept of space and no space letters makes the average number of letters per an Arabic word varies from 9 to 13 letters. This makes the 15-grams language model of letters is almost 1.5-grams of words. The goal is to disambiguate the multiple possibilities of spaces locations. Using Arabic letters and spaces disambiguation trellis shown in Fig. 2, Statistical disambiguation is deployed to infer the sequence of letters and spaces with maximum-likelihood probability according to a statistical language model. [8]. The best selected path by the A* search is now a sequence of some letters and spaces (i.e., sequence of words), where each space indicates the end of its preceding word. Some of these words could be valid words in the dictionary and some other words could be non-words. For each non-word in the output sequence, Algorithm 3 is used to find the possible candidates for this non-word. This makes the output of this stage a sequence of valid words and non-words in addition to the possible candidates for these non-words as shown in Fig. 3. C. Space Insertion Errors This type of errors occurs by adding one or more spaces inside a single word which splits the word into two or more words. This means that each phrase in the input text could consist of some split words that need to be merged together. Thus, each phrase merging possibilities. Algorithm 4 of size words has is used to find the possible merging suggestions for this phrase

W

W W

Fig. 3. Example of the output from “Space Deletion Errors” stage, where ( to ) are the proposed best separation after applying A* search. Furtherto ). more, all separated words are valid dictionary words except ( ( ... to ... ) are the proposed candidates for the nonwords after applying Algorithm 3.

a

a

a

a

W

considering the different forms of this type of errors (second, fourth, fifth, and sixth types of errors) that are mentioned in Section I. Algorithm 4 Candidates retrieval for spaces insertions errors Given: an array of size ; where is the number of words is the word number in this phrase; in an input phrase, . Create: table candidates, register , and string temp. Define: the function “binary representation ” that returns the binary form for the input integer , and the function “concatenate ” that merges the two words together. // the number of merging possibilities for from 1 to // this converts the integer into the binary form for from 1 to if

then

// 0 means put space end for empty (temp) end for return candidates The output candidates list after applying Algorithm 4 contains all possible combinations of the input phrase. Each combination consists of some valid words and some non-words. Algorithm 3 is applied on each non-word to find its possible candidates. Fig. 4 shows an example of the output from this stage. In this is a merged word of and example, the first candidate and each word equals to the word ; . candidate, there are some valid words after merging In the and some non-words, thus for those non-words, Algorithm 3 is applied to find the possible corrections for them.

ALKANHAL et al.: AUTOMATIC STOCHASTIC ARABIC SPELLING CORRECTION

j

k

2117

W W n n k

jn

S n