Application of Document Spelling Checker for Bahasa Indonesia

1 downloads 0 Views 681KB Size Report
document spelling checker for Bahasa Indonesia. The existing researches on Indonesian spelling checker have not developed into a complete document ...
ICACSIS 2011

ISBN: 978-979-1421-11-9

Application of Document Spelling Checker for Bahasa Indonesia Aqsath Rasyid N., Mia Kamayani, Ridho Reinanda, Simon Simbolon, Moch Yusup Soleh, and Ayu Purwarianti School of Electrical and Informatics Engineering, Bandung Institute of Technology E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—The needs of document spelling checker of Bahasa Indonesia is highly required. Unfortunately, there is no available application of document spelling checker for Bahasa Indonesia. The existing researches on Indonesian spelling checker have not developed into a complete document spelling checker. Here in this research, we compare several methods employed for Indonesian spelling checker especially in the word error detection and analyzed best methods employed in the building of Indonesian document spelling checker application. The main idea is to employ a complete word list as the reference. The Indonesian document spelling checker consists of 5 main components, namely document preprocess, word error detection, word error correction, word candidate ranking, and user feedback. The document preprocess is to process the document into a list of unique word which will be analyzed further in the spelling checker. In the word error detection, a binary search and hashing are used to do the searching faster. In the word error correction, the forward reverse and a similarity measure score are employed. In the candidate ranking, HMM is used to select the best correct word candidate. Using 13,000 words as the lexicon resource and 10 documents as the tested documents, the experimental results achieved 93.7% accuracy. The errors are caused by the word absence in the lexicon resource and the special repetition word form. I. INTRODUCTION

C

omputer advancement, including document processor advancement, has brought simplicity and easiness to document creating and writing of every kind. This development then increase the number of people capable of writing good document, and with that, increased the demand of written document significantly in every aspect. However, as human nature goes, doing mistakes when creating document is often unavoidable. These mistakes are varied from a simple mistype to a wrong concept of grammar or understanding of the

language itself. These mistakes, or word-error, can be categorized into 2 kinds: non-word error and realworld error. Non-word error is an error where the written word doesn‟t have any real meaning. Realworld error is an error where the written word is correct – has a meaning in a lexicon resource – but not the intended word in the sentence, thus creating a different meaning or even grammatically error sentence. Solving the real-world error requires extensive knowledge in word context, and other resources to extract the sentence‟s context which are not yet availabe for Bahasa Indonesia. The problem to detect and correct word-error in document is a challenging research, and it is still not a complete research. The implication of solving these problems can help many other fields such as text and code editing, Optical Character Recognition (OCR), Machine Translation (MT), Natural Language Processing (NLP), and many more. Our research is primarily focused on solving the non-word error problem. II. SPELLING CHECKER Detecting the word error can be done by using a computer aided application. The application used to detect and handling word error is called spelling checker. Spelling Checker, is an application on that is capable of checking a document, and search for any writing error in them. And then, if necessary, warn the writer about the mistakes and offer some suggestion to fix the mistakes. The challenge in creating a spelling checker, lies in the heart of the problem itself: finding the word that need to be fixed, and if possible, recommend the correct word to replace it. It sounds simple, but the implementation needs more than good luck. In non-word error, the possibilities of how the word is wrongly typed, for example extra space, extra character, misspelled word, will create an infinite list of word to checked one by one. In real-word error, the problem lies in recognizing the grammar of each sentence. This problem, as in every natural language problem, includes ambiguity and out of vocabulary (OOV). Other problem is that world list used in every

249

ICACSIS 2011

ISBN: 978-979-1421-11-9

language is always grew larger as time goes, and thus will create an out of vocabulary (OOV) problem in a static lexicon resource. III. RELATED WORKS In non-word spelling Checker, the most straightforward method used is by using a lexicon resource of correct word, then compare each word in the document against the lexicon resource of right words and wrong words. Although this method is practically impossible to do, considering the large size of the lexicon resource and the time consumed to iterate every word in the lexicon resource, this method is often used as basis in creating other method in spell checking. One method is to use hash table to increase lexicon resource access time. But in order to use hash table, using a clever hash function that will avoids collision and easy to calculate when lexicon resource is very large is a must. Other method is using a standard lexicon resource, then perform a string matching algorithm. These algorithm calculate the distance between word, and if the checked word is not in the lexicon resource, some word with relatively close distance will be returned as suggestion to the writer[1][3]. This method, however, has a weakness where the overall accuracy is highly depends on the lexicon resource used and the morphology of the languages itself. Some work tried to use morphological generator as an effort to enrich the lexicon resource [4]. The idea is to generate every morphological possibility on the closest word in the lexicon resource, and if the word matches the morphological rules, then it considered as valid word. Other morphological approach is to break down the checked word to its origin, and if it can be broken down, then it will be considered a valid word [5]. These morphological approaches however need a solid definition of morphological rules in the language, so that it will not create a bogus word. IV. OUR APPROACH The application was implemented by several processes, original text is being cleaned by pre-process and then being processed by error detection, error correction, candidate ranking, post-process until resulting in candidate list. Feedback is a feature where user can enrich lexicon resource with word that is detected as misspelled word whereas that word could be correct. Text

Preprocess

Error Detection

Error Correction

Ranking

Main Dictionary

Temporary Dictionary

Postprocess

A. Pre-process Preprocess is the first phase which is done before error checking. Common task in this phase is text stripping, which breaks down text into wordlist, both valid and invalid one. Tokenization is done on word boundaries such as white spaces, symbols and numbers. Besides, deletion of duplicated words is done so that unique list of words is formed. B. Error Detection Error detection checks validity of words in particular language, we used dictionary look-up method in this application. A word is valid if that word contained in the lexicon resource. The lexicon resource can be corpus, lexicon, wordlist or other form. Main process of error detection is comparison between word in the text and word in the lexicon resource or string matching. The result of this is deletion of valid words in the list, so that only invalid words left. A word is detected to be valid if it has common with any words in lexicon resource. The main problem in string comparison is search space and time. In order to optimize the comparison, we employed binary search and hashing. Hashing has algorithm complexity O(1), so that it is very efficient in searching. Binary search tree we used is median search tree, which is modification from frequency ordered binary search. It has node values and split values, node value is key value which often occur in the sub tree, while split value divides the rest between left and right sub tree by lexical order. It makes look up faster towards frequent word. C. Error Correction Error correction, it looks up words which are considered as solutions. Solution word is word which has similarities with misspelled word that referred in text. In error correction we used forward reserved dictionary. 1) Forward Reversed Dictionary This method used for error correction with edit distance based. It has forward and reverse dictionary. Forward writes word as its normal order while reverse writes word with its reverse order. It modifies 1-edit distance, which is if word length is N, then solution word length is N, N-1 or N+1. In this method used two assumptions: 1. Misspelled word caused by one of these mistakes: insertion, deletion, substitution and transposition. 2. The right word contained in both dictionaries.

Candidate List

Feedback

Fig. 1. System component

250

ICACSIS 2011

ISBN: 978-979-1421-11-9

k1

(2)

(a)

D. Candidate Ranking In candidate ranking, method used is Hidden Markov Model (HMM). In HMM, observed states are words as in text and hidden states are candidate list of words that considered as the corrections of that word. By using HMM, it considers not only distance between the correct word and misspelled word but also the rank of the words in the sentence. Transition probability was gained from bigram model of the words, observation probability was gained from probability of closeness or similarity between candidates and misspelled word.

k2

(b)

Error Zone (c) Sl

Sr

Si

Fig. 2. Error zone

In order to correct the errors, we need to find error zone of the misspelled word. Error zone is done by sub-string matching between misspelled word and words in forward or reversed dictionary. After error zone, correction is done by adding possible words that formed from four types of typography errors: 1) Insertion error, to handle this kind of error is by deleting one by one character and finding their matches with dictionary. 2) Deletion and substitution error, by looking word in dictionary that is begin with Sl and ended with Sr, with |Sd| = |S| or |Sd| = |S| + 1, where |Sd| is length in dictionary and |S| is length of error word. 3) Transposition error, by swapping adjacent character position and finding their matches with dictionary.

P0 X1

s2 - (Ncorrect - Nerror )2 -(Ucorrect + Uerror )2 ) s2

M

A

L

A

S

0

1

2

3

4

M

A

A

S

0

1

1

4

3

3

Fig. 3. Calculation to measure similarity values

In (2) is described how to measure similarity between correct word “MALAS” and error word “MAAS”.

a23

X3

a34

X4

a45

X5

b11

b22

b33

b44

b55

Y1

Y2

Y3

Y4

Y5

With, X: possible states (correct words) Y: possible observations (error words) aij: transition probabilities bij: observation probabilities P0: initial state distribution

(1)

Where, S : optimal characters N : number of characters U : unused characters P value less than or equal 0 is excluded.

X2

Fig. 4. HMM

2) Similarity Measure This method is used in error correction to measure similarity values between candidates. To calculate similarity values, used (1)

P(error | correct) =

a12

Evaluation toward HMM is to find how well a model to predict the best alternative solution among others, given set of observations. Decoding aims to find the best order of hidden state, when given observation value. Viterbi algorithm [2] is used in decoding. Learning is done to obtain HMM model so that it can be used to find solution for future input. Model is bigram from document which contains values for hidden state so that its transition probability can be found. E. Post-process Post process is a process done before solution words comes out. This process aimed to narrow or limit solution words without decreasing accuracy or correctness of the solutions. It consists of sorting or ranking, so that word which is considered as the best solution located in the front of the candidate list. F. Feedback Feedback aims to enrich lexicon resource manually by user, as user gets the recommended words he can mark the assumed error word as the correct word that hasn‟t contained in lexicon resource yet, the word will be added to temporary dictionary to be collected for further process. Later, temporary dictionary will update main dictionary. The condition to be met is when a word reaches a certain threshold of occurrence then it will be added to main dictionary. In our application, we use threshold 10 occurrences before a

251

ICACSIS 2011

ISBN: 978-979-1421-11-9

new word can be included in main dictionary. This will handle out-of-vocabulary problem, since our lexicon resource is still limited and there are always development on new vocabulary. V. RESULT In this section, we describe the experiments we conducted to evaluate the performance of the spelling checker. A. Method We use a lexicon of 13.000 words as the base lexicon resource. The experiment is conducted with a collection of 10 documents. We assemble the documents from the news section of local web sites. Besides the proper words in Bahasa Indonesia, the documents also contain foreign words, name entities, and abbreviations. We run the spelling checker through these documents and measure the accuracy of the detection. We define accuracy as the proportion of correctly classified words to the total number of words in the document. B. Result The result of our experiment is shown on Table I. TABLE I EXPERIMENT RESULT

No.

Num Words

Error Detected

False Detection

Accuracy

1

256

11

10

96.1%

2

297

14

14

95.3%

3

231

12

12

94.8%

4

201

8

7

96.5%

5

374

27

27

92.8%

6

227

25

23

89.9%

7

454

30

27

94.1%

8

338

34

28

91.7%

9

363

34

32

91.2%

10

528

27

26

95.1%

C. Analysis From the experiment, we observe two most common cases of incorrect detection. In the first case, correct derivative word in Bahasa Indonesia are detected as spelling error. In another case, a part of repetitive word is recognized as singular word, which turns out to be not valid in Bahasa Indonesia. The causes of these problems are: 1) Incomplete lexicon resource:The words are valid derivative words in Bahasa Indonesia, but is not listed in the lexicon resource, thus creating error in detection. For example: „diadakan‟ is a valid word, but not listed in the lexicon resource that we used in our application. 2) Preprocessing error: Some words in Indonesia are repetitive, for example „pertama-tama‟ should be considered as a singular word, but processed as two separate words instead, causing spelling error on the word „tama‟. VI. CONCLUSIONS This research provide solution to non-word error by using two steps, detection by dictionary lookup and correction by forward reverse dictionary. To improve the accuracy we use feedback module to update the lexicon resource. The appropriate threshold will produce a complete lexicon resource to be used in the spelling checker. Using this approach our method can solve the nonword error problem and achive up to 96.1% and the average of 93.7%. REFERENCES [1] [2] [3] [4]

[5]

The average of our accuracy is 93.7%.

252

Atmajaya, Gede Esa Deva.Pembuatan Spelling Checker Untuk Bahasa Indonesia Dengan Java 2 Standard Edition.Teknik Informatika, Universitas Gunadarma. Blunsom, Phil. Hidden Markov Model. 2004 Marino, Shinta. Spell checker Bahasa Indonesia. Laporan Tugas Akhir, Institut Teknologi Bandung, Program Studi Teknik Informatika. 2009. Soleh, Moch Yusup and Ayu Purwarianti, A Non Word Error Spell Checker for Indonesian using Morphologically Analyzer and HMM, International Conference on Electrical Engineering and Informatics, 17-19 July 2011, Bandung, Indonesia. Aduriz I, et. al. A Morphological Analysis Based Method For Spellling Correction. Informatika Fakultea, Basque Country.