15International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems © World Scientific Publishing Company
ERROR DETECTION AND CORRECTION BASED ON CHINESE PHONEMIC ALPHABET IN CHINESE TEXT CHUEN-MIN HUANG Department of Information Management National Yunlin University of Science & Technology University Road, Section 3, Douliou, Yunlin, Taiwan, Republic of China (
[email protected]) MEI-CHEN WU Department of Information Management National Yunlin University of Science & Technology University Road, Section 3, Douliou, Yunlin, Taiwan, Republic of China (
[email protected]) CHING-CHE CHANG Department of Information Management National Yunlin University of Science & Technology University Road, Section 3, Douliou, Yunlin, Taiwan, Republic of China (
[email protected]) Received (October 12, 2007) Revised (January 31, 2008) Accepted (February 20, 2008) Abstract. Misspelling and misconception resulting from similar pronunciation appears frequently in Chinese texts. Without double check-up, this situation will be getting worse even with the help of Chinese input editor. It is hoped that the quality of Chinese writing would be enhanced if an effective automatic error detection and correction mechanism is embedded in text editor. Therefore, the burden of manpower to proofread shall be released. Until recently, researches in automatic error detection and correction of Chinese text have undergone many challenges and suffered from bad performance compared with that of Western text. In view of the prominent phenomenon in Chinese writing problem, this study proposes a learning model based on Chinese phonemic alphabets. The experimental results demonstrate that this model is effective in finding out misspellings and further improves detection and correction rate. Keywords: Error detection of Chinese text, Error correction of Chinese text, language model, Chinese phonemic alphabet.
1. Introduction According to a survey report about the employment preference of enterprises in 2005, there were many misspellings in the résumés of the applicants submitted to enterprises [1]. Misspelling refers to a correct word being replaced by physically similar character or similar pronunciation. An earlier research indicates that misuse of Chinese phonemic alphabets has a great impact on students' misspellings [2]. Moreover, the phenomenon of misspellings is not improved with age. It means that there is no correlation between age and the number of errors per page. The convenience of the Internet transmission and the 1
2
C.-M. Huang, M.-C. Wu, C.-C. Chang
often use of Chinese Zhuyin Fuhao (注音符號), often abbreviated as Zhuyin input methods deteriorate the misspelling problem. Zhuyin is the national phonetic system of Taiwan for learning to read, write or speak Mandarin. The system uses 37 special symbols to represent the Mandarin sounds: 21 consonants and 16 vowels. This phonemic alphabet is currently in wide use in Taiwan. When the characters replaced by an auxiliary word, such as ㄟ (ei) or onomatopoeia , such as ㄎㄎ (ker ker) , do not make troubles for readers. However, in pursuit of convenience and speed, the characters replaced by other characters due to similar or identical pronunciation not only disturb readers but also be perceived as ridiculous implication. For example, '同聲勸諫' could be replaced by '同聲勸賤 ' because of the same pronunciation ( tong sheng xiang jian). The former phase means that a group of subordinates persuade their boss to consider more for some actions, while, the latter means a group of people lure a person to conduct dirty affairs. The newfangled language phenomenon like this emerges one after another in our daily life. As a result, a new publication must be proofread several times for quality guarantee. A report reveals that the general administration of press and publication of the People’s Republic of China regulates the error rate in the publications and requires it to be kept within one ten thousandth [3]. Consequently, proofreading by human is considered necessary to maintain the quality of publications. Although researches in detection and correction of English text have been conducted for many years, it is seldom considered as an interesting topic in Chinese text processing. The earlier studies appeared in early 1990s focusing on detection issues. Later on, Chang proposed a correction mechanism while suffering from low precision in detection and no correction consideration [4]. This research intends to look into the problem with practical solutions. It is hoped that the quality of Chinese writing would be enhanced if an effective automatic error detection and correction mechanism is embedded in text editor. In this case, the burden of manpower to proofread shall be released. This paper is organized as follows. In Section 2, the related works about the issues of error detection and correction are addressed. In Section 3, our detection system and correction mechanism with a real case are elaborate. Section 4 gives a description for experimental data; Section 5 presents our extensive experimental results. Finally, in Section 6 we summarize some conclusions and future work. 2. Literature Review In Section 2.1, we will first describe the properties of Chinese language. Related works for automatic detection and correction in text will be discussed in Section 2.2. Next, we will introduce the issues for error types in text and unknown word detection in Section 2.3 and Section 2.4 respectively. Finally, the concept of language model will be depicted in Section 2.5.
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
3
2.1. Characteristics of Chinese language According to morphology, languages are classified as synthetic and analytic language. Definition from Wikipedia shows that a synthetic language is a language with high morpheme-per-word ratio; an analytic language is any language where syntax and meaning are shaped more by the use of particles and word order rather than by inflection. Taking English for example, English belongs to synthetic language toward analytic language. As for Chinese, Chinese belongs to analytic language with the properties of isolating language, monosyllable and ideography [5]. A language is isolating when the vast majority of morphemes are free morphemes, and words are not marked by morphology showing their role in the sentence. A language is monosyllabic when each character is pronounced as a monosyllable. Most characters in Chinese can also be a meaningful word by themselves. Besides, words are equal even though genders, numbers and case of words are different. A language is ideographic when morphemes or "materials" are represented by graphical symbols, rather than letters, arranged according to the phonemes of a spoken language. The evolvement of Chinese characters is from oracle bone script to Bronze Script, from Bronze Script to Seal Script, from Seal Script to Clerical Script, from Clerical Script to Regular Script. Chinese evolution has been from complicated structures toward simple structures and from ideograms to the alphabet system representing phonemes of the spoken language. However, the ideography characteristic still remains in Chinese. Grammatical units in modern Chinese grammar are classified as: morphemes, words, phrases and sentences. Morpheme is the smallest linguistic unit that has meaning or grammatical function [6]. According to the ability of combination, morphemes are classified as free morphemes, semi-free morphemes and bound morphemes. Free morphemes can stand alone as words. Semi-free morphemes occur only when combining with other morphemes. Bound morphemes occur only as parts of words. Affixes are bound morphemes. Roots can be either free morphemes or bound morphemes. Root morphemes are essential for affixation and compounds. According to the length of syllables, morphemes are classified as monosyllable, bi-syllable and polysyllable. Most of the morphemes in modern Chinese grammar are monosyllabic. According to their grammatical function, morphemes are classified as real morphemes and null morphemes. Null morphemes are realized by a phonologically null affix (an empty string of phonological segments) which enables a sentence to provide more meanings. 2.2. Automatic Detection and Correction in Text Research on automatic detection and correction of English text has focused on three difficult problems: (1) non-word error detection, (2) isolated-word error correction, and (3) context-dependent word correction. The two main techniques that have been explored for non-word error detection are ngram analysis and dictionary lookup. N-gram error detection techniques work by examining each n-gram from input string and looking it up in a precompiled table of ngram statistics to ascertain either its existence or its frequency. A word in the input string
4
C.-M. Huang, M.-C. Wu, C.-C. Chang
may be labeled as non-word when the word is not in the word list or its frequency is too low. N-gram analysis has been proven useful for detection [7]. Dictionary lookup techniques work by checking each input string with a dictionary. The main techniques for gaining fast access to a dictionary include dictionary lookup, pattern-matching algorithms, dictionary-partitioning schemes and morphological-processing techniques [4]. Since the dictionary is made up of an enormous number of words, it is a big concern for its response time. Working on the second problem has spanned a broader time frame from 1960s to the present. Isolated-word error corrections are developed to suggest corrections for the detection of an error in text. Isolated-word error corrections are applied in text recognition, text editing, computer-aided tutoring, computer-aided language learning, text-to-speech applications, and so on [8]. The problem of isolated-word error corrections not only utilizes n-gram analysis and dictionary lookup but also mixes many hybrid techniques such as minimum edit distance [9], similarity keys, rule-based procedures, probability estimates, and neural nets. Context-dependent word correction process utilizes Natural Language Processing (NLP) and statistical modeling. As for typical NLP systems, they consist of a lexicon, a grammar, and a parsing procedure. The error detection and correction process between Chinese and English are varied. It is because the morphology and syntax of Chinese are different from those of English, and Chinese has no delimiter to separate words. An earlier research suffering from low precision in detection and no correction consideration uses word segmentation and scoring to represent the connection strength of the singleton combined with the righthand and left-hand neighboring characters [9]. Zhang et al. [10] construct a confusing set considering the similarity of the code of Wu-Bi input method. Utilizing a confusing set and a classifier based on two-word linkage, POS-tag linkage, semantic feature and linkage in a word archives detection and correction of errors in Chinese text. Ren et al. [11] present a hybrid approach that combines a rule-based method with a probability-based method. The method utilizes the average character frequency, the character transition probability, word-end-word-start transition probability, and bi-gram part-of-speech transition probability to automatically check and correct errors in the Chinese text. To correct suspicious words, candidates are provided by a list of common wrongly written/pronounced words with similar input codes and concurrent characters in the context. As to the error types in Chinese, Zhang et al. [12] classify errors into two categories: (a) non-word errors including character substitution errors, string substitution errors, character insertion errors, and character deletion errors. (b) real word errors containing word substitution error, word insertion error, and word deletion error. 2.3. Error Types in Text Non-word errors can be classified as typographic errors, cognitive errors, and phonetic errors. Damerau [13] points out that common error types consist of insertion, deletion, substitution, and transposition. Due to the variety in number, containing only one
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
5
instance of an error is named single error misspellings; multiple instances of errors are called multi-error misspellings. As for real word error processing is to examine if the word violate any one of natural language processing constraints [4]: (1) a lexical level, (2) a syntactic level, (3) a semantic level, (4) a discourse structure level, and (5) a pragmatic level. Non-word errors are categorized as lexical errors. Errors owing to the lack of subject-verb number agreement would be classified as syntactic errors. Errors that do not necessarily violate syntactic constraints but do result in semantic anomalies, e.g., "see you in five minuets", would be classified as semantic errors. Errors that break the inherent coherence relations in a text, such as an enumeration violation, would be classified as discourse structure errors, e.g. "I own 'three' dogs. Their names are John and Mary". Errors due to some anomaly related to the plans of the discourse participants would be categorized as pragmatic errors. The detection of non-word errors is easier than real word errors for its less expenditure in additional syntactic and semantic analysis of the text. However, the correction of two kinds of errors is also quite difficult [14, 15]. As to the error types in Chinese, Zhang et al. [12] classify errors as non-word errors including character substitution error, string substitution error, character insertion error character deletion error, real word containing word substitution error, word insertion error and word deletion error. Ren et al. [11] categorize errors as errors caused by mistyping in the input process and grammatical and semantic anomalies caused by errors. Specific causes due to mistyping in input process are summarized as redundancy and omission, and wrongly written of mispronounced characters, and lost punctuation. 2.4. Unknown Word Detection One of the most familiar problems in process of Chinese text is the identification of words. Due to the fact that there are no delimiters to separate words, the process of word identification encounters ambiguities and unknown words. Unknown words include abbreviation, proper names, derived words, compounds and numeric type compounds. Compounds and proper names are most widely known. If text contains error words, the process of unknown word detection becomes more complicated. It is because unknown word detection relies on contextual information. There is no satisfactory algorithm for identifying both unknown words and typographical errors. The same detection process might be shared for different types of unknown words and typographical errors. The unknown word detection problem and the dictionary-word detection problem are complementary problems because if all known words in an input text can be detected, then the rest of the character string will be unknown words. According to an examination of a group of testing data which is a part of the Sinica corpus, 4572 occurrences out of 4632 unknown words were incorrectly segmented into sequences of shorter words, and each sequence contained at least one monosyllabic word [16]. In other words, the appearance of monosyllabic words might be caused by the detection of unknown words. Therefore, the process of detecting unknown words is equivalent to making a distinction between monosyllabic words and monosyllabic morphemes which are parts of unknown words.
6
C.-M. Huang, M.-C. Wu, C.-C. Chang
2.5. Language Model Researches on NLP have lasted for a half century in order to overcome the language gap between humans and computers. During a half century, one of three achievements is statistical language model applied in many application systems, such as speech recognition [17]. The classic task of language modeling aims to predict the probability of a sentence made up of a sequence of words in corpora. Suppose that there is a given sentence S made of a sequence of T words. They are W1 ,W2 ,…,WT . The probability of having sentence S can be written as Eq. (1): P( S ) = P(W1 ,W2 ,...,WT )
= P(W1 ) P(W2 | W1 )...P(WT | W1W2 ...WT −1 ) T
= ∏ P(Wi | W1W2 ...Wi −1 )
(1)
i =1
However, sentences or phrases can be arbitrarily long in real world and it is tough to capture all possible sentences. Therefore, the N-gram model is used as an approximation of real underlying language by a Markov assumption that the last few words affect the next word to estimate the probability of a sequence of words reasonably [18, 19]. Consequently, the Eq. (1) is rewritten as equivalent Eq. (2): T
P ( S ) = P (W1 ,W2 ,..., WT ) ≅ ∏ P (W1 | Wi −i −n1+1 )
(2)
i =1
i −1 i − n +1
In Eq. (2), W is Wi − ( n −1)Wi − ( n − 2 ) …Wi −1 . Most people use N-gram with N=2. This is known as the bi-gram model. The N-gram model in general is to suggest using the relative frequency as a probability estimate. It denotes that the frequency of Wi − ( n −1)Wi − ( n − 2) …Wi −1Wi in the training text is divided by the frequency of Wi − ( n −1)Wi − ( n − 2 ) …Wi −1 in the training text as in Eq. (3). C (Wi j ) is the frequency of Wi −( n−1)Wi −( n−2 ) …Wi −1Wi in training text. P(Wi | Wi −i −n1+1 ) =
C (Wi −i n+1 ) C (Wi −i −n1+1 )
(3)
2.5.1. Witten-Bell Smoothing Smoothing techniques are to resolve the data sparseness of language model. Witten-Bell Smoothing is widely used to enhance the N-gram model. The key concept of Witten-Bell Smoothing is to use a count of n-grams seen at least once to re-estimate the count of the unseen n-grams [20]. The N-gram model is adjusted as follows: Pint erp (Wi | Wi −i −n1+1 ) = λW P (Wi | Wi −I −n1+1 ) + (1 − λW ) Pint erp (Wi | Wi −i −n1+ 2 ) (4) i −1 i − n +1
i −1 i − n +1
Witten-Bell Smoothing is defined recursively as a linear interpolation of the maximum-likelihood estimate and the lower-order (n-1) gram model. λW i −1 is a weight i − n +1 to incorporate n-gram with (n-1)-gram. This weight is calculated by Eq. (5). 1 − λW i −1 = i − n +1
P (Wi | Wi −I −n1+1 ) + (1 − λW i −1 ) Pint erp (Wi | Wi −i −n1+ 2 ) i − n +1
N1+ (Wi −i −n1+1 ) + ∑ C (Wi −i n +1 ) Wi
(5)
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text i −1 i − n +1
N1+W i −1
7
in Eq. (5) is the number of different words (types) that occur to the right
of Wi − n +1 :
N1+Wi −i −n1+1 = {Wi : C (Wi −i −n1+1Wi ) > 0}
(6)
Witten-Bell Smoothing considers a number of words catenated behind a word. If the number are rare, λW i −1 shall increase; therefore, the probability of n-grams is estimated i − n +1 precisely. The N-gram model with the modifications of Witten-Bell performs better than standard N-gram models. 2.5.2. Perplexity Perplexity is one of the important metrics for evaluating the n-gram models. Perplexity is defined as the entropy of a statistical language model. According to information theory, the entropy is defined as the summation of the probability distribution P multiplied by log P [21]: H p = − P log P = − lim
Q →∞
1 { ∑ P (W1W2 ...WQ ) log P (W1W2 ...WQ )} Q W1W2 ...WQ
(7)
It is not easy to implement Eq. (8). To simplify this computation we use an assumption which is called in statistics ergodic [18]. It follows that given a large enough value of m, H can be approximated with: Hp = −
1 log P (W1W2 ...WQ ) Q
(8)
Then, perplexity is defined as perplexity = 2 p , which represent the average branching factor of the n-gram model. The perplexity can be understood as follows: the lower the perplexity, the better the language models. H
3. System Architecture The system architecture of this research is divided into three main parts including error detection, error correction and database in Fig. 1. The database contains language model, lexicon and confusing word set. The error detection phase includes the word segmentation of Chinese Knowledge and Information Processing (CKIP), and the process of dubious word area formation. The error correction phrase subsumes lexical analysis and optimal word extraction.
8
C.-M. Huang, M.-C. Wu, C.-C. Chang
Fig. 1. System Architecture
3.1. Error detection The error detection procedure in Fig.2 includes two main processes: word segmentation and dubious word area formation. In the process of word segmentation, our system transmits the input text to the word segmentation system of CKIP. Then, the process of unknown word extraction is performed. When this process finishes, the segmentation result will be returned. In order to extract unknown words from the segmentation result, the process of extracting unknown word detection result is executed. Next, according to the extracted result, dubious word area will be formed through several steps. First, the text of the extracted result will be segmented into sentences. Second, words will be extracted from each sentence. Third, each tag of words will be filtered out. Finally, words attached question marks will be highlighted and their location will be notated in text. In the next sections, we will describe these processes in detail. 3.1.1. Word Segmentations The process of word segmentation is to launch unknown word extraction and extract unknown word detection result. It takes advantage of unknown word detection to detect misspelling of text. 3.1.1.1. Unknown Word Extraction The word segmentation system of CKIP including 100,000-entry lexicon with pos tags, word frequencies, pos tag bi-gram information, etc. is to extract unknown word, segment text into words and notate pos tags. Moreover, the word segmentation process considers morphology analysis and language model for unknown words and try to represent morphology of all kinds of unknown words as context free grammar to improve the extraction of the unknown word without significant statistical characteristics. Its steps of process are as follows: (1) initial segmentation (2) unknown word detection (3) Chinese name extraction (4) foreign name extraction (5) compounds extraction (6) bottom-up
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
9
merging algorithm (7) re-segmentation. On the completion of this processing, the unknown word list is obtained. 3.1.1.2. Extract Unknown Word Detection Result This research adopts unknown word detection as error detection. Different types of unknown words and typographical errors may share the same detection process. During the process of error word detection, an input text undergoes the maximum matching algorithm of initial segmentation to distinguish monosyllabic words from monosyllabic morphemes by contextual information, so misspelling of text should be found out. After the process of word segmentation, this system extracts the unknown word detection result from unknown word extraction result for further manipulating.
Fig. 2. Error Detection Procedures
3.1.1.3. Dubious Word Area Formation Dubious word area formation includes sentence separation, word separation, tags filter and dubious word location. After the process of unknown word detection, misspelling of input text should be found out. Furthermore, after the process of dubious word area formation, the location of misspelling should be exactly known, and then the misspelling would be manipulated. 3.1.1.4. Sentence Separation, Word Separation, Tags Filter and Dubious Word Location When the text contains dubious words, it would be separated to manipulate the sentence containing dubious word and present dubious word to users. During sentence separation, the location of each sentence of the text would be recognized and numbered. After sentence separation, the process of word separation is executed to separate and number words of the sentence containing dubious words and then to filter the tag of each word. After sentence separation, word separation and tags filter, the text in figure containing dubious words would be marked and presented to users.
10
C.-M. Huang, M.-C. Wu, C.-C. Chang
Fig. 3. Error Detection
3.2. Error Correction The procedure of error correction is illustrated with Fig. 4. During the process of error detection, the location of dubious word has been spotted, from which the dubious word and sentence could be easily extracted from text. Because the dubious word could be inspected mistakenly, the probable and correct words which are made of confusing word set and lexicon will be compared with the context of sentence, and the maximum common strings will be extracted. The language model is used to select the optimal candidate word from the maximum common strings. The inspected dubious word would be replaced by the optimal candidate word. 3.2.1. Lexical Analysis Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. It includes the extraction of dubious sentences, words and candidate words, and word matching. After lexical analysis, the system would suggest candidate words. 3.2.1.1. Extract Dubious Sentence and Word The process of error detection reveals not only whether dubious words exist in the text or not, but also the location of dubious sentences and words. When dubious words exist in the text, because the locations of dubious words are recorded, the process of error correction would easily extract the dubious sentence and word out. For example, the input sentence of Fig. 5 contains misspelling. The sentence is the first sentence of the text. Its location in the text is zero (1-1=0). The sentence will be separated word by word. The location of the misspelling is 4 calculated by subtracting 1 from its fifth order. Having its location, the character and word could be extracted from the detected document.
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
11
Fig. 4. Error Correction Procedure
3.2.1.2. Extract Candidate Word Because the dubious words are probably misspelling, candidate words are used to make necessary correction. The candidate words are made up of confusing word set established based on the identical or similar phonetic properties, and lexicon provided by Institute of Information Science of Academia Sinica. For example, the candidate words in Fig. 6 are extracted from confusing word set and lexicon to be inspected.
Fig. 5. Extract Dubious Sentence and Word
3.2.1.3. Word Matching Word matching is conducted to examine the appropriateness of the candidate words for representing the context of sentence. During the word matching, the candidate words will attach under the dubious words and the comparing window will be adjusted to display the longest distance of the candidate words. After comparison, the maximum common string will be extracted. For example, the sentence in Fig. 6 is in word matching. Because the longest word, 海 關 署 (Hai-Guan-Shu) does not match the words in the comparing window during the processing, it will be removed from the candidate words list. The most suitable word for the context of sentence would remain.
12
C.-M. Huang, M.-C. Wu, C.-C. Chang
Fig. 6. Word Matching
3.2.2. Optimal Word Extraction Optimal word extraction is conducted to select the most appropriate candidate word from the sentence. This research makes use of the character-based bi-gram language model and Witten-Bell Smoothing. The procedure to select optimal word is as follows: (1) Align the candidate words closely with the dubious words. (2) Adjust the windows size to the longest distance for the candidate words extent. (3) Replace the dubious word with the most appropriate candidate word. (4) Calculate the possibility of connectivity between words in the window with Eq. (7). (5) Choose the candidate word with the highest possibility as the optimal word. After the procedure of optimal word extraction, the optimal word will be inserted into the sentence. Users will be able to view the dubious word and the optimal word together. Take the sentence in Fig. 6 for example. After the extraction procedure, the optimal word will be extracted. Fig. 7 illustrates that the optimal word shall be 灌輸 (Guan-Shu). After the detection and correction procedure, the result is shown as Fig. 8. k
log P ( S ) = ∑ log Pint erp (Wi | Wi −i −n1+1 )
(9)
i =1
3.3. Language Model This Research adopts character-based bi-gram language model to gain the possibility of connectivity between words, and then the system could choose the optimal word because of the highest possibility of connectivity between words from language model. Furthermore, this research takes up the Witten-Bell Smoothing to resolve the datasparseness problem. With the Witten-Bell Smoothing, the possibility of the seen events could be reasonably distributed to the unseen events. Therefore the system could choose the optimal word more accurately.
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
13
Fig. 7. Optimal Word Extraction
3.4. Confusing Word Set Confusing words are called false cognates because they sound or are written so similarly that they are often confused. The confusing word set is established according to the identical or similar phonetic properties. The task of employing confusing word set is to extract the probable correct word such as the pronunciation of 灌 (Guan) is the same as 貫 (Guan).
Fig. 8. Error Correction Example
3.5. Lexicon Lexicon is also a synonym for dictionary. It includes the lexemes used to actualize words. We use lexicon to provide candidate words for replacing dubious words, such as 貫輸 (灌 輸) ( Guan-Shu) (inculcate). 4. Experiment 4.1. Training Data Set The collection of training data set is used to construct language model. In view of the convenience for data acquisition and the professional writing style in content description, news documents became the first consideration as our training data set. To obtain versatile documents, the sources of news documents were from 10 well-known online news websites in Taiwan. The online news websites are as follows. (1) United Daily News http://www.udn.com. (2) Central News Agency http://www.cna.com.tw/. (3) Broadcasting Corporation of China http://www.bcc.com.tw/.
14
(4) (5) (6) (7) (8) (9) (10)
C.-M. Huang, M.-C. Wu, C.-C. Chang
China Times http://news.chinatimes.com/ Formosa Television News http://www.ftvn.com.tw/. TVBS News http://www.tvbs.com.tw/tvbs_page/index/. Great News http://www.gnews.com.tw. Reuters http://www.reuters.com. Central Daily News http://www.cdn.com.tw/. Taiwan daily news http://www.taiwandaily.net/.
The collected news corpus consists of varied categories including politics, society, international, cross-straits, business, living, health, entertainment, leisure and technology. The news documents are stored as plain texts and then remove their tags. In order to ensure the quality of news data, there are five preprocessing to be checked: (1) Sum the occurrences of each word in all texts. (2) Examine the rarely used words based on frequency consideration. (3) Inspect the particular code. (4) Remove or behold news data based on the occurrence of rarely used words and the particular code. (5) Randomly select 50,000 news documents. After completing the preprocessing, the 50,000 news documents (18,450,075words) were used to train language model. 4.2. Confusing Word Set Mandarin is not native to Taiwan, yet it is the national language of Taiwan's citizens and is the sole official written language. In the past, the citizens of Taiwan are discouraged from writing their native languages (viz., Taiwanese, Hakka, and various aboriginal languages). Until recent decades, it has been possible to teach them in the schools. Confusing word set is made of the words with identical or similar phonetic properties. Due to multi-languages used in Taiwan as stated above, these languages influence each other on the evolution of pronunciation. Three prominent phenomena occur thereafter. First, the retroflex sounds from Mandarin are softened considerably and the retroflex "r" ending is very rarely heard. Second, the light dentilabial disappears. Third, the ㄥ(eng) is mispronounced as ㄣ (en) [22, 23]. The influence of the first phenomenon is on the misuse of ㄓ (jhih) as ㄗ(zih);ㄔ (chih) becomes ㄘ (cih); ㄕ(shih) becomes ㄙ (sih). The influence of the second phenomenon is on the misconception of ㄈ (f-) and ㄏ(h-). The influence of the third phenomenon is on the misconception of ㄥ(eng) and ㄣ (en). In addition to the three phenomena, this study includes the words with identical pronunciation and all tones to construct the confusing word set [24]. The steps of construction are as follows: (1) construct the phonetic table of words (2) construct the homophone table (3) construct the confusing word set based on the five variations of pronunciation. After the completion of construction, a collection is produced including 783 records with identical or similar pronunciation.
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
15
4.3. Lexicon This research adopts lexicon generated by the Institute of Information Science of Academia Sinica in Taiwan to produce candidate words for replacing dubious words. The lexicon includes many types of words such as common noun, proper noun, idiom, parlance, derivative, variant, combinatorial word, jargon and dialects of Proto-Mandarin. 5. Experiment Results A list of most frequently misused spellings was corrected from the Internet as a reference to evaluate the performance of the system [25-31]. These misspellings were verified by checking Chinese dictionary. Besides, one hundred and thirty-seven documents were retrieved from the Internet as our experimental base. With the fact of similar function provided by commercial-based text editors, this evaluation was to compare the experimental results of detection and correction generated by our system and those processed by Microsoft Word. For example, when we type a sentence with the common misspellings into Microsoft Word, the red-waved underline will appear under the misused word if it is detected. This procedure is similar to our handling detection, while we use question mark with parentheses next to its right side to tag its location, shown as Fig 3. Two systems provide candidate words for selection when a misspelling is detected, while the correction needs to be done by users in using Microsoft Word. Our system conducts the correction job automatically and attaches it next to the misspelling. Fig 9(a) and 9(b) show the procedures of detection and correction processed by using Microsoft Word.
(a)
(b)
Fig. 9. The Procedures of Detection and Correction - Microsoft Word
The documents with misspellings would be highlighted and suggested by candidate words. Therefore, during the evaluation procedure, documents with common misspelling will be fed into both Microsoft Word and our system to undergo detection and correction processes. The results of detection and correction from both systems would be recorded carefully and make comparison thereafter. After inspecting the result of detection and correction, 108 out of 137 documents were detected accurately by Microsoft Word. Therefore, the 108 documents will be used to perform the next task of correction. During
16
C.-M. Huang, M.-C. Wu, C.-C. Chang
the process of correction, only 38 out of 108 documents were corrected accurately. The detection rate is 0.788 and the correction rate is 0.352. On the other hand, our system shows that 111 out of 137 documents were detected accurately; 103 out of 111 were corrected accurately. The detection rate is 0.81 and the correction rate is 0.928. The evaluation result of both systems is shown as Table 1: Table 1. Detection and Correction Rate Microsoft Word
Detection Rate Correction Rate
108/137 ≅ 0.788 38/108 ≅ 0.352
Our system
111/137 ≅ 0.810 103/111 ≅ 0.928
6. Conclusion and Future Work This research focuses on error detection and correction in Chinese text. We adopt character-based bi-gram language model to gain the possibility of connectivity between words. Due to the fact of the misspellings usually resulting from homophones, we make use of unknown word detection for error detection. It is because unknown word detection takes not only statistics of words into consideration but also morphemes of morphology. Traditional treatments usually focus on statistics of words, while the seldom used words are easily to be neglected during the process. Moreover, the Chinese characters appearing together frequently are not necessarily meaningful words. After the unknown word detection, the monosyllabic words and monosyllabic morphemes will be easily distinguished. This result is collected for misspellings recognition. The process of phonetic alphabets including homophone, tones to be undistinguished, and three characteristics resulting from learning multi-language is the foundation of error correction. The experimental result reveals that our proposed model performs better than that of Microsoft Word in detecting and correcting misspellings. Researchers might be interested in conducting related studies including releasing local constrains of language model and loosing the dependence on corpus. Besides, utilizing a larger number of balanced training data set to enhance the quality of language model and improve the completion of morpheme is another promising task to endeavor.
References [1] H. I. Yang, "Xue shi zhong wen cheng du qi ye zhu guan yao tou," in China times express Taipei, 2005. [2] H. J. Wang, "Gao zhong zhi xue sheng zuo wen cuo bie zi yan jiu-yi gao xiong shi gao zhong zhi xue sheng zuo wen wei li," in Junior High School Material Department. Master Degree Kaohsiung: National Kaohsiung Normal University, 2003. [3] H. Chiang, "tou shi: ti xiao jie fei di bu zhi shi cuo wu bai chu " in Focus on China Beijing: BBC CHINESE.com, 2006. [4] K. Kukich, "Technique for automatically correcting words in text," ACM Comput. Surv. , vol. 24 pp. 377 - 439 1992. [5] T. A. Chen, Ying han bi jiao yu fan yi, 8th ed. Taipei: Bookman 2005. [6] Y. C. Ho, Xian dai han yu yu fa xin ta, 1st ed. Taipei: The Commerical Press, 2005.
Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text
17
[7] L. D. Harmon, "Automatic recognition of print and script," in Proceedings of the IEEE 1972, pp. 1165 - 1176 [8] R. A. Wagner, "Order-n correction for regular languages " Commun. ACM vol. 17, pp. 265-268 1974. [9] C.-H. Chang, "A new approach for automatic Chinese spelling correction," in Proceedings of Natural Language Processing: Pacific Rim Symposium'95, Seoul, Korea, 1995, pp. 278-283. [10] L. Zhang, M. Zhou, C. Huang, and M. Lu, "Approach in automatic detection and correction of errors in Chinese text based on feature and learning," in Proceedings of the 3rd world congress on Intelligent Control and Automation, Hefei, 2000, pp. 2744-2748. [11] F. Ren, H. Shi, and Q. Zhou, "A hybrid approach to automatic Chinese text checking and error correction," in Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics Tucson, USA, 2001, pp. 1693-1698. [12] L. Zhang, M. Zhou, C. Huang, and H. Pan, "Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm," in The 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2000. [13] F. J. Damerau, "A technique for computer detection and correction of spelling errors," Communications of the ACM, vol. 7, pp. 171-176, 1964. [14] M. Choudhury, M. Thomas, A. Mukherjee, A. Basu, and N. Ganguly, "How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach," in Graph-Based Algorithms for Natural Language Processing, Rochester: Association for Computational Linguistics, 2007. [15] G. Hirst and A. Budanitsky, "Correcting real-word spelling errors by restoring lexical cohesion," Natural Language Engineering, vol. 11, pp. 87-111, 2005. [16] K. J. Chen and M. H. Bai, "Unknown word detection for Chinese by a corpus-based learning method," International Journal of Computational linguistics and Chinese Language Processing, vol. 3, pp. 27-44, 1998. [17] C. N. Huang and H. F. Chang, "Zi ran yu yan chu li ji shu di san ge li cheng bei," Foreign Language Teaching and Research, pp. 180-187, 2002. [18] C. D. Manning and H. Schütze, Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999. [19] A. Papoulis, Probability, random variables, and stochastic processes, 2nd ed. New York: McGraw-Hill, 1984. [20] I. H. Witten and T. C. Bell, "The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression," IEEE Transactions on Information Theory, vol. 37, p. 1085, 1991. [21] C. E. Shannon, "A mathematical theory of communication," The Bell System Technical Journal vol. 27, pp. 379-423, 623-656, July and October 1948. [22] F. F. Cao, "Instances of interaction between Taiwanese Japan and Taiwanese Mandarin in Taiwan across the span of the last one hundred year," Chinese Study, vol. 36, pp. 273-297, 2000. [23] K. P. Hsieh, "Ti wan di qu nian qing ren yu zh(ㄓ), ch(ㄔ), sh(ㄕ) with z(ㄗ), c(ㄘ), s(ㄙ) zhen di bu fen ma?Do young people in Taiwan really confuse zh(ㄓ), ch(ㄔ), sh(ㄕ) with z(ㄗ), c(ㄘ), s(ㄙ)?," The World of Chinese Language, vol. 90, pp. 1-7, 1998. [24] A. J. Ssu Tu, Hao wan cuo bie zi you xi. Hong Kong: Singtao, 2005. [25] C. Chi, You jian bie zi, 2nd ed. Taipei: Ming Jen 1980. [26] H. L. Tso, Cuo bie zi bian zheng. Taipei: The Commercial Press, Ltd., 1980. [27] T. I. Chuang and S. Y. Chuang, Yi zi zhi cha. Taipei: Jian Lin, 1991. [28] F. L. Hung, Bian zi ji jin. Kaohsiung: Fu Wen, 1997. [29] S. P. Fan, Xiao yuan chang jian cuo bie zi shou ce. Hong Kong: Chinese Improvement Working Group, 1998. [30] P. C. Hsieh, Bie zai xie cuo zi liao. Taipei: Business Weekly Publications, Inc, 2001. [31] T. Ssu Ma, Cuo bie zi chu lie, 1 ed. Taipei: Business Weekly Publications, Inc., 2005.