Email: {ma56538, derekfw}@umac.mo ... CSAT, an integrated application with Chinese segmentation and tagging ability has been implemented and the results.
COMPUTATIONAL METHODS IN ENGINEERING AND SCIENCE EPMESC X, Aug. 21-23, 2006, Sanya, Hainan, China ©2006 Tsinghua University Press & Springer
CSAT: A Chinese Segmentation and Tagging Module Based on the Interpolated Probabilistic Model K. S. Leong *, F. Wong, C. W. Tang, M. C. Dong Faculty of Science and Technology of University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China Email: {ma56538, derekfw}@umac.mo Abstract Chinese is a challenging language in natural language processing. Unlike other languages such as English, Portuguese, the first step in Chinese text processing is the word identification because there are no delimiters in a Chinese sentence for identifying the words boundaries. And there are many ambiguity problems during Chinese processing like segmentation ambiguities, unknown words problem, part of speech ambiguities, etc. In this paper, CSAT, an integrated application with Chinese segmentation and tagging ability has been implemented and the results of experiments of the implemented application are presented. And in between, the adoption of the interpolated probabilistic tagging model for Chinese is discussed and introduced. About the CSAT, its segmentation module is based on the Chinese words rough segmentation model based on N-Shortest-Paths method. Using this method, in the first phase of the Chinese processing, a rather good rough segmentation result of Chinese sentences is obtained. Upon getting the rough segmentation result, in the second phase, those segmented candidate Chinese sentences are going to be tagged by the probabilitistic tagger. From the experiments, the accuracy of tagging is 95.20%. Our experiments are based on a one-month news corpus from the People’s Daily. This application is the first step for Chinese processing and it will be used as a preprocessing module in the Chinese to Portuguese machine translation system. Key words: N-shortest-paths method, words rough segmentation, probabilistic tagging, interpolated probabilistic tagging
INTRODUCTION To process a Chinese sentence, several steps have to be gone through like word segmentation, part of speech tagging, syntactic analysis, semantic and pragmatic analysis. In this paper, we are going to focus on the first two steps. They are the word segmentation and part of speech tagging. In Chinese text processing, the first essential step is the word segmentation because Chinese, unlike other languages such as English, Portuguese, both have word boundaries within a sentence. Therefore, to segment the Chinese sentences to find out the word boundaries within them, one needs to look up a dictionary or knowledge base with statistics about the Chinese words to achieve it. In the literature, there are several methods proposed for doing this like full segmentation with words N-gram model, maximum matching, maximum matching with rule-based approach [2], maximum probability, blending segmentation with tagging [3, 4]. The problems of these methods are that they may produce a large set of candidate segmented sentences for later processing, may produce some assertive results or some words ambiguities cannot be solved in the first phase. Those problems will usually greatly reduce the performance of the system or affect the accuracy of the processing in the later phases. Since we know that the result of segmentation is the input of the next phase, the part of speech tagging, then the high accuracy of the segmentation phase will surely give a better result of part of speech tagging. So in this paper, we have adopted the Chinese words rough segmentation model based on N-shortest-paths method [1] as our segmentation strategy in the first phase. With the result got from this phase, this is passed as the input to probabilistic tagging model in the second phase. By this method, we have tried several experiments with different arrangements. And we found that the accuracy of tagging reaches 95.20%. In the following sections, we will discuss this method and the results obtained from different experiments. ⎯ 1092 ⎯
WORDS ROUGH SEGMENTATION MODEL BASED ON N-SHORTEST-PATHS METHOD As there are problems in the segmentation methods mentioned above, we are going to have a better segmentation method in which can give good rough segmenatation results for the later phases while the performance of the application can be maintained or improved. We thought that the words rough segmentation model based on N-shortest-paths method [1] is such method. This method is actually via a rough segmentation on Chinese sentences and gets a certain amount of rough segmented sentences. After that, through a number of experiments to determine the best value for N and get the minimum number of rough segmented sentences that can include the final correct segmented result. In short, this is an extension of the shortest path algorithm and its basic idea is as follow. For a given Chinese sentence, it is first split up into its atomic characters. And for this, we create a directed graph for the sentence with each of its atomic characters as the vertices (V1, V2, … ,Vn) as in the Fig. 1 (V0 represents the starting node).
Figure 1: The directed graph for words rough segmentation After that, a knowledge base keeps the statistics about the word frequencies based on a Chinese corpus will be looked up for the probabilities of the atomic characters or the combinations of characters (words). Suppose that w (word) = Vi Vi+1 … Vj , then a directed edge will be added from the vertices Vi-1 to Vj . And the length Lw of that edge will be assigned as the probability of w. From the graph, we see that there must be a directed edge between adjacent vertices Vi, Vi+1. If there is not an atomic character Vi , then the length Li of the directed edge will be assigned with a smoothed probability but for any w = Vi Vi+1 … Vj which does not appear in the training corpus, then no directed edge will be added from the vertices Vi-1 to Vj. Based on the statistics kept in the knowledge base, the probability of each w will be used in this words rough segmentation model. Thus it is based on unigram statistics model. Let W = w1, w2, … , wm and this is regarded as one of the results of the segmentation for the Chinese sentence C. Then
P (W | C ) =
P(W ) P (C | W ) P (C )
(1)
where P(C) is the probability of the Chinese sentence to be segmented. This must be a constant for all different segmentations. And from different segmentations to get the whole Chinese sentence, the probability is P(C | W)= 1 because there is only one way to do so. Therefore, the goal is to obtain the N different segmentations which have the N largest probabilities of P(W). Now wi is a word, P(wi) represents the probability of wi that appears in the training corpus. In case wi does not appear in the training corpus, add one smoothing is applied. Therefore P(wi) can be approximated as
P ( wi ) ≈
( ki + 1)
(2)
m
(∑ k j + V ) j =0
where ki is the number of occurrences of wi and V is the number of word types in the training corpus. In the rough segmentation phase, it is assumed that the context within the sentence is not considered for simplicity. Then words are independent and thus from (1) and (2), m
m
arg max P(W ) = arg max ∏ P ( wi ) ≈ arg max(∏ W
i =1
i =1
(ki + 1) m
∑k j =0
j
)
+V
For convenience, it is preferred to change the maximum value problem to minimum value problem. Then
⎯ 1093 ⎯
(3)
m
m
m
i =1
i =0
j =0
arg min P (W ) = arg min[− ln ∏ P( wi )] ≈ arg min ∑ [ln(∑ k j + V ) − ln(ki + 1)] W
(4)
With the directed graph and the different probabilities of different rough segmentations are calculated, the N rough segmentation results with the mimimum values in (4) can be obtained.
PROBABILISTIC TAGGING MODEL In the work, we focus on applying the results got from the Chinese words rough segmentation model based on N-shortest-paths method [1] to carry on the second phase of Chinese processing which is the part of speech tagging. For this phase, we have applied the well-known probabilistic model known as Hidden Markov Model [7]. To find the most likely part of speech for each word in a sentence, a probabilistic model can be formulated. Let the most likely part of speech sequence be T = {t1, t2, … , tn} given a particular words sequence, W = {w1, w2, … , wn} Using Bayes' rule we have P (T | W ) =
P (W | T ) P (T ) P (W )
(5)
where P(T) is the priori probability of tags sequence T, P(W) is the unconditional probability of words sequence W and P(W | T) is the conditional probability of words sequence W given the tags sequence T. Then to find the most likely tags sequence T for the words sequence, we need to find the tags sequence T that can maximize P(T | W). Because W must be the same for all different tags sequences, the P(W) is not needed to consider. So now, (5) can be rewritten as n
P (T | W ) ≈ ∏ P (ti | ti −1 , ti − 2 ,..., t1 ) P ( wi | ti ,..., t1 , wi −1 ,..., wi )
(6)
i =1
But then instead of letting wi depend on all previous words and tags, one assumes wi depends only on ti and similarly the current tag ti depends only on the previous tag ti-1 based on the assumption that the local context is enough. Then by simplify it to be a bigram model, we get the following, arg max P(T | W ) ≈ arg max ∏ P (ti | ti −1 ) P ( wi | ti ) T
(7)
i
By (7), the tags sequence T which can maximize P(T | W) will be the final result for the words sequence W.
EXPERIMENTS As we have mentioned above, we aim at applying the rough segmentation results of each Chinese sentence by the method described above to pass into the tagging module for part of speech tagging. Actually, to determine which tagged Chinese sentence of a given Chinese sentence should be the correct output in the final result, we just base on the following to determine: Let S = {W1, W2, … , Wn} be the set of N best rough segmented Chinese sentences (the segmented sentences whose P(Wi) (1