A New Statistical Approach to Personal Name Extraction
Zheng Chen Microsoft Research Asia, 49 Zhichun Road, Haidian District, Beijing 100080, P.R. China
[email protected]
[email protected] Liu Wenyin Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Hong Kong SAR, P.R. China
Feng Zhang Microsoft Research Asia, 49 Zhichun Road, Haidian District, Beijing 100080, P.R. China
Abstract We propose a new statistical approach to extracting personal names from a corpus. One of the key points of our approach is that it can both automatically learn the characteristics of personal names from a large training corpus and make good use of human empirical knowledge (e.g., Context Free Grammar). Furthermore, our approach also assigns confidence measures to the extracted personal names, compared with traditional simple true/false determination. Another main contribution of this work is that we have applied the personal name extraction technology into a real application, which is a Chinese inputting system and have achieved an approximately 7% error rate reduction for all characters and 30% error rate reduction for personal names.
1. Introduction Named entity (NE) extraction, especially personal name extraction, becomes more and more important as more and more applications of natural language processing have been developing. However, how to extract personal names and fully utilize them remains a non-trivial problem. Admittedly, the best way to identify the personal name is by the context information. Hence, the main problem becomes how to find out the context where personal names exist, either by manual, rule-based, or automatic, statistical methods. Furthermore, how to utilize the extracted personal names effectively is another challenge. In this paper, we propose a new statistical approach to extracting personal names from the corpus. The approach can both automatically learn the characteristics of personal names from a large training corpus and make good use of human empirical knowledge (e.g., Context
[email protected]
Free Grammar). The system extracts all possible rules for personal names using a statistical approach, avoiding the incompleteness and subjectivity of manual, rule-based methods. The application of empirical knowledge can compensate the limitations of the statistical approach, especially when the training data is unbalanced or inadequate. Furthermore, a priori knowledge can guide the process of the statistical approach and thus improve the learning speed to certain extent. Our approach also assigns confidence measures to the extracted personal names, compared with traditional simple true/false determination. The confidence measures can help the system improve the performance of extraction through several iterations. Furthermore, by employing the statistical framework, our approach can be more easily applied into most currently popular natural language processing applications, such as speech recognition, natural language understanding, translations, and information retrieval. Although there are many papers talking about personal name extraction, little work has been done to apply it into real applications. In this paper we also apply the extracted personal names into a real application, which is a language inputting system (by speech recognition or the Chinese pinyin input method). A dynamic cache-based personal name dictionary is proposed to improve the accuracy of the conversion system. Personal name extraction and recognition are processed in a unified approach. Compared with traditional methods, our approach achieves an approximately 7% error rate reduction for all characters, including 30% error rate reduction for personal names. The rest of the paper is organized as follows. In Section 2, we present related works in the aspects of personal name extraction. In Section 3, we give a brief introduction on the Chinese input system as well as the Pinyin methodology. We present our proposed statistical approach to personal name extraction in Section 4 and its
application to a language input system in Section 5, respectively. We show experimental results in Section 6 to demonstrate the effectiveness of our approach. Finally, we present concluding remarks in Section 7.
this aspect. Thus, it is a challenge to find real practical applications of named entities.
2. Related Works
Chinese is a non-phonetic language. Hence, there is no direct way to input Chinese using the western style keyboard. So far, there are more than one thousand methods that have been developed for Chinese input. Among them, Pinyin is the most popular method to input Chinese. The basic idea of this method is that, the user types a string of phonetic letter, with optional spaces, e.g.,
Named entity (NE) extraction and recognition have recently drawn the attentions of many researchers. The main reason is that NE extraction/recognition is one of the basic and important problems of information retrieval, information extraction, machine translation, natural language understanding, etc. The NE recognition task was defined in the 6th Message Understanding Conference (MUC-6) (DARPA, 1995) as recognition of names of locations, persons, and organizations. There are two main approaches to identification and recognition of named entities. One is the rule-based method (Kim and Woodland 2000); and the other is the corpus-based method (Bikel et al., 1997; Day & Palmer, 1997). The rule-based method is widely employed because it is intuitive and easy to understand. Furthermore, much expert linguistic knowledge can be easily applied. However, it is a difficult job for human labors to manually write all possible rules for named entities. Nonstandardized syntax and non-integral rules become key barriers of this method. Hence, it was often applied to certain specific applications instead of general applications. In order to find out all possible situations for named entities, some researchers began to utilize the corpus-based machine learning methods to extract all possible rules from large corpuses. For example, Li and Wang (2000) used a statistical method to calculate the probability of a given string being a Chinese name. The corpus-based machine learning approach can achieve high accuracy and flexibility when the training data are sufficiently large. The statistical method is widely used in corpus-based learning. It can be easily combined with other statistics-based applications, such as speech recognition, machine translation, etc. The disadvantage of the corpus-based method is that the accuracy will drop greatly when the training data is sparse. Furthermore, how to utilize the expert knowledge is another problem in the corpus-based method. In addition, the Chinese language is different from western languages. There is no separator between Chinese words/phrases in one sentence. Furthermore, the rule of initial character capitalization in western languages is not applicable to Chinese NE extraction because Chinese is a non-alphabet language. These differences make the extraction of Chinese names more difficult than western names. Hence, very few work (e.g., Li and Wang, 2000) has been done on Chinese name extraction. Another interesting work is how to apply NE extraction to real applications. Some researchers applied the extracted named entities into information retrieval (Arampatzis et al. 1998; Fox 1983). However, there is little achievement in
3. Chinese Language Model and Pinyin Input
woshiyigezhongguoren The system converts this string into a string of Chinese characters, for this case:
( I am a Chinese ).
A sentence-based Pinyin input method chooses the most probable Chinese words according to the context. In our system, a statistical language model is used to provide adequate information to predict the probabilities of hypothesized Chinese word sequences. In the conversion of Pinyin to Chinese characters, for a given Pinyin P , the goal is to find the most probable Chinese character H , so as to maximize Pr( H | P) . Using the Bayes Theorem, we have: ^
H = arg max Pr( H | P ) = arg max H
H
Pr( P | H ) Pr( H ) Pr( P)
(1)
The problem is divided into two parts, typing model Pr( P | H ) and language model Pr(H ) . Conceptually, all H ’s are enumerated, and the one that gives the largest Pr( H , P ) is selected as the best Chinese character sequence. In practice, some efficient methods, such as Viterbi Beam Search (Lee 1989; Lee, Soong & Paliwal, 1996), can also be applied. Pr(H ) , which represent the Chinese language model, measures the a priori probability of a Chinese word sequence. Usually, it is determined by a statistical language model (SLM), such as tri-gram LM. Pr( P | H ) , which is referred to as the typing model, measures the probability that a Chinese word H being typed as Pinyin P (Zheng & Lee, 2000).
4. Our Proposed Approach to Personal Name Extraction In this paper, we focus our discussion on personal name extraction since it is the most difficult task in NE extraction. Since corpus-based machine learning technology can automatically find out the related rules for personal names, we adopt the corpus-based method as the core technology to do the personal name extraction task in this paper. Meanwhile, in order to increase the learning speed and improve the accuracy of extraction, we also
applied a priori knowledge of personal names into the learning procedure. Thus, we propose a new statistical approach, which is a combination of the probabilistic regular grammar and the statistical language model (SLM) (Huang et al., 1993), to personal name extraction in the corpus. In this approach, we use the class-based SLM to automatically find out the context information from the training corpus while avoiding the incompleteness caused by the manual rule making on the one hand; and use the regular grammar to introduce the expert experience into the learning procedure to overcome the data sparseness problem on the other hand. Next, we explain in detail our approach to personal name extraction. 4.1 Probabilistic Regular Grammar of Personal Names Since there are a huge number of personal names in real applications, it is impractical to collect all the personal names into the lexicon. Instead, we use a probabilistic regular grammar to represent the structure of a personal name. The names (both native Chinese names and Chinese transcriptions of foreign names) in Chinese can be characterized using the following regular grammar, as shown in Figure 1. CGN CSN S
Prefix FFN
FCGN
LCGN
MFN
LFN
Title
/S
Figure 1. Regular grammar for Chinese personal names.
The terms in Figure 1 are explained as follows. CSN: Chinese surname CGN: Chinese given name FCMN: First character of Chinese given name LCGN: Last character of Chinese given name FFN: First character of foreign name MFN: Middle character of foreign name LFN: Last of foreign name Prefix: the prefix of a personal name Title: the title of a personal name S, /S: the beginning, the ending
The second path walks from CSN to CGN and denotes one of the formal formats of Chinese people’s names. For examples, (Zhang San), (Li Si), (Wang Wu), etc, are the full names of some Chinese people. The third path walks from FFN, via MFN to LFN and denotes one of the formal formats of foreigners’ names in Chinese. Different block should use different valid set of Chinese characters. For and are the Chinese examples, translations of Bush and Clinton, respectively.
3.
Other formats of personal names are similar to the three ones listed above. Chinese names and the Chinese translations of foreign names follow one of the paths listed above. Training of personal name structure is also based on the regular grammars listed above. First, a large number of personal names are collected for training the probability of each regular grammar format of personal names. Census data, telephone books, etc., are available resources for this purpose. Meanwhile, we can build some heuristic rules to extract possible personal names from newspapers and other corpus. Certainly, the extracted data also need to be verified by human labors. Then we can count the appearing numbers for each path and calculate the probability of each path in the regular grammar of personal names. Furthermore, the distribution of valid Chinese characters in personal names is also counted at the same time. The probability of a Chinese character sequence being a personal name can be calculated using Equation (2) or Equation (3). Equation (2) is called the character-based uni-gram model, in which the characters in a Chinese name are independent. Equation (3) is called the character-based bi-gram model, in which each character in the Chinese personal name is only related to its immediately preceding character. Pr(C1 ,L, Cm ) = Pr(Ci ) =
m
∏ Pr(C ) i
i =1
Count (Ci ) , i = 1,L, m ∑ Count(Ck )
(2)
k
m
Pr(C1 ,L , C m ) = Pr(C1 )∏ Pr(Ci | Ci −1 ) i =2
We briefly go through some of the paths in Figure 1 to explain the regular grammars and the meanings of some instances of Chinese person names: 1. The path that starts from Prefix, via CSN to Title denotes a Chinese surname followed by a title. For (CSN) (Title) means “Dr. Wang”, examples, (CSN) (Title) means “President Li”, and (Prefix) (CSN) (Title) means “Mr. Zhao, Junior”. These are typical examples of this class of Chinese names.
2.
Count (Ci −1 , Ci ) Pr(Ci | Ci −1 ) = , i = 2,L, m Count (Ci )
(3)
Certainly, certain smoothing technology (Jelinek, 1997), such as the Good-Turning method, should be applied into the statistical approach to overcome the data sparseness problem. 4.2 Class-based SLM There is no space between the characters in a Chinese sentence. So it is difficult to segment the Chinese sentence into meaningful Chinese words or phrases.
Usually, the maximum length matching or statistics-based segmentation method is applied to the segmentation task for Chinese language processing. The maximum length matching method is based on the rule that a longer Chinese word is more meaningful than a shorter one. The statistics-based method is used to find out the maximum probability of the Chinese word sequence. Statistical language models (SLMs) are widely used to estimate the probabilities of input sentences. The most widely used statistical language model is the so-called ngram Markov models (Jelinek, 1997). Bi-gram and trigram are the most frequently used SLMs. Usually, the input sentence W is the combination of several Chinese words and can be decomposed into w1 , w2 ,L, wn , where wi can be a Chinese word or a Chinese character. Thus, the probability of a segmentation of the input sentence can be calculated by the tri-gram model, as shown in Equation (4). Since there are no enough contexts for the first two words, their probabilities are back-off to the uni-gram and bi-gram models, respectively. n
Pr(W ) = Pr( w1 ) Pr( w2 | w1 )∏ Pr( wi | wi − 2 , wi −1 )
(4)
i =3
Thus, the statistics-based segmentation is applied to find ^
out the most possible segmentation W from all possible segmentation candidates, as shown in Equation (5). n ^ (5) W = arg max Pr(W ) = arg max[Pr( w ) Pr( w | w ) Pr( w | w , w )] w1 ,L, wn
w1 ,L, wn
1
2
1
∏
i
i−2
i −1
i =3
The personal name extraction problem can also be considered as a segmentation problem. As the number of personal names is very large, it is not appropriate to include all personal names in a lexicon. Instead, we build a combined model based on the probabilistic regular grammar and the class-based SLM to model them. In the class-based SLM, the personal name is defined as one token < Name > , which is considered as one Chinese word. The class-based SLM can be used to estimate the probabilities of connections between the token < Name > and others Chinese words, e.g., the probability of (Mr.) following a given < Name > , Pr( Mr |< Name >) . The probabilistic regular grammar is used to estimate the probability of a string being in the < Name > class. For example, two Chinese words that have the same Pinyin (seldom) and (Han Jian), have sequences, different probabilities of being a personal name. The probability Pr( |< Name >) is much lower than Pr( |< Name >) . Training of the class-based SLM requires annotated data. We use a parser to tag some raw data and use them to train the LM. The parser can be built using heuristic rules or a dictionary look-up method. Of course, the more data we have, the better the results will be.
After all personal names are tagged in the training corpus, a special character is used to replace all the tagged personal names in the corpus. Afterwards, we can use the tagged data in the training process of the class-based, trigram incorporated class, < Name > . In the tri-gram model, the special character in the lexicon represents the class < Name > just like other lexicon items. After training, we can get the possibilities of all personal names (as a class) appearing in different context. The Viterbi beam-search algorithm (Lee, 1989; Lee et al., 1996) is used in the decoding or recognition process to determine correct segmentation results. The probability of a segmentation of the sentence is calculated by combining the class-based SLM with the regular grammar shown in Equation (6). Pr(W ) = Pr( w1 ,L, wi ,L wn ) = Pr( w1 )L Pr( wi |< Name >) Pr(< Name >| wi − 2 , wi −1 )L Pr( wn | wn − 2 , wn −1 )
(6) where, wi is a personal name, it is a combination of several Chinese words or characters. Other words except for wi , i.e., {w1 ,L, wn \ wi } , are not personal names. Pr( wi |< Name >) is the probability that wi is a personal name. Pr(< Name >| wi −2 , wi −1 ) is the tri-gram probability that a personal name can be followed by the word sequence wi − 2 wi −1 . Take the inputted sentence “ / / / / (Today, Mr. Zhang San will attend the meeting)” as an example. If we use the word-based trigram model, the probability will be written as Pr( )Pr( | ) Pr( | , )Pr( | , )Pr( | , ). If we use class-based tri-gram model, the probability of sentence will be re-written as Pr( )Pr( |)Pr(| )Pr( | ,)Pr( |, )Pr( | , ) by Eq. (6).
4.3 Bootstrap Segmentation and Confidence Measure The training data is the key to the success of LM training. There are two kinds of methods to prepare the training data for personal name extraction. The most accurate method is to tag the training corpus by human experts. However, few training data will be obtained because it is a time consuming work. Another method is to tag the training corpus automatically. Initially, a heuristic method (e.g., the maximum length matching or dictionary lookup method) is set up to segment/tag the training corpus coarsely. The probabilistic regular grammar and classbased SLM are then built based on these training data. After that, the training corpus can be re-segmented by the viterbi tri-gram segmentation method. Finally, the newtagged data can be used for training the new LM. It is a bootstrap method to refine the training data and the Segmenter. Although there are too many noises existing in the training corpus, the bootstrap procedure can gradually eliminate the side-effect of the noises.
Since viterbi segmentation outputs only one result, which is the most possible path, many possible segmentations and personal names are discarded. Hence, we use a forward-backward method to consider all candidate segmentations, which is explained as follows. First, the viterbi method is used to segment each training sentence into a word sequence. This is called the forward step. Then, in the backward step, all possible segmentations (N-best paths) with probabilities are generated using the A* algorithm (Jelinek et al., 1991). Then, we apply all possible candidate paths in the training procedure to overcome the shortcoming of the viterbi method. Furthermore, each segmentation result can be assigned with a confidence measure defined in Equation (7). Confidence( w) =
# of wi appears in the N - best paths # of N - best paths
(7)
5. Applications of Personal Name Extraction Information extraction is usually the first step of many real applications. Similarly, we would like to apply personal name extraction and recognition into real applications. Personal names are key components of input sentences for sentence understanding. Hence, personal name extraction and recognition is widely used in natural language understanding, information retrieval, machine translation, language input, etc. In this paper, we use language input to demonstrate the applications of personal name extraction. We take Chinese pinyin input as an example. The solution can be applied to many other similar applications similarly. The personal names in a sentence directly influence the understanding of the sentence. Hence, they are more important than other normal words, though they occupy a small proportion in the input sentence. But in Chinese input, personal names are seldom decoded correctly because they cannot be sufficiently trained due to lack of enough training data. Hence, the application of personal name extraction technology in Chinese input methods is of increasing importance. One direct solution is to take personal name extraction as the preprocessing step of the input methods. First, we can extract all personal names from the training corpus. Then, some personal names with high frequency are added into the dictionary. Finally, a new tri-gram model is built on this new dictionary. But it is only a theoretical method. In practice, we found that this method cannot improve the accuracy of personal name extraction. The failure of this solution may be due to two facts. One is that only a small portion of personal names are collected into the dictionary while most personal names in the test corpus cannot be found in the dictionary. The second reason is also due to the sparseness problem of the training data. Another solution is to apply the personal name extraction technology directly into the input process. On the one hand, possible Chinese characters are combined to form
possible Chinese personal names by the probabilistic regular grammar; on the other hand, class-based SLM searches for the most possible combinations of Chinese characters. But unfortunately, our experiment shows that the improvement is still very limited. First, although the accuracy of the surname recognition is greatly improved after using the personal name extraction technology, the accuracy of the given name recognition is the same as before. The main reason is that there are confusions among Chinese personal names. For example, there are multiple personal names with the same pronunciations, (Li Xiao Ming) and (Li Xiao e.g., both Ming) are commonly used in Chinese. In this case, even a human expert cannot judge which one is correct for conversion. Second, there are unavoidably some side effects in personal name recognition, say, mis-recognition as personal names. Hence, directly applying the personal name extraction technology is not a good method to deal with the Chinese input problem.
To overcome the shortcomings of the above two methods, we propose a new cache-based personal name recognition method. The basic assumption is that the personal names in articles appear occasionally more than once. If the system can automatically detect and remember the personal names that the user has ever typed before, the system will not make the same errors next time. We know that, the cache model, especially the uni-gram cache model (Kuhn, 1988), has been widely used in speech recognition for improving the recognition accuracy from user’s explicit corrections or system’s pseudo corrections. Although the uni-gram word cache model works well for common words, it cannot achieve good performance for personal name recognition. Since a Chinese personal name is almost constructed by several single Chinese characters, the uni-gram cache has to update the corrected characters separately. Hence, the learning speed becomes the bottleneck of this technology. In order to speed up the learning process, we use the whole Chinese personal name as one token in the uni-gram cache model to adjust its probability when the user makes some corrections to the Chinese personal name. The process contains two steps. In the first step, the personal name extraction module identifies the personal names in the input sentence which has been modified by the user. Then, in the second step, the system inserts the extracted personal names into the cache model (Jelinek et al., 1991) and re-estimates the probability of related items. Suppose N i is the personal name which the user wants to input, when N i is mis-recognized by the original system as word M i (and is corrected manually), N i should be added into the personal name cache model. Furthermore, the probabilities between N i and other words wi should be re-estimated to reduce the recognition errors for the next time. From the statistical language model, we found that the mis-recognition is caused by the probability of N i and M i given the context wi . The original probability is shown in Equation (8).
s1 = Pr(< Name >| wi ) s 2 = Pr( N i | wi ) s3 = Pr( M i | wi )
(8) Pr( N i |< Name >) ≈ s 2 / s1 Based on the original probability distribution, s3 > s 2 , so the system choose the M i instead of N i . But after the user’s modification, we need to increase the probability of N i given wi , and decrease the probability of M i given wi at the same time. The updated formula is shown in Equation (9). Prnew ( N i |< Name >) = s3 / s1 * (1 + α ) , α > 0 Prnew( Ni | wi ) = Prnew( Ni |< Name >) Pr(< Name >| wi ) = s3 * (1 + α ) (9) Prnew ( M i | wi ) = Pr( M i | wi ) = s3
Thus, Pr( N i | wi ) will be larger than Pr( M i | wi ) next time. This method can also introduce some side effects to conversion. But as can be seen from Equation (9), is restricted by the class-based Pr( N i | wi ) probability Pr(< Name >| wi ) . Hence, the distribution of probability will fall in a reasonable scope.
International News, Economical News, National News, Social News, and Entertainment News. It is a tedious job to calculate the precision and recall for the testing corpus. This is why we only use such a small collection of test data to demonstrate the performance of our system. The total number of the test sentences is 5,000, and the total number of characters is about 100,000. The F-measure (van Rijsbergen, 1979 is a commonly used statistic in Information Retrieval system for system performance evaluation. It is defined in Equation (10). # of correct tags by the NE tagger # of all tags by the NE tagger # of correct tags by the NE tagger R = Recall = # of all tags in the test collection P× R F= ( P + R) / 2 P = Precision =
Table 1. The average precision-recall for different categories of our approach. Total Ex- CorrectPrecision Name tracted ly Ex(%) # # tracted# SportsNews 199 169 163 96.44 National 96 88 80 90.91 News Internation47 31 25 80.65 al News Entertain143 95 80 84.21 ment News Economical 54 61 53 86.89 News SocialNews 65 55 53 96.36 Total 604 499 454 90.98 Category
6. Experiments and Discussions 6.1 Training Data & System Setup We have used the balanced Chinese corpus (which has been developed by our group at Microsoft Research Asia) in our experiments to train the class-based language model and the probabilistic regular grammar. The corpus contains 1.6 billion characters in seven domains. The original personal name tagging system tags the personal names in the collected corpus based on the pre-defined heuristic rules. After tagging, all personal names are replaced by a special class tag. Then, Katz’s back-off trigram model (Katz, 1987) is built on the tagged corpus. Since the collected Chinese personal names are sufficient to train the bi-gram model for personal names, Equation (3) is chosen to train the bi-gram model for the personal names. The Good-Turning smoothing algorithm is applied to deal with the data sparseness problem. The average precision/recall for the baseline personal name tagging is not high, but we have applied the bootstrap method to refine the tagging accuracy gradually. To test the effectiveness of our method, we have conducted two kinds of experiments. One is to calculate the precision and recall of our personal name extraction method. Another is to apply the personal name extraction method to a language input system and calculate the error rate reduction in the new system. 6.2 Precision/Recall of Personal Name Extraction For testing the precision-recall of personal name extraction, we use the data we have collected from some Chinese popular websites, which consist of 100 articles. The articles are of six categories: Sports News,
(10)
Recall F(%) measure 81.91
88.59
83.33
86.96
53.19
64.10
55.94
67.23
98.15
92.17
81.54 75.17
88.33 82.32
Table 2. The average precision-recall for different categories of the rule-based method. Total Ex- CorrectPrecision Recall FName tracted ly Ex(%) (%) measure # # tracted# SportsNews 199 156 150 96.15 75.38 84.51 National 75 90.36 78.13 83.80 96 83 News Internation19 54.29 40.43 46.34 47 35 al News Entertain72 80.90 50.35 62.07 143 89 ment News Economical 54 60 50 83.33 92.59 87.72 News SocialNews 65 57 54 94.74 83.08 88.52 Total 604 480 420 87.50 69.54 77.49 Category
Table 1 shows the precision-recalls for different text categories of our method proposed in Section 3. The total average precision-recall and F-measure are also listed. Table 2 shows the results of the rule-based method which is used as our baseline system. We found that the average
F-measure is increased from 77 to 82 after applying our method. Since the baseline system for personal name tagging is not accurate enough, it is a little difficult to eliminate the side-effect of the noisy data. But we found that the statistical method can also outperform the rulebased method. Moreover, the statistical method can be easily applied into other applications, for example, speech recognition, machine translation, etc. From Table 1, we can see that the performance of our extraction method varies from category to category. For example, the precision, recall and F-measure of the files of International News are much lower than others. This is mainly because most names in International News are foreign personal names, while our extraction method is mainly developed to deal with Chinese personal names. We can also find that for almost all categories, the precision is much better than the recall. The average recall is somewhat low. This is partially because in the current version of our system, some valuable information in the full context is not fully utilized due to strict limitation of time. For example, a name always appears many times at different places in a document. Since our personal name extraction method makes use of the context of a word to decide if it is a name, the same word is extracted in some places as a name, while cannot be extracted in some other places. To solve this kind of problem, we proposed a cache-based approach. We can let our system remember the extracted names, and assign each of them with a confidence measure. If this name appears again, we can use this information to boost the name confidence. When a name is extracted, we can also back trace to the beginning of the document to find out if any instance of the same word is not extracted as a name. If such an instance is found, we can boost its confidence and redetect it. Since this case appears frequently, we believe if this approach is employed, the recall will be improved significantly.
we only re-train the model for 2 times. The F-measure comparison is shown in Figure 2. 6.3 Application in the Chinese Pinyin Input System One important application of our personal name extraction (PNE) method is in a Chinese pinyin input system. Our base line Chinese pinyin input system uses a tri-gram language model and a viterbi beam search engine. To evaluate the effectiveness of the application of our personal name extraction method in the Chinese pinyin input method, we have conducted some experiments to compare the decoding performance of the baseline input system and the input system incorporated with our personal name extraction method. The testing corpus we used in our experiment is a collection of one of the entertainment newspapers in China, which contains 5M Chinese characters. The results are showed in Table 3. From Table 3, we found that purely applying personal name extraction (PNE) into real applications will not achieve good results due to the confusion among Chinese personal names. But we found that after using the cache technology, the side effect is greatly eliminated. We can see that although the reduction of the total error rate is not very significant, the error rate reduction for personal names is quite good, and the total error rate reduction is also much better than when the cache model method is not used. Besides, since the personal names in a sentence are very important in understanding the meaning of a sentence, a 30% error rate reduction of personal names will have a great, positive impact to the effectiveness of a Chinese pinyin input method. Table 3. Performance of personal name extraction in the Chinese pinyin input application. System Total Personal Error Rate Error Rate Configur- Error Name Reduction for Reduction for ation Rate Error Rate Total Error Rate Personal Names Baseline 8.96% Baseline + 8.86% PNE Baseline + PNE + 8.30% Dynamic Cache
F-Measure
Sports
National
Intl.
Rule Based
Entertainment
Economy
Social
Statistical + CFG (1 round)
Statistical + CFG (2 round)
Figure 2. The F-measure comparisons for bootstrapping segmentation.
We also evaluate the effectiveness of the bootstrapping segmentation method. Since it is a time-consuming task,
48.6%
-
-
45.0%
1.1%
7.4%
34.0%
7.4%
30%
7. Conclusions In this paper we proposed a new statistical approach to personal name extraction. The statistical language model is combined with the probabilistic regular grammar to extract personal name information. Furthermore, we also applied the personal name extraction technology into real applications to reduce the recognition error rate of personal names. Compared with the baseline of the system, our method obtains approximately 30% error rate reduction for personal name recognition.
Acknowledgements We would like to thank Dr. Zheng Zhang and Dr. WeiYing Ma for their help on revising the paper.
References Arampatzis, A. T., Tsoris, T., Koster, C. H. A., and van der Weide, T. P. (1998). Phrase-Based Information Retrieval. Information Processing & Management, 34, 6, 693-707. Bikel, D., Miller, S., Schwartz, R., and Weischedel, R. (1997). NYMBLE: A High-Performance Learning Name-finder. Proceedings of the Fifth Conference on Applied Natural Language Processing, Association for Computational Linguistics (pp. 194-201). Chen, Z. and Lee, K.-F. (2000). A New Statistical Approach to Chinese Pinyin Input. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (pp. 241-247). DARPA. (1995). Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, USA, Morgan Kaufmann. Day, D. and Palmer, D. D. (1997). A Statistical Profile of the Named Entity Task. Proceedings of the Fifth ACL Conference for Applied Natural Language Processing (ANLP-97), Washington, D.C. Fox, E. A. (1983). Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. PhD thesis, Cornell University Department of Computer Science. Huang, X., Belin, M., Alleva, F., and Hwang, M. (1993). Unified Stochastic Engine (USE) for Speech Recognition. Proceedings of ICASSP-93 (vol.2, pp. 636639). Jelinek, F. (1997). Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts. Jelinek, F., Merialdo, B., Roukos, S. and Strauss, M. (1991). A Dynamic Language Model for Speech Recognition. Proceedings of the DARPA Workshop on Speech and Natural Language (pp. 293-295). Pacific Grovce, California, Defense Advanced Research Projects Agency, Morgan Kaufmman. Katz, S. (1987). Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustic, Speech and Signal Processing, 35, 3, 400-401. Kim, J.-H. and Woodland, P. C. (2000). A Rule-based Named Entity Recognition System for Speech Recognition. Proceedings of the Sixth International Conference on Spoken Language Processing (vol. 1, pp. 528-531).
Kuhn, R. (1988). Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language. Proceedings of the 12th International Conference on Computational Linguistics (pp. 348–350). Lee, K.-F. (1989). Automatic Speech Recognition. Kluwer Academic Publishers. Lee, C.-H., Soong, F. K., Paliwal, K. K. (1996). Automatic Speech and Speaker Recognition -- Advanced Topics. Kluwer Academic Publishers. Li, J. and Wang, X. (2000). An Effective Method on Automatic Identification of Chinese Names. High Technology Letters, 10, 2, 46-49. van Rijsbergen, C. (1979) Butterworths, London.
Information
Retrieval.