Chinese Character-based Segmentation & POS-tagging and Named ...

2 downloads 0 Views 137KB Size Report
Abstract. In this paper, we propose a character-based conditional random field. (CRF) chunker to identify Chinese named entity words in the text files. The.
Chinese Character-based Segmentation & POS-tagging and Named Entity Identification with a CRF Chunker Xinhui Hu1,2, Hideki Kashioka1,2, 1

National Insititute of Information and Communication Technology, 2 ATR Spoken Language Translation Research Laboratories, Hikaridai 2-2-2, Seika-cho, Soraku-gun, Kyoto 619-0228, Japan {xinhui.hu, hideki.kashioka}@atr.jp

Abstract. In this paper, we propose a character-based conditional random field (CRF) chunker to identify Chinese named entity words in the text files. The input for it is from a character-based tagger in which the segmentation and partof-speech (POS) tagging are conducted simultanueously. The character-based tagger is trained by using a corpus in which each character is tagged with both its position (POC) in a word and POS tag of the word. The chunker is trained by an IOB2 tagged corpus, in which each character is labelled with POC, POS and chunk tags (one of the B, I, O). 4 kinds of named entities, including personal names, location names, organization names, and other proper nouns, are assumed to be identification targets. In experiments using the People’s Daily corpus, we found the CRF chunker can obtain better results than the maximum entropy model and support vector machine model in the case of using similar features. We also confirmed that the bigram features for the CRF chunker is superior to the unigram features, and nearly 1% improvement in identification is obtained with the addition of POS information. Keywords: Chinese Segmentation & POS tagging, Named Entity Identification, Character-based Model, ME, CRF

1

Introduction

Named entity (NE) identification is an important task in natural language processing, such as question answering, information retrieval and extraction. Although great success has been achieved for English NE identification, the performance for Chinese is not so satisfactory. This can be due to the special difficulties in Chinese. There are no explicit delimiters to separate words, like white spaces in English do. There is still no clear definition for Chinese words. The segmentation for Chinese text is generally the prerequisite step for many tasks of natural language processing. Compared with English, the Chinese NE identification and word segmentation are interactional in nature. It is important to have a high and stable segmentation system for the NE identification. In recent years, great improvement has been obtained in Chinese segmentation. One noticeable technique is the introduction of character-based tagging, which is proving to be robust in the identification of new words [1]. Such method is

also found effective to the Japanese morphological analysis [2, 3]. For the named entity identification, recent research has been focusing on machine learning approaches, including hidden Markov models, maximum entropy models, support vector machine models etc. Zhang [4] showed a tagging method similar to the character-based one. The model used there was a hidden Markov model. Their method achieved 69.88% precision and 91.65% recall for recognition of the names of Chinese people in the People’s Daily. Goh et al. [5] also used a character-based position tagging method by support vector machines. Their method achieved 63.8% precision and 58.4% recall for general, unknown words, in Chinese, in the People’s Daily. J. Sun et al proposed a class-based language model to Chinese NE identification [6], in which each type of NE is defined as a class, the language model is composed of two sub-models: an entity model and a contextual model. The segmentation and NE identification is unified into integration. 81.24% of precision and 81.24% of recall were achieved for the personal names, 86.89% of precision and 78.65% of recall were achieved for the local names, and 75.90% of precision and 47.58% of recall were achieved for organization names. In this paper, we propose a method that the character-based approach is used in both the segmentation and NE identification. Moreover, the POS tagging is conducted simultaneously with the segmentation, and is also used for the succeeding NE identification. Our realization of Chinese NE identification is outlined in Figure.1. The whole system is a cascaded one. It consists of two independent parts, one is the segmentation and POS tagging (denoted as seg.&tag hereafter), another is a chunker for NE identification. (1) The segmentation and POS tagging is combined into integration. It is trained by a character-based tagged corpus, in which each character is tagged with its position in the word and the POS tag of the word it belongs to. The maximum entropy model (ME) is adopted for implementation. We defined this model hereafter as ME tagger. (2) The output characters from the ME tagger is chunked by a CRF model, which is also trained by a character-based tagged corpus, in which, personal, location, organization names, and other proper nouns are labeled in the IOB2 convention.

Training Corpus

Input Sentence

Character-POC-POS Tagged Data

Character-IOB2 Tagged Data

ME Segmentation & POS-Tager

CRF Chunker

Output

Figure.1 Flow of Chinese Named Entity Word Identification.

In what follows, we will describe the above processes in detail. We will explain briefly the motivation for applying the character-based ME tagger to Chinese seg.&tag (Section 2). Then, we will illustrate our method for chunking NE using CRF model (Section 3). Finally, we will discuss experimental results and give conclusions (Section 4).

2

Character-based Segmentation and POS Tagging.

As in other task of Chinese natural language processing, word segmentation is also indispensable in the NE identification. The performance of segmentation will greatly influence the accuracy of identification. Recently, the character-based approach is booming in Chinese segmentation and Japanese morphological analysis [1, 2]. It is proven that this method is robust for unknown words. In Chinese text, the majority of characters are Hanzi (Chinese character), and these Hanzis often have well-defined meanings, and can serve as independent words on their own. Due to this consideration, we have adopted the character-based approach for segmentation in the NE identification, since the NE brings out easily new words. However, most NE identifications only utilized segmentation information and the POS information is often ignored. We believe the POS information is important since many proper nouns have relatively stable structure, often composed of several small words, and these words are common words with definite POS tags. Compared to pure word segmentation, the overall accuracy of the seg.&tag may be at risk of decreasing if the POS tagging is introduced. We expect the benefit of using POS tagging can preponderate over the decrease in the overall accuracy. So in this study, we utilize both segmentation and POS information in the NE identification. [8] conducted experiments using character-based segmentation and POS-tagging, and confirmed that the all-at-once (two tasks are conducted in one step) is slightly superior to the one-ata-time step (two tasks are conducted independently; POS-tagging is conducted after the segmentation is finished). We also conducted a few experiments on these two methods, and found that there are no major differences between the two methods, though the all-at-once is a little higher than the one-at-a-time in the accuracy of seg.&tag. So, in this research, we have adopted for the all-at-once method for segmentation and POS tagging, which means the segmentation and POS-tagging are integrated together, and conducted simultaneously. 2.1 Preparation of Training Data for Seg.&Tag: Combination of POC and POS Tagging The training data is tagged for each character with its position (POC) within the word and the POS tag of the word it belongs to. The character is firstly classified into one of the 4 categories, shown in the table 1. Then, the POS tag of the word to which the character belongs is combined with the above POC tag, and finally labeled together with the character itself.

Table 1. Position of Character Tags Tag S B M E

Description One-character word Beginning character of a word Intermediate character of a word (longer than two characters) Ending character of a word

Following is an example of tagged data for the original annotated sentence “天安门 /ns 广场/n 上/f”. 天/ns_B 安/ns_M 门/ns_E 广/n_B 场/n_E 上/f_S

2.2 The Maximum Entropy (ME) Tagger and Features We use the maximum entropy model (ME tagger) to accomplish the seg.&tag. Although the CRF method has been proven to be effective for Chinese seg.&tag [7], our experiments showed that it is too heavy for a computer to train a generic Chinese tagger with character-based, 40 POS tags, especially when the training data becomes large. Therefore we adopt ME approach for seg.&tag, since it can guarantee the tradeoff between the performance and requirement of computation system. The features that are used for the model are the instantiations of the feature templates shown as follows. They are chosen in a context window with a five-character length when the training data is scanned.

Ci , i ∈ [−2,2] (b) Ci Ci +1 , i ∈ [ −2,1]

(a)

(c) C −1C 0 C1, (d) Ti , i ∈ [ −2,−1] Here, C i represents characters at the position i. (a) stands for the current character and previous and next two characters. (b) stands for character combinations with two successive characters, for example, C−2C−1 means the combination of characters of the previous 2

and previous 1 position. (c) means the combination of 3 characters among the previous, present, and the next character. Ti means the tag set of the character at the position i. In our experiment, we use the maximum entropy toolkit name “MaxEnt” written by Zhang Le1 for implementation. 2.3 Evaluation of Segmentation and POS Tagging The training data for the ME tagger originated from the People’s Daily corpus. Instead of tagging 4 possible types (“S”, “B”, “M”, “E”) in the pure segmentation, the category number is to be multiplied by the total number of POS tags (here is 40) when the POS tagging is necessary at the same time. Data sparseness is expected to happen 1

http://homepages.inf.ed.ac.uk/s04050736/maxent_toolkit.html

more easily than in pure segmentation. We conducted evaluations on ME taggers trained by different size of training data ranging from 1M to 5M words. The test set contains 2355 sentences (141K words) randomly selected from the open data (also from the People’s Daily corpus). Figure 2 shows the F-score values of the seg.&tag. Meanwhile, the pure segmentation F-score were also computed for each case (shown as seg.), they were computed based on the results of seg.&tag by removing the POS tags.

F-Score of Segmentation and POS Tagging [%]. 98 96 94 92 90 88 86 84 82 80

seg.&tag. seg.

1M

2M 3M 4M Size of Training Corpus

5M

Figure 2. Word Segmentation and POS Tagging Performance in Different Training Size. From the results, we found that the accuracy of the ME tagger is influenced heavily by the training size. But the segmentation becomes saturated as the training size becomes larger. The difference between the F-score of pure segmentation and the Seg.&tag is large if the training data is small. For example, in the case of 1M words of training data, the difference is over 0.6, and the F-score of segmentation is only 0.921, much poorer than the best result of the 2005 Sighan workshop on Chinese segmentation using the People’s Daily corpus (F-score=0.945). This can be regarded as data insufficiency in training process when 40 POS tags are added for tagging, while only 4 tags are necessary for the pure segmentation processing. In order to guarantee a high quality of pretreatment for the NE identification, we aim to build a ME tagger with high performance as close to the Sighan 2005’s best result as possible. Eventually, we choice the ME tagger which is trained by 5M words. As shown in Figure 2, in this case, the F-score of seg.&tag is 0.932, while the F-score of pure segmentation is 0.952.

3. CRF Chunker for NE Identification

3.1 The CRF Model We propose a character-based conditional random field model (CRF) for chunking named entities. The CRF take the output of the character-based ME tagger as its input, and also outputs the chunked results for each character. Being conditionally trained, the CRFs can easily incorporate a large number of arbitrary, non-independent features while still having efficient procedures for non-greedy finite-state inference and training. It has advantages over generative models. The CRFs can be augmented by utilizing a large number of observation features. They have been shown success in various sequences modeling task, such as Chinese segmentation, Japanese morphological analysis, and phrase extraction [7, 9, 10, 11]. The expressive power of the model is often increased by adding new features that are combination of the original features. It is reasonable that more features can improve the model’s performance. But it is unfeasible to take arbitrary complicated features into the models. Two limitations are training time and overflow memory. For example, the number of features will amount to L*N for unigram features, L*L*N for bigram features, where L is the number of output classes and N is the number of unique string expanded from the given template of features. When the number of classes is too large, the features would make the training inefficiency or even cause training procedure to be interrupted. We will describe how to select the templates of feature for the CRF in section 3.3. 3.2 Chunking Tags and Data Preparations For training the CRF chunker, we use the IOB2 tagging convention to label each character of the annotated corpus. “B” is for beginning of a chunk, “I” is inside a chunk, and “O” is outside of a chunk. In this study, we only deal with 4 kinds of named entities: personal name (denoted as “person” for those words tagged as “nr” in the original the People’s Daily corpus), location name (denoted as “location” for those words tagged as “ns”), organization name (denoted as “organization” for those words tagged as “nt”), and other proper nouns (denoted as “otherzm” for those words tagged as “nz”). Table 2 shows an example of such IOB2 tagged data. Table 2. Example of IOB2 Tagged Data Character 天 安 门 广 场 上

POC-POS ns_B ns_M ns_E n_B n_E f_S

Chunk Tag B-location I-location I-location O O O

The nested proper nouns, which are composed of several words, are regarded as single units, and labeled in a same way. Except for the first character of the first word is tagged as “B”, the others are tagged as “I”. For example, the “[香港/ns 特别/a 行 政区/n] nt” appeared in the People’s Daily corpus is tagged as shown table 3. Table 3. Example of IOB2 Tagged Data for the Nested Proper Nouns Character 香 港 特 别 行 政 区

POS-POC ns_B ns_E a_B a_E n_B n_M n_E

Chunk Tag B-organization I-organization I-organization I-organization I-organization I-organization I-organization

3.3 Selections of Feature Template In order to investigate effective features for chunking named entities, we compare two sets of feature templates. One is the so-called unigram features, in which only the current output observation is included. We denote this feature set as uni-feature. In this study, they are selected as shown in following sets: (a) Ci , i ∈ [−2,2] (b) Ci Ci +1 , i ∈ [ −2,1]

C0C1C2 (d) Ti , i ∈ [−2,2] (e) TiTi +1 , i ∈ [ −2,1] (f ) TiTi +1Ti +2 , i ∈ [ −2,0] (c)

Here, the Ci stands for the character in the first column of Table 2 or 3 at the position i related to the current character. The Ti stands for the content in the second column in Table 2 or 3. Another feature template is to add the bigram features to the above uni-feature set. The bigram feature means the combination of the previous and current output observation chunk tags. Here we denote this set of features as big-feature. We simply take the above two feature sets to the NE identification in order to investigate what level the CRF model can reach when only using the simple features without help from any other special named entity database, only depend on the generic annotated corpus. The implementation the CRF model is carried out by the package “CRF++2”. LBFGS algorithm is selected for optimization and Gaussian prior (0.8) was imposed in the training process. 3

http://www.chasen.org/~taku/software/CRF++

3.4 Other Models or Features without POS to be Compared The support vector machine (SVM) model and maximum entropy model (ME) have been successfully used for chunk identification and named entity identification [12, 13]. To compare the CRF with these two types of model, we use similar features to the big-feature mentioned in section 3.3 to train these two models and evaluate the same test set. These two models are denoted as svm and maxent hereafter. The SVM toolkit is the YamCha3, and the ME toolkit is the same as the one used for the ME tagger in section 2. As we mentioned before, most current NE identification mainly utilize merely the word segmentation, but the POS information is ignored. To investigate the effectiveness of the POS information, we also train CRF chunker with the same bigfeature, but the POS tags are removed from the training and test data. This model is denoted as nopos. 3.5 Training and Test Data for the CRF Chunker. One month of the People’s Daily corpus, after being labeled by the chunk tags, is used for training the CRF chunker. The 2355 sentences, which are also used for evaluation of the ME tagger, are used for evaluating the chunker. Here we define the words with occurrence frequency (in the training corpus of 5M words for training the ME tagger) of less than 2 times as unknown words. The named entity words for each category contained in the test set are shown as in table 4. Table 4. Named Entity Words in the Test Data. Category Personal-Name Organization-Name Location-Name Other-Proper-Nouns Nested Proper-Noun

Count (unknown) 4018 (591) 508 (140) 2624 (357) 354 (111) 1051

3.6 Experiment Results The perl script conlleval.pl of the CoNLL 2000 is used to evaluate the chunked results. The test results for different models or different feature sets are shown in Figure 3. From this figure, we found that the CRF chunker is generally superior to the other two models, svm and maxent. But for the personal name identification, the difference is not so large. This means that all of these models have similar ability to trace the patterns of personal names, which is generally regarded as having not large variations. But for the location and organization, the CRF model is shown to be more effective than the svm and maxent. For the feature selection of CRF model, the bigram features is proven to be important for the identification, the F-score is 6% higher than the 3

http://www.chasen.org/~taku/software/yamcha

unigram features. On the other hands, the POS information contributed nearly 1% to the identification improvement for most named entities. F-score of Named Entity Identification[%] 100 90 80 70 60 50 40 Ov er B_ all pe rs o I_p n: er so B_ lo n: ca ti I_l on: o c B_ or atio ga n: n I_o izat ion rg an : iza t io B_ n: ot he r I_o zm: th er zm :

uni-feature big-feature nopos svm maxent

Character-based Named Entities

Figure 3. Results of the Named Entity Identification

4. Conclusion and Discussion The proposed character-based ME tagger which combines the segmentation and POS tagging into an integrated procedure has been proven effective for the overall named entity identification although it asks for more training data. The character-based approach in this tagger is found to be robust for predication of words with low frequency and new words. Transliterated foreign personal names and location names are recognized with high accuracy. In contrast, they are often split into several pieces of word in traditional word-based processing. For example, “哈立德”, “切帕洛娃”, “维诺格拉多夫”are personal names not appearing in the training corpus, yet they are still correctly tagged with POC and POS information. In regard to the errors of the ME tagger, we also found that known words (in-vocabulary) sometimes conflict the unknown word or low frequency word. For example, “的里雅斯特” is a location name appearing once in the training corpus, but it is recognized as “的/u 里雅斯特 /ns”. It seems that the auxiliary “的”, which has a high occurrence frequency, influenced the final decision. The character-based CRF chunker is proposed for identifying NE for the output of the ME tagger. We verified the validity of the CRFs in the case of using only simple features such as bigram or unigram without help of any other special knowledge of named entities. The bigram features is found to be more effective than the unigram ones, although the training cost of the bigram is also higher than the unigram. The training time when using the unigram features is 15 hours and when using bigram features is 20 hours. The accuracies of identified named entities are greatly increased

when using the bigram features (F-score is increased above 6%). Compare with other kinds of chunkers such as svm or maxent, the CRF model is found to be superior. For future work, we will introduce more knowledge from other database, such as collection of personal names, organization names, and general grammar lexicon, to enhance the features of the CRFs. Much effort will be required to increase the POStagging accuracy. In addition to using training data and evaluation data from the People’s Daily corpus for further study, we also plan to explore this research using other corpora in search of generality and reliability, and will try to evaluate this method using standard data, such as MET2 data in order to compare with other systems.

References 1. Nianwen Xue, “Chinese Word Segmentation as Character Tagging”, Computational Linguistics and Chinese Languages Processing, Vol.8, No.1, Feb. 2003, pp.29-48. 2. Masayuki Asahara, Yuji Matsumoto, “Japanese Unknown Word Identification by Characterbased Chunking,” Colling 2004, pp459-465. 3. Kiyotaka Uchimoto, Satoshi Sekine, Hitoshi Isahara, “The Unknown Word Problem: Analysis of Japanese Using Maximum Entropy Aided by a dictionary,” Proceeding of EMNPL, 2001, pp.91-99. 4. Zhang Hua-Ping, Qun Liu, Hao Zhang, Xue-Qi Cheng, “Automatic Recognition of Chinese Unknown Words Based on Role Tagging,” Proceeding of the First SIGHAN Workshop on Chinese Processing, pp71-77. 5. Chooi-Ling Goh, Masayuki Asahara, Yuiji Matsumoto, “Chinese Word Segmentation by Classification of Characters”, Computational Linguistics and Chinese Language Processing, “ Vol.10, No.3, Sep.2005, pp.381-396. 6. Jian Sun, Ming Zhou, Jianfeng Gao, “A Class-based Language Model Approach to Chinese Named Entity Identification,” Computational Linguistics and Chinese Language Processing, Vol.8, No2, Aug. 2003, pp.1-28. 7. Fuchun Peng, Fangfang Feng,Andrew McCallum, “Chinese Segmentation and New Word Detection Using Coditional Random Fields”, Proceeding of Colling 2004, pp562-568. 8. Hwee Tou Ng, Jin Kiat Low, “Chinese Part-of-Speech Tagging: One-at-a-Time, or All-atOnce? Word-based or Character-based?” Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp.277-284. 9. A. McCallumn and W.Li, “Early Results for Named Entity Recognition with Conditional Random Fields, Future Induction and Eeb-Enhanced Lexicons,” CoNLL 2003. 10. John Lafferty, Andrew McCallum and Fernnando Perrira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and labeling Sequence Data. In ICML-2001. 11. Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto, “Applying Conditional Random Fields to Japanese Morphological Analysis,” Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), pp.230-237. 12. T.Kudo and Y. Matsumoto, “Use of Support Vector Learning for Chunk Identification.” In Proc. Of CoNLL-2000 and ILL-2000. 13. Andrew Borthwich, “A Maximum Entropy Approach to Named Entity Recognition,” Ph.D.

Thesis. New York University. September, 1999.