Document not found! Please try again

A Method and Application of Automatic Term

0 downloads 0 Views 304KB Size Report
CRF-based automatic Chinese term extraction method. Then, this ..... F-score are computed according to the following formulas. 1. Pr. C ecision. S. = Re. C call.
A Method and Application of Automatic Term Extraction Using Conditional Random Fields Weijun FU

Lei LI

CISTR, Beijing University of Posts and

CISTR, Beijing University of Posts and

Telecommunications, P.R.C., 100876

Telecommunications, P.R.C., 100876 [email protected]

[email protected] Abstract: A Conditional Random Fields (CRF) based method and application of automatic term extraction was proposed in this paper according to the theory of “Information Knowledge - Intelligence” transformation. A CRF model was created by training the different fields of the corpus segmented and tagged. Using the model trained by CRF, the documents in a given field were automatically tagged and the terms in the field was automatically extracted with a certain way. On this basis, this method was used in automatic text summarization system to enhance the rate of the excellent summary. The experimental results showed that this method had a relatively high recall rate and accuracy, could effectively increase the performance of automatic summarization system.

Keywords: Automatic Term Extraction; Conditional Random Fields; Automatic Text Summarization

1.

Introduction

A term is a special concept in the professional field. The term extraction is extremely significant for terminology and terminology standardization and an important tool in information processing technology such as information retrieval, information extraction, machine translation and the establishment of lexical knowledge base. In the past, the term extraction was done mainly by human. With the development of Internet and information technology, the terms in various fields are dynamic, with strong liquidity, namely, the old terms gradually disappear while the new terms continuously emerged. It is impossible that terms are collected manually. Therefore, automatic term extraction technology appears. Automatic term extraction belongs to the scope of natural language processing and is a concrete manifestation of information extraction. Currently, there are two main methods of automatic term extraction, which are linguistics-based method and statistics-based one, and both have their own advantages and 978-1-4244-4538-7/09/$25.00 ©2009 IEEE

disadvantages. Linguistics-based automatic term extraction mainly uses the terms context [1] (for example, the prefix [2]) and the internal composition of the terms to extract terms. This method has a high precision, but the restrictions on domains are very strict and the portability of the system is not good. Statistics-based automatic term extraction mainly uses statistical model to extract terms, such as Hidden Markov Models (HMM), Maximum Entropy Models (ME) and Conditional Random Fields (CRF). The precision of statistics-based information extraction, by contrast, is lower, but it has better portability and lower professional knowledge requirement. In this paper, we designed and proposed a CRF-based automatic Chinese term extraction method. Then, this method was used in an automatic text summarization system. In the next section, the concept of CRF is described. Section 3 introduces the overall frame of the method. Section 4 focuses on the design and realization of the automatic text summarization system using the method proposed in this paper. Finally are the conclusions about the prospects and promotion points of our automatic term extraction method. 2.

The CRF-based automatic Chinese term extraction method

A machine trains corpus into a model and automatically extracts terms in a given field, which is a kind of intelligence. Our understanding of intelligence is a kind of smart strategy. In order to solve problem or achieve a certain aim, a machine obtains information in a certain environment, summarizes knowledge, creates a strategy using the information and knowledge and finally takes a series of actions to solve the problem, which is called the theory of “Information - Knowledge Intelligence” transformation [6]. In the paper, we designed automatic term extraction method according to the theory. In a given field (environment), we collected the corpus online (information acquisition), tagged and trained the corpus to create the model (knowledge learning), finally automatically tagged the test corpus (strategy) and extracted terms (achievement of aim).

2.1. 2.1.1

Conditional Random Fields

i > 1 , t j ( yi −1 , yi , x, i ) = 1 i = 1 , t j ( yi −1 , yi , x, i ) = 0

Undirected Graphical Models

A conditional random field may be viewed as an undirected graphical model, globally conditioned on X, the random variable representing observation sequences. Formally, we define G = (V, E) to be an undirected graph such that there is a node v ∈ V corresponding to each of the random variables representing an element Yv of Y. If each random variable Yv obeys the Markov property with respect to G, then (Y, X) is a conditional random field. In theory the structure of graph G may be arbitrary, provided it represents the conditional independencies in the label sequences being modeled. However, when modeling sequences, the simplest and most common graph structure encountered is that in which the nodes corresponding to elements of Y form a simple first-order chain, as illustrated in Figure 1.

s ( yi , x, i, ) = s ( yi −1 , yi , x, i, ) And n

Fj ( y, x ) = ∑ f j ( yi −1 , yi , x, i )

(2)

i =1

where each

f j ( yi −1 , yi , x, i ) is either a state function

s ( yi −1 , yi , x, i, )

or

a

transition

function

t j ( yi −1 , yi , x, i ) . This allows the probability of a label sequence y given an observation sequence x to be written as 1 (3) e x p ( ∑ λ j F j ( y , x )) p ( y | x ,λ ) = Zλ (x ) j Z(x) is a normalization factor. 2.1.2

Maximum Likelihood Parameter Inference are Assuming the training data y ( k ) , x ( k )

Figure 1. Graphical structure of a chain-structured CRF for sequences. Lafferty et al. [8] define the probability of a particular label sequence y given observation sequence x to be a normalized product of potential functions, each of the form (1) exp( λ jt j ( yi−1, yi , x, i) + μk sk ( yi , x, i))

∑ j

where

∑ k

t j ( yi −1 , yi , x, i ) is a transition feature function

of the entire observation sequence and the labels at positions i and i-1 in the label sequence; sk ( yi , x, i ) is a state feature function of the label at position i and the observation sequence; and λ j and μk are parameters to be estimated from training data. When defining feature functions, we construct a set of real-valued features of the observation to express some characteristic of the empirical distribution of the training data that should also hold of the model distribution. Each feature function takes on the value of one of these real-valued observation features if the current state (in the case of a state function) or previous and current states (in the case of a transition function) take on particular values. All feature functions are therefore real-valued. So, in the remainder of this paper, notation is simplified by writing

{(

)}

independently and identically distributed, the product of (3) over all training sequences, as a function of the parameters λ, is known as the likelihood, denoted by p y ( k ) | x ( k ) , λ . Maximum likelihood training

(

)

chooses parameter values such that the logarithm of the likelihood, known as the log-likelihood, is maximized. For a CRF, the log-likelihood is given by

(

)

⎡ ⎤ L(λ ) = ∑ log p y(k ) | x(k ) ,λ = ∑ ⎢∑ λ j Fj ( y ( k ) , x( k ) ) − log Z ( x( k ) ) ⎥ k k ⎣ j ⎦

This function is concave, guaranteeing convergence to the global maximum. Differentiating the log-likelihood with respect to parameter λ j gives ∂L (λ ) = E p ( Y , X ) ⎡⎣ F j (Y , X ) ⎤⎦ − ∑ E p Y ( ∂λ j k

where

x( k ) ,λ

⎡ F j (Y , x ( k ) ) ⎤ ⎦

)⎣

E p (Y , X ) is the empirical distribution of training

data and

Ep [

]

denotes expectation with respect to

distribution p. 2.2.

The overall frame of the method

As figure 2 shows, the method is mainly made up by two modules, the CRF-training module and the extraction module. The task of the former was to create CRF Model by training the corpus segmented and marked for different fields. The second module was the core of the method, the main task of which was to extract terms.

including two tools: crf_train and crf_test. The task of the former is to train the corpus and create the CRF model file named “model”. The second used in the next section is to automatically tag using the CRF model file and extract terms. 2.4. Figure 2. The overall frame of the method 2.3.

CRF-training module

The CRF-training module was made up by two sub-modules, pre-processing and training. The pre-processing module had two functions: segmentation and tagging. The tool of segmentation was developed by our center. Tagging module used the current common method of “BIEO”, where “B” denotes beginning and means the word tagged is located at the beginning of a term; “I” denotes in and means the word tagged is located at the internal of a term; “E” denotes end and means the word tagged is located at the end of a term; “O” denotes out and means the word tagged is located at the external of a term. Then, each consequent "BIE" block is a term where “I” can appear more than once. For example, the sentence segmented is as follows: 新泽西 / 网 / 队 / 客 / 场 / 100 / - / 88 / 战胜 / 迈阿密 / 热 /队 / 取得 / 东部 / 季 / 后 / 赛 / 次 / 轮 / 首 / 战 / 的 / 胜利 / 。/ The result tagged: 新泽西/o 网/b 队/e 客/b 场/e 100/o -/o 88/o 战胜/o 迈阿密/o 热/b 队/e 取得/o 东部/o 季/b 后/I 赛/e 次/b 轮/e 首/b 战/e 的/o 胜 利/o 。/o The training module had two functions: definition of the feature template and training. The feature reflected by observation sequence of the current object is limited. So, it is necessary to add contextual information. In this paper, we used the local contextual information, as illustrated in table 1. Table 1. The feature template of the training N-gram

Feature Template

1-gram feature

W-2, W-1, W0, W1, W2

2-gram feature

W-2 W-1, W-1W0, W0W1, W1W2

3-gram feature

W-2W-1W0, W-1W0W1, W0W1W2

W0 is the current object; W-n is the n-th object in front of the current object; Wn is the n-th object behind the current object. In the training module, we used the CRF++ v0.51 tool to train the corpus segmented and tagged and create the CRF model. This tool is Open-source toolkit in C++,

Extraction module

We used the crf_test tool mentioned in the previous section to tag the test corpus automatically, merged each consequent "BIE" block in the result tagged and extracted the term added the label “#” to calculate the number of terms. The example was shown in figure 3.

Figure 3. The example of the result of extraction 2.5.

Test

The test corpus is downloaded from the data resources of the SouGou Lab (http://www.sogou.com/ labs/resources.html) and is about the Chinese news from the Internet media websites such as http://sohu.com, http://sina.com.cn, etc. All the news was saved in the form of TXT and was classified into three categories: IT, sports, military. We selected randomly 200 texts as a training corpus to tag manually, and more than 500 texts as a test corpus. The result of extraction was shown in table 2. Table 2. The result of the term extraction Category

Texts number

S

S1

C

IT

539

6564

6323

6189

Sports

551

13501

13236

12822

Military

549

4753

4651

4414

Total

1639

24818

24210

23425

C is the number of named entities which were recognized correctly. S1 is the number of named entities which were recalled. S is the number of named entities which should be recognized. Precision, Recall and

F-score are computed according to the following formulas.

Pr ecision =

C S1

F− score =

Re call =

C S

2× P × R P× R

P is Precision, R is Recall. The evaluation of extraction was shown in table 3. Table 3. The evaluation of the term extraction Category R P F-score IT

0.943

0.979

0.961

Sports

0.969

0.950

0.960

Military

0.929

0.949

0.939

Total

0.944

0.968

0.956

From the testing results, we can draw the following conclusions: ♦



3.

The R, P and F-score of each category are relatively high, which reflects that the performance of automatic term extraction is good and the CRF is very suitable for automatic term extraction. The terms which do not appear in the training corpus are difficult to identify and extract. The unrecallable terms are mainly these terms. We can expand coverage of the training corpus to solve this problem. Application: automatic text summarization system

Automatic summarization is a technology that a text is analyzed and summarized using computer to generate a summary. In this section, the method of automatic term extraction proposed in the paper was used in the automatic summarization system to improve the rate of the excellent summary. 3.1.

The overall frame of the system

As figure 4 shows, the system is mainly made up by two modules, the pre-processor module and the summarization module.

Figure 4. The overall frame of the system The task of the based on pre-processor was to

segment text, create words list and automatically extract terms. The function of term extraction used the method proposed in the previous section which could automatically tag the text segmented by using the CRF model and finally extract terms. The second module was to create abstracts. We analyzed the words list created by pre-processor module, removed the stop words, created the sentences list and deal with the sentences list with 4 steps. We, firstly, calculated the frequency and weights of words; secondly computed the weights of the sentences; thirdly sorted the sentences in descending order according to the weights of the sentences and finally selected the sentences having higher weights as abstract sentences and generated abstract. In the process, calculating the weights of the words and sentences and selecting the abstract sentences dependent on the 5 kinds of text characteristics: frequency, title, location, terms and key words or phrases. Among them, we increased the weights of terms extracted in the pre-processor module. 3.2.

Test

The criterion of the evaluation is mainly the coverage of the abstract. We defined 5 level of the coverage: A. Perfect: The abstract covers all the information points, no missing. B. Good: The abstract covers the vast majority of the information points, missing the secondary or irrelevant information. C. Middle: The abstract covers the majority of the information points, missing some information points. D. Poor: The abstract covers the minority of the information points, missing many information points. E. Bad: The abstract hardly covers the information points, filling with the secondary or irrelevant information. If the abstract is of A or B level, it is defined that the abstract is excellent; if not, the abstract is non-excellent. The test corpus is the corpus used in the previous section. All the corpus was saved in the form of TXT and was classified into three categories: IT, sports, military. We selected randomly more than 500 texts as a test corpus. There are two tests in the section. One is without terms extraction, shown in table 4; the other one is with terms extraction, shown in table 5. Table 4. The result of test without terms extraction Category

Text number

Excellent

Nonexcellent

Rate of excellent

IT

539

475

64

88.1%

Sports

551

510

41

92.6%

References

Military

549

426

123

77.6%

Total

1639

1411

228

86.1%

[1] K. Frantz, S. Ananiadou. The C-value PNC-value domain independent method for multi2word term extraction. Journal of Natural Language Processing, 1999, 6(3):20-27. [2] J. S. Justeson, S. Katz, M. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1996, 3(2):259-289. [3] Pantel P, Lin Dekang. A statistical corpus- based term extractor. Canadian Conference on AI 2001, 2001: 36- 46. [4] Feng ZHANG, Yun XU, Yan HOU, etc. Chinese Term Extraction System Based on Mutual Information. Computer Application Research. 2005. [5] Bao LIU, Guiping ZHANG, Dongfeng CAI. Technical term automatic extraction research based on statistics and rules. Computer Engineering and Applications, 2008, 44 (23): 147-150. [6] Yixin ZHONG, Principle of Information Science 3nd edition, BUPT Press, 2002: 189. [7] Yixin ZHONG, Comprehensive Information Based Methodology for Natural Language Understanding, Journal of Beijing University of Posts and Telecommunications,2004, 27(4): 1-12. [8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, 2001. [9] John Lafferty, Andrew McCallum, Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. 2004. [10] Andrew McCallum. Efficiently inducing features of conditional random fields[C]. Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03), 2003. [11] Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto. Applying conditional random fields to Japanese morphological. Analysis Proceedings of EMNLP. 2004: 164-172. [12] Hanna M. Wallach. Conditional Random Fields: An Introduction. 2004. [13] Mingcai HONG, Kuo ZHANG, Jie TANG. A Chinese Part-of-speech Tagging Approach Using Conditional Random Fields. Computer Science. 2006. [14] Hao WANG, Sanhong DENG. Comparative Study on HMM and CRF Applying in Information Extraction. Information and Technology of Modern Library. 2007, 12. [15] Wenjing ZHANG, Yinghong LIANG. Study on the technology of term identification. 2008, 3.

Table 5. The result of test with terms extraction Category

Text number

Excellent

Nonexcellent

Rate of excellent

IT

539

512

27

95.0%

Sports

551

532

19

96.6%

Military

549

487

62

88.7%

Total

1639

1531

108

93.4%

From table 4, we can see that the abstracts created by the system has high rate of excellent, which reflects that the performance of automatic Chinese summarization system is good. And when we compare table 4 and table 5, it may be seen that the rate of excellent abstracts with terms extraction is higher than that without terms extraction. So, this method of automatic term extraction can effectively raise the rate of the excellent abstracts. 4.

Conclusions

In this paper, we propose a Conditional Random Fields (CRF) based method of automatic term extraction. In the method, a CRF model was created by training the different fields of the corpus segmented and tagged. Using the model trained by CRF, the documents in a given field were automatically tagged and the terms in the field was automatically extracted with a certain way. The test result showed that the performance of automatic term extraction is good and the CRF is very suitable for automatic term extraction. On this basis, this method was used in automatic Chinese text summarization system to enhance the rate of the excellent summary. The experimental results showed that this method could effectively increase the rate of automatic summarization In future work, we will expand the domain, do more evaluation and find new method to identify and extract the terms which do not appear in the training corpus. Acknowledgements This paper is funded by Ministry of Education Project 108131, Chinese NSF Project 60873001 and National Supporting Project 2007BAH05B02-04.

Suggest Documents