An Improved Method of Keywords Extraction xtraction

0 downloads 0 Views 395KB Size Report
[8] Deng Zhen, Bao Hong, Improved Keywords. Extraction Method Research ... Applications,2007,(21). [15] Li Dun, Cao Yuan-da, Wan Yue-liang, Internet-.
An Improved Method of Keywords Extraction Based on Short Technology Text Jun WANG1,2 Lei LI1 Fuji REN2 Beijing University of Post Computer,Beijing Postss and Telecommunications Telecommunications,, Beijing, China 1 School of Computer The University of Tokushima , Tokushima, Japan 2 Faculty of Engineering Engineering,The @gmail .com buptwangjun buptwangjun@ gmail.com .com;; [email protected]; [email protected]; Abstract: Keywords are the critical resources of information management and retrieval ic text classification etrieval,, automat utomatic and clustering. The keywords extraction plays an important role in the process of constructing structured algorithmss of keywords extraction have text. Current algorithm matured in some ways. However the errors of word segmentation which caused by unknown words have been affected the performance of Chinese keywords extraction, particularly in the field of technological text. In order to solve the problem, this paper proposes an improved method of keywords extraction based on the relationship among words. Experiments show that the proposed method can effectively correct the errors caused by segmentation and improve the performance of keywords extraction, and it can also extend to other areas.

Keywords: Short technology text; keywords extraction; unknown words; improved method

1.

Introduction

Keywords are widely used in the area of information management and retrieval, automatic text classification and clustering. Transforming weakly structured text into structured text by extracting keywords, constructing structured knowledge base are the precondition of efficient management and utilization of information [1]. Current extracting algorithms always observe the characteristics of words, such as frequency, area, semantic relation of the context and so on, then extracting keywords by statistical methods [2]. Through protracted and unremitting efforts, the performance of current keywords extraction has been greatly improved. While the problem of unknown words always play as a disadvantage in the Chinese keywords extraction. The first step of most Chinese keywords extraction algorithm is a segmentation. Because there are no clear boundaries between Chinese words, and the fuzziness of language, the segmentation in Chinese is not an easy task, especially when dealing with technological text, which contains lots of new words, proper words, and the 978-1-4244-6899-7/10/$26.00 ©2010 IEEE

segmentation software can not recognize them, which causes lots of errors and affects the next steps [3]. The common method is to add these new words into the dictionary by hand which costs a lot of time and efforts, and also can not effectively deal with the increasing new words [4]. In this paper, we analyze the performance of various methods of keywords extraction propose a multi-feature, multi-step improved method after weighing both rate and efficiency of the extraction. By examining the relation among candidate keywords, the method can efficiently correct the error caused by segmentation and improve the performance of extraction. This paper is organized as follows: in section 2 we analyze the strategy of keywords extraction combined with the characteristics of technological text; in section 3 we design the algorithm of extraction for technological text based on the previous analysis; in section 4 we prove the efficiency of the our method by experiments; we summarize our contributions and give the future work in section 5. 2.

The strategy of keywords extraction

2.1 The characteristics of short technological text In this paper, the object of extraction is short technological text, which has the following characteristics: 1 Extensive coverage. About three thousand texts are analyzed, which include 20 classes, 67 subclasses. They cover the range from mathematical theory to farm machine. Relatively speaking, the number of the texts in each class is small which means the largest subclass contains about 50 texts and the smallest subclass contains only 7. 2 New content. These texts reflect the latest method and result of various fields, so lots of new words exist in the text. 3 Few number of words. Since these texts are short text, the number of words in the text is about thousand. 4 Obvious regional characteristics. The text has a precise writing style with brief title, standard words, but no literature or flaming description.

2.2 Analysis of extraction algorithm

3.

As we mentioned before, current algorithms of keywords extraction mainly based on the statistical characteristics and the methods can be divided as following: machine learning, constructing word-network graph, simple statistical, using linguistic characteristics and so on. The method of machine learning first establishes some rules and gets a prototype of training model, then trains the prototype with lots of annotated texts. During the process of training, machine can change the parameters of model based on the response and get the best parameter automatically. If the training set is big and wide enough, good performance can be expected. But this method may consume great human resources on providing annotated text, and it only can deal with the text which is similar to training set. The performance will be dramatically degraded when facing with new field [5]. The method of constructing word-network graph based on small world theory. This theory holds that text is composed by a limited number of elements (words, participle, phrase) in some no-random methods. Norandom method means certain patterns of organization that assemble a meaningful sentence with relative syntax elements. The purpose of text is to express author's intention, so all the statement will be focusing on the intention and this makes the graph of word-network have the characteristic of small world. Words which has a strong characteristic of cluster could be the keywords. This method gives fully consideration to deep-level semantic information. The words which can express the main ideal of text can be found efficiently. But it has high algorithm complexity (about O(n 3 ) ), and is difficult to be used in large-scale document [6]. The main method which is based on statistic is tf/idf, it is the earliest and the most classical algorithm of keywords extraction which has the features of simplicity and efficiency. The biggest problem is this method does not consider sentence structure, semantic and other characteristics such as position and part of speech, so it can not find the words of low frequency which have important meaning. If some characteristics are added, local linguistic characteristics for instance, then the performance can be improved [7]. As a summary, based on the combination of the characteristics of short technological text, the author adopts a method with multi-feature, multi-step. First, a comprehensive statistical algorithm is used as a basic method to find candidate keywords by evaluating frequency, part of speech and position of words. Afterwards, use improved method to correct the errors of segmentation existing in candidate keywords, achieve the final result.

3.1.

The algorithm of keywords extraction Characteristics of keywords

The research on grammatical characteristics of keywords shows that keywords are mainly constituted by the following four categories: common noun, noun phrase, verb phrase and adjunct word [8]. Considering the precise style and few adjunct words in the technological text, we focus on nouns and verbs. Words in different positions have different importance. General speaking, in the level of full-text, title, abstract and conclusion are more important; in the level of paragraph, the first sentence is more important. In technological text, these characteristics are particularly obvious [9]. Generally, in the title, object of research is indicated which includes many keywords. At the beginning of an article, the author will briefly introduce the main content which also includes some keywords. We devote more attention to the words in these positions. 3.2.

The extracting algorithm

Based on the discussion above, the initial algorithm is as followed: pre-process, multi-feature statistical evaluating, and output. Pre-process includes segmentation and POS tagging. Multi-feature statistical evaluating considers frequency, POS and position of words. For nouns and verbs, assign 40 to the weight of the word if it occurs once in the title, 20 to the weight if it occurs once in the first sentence of paragraph, 10 to the weight when occurs once in other part of the text. After these processing, sort the words by their weights, take the top five as keywords. The equation of evaluating weight is as formula (1).

f w = wt × 40 + w f × 20 + w s × 10 ( w ∈ {v, n} )

(1)

f w represents the weight of word, wt represents the number of the word occurrences in the title , w f represents the number of the word occurrences in the first sentence of paragraph, ws represents the number of the word occurrences in other parts of the text. v represents verb, n represents noun. 3.3.

Analysis of results

We deal with more than 3000 text using the method above. It gets good result when keywords are short and usual. For example, in the text of "钢筋滚压直螺纹连接 生产 技 术 "(The method of producing steel bars rolling thread joint), keywords we extracted and their weights are :

方法(method) 50 技术(technology) 60 钢筋(steel bars) 70 连接(joint) 100 螺纹(screw thread)160 Keywords given in the text are: 钢筋(steel bars),螺纹 (screw thread),滚压(rolling),连接(joint). High precise is get. But while dealing with text of new content which contains long keywords, the performance is not satisfactory. In the text of “西门塔尔牛选育方法研 究”(Analysis of breeding Chinese simmental), keywords we extracted and their weights are: 中国(Chinese) 180 改良(improvement) 190 塔尔(tal) 260 西门(simmen) 290 牛(cattle) 490 Keywords given in the text are: 中 国 西 门 塔 尔 牛 (Chinese simmental) 系 统 选 育 程 序 (breeding processing) 黄牛改良 (cattle improvement) 开放核 心群育种法( ONBS) 育种目标(breed object) It shows that there are many unknown words which were separated wrongly in pre-process and affect the final result. One thing should be noticed is although these words were separated, their fragments still have higher weights. Take the text “西门塔尔牛选育方法研 究 ” (Analysis of breeding Chinese simmental) as an example, the top 15 words with the highest weights are followed: 养(raise) 60 发展(develop) 70 奶(milk) 80 杂交(hybridization) 80 肉(beef) 90 性能(performance) 90 生产(produce) 100 种(seed) 110 黄牛(cattle) 120 选育(breed) 140 研究(research) 150 中国(Chinese) 180 改良(improvement) 190 塔尔(tal) 260 西门(simmen) 290 牛(cattle) 490 Among these 15 words, most fragments of keywords are included. We selects 45 texts randomly, which includes 183 keywords. The top 15 words with highest weights are chosen as candidate keywords to do some analysis. The

result is shown in table 1: Table 1: Analysis on the result of extraction Number of keywords 183 completely correct 33 18.03% all fragments are found 77 42.07% part of fragments are found 43 23.49% not found 16 8.74% other 14 7.65% Other means in the keywords given by human occur some words that do not exist in the text. For example, in a text about the current employment situation, one of the keywords is “ 就 业 现 状 ”(situation of current employment), but in the text, we only find words “就业 形 势 ”(situation of employment) , “ 就 业 情 况 ”(employment status), no “ 就 业 现 状 ”(situation of current employment). From the analysis, we can see that the proportion of completely correct is less than 20%, it is even worse than the general result of only using tf method extracting keywords from single document. It is clear that unknown words in technological texts greatly affect the performance. But about half fragments of unknown keywords can be found in candidate keywords. This result hints us if we deal with the candidate keywords in certain ways and find fragments of unknown keywords, then reassemble them, the performance will be greatly improved. 3.4.

Improved method for extracting keywords

The general idea of finding fragments of unknown words and reassembling is a little similar to new word identification [10]. Our method is designed to search unknown words in the candidate keywords not in the whole text. Therefore, some more simple and effective methods can be adopted. Current algorithms of new word identification can be divided into two categories, rule based and statistics based. Rule based method is first establishing syntactic rule base according to Chinese word-formation, then detecting new words by calling rule base and filtering forbidden string. This method has higher accuracy, but establishing rule base large enough is a thorny problem [11]. Statistics based method searches fragment strings which have high repetition rate after fundamental segmentation and detects new words from fragment strings [12]. The main problem of this method is computational complexity, as it should search the whole text, so the cost could be huge if using maximum matching. For this reason, current methods usually stop at four quaternion and deal with the pattern of "character+word" and "character+character" [13]. But in our method, searching range is among the candidate keywords and analysis shows that, most fragment strings are composite word of "word+word" pattern not

"character+word" or "character+character" pattern. We analyze 77 keywords mentioned earlier that all fragments of them are found. Table2: Analysis on the pattern of unknown keywords number of words 77 word+word 54 70.12% character+word 19 24.67% character+character 4 5.19% So, author adopts string frequency statistics method based on maximum matching. The advantage of maximum matching is it does not limit the length of keywords, can find the hidden keywords most efficiently. The advantage of string frequency statistics is simple and convenient. This method does not need to consider other rules and has good performance while dealing with long composite words and new words which have special word formation. Some composite words, especially in the pattern of "word+word", are difficult to determine whether it is a word or not just by observing their form. By counting the frequency of strings, considering the way and frequency they occur in the text, if the string always occurs as a whole, we can think it is a new word [14]. This method can also avoid the problem of substring. As sub-string is a part of parent-string, for example,“中国人民” is a sub-string of “中国人民银 行”,the relation among words in sub-string, such as the frequency of co-occurrence is surely more intense than that in parent-string [15]. If we only consider the intensity of relation, sub-string is more likely to be a new word. But if the sub-string always occurs in parent-string, not independently, taking parent-string as a new word is more suitable. For example, if “中国人 民 ” occurs seven times in full text, and six times in the string of“中 国人民银行”, it is clear that “中国人民银行” is more likely to be new word. So we must distinguish the two conditions that string occurs independently and occurs as sub-string. As we use maximum matching, in calculation the sub-strings which occur in the parent-strings will not count. The influence of sub-string can be avoided. The detail of string frequency statistics method based on maximum matching is as followed: Searching from the beginning of text, if find a candidate keyword, take this word as start, search the word next to it, if next word is also a candidate keyword, repeat the searching until the word searched is not a candidate keyword. The string we find is a potential keyword. After finishing process on the whole text, take strings with highest frequency as keywords. This method only needs to scan the text once with the complexity of o( n ) .

W1

W2

W3

W4

Wn

W1 is a candidate keyword

W1

W2

W3

W4

Wn

W2 is also a candidate keyword

W1

W2

W3

W4

Wn

W3 is also a candidate keyword

W1

W2

W3

W4

Wn

W4 is not a candidate keyword

Output string:W1W2W3

Figure 1. Maximum matching of string 3.5.

The whole algorithm flow

Combining basic algorithm and improved algorithm, the whole flow is as below: 1 Pre-processing: segmentation, POS tagging, elimination of stop words 2 Searching for candidate keywords: calculate weights of words based on the characteristics of frequency, POS and position, sort them by their weights, then get the top 15 words as candidate keywords. 3 Detecting unknown keywords: use string frequency statistics method based on maximum matching, calculate the frequency of candidate strings and sort them. 4 Result output: eliminate separate characters in candidate keywords, get the top 4 words with the top 2 of candidate strings, 6 words in all as keywords. By evaluating frequency, POS and position, find candidate keywords, then search in candidate keywords, find keywords which are separate by mistake, get the result. That is the core of our method.

Pre-processing: segmentation, POS tagging, elimination of stop words

Search for candidate keywords

freque ncy

POS

Position

extraction after improving. We randomly choose 150 texts which include all 20 classes and evaluate the results of them manually. The evaluation criterions are precision and recall [16], they define as formula (2) and (3): Precision= automatic extraction & correct keywords/automatic extraction keywords (2) Recall= automatic extraction & correct keywords/keywords given manually (3) The results of different classes did not show obvious differences, so we just show the total result in table 3 Table 3: The result of extraction Total keywords

Candidate keywords

Detecting unknown keywords

Result output

Figure 2. Flow diagram of algorithm 4.

Analysis of result

We deal with all the 3000 texts using the algorithm above, the program of segmentation and POS tagging are developed by our lab. Good results of the improved method on finding long unknown words are achieved. Take the text “西门塔尔牛选育方法研究”which is mentioned before as an example, the top 5 strings we found are follows: 中国/西门/塔尔/牛 (Chinese simmental) 黄牛/改良 (cattle improvement) 生产/性能 (production performance) 杂交/改良 (hybridization improvement) 肉/性能 (beef quality) We can see most of these candidate strings are accordance with grammatical rules and can reflect the main ideal of text properly. The performance of extraction will be improved while these strings are added into keywords. It is worth mentioning that the improved method is domain independent and shows good performance in all classes. It detects unknown words of different types, different length such as "复式脉冲技术 碎 石 "(calculus lithotresis using cucomplex pulse technique), "CompactPCI 总 线 工 业 控 制 计 算 机 "(industrial PC based on CompactPCI ISA), "昌平区综 合体育馆工程"(Changping district sports centre project). This proves the efficiency of improved method on correcting errors of segmentation and detecting unknown words. In order to evaluate the performance of keywords

Automatic extraction&correct

preci sion

Recall

580 241 26.77% 41.55% Since there is no specialized keywords extracting system dealing with the Chinese short technological texts, in order to verify the efficiency of our method, we compare the results of tf method, method before improved and improved method. Table 4: Comparison of result Precision Recall tf 10.25% 12.42% Before improvement 16.17% 16.72% After improvement 26.77% 41.55% It can be seen that both precision and recall are significantly improved. Comparing with method before improved, precision increase 10.6%, recall increase 24.7%. One thing we should point out is that when dealing with keywords extraction of general text, tf method can get precision of 43.4%, recall of 45.3% [17], frequency and position method [18] can get precision of 50.33%, recall of 53.10% [1], but when dealing with technological text, precision and recall of both methods are less than 20%. It shows the necessity of adopting improved method from one side. In the analysis, we also find some problems. As string frequency statistics method only considers the frequency of string without other information. There exists some rubbish strings such as “从业人员甲醛”, “ 上颌 缺 损 修 复 重 建 临 床 ”. How to eliminate these rubbish strings is the problem will be solved in future work. 5.

Conclusion

In order to solve the problem of unknown words in the keywords extraction of technological text. We proposed an improved method which can correct the errors of segmentation by observing the relation among word fragments and calculating the maximum repeating strings. It shows good performance in experiments. The main advantage of the method is that it depends on single document and domain independent with the complexity of o( n ) . In future work, on one hand we will expand our work to other fields also with lots of unknown words

such as proposal and so on; on the other hand we will introduce some rules to improve the performance of new words detection.

[12]

Acknowledgements This paper is supported by Project 60873001 of National Natural Science Foundation of China, Project 108131 funded by Ministry of Education,Special Fund for Fast Sharing of Science Paper (FSSP) in Net Era by CSTD, Chinese Universities Scientific Fund and Chinese Association of Artificial Intelligence.

[13]

[14]

References [1] Suo Hong-guang, Liu Yu-shu and Cao Shu-ying, A Keyword Selection Method Based on Lexical Chains, Journal of Chinese Information Processing, 2006,(06) [2] Xinghua Hu and Bin Wu, Automatic Keyword Extraction Using Linguistic Features, Sixth IEEE International Conference on Data MiningWorkshops (ICDMW'06) [3] Yang Jie, Ji Duo,Cai Dong-feng, Lin Xiao-qing and Bai Yu, Keyword Extraction in Multi-Document Based on Joint Weight, Journal of Chinese Information Processing, 2008,(06) [4] Zheng Jiaheng and Lu Jiaoli , Study of An Improved Keywords Distillation Method, Computer Engineering,2005,(18) [5] Chunguo Wu, Maurizio Marchese and Jingqing Jiang, Machine Learning-Based Keywords Extraction for Scientific Literature, Journal of Universal Computer Science, vol 13, no.10(2007), 1471-1483 [6] Zhao Peng,Cai Qing-Sheng, Wang Qing-Yi, Geng Huan-Tong, An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features, Pattern Recognition and Artificial Intelligence,2007,(06) [7] A.Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003 [8] Deng Zhen, Bao Hong, Improved Keywords Extraction Method Research,Computer Engineering and Design,2009,(20) [9] Fuji Ren, Automatic Abstracting Important Sentences,International Journal of Information Technology and Decision Making,Vol.4, No.1, pp.141-152, 2005 [10] Liao Xiantao, Review of New Words Detection [11] Andi Wu and Zixin Jiang, Statistically-Enhanced New Word Identification in a Rule-Based Chinese System. In proceedings of the Second Chinese

[15]

[16]

[17]

[18]

Language Processing Workshop, Hong Kong. China.2000 Jian-Yun Nie, Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge, Communications of COLIPS, 1995 Cui Shiqi, Liu Qun, Meng Yao, Yu Hao and Nishino FumihitoCui Shiqi, Nishino FumihitoNew Word Detection Based on Large-Scale Corpus, Journal of Computer Research and Development,2006,(05) He Min, Gong Cai-chun, Zhang Hua-ping, Cheng Xue-qi, Method of new word identification based on lager-scale corpus, Computer Engineering and Applications,2007,(21) Li Dun, Cao Yuan-da, Wan Yue-liang, InternetOriented New Words Identification, Journal of Beijing University of Posts and Telecommunications,2008,(01) Li Su-Jian, Wang Hou-Feng, Yu Shi-Wen and Xin Cheng-Sheng, Research on Maximum Entropy Model for Keyword Indexing, Chinese Journal of Computers,2004,(09) Ma Li, Jiao Licheng, Bai Lin, Zhou Yafu, Dong Luobing, Research on a Compound Keywords Detection Method Based on Small World Model, Journal of Chinese Information Processing, 2009,(03) He Xingui, Peng Fuyang, Fuzzy Classification of Chinese Texts, Journal of Chinese Information Processing, 1999,(01)