Comparing Genetic Algorithm and Statistical Segmentation in Terminology Extraction for Topical Forum Yu-Hui Tao National University of Kaohsiung
[email protected]
Chih-Lung Lin Shu-Chu Liu National Pingtung University of Science and Technology
[email protected] [email protected]
Abstract An automatic document summarization system for producing frequently asked questions (FAQ) of topical forum constructed by Huang [10] can save forum managers a great deal of time in forum FAQ production. However, the summary outcome is still missing some important information due to lack of domain terminology in its summarization process. This work addresses this problem by adding a domain terminology extraction module to enhance Huang’ s original conceptual model. Both Genetic Algorithm (GA) and statistical methods are adopted to train domain-terminology corpus, and an experiment is also conducted to compare their performance. The result shows that both methods significantly enrich the corpus used in producing the automatic FAQ summary. The performance indices of recall rate and precision rate for both methods are also presented and discussed for their implications in practice. Keywords: FAQ Conceptual Model, Genetic Algorithm, Statistical Segmentation, Text Summarization, Topical Forum
initial screening was done by comparing the Chinese character combinations to the Chinese Electronic Dictionary (CED) developed by the Computational Linguistics Society of R.O.C. Although CED contains 80 thousand Chinese word combinations, special domain jargons or terminologies are not covered. Obviously, domain terminologies not covered in this CED would not be extracted in the first place. Although statistical segmentation method was applied to those unidentified words after matching CED, the unidentified word pool was too small to be effective in retaining these critical domain terminologies. Take the sample sentence in database domain as seen on the left side of Figure 1, the common Chinese terms of 資料 庫(database), 資料(information), 概念(concept), 欄 位 (column), 關聯 (relationship) and 屬性 (attribute) on the right hand side of Figure 1 are evaluated to be retained because they do exist in the CED. However, uncommon terms like 資料表(table), relationship, 一 對一(one to one), 一對多(one to many), 多對多 (many to many ) do not exit in the CED and thus could not be extracted.
1. INTRODUCTION From the perspective of knowledge management, the long existed Frequently Asked Questions (FAQ) has been a common knowledge sharing format used in Internet newsgroups, bulletin boards, forums and communities. FAQ is usually edited by the board manager who digests the original articles into question and answer listing for easy access by forum members. Huang [10] proposed a conceptual model for automatically transforming topical forum articles into FAQ summary, and empirically demonstrated that his model worked acceptably by implementing an automatic FAQ prototype system. Huang’ s experiment implied the potential benefits in saving time and manpower in producing FAQ while illustrating the technical feasibility of such a model. One remaining issue was that domain terminologies were not extracted appropriately in the final summary. It is because Huang’ s prototype system applied typical but straightforward methods to segment keywords in each sentence to support the decisions of forming the multi-article summary. The
Figure 1. Keywords extracted by CED. Since these domain terminologies are critical to clearly express the domain concept of the FAQ summary, lacking of them will lower the importance of the sentences containing these domain terminologies, which are evaluated to be selected into the summary or not. Overall speaking, the informative level of forum FAQ becomes lower because it emphasizes on the completeness of the important ideas or concepts of the original articles [1]. In other words, the performance of the automatic FAQ summary will be significantly affected by the completeness of these domain terminologies. To address the problem described above, this research aims to enhance the FAQ knowledge transforming model for topical forum by adding the domain-terminology extraction module in Huang’ s model. Once the domain terminology extraction
module is executed before the FAQ summarization module as seen in the bottom left of Table 2, those keywords not covered by the CED will be retained by the additional domain-terminology database generated by the new module as shown in Figure 3. Accordingly, the informative level of FAQ summary will be significantly lifted by including those sentences containing these missing keywords in the summary.
FAQ Summarization Module
TopicalForum Forum Articles
Information Collection
FAQ Presentation
FAQ Organization Word Segmentation Module
Domain Terminology Extraction Module
Domain Terminology Corpus
Chinese Electronic Dictionary
OtherCorpus, Keywords, Similarities, Statistics, …
Figure 2. The framework of FAQ summary for topical forum
Figure 3. Keywords extracted by CED. Before entering the research design of this new module, brief background information is presented next.
2. BACKGROUND INFORMATION Document Summarization has been developed for a long time [5, 6] such that its algorithms, application domain and evaluation methods have demonstrated substantial results. On the abstraction methods, Hovy and Lin [9] classified three types of methods, including Positional Information (PI), Natural Language Processing and Information retrieval (IR).
PI and IR both utilize keywords in screening important sentences. Based on the number of articles involved, text summarization can be distinguished into Singular Document Summarization (SDS) [5, 9, 11, 18] and Multiple Documents Summarization (MDS) [13]. SDS focuses on removing unnecessary information such that remaining text can still concisely represent the core concepts of the original article. In addition to this, MDS has to deduplicate information across multiple articles during the articles abstracting process. Text content screening, in principle, utilizes sentence as the basic unit and then moves up to paragraph level by categorizing sentences into topical relevant groups [18]. The majority of the sentence evaluation uses keyword frequency to determine the importance of sentences, and some uses discourse [12], syntax or semantics in the analyses. Conventional Chinese text segmentation has mainly three types of methods. They are statistical method [8, 19], heuristic rule-based method [3] and hybrid method [17]. A statistical segmentation method applies statistical information such as frequency and hurdle value from large volume of articles in deciding candidate terms within a short sentence [2]. Because corpora in application domains differ, the statistical information on different corpora cannot appropriately interchange in decision making [17]. The strength of statistical segmentation is in its effectiveness, but limited to first-order Markov models [14] because of time complexity. In other words, only single or two-character terms can be easily processed. The time complexity increases when trying to extend the order dimension [17]. Therefore, if the term length exceeds two, the efficiency decreases significantly and the accuracy becomes inadequate. Besides, it is not easy to obtain large volume of corpus, the statistical information occupies quite some disk space, and term frequencies cannot be easily shared for different corpora [2]. There has been a variety of statistical model proposed over the years such as Fan et al. [7] applied relaxation method commonly used in graphical processing to Chinese text segmentation, Sporat and Shih [19] calculated the probability of neighboring words to enrich the text screening. Chang et al. [3] proposed constraint satisfaction and probability optimization in Chinese text summarization to improve the term length, execution speed and accuracy. Chiang et al. [4] used many statistical features such as term frequency, term length, term type to weight them in the evaluation, which achieved an accuracy level as high as 99.39%. Heuristic rule-based method is the most commonly seen segmentation method because it is intuitive and easily implemented. It removes impossible terms by empirical rules and often with the aid of a dictionary. The most typical version is longest-matching method, whose strategy is to start the search from left to right by matching current-position character to characters
after it for the longest possible term identified by the corpus. It repeats this process from the next character of the most recent identified term until no term can be identified further. Take “近 代 人 生 活 ” as an example, we start with “ 近”to match all the terms in CED and obtain two candidates of “ 近代”and “ 近代 人” . Because “ 近代人”is longer than “ 近代” ,“ 近代 人”is the term extracted. Longest-matching method has some varieties like positive longest-matching or negative longest-matching methods. The quality of text summarization is related to the size of corpus and thus frequent update of corpus is necessary. The weakness of using a dictionary in segmentation is that it is impacted by the quality of collected terms. If new terms appear in a sentence, the accuracy of segmentation is decreased. To improve its accuracy by increasing new terms in the corpus, the efficiency decreases significantly [2]. Hybrid segmentation combines both statistical and heuristic rule-based segmentation [17], in which a dictionary is first used to find different segment combinations followed by judging statistical information for better combinations [10]. There is also Chinese text segmentation through algorithms like Chen et al. [2] using Genetic Algorithms (GA). In additional to GA, other algorithms have been proposed to improve the accuracy of Chinese text segmentation such as term frequency, conjunctive word probability (Markov model), other probabilistic values, and semantic meanings [3, 8, 20]. Performance evaluation of text summarization can be either extrinsic evaluation or intrinsic evaluation [15]. Extrinsic evaluation takes summarization outcome as the input to other information systems (ISs) and judge the overall performance of associated ISs for determining the quality of summarization. Intrinsic evaluation judges the quality of automatic text summarization by comparing it to the result of human summarization. The most common performance indices are recall rate and precision rate. Let C/H be the recall rate and C/M be the precision rate as seen in Formulas 1 and 2, respectively, where M represents the number of domain terminologies extracted by the system, H represents the number of domain terminologies selected by the human evaluators, and C be the number of keywords selected by both the system and the human evaluators.
C --- (1) H C Precision Rate = ---- (2) M Recall Rate =
3. RESEARCH DESIGN Among three conventional text segmentation methods, both heuristic rule-based and hybrid methods need an existing Chinese corpus to start with, which are not preferred in this work. On the other
hand, statistical segmentation method works from a large-volume data set without pre-defined corpus to generate new terminologies is a more intuitive method to our problem. Among the new Chinese text retrieval methods of algorithms, Chen et al. [2] proposed GA-based approach to handle Chinese text segmentation problem, which achieved a good performance with 79% of recall rate and 75% of precision rate. Therefore, this work adopts both statistical and GA-based segmentation methods to construct the domain terminology database and a comparative evaluation of these two methods.
3.1 Statistical Segmentation Method The process for constructing the domain terminology database using statistical segmentation method is as shown in Figure 4. Select training articles
Segment sentences Remove Interjections, Adverbs, Auxiliaries, Conjunctions, and Propositions
Generate 2 to N -1 terms and corresponding frequencies
Remove terms in Chinese Electronic Dictionary
Screen domain terminology by evaluator
Statistical domain terminology database
Figure 4. The process of statistical segmentation. At first, a large volume of training forum articles is fed into the basic sentence segmentation with punctuation marks and special types of terms (as in Table 1) deleted and cutoff points between two connective terms identified. Each short sentence is then processed to generate two-character, three-character, N-1 term where N is the length of the targeted sentence. The N terms generated from training articles are calculated for their appearing frequencies. Table 1. Type of characters deleted [10]. Type Interjections Adverbs Auxiliary Conjunctions Proposition
Examples 喂, 哼 不, 是, 就 啊, 呀, 的 而, 和, 可是 除了, 比如
Take Chinese資料庫系統(database system) of five Chinese characters as an example, it can generate four two-character terms of “ 資料” , “ 料庫” , “ 庫系” , “ 系 統” , three three-character terms of “ 資料庫” ,“ 料庫 系” ,“ 庫系統”and two four-character terms of “ 資料 庫系”and “ 料庫系統” These N-character terms are compared to the 80 thousand Chinese terms in CED, and retain only those not matched and whose frequencies are higher than a pre-specified hurdle value. These candidate terms are further examined by three evaluators, and only those terms selected by at least two evaluators are determined to be domain terminology.
3.2 Genetic Algorithm Method The process for constructing the domain terminology database using GA approach is as shown in Figure 5. Begin Step 1: Segmenting articles into S Sentences with 2 to N -1 character terms identified; Let sentence index i = 1
Step 3: Calculating Fitness Function Value
Step 4: Selection No i