JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 26, 505-525 (2010)
An Automated Term Definition Extraction System Using the Web Corpus in the Chinese Language FANG-YIE LEU AND CHIH-CHIEH KO Department of Computer Science Tunghai University Taichung, 407 Taiwan E-mail:
[email protected] This paper proposes a system, named DefExplorer, which analyzes the type of given Chinese terms, extracts term definitions from the Web, and selects answers from noisy Web pages. DefExplorer filters out invalid data with a semantic approach. Two types of candidate sets, common and domain specific, are employed to cluster similar candidates into groups. Different approaches are also deployed to evaluate candidates’ importance which is the key factor for selecting the best answers from retrieved candidates. Experimental results show that DefExplorer can effectively extract term definitions from the Web, especially for the definitions of out-of-vocabulary terms. Keywords: definitions, web corpus, information extraction, Chinese language, text mining
1. INTRODUCTION As societies and technologies become more and more advanced, new words and terms are created and developed to enrich everyday life. Building up lexicons that contain new vocabulary items covering different fields to provide us with an up-to-date word search is thus a challenge [1]. Because of the continuous increase in the number of new words/terms, we will inevitably encounter some in reading materials that cannot be found in any dictionaries. This is what we call the out-of-vocabulary problem [1, 2]. And of course it is not feasible to wait for dictionary updates to provide the meanings of new words/terms. The Web, currently one of the largest data resources in the world, has provided users with a great amount of data and knowledge [3]. Retrieving useful knowledge, e.g., the definition of a term/word, from the Web is a meaningful type of research [4], particularly when we have to or would like to know the meaning of an unknown term. However, after a user initiates a search by inputting keywords, what he/she retrieves from the Web is very often a huge amount of related and unrelated data, including descriptions and/or statements. Therefore, it is not surprising that most users frequently do not know what to do next. In Chinese, explanations of a term are usually expressed in simple declarative sentences, like “A 是 B” (namely, A is B). However, declarative sentences may be too common in format to be discriminated from non-declarative sentences, e.g., interrogative sentences, such as “瑜珈是什麼 (What is yoga?)”, which is also in the form of “A 是 B”. Received April 1, 2008; revised February 27 & April 22, 2009; accepted May 7, 2009. Communicated by Chin-Teng Lin.
505
506
FANG-YIE LEU AND CHIH-CHIEH KO
The search results sometimes confuse users, particularly when the submitted term is an unknown one. Definitional question answering is a new area of question answering [3, 4]. Definitional questions are often in the form of “X 是什麼? (What is X?)” or “X 是誰? (Who is X?)”, such as “瑜珈是什麼? (What is yoga?)” and “張藝謀是誰? (Who is Yimou Zhang?)”, where X is a question term, and should be a noun. Definitional questions are different from other questions, such as “瑜珈源自哪國? (In what country did yoga originate?)”, in that definitional questions may not require only one answer or one well-defined answer [4]. We define definitions of X as the conceptual facts that can be collected and described in dictionaries or encyclopedias. A definitional question usually has more than one answer. All conceptual facts that define X from different viewpoints are possible answers. This is an important feature that an outstanding definitional QA system should have [3]. In addition, a well-defined definition is often short and precise. On the other hand, the Web corpus has been proven useful in natural language processing, especially in cross-language information retrieval research [1, 2, 5]. Lu et al. [1] and Zhang et al. [2] used the Web corpus for automated term translation extraction and obtained desirable results. Ru et al. [5] proposed a unified solution for Chinese name recognition by identifying a name’s component, context and structure features and analyzing statistical data of the Web corpus. They claimed that their system achieved 93% precision and 89% recall. In this paper, we propose a term definition searching system, named Definition Explorer (DefExplorer for short), whose aims are to retrieve term definitions collected in the Chinese Web corpus [5, 6] for a given term, and effectively screen accurate definitions from retrieved noisy data. After receiving a term submitted by a user, DefExplorer first retrieves the corresponding pre-defined patterns, and then submits the patterns, rather than the term, to a commercial search engine to search related results/sentences. Upon receiving the results, DefExplorer filters out inappropriate sentences using a semantic approach. After that, it clusters semantically similar sentences into a group, and then selects one from each group as the group’s representative. At last, the top-ranked representatives are chosen as the final results. The rest of this paper is organized as follows. Section 2 introduces several related text mining studies that use the Web corpus. Section 3 describes how DefExplorer extracts term definitions from the Web. The experimental results are illustrated in section 4. Section 5 concludes this article and discusses our future work.
2. RELATED WORK This paper is an advanced version of the work in [7]. We enhanced the following items, including (1) integrating DefExplorer with existing lexicons to further improve its extracting performance; (2) comparing its performance with several state-of-the-art systems. So far, various techniques for answering definitional questions have been proposed, mainly oriented by the TREC (Text Retrieval Conference) Question Answering track [3, 4]. Prager et al. [8] proposed an approach to divide a definitional question X into several factoid sub-questions. The answers to the sub-questions would then be combined to form
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
507
the answer to X. The authors claimed that the system effectively increases hit rates of X. However, it is hard to derive the factoid sub-questions in advance since essential facts between/among different Xs vary even when the Xs are of the same type. Blair-Goldensohn et al. [9] proposed a hybrid goal-driven and data-driven system, called DefScriber, which has three main steps in extracting definitions, including (1) document retrieval which accepts a user submitted question X, and then retrieves related documents from the Internet; (2) predicate identification (the goal-driven aspect) in which the authors defined seven predicate sets, including Genus, species, …, and Non-specific Definitional (NSD), i.e., types of X. Among them NSD is the general type of the other six. In this step, a machine learning approach and a pattern-recognition method were respectively employed to identify NSD sentences, and extract sentence patterns from annotated data. A set of lexicosyntactic patterns, organized as a syntactic tree, is generated to model sentences. Users can understand X on the basis of the context provided; (3) statistical analysis (the data-driven aspect) which chooses appropriate important sentences by calculating each retrieved sentence’s IDF-weight cosine distance from the definition centroid. The N sentences with the highest IDF weights are then collected as the answer set. However, this paper only showed F-measures of given questions, providing no detailed information. As estimated, an overall F-measure is about 0.5. Furthermore, no other evaluation was given. Han et al. [10] introduced a probabilistic model, which is a formal model, for definitional QA. The authors considered that given that T is a set of answer candidates describing the topic related to X, and D is a set of answer candidates representing X’s definition(s), the answer to X is the intersection of T and D. This system has five steps in analyzing a question, including (1) question analysis which identifies the type of X, e.g., a “term” or “person” type; (2) document retrieval which locates relevant documents by using the BM25 scoring function of OKAPI [11]; (3) answer candidate extraction which extracts target-related parts of sentences by using the syntactic structures of retrieved sentences; (4) answer candidate ranking which adopts a proposed probabilistic model to rank answer candidates; (5) answer selection which selects related candidates with a given threshold. According to the evaluation results of this paper, the system worked well for external definitions, which are online definitions, including Columbia Encyclopedia, Wikipedia, the American Heritage Dictionary of the English Language, etc., but did not perform well when retrieving definitions from the Web. Google search engine [12, 13] is designed based on three philosophies behind Google ranking, including using the best locally relevant results to serve users globally, keeping it simple, and no manual intervention. The search engine on receiving keywords or phrases searches results in two steps. (1) matching: Google Web server dispatches network spiders to access Web pages/sites in advance, and accordingly creates indexes in its index servers for all the accessed Web pages/sites. In this step, it looks up the index server to obtain the relevant web pages; (2) page-ranking: in which a Web page P is ranked with the following considerations: (a) whether keywords or phrases are in header or bold tags, or right at the top of P; (b) whether keyword densities range between 6-10 % in P’s body; (c) number of incoming links of P; (d) the importance of a page that links to P.
FANG-YIE LEU AND CHIH-CHIEH KO
508
3. EXTRACTING TERM DEFINITIONS DefExplorer extracts definitions or their equivalences for Chinese terms in six phases, as shown in Fig. 1, including question analysis, document retrieval, semantics selection, similarity scoring, candidate grouping (also called candidate clustering), and answer generation. The first two phases respectively retrieve a given term’s corresponding patterns, and submit the patterns to search results. The third phase removes semantically inappropriate search results/sentences, and identifies the key portion of a definition sentence. In the fourth and fifth phases, DefExplorer calculates similarities between each sentence and other definition sentences, and clusters semantically similar sentences into a group. The last phase selects top-ranked sentences as the final results. In the following, we will describe the six phases, and explain why they are employed.
Fig. 1. The DefExplorer system architecture.
3.1 Question Analysis In this study, question analysis is identifying the type of a given question term X so as to classify X into one of the pre-defined domains, e.g., person, location, and organization, which are relatively easier to identify than other domains (e.g., animal and plant), and each of which can be conceptually and intuitively decomposed into several sub-domains. The purpose is to extract many more appropriate and detailed definition candidates. Terms that cannot be classified into any pre-defined domains are treated as normal terms. Many other domains can be found in HowNet [14]. DefExplorer will collect them one by one in the future. The method we deployed to identify the type of X is to match the prefix or suffix of X with pre-built lexicons. For example, if X ends with the character “市 (city)”, it will be classified into the type/domain “location.” The pre-built lexicon we collected for
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
509
Chinese people are 百家姓 with which we can identify a person’s family name. Famous foreign people’s names and popular foreign names are also collected one by one. The lexicon gathered for organizations contains 公司(company), 行(bank), 處(department), 局(bureau), etc., which are further classified into four classes, commercial organizations, government organizations, non-profit organizations, and others. The one collected for locations includes 鄰(neighborhood), 島(island), 湖(lake), etc., which are further divided into districts, natural areas and others (see Appendix A of this paper, A1 and A2). It is hard for us to completely collect all related lexicons and their contents since languages are open sets [14, 15]. New terms are created almost every day. Outliers, which are those that belong to one of the pre-defined domains but do not meet the pre-built lexicons are classified into normal terms. This gives us a flexible approach to work appropriately even when current pre-built lexicons and pre-defined domains are insufficient. After identifying the type of X, no matter whether the question term is a domainspecific or a normal term, DefExplorer produces a query set, named the common query set, which consists of common definition sentence patterns. They are pre-defined definition sentence patterns adapted to all or most question terms concerned. In fact, most term definitions in Chinese are formatted as “X 是 definition” (X is definition). For example, given a question term “馬英九”, the common query set DefExplorer generates includes common definition sentence patterns, “馬英九是… (Ma, Ying-Jeou is …)”, “馬英九 為… (Ma, Ying-Jeou is …)” and so on. Appendix A (in A3) lists a part of the patterns. In addition, some definition sentence patterns only appear together with specific domain terms [8, 10], e.g., “出生於 (was born in/on/at)” and “曾任 (served as)” always follow the name of a person. “位於 (located in/at)” follows a location or an organization. DefExplorer on receiving a question term concerning, e.g., “馬英九 (Ma, Ying-Jeou)”, generates a query set containing “馬英九曾任… (Ma, Ying-Jeou served as …)”, “馬英 九出生於… (Ma, Ying-Jeou was born in …)” and … as a domain query set. Queries collected in the same domain query set aim to gather the same target information. For example, “位於 (located at)” and “坐落於 (situated at)” both request a location. We collect them as a target-location domain query set. “成立於 (established in)” that requests the establishment year will be collected in a target-establishment-year domain query set. In other words, DefExplorer generates a common query set and k domain query sets, k ≥ 0, for X. To prepare pre-defined common (domain) definition sentence patterns, we collected the general methods/sentences Chinese people use to define a term (to introduce a person and a location, and to define an organization) from several newspapers and magazines, e.g., “國語日報 (Mandarin Daily News)” issued in February, 2008, and “探索人文地理 雜誌 (Discovery Cultural & Geographic Monthly)” issued in 2008. 3.2 Document Retrieval In this phase, DefExplorer submits the generated query sets to a commercial search engine, like Google or Yahoo, to retrieve related Web pages. Since these search engines often adopt partial match policies [12], many noisy sentences/Web pages and those in languages other than Chinese are also retrieved. DefExplorer then compares these sentences with the common and domain definition sentence patterns. Only those matching at
510
FANG-YIE LEU AND CHIH-CHIEH KO
least one common (domain) definition sentence pattern are selected as members of the candidate set, named the common candidate set (domain candidate set). Because our sentence patterns are all in Chinese, any retrieved sentences that are not in Chinese will be filtered out in this phase. In other words, the search space is the Web corpus. But, only Chinese definitions will be kept. 3.3 Semantics Selection In the following, a definition candidate in a common or domain candidate set is selected as a definition by using different selection principles. First, term definitions should be affirmative sentences [16]. Hence, we delete interrogative and exclamatory sentences, which can be detected since they end with question or exclamation marks, or contain several exclamatory terms. Next, we analyze the morphologies of these probable term definitions to see whether the structure of a retrieved sentence can be a definition or not. According to our observation, about 75% (2720/3628) of Chinese term definitions carry summary information prior to the first comma (or prior to the first period if there is no comma before the period) since people very often attempt to convey the key meaning of a given term in this portion. Words after the comma describe further details or list other less relevant contents [9, 16]. Fig. 2 gives an example. There are two definitions, “馬英九是一位年輕 的政治人物,曾任台北市長。(Ma, Ying-Jeou is a young statesman, who served as a mayor of Taipei city.)” and “馬英九是一位年輕的政治人物,是 2008 年國民黨總統候 選人。(Ma, Ying-Jeou is a young statesman, who is the present-election presidential candidate of the KuomMinTang)”, that meet the patterns. But, their similarity is not high because the portions after commas often describe details from different viewpoints [9, 16]. However, the similarity of the portions prior to the commas is high. Therefore, only the portion prior to the first comma, called the definition brief, is preserved.
Fig. 2. Definition brief and its tokens.
After that, we use a part-of-speech dictionary [17] to partition the definition brief into tokens, each of which is associated with its own part of speech. We also count the appearing frequency for each token so that in the next phase we can accordingly calculate the TF-IDF (term frequency-inverse document frequency) score for each definition brief [17-19]. The TF-IDF weight/score is a weight often used in information retrieval and text mining domains [18, 19]. The weight is a statistical measure used to evaluate how important a word (in this study, it is a token) is to a document (a definition brief) in a collection or corpus (a candidate set). The importance is proportional to the number of times a token appears in the definition brief but is offset by the frequency of the token in the candidate set. For details, please refer to [18].
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
511
Those tokens that are not included in the part-of-speech dictionary (about 5.54%) are treated as nouns, since most Chinese terms are nouns [20]. Chinese sentences usually contain words/phrases, e.g., possessive nouns, which modify the words that follow [21]. Therefore, a definition brief ending with a word other than a noun, e.g., an adjective, like “馬英九是很瀟灑的。(Ma, Ying-Jeou is handsome.)”, is then deleted from the candidate set. After that, if the underlying candidate set is a location or an organization candidate set (identified in phase 1), DefExplorer further compares each definition brief with a pre-collected domain dictionary, which consists of collected well known location names and suffixes of location and organization names which have been collected to determine whether a candidate is a location or an organization, i.e., the pre-collected domain dictionary = pre-built location lexicon + pre-built organization lexicon. Invalid candidates will be discarded. Given the question term “牛市 (bull market)”, it will be mis-recognized as a city in phase 1. But, its definitions will all be filtered out from the location candidate set in this step since its definition briefs, e.g., “(牛市是) 股票上揚之意 (bull market means stock market goes up)”, do not contain any items collected in the pre-built location lexicon. “(台北位於) 北臺灣 (Taipei is located in northern Taiwan.)” will be kept since “台灣 (Taiwan)” can be found in the pre-collected domain dictionary. 3.4 Similarity Scoring In this phase, we assume that if a term has a specific definition, its definitions on different Web pages should all be similar [22]. But, when the submitted term is a polysemant, several of its definition briefs will be quite different. Hence, before we cluster those with high similarity together as a group so as to merge redundant definition briefs and determine the importance of a group, we should calculate the similarity score for any pair of definitions in a candidate set. We assume that if a group before the merging of its redundant definition briefs has more definition briefs, its importance will be higher. Before the calculation of the similarity score, the synonym problem should first be solved. We solve this problem by employing a thesaurus [22, 23], e.g., “星國 (Singapore)” is replaced by “新加坡 (Singapore)”. DefExplorer also employs an additional step to replace an area-specific term with a commonly used one, like “數碼 (‘digital’ used in mainland China)” will be replaced by “數位 (‘digital’ used in Taiwan)” in Taiwan, and vice versa. In most situations, a definition brief is short (in our examples, 9.3 characters in average). DefExplorer calculates the similarity score Sim(A ↔ B) for definition briefs A and B in a candidate set S by invoking TF-IDF algorithm [18, 19]. 3.5 Candidate Grouping The way to cluster the definition briefs in S into groups is that when Sim(A ↔ B) is higher than a given clustering threshold θ, A and B are clustered into the same group. However, a fixed clustering threshold value is inappropriate, since a popular term may generate many definition briefs. Hence, a higher θ value should be given so we can divide them into many more groups to avoid too many definition briefs being clustered
FANG-YIE LEU AND CHIH-CHIEH KO
512
as a group. A less popular term yields few definition briefs. This time, a lower θ value is given to avoid having only a few definition briefs clustered in a group. That is, a dynamic clustering approach is used. DefExplorer multiplies the maximum similarity score of definition briefs in a candidate set by a constant ∂ (0 < ∂ < 1) to produce a critical value θcritical,
θcritical = ∂ · max{Sim(A ↔ B)⎪A, B ∈ S}
(1)
with which definition briefs are clustered, where in this study ∂ is 0.65, which will be described in section 4.1. Nevertheless, this clustering approach sometimes yields unsatisfactory results. For example, when Sim(A ↔ B) > θcritical and Sim(B ↔ C) > θcritical, A, B and C will be clustered into the same group. But, sometimes Sim(A ↔ C) is lower than θcritical. However, if Sim(A ↔ C) is close to θcritical, DefExplorer still classifies them into the same group. We call this the boundary choice. But when Sim(A ↔ B) is far lower than θcritical, we define another variable θmin < θcritical to further cluster them into different subgroups. After clustering, we check to see whether the smallest Sim(A ↔ B) in each group is smaller than θmin or not. If it is, DefExplorer increases θcritical and re-clusters the definition briefs in this group into subgroups again until the scores in each group are all higher than θmin, where θmin = 0.3 which will also be described in section 4.1. The time complexity of the checking is O(n2) where n is the number of definition briefs currently in a group. Since we assume that term definitions will show up repeatedly, groups containing only one definition brief will be deleted. Further, given a group G, we sum up similarity scores between a definition brief D and other definition briefs in G. Let SD be the sum
SD =
∑
Di ∈G , Di ≠ D
Sim( D ↔ Di ).
(2)
The definition brief with the highest SD is the one the most similar to all remaining definition briefs, and is then selected as the representative of G. 3.6 Answer Generation
In the final phase, DefExplorer generates answers by selecting definition groups from a candidate set. Generally, a general term generates many more definition groups than a specific term does. Different terms often yield different numbers of groups. So, DefExplorer selects λ groups as probable answers, named answer sets, where λ ≥ 1. The answer of a specific domain candidate set may be unique, e.g., a birth date; we call it a single-answer candidate set. A candidate set which has more than one probable answer, e.g., one’s work experience, is called a multiple-answer candidate set. After submitting a question term X, as shown in Fig. 3, DefExplorer sorts all groups in a candidate set into a descending order based on their group sizes, and selects all groups that have definition briefs not less than ϕ from X’s common candidate set and multiple-answer candidate sets as X’s answer set A, where group size is defined as the number of definition briefs in a group, ϕ = ⎡γ* total amount of definition briefs in the candidate set⎤, and γ ranges between 1% and 100% based on the number of definition briefs
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
513
Fig. 3. Generating an answer set for a query term.
retrieved. Basically, a general term often is given a lower γ, whereas a non-general term is assigned a higher γ value to avoid involving too few definition briefs, resulting in missing some valuable definitions. Further, DefExplorer selects the first group from each single-answer candidate set, and inserts them into A as a part of the final answer set. Appendix B lists the results for each phase of DefExplorer given a query term “馬 英九(Ma, .Ying-Jeou)”. We implement DefExplorer with the algorithms discussed above as a Web application. Given a question term, DefExplorer searches the Chinese Web and generates an answer set in which an answer entry is accompanied by its URLs, thus enabling users to visit the corresponding source pages. To perform the experiments, we first created three question sets containing a total of 544 question terms (see Appendix C) which were gathered from three different sources: 268 terms randomly selected from The China Times, 176 terms randomly selected from The Economic Daily, and 100 terms from the top 100 search queries of Yam.com in 2007. A question set came from a source. We also chose three specific domains, person, location and organization as examples, in which question terms belonging to the three domains were further classified into three classes. One domain is mapped to one class. Outliers as stated above are classified into a normal class as normal terms. The computer used is Pentium 586 with Quad CPU, 4GB memory, and 500GB H/D, and OS is Windows XP. 4.1 Clustering Threshold
In order to assign an appropriate clustering threshold θ to DefExplorer for clustering definition briefs, we generated answer sets by giving θ different values ranging from 0.0 to 1.0 with 0.05 as the increment, and defined two parameters, Min Cohesion which is the minimum similarity of any two definition briefs in a given group, and Max Coupling which is the maximum similarity between two definition briefs selected from two different groups [24, 25] in a candidate set. Given two groups G and H collected in a candidate set (rather than their answer set), and two definition briefs A and B, Min Cohesion = min{Sim(A ↔ B)⏐A, B ∈ G}, Max Coupling = max{Sim(A ↔ B)⏐A ∈ G, B ∈ H, G ≠ H}.
(3) (4)
FANG-YIE LEU AND CHIH-CHIEH KO
514
Both parameters range from 1 to 0. A high Min Cohesion score, close to 1, means all definition briefs in a given group are very similar. A high Max Coupling score represents a high similarity of definition briefs between two groups. Fig. 4 shows an example, in which A, B, C and D are definition briefs. Since both Sim(A ↔ B) ≤ θ and Sim(A ↔ C) ≤ θ and both Sim(A ↔ C) ≤ θ and Sim(A ↔ D) ≤ θ, A, B, C and D are then clustered into a group. Here, without losing its generality, we assume that in G max{Sim(X ↔ Y)⏐X, Y ∈ G} = 1.
Fig. 4. An example of an unsatisfactory clustering result where Sim(A ↔ B), Sim(A ↔ C) and Sim(C ↔ D) are all larger than or equal to θ. But, Sim(B ↔ D) < θ/2 so G should be further divided, and B and D should belong to two groups.
After submitting the 408 training terms on a specific θ value, 408 answer sets were collected. For each answer set, we retrieved the top 5% of answer entries, i.e., λ = ⎡the top 5% of groups⎤. A total of 6,598 answer sets (about 77.0% (= 6598/(408*21))) were collected, and 12,470 answer entries were gathered, i.e., 23.0% of answer sets are empty. Each answer set has 1.89 answer entries on average, and a total of 14,763 different tokens were generated in phase 3. For a specific θ value and its corresponding answer sets, we first recovered/retrieved their corresponding candidate sets in which a group consists of candidate briefs, rather than its representative, generated in phase 5. Next, for each candidate set S, we recalculated Min cohesion for definition briefs in a group, and Max coupling between arbitrary two groups in S. After that, we calculated average Min cohesion (AMCohθ) for all groups and average Max coupling (AMCouθ) for all candidate sets. k
mi
AMCohθ = ∑ (∑ Min Cohesion(i, j )/mi )/k
(5)
i =1 j =1
and m
k
C2 i
i =1
j =1
AMCouθ = ∑ ( ∑ Max Coupling (i, j )/C2mi )/k
(6)
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
1
Min cohesion Max coupling
0.8
Score
515
0.6 0.4 0.2 0 0.0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1.0
θ
Fig. 5. Clustering results on λ = 5% given different θ values.
where mi is the number of groups in candidate set i, and k = 408. Fig. 5 illustrates the plots of the Min Cohesion (i.e., AMCohθ) and Max Coupling (i.e., AMCohθ) against different θ values. Theoretically, the ideal (worst) grouping is Min cohesion = 1 (= 0) and Max coupling = 0 (= 1). But, we cannot find both in the experiment. According to [14], there are no two different terms whose coupling (cohesion) is definitively zero (one). So, the compromise case is adopting the highest cohesion and the lowest coupling. θ = 0.65 is the point where Min Cohesion is almost equal to Max Coupling. This is the most appropriate θ value for clustering definition briefs. When λ is changed from 5% to 2%, θ’s range is between 0.65 and 0.645. We redid the experiment by submitting 136 test terms. This time, θ ranges between 0.647 and 0.651 (≈ 0.65). Therefore, we used θ = θcritical = 0.65 to do the following experiments. Also, from Fig. 4, we can see the distance between B and C is 2θ, implying Sim(B ↔ C) ≈ θ/2 (= 0.325, actually between 0.3255 and 0.3225) which is the lowest similarity value θmin for which the answers can be clustered into a group. But, we lower θmin to 0.3 to implement the boundary choice stated above. Sim(B ↔ D) ≤ θmin indicates that G should be further divided. 4.2 The Training and Test Phases
In the second experiment, after we submitted the 408 training terms, a total of 60,980 Web pages were extracted, i.e., in average 1,270.4 pages were retrieved by submitting a term (in fact, a set of definition sentence patterns). We first selected answers given different λ values. Experimental results are shown in Table 1 in which the highest performance (inclusion rate and precision) occurs at λ = 5% (see all-answers column), but k
the corresponding answer size, defined as
∑ number of non-empty groups in answer set i i =1
k is too small since many question terms’ answer sizes are zero, where k is the number of question terms, k = 408. The best case occurs at λ = 2%, generating 4.48 answer entries. Therefore, we used λ = 2% as the default system configuration. Since the number of pages that should be retrieved in the Web is unknown, recall cannot be obtained even when others [7] gave recall of another definition. In fact, we only want to retrieve sufficient definitions rather than accessing all available definitions in the Web. So, in this study, we
,
FANG-YIE LEU AND CHIH-CHIEH KO
516
Table 1. Performance of candidate sets on different λs given 408 training terms. Candi.set
Common candidate set Incl. rate Ans. Preci. (%) size (%) 81.1 1.53 42 78.5 1.65 43.4 74.5 2.51 46.9 69.9 3.56 52.3 64.2 4.12 54.3 50.8 4.98 55.9
λ (%) 6 5 4 3 2 1
Domain candidate sets Incl. rate Ans. Preci. (%) size (%) 80.6 2.21 42.7 80.1 2.3 45.3 76.2 2.7 49.6 71.9 3.25 53 62.7 4.8 56.6 52.6 4.27 60.3
All answers Incl. rate Ans. (%) size 80.9 1.87 79.2 1.96 75.3 2.59 70.9 3.45 63.2 4.48 51.1 4.61
Precisi. (%) 42.3 44.3 48.6 52.6 55.6 57.6
omit the parameter recall. In the experiment, we found that only 34.07% (139/408) of collected question terms were classified into one of the three specific domains/classes. The remaining 65.93% are normal terms. This experiment of retrieving answer sets for the 408 training question terms lasted a total of 342 minutes (about 5.7 hours), including the Internet access and the employment of different λ values ranging from 1 to 6 to calculate inclusion rates, answer sizes, and precisions. In fact, the time required to process a question term is about 50.3 sec. If we submit a term online without calculating the abovementioned overheads, the average process time is 21.3 sec. Table 2 shows the experimental results of the three sources involved. The worst case occurred with Yam.com’s search queries. Further, DefExplorer did not work well on general terms, e.g., “成人 (adult)” and “好玩遊戲區 (interesting games)”, for which accurate definitions could not be obtained. Values of All-answers column in Table 2 are the same as those of All-answers column in Table 1 because they were obtained by submitting the same 408 question terms, but gathered from different viewpoints. Table 2. Performance of the three training question sets (408 training terms) on different λs by using DefExplorer only. Quest. set
China Times
λ (%)
Incl. rate
(%)
Ans. size
6 5 4 3 2 1
85 83.6 78.9 75.4 65.3 53
1.77 1.83 2.54 3.29 4.36 4.5
Economic Daily
Yam queries
All answers
Preci. Incl.rate Ans. (%) (%) size
Preci. Incl. rate Ans. (%) size (%)
Preci. Incl. rate Ans. (%) Size (%)
Preci. (%)
41.3 43.4 49 51.8 55.4 55.8
45.8 47.7 49.8 55.2 58.1 62.8
38.8 40.9 45.6 50.1 51.9 53.3
42.3 44.3 48.6 52.6 55.6 57.6
86 83.3 79.8 76.8 71.4 58.9
2.22 2.33 2.88 3.97 4.91 4.98
61.1 60 57.5 48.5 43.4 32.3
1.54 1.63 2.21 2.98 4.05 4.25
80.9 79.2 75.3 70.9 63.2 51.1
1.87 1.96 2.59 3.45 4.48 4.61
Table 3. Performance of candidate sets on different λs given 136 test term. Candi. set λ (%) 6 5 4 3 2 1
Common candidate set Incl. rate (%) 81.1 78.8 74.3 69.3 64.5 54.7
Ans. size 1.63 1.69 1.7 3.07 4.1 4.2
Preci. (%) 41 43 47 49.8 54 55
Domain candidate sets Incl. rate (%) 77.8 76.6 72.8 67.9 66.8 56
Ans. size 1.71 2.1 2.31 2.85 3.82 3.9
All answers
Preci. Incl. rate (%) (%) 41.3 79.9 44 77.9 48.9 73.8 50.9 68.1 52.9 65 56.7 55.2
Ans. size 1.66 1.83 2.1 2.94 4 4.05
Preci. (%) 41.2 43.3 47.9 50.4 53.2 55.9
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
517
Table 4. Performance of the three test question sets (136 test terms) on different λs by using DefExplorer only. Quest. set
λ (%)
6 5 4 3 2 1
China Times Incl.rate Ans. (%) size
83.6 82.1 78.9 72.7 70.4 63.9
1.49 1.68 1.94 3 4.22 4.01
Economic Daily
Yam queries
All answers
Preci. Incl.rate Ans. (%) (%) size
Preci. Incl. Ans. Precisi. Incl.rate Ans. (%) rate (%) size (%) (%) size
Preci. (%)
40.2 42.5 48.9 49.8 52.8 54.8
44.7 46.7 49.1 52.6 54.9 59.9
41.2 43.2 47.9 50.4 53.2 55.9
85.5 84.1 77.6 73.3 70.1 56.1
2.06 2.2 2.56 3.32 4.21 4.38
60.2 56 53.5 46.5 41.6 30.4
1.43 1.57 1.74 2.12 3.05 3.55
37.5 38.7 43 47.9 51.4 52
79.9 77.9 73.8 68.1 65 55.2
1.66 1.83 2.1 2.94 4.0 4.05
4.3 Integrating with the Existing Lexicon
If definitions cannot be extracted from the Web, but can be found in existing lexicons or encyclopedias, performance can then be improved. In the fourth experiment, DefExplorer was integrated with an existing Chinese lexicon [26]. Fig. 6 shows the experimental results given the 136 test terms. Before the integration, only 38.2% of question terms in the Economic Daily question set could find definitions in an existing dictionary. But after the integration, 93.9% of definitions of question terms could be obtained either in the given dictionary or in the top 5% (λ = 5%) of answer entries (in Appendix C, those excluded are followed by a question mark). Therefore, we can conclude that our approach is helpful in solving the out-of-vocabulary problem.
Fig. 6. Inclusion rates by employing a dictionary and/or different λs.
4.4 Comparison with Other Systems
In the fourth experiment, we compared DefExplorer with three state-of-the-art systems, Probabilistic Model [10], DefScriber [9] and Google given the 136 test terms. Several modifications were applied to the first two systems so as to ensure a fair comparison. The first was that the two systems were originally designed to extract definitions for English terms. We modified their grammar patterns and part-of-speech dictionary to enable them to process Chinese data. The second was that the two systems originally were designed to extract definitions mainly from reliable sources, such as encyclopedias. However, we were only interested in their performance on extracting term definitions from the Web. Hence, the Web corpus was the only data source used in the experiment.
518
FANG-YIE LEU AND CHIH-CHIEH KO
We compared the four systems by searching traditional Chinese pages and checking to see whether definitions were shown in the first 5% of search results or not. The question set was The Economic Daily.
Fig. 7. Inclusion rates on different λs.
Fig. 8. Precision on different λs.
Figs. 7 and 8 show that DefExplorer performed the best both in inclusion rates and precisions. The answers generated by the Probabilistic Model and DefScriber contained lots of noise which, even though in the form of definition patterns, did not constitute real definitions; e.g., “iPod 是浴缸的弟弟” (iPod is bathtub’s brother), which is not a definition of iPod, but is extracted from a joke [27]. The two systems performed better on their own well-defined knowledge corpora than on the Web. In fact, Han et al. [10] in their research also showed that their model performs best when encyclopedias are the extracting targets, and performs worst when the target is the Web. The Web corpus contains a lot of noise such as jokes and buzzes. However, DefExplorer can effectively filter out most of such noise. The two figures also show that Google’s search results at the λ = 1% inclusion rate (23.9%) were higher than those of the Probabilistic Model and DefScriber, but lower than those of DefExplorer. We found that most correct answers that appeared in the first search results were pages in Wikipedia. One reason for this may be that most pages in Wikipedia are cited often, so they have very high PageRank when Google’s ranking algorithm is deployed, rather than that the algorithm is optimal.
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
519
5. CONCLUSIONS AND FUTURE RESEARCH In this article, we propose a system, DefExplorer, which extracts term definitions from the Web, and explain how the system filters out noisy information and how it selects valid definitions. We designed two types of candidate sets, common and domain, and grouped candidates. Finally, answers from candidate sets were evaluated and selected by using different approaches, group size and TF-IDF score. The experimental results show that the system can extract definitions effectively for out-of-vocabulary terms. We also found that the two candidate sets both have their own advantages. Common candidate sets provide overall answers, while domain candidate sets support specific features. In the future, we would like to study how to improve the classification rate of question terms since in this study only 34.07% of terms could be classified into one of the domains considered. This situation can be improved in at least two ways. The first is to recover abbreviations, such as 傳媒Æ傳播媒體 (transmission media), and 事發Æ事情 發生(event occurrence). However, no Chinese abbreviation dictionary is currently available even though there is a website, acronymfinder.com, for English abbreviations, acronyms, and initialisms [28]. The second is to increase the number of domains, and collect many more effective domain definition sentence patterns for each domain. Semantic primitives collected in and analyzed by [14] provide an excellent direction for research. One may also ask “What will happen if the question term is a popular name so that several persons with different birthdays can be found using the same name?” Currently, DefExplorer will retrieve all their information which is mixed together as a whole, choose the first group from the sorted groups generated by submitting “X 生於 [date/location] and X 出生於[date/location] sub-questions as the “the person’s” birthday, no matter the groups are sorted on group size or TF-IDF score, and then select λ groups from candidate set as the final answer set. In fact, if one knows nothing about the specified term’s (name’s) background knowledge, one will not realize that the information belongs to different people and then can not accurately classify the mixed information into several clusters. These issues constitute our future research topics.
REFERENCES 1. W. H. Lu, L. F. Chien, and H. J. Lee, “Anchor text mining for translation of web queries: A transitive translation approach,” ACM Transactions on Information Systems, Vol. 22, 2004, pp. 242-269. 2. Y. Zhang and P. Vines, “Using the web for automated translation extraction in crosslanguage information retrieval,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp. 162-169. 3. E. M. Voorhees, “Overview of the TREC 2003 question answering track,” in Proceedings of the Text Retrieval Conference, 2003, pp. 54-68. 4. E. M. Voorhees, “Overview of the TREC 2004 question answering track,” in Proceedings of the Text Retrieval Conference, 2004, http://trec.nist.gov/pubs/trec13/papers/, http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIEW.pdf. 5. L. Ru, Z. Tong, Y. Liu, and S. Ma, Automatic Chinese Name Recognition Based on Web Corpus Analysis, ACTA press, 2007, http://www.actapress.com/.
520
FANG-YIE LEU AND CHIH-CHIEH KO
6. A. Renouf, Explorations in Corpus Linguistics, Amsterdam & Atlanta, Rodopi, 1998. 7. F. Y. Leu and C. C. Ko, “An automated term definition extraction using the web corpus in Chinese language,” in Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering, 2007, pp. 435-440. 8. J. Prager, J. Chu-Carroll, K. Czuba, C. Welty, A. Ittycheriah, and R. Mahindru, “IBM’s PIQUANT in TREC 2003,” in Proceedings of the Text Retrieval Conference, 2003, pp. 283-292. 9. S. Blair-Goldensohn, K. R. McKeown, and A. H. Schlaikjer, “A hybrid approach for QA track definitional questions,” in Proceedings of the Text Retrieval Conference, 2003, pp. 185-192. 10. K. S. Han, Y. I. Song, and H. C. Rim, “Probabilistic model for definitional question answering,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 212-219. 11. S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu, “Large test collection experiments on an operational, interactive system,” Okapi at TREC, International Journal Information Processing and Management, Vol. 31, 1995, pp. 345-360. 12. D. Callan, “Get Google optimization tips from our experts” The Google Forums, http://www.akamarketing.com/google-ranking-tips.html. 13. P. Craven, “Google’s PageRank explained and how to make the most of it,” http://www. webworkshop.net/pagerank.html. 14. Z. D. Dong and Q. A. Dong, “HowNet,” http://www.keenage.com, 1999. 15. D. A. van Leeuwen and K. P. Truong, “An open-set detection evaluation methodology applied to language and emotion recognition,” in Proceedings of the International Conference on Interspeech, 2007, pp. 338-341. 16. J. C. Hodges, W. B. Horner, S. S. Webb, and R. K. Miller, Harbrace College Handbook: With 1998 MLA Style Manual Updates, 13th ed., Harcount College Pub., Fort Worth, 1998. 17. S. Yu, X. Zhu, H. Wang, and Y. Zhang, The Grammatical Knowledge-base of Contemporary Chinese − A Complete Specification, Beijing Tsinghua University Press, Beijing, 2003. 18. F. Y. Leu and K. W. Hu, “A real-time intrusion detection system using data mining technique,” Journal of Systemics, Cybernetics and Informatics, Vol. 6, 2008, pp. 36-41. 19. A. Leuski, “Evaluating document clustering for interactive information retrieval,” in Proceedings of the ACM Conference on Information and Knowledge Management, 2001, pp. 33-40. 20. Y. R. Chao, A Grammar of Spoken Chinese, Chinese edition, The Chinese University of Hong Kong, 1968. 21. Z. G. Zhang, Han yu yu fa chang shi, Joint Publishing, Hong Kong, 1999. 22. Y. H. Tseng, “Automatic thesaurus generation for Chinese documents,” Journal of the American Society for Information Science and Technology, Vol. 53, 2002, pp. 1130-1138. 23. http://zh.wikipedia.org/wiki/Wikipedia: 繁簡分歧詞表. 24. R. S. Pressman, Software Engineering: A Practitioner’s Approach, 6th ed., McGraw Hill, Singapore, 2005. 25. I. Sommerville, Software Engineering, 8th ed., Addison Wesley, 2006.
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
521
26. Ministry of Education, R.O.C., Chinese Dictionary; http://140.111.34.46/newDict/dict/. 27. http://blog.pixnet.net/cwyuni/post/849491. 28. http://www.acronymfinder.com/.
APPENDIX A PRE-BUILT LEXICONS AND DEFINITION SENTENCE PATTERNS A1. Pre-built Lexicons for Locations (1) commercial organizations: 公司(company), 銀行(bank), 社(society, community), 廠(factory, plant, workshop), 館(mansion, building), …. (2) government organizations: 股(section), 科(department), 處(department), 局 (bureau), 廳(department), 署(department), 部(ministry), 院(Yuan), 府, … (3) non-profit organizations: 委員會(council, commission), 基金會(foundation), 協會 (association), …. (4) others: those location suffixes that do not belong to the three classes above. A2. Pre-built Lexicons for Organizations (1) districts: 鄰(Neighborhood), 里(village), 村(village), 鄉(township), 鎮(township), 區 (zone), 縣 (county), 市 (city), 州 (state), 省 (province), 國 (nation), 地區 (area), 道(dau), 都(du), …. (2) natural areas: 島(island), 半島(peninsula), 洲(continent), 洋(ocean), 湖(lake), 河 (river), 川(river), …. (3) others: those organization suffixes that do not belong to the two classes above. A3. Common Definition Sentence Patterns (1) X 是指…; (2) X 意為…; (3) X 為…; (4) X 是…; (5) X 的定義是…; (6) X 的定義 為…; (7) X 定義為…; (8) X 被定義為…; (9) X 被定義成…; (excluding X 是不是…, X 是否…) A4. Domain Definition Sentence Patterns for Person, Location and Organization 1. Person: (1)X 生於 [date / location], X 出生於[date / location]; 2. X 卒於 [date / location], X 亡於[date / location]; (3) X 曾任 [jobs], X 任職於 [jobs] 3. Location: (1)X 位於 [place], X 坐落於 [place]; (2) X 建於 [date] 4. Organization: (1) X 成立於 [date]; (2)X 位於 [place] A5. Domain Dictionary Currently, Domain Dictionary = Pre-built location lexicons + Pre-built organization lexicons. In the future, the dictionary will be expanded.
APPENDIX B EXAMPLES OF AN EXTRACTING PROCESS The following lists results of each phase when DefExplorer is used given a query term “馬英九 1 (Ma, Ying-Jeou)”. 1
馬英九 was elected as the new president of ROC on March 22, 2008. He has been the president of ROC after May 20, 2008.
522
FANG-YIE LEU AND CHIH-CHIEH KO
Steps Question Analysis
Queries Generated Outputs Type of question term: “馬英九是一個” person “馬英九是一位” “馬英九是一本” … Document Common Query Set: A total of 1,552 candidates in candidate set: Retrieval (1) 馬 英 九 是 ; (2) 馬 馬英九是眼睛瞎了,看不到擺在眼前的事實? 英九為; (3)馬英九是 主持人說網路上都形容馬英九是「馬囧」,馬英九說, 指;(4) … 囧的原意並非困窘而是光明。 Domain Query Set: 馬英九出生於 1950 年 7 月 13 日,籍貫湖南衡山。 (1) 馬 英 九 出 生 於 ; … (2)馬英九曾任; (3)… Semantics A total of 283 definition briefs: Selection Common candidate set: 「馬囧」 最大在野黨主席 2 (the chairman of the largest opposition party in Taiwan) 國民黨總統候選人3 (the president candidate recommended by KuoMinTang party) 中國國民黨總統候選人 3 (the president candidate recommended by KuoMinTang party) 台灣最大在野黨總統候選人 3 (the president candidate recommended by the largest opposition party in Taiwan) 一個法律人 一個學法的人 … Target-birthday domain candidate set: 出生於 1950 年 7 月 13 日 Target-birthplace domain candidate set: 出生於香港 … Similarity Similarity Score: Scoring 國民黨/總統/候選人; 中國/國民黨/總統/候選人; 台灣 /最大/在野黨/總統/候選人 Similarity score = 0.71, 0.55, 0.58 一個/法律人;一個/學法/的/人 Similarity score = 0.65 最大/在野黨/的/主席; 國民黨/的/希望 Similarity score = 0.25 … Candidate Total of 22 groups: Grouping 國民黨總統候選人(2) =>國民黨總統候選人/中國國民黨總統候選人 一個法律人 (2) => 一個法律人/一個學法的人 … Answer Answer Set: Generation 國民黨總統候選人 一個法律人 … 2 3
This is old data in the Web. But, it was true in 2005. This is old data in the Web. But, it was true in 2007.
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
523
APPENDIX C QUESTION TERM SETS FOR EXPERIMENTS A total of 544 question terms were included, of which 268, 176 and 100 question terms came from The China Times, The Economic Daily and the top 100 search queries of Yam.com, respectively, in 2007. Terms marked with “*” are those that could be found in the given lexicon [31], and terms marked with “?” are those for which no correct answer was found among the answers ranked the top 5% in DefExplorer’s answer sets. The total hit rate is 78.9% (= 429/544), in which The China Times is 73.2% (= 223/268), The Economic Daily is 83.5% (= 147/176) and Yam.com is 59% (= 59/100). C.1 China Times (268 Question Terms) 3.5G 一夜情 一國兩制* 九一一* 九份* 二氧化碳* 人大* 十六大 上網*? 大蒜* 女婿*? 小泉純一郎 中油* 中芯 中秋節* 中國大陸* 中常委* 中常會* 中選會* 互聯網 元宵* 公民* 海王星* 海協會* 海軍* 海基會* 真調會 紙尿褲* 紙鶴 草案*? 記者會
公投 公視* 公訴* 公路*? 化糞池* 反分裂法 反作用力* 反恐 反聖嬰現象 天皇* 巴勒斯坦* 文宣*? 方便麵* 火鍋* 牛肉麵? 主席* 主播* 冬瓜* 包道格 台海 四川省* 外資*? 偷渡* 曼聯* 基本法* 基督徒? 基督教* 專機*? 得票率* 情人節* 捷運*
失禁*? 奶茶* 布希* 布施* 平壤* 民調 瓦斯* 田徑* 目的*? 石家莊 立法會 交換學生 共伴效應? 印度尼西亞* 地震*? 年底*? 成都市* 成龍* 死刑* 江澤民 老鼠會* 吳郭魚* 殘奧會 湯曜明 無盟 菩提* 菸草* 華人* 華府* 華僑* 著作權*
志工* 抗戰* 投票率 李遠哲* 乳球 事發*? 咖啡* 和解*? 周杰倫 周星馳 坦言*? 季風* 季線 定位拖牌 招牌* 果凍*? 林信義 林務局 板塊* 泛民主派 泛綠 泛藍 溫室效應* 溫家寶 準備程序庭 罪犯* 聖戰* 聖嬰現象* 董事長* 解聘* 資本主義*
的士* 直升機* 直播* 矽膠 社會局* 社團*? 虱目魚* 阿里山 阿妹* 信用卡* 保守黨 俄羅斯* 姚明 後遺症* 指紋* 星球* 查處*? 毒樹果理論? 洪災* 流感 炸彈客? 相聲* 網頁* 網站* 罰球* 認同* 餃子*? 廟會* 數獨 歐尼爾* 歐盟
紅包*? 紅絲帶 紅襪隊 背包客 胡錦濤 軍售*? 迪士尼 音響*? 飛彈* 香油錢* 香港* 倒扁* 修憲* 冥王星* 原住民* 哥斯大黎加* 座談會*? 恐怖分子* 核武 核電廠* 消保官 消費者* 檢察院* 總編* 聯合國* 輿論* 鎂光燈* 隱私* 鮮奶* 鮮乳* 瀉藥*
FANG-YIE LEU AND CHIH-CHIEH KO
524
記錄片? 訊問* 貢丸* 退休金* 退輔會 酒駕 配票* 院會* 除夕* 達馬松? 馬英九 高齡化* 停車費*? 假帳?
教宗* 球證* 產學合作 眷村* 移民* 統一戰線* 軟件* 陳水扁* 凱達格蘭* 單車* 寒流* 富邦 惠普 替代役
詐欺*? 超貸*? 跆拳道* 進行式? 郵輪*? 間諜* 黃金周 傳媒*? 奧組委 奧運* 新竹米粉 新華社* 新疆* 楊振寧*
跳票* 農藥* 遊民* 電話門事件 電影*? 預算案? 歌仔戲* 漫畫家 碩士* 管委會 精神病* 綠卡* 網民 網址*
緯來 蔡依林 輪迴* 養生*? 養殖業 魯迅* 魯爾* 學歷*? 導演* 導彈* 融資*? 親民黨 選舉人票* 彌功
藍軍 覆議案 證據*? 蘇貞昌 議會* 霰* 鐵人三項* 鐵馬* 鐵鍊*? 體味*? 緋聞*? 貔貅*
C.2 Economic Daily (176 Question Terms) ADSL DVD EMBA ETF HCPV IPO LCD LED OBM OEM Oracle POS SARS TUV 認證? Web 2.0 WiMAX iPod 九寨溝 人民幣* 大地震 中東* 今季? 內線交易*? 公平會* 化肥 天災*?
毛利率 水菸 世界盃 代工* 出線*? 加息*? 功率* 包機*? 北京* 台商 台塑 台糖 市值 市場經濟 平方米 平台 民主派 民建聯 甲醇* 光纖* 兆豐金控 全球化 地鐵* 安保 尖沙嘴
成本* 收盤*? 老撾* 自由行 色域 西九龍 伺服器* 免稅額* 呎價 宏觀調控 尾牙* 快閃記憶 體 投資* 改革開放 李大維 李肇星 李顯龍 里昂證券 供應商*? 併購*? 房地產* 房委會 旺角 東帝汶 東莞*
東盟 版岩 物資*? 物價* 直選 知識產權* 矽晶泡沫玻璃?
金控 金球獎 金管會 南水北調 品牌* 政熱經冷 背書* 軍購 軍購案 面板* 朗尼 核四* 柴油* 消委會 消保會 特首 租金*? 素地
能源稅 財報 退職所得 高鐵* 國台辦 國民經濟*? 國有企業 國債* 執行長 康師傅 張榮發 採購*? 液晶* 深圳 球團 現金卡 產品線* 產能* 麥當勞* 散戶*? 普吉島 港股 游揆? 發展商 發展觀
華碩 裁員*? 跌停板* 量販店* 開發區* 匯市*? 匯豐 幹細胞 意外險*? 新加坡* 新台幣 會計師* 概念股 煤價 瑜珈 禽流感 董事會* 董建華 資本額*? 達欣 電池* 電能*? 團拜*? 賓士* 銅鑼灣
增資*? 審計署 廣告* 標準普爾 歐元* 歐聯* 潛艦* 罷工* 銷單? 據點*? 燃料*? 糖價? 蕃薯藤 選後? 鴻海 職業倫理 藍光 轉乘 雙核心 簽證* 證監會 邊際效應 黨產 顧問* 羈押*
AN AUTOMATED TERM DEFINITION EXTRACTION SYSTEM
525
C.3 Top 100 Search Queries of Yam.com (100 Question Terms) 玩命小鎮 mp3 統一發票* 好玩遊戲區? 104 貼圖 小說* 小遊戲? 史萊姆 GOOGLE 桌布*? 大樂透 韓劇 後宮電影院? 情色小說 情色文學 成人*?
王子變青蛙 租屋*? 中華電信 氣象局 自拍 AV 嘟嘟*? msn 聊天室*? 台鐵 歌詞* 中華職棒 旅行社*? hinet 蘋果日報 小蕃薯? 楓之谷
5566 色情小說*? 火車時刻表 後宮* 亞洲建築專業網? 情色貼圖? 樂透* 洪爺? 波波*? 地圖*? 汽車* 小蕃薯 中央氣象局* 台灣大哥大 手機 音樂網? ut?
寵物*? 食譜*? 音樂*? 遊戲區? 成人小說? 無名小站 1111? 中國信託 真珠美人魚 圖片*? 美食*? 火影忍者 天馬*? 氣象*? 神奇寶貝 人事行政局 東森購物?
MLB NBA 統一發票開獎號碼? 考選部? 電子地圖 色情漫畫? 綠光森林? 網頁素材? 成人資訊? 新浪網 圖庫? 104 人力銀行 惡作劇之吻 同志*? 勞保局* TT1069?
東森 樂透彩 台灣 kiss? 巴哈姆特 犬夜叉 鐵路局* 線上遊戲 尋夢園? 自由時報 心理測驗* 音樂下載? 星座* 手機王 減肥* 珍珠美人魚 國稅局*
Fang-Yie Leu (呂芳懌) received his B.S., master and Ph.D. degrees all from National Taiwan University of Science and Technology, Taiwan, in 1983, 1986 and 1991, respectively, and another master degree from Knowledge System Institute, USA, in 1990. His research interests include wireless communication, network security, Grid applications and Chinese natural language processing. He is currently an associate professor of Tunghai University, Taiwan, and director of database and network security laboratory of the University. He is also a member of IEEE Computer Society. Chih-Chieh Ko (柯志杰) received his master degree of Computer Science from Tunghai University, Taiwan, in 2008, and studied in Oita University, Japan for one year as an exchange student. His research interests include Web applications, software engineering, and languages including programming languages and Asian natural languages.