Query Term Selection Strategies for Web-based

0 downloads 0 Views 353KB Size Report
Query terms are prepared in advanced ... term selection and to build gold standards for machine .... the answer can't be seen in advance and there is not.
Query Term Selection Strategies for Web-based Chinese Factoid Question Answering Hao Tang13, Cheng-Wei Lee12, Tian-Jian Jiang12, Wen-Lian Hsu1 1

Institute of Information Science, Academia Sinica, Taiwan, R.O.C Dept. of Computer Science, National Tsing-Hua University, Taiwan, R.O.C 3 Dept. of Computer Science and Information Engineer, National Taiwan University, Taiwan, R.O.C 2

{larry, aska,tmjiang,hsu}@iis.sinica.edu.tw Abstract Passage retrieval plays an important role in a Chinese factoid Question Answering (QA) system. Query term selection is the process of choosing keywords from a given question to make the most use of information retrieval engines. Query terms selected by humans are analyzed to measure the difficulty and for evaluating machine generated results. Three approaches, namely stop words elimination, rule-based, and machine learning-based, are studied in this paper. Eliminating stop words is the simplest one. Heuristic rules produced by morphologists are more complex. Conditional Random Fields (CRF), a machine learning approach, is adopted for labeling query terms. For evaluation, two sets of metrics are proposed. Passage MRR/Coverage relies on search engine result which directly relates to the QA performance but is time consuming and may vary at different time. Our experiment shows that Query Term Precision/Recall is a viable alternative. The baseline Coverage of sending raw questions to Google is about 53%, while applying the three approaches yields 65% for stop words elimination, 57% for rule-based approach, and 54% for machine learning-based approach. The MRR of sending raw questions to Google is 0.33, while applying the three approaches yields 0.44 for stop words elimination, 0.41 for rule-based approach and 0.38 for machine learning-based approach. The result can be not only for factoid QA systems but also a preprocessor for search engines. Keywords: Question Answering, Selection, Passage Retrieval.

Query

Term

1. Introduction With the high level of information overload on the Internet, responding to users’ questions with exact answers is becoming increasingly important. Many international question answering contests has been held at conferences and workshops, such as TREC [1], CLEF [2], and NTCIR [3]. The goal of question answering (QA) is to find

answers from the corpus for a given question in natural language. There are many types of QA system which serve different purposes [7]. The proposed techniques in this paper are mainly for factoid QA system which provides named entities, such as individual names, organization names, and location names, as answers. Although Chinese is the second most popular language in the world, there seems to be a performance gap between Chinese question answering systems and some systems of other languages. The architecture of most of the factoid QA systems comprises four main components: Question Processing (determine the type of a question, for example, “Q_PERSON” or “Q_LOCATION”), Passage Retrieval (retrieve relevant documents that may contain answers), Answer Extraction (extract answers from the retrieved documents), and Answer Ranking (rank answers according to the question and retrieved passages) [8]. Passage retrieval plays an important role in a factoid QA system. With the blossom of the Internet, passages are not limited in local corpus anymore. They can be found from Internet with large scale search engines. Web as a corpus is almost infinite large. Integrated with search engines, factoid QA system can answer much wider range of questions [17]. Factoid QA system integrated with search engines are called web-based factoid QA systems. In order to utilize search engines, selecting the keywords from a given question in natural language is thus an important issue. For example, a question “誰是 2006 年美國的總統?(Who is the president of USA?)” can have many kinds of search engine query, like “2006 美國總統” or “2006 美國 總統”. Formulating a query involves several steps. First, query terms are selected from words in the question. Then, query terms may be expanded, for example, using a thesaurus. Finally, the expanded query terms are converted into the language of a specific search engine, for example, using Boolean expressions such as and, or, and not. The approaches of this paper focus on the first step, query term selection. Three approaches, stop words elimination,

rule-based and machine-learning-based, are studied in this paper. The simplest and most intuitive one is eliminating the stop words in the question to form the query. In rule-based approach, questions are first separate into segments; rules produced by some morphologists are applied to combine those segments; finally, segments are filtered with the same stop word list as the first approach, and become the final keywords. On the other hand, in a machine learning-based approach, the expert knowledge is replaced by a model, which is trained from a sufficiently large collection of query-term-labeled questions, with the hope that the useful patterns for identifying query terms are automatically captured. The machine learning-based approach has advantages not only on saving expensive expert labor, but also higher portability to other domains. Conditional random fields (CRF) [13][18][22] an advanced machine learning approach, is adopted in this research. Section 2 reviews related researches, including QA systems, query term selection and expansion strategies. Comparing to manual selection, the selection strategies we proposed – stop words elimination, rule-based, and machine learning-based – are described in Section 3. Two sets of evaluation metrics, i.e. precision/recall and MRR/coverage, are shown in Section 4. Section 5 discusses the experiment results and some discoveries. Section 6 represents the conclusion and future works.

information. Hence, others may use a parser to segment a sentence and acquire semantic information [9]. A parser may be a trained model with various modern machine learning approaches. However, the disadvantage of a parser is that it is too time consuming. Others then use a POS tagger to do segmentation and acquire least useful information [14][23]. After segmentation, the processing can be further elaborated by many means, such as filtering with a stop word list, filtering by POS, or weighting the term according to some heuristic rules [24]. Although there seems to be many approaches, there are few details described and is lack of thorough experiments. Many studies are needed to be done.

2. Related Works

In order to investigate the difficulty of query term selection and to build gold standards for machine learning, each of three human annotators (HAs) manually labels 400 questions. All HAs are accustomed to search engines. HA1 is a programmer and is familiar with the mechanism of search engines, while HA2 and HA3 are end users. Before annotation, HAs are presented with a guideline as follows: Imagine you have to answer a question by a search engine and you are restricted to send only one query. Query expansion is not allowed, you have to choose query terms from words of the question. Try your best to form a query term set which is most probable to retrieve the correct web pages. For example, the query terms of the question “請問清朝時,台灣的第一任巡撫是誰?(Who is Taiwan’s first governor in the Ching Dynasty?)” may be “清朝”, “台灣”, and “第一任巡撫”

In western languages such as English, QA systems have been well developed. Many query formulation approaches, not only query term selection but also query type recognition and query expansion, have been innovated and proved to be effective. Simple approaches such as using rules are popular, since it is intuitive and easy to implement. Brill et al. [12] for instance proposed several rules to convert a question into a statement. A question like “Who is the world’s richest man married to?” is converted into “The world’s richest man is married to.” Complex approaches such as building probabilistic models to form a query are also convincing. Radev et al. [10][11] proposed a sophisticated algorithm to perform query expansion. They even develop a small set of query operations, such as INSERT, DELETE, DISJUNCT, etc. In eastern languages such as Japanese or Chinese, these languages in nature are different from western languages, which in turn affect the proposed approaches. Converting sentences into bi-grams is an approach inherited from the IR field and is proved to be suitable for eastern languages, so it is widely used [15][16][19]. Documents and questions are converted into bi-grams for indexing and searching, respectively. Although converting sentences into bi-grams are easy to implement, the tradeoff is losing semantic

3. Query Term Selection Strategies Three strategies along with a manual one are described in the following four subsections. The first is to select the query terms from words of a question by humans. Three staffs are involved in this task. Another strategy is stop words elimination. Words of a question are filtered with a stop word list. Still another strategy is rule-based. Several rules are constructed by domain experts and utilized to pick up question terms. The other strategy is machine learning-based. Query terms are prepared in advanced by humans to train a model for future prediction.

3.1 Manual

3.2 Stop Word Elimination Stop words in a given question are eliminated according to a manually constructed stop word list. Actually, stop words in a question are replaced with white spaces to form a query. Query terms separated by white spaces are then applied for searching. Some of the sample stop words are listed in Table 1. For example, “ 請 問 2001 年 G8 會 議 於 何 地 舉 行 ?

(Where did G8 hold in 2001?)” is segmented into “2001 年G8 會議於 舉行” with the stop words “請 問” and “何地” eliminated. Table 1. Some sample stop words. Chinese 誰 是什麼 為什麼 何地 請問

English Who What Why Where An auxiliary word for being politeness

3.3 Rule-Based Some heuristic rules are used to extract keywords from a given question. A question is segmented into words with part of speech (POS) tags by CKIP AutoTag [4]. Since AutoTag segments words as fine as possible, it results in short and less meaningful words, which is not suitable for QA passage retrieval. Therefore, re-combining some words is needed. The word combining rules are summarized in Table 2 and Table 3. For example, “四 (Neu)” and “年(Nf)” are combined into “四年(Nf)” according to the eighth row of Table 3. “喪(VJ)” and “假(VH)” is combined into “喪假(Na)”, for another example according to the third row of Table 3. Those rules are inspirations from some Chinese morphologists after analyzing the segmentation results [6], and modified for our own purpose. After re-combination, the words are filtered with a list of stop words. The words left are then query terms for searching. Table 2. POS combination rules for two words; at least one word must be 1-char long. Left word FW VH13

Right word Neu Na, Nb, Nc, Nd

Composite FW Na

Table 3. POS combination rules for two 1-char-long words. Left word Na, Nb, Nc A, VH, Neu, Nes, VH13 VJ VC, VD Nba Dfa Nes, Neu Neu, Nes, FW Neu Nep

Right word Na, Nb, Nc Na, Nb, Nc, Nd

Composite Na Na

VH Na Nba VH, V_2 Neu Nf VH Nf, Nd

Na Na Nba VH13 Neu Nf Nf Nf

3.4 Machine Learning-Based Instead of using heuristics, Conditional Random Field (CRF) is used for labeling query terms. CRF are undirected graphical models trained to maximize a conditional probability [18]. A linear-chain CRF with parameters Λ = {λ1, λ2 …} defines a conditional probability for a state sequence y = y1 …yT , given that an input sequence x = x1 …xT is

PΛ ( y | x) =

⎛ T ⎞ 1 exp⎜⎜ ∑∑ λ k f k ( y t −1 , y t , x, t ) ⎟⎟ Zx ⎝ t =1 k ⎠

where Zx is the normalization factor that makes the probability of all state sequences sum to one; fk(yt-1, yt, x, t) is often a binary-valued feature function and λk is its weight. The feature functions can measure any aspect of a state transition, yt-1→yt, and the entire observation sequence, x, centered at the current time step, t. For example, one feature function might have the value 1 when yt-1 is the state B, yt is the state I, and xt is the character “請”. We use CRF++ [5], a simple open source tool implementing the concept of CRF mentioned earlier for our experiment uses. As described in Section 3.1, an annotator labels query terms for each question under the condition that the answer can’t be seen in advance and there is not any search feedback. Then a model is trained from labeled terms and a question set. Features such as segmentation points of word and part of speech are used for training. Features and class labels are all in IOB format, Keyword-B (KB) and Keyword-I (KI), as shown in Table 4, for instance. Table 4. An example of the labeled query terms of question “請問 2000 年因賄選醜聞曝光而糟到國會 罷免流亡海外的祕魯總統是誰?”

VE

Query Term O

TI

VE

O

2

TB

Nd

KB

0

TI

Nd

KI

0

TI

Nd

KI

0

TI

Nd

KI



TI

Nd

O



TB

Cbb

O



TB

VA

KB



TI

VA

KI



TB

Na

O



TI

Na

O



TB

VH

O



TI

VH

O

Question Character 請

Segmented Term TB



POS



TB

Cbb

O



TB

P

O



TI

P

O



TB

Nc

O



TI

Nc

O



TB

VC

KB



TI

VC

KI



TB

VCL

O



TI

VCL

O



TB

Nc

O



TI

Nc

O



TB

DE

O



TB

Nc

KB



TI

Nc

KI



TB

Na

KB



TI

Na

KI



TB

P

O



TB

O



TB

Nh QUESTION CATEGORY

O

The results of selected keywords by different annotators and approaches are demonstrated in Table 5. Table 5. An example of query terms selected from the question “請問京都議定書規定幾個工業國家 的二氧化碳排放量限制?” Annotator 1 Annotator 2 Annotator 3 Heuristics CRF

Query terms 京都議定書 二氧化碳 京都議定書 幾個 工業國家 二氧 化碳排放量 京都議定書 二氧化碳排放量 議定書 幾個 京都 限制 二氧化 碳 國家 排放量 規定 工業 京都議定書 幾個工業國家 二氧 化

4. Evaluation Metrics One of the most effective ways of evaluating a module of QA task is to put it into a real QA system. Since most QA systems are too complicated to do error analysis, there are many researches using modular approach to evaluate individual QA stages. For instance, question processing and passage retrieval are often evaluated by question type classification accuracy and passage MRR respectively

[20][21]. In our study, except the traditional passage MRR metric, we suggest char/word precision/recall metrics to reduce experimentation time. For each question, there is an answer set prepared by humans for evaluation. We use Google as our experiment search engine. From the retrieved documents and the given answer set, MRR and Coverage can be calculated. Mean Reciprocal Rank (MRR) and Coverage are two classic criteria for evaluation. Suppose the first document contains one of the answers and is the ith document among the retrieved documents, MRR is 1/i. Coverage is defined as number of documents containing answers over total documents. Because answers of factoid questions are mostly exact and short, directly matching the words within the returned documents is sufficient. There are also cases that a document contains some answer words but the context doesn’t imply the answer. These cases are rare, so they are ignored. However, calculating MRR and Coverage is time consuming, and the contents of Google and other web pages changes over time. Although we cache the web pages for efficiency and future reproducing, it still takes too much time. Therefore, more efficient metrics, precision and recall, are also applied for evaluation. The char/word precision/recall metrics are calculated according to human annotated standards. If a character appears in both expected and predicted keyword sets, the predicted character is counted as true positive. If a character only appears in the predicted keyword set but not in the expected one, the character is counted as false positive. If a character only appears in the expected keyword set but not in the predicted one, then the character is counted as false negative. The rest of the characters not appeared in both sets is counted as true negative. Although characters are the basic elements of Chinese sentence, some characters are meaningless except they come with other characters. Therefore, comparing keyword sets extracted by rules or CRF with sets labeled by humans is also important. True positives increase if there is an exact match. False positives rise if there is a keyword in the predicted set but not found in the expected set. False negatives grow if there is a keyword in the expected set but not found in the predicted set. True negatives are not defined and not necessary for calculating precision and recall. After counting the four values, precision and recall can be acquired by the following definitions.

∑ i tpi ∑ i (tpi + fpi ) ∑ i tpi recall = ∑ i (tpi + fni )

precision =

where tpi, fpi, and fni stand respectively for the numbers of true positives, false positives, and false negatives of the ith question.

5. Experiment We adopt the NTCIR5 CLQA1 Chinese QA data sets, i.e. CLQA1-CE-S1200 (D200) and CLQA1-CC-EC-T1200 (T200), for experiment. These data sets consist of questions and answers. Answers are restricted to named entities; proper nouns, such as the name of a person, an organization, various artifacts, and numerical expressions, such as money, size, date, etc. Fig 1 is a sample from the T200 data set. Both T200 and D200 are constructed from a newswire corpus. They may not contain sufficient standard answers for our web QA study. In order to apply them to the web scenario, we manually examined the returned snippets from Google to add some missing correct answers. 請問首 位自費太空旅行的觀光客為誰? Who is the first self-financed space tourist? PERSON 提托 提托 提托 提托 Fig 1. An Example of question answer pair from CLQA1-CC-EC-T1200 Data Set The experiments are separated into three parts. First, raw questions, i.e. questions without any modification, are sent to Google as queries directly to form a baseline of other experiments. Second, some online experiments are done by sending query terms selected by annotators, stop words elimination, rule-based approach, and machine learning-based approach. The model of machine learning-based approach is trained by using the result of annotator 1 with question set D200, so the prediction of query terms is with question set T200.

Retrieved documents are then searched to see if it contains any answers specified from the data set, such that MRR and Coverage can be calculated. The results of sending query terms selected by three different approaches, query terms selected by annotators, and raw questions are shown in Table 6, Table 8, and Table 9. The last experiment is offline. A query term set is selected as perfect query term set. Other query term sets are compared with it, and precision/recall as defined earlier is calculated. Assume keywords labeled by annotator 1 are perfect. The result of comparing query terms selected by three different approaches and other annotators are shown in Table 7 and Table 10. Query terms formed by eliminating stop words are not compared, because the queries are long and are not separated by word boundaries. The purposes of these three experiments are different. The baseline and online experiments evaluate the effectiveness of proposed approaches. The online and offline experiments investigate the relationship between MRR/Coverage as online metrics and precision/recall as offline metrics.

6. Discussion According to Table 6, the best MRR of human annotator is only about 0.56. The task is not that easy even for humans. According to the statistics of experiment, the average difference of reciprocal rank between annotators is 0.22. Thus, the effectiveness of query terms selected by annotators does not differ a lot. Besides, Table 7 shows that word precision and recall of human annotators differ much, while character precision and recall are similar. For example, “誰是美國總統?(Who is the president of USA?),” the query terms may be “美國” and “總統,” or only “美國總統.” The chosen characters are the same, but the word boundary is different. Therefore from Table 6 and Table 7, the difference of word precision and recall do not affect the online performance. The result of query terms selected by simply eliminating the stop words is shown in Table 8. For T200, the Coverage of stop words elimination is 0.65, while for raw questions the Coverage is only 0.53. The gap of Coverage between human annotators is at most 0.08. It is the simplest way but yet also is the most effective approach. If there is a virtual annotator who select the worst query terms among the three annotators, this virtual annotator can be seen as the lower bound of annotators. MRR and Coverage of virtual annotators for T200 are then 0.42 and 0.64, respectively. The performance of stop words elimination achieves the lower bound of human annotators. The results of query terms selection by using heuristics are also shown in Table 8. MRR and Coverage are only 0.57 and 0.41 for T200. It performs slightly better than sending raw questions. MRR improvement is especially useful for QA systems.

Answers can be found as soon as the passage with answers is reached. Query terms selected by annotator 1 from question set D200 for CRF training and testing with question set T200 yield Table 9. CRF also performs better than raw questions. However, performances of rule-based and machine learning-based approaches are still far behind annotators and stop words elimination. According to the analysis of search results, there are 13% of questions that answers are found with query terms of stop words elimination but not found with terms of heuristic rules. By some analysis of humans on the query terms of both approaches, we find out that fine-grained query terms are not favorable for Google. In addition, Google favors its own segmentation, which is the first reason of the poor performance of heuristic rules. The average number of selected terms is also calculated and is shown in Table 14. Three annotators have some similarities such that the number of selected terms is close. The amount of terms selected by heuristics is much higher than others, because the question is segmented into small word pieces. Except for stop words, all other words are chosen. Thus the amount of terms is higher than others. The average length of terms is also calculated and is shown in Table 15. The lengths of terms selected by three annotators are close. However, the lengths of terms selected by heuristic rules are much smaller than others in average. This fact is not weird after all, because the high number of terms must imply short terms. It seems that terms produced by heuristic rules, although they have been combined, are still too short. And, most of the terms are chosen except for stop words, no wonder the character recall is so high. High number of terms and short terms are due to the segmentation policy of AutoTag. The policy of AutoTag is to segment a sentence as fine as possible. Therefore, the second reason that causes pathetic performance of the rule-based approach is that there are many ways of segmentation which are all meaningful and appropriate for daily use. There is no standard way of segmentation. The index of a search engine relies on the segmentation result, so the ambiguity of segmentation may cause bad performance. The other reason that induces poor performance of heuristic rules is probably that the original rules are designed for Cross-Language QA (CLQA) systems. It may not fit in the web QA scenario. Poor performance of CRF may due to few testing instances for training. However, it still needs more investigation whether this is the real reason. In order to find out the relationships of online and offline metrics, Kendall τ correlation is calculated. It is defined by

τ=

1 2

2P −1 n( n − 1)

where n is the number of items, and P is the sum, over

all the items, of items ranked after the given item by both rankings. It can be seen as the “distance” between two rankings as the minimum number of pair wise adjacent swaps necessary to convert one ranking into the other. From Table 11, Table 12, and Table 13, it is concluded that the ranking of word recall can represent the ranking of both MRR and Coverage. Word recall has higher stability, efficiency, and simplicity over MRR and Coverage. Although MRR and Coverage are still important metrics and can not be replaced, word recall is preferred for analysis. Table 6. MRR and Coverage of top 10 documents retrieved from Google based on keywords of the three annotators and raw questions. D200

T200

MRR

Coverage MRR

Coverage

Raw Question 0.31

0.53

0.33

0.53

Annotator 1

0.56

0.75

0.52

0.72

Annotator 2

0.55

0.75

0.56

0.73

Annotator 3

0.48

0.72

0.52

0.73

Table 7. Precision and recall of two annotators calculated based on keywords labeled by annotator 1.

D200 T200

Annotator 2

Annotator 3

precision recall

precision recall

word

0.36

0.45

0.60

0.61

char

0.80

0.91

0.87

0.90

word

0.57

0.62

0.66

0.64

char

0.83

0.96

0.88

0.90

Table 8. MRR and Coverage of top 10 documents retrieved from Google based on the keywords extracted by heuristics and annotator 1, compared with raw questions. D200

T200

MRR

Coverage MRR

Coverage

0.46

0.70

0.46

0.65

0.47

0.65

0.41

0.57

Raw Question 0.31

0.53

0.33

0.53

Annotator 1

0.75

0.52

0.72

Stop Words Eliminated Heuristics

0.56

Table 9. MRR and Coverage of top 10 documents retrieved from Google based on the keywords extracted by heuristics, CRF, and annotator 1, compared with raw questions.

Table 13. Kendall τ correlation, relationships between MRR and precision/recall, coverage and precision/recall calculated with keywords of annotator 3. word

Char

precision recall

precision recall

0.33

0.67

0.33

0.33

Coverage 0.33

0.67

0.33

0.33

T200 MRR

Coverage

0.46

0.65

0.41

0.57

CRF

0.38

0.54

Raw Question

0.33

0.53

Annotator 1

0.52

0.72

Stop Words Eliminated Heuristics

MRR

Table 14. Average number of terms selected by different approaches.

Table 10. Precision and recall of CRF and heuristics calculated based on keywords labeled by annotator 1. CRF

Heuristics

precision recall precision recall Question Set 2

word 0.35

0.41 0.26

0.58

char 0.77

0.68 0.68

0.96

Table 11. Kendall τ correlation, relationships between MRR and precision/recall, coverage and precision/recall calculated with keywords of annotator 1. Word

char

precision recall

precision recall

0.33

0.67

0.33

0.55

Coverage 0.55

0.91

0.55

0.40

MRR

Table 12. Kendall τ correlation, relationships between MRR and precision/recall, coverage and precision/recall calculated with keywords of annotator 2. word

Char

precision recall

precision recall

0.60

0.91

0.55

0.20

Coverage 0.55

0.67

0.33

0.18

MRR

CRF Annotator 1 Annotator 2 Annotator 3 Heuristics

Average number of terms 2.735 3.135 3.405 3.005 6.170

Table 15. Average length of terms selected by different approaches. CRF Annotator 1 Annotator 2 Annotator 3 Heuristics

Average length of terms 3.238 3.413 3.637 3.626 2.386

6. Conclusion Three approaches, stop words elimination, rule-based, and machine learning-based approach are proposed for query term selection. They all perform well with web-based factoid QA systems, especially stop words elimination. Stop words elimination achieves the lower bound performance of human annotators. Two categories of metrics, MRR/Coverage and precision/recall, are used for evaluation. The precision and recall may somewhat represent MRR and Coverage. From the Kendall τ correlation, we find out that word recall reflects the order of MRR and Coverage. The main contribution of this paper is on improving performance by modifying input queries. Especially in Chinese, there is still a lack of appropriate way, such as query expansion with a thesaurus, to gather more information from search engines by enhancing queries. We hope that there will be more researchers to expand this topic further.

7. Acknowledgments This research is supported in part by the National Science Council under GRANT

NSC94-2752-E-001-001-PAE. We would like to thank the Chinese Knowledge and Information Processing group (CKIP) in Academia Sinica for providing us with AutoTag for Chinese word segmentation.

[15]

8. Reference [1] [2] [3] [4] [5] [6] [7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Text REtrieval Conference (TREC), http://trec.nist.gov/ Cross Language Evaluation Forum (CLEF), http://www.clef-campaign.org/ NTCIR Workshop, http://research.nii.ac.jp/ntcir/ CKIP AutoTag, Academia Sinica. http://ckipsvr.iis.sinica.edu.tw/ CRF++: Yet Another CRF Toolkit, http://chasen.org/~taku/software/CRF++/ 林 厚 誼 , 蔣 岳 霖 , 周 世 俊 , “The design and implementation of Act e-Service Agent Based on FAQ Corpus,” TAAI, 2002 Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang, Chia-Wei Wu, Cheng-Lung Sung, Yu-Ren Chen, Shih-Hung Wu, Wen-Lian Hsu, “Perspectives on Chinese Question Answering Systems,” Proceedings of Workshop on the Sciences of the Artificia (WSA 2005), Hualien, Taiwan, December 7-8, 2005 Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang, Chia-Wei Wu, Cheng-Lung Sung, Yu-Ren Chen, Shih-Hung Wu, Wen-Lian Hsu, “ASQA: Academia Sinica Question Answering System for NTCIR-5 CLQA,” Proceeding of NTCIR-5 Workshop, Tokyo, Japan, December 6-9, 2005, pp. 202-208 Chuan-Jie Lin, “A Study on Chinese Open-Domain Question Answering Systems,” Ph.D. dissertation, National Taiwan University, 2004 Dragomir R. Radev, Hong Qil, Zhiping Zhengl, Sasha Blair-Goldensohrt, Zhu Zhangl, Weiguo Fan, John Prager, “Mining the Web for Answers to Natural Language Questions,” ACM CIKM 2001: Tenth International Conference on Information and Knowledge Management Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, Amardeep Grewal, “Probabilistic Question Answering on the Web,” Journal of the American Society for Information Science and Technology Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais and Andrew Ng, “Data-Intensive Question Answering,” Proceedings of the Tenth Text REtrieval Conference (TREC 2001), November 2001, Gaithersburg, Maryland H. M. Wallach, “Conditional Random Fields: An Introduction” http://www.inference.phy.cam.ac.uk/hmw26/pap ers/crf_intro.ps I-Heng Meng, Wei-Pang Yang, “The Design and Implementation of Chinese Question and

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23] [24]

Answering System,” Computational Science and Its Applications - Iccsa 2003, Pt 1, Proceedings, 2003, pp. 601-613 In-Su Kang, Seung-Hoon Na, Jong-Hyeok Lee, “Combination Approaches in Information Retrieval: Words vs. N-grams, and Query Translation vs. Document Translation,” Proceedings of NTCIR-4, Tokyo, 2004 Jiangping Chen, Rowena Li, Fei Li, “Chinese Information Retrieval Using Lemur: NTCIR-5 CIR Experiments at UNT,” Proceedings of NTCIR-5 workshop, 2005 Jimmy Lin, “The Web as a Resource for Question Answering: Perspectives and Challenges,” Proceedings of the third International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” Proc. 18th International Conf. on Machine Learning, 2001 Kui-Lam Kwok, Sora Choi, Norbert Dinstl and Peter Deng , “NTCIR-5 Chinese, English, Korean Cross Language Retrieval Experiments using PIRCS,” Proceedings of the Fifth NTCIR Workshop, 2005 Min-Yuh Day, Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, Wen-Lian Hsu, “An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification,” Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE 2005), 2005 Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton, “Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering”, SIGIR 2003 Tzong-Han Tsai, Shih-Hung Wu, Wen-Lian Hsu, “Exploitation of Linguistic Features Using a CRF-Based Biomedical Named Entity Recognizer,” ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics Yutao Guo, “Chinese Question Answering with Full-Text Retrieval Re-Visited,” Waterloo, 2004 Zhang Gang, Liu Ting, Zheng Shifu, Che Wanxiang, Qin Bing, Li Sheng, “Research on Open-domain Chinese Question-Answering System,” 20th Annual Meeting of the Chinese Information Processing Society of China, 2001

Suggest Documents