Information Sciences 224 (2013) 77–87
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
A support vector machine-based context-ranking model for question answering Show-Jane Yen a, Yu-Chieh Wu b, Jie-Chi Yang c, Yue-Shi Lee a,⇑, Chung-Jung Lee d, Jui-Jung Liu e a
Department of Computer Science and Information Engineering, Ming Chuan University, No. 5, De-Ming Rd., Gweishan District, Taoyuan 333, Taiwan, ROC Department of Communication and Management, Ming Chuan University, No. 250, Zhong Shan N. Rd., Taipei 111, Taiwan, ROC c Graduate Institute of Network Learning Technology, National Central University, No. 300, Jhong-Da Rd., Jhongli City, Taoyuan County 320, Taiwan, ROC d Department of Finance, Ming Chuan University, No. 250, Zhong Shan N. Rd., Taipei 111, Taiwan, ROC e Department of Information and Electronic Commerce, Kai-Nan University, No. 1, Kainan Rd., Luzhu Shiang, Taoyuan 338, Taiwan, ROC b
a r t i c l e
i n f o
Article history: Received 25 April 2011 Received in revised form 30 September 2012 Accepted 15 October 2012 Available online 26 October 2012 Keywords: Question answering Information retrieval Question classification Passage retrieval Support vector machines
a b s t r a c t Modern information technologies and Internet services are suffering from the problem of selecting and managing a growing amount of textual information, to which access is often critical. Machine learning techniques have recently shown excellent performance and flexibility in many applications, such as artificial intelligence and pattern recognition. Question answering (QA) is a method of locating exact answer sentences from vast document collections. This paper presents a machine learning-based question-answering framework, which integrates a question classifier, simple document/passage retrievers, and the proposed context-ranking models. The question classifier is trained to categorize the answer type of the given question and instructs the context-ranking model to re-rank the passages retrieved from the initial retrievers. This method provides flexible features to learners, such as word forms, syntactic features, and semantic word features. The proposed contextranking model, which is based on the sequential labeling of tasks, combines rich features to predict whether the input passage is relevant to the question type. We employ TREC-QA tracks and question classification benchmarks to evaluate the proposed method. The experimental results show that the question classifier achieves 85.60% accuracy without any additional semantic or syntactic taggers, and reached 88.60% after we employed the proposed term expansion techniques and a predefined related-word set. In the TREC-10 QA task, by using the gold TREC-provided relevant document set, the QA model achieves a 0.563 mean reciprocal rank (MRR) score, and a 0.342 MRR score is achieved after using the simple document and passage retrieval algorithms. Ó 2012 Elsevier Inc. All rights reserved.
1. Introduction On-line information retrieval plays a vital role in knowledge acquisition research. To extract external knowledge, knowledge seekers must often look for answers in a large corpus. The goal of automatic question answering (QA) [34] is to find exact answers to the natural language questions that are specified by users. Automatic QA technology is intended to manage a large amount of textual data and mining suitable knowledge through precise corpora. Researchers [4] in the field of Internet technology have also applied online learning to QA technology. For example, Feng et al. [5] developed a simple QA system
⇑ Corresponding author. Tel.: +886 3 3507001; fax: +886 3 3593874. E-mail addresses:
[email protected] (S.-J. Yen),
[email protected] (Y.-C. Wu),
[email protected] (J.-C. Yang),
[email protected] (Y.-S. Lee),
[email protected] (C.-J. Lee),
[email protected] (J.-J. Liu). 0020-0255/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.10.014
78
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
capable of answering questions posted on discussion boards. Molla et al. [22] proposed a logic-form-based [29] answer extraction system capable of answering natural language questions in the aircraft flight handbook. Developing QA systems can support online learning and communication in many tutoring and knowledge management activities. Automatic QA technology has recently been extended to multimedia applications [16,36]. The QA of research is an open domain and a challenging research problem. The Text REtrieval Conference (TREC) [34] has held an annual question-answering competition through its QA track. In this competition, researchers test the ability of their systems to extract answers to predefined questions from a sizable document collection. Existing QA technology [1,11,15,17,20,26] involves two main steps: information retrieval (IR) and information extraction (IE). The IR components first retrieve the relevant document set after question analysis or classification. The IE process then uses the retrieved relevant documents to extract possible answers that are often small and short, such as phrases or clauses. Developing a highquality QA system, such as human-made rules [26], a domain specific knowledge base, or a thoroughly annotated ontology [11], generally requires multiple external components. Successful developing machine learning algorithms in recent years has led to their inclusion in most QA systems as a part of the kernel component. Machine learning-based methods [11,24,27,28,38] have been applied to various domains and languages [30] and have achieved results comparable to previous rule-based QA systems. Liu [23] addressed the difficulties of adopting machine learning methods, such as SVM [12], to improve document retrieval. To enhance the performance further, a set of language or domain-dependent processing tools, such as a syntactic parser [1,24,27], thesaurus [11,15,26], or an ontology [3,6], are necessary to enhance system performance. This paper presents a machine learning QA framework based on the proposed context-ranking model, the framework of which consists of three steps: question classification, information retrieval, and context ranking. The first step involves using clustering methods to generate pseudo labels of questions and train classifiers using the pseudo labeled questions. The second step adopts a simple retrieval model to retrieve document-level information that relates to the given question. To determine the answer, each of the retrieved documents is segmented into passage-level fragments. The final step involves applying a context-ranking model that uses contextual information of proper names to re-rank passages. To demonstrate the effect of context ranking, this study evaluated the proposed method in using the TREC-8, 9, and 10 QA tracks. The experimental results show that the proposed question classification method is not only fast, and accurate but also useful in applying the proposed context-ranking model. The results are competitive with those of famous QA techniques combining a large amount of external knowledge. 2. Machine learning-based question answering framework The goal of QA is to find exact answers in a large collection of text. This goal is quite similar to information retrieval (IR), which retrieves relevant ‘‘documents’’ or ‘‘Web-pages’’ in response to query words. QA emphasizes precise responses, such as short phrases or sentences, to questions posed in natural language. However, IR techniques can greatly reduce the search space. The QA of a machine learning-based framework is similar to that of traditional IR models. Fig. 1 shows the architecture of the proposed QA model. The QA consists of three main parts: question classification (QC), document and passage retrieval (DPR), and the contextranking model (CRM). Question classification is a supervised machine learning classifier trained to identify the answer type related to the input question. The answer type is an indicator that determines the corresponding context-ranking model to re-rank the retrieved passages. DPR uses a Boolean-based document retriever and the density-based method to search relevant passages. Limiting the size of a retrieved passage to three sentences is necessary to form the basic answer element. Li and Roth [19] showed that users prefer passages to answer-phrases because a passage provides sufficient context. In the
Fig. 1. Machine learning-based question answering framework.
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
79
third stage, the answer type allows the CRM model to choose a suitable model for re-ranking the retrieved passages. Finally, the system chooses the top five passages as answers. Sections 2.1 and 2.2 introduce the question classification module and the DPR component. Section 3 describes the proposed context-ranking model. 2.1. Question classification The purpose of question classification is to detect the answer type of the input question. Question answering shares the same target as conventional textual categorization by treating a question sentence as a document and using various machine learning-based methods for classification. Using an online search of 5500 manually annotated question set (UIUC question corpus), Li and Roth [18,19] applied the SNoW algorithm to classify 500 TREC-10 test questions into 50 questions for finegrained categories and six questions for coarse-grained categories. They demonstrated that incorporating additional humanconstructed word lists improved the accuracy of results (79–84.2%). Hacioglu and Ward [7] reported that without using the word list, the SVM-based question classifier could achieve a level of performance comparable to that of using SNoW (82%). Tree kernels SVMs can achieve better coarse-grained (i.e., six degrees of class) classification results (Zhang and Lee [40]). Alternatively, Huang et al. [8,9] reported achieving an improvement in the accuracy rate of state-of-the-art classification on the TREC-10 question set using the extracted headwords from parsers and WordNet hypernyms. Platt [26] combined powerful semantic features (derived from WordNet) and Wiki for question classification. Krishnan et al. [13] used conditional random fields (CRFs) [14] to provide multi-resolution features. Lexical information is an essential feature of question classifiers because it improves the ability of the classifier to identify unknown words. Li and Roth [18,19] reported achieving great improvement over existing classifiers by using the humanmade related words. The extracted word clusters become features for the classifiers. The traditional approach involves encoding a training example using the bag-of-words (BOWs) model. This model has a one-to-one mapping relationship, but cannot find the mapping when a fresh word appears. However, the proposed cluster-based model has many-to-one relationship, and maps a set of similar terms into the same group. For example, football and baseball are terms in the ‘‘ball game’’ cluster (group). These word clusters are then used to map each training word in classifiers. This study defined two query types to retrieve useful word clusters in WordNet. The first type uses category names as query words to extract hyponyms. A category name usually expresses a generic concept. For example, ‘‘baseball’’ and ‘‘football’’ can be represented by the concept ‘‘sports’’. All hyponyms of a category name are grouped to form a cluster. In this study, 42 queries found related words, whereas the remaining eight question class names could not be found in WordNet. The second query type increases word coverage by searching for coordinate words in WordNet. Some of the keywords may appear in various forms in the test question, such as ‘‘class’’, ‘‘category’’, and ‘‘type’’. Some semantically related words are not indexed during training collection, such as ‘‘January’’ and ‘‘February’’. Thus, the second query type considers all of the related words of importance. These words are mainly derived from the non-stopwords in the training data, and are used to retrieve coordinate terms. For example, the word ‘‘baseball’’, contains three coordinated term clusters (i.e., ‘‘ball game’’, ‘‘ball’’, and ‘‘baseball equipment’’). These three term groups are then extracted. Figs. 2–4 list the constituents of the three clusters. To measure the term strength, this study uses v2-statistics to evaluate the importance of each word. Unlike Type 1 queries, a Type 2 query might cover several conceptual meanings. Without losing information, extracting all coordinate clusters is possible. Similar to Type 1 queries, Type 2 queries rank the top 50 words according to their v2-statistical value to retrieve the related clusters. By using the two-way contingency table of a term t and a class cj, the word-goodness measure can be defined as follows:
v2 ðt; cjÞ ¼
N ðA D B CÞ2 ; ðA þ BÞðC þ DÞðA þ CÞðB þ DÞ
where A is the number of times that term t belongs to category cj; B is the number of times that term t not belongs to category cj; C is the number of times that the other terms (except term t) belong to category cj; D is the number of occurrences neither term t nor in cj; and N is the total number of training documents. To measure the global importance of a word in the training set, select the maximum v2 scores among all categories for each term. The estimation function is shown as follows: m
v2 ðtÞ ¼ maxfv2 ðt; cjÞg; j¼1
where m denotes the number of categories. Select the top-K terms with the highest v2-statistical scores and their associated coordinate terms in WordNet. In this paper, K = 50 and 192 term clusters are extracted. Table 1 shows the statistics of the retrieved word clusters for Type 1 and Type 2 queries.
Cluster name : ball game Synonyms: ballgame Terms: baseball, baseball game, ball Fig. 2. An overview of ‘‘ball game’’ cluster in WordNet.
80
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
Fig. 3. An overview of ‘‘ball’’ cluster in WordNet.
Fig. 4. An overview of ‘‘baseball equipment’’ cluster in WordNet.
Table 1 Features of WordNet expansion lexical terms.
Number of queries Number of returned groups Max. size in each group Min. size in each group Avg. term in each group
Type 1
Type 2
50 42 185 3 32.129
50 192 272 1 30.494
By querying the WordNet, we put the auto-derived word clusters to the question classifier. The proposed question classifier is similar to the traditional vector space model in the text classifier proposed in [23] because it uses a bag of unigrams and bigrams as base features. Fig. 5 shows the vectorized graph, in which jV1j jV4j are the vocabulary sizes of the fourfeature set. For example, jV4j is the number of coordinate term clusters. All question words are mapped to the extracted hyponyms and coordinates. Each training example contains four feature types. The classifier predicts the question type of each test question by mapping the four feature types to the classifier. This study uses the same data set used in previous work to better compare with such work. The classifier used in this paper is the SVM [12,37]. The SVM has been successfully applied to many classification problems, including pattern recognition and text categorization. However, the SVM is a binarized classification algorithm that can be extended to multiclass SVM by adopting a one-versus-all type. 2.2. Document and passage retrieval Searching for answers in a small data set is more efficient than searching in an entire corpus. A document retriever provides a document set related to the question, whereas a passage retriever segments each of the documents into paragraphs
Fig. 5. An illustration of feature mapping.
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
81
and ranks them according to certain evaluation metrics. Finding answers in passages is much easier than searching the entire document set. Tellex et al. [33] compared several document retrievers to evaluate the performance of variant IR systems, including PRISE (using a TFIDF model), and Lucene (using a Boolean model). Their results revealed no significant difference between TFIDF and the Boolean models for the QA task. Because the TREC-QA data set is quite large (about 5 GBs), the Boolean model is a more efficient option for document retrieval. For efficiency, this study employed a simple Boolean-based document retrieval model, and eliminated stopwords and punctuation. The input query simply entails using the complete question words. The simple Boolean retrieval model implemented in this study extracted the top 500 related documents from the test collection. This method merely involves ranking a document according to the number of matches between the given question and the document. An overlap of three sentences with one previous sentence forms a passage. An example of the passage retrieval is shown below. Assume a document contains k sentences
Doc ¼ ðS1 ; S2 ; S3 ; . . . ; Sk Þ The passage is formed by aggregating three consecutive sentences
P1 ¼ fS1 ; S2 ; S3 g; P2 ¼ fS3 ; S4 ; S5 g; P3 ¼ fS6 ; S6 ; S7 g; . . . The retrieved documents are then collapsed into a set of passages. The passage retrieval module uses these passages and performs retrieving using density-based ranking methods [15,19]. This method entails adopting the SiteQ’s density-based ranking algorithms for further ranking. Finally, the top 50 passages are stored as answer candidates for the next stage. 3. Context-ranking model Several studies have discussed how to rank answers from retrieved passages. These approaches usually require many training examples to learn a mapping function that can rank new data. Researchers in the field of Internet technology have developed various trainable answer extraction approaches over the past few years. Ittycheriah et al. [11] presented a maximum entropy learning approach that combines abundant external knowledge information, including WordNet, a dependency parser, and predefined patterns. Moschitti and Quarteroni [23–25] used a trained text classifier to filter irrelevant passages, simulating a text categorization problem. Suzuki et al. [32] considered all contextual words of a named entity word as a document and applied several classification algorithms to classify ‘‘documents’’. Sasaki [30] recently proposed a novel ‘‘answer phrase chunking’’ model for Japanese question answering. Zukerman et al. [41] adopted the parser and supervised machine learning-based methods to extract syntactic features and identify potential answer sentences. The proposed context-ranking model (CRM) re-ranks the retrieved passages to find the answers and loads and re-ranks the passages according to the results of the question classifier. Unlike the previous machine learning-based approaches (which include parsers), the proposed CRM seeks the specific syntactic patterns or features bounded in the context of keywords. This model can identify both semantic and syntactic features for each word in the context window. For example, the part-of-speech (POS) tag and named entity words could be directly modeled using the CRM. The word position is also another essential feature of this model. Fig. 6 shows the overall work flow of the CRM. Fig. 7 illustrates how the CRM encodes the contextual features. As shown in Fig. 6, it is first derived the labeled answer sentences are derived first from the TREC-provided short fragments. The named-entity-taggers recognize the terms of focus for each input text. If a single word or a word sequence is a part of a named entity word, then the CRM extracts its contextual words as features and transforms them into a set of vectors using conventional BOW mapping.
Fig. 6. Work flows of the proposed CRM model.
82
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
Fig. 7. An illustration of focus-term and its context feature.
Table 2 Token feature category list. Feature description
Example text
Feature description
Example text
1-Digit number 2-Digit number 4-Digit number Year decade Only digits Number contains Number contains Number contains Number contains Number contains Number contains Number contains Number contains
3 30 2004 2000s 1234 3/4 2004/8/10 $199 100% 1–2 19,999 3.141 08:00
Number contains alpha and slash All capital word Capital period (only one) Capital periods (more than one) Alpha contains money Alpha and periods Capital word Number and alpha Initial capitalization Inner capitalization All lower case Others
1/10th SVM M. I.B.M. US$ Mr. Taiwan F-16 Mr., Jason WordNet am, is, are 3n/4
one slash two slash money percent hyphen comma period colon
The input in the test phase consists of the passages (i.e., top-K passages) retrieved by the document retriever and passage retriever. The same modules are then used to locate focus terms, and feature extraction converts the given passage into feature vectors. The SVM test stage re-ranks the passages according to the SVM prediction score. To obtain the probability output from the SVM, the CRM employs the sigmoid function with fixed parameters A = 2 and B = 0 [26]. The proposed CRM centralizes each named-entity word in the passage and extracts context features in a fixed-size window. In this study, we employed the named-entity recognizer (NER) developed by [35] to label the named entity class for each word. This recognizer was trained by using MUC-7 NER corpus-specified seven named-entity classes. Fig. 7 illustrates this idea, which the term on focus is the person, found by name ‘‘Rostropovich’’. In this example, we account for the ±5 window size (i.e., the previous five words and the next five words) and extract the corresponding features. The CRM is a highly flexible model that enables users to append or modify feature sets obtained from different sources. Here, we denote FSi as the feature set of word i in the context. Each FSi can include predefined or auto-tagged features obtained from other sources. Section 3.1 presents a discussion on the employed feature set. Section 3.2 presents the detailed technical concerns. Section 3.3 presents a comparison of the proposed method with previous ranking algorithms. 3.1. Feature extraction The example representation of the CRM is similar to word sequence classification tasks, such as POS tagging and namedentity chunking [35], which increase the ability to identify the word type. Our method extracts the contextual features surrounded by a central word (e.g., Rostropovich in Fig. 7). The CRM uses external knowledge to increase the word information such as part-of-speech tags, called ‘‘entity chunks’’. The central word is defined as a named entity word that provides more information than the other terms, such as ‘‘the’’, ‘‘a’’, and similar words. The extracted features are listed below. Lexical (Word form): The word form (e.g., Rostropovich in Fig. 7). Part of Speech (POS): The part-of-speech tag of the word (e.g., NP is the POS tag of Rostropovich). Named-Entity Class (NE): The named entity tag of the word (e.g., ‘‘Rostropovich’’ is the person name class). This study replicates the NER tagger reported by [35], which achieved 86.40% accuracy rate in the MUC-7 corpus. Term-match degree: A match degree indicates how effectively the word matches one of the question terms in stem form or synonym form. We defined seven match degrees (from degree = 7 to degree = 1): named-entity match, question first noun phrase match, question term exact match, stem match, synonym match, hyponym match, and hypernym match. Sequentially matching each word according to this order consistently results in selecting the highest degree. Token feature: The token feature involves identifying specific symbols and punctuations within the given token, such as ‘‘$’’, or ‘‘%’’. Each term is assigned to a token class type through the predefined word category mapping. For example, the word ‘‘1990s’’ matches the token category ‘‘Year decade’’ (Table 2).
83
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87 Table 3 Descriptions of TREC-QA dataset.
TREC-8 TREC-9 TREC-10
Number of testing questions
# Of no-Answer
NILL
200 693 498
2 11 8
0 0 49
3.2. Training and testing The answer type of the input question prompts the CRM to re-rank the passages. A corresponding ranking model for each answer type identifies whether the input fragment (passage) is relevant to the class. The CRM reproduces a new ranking list by classifying the feature vectors in the passage. The best1 one serves as the passage score for passages with multiple focus terms. Finally, the CRM chooses the top five passages in the answer list. Similar to the question classification, the SVM is employed as the classification algorithm. The TREC-provided judgment set [34] labels the retrieved passages as +1 (answer) or 1 (non-answer). The judgment set contains a list of answer candidates that have been manually annotated by human experts. Each passage that contains the answer information is treated as a positive example, and as a negative example otherwise. The training feeds only the TREC 8 and 9 question/answer pairs to the SVM. The CRM then ranks the test question of TREC 10 using the trained model. In this case, the training phase does not include a TREC 10 example. The sources are used to determine the effect of document and passage retrievers: the first is derived from the TREC10 judgment set, and the second is generated using simple document and passage retrievers. 3.3. Comparison The proposed ranking-context model focuses on weighting the named-entity terms based on their contextual clues, such as position information, named-entity words, phrase structures, and question terms. This method differs from traditional pseudo feedback algorithms and machine learning-based ranking approaches. Xu and Croft [39] empirically showed that using passage words improves the ranking performance. Using learning-to-rank techniques [21] has recently generated great success in improving the initial retrieval results. However, these approaches identify the surface words strongly related to the query only. The CRM model allows more flexibility with respect to selecting the feature information of interest, such as syntactic semantic clues and their corresponding positional relationships. One appealing property of the CRM is that it provides a flexible way of integrating features such as named-entity words, phrase chunks, query words, and contextual words. 4. Experimental result The previous sections introduces the machine learning-based question classification and answer ranking. Clearly, question classifiers can be examined alone. However, the result of the answer ranking is equivalent to the overall QA performance. Therefore, this section evaluates the proposed CRM after the question classification process. To evaluate the proposed QA system, we applied the well-known benchmarking corpus into our method for the purpose of comparison. The question classification entails adopting the UIUC corpus, which consists of 5500 training questions and 500 TREC-10 testing questions. The UIUC corpus is a two-level class structure in which the first level contains six coarsegrained question classes and the second level comprises 50 fine-grained question classes. The accuracy of question classification is measured by traditional precision/recall/f-measure and the accuracy rate (i.e., the ratio of correct assigned questions). The training question classifier was included in the QA system to evaluate the overall QA performance using the testing questions. This approach employs TREC-8 and 9 to train the CRM, and TREC-10 to evaluate the QA performance. Table 3 lists the statistics of the TREC-QA data set. Table 3 shows that 200 questions are in the TREC-8, with two questions excluded from evaluation. In TREC-10, 49 questions have no correct answer in the given document set of answer. For these questions, ‘‘NILL’’ is considered a correct response. We are to mainly apply the main reciprocal ranking (MRR) to score in TREC, in which the highest-ranked correct answer is generated by the QA system. Its score is counted as the reciprocal of the rank. For example, it is counted as 1 if the correct answer is first answer, and it is counted as 1/5 if the fifth answer is correct. The MRR score is defined as follows:
MRR ¼
N 1X 1 ; N i¼1 ranki
where N is the number of test questions and ranki is the rank of the ith answer. Similar to text categorization tasks, the question classification was evaluated according to the accurate ratio of ‘‘correct’’ classification. 1
The best SVM score is that with the highest similarity between a centroid and the class model.
84
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
Table 4 Question classification performance with different training and grained-sizes. Training size
1000 2000 3000 4000 5500
Original model
Combining with WordNet clusters
Combining with WordNet clusters and related words
Coarse-grained (%)
Fine-grained (%)
Coarse-grained (%)
Fine-grained (%)
Coarse-grained (%)
Fine-grained (%)
72.60 78.00 86.40 87.80 89.40
58.80 66.20 79.80 81.60 84.40
72.40 79.40 88.60 88.80 90.00
60.40 67.60 82.60 82.80 85.60
77.80 84.60 91.40 91.60 93.00
66.60 74.00 85.80 86.40 88.60
Table 5 Recall, precision, f-measure reports of the question classifier with (left) and without (right) related words. Question class
Recall (%)
Precision (%)
F-measure (%)
Recall (%)
Precision (%)
F-measure (%)
ABBR DESC ENTY HUM LOC NUM AVG.
66.67 98.55 82.98 96.92 93.83 93.81 88.79
100.00 90.67 93.98 92.65 90.48 97.25 94.17
80.00 94.44 88.14 94.74 92.12 95.50 90.82
77.78 100.00 71.28 96.92 92.59 88.50 87.84
100.00 84.15 98.53 87.50 87.21 97.09 92.41
87.50 91.39 82.72 91.97 89.82 92.59 89.33
4.1. Evaluations of question classification The evaluation of question classification is highly dependent on each input question because it corresponds to the answer type. It cannot return an ‘‘out-of-class’’ label even if the question is difficult to categorize. This approach assumes that the predefined answer type addresses all valid questions. Thus, the question classification performance is an accurate reflection of the correct classification of the test data. As discussed in Section 2.1, this study adopted the traditional unigram and bigram to represent the question and combine the ‘‘auto-derived’’ word clusters to evaluate the improvement. Table 4 shows the experimental results of the proposed question classifier for various training sample sizes. Table 5 lists the recall, precision, and F-measure of the question classifier with and without using the related words. Combining the WordNet clusters clearly leads to higher accuracy than using the original method (from 84.40% to 85.60% for fine-grained and from 89.40% to 90.00% for coarse-grained). This prediction performance increases by combining both WordNet clusters and the related words (88.60%). Using approximately 3000 training questions, the question classifier achieves more than 82% accuracy. We also list the comparisons of other published studies, which have performed tests by using the same corpus. Compared to the second best system (proposed by Huang et al. [9,10]), which does not include a heavy syntactic parser, the proposed method simply involves using the set automatically derived from word thesauri and related words (compiled semi-automatically). The derivation of the WordNet thesaurus is slightly different. Instead of using the category title name feature, that approach entails expanding the headword information from WordNet. The proposed method is extremely fast because it does not rely on parsers [40], and can label more than 10,000 questions per second. By contrast, using parsers and named-entity taggers [17,19] requires at least 1–5 s to parse a question. The main reason for this difference is that conventional parsers and named-entity taggers must scan the entire sentence and use Viterbi/Beam search to find the best parse and tag path. Therefore, the proposed method demonstrates superior efficiency. Table 6 presents a summary of the results reported in previous studies. The best previous best system was reported by [26], and applies numerous predefined patterns and rules. This approach involves using parsers and named entity taggers to identify patterns in questions. The system with human-made rules performed higher than the system that does not include the same rules. Compared to the second best system (proposed by Huang et al. [9,10]), which does not include a heavy syntactic parser, the proposed method simply involves using the set automatically derived from word thesauri and related words (compiled semi-automatically). The derivation of the WordNet thesaurus is slightly different. Instead of using the category title name feature, that approach entails expanding the headword information from WordNet. The proposed method is extremely fast because it does not rely on parsers [40], and can label more than 10,000 questions per second. By contrast, using parsers and named-entity taggers [17,19] requires at least 1–5 s to parse a question. The main reason for this difference is that conventional parsers and named-entity taggers must scan the entire sentence and use Viterbi/Beam search to find the best parse and tag path. Therefore, the proposed method demonstrates superior efficiency. The third system reported by [19] not only integrates abundant external knowledge (WordNet, human-made word list, and a comprehensive word thesaurus (CBC)), but also provides four times the training data as before (approximately 20,000 questions). The empirical results are less trivial because they are based on TREC-10 and TREC-11 1000 questions as test data,
85
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87 Table 6 Comparison with previous methods. Coarsegrained (%)
Finegrained (%)
Training size (questions)
Parsers/Named tagger
Additional features
Efficiency
Our method Our method (without related words) Li and Roth [19] (testing with TREC-10 and TREC11) Letsche and Berry [17] Zhang and Lee [40] Hacioglu and Ward [7] Krishnan et al. [13] Huang et al. [10]
93.00 90.00 N/A
88.60 85.60 88.05
5500 5500 20,000
No No Yes
Related words No Related words
Fast Fast Slow
91.00 90.00 N/A N/A 93.40
84.20 80.00 82.00 86.70 89.20
5500 5500 5500 5500 5500
Yes Yes No Yes Yes
Slow Slow Fast Slow Slow
Platt [26] Quarteroni and Moschitti [27] Croce et al. [2] Croce et al. [2]
N/A 91.80 94.80 91.20
89.55 N/A 87.40 82.20
5500 5500 5500 5500
Yes Yes Yes Yes
No No No CRFs Modified headword rules Human-made rules SRL tagger LSA or WordNet No
Slow Slow Slow Slow
Table 7 TREC-10 results on different grained size.
CRM with 50 answer types CRM with six answer types CRM with one answer types
TREC-provided judgment set Our document/passage retrievers TREC-provided judgment set Our document/passage retrievers TREC-provided judgment set Our document/passage retrievers
MRR-value
# of miss
0.563 0.335 0.554 0.320 0.547 0.305
160 252 165 259 165 264
and therefore include more sources and training materials. Considering the same training size and the same settings (without using an additional word list or clusters), the proposed method achieves highly competitive accuracy. The authors also enlarge the training data by annotating 20,000 questions. The proposed method can also employ more training data to improve the result. Numerous previous studies [7,27,40] have failed to achieve the same performance as that of the proposed method, which involves applying a full syntactic parser. Adding the word list to the classifier leads to higher accuracy (87.5% for finegrained). This shows that the addition of ‘‘human-made’’ words is still a much more effective approach than using ‘‘autoderived’’ word clusters. However, the main limitation of this approach is that it requires domain-experts and human effort. Moschitti [2,24,27] has investigated the integration of SVM with distinct tree representations and add-in resources – yielding remarkable results in question classification with coarse-grained question classification accuracy. The most recent Moschitti study demonstrated the highest accuracy by combining the full parse tree, WordNet, and LSA-based word similarity measurements. Regarding a finer-grained size of question class, the obtained accuracy is not as high as that using Huang’s approach, which also involves adopting full parsers. The final row in Table 7 shows the question classification results of using the parse tree kernel only. In the fine-grained setting, the proposed method performs higher than do the kernel methods (88.60 vs 87.40). Huang also applied the tree-kernel-based SVM for re-ranking answers [24]. Moschitti et al. conducted experiments on a small part of TREC-QA with a different answer-ranking list. These results are difficult to compare with the current approach, because the input passage varies and the results are obtained based on a subset of the proposed benchmark. These results show that the power of word clusters improves system accuracy. One advantage of this approach is that it does not require full syntactic parsing or semantic role labeling [27]. A related word list, which is created by humans, shows even higher accuracy than the full parse tree does. Combining the proposed WordNet clusters leads to state-of-the-art prediction accuracy while maintaining efficiency (10,000 questions/s). This indicates that the proposed method is suitable for online purposes.
4.2. Evaluation of overall question answering Table 7 lists the experimental results of the overall QA. This study used three grain sizes of the question types (ranging from class number = 1, 6, 50) to determine the relationship between the question classifier and the overall QA result. The inputs of the CRM are derived from two sources: the TREC-provided passage list and the real document/passage retriever. Table 7 shows that the proposed CRM achieves greater prediction power using the finer grain (class number = 50). The TREC-provided list achieves noticeably more favorable results. Recall that we retrieved only the top-200 passages from
86
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87 Table 8 A selection results that have been published for the QA data set of the TREC-10. TREC-10
MRR-value
InsightSoft-M [31] Southern Methodist U. Our method (with TREC-provided relevant set) Oracle U. Southern California, ISI [8] MultiText, U. of Waterloo Microsoft Research IBM [11] Our method (with our document/passage retrievers)
0.690 0.590 0.563 0.490 0.450 0.460 0.430 0.360 0.342
Table 9 List of question classes that included rare training examples. Class name
# Of training examples in TREC-8, 9
# Of testing examples in TREC-10
Definition Color Substance Currency Percent
46 1 1 0 3
134 10 11 6 4
the documents. More than 50% of the answers were from the top 200. We attempted to enlarge the number of passages from 200 to 400, but this made the performance lower than before probably because the noise increases with the size of passages. By contrast, all correct answers were included in the TREC-provided judgment set with only a few negative examples. This judgment set represents topline QA performance because most non-answer strings were filtered out to contain only sufficient answer information. We believe that the proposed CRM performs as high as the state-of-the-art method when the size of retrieved passages is similar to the judgment set. Table 8 presents a summary of the selection results of published studies in TREC-10. It is important to note that a full and fair comparison of different QA approaches is nearly impossible. Most of these QA approaches are equipped with a large amount of external knowledge, rules, or additional training materials lacking in the current experiment. These methods combine rich sources, like full parsers, predefined syntactic patterns, and more training materials. For example, the best QA system [31] (InsightSoft-M) includes many hand-tuned regular expression-like patterns that are often combined in an ad hoc fashion for answer selection. This approach showed that the handcraft rules are useful in improving the answer selection accuracy. Table 7 shows that the MRR value of the proposed CRM-based QA is close to IBM’s statistical QA systems. The IBM QA methodology [11] also requires human-made ontology and a set of developed training question/answering instances. On the contrary, the proposed method requires fewer human resources to achieve very competitive results. The results above show that the proposed CRM did not have sufficient training data for some small categories. Table 9 lists six categories with fewer training instances. There is only one training example in the categories ‘‘color’’ and ‘‘substance’’. The ‘‘currency’’ does not contain a training instance from TREC-8 and TREC-9. Hence, it is quite unlikely that the CRM can manage these questions in tests. In summary, approximately 900 questions were used for training whereas the testing question size is 500. For the QA model, these training materials are still insufficient to achieve higher accuracy (for example, consider the categories of color, substance, and currency). Clearly, two methods can be used to improve this model: (1) employing document and passage retrievers that are more effective and (2) to creating additional training examples for ‘‘small’’ categories.
5. Conclusion Question-answering technology often requires extensive human-made knowledge, many external sources, such as parsers, and a domain-specific knowledge base. To overcome the cumbersome demand of external sources and human-made knowledge, this study presents another machine learning-based question-answering framework that classifies the input question and re-ranks the retrieved passages through the proposed CRM model. According to the human-made-word-list, which varies from time to time, this study derives two query sets to retrieve useful and related word clusters to improve the question classification accuracy. Experimental results show that under the same settings, the proposed question classifier outperforms the methods that using syntactic parsers. The accuracy rates reach 84.4% and 85.6% of the accuracy rates without and with the ‘‘auto-derived’’ word clusters, respectively. For the overall QA, the CRM model was trained with three sets, including approximately 900 of the TREC-8 set, TREC-9 of the questions set, and the evaluation set for TREC-10. The experimental results show that the CRM performed very well given the TREC-provided judgment set (0.564 MRR score),
S.-J. Yen et al. / Information Sciences 224 (2013) 77–87
87
and reached 0.342 MRR score using simple document and passage retrievers. However, some of the small categories lack training materials. Thus, future research should add more training data to address this problem. References [1] T. Arita, M. Shishibori, J.I. Aoe, An efficient algorithm for full text retrieval for multiple keywords, Information Sciences 104 (3) (1998) 345–363. [2] D. Croce, A. Moschitti, B. Roberto, Structured lexical similarity via convolution kernels on dependency trees, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 1034–1046. [3] Cui H, Sun R, Li K, Kan MY, Chua TS. 2005. Question answering passage retrieval using dependency relations. In Proceedings of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval 400-407. [4] S. Ferrández, A. Toral, Ó. Ferrández, A. Ferrández, R. Muñoz, Exploiting Wikipedia and EuroWordNet to solve cross-lingual question answering, Information Sciences 179 (20) (2009) 3473–3488. [5] D. Feng, E. Shaw, J. Kim, E. Hovy, An intelligent discussion-bot for answering student queries in threaded discussions, in: Proceedings of the International Conference on User Interface, 2006, pp. 171–177. [6] B. Gils, E. Proper, P. Bommel, Th.P. van der Weide, On the quality of resources on the Web: an information retrieval perspective, Information Sciences 177 (20) (2007) 4566–4597. [7] K. Hacioglu, W. Ward, Question classification with support vector machines and error correcting codes, in: Proceedings of Human Language Technology Conference (HLT-NAACL), 2003, pp. 28–30. [8] E. Hovy, U. Hermjakob, C.Y. Lin, The use of external knowledge in factoid QA, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 644–652. [9] Z. Huang, M. Thint, A. Celikyimaz, Investigation of question classifier in answering, in: Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2009, pp. 543–550. [10] H. Huang, M. Thint, Z.C. Qin, Question classification using head words and their hypernyms, in: Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2008, pp. 927–936. [11] A. Ittycheriah, M. Franz, S. Roukos, IBM’s statistical question answering system TREC-10, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 258–264. [12] T. Joachims, A statistical learning model of text classification with support vector machines, in: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, pp. 128–136. [13] V. Krishnan, S. Das, S. Chakrabarti, Enhanced answer type inference from questions using sequential models, in: Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2005, pp. 315–322. [14] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proceedings of the Eighteenth International Conference on Machine Learning (ICML), 2001, pp. 282–289. [15] G.G. Lee, J.Y. Seo, S.W. Lee, H.M. Jung, B.H. Cho, C.K. Lee, B.K. Kwak, J.W. Cha, D.S. Kim, J.H. An, H.S. Kim, SiteQ: engineering high performance QA system using lexico-semantic pattern matching and shallow NLP, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 437–446. [16] Y.S. Lee, Y.C. Wu, J.C. Yang, BVideoQA: online bilingual question answering on videos, Journal of American Society and Information System Technology 60 (3) (2009) 1–17. [17] T. Letsche, M. Berry, Large-scale information retrieval with latent semantic indexing, Information Sciences 100 (1–4) (1997) 105–137. [18] X. Li, D. Roth, Learning question classifiers, in: Proceedings of the 19th International Conference on Computational Linguistics, 2002, pp. 556–562. [19] X. Li, D. Roth, Learning question classifiers: the role of semantic information, Journal of Natural Language Engineering 12 (3) (2006) 229–249. [20] J. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, D.R. Karger, What makes a good answer? The role of context in question answering, in: Proceedings of the IFIP TC13 International Conference on Human-Computer Interaction (INTERACT), 2003, pp. 25–32. [21] T.Y. Liu, Learning to rank for information retrieval, Foundations and Trends in Information Retrieval 3 (3) (2009) 225–331. [22] D. Molla, R. Schewitter, F. Rinaldi, J. Dowdall, M. Hess, ExtrAns: extracting answers from technical texts, IEEE Intelligent systems 18 (4) (2003) 12–17. [23] A. Moschitti, Answer filtering via text categorization in question answering systems, in: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, 2003, pp. 241–248. [24] A. Moschitti, S. Quarteroni, Kernels on linguistic structures for answer extraction, in: Proceedings of the 46th Conference of the Association for Computational Linguistics, 2008, pp. 113–116. [25] A. Moschitti, S. Quarteroni, Linguistic kernels for answer re-ranking in question answering systems, Information and Processing Management 42 (6) (2010) 825–842. [26] John Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in: Advances in Large Margin Classifiers, 1999, pp. 61–74. [27] S. Quarteroni, A. Moschitti, S. Manandhar, R. Basili, Advanced structural representations for question classification and answer re-ranking, in: Proceedings of the 29th European Conference on Information Retrieval (ECIR), 2007, pp. 234–245. [28] S.K. Ray, S. Singh, B.P. Joshi, A semantic approach for question classification using WordNet and Wikipedia, Pattern Recognition Letters 31 (13) (2010) 1935–1943. [29] V. Rus, D. Moldovan, High precision logic form transformation, International Journal on Artificial Intelligence Tools 11 (3) (2002) 437–454. [30] Y. Sasaki, Question answering as question-biased term extraction: a new approach toward multilingual QA, in: Proceedings of the 43rd Annual Meeting of the ACL, 2005, pp. 215–222. [31] M.M. Soubbotin, Patterns of potential answer expressions as clues to the right answers, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 293–302. [32] J. Suzuki, Y. Sasaki, E. Maeda, SVM answer selection for open-domain question answering, in: Proceedings of the 19th International Conference on Computational Linguistics, 2002, pp. 974–980. [33] S. Tellex, B. Katz, J.J. Lin, A. Fernandes, G. Marton, Quantitative evaluation of passage retrieval algorithms for question answering, in: Proceedings of the 26th ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 41–47. [34] E.M. Voorhees, Overview of the TREC 2001 question answering track, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 42–52. [35] Y.C. Wu, T.K. Fan, Y.S. Lee, S.J. Yen, Named entities using support vector machines, Discovery in Life Science Literature 3886 (2006) 91–103. [36] Y.C. Wu, J.C. Yang, Toward multilingual: a passage retrieval algorithm for video question answering, IEEE Transactions on Circuits and Systems for Video Technology 18 (10) (2008) 1411–1421. [37] Y.C. Wu, Y.S. Lee, J.C. Yang, Robust and efficient multiclass SVM models for phrase pattern recognition, Pattern Recognition. 41 (9) (2008) 2874–2889. [38] Y. Wu, R. Zhang, X. Hu, H. Kashioka, Learning unsupervised SVM classifier for answer selection in Web question answering, in: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 33–41. [39] J. Xu, W.B. Croft, Query expansion using local and global document analysis, in: Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 4–11. [40] D. Zhang, W.S. Lee, Question classification using support vector machines, in: Proceedings of the 26th ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 26–32. [41] I. Zukerman, P. Kowalczyk, M. Niemann, B. Raskutti, Supervised machine learning techniques for question answering, in: Proceedings of Knowledge and Reasoning for Answering Questions (KRAQ), 2005, pp. 89–96.