Document not found! Please try again

Evolutionary optimization for ranking how-to questions based on user ...

4 downloads 40271 Views 828KB Size Report
article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors ... topics are hard to address by usual search engines. Hence some.
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights

Author's personal copy

Expert Systems with Applications 40 (2013) 7060–7068

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Evolutionary optimization for ranking how-to questions based on user-generated contents q John Atkinson a,⇑, Alejandro Figueroa b, Christian Andrade a a b

Department of Computer Sciences, Faculty of Engineering, Universidad de Concepcion, Concepcion, Chile Yahoo! Research Latin America, Av. Blanco Encalada 2120, Santiago, Chile

a r t i c l e

i n f o

Keywords: Community question-answering Question-answering systems Concept clustering Evolutionary computation HPSG parsing

a b s t r a c t In this work, a new evolutionary model is proposed for ranking answers to non-factoid (how-to) questions in community question-answering platforms. The approach combines evolutionary computation techniques and clustering methods to effectively rate best answers from web-based user-generated contents, so as to generate new rankings of answers. Discovered clusters contain semantically related triplets representing question–answers pairs in terms of subject-verb-object, which is hypothesized to improve the ranking of candidate answers. Experiments were conducted using our evolutionary model and concept clustering operating on large-scale data extracted from Yahoo! Answers. Results show the promise of the approach to effectively discovering semantically similar questions and improving the ranking as compared to state-of-the-art methods. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction Traditionally, Question Answering (QA) systems aim at automatically answering natural language questions by extracting short text fragments from documents contained in a target collection (e.g., the web). On the other hand, Community QA (CQA) services are composed of members, who contribute new questions, which can be answered by other members. By making good use of this synergy, members share their knowledge so as to build a valuable massive archive (i.e., fora) of questions and answers. This rapidly growing repository yields insights and solutions to many common questions and daily problems that people may face. This collaborative paradigm has shown to be attractive to provide answers when topics are hard to address by usual search engines. Hence some users may be looking for opinions, assistance to others, etc. which are barely plausible using conventional QA systems. Unlike some traditional QA systems, CQA platforms have became popular as a great opportunity to obtain answers generated by other users, rather than lists of text snippets or documents. Typically, CQA services (i.e., Yahoo! Answers) are organized in categories, which are selected by members when submitting a question. These categories can be used for finding contents on topics of interest. Thus, this kind of platform provides a mechanism

q This research was partially supported by FONDECYT, Chile under Grant number 1130035: ‘‘An Evolutionary Computation Approach to Natural-Language Chunking for Biological Text Mining Applications’’. ⇑ Corresponding author. Tel.: +56 41 2204305. E-mail addresses: [email protected] (J. Atkinson), afi[email protected] (A. Figueroa).

0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.06.017

where posted questions can receive several responses from multiple users, who contribute good questions and answers, rate the answers’ quality (i.e., positive/negative votes, thumbs-up/thumbsdown, etc.) and post comments. Broadly speaking, research on automatic QA systems has been conducted under the umbrella of the QA track in the Text REtrieval Conference (TREC), in which systems are targeted at news articles and some specific and re-structured classes of questions (Voorhees, 2005). Despite their success, these systems differ significantly from CQA platforms: news collections are normally clean documents whereas CQA services are built on top of noisy user generated content. On the one hand, CQA platforms aim at finding members who can provide quick and short answers to askers, whereas traditional QA systems do not yield this social ability. Furthermore, CQA platforms rely on the voluntary involvement of their members, and have demonstrated to be more attractive at responding complex questions (Liu et al., 2008), especially how-to questions (e.g., ‘‘How to make beer?’’) that appear to catch much attention (Harper et al., 2008). Current QA systems are different from CQA services in different ways (Blooma and Kurian, 2011): (1) QA systems started by processing single-sentence factual questions (e.g., ‘‘Where is London?’’), but later the focus was shifted to interactive questions Dang et al. (2006). CQA services, in turn, are rich in multiple-sentence questions (e.g., ‘‘I am traveling to Bellevue this summer. What are tourist attractions there?’’). (2) QA systems extract answers from documents, whereas for CQA platforms, answers are generated by users.

Author's personal copy

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

(3) QA systems typically operate on clean and valid documents, so the answer quality is often very high, whereas for CQA services, quality is unclear as it depends on the content contributed by members. (4) CQA platforms are rich in meta-data (e.g., best answers selected by askers). (5) Responses in traditional QA systems are provided immediately, whereas in CQA services, response times depend on the availability of their members. As a consequence, the quality of posted answers in CQA services is also variable, ranging from excellent to insulting responses. Another major problem involves the low participation rate of members, causing a small number of users to answer a large number of questions. Everyday, a bunch of questions remain unanswered or done with delay. For example, in Yahoo! Answers nearly 15% of all submitted questions in English go unresolved, poorly answered, or never satisfactorily resolved (Shtok et al., 2012). This drawback might be due to users who simply do not want to respond questions (or they are not experts), or system saturation producing that members are unaware of new questions of their interest. In order to address these issues, informing askers that posted questions are unlikely to be resolved, offering suggestions or past answers (or forwarding questions to experts) may become practical solutions as recent research has reported that, at least, 78% of the best answers are reusable (Liu et al., 2008). An effective CQA platform should be capable of detecting similar past questions and relevant answers, and recommending potential answerers. However, members need to narrow the lexical gap between new and past questions. Unfortunately, the question’s body provides both relevant and irrelevant contents, and illformed language and heterogeneous styles for answers significantly affect the quality of answers in CQA services. Hence ranking answers to complex questions plays a key role in CQA platforms. While state-of-the-art approaches to QA have been fair to deal with factoid questions, they fail to effectively provide answers to complex questions, specially considering manner (how-to) questions are a mainstream for CQA platforms (Liu et al., 2008). This research addresses non-factoid question-answering, in particular, discovering answers to procedure (how-to) questions. To this end, a QA model combining evolutionary computation techniques and clustering methods is proposed to effectively search for best answers from web-based user-generated contents, so as to generate new rankings of answers. Our work’s main claim is that combining evolutionary computation optimization techniques and question–answer clustering may be effective to find semantically similar questions by improving the ranking of candidate answers. Specifically, genetically discovering semantic relationships by concept clustering that contain answers may significantly increase the baseline performance of a QA ranking. Accordingly, this paper is organized as follows: Section 2 discusses some concepts and foundations for this work, Section 3 describes a novel evolutionary optimization model which uses clustering methods for ranking answers to how-to questions, Section 4 discusses the different experiments, evaluations and results by using our approach and finally, Section 5 highlights the main conclusions of this work.

2. Related work User opinions are important supplies for CQA platforms. Many features of typical CQA services, such as the best answer to a question, are dependent on ratings cast by the community. For example, in a popular CQA platform such as Yahoo! Answers, members vote for the best answers to their questions and can also thumb

7061

up or down each individual answer. In terms of designing CQA platforms for theses scenarios, there are four key topics: question processing (Blooma et al., 2011; Blooma and Kurian, 2012; Cao et al., 2008; Chen et al., 2012; Harper et al., 2009; Li et al., 2008; Tamura et al., 2006, 2005; Wanga et al., 2009; Yang et al., 2011), answer ranking, user participation (Li et al., 2012; Mendes Rodrigues and Milic-Frayling, 2009; Nam et al., 2009; Pal and Konstan, 2010; Pal et al., 2011; Pal et al., 2012; Rechavi and Rafaeli, 2012), and question routing (Liu and Agichtein, 2011; Liu et al., 2005; Riahi et al., 2012). In particular, research on answer ranking focuses on two key problems: answer retrieval and user-generated answer quality. For answer retrieval, a ranking framework can retrieve correct well-formed answers to factoid questions (Bian et al., 2008), by combining relevance, user interaction, and community feedback information. Experiments assessed several collaborating features (i.e., number of terms in a question, overlapping terms between questions, questions and answers, and lifetime of questions and answers), showing that top-ranked answers to be selected ‘‘best’’ answers by a asker are more important than those having suitable textual features. Similarly, other information retrieval techniques combine a translation-based language model for the question part and a query likelihood approach for the answer part (Xue et al., 2008). Furthermore, some approaches focus on learning to rank models to recognize effective search queries so as to fetch answers from CQA services (Figueroa and Neumann, 2013), in which users click on Yahoo! Answers pages in order to discriminate search queries that are expected to be questions, on open domains. Overall, question-type specific models were observed to perform better than general-purpose ranking functions as selected features were effective to model several kinds of the search queries’ semantic and syntactic properties which were part of the candidate answers. In order to measure the quality of user-generated answers, two distinct sources of predictors for high-quality answers have been identified: social and textual features (John et al., 2011). In terms of social features, one of the most relevant aspects becomes the answerer authority and the answer’s rating of the asker (Jeon et al., 2006), whereas for textual features, the number of unique words, the length of an answer, the number of misspellings, etc., are the most common aspects. Some methods have combined both kinds of predictors for automatically selecting high-quality user-generated answers (Agichtein et al., 2008). In general, quality of these answers is based on two properties: answer features and user expertise (Suryanto et al., 2009). Accounting for user expertise (e.g., the ability of prompting good questions and producing appropriate answers) was found to perform better as some not-best answers can also be high-quality responses. Predictors of answer quality have been investigating by studying responses across several CQA platforms. Some approaches found the paid services yield more high-quality answers than communities relying upon specific individuals for responding questions (e.g., library reference services) (Harper et al., 2008). The question’s main topic was discovered to have a significant effect on the number of posted responses, but a slight impact on the quality of the answers (i.e., entertainment questions obtain many low-quality responses compared to other topics). Furthermore, question types impact answer quality, in particular advice (how-to) questions reap the best quality, whereas factual questions, the poorest. These studies also showed a similar behavior in terms of effort and length. In general, advice questions appeared to catch the most and best attention. Hence the idea of re-using resolved popular question has became popular to estimate the probability of new questions to be answered by past best answers (Shtok et al., 2012), as finding similar or identical questions is hard to discriminate (e.g., ‘‘Do you like me?’’), as they might have the same end but their answers are time-dependent (e.g., ‘‘Who won the Superbowl?’’).

Author's personal copy

7062

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

Instead of scoring answer passages, a recent method identifies past resolved questions that are similar to a new question by using some similarity measure such as cosine, making sure that candidate answers are selected only, and accounting for past questions having the best responses (Shtok et al., 2012). These high-rated questions are then re-ranked considering the question body as additional evidence. On the other hand, extracting potential answers benefits from a two-sided Random Forest classifier that assesses the best answer to the top rated question as response to the new question. The classifier is enriched with surface statistics including text length, number of question marks/figures/links/ stopwords, and several IDF statistics (i.e., few stop-words was observed to indicate low question readability). These statistics were extended with manifold assorted features such as the tf–idf weighted cosine similarity between two question titles, two question bodies, and the answer. Some variations have included the use of topic modeling for determining similarity between new and past questions (Shtok et al., 2012). Latent Dirichlet Allocation (LDA) topics are inferred for each category, and a topic distribution is then deduced for the new and past questions, and the answer. The method also benefited from a dependency parser for extracting other question’s features such as the WH-type, the number of verbs, adjectives and nouns. Since ambiguous queries are expected to fetch documents on diverse topics, the resulting language model will be similar to that learned from the corpus. For this, the KL-divergence between the language model of the entire collection and the result-set of the question was applied. Taxonomies for questions and answers have also been proposed (Liu et al., 2008), in which answers were added by generating summaries based on different categorized question-types. A question taxonomy is extended by adding the social category, which involves queries seeking interaction with people (Rose and Levinson, 2004). Queries are split into two types: constant and dynamic. A constant query has a fixed set of answers, whereas a dynamic query is divided into three classes: opinion, context-dependent and open. Results using this method showed a high correlation between answers and question types, indicating that constant questions were more likely to target factual unique answers, whereas open questions get a variable list of facts/methods as answers, opinions get subjective answers, and social questions are related to irrelevant content. For specific and complex question types including how-to questions, some methods score answers by using user-generated content repositories such as Yahoo! Answers (Surdeanu et al., 2008, 2011), from which a large training corpus of positive/negative question–answer pairs can be collected. A positive pair contains a question and the best answer manually selected by an asker, whereas the remain of the responses are seen as negative pairs. Ranking is automatically generated by using a variant of the Perceptron and SVMRank methods, both enriched with a wide range of textual and non-textual features including similarity between questions and answers for different representations (e.g., bag-ofwords and semantic roles), translation models over semantic and syntactic dependencies, density and frequency (e.g., answer span and informativeness) and web correlation statistics. Since these models lacked user information, the effectiveness of three categories of user profile features has been examined on scoring answers to how-to questions (Zhou et al., 2012): engagement, authority and level. In addition, having a user picture was observed to significantly contribute the ranking task, and, as a consequence, most of the top contributors were always good at only one or two questions categories. Level features were derived from the points a user gets, actions, the sum of best and non-best responses, and the answer-question rate. On the other hand, engagement features encompass points earned in the current week and picture existence. On the contrary, authority properties included the best

answer rate, a boolean element indicating whether it is a top contributor, and expertise rank. 3. An evolutionary model for ranking how-to answers By using a large web-based knowledge on related questions and answers such as Yahoo! Answers, a novel QA model based on evolutionary computation optimization and clustering techniques was designed for re-ranking answers to how-to questions. Each question is represented by a Shallow Subject-Verb-Object (SSVO) triplet which is an entity grouping a set of semantically similar questions. In order to generate new rankings of candidate answers, clusters of similar triplets are automatically generated (i.e. related questions that may share the semantically similar answers). Each question of the set is then assessed by using a vector representation and producing a new ranking of answers based on a performance improvement on the ranking for the same question. Best answers for each question are grouped so as to create a best averaged triplet, which in turn, has the clustering quality improved. A specially designed Genetic Algorithm (GA) iteratively searches for the best combinations of triplets based on their fitness evaluation by using standard QA metrics. The overall architecture of our optimization model can be seen at Fig. 1. 3.1. Evolutionary optimization for re-ranking Our GA-based ranking approach uses SSVO triplets representing related questions. In order to search for candidate configurations of clusters of triplets, the GA (Fan et al., 2004; Trotman, 2004; Yeh et al., 2007; Tiedemann, 2007; Verberne et al., 2010; Verberne et al., 2011; Karimzadehgan et al., 2011) starts off with a initial population of chromosomes (hypotheses), and then uses genetic operators in order to iteratively search for and improve further candidate solutions as seen in Algorithm 1. In each iteration, candidate clusters are assessed in terms of a fitness function representing the quality of the grouped triplets so as to generate a new ranking of answers. 3.2. Chromosome representation by shallow subject-verb-object triplets Each pair of extracted question–answer sentences is enriched with a HPSG tree representation. By using a HPSG parser, the triplet representation for each sentence is modeled as (actor, action, object). Unlike traditional Subject-Verb-Object (SVO) representations, our triplet approach captures shallow semantic knowledge which maps all objects to their semantic head. Good examples of this include ‘‘acid’’ (e.g., ‘‘sulphuic acid’’) and ‘‘dance’’ (e.g., ‘‘tahitian dance’’). In addition, the approach also maps common classes of

Fig. 1. Evolutionary QA model for ranking answers to how-to questions.

Author's personal copy

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

subjects into their types (i.e., for ‘‘person’’, pronouns such as ‘‘you’’ and ‘‘we’’ can be found). Some examples of these representations can be seen in Table 1. Thus, a how-to question is seen as the information required on how an actor (e.g., ‘‘person’’) can perform a particular action (e.g., ‘‘get rid of’’) to a certain object (e.g., ‘‘anxiety’’). Thus, a SSVO triplet may embody a question’s main substance, which responses to questions clustered into a semantically similar SSVO triplet. In our SSVO representation, a verb is linked to the semantic head in the HPSG tree, every time the attribute ‘‘cat’’ of this head gets a value of V. The attribute ‘‘cat’’ indicates a phrase symbol of the constituent (e.g., NP) or the word (e.g., V or CONJ). For example, the value of CONJ appears in the case of questions such as ‘‘How do you know or test how much weight can be held on a fishing rod before it snaps?’’. As a rule, whenever this value of ‘‘cat’’ was not ‘‘V’’, the arguments of the semantic head are first explored in order. It aims at finding the first element for which ‘‘cat’’ is equal to V. Whenever this task fails (e.g., misparsed, misspelled and mismatched questions), the verb is chosen according to the first V within the sentence, excluding tokens used by the regular expression for filtering questions. Selected verbs are forced to be in present form by checking whether the attribute ‘‘tense’’ of this verb is instantiated with ‘‘present’’. Furthermore, the actor/subject pair is determined by using three types of patterns: (1) Whenever the question begins with the ‘‘how to . . .’’ pattern, the actor is assumed to be a person. (2) Patterns such as ‘‘How do/does/did/would/should/could/can PRP’’, where PRP denotes a personal pronoun (e.g., ‘‘we’’, ‘‘they’’ and ‘‘she’’) in the form used as the subject of an infinite verb. The actor is mapped into a ‘‘person’’ where pronouns are extended with a list of common expressions used as synonyms (e.g., ‘‘anyone’’ and ‘‘somebody’’) and some abbreviations. Whenever the actor can not be associated Table 1 A sample SSVO representation for how-to questions. Representation

Question

(person, make, beer)

How do bottle? How do How do How do How do How do How do

(religion, affect, life)

i make beer that leaves no sediment in the i make beer taste better? you make fake beer? you make green beer? you make pineapple beer? you make non alcoholic beer? you make beer from non alcoholic beer?

How does religion affects life? How does religion affects your life? pls. give some info.. if its ok? How did religion affect the daily lives of people during the elizabethan era?

(scientists, know, undefined)

How do scientists know how many planets there are in the universe? How can scientists know we live in the milky way? How did scientists knew that dogs had a black and white vision?

(dogs, get, worm)

how do dogs get worms? how do dogs get tape worms? How do dogs get ring worms and what is it?

(women, get, pregnant)

How do women get pregnant with multiple kids at a time? How could women get pregnant after their husbands had vasectomies? how do women get pregnant on period??

(colleges, calculate, gpa)

How do colleges calculate your high school gpa? How do colleges calculate your gpa?

(animals, know, undefined)

How do animals know how to evolve camouflage? How do animals know when a storm is approaching?

7063

with a person, the span of text containing the end of the matching pattern and the previously determined verb, is extracted. All leading determiners and possessive pronouns are then removed. If this actor is constituted by several tokens, the last token is verified to be the semantic head by the HPSG tree. For example, the question ‘‘How can my hair grow faster?’’ produces the actor ‘‘hair’’ (otherwise, the actor is ‘‘undefined’’). (3) The object is determined by searching from leftmost to rightmost argument of the main verb excluding the subject. Here, two elements are looked for: the base form of the word (tagged with a ‘‘cat’’ value of N) or the word whose part-ofspeech tag is VBG (e.g. ‘‘thinking’’ in ‘‘how to avoid thinking about girls?’’). As for verbs having a coordination of objects, only the base form of the semantic head of leftmost member of this coordination is considered. For verbs attached to multiple prepositional phrases, the base form of the semantic head of the first prepositional phrase is obtained, otherwise the object is set to ‘‘undefined’’. Accordingly, each individual (chromosome) is coded as an array of triplets, each cell containing the identification (ID) of the corresponding generated cluster. For example, consider the following eight SSVO triplets representing coded questions: (1) (2) (3) (4) (5) (6) (7) (8)

(person, (person, (person, (person, (person, (person, (person, (person,

cook, salmon) prepare, chicken) cook, duck) prepare, spaghetti) prepare, pasta) prepare, soup) cook, duck) cook, squid)

A sample chromosome encoding for six ‘aggregated’ clusters can be seen in Fig. 2 in which each gene denotes the cluster ID the corresponding triplet belongs to. The sample individual indicates that the 5th, 7th and 8th SSVO triplets were aggregated into one (the 5th) cluster. Finally, the initial population for the GA with these chromosomes, is randomly generated so as to favor the wide exploration of the search space. 3.2.1. Genetic Operators In order to our optimization strategy to search for promising candidate clusters of triplets, three genetic operators were designed:  Mutation: this operator provides population diversity by randomly modifying some gene of a chromosome. For our model, uniform mutation was implemented so as to improve a candidate cluster’s configuration as the GA goes on Langdon (2011). For example, consider the following triplet representation for ‘‘prepare food’’, so that for a random gene the resulting individual is as follows: [(person - cook - chicken), (person - prepare - chicken)], [(person - cook - chicken), (person - cook - turkey)], [(person - prepare - spaghetti), (person - prepare - pasta)], [(person - prepare - lasagna)] + [(person - cook - chicken), (person - prepare - chicken)], [(person - cook - chicken)], [(person - cook - turkey)], [(person - prepare - spaghetti), (person - prepare - pasta)], [(person - prepare - lasagna)]

Author's personal copy

7064

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

(1) A Centroid Vector (CV) is constructed from the training data: the centroid vector of an aggregated cluster is built by joining all best answers derived from all the respective aggregated SSVO triplets. Best answers are chosen by a voting mechanism and by askers. In our working example, the 5th centroid vector fuses the training data from all best answers to training questions modeled by the triplets: (person, prepare, pasta), (person, prepare, soup) and (person, cook, squid). (2) CVs are used for ranking candidate answers to a set of testing questions: candidate answers are obtained by sending the respective question to Lucene1 search engine.

Fig. 2. Chromosome encoding.

Algorithm 1. GA for Generating the Best Clusters Configuration Require: Set L be the ranking obtained by Lucene Require: Set T be the triplets Initialize(T) :C0 Fitness(C0) k 0 i 0 while i < MaxGenerations and Fitness(Ci) < k do Sel Selection(Ci, L) Ci+1 Crossover(Sel) Ci+1 Mutation(Ci+1) Fitness Ci+1 i i+1 end while return Best Individual in Ck

Fig. 3. A sample cross-over operation.

 Crossover: recombines two parent chromosomes in some point so as to generate a new offspring (Alba and Cotta, 2005; Tiedemann, 2007; Verberne et al., 2010). For our approach, a single-point crossover operator (Florez-Revuelta, 2007) was implemented. Consider two parent chromosomes of length k, recombination proceeds as follows: (1) Select a random ts (s 2 [1, k]), where k is the number of genes (SSVO triplets). (2) Copy the genes in the segment [t1, ts] from the first parent, to the offspring. (3) Copy the genes in the segment [ts + 1, tk] from the second parent, to the offspring. If any of these copied genes have the same allele as any gene of the non-copied section (second parent), copy the respective allele from the first parent. An example of the crossover operator is shown in Fig. 3 in which the segment [t1, t4] in the first parent, i.e., the sequence 1–2–1–3, is directly copied to the offspring. The genes of the second parent are then examined separately: since t5 has the same allele than t2 in both parents, the allele of the second parent is used. Next, t6 is part of an aggregated cluster that has an element in the inherited segment of the chromosome, hence instead of copying the number ‘‘3’’, the respective allele (‘‘1’’) is inherited from the parent. Similarly, when crossing-over t7: the number ‘‘3’’ is copied instead of ‘‘4’’.  Selection: in order to select the best individuals to be reproduced in each generation, a tournament selection method was implemented by following the procedure: (1) A set of k individuals is randomly collected for a tournament. (2) The best individual of the tournament is selected with probability p. (3) The second best individual is selected with probability p(1  p). (4) The procedure goes on until all individuals of the selection have been evaluated. Thus, the ith best individual is selected with probability p(1  p)(i1). All required tournaments are conducted until the number of individuals required for the next generation is achieved.

3.2.2. Fitness evaluation Once genetic operators are applied to each chromosome (candidate aggregated cluster), the individual fitness is evaluated by following a two-phase process:

Furthermore, each candidate answer is ranked according to its CV which is determined by finding the aggregated cluster that embodies the SSVO triplet representation of the question that the candidate answer seeks to respond. In our example, all candidate answers to any testing question modeled by the SSVO triplet (person, cook, squid) will be evaluated by the CV trained with all best answers to training questions modeled with (person, prepare, pasta), (person, prepare, soup) and (person, cook, squid). As a result, the CV yields a ranking of candidate answers. The quality of this ranking is measured as the Reciprocal Rank (RR) of the respective best answer. Thus the fitness of a chromosome C is computed as follows:

1 X RRðr i Þ  1; jQ j i¼0 RRðli Þ jQ j

FitnessðCÞ ¼

where Q is the set of testing questions, ri and li are the rankings provided by the respective CV and the information retrieved by Lucene, respectively. The overall optimization process finishes when the GA a certain number of iterations and a target value in the average fitness of the whole population. 4. Experiments and results In order to assess the effectiveness of our evolutionary model for ranking how-to answers, a web-based computer prototype was implemented. Experiments were then conducted in order to investigate the extent to which our combined triplet clustering and evolutionary optimization can indeed create underlying rich semantic association between similar question so as to generate high-quality ranking of candidate answers. 1

http://lucene.apache.org

Author's personal copy

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

4.1. Methodology 4.1.1. Questions selection and correction In order to collect and prepare a large set of how-to questions from Yahoo! Answers, a simple regular expression such as howðtojdojdidjdoesjcanjshouldÞ was used Surdeanu et al. (2011). This matched nearly 11 million questions posted by Yahoo! Answers members from June 2006 to August 2011. Deep linguistic pre-processing was then carried out in order to correct some common grammatical and spelling errors. These corrections also aimed t improving the outcome of the HPSG parser, and included the following: (1) More than 35 acronyms and expressions, typically found across matched questions, were normalized. (2) Askers frequently do not correctly use capitalization when typing named entities so this type of grammatical error was corrected by extracting unigrams and bigrams from each submitted question, and then checking whether their reverted version exist across the respective question body and responses. Only n-grams not formed by stop-words neither punctuation signs were considered (e.g., question marks). (3) Additional replacements were made by using the Jazzy2 spell checker, which identifies misspelled words and provides likely corrections. All corrections were made subject to the following conditions:  the question word is not found within the question body neither is contained in the answers,  neither the question word nor the candidate replacement are stop-words,  the number of characters in each word must be higher than three characters,  the substitution must appear across the question body and/or the responses. (4) Some grammar errors were also corrected via the Language Tools3 API, which detects assorted grammar problems based on 38 triggering rules. These included word repetitions, article-noun disagreement (e.g., ‘‘this NNS’’ and ‘‘a PLURAL’’), missing articles, etc. These replacements were mostly performed without checking the question body and the responses. 4.1.2. Indexing and retrieving answers A collection of documents containing all answers to how-to questions was created by using Lucene4, in which each answer is seen as an independent document. Lucene was configured for lowercasing, filtering stop-words and stemming. The top-1000 hits fetched by sending a question are then perceived as its candidate answers. The impact of our question correction strategy was assessed on the retrieval of the best answer. For this, the original and the corrected question were submitted to Lucene. For the original question, solely in 10.13% of the questions, the best answer was used in retrieval step. On the other hand, when sending the corrected query, the coverage of the best answer reached 10.21%, showing that information retrieval engines place an upper bound on the overall performance of the QA systems. In order to improve the coverage of the best answer retrieval task, a query expansion technique based on useful expansion words was exploited (Derczynski et al., 2008), which improved the coverage to nearly 18%, that is ca. 2 millions questions. Like other approaches (Surdeanu et al., 2011; Verberne et al., 2010), this 2-millon set of questions was only used for building 2 3 4

http://sourceforge.net/projects/jazzy http://www.languagetool.org http://lucene.apache.org

7065

our triplet corpus, and then testing our evolutionary approach. Note that if the best answer is not in the top 1000 hits, it is impossible for a re-ranker to reach good performance. 4.1.3. Corpus collection Each pair of extracted question–answer was enriched with a HPSG tree representation. In order to compute these trees, the Enju5 HPSG parser was used by modeling each sentence as a triplet (actor, action, object). Whenever a triplet can not be identified by the HPSG parser, it was replaced by the placeholder ‘‘undefined’’ which happens very often due to existing complex objects rather than complex subjects. Complex objects are not given directly by their names, but rather by a description (see ‘‘how to evolve camouflage’’ on Table 1). Note that for multi-sentence questions (Tamura et al., 2005, 2006), such as ‘‘how does religion affects your life? pls. give some info . . . if its ok?’’, only the first pattern matching the question is taken into account in our semantic analysis. Each question was then translated into XML format so as to be represented as SSVO triplets. Each triplet contains a set of elements modeled by the following items:    

The corrected question. The SSVO triplet representation actor-action-object. The best answer for the question. The list of answers associated to each question, ranked according to a similarity analysis provided by Lucene containing a maximum of 1000 answers for each question.

It it important to highlight that only triplets using at least five elements were used, and therefore, the corpus acquired in the first step was reduced from 2 millions to 1.435.000 questions involving 54.757 different triplets. This set was split into 60% to train our GAbased optimization approach and 40% (ca. 574.000 questions) to test and validate the candidate ranking. 4.1.4. Baseline Since our experiments require to be de-coupled from the performance of the retrieval step, the retrieval task for this large test set of about 574.000 questions was used as baseline (Ko et al., 2007; Surdeanu et al., 2011; Verberne et al., 2010). For the whole set, our baseline system involved a Mean Reciprocal Rank (MRR) of 0.089. Table 2 groups the outcomes produced by Lucene for some sample triplets. The results also indicates a high variability from one triplet to the other by scoring via a general-purpose ranking function. This is a key motivation for our evolutionary approach to find so as to investigate which semantic units require specific models, and which ones must be combined to reach the best performance. The samples’ mean might be due to data-sparseness, thus combining them with other triplets becomes a key issue to improve the performance. In addition, the undefined triplet involves 61,358 questions, scoring an MRR of 0.1149, which is higher than that obtained by the entire testing data (0.089). This increase suggests that better models are needed to rate samples belonging to other SSVO triplets. 4.1.5. Parameters tuning Parameters were tuned by using the MRR gain as a quality measure (i.e., fitness) as our GA evolves. Thus, the time complexity of evaluating our fitness function can easily be reduced to the ranking’s complexity, e.g., O(n log n)O(m), where n is the length of the CV and m is the size of the ranking. In addition, the average size of each ranking was 1000 answers. Parameters adjusting was carried out by running the GA on an average number of 20 initial

5

http://www-tsujii.is.s.u-tokyo.ac.jp/enju/

Author's personal copy

7066

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

Table 2 Results obtained by our baseline. SSVO triplet

MRR

SSVO triplet

MRR

(person, write, verse) (person, unlock, stereo) (person, teach, pony) (America, be, different)

1.0 1.0 0.75 0.6679

0.1455 0.1356 0.1278 0.1149

(computer, know, undefined) (animals, know, undefined) (temperature, affect, solubility) (IRS, know, undefined) (person, connect, dvd)

0.5368

(colleges, calculate, gpa) (person, make, gun) (person, make, beer) (undefined, undefined, undefined) (scientists, know, undefined)

0.5071

(religion, affect, life)

0.0060

0.3380

(women, get, pregnant)

0.0033

0.2516 0.2063

(dogs, get, worm) (Republicans, expect, undefined)

0.0024 0.0015

0.0400

Fig. 4. Gain for different mutation rates.

generations. The initial population was set to 30 individuals, and two genetic operators were implemented: a single-point crossover and a uniform mutation. Final crossover and mutation probabilities were set to 90% and 1%, respectively based on ran adjusting experiments. The impact of the mutation operation was assessed using probability values (Pm) ranging from 0.01 to 0.015 as seen in Fig. 4. Adjustments did not show significant differences of quality in terms of different mutation rates, which might be due to the closeness of the selected values. In order to set a suitable initial population size, a random assignment over 30 individuals was assessed as shown in Fig. 5. Results show that MRR gain increases as the population size, however, differences of gain between sizes 50 and 70 are not significant, and furthermore, running time of the method for populations containing above 70 individuals doubles the 50-size population. For the crossover operation, rates were evaluated between 80% and 95% as seen in Fig. 6. Results show a clear difference of MRR gain when crossing-over with probability (Pc) of 95%. In order for the GA to converge toward good candidate solutions, parameters were finally set as previously discussed. Furthermore, different runs of the GA were tested as seen in Fig. 7, showing some significant results between runs which may be explained by the random and unnormalized initialization of the population. Final best five clusters for each run also showed significant differences indicating that multiple suitable triplets associations may exist. On average, 17968 clusters were generated containing 2.7 triplets each. However, standard deviations for the created clusters were relatively high ranging from 2.3 to 3.4, which may be due to the difference of distribution of topics contained in the triplets, e.g., undefined triplets cover almost 60% of data.

Fig. 5. Gain for different population sizes.

Fig. 6. Gain for different crossover rates.

Fig. 7. Gain versus GA runs.

Experiments also indicated that after 60 generations, the number of clusters converge faster which may suggest a suitable evolution time to come up with good solutions (Fig. 8). In order to validate the obtained candidate solutions, data were structured for five cluster configurations, and results can be seen at Fig. 7 by using 40% of the data. The average MRR gain was 64.794, which represents a reduction of 15% as compared to the gain of 75.161 obtained from the direct GA. Experiments show that our evolutionary model using triplets clustering outperforms a widely used baseline system. Unlike current methods, our model does not analyze questions in isolation but generates concept associations across automatically created clusters which suggests our approach can indeed link questions to similar answers using hidden knowledge on these underlying clusters.

Author's personal copy

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

Fig. 8. Clusters versus no. of GA iterations.

5. Conclusions and future work In this work, a novel evolutionary computation QA model which uses clustering methods for generating the best answers to how-to questions from user-generated contents is proposed. New representation scheme are proposed to map question–answer pairs into shallow SVO triplets which are provided to an evolutionary optimization method which iteratively searches for the best configurations of clusters along a population of candidate answers. Specifically, a genetic algorithm (GA) based optimization process is capable of finding and improving concept cluster configurations containing triplets so as to improve the quality of answers produced from the Yahoo! Answers repository. Quality of candidate answers and ranking are measured by combining standard RR and vectorial similarity metrics which are computed by a fitness evaluation function for each hypothesis in the GA. New chromosome representation for the GA and effective genetic operators were designed accordingly. Different experiments were conducted with our optimization model so as to generate good ranking of candidate answers from large corpus extracted from Yahoo! Answers. Results using our approach to re-ranking answers to how-to questions show the promise of the work compared to a baseline. These results suggest that combining evolutionary optimization methods and triplet clustering was indeed effective to find good rankings of answers by grouping semantically similar questions. As a further work, exploiting semantic relations provided by WordNet might become a robust way to establish potential connections among SSVO triplets, and thus reducing the search space. Dimension reduction techniques might also be applied to improve the centroid vector ranking, and as a natural consequence, the performance of our evolutionary approach. Finally, our model might benefit from learning different ranking functions and/or distance metrics for measuring triplets or aggregation of them.

References Agichtein, E., Castillo, C., Donato, D., Gionis, A., & Mishne, G. (2008). Finding highquality content in social media. In Proceedings of the 2008 international conference on web search and data mining, WSDM ’08 (pp. 183–194). New York, NY, USA: ACM. Alba, E., & Cotta, C. (2005). Evolutionary algorithms. Chapman and Hall/CRC, pp. 130– 145. Bian, J., Liu, Y., Agichtein, E., & Zha, H. (2008). Finding the right facts in the crowd: Factoid question answering over social media. In World Wide Web conference series (pp. 467–476). Blooma, M. J., Chua, A. Y. K., & Goh, D. H.-L. (2011). Quadripartite graph-based clustering of questions. In Proceedings of the 2011 eighth international conference on information technology: New generations (pp. 591–596). Washington, DC, USA: IEEE Computer Society.

7067

Blooma, M. J., & Kurian, J. C., 2011. Research issues in community based question answering. In Proceedings of the 2011 Pacific Asia conference on information systems (PACIS). Blooma, M. J., & Kurian, J. C. (2012). Clustering similar questions in social question answering systems. In Proceedings of the 2012 Pacific Asia conference on information systems (PACIS). Cao, Y., Duan, H., Lin, C. Y., Yu, Y., & Hon, H. W. (2008). Recommending questions using the mdl-based tree cut model. In Proceedings of the 17th international conference on World Wide Web, WWW ’08 (pp. 81–90). ACM. Chen, L., Zhang, D., & Mark, L. (2012). Understanding user intent in community question answering. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion (pp. 823–828). New York, NY, USA: ACM. Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the TREC 2006 question answering track. In Proceedings of the text retrieval conference. Derczynski, L., Wang, J., Gaizauskas, R., & Greenwood, M. (2008). A data driven approach to query expansion in question answering. In The second workshop on information retrieval for question answering (IR4QA) (pp. 34–41). Fan, W., Fox, E. A., & Wu, H. (2004). The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7), 628–636. 5. Figueroa, A., & Neumann, G. (2013). Learning to Rank effective paraphrases from query logs for community question answering. In AAAI 2013. Florez-Revuelta, F. (2007). Specific crossover and mutation operators for a grouping problem based on interaction data in a regional science context. In Evolutionary computation, 2007, CEC 2007, IEEE Congress (pp. 378–385). Harper, F. M., Moy, D., & Konstan, J. A. (2009). Facts or friends?: Distinguishing informational and conversational questions in social q& a sites. In Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’09 (pp. 759–768). New York, NY, USA: ACM. Harper, F. M., Raban, D., Rafaeli, S., & Konstan, J. A. (2008). Predictors of answer quality in online Q&A sites. In Proceeding of the 26th annual SIGCHI conference on Human factors in computing systems, CHI ’08 (pp. 865–874). New York, NY, USA: ACM. Jeon, J., Croft, W. B., Lee, J. H., & Park, S. (2006). A framework to predict the quality of answers with non-textual features. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06 (pp. 228–235). New York, NY, USA: ACM. John, B. M., Chua, A. Y.-K., & Goh, D. H.-L. (2011). What makes a high-quality usergenerated answer? IEEE Internet Computing, 15(1), 66–71. Karimzadehgan, M., Li, W., Ruofei, Z., & Mao, J. (2011). A stochastic learning-to-rank algorithm and its application to contextual advertising. In The 20th international World Wide Web conference (pp. 377–386). Ko, J., Wang, J., Mitamura, T., & Nyberg, E. (2007). Language-independent probabilistic answer ranking for question answering. In Proceedings of the 45th annual meeting of the association for computational linguistics (pp. 784–791). Langdon, W. B. (2011). Elementary bit string mutation landscapes. In Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms, FOGA ’11 (pp. 25–42). New York, NY, USA: ACM. URL:. Li, B., Liu, Y., Ram, A., Garcia, E., & Agichtein, E. (2008). Exploring question subjectivity prediction in community QA. In Proceedings of the 31st annual international ACM SIGIR . . .. Li, B., Lyu, M., & King, I. (2012). Communities of Yahoo! answers and baidu zhidao: Complementing or competing? In The 2012 international joint conference on neural networks (IJCNN) (pp. 1–8). Liu, Q., & Agichtein, E. (2011). Modeling answerer behavior in collaborative question answering systems. In Advances in information retrieval (pp. 67–79). Springer. Liu, X., Croft, W. B., & Koll, M. (2005). Finding experts in community-based question-answering services. In Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05 (pp. 315–316). New York, NY, USA: ACM. Liu, Y., Bian, J., & Agichtein, E. (2008). Predicting information seeker satisfaction in community question answering. In Proceedings of the 31st annual international ACM SIGIR conference, SIGIR ’08 (pp. 483–490). ACM. Liu, Y., Li, S., Cao, Y., Lin, C. -Y., Han, D., & Yu, Y. (2008). Understanding and summarizing answers in community-based question answering services. In International Conference on Computational Linguistics (pp. 497–504). Mendes Rodrigues, E., & Milic-Frayling, N. (2009). Socializing or knowledge sharing? Characterizing social intent in community question answering. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM ’09 (pp. 1127–1136). New York, NY, USA: ACM. Nam, K. K., Ackerman, M. S., & Adamic, L. A. (2009). Questions in, knowledge in? A study of naver’s question answering community. In Computer human interaction (pp. 779–788). Pal, A., Farzan, R., Konstan, J. A., & Kraut, R. E. (2011). Early detection of potential experts in question answering communities. In UMAP (pp. 231–242). Pal, A., Harper, F. M., & Konstan, J. A. (2012). Exploring question selection bias to identify experts and potential experts in community question answering. ACM Transactions on Information Systems, 30(2), 10. Pal, A., & Konstan, J. A. (2010). Expert identification in community question answering: Exploring question selection bias. In CIKM (pp. 1505–1508). Rechavi, A, & Rafaeli, S. (2012). Knowledge and social networks in Yahoo! answers. In 2013 46th Hawaii international conference on system sciences (pp. 781–789). Riahi, F., Zolaktaf, Z., Shafiei, M., & Milios, E. (2012). Finding expert users in community question answering. In Proceedings of the 21st international

Author's personal copy

7068

J. Atkinson et al. / Expert Systems with Applications 40 (2013) 7060–7068

conference companion on World Wide Web, WWW ’12 Companion (pp. 791–798). New York, NY, USA: ACM. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, WWW ’04 (pp. 13–19). New York, NY, USA: ACM. Shtok, A., Dror, G., Maarek, Y., & Szpektor, I. (2012). Learning from the past: answering new questions with past answers. In Proceedings of the 21st international conference on World Wide Web, WWW ’12 (pp. 759–768). New York, NY, USA: ACM. Surdeanu, M., Ciaramita, M., & Zaragoza, H. (2008). Learning to rank answers on large online QA collections. In Proceedings of the 46th annual meeting for the association for computational linguistics: human language technologies (ACL-08: HLT) (pp. 719–727). Surdeanu, M., Ciaramita, M., & Zaragoza, H. (2011). Learning to rank answers to non-factoid questions from web collections. In Computational linguistics (Vol. 37, pp. 351–383). Suryanto, M. A., Lim, E. P., Sun, A., & Chiang, R. H. L. (2009). Quality-aware collaborative question answering: methods and evaluation. In Proceedings of the second ACM international conference on web search and data mining, WSDM ’09. (pp. 142–151). New York, NY, USA: ACM. Tamura, A., Takamura, H., & Okumura, M. (2005). Classification of multiplesentence questions. In IJCNLP (pp. 426–437). Tamura, A., Takamura, H., & Okumura, M. (2006). Classification of multiplesentence questions. IPSJ Journal, 47(6), 1954–1962. Tiedemann, J. (2007). A comparison of genetic algorithms for optimizing linguistically informed IR in question answering. In AI⁄IA ’07 proceedings of the

10th congress of the italian association for artificial intelligence on AI⁄IA 2007: Artificial intelligence and human-oriented computing (pp. 398–409). Trotman, A. (2004). An artificial intelligence approach to information retrieval. In SIGIR ’04 proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 603–608). Verberne, S., Boves, L., Oostdijk, N., & Coppen, P. -A. (2010). What is not in the bag of words for why - QA? (Vol. 36, pp. 229 245). Computational Linguistics, MIT Press. Verberne, S., Halteren, H., Theijssen, D., Raaijmakers, S., & Boves, L. (2011). Learning to rank for why-question answering. Information Retrieval, 14(2), 107–132. 6. Voorhees, E. M. (2005). Overview of TREC 2005. In TREC. Wanga, K., Ming, Z., & Seng Chua, T. (2009). A syntactic tree matching approach to finding similar questions in community-based QA services. In Research and development in information retrieval (pp. 187–194). Xue, X., Jeon, J., & Croft, W. B., (2008). Retrieval models for question and answer archives. In Research and development in information retrieval (pp. 475–482). Yang, L., Bao, S., Lin, Q., Wu, X., Han, D., Su, Z., et al. (2011). Analyzing and predicting not-answered questions in community-based question answering services. In AAAI. Yeh, J. -y., Lin, J. -y., Ke, H. -r., Yang, W. -p. (2007). Learning to rank for information retrieval using genetic programming. In SIGIR 2007 workshop: Learning to rank for information retrieval. Zhou, Z.-M., Lan, M., Niu, Z.-Y., & Lu, Y. (2012). Exploiting user profile information for answer ranking in CQA. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion (pp. 767–774). New York, NY, USA: ACM.

Suggest Documents