contextual language models for ranking answers to natural language ...

3 downloads 0 Views 705KB Size Report
May 3, 2012 - then they are projected into the candidate answers. .... is substantially different for ranking all candidate answers with respect to a set of KB ...
Computational Intelligence, Volume 28, Number 4, 2012

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS TO NATURAL LANGUAGE DEFINITION QUESTIONS ALEJANDRO FIGUEROA1 AND JOHN ATKINSON2 2

1 Yahoo! Research Latin America, Santiago, Chile Department of Computer Sciences, Universidad de Concepcion, Concepcion, Chile

Question – answering systems make good use of knowledge bases (KBs, e.g., Wikipedia) for responding to definition queries. Typically, systems extract relevant facts from articles regarding the question across KBs, and then they are projected into the candidate answers. However, studies have shown that the performance of this kind of method suddenly drops, whenever KBs supply narrow coverage. This work describes a new approach to deal with this problem by constructing context models for scoring candidate answers, which are, more precisely, statistical n-gram language models inferred from lexicalized dependency paths extracted from Wikipedia abstracts. Unlike state-of-the-art approaches, context models are created by capturing the semantics of candidate answers (e.g., “novel,” “singer,” “coach,” and “city”). This work is extended by investigating the impact on context models of extra linguistic knowledge such as part-of-speech tagging and named-entity recognition. Results showed the effectiveness of context models as n-gram lexicalized dependency paths and promising context indicators for the presence of definitions in natural language texts. Received 25 March 2010; Revised 5 July 2011; Accepted 5 July 2011; Published online 3 May 2012 Key words: context definition models, definition questions, feature analysis, lexicalized dependency paths, statistical language models, question answering.

1. INTRODUCTION The continuous growth and diversification of online text information on the Web represents a continuing challenge, affecting both designers and users of information-processing systems. Assisting users in finding relevant answers from large text collections is a key task. Situated at the frontier of natural language processing (NLP) and information retrieval (IR), open-domain question answering (QA) is a promising choice for the retrieval of full-length documents. Users of QA systems specify their information needs in the form of naturallanguage questions, this way eliminating any artificial constraints sometimes imposed by a particular input syntax (e.g., Boolean operators). A QA system returns brief answer strings extracted from the text collection, thus taking advantage that answers to specific questions are often concentrated in small fragments of text documents. It is up to the system to analyze the content of full-length documents and identifying small, important text fragments. In particular, definition questions (i.e., “What is . .?” and “Who is . .?” ) aim at a list of, normally biographical, relevant facts (i.e., nuggets) about a specific topic or concept (a.k.a. definiendum or target). In contrast to other categories of queries, this class has become especially interesting in recent years due to their growing number actually submitted to web search engines (Rose and Levinson 2004). Note that definition QA systems produce a group of sentences as a means of providing enough context, and thus ensuring the readability of the answering nuggets. For instance, consider the following sentence: Danielle Steel is an American romantic novelist and author of mainstream dramas.

State-of-the-art strategies would try to determine whether this sentence is part of a correct answer to the question “Who is Danielle Steel” by: Address correspondence to John Atkinson, Department of Computer Sciences, Universidad de Concepcion, Concepcion, Chile; e-mail [email protected] This study was carried out while the first author was at the German Centre for Artificial Intelligence (DFKI), University of Saarland, Germany.  C

2012 Wiley Periodicals, Inc.

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

529

(i) identifying relevant words across all candidate answers (e.g., “American” and “novelist”); (ii) looking for articles about “Danielle Steel” extracted from KBs, and measuring the overlapping words (e.g., “American,” “romantic,” and “author”); (iii) identifying the occurrence of some regularities indicating candidate descriptions (e.g., “Danielle Steel is an . . .”). Overall, the knowledge collected from KBs is useful for this kind of technique, and consequently the performance considerably drops, whenever few answer candidates are discovered and/or the coverage across KBs is limited. This work claims that correct answers are characterized by some regularities exhibited in their respective lexicalized dependency paths (LDPs) instead of word correlations or simple frequency counts. LDPs are sequences of words syntactically and semantically connected within a sentence. Thus, these regularities are hypothesized not to necessarily belong to a set of articles about a particular definiendum. Instead, they can be found across several definitions of the same context or type (e.g., “is → novelist → romantic”). This differentiates context models from usual strategies using KB articles about the definiendum as they infer regularities from numerous descriptions of the same semantic category. In order to learn these contextual regularities, sample sentences are taken from Wikipedia abstracts. Their respective LDPs are then automatically grouped based on their semantics (e.g., “coach,” “station,” and “genus”). An n-gram language model (LM) is then built for each context/semantic from the top of the regularities across its LDPs. These contextual LMs are used then for rating candidate answers in a web collection based on their respective semantics. This means one candidate sentence is connected with one context model, which is substantially different for ranking all candidate answers with respect to a set of KB articles regarding the particular definiendum. This paper is organized as follows: Section 2 discusses the related approaches to definitional QA, Section 3 describes our context models, Section 4 shows the results produced by applying our approach, and finally Section 5 highlights the main conclusions and further work. 2. RELATED WORK In order to discover correct answers to definition queries, some QA systems align surface patterns with sentences in a target corpus (Figueroa 2010). Surface patterns are lexico-syntactic structures that indicate descriptive content at the sentence level. Some approaches have proved that the larger the size of the target collection, the higher the probability of aligning these patterns with sentences in the target corpus (Joho and Sanderson 2000, 2001). These definition patterns do not yield a significant accuracy; hence, matching sentences are weighed by an experimental factor, and/or some statistical metrics grounded on the frequencies of terms cooccurring with the definiendum (Joho and Sanderson 2000, 2001). Simply put, highly correlated words are very likely to express various facets of the definiendum (Cui, Kan, and Xiao 2004). There are several ways of identifying these descriptive words (Yang et al. 2003; Cui et al. 2004). For instance, they can be recognized by comparing the distribution of words within the context of the definiendum with the remaining body of the document (Yang et al. 2003). In general, the Text REtrieval Conference (TREC) provides the necessary infrastructure to assess this kind of QA tasks on a large collection of documents such as the AQUAINT

530

COMPUTATIONAL INTELLIGENCE

corpus. For example, in TREC 2003, this technique exploited these descriptive terms for expanding web search queries (Voorhees 2003), and used evidence obtained from the Internet along with these descriptive terms for ranking answer candidates afterward (F(5)-score = 0.473). In TREC 2004, some approaches combined facts taken from the top 800 documents, retrieved from the AQUAINT corpus, with evidence extracted from several KB articles about the definiendum (Cui et al. 2004). For ranking answers, this strategy benefited from the centroid vector (Xu, Licuanan, and Weischedel 2003), which is commonly composed of words that exceed the average weight score. Typically, it is computed by using a mutual information measure (MIM), and the weight of terms appearing within KB articles is augmented. The cosine similarity to the resulting centroid vector can then be used for rating candidate sentences. Overall, this kind of method is based on the distributional hypothesis (Harris 1954; Firth 1957) in which candidate answers are ranked in tandem with the degree to which their respective terms typify the definiendum, where this degree is based on MIM. Terms collected from KBs tend then to improve the performance as they are more likely to offer a better characterization. However, this sort of strategy does not guarantee a fair accuracy, as their performance falls into a steep decrease whenever they find narrow coverage across KBs (Zhang et al. 2005; Han, Song, and Rim 2006). For this, some systems benefit from definition patterns for discarding some unreliable hits (Cui et al. 2004). Strictly speaking, there are two categories of patterns: hard and soft definition patterns. Hard patterns consists of well-known manually defined lexicosyntactical rules, such as copulas and appositives (Figueroa, Atkinson, and Neumann 2009), with their main drawback being their rigidness to cover all different ways of conveying descriptive content. In other words, they fail to match definition sentences due to the insertion and/or deletion of tokens such as adverbs or adjectives. On the other hand, Soft patterns outperform hard patterns as they model the language variations probabilistically (Cui et al. 2004; Cui, Kan, and Chua 2007), but they have been observed to miss some definitions detected by aligning hard patterns (Cui et al. 2004). Therefore, some approaches (Cui et al. 2004) have combined the centroid vector with soft and hard patterns, producing the best run in TREC 2004 with an average F(3)-score of 0.460 (Voorhees 2004). In TREC 2006, the best answer (F(3)-score = 0.250) gathered word frequency counts from web snippets returned by a search engine (Kaisser, Scheible, and Webber 2006). Candidate sentences are then rated in congruence with weights learnt from these frequency counts, and divided by their length in nonwhite space characters. This method iteratively selects the highest scored sentence, and systematically decreases the weights of its respective terms. This process repeats until no unselected candidate answers exceed an experimental threshold. In TREC 2007, the best run ranked candidate answers by integrating three different kinds of features (Dang, Kelly, and Lin 2007; Qiu et al. 2007): 1. Attributes learnt from four distinct LMs including one inferred from web snippets and Wikipedia articles regarding the definiendum; 2. Lexical dependency relations (i.e., “punc”, “appo”, “pcomp-n,” and subject “s”); 3. The ranking value computed by an IR engine. The redundancy removal step checks the word overlap of the next candidate answer against all previously chosen answers (Kaisser et al. 2006). The best answer, however, did not consider this step, and finished with an average F(3)-score of 0.329. Past approaches have proven that taking into account only semantic relationships (e.g., word correlations) is insufficient for ranking candidate answers as they do not necessarily

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

531

guarantee syntactical dependency. In order to deal with this, the influence of three types of features on LMs has been studied: unigrams, bigrams, and bi-terms (Chen, Zhon, and Wang 2006). In practice, this strategy took advantage of the min-Adhoc method (Srikanth and Srihari 2002) for estimating bi-terms probabilities by learning attributes from the ordered centroid representation of training sentences acquired from web snippets. Centroid words comprise the 350 most frequent stemmed cooccurring terms extracted from the snippets. These snippets are downloaded by expanding the original query with a set of terms, which highly cooccur with the definiendum in sentences obtained by submitting the original query and some task-specific clues (e.g.,“biography”). These keywords bias the search engine in favor of hits extracted from online encyclopedias, dictionaries, and biographical websites. These acquired words are then used for forming an ordered centroid vector by retaining their original order within the training sentences (Chen et al. 2006), from which three LMs were constructed. Overall, the three models were assessed by means of the TREC 2003 definition question set, and as a result, bi-term LMs outperformed the other two models (F(5)-score = 0.531). In light of these results, one can conclude that the flexibility and relative position of the lexical terms captured shallow information about syntactic regularities across definitions. Broadly speaking, KBs facilitate the design of heuristics-based definition QA systems capable of discovering highly relevant facts about the definiendum. However, there are, in general, four factors that make this type of approach less attractive: 1. The performance worsens whenever KBs return limited coverage or no evidence is found at all (Zhang et al. 2005; Han et al. 2006). This is basically due to the redefinition of the problem as a “more like this” task. Thus, a narrow coverage means that this “this” is not well defined, materializing a marked drop. 2. Even though this “this” can be well defined by KBs, the performance of this class of definition QA system is constrained because it fails to distinguish descriptions dissimilar to the contexts yielded by the respective KBs articles. 3. The data sparseness that typifies KBs inhibits from exploring more determining features that can be used for ranking candidate answers (Qiu et al. 2007). 4. The contexts (or senses) of the definiendum are ruled by the target collection of documents, and not by the set of KBs. Some methods (i.e., Han et al. 2006) address the definitional QA task based on two different LMs: definition and topic. The definition LM scores candidate answers based on their evidence of being definitions regardless their closeness to the definiendum. Whereas, the topic LM rates candidate answers based on the evidence of some relevant facts of the definiendum. The definition LM was constructed from the top articles regarding arbitrary definienda coming from KBs. Using this corpus, one general LM and three domain-specific LMs were created. These models were generated from an arbitrary set of articles containing 14,904 persons, 994 organizations, and 3,639 terms. Hence, definition LMs rank candidate sentences based on the interpolation of the general model and the domain model relevant to the definiendum, whereas the topic LM linearly combines the evidence extracted from three distinct kinds of sources: top-ranked documents fetched from the target collection of documents, articles originated from a set of eight trustworthy KBs, and a group of web pages downloaded from the Internet. Several experiments were conducted to investigate the effect of each type of model and resource. First, the weights used to combine the three external sources were modified, noting that the performance increased for the TREC 2003 question set when accounting solely for KBs. Different settings accomplished the best performance for the TREC 2004

532

COMPUTATIONAL INTELLIGENCE

data set: 0.25 (top-five documents), 0.60 (eight KBs), and 0.15 (web pages). It reveals that the models strongly depend on the information provided by KBs for each definiendum (Zhang et al. 2005). Thus, a significant change in the values of the parameters confirms that the performance of their system goes hand in hand with the coverage given by KBs (92%–86%). Incidentally, additional experiments adjusted the parameter that weights the contribution of the general and specific definition models (Han et al. 2006). The best configuration highlighted the relevance of the definiendum type when ranking candidate answers. One interesting finding is that the word distributions among assorted domain-specific models are markedly different from each other. It is worth pointing out that the type of definiendum is labeled by using the BBN Indentifinder (Bikel, Schwartz, and Weischedel 1999). The definiendum is relabeled as term whenever it is not tagged as person or organization. 3. LEARNING CONTEXT LANGUAGE MODELS BASED ON LDPS This work proposes a new approach to answer definition questions, which uses LMs and LDPs. The model combines lexico-syntactic information provided by LDPs with the semantics yielded by each context (e.g., type of definiendum). Our approach claims that LDPs are effective models to typify definitions by keeping a trade-off between lexical semantics and syntactic information. The use of this linguistic construct is also a key difference between our strategies, the unigrams LMs (Han et al. 2006), and the bi-terms and bigrams models (Chen et al. 2006). Unlike other approaches, context models are learnt for each context, and not for each definiendum, which vaguely resembles other specific models (Han et al. 2006). To illustrate this, consider the following type of sentence:1 CONCEPT is a ENTITY novelist and author of ENTITY. Roughly speaking, human readers would quickly notice that this sentence is a definition of a novelist , despite the missing concept and words. This recognition is based on the existence of two LDPs: ROOT→is→novelist and novelist→author→of. The former acts as a context indicator indicating the semantics of the sentence or the kind of definiendum being described, whereas the latter yields content that is very likely to be found across descriptions of the context novelist. Here, highly frequent directed LDPs within a particular context are hypothesized to significantly characterize the meaning when describing an instance of the corresponding context indicator. Overall, context models are inferred, and accordingly used, at the sentence level, whereas other methods, including Chen et al. (2006) and Han et al. (2006), extract regularities from a set of articles or snippets on the definiendum extracted from KBs, and then score each sentence based on these derived regularities. In addition, training sentences are anonymized, that is, context models do not hold any relation with the original group of definienda, wherever they were created. A narrow coverage has been known to dramatically affect the performance (Zhang et al. 2005; Han et al. 2006); hence, this work investigates efficient strategies that ignore articles on the definiendum. 1

The placeholder Entity denotes a sequence of entities or adjectives that starts with a capital letter.

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

533

3.1. Building a Treebank of Contextual Definitions In our model, definition sentences are used to automatically generate a treebank of LDPs. These trees are then clustered from their context indicators, and contextual n-gram LMs are then constructed from the top of these LDPs. The following is a breakdown of the steps taken for building our contextual treebank of definitions: 1. Abstracts were taken from the January 2008 snapshot of Wikipedia . All wiki annotations were removed, and heuristics were used for removing undesirable pages (abstracts). 2. A named-entity recognizer (NER) tool2 was employed to recognize named entities across all abstracts. The following classes were accordingly replaced with a placeholder (ENTITY): person, date, and location. This allows to reduce the sparseness of the data and thus to obtain more reliable frequency counts. Furthermore, numbers and sequences of words that start with a capital letter (e.g., some adjectives) were mapped into the same placeholder. Any of these substitutions was applied to an “entity” that has overlapping words with the title of the article, and sequences of this placeholder are merged into a sole instance. 3. Sentences are identified by means of JavaRap.3 4. Definition surface patterns based on (Figueroa et al. 2009) are applied so that only matched sentences are eventually considered for the following steps, and pronouns are acceptable values of the slot corresponding to the definiendum in the set of (hard) definition patterns. 5. Sentences are “anonymized” so that the definiendum is replaced with a placeholder (CONCEPT). This is due to the fact that some sentences do not exactly match the predefined set of patterns. For example, consider the following group of sentences: In 1776, he (John Edgar) was the commander of a British ship in the Great Lakes. From 1990 to 1998, she (Monika Griefahn) was a minister in Lower Saxony. Since 1998, she has been a member of the German Bundestag. Currently, he (Joseph Pairin Kitingan) is the Deputy Chief Minister and Minister of Rural Development of Sabah and holds the post since March 2004.

A group of these underlined expressions, which precedes the concept, was collected from the sentences matching the definition patterns. A set of templates was made out of these expressions by substituting numbers, possessive pronouns, and capitalized words with a placeholder. The first personal pronoun was seen as the end of the template. The most frequent 4,259 templates were kept, and every time any of these templates matched a training sentence, the corresponding piece of text was replaced by CONCEPT. Most of these templates involved discourse markers, phrases that temporally anchors the sentence, and cataphoras. 6. Preprocessed sentences are parsed by using a lexicalized dependency parser,4 in which extracted lexical trees are used for building a treebank of lexicalized definition sentences. Overall, the source treebank contains trees for 1,900,642 different sentences, from which candidate context indicators are identified. These involve words that indicate what is being 2

nlp.stanford.edu/software/CRF-NER.shtml. wing.comp.nus.edu.sg/∼qiu/NLPTools/JavaRAP.html. 4 nlp.stanford.edu/software/lex-parser.shtml. 3

534

COMPUTATIONAL INTELLIGENCE

TABLE 1. Some of the Most Frequent Context Indicators Based on log10 of Their Frequencies (note: P(cs ) ∗ 104 ).

Indicator

P(cs ) Indicator

P(cs ) Indicator

P(cs ) Indicator

P(cs ) Indicator

P(cs )

born album member player film town school village station son

1.503 1.460 1.450 1.383 1.373 1.372 1.352 1.350 1.344 1.334

1.328 1.319 1.318 1.317 1.316 1.316 1.314 1.313 1.300 1.297

1.294 1.288 1.285 1.273 1.249 1.245 1.244 1.243 1.240 1.239

1.239 1.239 1.234 1.233 1.229 1.219 1.215 1.214 1.210 1.210

1.205 1.205 1.199 1.190 1.190 1.184 1.184 1.183 1.178 1.174

company game organization band song author term series politician group

character actor city writer species footballer area book genus actress

novel center artist singer director community program known site professor

district leader team club episode title used officer single coach

defined or what sort of descriptive information is being expressed. Context indicators are recognized by walking through the dependency tree starting from the root node, and only sentences matching definition patterns and that starts with the placeholder CONCEPT are taken into account. Thus, there are some regularities that are useful to find the corresponding context indicators. Because the root node itself is a context indicator, whenever the node is a word used in the surface patterns (e.g., is, was, and are), the method walks down the hierarchy. In the case that the root has several children, the first child is interpreted as a context indicator. Note that the method must sometimes go down one more level in the tree depending on the expression holding the relationship between nodes (e.g., “part/kind/sort/type/class/first of ”). Furthermore, the lexical parser outputs trees that meet the projection constraint; consequently; the order of the sentence is preserved. As a result, 45, 698 distinct context indicators were obtained during parsing with Table 1 showing the most frequent indicators acquired by our method, where P(cs ) is the probability (multiplied by 104 ) of finding a sentence triggered by the context indicator cs within the treebank. Candidate sentences are then automatically grouped according to the obtained context indicators. Nevertheless, definition patterns ( “CONCEPT [which|that|who]”) may cause the indicator to be a verb, but in practice, it works as the nouns shown in Table 1. The following sentences illustrate this similarity: CONCEPT which is located in Entity. CONCEPT that was published in Entity by the Entity. CONCEPT who won the Entity in Entity with a portrait of Entity. Highly frequent directed LDPs within a particular context are claimed to significantly typify the meaning when describing an instance of the corresponding context indicator. This is strongly based on the extended distributional hypothesis (Lin and Pantel 2001): if two paths tend to occur in similar contexts, their meanings tend to be similar. In addition, the relationship between two entities within a sentence almost exclusively focused on the shortest path between the two entities of the undirected version of the dependency graph (Bunescu and Mooney 2005). Thus, one entity can be interpreted as the definiendum, and the other can be any entity within the sentence. Paths linking a particular type of definiendum with a class of entity relevant to its type will therefore be highly frequent in the context (e.g., novelist→author→of→Entity). Note that using paths reduces the effect of determining the

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

535

exact category of the entity. For instance, the entity in the previous path will be a book because the linked sequence of words indeed indicates this; however, some paths can still be ambiguous (e.g., born→Entity→in). A small number (1,162 out of 1,900,642) of random sentences in the treebank were finally manually checked to estimate the number of incorrectly annotated samples (false positives). In short, solely 4.73% of these selected sentences were judged as spurious descriptions. 3.2. Learning Context Language Models For each context, all directed paths bearing two to five nodes are extracted. Longer paths are not taken into consideration as they are likely to indicate weaker syntactic/semantic relations. Path directions are seen as essential syntactical information regarding word order when going up the dependency tree. Otherwise, undirected graphs would lead to a significant increase of the amount of paths as it might go from any node to any other node. Some illustrative directed paths obtained from the treebank for the context indicator author are shown below: author→awarded→for→Entity author→chairman→former author→co-author→of→Entity→bestseller author→contributed→to→Entity→journal author→founder→of→Entity→movement characterized→for→period→the editor→at→Entity. From these obtained LDPs, an n-gram statistical LM is built to estimate the most relevant LDPs for each context. The probability of a LDP dp in a context cs is defined by the likelihood of dependency links that compose the path in the context indicator cs , with each link probability conditional on the last n − 1 linked words (out of l words in the path):  | cs ) ≈ P(dp

l    i−1 , P w i | cs, w i−n+1

(1)

i=1 i−1 where P(w i | cs , w i−n+1 ) is the probability of word w i being linked with the previous word w i−1 after seeing the LDP w i−n+1 . . . w i−1 . In simple words, this is the likelihood that w i is a dependent node of w i−1 , and w i−2 is the head of w i−1 , and so forth. The probabilities i−1 P(w i | cs , w i−n+1 ) are usually computed using the maximum likelihood estimate:   i  count cs , w i−n+1  i−1 (2) P w i | cs, w i−n+1 =  . i−1 count cs , w i−n+1 i However, when using LDPs, the word count count(cs , w i−n+1 ) can frequently be greater i−1 than count(cs , w i−n+1 ). For example, in the following definition sentence: “CONCEPT is a band formed in Entity in Entity.” The word formed is the head of two “in,” hence the i−1 denominator of P(w i | cs , w i−n+1 ) is the number of times w i−1 is the head of a word (after i−1 looking at w i−n+1 ). The obtained n-gram LM is smoothed by interpolating with shorter LDPs (Chen and Goodman 1996; Zhai and Lafferty 2004). The probability of a path is accordingly computed as shown in equation (1) by accounting i−1 for the recursive interpolated probabilities instead of raw P. Note also that λcs ,w i−n+1 is computed for each context cs as described in Chen and Goodman (1996). A candidate

536

COMPUTATIONAL INTELLIGENCE

sentence A is ranked according to osits likelihood of being a definition as follows:   | cs ). rank (A) = P(cs ) P(dp

(3)

 ∈A ∀dp

In order to avoid counting redundant LDPs, only paths ending with a leaf node are taken into account, whereas duplicate paths are discarded. In general, there are key differences between our context LMs and those used by other approaches, such as Chen et al. (2006) and Han et al. (2006). First, our context LMs take advantage of LDPs as attributes, whereas others use bi-terms, bigrams and unigrams, and unigrams only (Chen et al. 2006; Han et al. 2006). Second, our context LMs distinguish a definiendum type per candidate sentence, thus making it possible to benefit from different models accordingly, avoiding the use of external tools for detecting the definiendum type. Third, unlike some systems that created three specific models (Han et al. 2006), over 40,000 distinct models were generated for our approach. Finally, some approaches built topic LMs from web snippets or KBs (Chen et al. 2006; Han et al. 2006), whereas our context LMs are constructed exclusively from anonymized sentences extracted from Wikipedia . Unlike other data sources, the use of Wikipedia is motivated by the fact that it is a huge KB that can be assumed to be reliable. It indeed provides a huge corpus to learn from. In addition, the number of grams (n) commonly varies between one and three. However, longer n-grams were taken into consideration to reward some relevant paths that can establish relations with other entities in candidate sentences. Specifically, more than 1,200 paths of length four and five were found to get involved with more than 50 different context indicators. Some important tetragrams include: album→released→in→Entity and built→between→Entity→Entity. Longer paths were also observed to be more likely weaker relations. Nevertheless, some relevant pentagrams can still indicate a relationship between the definiendum and a pair of entities. Some of these pentagrams include: born→in→Entity →in→Entity, written→by→Entity→in→Entity, and professor →at→Entity→of→Entity. 3.3. Ranking Candidate Answers Our definition QA system discovers answers from web snippets. First, sentences are detected by performing truncations via JavaRap.5 Second, sentences matching definition surface patterns are interpreted as candidate answers, thus selected and forced on starting with the placeholder CONCEPT. Third, candidate answers are then parsed to get the corresponding LDPs. Last, given a set of test sentences/dependency trees (T ) extracted from the document snippets, our approach discriminates answers to definition questions by iteratively selecting sentences using the algorithm 1. The procedure first sets φ which keeps the LDP belonging to previously selected sentences (line 1). Next, context indicators for each candidate sentence are extracted so as to build an histogram indHist from a treebank T which contains subtrees ti (line 2). Because highly frequent context indicators constitute more reliable definiendum types, the method favors candidate answers based on their context indicator frequencies (line 3). Sentences matching the current context indicator are rated by computing values of equation (3) (lines 7 and 8). However, only paths dp in ti − φ are taken into consideration, while computing equation (3). Sentences are consequently ranked in congruence with their novel paths in

5

http://www.comp.nus.edu.sg/qiul/NLPTools/JavaRAP.html.

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

537

relation to the previously selected sentences, while at the same time, sentences carrying redundant information decrease their ranking value systematically. Highest scored sentences are selected after each iteration (lines 9–11), and their corresponding LDPs are added to φ (line 18). If the highest ranked sentence meets the halting conditions (line 14), the extraction task finishes. Halting conditions ensure that there is no more sentences left, or there is no more candidate answers bearing novel and trustworthy descriptive information. Algorithm 1. The Strategy for Answer Extraction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

φ = ∅; indHis = getContextIndicatorsHistogram(T ); for highest to lowest frequent ι ∈ indHis do while true do nextSS = null; forall the ti ∈ T do if indHis(ti )==ι then rank = rank(ti ,φ); if nextSS == null or rank > rank(nextSS,φ) then nextSS = ti ; end end end if nextSS == null or rank(nextSS,φ) ≤ 0.005) then break; end print nextSS; addPaths(nextSS,φ); end end

Unlike other strategies which control the overlap at the word level (Hildebrandt, Katz, and Lin 2004; Chen et al. 2006; Han et al. 2006; Kaisser et al. 2006), the basic unit of this answer extractor is the LDP, that is, a group of related words. Therefore, if a word appears in two different contexts, then the extractor interprets it differently. More specifically, some approaches (Chen et al. 2006) measure the cosine similarity of a new candidate answer to each of the previously chosen answers. A threshold acted then as the referee for determining whether the new candidate answer was similar to any of these previously selected answers so as to have it incorporated into the final answer. In addition, the approach also benefited from WordNet for removing candidate answers that share the synset with any of the already selected answers. On the other hand, our extraction procedure is in the spirit of the rating strategy for the best system in TREC 2006 (Kaisser et al. 2006; Schlaefer et al. 2007). When ranking, it benefits from redundancy by removing the contribution of the redundant content. Thus, candidate sentences become less important as long as their overlap with all previously selected sentences becomes larger. 4. EXPERIMENTS AND RESULTS In order to assess our initial hypothesis, a prototype of our model was built and assessed by using 189 definition questions extracted from TREC 2003-2004-2005 tracks. Because our

538

COMPUTATIONAL INTELLIGENCE

model extracts answers from the Internet, these TREC data sets were only used as reference question sets. To boost the Recall of descriptive phrases that matches definition patterns, a search strategy exploited for QA purposes was employed (Figueroa et al. 2009). This method submits 10 queries, each aimed at 30 web snippets, and hence for each question, a maximum of 300 web snippets is retrieved. These snippets, including those mismatching patterns, were manually inspected to create a gold standard. Note that there was no descriptive information for 11 questions corresponding to the TREC 2005 data set. In our experiments, the MSN Search engine was used as an interface to the Web, and our prototype and the three baselines were provided with the same group of input sentences. 4.1. Evaluation Metrics In this work, two metrics commonly used to assess QA performance were used: F-score and mean average precision(MAP), where the standard F-Score (Voorhees 2003) is computed as follows: (β 2 + 1) × P × R Fβ = . β2 × P + R This takes advantage of a factor β for balancing the length of the output and the amount of relevant and diverse information it carries. In early TREC tracks, β was set to 5, but as it was inclined to favor long answers, it was later decreased to 3. The precision (P) and the recall (R) metrics were computed as described in the most recent evaluation by using uniform weights for the nuggets in the gold standard obtained from web snippets (Lin and Demner-Fushman 2006). One of the disadvantages of the F-score is that it does not account for the order of the nuggets within the output. This is a key issue whenever definition QA systems output sentences as it is also necessary to assess the ranking order, that is, determine whether the highest positions of the ranking contain descriptive information. In order to deal with this, the MAP metrics (Manning, Raghavan, and Sch¨utze 2008) was used to measure the Precision at fixed low levels of results such as MAP-1 and MAP-5 sentences. Hence, this metric is referred to as precision at k as follows: MAP (Q) =

mj |Q| 1  1  Precision at k, | Q | j=1 m j k=1

where Q is a question set (e.g., TREC 2003), and m j is the amount of ranking sentences in the output. Accordingly, m j is truncated to one or five, when computing MAP-1 and MAP-5, respectively. This metric was selected due to its ability to show how good the results are on the first positions of the ranking. Simply put, for a given question set Q, MAP-1 shows the fraction of questions that ranked a valid definition on the top. 4.2. Designing Baselines For comparison purposes, three baselines were implemented. These ignored articles on the definiendum in topic models. Because the impact of KBs on the performance is well known (Zhang et al. 2005; Han et al. 2006), robust methods were investigated to discard this kind of information. A first baseline (BASELINE I) selects answers by using the algorithm 1, and ranks candidate sentences based on their similarity to the centroid vector (Yang et al. 2003; Cui et al. 2004; Cui et al. 2004). This vector was learnt exclusively from all retrieved sentences bearing

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

539

the definiendum. For this baseline, all sentences are seen as candidates, and unlike other evaluations, it can therefore identify descriptions from sentences that mismatch the predetermined definition patterns. The second baseline (BASELINE II) is based on 1,900,642 preprocessed sentences acquired from Wikipedia abstracts. This baseline is based on bi-term LMs inferred from these training sentences (Chen et al. 2006). Sentences are then selected by using the same algorithm as for BASELINE I, but rating candidate answers based on (Chen et al. 2006). Correspondingly, bi-terms used in selected answers are added to φ, and analogously, their contribution to the next iterations is suppressed. Accordingly, the mixture weight of these models was experimentally set to 0.72 by using the expectation maximization algorithm (Dempster, Laird, and Rubin 1977), and its reference length was empirically set to 15 words. Overall, this baseline is geared toward testing the performance of the LMs (Chen et al. 2006) against our test sentences built on our training sentences. The third baseline (BASELINE III) is also incorporated into the framework provided by the algorithm. This baseline is constructed from the top of word association norms (Church and Hanks 1990). These norms were computed from the same set of 1,900,642 preprocessed sentences extracted from Wikipedia abstracts which involved pairs and triplets of ordered words. Sentences are subsequently ranked in agreement with the sum of the matching norms which are normalized by dividing them by the highest matching value. Word association norms compare the probability of an observing word w 2 followed by w 1 within a fixed window of ten words with the probabilities of observing w 1 and w 2 , independently. They provide us with a methodology that is the foundation for statistical description of a variety of interesting linguistic phenomena (Church and Hanks 1990), ranging from semantic relations of the professor/student type to lexico-syntactic cooccurrence constraints between verbs and prepositions (e.g., written/by). For this, BASELINE III offers a good starting point for measuring the contribution of our dependency-based context LMs. Because these three baselines do not consider context indicators, every sentence is assumed to have the same context indicator. These also provide assorted methods for deriving lexico-syntactic and semantic relations at various levels that typify descriptions of the definiendum, and use them for scoring answers to definition questions. 4.3. Ranking Candidate Answers from the Web The results achieved by measuring the three baselines and the context models for the three set of test queries are highlighted in Tables 2–5. Broadly speaking, BASELINE III outperformed the other two baselines in all sets, and BASELINE II finished with better results than the first baseline. In terms of F(3)-Score, context models surpassed BASELINE III by 5.22% and 11.90% for the TREC 2003 and 2004 data sets, respectively. Thus, the outcome was improved for 81 out of 114 questions (71.05%), whereas for 32 out of these 114 questions (28.07%), the performance decreased. In terms of Recall, the average value increased from 0.52 to 0.57 (9.6%) for the TREC 2003 data set, whereas by 6.4% for the TREC 2004 data set. In particular, definienda such as “Jennifer Capriati” and “Heaven’s Gate” resulted in significant improvements, whereas “Abercrombie and Fitch” and “Chester Nimitz” went into sudden drops. This improvement can partially be due to preference given to sentences belonging to the most frequent context indicators. For example, the context indicators “cult” and “religion” contain twelve and nine sentences, respectively, where four and two of them were chosen on the top of the ranking, and all of these six chosen sentences were actual definitions. This preference assisted us to select some novel answers that were not selected by using BASELINE III as some misleading candidate answers reached a higher score than these genuine definitions. Note that the order

540

COMPUTATIONAL INTELLIGENCE TABLE 2. Results for TREC Question Sets.

TREC 2003

TREC 2004

TREC 2005

Size

50

64

(64)/75

Recall Precision F(3)-Score

0.27 0.20 0.24

0.27 0.20 0.25

0.24 0.18 0.22

Recall Precision F(3)-Score

0.45 0.28 0.40

0.40 0.19 0.34

0.38 0.21 0.33

Recall Precision F(3)-Score

0.52 0.27 0.46

0.47 0.26 0.42

0.49 0.29 0.43

Recall Precision F(3)-Score

0.57 0.39 0.53

0.50 0.40 0.47

0.42 0.29 0.38

Baseline I

Baseline II

Baseline III

Context Models

TABLE 3. Mean Average Precision.

Baseline I

Baseline II

Baseline III

Context Models

0.64 0.64

0.82 0.82

0.66 0.62

0.88 0.82

0.77 0.70

0.79 0.77

TREC 2003 MAP-1 MAP-5

0.16 0.21

0.56 0.57 TREC 2004

MAP-1 MAP-5

0.27 0.25

MAP-1 MAP-5

0.18 0.24

0.67 0.59 TREC 2005 0.58 0.53

of selection affects the ranking score. Interestingly, this improvement is obtained by enriching the algorithm 1 with inferences drawn from global information, namely, all candidate answers, instead of using solely the attributes of each sentences for ranking. As previously noted, context models were outperformed by using BASELINE III in 32 questions. For 26 cases, the Recall decreased in more than 10%, bringing about a significant drop

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

541

TABLE 4. Results for TREC Question Sets (Treebank Expansion + POS Tagging).

TREC 2003

TREC 2004

TREC 2005

Context Models Recall Precision

0.57 0.39

0.50 0.40

0.42 0.29

F(3)-Score

0.53

0.47 Context Models II

0.38

Recall Precision

0.46 0.32

0.46 0.38

0.42 0.29

F(3)-Score

0.43

0.44 Context Models III

0.38

Recall Precision F(3)-Score

0.46 0.31 0.43

0.44 0.34 0.42

0.41 0.28 0.37

Context Models + POS Recall Precision F(3)-Score

0.56 0.24 0.48

0.47 0.22 0.41

0.48 0.24 0.42

TABLE 5. MAP (Treebank Expansion + POS Tagging).

Context Models

Context Models II

Context Models III

Context Models+POS

TREC 2003 MAP-1 MAP-5

0.82 0.82

0.88 0.88 TREC 2004

0.88 0.87

0.88 0.88

MAP-1 MAP-5

0.88 0.82

0.92 0.88 TREC 2005

0.94 0.87

0.91 0.87

MAP-1 MAP-5

0.79 0.77

0.81 0.78

0.82 0.78

0.73 0.71

in the F(3)-score. In order to qualify for the final response, a candidate answer must obtain a relatively high score. Candidate answers matching contexts that yield narrow coverage are then more unlikely to be included in the final output. On the other hand, BASELINE III makes use of statistics computed from the entire corpus, by reducing the data sparseness problem of some contexts with narrow coverage. Nonetheless, both strategies can be combined because the system knows the most essential and reliable context models.

542

COMPUTATIONAL INTELLIGENCE

Recall was observed to decrease due to the ungrammaticality of some web snippets. While the search strategy biases the output in favor of longer sentences, short and truncated sentences are still likely to be fetched (Figueroa et al. 2009). These truncations can eventually cause problems when computing the LDPs, missing some interesting nuggets. This is a key advantage when preferring surface statistics, such as word norms instead of context LMs, which incorporate NLP techniques. Overall, context models improved Recall for 50% of the TREC 2003-2004 definienda. Context models achieved higher Precision for the two data sets. In the case of the TREC 2003, the increase was 44.44%, whereas it was 53.84% for the TREC 2004 question set. In other words, context models were capable of filtering out a larger amount of sentences that did not yield descriptions, while at the same time, boosting the Recall. Thus, these pieces of information are characterized by regularities in their contextual dependency paths, in which the accuracy of pattern matching was improved. Nevertheless, some misleading descriptions were found to pose a tough, but interesting, challenge to definition QA systems. For example, consider the following two sentences incorporated into the answer of “Jean Harlow:” • Mona Leslie (Jean Harlow) is an up-and-coming broadway actress, dancer, and singer, who leads a happy-go-lucky, freewheeling lifestyle; bailed out of jail by family friend Ned. • Jean Harlow is the secretary no wife wants her husband to have in Wife versus Secretary. The first case describes a role played by “Jean Harlow” in a movie. A definition QA system would normally exploit the parentheses pattern to match descriptions that include aliases of the definiendum (e.g., “Abbreviation (organization) is a/an/the”); thus, in this case, the role is identified as “Jean Harlow” herself. Thus, the use of a particular pattern can be more suitable in one context than in others. Likewise, the pattern “ became . . .” captures good nuggets when dealing with context indicators such as artists and sports. However, inspecting the output showed that it was more likely to be noisy when tackling contexts such as organizations and events. In the second description, the definiendum replaces the name of the character in the movie, which is the actual concept being outlined in the phrase. Another key source of spurious answers is superlatives (Kaisser et al. 2006; Razmara and Kosseim 2007). The Internet is abundant in opinions and advertisements, which are highly likely to match superlatives, and hence to convey the mistaken impression of actual descriptions: “ was/is the best man/player/group/band in the world/NBA” and “ is the best alternative to . . . .” There are two main reasons why these misleading sentences were chosen: • Overmatched superlatives were normally included into the set of sentences belonging to the predominant context (e.g. “tenor”, “singer”, “band,” and “actor”). • Misleading sentences accomplished a relatively high score partially due to descriptive paths like band→best, band→the, band→is. These paths, nonetheless, play an important role and can be frequently found across definitions. With respect to BASELINE I, the achieved results are comparable to other studies (Zhang et al. 2005), in which the QA system did not account for online KBs. As for TREC 2005, context models finished with a lower recall and F(3)-score. Results reveal an increase of performance in 37 out of the 64 questions (57.81%), while for 24 of them (37.5%) it showed a reduction. Note that for 6 of these 24 cases, context models obtained a Recall of zero and

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

543

therefore the F(3)-score value became zero, which can eventually materialize a significant decline in the average score. In these six scenarios, few nuggets were found within the fetched snippets and they had a low frequency so that whenever context models missed any or all of them, the performance was reduced. This becomes serious for nuggets relevant to contexts that are very unlikely to be covered by the models. To measure the influence of these six cases, the average F(3)-score was compared by accounting solely for the other 58 questions, giving a value of 0.43 for context models, and 0.41 for BASELINE III. Regarding MAP scores (see Tables 3 and 5), context models effectively contribute to improve the ranking of the sentences. They did not only outperform the other three strategies, but they also reached a higher Precision in ranking, achieving a valid definition at the top 80% of the cases. The improvement of MAP can basically be due to sentences sharing more lexico-syntactic similarities with descriptive sentences within Wikipedia abstracts, causing an increase of Precision of the matching surface patterns. In addition, highest ranked answers correspond to predominant and hence more trustworthy, definiendum types. As a consequence, it becomes more likely for them to contain a genuine description. The use of relations between a group of words instead of isolated terms for ranking sentences also ensures a certain degree of grammaticality and context in the candidate answers. On the other hand, two different LDPs can yield the same descriptive information, bringing about an increase of redundancy. A good example containing issues affecting the performance for “Teapot Dome Scandal” is provided as follows: NOTES: Presents an examination of the Teapot Dome scandal that took place during the presidency of Warren G. Harding in the 1920s. Teapot Dome Scandal was a scandal that occurred during the Harding Administration. This article focuses on the Teapot Dome scandal, which took place during the administration of U.S. President Warren G. Harding. The Teapot Dome Scandal was a scandal under the administration of President Warren Harding which involved critical government . . . Teapot Dome Scandal cartoon The Teapot Dome Scandal was an oil reserve scandal during the 1920s. The Teapot Dome scandal became a parlor issue in the presidential election of 1924 but, as the investigation had only just started . . . . . It basically expresses the next four ideas repeatedly: “under Harding presidency,” “in 1920s,” “involved government oil fields,” and “major issue in 1924 election.” In this example, the following three paths convey the same fact about this definiendum: took→during→presidency→of→Entity, took→during→ administration→of→Entity, and under→administration→of→Entity. Only three of eight sentences would indeed be enough to cover all these aspects. Other strategies to detect redundancy can be developed by distinguishing analogous LDPs (Chiu, Poupart, and DiMarco 2007) which provides key advantages for using LDPs for answering

544

COMPUTATIONAL INTELLIGENCE

definition questions. Nevertheless, a TREC system can benefit from this when projecting the output into the AQUAINT corpus. In addition, two treebanks (snapshots from January 2007 and October 2008) of dependency trees were built, and hence, two extra context models were generated. The score of a candidate sentence A (equation (3)) was computed by making allowances for the average values of p(cs ) and p(dp | cs ). Accordingly, Table 4 highlights the obtained results for two extensions accounting for two and three treebanks, respectively. Overall, the performance was decreased in terms of Precision and Recall. Thus, the gradual reduction in Recall is given by the average of two or three treebanks which worsens the value of low frequent paths as they are not (significantly) present in all the treebanks. Thus, whenever they match a sentence, the sentence is less likely to score high enough to outperform the experimental threshold. Table 4 also shows highly frequent paths produce more robust estimates as they are very likely to be included in all treebanks, having a positive effect in the ranking. In all question sets, these two extensions outperformed the systems in Table 3. This increase of MAP values suggests that combining estimates from various snapshots of Wikipedia assists us to identify more relevant and plausible paths. These estimates along with the preference given by our selection algorithm brings about the improvement in the final ranking. As a consequence, additional genuine pieces of descriptive information tend to be used in the highest positions of the ranking. Likewise, the effects on context models of part-of-speech (POS) knowledge can be seen in Tables 4 and 5. These new context models were constructed from the original models, but they account for a treebank in which words with the next labels are mapped into the respective placeholder (tag): DT, CC, PRP, PRP$,CD, RB, FW, MD, PDT, PRP, RBR, RBS, SYM. In addition, some verbs, normally used for discovering definitions, are mapped into the same placeholder: is, are, was, were, become, becomes, became, had, has, and have. This aims to consolidate the probability mass of similar paths, when computing context LMs. Table 5 highlights the results achieved by this strategy when compared with the original model. In general, the three extensions finished with better ranking in relation to the original context models. For our POS-based method, results indicate an increase of Precision with respect to the original system for two data sets, but a decrease in the case of the TREC 2005 questions set. Unlike the two previous question sets, abstracting some syntactic categories leaded to some spurious sentences to rank higher. Table 4 highlights a marked drop in terms of F(3)-score for two data sets, while remarking a substantial improvement for the TREC 2005 question set, when compared with the results of the original model. This improvement can partially be due to a boost in Recall, which suggests that the consolidation of LDPs effectively assisted us to identify a larger number of genuine descriptive sentences. The addition of lexical knowledge also allowed for the matching of more misleading and spurious sentences, and consequently it worsened the performance of Precision which might also explain the decrease of MAP values for this question set. In order to assess the impact of leaving unconsidered entities when building context models and thus rating candidate answers, all pairs were extracted and their respective mutual information was computed. A Benchmarking task was then carried out in which all sentences (not selected by the algorithm 1) were checked to determine whether they matched any pair . If any, the corresponding sentence is appended to the end of the output. This is premised on the observation that if entities were important to recognize novel information, the overall Recall would be increased. This strategy

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

545

enlarged the answer for 19 and 35 out of the 50 and 64 TREC 2003 and 2005 question sets, respectively. A slightly different view is seen at the TREC 2004 question set in which the answer was extended for 24 out of the 64 questions, but in four cases the Recall was increased. From these results, entities were found to play no essential role in context models. Furthermore, entities are usually written in several ways (i.e., person names such as “George Bush” can be found as “George W. Bush,” “G.W. Bush,” “President Bush”). These variations make it difficult to match entities in the models with entities in the target set of sentences, and learn accurate distributions from the training data. Hence, an alias resolution step is necessary when learning entities and rating new sentences. In order to investigate the closeness between pairs of context models, their cosine similarity was computed. Each context indicator was created by measuring the MIM between each path and the relevant context. This computation involved 45,698 different context indicators and 26,490,042 different n-gram paths. Some findings regarding the relationships among context models can be seen in Table 1 and include: • A single cluster of ten contexts implying descriptions of physical places (e.g., area, city, community, district, and town). For instance, the cosine similarity drew a value of 0.28 for the pair district↔town, and 0.22 for the pair town ↔ center, but the pair district ↔ center obtained a lower value (0.159). • Descriptions of instances of specific contexts are likely to include some aspects inherited from their respective more general context(s), causing a higher similarity between the specific and general context(s). At the same time, each of the most specific contexts emphasizes radically different aspects inherited from the common general context(s). Good examples included artist ↔ painter (0.2385), artist ↔ singer (0.2045), and painter ↔ singer (0.1061). • The cosine similarity returned the value of 0.208 for the tuple town ↔station, while 0.17 and 0.47 for the tuples station ↔ city and town ↔ city, respectively. In simple words, the contexts town and city are highly likely to be described by using similar paths. The distinction is more accurate between a station and a city as the specific information that disambiguates a station from a city is a key element for a definition, causing an increase of their dissimilarity. Note that station is ambiguous: an army base, a bus stop, and radio station. Analogously, a strong similarity was observed between synonyms (e.g., book ↔ novel and song ↔ single) and genders (e.g., actor ↔ actress). • Some contexts corresponding to objects are strongly related to a particular type of person (i.e., album ↔ singer (0.208) and author ↔ book (0.251)). This can be due partially to descriptions of albums that usually include their singers, and descriptions of singers including information about their albums. Furthermore, some strong hyponym relationships are noted such as artist ↔ [painter, writer (0.223), singer (0.2045)], musician ↔ [composer (0.209), singer (0.308)]. Some part of relations were also indicated as very strong: character ↔ series (0.229) and band ↔ singer (0.241). • Some of the most dissimilar contexts involved: episode, single, species, genus. For instance, episodes are almost unrelated to definitions of type of persons such as professor (0.015) and painter (0.011), and also disconnected from descriptions of locations such as city (0.018) and club (0.018). Overall, this analysis questions the suitableness of the three specific models proposed by Han et al. (2006). The context indicator can establish a well-known semantic connection with some of its used terms. A simple way of detecting the most useful semantic connections involves checking synsets in WordNet that contain the context indicator, and their defined

546

COMPUTATIONAL INTELLIGENCE

relations to other synsets. More exactly, WordNet provides assorted semantic relations such as hypernymy and meronymy: • Overall, 13,217 hypernymy were identified including: area ↔ acreage, area ↔ arena, area ↔ land, author ↔ coiner, book ↔ production, leader ↔ imam, and organization ↔ brotherhood. • Furthermore, by using WordNet 1,165 holonym/meronym relations were discovered, which is a number substantially lower than the previous type of relation. For example, the context book is linked with the words: binding, cover, and text, while the context film with: credit, episode, scene, sequence and shot. Other contexts such as program and song also signified interesting matches such as program ↔ command and program ↔ statement as well as song ↔ chorus. These examples indicate that meronyms can be used for characterizing typical attributes of the context indicator. Take expressions such as: “infectious singalong chorus,” “command syntax similar to ed,” and “text/illustrations by Entity.” • Furthermore, WordNet labeled 448 different relations as pertainym. Some good examples are: film ↔ cinematic, poet ↔ poetic and title ↔ titular. E.g., “CONCEPT is a title held by the Entity which signifies their titular leadership over the Entity of Entity.” • What is more, WordNet discriminated 620 different antonyms that can potentially cooccur in a description. Take, for instance, the pair leader ↔ follower. More relevant, one can also envision that synsets in WordNet that match an existing relation can assist in combating data-sparseness. Put differently, one can assume that elements in each matched synset are interchangeable, analogously to the procedure exploited by (Han et al. 2006) for cushioning the effects of redundancy. 5. CONCLUSIONS AND FUTURE WORK This work proposes a novel approach to discover answers to web-based questions using context models for ranking definitions. These are statistical LMs derived from LDPs extracted from sentences that match definition patterns across Wikipedia abstracts. The principle behind contextual models is being capable of dealing with the strong dependence of state-ofthe-art definition QA systems on articles on the definiendum across KBs. The main findings included: • Different experiments reveal that LDPs serve as promising indicators for the presence of definitions in natural language texts. This is supported by evaluations using current baselines and observations on the amount of lexico-syntactic information being incorporated which indeed increased the values of Precision and Recall. • The results also indicate that enriching context models with POS tags may boost the performance, while accounting for extra Wikipedia snapshots brings about only an improvement in the ranking order of the top answers. • Additional experiments that incorporate information about contextual entities into the context models were observed to reduce the performance. Overall, generated context models showed the underlying semantic relations between the different models. Experiments noted that some pairs of context indicators were more dissimilar than others, while other pairs share a higher similarity. Thus, popular specific models (Han et al. 2006) do not provide an optimal solution. Furthermore, results suggest

CONTEXTUAL LANGUAGE MODELS FOR RANKING ANSWERS

547

that a semantic hierarchy might be necessary to model different relationships across models. Nevertheless, building this hierarchy poses interesting challenges. As a future work, similarities between contexts and WordNet synsets can be used for smoothing or grouping context models, and hopefully increasing the Recall. Finally, context indicators can assist us to disambiguate some of the candidate senses of the definiendum, especially for entities regarding people and places named after individuals. In addition, other strategies to detect redundancy could be investigated by recognizing analogous LDPs (Chiu et al. 2007), which makes it explicit the advantage of using LDPs for answering definition questions.

ACKNOWLEDGMENTS This research is partially sponsored by the Universidad de Concepcion, Chile under grant number DIUC no. 210.093.015-1.0. REFERENCES BIKEL, D. M., R. L. SCHWARTZ, and R. M. WEISCHEDEL. 1999. An algorithm that learns what’s in a name. Machine Learning, 34(1 – 3):211 – 231. BUNESCU, R., and R. J. MOONEY. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of HLT/EMNLP, Vancouver, BC, pp. 724 – 731. CHEN, S., and J. GOODMAN. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz, CA, pp. 310 – 318. CHEN, Y., M. ZHON, and S. WANG. 2006. Reranking answers for definitional qa using language modeling. In Coling/ACL-2006, pp. 1081 – 1088. CHIU, A., P. POUPART, and C. DIMARCO. 2007. Generating lexical analogies using dependency relations. In Proceedings of the 2007 Joint Conference on EMNLP and Computational Natural Language Learning, Prague, Czech Republic, pp. 561 – 570. CHURCH, K. W., and P. HANKS. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22 – 29. CUI, H., M.Y. KAN, and J. XIAO. 2004. A comparative study on sentence retrieval for definitional question answering. In SIGIR Workshop on Information Retrieval for Question Answering (IR4QA), Sheffield, UK, pp. 383 – 390. CUI, H., M.-Y. KAN, and T.-S. CHUA. 2007. Soft pattern matching models for definitional question answering. ACM Transactions on Information Systems, 25(2):1 – 30. CUI, H., K. LI, R. SUN, T.-S. CHUA, and M.-Y. KAN. 2004. National University of Singapore at the TREC 13 Question Answering Main Task. In Proceedings of TREC 2004. NIST: Gaithersburg, MD. DANG, H. T., D. KELLY, and J. LIN. 2007. Overview of the TREC 2007 Question Answering Track. In Proceedings of TREC 2007. NIST: Gaithersburg, MD. DEMPSTER, A.P., N.M. LAIRD, and D. B. RUBIN. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1 – 38. FIGUEROA, ALEJANDRO. 2010. Finding answers to definition questions on the web, Ph. D. Thesis, Universitaet des Saarlandes, Saarland, Germany. FIGUEROA, A., J. ATKINSON, and G. NEUMANN. 2009. Searching for definitional answers on the web using surface patterns. IEEE Computer, 42(4):68 – 76. FIRTH, J. R. 1957. A synopsis of linguistic theory 1930 – 1955. Studies in Linguistic Analysis (Special volume of the Philological Society). Blackwell: Oxford, UK, pp. 1 – 32.

548

COMPUTATIONAL INTELLIGENCE

HAN, K., Y. SONG, and H. RIM. 2006. Probabilistic model for definitional question answering. In Proceedings of SIGIR 2006, Seattle, WA, pp. 212 – 219. HARRIS, Z. 1954. Distributional structure. Word, 10(23):146 – 162. HILDEBRANDT, W., B. KATZ, and J. LIN. 2004. Answering definition questions using multiple knowledge sources. In Proceedings of HLT-NAACL, Boston, pp. 49 – 56. JOHO, H., and M. SANDERSON. 2000. Retrieving descriptive phrases from large amounts of free text. In 9th ACM conference on Information and Knowledge Management, McLean, VA, pp. 180 – 186. JOHO, H., and M. SANDERSON. 2001. Large scale testing of a descriptive phrase finder. In 1st HLT Conference, San Diego, CA, pp. 219 – 221. KAISSER, M., S. SCHEIBLE, and B. WEBBER. 2006. Experiments at the University of Edinburgh for the TREC 2006 QA Track. In Proceedings of TREC 2006. NIST: Gaithersburg, MD. LIN, D., and P. PANTEL. 2001. Discovery of inference rules for question answering. Natural Language Engineering, 7:343 – 360. LIN, J., and D. DEMNER-FUSHMAN. 2006. Will pyramids built of nuggets topple over? In Proceedings of the main conference on HLT/NAACL, New York, pp. 383 – 390. ¨ . 2008. Introduction to Information Retrieval. Cambridge MANNING, C. D., P. RAGHAVAN, and H. SCHUTZE University Press: Cambridge, UK. QIU, X., B. LI, C. SHEN, L. WU, X. HUANG, and Y. ZHOU. 2007. FDUQA on TREC2007 QA Track. In Proceedings of TREC 2007. NIST: Gaithersburg, MD. RAZMARA, M., and L. KOSSEIM. 2007. A little known fact is . . . answering other questions using interest-markers. In Proceedings of the 8th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007), Mexico City, Mexico, pp. 518 – 529. ROSE, D. E., and D. LEVINSON. 2004. Understanding user goals in web search. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, New York, pp. 13 – 19. SCHLAEFER, N., J. KO, J. BETTERIDGE, G. SAUTTER, M. PATHAK1, and E. NYBERG. 2007. Semantic extensions of the Ephyra QA system for TREC 2007. In Proceedings of TREC 2007. NIST: Gaithersburg, MD. SRIKANTH, M., and R. SRIHARI. 2002. Biterm language models for document retrieval. In Proceedings of the 2002 ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. VOORHEES, E. M. 2003. Evaluating Answers to Definition Questions. In Proceedings of HLT-NAACL, Edmonton, Canada, pp. 109 – 111. VOORHEES, E. M. 2004. Overview of the TREC 2004 question answering track. In Proceedings of TREC 2004. NIST: Gaithersburg, MD. XU, J., A. LICUANAN, and R. WEISCHEDEL. 2003. TREC2003 QA at BBN: Answering definitional questions. In Proceedings of TREC 2003. NIST: Gaithersburg, MD. YANG, H., H. CUI, M. MASLENNIKOV, L. QIU, M.-Y. KAN, and T. CHUA. 2003. QUALIFIER In TREC-12 QA main task. In Proceedings of TREC 2003. NIST: Gaithersburg, MD. ZHAI, C., and J. LAFFERTY. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179 – 214. ZHANG, Z., Y. ZHOU, X. HUANG, and L. WU. 2005. Answering definition questions using web knowledge bases. In Proceedings of IJCNLP 2005, Jeju Island, South Korea, pp. 498 – 506.

Suggest Documents