Fifth International Conference on Information Technology: New Generations
Exploiting Wikipedia for Directional Inferential Text Similarity Leong Chee Wee Computer and Information Sciences University of Delaware
[email protected]
Samer Hassan Computer Science and Engineering University of North Texas
[email protected]
Abstract
segments. Improved variants of this approach have looked into stemming, stop-word removal, part-of-speech tagging, longest subsequence matching and other weighting and normalization schemes [19]. Though successful to some extent, lexical similarity methods have failed to identify the semantic similarity of texts in many instances. As an example, “I own a car” and “I have a vehicle” are two text segments that are obviously related to each other, but most current text similarity metrics would fail to identify any meaningful connection between these two texts. A number of word-to-word semantic similarity measures have been developed that are either knowledge-based [25, 9] or corpus-based [23], with applications in natural language processing tasks such as malapropism detection [1], word sense disambiguation [16], and synonym identification [23]. For text-based semantic similarity, the common approaches include the approximations obtained through query expansion in information retrieval [24], and the latent semantic analysis method [7] that derives texts similarity by exploiting second-word word relations automatically collected from large corpora. In this paper, we propose a method to augment wordto-word similarity with a directional component using Wikipedia as a semantic resource. Such a directional similarity can be used to establish weights for words in a text segment via a graph iteration algorithm. Finally, by combining the weights and directional similarity between words, we can measure the directional inferential similarity between any two text segments.
In natural languages, variability of semantic expression refers to the situation where the same meaning can be inferred from different words or texts. Given that many natural language processing tasks nowadays (e.g. question answering, information retrieval, document summarization) often model this variability by requiring a specific target meaning to be inferred from different text variants, it is helpful to capture text similarity in a directional manner to serve such inference needs. In this paper, we show how Wikipedia can be used as a semantic resource to build a directional inferential similarity metric between words, and subsequently, texts. Through experiments, we show that our Wikipediabased metric performs significantly better when applied to a standard evaluation dataset, with a reduction in error rate of 16.1% over the random metric baseline. Key Words - Wikipedia, Semantic, Directional, Similarity, Inference
1. Introduction Text similarity measures have long been used in natural language processing and in its related areas. Among the first applications of text similarity was probably the vectorial model in information retrieval, where documents in a collection are ranked in descending order of their similarity to a given query to determine the document most relevant to that query [20]. Additionally, text similarity has been used for text classification and relevance feedback [18], in word sense disambiguation [10, 21], extractive summarization [21], and methods for automatic evaluation of machine translation [15] as well as text summarization [11]. Text similarity measureas are also useful in the evaluation of coherence of texts [8]. Typically, the approach used for determining the similarity between two text segments relies on a naive lexical matching method to derive a similarity score, and is dependent on the number of lexical units that co-occur in both
978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.190
2. Wikipedia as a Semantic Resource Wikipedia (http://en.wikipedia.org) is currently the largest encyclopedia known to date, covering a large variety of topics including politics, economics, healthcare, sports, entertainment, finance etc. Each article, which is the basic entry in Wikipedia, is essentially a concept described in detail. Hyperlinks are present in these articles that link to other articles, providing a convenient feature of navigation through the encyclopedia. Additionally, the user can per-
687 686
3
form queries that enable random access to a specific article in Wikipedia. Wikipedia adopts an open-source development model, which means that anyone can add, edit or delete information in the articles. This “freedom of contribution” has contributed to a rapid growth in the quality and quantity of Wikipedia, as any potential mistake in the information content can be quickly discovered and corrected, while at the same time making up for any inadequacy in the content of each article. Not surprisingly, a recent study [5] has shown that Wikipedia possesses an accuracy that rivals that of Britannica. As explained, prior methods on measuring words similarity can be broadly divided into two main approaches. Corpus-based (statistical) approach often requires the use of large corpora and records co-occurrence of words as a means to judge their similarity, while the knowledge-based approach typically uses lexical-semantic resources such as Wordnet [3] that encodes word senses and the hierarchical relationships between the words to build similarity measures. Both of these methods have drawbacks in that statistical modeling of words similarity in general are shown to have poor correlation with human judgment [4]. Since humans have the innate ability to compare semantic similarity between words, we consider such statistical methods as being inadequate on their own. Most knowledge-based methods, to our findings, do not provide effective similarity measures for words that are in different parts-of-speech. For instance, in Wordnet, there are relatively few adjectives and adverbs, as are the relationships between these words, compared to nouns and verbs1 . Moreover, given that building such resources requires extensive human efforts, and maintenance comes at a high cost to a dedicated group of researchers, it has the disadvantage of slow integration of new words into the knowledge-base. Open-class words, especially proper nouns and specific terminologies, are not defined in the network of lexical-relationships in a timely manner. Wikipedia circumvents these problems by providing a dynamic framework of capturing world knowledge that allows fast integration of concepts and words into a network of articles connected by links. By following any link from an article in Wikipedia to another article, inter-conceptual relationship can be realized. Interestingly, current research [13] has already demonstrated that such links are capable for the task of word sense disambiguation. Another line of work [4, 22] argued that semantic relatedness measures derived from Wikipedia has a high correlation with human judgments of such measures, as compared to other knowledge-bases such as Open Directory Project, Roget’s Thesaurus etc.
Measuring Directional Inferential Text Similarity
Our research extends the idea proposed earlier [2] in which information drawn from similarity between component words can be used to compute a similarity between two text segments. Specifically, we focus on the subtask of recognizing the directional inferential relationship between two given text segments, as such semantic inferences needs are necessary across many natural language processing tasks (e.g. question answering, information retrieval, information extraction and summarization). Simultaneously, we seek to exploit Wikipedia as a semantic resource, in an attempt to discover the potential usefulness of a such an open-source, large-scale knowledge repository to research in natural language processing. We choose to evaluate our method in a recently proposed framework for capturing semantic inference relationships [6]. Given a text and a hypothesis, the task is to establish an entailment relationship between the two text segments, with the relationship holding only if the hypothesis can be inferred from the text. For instance, if the text is “He is snoring” and the hypothesis being “He is sleeping”, we can clearly conclude that the entailment relationship holds, as a person snoring implies that he is sleeping. Note that this relationship is directional. The problem can thus be reduced to deducing a truth-functional output from a list of instantiated parameters determining the entailment relationship2 . Our method adopts a stepwise approach to solving the entailment problem. Using Wikipedia as a knowledge-base, we pick out the most salient words in the text. Next, we try to compute similarities between words in the hypothesis with the words in the text using various similarity scores based on Wordnet and Wikipedia, with higher weightings given to those salient words in the text. Following this, modality contexts for each text and hypothesis are established. Finally, feature vectors are extracted for each entailment pair and machine learning algorithms are used to determine their entailment relationship.
3.1
Picking out salient words in the text
We posit that a match between the salient words of the text and the hypothesis is the first step to recognizing entailment relationship. In particular, since the inference is based on the text, concepts which appear salient in the text would necessarily be carried over to the hypothesis. As with other researchers, we devise a relatedness metric based on the tf·idf scheme [4], where the importance of a word in a given 2 In some cases where the entailment relationship cannot be confirmed, a degree of confidence must be specified e.g. 80% that the entailment holds. However, in our experiments, we do not provide such degree specification for sake of simplification
1 In Wordnet 2.0, there are 114,648 nouns, 11,306 verbs, 21,436 adjectives and 4660 adverbs
687 688
article is proportional to the number of times it occurs in that article, called the term frequency (tf), and inversely proportional to the number of articles containing this word in Wikipedia, known as the inverse document frequency (idf). To compare the similarity of two words, w1 and w2 , we compute the cosine similarity of their vectors, v1 and v2 , which are weighted across the entire conceptual space in Wikipedia using tf·idf values. In other words, vi represents a weighted vector of hweighti1 , ..., weightiN i for a given word wi , with each cj ∈hc1 , ..., cN i denoting a concept represented by an article in Wikipedia, and weightij = tf·idf for wi and cj . Next, we further establish the notion of what we called directional inferential similarity. To our knowledge, all words similarity metrics provides a single-valued score between a pair of words w1 and w2 to indicate their proximity in semantics. Intuitively, this is not always the case, as w1 may be represented by concepts that are entirely embedded in other concepts, represented by w2 . We argue that such a pair of words has an asymmetrical relationship between them in their similarity. In particular, in psycholinguistics terms, uttering w1 may bring into mind w2 , while the appearance of w2 without any contextual clues may not associate at all with w1 . There are many instances to support this claim. For example, “George Bush” brings into mind “president”, but “president” may trigger other concepts such as “Washington”, “Lincoln”, “Ford” etc depending on contextual clues. Thus, the degree of similarity of w1 with respect to w2 should be separated from that of w2 with respect to w1 . In particular, DSim(wi , wj ) =
Cij ∗ Sim(wi , wj ) Ci
Table 1. Directional similarity scores for some words wi broadband Internet Ipod apple Bush president Microsoft software
DSim(wi ,wj ) 0.797 0.032 0.792 0.076 0.385 0.072 0.550 0.231
where WS(Vi ) is the weighted score for vertex Vi , Out(Vi ) is the set of vertices that Vi points to, and In(Vi ) is the set of vertices pointing to Vi . The damping factor d is usually set to a value of 0.85, which is the contribution of other vertices to the overall weighted score of Vi . Hence, any word can assumed to have a baseline importance of 0.15 in the absence of incoming and outgoing edges. After some iterations, the values of the weighted scores would ultimately converge. We say that the graph converges when the difference of weighted scores of each vertex between each successive iteration is less than a threshold4 . In what follows is a simple sorting in descending order of the final weighted scores associated with each word to derive the most salient units of the text. Note that we do not construct graphs for each part-of-speech. We are assuming that all words in the text recommends one another synergistically, and the most important word after convergence is a conclusion based on the global knowledge drawn from the entire text. This gives us the flexibility to determine the relative contribution of each lexical unit to the “aboutness” of the text. In an example below, we provide a text with the 5 top-ranked words extracted, following the graph iteration convergence. Note that the top-ranked words may be labeled with different parts-of-speech.
(1)
where Cij is the count of articles containing words wi and wj , Ci is the count of articles containing words wi , and Sim(wi , wj ) is the cosine similarity of the vectors representing the two words, derived from Wikipedia. The directional weight (Cij /Ci ) amounts to the degree of association of wi with respect to wj . Table 1 shows the directional similarity for some words. Using the directional inferential similarity scores as directed edges and distinct words as vertices, we obtain a graph for each text in an entailment pair. The directed edges denotes the idea of “recommendation” where we say w1 recommends w2 if and only if there is a directed edge from w1 to w2 , with the weight of the recommendation being the directional similarity score.3 By employing an graph iteration algorithm proposed in [14], we can compute the rank of a vertex using the following formula : X wji P W S(Vi ) = (1−d)+d∗ W S(Vj ) (2) Vj ∈In(Vi )
wj Internet broadband apple Ipod president Bush software Microsoft
Text : OTN profiles the Hamas (Islamic Resistance Movement), which is a Palestinian Islamic fundamentalist group. Top 5 words :{Hamas, fundamentalist, resistance, OTN, profiles}
3.2
Vk ∈Out(Vj ) wjk
Computing similarity between text and hypothesis
The next step in our approach is to determine how well the hypothesis matches up with the text in terms of similar-
3 Intuitively,
a specific concept recommends a general concept. Hence in our experiments, the directed edges are reversed to allow us to capture more salient (more specific) concepts
4 In
688 689
our experiments, we set this threshold to be 0.001
of importance of wT in the text. The same procedure is done for each word wH ∈{adverbs} and each word wH ∈{cardinals}.
ity to the most salient units of the text. We achieve this by using different word-to-word similarity proposed in the literature, which are chosen for their observed performance in modeling word similarity. Specifically, we use the measures from Leacock & Chodrow, Lesk, Wu & Palmer, Resnik and Lin. We provide a brief description for each of them below. The Leacock & Chodrow [9] similarity is defined as : Simlch = −log
length 2D
4. All scores obtained from step 2 and 3 are summed and normalized to give an overall score from 0 to 1 for the entailment pair.
(3)
P DSim(H, T ) =
where length refers to the shortest path between two concepts using node-counting, and D is the maximum depth of the taxonomy. The Lesk similarity of two concepts is defined as a function of the overlap between their corresponding definitions in a dictionary. It is based on an algorithm proposed in [10] as a solution for word sense disambiguation. The Wu & Palmer [25] similarity metric combines the depth of the two concepts in Wordnet taxonomy and the depth of the least common subsumer (LCS) into a score : Simwup =
2 ∗ depth(LCS) depth(concept1 ) + depth(concept2 )
2 ∗ IC(LCS) IC(concept1 ) + IC(concept2 )
wk ∈posH
(weightwkT )
(7)
The whole process can be summarized by the equation above. The weights used for each word are the idf values derived from British National Corpus (BNC). We also compute another Combined score with a modification to step 2, where all 5 similarity measures are compared simultaneously between a pair of words and the maximum score is chosen. Our Wikipedia approach builds on the Combined scoring system with the following changes : (a) all weights are replaced with directional weights from Wikipedia, instead of idf values from BNC (b) directional inferential similarity is added to form 6 metrics, which are then simultaneously computed in step 2 (c) simple lexical matching scores are replaced with directional inferential similarity scores in step 3.
(4)
(5)
where IC is defined as IC(c) = -log P(c) and P(c) is the probability of encountering an instance of concept c in a large corpus. The Lin [12] metric builds on Resnik’s similarity, and further adds a normalization factor consisting of the information content of the two input concepts : Simlin =
P (maxSim(wk ) ∗ weightkT )) P wk ∈pos PH posH
The measure by Resnik [17] returns the information content (IC) of the LCS of two concepts : Simres = IC(LCS)
posH
3.3
Establishing modality contexts
In a dataset of text-hypothesis pairs, there inevitably exists false positives and true negatives which leads to a falsified entailment relationship. Consider the example below :
(6) Text : Prolonged exposure to UV-B light can cause sunburn, and there are worries that damage to the ozone layer may lead to an increase in the incidence of skin cancer.
In addition, we make use of directional inferential word similarity derived from Wikipedia in our algorithm, which is described in detail below :
Hypothesis : Damage to the ozone layer has led to an increase in the incidence of skin cancer.
1. For each word in the hypothesis, classify it to be one of separate sets of nouns, verbs, adjectives, adverbs and cardinals. This process is repeated for the text.
in which each text or hypothesis captures a proposition in a mode. Mode indicates an attitude toward what is being reported and leads to a case of either certainty or doubt over the proposition therein. We believe that such modality contexts play a crucial role in our automatic deduction of entailment using machine algorithms. We extract this modality context automatically using a part-of-speech tagger, and later use it as one of our features in our machine learning step.
2. Using one of the 5 word-to-word similarity measures, for each word wH ∈{nouns} in the hypothesis, we identify the word wT ∈{nouns} in the text having the highest similarity score with wH , and multiply this score by the normalized weight of importance wT in the text . The same procedure is done for each word wH ∈{verbs}.
4
3. For each word wH ∈{adjectives} in the hypothesis, we identify the word wT ∈{adjectives} in the text that matches lexically with wH , and multiply the score (1 for a match or 0 for none) by the normalized weight
Evaluation
To test the effectiveness of our Wikipedia-based approach to textual entailment, we obtain the 400 training and
689 690
even more significant when compared to the random baseline, representing a 16.1% reduction in error rate. The improvement can perhaps be explained by the large database of concepts semantically connected in Wikipedia that allows derivation of a score between any two given words, and hence extending its coverage of semantic relatedness in an “all-pairs” scenario. As expected, all other knowledge-based methods would fail to capture any kind of semantic relation between a specific terminology and its domain, such as “Bush” and “politics”, due to absence of the former term in their semantic web. In an interesting observation, we sometimes retrieve a self-similarity of 0 as an upper-bound during normalization process to obtain pair-wise similarity between words. For instance, the word “hostage” gives a self-similarity of 0 under the information content measures (Simres and Simlin ). This anomaly can be explained by a lack of count in the corpus associated with such a concept. In this situation, the information content of “hostage” becomes zero, and hence the metric reports the lowest common subsumer of itself as having a 0 similarity. On the contrary, we do not observe such anomalies in any instance involving our Wikipedia directional similarity metric. The experimental results confirm our intuition that selection of salient concepts from the text and measuring the semantic mapping from the hypothesis to the text using directional similarity indeed produce a better measure of their directional relatedness, hence the improvement in recognizing their entailment relationship. Employing the graph iteration algorithm allows us to “settle” on the most important concepts which is a result based on the interaction of semantic links among the words in the text, as well as world knowledge harnessed from Wikipedia. The results are also a significant improvement over the simple tf·idf weight approach based on British National Corpus [2]. In their work, they also failed to incorporate semantic similarity between any two adjectives or adverbs into the text similarity measure. Nevertheless we are aware that our approach presents a surface level analysis dealing with lexical similarity and fails to arrive at any conclusion about the semantic structure of the text or hypothesis. It is also ignorant of the order of words within a sentence, with one such as “Paul kissed Lisa” vs “Lisa kissed Paul” presenting totally different semantic interpretations, while our approach would generate a perfect similarity score. Additionally, it does not consider complex event states (i.e. emotional states, temporal markers) or complete linguistic structures involving prepositional attachments. While these constraints remain, we believe we have proposed a relatively effective solution to the entailment problem, without the need for for deep syntactic parsing or rigorous semantic structure analysis. Our work also compares favorably with the best results achieved
Table 2. Recognizing textual entailment using various similarity measures Metric L&C Lesk W&P Resnik Lin Combined Wikipedia Random
Acc. Prec. 0.552 0.530 0.509 0.508 0.554 0.531 0.540 0.526 0.544 0.530 0.550 0.530 0.606 0.581 Baseline 0.530 0.529
Rec. 0.928 0.535 0.918 0.818 0.778 0.878 0.763
F 0.675 0.521 0.673 0.640 0.630 0.661 0.659
0.545
0.537
800 testing entailment pairs from the PASCAL Recognizing Textual Entailment5 corpus for evaluation, where each pair consists of a text, a hypothesis, a natural language task in which such a pair appears and a binary classification for the entailment relationship. From the training and testing data, the set of features extracted consists of : hSimN N , SimJJ , SimV B , SimRB , SimCD , SimAV G , M odalityT , M odalityH , task, entailmenti
in which the average similarity and modality are two particular important features underlying our approach. 7 sets of experiments are performed according to our algorithm in Equation 7. In addition, we construct a baseline Random that calculates random similarity scores for each of the part-of-speech category. Once we obtain the scores for both training and testing datasets, we perform our evaluation using Weka,6 which contains a suite of software implementing numerous machine learning from various learning paradigms. Specifically, we iterate through all the machine learning methods to obtain the best result for each experiment. Our findings are shown in Table 2.
5
Discussion and Conclusion
As observed, using Wikipedia improves the likelihood of recognizing the entailment relationship between two text segments significantly over other knowledge-based methods and random baseline, as measured in a recognizingtextual-entailment task. The best performance is achieved using our approach that combines Wikipedia weights and similarity scores with other knowledge-based similarity measures, yielding an overall accuracy of 60.6%, which is an improvement of 9.4% over the next best-performing metric by Leacock & Chodrow. Moreover, this improvement is 5 http://www.pascal-network.org/Challenges/RTE2/ 6 http://www.cs.waikato.ac.nz/ml/weka/
690 691
during the second PASCAL entailment evaluation. As a conclusion, we present an approach of using world knowledge from Wikipedia to capture semantic similarity between words and texts. We introduced directional similarity, in which there is a directional weight dictating the amount of similarity between any two words. Such a directional similarity can be used to establish weights for salient words in a text segment via a graph iteration algorithm. Subsequently, by combining the weights and directional similarity between words, we show that we can measure effectively the directional similarity between any two text segments. Experimental results have shown that our approach outperforms other knowledge-based methods by a significant 16.1% reduction in error rate when applied to the same algorithm and dataset.
6
[11]
[12]
[13]
[14]
[15]
Acknowledgments
[16]
We would like to thank Rada Mihalcea from University of North Texas for her valuable suggestions and insightful critique to improve the quality of this paper.
[17]
References
[18]
[1] A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources, 2001. [2] C. Corley and R. Mihalcea. Measuring the semantic similarity of texts. Proceedings of the ACL 2005 Workshop on Empirical Modeling of Semantic Equivalence and Entailment, 2005. [3] C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, 1998. [4] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of International Joint Conference on Artificial Intelligence, 2007. [5] J. Giles. Internet encyclopedias go head to head. Nature, (438):900–901, 2005. [6] O. G. Ido Dagan and B. Magnini. The pascal recognising textual entailment challenge. Lecture Notes in Computer Science, 3944:177–190, 2006. [7] P. Landauer, T. K.; Foltz and D. Laham. Introduction to latent semantic analysis. Discourse Processes 25, 1998. [8] M. Lapata and R. Barzilay. Automatic evaluation of text coherence: Models and representations. In Proceedings of 19th International Joint Conference on Artificial Intelligence, 2005. [9] C. Leacock and M. Chodorow. Combining local context and wordnet sense similarity for word sense identification. WordNet, An Electronic Lexical Database. The MIT Press, 1998. [10] M. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice
[19]
[20]
[21] [22]
[23]
[24]
[25]
691 692
cream cone. In Proceedings of the SIGDOC Conference, 1986. C. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of Human Language Technology Conference, 2003. D. Lin. An information-theoretic definition of similarity. In Proceedings of 15th International Conference on Machine Learning, 1998. R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of the North American Chapter of the Association for Computational Linguistics, 2007. R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004. S. W. T. Papineni, K.; Roukos and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. S. Patwardhan, S.; Banerjee and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, 2003. P. Resnik. Tusing information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995. J. Rocchio. Relevance feedback in information retrieval. Prentice hall, Ing. Englewood Cliffs, New Jersey. 143-180, 1971. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. In Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, 1997. G. Salton and M. Lesk. Computer evaluation of indexing and text processing. Prentice hall, Ing. Englewood Cliffs, New Jersey. 143-180, 1971. H. Schutze. Automatic word sense discrimination. Computational Linguistics, 1(24):97124, 1998. M. Strube and S. P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. In Proceedings of National Conference on Artificial Intelligence, 2006. P. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the Twelfth European Conference on Machine Learning (ECML), 2001. E. Voorhees. Using wordnet to disambiguate word senses for text retrieval. In Proceedings of the 16th annual international ACM SIGIR conference, 1993. Z. Wu and M. Palmer. Verb semantics and lexical selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics., 1994.