Taxonomic Semantic Indexing for Textual Case-Based Reasoning *

17 downloads 39183 Views 612KB Size Report
2 Robert Gordon University, Aberdeen, United Kingdom, email: [email protected].uk .... For example, terms apple ... If however we were to use a taxonomy where apple and banana .... ble hypernym extraction for apple; computer is not!
Taxonomic Semantic Indexing for Textual Case-Based Reasoning ? Juan A. Recio-Garcia 1 and Nirmalie Wiratunga 2 1

2

Universidad Complutense de Madrid, Spain, email: [email protected] Robert Gordon University, Aberdeen, United Kingdom, email: [email protected]

Abstract. Case-Based Reasoning (CBR) solves problems by reusing past problemsolving experiences maintained in a casebase. The key CBR knowledge container therefore is its casebase. However there are further containers such as similarity, reuse and revision knowledge that are also crucial. Automated acquisition approaches are particularly attractive to discover knowledge for such containers. Majority of research in this area is focused on introspective algorithms to extract knowledge from within the casebase. However the rapid increase in Web applications has resulted in large volumes of user generated experiential content. This forms a valuable source of background knowledge for CBR system development. In this paper we present a novel approach to acquiring knowledge from Web pages. The primary knowledge structure is a dynamically generated taxonomy which once created can be used during the retrieve and reuse stages of the CBR cycle. Importantly this taxonomy is pruned according to a clustering-based sense disambiguation heuristic that uses similarity over the solution vocabulary of cases. Algorithms presented in the paper are applied to several online FAQ systems consisting of textual problem-solving cases. The goodness of generated taxonomies is evidenced by improved semantic comparison of text due to successful sense disambiguation resulting in higher retrieval accuracy. Our results show significant improvements over standard text comparison alternatives.

1

Introduction

Text comparison is important in many research areas including IR, NLP, semantic web, text mining and textual case-based reasoning (TCBR). In TCBR as with traditional CBR the aim is to compare a problem description with a set of past cases maintained in a casebase with the exception that descriptions are predominantly textual. The final aim is to reuse solutions of similar cases to solve the problem at hand [22]. Clearly ability to compare text content is vital in order to identify the set of relevant cases for solution reuse. However a key challenge with text is variability in vocabulary which manifests as lexical ambiguities such as the polysemy and synonymy problems [19]. Usually the similarity metric used to compare cases rely on both general purpose lexical resources (such as Thesauri and dictionaries) and hand-built domain-specific knowledge structures (such as ontologies or taxonomies). Naturally domain-specific resources are more attractive since general purpose resources lack coverage. However coding extensive ?

Supported by the Spanish Ministry of Science and Education (TIN2009-13692-C03-03) and the British Council grant (UKIERI RGU-IITM-0607-168E)

knowledge for each CBR application is costly and make it appealing to have tools to learn this knowledge with minimum human intervention [10]. The Web contains a rich source of experiential knowledge and for CBR the challenge is to develop tools to filter and distill useful information to augment the CBR cycle [15]. Clearly the coverage of knowledge and domain-independence is a strength but the risk of irrelevant content extraction is a threat. Still the Web has successfully been used as a resource for similarity knowledge for NLP applications [20], Ontology matching [12] and word sense disambiguation [3]. Generally co-occurrence statistics from retrieved documents are used to quantify inter-relatedness of sets of keywords. In this paper we are interested in gathering a taxonomy to capture the semantic knowledge in textual cases that cannot be obtained through statistical methods alone. Our contribution is two folds: firstly we propose to guide the taxonomy generation process using a novel CBR-specific disambiguation algorithm and secondly case comparison is improved by means of Taxonomic Semantic Indexing, a novel indexing algorithm that utilises the pruned taxonomy. While this approach has the advantages from using the Web as background knowledge, it provides an elegant solution to address the tradeoff between the use of multiple web resources with greater coverage and irrelevant content extraction. Related work in text representation appears in Section 2. Our approach to case indexing is discussed in Section 3 followed by the taxonomy generation process in Section 4. The case-based disambiguation algorithm which we use to prune the taxonomy is presented in Section 5. Next Section 6 presents experimental results followed by conclusions in Section 7.

2

Related work

Variability in vocabulary and the related sparse representation problem is common to many research areas involving textual content. Solving this requires that generalised concepts are discovered to help bridge the vocabulary gap that exists between different expressions of similar meaning terms. Research in Latent analysis such as Latent Semantic Indexing [5] and latent Dirichlet Allocation [2] do exactly this by creating representations of original text in a generalised concept space. A difficulty with these approaches is that they generate non-intuitive concepts that lack transparency. These concerns have to some extent been addressed by word co-occurrence techniques [23] and in related work where taxonomic structures are generated from text collections [4]. The latter embody richer semantics and is particularly well suited for textual case comparison. Still since co-occurrence statistics are often poor (also due to sparsity), distributional distance measures are needed instead to capture higher-order co-occurrence statistics [24]. However all these techniques tend to be computationally expensive with repetitive calculations and lack scalability. Work in word sense disambiguation (WSD) pioneered by the classical Lesk algorithm establishes the meaning of a piece of text by examining adjacent words [11]. Essentially a dictionary definition that maximises the overlap with adjacent terms is used to disambiguate the original text. This key idea has since been applied in conjunction with

the popular human-edited Thesauri Wordnet, whereby the correct sense of a given word is determined based on the overlap with example sentences called glosses associated with each candidate sense [1]. However Wordnet is general-purpose and lacks coverage essential for domain-specific tasks and so resources with greater coverage such as Wikipedia are more attractive. The popular bag-of-concept (BOC) representation introduced in Explicit Semantic Analysis (ESA) treats each Wikipedia page as a concept and each word in a case is represented by a Wikipedia concept vector [7]. Essentially the vector captures the importance of a given domain-specific word within Wikipedia pages. Since individual words are mapped onto a common concept space a granular yet generalised representation is obtained to help resolve typical ambiguities in meaning. Although ESA’s BOC approach is particularly appealing it is arguable that techniques need not to be restricted to a single web resource and the instantiation of the concept vector need not be restricted to semantics inferred from term frequency alone. Therefore, the contribution of our work to this line of research is to explore how the BOC approach can be applied with semantically richer taxonomic structures derived from multiple heterogeneous Web resources. Using web search to resolve the sparseness problem is not new [9]. The general idea is to retrieve documents by formulating web search queries that explicitly capture semantic relationships (such as part-of and is-a relationships) using linguistic patterns proposed by Hearst [8]. These relationships are important because case reuse and revision stages in CBR rely heavily on substitutional and compositional knowledge respectively. Typically the strength of the relationship is implicitly captured by the frequency of the recurring pattern in ranked documents. In the PANKOW system such relationships are used to annotate web documents [14] and in more recent work these relationships are combined to generate taxonomic structures [18]. A common advantage with all Web-related approaches is greater knowledge coverage however in reality relationships mined from the web alone can be very noisy. This coverage-noise trade-off needs to be addressed and we propose to do so by guiding the discovery process using word distributional evidence obtained from the case solution vocabulary.

3

Taxonomic Semantic Indexing

The Bag of Words (BOW) representation for text uses a word vector and the cosine angle between two vectors to quantify similarity between any two cases, i.e. the smaller the angle the greater the similarity [17]. However, this approach to case comparison fails to capture any semantic relationships between terms. For example, terms apple and banana, although similar in that they are both fruits will not be captured by a metric that is simply focused on word-word comparisons and will incorrectly result in minimum similarity. If however we were to use a taxonomy where apple and banana are sub-concepts of fruit, then the presence of a common concept refines the distance computation to reflect a greater degree of similarity [13]. This is explicitly achieved by extending the vector representation of each case to include new concepts (such as fruit) with suitable weights. Indexing cases in this manner using “Bag of Concepts” (BOC) is referred to as Explicit Semantic Indexing and in [7] BOC extracted from Wikipedia alone significantly outperformed BOW.

Figure 1 illustrates our Taxonomic Semantic Indexing (TSI) approach where the BOC obtained from a taxonomy extends the BOW representation. Here the semantic knowledge is encapsulated in a taxonomy, Ti = ∪hh− , h− i, where hh− , h− i is a hypernymhyponym relationship pair. A hypernym, h− , is a term whose semantic range includes the semantic range of another word called hyponym, h− . In our example fruit is the hypernym of apple, whilst apple is the hyponym in this relationship. T is recursively extracted from the Web and is detailed in Section 4. We formalise TSI as follows. Given a textual casebase CB, represented using a vocabulary V, and composed of a set of textual cases C = {C1 , . . . , Ci , . . . , C|CB| }, each case represented using BOW (Ci ) = {w1 , . . . , wj , . . . , w|V| } is extended with hypernyms obtained from T to form the extended BOC representation: − − BOC(Ci ) = {w1 , . . . , w|V| , h− 1 , . . . , h|T | }| ∃ hwi , hj i ∈ T

This can be summarised as follows: − − BOC(Ci ) = {BOW (Ci ), h− 1 , . . . , h|T | }| {∃ hwi , hj i ∈ T } ∧ {wi ∈ BOW (Ci )}

Note that every leaf of T corresponds to a wi whilst internal nodes may or may not. This is because the hypernyms in the taxonomy are usually extracted from the Web and are not likely to be in V. These new hypernyms forming vocabulary V 0 , extend the original vocabulary, V, to create the extended vocabulary E = V ∪ V 0 . Accordingly internal nodes in our taxonomy can consists of both original terms or concepts discovered from the Web.

4

Taxonomy Generation

Hyper-hyponym relationships are ideal for taxonomy creation because they capture the is-a relation that is typically used when building ontologies. The basic Hearst extraction patterns summarize the most common expressions in English for hh− , h− i relationship discovery. Expressions like “X such as Y” (also “X including Y”, and “X especially Y”) are used to extract the relationship “Y is a hyponym of X”. For example, given the term ”food” a search for “food such as” in the text “food such as grapes and cereal” will discover hyponyms “grapes” and “cereal”. In reality the set of candidate hyponyms needs filtered so that irrelevant relationships are removed. Therefore taxonomy generation can be viewed as a 2-staged search-prune process which when repeated on newly discovered terms generates the taxonomy in a top-down manner [18]. Note that the search step can be performed using web search engines (or search restricted to a local document corpus). TSI presented in the previous section, calls for a bottom-up taxonomy discovery approach, because the BOC representation is based on finding h− from h− in V (and not h− from h− ). Therefore we need to start with leaf nodes corresponding to terms in BOW (Ci ) and progressively extract higher-level concepts from the Web. Hearst’s patterns can still be used albeit in an inverse manner. For example to extract h− for term “fish” we can use the pattern “X such as fish”, where X is our h− . We have also had to refine these inverse patterns in order to remove false positives that are common due

Fig. 1. Taxonomic Semantic Indexing

Original Hearst Patterns queryTerm {,} including {NP,}* {or—and} NP queryTerm {,} such as {NP,}* {or—and} NP queryTerm {,} especially {NP,}* {or—and} NP Inverted Hearst Patterns ¬NP NP including queryTerm ¬NP ¬NP NP such as queryTerm ¬NP ¬NP NP especially queryTerm ¬NP

Fig. 2. Original Hearst patterns and adapted inverse patterns for hypernym extraction

to problems with compound nouns and other similar grammatical structures. Figure 2 summarises the original Hearst’s and our modified inverse patterns. Here NP refers to the Noun Part-Of-Speech tag and the extracted terms are in bold font. With the inverse patterns (unlike with the original patterns), queryTerm is the hyponym and tagged as NP. The negation is included to avoid compound nouns, which have the tendency to extract noisy content. For example, the query “such as car” returns the following snippet: “Help and advice is always available via the pedal cars forum on topics such as car design, component sources, ...” where topic could be a valid hyper-term for design but not car.

Term: insomnia. Pattern/Query: “such as insomnia” ... Another effect is said to be, having sleeping problems such as insomnia or having nightmares. Not wanting to go to school is suggested to be an effect of bullying for many ... ... Stress is the main cause of illness such as insomnia,bad memory, bad circulation and many more. The need for healing is ... ... Acupuncture and TCM is also a very powerful means of treating emotional problems and the physical manifestations that can arise as a result such as insomnia, headache, listlessness ... Term: acne. Pattern/Query: “such as acne” ... is a leader in all-natural skin and body care products for problems such as: acne, cold sores, menstrual cramps ... ... Do you suffer from troublesome skin or problem skin such as acne or rosacea? An American pharmacist ... Term: breakout. Pattern/Query: “such as breakout” ... Casting growth instability leads to a variety of process and product problems such as breakout, undesirable metallurgical structures, surface and subsurface cracking, and ... ... Like the original, it is patterned after classic ball-and-paddle arcade games such as Breakout and Arkanoid ... ... Check back for more material and information, such as: Breakout Session Summaries; Misc. Forms/Press Releases; ...

Fig. 3. Examples of web search results in response to queries formed using inverse patterns.

In order to generate the bottom-up taxonomy a search query is formulated for every term in V based on the inverse Hearst’s pattern 3 . We found that both the singular and plural forms of terms need to be encoded in these patterns 4 . In the interest of efficiency only hyponyms contained in summary snippets generated by the search engine were extracted. Finally the most frequent hyponyms in snippet text are considered for the taxonomy.

Figure 3 presents some real examples of inverse patterns an their corresponding extraction results from a web search engine. This table shows some noisy results (strikeout font) to illustrate the difficulties of the method. Here a common hypernym such as “problems” is linked with three different terms: “insomnia”,“acne”, and “breakout”. Clearly the first and second terms have the semantic sense of health problems, whilst the last one is completely different and should ideally be pruned. We next present a disambiguation algorithm to prune TSI’s taxonomies. 3

4

Search engines such as Yahoo!, Google or Bing can be queried. Bing was best whilst Yahoo had long response times and Google’s API returns just 10 results per query. For example, the term “biscuit” does not return any h− s with query pattern “such as biscuit”. However its plural form does.

5

Pruning as Disambiguation

Knowledge extracted from the Web may contain relationships that are contextually irrelevant to the TCBR system. For example in a cooking domain whilst fruit is a sensible hypernym extraction for apple; computer is not! The question is how can we detect these noisy relationships in order to prune our taxonomy? Verification patterns with a conjunction is commonly used for this purpose: “h− such as h− and * ”; checks if an extracted hypernym (h− ) is also a common parent to a known hyponym (h− ) and other candidate hyponyms(∗) [25]. Probabilistic alternatives include the test of independence using search engine hits: hyponymP robability(h− , h− ) =

hits(h− AN D h− ) hits(h− )hits(h− )

However in reality all such verifications require many queries to Web search engines slowing system performance, and crucially fail to incorporate contextual information implicit in the casebase. For example candidate hyponyms apple and banana for hypernym fruit are likely to be used in a similar context within a cooking casebase. In contrast, the candidate hyponym oil incorrectly related 5 to fruit will have a different context to that of apple or banana. 5.1

Creating a Hyponym Context

Context of a term is captured by its co-occurrence pattern. Often, related words do not co-occur, due to sparsity and synonymy. Therefore co-occurrence with a separate disjoint target set such as the solution vocabulary is used instead [24]. Essentially for a given hypernym its candidate hyponyms can be pruned by comparing their distributions conditioned over the set of solution words. The intuition is that hyponyms having the same hypernym should also be similarly distributed over the target vocabulary. Therefore the more similar the conditional probability distributions of candidate hyponym terms the more likely that the extracted hypernym is correct. 5.2

Disambiguation Algorithm

In this section we formalise our disambiguation algorithm with which we prune the taxonomy. 1. A case base CB contains a set of cases, where each case C, is a problem-solution pair hp, si. Accordingly the case base CB can further be viewed as consisting of instances of problems and solutions CB = hP, Si and related vocabularies. We define problem Vp and solution Vs vocabularies as: Vp = {∪ w ∈ P | relevance(w) > α} Vs = {∪ w ∈ S | relevance(w) > α} 5

from Web snippet “brines contain components from fruit (such as oil, sugars, polyols and phenolic compounds)

where relevance(w) measures the relevance of the term w in the corpus (usually based on TFIDF filtering). Here α is a term selection parameter. 2. The context ϕw,C ∈ R|Vs | of a term w in a case C as the frequency vector of the terms in Vs . Note that ϕw,C is only computed if w appears in the problem space (w ∈ Vp ). Any term in a problem description should have a unique context. However we have used just the frequency vector to simplify the computation of the context. Future implementations could compute ϕw,C as any other function that reflects the relationship of a term w with other terms in C. One possibility is the conditional probability. 3. Given two terms x, y ∈ Vp with a common hypernym candidate h− , we obtain the relevant sets of cases RSx and RSy containing x or y in the problem description: RSx = { C | {x ∈ P } ∧ {C = hP, Si ∈ CB} } RSy = { C | {y ∈ P } ∧ {C = hP, Si ∈ CB} } 4. We generate the similarity matrix X|RSx |×|RSx | by computing all pair-wise similarities between members in RSx (solution parts of cases containing x in their description). This similarity matrix represents the distance between the contexts ϕx,Ci , reflecting the possible senses of x. Xij = distance(ϕx,Ci , ϕx,Cj ) where ∀Cj , Ci ∈ RSx A suitable distance metric such as Euclidean, Cosine or KL-Divergence is used. Analogously, we compute the matrix Y|RSy |×|RSy | : Y ij = distance(ϕy,Ci , ϕy,Cj ) where ∀Cj , Ci ∈ RSy Analyzing the similarity matrix X or Y we can infer different senses of a term. Terms with similar senses will conform groups that can be obtained by a clustering algorithm. If the clustering results in one cluster, it indicates only one sense, whilst multiple clusters suggest different senses of that term. 5. For candidate hyponym x we generate a clustering Gx consisting of independent groups {g1x , . . . , gnx | ∩gix = ∅}. A group g x consists of a set of similar contexts for the term x : {ϕx,1 , . . . , ϕx,m }. We also compute the analogous clustering for term y candidate hyponym y: Gy = {g1y , . . . , gm | ∩giy = ∅}. 6. To determine the validity of a hypernym h− we compute distance between every pair of groups hg x , g y i. Groups with similar senses are expected to be more similar. We use a distance based heuristic to determine the validity of the candidate hypernym h− . Distances between group centroids are computed as follows: Distance(g x , g y ) = distance(ϕx,i , ϕy,j ) where ϕx,i ∈ g x ∧ ϕy,j ∈ g y 7. If Distance is below some predefined threshold (< β) we can infer that both terms x and y are used in a similar sense and, therefore, the candidate h− is valid, otherwise hx, h− i and hy, h− i relationships are deleted. When hypernyms are deemed valid, we annotate documents in g x and g y with the new concept h− . Moreover we could also include the Distance value with the BOC representation.

Fig. 4. Disambiguation process using the solution space.

The end product is a pruned taxonomy where noisy relationships are deleted and every hypernym-hyponym relationship is associated with the list of documents where this relationship holds. Figure 4 summarizes the process using the example discussed in this Section. Contexts representing the terms “apple” -meaning fruit- and “banana” have a short distance, however the group of contexts capturing the sense of “apple” but meaning computer are in a distant cluster. The list of relationships listed in Figure 5 represent actual examples of sense disambiguation when generated taxonomies were pruned in our experimental evaluation. Here strikeout relationships denote noisy hyponyms that were removed by the disambiguation algorithm for the sample domains.

Domain: Health problem → insomnia problem → acne problem → breakout problem → crash Domain: Games & Recreation event → party event → celebration event → extinction Domain: Recreation item → dish item → cup item → energy Domain: Computers & Internet problem → crash problem → degradation problem → anxiety

Fig. 5. Disambiguation examples.

6

Experimental Evaluation

The aim of our evaluation is to establish the utility of TSI with and without disambiguation when compared to standard case indexing with BOW representations. Therefore a comparative study is designed to compare the following case indexing algorithms: – BOW, Bag-Of-Words representation; – BOC, Bag-of-Concepts with TSI where the pruned taxonomy is obtained with the disambiguation algorithm; and – BOC with no disambiguation, using TSI with an unpruned taxonomy. These algorithms are compared on FAQ recommendation tasks. Like help-desk systems a FAQ recommender system retrieves past FAQs in response to a new query. In the absence of user relevance judgments system effectiveness is measured on the basis of FAQ classification accuracy using a standard k-NN algorithm (with k=3). Since each case belongs to a predefined category we can establish classifiers accuracy for each indexing algorithm. Significant differences are reported using the Wilcoxen signed rank test with 99% confidence. Individual recommender systems are built using jCOLIBRI, a popular CBR reference platform [6], with the Taxonomic knowledge structure integrated with CBR’s indexing knowledge container [16]. Our implementation uses (besides the standard Textual-CBR capabilities of jCOLIBRI) the Lucene toolkit6 to organize and filter 6

http://lucene.apache.org

the texts, the snowball7 stemmer, the OpenNLP8 part-of-speech tagger and the Morphadorner lemmatizer9 to obtain the singular and plural forms of web query terms. We have also developed several components to automatically connect to the search engine APIs, submit search queries and process retrieved results. 6.1

Datasets

Several Web-based FAQ casebases were extracted with the online Yahoo!Answers10 Web site. Here thousands of cases in the form of question-answer pairs are organized into topic categories. Importantly textual content from these FAQs can be extracted dynamically through a public Web API. Therefore, for a given FAQ category (e.g. health, sports etc.) a set of textual cases can be extracted dynamically from Yahoo!Answers. This flexibility enabled us to extract casebases for six textual CBR systems from the Web. Essentially FAQ’s corresponding to 12 different categories were extracted and later grouped together to form 6 casebases with each containing cases from no more than 3 categories. These groupings were made with the intention to create casebases with similar, whilst others with quite distinct categories. For example, corpus A contains quite similar texts as its categories are: “Consumer Electronics”, “Computers & Internet” and “Games & Recreation”. However, corpus C is composed of heterogeneous texts: “Science & Mathematics”, “Politics & Government”, “Pregnancy & Parenting”. Every casebase contained 300 cases (question-answer pairs), with equal distribution of cases i.e. 100 cases per category. The size of the FAQ question description vocabulary |Vp | w 2000, whereas solution space |Vp | w 3500. In general Taxonomies extracted from the Web contained 550 concepts on average of which 25% were new concepts i.e. the V 0 vocabulary. 6.2

Experimental Setup

The relevance(w) function in relation to the disambiguation algorithm mentioned in Section 5 is implemented by way of a simple frequency heuristic, where words occurring in more than 50% of the cases are deleted. Removal of very rare words (i.e.: frequency < 1% of cases) was also considered, however this had a negative impact on overall performance. We found that the FAQ vocabularies were large due to different users generating semantically similar yet lexically different content. Therefore, unlike with standard text classification here rare words were important. The hypernym candidates obtained from search engines are filtered according to their frequency in Web snippet text. This filtering parameter -named minimum pattern frequency (mpf)- was configured experimentally, with best results achieved when mpf = 6. The clustering algorithm for the taxonomy relationship disambiguation process used a two threshold sequential algorithmic scheme (TTSAS) [21] with the cluster merge parameter β obtained experimentally (β = 0.1). 7 8 9 10

http://snowball.tartarus.org/ http://opennlp.sourceforge.net/ http://morphadorner.northwestern.edu/ http://answers.yahoo.com

A BOW vector is implemented using standard TFIDF weights [17]. The BOC vector extends the BOW by explicitly inserting new concept values. This value is the concept weight, and should ideally be a function of the confidence of the learnt taxonomy relationship and the importance of the hyponym involved in the relationship. Our initial exploration with the concept weight parameter -named concept activation factor (caf)-, suggest that caf = 1.0 leads to surprisingly good results. Basically copying the same TFIDF value of the hyponym involved in the hypernym for the BOC concept weight. Although multiple levels of hypernym relationships were tested to disambiguate terms, we did not find significant differences with more than one level. This could simply be unique to our sparse corpora where it was unlikely to find a common grandparent for two given terms. Therefore our experiments only use taxonomies generated with one level. The configuration of the parameters was experimentally obtained for every corpus as our implementation provides a toolkit to obtain it automatically. However, it is important to note that usually optimal results for each corpus shared similar configurations. For example caf and mpf parameters had always the same values for all corpora, existing only small differences in the configuration of the clustering algorithm. 6.3

Evaluation Results

Average accuracy results for each dataset using a 10-fold cross validation experiment is presented in Figure 6. BOC’s TSI with pruned taxonomies has resulted in best performance, whilst BOC no disambiguation has also improved upon BOW. BOC is significantly better than BOW with almost a 5% increase across all FAQ domains. As expected BOC’s performance is also significantly better than BOC no disambiguation in all but Corpus C. Here the disambiguation algorithm has incorrectly pruned some valid taxonomic concepts. Close examination of trials suggest that this domain is very sensitive to the clustering parameter (mainly the β parameter), and tweaking this leads to improvements with taxonomy pruning. However these initial observations call for further study into the relationship between dataset characteristics and parameter setting in the future.

7

Conclusions

The idea of Taxonomic Semantic Indexing (TSI) using multiple heterogeneous Web pages is a novel contribution of this paper. Use of contextual knowledge to disambiguate and prune the extracted taxonomy presents an elegant solution to the trade-off between knowledge coverage and the risk of irrelevant content extraction from the Web. We achieve this by way of a taxonomy relationship disambiguation algorithm that exploits contextual knowledge that is implicit in the Case-Based Reasoning (CBR) system’s case solution vocabulary. TSI can be viewed as an unsupervised approach to taxonomy generation from the Web and is relevant not only to CBR but also to semantic web, NLP and other related research areas. For CBR, taxonomic knowledge can be utilised for case indexing, retrieval, reuse or even revision. In this paper we evaluate the quality of extracted taxonomic knowledge for textual case indexing using several online Textual CBR domains.

FAQ domains:

Corpus A Corpus B Corpus C Corpus D Corpus E

Consumer Electronics, Computers & Internet, Games & Recreation Health, Pregnancy & Parenting, Beauty & Style Science & Mathematics, Politics & Government, Pregnancy & Parenting Cars & Transportation, Games & Recreation, Business & Finance Sports, Pets, Beauty & Style Fig. 6. Experimental results

We employ the bag-of-concept (BOC) approach to extend case representations by utilising is-a relationships captured in the taxonomy. Results suggests significant performance improvements with the BOC representation and best results were obtained when taxonomies are pruned using our disambiguation algorithm. An interesting observation is that there is no obvious manner in which CBR systems using TSI like approaches can be selective about Web resource choices beyond pagerank type search-engine specific rankings. In future work we plan to explore how a feedback mechanism might be built in so as to enable CBR systems to leave feedback annotations on Web resources.

References 1. Banerjee, S., Pedersen, T.: An adapted lesk algorithm for word sense disambiguation using word-net. In: In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics. pp. 136–145 (2002) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

3. Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007) 4. Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. Journal of AI Research 24, 305–339 (2005) 5. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391– 407 (1990) 6. D´ıaz-Agudo, B., Gonz´alez-Calero, P.A., Recio-Garc´ıa, J.A., S´anchez-Ruiz-Granados, A.A.: Building cbr systems with jcolibri. Sci. Comput. Program. 69(1-3), 68–75 (2007) 7. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of The 20th International Joint Conference for Artificial Intelligence. Hyderabad, India (2007) 8. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics. pp. 539–545. Association for Computational Linguistics, Morristown, NJ, USA (1992) 9. Keller, F., Lapata, M., Ourioupina, O.: Using the web to overcome data sparseness. In: EMNLP ’02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing. pp. 230–237. Association for Computational Linguistics, Morristown, NJ, USA (2002) 10. Leake, D., Powell, J.: Knowledge planning and learned personalization for web-based case adaptation. In: ECCBR ’08: Proceedings of the 9th European conference on Advances in Case-Based Reasoning. pp. 284–298. Springer, Berlin, Heidelberg (2008) 11. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: In Proc. of SIGDOC-86: 5th International Conference on Systems Documentation. pp. 24–26 (1986) 12. Marta Sabou, M.D., Motta, E.: Exploring the semantic web as background knowledge for ontology matching pp. 156–190 (2008) 13. Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity - measuring the relatedness of concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04) (2004) 14. Philipp Cimiano, S.H., Staab, S.: Towards the self-annotating web. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web. pp. 462–471. ACM, New York, NY, USA (2004) 15. Plaza, E.: Semantics and experience in the future web. In: ECCBR ’08: Proceedings of the 9th European conference on Advances in Case-Based Reasoning. pp. 44–58. Springer-Verlag, Berlin, Heidelberg (2008) 16. Recio-Garc´ıa, J.A., D´ıaz-Agudo, B., Gonz´alez-Calero, P.A., S´anchez-Ruiz-Granados, A.: Ontology based cbr with jcolibri. In: Applications and Innovations in Intelligent Systems XIV. SGAI’06. pp. 149–162. Springer (2006) 17. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983) 18. Sanchez, D., Moreno, A.: Bringing taxonomic structure to large digital libraries. Int. J. Metadata Semant. Ontologies 2(2), 112–122 (2007) 19. Simpson, G.B.: Lexical ambiguity and its role in models of word recognition. Psychological Bulletin 92(2), 316–340 (1984) 20. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: AAAI’06: proceedings of the 21st national conference on Artificial intelligence. pp. 1419– 1424. AAAI Press (2006) 21. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, Third Edition. Academic Press (February 2006)

22. Weber, R.O., Ashley, K.D., Br¨uninghaus, S.: Textual case-based reasoning. The Knowledge Engineering Review 20(03), 255–260 (2006) 23. Wiratunga, N., Lothian, R., Chakraborty, S., Koychev, I.: Propositional approach to textual case indexing. In: Proceedings of the 9th European PKDD Conf. pp. 380–391. Springer (2005) 24. Wiratunga, N., Lothian, R., Massie, S.: Unsupervised feature selection for text data. In: RothBerghofer, T., G¨oker, M.H., G¨uvenir, H.A. (eds.) ECCBR. Lecture Notes in Computer Science, vol. 4106, pp. 340–354. Springer (2006) 25. Zornitsa Kozareva, E.R., Hovy, E.: Semantic class learning from the web with hyponym pattern linkage graphs. In: Proceedings of ACL-08: HLT. pp. 1048–1056. Association for Computational Linguistics, Columbus, Ohio (June 2008)

Suggest Documents