Pattern-based automatic taxonomy learning from ... - Semantic Scholar

2 downloads 633 Views 923KB Size Report
IOS Press. Pattern-based automatic taxonomy learning from the Web. David Sánchez. ∗ and Antonio Moreno. Intelligent Technologies for Advanced Knowledge ...
AI Communications 21 (2008) 27–48 IOS Press

27

Pattern-based automatic taxonomy learning from the Web David Sánchez ∗ and Antonio Moreno Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA) Research Group, Department of Computer Science and Mathematics (DEIM), University Rovira i Virgili (URV), 43007 Tarragona, Spain Abstract. The construction of taxonomies is considered as the first step for structuring domain knowledge. Many methodologies have been developed in the past for building taxonomies from classical information repositories such as dictionaries, databases or domain text. However, in the last years, scientists have started to consider the Web as valuable repository of knowledge. In this paper we present a novel approach, especially adapted to the Web environment, for composing taxonomies in an automatic and unsupervised way. It uses a combination of different types of linguistic patterns for hyponymy extraction and carefully designed statistical measures to infer information relevance. The learning performance of the different linguistic patterns and statistical scores considered is carefully studied and evaluated in order to design a method that maximizes the quality of the results. Our proposal is also evaluated for several well distinguished domains, offering, in all cases, reliable taxonomies considering precision and recall. Keywords: Taxonomy learning, ontologies, Web mining, knowledge acquisition

1. Introduction In the last years, the enormous growth of the WWW has motivated many researchers [37] to start considering it as a valid repository for Information Retrieval and Knowledge Acquisition tasks. However, the Web suffers from many problems that are not typically observed in classical information repositories such as dictionaries, databases or news reports. Those sources are often quite structured in a meaningful organization or carefully selected by information engineers and, in consequence, one can assume the trustiness and validity of the information contained in them. In contrast, the Web raises a series of new problems such as the lack of structure, untrustiness of information sources and the noise added by the visual representation, in addition to the ambiguity presented in all resources written in natural language. Despite all these shortcomings, the Web also presents characteristics that can be interesting for knowledge acquisition. Due to its huge size and heterogene* Corresponding author: David Sánchez Ruenes, Department of Computer Science and Mathematics (DEIM), University Rovira i Virgili (URV), Avda. Països Catalans, 26, 43007 Tarragona, Spain. Tel.: +34 977 559681; Fax: +34 977 559710; E-mail: [email protected].

ity it has been assumed that the Web approximates the real distribution of the information in humankind [39]. Some authors [16,19,34] have started to develop techniques for acquiring knowledge from the Web, especially adapted to its characteristics. Some of these works adopt an approach based on lightweight linguistic analysis to extract knowledge [30], using domain independent methods that require slight supervision and scale well in such a huge repository. Regarding the knowledge acquisition process itself, the extraction of relevant concepts for a domain and the construction of a taxonomy is considered as the first logical step in structuring a domain of knowledge [3, 43]. In this sense, general linguistic patterns for detecting hyponymy in a particular language (such as English) have been previously applied [34,35]. This unsupervised extraction of hyponym candidates is followed by a process of selection of the most suitable ones that usually involves statistical analyses about cooccurrence of terms [34,35,39]. The selected candidates (the most relevant ones) are finally used to construct the taxonomy. However, there are many ways in which the described process can be performed, depending on the particular linguistic patterns employed and the way in which statistics are obtained. It can be argued that the taxonomy learning performance can be improved by considering carefully the

0921-7126/08/$17.00 © 2008 – IOS Press and the authors. All rights reserved

28

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

possibilities offered by different linguistic patterns and Web-based statistics. So, in this paper, we present a novel approach for constructing taxonomies from the Web that uses a combination of linguistic patterns for hyponymy detection and especially designed statistical measures adapted to the Web environment. The main contributions of this paper are: (1) A study of how different linguistic patterns for hyponymy detection behave in extracting terms for constructing taxonomies. (2) A study of the most appropriate statistical measures for inferring information relevance and selecting the most suitable terms for the domain. (3) A method for combining different linguistic patterns into an integrated, domain independent, automatic and unsupervised taxonomy learning process using an incremental learning approach. (4) A manual evaluation of the learning performance of each linguistic pattern and statistical score considered in (1) and (2) and an evaluation of the especially designed methodology for several well distinguished domains of knowledge. The rest of the paper is organised as follows. Section 2 introduces related works and approaches developed for knowledge acquisition from texts. Section 3 presents the techniques used in our knowledge acquisition process based on linguistic patterns and statistical analyses; it also discuses the behaviour of the considered linguistic patterns and how they can be combined to improve the final results. Section 4 presents the novel automatic and unsupervised methodology for learning taxonomies from the Web and Section 5 details relevant aspects about how the learning process is performed and controlled. Section 6 discusses the learning performance of the techniques employed and the evaluation of the results, discussing how our proposal represents an improvement in relation to other possible approaches (using other combinations of patterns and statistics). The final section contains the conclusions and proposes lines of future work.

2. Related work There are several knowledge acquisition approaches for learning structured representations (like taxonomies) depending on the type of input [3]: texts, dictionaries, knowledge bases, semi-structured data, relation schemas, etc.

Concerning the processing of text sources (as the major part of Web documents are presented in this form), the most well-known approaches are: patternbased extraction [18,28] where a relation is recognized when a sequence of words in the text matches a pattern; association rules [38], which have been used to discover non-taxonomic relations between concepts [3], using a concept hierarchy as background knowledge; conceptual clustering [6], where concepts are grouped according to the semantic distance between each other to make up hierarchies; ontology pruning [23] that is based on refining a general ontology using heterogeneous sources, and concept learning [47] where a given taxonomy is incrementally updated as new concepts are acquired from texts. The common characteristics of classical knowledge acquisition methods from texts (also mentioned in [2]) are: • As stated in the Introduction, many knowledge acquisition methods [1,4,9,33] use as learning corpus a reduced and pre-selected set of relevant documents for the covered domain. This approach solves the mentioned problems that arise when developing an unsupervised, domain-independent Web-based approach. • Most of the knowledge acquisition methodologies [7,14,15,23,24,26] use predefined knowledge to some degree, like training examples, previous ontologies or semantic repositories. This fact hampers the development of domain independent solutions, weakening the scalability and versatility of those systems in wide and heterogeneous environments like the Web. On the contrary, we aim to obtain taxonomies from scratch without any previous knowledge, adapting several classical techniques for knowledge acquisition (linguistic patterns, statistical analysis, etc.) to the characteristics of the Web. Recently, some authors have also been using the Web as a learning corpus for developing [39] or enriching knowledge structures [14], proposing techniques adapted to this particular environment. In particular, Web-based statistics have been used for ranking synonym sets [36] or checking the relevance of patternbased extracted candidates for taxonomic relationships [35]. These statistics, in conjunction with an exhaustive utilization of Web search engines, have been used for massive Information Extraction [34]. Our approach shares at a certain level the same spirit (Web as learning corpus, queries to search engines) as

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

29

tion, we offer an overview of those techniques, studying their behaviour and possibilities when applied to taxonomy learning.

those last approaches. However, we work at a higher level of semantics, presenting a fully detailed multilevel taxonomic structure for a domain of knowledge. Other interesting characteristics which distinguish our work from previous approaches are:

3.1. Study of linguistic patterns

• It is fully unsupervised. This is especially important due to the amount of available resources, avoiding the need of a human domain expert. • The learning is completely automatic, allowing to perform easily executions at any time in order to retrieve updated results. This characteristic fits very well with the dynamic nature of the Web. • It is a domain independent solution, because no domain related assumptions are formulated and no domain predefined knowledge (previous ontologies, lexicons, thesaurus, etc.) is needed. This is especially interesting when dealing with technological domains where specific and nonwidely-used concepts may appear. • Although it is unsupervised, the proposed incremental learning method allows a dynamic adaptation of the evaluated corpus as new knowledge is acquired (as a bootstrap). Moreover, the system has continuous feedback about the productivity of the learning task performed at each moment. This information is used to detect which are the most productive concepts on the taxonomy and to decide dynamically the amount of analysis that is applied to the available corpus.

In pattern-based approaches, the text is scanned for instances of distinguished lexico-syntactic patterns that indicate an interesting relationship in a particular language such as English. This technique is especially useful for detecting specialisations of concepts that represent is-a (taxonomic) relations. 3.1.1. Hearst’s patterns The most important precedent in the study of linguistic patterns is the Hearst work [28], in which she introduces a set of basic patterns for hyponymy discovery and a methodology for obtaining new patterns. This study covers domain independent regular expressions used to express lists of specialisations (summarized in Table 1). Those patterns summarize the most common ways of expressing hyponyms in English. Many authors [13, 27,29,42] have refined or used them as the base for their taxonomy learning methodologies. In order to study the behaviour of the patterns defined by Hearst in extracting hyponyms, we have conducted several experiments for different domains. We queried a Web search engine by using a keyword representing the domain to explore (e.g. Cancer) and each pattern’s regular expression (e.g. “cancer such as”). The first N returned Web sites were processed in order to find matchings of the corresponding pattern in the text and to extract candidates (Noun phrases) using the pattern’s regular expression and a text ana-

3. Knowledge acquisition techniques As mentioned in the Introduction, we base our unsupervised learning process on the use of linguistic patterns for hyponymy detection and statistical measures for inferring information distribution. In this secTable 1

Examples of linguistic patterns proposed by Hearst for discovering hyponymy relations in English natural language texts Pattern NP {,} including {NP,}* {or|and} NP Such NP as {NP,}* {(or|and)} NP

NP {,} such as {NP,}* {or|and} NP NP {,} especially {NP,}* {or|and} NP

Example

Relation

. . . countries including Spain or France . . . such mammals as dogs, cats and whales

Hyponym (“Spain”, “countries”), hyponym (“France”, “countries”) Hyponym (“dogs”, “mammals”), hyponym (“cats”, “mammals”), hyponym (“whales”, “mammals”)

. . . cancers such as breast cancer and leukaemia . . . insects, especially bees and wasps

Hyponym (“breast cancer”, “cancers”), hyponym (“leukaemia”, “cancers”) Hyponym (“bees”, “insects”), hyponym (“wasps”, “insects”)

30

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

lyzer (OpenNLP1 ). Evaluating the set of those hyponym candidates, we have distinguished several situations according to the number of meaningful words (nouns and adjectives) that compose the noun phrase. For noun phrases containing only one word, we have identified the following three cases: 1. One word valid hyponyms (e.g. “cancer such as leukaemia”): those terms express correct specialisations of the meaning of the initial keyword and can be suitable for composing the domain taxonomy. 2. One word incorrect hyponyms (e.g. “cancer such as radiotherapy”; “cancers such as the following”): they represent concepts that are related in some way (but not taxonomically) to the main concept; in the worst situations, candidates may not have any kind of relationship with the domain. Those cases typically result from the fact that we are considering a very narrow context during the extraction. Analysing the whole sentence we may realize the specific sense of this extraction (e.g. “treatments for cancer such as radiotherapy”; “different types of cancers such as the following: breast cancer, lung cancer”). However, this kind of analysis requires, in general, much more effort and semantic background than the one we would expect from an unsupervised, automatic and Web scalable methodology. 3. One word hyponym with ellipsis (e.g. “cancer such as lung”): those terms express a specialisation by adding new terms (nouns or adjectives) to the main concept and can be suitable for composing the taxonomy for the domain. However, in this case, the ambiguity inherent to natural texts arises: in order to avoid redundancy, the writer omits the main concept. The extracted term can be a correct one if we are able to realize that it needs to be concatenated to the main concept in order to express the correct specialisation. When dealing with noun phrases composed by two meaningful words, we can distinguish between the situation in which the word on the right side is the same as the main concept or not. For the first situation, we can identify the following two cases: 1 OpenNLP is a mature Java package that hosts a variety of Natural Language Processing tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, allowing morphological and syntactical analysis of texts.

4. Two word valid hyponyms (e.g. “cancer such as breast cancer”): similarly to case #3, those terms express a specialisation by adding new words (nouns or adjectives) to the main concept but in an explicit way, and can be suitable for composing the domain taxonomy. 5. Two word incorrect hyponyms (e.g. “cancer especially dangerous cancer”): this case is quite rare for this type of patterns and it represents a specialisation of the main concept that cannot be considered as a correct subtype in a taxonomy. The most common situations are the use of general purpose adjectives to qualify the main concept. When none of both words of the noun phrase is the main concept (e.g. “cancer including follicular lymphoma”) and with noun phrases composed by more than two meaningful words (e.g. “cancer including invasive breast cancer”), multiple levels of hyponym relationships are represented. In this situation, several relations of any of the mentioned cases may arise (e.g. lymphoma is a subtype of cancer and follicular lymphoma is a subtype of lymphoma; or breast cancer is a subtype of cancer and invasive breast cancer is a subtype of breast cancer). In consequence, it can be considered as a composition of the mentioned cases and can be partitioned in simpler relationships that should be analyzed individually. Finally, as Hearst’s patterns typically define lists of terms, we can find cases that mix features from different identified cases (e.g. “cancers, including sarcomas, certain hematologic malignancies and breast, colon and prostate cancers”). In this situation, each noun phrase should be extracted, identified and analyzed according to its particular nature. In addition to these cases (that can be considered as “ideal”), the scenario is more complex if problems inherent to natural language are considered. The most common problematic situations are the following: • The use of synonyms in order to avoid repetition of terms (e.g. “cancer such as colon tumours”) may add confusion in the identification of the particular hyponymy case. This situation can be corrected if we are able to detect synonyms. However, true synonyms are actually very hard to find and, in most cases, there may be subtle differences of meaning that can be also correctly considered as specialisations (e.g. carcinoma is usually incorrectly considered as a synonym of cancer).

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

• Misspellings (e.g. “cancer such as brest cancer”) are very common in open environments like the Web. They should be treated adequately in order to minimize their effects. • Proper names (e.g. “centers related with cancer such as National Cancer”) are referred to individuals more than to specialisations of the domain. They should be properly distinguished from common nouns in order to present a correct taxonomy. • Polysemy (e.g. “cancer such as zodiac cancer”) is another problem derived from natural language ambiguity. It is hard to solve even in supervised approaches [40]. Summarizing, Hearst’s patterns allow to find all possible taxonomic relationships for the specific domain (good recall) but problems about ellipsis, decontextualisations and natural language ambiguity can affect seriously the quality of the results (compromised precision). These intuitions will be reflected with results obtained for several well distinguished domains in Section 6. 3.1.2. Noun phrase based patterns Another approach for detecting specialisations is the use of noun phrases (e.g. credit card) and adjective noun phrases (e.g. local tourist information office). Concretely, in the English language, the immediate anterior word for a keyword is frequently classifying it (expressing a semantic specialisation of the meaning) [21]. Thus, the previous word for a specific keyword can be used to obtain the taxonomic hierarchy of terms (e.g. pressure sensor is a subclass of sensor). If the process is repeated recursively we can create deeperlevel subclasses (e.g. air pressure sensor is a subclass of pressure sensor). In this case, we have conducted several extraction experiments in a slightly different manner. The search engine is queried only with the domain of knowledge (e.g. Cancer) and the Web text is parsed to find matchings of this term as a noun phrase, extracting hyponym candidates by analysing syntactically the immediate previous words (nouns or adjectives).

31

The retrieved extraction cases are very simple (and also the queries and extractions), as they can be reduced to the mentioned correct case #4 (e.g. breast cancer is a subtype of cancer), the incorrect case #5 (e.g. world cancer), and more generally, the recursive case (e.g. invasive breast cancer is a subtype of breast cancer and breast cancer is a subtype of cancer). Ambiguity in the form of polysemy and misspellings may also appear in the retrieved subtypes. However, in this case, we are not able to detect all possible relationships for the domain, because only some hyponyms of the full potential set are normally expressed in this way (e.g. lymphoma is not usually expressed as “lymphoma cancer”). Summarizing, the use of these patterns results in much simpler extractions than those corresponding to Hearst case. Their simplicity allows a higher robustness to decontextualizations and ellipsis (higher precision). Unlike in the approach of Hearst, they are not able to detect all possible taxonomic relationships, but only those expressed by the concatenation of nouns and/or adjectives (lower recall). Again, these intuitions will be illustrated with results for several domains in Section 6. 3.1.3. Combining linguistic patterns As one can conclude from the presented study (see an overview of the identified cases in Table 2), both patterns behave in a quite complementary way in relation to precision and recall. A proper combination of both may compensate their behaviours and result in an increase of the global learning performance. This is the main hypothesis of this paper. Regarding the taxonomy learning process, the following aspects may be taken into consideration: • Cases #1 (the correct one) and #2 (the incorrect one) are exclusively obtained through Hearst’s patterns. In order to maximize the learning performance, both cases should be adequately distinguished. As case #2 is incorrectly obtained due to a non-contextualized extraction, we will try to contextualize the analysis as much as possible in order to reject these hyponymy candidates.

Table 2 Types of hyponym candidate extractions (valid or incorrect) according to the type of linguistic pattern employed Extraction case

Hearst

Noun phrase

#1. One word valid hyponyms

Leukaemia

Example

X



#2. One word incorrect hyponyms #3. One word hyponym with ellipsis #4. Multiple word valid hyponyms #5. Multiple word incorrect hyponyms

Radiotherapy Lung Breast cancer Dangerous cancer

X X X X

– – X X

32

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

• Case #3 (the not so correct one due to ellipsis) is only extracted through Hearst’s patterns. However, when expressed in the correct form with explicit inclusion of the main concept, it corresponds to a multiple word hyponym that can be easily detected with the noun phrase based pattern. In consequence, this potentially incorrect extraction can be compensated adequately using the second pattern type. • Cases #4 (the correct one) and #5 (the incorrect one) may appear from both pattern approaches. However, they are more easily extracted, analyzed and distinguished through the noun phrase based pattern approach. The more general situation in which several hyponym levels are presented in the same noun phrase will be considered by treating each relation individually. In other words, only the most general one will be considered at each moment and the specialisations will be treated individually in new iterations of the learning process. Additional problems such as misspellings or the presence of proper names will also be treated adequately, as will be shown in Section 4. More complex situations involving ambiguity may require additional effort in order to be solved. As will be introduced in Section 7, we have developed complementary techniques that can be suitable for dealing with synonym detection, even though they are not integrated in the proposed methodology at this moment. Polysemy has not been yet considered due to its enormous complexity for an unsupervised approach [40]. 3.2. Statistical analysis Once the set of hyponym candidates has been extracted in a completely unsupervised way, we need to know which of them are the most adequate for the domain in order to obtain a correct taxonomy. This process involves the analysis of the extracted terms and hyponymy relationships in order to measure the degree of relationship between the extracted terms and the domain. If we want to develop an unsupervised and domain independent learning methodology, the use of statistical measures (typically about co-occurrence of terms) for inferring the relevance of concepts and the degree of relationship between terms, can be a good option [9,34]. However, statistical techniques suffer from the sparse data problem, i.e. they perform poorly when the

words are relatively rare, due to the scarcity of data. Some authors [16] have demonstrated the convenience of using a large amount of texts to improve the quality of classical statistical methods. Concretely, [20] and [36] address the sparse data problem by using the hugest data source available: the Web. Unfortunately, the analysis of such an enormous repository is, in most cases, computationally infeasible. In order to tackle this problem, some authors [34, 35,39] have demonstrated the convenience of using Web search engines hit count to obtain robust statistics. One of the most important precedents can be found in [36], where several heuristics for employing the statistics provided by Web search engines are presented (Web scale statistics [34]). Specifically, they use a form of pointwise mutual information (PMI) [25] between words and phrases that is estimated from Web search engine hit counts for specifically formulated queries. The conclusion is that the degree of relationship between a pair of concepts can be measured through a combination of queries made to a Web search engine (involving those concepts and, optionally, their context). As an example, a typical score measure of cooccurrence between an initial word (problem) and a related candidate concept (choice) is (1). Score(choice) =

hits(problem AND choice) . hits(choice)

(1)

Statistics obtained directly from queries into a Web search engine are particularly interesting as (i) they can be obtained in a very immediate and scalable way from publicly available Web search engines, avoiding the necessity of large analyses of text, and (ii) they provide particularly robust measures about information distribution as they have been obtained from the whole Web. Based on this last premise, several authors, as [12], have mentioned that the relative page counts of a Web search engine can approximate the true societal words and phrases usage.

4. Taxonomy learning methodology The most novel idea of our approach is to compose a method that maximizes the performance of the learning process taking into consideration: (i) the behaviour of the different linguistic patterns (considering the conclusions presented in the previous section) and (ii) a set of specifically designed statistical scores to measure

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

the relevance of extracted terms and relationships in an unsupervised way. However, as learning without a base of knowledge is difficult, we propose an incremental approach in which several learning steps are recursively performed and enriched (bootstrapped) with relevant knowledge already acquired. Concretely, as shown in Fig. 1, the learning process starts from a user specified keyword that indicates the domain for which the taxonomy should be constructed (e.g. cancer). This term is used as a seed for the learning process. It is worth noting that the initial concept could be composed of several words (e.g. breast cancer) providing a higher degree of concreteness if desired. At this initial stage of the analysis, only general queries using domain independent patterns can be performed to the search engine. Instead of performing complex analysis with a large amount of those resources, only subtle and lightweight analytic proce-

33

dures are executed over a reduced amount of resources in order to detect the most directly related knowledge and compose an initial taxonomy. A procedure for detecting named entities and include them as instances of the taxonomy is also performed. The output of this process is a one-level taxonomy with general concepts. For each new concept, the taxonomic learning is recursively executed and the particular term becomes a seed for further analyses. As the learning evolves, queries are longer, the search is more contextualized, Web resources are more domain related and, in consequence, the throughput of the methodologies and the quality of the results are potentially higher. The finalisation of this recursive process is controlled by the algorithm itself considering, as described in Section 5.1, the learning throughput of the already executed steps. At the end, we obtain a multi-level domain taxonomy.

Fig. 1. Overview of the proposed taxonomy learning methodology.

34

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

4.1. Hearst based extraction Taking into consideration the conclusions presented in Section 3.1, Hearst’s patterns (we use the set introduced in Table 1) are applied first as they have a higher recall (compared to noun phrases) and their lower precision will be compensated adequately later through the use of noun phrase based patterns (for cases #3, #4 and #5). Concretely, using each Hearst pattern (e.g. NP such as NP) and the initial keyword (e.g. cancer), we compose a query (e.g. “cancer such as”) for a Web search engine. Note that the use of “ ” in the query implies that the exact phrase is searched. Different queries for each pattern are composed using the pattern’s regular expressions (i.e. using singular and plural keyword forms and optional colons). Using the same analytical procedure introduced in Section 3.1, the Web content is scanned for pattern matchings in sentences and, using a morphologic and syntactic analyser and the appropriate pattern regular expression (e.g. cancer such as {NP,}*{or|and} NP), candidate concepts for hyponymy (covering cases #1 to #5) are obtained. Candidates that are a single word (such as leukaemia) and those composed by a noun phrase (such as breast cancer) are distinguished. Moreover, candidates are analysed by an English stemming algorithm to detect different morphological forms of the same concept. Next, in order to select only the most suitable candidates for the domain, we use Web scale statistics obtained directly from a Web search engine. As this approach requires composing appropriate queries for a Web search engine involving the extracted candidates and the initial keyword, the suitability of the statistical values will depend on the specific queries. In this case, we focus on the process in the adequate distinction between cases #1 (correct) and #2 (incorrect), as both are exclusive of Hearst’s approach. As the last one appears due to non-contextualised extractions, we will need queries as contextualised as possible. In this case, derived from the score (1) presented in Section 3.2, we have designed several queries and formulated different scores. Score_A(candidate) =

hits(“candidate” AND “keyword”) . hits(“candidate”)

(2)

This is the typical way of obtaining measures about co-occurrence and to infer the degree of relationship between terms [34–36]. However, it does not ensure

that the relationship between candidate and keyword is taxonomic. It only measures whether they co-occur or not in the text and, in consequence, an incorrect extraction of case #2 may be selected. Score_B(candidate) =

hits(“candidate keyword”) . hits(“candidate”)

(3)

This second approach tries to bound the context by joining both terms using double quotes. This measure can be useful for hyponyms based on noun phrases as cases #4 and #5 (as will be shown later) but it performs poorly for cases #1 and #2 (e.g. “breast cancer” is a correct expression but “lymphoma cancer” is redundant). Score_C(candidate)  = hits(“Hearst_pattern(“keyword”,  “candidate”)”) (hits(“candidate”))−1 .

(4)

This third score uses the pattern itself as part of the query, joining it to the keyword and the candidate with double quotes (e.g. hits(“keyword such as candidate”)). This kind of queries is the most concrete one and ensures that the relation between terms should be taxonomic. However, it can be too restrictive in some situations (especially for noun phrases like cases #4 and #5 that involve many terms and result in longer queries) and, in consequence, the recall may be compromised. Moreover, for each possible pattern, a different score can be computed and, potentially, different results can be obtained. Section 6 includes a detailed evaluation on how those different scores affect the final result for a particular case of study. Considering that, at this stage, our main objective is to be able to select case #1 extractions and reject case #2 ones, we use (4) as our selection score. In order to obtain the maximum generality, a different query for each Hearst pattern is composed and executed (involving the initial keyword and the candidate), and the maximum score is selected. Once values for all candidates have been computed, those that exceed a threshold are selected. This threshold controls the selection procedure’s behaviour. It should be restrictive enough to maximize performance for cases #1 and #2, even compromising a little bit the quality of cases #4 and #5, which will be considered more carefully later. However, the value should be tuned considering the reduced amount of hits po-

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

tentially obtained by the score’s numerator (which involves several words with double quotes) in comparison with the general nature of the denominator (containing only the domain). Considering those facts, we empirically recommend a threshold with an order of magnitude as low as 1E–5. In addition, a minimum number of hits for the constructed queries is also required in order to avoid misspelled terms. As this is an absolute measure, we set a common value for the different Web search engines that have been considered (more details in Section 5). However, finer tuning can be performed focusing the analysis only on a particular search engine. As this value also depends on the length of the particular queries, it is relaxed proportionally to the number of query terms, from several dozens of hits for one word terms (a minimum that, even for rare concepts, a search engine such as MSNSearch, typically ensures) to a unique hit for terms with more than three words. The particular value is not as important as the order of magnitude which scales in function of the number of queried words. The same process is executed for each of the remaining Hearst patterns, resulting in a list of concepts that are marked as pre-selected or pre-rejected according to the statistical analysis. This particular notation is used because, as stated in Section 3.1, some of the acquired and evaluated concepts using Hearst patterns can be potentially retrieved again using noun phrase based patterns. Due to the especial characteristics presented by those last extractions (less affected by ellipsis and decontextualizations), we can re-evaluate them with more confidence. 4.2. Noun phrase based extraction The next step is quite similar to the first one but considering patterns based on noun and adjective phrases. The search engine is queried again but only with the initial keyword with double quotes (e.g. “cancer”). Web sites are parsed and the immediate previous word (e.g. breast cancer) is extracted and selected as a hyponym candidate if it is a noun or an adjective but not a stop word. Those new candidates are added to the set obtained in the previous step. In the case in which a candidate was already in the list, it is marked to be a noun phrase (e.g. lung cancer), regardless of being a noun phrase or a single word term or being pre-selected or pre-rejected in the previous step. With this mechanism, we try to solve problems about ellipsis (case #3: e.g. “cancers

35

such as lung”) that may appear with an approach such as Hearst’s (the “lung” incorrect extraction will become the “lung cancer” correct candidate). This shows how this second pass using the noun phrase based pattern can improve the precision of the final results. Once all resources are parsed, the new retrieved candidates and those re-marked as noun phrases that were pre-rejected in the previous stage are evaluated again using Web scale statistics. With this mechanism, we give a second chance to the potentially incorrectly rejected candidates and improve the recall for the case #4 extractions. In this case, due to the nature of the relationship (expressed by noun phrases), Score_B is the more adequate one. It is able to contextualize enough the search (in contrast to Score_A) but without being too restrictive (as Score_C). As the numerator’s score is much simpler (without pattern’s terms) than for the Heart’s case, a higher selection threshold should be used. We recommend a value at least two orders of magnitude higher (i.e. 1E–3) and a higher number of minimum appearances, starting from several hundreds. In this phase of the learning, a method for distinguishing between common terms – that can become subclasses for the domain’s taxonomy – and individualities – that should be considered as instances – is also applied. This method is described in [11] and uses heuristics about capitalization to perform the distinction. This additional mechanism helps to improve the quality of the final set of results by distinguishing real world entities (that should populate the ontology) from domain conceptualizations (that compose the taxonomy itself). At the end, we obtain a set of selected candidates joining those pre-selected during the Hearst’s extractions and those re-marked, re-evaluated or newly retrieved and finally selected during this second stage. They become subclasses of the initial concept and are stored in an ontological way. If several morphological forms for a specific concept exist, all of them are considered and stored (as the used keyword-based search engines may return different results for each one) but they are tagged as equivalent classes.

5. Relevant aspects of the learning process As mentioned in the Introduction, we intend to perform the full learning process in a completely unsupervised and automatic way. The lack of previous knowledge makes the learning a difficult task, especially considering the ambiguity inherent to natural language.

36

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

The automatic operation introduces problems about finalisation, i.e. how to decide when the algorithm should continue the analysis of resources or stop the exploration. Considering this situation, instead of trying to learn all the possible knowledge in one big step by analysing thousands of resources, it is more convenient to divide the learning process into several simpler steps that can be executed iteratively and recursively depending on how the learning process evolves. With this approach, each step can receive previously acquired knowledge information as a bootstrap. In this section, we describe several relevant considerations about how the aspects of execution flow control, finalisation, bootstrapping and efficient analysis of Web resources have been addressed in our approach in order to present a fully automatic and scalable methodology. 5.1. Adaptive corpus size During the explanation we have mentioned the fact that a set of Web resources is retrieved and analysed to extract candidates. However, how big should this set of Web resources be in order to obtain a set or results with good precision and recall? In previous experiments [11] we observed in different domains that the growth of the number of discovered concepts (and in consequence the recall) follows a logarithmic distribution in relation to the size of the search. This is caused in part by the redundancy of information [46] and the relevance-based sorting of Web sites made by the search engine [5]. Moreover, once a considerable amount of discovered concepts has been reached, precision tends to decrease due to the growth of false candidates. As a consequence, analysing a large amount of Web sites does not imply obtaining better results than with a more reduced but accurate corpus. The ideal corpus size depends on many factors, like the domain’s generality, the quality of the Web sources, the ranking policy of the search engine or the concreteness of the particular query. For example, when recursively evaluating deeper levels of taxonomic relationships, the amount of resources needed to obtain the potentially available domain subclasses tends to become, in most cases, smaller. This is because in the first levels (e.g. cancer), the spectrum of the candidate concepts is generally wider than in the last ones (e.g. metastatic breast cancer) where the searched concept is much more restrictive and fewer valid results can be found.

Due to the automatic, domain independent and dynamic nature of our proposal, the corpus size cannot be set a priori. Thus, we need a mechanism that sets its size dynamically at execution time depending on how the learning is evolving in order to decide whether to continue evaluating more resources or not. We propose an incremental analytic methodology: the amount of Web resources analysed during each learning step is increased until the system decides that most of the knowledge for the particular query has been already acquired. More concretely, for a particular query (i.e. each taxonomic pattern for each discovered concept), we retrieve and analyse a reduced set of Web resources (e.g. 50), extracting candidates and selecting related ones through the described statistical analyses. At the end of the process, if the percentage of selected terms from the list of extracted candidates is high, this indicates that the queried concept is particularly productive and a deeper analysis will potentially return more results. In this case, we query again the search engine with an offset to obtain an additional set of Web sites (e.g. the next 50 Web sites) and repeat the learning stage. The process is iteratively executed until the global percentage of selected terms (computed from the accumulation of results of each iteration) falls below a certain threshold or no more knowledge has been acquired in the last iteration. This indicates that most of the knowledge related to the queried concept has been already acquired because most of the last retrieved terms have been rejected. The particular threshold can be specified a priori by the user in order to tune up the learning process (i.e. from general – high threshold – to exhaustive – low threshold – analysis) in a domain independent way. Using the presented feedback mechanism through the full process we ensure, in addition, the correct finalization of each learning step with a dynamic adaptation of the effort dedicated to analyse each concept. Moreover, we are able to obtain results with a good coverage regardless of the generality or concreteness of the specific domain. From the point of view of runtime performance, this approach provides a good learning/effort ratio as the algorithm decides to continue with the analysis only of the apparently productive concepts, discarding the unproductive ones. 5.2. Bootstrapping Even though we start the taxonomy construction process from scratch, thanks to the incremental learn-

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

ing methodology presented, after each learning step, a partial set of results is available. In particular, once the first level of the taxonomy has been obtained, that knowledge base can be used in further steps as a bootstrap. In this manner we may be able to improve future searches (i.e. deeper taxonomic analysis or nontaxonomic relationships) by creating more contextualized searches and retrieving more concrete resources. In more detail, for each acquired subclass, we repeat the same process already described but using the immediate superclass as a bootstrap of the learning process. We attach that superclass to each Web query performed (e.g. “leukaemia” AND “cancer”) for retrieving Web resources or computing statistics. In this manner, we try to specify the context in which the particular concept should be analysed. This is especially useful when the analysed subclass is polysemic or it is used in several domains because the additional knowledge used in the learning can guide it to the appropriate “sense”. As a consequence, the more knowledge is acquired, the more informed the learning process is. Repeating this process recursively, deeper and more concrete levels of the taxonomy are obtained. The recursion finishes when no more new subclasses are selected. This point is controlled automatically and unsupervisedly by means of the established selection threshold and the described process for controlling the finalization of the learning algorithm. As a final step, once a multiple level taxonomy has been obtained, the hierarchy is processed in order to detect implicit relationships not directly discovered, like multiple inheritance (e.g. “metastatic breast cancer” is both a subclass of “breast cancer” and “metastatic cancer”) or equivalences between different morphological forms of a concept discovered from different classes using a stemming module. Redundant taxonomic cycles (e.g. “whale” is-a “aquatic mammal” and “mammal”; and “aquatic_mammal” is-a “mammal”) are also processed, maintaining the most specific relation (e.g. “whale” is-a “aquatic mammal”). In this way, we are able to return a more complete and coherent structure (some detailed examples of the kind of taxonomies that the system obtain are shown in Section 6). 5.3. Efficient Web accessing Even though our main objective is to offer the best results and not the shortest response time, there are some ways to speed-up the process while maintaining the quality of the taxonomy. Due to the particular na-

37

ture of our approach, much of the time is employed in accessing the Word Wide Web (whenever we are querying a Web search engine or accessing a particular Web site). As the Web’s response time is, in many situations, orders of magnitude higher than the time required to process the Web content, any improvement in this aspect can represent a great difference from a temporal perspective. The first improvement is related to the Web search engine used to perform queries for obtaining Web sites or Web scale statistics. In order to avoid the saturation of one particular search engine, denegation of service or lower performance due to introduced courtesy waits, we have implemented several interfaces with different search engines such as Google, Yahoo and MSNSearch. In this manner, we can alternate from one to another for several searches or even combine two of them for the same search. Analysing those search engines, we can conclude that Google has the best Web coverage but its very limited access and extremely slow response times through the search API, introducing courtesy waits of several seconds for consecutive queries, really hamper its usefulness. On the other hand, MSNSearch offers a really good performance through the Web interface with no limitations (even performing thousands of consecutive queries) at the cost of a reduced coverage, especially for the most concrete queries. Yahoo stays at an intermediate point with slightly lower response time and better coverage than MSNSearch, but introducing access limitations. These behaviours are quite similar to those reflected by an empirical study in [22]. Considering this situation, the combined use of different search engines becomes almost mandatory (e.g. Google for retrieving Web resources from which to extract candidates and MSNSearch or Yahoo for obtaining statistics from which to compute relative scores). Due to this search engine dependence, one can wonder about the influence of the particular Web search engine used during the learning process in the final results. In our case, each queried search engine provides its own ranked list of Web resources and its own statistical measures about information distribution. We have drawn two main conclusions. On the one hand, once a significant amount of Web resources has been retrieved, the extracted knowledge using different search engines tends to be the same due to the high redundancy of information in the Web; on the other hand, although the absolute statistical values for a specific query may be quite different (due to the particular estimation algorithm employed by each Web searcher),

38

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

the final measures obtained during the selection about relatedness between concepts tend to be similar as they are always relative measures. The only observed difference (apart from response times) is that Google, due to its high Web coverage, is able to return a significantly larger amount of resources than other search engines for constrained queries (i.e. the deeper levels of the taxonomy) allowing us to acquire more specific knowledge. The second point that influences the performance is the way in which the content of Web resources is accessed. More concretely, for a particular query that returns a set of Web sites that are potentially interesting to be explored, we typically access each particular Web URL, download its content and start working on it. This can represent an important overhead depending on the Internet connection bandwidth, the size of the Web site and the server’s response times. However, there are alternative ways of accessing partial Web content, such as the previews offered by Web search engines (typically 2 or 3 lines of text covering the queried term). In our case this can be particularly useful because our pattern based extraction of candidates only considers a short neighbourhood for the constructed query. However, those previews only cover one matching for the particular query and, if several instances can be found on the same Web site, they will be omitted. So, in order to decide the convenience of using one approach or the other for accessing Web content, we conducted a simple experiment: for several domains, we queried a Web search engine using Hearst’s patterns and Noun Phrase based patterns as described in the previous section. Then, we evaluated the first N Web sites and counted the number of candidates that our system obtained in each case. The results (exemplified in the cancer domain) were the following: • When using Hearst’s patterns (e.g. “cancer such as”), we were able to extract 7 candidates from the first 10 Web sites, obtaining an extraction ratio of 0.7, with a maximum of 2 candidates per Web site. This low number was expected, due to the concrete nature of the pattern. • When using the noun-phrase based pattern (i.e. “cancer”) we were able to extract 112 candidates from the first 10 Web sites, obtaining an extraction ratio of 11.2, with a maximum of 31 candidates per Web site. This is expected as these patterns are typically found in indexes, labels or partial classifications. In consequence, for the first case, it is quite convenient to use Web search previews that typically cover

the maximum of 1 or 2 matchings per site. This speeds up things greatly as parsing one page of results is equivalent in terms of learning performance to access and parse up to 50 individual Web sites. On the contrary, for the second case, we decided to access and parse the full Web sites due to the high amount of useful information that we are able to obtain.

6. Evaluation This section has two main purposes. On the one hand, we will show the potential learning improvement in the results that the designed approach may offer in comparison with other alternatives that we have also considered for a specific case of study. On the other hand, we will show and evaluate the results that our learning methodology is able to return for several well distinguished domains of knowledge. In general, the evaluation of ontologies (in this case, only on their taxonomic aspect) is recognized to be an open problem [2]. The most common way for evaluating them is manually, where a human being checks the results and evaluates them according to his/her knowledge, or where results are compared against a standard composed by an expert (some examples in [35,39]). Our approach is also evaluated manually. Whenever it is possible, a representative human made classification for the domain is taken as the ideal model of taxonomy to be achieved (Gold Standard). The Gold Standard evaluation approach assumes that it contains all the extractable concepts from a certain corpus and it contains only those. However, in reality, Gold Standards (i.e. existing ontologies) omit many potential concepts in the corpus and introduce concepts from other sources (such as the domain knowledge of the expert) [31]. In order to compensate those imperfections, and in cases in which no standards are available, concept-per-concept evaluation by a domain expert can be performed [41]. The concept-per-concept evaluation is carried by analysing the raw list of taxonomic candidates retrieved during the corpus analysis. The domain relatedness of each concept and the validity of the taxonomic relationships are evaluated by a domain expert or using an existing taxonomy/ontology. This is then compared against the list of selected and rejected concepts defined in function of the computed Web based statistics, computing the standard measures of recall, precision and F-measure. Recall (5) shows how much of the existing knowledge is extracted. It is obtained counting the number

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

of truly taxonomically related concepts selected by the algorithm and dividing it by the total number of taxonomic terms of a reference ontology. Sometimes, especially if there does not exist a standard classification that can tell us the full set of expected terms for the domain, we can compute the Local Recall (6). This measure considers that the domain’s scope is limited to the corpus of documents analysed by the learning algorithm. In our case, the domain’s scope is determined by the full set of candidates retrieved from the analysed corpus of Web resources. As this composes a finite set whose correctness can be evaluated, local recall can be computed by dividing the number of correctly selected concepts against the full set of correctly retrieved entities. Despite its locality, this score can also give us a measure of how good the selection procedure is in accepting or rejecting candidates. This metric is consistent with the recall metric used in TREC conferences [17] and has been used by several authors such as [34], for evaluating automatically obtained knowledge. #correctly selected entities , #domain entities Local_Recall #correctly selected entities . = #correctly retrieved entities

Recall =

(5)

(6)

Precision (7) specifies to which extent the knowledge is extracted correctly. In this case we calculate the ratio between the correctly extracted concepts and the whole number of extracted ones. Precision can be computed by evaluating the correctness of the selected entities against a gold standard, the expert’s criteria or other learning approaches. Precision =

#correctly selected entities . #total selected entities

(7)

In addition to those individual measures, the F-Measure (8) provides the weighted harmonic mean of precision and recall, summarizing the global performance of the selection process. This eases the comparison of the learning quality of different approaches. In the same way as for the Recall, a Local F-Measure (9) can be computed considering the Local Recall instead of the global one. 2∗ Precision∗ Recall , Precision + Recall Local_F -Measure F -Measure =

=

2∗ Precision∗ Local_Recall . Precision + Local_Recall

(8)

(9)

39

Note that for all of the presented evaluations and results of this section, a learning threshold of 60% and the default selection threshold guidelines introduced in Section 4 have been applied. All queries were performed to MSNSearch as it does not impose any limitation in relation to the allowed number of queries. 6.1. Evaluating the taxonomy learning hypotheses Once the general evaluation procedure has been explained, we are ready to perform some tests. First, we start by checking some of the hypotheses mentioned in Sections 3 and 4 regarding how different combinations of patterns and Web scale statistical scores perform. We have used one taxonomic iteration of the Cancer domain as a case of study because, as introduced in Section 3.1, it covers all of the different extraction cases that we have identified and it is widely considered in many standard repositories. Different executions with the same conditions are performed with several implementations of the learning procedure (considering different linguistic patterns and Web scale statistics). Results are then evaluated against a Gold Standard and conclusions about the learning performance are extracted. As Gold Standard we have used the MESH (http:// www.nlm.nih.gov/mesh/) classification of neoplasms (scientific term for referring to cancers). Concretely, MESH (Medical Subject Headings) considers different overlapping ways of classifying neoplasms. We have used the classification “Neoplasms by Site – Tree C04.588” because this hierarchy offers the widest coverage for the domain. The concrete evaluation procedure is performed in the following way: every concept of the list of retrieved ones is queried on the MESH Browser. If the query results in one matching corresponding to the C04.588 hierarchy (which indicates that it belongs, taxonomically, to the cancer domain) it is considered as correct. When a concept is not found, considering the limitations presented by Gold Standard evaluations as stated above, an expert is requested to check if the particular concept is taxonomically correct or not. For example, metastatic cancer is not considered as a cancer subclass in MESH as it classifies cancers as parts of the body, but it can be considered as a correct subclass of cancer according to the stage of development. Those concepts (e.g. chemotherapy) which may belong to the cancer domain but are not taxonomically related are considered as incorrect. When the full set of concepts has been analysed, the result is compared against the selection and rejection decision per-

40

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

formed by the developed learning algorithm in order to detect correctly or incorrectly selected or rejected concepts. As a result, we can compute precision and local recall measures (considering the list of retrieved concepts as the domain scope) as defined above. In order to compute the global recall (that considers the full domain scope), we consider the number of subclasses of the C04.588 tree (102) plus those identified as correct by the expert. The first test regards the selection of Hearst based candidates through statistical analyses. In Section 4, 3 scores where defined, being Score_A the most widely used [36] and Score_C the one selected in our approach as the best to contextualize queries and select only the most related candidates. In order to prove this hypothesis, we have run 3 one-shot taxonomic executions with the same conditions and compared the behaviour of the selection procedure using each score. Figure 2 shows the result of the evaluation of the selection procedures. We can see how there is a direct relationship between the degree of contextualization that each score brings and its precision in the selection procedure. However, the inverse relation can be observed for the local recall. Considering that the Hearst’s extraction is the first step of the learning process and that the pre-rejected terms can be re-evaluated during the noun phrase based extraction stage, we prefer to maximize the precision of this phase. In consequence, as Score_C improves the other ones in terms of F-Measure (by margins of 5–15% locally) and maximizes the precision greatly (over 30–45%), it is the most adequate for

complementing Hearst based extractions in the rest of the learning process. The next step is addressed to show the convenience of combining the different linguistic patterns in the way proposed in Section 4. Several tests have been performed, considering each pattern independently (Hearst’s with Score_C and noun-phrase based with Score_B as described in Section 4) and both. Analysing the result shown in Fig. 3, we can see that both kinds of patterns behave in a quite complementary way: Hearst’s patterns in conjunction with Score_C tend to show high precision (89%) but low local recall (67%), whereas the noun phrase based with Score_B presents the inverse behaviour (79% and 92%, respectively). This is very convenient as both can compensate each other and, finally, as shown by the F-Measure, provide a result that is considerably better (by margins of 12–28%) than the one obtained by a single pattern. Considering the extraction cases presented in Section 3.1, we can observe that extractions such as produced by Hearst’s approach are able to retrieve and distinguish adequately cases #1 (cancer such as leukaemia) and #2 (cancer such as radiotherapy), which can only be retrieved through Hearst’s patterns, providing a good selection precision. However, recall, mainly referred to the incorrect rejection of case #4 (cancer such as breast cancer), is low due to the restrictive selection procedure that affects negatively queries with many terms. Case #3 (cancer such as lung) is also present, affecting slightly the precision as ellipsis is a problem when using these patterns.

Fig. 2. Evaluation of the performance of each score used for the selection of candidates extracted through Hearst’s patterns.

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

41

Fig. 3. Evaluation of the performance of extraction and selection of candidates according to the specific pattern(s) employed.

Then, adding the noun phrase based extractions and the final selection procedure over the partial results, we can improve the global recall, due to the selection of case #3 extractions, thanks to the less restrictive queries based on Score_B (maintaining a good precision), and correct the selection of case #4 extractions, as these patterns do not suffer from ellipsis. At the end of the process, we have been able to obtain a good global recall for the domain (61.8%), maintaining a good level of precision (82%). Those facts can be summarized in the improved F-Measure (70.5% in contrast to 61.5% and 41.5%). 6.2. Evaluating several domains of knowledge After discussing the potential improvement that our approach can bring for taxonomy learning, we present complete taxonomic evaluations performed over well distinguished domains. The evaluation criteria is the same as in the previous cases but, due to the enormous and overwhelming amount of candidates to evaluate (more than ten thousand in total), the evaluation has been applied to those classes which have at least 100 candidates (the most representative ones). First, the cancer domain used up to this moment is evaluated analysing the multi-level taxonomy (a part is presented in Fig. 4). It can be considered as a good test bed for both types of patterns as it is composed by single word terms like leukaemia and noun phrases like breast cancer in a similar percentage. The evaluation procedure is the same already described in the previous section. In this case, however, the expert’s intervention is higher as concrete multi-

ple word terms are barely covered by MESH. Note that some of the mistakes presented in the structure (e.g. classes like “diagnosing cancer”) are caused by the particular syntactic analyser used during the parsing of text. Considering only those classes with more than 100 subclass candidates, we have evaluated the 1st level taxonomy (with 260 candidates for cancer specialisations) and 16 subclasses of the 2nd taxonomic level (which represent a total set of 2249 candidates). The candidates belonging to subclasses wrongly selected in the 1st taxonomic level (e.g. surgery) have been evaluated independently (e.g. surgery is an incorrect subclass of cancer but maxillofacial surgery is a correct type of surgery). The results of the evaluation are summarized in Table 3 and Fig. 5. Next, we have selected two extreme cases. The first is the mammal domain, shown in Fig. 6, in which single word terms prevail (e.g. cat, cow, dog, primate but also aquatic mammal) and the sensor domain, shown in Fig. 7, in which specialisations expressed by adding nouns and adjectives to the initial terms are the most common case (e.g. temperature sensor, biological sensor, pressure sensor, but also sonar). The mammal domain is, in fact, especially interesting due to the large amount of equivalent terms and redundant taxonomic cycles (e.g. whales is-a mammal, aquatic_mammal, marine_mammal and cetaceans, aquatic_mammal isa mammal and cetaceans is-a marine_mammal). This shows the effectiveness of the taxonomy post-processing stage introduced in Section 5.2. In both cases, the authors have performed a conceptper-concept multi-level evaluation. Precision, local re-

42

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

Fig. 4. Part of the multi level Cancer taxonomy with a total of 1458 classes, obtained after 21 hours by analysing 11160 Web sites and performing 41653 Web search queries. The first taxonomic level was obtained after 9 minutes. Table 3 Taxonomic evaluation for the Cancer domain 1st taxonomic level Right

Wrong

Total

Selected Rejected

73 159

16 12

89 171

Total

232

28

260

2nd taxonomic level (16 classes) Selected

417

143

560

Rejected

1641

48

1689

Total

2058

191

2249

1st and 2nd taxonomic level Selected Rejected

490 1800

159 60

649 1860

Total

2290

219

2509

Number of correctly and incorrectly selected and rejected classes. A total of 16 subclasses evaluated for the 2nd level (those with more than 100 candidates).

call and local F-Measure have been computed in the same way as in the cancer domain but, as no Gold Standard has been used, no global recall is provided.

For the mammal domain, evaluation is quite easy as one only has to check if a particular concept is a mammal (e.g. dolphin, dog, cat, etc.) or a mammal category (e.g. aquatic mammal, marine mammal, etc.). We have evaluated the 1st taxonomic level (with 245 candidates) and 19 subclasses of the 2nd level representing a total of 2493 candidates. Results are summarized in Table 4 and Fig. 8. For the sensor domain, subclasses have been considered as correct if they indicate the measured magnitude (e.g. force, speed, temperature, etc.) and/or the type of measuring transducer (e.g. optic, electrochemical, etc.). We have evaluated the 1st taxonomic level (with 262 candidates) and 12 subclasses of the 2nd level representing a total of 1986 candidates. Results are summarized in Table 5 and Fig. 9. The presented results show a consistent global performance, with a very similar F-Measure through the different domains, with values above 80% for the first level and above 75% for the considered subclasses. This shows that our approach performs well and robustly with independence of the taxonomic nature of the particular domain of knowledge (in which a par-

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

43

Fig. 5. Taxonomic evaluation for the Cancer domain.

Fig. 6. Part of the multi level Mammal taxonomy with a total of 957 classes, obtained after 16 hours by analysing 12747 Web sites and performing 46308 Web search queries. The first taxonomic level was obtained after 16 minutes.

44

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

Fig. 7. Part of the multi level Sensor taxonomy with a total of 868 classes, obtained after 15 hours by analysing 12591 Web sites and performing 31455 Web search queries. The first taxonomic level was obtained after 17 minutes. Table 4 Taxonomic evaluation for the Mammal domain 1st taxonomic level Right

Wrong

Total

Selected

79

5

84

Rejected

141

20

161

Total

220

25

245

2nd taxonomic level (19 classes) Selected Rejected

173 2207

54 59

227 2266

Total

2380

113

2493

1st and 2nd taxonomic level Selected Rejected

252 2348

59 79

311 2427

Total

2600

138

2738

Number of correctly and incorrectly selected and rejected classes. A total of 19 subclasses evaluated for the 2nd level (those with more than 100 candidates).

ticular type of pattern can be more or less suitable). One can also realize from the tables that the percentage of rejected candidates is much bigger for the subclasses than for the root concept. This is expected, as concrete concepts present a much narrower scope (i.e. valid subclasses). This fact influences a bit negatively the quality of the deeper taxonomic levels, as the system has to deal with a higher amount of false candidates. However, proportionally, extraction quality is maintained at a reliable level. Considering each domain individually, cancer offers the most consistent results, followed by sensor. Mammal, on the contrary, offers the major divergence between the 1st taxonomic level and the rest. This domain’s quality is hampered by its generality and some problems regarding polysemy (e.g. baseball bat, hot dog, etc.). Bootstrapped information contributes to minimize the problem by contextualizing queries, but the unsupervised learning may be affected due to the lack of semantic understanding.

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

45

Fig. 8. Taxonomic evaluation for the Mammal domain. Table 5 Taxonomic evaluation for the Sensor domain 1st taxonomic level Right

Wrong

Total

Selected Rejected

75 159

18 10

93 169

Total

234

28

262

2nd taxonomic level (12 classes) Selected

211

73

284

Rejected

1380

60

1440

Total

1591

133

1724

1st and 2nd taxonomic level Selected Rejected

286 1539

91 70

377 1609

Total

1825

161

1986

Number of correctly and incorrectly selected and rejected classes. A total of 12 subclasses were evaluated for the 2nd level (those with more than 100 candidates).

7. Conclusions and future work Our novel approach for learning taxonomies from scratch is based on a specifically designed combination of linguistic patterns in conjunction with appropriate statistical scores, obtaining high quality results for several well distinguished domains. The presented methodology is unsupervised in the sense that it does not start from any kind of predefined knowledge and, in consequence, it can be applied over domains that are not typically considered in semantic repositories (like highly specific technological ones [11]). We only use

domain independent linguistic patterns (which have proven their effectiveness in the past) and a set of predefined selection thresholds. Those last values may be established a priori to tune the systems behaviour. In any case, no user’s or expert’s supervision is needed during the whole learning process. Moreover, the automatic operation based on an incremental learning procedure using bootstrapping techniques allows to control and to contextualize the learning process autonomously. In addition to these interesting characteristics, the developed method has been designed in a way which distinguishes it from other classical ontology learning approaches, as it is fully integrated within the Web environment. As mentioned in the introduction, this environment adds new troubles to the information processing, derived from the untrustiness, size, noise and lack of structure of Web resources. However, other characteristics such as the redundancy of information and the existence of Web search engines may help to tackle them. Regarding the first point, redundancy allows us to infer information relevance, manage untrustiness and develop lightweight analytical approaches that can be adequate and scalable for the size of the Web. In relation to the second point, Web searchers classically conceived as a final user interface for accessing Web resources hide lots of potential regarding the inference of information distribution. Highly valuable Web scale statistics can be extracted efficiently if adequate search queries are performed. This can save us from analysing large amounts of resources and help us to obtain scalable learning methodologies. Moreover, their lack of any semantic content makes them suitable for any domain of knowledge. This is especially interest-

46

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

Fig. 9. Taxonomic evaluation for the Sensor domain.

ing in technological domains that are highly dynamic and evolutionary. The most important application of the results provided by our automatic and unsupervised methodology is the development of ontologies. An ontology is traditionally built entirely by hand, requiring a great effort. In this sense, automatic methodologies for knowledge acquisition like the present proposal are a great help for knowledge engineers. Ontologies are crucial in many knowledge intensive tasks such as Electronic Commerce, Knowledge Management, MultiAgent Systems or the Semantic Web. Concretely, the Semantic Web relies heavily on ontologies to provide taxonomies of domain specific terms. There is wide agreement that a critical mass of taxonomies and ontologies is needed for the Semantic Web [44]. In this case, the domain taxonomies obtained by our proposal are especially adequate because they have been obtained directly from the Web. As future lines of research, some topics can be proposed: • When dealing with natural language resources, ambiguity problems may arise. In our case synonymy, as mentioned in Section 3.1, can represent a problem. For that reason, we developed a complementary method for dealing with synonymy [10] especially adapted to our working environment (Web resources, search engines and lack of predefined knowledge). We plan to refine that technique and integrate it into the learning methodology in order to improve the quality of the results. • The recall of the taxonomy learning process may be improved if additional linguistic patterns

for hyponymy detection are applied. Concretely, some authors [13,27,29,42] have been working in refining Hearst’s patterns. However, many of the new regular expressions are subtle variations of general patterns or represent specific forms rarely used. In consequence, it should be studied if including additional concrete patterns to the taxonomic learning results in a final improvement or it overheads the learning process. In our opinion, the basic but general pattern set used up to this moment is enough for obtaining good coverage (in function of the established learning thresholds) thanks to the size and redundancy of information in the Web. • Ways for automating the evaluation of the results will be studied. In this sense, semantic relatedness measures between concepts [45] against a general purpose repository such as WordNet could be an appropriate approach for checking the suitability of the obtained relationships in an automatic and domain independent way.

Acknowledgements The work has been supported by Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya i del Fons Social Europeu.

References [1] A. Faatz and R. Steinmetz, Ontology enrichment with texts from the WWW, in: Semantic Web Mining 2nd Workshop, 13th

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web European Conference on Machine Learning and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD-2002), Helsinki, Finland, August 19–23, 2002. [2] A. Gómez-Pérez, M. Fernández-López and O. Corcho, Ontological Engineering, 2nd edn, Springer-Verlag, Berlin, Germany, 2004. [3] A. Maedche and S. Staab, Ontology learning for the semantic Web, IEEE Intelligent Systems, Special Issue on the Semantic Web 16(2) (2001), 72–79. [4] C. Lee, J. Na and C. Khoo, Ontology learning for medical digital libraries, in: Digital Libraries: Technology and Management of Indigenous Knowledge for Global Access. 6th International Conference on Asian Digital Libraries, ICADL 2003, Kuala Lumpur, Malaysia, December 8–12, 2003, T.M.T. Sembok, H.B. Zaman, H. Chen, S. Urs and S.H. Myaeng, eds, Lecture Notes in Computer Science, Vol. 2911, Springer-Verlag, Berlin, Germany, 2003, pp. 302–305. [5] C. Ridings and M. Shishigin, PageRank uncovered, Online Research Report, available at: http://www.voelspriet2.nl/ PageRank.pdf, 2002. [6] D. Faure and T. Poibeau, First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX, in: Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence (ECAI’00), Berlin, Germany, August 20–25, 2000, S. Staab, A. Maedche, C. Nedellec and P. Wiemer-Hastings, eds, 2000, pp. 7–12. [7] D.I. Moldovan and R.C. Girju, An interactive tool for the rapid development of Knowledge Bases, International Journal on Artificial Intelligence Tools 10(1–2) 2001, 65–86. [8] D. Lonsdale, Y. Ding, D.W. Embley and A. Melby, Peppering knowledge sources with SALT: Boosting conceptual content for ontology generation, in: Proceedings of the AAAI Workshop for Semantic Web Meets Language Resources, 18th National Conference on Artificial Intelligence, Edmonton, Canada, July 28, 2002, pp. 30–36. [9] D. Lin, Automatic retrieval and clustering of similar words, in: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Vol. 2, Morristown, USA, Association for Computational Linguistics, 1998, pp. 768–774. [10] D. Sánchez and A. Moreno, Automatic discovery of synonyms and lexicalizations from the web, in: Artificial Intelligence Research and Development, Vol. 131, B. López, J. Meléndez, P. Radeva and J. Vitrià, eds, IOS Press, The Netherlands, 2005, pp. 205–212. [11] D. Sánchez and A. Moreno, A methodology for knowledge acquisition from the web, International Journal of KnowledgeBased and Intelligent Engineering Systems 10(6) (2005), 453– 475.

47

[14] E. Agirre, O. Ansa, E. Hovy and D. Martinez, Enriching very large ontologies using the WWW, in: Proceedings of the Workshop on Ontology Construction, 14th European Conference on Artificial Intelligence (ECAI-00), Berlin, Germany, August 20– 25, 2000. [15] E. Alfonseca and S. Manandhar, An unsupervised method for general named entity recognition and automated concept discovery, in: Proceedings of the 1st International Conference on General WordNet, Mysore, India, January 21–25, 2002, pp. 1–9. [16] E. Brill, J. Lin, M. Banko and S.A. Dumais, Data-intensive question answering, in: Proceedings of the Tenth Text Retrieval Conference (TREC-2001), 2001, pp. 393–400. [17] E.M. Voorhees, Overview of the TREC 2001 question answering track, in: Proceedings of the Tenth Text Retrieval Conference (TREC-2001), 2001, pp. 42–51. [18] E. Morin, Automatic acquisition of semantic relations between terms from technical corpora, in: Proceedings of the Fifth International Congress on Terminology and Knowledge Engineering (TKE-99), TermNet-Verlag, Vienna, 1999, pp. 268–278. [19] F. Ciravegna, A. Dingli, D. Guthrie and Y. Wilks, Integrating information to bootstrap information extraction from Web sites, in: Proceedings of the Workshop on Information Integration on the Web, 18th International Joint Conference on Artificial Intelligence, S. Kambhampati and C.A. Knoblock, eds, Acapulco, Mexico, August 9–10, 2003, pp. 9–14. [20] F. Keller, M. Lapata and O. Ourioupina, Using the web to overcome data sparseness, in: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP-02), Vol. 10, Philadelphia, USA, July 6–7, 2002, Association of Computational Linguistics, 2000, pp. 230–237. [21] G. Grefenstette, SQLET: Short query linguistic expansion techniques: Palliating one-word queries by providing intermediate structure to text, in: International Summer School on Information Extraction (SCIE’97), Lecture Notes in Artificial Intelligence, Vol. 1299, Springer-Verlag, London, 1997, pp. 97–114. [22] J. Dujmovic and H. Bai, Evaluation and comparison of search engines using the LSP method, ComSIS 3(2) (2006), 711–722. [23] J.U. Kietz, A. Maedche and E. Volz, A method for semiautomatic ontology acquisition from a corporate Intranet, in: Proceedings of Workshop on Ontologies and Texts, 12th International Conference on Knowledge Engineering and Knowledge Management (EKAW’00), Juan-Les-Pins, France, October, 2000, N. Aussenac-Gilles, B. Biébow and S. Szulman, eds, Lecture Notes in Artificial Intelligence, Vol. 1937, SpringerVerlag, Amsterdam, 2000, pp. 37–50. [24] K.M. Gupta, D.W. Aha, E. Marsh and E. Maney, An architecture for engineering sublanguage WordNets, in: Proceedings of the First International Conference on Global WordNet, Mysore, India, 2002, pp. 207–216.

[12] Economist, Corpus collosal: How well does the world wide web represent human language?, The Economist, January 20 (2005).

[25] K.W. Church, W. Gale, P. Hanks and D. Hindle, Using statistics in lexical analysis, in: Lexical Acquisition: Exploiting on-Line Resources to Build a Lexicon, U. Zernik, ed., New Jersey, USA, 1991, pp. 115–164.

[13] E. Agichtein and L. Gravano, Snowball: Extracting relations from large plain-text collections, in: Proceedings of the 5th ACM International Conference on Digital Libraries (ACM DL00), San Antonio, TX, 2000, pp. 85–94.

[26] L. Khan and F. Luo, Ontology construction for information selection, in: Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’02), IEEE Computer Society, Washington, DC, USA, 2002, pp. 122–127.

48

D. Sánchez and A. Moreno / Pattern-based automatic taxonomy learning from the Web

[27] L.M. Iwanska, N. Mata and K. Kruger, Fully automatic acquisition of taxonomic knowledge from large corpora of texts, in: Natural Language Processing and Knowledge Processing, MIT/AAAI Press, Cambridge, 2000, pp. 335–345. [28] M.A. Hearst, Automatic acquisition of hyponyms from large text corpora, in: Proceedings of 14th International Conference on Computational Linguistics (COLING-92), Vol. 2, Morristown, USA, Association for Computational Linguistics, 1992, pp. 539–545. [29] M. Pasca, Acquisition of categorized named entities for Web search, in: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04), ACM Press, New York, 2004, pp. 137–145. [30] M. Pasca, Finding instance names and alternative glosses on the Web, in: Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing 2005), Mexico City, Mexico, Lecture Notes in Computer Science, Vol. 3406, SpringerVerlag, Berlin, 2005, pp. 280–292. [31] M. Sabou, Building web service ontologies, PhD thesis, SIKS Dissertation Series, UK, 2006. [32] M. Volk, Using the Web as corpus for linguistic research, in: Catcher of the Meaning, R. Pajusalu and T. Hennoste, eds, Publications of the Department of General Linguistics 3, University of Tartu, Estonia, 2002, pp. 1–10. [33] N. Aussenac-Gilles, B. Biébow and S. Szulman, Corpus analysis for conceptual modelling, in: Proceeding of the Workshop on Ontologies and Text, Knowledge Engineering and Knowledge Management: Methods, Models and Tools, 12th International Conference on Knowledge Engineering and Knowledge Management (EKAW’2000), Juan-Les-Pins, France, October 2–6, 2000, pp. 13–20. [34] O. Etzioni, M. Cafarella, D. Downey, A.M. Popescu, T. Shaked, S. Soderland, D.S. Weld and A. Yates, Unsupervised namedentity extraction from the Web: An experimental study, Artificial Intelligence 165(1) (2005), 91–134. [35] P. Cimiano and S. Staab, Learning by googling, SIGKDD Explorations 6(2) (2004), 24–33. [36] P.D. Turney, Mining the Web for synonyms: PMI-IR versus LSA on TOEFL, in: Proceedings of the 12th European Conference on Machine Learning (ECML-2001), Freiburg, Germany, September 3–7, 2001, L. Raedt and P. Flach, eds, Lecture Notes in Computer Science, Vol. 2167, Springer-Verlag, Germany, 2001, pp. 491–502.

[37] P. Resnik and N. Smith, The web as a parallel corpus, Computational Linguistics 29(3) (2003), 349–380. [38] R. Agrawal, T. Imielinksi and A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, P. Buneman and S. Jajodia, eds, ACM Press, Washington, DC, 1993, pp. 207–216. [39] R. Cilibrasi and P.M.B. Vitanyi, Automatic meaning discovery using Google, Online Report, available at: http://xxx.lanl.gov/abs/cs.CL/0412098, December, 2004. [40] R. Mihalcea and P. Edmonds, The Senseval-3 English lexical sample task, in: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, R. Mihalcea and P. Edmonds, eds, Barcelona, Spain, Association for Computational Linguistics, 2004, pp. 25–28. [41] R. Navigli and P. Velardi, Learning domain ontologies from document warehouses and dedicated Web sites, Computational Linguistics 30(2) (2004), 151–179. [42] R. Snow, D. Jurafsky and A.Y. Ng, Learning syntactic patterns for automatic hypernym discovery, Advances in Neural Information Processing Systems 17 (2004), 1297–1304. [43] S. Lamparter, M. Ehrig and C. Tempich, Knowledge extraction from classification schemas, in: Proceedings of On the Move to Meaningful Internet Systems 2004 (CoopIS/DOA/ODBASE 2004), OTM Confederated International Conferences, Agia, Cyprus, October 25–29, 2004, Lecture Notes in Computer Science, Vol. 3290, Springer-Verlag, Berlin, 2004, pp. 618–636. [44] T. Berners-lee, J. Hendler and O. Lassila, The semantic web, Scientific American 284(5) (2001), 34–43. [45] T. Pedersen, S. Patwardhan and J. Michelizzi, WordNet::Similarity – measuring the relatedness of concepts, in: Proceedings of 5th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-04), San Jose, CA, 2004. [46] T.B. Jansen, The effect of query complexity on Web searching results, Information Research 6(1) (2000), 87–88. [47] U. Hahn and S. Schulz, Towards very large terminological knowledge bases: A case study from Medicine, in: Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence, Lecture Notes in Computer Science, Vol. 1822, Springer-Verlag, London, 2000, pp. 176–186.

Suggest Documents