Content annotation for the semantic web: an ... - Semantic Scholar

10 downloads 248390 Views 543KB Size Report
May 21, 2010 - laptops and netbooks, and sport cars respectively. ..... Acer Aspire One ... tion about laptops and sports cars offer the best and less variable ...
Knowl Inf Syst (2011) 27:393–418 DOI 10.1007/s10115-010-0302-3 REGULAR PAPER

Content annotation for the semantic web: an automatic web-based approach David Sánchez · David Isern · Miquel Millan

Received: 7 July 2008 / Revised: 26 October 2009 / Accepted: 4 May 2010 / Published online: 21 May 2010 © Springer-Verlag London Limited 2010

Abstract Semantic Annotation is required to add machine-readable content to natural language text. A global initiative such as the Semantic Web directly depends on the annotation of massive amounts of textual Web resources. However, considering the amount of those resources, a manual semantic annotation of their contents is neither feasible nor scalable. In this paper we introduce a methodology to partially annotate textual content of Web resources in an automatic and unsupervised way. It uses several well-established learning techniques and heuristics to discover relevant entities in text and to associate them to classes of an input ontology by means of linguistic patterns. It also relies on the Web information distribution to assess the degree of semantic co-relation between entities and classes of the input domain ontology. Special efforts have been put in minimizing the amount of Web accesses required to evaluate entities in order to ensure the scalability of the approach. A manual evaluation has been carried out to test the methodology for several domains showing promising results. Keywords

Knowledge discovery · Semantic Web · Semantic annotation · Ontologies

1 Introduction The Semantic Web is an evolving extension of the World Wide Web in which the semantics of information and services are defined, making it possible for Web-based tools to understand and satisfy the requests of people and machines to exploit Web content [3]. This model requires a set of knowledge structures to formalize those semantics, and a linkage between

D. Sánchez (B) · D. Isern · M. Millan Departament d’Enginyeria Informàtica i Matemàtiques, Intelligent Technologies for Advanced Knowledge Acquisition Research Group (ITAKA), Universitat Rovira i Virgili, Av Països Catalans, 26, 43007 Tarragona, Catalonia, Spain e-mail: [email protected] D. Isern e-mail: [email protected] M. Millan e-mail: [email protected]

123

394

D. Sánchez et al.

Web content and those structures. As a result, the Semantic Web relies on two basic components, ontologies and annotations. On the one hand, ontologies are knowledge structures representing the semantics of domain concepts and their interrelations in a machine-readable way. Many ontology engineering and learning techniques have been proposed in the past to construct ontologies in a manual [19] and automatic [35] ways. On the other hand, from the Semantic Web point-of-view, annotations represent a specific sort of metadata that provides references between entities appearing in resources and domain concepts modelled in an ontology. Even though the annotation paradigm supports annotating any kind of multimedia content, the work presented in this paper focuses on textual content. Ontology-based semantic tagging of textual Web content can certainly bring many benefits in developing intelligent Information Retrieval [23,44] and Extraction techniques [43,45], configuring a whole new set of Web-based knowledge services. Even though manual annotations of Web content are being timidly applied in some Web environments (e.g., Wikipedia hyperlinks and categories, blog’s tags, Del.icio.us, etc.), they only offer a partial view of what a formal ontology-based annotation can bring (i.e., comprehension of text). However, due to the enormous size of the Web, manual annotation of already existing text resources is completely unfeasible. In this sense, automatic text annotation approaches can suppose a valuable help in bringing semantics to Web resources in a scalable fashion [10]. However, as it will be shown in the related work section, there is a lack of general and fully automatic approaches that can assist the annotation process. This paper introduces an ontology-based annotation methodology that uses the Web itself as a background corpus to infer semantic assessments and a set of well established linguistic techniques to annotate textual Web resources in a completely automatic and unsupervised way. The proposed algorithm is based on the use of domain-independent rules to locate potential entities to be annotated within English written text. Then, those candidates are analysed by means of linguistic patterns in order to discover suitable domain concepts. Finally, those concepts are matched with classes contained in an input domain ontology. The algorithm employs several statistical measures about term occurrence, collocation and semantic similarity in order to assess the suitability of the extracted entities and the proposed annotations in an unsupervised way. Two more goals of this work are to maximise the precision of the results in order to propose reliable annotations, and to minimise the amount of Web accesses in order to propose a scalable approach. The rest of the paper is organised as follows. Section 2 presents the related works dealing with semantic annotation of Web resources. Section 3 analyses the annotation problem, describes the main techniques employed to tackle it and formalises the bases of our proposal in comparison to other approaches. As a result, a three-staged annotation procedure is presented in detail in Sect. 4. Section 5 explains the evaluation, describing the criteria that have been considered, the measures that have been applied and the main results obtained for several domains. The last section presents the conclusions and several lines of future work.

2 Related work In the last years, several attempts have been made to address the annotation of textual Web content. From the manual point-of-view, several tools have been developed to assist the user in the annotation process such as Annotea [25], CREAM [21], NOMOS [31] or Vannotea

123

An automatic web-based approach

395

[39]. Those systems rely on the skills and will of a community of users to detect and tag entities within Web content. Considering that there are 1 trillion of unique Web pages on the Web,1 it is easy to envisage the unfeasibility of manual annotation of Web resources. Recently, some authors have been focusing on addressing the annotation problem by automating some of its stages. As a result, some tools such as Melita [12] have been developed. It is based on user-defined rules and previous annotations to suggest new annotations in text. Manually constructed rules are used also in other basic approaches to extract known patterns for annotations [2]. Another preliminary work proposing semi-automating the annotation of Web resources is the work described in [24]. The authors propose the combination of patterns (e.g., addressed to extract objects such as email addresses, phone numbers, dates and prices) to tag the candidates to annotate, and then, this set is annotated with by means of a domain conceptual model. That model represents the information of a particular domain through concepts, relationships and attributes (in an entity-relation based syntax). Supervised systems also use extraction rules obtained from a set of pre-tagged data [7,33]. Supervised attempts are certainly difficult to apply due to the bottleneck introduced by the interaction of a domain expert and the great effort required for compiling a large and representative training set. Even being supervised, other systems like KnowItAll [15] provide a higher level of automation. It uses the redundancy of the Web to perform a bootstrapped information extraction process. The confirmation of the correctness of this information is requested to the user in order re-execute the process. SmartWeb [6] resolves the issue of not having pre-existing mark-up to learn from by using class and subclass names from a previously defined ontology. Those are used as examples to learn contexts. In this way, instances can be identified, as they present similar contexts. Complete automatic and unsupervised systems are rare. SemTag [13] performs automated semantic tagging from large corpora based on the Seeker platform for text analysis and tagging large number of pages with the terms included in a domain ontology named TAP. This ontology contains lexical and taxonomic information about music, movies, sports, health, and other issues, and SemTag detects the occurrence of these entities in Web pages. It disambiguates using neighbour tokens and corpus statistics, picking the best label for a token. Another interesting annotation application is presented in [29]. In this case, authors use a reference set of elements (e.g., online collections containing structured data about cars, comics or general facts) to annotate ungrammatical sources like texts contained in posts. First of all, the elements of those posts are evaluated using the TF-IDF metric. Then, the most promising tokens are matched with the reference set. In both cases, limitations may be introduced by the availability and coverage of the background knowledge (i.e., ontology or reference sets). From the applicability point-of-view, Pankow [10] is the most promising system. It uses a range of well studied syntactic patterns to mark-up candidate phrases in Web pages without having to manually produce an initial set of marked-up Web pages, and without depending on previous knowledge. The context driven version, C-Pankow [11], improves the first by reducing the number of queries to the search engine. However, the final association between text entities and a possible domain ontology is not addressed. Summarising, according to the techniques employed in all those related works, three groups of approaches can be distinguished. The first one consists on using wrappers, which exploit the structure of Web pages to identify nuggets of information for mark-up. Wrappers

1 The Official Google Blog, http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html (last access 19/10/2009).

123

396

D. Sánchez et al.

and rules are most useful when there are very regular patterns in the documents, such as standard tables of data. On the other hand, they need user skills to define the correct rules. The second group consists on learning patterns from pre-tagged resources, which are used to annotate new similar documents. The inconvenient is that they require a large amount of previously annotated documents. A third set employs unsupervised learning techniques, like the exploitation of the distribution of certain patterns on the Web to determine the formal annotation of entities in Web pages by a principle of ‘annotation by maximal syntactic evidence’ [42] . Those approaches are hampered by the large amount of Web accesses required to compute Web-based statistics to infer information distribution, and the inherent ambiguity of using a reduced set of general patterns. Due to the lack of supervision, those algorithms tend to offer worse results than supervised approaches, even being more general as no knowledge-dependant assumptions are made. Due to the lack of fully automatic and unsupervised approaches, our work is framed in the third group. As a key point, the Web information distribution is used as the evidence to select and tag textual entities. However, compared to other approaches like C-Pankow, additional heuristics are employed to improve the precision, and a general purpose thesaurus (WordNet) is introduced to minimise the amount of Web accesses. It is important to note that both elements are used in a way in which their influence in the algorithm’s generality is minimised. It is also worth mentioning that, unlike previous approaches [10,15], discovered entities will be associated to classes of an input domain ontology in order to offer more structured and easily interpretable annotations. In any case, up to this final stage, the algorithm’s generality is not dependant on the coverage of the ontology (unlike [6]).

3 Analysis of the automatic annotation problem As stated above, the ontology-based annotation of text from a Semantic Web point-of-view [3] involves two main tasks: detection of entities to annotate, and tagging of each entity with the most appropriate concept (class) of the given domain ontology. In this section we analyse those tasks, we review some techniques suitable to tackle them and introduce the design of our approach (which will be detailed in Sect. 4). 3.1 Detection of entities to annotate Annotated elements typically cover real world entities which can be considered as instances of ontological classes [3] (e.g., Barcelona is an instance of city). So, the first step of the automatic annotation process consists on discovering them within the input text. In human languages, real world entities are commonly expressed by means of Named Entities (NE) [11]. Supervised approaches try to detect NEs relying on a specific set of extraction rules learned from pre-tagged examples ([18,40] , or predefined knowledge bases such as lexicons and gazetteers [26,30]). However, the amount of effort required to assemble large tagged sets or lexicons binds the NE recognition to either a limited domain (e.g., medical imaging), or a small set of predefined, broad categories of interest (e.g., persons, countries, organizations, products). This introduces compromises in the recall [32]. Other approaches like [27] use a thesaurus to detect NEs: if a word or noun phrase is not found in the dictionary, it is assumed to be a NE. The problem is that NEs composed by common words will be discarded. Another possibility consists on exploiting the way in which NEs are expressed in languages such as English. They can be distinguished from normal noun phrases by the presence of alphanumeric terms and/or capitalized letters. Those

123

An automatic web-based approach

397

heuristics have been the basis for developing automatic NE detection methods [14,32]. The main problem is that basing the detection of NEs on individual observations may produce inaccurate results if no additional analyses are applied. For example, a noun phrase may be arbitrary capitalised to stress its importance or due to its placement within the text. Being unsupervised, domain independent and lightweight, we will employ capitalization heuristics to discover NEs. As it will be described in Sect. 4.1, a set of linguistic analyses will performed over the text to identify noun phrases and to detect potential NEs. However, in order to improve the NE extraction precision, we will complement it with a Web-based reliability analysis. Potential NEs will be sought in the Web in order to estimate their reliability in a much wider context (i.e., several observations in heterogeneous contexts). The basic idea is to use new Web resources containing potential NEs to re-check that their appearances usually accomplish the extraction rules established for a NE. 3.2 Entity annotation The next step consists on relating NEs with their formal semantics in a given ontology (i.e., an ontological class representing a subsumer concept for which NEs are instances). This is a difficult process because on the one hand, NEs are unstructured and unlimited by nature. On the other hand, the semantics which can be exploited to detect those relationships remain hidden in the text from which the extraction has been performed. Some authors [16,18] simplify this problem by using a predefined and reduced set of general annotation classes (e.g., organization, person, location, etc.) instead of a domain ontology. This decreases the degree of generality of the algorithm because domain dependent entities are omitted (e.g., The Matrix is an instance of movie, in the cinema domain). Other authors try to learn entity-subsumer concept pairs from the text [10] instead of associating them to ontological classes. From an unsupervised point-of-view, this can be done by evaluating the degree of similarity between entities and subsumer concepts from the statistical estimation of their co-occurrence [15]. For instance, Barcelona and city would co-occur higher than Barcelona and mountain. However, the main problem of statistical analyses is data sparseness: the fact that the available data is not enough to extract reliable assessments. So, they perform poorly when evaluated terms are relatively rare. Some authors [5] have demonstrated the convenience of using a wide corpus like the Web in order to improve the quality of statistical methods. In fact, it has been stated that the amount and heterogeneity of information in the Web is so high that it can be assumed to approximate the real distribution of information [9]. The use of the Web as the source from which perform statistical analyses has been very successful in information extraction tasks [16]. In fact, not only robust statistics can be computed from the analysis of Web information distribution, but they can be obtained from the hit count of keyword-based search engines like Google [41]. From the annotation point-of-view, those Web-scale statistics have been applied to cluster similar entities, deriving the class topology from the cluster tree [16] and to detect the most suitable subsumer concept from a list of candidates of a certain Named Entity [10]. In the next sections, we will describe how Web statistics can aid to annotate NEs with regards to classes of an input ontology and how they have been exploited in our approach. 3.2.1 Associating entities and ontological classes Being unsupervised and domain independent, Web-scale statistics can be used to assess the most related ontological class to which a NE should be annotated. Co-occurrence of both

123

398

D. Sánchez et al.

the NE and each possible ontological class can be estimated from the Web by querying them in a search engine. This will give an indication of relatedness [41] . However, considering the amount of NEs to evaluate in a given document (potentially hundreds) and the number of classes contained in a domain ontology (also hundreds), the number of queries needed to check pair combinations would be overwhelming. As reported in [11], this solution, even promising, would cause serious problems about scalability. Without exclusively relying on Web-scale statistical measures, it is also possible to compute the degree of similarity between words using a general purpose off-line thesaurus like WordNet [17]. WordNet offers a massive lexicon, thesaurus and semantic linkage between most of English words. Semantic pointers interrelate terms with predefined relationships (e.g., hyponomy, meronymy, synonymy, etc). Counting and weighting the number of semantic links between terms [28,46] , it is possible to compute similarity measures between terms (e.g., town and city are “nearer” than town and river). Offline WordNet queries for similarity computation are extremely efficient when compared to on-line Web-based ones. However, those measures are hampered by WordNet’s limited coverage of NEs. So, it is not possible to directly compute the similarity between a NE and an ontological class. In the next section we propose a way to tackle this problem in a scalable way.

3.2.2 Exploiting linguistic patterns for the discovery of annotation classes Being WordNet-based similarity measures appropriate from the efficiency point of view but inapplicable due to WordNet’s limited coverage of NEs, other possibilities should be explored. Semantically, we need a way to go from the instance level (i.e., NEs, for example Barcelona) to the conceptual level (i.e., a subsumer concept for which the NE is an instance, for example city). Semantically, NEs and subsumer concepts are related by means of taxonomic relationships. So, it is needed a mechanism to discover taxonomically related concepts of a given NE in an unsupervised a domain independent manner. As stated in [10], three different learning paradigms can be exploited. First, some approaches rely on the document-based notion of term subsumption [38]. In second place, some researchers claim that terms are semantically similar according to the shared syntactic context [4]. Both cases require a considerable amount of document and linguistic parsing. Finally, several researchers exploited linguistic patterns. Those patterns express language regularities which can be exploited to detect predefined relationships such as is-a, part-of, and causation [21]. In those approaches, the text is scanned using pattern’s regular expressions to look for a relation of interest (e.g., in the case of taxonomic relationships, cities such as Barcelona). Pattern-based approaches represent a simple and unsupervised way of discovering the concept/class to which a NE is a subsumed term/instance. In fact, this technique has been applied to retrieve a set of concepts to which annotate a NE [10]. Pattern-based approaches offer a relatively high precision but suffer from low recall due to the fact that explicit linguistic patterns are rare in corpora [10]. Fortunately, as stated previously, data sparseness can be minimised by exploiting the Web as the corpus from which extract those semantic evidences [34], which is precisely our case. As it will be detailed in Sect. 4.2 taxonomic linguistic patterns will be exploited in our approach to discover subsumer concepts for each NE from the Web.

123

An automatic web-based approach

399

3.2.3 Concept-class matching As a result of the taxonomic pattern-based analysis, a set of subsumer concepts for a NE can be retrieved from the Web. For example, Barcelona can be a city, capital, centre, place, or hotel. In previous approaches [10], a Web-based statistical evaluation of this concept set will lead to select the most related concept as the final annotation class (e.g., Barcelona is a city). Due to the potentially big size of the concept set, this will require a considerable amount of queries. This kind of approaches does not base the annotation on an input ontology and, in consequence, it results in semantically unstructured annotations covering heterogeneous domains which are hard to exploit. For example, NEs may be annotated with different synonyms referring to the same concept (e.g., Barcelona is a Metropolis, Madrid is a city), or several NEs could be annotated with closely related concepts (e.g., Barcelona is a city, Barri de Gracia is a neighborhood), a relation which would remain hidden. From the knowledge engineering point-of-view, semantic annotations should be performed on the basis of a given ontology in order to offer machine-interpretable annotations [3]. So, subsumer concepts retrieved for a NE in the previous stage can be used to directly annotate the NE in the case that the concept appears in the input ontology as a class. Otherwise, it is necessary to asses the most appropriate ontological class according to subsumer concepts. In this last case, as we are working at a concept-rather than instance—level, the set of subsumer concepts can be efficiently evaluated and ranked against ontological classes using WordNet-based similarity measures mentioned in Sect. 3.2.1. For example, Sagrada Familia’s concept subsumers may be Cathedral, Centre and Building; after evaluating each one against each ontological class using WordNet-based similarity measures, we may find that the Church ontological class is the most similar one. However, we identified some problems which would require an additional analysis. For example, there can be situations in which several ontological classes have been discovered as subsumer concepts for a certain NE (e.g., Sagrada Familia’s subsumers Building and Cathedral may be covered in the input ontology). In that case, WordNet cannot aid to take the most suitable annotation. In other cases, such as in technological domains, WordNet’s coverage of semantic pointers may present limitations, hampering the similarity assessment [36]. On the contrary, as stated in Sect. 3.2, Web-based statistical measures do not present those limitations. So, in order to tackle those problems, as it will be described in Sect. 4.3, the final assessment of the most related ontological class for a NE can be based on Web-scale statistics. In order to ensure the scalability of the approach, only the most similar (according to WordNet-based similarity measures) or directly matched (according to the ontology) ontological classes are evaluated by means of Web-scale statistics. This represents a much reduced set of concepts to evaluate (compared to the direct approaches introduced in Sect. 3.2.1), and results in a much lower amount of Web queries to be performed for statistics. In addition, the final Web-based assessment can be a direct evaluation of the NE and each ontological class, resulting in a more reliable assessment. For instance, we may obtain that the pair Sagrada Familia and Cathedral co-occur more in the Web than Sagrada Familia and Building; being Cathedral and Building classes in the input ontology, we conclude that Cathedral is the most suitable annotation. At the end, thanks to the introduction of linguistic patterns for subsumer concept discovery and WordNet-based similarity measures to evaluate them, we can achieve the goal of annotating NE based on robust statistics estimated from the Web, but drastically reducing the number of Web queries (a fact which hampered the performance of other automatic approaches [10]).

123

400

D. Sánchez et al.

Fig. 1 Pseudo-code of the annotation procedure

3.3 Formalization Presenting the design of our annotation approach in a more formal manner (see the pseudo code in Fig. 1), it receives two parameters: the Web document (WD) to be annotated and, a domain ontology (DO), which contains a set of ontological classes (OC). First of all, the WD is parsed for extracting a set of potential Named Entities (PNE). All of them are analysed using a Web-based statistic analysis (NE_score in Fig. 1) in order to select the most reliable ones, which compose the final set of Named Entities (NE).

123

An automatic web-based approach

401

The second step analyses each element included in the NE set to discover a set of subsumer concepts (SC) by means of pattern-based Web analysis. Several taxonomic linguistic patterns (TLP) are employed. Finally, ontological classes (OC) are used to semantically evaluate SC items, covering direct matches of OC and SC and/or a subset of the most related OC (if no direct matches are found). This is achieved using a similarity function based on WordNet (WN_similarity in Fig. 1). After all combinations of those elements (OC j , SCk ) are considered, a set of the most similar ontological classes (SOC) is composed. Then, a Web-based statistical assessor (Web_score in Fig. 1) is applied to select the most related class of SOC with respect to each NE, which is used as annotation class (AC). When all elements of the NE set are handled, the annotated document is returned.

4 Annotation methodology In this section, the concrete details of the proposed annotation algorithm are described and discussed following the structure of the pseudo code introduced in Fig. 1. 4.1 First stage: extract and select named entities from text As shown in Fig. 1, the first stage of the annotation procedure starts by parsing the Web document (WD) to analyse. At first, a cleaning of the document from all HTML mark-ups is performed (analise_and_parse function). Then, the set of potential Named Entities (PNE) is detected by means of a linguistic analysis (extract_potential_NEs function). It seeks for totally or partially capitalized noun phrases using a composition of text taggers.2 The first tagging procedure uses general regular expressions (see Table 1) to prioritize the mark of capitalized words as Proper Nouns (i.e., an indication of a NE). Then, the text is passed over two n-gram taggers, which are trained with the Brown Corpus,3 in order to perform a post-tagging of the rest of words. First a Unigram tagger tags words by assigning them the most likely morphological category. After that, a Bigram tagger refines the tagging by considering the category of the preceding word. Testing this combination of taggers over the Brown pre-tagged corpus, it achieves a precision of 93.4%. After the text has been tagged, a grammar based on the expressions shown in Table 2 is used to detect the noun phrases which may contain the full name of a NE (e.g., Sagrada Familia). This grammar describes the structure of a NE noun phrase which is usually composed by a central particle with one or more Proper Nouns, NNP+ (capitalized words as detected in the previous stage), followed and/or leaded by zero or more Nouns, NN|NNS* (both in singular and plural). Usually, this central particle is leaded by some optional determinants or/and adjectives composing a noun phrase. Using this simple parsing, we are able to retrieve a wide set of noun phrases with one or more capitalized words which compose the set of potential Named Entities (PNE). Examples of PNEs for the Barcelona Wikipedia article can be found in the first column of Table 5. PNEs can be particularly numerous and noisy, including false NEs such as headers or stressed terms, because they have been retrieved from individual observations in a particular context. Using this set as the basis to perform the annotation process (as approaches like [10]), it will 2

NLTK (http://nltk.sourceforge.net) It uses the Brown tag-set (http://www.comp.leeds.ac.uk/amalgam/ tagsets/brown.html). 3 http://icame.uib.no/brown/bcm.html.

123

402

D. Sánchez et al.

Table 1 Regular expressions used to analyse the text Regular expression

Tag

Description

Example

A-Z].*$

NNP

Proper noun

Barcelona

.*ing$

VBG

Gerund verb tense

Distinguishing

.*ed$

VBD

Regular verb in past tense

Distinguished

.*es$

VBZ

Verb in 3rd singular person, present tense

Distinguishes

.*’s

NN$

Singular common noun genitive

Season’s

.*s$

NNS

Plural common noun

Stadiums

.*al$

JJ

Adjective

Global

∧−?[0 − 9] + (.[0 − 9]+)?$|[0 − 9]∗ ((\.|, ) [0 − 9]∗ )∗ $ .*

CD

Cardinal number

125,000

NN

Singular common noun

word

Table 2 Noun phrase detection grammar Grammar

UNINP:

{NN |NNS ∗ NNP + NN| NNS∗}   DT|DTI|DTS|DTX|PP\$?JJ|JJ-TL|JJR|JJT|JJS?NOUNP

NP:

{UNINP?UNINPUNINP?}

NOUNP:

require a lot of effort to deal with false positives. For example, for the Wikipedia article of Barcelona (more details in the evaluation section) a total amount of 292 possible NEs are extracted using this procedure, but less than 150 can be considered as valid. In order to minimise the number of elements contained in PNE, we introduce an unsupervised assessor. As stated in Sect. 3.1, this assessor uses the Web as background information to avoid drawing final conclusions from a unique observation. It uses new Web resources containing each PNEi to re-check that its appearances accomplish the extraction rules established for a NE. This will increase the candidate’s reliability with independence of the text context, resulting in a selection score based on the number of valid and invalid observations (NE_score function in Fig. 1). So, the Web-based assessor relies on a statistical estimation of the candidates’ appearances over the Web. Concretely, each candidate is queried in a Web search engine and the resulting Web snippets (i.e., pieces of Web content covering one or several matches of the input query) are retrieved. We use snippets instead of individual Web accesses for efficiency purposes as, with a single query, we are able to retrieve up to 100 Web snippets containing one or several query matches. PNEi is searched in this snippet set. Then, the probability to find PNEi matches written as its original form (i.e., appropriate capitalized letters) is computed according to the total amount of appearances (Eq. 1). N E_Scor e(P N E i ) =

#E xact_Matchings(P N E i ) #T otal_Matchings(P N E i )

(1)

If the resulting rate indicates that the PNEi commonly accomplishes the established rules (e.g., higher than a 70%), it will be considered as a NE. The concrete threshold (NE_THRESHOLD constant in Fig. 1) is configured to maximise the precision of this

123

An automatic web-based approach

403

Table 3 List of hearst patterns used to retrieve named entity’s subsumer concepts Pattern structure

Example

CONCEPT such as (ENTITY)+ ((and | or) ENTITY)?

Cities such as Barcelona or Madrid

CONCEPT (,?) especially (ENTITY)+ ((and | or) ENTITY)?

Countries especially Spain and France

CONCEPT (,?) including (ENTITY)+ ((and | or) ENTITY)?

Capitals including London and Paris

ENTITY (,?)+ and other ENTITY

Eiffel Tower and other monuments

ENTITY (,?)+ or other CONCEPT

Coliseum or other historical places

Table 4 Additional taxonomical patterns Pattern structure

Example

ENTITY (,?)+ is a | are a CONCEPT

Paris is a beautiful city

ENTITY (,?)+ like other CONCEPT

Taj Mahal like other mausoleums

selection stage, as it shall be discussed in Sect. 5.2.1. In addition, as it is quite common to find misspelled terms when analysing Web content, which may be considered as apparently valid NE, a minimum number of total matches is also required to filter erroneous terms. Examples of NE_Scores for several PNEs of the Barcelona Wikipedia article can be found in the second column of Table 5. 4.2 Second stage: retrieve subsumer concepts for each named entity After collecting and filtering the set of most reliable named entities (NE) from the document WD, the second step aims to extract subsumer concepts (SC). This task is performed by means of linguistic patterns aimed to detect taxonomic relationships. In that sense, the work of [22] is particularly relevant because she describes a set of text patterns and a method to acquire hyponymy relations from unrestricted text. In our case, as shown in Fig. 1, the set of Hearst patterns’ regular expressions (TLP) are iteratively used to find new SCs for each NEi . A pattern (TLP j ) is used in conjunction with each NEi to construct a query for a Web search engine (e.g., “Barcelona and other”, “especially Barcelona”). The retrieved Web resources are used as the corpus from which to extract SCs (extract_subsumer_concepts function in Fig. 1). Again, snippets are used instead of the full Web text for efficiency (retrieve_snippets function in Fig. 1). In previous research [37] we found that when using Hearst-like patterns, it is very unlikely to retrieve more than one matching for a given query from a Web page. In consequence, a snippet-based analysis, which precisely covers query-matches, exhibits better performance than an expensive full-Web parsing. The concrete set of taxonomical patterns contained in TLP is summarised in Table 3, where ENTITY is the NEi to evaluate and CONCEPT indicates the position in which a SC should appear in text. After some experimentation, two new patterns have been added to this list, formally described in Table 4. So, each pattern matching is analysed in order to extract the CONCEPT part from them, compiling a list of SCs for each NEi . Examples of SCs retrieved for several NEs extracted from the Barcelona Wikipedia article can be found in the third column of Table 5.

123

404

D. Sánchez et al.

Table 5 Examples of extracted and selected entities, subsumer concepts, similar ontological classes and annotation classes for the Barcelona wikipedia articlea , using an ontology containing geopolitical knowledgeb Stage 1: Extract and select NEs Stage 2: Extract SCs

Stage 3: Select SOCs and Annotate

PNE

NE_Score

Examples of SCs

Examples of SOCs (Web_Score)

AC

Fundacio

0.907

Cultural center

Museum (6.18e-5)

Museum

Museum

Community center(1.17e-6)

Antoni Tapies

Convention center(4.59e-6) Information center(6.71e-7) Palau de la

0.771

Musica Catalana

Concert hall

Concert hall (0.00081)

Concert hall

Great music hall Famous concert hall Wonderful music hall

Estadi Olimpic

0.967

Lluis Companys FC Barcelona

Spanish soccer stadium Stadium (8.709e-5)

Stadium

Facilities 0.785

Spanish football club

Club (0.00088)

Sports club

Jazz club(0.00053)

Spanish club

Night club(2.139e-5)

Club

European soccer clubs Prestigious clubs Estacio del Nord 0.764

Torre Agbar

0.75

Modern park

Bus station (0.000314)

Bus station nowadays

Park (1.459e-6)

Sites

Park and ride (8.571e-6)

Nice building

Administrative building (4.37e-5) Office building (8.828e-6)

New building

Bus station

Administrative building

Building Horta-Guinardo 0.706

Save neighborhoods

District (1.041e-8)

District

Sant Andreu

Municipalities

Town (2.976e-9)

Town

Theater (3.008e-4)

Theater

0.805

Villages Sites Liceu

0.785

Auditorium

Concert hall (7.95e-5) Piece of theatrical archaeology Institution Music school Visigoths

0.801

Northern tribes Germanic tribes

123

Not similar enough



An automatic web-based approach

405

Table 5 Continued Stage 1 : Extract and select NEs

Stage 2: Extract SCs

Stage 3: Select SOCs and Annotate

PNE

Examples of SCs

Examples of SOCs (Web_Score)

AC

Not similar enough



NE_Score

People Barbarian tribes Catalunya Radio

0.876

Transmitter Media Organizations

Bicing service

0.742

Not found





ft Torre

0.888

Not found





Feb Mar Apr

0.823

Calendar

Not similar enough



Districts

0.482

Rejected





Beaches

0.510

Rejected





Neighbourhoods

0.489

Rejected





Seaport

0.437

Rejected





Cruise ships

0.294

Rejected





Cities

0.484

Rejected





Months

Italics indicate rejected entities or classes a Source: http://en.wikipedia.org/wiki/Barcelona (last access 20/10/2009). b http://itaka2-deim.urv.cat/ontologies/location.owl (last access 20/10/2009).

4.3 Third stage: annotate named entities with ontological classes The goal of last the step of the algorithm is to associate an annotation class (AC) included in the domain ontology to each named entity (NEi ). This step begins with the extraction of all ontological classes (OC) contained in the domain ontology (extract_ontological_classes function in Fig. 1). Then, for each NEi , all subsumer concepts (SC) are compared against OC elements in order to select the most similar ontological classes (SOC). When a SC is composed by several words (i.e., it is a multi-word noun phrase, like “populated city”), its words are syntactically tagged using the tools introduced in Sect. 4.1. As a result, we extract the main NN| NNP contained in the noun phrase (e.g., city). This allows a direct comparison against OCs. A stemming algorithm is also applied to both SCs and OCs in order to discover morphologically equivalent terms (e.g., city and cities). In the case that one or several OCs are found, they are selected as similar classes (SOC) to annotate the current NEi (extract_dir ect_matching function in Fig. 1). Otherwise, if no direct matchings are found, the most semantically similar OCs (ideally synonyms) are searched according to the SCs. So, all pairs (OC j , SCk ), with j ∈ (0..|OC|) and k ∈ (0..|SC|), are evaluated by applying the similarity measure of Eq. (2) based on their proximity in the WordNet hierarchy [17] (WN_similarity function in Fig. 1).

W N _similarit y(OC j , SCk ) =

2×taxonomical_depth |shor test_ path(OC j , SCk )|+2×taxonomical_depth (2)

123

406

D. Sánchez et al.

Concretely, the shortest path between both terms in the WordNet’s is-a hierarchy (hyponomy-meronomy relationships) is identified. The path length is then scaled by the depth of the hierarchy in which terms reside to obtain the final measure [46] . As path length measures distances, the fraction is inverted to approximate similarities. Values range from 1 (for identical terms) to 0 (non-linked ones). OCs are filtered according a threshold (SIM_THRESHOLD constant in Fig. 1) in order to select only the most similar ones. Those are added to the set SOC and taken as candidates to annotate the current NEi . At the end of this process, if no SOCs are found, it implies that the input ontology does not have a class similar enough to the current NEi , and hence, NEi will remain unlabelled. Note also that malformed or incorrect NEs retrieved at the first stage can be automatically discarded at stage if no SOCs (or even SCs) are found. This is very convenient in order to further refine, in an implicit manner, the set of entities to annotate. From our experiments, at the end of this stage, the total number of SOCs ranges from 4 to 8, an amount which can be handled in a scalable way by means of Web-scale statistics. So, the next step consists on selecting, from the list of SOCs, the one to which finally annotate the NEi (select_SOC_with_maximum_Web_Score function in Fig. 1). The selection is based on the relatedness between the NEi and each SOC, assessed by means of a Web-scale statistical analysis. This offers a general and robust estimation of the information distribution at a social scale [9]. Concretely a version of the Point-Wise Mutual Information (PMI) collocation measure [8] adapted to the Web is computed. The score (Eq. 3) computes term appearance probabilities from the Web hit count provided by a search engine when querying the NEi and each of the SOC j . W eb_Scor e(N E i , S OC j ) =

hits(S OC j AN D N E i ) hits(N E i )

(3)

As presented by [41], this score is derived from PMI (Eq. 4), which statistically assesses the relation between two words (a, b) as the conditioned probability of a and b occurring within the text. In the Web-based score (Eq. 3), concept probabilities are approximated by Web hit counts provided by a Web search engine. Concretely, hits(S OC j AND N E i ) is the probability that S OC j and N E i co-occur. Since we are looking for the maximum score among a set of SOCs, log2 and hits(N E i ) can be dropped because it has the same value for all SOCs of the NEi . p(ab) P M I (a, b) = log2 p(a) p(b)

(4)

Once the score is computed for all SOCs, the one with the highest value is taken as the final annotation class (AC) for the current NEi . Examples of SOCs, associated Web_Scores and the final AC for several NEs extracted from the Barcelona Wikipedia article can be found in the fourth and fifth columns of Table 5. The fact that this final decision is based on the information distribution at a Web scale is very important, as the discovery of the reliable relative frequencies of words and phrases is a major problem in applied linguistic research. As introduced in Sect. 3.2 , considering the size and heterogeneity of the Web, the probabilities of Web search engine terms, conceived as the frequencies of page counts returned by the search engine divided by the number of indexed pages, approximate the current use of those search terms in society [9]. So, their degree of generality as a statistical assessor is higher than the similarity computed from more limited corpus like WordNet. In fact, they allow to directly compare the NEs and SOCs. This is the main reason why we delegate the final selection to the Web-based assessor instead of completely relying on WordNet similarity measures.

123

An automatic web-based approach

407

As a final step, annotated NEs are linked to their corresponding AC in the input document (annotate_NE function in Fig. 1). 4.4 Runtime complexity In order to justify the scalability of our approach, in this section we study the complexity of the algorithm from the runtime point-of-view. Considering that Web access overhead is several orders of magnitude larger (seconds) than the text parsing or off-line WordNet queries, the runtime complexity of the algorithm for one document would be O(|Q|) where |Q| is the number of queries done to a Web search engine. |Q| can be split in |P N E|+|N E|·|T L P|+|E N |·|S OC| where |P N E| is the number of potential Named Entities extracted from the Web document, |N E| is the number of Named Entities selected after the first stage, |T L P| is the number of linguistic patterns (7) employed during the subsumer concept extraction phase, and |S OC| is the number of ontological classes selected after the WordNet similarity-based assessment. We can conclude that, for a NE set, the number of Web queries required to finally annotate them ranges about a dozen. This contrasts to the several hundreds needed by other more direct approaches, in which |N E| ∗ |OC| queries are needed to evaluate Named Entities against all ontological concepts. We clearly see how introducing the subsumer concept discovery stage and partially relying on WordNet, we can greatly minimize the total number of queries even maintaining the final Web-based assessor. 5 Evaluation Due to the lack of solutions to carryout automatic evaluations of annotated documents, researchers [1,11] typically focus the evaluation on the manual side. In our case, a two-step expert-based evaluation procedure has been designed. At first, the quality of the extracted and selected NEs is checked against manually tagged ones. Then, the annotated classes (AC) assigned to those entities are manually evaluated according to the classes available in the input ontology. As shown previously, our proposal includes several parameters such as the threshold to filter NEs and OCs. Their influence in the final results has been also individually evaluated. 5.1 Evaluation procedure The evaluation process has been carried out over 4 sets of 5 English-written Wikipedia articles. Those have been selected due to their proliferation of NEs which, in addition, are commonly linked to their corresponding articles, easing the evaluation from the manual pointof-view. These sets include information about cinematographic resources, popular movies, laptops and netbooks, and sport cars respectively. All articles have been annotated using four ontologies, each one corresponding to each article domain. Ontologies have a different size and, consequently, a different degree of coverage and generality. For the first set, we used a domain ontology4 with 188 concepts related with spatial entities (e.g., geographical features, political divisions, places, etc.); in the second set, the ontology5 has only 17 concepts related with cinema (e.g., companies, professionals, cinematographic and TV products, etc.); for the 4 http://itaka2-deim.urv.cat/ontologies/space.owl (last access 10/10/2009). 5 http://itaka2-deim.urv.cat/ontologies/film.owl (last access 10/10/2009).

123

408

D. Sánchez et al.

third set, the ontology6 models 45 concepts related to computers and office equipment (e.g., computer types, components, peripherals, etc.); finally, the fourth ontology7 models 133 concepts related to car’s manufacturing (e.g., components, brands, equipment, etc.). Those have been manually composed by independent third parties and retrieved from Swoogle.8 In order to define a relevant human evaluation baseline, two domain experts have manually checked each document of the four sets of articles. In a first stage, each one is requested to extract entities, which may be suitable to be annotated. Individual results are then put together and the experts are requested to agree in those cases in which there is a lack of consensus. The final set of extracted entities for each article is compared against the automatically extracted ones. Result’s quality is evaluated by means of precision and recall. Precision (Eq. 5) is computed as the rate between the correctly selected NEs and the total amount of NEs. |Correct_NEs| Precision_N E = (5) |NEs| Recall (Eq. 6) indicates how much of the entities to be annotated have been extracted. It is computed as the rate between correctly selected ones and the total amount of entities detected by human experts in the document. Recall_N E =

|Correct_N E| |Annotable_Entities|

(6)

In a second stage, both experts are requested to check the suitability of the annotations proposed by the automatic procedure, in relation to the classes available in each domain ontology. Annotated NEs are tagged as feasible according to the input ontology and un-annotated ones are checked in order to assess if a suitable annotation class is available in the ontology. As a result, precision (Eq. 7) is computed as the rate between correctly annotated NEs and the total number of annotated NEs. Recall (Eq. 8) is computed as the rate between the number of correctly annotated NEs and the total number of NEs able to be annotated. |Correctly_annotated_NEs| Precision_Annotation = (7) |Annotated_NEs| |Correctly_annotated_NEs| (8) Recall_Annotation = |Annotable_N Es| From these measures we have also calculated the F-measure as the harmonic mean of recall and precision (Eq. 9): F − measure =

2∗ Recall∗ Precision Recall+Precision

(9)

5.2 Evaluation of thresholds’ influence As introduced in Sect. 4, our algorithm uses two pre-defined thresholds. The first one (NE_THRESHOLD constant in Fig. 1) is used in the first stage to set the minimal percentage of exact observations of a PNE in the Web. It allows deciding to select or to reject the PNE. Through an empirical study carried out for several domains, we set a value of 0.7. The 6 http://itaka2-deim.urv.cat/ontologies/laptop.owl (last access 10/10/2009). 7 http://itaka2-deim.urv.cat/ontologies/car.owl (last access 10/10/2009). 8 http://swoogle.umbc.edu/ (last access 10/10/2009).

123

An automatic web-based approach

409

Fig. 2 Named entity evaluation as a function of the selection threshold value

second threshold (SIM_THRESHOLD constant in Fig. 1) is used in the third stage to select the most similar OCs of a certain NE by means of a WordNet similarity measure. In this manner, we minimize the set of classes for which the final Web score should be calculated. Empirically, we set a threshold value of 0.3. As both scores are normalized measures, values range from 0 to 1 (i.e., a threshold of 0 indicates that all candidates are selected; a threshold of 1 implies that only exact candidates will be selected). In order to evaluate the influence of the thresholds and the adequacy of the established values, we applied the annotation process several times over the same input document and ontology, modifying the concrete threshold values and evaluating the quality of the obtained results. Threshold values have been set from 0.1 to 0.9 with increments of 0.2. In the following, the concrete results obtained for the Palma de Mallorca article are presented. 5.2.1 Named entities selection threshold Considering the amount of Web resources and information freely published in the Web, the main aim of our annotation approach is to maximize the precision of the results in order to present a reliable solution. Recall has less importance from the Web annotation point of view as Web information is highly redundant [5] (i.e., the same data may appear in many different forms). In general any correct annotation is valuable considering that we start from plain text documents. So, the goal for the first step of the algorithm is to maximize the precision of the selected Named Entities. Moreover, a more compact set of NEs result in a better learning performance as fewer Web queries are required in the second stage. Taking this into consideration, we evaluate NEs obtained when setting different values of the NE_TRESHOLD constant. The evaluation results are shown in Fig. 2. Analysing the Figure, as expected, recall and precision have an inverse behaviour. In consequence, for a range of values between 0.1 and 0.7 the precision-recall equilibrium measured by the F-Measure is more or less stable. The absolute maximum is reached for a value of 0.1, but with a precision of just 64.6%. Considering the relaxed value of the threshold (i.e., the Web-based selection method will hardly filter any PNE), this indicates the total percentage of correct extractions obtained from the individual observations in the input document. Increasing the threshold value above a 0.5 (i.e., more than half of the Web observations accomplish the heuristic rules presented in Sect. 4.1), precision starts to increase, indicating that the statistical assessment is aiding to minimize the noise of the resulting set.

123

410

D. Sánchez et al.

Fig. 3 Annotation evaluation as a function of the OC selection threshold value

On the other hand, too restrictive values (0.9) results in a maximized precision at the cost of a very low recall. So, in order to offer a reliable set of NEs (even at the cost of a lower recall), we preferred an intermediate value of 0.7, which indicates an assessment based on a major percentage of exact matches. This provides a high precision with a reasonable recall, which is precisely the objective of our approach.

5.2.2 Subsumer concept selection threshold SIM_TRESHOLD is used to rank and to filter the list of OCs according to the SCs retrieved for a given NE, using WordNet similarity measures. Considering that this process is only applied if no direct ontological class matches are found (i.e., annotations may depend on the characteristics of the concrete ontology), the influence of the threshold value is less evident. In fact, we introduced the final Web-based assessor to minimize the influence of the WordNet bias or semantic pointer’s coverage issues for some domains, which may hamper the generality of the algorithm. Even the particular threshold is not as crucial as the previous one, it has an influence on the final annotations. So, in this test, we began from the results of the previous stage (when setting a NE_THRESHOLD of 0.7), varying the SIM_THRESHOLD from 0.1 to 0.9. After evaluating annotations for the different tests, we obtained the evaluation results presented in Fig. 3. Analysing the tendencies shown in the Figure, we can conclude that the threshold value has its greater influence over the final recall (i.e., the fact that each NE has or has not been annotated). On the one hand, a too permissive threshold (below 0.3) results in an almost disabled filtering, presenting a numerous set of OCs to evaluate. In fact, in that case, it was almost tripled. This relaxed policy only affects the performance, but not the result’s quality, as the final class is selected via the Web-based assessor at the cost of a greater amount of queries. On the other hand, a restrictive threshold (above 0.3) rejects too many concepts resulting, in many situations, in non-annotated entities. We can see in the Figure that for values above 0.5, recall is maintained, indicating that only direct ontological matches have been finally annotated (in this case, a 39.13%). So, the ideal threshold value should be high enough to filter as many OCs as possible (minimising the final amount of Web queries required for annotation) but maintaining the

123

An automatic web-based approach

411

Table 6 Evaluation of named entities for the geopolitical articles Article

|Article| (words)

|Extracted NEs| |Selected NEs| Recall (%) Precision (%) F-Measure (%)

Andorra

2661

136

68

66.7

67.6

Barcelona 7996

292

146

37.4

89.7

67.1 52.8

Tarragona 1872

176

47

46.4

83

59.5

Reus

1231

46

10

15.4

80

25.8

Palma

3777

166

42

48.9

85.7

62.3

Average

3507

163

63

42.9

81.2

53.5

Table 7 Evaluation of named entities for the cinema articles Article

|Article| (words)

|Extracted NEs| |Selected NEs| Recall (%) Precision (%) F-Measure (%)

The Matrix 5715

267

80

41

77.5

53.7

War Games 2400

95

22

30.6

68.2

42.2

LOTR 3

10097

585

146

36.1

64.4

46.3

I, Robot

3359

175

41

30.4

58.5

40

300

7449

416

104

40.2

75

52.3

Average

5804

308

79

35.7

68.7

46.9

recall (i.e., at least one valid OC pass the filter for correct NEs). We found that, considering the similarity measure described in Sect. 4.3, a threshold about 0.3 offers the best equilibrium. Regarding the precision, it follows a similar tendency as the recall. On the one hand, the lower the threshold is (ranging from 0.1 to 0.5), the more the possibilities to select an appropriate OC to annotate. On the other hand, for values above 0.5, the precision obtained considering only direct ontological matchings is shown (69.23%). 5.3 Evaluation of named entities In the next battery of tests, we ran (using the thresholds indicated above) the complete annotation process for the domains and ontologies described in Sect. 5.1. In this section, the results of the evaluation of the first learning stage for each individual Web document and ontology are presented. Details about the annotated articles, statistics of the learning process and evaluation results are summarised in Tables 6, 7, 8 and 9. Analysing the results, we can observe a relatively high precision (68–86% in average) with a lower recall (35–60% in average). Those are expected results as NE selection threshold is established at a high value in order to increase the precision at the cost of a lower coverage. As stated above, we prefer to introduce strong constraints in the NE selection procedure and present more reliable annotations. This idea fits better in the requirements of an automatic annotation tool for the Semantic Web [42]. Regarding the variability of the results from one domain to another, we have also observed a dependency between the result’s quality and the amount of English-written resources available in the Web for each PNE. This fact may hamper the performance with more scarce domains. For example, tourist destinations which are more world-wide oriented such as

123

412

D. Sánchez et al.

Table 8 Evaluation of named entities for the laptop articles |Article| (words)

Article

|Extracted NEs|

|Selected NEs|

Recall (%)

Precision (%)

F-Measure (%) 66.7

Acer Aspire One

2055

178

41

61.2

73.2

Macbook Air

1463

150

44

68.7

75

71.7

Dell Inspiron Mini

1524

96

28

55.5

71.4

62.5

Lenovo IdeaPad S10

1007

50

13

58.8

76.9

66.6

HP 2133 Mini-Note

1229

56

23

56.6

73.9

64.1

Average

1455

106

30

60.2

70.1

66.3

Recall (%)

Precision (%)

F-Measure (%)

Table 9 Evaluation of named entities for the sports car articles |Article| (words)

Article

|Extracted NEs|

|Selected NEs|

Ford GT

1806

134

27

52.3

85.1

64.8

Lamborghini Murcielago Ferrari F430

2260

134

32

70

87.5

77.8

2024

120

32

48.2

84.3

61.3

Aston Martin DB9

1069

86

29

66.6

82.7

73.8

Mercedes CLK GTR

2042

102

29

50

93.1

65.1

Average

1840

115

30

57.4

86.6

68.5

Table 10 Evaluation of Annotations for the geopolitical articles Article

|Annotations|

Recall (%)

Precision (%)

F-Measure (%)

Andorra

33

81.2

75.7

78.4

Barcelona

77

79.4

70.1

74.5

Tarragona

26

82.6

73.1

77.5

Reus

4

40

50

44.4

Palma

15

63.1

80

70.6

Average

26

69.3

69.8

69.1

Andorra or Barcelona provide better results than Reus for which many local Spanish references are retrieved. Movie-related results are more similar as evaluated items are a more homogeneous (i.e., well known American titles in all cases). Finally, articles with information about laptops and sports cars offer the best and less variable results as they are very homogenous in their contents (i.e., product and vehicle specifications) and refer to inherently English terms. 5.4 Evaluation of the annotation procedure In this section, the evaluation of the annotations of each NE is presented in Tables 10, 11, 12 and 13.

123

An automatic web-based approach

413

Table 11 Evaluation of Annotations for the cinema articles Article

|Annotations|

Recall (%)

The Matrix

47

76.7

70.2

73.3

War Games

12

70

58.3

63.6

Precision (%)

F-Measure (%)

LOTR 3

56

70.4

55.3

62

I, Robot

13

40

46.1

42.8

300

50

50

56

52.8

Average

30

61.4

57.2

58.9

Table 12 Evaluation of Annotations for the laptop articles Article

|Annotations|

Recall (%)

Precision (%)

F-Measure (%) 66.6

Acer Aspire One

18

61.9

72.2

Macbook Air

24

69.2

75

72

Dell Inspiron Mini

16

75

75

75

HP 2133 Mini-Note

15

76.9

66.6

71.4

Lenovo IdeaPad S10

9

72.7

88.8

80

Average

16

71.1

75.5

73

Table 13 Evaluation of Annotations for the sports car articles Article

|Annotations|

Recall (%)

Precision (%)

F-Measure (%)

Ford GT

16

76.9

62.2

68.8

Lamborghini Murcielago

19

81.2

68.4

74.3

Ferrari F430

22

88.8

72.7

80

Aston Martin DB9

18

76.9

55.5

64.5

Mercedes CLK GTR

13

75

69.2

72

Average

18

79.7

65.6

72

For the first domain, the F-Measure is about 69% in average (with a similar precision and recall), except for the case of the scarcest domain (Reus). For that case, similarly to the previous stage, the limited amount of English written-resources for the extracted NEs results in a narrower set of subsumer concepts retrieved by means of linguistic patterns. At the end, this hampers the performance of the annotation selection procedure. For the second domain, results are slightly worse, with an average F-Measure below 60%. We observed that there is a dependency between the final annotations and the degree of generality and coverage of the input ontology with respect to the information contained in the input document. For the first domain, the size of the ontology (188 classes) allows a better definition of domain concepts and specializations, increasing the probability of a direct matching. In the second case (an ontology with 17 classes), entities can only be annotated with general classes, resulting in less accurate results. However, it is important to note that the input ontology is only used to map NEs to the available ontological classes and not as a

123

414

D. Sánchez et al.

base of knowledge to perform the learning process (as in approaches like [6]), which may introduce a learning dependency on predefined knowledge. Finally, for the third and fourth domains, results present the best quality from the bunch, with and F-Measure of 72–73%. The fact that the associated ontology offers a detailed formalization of the concepts to which the retrieved NEs are potential instances (i.e., computer and car components) aids to improve the results. In addition to the ontology dependency, the homogeneity of the article contents against the domain formalised in the ontology is also important. In the second test, for example, movie-related articles usually have contents related with the film’s plot (including fictitious and ambiguous film characters) or off-topic information about actors’ biography. This may introduce noise and may result in less accurate annotations. On the contrary, for the third test, laptop articles are very homogeneous in their contents, providing detailed computer specifications. Finally, we observed that sports car articles contain a certain amount of NEs not directly related to the ontology (i.e., names of GT competitions, designers, engineers and racing pilots). This contributes to a slightly lower precision. On the other hand, recall is high as car models or manufacturers are rarely ambiguous, resulting in appropriate matches to ontological classes.

6 Conclusions and further work Considering the size of the Web, the manual annotation of already existing Web resources is a hard and arduous task. So, the design of automatic annotation solutions is fundamental to provide a base of annotated content that contributes to the success of the Semantic Web. Even though the quality of automatic solutions, like the one proposed in this paper, is still far from the one obtained with manual approaches, the results and annotation throughput are promising. From the accuracy point-of-view, the proposed algorithm, using simple and domain independent heuristics, is able to extract around 50% of the NEs that a human expert can detect, with a precision about a 70%. Ontology-based annotations provide also reliable results with a precision around a 70%. This supposes a considerable reduction on the human effort required to annotate the enormous amount of already available textual Web resources. Moreover, even imperfect and partial, as stated in Sect. 5, any annotation that can bring semantic content to pure plain text resources is valuable [42]. The lack of a standard evaluation procedure makes difficult the comparison of the methodology with other solutions. However, there exist several aspects which differentiate our approach from previous automatic ones. Regarding the NE extraction process, our statistical assessor leads to higher precision than other automatic approaches [11] with a similar corpus (articles about locations). In that approach (which represents an improvement over previous ones [1,20], the NE selection constraints are more relaxed, resulting in a higher number of mistakes which hamper the final annotations. In relation to the annotations, differences and contributions of our proposal are more evident, as other approaches typically omit the ontology-mapping stage (resulting in unstructured and impractical annotations) or use ad-hoc ontologies to perform the annotation procedure [11]. However, in average, our results improve precision and recall thanks to the use of Webbased statistical analyses to finally assess the suitability of the selected class candidates. The number of required Web queries is also considerably reduced compared to other approaches [10], thanks to the introduction of the pattern-based retrieval of subsumer concepts. This allows the use of WordNet-based similarity measures to filter and to reduce the list

123

An automatic web-based approach

415

of terms to evaluate by means of the final Web-based statistical assessor. It also improves the performance without depending on WordNet’s coverage for the specific domain. However, further refinements of previous works [11] have also led to a better scalability at the cost of omitting Web-based statistical analyses. As future work, it is important to study how to reduce even more the number of queries to Web search engines, as they are represent the main bottleneck of the algorithm and introduce a dependency on external resources. During the development, we observed that some of the patterns (like “is a”, “and other”, “or other”) retrieve more precise and useful candidates than others, as they result in less ambiguous extractions. This may lead to further research in composing a more concrete set of patterns, resulting in fewer queries. An additional issue which should also be taken into account is semantic ambiguity of annotations (considered by some previous approaches [11]). For example, the “Barcelona” entity extracted from a Web site could be a “city” or a “sports team” depending on the context. When using a unique ontology as input, the problem can be solved implicitly by the unambiguous definition of ontological classes. Nevertheless, in the context of the Semantic Web, annotators may have to deal with many ontologies, which will probably offer different annotation possibilities. Considering that words tend to exhibit a unique sense within a discourse [47] it could be possible to exploit the unambiguous annotated entities in the same context (Web document) to decide the most suitable annotation. Acknowledgments This work has been partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the Spanish Ministry of Science and Innovation (DAMASK project, Data mining algorithms with semantic knowledge, TIN2009-11005) and the Spanish Government (PlanE, Spanish Economy and Employment Stimulation Plan).

References 1. Alfonseca E, Manandhar S (2002) Improving an ontology refinement method with hyponymy patterns, In: 3rd international conference on language resources and evaluation, LREC 2002. Las Palmas, Spain 2. Baumgartner R, Flesca S, Gottlob G (2001) Visual web information extraction with lixto. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT (eds) 27th international conference on very large data bases, VLDB 2001. Morgan Kaufmann, Roma, Italy, pp 119–128 3. Berners-Lee T, Hendler J, Lassila O (2001) The semantic web—a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci Am 284:34–43 4. Bisson G, Nedellec C, Cañamero D (2000) Designing clustering methods for ontology building, the Mo’K workbench. In: Staab S, Maedche A, Nedellec C, Wiemer-Hastings P (eds) ECAI workshop on ontology learning 2000. CEUR-WS, Berlin, pp 13–19 5. Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (ed) 4th international conference, CICLing 2003. Springer, Heidelberg, pp 179–185 6. Buitelaar P, Ramaka S (2005) Unsupervised ontology-based semantic tagging for knowledge markup. In: De Raedt L, Wrobel S (eds) Workshop on learning in web search at 22nd international conference on machine learning, ICML 05. ACM, Bonn, pp 26–32 7. Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210 8. Church K, Gale W, Hanks P et al (1991) Using statistics in lexical analysis. In: Zernik U (ed) Lexical acquisition: exploiting on-line resources to build a lexicon. Erlbaum, Hillsdale, pp 115–164 9. Cilibrasi RL, Vitányi PMB (2006) The google similarity distance. IEEE Trans Knowl Data Eng 19: 370–383 10. Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 462–471 11. Cimiano P, Ladwig G, Staab S (2005) Gimme’ the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis A, Hagino T (eds) 14th international conference on world wide web. ACM, Chiba, pp 462– 471

123

416

D. Sánchez et al.

12. Ciravegna F, Dingli A, Petrelli D et al (2002) User-system cooperation in document annotation based on information extraction. In: Gómez-Pérez A, Benjamins R (eds) 13th international conference on knowledge engineering and knowledge management. Ontologies and the semantic web, EKAW 02. Springer, Heidelberg, pp 122–137 13. Dill S, Eiron N, Gibson D et al (2003) A case for automated large-scale semantic annotation. Web Semant Sci Serv Agents World Wide Web 1:115–132 14. Etzioni O, Cafarella M, Downey D et al (2004) Web-scale information extraction in knowitall: (preliminary results). In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 100–110 15. Etzioni O, Cafarella M, Downey D et al (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165:91–134 16. Evans R (2003) A framework for named entity recognition in the open domain. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, RANLP 03. John Benjamins, Borovetz, pp 267–276 17. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Massachusetts, USA 18. Fleischman M, Hovy E (2002) Fine grained classification of named entities. In: Tseng S-C, Chen T-E, Liu Y-F (eds) 19th international conference on computational linguistics—vol. 1, COLING 02. Morgan Kaufmann Publishers, Taipei, pp 1–7 19. Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering with examples from the areas of knowledge management, e-Commerce and the semantic web. Springer, Berlin 20. Hahn U, Schnattinger K (1998) Towards text knowledge engineering, In: Mostow J, Rich C, Buchanan B (eds) Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, AAAI 98/IAAI 98. AAAI, Madison, pp 524–531 21. Handschuh S, Staab S, Studer R (2003) Leveraging metadata creation for the semantic web with CREAM. In: Günter A, Kruse R, Neumann B (eds) 26th annual german conference on AI, KI 2003. Springer, Hamburg, pp 19–33 22. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Kay M (ed) 14th conference on computational linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, pp 539–545 23. Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18:199–211 24. Kiyavitskaya N, Zeni N, Cordy JR et al (2005) Semi-automatic semantic annotations for web documents. In: Bouquet P, Tummarello G (eds) 2nd Italian semantic web workshop on semantic web applications and perspectives, SWAP 2005. CEUR-WS, Trento, Italy, pp 210–225 25. Koivunen M-R (2005) Annotea and semantic web supported collaboration (invited talk). In: Dzbor M, Takeda H, Vargas-Vera M (eds) Workshop on end user aspects of the semantic web at 2nd annual european semantic web conference, UserSWeb 05 CEUR Workshop Proceedings, Heraklion, Crete, pp 5–17 26. Krupka G, Hausman K (1998) IsoQuest, Inc: description of the NetOwl extractor system as used for MUC-7, In: 7th message understanding conference, MUC-7. Morgan Kaufman, Fairfax, Virginia, USA 27. Lamparter S, Ehrig M, Tempich C (2004) Knowledge extraction from classification schemas. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems 2004: CoopIS, DOA, and ODBASE, OTM confederated international conferences, CoopIS/DOA/ODBASE 04. Springer, Cyprus, pp 618–636 28. Leacock C, Chodorow M (1998) Combining local context and wordNet similarity for word sense identification. In: Fellbaum C (eds) WordNet: an electronic lexical database. MIT Press, Massachusetts, pp 265 283 29. Michelson M, Knoblock CA (2007) An automatic approach to semantic annotation of unstructured, ungrammatical sources: a first look. In: Knoblock CA, Lopresti D, Roy S, Subramaniam LV (eds) IJCAI2007 workshop on analytics for noisy unstructured text data. Hyderabad, India, pp 123–130 30. Mikheev A, Finch S (1997) A workbench for finding structure in texts. In: Grishman R (ed) 5th applied natural language processing conference, ANLP 1997. Association for Computional Linguistics, Washington, pp 8–16 31. Niekrasz J, Gruenstein A (2006) NOMOS: a semanticWeb software framework for annotation of multimodal corpora In: 5th international conference on language resources and evaluation, LREC 06. Genoa, Italy, pp 21–27 32. Pasca M (2004) Acquisition of categorized named entities for web search. In: Grossman DA, Gravano L, Zhai C, Herzog O, Evans DA (eds) Thirteenth ACM international conference on Information and Knowledge Management, KM 06. ACM, Washington, DC, USA, pp 137–145

123

An automatic web-based approach

417

33. Roberts A, Gaizauskas R, Hepple M et al (2007) The CLEF corpus: semantic annotation of clinical text, In: AMIA 2007 annual symposium. American Medical Informatics Association, Chicago, USA, pp 625–629 34. Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17:17–33 35. Sánchez D (2008) Domain ontology learning from the web. VDM Verlag, Saarbrücken, Germany 36. Sánchez D, Moreno A (2008a) Learning non-taxonomic relationships from web documents for domain ontology construction. Data Knowl Eng 64:600–623 37. Sánchez D, Moreno A (2008b) Pattern-based automatic taxonomy learning from the web. AI Commun 21:27–48 38. Sanderson M, Croft B (1999) Deriving concept hierarchies from text, In: 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ‘99. ACM, Berkeley, USA, pp 206–213 39. Schroeter R, Hunterd J, Kosovic D (2003) Vannotea—a collaborative video indexing, annotation and discussion system for broadband networks. In: Handschuh S, Koivunen M-R, Dieng-Kuntz R, Staab S (eds) Knowledge markup and semantic annotation workshop, K-CAP 03. ACM, Sanibel, Florida, pp 9–26 40. Stevenson M, Gaizauskas RJ (2000) Using corpus-derived name lists for named entity recognition. In: Niremburg S (ed) 6th applied natural language processing conference, ANLP 2000. Association for Computional Linguistics, Seattle, pp 290–295 41. Turney PD (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: De Raedt L, Flach P (eds) 12th european conference on machine learning, ECML 01. Springer, Freiburg, pp 491–502 42. Uren V, Cimiano P, Iria J et al (2006) Semantic annotation for knowledge management: requirements and a survey of the state of the art. J Web Semant 4:14–28 43. Wang P, Hu J, Zeng H-J et al (2009a) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19:265–281 44. Wang Z, Wang Q, Wang D-W (2009b) Bayesian network based business information retrieval model. Knowl Inf Syst 20:63–69 45. Wong T-L, Lam W (2008) Learning to extract and summarize hot item features from multiple auction web sites. Knowl Inf Syst 14:143–160 46. Wu Z, Palmer MS (1994) Verb semantics and lexical selection, In: 32nd annual meeting of the association for computational linguistics (ACL). Morgan Kaufmann Publishers / ACL, Las Cruces, New Mexico, USA, pp 133–138 47. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, Cambridge, Massachusetts, USA, pp 189–196

Author Biographies David Sánchez is a Lecturer at the University Rovira i Virgili’s Computer Science and Mathematics Department. He received a PhD on Artificial Intelligence from UPC (Technical University of Catalonia) in 2008. He is a member of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition). His research interests are intelligent agents and ontology learning and the Semantic Web. He has been involved in several research projects (National and European), and published several papers and conference contributions.

123

418

D. Sánchez et al. David Isern is a post-doctoral researcher of the University Rovira i Virgili’s Department of Computer Science and Mathematics. He is also associate professor of the Open University of Catalonia. He received his PhD in Artificial Intelligence (2009) and an MSc (2005) from the Technical University of Catalonia. His research interests are intelligent software agents, distributed systems, user’s preferences management, and ontologies, especially applied in healthcare and information retrieval systems. He has been involved in several research projects (National and European), and published several papers and conference contributions.

Miquel Millan is a senior software developer and IT researcher. He received an MSc (2008) from the Universitat Rovira i Virgili. His research interests are intelligent software agents and the Semantic Web. He has been involved in European research projects, and published several conference contributions.

123