A Web Services-Based Annotation Application for ... - Springer Link

2 downloads 68737 Views 363KB Size Report
compiling marketing terminology and knowledge can help marketing managers ... presents a Web services-based application to automate the semantic ...
A Web Services-Based Annotation Application for Semantic Annotation of Highly Specialised Documents About the Field of Marketing Mercedes Argüello Casteleiro1,∗, Mukhtar Abusa1, Maria Jesus Fernandez Prieto2, Veronica Brookes2, and Fonbeyin Henry Abanda1 1

Research Institute of the Built and Human Environment (BUHU) 2 School of Languages The University of Salford, Salford, M5 4WT, United Kingdom [email protected]

Abstract. The field of marketing is ever-changing. Each shift in the focus of marketing may have an impact on the current terminology in-use, and therefore, compiling marketing terminology and knowledge can help marketing managers and scholars to keep track of the ongoing evolution in the field. However, processing highly specialised documents about a particular domain is a delicate and very time-consuming activity performed by domain experts and trained terminologists, and which can not be easily delegated to automatic tools. This paper presents a Web services-based application to automate the semantic annotation and text categorisation of highly specialised documents where domain knowledge is encoded as OWL domain ontology fragments that are used as the inputs and outputs of Web services. The approach presented outlines the use of OWLS and the OWL´s XML presentation syntax to obtain Web services that easily deal with terminological background knowledge. To validate the proposal, the research has focused on expert-to-expert documents of the marketing field. The emphasis of the research approach presented is on the end-users (marketing experts and trained terminologists) who are not computer experts and not familiarized with Semantic Web technologies. Keywords: Semantic Web, Semantic Web Services, Semantic annotation, Ontologies, OWL, OWL-S, XML, Text Categorisation.

1 Introduction There is an increasing need to effectively mine for knowledge both across the Internet and in particular repositories. From the publishing industry to the competitive intelligence business, important volumes of data from various sources have to be processed daily and analysed by professional users [1]. In general, when expert professionals perform manual processing of specialised documents is to capture the pertinent ∗

Please note that the author has joined the ESRC National Centre for e-Social Science (NCeSS) based at the University of Manchester, M13 9PL, United Kingdom.

R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part I, LNCS 4803, pp. 1135–1152, 2007. © Springer-Verlag Berlin Heidelberg 2007

1136

M. Argüello et al.

knowledge contained in each selected resource, and the knowledge captured is used to annotate the document by a set of descriptors (e.g. terms from a thesaurus). The introduction of tools that automatically extract/highlight specialised terminology from textual documents and are designed to be interactive and usable by domain experts and trained terminologists who are not necessarily computer experts, will contribute to save time from domain experts and trained terminologists and accelerate the process of semantic annotation of documents. Furthermore, make explicit the domain specific terminology used in a particular domain can be considered as “feature extraction” and may be used for text clustering and text categorisation. Annotators were first conceived as tools that could be used to alleviate the burden of including ontology based annotations manually into Web pages [2]. Since then, many of them have evolved into more complete environments that use Information Extraction (IE) and Machine Learning (ML) techniques to propose semi-automatic annotations for Web documents. Nowadays, there are many annotation tools or environments such as Annotea [3], the Semantic Markup Tool [4], the OntoMat Annotizer [5], SMORE [6], SHOE Knowledge Annotator [7], ONTO-H [8]. However, some of them are not very well suited for the annotators unfamiliar with concepts related to ontologies and the semantic annotation in general. Moreover, most of the current annotation systems, like some of the ones mentioned-above, are applications that run locally on the annotator’s computer. The research approach presented in this paper is aligned with the annotation tool Saha [9], i.e. tackle the problem of creating semantically rich annotations by developing an annotation system that supports the distributed creation of metadata and that can be easily used by non-experts in the field of Semantic Web. However, the research approach presented in this paper differs from Saha [9] in: 1. The annotation application developed and presented in this paper is based on Web services [10] and considers three types of annotations: Dublin Core annotation, thesaurus annotation, and ontology-based annotation. 2. Background knowledge about specialised terminology is used to obtain a representation of documents as a selection of terms (term vectors) which is a pre-processing strategy where documents are reduced to terms considered “important”, and therefore text clustering and text categorisation may profit from it. In fact, the Web services-based annotation application developed makes use of text categorisation to classify the content of highly specialised documents by means of relating parts of the document to one or several concepts of the domain ontology. 3. The research approach pays special attention to two tasks: the first task is to define the service’s domain ontologies in terms of OWL [11] classes, properties, and instances. The second task is to create an OWL-S [12] description of the services, relating this description to the domain ontologies. 4. The research study addresses the challenge of service composition which refers to the process of combining different Web services to provide a value-added service [13]. The approach outlines the use of OWL-S [12] and the OWL´s XML presentation syntax [14] to obtain a combination of Web services that easily deal with terminological background knowledge.

A Web Services-Based Annotation Application for Semantic Annotation

1137

This paper is organised as follows. Section 2 summarises related work. Requirements and an approach overview are in section 3. The details about the three different types of annotations are described in section 4. Section 5 presents the evaluation performed of the Web services-based annotation application in the marketing domain. Conclusions are in section 6.

2 Related Work With the emergence of the Semantic Web, annotate metadata in documents and general Web resources have been the focus of many projects that have attempted to provide tools or frameworks for annotating different types of content (HTML, databases, multimedia) and with different degrees of automation – see http:// annotation.semanticweb.org/. Some examples of widely used applications of metadata annotation will be the following: 1. Dublin Core [15]: is an example of a lightweight ontology that is being widely used to specify the characteristics of electronic documents without providing too many details about their content. It specifies a predefined set of document features such as creator, date, contributor, description, etc. 2. Thesaurus and controlled vocabularies: such as MeSH [16]. Terms from a thesaurus or from a controlled vocabulary can be used to provide agreed terms in specific domains and to annotate documents. Since these vocabularies are not completely formal, the annotations are normally pointers to those terms in the vocabulary [2]. 3. Ontologies: such as the SWRC ontology [17]. The SWRC ontology generically models key entities in a typical research community and reflects one of the earliest attempts to put this usage of Semantic Web Technologies in academia into practice. The SWRC ontology initially grew out of the activities in the KA2 project [18]. Since its initial versions it has been used and adapted in a number of different settings, most prominently for providing structured metadata for web portals, e.g. OntoWeb [19]. Dublin Core annotations are more ambiguous than annotations based on a thesaurus or controlled vocabulary, and these are also more ambiguous, in general, than the annotations based on ontologies [2]. However, these three approaches complement each other. To illustrate this: the most recent version of the SWRC ontology that has been released in OWL [11] format comprises a total of seven top level concepts namely Document, Event, Organization, Person, Product, Project and Topic and includes Dublin Core elements by means of both datatype and object properties. Furthermore, [20] proposes a method for transforming thesaurus into ontologies. The foundations of the current approach are in line with the annotation tool Saha [9], i.e. tackle the problem of creating semantically rich annotations by developing an annotation system that supports the distributed creation of metadata and that can be easily used by non-experts in the field of Semantic Web. However, the current approach pursues to facilitate the economical annotation of large document collections. To achieve this, the current approach integrates terminological resources to facilitate the incorporation of knowledge extraction techniques into the annotation environment. Initially, terminological resources can be easily included in the SWRC ontology

1138

M. Argüello et al.

by means of a Document Extension ontology mainly because a controlled vocabulary, a dictionary or a thesaurus can be naturally seen as three different types of documents. The current approach pays special attention to thesaurus which are the most interesting terminological resources because a terminological entry in a thesaurus may contain synonyms, abbreviations, regional variants, definitions, contexts, even term equivalents in other languages. Different fields have thesaurus of their own which can be widely used for harmonising content indexing. For example, the MeSH [16] thesaurus is used to index the biomedical literature. Taking this into account, the current approach dedicated effort to obtain a suitable thesaurus, because with the help of thesaurus descriptors, domain experts can relate terminological entries from a thesaurus with class names from a classification system (e.g. concepts from a simple taxonomy ontology) to make explicit how terminological entries from a thesaurus cover a certain topic. The mapping process performed (i.e. mapping terms to concepts) has several advantages: a) the size of concept vectors representing documents is considerable smaller than the size of term vectors produced, and b) to facilitate providing the end-user with useful classifications of documents. The combination of Web services [10] and ontologies [21] has resulted in the emergence of a new generation of Web services called Semantic Web services [22]. The landscape created by Semantic Web services has spurred several research issues. One important challenge is service composition which refers to the process of combining different Web services to provide a value-added service [13]. A large body of research has recently been devoted to Web service composition, and therefore, several techniques, prototypes, and standards have been proposed by the research community. However, these techniques, prototypes, and standards provide little or no support for the semantics of Web services, their messages, and interactions [10]. The research approach presented in this paper uses OWL-S [12] to describe Web services and explores the advantages of using the existing OWL Web Ontology Language XML Presentation Syntax [14] to encode OWL [11] domain ontology fragments as XML documents that are fruitful to be passed between Web services and that may be needed by other components in the same workflow. To validate the proposal, the research has focused on a Web services-based application which uses terminological knowledge to automate the semantic annotation and text categorisation of highly specialised documents (i.e. expert-to-expert documents). The next section provides an overview of the Web services-based annotation approach and requirements.

3 Requirements and Approach Overview 3.1 Requirements In [23] seven requirements for semantic annotation systems have been formulated. A summary of how the current Web services-based annotation approach fulfils the requirements identified in [23] could be the following: 1. Standard formats - The Web Ontology Language (OWL) [11] has been used for representing ontologies, and a XML syntax based on a OWL’s XML presentation syntax [14] has been used to pass ontology fragments between the services.

A Web Services-Based Annotation Application for Semantic Annotation

1139

2. User centred/collaborative design - An easy to use interface that simplify the annotation process to end-users not familiarized with semantic Web techniques is achieved by means of a GUI partially generated on-fly by means of Ajax [24], where the existing XSLT stylesheet [25] and XML documents derived from the OWL’s XML presentation syntax are interpreted by JavaScript functions that keep end-users unaware of underlying complexities. 3. Ontology support - Protégé 3.2 beta [26] has been chosen as the ontologydesign and knowledge acquisition tool to: a) build ontologies in the Web Ontology Language OWL using the Protégé-OWL Plugin and b) to create OWL-S ontologies using the OWL-S Editor [27] that is implemented as a Protégé plugin. 4. Support of heterogeneous document formats -Documents can be in Webnative formats such as HTML. Although, MS Word format is also supported. 5. Document evolution - Documents may change as ontologies may change. This is even more likely in an environment where terminology plays pivotal role and may need to be up-dated on daily-basis. A way to keep ontologies and annotations consistent is to consider annotations as “temporary annotations” instead of “definitive annotations”, and therefore, the environment should be able to generate or regenerate annotations automatically. 6. Annotation storage - The approach taken is in line with the Semantic Web model that assumes that annotations will be stored separately from the original document. 7. Automation - The level of automation achieved (generate or regenerate annotations automatically) guarantees the use by end-users without computer expertise. 3.2 Approach Overview A Web service is a set of related functionalities that can be programmatically accessed through the Web [10]. A growing number of Web services are implemented and made available internally in an enterprise or externally for other users to invoke. These Web services can be reused and composed in order to realize larger and more complex business processes. The Web service proposals for description (WSDL[28]), invocation (SOAP[29]) and composition (WS-BPEL [30]) that are most commonly used, lack proper semantic description of services. This makes hard to find appropriate services because a large number of syntactically described services need to be manually interpreted to see if they can perform the desired task. Semantically described Web services make it possible to improve the precision of the search for existing services and to automate the composition of services. Semantic Web Services (SWS) [22] take up on this idea, introducing ontologies to describe, on the one hand, the concepts in the service’s domain (e.g. flights and hotels, tourism, e-business), and on the other hand, characteristics of the services themselves (e.g. control flow, data flow) and their relationships to the domain ontologies (via inputs and outputs, preconditions and effects, and so on) [27]. Two recent proposals have gained a lot of attentions: 1) the American-based OWL Services (OWL-S) [12] and 2) the Europeanbased Web Services Modelling Language (WSML) [31]. These emerging specifications overlap in some parts and are complementary in other parts. WSML uses its own lexical notation, while OWL-S is XML-based.

1140

M. Argüello et al.

The OWL Web Ontology Language for Services (OWL-S) [12] provides developers with a strong language to describe the properties and capabilities of Web Services in such a way that the descriptions can be interpreted by a computer system in an automated manner. The current approach pays special attention to the Service Process Model because it includes information about inputs, outputs, preconditions, and results and describes the execution of a Web service in detail by specifying the flow of data and control between the particular methods of a Web service. The execution graph of a Service Process Model can be composed using different types of processes and control constructs. OWL-S defines three classes of processes. Atomic processes (AtomicProcess) are directly executable and contain no further sub-processes. From the point of view of the caller atomic processes are executed in a single step which corresponds to the invocation of a Web service method. Simple processes (SimpleProcess) are not executable. They are used to specify abstract views of concrete processes. Composite processes (CompositeProcess) are specified through composition of atomic, simple and composite processes recursively by referring to control constructs (ControlConstruct) using the property ComposeOf. Control constructs define specific execution orderings on the contained processes. The research study addresses the challenge of service composition which refers to the process of combining different Web services to provide a value-added service [13]. The approach highlights the benefits of Semantic Web technologies in order to obtain a combination of Web services that easily deal with terminological background knowledge to automate the semantic annotation and text categorisation of highly specialised documents. The current research study exposes the advantages of using the existing OWL's XML Presentation Syntax [14] to encode OWL [11] domain ontology fragments as XML documents that are fruitful to be passed between Web services and that may be needed by other components in the same workflow. The current approach considers two Web services: 1. Taxonomy-Thesaurus mapping service: is a service that provides functionality to associate terminological entries from a thesaurus with concepts from a simple taxonomy ontology to make explicit how terminological entries from a thesaurus cover a certain topic. The Taxonomy-Thesaurus mapping performed plays pivotal role to obtain a text classifier that could carry out automatic text categorisation. 2. Semantic annotation service: is a service that provides functionality to perform three different types of annotations from textual documents: a) Dublin Core annotation, b) thesaurus annotation, and c) ontology-based annotation which is devoted to describe the content of documents, and where thematic metadata is used to describe the semantics of the document. Each service considers different kinds of activities. It is necessary to detail each activity and consider if the activity can be related to an atomic process or to a composite process that can be further refined into a combination of atomic processes. Furthermore, it is essential to decide what are the inputs and outputs for each of the considered processes. Figure 1 and 2 show the inputs and outputs for composite

A Web Services-Based Annotation Application for Semantic Annotation

1141

processes of the Taxonomy-Thesaurus mapping service and the Semantic annotation service respectively. The name of each input or output is specified in (bold black) as well as a type is defined for each input and output (between brackets in grey). Inputs’ and outputs’ types are classes/concepts of ontologies that appear in figure 3.

Fig. 1. Inputs and outputs for composite processes of the Taxonomy-Thesaurus mapping service

Fig. 2. Inputs and outputs for composite processes of the Semantic annotation service

The research study presented in this paper is adhered to a modular ontology design. Existing methodologies and practical ontology development experiences have in common that they start from the identification of the purpose of the ontology and the need for domain knowledge acquisition [32], although they differ in their focus and

1142

M. Argüello et al.

steps to be taken. In this study, the three basic stages of the knowledge engineering methodology of CommonKADS [33] coupled with a modularised ontology design have been followed: I. KNOWLEDGE IDENTIFICATION: in this first stage, several activities were included: explore all domain information sources in order to elaborate the most complete characterisation of the application domain, and list potential components for reusing. The following knowledge sources were identified: a) the SWRC ontology [17] which generically models key entities relevant for typical research communities and the relationships between them, b) a taxonomy of marketing topics that appears in [34], and c) several terminological resources related to the marketing field, such as [35] or [36]. II. KNOWLEDGE SPECIFICATION: in this second stage, the domain model was developed. An overview of the modular ontological design appears in figure 3.

Fig. 3. Overview of the modular ontological design

A Web Services-Based Annotation Application for Semantic Annotation

1143

Four ontologies have been considered: 1) the SWRC ontology [17] where several top level concepts and relationships have been reused, 2) the Document Extension ontology which is an extension of the SWRC ontology to include terminological resources, 3) the Marketing ontology which can be considered as an extension of the SWRC ontology to incorporate an adaptation of the taxonomy of marketing topics from [34], and 4) the Data Set ontology which is introduced to facilitate the linkage between inputs’ and outputs’ types of Web services and classes/concepts of the other three ontologies. Protégé 3.2 beta [26] has been chosen as the ontologydesign and knowledge acquisition tool to build OWL [11] ontologies. Figure 4 shows a screenshot of Protégé 3.2 beta during the OWL ontology development that illustrates the relationship between the SWRC ontology and the Data Set Ontology (dso) by means of the object property belongsTo which is remarked.

Fig. 4. A screenshot of Protégé 3.2 beta during the OWL ontology development

III. KNOWLEDGE REFINEMENT: in this third stage, the resulting domain model is validated by paper-based simulation, and more terms from [36] are added to the marketing thesaurus developed. It is also evaluated how each terminological entry added to the thesaurus is associated to one or more categories of a finite set of categories (selection of concepts from the taxonomy of marketing topics).

1144

M. Argüello et al.

The next section provides details about the three different types of annotations considered. The annotation process relies on a combination of Web services. To enable a Web services composition that easily deal with terminological background knowledge, the research approach relies on OWL-S [12] to describe Web services and exposes the advantages of using the existing OWL’s XML presentation syntax [14] to encode OWL [11] domain ontology fragments as XML documents that are fruitful to be passed between Web services and that may be needed by other components in the same workflow.

4 Annotation Services can be described as a collection of atomic or composite processes, which can be connected together in various ways, and the data and control flow can be specified. Figure 5 shows the control flow and data flow for the three composite processes of the Semantic annotation service. The details about how to encode OWL ontology fragments as XML documents that are fruitful to be passed between processes in the same workflow are described below.

Fig. 5. Control flow and data flow for composite processes of the Semantic annotation service

A Web Services-Based Annotation Application for Semantic Annotation

1145

Web services are part of a trend in XML-based distributed computing and XML does not provide any means of talking about the semantics (meaning) of data. However, a XML document type definition can be derived from a given ontology as pointed out in [37]. The linkage has the advantage that the XML document structure is grounded on a true semantic basis. There is more than one way to derive an XML Schema [38] from an OWL ontology which is compatible with the RDF/XML syntax. Many possible XML encodings could be imagined, but the most obvious solution is to use the existing OWL Web Ontology Language XML Presentation Syntax [14] which is the solution taken here. The owlx namespace prefix should be treated as being bound to http://www.w3.org/2003/05/owl-xml, and is used for the existing OWL’s XML presentation syntax. The subsection 4.3 provides an example of individual axioms (also called “facts”) based on the XML presentation syntax for OWL. These facts are outputs and/or inputs for each of the three composite processes of the Semantic annotation service that appears in figure 5. 4.1 Dublin Core Annotation Documents may be in many different formats. The current approach considers two document formats: HTML and MS Word. Depending on the document source provided, a HTML-wrapper (see figure 5) or a MS Word-text converter is being used. Because, the world-wide-web (i.e. WWW) has become one of the most widely used information resources, the HTML-wrapper has being proved as more useful for automatic processing. The HTML-wrapper, which appears in figure 5, performs the task of extract the data embedded in Web pages for further processing. The goal of the HTML-wrapper is to translate the relevant data embedded in Web pages into a structured format: the Dublin Core [15] lightweight ontology that is being widely used to specify a predefined set of document features such as title, creator, etc. The HTML-wrapper performs a Dublin Core annotation to create as an output a XML document (Textdc_annotation) that will contain a set of document features according with the Dublin Core lightweight ontology and will be used as the input of the TermDetector_based-on_Thesaurus (see figure 5). 4.2 Thesaurus Annotation The Term-Detector_based-on_Thesaurus (see figure 5) uses simple terms and/or compound terms from thesaurus of specific domains to annotate documents. The process could be seen as an automatic indexing with controlled vocabularies/ thesaurus, and therefore, is closely related to automated metadata generation. Terminological resources are increasingly available on-line, e.g. TERMIUM [39] or EURODICAUTOM [40]. In the particular case of the marketing field, many online dictionaries and glossaries of marketing terms could be found. However, not all of them appear to be comprehensive enough. To illustrate this: on the one hand, the American Marketing Association offers an on-line marketing dictionary [35] which is a good resource of information, although well researched definitions are not always provided. On the other hand, the Faculty of Business and Economics at the Monash

1146

M. Argüello et al.

University (Australia) offers an on-line marketing dictionary [36] which is a comprehensive glossary of marketing terms that offers well researched definitions for most of the marketing related terms likely to be needed on this topic, and which could be easily converted into a thesaurus where each terminological entry may contain synonyms, abbreviations, regional variants, and definitions. Text categorisation is one of the core problems in Text Mining. The goal of text categorisation is to automatically assign text documents to a finite set of predefined categories. With the rapid growth of Web pages on the World Wide Web (WWW), text categorisation has become more and more important in both the world of research and applications. One important challenge for large-scale text categorisation is how to reduce the number of features that are required for building reliable text classification models. There are typically two types of algorithms to represent the feature space used in classification. One type is the so-called “feature selection” algorithms, i.e. to select a subset of most representative features from the original feature space. Another type is called “feature extraction”, i.e. to transform the original space to a smaller feature space to reduce the dimension. The use of thesaurus can be seen as a type of “feature selection” because only the terms which belong to each terminology entry of the thesaurus will be taken into account. Furthermore, the use of thesaurus is expected to contribute to “feature extraction” as well, because most probably only a portion of the terms from the thesaurus will be found in the text documents of the corpus considered. The Dublin Core annotation performed by the HTML-wrapper is more ambiguous than the annotations based on a thesaurus or controlled vocabulary. As depicted in figure 5, the XML document (Text-dc_annotation), which is the output of the HTML-wrapper, is complemented by the Term-Detector_based-on_Thesaurus which performs a thesaurus annotation to create as an output a XML document (TextThesaurus_annotation) that introduces a local (document level) measure: term frequency. For each simple or compound term tk from thesaurus that appears in the document dj, it is annotated not only the name of the term but also the term frequency (term counts) tfk, i.e. the number of times that a term tk occurs in a document dj. 4.3 Ontology-Based Annotation The Text-Classifier_based-on_Taxonomy-Thesaurus-Mapping (see figure 5) performs the task of annotate the document with thematic metadata by relating parts of the document (simple terms and/or compound terms previously identified that belongs to a thesaurus) to one or several concepts of the domain ontology (e.g. concepts from a taxonomy of topics). The construction of the Text-Classifier_based-on_Taxonomy-Thesaurus-Mapping involves: a) a phase of term selection, in which the most relevant terms for the classification task are defined and where simple and compound terms which belong to terminological entries of thesaurus have been selected, and b) a phase of term weighting, in which weights for the selected terms are computed based on the explicit associations performed between terminological entries from thesaurus with concepts from a simple taxonomy ontology.

A Web Services-Based Annotation Application for Semantic Annotation

1147

Thanks to the Taxonomy-Thesaurus mapping service described in subsection 3.2, the terminological entries of a thesaurus can be grouped and assigned to a finite set of categories (selection of concepts from the domain ontology), and therefore, the text indexing performed can be considered as an instance of text categorisation. Text categorisation problems are usually multi-class in the sense that there are usually more than two possible categories. Although in some applications there may be a very large number of categories, the current research study focuses on the case in which there are a small to moderate number of categories. It is also common for text categorisation tasks to be multi-label, meaning that the categories are not mutually exclusive so that the same document may be relevant to more than one category. In the particular case of the marketing field, there is a high overlapping between categories. Ranking categorisation has been introduced to deal with overlapping categories. In other words, for a given document dj, the existing categories are ranked according to their estimated appropriateness to dj, without taking any “hard” decision about any of them. As pointed out in [41] the inductive construction of a ranking classifier for category ci ∈ C usually consists in the definition of a function CSVi: D Æ [0,1] that, given a document dj, returns a categorization status value for it, that is, a number between 0 and 1 which, roughly speaking, represents the evidence for the fact that dj ∈ ci. The CSVi function takes up different meanings according to the learning method used. For example, probabilistic classifiers (see [42] for a thorough discussion) view CSVi(dj) in terms of a probability. The thesaurus annotation performed by the Term-Detector_based-on_Thesaurus is, in general, more ambiguous than the annotations based on ontologies. As depicted in figure 5, the XML document (Text-Thesaurus_annotation), which is the output of the Term-Detector_based-on_Thesaurus, is complemented by the Text-Classifier_basedon_Taxonomy-Thesaurus-Mapping which performs an ontology-based annotation to create as an output a XML document (Text-Classifier_annotation). An example of ontology-based annotation appears in figure 6 where significant facts from a XML document (Text-Classifier_annotation) have been included. The research approach presented in this paper estimates the appropriateness of a category ci to a document dj based on the weight of a category ci in a document dj, wcij, which is determined by a combination of a local (document level) measure and a global (thesaurus level) measure. The local (document level) measure is the term frequency (term counts) tfk, i.e. the number of times that a simple term or a compound term tk from a thesaurus appears in the document dj. The global (thesaurus level) measure is the term weight wk,i.e. the value assigned to a simple or compound term tk from a thesaurus. The weighting scheme considers: on the one hand, that terms from a thesaurus that have been associated to too many categories (high overlapping) should receive a low weight, while terms from a thesaurus that have been associated to only one category should receive a high weight. On the other hand, the total number of terms from a thesaurus that have been associated to each category should be taken into account to prevent that categories with higher number of associated terms will become predominate.

1148

M. Argüello et al.

Fig. 6. Significant facts related to ontology-based annotation

5 Evaluation The evaluation of the Web services-based annotation application considers: 1) the experimental results obtained over two data sets of highly specialised documents (i.e. expert-to-expert documents) that were selected among the vast marketing corpora available, and 2) the feed-back obtained from end-users (marketing experts and trained terminologists) by interviewing and observation-based methods. 5.1 Experimental Results Two data sets of highly specialised marketing documents (i.e. expert-to-expert marketing documents) were considered: a) a training and validation set TV which contains a selection of 72 documents that attempt to be an overall overview of the field of marketing and which includes case studies, chapters from books, journal papers, etc,

A Web Services-Based Annotation Application for Semantic Annotation

1149

and b) a test set TE which contains 60 marketing articles, and its scope is wide enough to cover the spectrum of marketing’s sub-disciplines. In table 1 the marketing categories considered are listed, where each category is associated with a top level marketing concept from a taxonomy of marketing topics. Table 1 shows the number of documents of the test set TE that are manually assign to each category by marketing experts. For a given document dj, the Web servicesbased annotation application ranks the seven marketing categories according to their estimated appropriateness to the document dj (see subsection 4.3), and normalised those values by means of JavaScript functions before showing them to the end-user. Table 1 shows the estimated appropriateness assigned by the Web services-based annotation application to those documents that have being manually assigned to each category. Based on the results obtained with the set TV, an experimental threshold was defined: only the documents under “less than 72% of estimated appropriateness” can be considered as wrongly classified under a certain category by the Web servicesbased annotation application because the evaluation sessions performed with different marketing experts indicate that a) it is usual to assign more than one category for a given marketing document, and b) marketing experts do not always easily agree about the more appropriate category for a given marketing document. Table 1. Test set TE. Documents assigned to each category manually and automatically. Categories

MC_Customer_ and_Marketing MC_Market_ Segmentation MC_Product_and _ Services MC_Price MC_Place MC_Promotion MC_Marketing_ Research

Number of documents Web services-based annotation application: manually assigned by estimated appropriateness marketing experts 100 % Equal or more Less than 72 % than 72% 3 2 3 8 5

4

15

15

8 4 6 14

8 2 6 8

1

2 4

2

According to the reasons above-mentioned, only 10% of the marketing documents of the test set TE can be considered as wrongly classified. This result combined with the relatively high level of estimated appropriateness (equal or more than 72%) encourage the current approach to extend the study among the vast marketing corpora available to verify if the current approach facilitates the economical annotation of large document collections. In order to improve the categorisation capability of the Web services-based annotation application, it may be needed to check the correctness of the selected categories and modify or insert new terms into the thesaurus. To illustrate this: among the finite set of marketing categories considered in table 1, the category that has the biggest

1150

M. Argüello et al.

categorisation error (37.5 %) is MC_Customer_ and_Marketing. With the aim of bringing some light about the causes of the relatively high categorisation error found for the category MC_Customer_ and_Mark, different marketing experts were consulted. The consultation revealed that marketing experts strongly disagree about the selection of terminological entries from a thesaurus and their associations with the category MC_Customer_ and_Marketing, and therefore, a refinement in the semantics (meaning) of the category and in the terminological entries associated are needed. 5.2 End-Users Feed-Back Several evaluation sessions were performed to obtain feed-back from marketing experts and trained terminologists. During those sessions, the Think-Aloud Protocol (TAP) [43] was frequently used to gain an outline about the efficacy of the Web services-based annotation application. TAP is a verbal protocol method popularly used to gather usability data during system evaluation by asking the users to vocalize their thoughts, feelings and opinions concurrently while interacting with the system. The audio recorded data reveals that comments like “incredible quick”, “just at the click of a button”, or “is quite right” appear frequently. These comments enlightened the fact that the Web services-based annotation application has reduced substantially the time that marketing experts and trained terminologists have to invest to classify a highly specialised document about marketing (i.e. expert-to-expert marketing documents) and extract/highlight a minimum of relevant marketing terminology from the document from an average of hours to an average of seconds. Furthermore, the abovementioned tasks have been simplified to just a click of a button.

6 Conclusions The approach highlights the benefits of Semantic Web technologies in order to obtain a combination of Web services that easily deal with terminological background knowledge to automate the semantic annotation and text categorisation of highly specialised documents (i.e. expert-to-expert documents). On the one hand, although OWL-S [12] is not currently ready to support the dynamic discovery, composition, and invocation of services; OWL-S facilitates to define the inputs and outputs of a service in terms of an ontology which is a step forward to enable dynamic discovery, composition, and invocation of services without user intervention. On the other hand, the OWL Web Ontology Language XML Presentation Syntax [14] has been exposed as a good way of encoding OWL [11] domain ontology fragments as XML documents that are fruitful to be passed between Web services and that may be needed by other components in the same workflow. The substantial reduction in the time and effort required from marketing experts and trained terminologists to classify highly specialised marketing documents (i.e. expert-to-expert marketing documents) and extract/ highlight a minimum of relevant marketing terminology from documents, together with a high accuracy in the automatic text categorisation performed by the Web services-based annotation application, encourage the current research study to be extended to a large collection of documents. Furthermore, from the point of view of terminologists, translators and

A Web Services-Based Annotation Application for Semantic Annotation

1151

interpreters as well as translator and interpreter trainers it is a paramount to have an easy-to-use environment or tool that not only provides terms already available from a controlled vocabulary or thesaurus, but more importantly shows/highlights very quick those terms in context (how terms are being used in highly specialised documents).

References 1. Amardeilh, F., Laublet, P., Minel, J.L.: Document annotation and ontology population from linguistic extractions. In: Proceedings of the 3rd international conference on Knowledge Capture, pp. 161–168 (2005) 2. Corcho, O.: Ontology based document annotation: trends and open research problems. International Journal of Metadata, Semantics and Ontologies 1, 47–57 (2006) 3. Kahan, J., Koivunen, M.R., Prud’Hommeaux, E., Swick, R.R.: Annotea: An Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of the 10th International World Wide Web Conference (WWW10), Hong Kong, China (2001) 4. Kettler, B., Starz, J., Miller, W., Haglich, P.: A Template-based Markup Tool for Semantic Web Content. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, Springer, Heidelberg (2005) 5. OntoMat-Annotizer, http://annotation.semanticweb.org/ontomat/index.html 6. Kalyanpur, A., Hendler, J., Parsia, B., Golbeck, J.: SMORE – Semantic Markup. In: RDF (ed.) Ontology (2005), Available at: http://www.mindswap.org/papers/SMORE.pdf 7. The SHOE Knowledge Annotator, http://www.cs.umd.edu/projects/plus/SHOE/ KnowledgeAnnotator.html 8. Benjamins, V.R., Contreras, J., Blázquez, M., Dodero, J.M., García, A., Navas, E., Hernández, F., Wert, C.: Cultural heritage and the semantic web. In: Bussler, C., Davies, J., Fensel, D., Studer, R. (eds.) The Semantic Web: Research and Applications, First European Semantic Web Symposium, pp. 433–444. Springer-Verlag, Heidelberg (2004) 9. Saha, http://www.seco.tkk.fi/applications/saha/ 10. Medjahed, B., Bouguettaya, A.: A multilevel composability model for semantic Web services. IEEE Transactions on Knowledge and Data Engineering 17(7), 954–968 (2005) 11. OWL, http://www.w3.org/2004/OWL/ 12. David, L.: Bringing Semantics to Web Services: The OWL-S Approach. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, Springer, Heidelberg (2005) 13. Tsur, S., Abiteboul, S., Agrawal, R., Dayal, U., Klein, J., Weikum, G.: Are Web Services the Next Revolution in e-Commerce (Panel). In: Proc. Very Large Data Bases Conf., pp. 614–617 (2001) 14. Hori, M., Euzenat, J., Patel-Schneider, P.F.: OWL web ontology language XML presentation syntax. W3C Note (2003), Available at http://www.w3.org/TR/owl-xmlsyntax/ 15. Dublin Core, http://dublincore.org/documents/dces/ 16. Medical Subject Headings (MeSH), http://www.nlm.nih.gov/mesh/meshhome.html 17. SWRC ontology, http://ontoware.org/projects/swrc/ 18. Benjamins, V.R., Fensel, D.: Community is knowledge (KA)2. In: Proceedings of the 11th Workshop on Knowledge Acquisition, Modeling, and Management, Banff, Canada (1998) 19. OntoWeb, http://www.ontoweb.org 20. Hyvönen, E.: Semantic Web Applications in the Public Sector in Finland - Building the Basis for a National Semantic Web Infrastructure. Norwegian Semantic Days, Stavanger, Norway (2006)

1152

M. Argüello et al.

21. Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993) 22. McIlraith, S., Song, T., Zeng, H.: Semantic Web services. IEEE Intelligent Systems, Special Issue on the Semantic Web 16, 46–53 (2001) 23. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic Annotation for Knowledge Management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1) (2006) 24. Ajax, http://adaptivepath.com/publications/essays/archives/000385.php 25. http://www.w3.org/TR/owl-xmlsyntax/owlxml2rdf.xsl 26. Protégé, http://protege.stanford.edu/ 27. Elenius, D., Denker, G., Martin, D., Gilham, F., Khouri, J., Sadaati, S., Senanayake, R.: The OWL-S. In: Gomez-Perez, Euzenat (eds.) A Development Tool for Semantic Web Services, pp. 78–92 (2005) 28. WSDL, http://www.w3.org/TR/wsdl20 29. SOAP, http://www.w3.org/TR/soap12-part0/ 30. WS-BPEL, http://www.ibm.com/developerworks/library/specification/ws-bpel/ 31. WSMO working group. D16.1v0.2 The Web Service Modeling Language WSML, WSML 32. Davies, J., Fensel, D., van Harmelen, F. (eds.): Towards the Semantic Web: OntologyDriven Knowledge Management. John Wiley, Chichester (2002) 33. Schreiber, A., Akkermans, H., Anjewierden, A.A., Hoog, R., Shadbolt, N.R., Van de Velde, W., Wielinga, B.: Engineering and managing knowledge. In: The CommonKADS methodology, The MIT Press, Cambridge (1999) 34. Fernandez Prieto, M.J., Moroto Garcia, N.: The SALCA project: marketing terminology in Spanish and English. In: Thelen, M., Lewandowska-Tomasczyk (eds.) Translation and Meaning Part 5, Maastricht:Hogeschool Zuyd, Maastricht School of Translation and Interpreting, pp. 231–239 (2001) 35. Dictionary of Marketing Terms, http://www.marketingpower.com/mg-dictionary.php 36. Monash University: marketing dictionary, http://www.buseco.monash.edu.au/mkt/ dictionary/ 37. Erdmann, M., Studer, R.: Ontologies as conceptual models for XML documents. In: Proceedings of KAW 1999, Banff, Canada (1999) 38. XML Schema, http://www.w3.org/XML/Schema 39. TERMIUM, http://www.termium.gc.ca/ 40. EURODICAUTOM, http://iate.europa.eu/ 41. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002) 42. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998) 43. Ericsson, K.A., Simon, H.A.: Protocol analysis: verbal reports as data. MIT Press, Cambridge (1984)

Suggest Documents