Automation and evaluation of the semantic annotation ...

9 downloads 534 Views 805KB Size Report
1 https://developers.google.com/custom-search/. 2 https://developers.google.com/web-search/. General syntax of search query: “Ontology+Name”+Keyword ...
Automation and evaluation of the semantic annotation of Web resources Sahar Maâlej Dammak, Anis Jedidi, Rafik Bouaziz Sfax University, MIRACL Laboratory Sfax University Faculty of Economics and Management of Sfax Sfax, Tunisia [email protected]

Sfax University Higher Institute of Computer Science and Multimedia of Sfax Sfax, Tunisia [email protected]

Abstract—The annotation of a Web page allows associating a semantic to the content of this page. But with the great mass of pages managed through the world, and especially with the advent of the Web, their manual annotation is impossible. In this paper we will focus on the semi-automatic annotation of the Web pages, we will propose an approach and a framework for semantic annotation of Web pages entitled “Querying Web”. Our solution is based on the enriched domain ontologies and on the analysis of the RDF result by the “Semantic Radar” plug-in. We will also present in this paper an evaluation of the automation made by our framework. Keywords- Semantic Web; Semantic annotation; Domain ontologies; Querying Web

I.

INTRODUCTION

The existing annotations of resources are not sufficient to satisfy the user when making his interrogation. So, our goal is to see how we can annotate semantically the Web resources in order to improve the interrogation process in a Semantic Web environment. In fact, Web resources are very heterogeneous in terms of their structure as well as in the language used. Thus, the semantic annotation of Web pages is a difficult task. So, the annotation process automation is required. Our principal contribution in this paper is a semantic annotation approach of Web resources which improves the interrogation process of the Web. We want to assist users through the Web search. We are also interested in the operationalization of this approach in order to automate the extraction of RDF annotation metadata of the Web resources. Then, we propose a framework for semantic annotation, “Querying Web”, using the plug-in “Semantic Radar” [4] [10] to extract semantic descriptors such as FOAF, SIOC, etc. for each Web page. In addition, we use for this framework “Eclipse” as a development tool to automate the extraction of RDF metadata based on the semantic annotation model that we proposed in [7]. In fact, we use this model to search the equivalence between the concepts of the enriched domain ontology corresponding to the search area of the user and the concepts generated by the Semantic Radar plug-in.

Sfax University Faculty of Economics and Management of Sfax Sfax, Tunisia [email protected]

The paper is structured as follows. Section 2 presents the semantic annotation process of Web resources. Section 3 describes the implementation details of the framework “Querying Web”. We show the contribution of our framework through the evaluation of these products in Section 4. Finally, in Section 5 we present our conclusion and some further work. II.

AN APPROACH TO SEMANTIC ANNOTATION

We propose to create a new approach to semantic annotation of the Web resources [6]. Our approach is based on the domain ontologies that have been extended by FOAF (Friend-of-a friend) [5] and SIOC (Semantically-Interlinked Online Communities) [3] concepts (these concepts represent the data of the semantic Web) in the instantiation process. If we do not find in the literature an ontology that includes FOAF, SIOC, DOAP and/or RDFa standards, we note that the extensions made on the ontologies usually use SIOC and FOAF standards. This approach uses the semantic structures instantiated by the Semantic Radar tool in a set of RDF files. The general steps of the proposed approach are as follows (cf. Fig. 1): S1: After the specification of the study field, the user has to interrogate the Web through a semantic search engine. This interrogation is based on the concepts of the used domain ontology. S2: Then, we propose to use an appropriate method, which we intend to define, for filtering Web pages. The Web pages returned by the interrogation process pass to the filtering process which selects the most relevant. S3: Subsequently, we propose to transfer the filtered Web pages to the automatic and semantic analysis by Semantic Radar. This analysis returns an RDF file for each page containing descriptors such as FOAF and SIOC. The set of RDF files extracted by Semantic Radar represents the input of the annotation process. S4: Also, the selected and analyzed pages have to be automatically annotated by a method that we propose here after, to produce RDF metadata for each Web resource.

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

448

This method is based on the equivalence rules and on the semantic annotation model that we have proposed in [7]. The new RDF resource will be linked to the original resource and released on the Web. Semantic annotation method: The basis of our semantic annotation method is to seek the equivalence between the concepts of the analysis result made by Semantic Radar on the Web resources and the concepts of the used enriched domain ontology. In fact, the user starts the search based on the concepts in the field of research, which have been defined in the hierarchy of the used enriched domain ontology. Based on this equivalence, we have proposed equivalence rules and a semantic annotation model [7]. In the two latter, we defined the descriptors of the semantic annotations. The annotation result of each resource is an RDF file that represents an enhancement of the RDF result generated by Semantic Radar, applying our rules and model. The dotted part in Fig. 1 shows generally the main parts of our semantic annotation method of Web resources. S5: Finally, after the proposed annotation on the Web resources, we propose a new sorting for viewing Web pages to the user; a more relevant response is displayed. Indeed, the annotated resources appear at the top of the list.

III.

A SEMANTIC SEARCH ENGINE

We have developed a prototype, called “Querying Web”, using the plug-in “Semantic Radar” to extract semantic descriptors (Step 3 of our approach), and “Eclipse” as development tool to automate the extraction of RDF metadata (Step 4 of the approach). We have limited our presentation to the principal implementations parts of our framework for semantic annotation through the following steps, which we have defined to automate the extraction of annotation RDF metadata of Web resources. Fig. 2 shows the interface of our semantic search engine. A. Writing a semantic query As we have shown here before, Web querying is done through a new semantic search engine. Thus, the domain expert must specify the study domain (the right part of the graphical interface “Domain Ontologies” in Fig. 2). After the specification of this domain, we show to the user the hierarchy of the ontology to assist him/her in writing the semantic query. In addition, we tell him/her to follow a well-defined syntax in the writing of this request in order to have a correct advanced search on the Web (cf. Fig. 2). General syntax of search query:

“Ontology+Name”+Keyword [+OR+Keyword]* N.B: +: replaces each space between the words of “Ontology Name” and each keyword. [ ]: indicates that the specification is optional. *: indicates that the specification can be multiple. We chose to study the field “Network of Scientists” that corresponds to the enriched domain ontology “Network of Scientists” (vivo.owl). All examples apply to this area. VIVO is an open source semantic Web application that, when populated with researcher interests, activities and accomplishments, enables discovery of research and scholarship across disciplines [8]. B. Web Querying According to the study of opportunities of the interrogation of internet through our search engine, we notice the need to use a Web service to communicate with Google. Google Custom Search API1 is an API to retrieve and display results from a Custom Search Engine. With this API, we can use REST requests for the results of Web searches. It limits the number of requests per day: 100 requests/day. Therefore, we can use the Google Custom Search API for querying the Web with Google, from our Java application in Eclipse. In fact, this API replaces an old API called “Google Web Search API2”. This last API has been officially deprecated since November 1, 2010.

Figure 1. Steps of our approach

1 2

https://developers.google.com/custom-search/ https://developers.google.com/web-search/

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

449

A

B

Figure 2. Example of a first level of annotation on a Web page

Also, the Custom Search API Project and the Custom Search Engine are run for each Web search in the background and transparently to the user. C. Definition of a search method After writing the query, the search will be conducted through a connection between Google Custom Search API and Custom Search Engine. As a search method, we propose, at each Web interrogation, the display of the URL returned by Google (ten links) in the table “Table of URL” and also the number of search list. The part A in Fig. 2 shows the display of the URL in Table of URL after interrogation. In the part B, there is the display of the number of the search list. In addition, we import with each search the RDF descriptions of Web pages returned in the list. An internal file will be created automatically “RDF.txt” to be the first source of analysis. This file contains the RDF descriptors extracted by Semantic Radar for only pages that have a first level of semantic annotation. Fig. 2 shows that our annotation system has imported the RDF description for the resource http://journal.webscience.org/532/; this description represents a first level of annotation by Semantic Radar. The content of this description will be stored in the internal file “RDF.txt”. D. Generation of annotation metadata In this step, we propose to produce the RDF annotation metadata for each Web resource. In fact, the required inputs are

ready: the enriched domain ontology (.OWL), the FOAF and SIOC ontologies (.RDF) and the file “RDF.txt”. The remaining task is to apply equivalence rules [7] and the semantic annotation model proposed in [7] to generate an RDF document related to original resource on the Web (cf. Fig. 1). In fact, we have defined four equivalence rules to integrate in the resource description. The equivalence is identified in the relationship between the FOAF concepts of the FOAF ontology and the SIOC concepts of the SIOC ontology [3]. For the proposed semantic annotation model it is based on the annotation of the FOAF (or SIOC) concepts (stored in the internal file “RDF.txt”) by the FOAF (or SIOC) concepts of the enriched domain ontology, the domain concepts and/or the FOAF (or SIOC) concepts of the FOAF (or SIOC) ontology (cf. Fig. 3). In this model we have defined different descriptors of annotation as “Has Child”, “IS A”, Etc. We have developed the task of the automatic generation of the annotation RDF of each Web page in Eclipse by the language “Java”. In fact, our annotation system allows annotating automatically and semantically the Web page after interrogation. An RDF annotation file is automatically generated for each Web resource. This file is an enhancement of the first result RDF for the resource by Semantic Radar (the first level of annotation). Fig. 2 shows a first level of annotation for this Web resource: http://journal.webscience.org/532/. Fig. 4 displays the proposed semantic annotation, which is automatically generated after the changing of the result of

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

450

Semantic Radar, for the resource http://journal.webscience.org/532/. In this example, there is the annotation of the concept “FOAF: Person” of the result of Semantic Radar by the concepts of the enriched domain ontology “Network of scientists”, using the proposed semantic annotation model: Librarian, EmeritusFaculty, FacultyMember,

Semantic Radar

RDF file for a Web page

NonFacultyAcademic, NonAcademic, EmeritusProfessor, EmeritusLibrarian, Student. In addition, there is the enrichment of this annotation by annotating of the same concept using the concepts of the FOAF ontology (Agent).

An equivalence search

The enriched domain ontology

Contains

The FOAF concepts

The SIOC concepts

Etc.

The domain concepts

The FOAF concepts

The FOAF Ontology

The SIOC concepts

The SIOC Ontology An RDF metadata for a Web resource

The FOAF concepts are annotated by the FOAF concepts of the enriched domain ontology, the domain concepts and/or the FOAF concepts of the FOAF ontology The SIOC concepts are annotated by the SIOC concepts of the enriched domain ontology, the domain concepts and/or the SIOC concepts of the SIOC ontology Figure 3. Application of the proposed semantic annotation model

E. Attachment of the annotation metadata to their original resources We proposed in our approach that the new RDF resource of annotation has to be linked to the original resource and released on the Web. To achieve this goal, we proposed, on the one hand, to create a new database for annotation in order to connect each original resource to their RDF of annotation. On the other hand, we see that the interrogation by the user has to be indexed on the Google Web server and the new base of annotation in order to return the not-annotated Web pages and those annotated (register in the database of annotation). To

create a database of annotation, we have used the Web development platform “EasyPHP3 5.3.9”. After the semantic and automatic annotation of Web resources, recording for each resource with its RDF is necessary in the database of annotation. In fact, we find three cases: •

3

If the resource is not annotated, the automatic annotation will be generated (cf. Fig. 4) and recording of this resource with their annotation will be done automatically.

EasyPHP 5.3.9: integrates the Apache server, the MySQL database, the PHP language and the tools PhpMyAdmin and Xdebug that facilitate the development of Web sites or applications.

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

451



If the resource is already annotated by the enriched domain ontology, a message appears “Resource already annotated” (already stored in the database).



If the resource is already annotated with an ontology and the current query is made by another ontology, an automatic annotation will be generated by the new ontology for the same resource and recording of this resource will be done automatically.

automation for Web resources, using the following four percentages: •

Percentage of the pages annotated (correctly): which represents the percentage of the pages annotated automatically and correctly relatively to the result of the Web pages returned after interrogation. Percentage of pages annotated correctly = Number of pages annotated correctly / Number of Web pages (1)



Percentage of the pages not-annotated: which represents the percentage of the pages not-annotated relatively to the result of the Web pages returned after interrogation. Percentage of pages not-annotated = Number of pages not-annotated / Number of Web (2) pages



Percentage of the pages that should be annotated (annotation missing): which represents the percentage of the pages that must be annotated but haven’t been annotated relatively to the result of the Web pages returned after interrogation. Percentage of the pages that should be annotated = Number of the pages that should be annotated / Number of Web pages (3)

• Figure 4. Example of a generated file of a semantic annotation for a Web resource

IV.

Percentage of the unnecessarily annotated pages: which represents the percentage of the pages that shall not be annotated relatively to the result of the Web pages returned after interrogation.

EXPERIMENTAL STUDY

We also use the field “Network science”, which corresponds to the enriched domain ontology “Network of Scientists” (vivo.owl), for this study. We see that an automation of the generation of the RDF annotations for Web resources is beneficial for this area. In fact, this annotation will accelerate the interrogation process to the user. After the proposed annotation of Web resources, a new search returns a new sorting of the result. Indeed, annotated resources will appear at the top of the list. Then, we evaluate the automation made by our framework “Querying Web”. For this, we present here after the result of the percentages of the automation of annotation for this case study. We take the query result by keyword “Networking+of+Scientists”+ConferencePaper, following the general syntax of a query. This result returns ten URL (ten pages not-annotated) in the first search list. After running the annotation process, the engine proceeds to the annotation of three pages among ten pages. We present below the statistical reports of the step of the RDF annotation

Percentage of the unnecessarily annotated pages = Number of the pages that shall not be annotated / (4) Number of Web pages We present these percentages for our case in the table shown below: TABLE I.

RESULT OF THE AUTOMATION OF THE SEMANTIC ANNOTATION

Percentage of pages annotated correctly

Percentage of pages notannotated

Percentage of the pages that should be annotated

Percentage of the unnecessarily annotated pages

30%

70%

0%

0%

To better evaluate this automation, we then used the following metric standards: recall, precision and F-Measure [9].

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

452

The precision (P): is the number of data correct found by the program (i.e., the correctly annotated pages) divided by the total number of data found by the program.

annotation which is about 45%. This result is satisfactory, comparing it to that achieved in the literature. However, the improvements are needed.

The recall (R): is the number of data correct found by the program divided by the total number of the real data identified manually.

In future works, we will propose an extension for the query language “SPARQL 4 ” in order to query the metadata of annotation proposed on the Web resources (step 1 of our approach). In addition, the use of this language will accelerate the process of querying of these resources. Also, we will elaborate a solution to filter the Web pages after interrogation (step 2 of our approach). This task is in progress and will be published later. In fact, the filtering method that we are currently developing is based on two principles: allocating scores for the Web pages (annotated and indexed) and classification of these pages.

The F-Measure (F): is a metric that can be combined into a single value the measures of precision and recall: F = (2 * P * R) / (P + R)

(5)

Table 2 shows the automation evaluation of the semantic annotation for the Web resources by the framework “Querying Web”.

REFERENCES TABLE II.

The annotated pages

RESULT OF THE AUTOMATION OF THE SEMANTIC ANNOTATION BY THE METRIC STANDARDS Precision : P

Recall : R

F-Measure : F

30%

100%

46%

So, we conclude that the percentage of the annotated pages obtained by our framework is about 45% (percentage of satisfaction with our framework) for this case study. In fact, in the literature we do not find studies that evaluate the number of the annotated Web pages by the proposed system compared to the number of the not-annotated Web pages. But, we find studies of the extraction of the information for the annotation of texts (not Web pages) that happen to a Fmeasure of 67.23% as in [1]. In addition, the authors in [2] evaluated the annotation of an English corpus and they got a result approximately 50%. We also see in [11] the automatic semantic annotation of temporal expressions in Web pages for an application of e-tourism. In this case, the F-measure obtained is 58.9%. In our work, we obtained a percentage of satisfaction with our framework (for the annotation of the entire contents of Web pages) of about 45%. We then believe that this percentage is satisfactory to accelerate the process of querying to the user, mainly it is a percentage of semantic annotation automation, which represents a tedious task, for Web resources, which are very heterogeneous.

[1] A. Ben Abacha and P. Zweigenbaum, “Annotation et interrogation [2]

[3] [4] [5] [6] [7]

[8]

[9] [10] [11]

V.

CONCLUSION AND FURTHER WORK

In this paper, we have presented the steps of the approach that we have proposed to the semantic annotation for Web resources. This approach helps to query in a Semantic Web environment. The products of the implementation of our framework, that allows the automation of the semantic annotation, have shown the feasibility and benefits of our proposals, although this automation is a delicate and difficult task. This framework, entitled “Querying Web”, allows assisting the user in his search on the Web. The field of study “Network of Scientists” has allowed us to show the importance of the semantic annotation. The evaluation of this annotation, by the framework “Querying Web”, helped to locate the percentage of automation of this

4

sémantiques de textes médicaux,” Atelier Web Sémantique Médical, Nîmes, pp. 61–70, 2010. A. Ben Abacha, P. Zweigenbaum, and A. Max, “Extraction d'information automatique en domaine médical par projection interlangue : vers un passage à l'échelle,” Traitement automatique des langues naturelles, Grenoble, 2012. U. Bojars and J. Breslin, “Sioc core ontology specification,” http: //rdfs.org/sioc/spec/, 2010. (Access date: 11 October 2011). U. Bojars, A. Passant, F. Giasson, and J. Breslin, “An Architecture to Discover and Query Decentralized RDF Data,” 3rd Workshop on Scripting For The Semantic Web, Innsbruck-Austria, 2007. D. Brickley and L. Miller, “Foaf vocabulary specification,” http: //xmlns.com/foaf/spec/, 2010. (Access date: 11 October 2011). S. Maâlej, “Annotation sémantique des ressources Web: État de l’art et perspectives de recherche,” Inforsid-2012, Montpellier-France, pp. 591– 598, 29-31 May 2012. S. Maâlej, A. Jedidi, and R. Bouaziz, “Semantic Annotation Framework for Web resources,” Fifth International Conference on Internet Technologies & Applications. ITA, Wrexham, North Wales, UK, pp. 106–113, 10-13 September 2013. Stella. Mitchell, Shanshan. Chen, Mansoor. Ahmed, Brian. Lowe, Paula. Markes, Nick. Rejack, Jon. Corson-Rikert, Bing. He, Ying. Ding, and the VIVO collaboration, “The VIVO Ontology: Enabling Networking of Scientists,” the ACM WebSci'11, Koblenz, Germany, pp. 1–2, 14-17 June 2011. V. Rijsbergen, “Information Retrieval,” Butterworth, isbn 0-408-709294, 1979. SemanticR “Semantic Radar,” https://addons.mozilla.org/enUS/firefox/addon/semantic-radar/, 2009. (Access date: 7 March 2011). S. Weiser, “Repérage et typage d'expressions temporelles pour l'annotation sémantique automatique de pages Web-Application au etourisme,” Doctoral thesis, University Paris Ouest Nanterre La Défense, June 2010.

www.w3.org

Copyright © 2013 ICITST-2013 Technically Co-Sponsored by IEEE UK/RI Computer Chapter

453

Suggest Documents