Ontology-based Crawler for the Semantic Web - CiteSeerX

Faculty of Science Department of Applied Computer Science

Ontology-based Crawler for the Semantic Web Dissertation for the degree of Master in Applied Computer Science

Felix Van de Maele Promotor: Advisor:

May 2006

Prof. Dr. Robert Meersman Dr. Peter Spyns

Faculteit Wetenschappen Departement Informatica

Ontology-based Crawler for the Semantic Web Eindverhandeling ingediend met het oog op het behalen van de graad van Licentiaat in de Toegepaste Informatica

Felix Van de Maele Promotor: Begeleider:

May 2006

Prof. Dr. Robert Meersman Dr. Peter Spyns

Abstract Information interoperatbility has received increased attention since the growing popularity of the Internet, the Web and distributed computing infrastructures. During this evolution, the attention on semantics and ontologies to achieve this interoperatbility has also increased. The same thing is happening on the Semantic Web, where ontologies are used to assign (agreed) meaning to the content of the Web. On the Semantic Web, data will inevitably be linked to many different ontologies, and information processing across ontologies is not possible without knowing the semantic mappings between them. As the resources on the Semantic Web are annotated using these ontologies, new search techniques can be applied to find specific and structured information. This evolution calls for new tools to assist the user in the discovery and extraction of specific information resources from the Semantic Web. In this thesis, we describe an ontologyfocused crawler for the Semantic Web. We argue that this crawler can exploit the semantic meta data to efficiently discover and extract information resources on the Semantic Web. In our approach, we use a topic ontology to guide the crawler to the relevant information. We present DMatch-lite as the algorithm used to guide the crawler. DMatch-lite is an automated ontology matching algorithm that matches the topic ontology with the Semantic Web ontologies discovered during the crawl. It computes the similarity score between both ontologies and returns a list of candidate mappings. These candidate mappings can later facilitate the integration of the extracted ontologies. By computing the similarity between a topic ontology and Semantic Web ontologies instead of comparing a keyword description to the text of the resources, we believe that the crawler is more efficient in the discovery and extraction of information resources. We embed this crawler system into the DOGMA Framework. The DOGMA Studio workbench provides the ontology engineer with a powerful framework to model the topic ontology commitment used to guide the crawler. By embedding the crawer into the DOGMA Framework, the data extracted during the crawl can also be efficiently used in other applications such as ontology integration and elicitation.

Samenvatting Met de toenemende populariteit van Internet en het Web is de nood aan onderlinge informatie overdracht alleen maar toegenomen. Tijdens die evolutie is ook de aandacht voor semantiek en ontologieën sterk gestegen. Een gelijkaardige evolutie is vast te stellen op het Semantische Web. Daar worden ontologieën gebruikt om (overeengekomen) betekenis aan de inhoud van het web toe te wijzen. Zo worden de gegevens binnen het Semantische Web onvermijdelijk verbonden met de diverse ontologieën en is de informatieverwerking van deze gegevens niet mogelijk zonder de semantische relaties tussen de verschillende ontologieën te kennen. Door de annotatie van gegevens met diverse ontologieën, kunnen nieuwe zoek algoritmes ontwikkeld worden om specifieke en gestructureerde informatie te vinden. In deze thesis beschrijven we een ontology-based crawler voor het Semantische Web. We tonen aan dat deze crawler de semantisch geannoteerde data kan exploiteren om de gegevens op het Semantische Web op een efficiënte manier te ontdekken en op te vragen. In onze benadering gebruiken we een topic-ontology om de crawler naar de relevante informatie te leiden. Wij stellen het DMatch-lite algoritme voor om de semantisch geannoteerde data te exploiteren. Dmatch-lite is een geautomatiseerd ontology-based matching algoritme dat de bron ontologie met de ontologieën die op het Semantische Web gevonden worden, vergelijkt. Het berekent de gelijkenis tussen beide ontologieën en geeft een lijst terug van alle kandidaat relaties tussen de concepten van de ontologieën. In plaats van een gewone keyword beschrijing te gebruiken om de crawler naar de relevante paginas te leiden, geloven wij dat onze aanpak, met een bron ontologie en DMatch-lite als ontology matcher een betere oplossing biedt. Deze crawler is ontwikkeld binnen het DOGMA Framework. We gebruiken de DOGMA Studio Workbench om de bron ontologie te modelleren. Bovendien kunnen, door gebruik te maken van het DOGMA Framework, de gegevens die tijdens het crawler verzameld werden, gebruikt worden in andere toepassingen, zoals de integratie en elicitatie van ontologieën.

Acknowledgements I would like to sincerely thank my promotor Prof. Dr. Robert Meersman for introducing me to the world of semantics and giving me the opportunity to work on this project. Special thanks go to Dr. Peter Spyns to provide me with the necessary guidance and support. His proof readings and critical remarks on the draft versions have contributed tremendously to this thesis. Furthermore, I would like to thank my colleagues at STARLab for proof reading various parts of this thesis and especially for creating a wonderful work environment during my time at STARLab. Finally, I would like to thank my parents and friends for their support and never ending encouragements.

Contents 1

2

Introduction

1

1.1

Problem Statement and Motivation . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Proposed Solution and Methodology . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Background

5

2.1

Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

The DOGMA Ontology Framework . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.1

The Lexon Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.2

The Commitment Layer . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2.3

The Concept Definition Server . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1

2.4

2.5

3

Structure of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . 11

Web Crawlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1

Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2

Crawling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3

Crawler Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Ontology/Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1

The Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.2

Matching Typology and Classification . . . . . . . . . . . . . . . . . . . 31

Crawling the Semantic Web

38

3.1

Context and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2

Structure of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1

3.3

The Crawler Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1

3.4

Semantic Web Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Goals and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

CONTENTS

ii

3.4.1

High level Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.2

Crawler Thread Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.3

Crawler Thread Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . 55

4

Relevance computation: DMatch-lite 4.1

4.2

4.3

4.4 5

5.2

6

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1

The Matching Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2

DMatch-lite vs Ontology Integration . . . . . . . . . . . . . . . . . . . . 60

4.1.3

The Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1.4

Requirements and Limitations . . . . . . . . . . . . . . . . . . . . . . . 62

DMatch-lite: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1

Step 1: Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.2

Step 2: Building the Search Space . . . . . . . . . . . . . . . . . . . . . . 65

4.2.3

Step 3: Similarity Iterations . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.4

Step 4: Final Similarity Computation . . . . . . . . . . . . . . . . . . . . 67

DMatch-lite: Similarity Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1

Matching Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.2

String-based Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.3

Linguistic-based Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.4

Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.5

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Crawler Implementation 5.1

57

86

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.1

The Crawler Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.1.2

The Generic Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.1.3

The Matching Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

DMatch-lite Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2.1

Similarity Iterations package . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2.2

Similarity Algorithms package . . . . . . . . . . . . . . . . . . . . . . . 90

5.2.3

Mapping Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2.4

Relevance Sets Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.5

Using DMatch-lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Crawler Evaluation

96

6.1

Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2

Crawler Harvest Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.1

Harvest Rate on Different Ontologies . . . . . . . . . . . . . . . . . . . 98

CONTENTS

iii

6.2.2

Harvest Rate on Ontology- and Keyword Based Focused Crawling . . 100

6.3

Availability of SWDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.4

Crawler Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.1

CPU time per Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.2

Process Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7

Conclusions

106

8

Future Work

108

Bibliografie

110

A Crawler Transcripts

119

A.1 DMatch-lite Matching the Two Beer Ontologies . . . . . . . . . . . . . . . . . . 119 A.2 The Crawler in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 B KnowledgeWeb Project Ontology

136

List of Figures 2.1

Illustartion of the two levels in DOGMA ontology. (Reproduced from [22]) . .

2.2

Illustration of two terms (within their resp. contexts), being articulated (via

9

the mapping ct) to their appropriate concept definition. (Inspired by [22]) . . 10 2.3

The Semantic Web layers - picture reproduced from [89] and [52] . . . . . . . 12

2.4

The flow of a basic sequential crawler. . . . . . . . . . . . . . . . . . . . . . . . 15

2.5

The crawl-graph of a general crawler and a preferred crawler. . . . . . . . . . 16

2.6

The crawl-graph of a Focused Best-First Crawler. . . . . . . . . . . . . . . . . . 19

2.7

This figure shows how a relevant page can only be reached if there exist a path of relevant pages from the root to the given page. . . . . . . . . . . . . . 21

2.8

An example of a topic subtree form a hierarchical directory. reproduced from [88])

2.9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

An example of the different target sets of a topic subtree. . . . . . . . . . . . . 25

2.10 Matching 2 relational schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 An example of the annotated web-sites of two computer science departments 30 2.12 Matching typology, based on [79] and [85] . . . . . . . . . . . . . . . . . . . . . 32 2.13 Semantic Intensity Spectrum, reproduced from [49] . . . . . . . . . . . . . . . 34 2.14 Structure Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.15 Context Awareness example using Relevance Sets, reproduced from [58] . . . 36 3.1

The context and applications of the crawler within the Semantic Web and DOGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2

The T-Lex tool from the DOGMA Studio Workbench. . . . . . . . . . . . . . . 40

3.3

The T-Lex Commitment Editor, where constraints can be added to the lexons. 41

3.4

The scope of the crawler within DOGMA . . . . . . . . . . . . . . . . . . . . . 46

3.5

The highest level overview of the crawler, distinguishing the two major systems that compose the Focused Crawler. . . . . . . . . . . . . . . . . . . . . . . 49

3.6

The crawler in a possible real-world environment . . . . . . . . . . . . . . . . 50

3.7

Pragmatic view on the architecture of the Crawler Thread. . . . . . . . . . . . 51

iv

LIST OF FIGURES

v

3.8

The flow diagram of the crawling process. . . . . . . . . . . . . . . . . . . . . . 56

4.1

The match operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2

Diagram of the ontology commitment in Ω-RIDL. . . . . . . . . . . . . . . . . 59

4.3

The DMatch-lite match operator within the Ontology Integration context . . . 61

4.4

DMatch-lite’s matching process methodology. . . . . . . . . . . . . . . . . . . 63

4.5

The heap structure used in the compuateFinalSimilarity algorithm. . . . . . . . 68

4.6

A first beer ontology, described in RDFS . . . . . . . . . . . . . . . . . . . . . . 71

4.7

A second beer ontology, now represented as a Dogma Ontology Commitment 72

4.8

A plot of the Soundex Normalize() function . . . . . . . . . . . . . . . . . . . . 74

4.9

Matching the contexts of the two Beer concepts from both ontologies. . . . . . 79

4.10 Matching the contexts of the two Beer concepts from both ontologies. . . . . . 84 5.1

The 3 packages that make up our Focused Crawler . . . . . . . . . . . . . . . . 87

5.2

The class diagrams of the Crawler package. . . . . . . . . . . . . . . . . . . . . 88

5.3

The class diagrams of the Generic package. . . . . . . . . . . . . . . . . . . . . . 89

5.4

The class diagrams of the Matching package. . . . . . . . . . . . . . . . . . . . 90

5.5

The class diagrams of the SimilarityIterations package. . . . . . . . . . . . . . . 91

5.6

The class diagrams of the SimilarityAlgorithms package. . . . . . . . . . . . . . 92

5.7

The class diagrams of the Mapping package. . . . . . . . . . . . . . . . . . . . . 93

5.8

The class diagrams of the RelevanceSets package. . . . . . . . . . . . . . . . . . 94

5.9

The sequence diagram showing the use of the DMatch-lite Algorithm . . . . . 95

6.1

The harvest ratio of two crawls: one using the Beer ontology and the other using the KnowledgeWeb Project ontology. . . . . . . . . . . . . . . . . . . . . 99

6.2

The harvest ratio of two crawls: one using the Beer ontology as topic description and the other using a simple ’Beer’ keyword. . . . . . . . . . . . . . 101

6.3

The number of SWDs found during the two crawls. . . . . . . . . . . . . . . . 102

6.4

The CPU time in milliseconds needed to process a single page. . . . . . . . . . 103

6.5

The partitioning of the processing time of a single page into the several crawling loop phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

List of Tables 3.1

Approximate number of SWDs indexed by google in May 2005 and May 2006 43

3.2

An example of the Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . 53

3.3

Variants of the term enzyme activity . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4

Examples of stemming terms with a Porter stemmer. . . . . . . . . . . . . . . . 55

4.1

An example search space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2

The overview of the similarity iterations with their similarity metrics and weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3

Some non normalized similarity scores between the ontology terms using the Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4

Some non normalized similarity scores between the ontology terms using the Soundex similarity metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5

A part of the WordNet Synsets that contain the term ”car” . . . . . . . . . . . 75

4.6

The hyponym tree from the WordNet ”Beer” SynSet. . . . . . . . . . . . . . . 76

4.7

Some of the resulting mapping rules after applying the WordNet Relations Algorithm on them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8

The search Space Matrix after the String Similarity Iteration . . . . . . . . . . 81

4.9

The search Space Matrix after the Linguistic Similarity Iteration . . . . . . . . 82

4.10 The search Space Matrix after the Semantic Similarity Iteration . . . . . . . . . 83 6.1

The KnowledgeWeb Project ontology (in RDFS) automatically converted to Lexons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vi

List of Definitions Definition 1

Ontology by Gruber . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Definition 2

Ontology by Ushold and Gruninger . . . . . . . . . . . . . . . . . . . .

5

Definition 3

Lexon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Definition 4

The Lexon Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Definition 5

Concept Definition Server . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Definition 6

Meta-Lexon Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Definition 7

Term frequency/Inverse document frequency . . . . . . . . . . . . . . 22

Definition 8

Relevance Sets by Ehrig and Maedche . . . . . . . . . . . . . . . . . . . 35

Definition 9

Semantic Web Ontology (SWO) . . . . . . . . . . . . . . . . . . . . . . . 44

Definition 10

Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Definition 11

Normalized similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Definition 12

Concept Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Definition 13

Mapping rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Definition 14

Linguistic Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Definition 15

Similarity aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Definition 16

Soundex Normalize function . . . . . . . . . . . . . . . . . . . . . . . . 73

Definition 17

WordNet Relations Normalize function . . . . . . . . . . . . . . . . . . 76

Definition 18

Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Definition 19

Context Similarity Normalize function . . . . . . . . . . . . . . . . . . 80

vii

Chapter 1

Introduction Interoperability, that is, the ability to exchange and use information, has been a basic requirement in modern Information Systems environments for years [83]. One of the reasons for the increased attention in data and information interoperability is the excellent progress in interconnection afforded by the Internet, Web and distributed computing infrastructures. This leads to easy access to a large number of independently created and managed information sources of broad variety. This evolution in the context of interoperability has been classified into three generations [83]: Generation I which covers the period to roughly 1985, Generation II which covers the period of a decade through 1995, and Generation III which covers a period yet to be bounded since 1996. In the current generation, more and more focus is put on semantics and ontologies to arrive at the interoperability of information [46]. This is no surprise since the difference between a general data model and an ontology lies in the fact that ontologies are shared conceptualisations, that is, agreement is necessary [48]. To be able to share, reuse and agree on information, this information must first be available and known to the systems or experts involved. Therefore, we believe that finding this relevant information, especially on the (Semantic) Web, is an important and growing problem. The possibility and therefore also the impulse to find, extract and use this information will only increase when more and more semantic data is available on the Web.

1.1

Problem Statement and Motivation

In this thesis, we focus on the problem of automated discovery and extraction of information sources on the Web, more specifically, we focus on web pages and Semantic Web 1

CHAPTER 1. INTRODUCTION

2

Ontologies as information sources. While the general, non-guided discovery and extraction of information has been available for quite some time, we consider that the problem of extracting specific, structured and semantic information has not yet been sufficiently tackled. Let us give a possible scenario to illustrate the problem. Say, for example, a classical library wants to make its content electronically available on the Web. Obviously, it wants to do that in an intelligent, structured manner, such that not only humans can access the content, but also other artificial systems (which we will call agents). Therefore, it is necessary to model this information using ontologies. The experts that have to model all this data, want to know how other, similar organizations like online bookstores, other electronic libraries etc., have structured their data. To do that, they want a tool that, given their specification, will look on the web and find all the information that corresponds, to some extent, to their specification. Another practical scenario might be the following: Imagine a social networking portal. One of the goals of the portal is to group specific contact information of people. Hence, the portal system needs to look on the Web to find such information. To do that, the developers can specify a kind of person ontology that models their domain of interest (the contact information of persons). Using this information, they can crawl the web to find pages that contain a similar kind of information. For example, pages that use the FOAF project1 ontology will be found in this search since this ontology will be similar to the specified topic ontology. The developers also want to retrieve these ontologies so that a mapping can be created between their ontology and the ontologies found during the crawl. Using this mapping, these external ontologies can be directly embedded in their portal site [54]. The above description and example scenarios of the problem can be summarized as ”The automated and ontology-guided discovery of resources on the Semantic Web within the DOGMA framework”, which is the scope of this thesis.

1.2

Proposed Solution and Methodology

In this thesis, we present an Ontology-based Crawler to solve the problem of the automated and ontology-guided discovery of resources on the Semantic Web. Crawlers have been around for a long time and have proven their usefulness and success on the Web. None the less, these general-purpose crawlers are not sufficient to tackle the stated problem. They crawl the Web in an blind and exhaustive manner. Since our goal is to find very specific data on the Semantic Web, this exhaustive approach will not find the requested information considering the current size of the Web. Therefore, we propose a focused crawling process 1 see

http://www.foaf-project.org/


3

so that the crawler is guided to the relevant information and no time is wasted on irrelevant resources. These kinds of focused crawlers are also referred to as preferential or heuristic-based crawlers [37]. The heuristic we use in our proposed solution is ontology matching. Since the goal is to find information resources on the Semantic Web, we expect that most of these resources are semantically annotated. We believe that by providing the crawler with a topic ontology commitment, we can efficiently guide the crawler to the relevant information resources. Our proposed crawler system is embedded into the DOGMA framework. We use the Dogma Studio Workbench to model the topic ontology commitment. This gives the ontology engineer or domain experts the necessary tools and expressiveness to give a fine and detailed model of the domain of interest that is used to guide the crawler. Not only does the DOGMA framework help to model the topic ontology, the resources discovered and extracted during the crawl can also be used within the DOGMA framework in many other applications, for example, the integration (merging and aligning) of the discovered ontology resources.

1.3

Overview of this Thesis

In the following chapter, we provide a background discussion on the different research fields, technologies and techniques that are relevant for an ontology based focused crawler. More specifically, we begin by introducing ontologies. In the next section, we describe the DOGMA framework that is a database-inspired approach to ontology engineering. We discuss its double articulation principle: the decomposition of an ontology into a Lexon Base and a Commitment Layer. The DOGMA framework is also decomposed into two levels: the language level and the conceptual level. Following our description of the DOGMA framework, we provide a brief overview of the Semantic Web and it’s structure. As we develop a crawler for the Semantic Web, an introduction to Web Crawlers is given in the following section. We briefly discuss the architecture of a general crawler and provide a state of the art on focused crawling algorithms. We end the section on web crawlers with a methodology that can be used to evaluate these focused crawlers. In the last section of our background discussion, we give an introduction to the domain of schema and ontology matching. By giving an overview and classification of several matching techniques and algorithms discussed in current literature, we justify our choices in our proposed matching solution. In chapter 3, we begin by motivating and sketching the context of our crawler within the


4

Semantic Web and the DOGMA framework. We also provide a formal terminology to be used in the rest of the thesis. Having defined the context and a consistent terminology, we proceed by giving an in-depth description of our proposed crawler framework. First, we define the goals, the requirements and the scope of the framework. This allows us to describe and justify our crawler architecture. The relevance computation algorithm is of crucial importance to our proposed solution and forms the basis of our crawling technique which distinguishes us from existing approaches. We describe our proposed relevance computation algorithm, DMatch-lite, in chapter 4. We start by identifying the context of the matching process and define then, in line with other literature on schema and ontology matching, our Match Operator as the central operation in our matching process. In the next section, we discuss the differences between our DMatch-lite matching process and ontology integration. The second and third part of the chapter describes our DMatch-lite algorithm in more detail: We lay out our matching methodology and discuss the different algorithms used in DMatch-lite. We end the chapter with a brief evaluation that outlines the functioning of the algorithm in practice. In chapter 5, we give an overview of our implementation of the crawler and matching system we have discussed in the previous chapters. After we provide a high level overview of the packages and classes that are part of the crawler system, we focus on the implementation of the matching algorithm. We end the chapter by depicting the functioning of our match algorithm on the implementation level with a sequence diagram. In the last chapter we we perform a brief and empirical evaluation study of the crawler in an uncontrolled, practical environment, namely the Semantic Web. We introduce the harvest rate metric to compare the impact of using different techniques and different topic ontologies when crawling the Semantic Web. Also the availability of semantic meta data on the current Web and an idea on the performance that can be expected from the crawler are briefly discussed.

Chapter 2

Background 2.1

Ontologies

We start by giving a brief introduction on ontologies in the computer science domain. We find this introduction relevant since the notion of ontologies and how we envision them is a key to understanding the rest of this thesis. The concept of ontology was first used in the field of Philosophy. The term Ontology (with an uppercase ’O’ this time) was later used by AI practitioners

1

and is now one of the

fundaments of the Semantic Web. Several definitions have been proposed in the literature 2

. One of the most cited definitions of an ontology is defined by Gruber [39] Definition 1 (Ontology by Gruber): An ontology is a specification of a shared conceptualization.

Another, closely related definition is defined by Ushold and Gruninger in [91]: Definition 2 (Ontology by Ushold and Gruninger): Ontologies are agreements about shared conceptualizations

The latter definition is less strict: In this definition, ontologies and conceptualisations are kept clearly distinct. An ontology in this sense is not a specification of a conceptualisations, it is an (possibly incomplete) agreement about a conceptualisation [70]. 1 AI

researchers use ontologies mostly for building Knowledge Bases [40] for more definitions.

2 See

5

CHAPTER 2. BACKGROUND

6

An ontology always includes a vocabulary of representational concept labels to describe a shared domain. These concept labels are usually called terms (lexical references) and are associated with entities (non lexical referents – the concepts) in the universe of discourse. Formal axioms are also introduced to constrain their interpretation and well-formed use. An ontology is in principle a formalisation of a shared understanding of a domain that is agreed upon by a number of agents [70]. In order for this domain knowledge to be shared amongst agents, they must have a shared understanding of the domain and therefore, agreement must exist on the topics about which to communicate. This raises the issue of ontological commitment which Gruber [39] describes as ”the agreements about the objects and relations being talked about among agents”. In other words, in order to facilitate meaningful communication, an agent must commit to the semantics of the terms and relationships in the common ontology [76]. This includes axioms about properties of objects and how they are related, also called the semantic relationships of the ontology.

2.2

The DOGMA Ontology Framework

In this section, we will describe the DOGMA3 framework under development at STARLab. Our description of the DOGMA framework has been inspired by several previous papers such as [68, 77, 70, 45, 25]. A formal foundation of the DOGMA framework can be found in [23, 22, 24]. The DOGMA approach to ontology engineering aims to satisfy real world needs by developing a useful and scalable ontology engineering approach. To reach such an approach, DOGMA’s philosophy is based on a double articulation: DOGMA can be seen as a representation model and framework that separates the specification of the conceptualisation (i.e. the lexical representation of concepts and their interrelationships, see the ontology definitions 1 and 2) from its axiomatisation (i.e. the semantic constraints). In DOGMA’s double articulation principle, an ontology is decomposed into a lexon base and a commitment layer

2.2.1

The Lexon Base

The Lexon Base is an uninterpreted, extensive and reusable pool of elementary building blocks for constructing an ontology [22]. These building blocks are defined as Lexons, formalised in definition 3.

3 acronym

for Developing Ontology-Guided Mediation of Agents; a research initiative of VUB STAR.Lab


7

Definition 3 (Lexon): A lexon is an ordered 5-tuple of the form < γ, ζ, t1 , r1 , r2 , t2 > with γ ∈ Γ, ζ ∈ Z, t1 ∈ T , t2 ∈ T , r1 ∈ R and r2 ∈ R. Where: • T and R are sets of strings; • t1 is called the head-term of the lexon and t2 is called the tail-term of the lexon; • r1 is the role of the lexon and r2 is the co-role of the lexon; • γ is the context in which the lexon holds; • ζ is a code that refers to the natural language in which the lexon is defined. A lexon < γ, ζ, t1 , r1 , r2 , t2 > is a fact that might hold in a domain, expressing that within the context γ and for the natural language ζ, an object of type t1 might plausibly play the role r1 in relation to an object of type t2 . On the other hand, the same lexon states that within the same context γ and for the same language ζ, an object of type t2 might play the role r2 in (the same) relation to an object of type t1 . This description of lexons shows that they represent plausible binary fact types (e.g., Person drives/is driven by Car). Definition 4 (The Lexon Base): The Lexon Base Ω as a structure < T, R, Γ, Z, D, Λ > where: • T ⊆ T is a non-empty finite set of terms that occur in the Lexon Base, • R ⊆ R is a non-empty finite set of role names that occur in the Lexon Base, • D is a not necessarily finite document corups, • Γ is a finite set of context identifiers, • Z ⊆ Z is a non-empty finite set of natural language codes, • Λ is a finite set of 5-tuples: Λ ⊆ Γ × Z × T × R × R T ⊆ L. These 5-tuples are called lexons. Logically, since lexons represent plausible fact types, this database of lexons can become very large. To guide an ontology engineer through this database, contexts impose a meaningful grouping of the lexons. The context of a lexon refers to the source it was extracted from. This source could be terminological or human domain experts. It is important to note that this lexon base is uninterpreted. For example, while a lexon like < γ, manager, is-a, subsumes, person > might intuitively express a specialisation relationship, the interpretation of a role/co-roe label pair as being a part-of or specialisation relation, is postponed to the commitment layer, where the semantic axiomatisation takes place.


2.2.2

8

The Commitment Layer

The Commitment Layer can be reified as a separate layer, mediating between the Lexon Base and the applications that commit to the Lexon Base. Committing to the Lexon Base means selecting a meaningful set Σ of lexons from the Lexon Base that approximates well the intended4 conceptualisation, and subsequently putting semantic constraints on this subset. The result (i.e., Σ plus a set of constraints), called an ontology commitment, is a logical theory that intends to model the meaning of this application domain [22]. The set of constraints in the ontology commitment are specific to an application (intended conceptualisation) using the ontology.

2.2.3

The Concept Definition Server

As we have introduced in the previous sections, a lexon is basically a lexical representation of a plausible conceptual relationship between two concepts, though there is no one-to-one mapping between a lexical representation and a concept. Therefore, a higher conceptual level is introduced [87]. As a result not only do we have a double articulation between the Lexon Base and the Commitment layer, we also have two different kind of levels in the DOGMA ontology, namely the Language level (where the Lexon Base is located) and the Conceptual Level. On the Conceptual Level, we have the Concept Definition Server (CDS). The idea for a Concept Definition Server was first mentioned in [68] and is inspired by lexical databases such as Wordnet [64]. In line with these lexical databases, for each (lexical) term, a set of senses is kept (comparable to synsets in Wordnet). A concept definition is unambiguously explained by a gloss (i.e. a natural language (NL) description) and a set of synonymous terms. Consequently we identify each concept definition in the CDS with a concept identifier c ∈ C. The following definition specifies the CDS: Definition 5 (Concept Definition Server): We define a Concept Definition Server Υ as a triple < TΥ , DΥ , concept > where: • TΥ is a non-empty finite set of strings (terms), • DΥ is a non-empty finite document corpus, • concept : C 7−→ DΥ × ℘ (TΥ ) is an injective mapping between concept identifiers c ∈ C and concept definitions. 4 With

respect to the application domain.


9

Further, we define conceptdef (t) = {concept(c) | concept(c) = < g, sy > ∧t ∈ sy}, where gloss g ∈ DΥ and synset sy ⊆ TΥ .

Going from the language level to the conceptual level corresponds to articulating lexons into meta-lexons. The articulation of a (lexical) term to a Concept in the Concept Definition Server is also called the lift-up of a term. In figure 2.1, we illustrate the two levels in DOGMA: on the left - the lexical level, lexons are elicited from various contexts. On the right, there is the conceptual level consisting of a concept definition server. The meaning ladder in between illustrates the articulation (or lift-up, therefore the ”ladder” terminology) of lexical terms into concept definition. CONCEPTUAL LEVEL

LANGUAGE LEVEL

Incompatible information

LEXON

META-LEXON

< γ, t1, r1, r2, t2 >

< ct1, cr1, cr2, ct2 > CDS Record for term “capital”

Information Systems

conceptdef (capital) = { s1, s2, ....,sn }

Terminologist ARTICULATION

Domain Experts

ct(γγ, t1) = ct1

Lexon Base

excerpt Concept Definition Server Concept definition

Quality Assurance

Knowledge engineer

si

• Concept identifier Concept definition s i

•• Natural language Concept definition si Concept identifier c1

< γ, t1 > < γ, r1 > < γ, r2 > < γ, t 2 >

c1 gloss

• Concept identifier c 1 ,..., Natural gloss Concept slanguage i= { tk ••definition Synset tm } • Concept identifier • Natural language ,..., tm } • Synset = c{ t1k gloss • Natural language gloss • Synset = { tk ,..., tm } MEANING LADDER

• Synset = { tk ,..., tm }

Figure 2.1: Illustartion of the two levels in DOGMA ontology. (Reproduced from [22])

Definition 6 (Meta-Lexon Base): Given a Lexon Base Ω and a total articulation mapping ct : Γ × T ∪ R → C, a Meta Lexon Base MΩ,ct = {ml,ct |l ∈ Ω} can be induced.

Example: As an illustration of the defined concepts, consider Figure 2.2. The term ”bank” in two different contexts can be articulated to different concept definitions in the CDS. The


10

terms are part of some lexons residing in the Lexon Base. The knowledge engineer first queries the CDS Υ for the various concept definitions of the term: concepdef (bank) = Sbank ⊆ DΥ × ℘ (TΥ ). Next, he articulates each term to the concept identifier of the appropriate concept definition: • The term ”bank” was extracted from a seaport navigation document, and is articulated to a concept identifier c1 that corresponds to concept definition (or meaning) s1 ∈ Scapital (as illustrated on the right of Figure 2.2). A gloss and set of synonyms (synset) is specified for s1 . • The term ”bank” was extracted from a financing book, due to the different context it was extracted from, it is articulated to another concept identifier c2 that is associated with a concept definition s2 ∈ S. CONCEPTUAL LEVEL

LANGUAGE LEVEL

Concept Definition Server

Lexon Base

CDS Record for term “capital” conceptdef (capital) = { s1, s2, ....,sn }

ARTICULATION

ct(γγ, t) = c

Concept Definition

s1

c

• Concept identifier 1 • Gloss = Sloping land (especially the slope beside a body of water) • Synset = { }

< Financing Book, bank >

Concept Definition

s2

c

MEANING LADDER

• Concept identifier 2 • Gloss = A financial institution that accepts deposits and channels the money into lending activities • Synset = { banking concern , depository financial institution , }

Figure 2.2: Illustration of two terms (within their resp. contexts), being articulated (via the mapping ct) to their appropriate concept definition. (Inspired by [22])


2.3

11

The Semantic Web

Introduction The advent of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The rapid growth of the web poses new problems. Anyone can easily publish a new document or add a link to a site with no restrictions on the structure or validity of the new content. The lack of restrictions have been partly the reason of the success of the web. It has kept the expertise necessary to create content for the web low and thereby originated the amount of available content. The overabundance of unstructured and possibly faulty information has made it difficult for the users of the web to find relevant information easily. It also poses scaling difficulties on current web crawlers and search engines. These difficulties have given rise to the successor of the current web: the Semantic Web. The Semantic Web is a project that intends to create a universal medium for information exchange by giving meaning (semantics), in a manner understandable by machines, to the content of documents on the Web. Currently under the direction of the Web’s creator, Tim Berners-Lee of the World Wide Web Consortium, the Semantic Web extends the World Wide Web through the use of standards, markup languages and related processing tools5 . With the Semantic Web we not only receive more exact results when searching for information, but also know when we can integrate information from different sources, know what information to compare, and can provide all kinds of automated services in different domains from future home appliances and digital libraries to electronic business and health services[90].

2.3.1

Structure of the Semantic Web

Currently, the World Wide Web consists primarily of documents written in HTML. This makes the the Web readable for humans, but since HTML has limited ability to classify the blocks of text apart from the roles they play, the Web in its current form is very hard to understand for computer agents. The purpose of the Semantic Web is to add a layer of descriptive technologies to web pages so they become readable and can be reasoned about by computer agents. The Semantic Web principles are implemented in the layers of Web technologies and standards. The layers are presented in Figure 2.3 and described as follows [52]: 5 http

: //en.wikipedia.org/wiki/Semantic web


12

Figure 2.3: The Semantic Web layers - picture reproduced from [89] and [52] • The Unicode and Uniform Resource Identifier (URI) layers make sure that we use international characters sets and provide means for identifying the objects in the Semantic Web, respectively. The most popular URI’s on the World Wide Web are Uniform Resource Locaters (urls). • The XML layer with namespace and schema definitions make sure we can integrate the Semantic Web definitions with the other XML based standards. XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documentes. XML Schema is a language for restricting the structure of XML documents. • The ontology layer supports the evolution of vocabularies as it can define relations between the different concepts. The structure is simple: knowledge is expressed as descriptive statements, stating some relationship exists between one thing and another. The technologies to represent that structure are already in place6 : – RDF is a simple data model for referring to objects (”resources”) and how they are related. An RDF-based model can be represented in XML syntax classes. – RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization-hierarchies of such properties and classes. – OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. ”exactly one”), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. 6 http

: //en.wikipedia.org/wiki/Semantic web


13

These technologies will be described in detail in the next chapter. • A digital signature is an electronic signature that can be used to authenticate the identity of the sender of a message or the signer of a document. The Digital Signature layer ensures that the original content of the message or document is unaltered. • The top layers Logic, Proof and Trust, are currently being researched and simple application demonstrations are being constructed. The Logic layer enables the writing of rules while the Proof layer executes the rules and evaluates together with the Trust layer mechanism for applications whether to trust the given proof or not.

2.4

Web Crawlers

Introduction The World Wide Web (or simply the Web) can be seen as a directed graph. The vertices represent the web pages and the directed edges represent the links from one page to another. A Web Crawler is a program that uses this graph structure of the web (the web-graph) to visit web-pages in an automated manner by following the edges from every page in the web-graph. This process is called crawling the web. In their infancy, these programs were also called wanderers, robots, spiders, etc. These kind of general-purpose crawlers will try to visit as much pages as possible, that is, they are blind and exhaustive in their approach. All edges of the web-graph are followed without much consideration about the order in which they are followed. Most of the time, a breath-first order will be used. In contrast, crawlers can be more selective about what edges and in what order the edges of the web-graph should be followed. These crawlers are referred to as preferential or heuristic-based crawlers [37]. Heuristic-based crawlers built to retrieve pages within a certain topic are called topical or focused crawlers. In this section we will start with a very brief overview of the architecture of a general crawler, a more in-depth view will be provided when we discuss the implementation of our crawler in chapter 3.4 and 5. In the second part of this chapter we will give a more in-depth view on the different crawler techniques discussed in the literature. We will start with the general or naive crawling algorithms and then move on to more sophisticated techniques. Most of our attention will go to focused crawlers as that is the main approach we use in this thesis.


2.4.1

14

Crawler Architecture

Almost all crawlers have a very similar workflow as shown in figure 2.4. The crawler is started with a number of seed urls which are added to the crawl frontier. What follows next is referred to as the crawling loop: The crawler fetches a URL from the frontier, fetches the URL through HTTP, parses the fetched page and then adds the links found on the page to the frontier. There are different possible end conditions for this loop: The crawler can stop when a certain number of pages are crawled, or when a certain crawl-depth is reached. These stop-conditions are highly crawler-specific. The seed urls can be seen as the start nodes of the crawl-graph. From these start nodes, the web-graph is traversed by the crawling loop and simultaneously, the crawl-graph is built. As the web-graph is not an acyclic graph, we must provide ways to detect these cycli to prevent the crawling loop to continue endlessly. Since this flow (more specifically the crawling loop) is easily parallelised, most crawlers are implemented as multi-threaded applications. Since our implementation of the crawler is also multi-threaded, we will discuss the multi-threaded version of the general crawler workflow when we discuss our own crawler architecture in chapter 3.4.

Frontier The frontier can be seen as a to-do list of the crawler. It contains all the unvisited urls of the crawler. In graph terminology, the frontier is the open list of all the unexpanded nodes of the crawl-graph In small-scale crawlers, the frontier can be stored in memory. For general crawlers, a FIFO queue data-structure can be used to store the urls. With a fifo queue, the crawler will blindly visit all urls in a breath-first fashion as shown in figure 2.5. Another possibility is to use a priority-queue to store the urls. In that case, the crawler uses a heuristic to compute a score to every page. The page with the highest score will be visited next. In that way, the crawler traverses the graph in a best-first fashion. Both possibilities are illustrated in figure 2.5. In both cases, the next URL the crawler will visit is the first URL in the frontier (the URL in position 1). The frontier can quickly get very large as pages are crawled. With an average of 7 links a page [53], the frontier will contain about 70.000 urls when 10.000 pages are crawled. Therefor, in most crawlers, the frontier is stored on disk, since the number of urls the frontier can contain in memory, is limited. In most implementations, a DBMS system is used to store these urls since it provides a fast and easily manageable access interface.


15

start

Initialize frontier with seed url’s

Terminated?

done

Fetch url from frontier

Fetch page

Frontier

Crawling loop

Web page

Preprocess Page

Add Url’s to frontier

Figure 2.4: The flow of a basic sequential crawler. Fetching When the crawler receives a new URL from the frontier, it must fetch the page from the Web. To download the page, the crawler must use an HTTP client which sends HTTP requests and receives the responses. Most languages like Java provide high level interfaces to fetch the page from the web.

Preprocessing When a page has been fetched to the crawler’s local memory, some sort of preprocessing is applied to the page so that the useful information can be easily identified. This preprocessing contains several components: from simple URL extraction to advanced Natural Language Processing techniques. An overview of basic components that most crawlers use


16

General crawler with FIFO queue frontier x

Preferred Crawler with Priority queue frontier

URL in Frontier in position x Seed URL Visited URL

2

4

3

4

1

5

1

3

4

2

Figure 2.5: The crawl-graph of a general crawler and a preferred crawler. is given below: URL Extraction/Filter To add the urls from the page to the frontier, these urls must first be extracted from the page. A parser is used to find the anchor tags and grab the values of the associated ’href’ attributes. The urls that are extracted from the page might not always be in the right format. Relevant urls should for example be converted to absolute urls. It is important that the crawler maps all urls to one single canonical form. For example, a training / should be added or removed from urls, etc. Not all urls should be added to the frontier. Therefor it is important that the crawler filters the extracted urls so only relevant urls are added. The crawler can filter on the protocol of the URL. For example, the crawler might not be interested in ftp or mailto urls. It might also filter on the extension of the web-page to filter out dynamic pages. Text Parser The crawler can also use text parsing techniques to extract content information from the page. This content can be used to index the page or can be used in the heuristic to give a score to a page. The parsing of the text can be very basic to very advanced using different Natural Language Processing techniques. We will give a listing of the most interesting techniques: stoplisting Every page contains a large number of stopwords, such as ”it”, ”can”, ”the”, etc. Stoplisting is the process of removing those stopwords from the text. stemming The stemming process normalizes words by confliating a number of morphologically similar words to a single root form or stem. For example, ”connect”, ”con-


17

nected” and ”connection” are all reduced to ”connect”. Implementations of the commonly used Porter stemming algorithm [75] are easily available in many programming languages.

2.4.2

Crawling Algorithms

General Purpose Crawlers A general-purpose web crawler’s basic task is to fetch a page, parse the links and repeat. It normally tries to gather as many pages as it can to build up a web-graph as complete as possible. Search engines use these web-graphs to identify the most authoritative pages related to a user query or topic. That process is called topic distillation. A brief overview of different distillation techniques is given below (largely based on [15], a more detailed overview appears in [16] and [18]): Text similarity The first technique uses the textual similarity between pages. Similarity has been well studied in the Information Retrieval (IR) community [81] and has been applied to the WWW environment [13]. Based on these statistics, the relevance of a page to a certain query is computed. The next algorithms use only the connectivity information of the Web (i.e., the links between pages) and not the content of pages. Backlink If one wanders on the Web for an indefinite time, following a random link out of each page, then different pages will be visited at different rates; popular pages with many in-links will tend to be visited more often. In other words, the importance of a page P is the number of links to P that appear on the entire web. Intuitively, a page P that is linked to by many pages is more important than one that is seldom referenced. This ”citation” count metric has been used extensively to evaluate the impact of published papers. PageRank While the backlink metric treats every link the same, pagerank recursively defines the importance of a page P to be the weighted sum of the backlinks to P. This is the core of the PageRank algorithm, invented by Brin and Page [12]. The Pagerank algorithm crawls the web and simulates such a random walk on the Web graph in order to estimate the visitation rate, which is used as a score of popularity. Given a keyword query, matching documents are ordered by this score. Note that the Pagerank popularity score is precomputed independently from the query, hence Google can potentially be as fast as any search engine that purely ranks the query results based on the input query. HITS Hyperlink induced topic search (HITS) [51] is slightly different: it does not crawl or preprocess the Web, but depends on a search engine. A query to HITS is forwarded


18

to a search engine such as AltaVista, which retrieves a subgraph of the Web of which the nodes (pages) match the query. Pages citing or cited by these pages are also included. This expanded graph is analyzed for popular nodes using a procedure similar to the Google, the difference being that not one, but two scores emerge: the measure of a page being an authority, and the measure of a page being a hub (a compilation of links to authorities). Because of the query-dependent graph construction, HITS is slower than Google. A variant of this technique, the Companion algorithm, has been used by Dean and Henzinger to find similar pages on the Web using link-based analysis alone [26]. They improve the speed by fetching the Web graph from a connectivity server which has substantial pre-crawled portions of the Web [9]. ARC and CLEVER HIT’s graph expansion sometimes leads to topic contamination or drift. This is partly because in HITS (and Google) all edges in the graph have the same importance. By heuristic modification of edge weights, the quality of query results can be significantly improved. In [18], Text similarity, Backlink and Pagerank techniques have been evaluated on a copy of the Standord University Website (about 250,000 pages). The result showed that PageRank is the best metric of the three.

Focused Crawlers Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the enormous growth and dynamic content of the web. One of the ideas proposed in recent years is focused crawling. A focused crawler is designed to only gather pages relevant to a certain, pre-defined set of topics, without having to explore all web pages. By definition, focused crawlers are preferred crawlers, that is, they must use some sort of heuristic to rate pages according to their relevance to the given topic. Focused crawlers have the advantage of being driven by a rich context (topics, queries, user profiles, in our case, a topic ontology) within which to interpret pages and select the links to visit. It is obvious that the success of the crawler depends on the quality of the heuristic used. During the crawl, the crawler should stay focused around the given topic, that is, it should give correct scores to the pages so that links on irrelevant pages are not pursued by the crawler as illustrated in figure 2.6. On the contrary, the heuristic must not be to ”strict” so that relevant pages are still found by the crawler. This is similar to the precision and recall measure from Information Retrieval (IR). We will describe these measures in more detail in the Crawler Evaluation subsection. In this subsection, we will give an overview of different heuristics described in the current literature.


19

Focused Best-First Crawler x

Seed URL Relevant page 4

Irrelevant page

4

3

4

2

1

3

Keep Crawler focused around its topic

Page in frontier on position x

Figure 2.6: The crawl-graph of a Focused Best-First Crawler. One of the first focused web crawlers is presented in [72]. This works describes a clientbased real-time retrieval system for hypertext documents, based on depth-first search. The ”school-of-fish” metaphor is used: When food (relevant information) is found, fish (search agents) reproduce and continue looking for food, in the absence of food (no relevant information) or when the water is polluted (poor bandwidth), they die. In other words, the ”fish” follow more links from relevant pages, based on keyword and regular expression matching. The authors acknowledge that this type of system can make heavy demands on the network, and propose various caching strategies to deal with this. A more aggressive version of the fish-search algorithm, the shark-algorithm, is described in [41]. The authors of the shark-search algorithm claim that it overcomes some limitations of the fish search by analyzing the relvance of documents more precisely and, more importantly, making a finer estimate of the relevance of neighbouring pages before they are actually fetched and analyzed. The anchor text, i.e. text surrounding the links or link-context, and inherited score from ancestors influence the potential scores of links. The potential score of an unvisited URL is computed as:

score(U RL) = γ · inherited(U RL) + (1 − γ) · neighborhood(U RL)

(2.1)

where γ − 1 is a parameter, the neighborhood score defines the contextual evidence found on


20

the page that contains the hyperlink URL, and the inherited score is computed as:  γ · sim(q, p) , if sim(q, p) > 0 inherited(U RL) = γ · inherited(p) , otherwise.

(2.2)

where γ − 1 is again a parameter, q is the query, and p is the page from which the URL was extracted. The neighborhood score uses the anchor text and the text in the ” vicinity” of the anchor in an attempt to refine the overall score of the URL by allowing for differentation between links found within the same page. These improvements enable to save precious communication time, by fetching first documents that are most probably relevant or leading to relevant documents, and not wasting time on garden paths. Another early experience with a focused crawler based on a hypertext classifier, is described in [17]. It describes a prototype implementation that is comprised of three programs integrated via a relational database: a crawler, a hypertext classifier and a distiller. The basic idea is to classify crawled pages with categories in a topic taxonomy. The relevance rating system uses a hypertext classifier to update the metadata with topic information from a large taxonomy: a user marks interesting pages as they browse, which are then placed in a category in the taxonomy. This was bootstrapped by using the Yahoo hierarchy. Relevance is not the only attribute used to evaluate a page while crawling: the popularity rating system updates metadata fields signifying the value of a page as an access point for a large number of relevant pages. The latter is based on connectivity analysis. The two mining or rating modules guide the crawler away from unnecessary exploration and focus its efforts on web regions of interest. A typical problem for focused crawlers is to find an appropriate relevance computation function so that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield to more relevant pages. As seen in figure 2.7, a certain page will only be visited if there is a path of relevant pages from the seed of the crawl to the specific page. One solution described in literature is to use tunneling: the technique is to not only prioritize links from pages according to the page’s relevance score, but also estimating the value of each individual link and prioritizing them as well. The use of this tunneling technique is described in [7]. The paper describes techniques to automatically build online collections of topic specific resources. The author’s hypothesis


21

Keep crawler focussed

Start Webpage

Target pages

Figure 2.7: This figure shows how a relevant page can only be reached if there exist a path of relevant pages from the root to the given page. is that by examining the patterns of document-to-collection correlations along Web link paths, more efficient selective crawling techniques can be devised. As a start, a set of centroids are constructed from the subjects in the given topic hierarchy. Each centroid is a weighted term vector describing a topic. During the crawl, each downloaded document is tentatively classified with the nearest subject vector. For each subject, Google is queried and the centroids are constructed from the first k hits returned in the search result. The score for each document is then defined as the correlation between the document and some centroid. The correlation score for a downloaded normalized document d is

score = {c}argmax

(s P

2 2 i di · ci kci k

) (2.3)

where c is a centroid. Argmax chooses the cluster of which the centroid maximizes the score. Weights of term i are di , ci . As one does not necessarily start with on-topic seeds, tunneling at the start of the crawl can help the crawler to find relevant pages. To define tunneling more precisely, the following definition is used in [7]: A nugget is a Web document whose cosine relation with at least one of the collection centroids is higher then some given threshold. A dud, on the other hand, is a document that does not match any of the centroids very closely. A path is the sequence of pages and links going from one nugget to the next.


22

To characterize more precisely the benefits of tunneling, almost 500,000 unique documents were downloaded and analyzed. The authors came to the following conclusions: • The path from one nugget to the next can be long; • Better parents have better children; • Path history matters; • Frontier grows rapidly. More detailed explanation of the statistical results can be found in [7].

Intelligent Crawlers The problem of not reaching relevant pages that are surrounded by non-relevant pages (Shown in figure 2.7) has also been addressed in [28]. Instead of tunneling, the authors use a context graph to address this problem. They present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture link hierarchies within which valuable pages occur, as well as model content in documents that frequently co-occur with relevant pages. Consequently they call their crawler a Context-focused crawler. There are two phases in their algorithm The first phase (= initialization) constructs a set of context graphs and associated classifiers for each of the seed documents. The context graphs are built by querying search engines as Google to find backlinks of the seed pages. In that way, a context graph of n layers is constructed, where each node of layer n is connected to one and only one node of layer n-1. Next, a set of classifiers is constructed to assign each document to a specific layer. A modified version of TF/IDF (Term Frequency Inverse Document Frequency [81]) is used to represent the document. TF/IDF assesses the relevance of a term to a document. Definition 7 (Term frequency/Inverse document frequency): Given a corpus C of strings (usually documents), we define the following measures: ∀t S, ∀s C, tf (t, s) = s#t

(termf requency)

∀t S, doc(t) = |{s C; t s}| |C| ∀t S, idf (t) = log( |doc(t)| )

(inversedocumentf requency)

The second phase is the crawling phase. It uses the classifiers to guide the search, and performs online updating of the context graphs. Their test results show that the best results


23

are achieved when the topic is so that the target content can be reliably co-located with pages from a different category, and where common hierarchies do not exist or are not implemented uniformly across different web-sites. For example, in the test where they crawled for ”conference announcements”, the performance improvement was greater than when they crawled for the ”Biking” category.

2.4.3

Crawler Evaluation

Introduction In a general sense, the quality of a crawler may be evaluated on its ability to retrieve ”good” pages. The problem lies with the definition of a ”good” page, how can these good pages be recognized? In a operational environment, real users may judge the relevance of these pages by following the crawl and so enabling us to determine if the crawl was successful or not. In order to make reasonable conclusions about the crawler’s effectiveness, a large number of crawls should be conducted with a large number of users. All these crawls take a significant amount of time to complete, which makes testing crawlers with real users overly problematic. Thus while keeping user based evaluation results as the ideal, we explore alternative user independent mechanisms to assess crawl performance. In general, it is important to compare the different crawling algorithms over a large number of topics and tasks. This will allow us to ascertain the statistical significance of particular benefits that we may observer across the different crawling algorithms. We will present here a general evaluation framework for topical crawlers described by [88]. What follows is a brief overview of the evaluation methodology adopted from [88], for the full description, we refer to the full text. The generic framework has three distinct dimensions. The first dimension regards the nature of the crawl task. This includes consideration of how topics are defined and how seeds and target relevant pages are identified. The second dimension deals with evaluation metrics both for effectiveness and for efficiency analysis. The last dimension of the framework looks at topics in greater detail, by examining particular characteristics such as popularity and authoritativeness and their effect on crawler behavior.

The Nature of a Crawl A crawl is characterized by several features. These include how the topic is defined, the mechanism by which the seed pages for starting the crawl are selected and the location of the topic’s relevant target pages relevant to the seed pages. Obviously, a crawl where the


24

seeds link directly to many pages relevant to the topic is likely to be less challenging than one in which the seeds and targets are separated by some non trivial link distance. These issues are discussed in this subsection. Topic and Descriptions A topic, such as ’sports’, ’tennis’ or ’US. Military’, delineates a particular domain of discourse. As seen in various examples ([1]; [17]), topics offer a handy mechanism for evaluating crawlers, since we may examine their ability to retrieve pages that are on topic. Topics can be obtained from different sources, such as Yahoo or the Open Directory Project (ODP). A key point to note is that all topics are not equal. A topic such as ’2005 US Opens’ is more specific then ’tennis’ which is in its turn more specific then ’sports’. Therefore, topic specification also plays a critical role. In current crawlers, different methods are used to specify topics. One method is to use the leaf node concepts as topics ([63]). The problem with this approach is that that the selected topics may be at different levels of specificity. We can control this by deriving topics from concept nodes that are at a predefined distance (topic level) from the root of the concept hierarchy. While one cannot say that two topics at the same level in the ODP hierarchy have the same specificity, it is reasonable to assume that topic level is correlated with specificity, and it is useful to have a simple control parameter to characterize the specificity of a set of topics. root

Depth 0 targets

Topic level

Depth 1 targets

0

0

Depth 0 Depth=1

1

1

1

Depth 2 targets

Depth=2 2

2

2

2

2

2

2

2

2

Figure 2.8: An example of a topic subtree form a hierarchical directory. reproduced from [88]) Another, more general approach, is to build topics from subtrees of a given maximum depth of which the roots are topic level links away from the root of the original concept tree. Depth,


25

as used here, refers to the height of a subtree. Figure 2.8 illustrates this with a topic subtree of max depth = 2 built from a concept hierarchy at topic level = 2. The topic in this example has topic level = 2 and max depth = 2. The external pages linked from nodes at a given depth are the targets for that depth. Shaded areas represent target sets corresponding to subtrees of depth between 0 and max depth, i.e., to progressively broader interpretations of the topic. A broader interpretation (lighter shade of gray) includes additional, more specific targets. The topic description can be built up by topic keywords, formed by concatenation of the node labels from the root of the directory tree to the topic node. These keywords form the search criteria provided as initial input for the basic crawling algorithms. By varying the depth from 0 to max depth, it is possible to generate alternative descriptions of a given topic. These descriptions will be used to estimate the relevance of retrieved pages. Mind that our goal is to develop algorithms that use a topic ontology as the topic description. Still, to be able to compare these algorithms with general algorithms that use a keyword description, we can use the depth parameter on the ontology in the same way as we use it to form the description from the concept hierarchy. Target Pages Hierarchical concept based directories are designed to assist the user by offering entry points to a set of conceptually organized Web pages. Thus, the Yahoo directory page on Newspapers leads to USA Today, New York Times and the web sites of other news media. One may regard the resources pointed to by the external links as the topically relevant target set for the concept represented by the directory page: USA Today and New York Times may be viewed as part of the set of target relevant pages for the concept of Newspapers. Level 0 target set L1 ...

0 Depth 0 targets

Level 1 target set

L2

0

Depth 1 targets

0 1

1

1

L3 ... L4

1

1 1

2

2

2

2

2

2

2

2

L1 ... L2 L5 ... L6

L7 ... L8

2

Figure 2.9: An example of the different target sets of a topic subtree.


26

In the framework, parallel to topic descriptions, topic target pages are also differentiated by the depth of the topic subtree as we have illustrated in figure 2.9. The first target set contains the external links from concepts on depth = 0. The second target set contains the external links from all concepts on depth = 1 and the concept on depth = 0, that is, the root of the subtree. In this example, max depth = 2, so there are 3 possible target sets. Thus, when the topic is described by the subtree of depth = 0, the relevant target set will consist of the external links from the root node of the topic subtree. If, for example, the subtree of depth = 1 is used, the target set will consists of all the external links from the nodes on depth = 0 and the nodes on depth = 1 of the subtree. Thus, for a single topic there are max depth + 1 sets of target pages defined, with the set at a higher depth including the sets at the lower depths. Seed Pages The specification of seed pages is a crucial aspect defining the crawl task. The approach used in several papers ([17],[63],[62]) is to start the crawl from pages that are assumed to be relevant. In other words, some of the target pages are selected to form the seeds. This type of crawl task mimics the query by example search mode where the user provides a sample relevant page as a starting point for the crawl. These relevant seeds may also be obtained from search engines ([73]). The idea is to see if crawlers are able to find other target pages for the topic. An alternative way to choose seed pages allows one to define crawl tasks of increasing difficulty. If the seeds are distant from the target pages, there is less prior information available about the target pages when the crawl begins. The effort by Aggarwal et al. [2] is somewhat related in that the authors start the crawl from general points such as Amazon.com. The framework we are using ([88]) takes a general approach and provides a mechanism to control the level of difficulty on the crawl task. One may specify a distance dist = 1,2,... of links between seeds and targets. As dist increases, so does the challenge faced by the crawlers. The procedure in figure 2.9 implements the selection of up to n seeds seed pages for a given topic.

2.5

Ontology/Schema Matching

Introduction Schema matching has become a much debated topic in today’s research agenda, especially with the advent of the Semantic Web. Its roots, however, can be found in the database


27

literature of the eighties. Due to the widespread importance of integration and matching, many disparate communities have tackled this problem. They have developed a wide variety of overlapping but complementary technologies and approaches. In this section, we will give a typology of the different matching techniques that are used in current matching solutions. What follows is only a classification of the different techniques, a more in-depth description will follow in the next section. This should help the reader to keep an overview of the possible solutions to the matching problem and how these are combined for best results. A lot of applications to solve schema/ontology mapping already exist. All these applications use their combination of techniques to solve the matching problem. By giving an overview of the possible techniques that can be used to match different schemas or ontologies, we can explain why we have opted for certain techniques and have not used other possible techniques in our solution to the matching problem in our specific application.

2.5.1

The Matching Problem

The schema matching problem at the most basic level refers to the problem of mapping schema elements (for example, columns in a relational database schema, or concepts in an ontology) from one information repository to elements of another information repository [50]. These schema matching problems arise in both classical scenarios like company mergers as well as in new scenarios like the Semantic Web. Most of the work on schema matching has been motivated by schema integration, a problem that has been investigated since the early 1980’s: Given a set of independently developed schemas, construct a global view [5][74]. Since these schemas are developed independently, it is likely that they will use a different structure and terminology. One reason is that the schema data is from a different domain, such as a real estate schema and tax property schema. However, it is also possible that two schemas from the same domain use different structure and terminology. This is because they are developed by people from independent real-world contexts. An example of the latter case can be seen in figure 2.10. Since the problem arose in the field of schema integration, other needs for schema, and more recently, ontology matching have come up. We will list a few of these new applications to give the reader a view of the importance and the different application domains of the matching problem. The following summary is based on [34] and [85]. Data warehouses Since the 1990’s a new variation of the schema integration problem became popular: the integration of data sources into a data warehouse. A data warehouse is a decision support database that is extracted from a set of data sources. The extraction process requires transforming data from the source format into the ware-


28

house format. As shown in [8], the match operator is useful for designing transformations. Given a data source, the first approach is to find the elements of the imported source that are also present in the warehouse. Another approach to integrate a new data source S 0 is to use an existing source-to-warehouse transformation S ⇒ W . First, the common elements of S 0 and S are found, that is the match operation. Then, S ⇒ W is used for those common elements. Catalog matching Most of the e-Commerce applications are based on the publication of electronic catalogs which describe the goods for sale and allow the customers to select the goods they need. Current business models (business-to-business as well as business-to-consumer) require the e-Commerce applications to be able to interoperate on different catalogs. Many systems require participant parties to perform very costly operations on their local catalogs to enter into the system, and this is a tremendous barrier for newcomers. A second need in current e-Commerce applications that has led to a new motivation for schema matching is the need for message translation. Trading partners frequently exchange messages that describe business transactions. Usually, each trading partner uses its own message format. Part of the message translation problem is translating between message formats (EDI structured, XML formatted, etc.). Today, application designers have to manually specify how message schemas are related. The use of semi-automated schema matching techniques may reduce the amount of manual work by generating a draft mapping between the two schemas. P2P Databases Peer-to-Peer (P2P) information sharing has received alot of attention in recent years, both from the academia and the industry. Currently P2P file sharing systems are very popular and have a variety of implementations. For example, Kazaa and Mopheus have more than 450 million of downloads (see www.download.com, March 2004). Most of these application provide a simple schema in order to describe file contents (e.g. Napster, Kazaa) which is shared by all parties and cannot be changed by one local party. P2P information sharing systems which use more complex structures to describe the data (e.g. database schemas, ontologies, etc.) have been largely described in current literature but, to our knowledge, have not yet been implemented as a real-world service. In these applications, assumptions that all parties agree on the same schema, or that all parties rely on a global schema, cannot be made. Peers come and go, import multiple schemas into the system, and have a need to interoperate with other nodes at runtime [34]. Therefore, in this activity, schema alignment is the main process to enable node interoperability.


29

Web-based systems interoperability Interoperability is also a pre-requisite for a number of Web-based systems that serve distinct applications [82]. In particular, in areas where there is diverse and heterogeneous Web-based data acquisition, there is also need for interoperability. The application described in this thesis would fall in this category. For a more detailed list of possible use cases or scenarios for ontology matching and alignment, we refer to [34] and [82].

Real-world examples To give the reader a real-world feel on the problem, we will illustrate the problem with 2 possible scenarios. The first is an illustration of the problem in a classic schema matching environment, the second is a possible scenario (or use case if you wish) on the Semantic Web.

Colleagues Name

Address

Tel_nr

email

Wim Devos

Troonlaan 94, Brussels

0032 2 569847

[email protected]

0032 2 569847

[email protected]

Jan Vandenberge Zomerstraat 145, Brussels

Friends FirstName

SecondName

Address

Telephone

BirthDay

Steve

Dewilde

Troonlaan 94, Brussels

0032 2 569847

23 / 05 / 1980

Valérie

Dejaegher

Zomerstraat 145, Brussels

0032 2 569847

12 / 08 / 1974

Generality

Equivalence

Schema matching

Figure 2.10: Matching 2 relational schemas In the first example, see figure 2.10, you are given 2 relational schemas. The first represents the contact list of your work colleagues, the second is your personal contact list with all your friends. You want to match these in order to merge them into one big contact list. The second example is a possible scenario on the Semantic Web. Suppose you start with a new research project and you want to find out which information is available in the current literature on the subject. You want an agent to search the web sites of the Computer Science departments of certain universities that have published papers on the subject. Imagine that you want the agent to return a list of all the relevant papers, courses, technical reports, etc.


30

Since the layout of the web-sites of the CS departments can all be very different, it is not possible to return a clean list with a regular keyword based search. On the Semantic Web however, this data should be much easier to find. All departments have annotated their website with some ontology, such as in Figure 2.11. CS Dept 1

Undergraduate Courses

Courses

Publications

Tech Reports Proceedings

Articles

CS Dept 2

Courses

Publications

Dissertations

Ph. D

Articles

Master

Figure 2.11: An example of the annotated web-sites of two computer science departments

Schema vs Ontology matching As stated earlier, schema matching, at the most basic level refers to the problem of mapping schema elements in one information repository to corresponding elements in a second information repository. [50]. These schema elements can be represented by different structures which are all considered as data/conceptual models: description logic terminologies, relational database schemas, XML-schemas, catalogs and directories, entity-relationship models, conceptual graphs, UML diagrams, etc. Ontology matching, specifically, can be described as the process whereby two ontologies are semantically related at conceptual level. That is, we have to find semantic correspondences [90] between the concepts in the ontologies. Previous work on (specifically) ontology matching and ontology similarity measures can be found in [10, 59, 27]. Ontologies and schemas are the same in the sense that (i) they both provide a vocabulary of terms that describes a domain of interest and (ii) they both constrain the meaning of terms used in the vocabulary [85]. In this respect, there are some im-


31

portant differences and commonalities between schema and ontology matching [85]. The key points are: • Database schemas often do not provide explicit semantics for their data. Semantics are usually specified explicitly at design-time, are frequently not a part of a database specification, and consequently are available [67]. For example, relational schemas do not provide generalization. Ontologies are logical systems that themselves obey some formal semantics, e.g., we can interpret ontology definitions as a set of logical axioms [85]. • Ontology data models are usually richer then schema data models: The number of primitives is higher, and they are more complex). For example, OWL [86] allows defining inverse properties, transitive properties; disjoint classes, new classes as unions or intersections of other classes, etc. • A last observation is that schema matching is typically performed manually, using some methodology and perhaps supported by a graphical user interface. For example, DOGMA’s ontology integration methodology [34] adopts for ontology integration the same methodological steps that were singled out in data base schema integration [5]. Problems in the field of ontology matching require usually an automatic or semi-automatic solution. In [84] the author argues for a distinction between syntactic and semantic matching, with the former being the dominant approach in most database schema matching works. A description and comparison between syntactic and semantic matching will follow in section on the classification of different matching techniques in the next section. In real-world applications, usually both schema and ontologies have well defined and obscure labels (terms). Therefor, solutions for both problems can be mutually beneficial so many of the schema matching techniques also apply for the ontology matching problem.

2.5.2

Matching Typology and Classification

In a first classification of matching approaches, we present the reader with a typology of the different matching techniques. A second classification is based on the observation that a common trend in semantic integration is to progress from semantically-poor to semantically-rich solutions ([49]). This classification, described in [49], ranks the different techniques along a semantic intensity spectrum.


32

Typology of Matching Techniques Matching Approaches

Schema-based

Element-level

Structure-level

Syntactic

External

Stringbased - name similarity - Description similarity - Global namespace

Linguistic-resource - Lexicons - Thesauri

Language-based - Tokenization - Lemmatization - Morphological analysis - Elimination

Instance-based

Alignment-reuse - entire schema / ontology - Fragments

Syntactic Graph -based - Graph matching - Paths - Children - Leaves

Semantic Model -based - Propositional SAT - DL-based

Syntactic Linguistic - IR techniques Constraint -based - Value and Pattern ranges

Taxonomy-based - Taxonomic structure

Upper level formal ontologies - SUMO, DOLCE

Constraint-based - Type similarity - Key properties

Figure 2.12: Matching typology, based on [79] and [85] As we described earlier, most of the schema and ontology matching techniques are mutually beneficial. Therefor, it should be no surprise that classifications of schema matching techniques, which are targeted on schema matching approaches, have been used and revised for ontology-based matching systems. Such a well cited classification that targets database schema matching approaches is the typology that was proposed by Rahm and Bernstein [79]. The classification of Rahm and Bernstein distinguishes between elementary (individual) and combinations of matchers. Elementary matchers comprise instance-based and schema-based, element- and structure-level, linguistic- and constrained-based matching techniques. Also cardinality and auxiliary information (e.g., thesauri like WordNet [64], global schemas) are taken into account. This typology has been used and revised by Shvaiko and Euzenat in [85]. In their revision, they have introduced two synthetic classifications: a Granularity/Input Interpretation classification is based on (i) the granularity of a match, i.e., element- or structure-level, and then (ii) on how the techniques generally interpret the in-


33

put information. The second classification is the Kind of Input which is concerned with the type of input considered by a particular technique. The resulting typology can be seen in Figure 2.12. Element-level vs Structure level Two alternatives can be distinguished for the granularity of a match: element-level and structure-level matching. This criterium was first introduced in [79]: ”For each element of the first schema, element-level matching determines the matching elements in the second input schema. [...] Structure-level matching, on the other hand, refers to matching combinations of elements that appear together in a structure. Syntatic vs external vs semantic This criteria is the Input Interpretation classification proposed by Shvaiko and Euzenat in [85]: ”The key characteristics of the syntactic techniques is that they interpret the input in function of its sole structure following some clearly stated algorithm. External are the techniques exploiting auxiliary (external) resources of a domain and common knowledge in order to interpret the input. These resources might be human input or some thesaurus expressing the relationships between terms. The key characteristic of the semantic techniques is that they use some formal semantics (e.g. model-theoretic semantics) to interpret the input and justify their results. In case of semantic based matching system, exact algorithms are complete.”

Semantic Intensity Spectrum Another interesting classification is described in [49]. This classification is based on the observation that a common trend for DB and AI semantic integration practitioners is to progress from semantically-poor to semantically-rich solutions. This metaphor of semantic richness is used to classify works from both communities along a semantic intensity spectrum. Along this spectrum are several interim points to address string similarity, structure, context, extension and intension awareness as different layers of semantic intensity (see Figure 2.13). This classification is interesting for us as, in general, semantically-rich solutions are more complex then semantically-poor solutions. Since we need semi real-time and automated matching solutions, this semantic intensity spectrum gives us a good idea on the techniques that may be realistic to use in our solution and what techniques would have a too high time complexity. We will discuss the goals and requirements of our solution in section 3.3.1. We will now give an overview of the different interim points on the spectrum, based on [49]. We refer to the full text ([49]) for a complete and detailed discussion. String similarity , occupying the semantically-poor end of the spectrum, compares names of elements from different semantic models. These techniques consider strings as


34 Semantically Rich

Semantically Poor

Ontology Matching - Name similarity - Description similarity

String Similarity

- Graph Matching - Property characters

Structure -Aware

- Labelled DAG matching - Crawling - Namespaces

Context -Aware

- Formal concept analysis - Content similarity - Data mining - IR techniques

Extension -Aware

- information flow

Intension-Aware

- Logic morphism - Logic satisfiability

Semantic Similarity

Figure 2.13: Semantic Intensity Spectrum, reproduced from [49] sequences of letters in an alphabet. They are typically based on the following assumption: The more similar the strings of the concepts or nodes, that is, the syntactic similarity, the higher their semantic similarity. Usually, the string is first normalized, then token-based distance functions are used to map the pair of strings to a real number. Some examples of such techniques that are frequently used in matching systems are prefix, suffix, edit distance and n-gram. Linguistic similarity , is just a little bit more to the right-end side of the semantic intensity spectrum then pure string-based similarity. In this case, names of concepts or nodes are considered as words of a natural language. Therefor, these techniques use Natural Language Processing (NLP) techniques. For example, instance, pronunciation and soundex are taken into account to enhance the similarity purely based on strings. Also, linguistic relations between words like synonyms and hypernyms (a word that is more generic than another word) will be considered based on generic and/or domain-specific thesauri, e.g. WordNet [64], Dublin Core. Structure-aware , refers to approaches that take into account the structural layout of ontologies and schema data. Going beyond matching terms (strings), structural similarity considers the entire underlying structure. Although, in some interpretations, structure-level techniques include full graph matching algorithms, the interpretation of structure-aware techniques here, is the matching of ontologies that are represented as a hierarchical, partially ordered lattice. Therefor, in pure structural matching techniques, matching is equivalent to matching vertices of the two source graphs. Similarity between two such graphs G1 and G2 is computed by finding a subgraph of G2 that is isomorphic to G1 or vice versa.


35

Transaction

PurchaseOrder

PO

POShipTo

POShipTo

City

Address 2

Address 1

ItemSold

City City

Street

City

Street

Customer

Street

Street

S1

S2

Figure 2.14: Structure Awareness Context-aware , in many cases, there is a variety of relations among concepts which makes it necessary to differentiate distinct types of connections among nodes. This gives rise to a family of matching techniques which are more semantically rich than structureaware ones. In general, two different types of context-awareness can be identified. In the simplest form, algorithms that compare nodes from two ontologies also traverse downwards several layers along the direction of edges from the node under consideration, or upwards against the direction of edges to the node under consideration. All the visited nodes, together with the information about edges connecting them (taxonomic relationships like part-of, subclass-of, is-a, etc) are evaluated as a whole to infer further mappings between nodes in the context. An example of a context-aware matching algorithm that is used in an ontology-based crawler is described in [58]. In their approach, four different relevance sets R(e) are defined in the context of the focused crawler. They distinguish between Single Rs (e), Taxonomic Rt (e), Relational Rr (e) and Total Ron (e). Each of the relevance sets defines how far the edges from the node under consideration are followed. Definition 8 (Relevance Sets by Ehrig and Maedche): 1. Rs (e) is the most simple measure returning nothing more than just the original list of entities referred to in a document: Rs (e) = {e}, ∀eE. 2. Rt (e) uses the power of ontologies only to a certain extent. The list of related entities is defined by the entity itself, the direct super-entities and the direct sub-entities. This corresponds to a distance measure of one in the ontology graph. 3. Rr (e) uses deeper knowledge from the ontology. The list of related entities is defined like the taxonomic measure but additional entities are appended: directly linked con-


36

cepts, instances and their relations plus the ranges of the relations. This corresponds to a distance measure of two in the ontology graph. 4. Ron (e) takes the whole ontology and metadata structure as input. An illustration of this example can be seen in Figure 2.15. The relevance sets define how far the edges from the node under consideration (”Airplane”) are followed and used to compute the score of the node. Also note that in this case (opposed to structure-awareness) the edges of the graph are labeled. S D R T

subClassOff Domain Range instanceOf

person S

transports

R

passenger

D

T

Joe Dohn

S

vehicle

Single D

flies

R

Taxonomic

Relational

Airplane S

S

flight

Commercial airplane

Military airplane

Single: R s (e) Taxonomic : RT (e) Relational : R R (e)

Figure 2.15: Context Awareness example using Relevance Sets, reproduced from [58] Extension-aware , when a relatively complete set of instances can be obtained, the semantics of a schema or an ontology can be reflected through the way that instances are classified. A major assumption made by techniques belonging to this family is that instances with similar semantics might share features [57], therefor, an understanding of such common features can contribute to an approximate understanding of the semantics. Formal Concept Analysis (FCA) [38] is a representative of instance-aware approaches. FCA is a field of mathematics emerged in the nineties that builds upon lattice theory


37

and the work of Ganter and Wille on the mathematisation of concept in the eighties. A formal context is a triple K = (O, P, S), where O is a set of objects, P is a set of attributes (or properties), and S ⊆ OxP is a relation that connects each object o with the attributes satisfied by o. The intent (the set of attributes belonging to an object) and the extent (set of objects having these attributes) are given formal definitions in [38]. A formal concept is a pair < A, B > consisting of an extent A ⊆ O and an intent B ⊆ P , and these concepts are hierarchically ordered by inclusion of their extents. This partial order induces a complete lattice, the concept lattice of the context. FCA can be applied to semistructured domains to assist in the main structure most the mapping systems work with. Intension-aware refers to the family of techniques that establish correlations between relations among extent and intent. Such approaches are particulary useful when it is impossible or impractical to obtain a complete set of instances to reflect the semantics. A mathematical theory that goes beyond extension-awareness towards the tick marked by intension-awareness is Information Flow, proposed by Barwise and Seligman. We refer to [4] for a detailed description. Semantic Similarity , very close to the semantically-rich end lays the family of logic satisfiability approaches which focus on the logic correspondences. The idea behind techniques in this category is to reduce the matching problem to one that can be solved by resorting to logic satisfiability. Concepts in a hierarchical structure are transformed into well-formed logic formulae (wffs). These notions are also used in the DOGMA Framework: the lexons in DOGMA are all well-formed formulae.

Chapter 3

Crawling the Semantic Web 3.1

Context and motivation

As we have seen, one of the design principles and also a large factor in the success of the Word Wide Web lies in its decentralization principle. This principle is also, if not even more, important for the Semantic Web 1 . We have also seen that the key ingredients of the semantic web are ontologies and metadata. They describe the domain theories for the explicit representation of the semantics of metadata. Therefore, everyone should be able to design or reuse an existing ontology, define metadata according to this ontology, and publish this metadata on the web. It is important that all this semantic content can be shared and reused by humans and by artificial agents. Although currently only a small portion of the Web is annotated with metadata, over time, we expect a much larger percentage of the web will be annotated. A knowledge management system is needed to find, organize, access and maintain this new information. In our approach, DOGMA is this system (although DOGMA is focused on more than only the Semantic Web). We believe that our ontology-based crawler can play an important part in this system. A more detailed description and terminology of ontologies and knowledge bases can be found in [40]. A discussion on how to extend information retrieval systems to handle annotations in semantic web languages is described in [36]. A model for the exploitation of ontology-based knowledge bases to improve search over large document repositories is proposed in [92]. In [80], a hybrid approach for searching the Semantic Web is described: classical search techniques are combined with spread activation techniques applied to a semantic model of a given domain. 1 http://www.w3.org/DesignIssues/Principles.html

38

CHAPTER 3. CRAWLING THE SEMANTIC WEB

39

Figure 3.1 is a high level representation of the context and applications of our crawler within the DOGMA system. We can distinguish four important parts: (i) the input of the crawler, (ii) the ontology commitment by the user used to guide the crawler, (iii) the processing and storing of the processed input and (iv) a selection of the possible applications of that processed data.

The Semantic Web Knowledge Management:

DOGMA

Web-Pages SWDs

(part-of)

T-Lex browser

Automated Resource Discovery: Ontology Commitment

Creation of a Commitment from the Lexon Base by domain experts

1

FOCUSED CRAWLER

2

Semantic Web Ontologies

3

Indexed Web-Pages

Dogma Lexons

Dogma Lexon-base

Ontology Mining

Web-Ontology browsing and indexing

Ontology Aligning /Merging

4

Semantic Web Portals Semantic Web Searching

Figure 3.1: The context and applications of the crawler within the Semantic Web and DOGMA 1. Input: The input of the crawler is the collection of resources on the Semantic Web. These include, but are not restricted to, regular web pages as well as Semantic Web Documents (we will give a more formal description of these in the next chapter). In the future, the crawler may also be able to process other kinds of (annotated) data


40

like images, calendars, maps, etc. that are available on the Web (see for example the latest initiatives of Google, Microsoft Live and Yahoo in this area). 2. Source ontology commitment: The user of the crawler can first specify on what domain the crawler should be focused on. This domain of interest can be specified and described using a tool that is also under development (by Jan Vereecken & Damien Trog) within the DOGMA framework: T-Lex [93].

Figure 3.2: The T-Lex tool from the DOGMA Studio Workbench. The T-Lex tool is shown in figure 3.2. The left pane shows the list of contexts on the DOGMA Server, the ontology engineer can select a context and will then see the lexons contained in this context as a NORM tree

2

in the lower pane, the T-Lex Lexon

Base Browser. The lexons that should be included in the final commitment can be dragged to the upper pane. In the upper pane, the T-Lex Commitment Editor, constraints can be added to the roles as is depicted in figure 3.3. We will use the Beer ontology from both figures later to evaluate our matching algorithm in section 4.3.5. 3. Processing: While crawling, the crawler processes the discovered resources. It will 2 NORM

stands for NORM Ontology Representation Method.


41

Figure 3.3: The T-Lex Commitment Editor, where constraints can be added to the lexons. download the resources, parse them (for example, parse the annotated data into ontology models using the Jena library), store this parsed data into the 2 data stores: one for the normal (text-based) html documents and another for the semantic data, after which the crawler has to find a new resource to process. This selection process is of key importance to the usefulness of the crawler. It must find pages that are of interest to the user and are useful for the application of the discovered resources. This selection process is guided by the domain of interest specified by the user in step (ii). A more detailed description is given in the following sections of this chapter. 4. Applications: The discovered data can be of many uses: Ontology mining: In this prospect, the crawling process will discover ontologies that are published on the web and are relevant to a certain given domain of interest. Therefore this process can be seen as the mining of existing ontologies on the web. Not only ontologies will be discovered, but also regular (text-based) web pages will be found during the crawl. In addition, for each of these web pages, the resemblance to the given domain ontology is known. This facilitates the process of mining ontologies or instance data from regular web pages. Ontology Aligning/merging: Closely related to ontology mining is the alignment or merging of ontologies. Imagine a domain expert that wants to build a kind of upper ontology given a set of ontologies of a specific domain. The crawler will find the ontologies on the web that are relevant to that domain. But not only will it find these ontologies, it will also compute the similarity between these ontologies and even more important, it will provide the expert a set of candidate mappings that can be used in the actual alignment or merging process. In this aspect, the crawling process can be seen as a kind of pre-matcher to the actual alignment matching process. For related work on ontology integration (ontology


42

integration is the general term used for ontology aligning and merging) within the DOGMA framework, we refer to [70, 44, 68] and [47]. Web ontology browsing and indexing: The Semantic Web ontologies can be indexed and browsed to give the users an idea of the available semantic content on the web relevant to the given domain of interest. A good example of the indexing and browsing of metadata on the web is Swoogle 3 , we refer to [29] for a more detailed description. Semantic Web Searching: The found data that has been indexed can then be used by a search engine. This engine will not only match the user’s query against the relevant web pages, but it will also match it to the discovered meta data during the crawl. An example of such an application is QuizRDF [21]. QuizRDF combines free-text-search with the capability to exploit RDF metadata in searching and browsing. Semantic Web Portal: The semantic data found during the crawling process can be used by semantic web portals. An example of a Semantic Web Portal is OntoWeb, more information on OntoWeb can be found in [69] and [20]. For a ”DOGMAtic” approach to Semantic Web Portals, we refer to [45].

3.2

Structure of the Semantic Web

Introduction Currently, the Semantic Web of ontologies and metadata and the ”normal” Web of html documents can be seen as 2 parallel web-universes [29]. We will describe the Semantic Web as the web consisting of Semantic Web Documents (SWDs). These are for example annotated web pages or, in general, documents that contain some sort of metadata described in some ontology language such as RDFS or OWL. There is no standard way for html documents to embed or refer to Semantic Web Documents in a meaningful way. This leads to the two separate web universes: the Semantic Web Documents refer to one another in meaningful ways using relations described in the ontology languages and the regular makes use of the hyperlink structure in html documents. Therefore two distinct web-graphs are built up while crawling, one mutual graph between the SWDs and another describing the link structure of the HTML documents. Because of these disperse graphs, regular crawlers are not sufficient to crawl the Semantic Web and special Semantic Web Crawlers should be developed. 3 http://swoogle.umbc.edu/


43

Google query

# results

Google query

# results

rdf

5.230.000

rdf

215.000.000

filetype:rdf rdf

246.000

filetype:rdf rdf

5.350.000

filetype:rdfs rdf

304

filetype:rdfs rdf

306

filetype:daml rdf

4.360

filetype:daml rdf

743

filetype:n3 rdf

2.630

filetype:n3 rdf

16.900

filetype:owl rdf

1.310

filetype:owl rdf

32.000

Results from May 25, 2005 [29]

Results from May 05, 2006

Table 3.1: Approximate number of SWDs indexed by google in May 2005 and May 2006 Table 3.1 gives a very rough idea of the availability of Semantic Web Documents on the web and its evolution in the last year. The results from May 25, 2005 are borrowed from [29]. The table shows that RDF documents are clearly the most common on the current web. We see that the number of RDF documents indexed by Google has increased from 246.000 to 5.350.000, an increase of approximately 2000%. Other kinds of ontology documents have not experienced such an increase, only the number of N3 and OWL-based documents has increased in a large manner, the number of DAML-based documents has even decreased. Another striking number is the lack of RDFS documents. Although it can be argued that most of the RDFS-based documents have the RDF file-extension.

3.2.1

Semantic Web Documents

Our view on the Semantic Web and its mutual relations is inspired by previous work from Finin et. al. on the Swoogle metadata engine [29]. As stated previously, we describe Semantic Web Documents as documents using some Semantic Web language. Another restriction that applies to Semantic Web Documents is that they must be publicly available on the web for humans and agents. Similar to a document in Information Retrieval, an SWD is an atomic information exchange object in the Semantic Web. We will distinguish two kinds of SWDs: Semantic Web Ontologies (SWOs) and Semantic Web Databases (SWDBs). These correspond to what are called T-Boxes and A-boxes in the description logic literature [3][43]. An SWD is considered a Semantic Web Ontology (SWO) when a significant proportion of the statements that are made in the document define new concepts and relations (e.g. new classes and properties) or extends the definition of the concepts and the relations from another SWO. An SWDB at the contrary will not define or extend new concepts but can introduce individuals and make assertions about them or make assertions of individuals defined in other SWDs. Of


44

course, there are SWDs that fall in between the two extremes, for example, a document that is intended as an ontology, might define individuals that are part of the ontology. Similarly, a document that consist mostly of individuals might introduce some new relations in order to make it easier to describe the individuals. We will now give a formal definition of the structures that describe a SWO and are relevant for our approach: Definition 9 (Semantic Web Ontology (SWO)): A Semantic Web Ontology (SWO) W as a structure < CW , HC , GC , RW , HR , I > where: • Semantic Web Concepts (SWC): CW This is the set of concepts that are defined in the SWO. • Subsumption concept hierarchy: HC This is the hierarchy of binary relations that model the sub and super relations between the SWCs. • Concept Gloss Set (SWG): GC This is the optional set of gloss descriptions associated with each concept. • Semantic Web Relations (SWR): RW This is the set of relations (without the concept hierarchy relations HC ) between the different concepts of the SWO. • Subsumption relation hierarchy: HR This is the hierarchy of binary relations that model the sub and super relations between the SWRs. • Instance i I may be an instance of a concept c CW .

These are the relations the crawler will use to do its relevance computation. These relations correspond with the expressiveness of RDFS: • SWC c CW : A concept c in the ontology is an instances of rdfs:class. • SWR r HC : A subsumption relation r between 2 concepts are binary relations corresponding to rdfs:subClassOf. • SWG g GC : A gloss g is described by the rdfs:label property. • SWR r RW : The relation r is an instance of rdf:Property that has a SWC c CW as rdfs:range property. • SWR r HR : These are the binary relations corresponding to rdfs:subPropertOf.


45

• The class-instances I have the rdf:type property. An instance i I may have one or many j I for a relation r R. These triples < i, r, j > are called Rdf Statements. The fact that the crawler will only use semantics that can be described by RDFS documents does not mean it is not able to process SWDs that use other ontology languages like OWL or DAML. When such SWDs are encountered by the crawler, it will only use the semantics that are described above to do its relevance computation and discard the optional extra semantic information in the document. Considering this limitation of the crawler, it is important to note that all of the semantic data will still be parsed into ontology models and will be available after indexing, so no information is lost.

3.3 3.3.1

The Crawler Framework Goals and Requirements

We will describe the goals and requirements of the crawler in the same way one would describe them in a Software Requirements Specification (SRS) document. We will not append a full SRS document to this thesis since we are of the opinion that a full SRS would not bring much added value to this project as this crawler is more of a research project then an end-user application. However, having a clear specification of the requirements and goals of the project is still important since it can have a large effect on some design and implementation choices, therefore we will describe here the topics relevant for this project that one would normally find in a SRS document.

Scope of the project As we have stated in the introduction, the scope of this thesis is the automated discovery of resources on the Web within the DOGMA framework. This is achieved by introducing a crawler that will crawl the web to find these resources. We will use figure 3.4 that we already introduced in section 3.1 to describe the scope of the crawler: specify what the responsibilities are of the crawler and those that are the responsibilities of other tools within the DOGMA framework. In the figure, the scope of the crawler is clearly marked by the red dashed line surrounding the crawler. Having defined the scope of the crawler, we will now specify the expected functionalities and requirements of the crawler in more detail.


46

The Semantic Web Knowledge Management:

DOGMA

Web-Pages SWDs

(part-of)

T-Lex browser

Automated Resource Discovery: Ontology Commitment

Creation of a Commitment from the Lexon Base by domain experts

1

FOCUSED CRAWLER

2

Semantic Web Ontologies

3

Indexed Web-Pages

Dogma Lexons

Dogma Lexon-base

Ontology Mining

Web-Ontology browsing and indexing

Ontology Aligning /Merging

4

Semantic Web Portals Semantic Web Searching

Figure 3.4: The scope of the crawler within DOGMA General Requirements As stated in the previous section and illustrated in figure 3.4, the main requirement of the project is to discover relevant (semantic) web resources in an automated manner using an ontology that describes the domain of interest. Therefore we develop a crawler that must be able to: (i) Crawl the internet to discover web-resources in an automated and performant manner; (ii) Focus this crawl using a given ontology commitment extracted from the DOGMA Server and specified by the user. This commitment describes the domain of interest of the resources the crawler should find; (iii) Parse the discovered resources (Semantic resources as well as normal Web pages) and compare them with the given ontology commitment using different techniques such as Natural Language Processing and Ontology Matching; (iv) Store and index these resources in some data store so that they can be used for other


47

applications; (v) Provide an easy access to the indexed resources in the form of direct database access and access via Web Services with SOAP message calls.

Functional Requirements We have build the crawler with a research goal in mind. It was not developed for use as a real-world application with thousands of users, at least not in its current form. Therefore, no graphical user interface was required and all of the configuration options of the crawler are specified using configuration files. Therefore, the number of functional requirements is rather small. We will provide a list with the most important functional requirements and detail some with a use-case. The user should be able to: [FR1] Start the crawler from a command-prompt given as argument a configuration file. [FR2] Specify the SOAP Server the crawler should use to store its indexed resources in the configuration file. [FR3] Specify the Database Server the crawler should use to store its indexed resources in the configuration file. [FR4] Specify the filter options like: [4.1] Allowed protocols, e.g. http. urls with other protocols like ftp:// will not be pursued by the crawler. [4.2] Allowed filetypes, e.g. html, htm, php, rdf, rdfs, owl, .... urls that refer to files with other filetype extensions will not be pursued. [4.3] Disallowed characters, e.g. #, ?. urls with these characters will not be pursued. [FR5] Specify the start url(s) from where the crawler should start its crawl. [Optional] [FR6] In case the user did not specify any start urls, the user must specify a query that will be used to query Google and the first 5 results returned by Google will be used to start the crawl process. [FR7] Specify the Ontology Commitment file that describes the domain of interest around which the crawler should focus its crawl. We will describe functional requirement 2 in more detail with a use-case:


48

SOAP Server Configuration [FR2] Name SOAP Server Configuration Summary Specify the SOAP Server the crawler should use to store its indexed resources in the configuration file. Actors The user that configures the crawler. Assumptions A SOAP Server is available. Description

1. The user opens the existing configuration file or makes a new one.

2. The user specifies the soapserver as: SoapServer = url to soapserver. Example: SoapServer = http://localhost:8080/soap/servlet/rpcrouter. 3. The user specifies the name of the Crawl Frontier service as: SoapCrawlFrontier = Service name. Example: SoapCrawlFrontier = urn:CFServer. 4. The user specifies the name of the RDF Frontier service as: SoapRDFServer = Service name. Example: SoapRDFServer = urn:RDFServer. 5. The user saves the configuration file.

3.4

Crawler Architecture

In this section we will describe the general architecture of the crawler: It’s constituting parts and what the responsibility of each of these parts is. We will show how these parts work and how they work together with the other parts of the crawler. We will not go into any implementation details here. These implementation details with UML diagrams etc. will be covered in chapter 5.

3.4.1

High level Overview

The focused crawler we developed is made out of two big, separate systems as shown in figure 3.5. The first system is the actual focused Crawler itself. This system will crawl for pages on the web. This process is called the crawling loop: It will download the pages from the web, preprocess them, compute their relevance score and start over. As illustrated in the figure, the Crawler uses different crawling threads. Each thread will execute the described crawling loop independently from the other threads, until it has no more pages to fetch.


49

FOCUSED CRAWLER Crawler Threads Crawler

Preprocessor

Relevance Computation

Crawl Frontier URL Frontier

SWD Frontier

Figure 3.5: The highest level overview of the crawler, distinguishing the two major systems that compose the Focused Crawler. The alert reader will notice that something is lacking if the focused crawler would only exist of the above crawler system. How is the crawler supposed to know which urls to fetch and download, and what should it do with all the processed data? The CrawlFrontier fills this gap. The crawl frontier will store all the data parsed by the crawler. Concretely, the crawl frontier has two different data-stores: A url Frontier data-store that will store all the urls the crawler will visit and has already visited. The second data-store, the SWD Frontier will store the discovered semantic web documents. Figure 3.6 emphasizes once more that both the Crawler running the crawler threads and the Crawl Frontier are two distinct applications. They can be executed on different servers. This is important for different reasons: 1. The Crawl Frontier can be used by other applications. For example, as a backend for a Semantic Web Portal, or as a source of ontologies for ontology mining and merging (we provided a more complete list on the possible applications in section 3.1). Therefore, it is important that it can be run as a stand-alone application implementing different possible access interfaces. For example, the crawler might want to communicate with the crawl frontier on a direct database connection while the portal might want the crawl frontier to be installed as an application server and communicate with it via SOAP message calls. 2. The crawler should be extensible. By separating the crawl process from the crawl frontier, we can use different servers, each running their own crawl processes, but using the same crawl frontier. This will result in all these servers working on the


50

FOCUSED CRAWLER Crawl Frontier

Crawlers

threads

threads

threads

threads

SWDs Web-Pages

The Semantic Web Access Interface: Can be direct database access or access via web services (SOAP) in our implementation

Figure 3.6: The crawler in a possible real-world environment same ”crawl” since they process the same list of urls to crawl. Notice in the figure that each server can choose its own number of crawl threads to run, depending on the performance of the server. Again, this is possible because each crawler thread is also a separate process, using the same crawl frontier.

3.4.2

Crawler Thread Overview

We will now provide a more in-depth view on the working of the crawler thread process. Remember that every crawler thread will execute the same crawling process independently. As stated before, a crawler thread consists of three big phases: the crawler, the preprocessor and the relevance computation. These phases, the different techniques used in each phase and how they relate to the crawl frontier and the Web, are illustrated in figure 3.7. We will


Crawler Thread Crawler

51

The Semantic Web

> Download new page to memory

Crawl Frontier

PreProcessor

Link Extractor / Filter

> Fetch URLs from the document > Filter unwanted URLs

SWD Extractor / Parser

> Extract / download SWD > Parse SWD into Jena Ontology Model

Text Parser (NLP)

> Parse the text: tokenizer, word filter, porter stemmer, … > Create a Vector Space Model (VSM) from the text and compute relevant scores (tf-idf, …)

Relevance Computation Link Extractor / Filter

> Fetch URLs from the document > Filter unwanted URLs

Figure 3.7: Pragmatic view on the architecture of the Crawler Thread. describe each of them in detail below.

The crawler phase The crawler is the most simple part of the crawler thread process. It will query the Crawl Frontier for a new url to crawl. After fetching the url from the Crawl Frontier, it will connect to that url and download the corresponding document to memory. The document is stored in primary memory for performance reasons, the document must still be parsed and only the parsed data is of interest to the user, therefore the best place to store the downloaded documents is in memory.


52

The preprocessing phase After the fetched document has been stored in memory, the crawler will preprocess it: It will parse all the needed information from the document, so that the relevance of the document can be computed in the next phase. We will use several techniques to preprocess a document: Link Extractor / Filter Its first job is to extract all the urls on a page. Not every url is interesting for the crawler, for example, links with the ftp:// protocol or links to images, movies, or pdf documents should be skipped by the crawler. Therefore we have provided a filter that filters all the urls given some filter options. These filter options can be specified in the configuration file of the crawler (see the functional requirements in section 3.3.1). Currently, we support the following filters: The protocol filter (i) will only download urls that have a specific protocol; the filetypes filter (ii) will discard any urls that have a given filetype and the character filter (iii) will check the characters in the url string and will discard any url that has a character specified in the disallowed characters list. SWD Extractor / Parser After having extracted and filtered the urls from the document, the crawler thread will try to find semantic data that is available on the page. There are two possible ways such data can be present: In the first case, the data is embedded into the document and will be extracted from it. The other possibility is that the semantic data is contained in a separate SWD and that SWD is referenced from the current document. In that case, the separate SWD will also be downloaded to memory. Once all the available meta-data is fetched, the crawler thread will try to parse it into an Ontology Model. This model can later be queried for the specific data such as ”Give all concepts”, ”give all the relations of this certain concept”, etc. In this phase, we can not yet decide if the urls should be added to the Crawl Frontier: only the urls of relevant pages will be added and the relevance computation is done in the next phase. We store the Ontology Model into the Crawl Frontier, more specifically, into the SWD Frontier. We will still keep the model stored in memory since we need it for the subsequent relevance computation, but as this ontology model can be of use in other applications other than the crawler, we can already store it in this phase. Text Parser Since most of the documents on the Semantic Web are still not annotated (see table 3.1), no semantic data will be found in the majority of the cases. Therefore we have added a text parser to the crawler threads. This text parser will use different techniques that enable us to analyze the web page. We will use Natural Language


53

agent

java

...

...

doc1

0.9

0.2

...

...

doc2

0.7

0.4

...

...

doc3

0.2

0.3

...

...

doc4

0.6

0.7

...

...

Table 3.2: An example of the Vector Space Model Processing (NLP) techniques that have been around since long in the Information Retrieval community and have proven their success. Concretely, we will represent the documents using the Vector Space Model. In the Vector Space Model, a document is represented by a vector that denotes the relevance of a given set of terms for this document. Terms are usually natural language words, but they can also be more general entities, as words that are reduced to some linguistic base form or abstract concept as ’< number >’ denoting any occurrence of a number in the text. Although the Vector Space Model has been criticized for being ad hoc, it is sufficient for our needs. For a more theoretical analysis of the Vector Space Model, we refer to [78]. We will use the tf/idf score to assess the relevance of a given term for each document. We have already defined the tf/idf score, see definition 7 on page 22. From the early days of automatic text processing and Information Retrieval, the Vector Space Model has played a very important role. It is the point of departure for many automatic text processing tasks, as text classification, clustering, characterization and summarization as well as information retrieval. We will use the Vector Space Model here to compute relevant statistics of the document that we can use in our relevance computation algorithms. The four preprocessing steps in their specific ordering consist of: 1. Decoder: A general web document consists of plain text (the text that is displayed to the user visiting the page), surrounded by HTML or other kind of tags. We first parse the full document to separate the normal text from the tags, so that only the plain text is used in the following steps. 2. Tokenizer: A document is represented as a string of characters, the tokenizing process will turn it into a list of words. This is not a trivial process, though for vectorization often a simple heuristic is sufficient. We opted for a tokenizer which uses the Unicode specification to decide whether a character is a letter. All non-letter characters are assumed to be separators, so the resulting tokens


54

contain only letters. This results in fast tokenization that is sufficient for our purposes. 3. WordFilter In this step, tokens that should not be considered for vectorization are filtered out. These are usually tokens appearing very often (referred to as ”stopwords”). A standard English stopword list is included. This technique can result in a large speed increase since it can reduce the indexed words up to 40%. 4. Stemmer It is often so that terms are denoted differently (for example, different grammatical forms of a term) although they refer to the same term. This is usually referred to as term variation. In [60], 3 main kinds of term variation are distinguished: morphological, syntactical and semantic, although combinations of these are also possible. We illustrate these variation with an example in table 4. Type

Subtype

Example

Morphological

Inflection

enzyme activities

Derivation

enzymatic activity

Inflectional-Derivational

enzymatic activities

Insertion

enzyme amidolytic activity

Permutation

activity of enzyme

Coordination

enzyme and bactericidal activity

Syntactic

Semantic Multilingual

fermentation French

activit d’enzyme

Table 3.3: Variants of the term enzyme activity While these term variants are very interesting, they are too complex for use in our crawler. We have to process the documents in semi-realtime. Therefore, we need a more performant solution. Stemmers are designed to standardize morphological variations by stripping words to their base form by removing suffixes such as plural forms and affixes denoting declination or conjugation. We have opted to use a Porter stemmer to strip suffixes from terms, which has as its biggest advantage that it is very performant. Although the rather trivial porter stemmer works well on English and romance languages, there are some known problems, namely commission errors (a same root-form is found for terms that are actually different) and emission errors (a similar term is not found for equivalent terms). A few examples of a porter stemmer and its errors in action are illustrated in table 4.


55

Word 1

Word 2

Result

Error

matching

matches

match

none

connection

connected

connect

none

organization

organ

organ

Commision

police

policy

polic

Commision

europe

european

none

Emission

Table 3.4: Examples of stemming terms with a Porter stemmer. The relevance computation phase Our relevance computation using a source ontology describing the domain of interest compared with the found metadata on the semantic web is the main feature that distinguishes our approach from other existing techniques. Since it is of such an importance to this thesis, we will describe it in a separate section, namely section 4.

3.4.3

Crawler Thread Flow Diagram

We will describe the general flow of the crawling process using the flow diagram in figure 3.8. In the figure we can clearly distinguish the three phases that constitute the repeating crawling process in each crawler thread. We have drawn the flow diagram next to the architecture of the crawler thread, such that the correspondence between the two can be easily realized.


56

start

Crawler Thread

Initialize frontier with seed url’s

done

Terminated?

Crawler

Fetch url from frontier

Fetch page

PreProcessor

Preprocess Page

Compute Relevance


Page Relevant ?

Add Url’s to frontier

Web page

Frontier

Figure 3.8: The flow diagram of the crawling process.

Chapter 4

Relevance computation: DMatch-lite 4.1

Context

As introduced briefly in the previous chapter, the relevance computation phase in the crawler loop has as its primary goal to provide the crawler with a relevance score between the fetched SWO and the initial Ontology Commitment. This is, in other words, a matching process between the two ontologies. The better the two ontologies match, the higher they will be relevant to each other and the higher the relevance score will be. The notion of relevance between two ontologies can best be visualized by the intersection of the two domains that represent each ontology. In this chapter we will sketch the context of this relevance computation and matching process. More specifically, we will define the necessary measures and operators in order to fix the vocabulary used in this chapter. We will also provide a broader view on the matching process in respect to other applications. We find this important since we see the crawler as one possible environment where our matching process can be used, but we also want to stress the point, as we have done before, that the matching process and its results can be applied in many other interesting cases.

4.1.1

The Matching Process

Many different solutions to the matching problem have been proposed so far, for example [56, 42, 31, 61, 6, 66, 14, 33, 50]. In this thesis, we have a different focus than in most of 57

CHAPTER 4. RELEVANCE COMPUTATION: DMATCH-LITE

58

these other approaches. Since our matching process is situated in a crawler environment, other requirements and limitations apply. Still, in line with most of the other approaches in standard schema matching, we define the Match Operator as the central operation in our matching process. RDFS, OWL, ...

SWO

Match

(Dmatch-lite)

Dogma Ontology Commitment OMEGA-RIDL, LEXON-BASE, ...

Relevance Score Candidate Mappings

External Matching Resources WordNet

Figure 4.1: The match operator. We have illustrated the Match Operator in figure 4.1. In accordance with what we have stated previously, the Match Operator takes two ontologies as input. One ontology is the SWO that is fetched from the Semantic Web. The other is the Ontology Commitment that is extracted from the DOGMA Framework. The Match Operation will also use possible external sources to aid in its matching process. One such source we will use in our approach is the WordNet lexical database [64]. The output of the Match Operation is twofold: it will store the candidate mappings between the two ontologies (we will define the notion of candidate mappings later in this chapter) and it will compute the relevance score between the two ontologies. We have already defined the notions of a SWO in definition 9 in section 3.2.1. We will now describe the possible representations of the ontology commitment that is extracted from the DOGMA Framework.

The Ontology Commitment in the Ω-RIDL format A dedicated language has been designed at STARLab to describe commitments: OmegaRIDL. Its pseudo-natural syntax was based on RIDL (Reference and Idea Language), an old textual language used (amongst other things) to specify NIAM schemas (NIAM is the


59

predecessor of ORM, see [65, 94]). The Omega refers to the Lexon Base, see definition 4. Before Ω-RIDL was developed, commitments only consisted of (i) a selection of lexons and (ii) constraints on the selected lexons. Commitments were graphically modelled by adopting the ORM graphical notation, and stored as ORM-ML files. With the latest version of Ω-RIDL, which we call the next generation, it is ”again” possible to model a commitment in the old fashion way, i.e. only a selection of lexons and constraints. We illustrate the high level view of the XML Schema used for Ω-RIDL with the diagram is figure 4.2. For a more in-depth description of the Ω-RIDL language, we refer to [71]

Figure 4.2: Diagram of the ontology commitment in Ω-RIDL. Because the Ω-RIDL language is still in development and there is no final version yet that is sufficiently supported by the DOGMA framework, we will use a set of Lexons from the Lexon Base as input, which we will describe in the following section.

The Ontology Commitment as a Set of Lexons from the Lexon Base In our implementation, we will describe the input Dogma ontology as a set of lexons, more specifically a subset of the Lexon Base. While this is formally not 100% correct, it is sufficient for a pragmatic approach. In contrast with the Ω-RIDL description of the ontology, our pragmatic solution does not define any constraints on the concepts of the ontology. This


60

might seem like a big loss, and it is within the normal use of ontologies, but our relevance computation algorithm does compare the ontology commitment with a SWO of which we only use the semantics that are available in RDFS. In RDFS, there is no real support for constraints on concepts, therefore, our pragmatic solution, in this stage, using a lexon set, will not be very different from the correct solution using an ontology commitment described in Ω-RIDL.

4.1.2

DMatch-lite vs Ontology Integration

Independently of the integration strategy adopted, ontology integration is divided into several methodological steps: relating the different ontologies, finding and resolving conflicts in the representation of the same real world concepts, and eventually merging the conformed ontologies into one global ontology [11]. For ontology integration, STARLab has proposed the same methodological steps that were singled out in database schema integration [5]: 1. preintegration 2. ontology comparison (alignment phase) 3. ontology conforming 4. ontology merging and restructuring Our ontology matching process is far from a full ontology integration process. We have other goals and are limited by other constraints as we have described before. However, we can map our ontology matching process to a part of the methodology proposed for ontology integration. More specifically, what we do in our ontology matching approach can be seen as the first step and a part of the second step of the ontology integration methodology. Figure 4.3 gives a more detailed view on the place of our matcher within the full integration process. As seen in figure 4.3, our matching process will take as input two ontologies, each described in a different ontology language. Our matcher will output a set of candidate mappings, each describing the similarity between two concepts from each ontology together with a found semantic relation and relevance score between the two concepts. These candidate mappings can then later be used in the alignment phase of the ontology integration methodology. In this context, our matching process can be seen as a sort of pre-matcher, corresponding to the first step (preintegration) and a part of the second step (our candidate mappings can be seen as a pre-result of the ontology comparison phase) of the ontology integration methodology.


61

RDFS, OWL, ...

SWO

SEMI AUTOMATED

AUTOMATED

Match

(Dmatch-lite)

Candidate Mappings

Ontology Integration

Commitment

Dogma Ontology External Matching Commitment Resources OMEGA-RIDL, LEXON-BASE, ...

Domain Experts

WordNet

SEMANTIC INTEROPERABILITY

Figure 4.3: The DMatch-lite match operator within the Ontology Integration context

4.1.3

The Similarity Measures

To be able to make statements on the semantic affinity between concepts of different ontologies, we first need the notion of similarity. There are many ways to assess the similarity between two entities. The most common way amounts to defining a measure of this similarity. We present here the characteristics which can be asked from these measures. Definition 10 (Similarity): Given a set O of entities, a similarity σ : O × O −→ R is a function from a pair of entities to a real number expressing the similarity between two objects such that: ∀ x, y O, σ(x, y) ≥ 0

(positiveness)

∀ x O, ∀ y, z O, σ(x, x) ≥ σ(y, z)

(maximality)

∀ x, y O, σ(x, y) = σ(y, x)

(symmetry)

Definition 11 (Normalized similarity): A similarity is said to be normalized if it ranges over the unit interval of real numbers [0 1]. A normalized version of a similarity σ will be noted σ.

In the remainder of the thesis, we will consider only normalized measures. We will assume that a similarity function between two entities will return some real number between 0 and


62

1 as seen in the following definition of the similarity between two concepts. Definition 12 (Concept Similarity): Given the Lexon Base Subset Ω from a Dogma ontology commitment and an SWO W. We define the similarity between a Dogma concept: t T : γ(ζ, t) = cΩ CΩ and an SWO concept cW CW as sc(cΩ , cW ) : CΩ × CW → [0, 1]; where that sc(cΩ , cW ) = σ (cΩ , cW )

4.1.4

Requirements and Limitations

The requirements and limitations of our matching problem are different than those for general schema and ontology matchers. First of all, given the workflow of our crawler thread, a new document will only be fetched after the crawler knows whether the current document is relevant or not. This is depicted in illustration 3.8 where the workflow diagram of the crawler process is shown. Therefore, this relevance computation and matching process must be computed in semireal time. It is not opportune for a crawler thread to need more then a few seconds to compute the relevance of a page. This would slow down the crawling process too much. A direct consequence of this limitation is that the matching process must be done automatically, in contrast with most of the other standard ontology matchers that have a semi-automated approach. These restrictions have a big influence on which matching techniques we can use. In general, the more semantically ”rich” the techniques are, the more running-time they require. In the next sections we will describe in detail our solution for the matching problem we have defined in this section. We call our solution algorithm DMatch-lite: the ”D” stands for DOGMA and the lite stands for the fact that the algorithm should be viewed as a sort of a performant, automated pre-matcher. It does not have all the features and techniques that should be used by a full-blown ontology-matcher for ontology integration, but is in contrast more performant than such an ontology matcher. It also works fully automated, which is exactly what we need for our purposes.


4.2

63

DMatch-lite: Methodology

Similarly to what we have seen in the standard ontology integration process, our own matching algorithm, DMatch-lite, can also be split up into several methodological steps. Of course, in our case, these steps will be fully automated. We propose the following matching methodology, illustrated in figure 4.4.

1. Feature Engineering

2. Building the Search Space

3. Similarity Iterations

2

3.1 Compute Similarities

1

4. Compute Final Similarity

3.2 Aggregate Similarities

3.3 Update Search Space

3

Figure 4.4: DMatch-lite’s matching process methodology. The following sections will describe each step in more detail.

4.2.1

Step 1: Feature Engineering

The first step in the matching process is feature engineering. The purpose of this step is to transform the initial representation of the SWOs and the Dogma ontology commitment into a format that is usable for the similarity calculations. Not only do we have to deal with these 2 different representations, we also have to cope with the fact that the SWOs can be described in different ontology languages. Therefore, DMatch-lite will parse these SWOs into an Ontology Model. As a consequence, all the SWOs, regardless of the language in which they are described, will respond to the same semantics. Currently, for DMatch-lite, these are the RDFS semantics, but by using such an ontology model, the number of supported semantics can be expanded easily. By representing these different SWOs as ontology models, we can also store them as such in a uniform way. We still have to tackle the problem of defining a structure so that the concepts from both ontologies can be compared. We define this structure as a commitment rule. These commitments map each concept from the Dogma ontology commitment to every concept in


64

the SWO. While the name ”commitment” follows the DOGMA approach, we will call these mapping rules from now on to be consistent with the existing work on ontology integration. We provide a formal definition of this mapping rule structure in definition 13. Definition 13 (Mapping rule): A mapping rule M between two concepts ci CΩ and cj CW as a structure < mid , γ(ζ, ti ), R, cj , sc(ci , cj ) > where: • mid stands for a mapping-id that uniquely identifies the mapping rule, • γ(ζ, ti ) is the lift up of the Dogma term ti Ti into a Concept ci CΩi from the Dogma ontology commitment Ω, • R is the linguistic relation between the two concept labels in the mapping rule, • cj CW is a Semantic Web Concept from the SWO W, • sc(ci , cj ) is the similarity score we defined in Definition 12,

Definition 14 (Linguistic Relation): A linguistic relation R denotes a directed, binary relation between two concept labels C1L and C2L and can be one of the following: • Rsynonym : Two words that can be interchanged in a context, without a significant loss of meaning, are said to be synonymous relative tothat context. • Rhyponym : A concept represented by a lexical term L0 is said to be a hyponym of the concept represented by a lexical term L1 if every L0 is-a (kind of) L1 . That is, if native speakers of English accept sentences constructed from the frame An L0 is a (kind of ) L1 . This relation is transitive. • Rhyperym : This is the inverse relation of Rhyponym . • Rmeronym : A concept represented by a lexical term L0 is said to be a meronym of the concept represented by a lexical term L1 if every L0 is-a-part-of L1 . That is, if native speakers of English accept sentences constructed from the frame An L1 is a part of L0 . This relation is transitive. • Rholonym : This is the inverse relation of Rmeronym . • Rnone : . If none of the above relations between two concepts represented by lexical terms L0 and L1 , is applicable.


4.2.2

65

Step 2: Building the Search Space

Now that we have defined the features of both ontologies and the structures to compare their concepts, we need to define the structure we will use to aid us in the matching and the relevance computation of the complete ontologies. The derivation of ontology mappings takes place in a search space of candidate mapping rules. This step builds up that search space. See figure 4.1 for an example of a search space with Dogma concepts: CΩx and Semantic Web Concepts CWy CW1

CW2

CW3

CW4

CW5

CΩ1

M(Ω1 ,W1 )

M(Ω1 ,W2 )

M(Ω1 ,W3 )

M(Ω1 ,W4 )

M(Ω1 ,W5 )

CΩ2

M(Ω2 ,W1 )

M(Ω2 ,W2 )

M(Ω2 ,W3 )

M(Ω2 ,W4 )

M(Ω2 ,W5 )

CΩ3

M(Ω3 ,W1 )

M(Ω3 ,W2 )

M(Ω3 ,W3 )

M(Ω3 ,W4 )

M(Ω3 ,W5 )

CΩ4

M(Ω4 ,W1 )

M(Ω4 ,W2 )

M(Ω4 ,W3 )

M(Ω4 ,W4 )

M(Ω4 ,W5 )

CΩ5

M(Ω5 ,W1 )

M(Ω5 ,W2 )

M(Ω5 ,W3 )

M(Ω5 ,W4 )

M(Ω5 ,W5 )

CΩ6

M(Ω6 ,W1 )

M(Ω6 ,W2 )

M(Ω6 ,W3 )

M(Ω6 ,W4 )

M(Ω6 ,W5 )

Table 4.1: An example search space. Initially, the candidate mappings that compose the search space have the linguistic relation ”None” and a similarity score that equals zero. During the matching process, as we will see in the following sections, these scores will be updated.

4.2.3

Step 3: Similarity Iterations

The core of the DMatch-lite algorithm are the actual similarity metrics we use. Several different of these similarity metrics or algorithms are used, each for their own purposes. In this aspect, DMatch-lite can be classified under the combining matchers, more precisely, as a hybrid matcher [79]. We classify each of those algorithms under a specific similarity category. Each similarity category corresponds to a similarity iteration in the DMatch-lite matching methodology. We provide an overview of each iteration and the order in which these iterations are performed. We also list the similarity metrics within each such iteration. This information is shown in table 4.2. Each similarity metric is also given a certain weight. This weight determines how much this particular metric will influence the total similarity score of the mapping rule. As illustrated in table 4.2, some of the similarity algorithms are given a lower weight. We provide here a brief explanation why we use these specific values for each algorithm.


66

• The Soundex similarity is given a weight of 0.5. As we will illustrated in more detail when we discuss the Soundex algorithm, most of the resulting score values from the soundex algorithm lie between 0.4 and 0.6. To even these numbers out, we will introduce a normalize function that penalizes these average values. Still, our testing results show that the soundex similarity, in most cases, does not add a lot of ”value” on the similarity of both terms, therefore, a weight of 0.5 is chosen. • Both Gloss similarities have a weight of 0.5 and 0.6 respectively. We have opted for these weights because our preliminary testing results have shown that, even between similar concepts, the concept gloss description similarities are very low. Choosing a higher weight value would lower the total similarity score between the concepts too much. Iteration similarity metric

weight

String Similarity Levenshtein distance

1.0

Soundex similarity

0.6

Linguistic Similarity WordNet Relations

1.0

WordNet Glosses

0.5

Semantic Similarity Context-aware similarity

1.0

Concept Gloss comparison

0.6

Table 4.2: The overview of the similarity iterations with their similarity metrics and weights. Within each iteration, we can again distinguish three individual steps: the similarity computation, the similarity aggregation and finally the updating of the search space. We will now provide a more detailed description of each of these steps.

1. Similarity Computation This is always the first stage of the iteration. In this stage, a similarity metric is selected and using that metric, the relevance score of the candidate mappings is computed. After the relevance scores are computed using the selected similarity metric, another similarity metric is selected and the process is repeated. This is done until all available metrics in this iteration have been used.


67

2. Similarity Aggregation The second stage is the aggregation of the similarity scores that have been computed in the first stage. All metrics have been assigned a weight score that determines how much they contribute to the final similarity score. Each metric also has a normalize function. This normalize function will bring every score in the line with other scores so that they can be compared more easily. We define this similarity aggregation in definition 15. Definition 15 (Similarity aggregation): For each similarity metric k, let wk be the weight of metric k and sck (cΩ1 , cW1 ) the similarity score between the 2 concepts using similarity metric k. Similarities are than aggregated by:

scagg (cΩ1 , cW1 ) =

Σ k=1...n wk × N ormalizedk (sck (cΩ1 , cW1 )) Σ k=1...n wk

3. Updating the Search Space The similarity scores obtained in this iteration are aggregated so that every candidate mapping rule will have exactly one new similarity score. After that, we need to aggregate this new score for each candidate mapping with its previous score. Doing that will automatically update the search space with the updated mapping rules. These updated mapping rules can than be used in the following iterations.

4.2.4

Step 4: Final Similarity Computation

When all the similarity scores for the mapping rules are computed, we need to compute the final similarity score between the 2 ontologies. Different possibilities arise: Find the optimal 1:1 mapping This means that we have to find, for each Dogma-concept cΩ CΩ the best corresponding SWC cW CW . If we need to find the best possible W solution, we need to investigate #c#c . This exponential time complexity is too high Ω

to be applied in a crawler environment. When comparing large ontologies, an enormous amount of operations would have to be performed. This is why we propose a


68

greedy approach1 . Compute the similarity score based on a n:1 or n:n mapping Since our main goal is not to align or merge the given ontologies but only to compute the (inexact) similarity between them, other techniques are possible which require a lot less computing time. We list here two of such techniques: 1. We could take for every Dogma concept cΩ CΩ the SWC cW CW that results in the highest similarity between those 2 concepts. This would result in an n:1 mapping between concepts. 2. Another possibility is to compute, for every Dogma concept cΩ CΩ the average of all the SWC’s cW CW with a similarity score higher than a given threshold. We find however, that these techniques do not provide an acceptable and accurate score of the similarity between both ontologies. We have opted to use a sub-optimal 1:1 mapping using a greedy programming approach. We present the algorithm below in algorithm 1. The algorithm above uses a nested PriorityQueue to represent the different concepts. The main priority queue contains the terms from the Dogma Concepts, together with a similarity score and another PriorityQueue. In the algorithm, this element is denoted as the Array: [sourceConcept, score, swcQueue].

Descending concept similarities

se it ir lai im s de ta re m un e gn id ne cs eD

Figure 4.5: The heap structure used in the compuateFinalSimilarity algorithm. For every Dogma term, a new priorityQueue is constructed (the swcQueue we introduced above): the elements in this priority queue are Arrays of the form: [swConcept, similarity]. 1A

Greedy Algorithm is an algorithm that follows the problem solving meta-heuristic of making the locally

optimum choice at each stage with the hope of finding the global optimum


69

Algorithm 1 computeFinalSimilarity(sourceConcepts, swConcepts, searchSpace) 1: mappingQueue ← new P riorityQueue 2: searchM atrix ← searchSpace 3: for 4: i ← 0 to size(sourceConcepts) do 5:

sourceConcept ← sourceConcepts[i]

6:

score ← 0

7:

swcQueue ← new P riorityQueue

8:

for

9:

j ← 0 to size(swConcepts) do

10:

swConcept ← swConcepts[j]

11:

similarity ← searchM atrix[sourceConcept, swConcept]

12:

Add(swcQueue, [swConcept, similarity])

13:

score ← score + similarity

14:

end for

15:

Add(mappingQueue, [sourceConcept, score, swcQueue])

16: end for 17: elementsT aken ← new HashT able 18: resultScore ← 0 19: numberOf M appings ← 0 20: while !isEmpty(mappingQueue) do 21:

element ← P oll(mappingQueue)

22:

swcQueue ← element[3]

23:

swElement ← P oll(swcQueue)

24:

while contains(elementsT aken, swElement) do

25:

swElement ← P oll(swcQueue)

26:

end while

27:

numberOf M appings ← numberOf M appings + 1

28:

resultScore ← resultScore + swElement[2]

29:

Add(elementsT aken, swElement)

30: end while 31: result ←

resultScore numberOf M appings

These are the SWCs that have not yet been filtered out by the heuristics used while updating the Search Space (Iteration step 3.3). The similarity value is the similarity score between the swConcept and the sourceConcept. This heap structure is depicted in figure 4.5. As shown in the firuge, the vertical heap contains the Array elements: [sourceConcept, score, swcQueue]. The swcQueues contains the swConcepts and their concept similarity score. The score value in the vertical heap is then the sum of all the similarity values. For example, consider the top

P = scW

element CΩ1 : sc

4

+ scW3 + scW2 . Both priority queues are sorted in descending

order by respectively their score value and their concept similarity value. Now that this structure is set up, we start extracting the final mappings and computing the final similarity score. The principle is simple, the first element is popped from the queue, from this element, the first element that has not yet been used in another mapping gets


70

extracted. The similarity score between these 2 elements is added to the total similarity score and the next element is popped from the queue. This is done until the queue is empty. This process models a greedy approach since, for every new element popped from the queue, the similarity score will get higher. One of the problems is selecting the first element to begin with. This is why we use the priority queue: we begin with the element which has the largest score value (the score value is the sum of all the similarity scores between the Dogma concept and the other SWCs).

4.3

DMatch-lite: Similarity Iterations

In this section, we present the actual Similarity Iterations, each with their specific algorithms. The result of each of these algorithms is illustrated with an example scenario. We describe this matching scenario example in the next section.

4.3.1

Matching Example

We will apply the similarity algorithms we present in the following sections on this possible scenario. We have defined two different, but similar beer-ontologies2 . In line with the practical use of the crawler, we have described the first ontology (in figure 4.6) in the RDFS ontology language. The second ontology is a representation of a Dogma Ontology Commitment which we present in ORM. It is important to note that we have opted to draw the constraints on the lexons in this ontology. This is for the sake of completeness since a normal ontology commitment would contain these kind of constraints. This Beer Ontology Commitment can be modeled in T-Lex as we have previously depicted in figure 3.2 on page 40. In our implementation however, we take only a subset of the Lexon Base to represent a Dogma ontology. We want to stress that such a subset will never contain any constraints. We adopt this approach for practical reasons as we have explained before.

4.3.2

String-based Similarity

The string-based similarity iteration will be the first kind of similarity to be computed on each mapping rule. String-based algorithms will compute the syntactic similarity of the 2 Both

ontologies

are

based

on

http://www.purl.org/net/ontology/beer.owl

the

Beer

ontology

from

David

Aumueller:

CHAPTER 4. RELEVANCE COMPUTATION: DMATCH-LITE AlcoholPer centage

71

Ingredient

ge ta

brewedBy

Bottom Fermented Beer

Top Fermented Beer

Pilsner

Pilsner

dI n

ed

Beer

Region loc ate

n ce

ar d

er lP ho

aw

l co sA

madeFrom

ha

Award

Brewery

Organization

Figure 4.6: A first beer ontology, described in RDFS 2 terms of each concept. There are many ways to compare strings depending on the way the string is seen (as an exact sequence of letters, an erroneous sequence of letters, a set of letters, a set of words, ...). A good comparison of various string-matching techniques is given by Cohen et al. [19]. In DMatch-lite, we use two different algorithms to compute the string similarity of the two terms that represent the concepts: we compute the Levenshtein distance and the Soundex similarity.

Levenshtein Distance Distance functions map a pair of strings s and t to a real number r, where a smaller value of r indicates greater similarity between s and t. The Levenshtein distance falls under the category of edit-distance algorithms, which are an important class of distance functions. The distance of an edit-distance function is the cost of the best sequence of edit operations to convert s to t. Typical edit operations are character insertion, deletion, and substitution, and each operation must be assigned a cost. The Levenshtein distance is an edit-distance function with every operation assigned a cost of 1.

CHAPTER 4. RELEVANCE COMPUTATION: DMATCH-LITE Fermentation Method

Pilsner Ale

72

Award

... has .../... Is-of ...

Beer ... has received .../... isGivenTo ... ... isPartOf .../... madeFrom ...

Component

... has .../... Is-of ...

... brews .../... brewedBy ...

AlcoholPercentage

Brewery

Company

... locatedIn .../ … is-of ...

Location

Figure 4.7: A second beer ontology, now represented as a Dogma Ontology Commitment Soundex Similarity The soundex similarity algorithm is a coarse phonetic indexing scheme. It is mainly used in genealogy domains since it allows phonetic mispellings to be easily evaluated. For example, the names John, Johne and Jon are often genealogically the same person. Soundex is a term based evaluation method where each term is given a phonetic code. Each such soundex code consists of a letter and three numbers between 0 and 6. This is the procedure: 1. Take the first letter. 2. Translate remaining characters: • B, F, P, V → 1 • C, G, J, K, Q, S, X, Z → 2 • D, T → 3 • L→4 • M, N → 5 • R→6


73

Brewery

Ale

Award

Region

Ingredient

Pilsner

0.14285713

0.28571427

0.0

0.0

0.100000024

Company

0.14285713

0.0

0.14285713

0.14285713

0.100000024

Ale

0.14285713

1.0

0.19999999

0.0

0.100000024

Location

0.0

0.0

0.125

0.375

0.19999999

Brewery

1.0

0.14285713

0.28571427

0.14285713

0.3

Table 4.3: Some non normalized similarity scores between the ontology terms using the Levenshtein distance 3. Drop adjacent letters having the same code. 4. Drop non-initial A, E, I, O, U, Y, W and H. 5. Take the first four characters padding with zeros. For example: • ALEXANDRE → A4E2A536E → A4E2A536E → A42536 → A425. • ALEKSANDER → A4E22A53E6 → A4E2A53E6 → A42536 → A425. Brewery

Ale

Award

Region

Ingredient

Pilsner

0.5555556

0.5555556

0.5555556

0.6666667

0.6944445

Company

0.5555556

0.44444445

0.44444445

0.5555556

0.5555556

Ale

0.5555556

1.0

0.8222223

0.6666667

0.44444445

Location

0.5555556

0.5555556

0.6666667

0.77777785

0.19999999

Brewery

1.0

0.5555556

0.6666667

0.5555556

0.5555556

Table 4.4: Some non normalized similarity scores between the ontology terms using the Soundex similarity metric As we can see in table 4.4, the similarity values returned by the soundex metric are all very similar and do not reveal much about the two strings. That is because the soundex metric will only return a remarkably higher score if the two terms are phonetically very similar and a general score if they are not. That is why we proposed a normalization function to bring these values more in line with the other similarity measures. We use the following function: Definition 16 (Soundex Normalize function): If x is the non normalized soundex similarity value between 2 strings, Normalize(x) will return the normalized value. N ormalizeSoundex (x) = x(10−10x)


Normalized 1

74

Normalized Soundex

0.8 0.6 0.4 0.2 Unnormalized 0.2

0.4

0.6

0.8

1

Figure 4.8: A plot of the Soundex Normalize() function

4.3.3

Linguistic-based Similarity

The linguistic based similarity Iteration contains the algorithms and techniques that can be classified under the (i) Element based and (ii) External matching approaches, as we have defined in Figure 2.12 in Section 2.5.2 when we discussed all the different matching techniques and their classifications. The linguistic-based techniques are also more semantically ”rich” then pure string-based matching methods as we have illustrated on the Semantic Intensity Spectrum in figure 2.13. We will focus on sense-based techniques using external resources, namely common knowledge thesauri. In DMatch-lite, we use WordNet as this thesaurus. This allows us to compute the linguistic relations between terms.

WordNet Wordnet is an electronic lexical database for English, where various senses (possible meanings of a word or expression) of words are put together in a set of synonyms. Such a set of synonyms is called a Synset and represents the notion of a concept in WordNet. A Synset is represented by a line in a WordNet pos.data file. A Synset contains a set of Words, each of which has a sense that names that concept (and each of which is therefore synonymous with the other words in the Synset). Synsets are linked by pointers into a network of related concepts; this is the Net in WordNet. We use these pointers to compute the linguistic relations between these synsets.


75

Synset: [Offset: 2853224] [POS: noun] Words: car, auto, automobile, machine, motorcar (4-wheeled motor vehicle;usually propelled by an internal combustion engine; ”he needs a car to get to work”) Synset: [Offset: 2854760] [POS: noun] Words: car, railcar, railway car, railroad car (a wheeled vehicle adapted to the rails of railroad; ”three cars had jumped the rails”) Synset: [Offset: 2830136] [POS: noun] Words: cable car, car (a conveyance for passengers or freight on a cable railway; ”they took a cable car to the top of the mountain”) Synset: [Offset: 2855301] [POS: noun] Words: car, gondola (car suspended from an airship and carrying personnel and cargo and power plant) Synset: [Offset: 2855152] [POS: noun] Words: car, elevator car (where passengers ride up and down; ”the car was on the top floor”)

Table 4.5: A part of the WordNet Synsets that contain the term ”car” There are several practical problems we have to tackle if we want to use WordNet to compute the linguistic similarity measures between two given concepts (a Semantic Web Concept and a concept from the Dogma Ontology Commitment): Both concepts are represented by their concept term. If we want to compute linguistic information about this concept from WordNet, we have to take into account all the WordNet synsets that contain this concept term. For example, given the Dogma Car concept, represented by the term ”car”. If we want to compute linguistic relations from this Car concept, and look up the term ”car” in WordNet, we will find a set of several Synsets, each containing the term ”car” in a different context. Table 4.5 shows a subset of all the WordNet synsets that contain the term ”car”. Another problem is that we do not know in what Part Of Speech (POS) the term is in. If we want to look up a term in WordNet, we need to know what the POS is of the intended term. For the sake of simplicity, we assume that terms representing concepts are Nouns.


76

WordNet Relations There are several relations between WordNet synsets. We will only try to find the relations we defined previously in definition 14. We will also limit ourselves to direct relations between synsets. > [Synset: [Offset: 7411192] [POS: noun] Lemma: beer > [Synset: [Offset: 7411424] [POS: noun] Lemma: draft_beerdraught_beer > [Synset: [Offset: 7411517] [POS: noun] Lemma: suds > [Synset: [Offset: 7411959] [POS: noun] Lemma: lagerlager_beer > [Synset: [Offset: 7411629] [POS: noun] Lemma: Munich_beerMunchener > [Synset: [Offset: 7411786] [POS: noun] Lemma: bockbock_beer > [Synset: [Offset: 7412292] [POS: noun] Lemma: light_beer > [Synset: [Offset: 7412383] [POS: noun] Lemma: OktoberfestOctoberfest > [Synset: [Offset: 7412554] [POS: noun] Lemma: PilsnerPilsener > [Synset: [Offset: 7413564] [POS: noun] Lemma: maltmalt_liquor > [Synset: [Offset: 7413782] [POS: noun] Lemma: ale > [Synset: [Offset: 7412790] [POS: noun] Lemma: Weissbierwhite_beerwheat_beer > [Synset: [Offset: 7413034] [POS: noun] Lemma: Weizenbier > [Synset: [Offset: 7413141] [POS: noun] Lemma: Weizenbock > [Synset: [Offset: 7414086] [POS: noun] Lemma: bitter > [Synset: [Offset: 7414244] [POS: noun] Lemma: Burton > [Synset: [Offset: 7414322] [POS: noun] Lemma: pale_ale > [Synset: [Offset: 7414480] [POS: noun] Lemma: porterporter’s_beer > [Synset: [Offset: 7414606] [POS: noun] Lemma: stout > [Synset: [Offset: 7414794] [POS: noun] Lemma: Guinness

Table 4.6: The hyponym tree from the WordNet ”Beer” SynSet. If a relation is found, we also compute the score of that relation. We define the normalized linguistic relationship score between 2 terms as follows: Definition 17 (WordNet Relations Normalize function): Take x the length of the linguistic relationship path between two terms. Normalize(x) will return the normalized value.

1/3 1 N ormalizeW ordnetRel (x) = x

Take for example the terms Ale and Beer from table 4.6, the length of the hypernym relation between Ale and Beer is 1 so the resulting score will be 1. Take in contrast the terms Company and Organization, their relation distance is 4, so the normalized score is 0.62996054. Finding relations between WordNet synsets can be a computational complex task. Therefore we have taken some steps to make this process more performant:


77

1. We have limited the size of the relational tree (for example the hyponym tree in table 4.6) to a depth of 5. 2. When each crawler thread is started, the relations of the WordNet synsets corresponding to the terms from the Dogma ontology are parsed into a hashtable. Therefore, when given another synset, we can look up if there is a relation with a synset corresponding to a Dogma term and what that relation might be from the generated hashtable. This results in a relation lookup of O(1). 3. The fact that we only take into account direct synset relations is a direct consequence of this performance limitation. If we would want to also find indirect relations between two synsets, given the generated hashtable of the relations for one of the synsets, only one relation tree would have to be traversed. While this might still be acceptable performance-wise, the question remains if synsets that have no direct relation but only share some hypernym or hyponym are still very related to each other. In table 4.7 we give a few of the resulting candidate mappings after we applied the WordNet relations algorithm on them.

Table 4.7: Some of the resulting mapping rules after applying the WordNet Relations Algorithm on them.

WordNet Glosses A lot of the semantic value of concepts is contained within their gloss descriptions. Therefore, we also compare the glosses of the WordNet synsets with each other. We use the Jaccard similarity algorithm to compare two glosses. The Jaccard similarity is a token based vector space similarity measure like for example the cosine distance and the matching coefficient similarities. Jaccard similarity uses word sets from the comparison instances to evaluate similarity. In [19], we see that the Jaccard is one of the better token-based similarity measures.


78

Token based similarity measures consider two strings s and t as multisets (or bags) of words (or tokens). We can now define the Jaccard similarity between two strings as: Definition 18 (Jaccard Similarity): Given 2 strings s and t, the jaccard similarity between both strings is: simJaccard (S, T ) =

|S ∩ T | |S ∪ T |

where S and T are the multisets representing respectively the strings s and t.

4.3.4

Semantic Similarity

The previous similarity iterations are positioned on the semantically weak side of the Semantic Intensity Spectrum (see section 2.5.2). In this iteration, we will apply techniques that have a greater semantical value, while still being performant enough to be used within the crawler environment. We present 2 algorithms that we will use to compute the semantic similarity between 2 concepts, the first uses the context of the concepts and the second uses the gloss description of the concepts.

Context-aware Similarity Comparing the contexts of the concepts from both ontologies is needed to come to a semantically richer similarity score. This context-aware similarity can for example solve the problem of homonyms. We have already solved the problem of synonyms with the WordNet relations similarity score. The problem of homonyms mismatches can be solved with this context-similarity metric. Two identical terms are considered to be homonyms when they identify different concepts. For Instance, the term ’bank’ has a meaning in a geographical context that is different than the one in a financial context and therefore evokes different concepts. While the terms representing these 2 different concepts are the same. The relations they have with other concepts, will not be the same. We see these relations as the context of the concepts. By using these contexts to compute the similarity between the 2 concepts, we are able to compute that, while their term labels are similar, the concepts are not. We illustrate the context-aware similarity with the example in figure 4.9. This example shows the context of the two beer concepts from both ontologies we presented earlier in section 4.3.1. The blue arrows between the different concepts denotes what concepts are


79

en ta

ge

AlcoholPerc entage

Brewery

0.6663452 0.51828617

Award

ho Fr om

aw

ard

ed

Beer

brew edBy

0.6259002

sA lco

ad e

ha

m

lPe rc

Ingredient

1.0

Bottom Fermented Beer

Top Fermented Beer 1.0

0.17045702 Pilsner

1.0

Ale

Fermentation Method

0.5

Award

... has .../... Is-of ...

Beer ... has received .../... isGivenTo ... ... isPartOf .../... madeFrom ...

Component

... has .../... Is-of ... ... brews .../... brewedBy ...

AlcoholPercentage

Brewery

Resulting Context Similarity Score: 0.7829984

Figure 4.9: Matching the contexts of the two Beer concepts from both ontologies. mapped to each other. The value on each of these arrows denotes the similarity score between these concepts. We retrieve this similarity score from the vector containing the similarity scores from the previous similarity iterations. In algorithm 1, we present the context similarity algorithm in pseudocode. First, all the concepts that are contained in the context of the first concept are iterated. For each of these concepts, the best match with one of the concepts from the second context is found. This is done in the second for-loop and the test for the bestScore ensures the concept with the highest similarity will be chosen. We lookup the similarities between the concepts in the current Search Space. All these bestScores are enumerated into the totalScore. The result of the algorithm is than the totalScore divided by the number of mappings.


80

Algorithm 2 compareContexts(searchspace, concept1, concept2) 1: searchM atrix ← searchspace 2: context1 ← getContext(concept1) 3: context2 ← getContext(concept2) 4: totalScore ← 0 5: l ← 0 6: for 7: i ← 0 to size(context1) do 8:

concept1 ← getConcept(context1, i)

9:

bestScore ← 0

10:

for

11:

j ← 0 to size(context2) do

12:

concept2 ← getConcept(context2, j)

13:

simScore ← searchM atrix[concept1, concept2]

14:

if simScore ≥ bestScore then

15:

bestScore ← simScore

16:

end if

17:

end for

18:

totalScore ← totalScore + bestScore

19:

l←l+1

20: end for 21: result ← N ormalizeContextSim (totalScore, l)

Definition 19 (Context Similarity Normalize function): If x is the non normalized Context similarity value between 2 contexts and l the number of mappings between the concepts of both contexts, Normalize(x) will return the normalized Context similarity score. N ormalizeContextSim (x, l) =

x l + 2l

Concept glosses Another semantically ”rich” similarity measure is comparing the gloss descriptions of both concepts. Gloss descriptions of synonyms will result in a much higher gloss similarity than gloss descriptions of homonyms. Therefore, comparing the glosses of the concepts can help in identifying which two concepts are equal and which have the same term label, but denote different concepts. Again, we use the Jaccard similarity measure to compare both glosses. We have already defined the Jaccard similarity measure in definition 18.


4.3.5

81

Evaluation

We will now provide a brief evaluation of the DMath-lite algorithm on the beer-ontology example we described before. We have 2 different, but similar beer ontologies. The first is described in RDFS, see figure 4.6. The other is presented as a Dogma Ontology Commitment, see figure 4.7.

Pilsner Fermentation Method Ale Company Award AlcoholPercentage Location Beer component Brewery Pilsner Fermentation Method Ale Company Award AlcoholPercentage Location Beer component Brewery Pilsner Fermentation Method Ale Company Award AlcoholPercentage Location Beer component Brewery

Pilsner

Ale

TopFermentedBeer

1.0 0.13008991 0.24287185 0.064300425 0.064300425 0.18464053 0.11111112 0.24287185 0.20318931 0.15358613

0.24287185 0.06581654 1.0 0.032921813 0.33344862 0.27928025 0.064300425 0.3326904 0.102366254 0.15358613

0.18148793 0.34091404 0.07198431 0.20371176 0.14242543 0.21135925 0.14242543 0.22055043 0.28183675 0.18148793

Ingredient

Award

AlcoholPercentage

Beer

0.18808678 0.2571657 0.09542183 0.12680045 0.23611112 0.21135925 0.25058675 0.18930042 0.25180042 0.25180042

0.064300425 0.17690061 0.33344862 0.12220752 1.0 0.17389567 0.18923612 0.30144036 0.032921813 0.28968254

0.18464053 0.06581654 0.27928025 0.13782984 0.17389567 1.0 0.17459454 0.10645122 0.24812394 0.10645122

0.24287185 0.13008991 0.3326904 0.032921813 0.30144036 0.10645122 0.064300425 1.0 0.102366254 0.58295363

Organization

BottomFermentedBeer

Brewery

Region

0.17767008 0.30847952 0.032921813 0.116383746 0.116383746 0.10106513 0.37152782 0.064300425 0.116383746 0.116383746

0.20979533 0.19137624 0.06581654 0.20979533 0.13008991 0.26166883 0.17690061 0.27946785 0.27558482 0.27946785

0.15358613 0.16298464 0.15358613 0.15358613 0.28968254 0.10645122 0.064300425 0.58295363 0.102366254 1.0

0.11111112 0.19587936 0.11111112 0.15358613 0.11111112 0.13782984 0.4108154 0.2152778 0.20318931 0.15358613

Table 4.8: The search Space Matrix after the String Similarity Iteration As we have described earlier, the DMatch-lite algorithm consist of several Similarity Iterations. We will apply each of these iterations on the two provided ontologies and present the results after each iteration. We will illustrate the results by presenting the Search Matrix (the Search Space). In the actual implementation, the search matrix contains references to the candidate mappings, but here we show the similarity scores of each mapping. The first iteration is the String-based similarity. The search matrix after applying this iteration can be seen in table 4.8. This is the first phase, so each of the candidate mappings are



82

Pilsner

Ale

TopFermentedBeer

1.0 0.0 0.121435925 0.0 0.0 0.0 0.0 0.51828617 0.10159466 0.0

0.121435925 0.0 1.0 0.0 0.16672431 0.13964012 0.0 0.6663452 0.0 0.0

0.0 0.17045702 0.0 0.10185588 0.0 0.105679624 0.0 0.11027522 0.14091837 0.0

Ingredient

Award

AlcoholPercentage

Beer

0.0 0.12858285 0.0 0.0 0.11805556 0.105679624 0.12529337 0.0 0.6259002 0.12590021

0.0 0.0 0.16672431 0.0 1.0 0.0 0.0 0.15072018 0.0 0.14484127

0.0 0.0 0.13964012 0.0 0.0 0.5 0.0 0.0 0.12406197 0.0

0.51828617 0.0 0.6663452 0.0 0.15072018 0.0 0.0 1.0 0.0 0.29147682

Organization

BottomFermentedBeer

Brewery

Region

0.0 0.15423976 0.0 0.37317213 0.0 0.0 0.18576391 0.0 0.0 0.0

0.10489766 0.0 0.0 0.10489766 0.0 0.13083442 0.0 0.13973393 0.13779241 0.13973393

0.0 0.0 0.0 0.0 0.14484127 0.0 0.0 0.29147682 0.0 1.0

0.0 0.0 0.0 0.0 0.0 0.0 0.7054077 0.1076389 0.10159466 0.0

Table 4.9: The search Space Matrix after the Linguistic Similarity Iteration given a similarity score. The second iteration is the linguistic-based similarity iteration. The results after this iteration are presented in table 4.9. In this stage, we can clearly see the effect of the WordNet database on the similarity scores. For example, where the similarity score between Beer and Ale is only 0,33 after the string-based iteration, it is now 0,66 since we have found a linguistic relation between the two terms. The same applies for the terms Component and Ingredient. In contrast, the terms Beer and Brewery have a relatively high (0,58) string similarity, but after computing their linguistic similarity, the score has decreased to 0,29. In this phase, we also see the heuristics that are used when updating the search space. All candidate mappings that have a similarity score less than 0,1 after this iteration, will be given the inactive status and their similarity score is reset to 0,0. The last iteration is the Semantic similarity iteration. What is remarkable here is that con-



83

Pilsner

Ale

TopFermentedBeer

0.7530477 0.0 0.16733831 0.0 0.0 0.0 0.0 0.46395135 0.16014376 0.0

0.19201481 0.0 0.77772415 0.0 0.22220707 0.20415094 0.0 0.5737247 0.0 0.0

0.0 0.28030467 0.0 0.11648339 0.0 0.24277984 0.0 0.3978879 0.26061225 0.0

Ingredient

Award

AlcoholPercentage

Beer

0.0 0.0857219 0.0 0.0 0.07870371 0.070453085 0.10451229 0.0 0.41726682 0.10481571

0.0 0.0 0.13626957 0.0 0.6917867 0.0 0.0 0.2581458 0.0 0.12168087

0.0 0.0 0.09309342 0.0 0.0 0.33333334 0.0 0.0 0.08270798 0.0

0.44892886 0.0 0.54146576 0.0 0.3253601 0.0 0.0 0.89149916 0.0 0.3687575

Organization

BottomFermentedBeer

Brewery

Region

0.0 0.102826506 0.0 0.24878143 0.0 0.0 0.123842604 0.0 0.0 0.0

0.38578218 0.0 0.0 0.118511245 0.0 0.25388962 0.0 0.40758383 0.25852826 0.30259603

0.0 0.0 0.0 0.0 0.14514032 0.0 0.0 0.29952967 0.0 0.8328141

0.0 0.0 0.0 0.0 0.0 0.0 0.4702718 0.079973646 0.08566959 0.0

Table 4.10: The search Space Matrix after the Semantic Similarity Iteration cepts that were given the perfect similarity score, for example the concepts Beer and Award in both ontologies, are now given a score that better reflects their concept similarity. After the first 2 iterations, the similarity score between both concepts was 1,0. This is logical since they have exactly the same string, and they are synonyms from each other. This does not mean that they perfectly describe the same concepts however! Applying the context-aware similarity measure on them will give a more nuanced result on their concept similarity. Respectively, the Beer concepts have a 0,89 similarity and the Award concepts have a 0,69 similarity score.

Company Beer

AlcoholPercentage

Location

FINAL SIMILARITY SCORE: 0.5572768

Figure 4.10: Matching the contexts of the two Beer concepts from both ontologies. Brewery

3 7 7 1 7 7.0

Component

... f o-si … /... nIdetac ol ...

... m orFedam .../... fOt raP si ...

... yBdewerb .../... swe rb ... ... f o-sI .../... sah ...

Ale

7 7 7 0 4 8 4 2.0

rensliP

8 4 0 7 1 8 8. 0

... f o- sI .../... sah ...

... oTneviG si .../... deviecer sah ... 4 3 3 6 2 5 4 7. 0

Award

9 2 6 1 0 2 8. 0

reeB detnemreF mottoB

4 3 3 3 3 3 3 3. 0

6 7 7 9 5 6 3 2. 0

brewedBy

loc ate dIn

aw a

egatnec rePlohoclA

madeFr om

4 3 9 7 7 41 4. 0

Pilsner

elA

reeB detnemreF poT

reeB

tneidergnI

1 4 4 7 5 8 6. 0

Fermentation Method

7 1 0 0 5 3 4. 0

no ita zinagrO

yrewerB

noigeR

drawA

lP ho lc o A s ha

ge ta en c er

84 CHAPTER 4. RELEVANCE COMPUTATION: DMATCH-LITE

rd e d


4.4

85

Conclusion

In this section, we introduced and described the DMatch-lite algorithm. We have sketched the context of our algorithm within our crawler system and within the general ontologyintegration process. By determining the context of the algorithm we were able to define the Match Operator in line with current work on ontology matching. We have seen that the Match Operator takes two ontologies as input: an SWO and a Dogma ontology commitment. Using several kind of Iterations, each with their own ordered list of Similarity metrics, we compute the similarity between both ontologies, that is, the size of their intersection. The output of the Match Operator is the final similarity score together with the set of candidate mappings that have been created during the matching process. We have also illustrated DMatch-lite using two example Beer ontologies. The system was ”smart” enough to be able to relate concepts that would not have been related using simple string similarity metrics, for example, the system related Ingredient to Component, Company to Organisations and Location to Region.

Chapter 5

Crawler Implementation Introduction In this chapter, we will give a brief overview of our implementation of the crawler and matching system we have discussed in the previous chapters. We will give a high level overview of the packages and classes that are part of the system. After that, we will discuss the Matching part of our implementation in more detail. We choose this part of the system as it is the part that distinguishes our system from the countless , already existing, general crawlers.

5.1

Overview

We have chosen to implement our crawler in Java 1.5. This was a direct consequence of the fact that the crawler should be part of the Dogma Framework. While of course also other programming languages can work together with the Dogma framework (via RMI, SOAP, ..), the java platform was an obvious choice. In figure 5.1, we show the 3 packages that make up the crawler system: • The Crawler package contains all the specific crawler code and models the flow of the crawler process. • The Generic package contains the classes that are used to download and parse the Web Pages. These classes are included in the Generic package since they can be seen as generic classes that can be used in a lot of applications.

86

CHAPTER 5. CRAWLER IMPLEMENTATION

87

Package Overview: focusedcrawler

crawler

matching

(from vub::starlab::focusedcrawler)

(from vub::starlab::focusedcrawler)

generic (from vub::starlab::focusedcrawler)

Figure 5.1: The 3 packages that make up our Focused Crawler • The Matching package contains all the code that is responsible for the matching computation we use in the crawler. We enclosed this code in its own package since it can be seen as a sort of matching api that can also be used in other systems.

5.1.1

The Crawler Package

The crawler package is the main package of the crawler system. The Crawler class is the main class of the system. Its main method will parse the configuration file and spawn a specific number of CrawlerThread classes. Each of these crawlerthread classes model the flow of the crawler process. We have already described that process in detail in section 3.4 on page 48. Notice the correspondences between the architectural crawler process flow we illustrated in figure 3.8 on page 56 and the private methods in the CrawlerThread class. The three big parts of the crawler process are mapped to three private methods: read(), preprocess() and computeRelevance().

5.1.2

The Generic Package

We have grouped all the classes that are responsible for the parsing of the web pages in the generic package. Two classes contained in the generic package are UrlReader and UrlFilter.


88

Package Overview: crawler

Crawler +main(args:String[]):void -stoppedCrawlerThread():void

CrawlerThread +CrawlerThread(soapCF:FrontierSoapClient,lexonset:LexonSet):CrawlerThread +run():void -read():void -preprocess():void -computeRelevance():void -addlinkstoDbase(reader:UrlReader,relevance:double):void

soap SoapClient

FrontierSoapClient

RdfSoapClient

Figure 5.2: The class diagrams of the Crawler package. The UrlReader class is responsible for downloading the web-page. It is given a URL parameter to construct the class. Its methods give then access to the content of the web-page (in a String), the metadata on the web-page and the URLs on the web-page. The URLs are filtered by the UrlFilter class, to which UrlReader keeps a reference.

5.1.3

The Matching Package

Since the different matching algorithms we used are the most important and interesting part of the crawler system, we will present the implementation of our DMatch-lite algorithm that contains all these different algorithms in more detail in the next section.


89

Package Overview: generic

util

processing wordvector

reader

Util

UrlReader

~ filter_

UrlFilter

- instance_

Figure 5.3: The class diagrams of the Generic package.

5.2

DMatch-lite Implementation

As we have described in the previous sections, DMatch-lite can be seen as a combination of existing and well-known matching algorithms. In this thesis, we have selected, from these algorithms, the ones that seemed best to suit our goals: to implement a fast and automated ontology matcher to be used in a focused crawler environment. We have already stressed the point that this matching algorithm can and should be used in other applications as well. It is obvious that modifications to the DMatch-lite algorithm, that is, to the combination of general matching algorithms it uses, might be necessary to cope with the specific requirements of those applications. Not only should the DMatch-lite algorithm be easily adoptable for other applications, it is also important that the algorithm can be easily changed and configured within the crawler system, for example, depending on what kind of information is crawled for. Since not much evaluation of the crawler has been done, and is possible up to this moment (as we have described in the previous chapter), it is imaginable that some algorithms within the current version of DMatch-lite might not seem the optimal choice when the crawler is sufficiently tested in a real-world environment. Considering the above reasons, one of the primary goals we kept in mind while implementing the DMatch-lite algorithm was modularization. The algorithms contained within DMatch-lite should be easily changeable and adopted to other requirements. Therefore we have used a number of well-known programming patters that give us this dynamism. The code implementing the DMatch-lite algorithms corresponds with the Matching package of the crawler. We illustrate the overview of the package in figure 5.4. The package consist of 5 other packages: Similarity Iterations package, Similarity Algorithms package, Relevance Sets package, Mapping package and the Util package. The class that uses all these packages and is contained in the main Matching package is DLMatch. In the following subsection we will present a more detailed description of all these packages and then illustrate the use of the DLMatch class with a Sequence diagram.


90

Package Overview: matching

Util

SimilarityIterations

RelevanceSets

Mapping

SimilarityAlgorithms

DLMatch -simiterations_:List[1..*ordered] +DLMatch():DLMatch +initialize(mtable:int):void +computeDLMatch(mtable:MappingTable):Float +computeFinalSimilarity(mtable:MappingTable):void

Figure 5.4: The class diagrams of the Matching package.

5.2.1

Similarity Iterations package

The SimilarityIterations package contains the different similarity iterations we have discussed in the previous chapters. As shown in figure 5.5, the three Iterations from the DMatch-lite algorithm are the classes StringSimilarity, LinguisticSimilarity and SemanticSimilarity. All these classes are subclasses from the SimilarityIteration class. By using this model, we achieve polymorphism: the DMatch-lite algorithm can represent each iterations as a SimilarityIteration class without having to know the actual subclass it represents. Therefore, it is very easy to add or delete new iterations to the algorithm. The only requirement is that the classes representing these new iterations is that they are subclasses from the SimilarityIteration class.

5.2.2

Similarity Algorithms package

The similarity algorithm package contains all the similarity algorithms used in DMatch-lite. We have implemented the similarity algorithms using the same ideas as with the similarity iterations. Each specific similarity algorithm is contained in a class. That class can be a kind of wrapper of the similarity metric (in our implementation, this is the case for the string similarities) or can actually implement the similarity algorithm (for example the WordNet


91

Package Overview: SimilarityIterations

SimilarityIteration #algorithms_:List[1..*ordered] +SimilarityIteration():SimilarityIteration +addAlgorithm(algo:SimilarityAlgorithmWrapper): +Initialize(mtable:MappingTable):SimilarityIteration +newIteration(mtable:MappingTable): +computeSimilarity():void +aggregateSimilarity():void

SemanticSimilarity

StringSimilarity

LinguisticSimilarity

Figure 5.5: The class diagrams of the SimilarityIterations package. and Context-based algorithms). All these algorithm classes are subclasses of the SimilarityAlgorithmWrapper class. Using such a wrapper class, we get the same modularity as with the iteration classes. Each iteration contains a number of similarity algorithm classes, represented by the SimilarityAlgorithmWrapper class. For each of these classes, the iteration will call the SimilarityAlgorithmWrapper.computeSimilarity() method, the iteration doesn’t need to know which similarity algorithm class is actually represented by the wrapper. The string-based similarity algorithm classes (SoundexSimilarity, JaccardSimilarity and JaroSimilarity) are wrapper classes that use an existing similarity metrics library: SimMetrics 1 . The SimMetrics library is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance, L2 Distance, Cosine Similarity, Jaccard Similarity etc etc. SimMetrics provides a library of float based similarity measures between String Data as well as the typical non normalised metric output. It was developed by Sam Chapman at Sheffield University from the Natural Language Processing Group 2 . The WordNet algorithms use an existing api: Java WordNet Library (JWNL)3 . JWNL is 1 http://www.dcs.shef.ac.uk/ 2 This

sam/simmetrics.html work was carried out within the AKT project (http://www.aktors.org), sponsored by the UK Engineer-

ing and Physical Sciences Research Council (grant GR/N15764/01), and the Dot.Kom project, sponsored by the EU IST asp part of Framework V (grant IST-2001-34038). 3 http://jwordnet.sourceforge.net/


92

Package Overview: SimilarityAlgorithms

SimilarityAlgorithmWrapper +SimilarityAlgorithmWrapper():SimilarityAlgorithmWrapper +computeSimilarity(e:MappingElement): +initialize(mtable:MappingTable):void +getSimilarity():float +getNormalised():float +getRelation():SimilarityRelation +getWeight():float

LevenshteinSimilarity

WordnetGlosses

ContextSimilarity

WordnetRelations

JaroSimilarity

SoundexSimilarity

JaccardSimilarity

Figure 5.6: The class diagrams of the SimilarityAlgorithms package. an API for accessing WordNet-style relational dictionaries. It also provides functionality beyond data access, such as relationship discovery and morphological processing.

5.2.3

Mapping Package

The mapping packing contains the MappingTable interface. A specific data structure is used to represent the concepts and the candidate mappings used in the DMatch-lite algorithm. This data structure must respond to the operations that are declared by the MappingTable interface. In our crawler, we implement a DogmaMappingTable class that implements this interface. The DogmaMappingTable class contains the data structures to represent the search space and the candidate mappings we introduced in the previous chapters. It also responds to all the methods that are declared in the MappingTable interface. In the DMatch-lite algorithm, the mappingtable variable is once created as a DogmaMappingTable, and in the rest of the algorithm, this MappingTable interface is used. This design again promotes the adaptiveness and extensibility of the matching algorithm. While we use our own DogmaMappingTable (since we compare a Dogma commitment with SWOs), if one would want to match several SWOs or ontologies in another representation, it is sufficient to make a new class that implements the MappingTable interface and instantiate this new class as the MappingTable variable in DMatch-lite.


93

Package Overview: Mapping

>

>

MappingTable

Iterable (from java::lang::Iterable)

+iterator():Iterator +getSourceConcepts():List +getMappingElement(sourceTerm:String,swTerm:String):MappingElement +printMappingTable():void +printMappingTable(count:int):void +getSourceModel():LexonSet +getSWOModel():OntModel +getSWConcept(term:String):OntClass +printMatrixLatex():void +getSwtTable():Hashtable +numberOfSourceTerms():int +numberOfSWTerms():int +resetMappingTable():void

MappingElement - mtable_

DogmaMappingTable #lookupmatrix_:Hashtable #swtTable_:Hashtable #sourceconcepts_:List #mappingtable_:List -numberofswterms_:int -mappingID_:int

Figure 5.7: The class diagrams of the Mapping package.

5.2.4

Relevance Sets Package

In the context similarity algorithm, we represent the context of a concept using Relevance Sets. These relevance sets are contained in the Relevance Sets package. Again, as in the other packages, we have a general RelevanceSet interface that declares all the operations that should be supported by the classes implementing a relevance set. In our implementation, we have two classes that implement this relevance set interface, namely the DogmaRelevanceSet class that represents the context of a Dogma Concept and the SWORelevanceSet class which represents the context of a SWC. Each element in the relevance sets is modeled by the RelSetElement class.

5.2.5

Using DMatch-lite

We will now show, using the sequence diagram in figure 5.9, how the computation of the relevance score between two ontologies using the DMatch-lite algorithm is computed. As illustrated in the sequence diagram and in correspondence with the description of our crawler’s architecture in section 3.4, the relevance computation is started from within the CrawlerThread class with the computeDLMatchRelevance() method. As we have described before, the DMatch-lite algorithm will use several iterations and each of these iterations will use different existing similarity algorithms. This process is


94

Package Overview: RelevanceSets

> RelevanceSet +getRelationSet():HashSet +getTaxonomicSet():HashSet +makeRelevanceSets(term:String):void +selectCurrentRelation(relation:String):void +size():int SWORelevanceSets->RelevanceSet DOGMARelevanceSets

SWORelevanceSets

RelSetElement +RelSetElement(relation:String,concept:String):RelSetElement +getRelation():String +getConcept():String +setRelation(relation:String):void +setConcept(concept:String):void

Figure 5.8: The class diagrams of the RelevanceSets package. clearly illustrated in the sequence diagram. As can be seen in diagram 5.7 of the mapping package, the DMatch-lite class has a list with SimilarityIterations. Each of these iterations is actually a subclass of the SimilarityIterations class (as described before). The DMatchlite class will then iterate over that list and send three messages to the class representing the SimilarityIteration: (i) newIteration() will initiate the iteration, (ii) computeSimilarity() will compute the total similarity of that iteration and (iii) aggregateSimilarity() will aggregate the similarity scores and use them to update the search space. In each of these iterations, a similar process is executed. Each SimilarityIteration has a list of SimilarityAlgorithmWrapper classes. To compute the relevance, the iteration implements a double loop: for each element and for each algorithm in the iteration, the relevance is computed. It should be noted that these computation sequences correspond very well to the matching methodology we presented in chapter 4.2. Both the methodology and the corresponding implementation makes the algorithm very adoptable and extensible, which were our initial goals for the implementation.


95


:CrawlerThread

:DLMatch

:SimilarityIteration

:SimilarityAlgorithmWrapper

1) .computeDLMatch(mtable):result loop(SimilarityIteration.Iterator ) []

2) .newIteration(mtable)

3) .computeSimilarity()

loop(MappingEl = MappingTable.Iterator) [] loop(SimAlgoWrapper = Algorithms.Iterator) []

4) .computeSimilarity(mappingEl)

4) computeSimilarity 5) .getNormalised():result

5) getNormalised 6) .getRelation():simRelation

6) getRelation

7) .getWeight():weight

7) getWeight

3) computeSimilarity

8) .aggregateSimilarity() 8) aggregateSimilarity

1) computeDLMatch

Figure 5.9: The sequence diagram showing the use of the DMatch-lite Algorithm

Chapter 6

Crawler Evaluation 6.1

Introdution

In section 4.3.5 we have illustrated and briefly evaluated our relevance computation algorithm DMatch-lite. We have done this in a small and controlled environment so that the functioning of the algorithm could be clearly visualised and experienced. In this chapter, we perform a brief and empirical evaluation study of the crawler in an uncontrolled, practical environment, namely the Semantic Web. By performing such an uncontrolled empirical evaluation, we want to point out and illustrate some interesting statistics concerning our crawler. We introduce the well-known harvest rate [1, 17] evaluation metric that is generally used to evaluate focused crawlers. The harvest rate can be denoted as P (C) which is the percentage of the web pages crawled satisfying the predicate. We present different scenarios to test the harvest rate of our crawler. We compare the harvest rate on crawls using different kind of topic-ontologies. We also compare the harvest rate of a crawl using a topic-ontology (our approach) and a crawl focused around a simple keyword topic. In addition, we provide some statistics that give an idea of the performance that can be expected from the crawler. We also compare the computing time between all phases of the crawling loop (see section 3.4). The introduced experimental data will give us an idea on the usefulness and applicability of the crawler on the current Semantic Web. As stated above, our main goal with this empirical evaluation study is to provide some insight in the problems that arise and that still have to be tackled when using the crawler in practice. To be able to draw conclusions on the quality performance of the crawler and more specifically, our DMatch-lite algorithm, compared to other heuristics used in focused crawlers, we should test the crawler in a controlled and transparent environment. Building and thoroughly evaluating a crawler in such an environment is not a simple task. Firstly, 96

CHAPTER 6. CRAWLER EVALUATION

97

a complete Dogma Ontology Commitment is needed to be used as topic-ontology. As we have thoroughly described in section 2.2.2, a Dogma Ontology Commitment is the set of lexons together with its constraints, modeled by the domain expert or ontology engineer, that describes the domain of interest of the topic. Currently, we lack the tools to model such a commitment (e.g., the STARLab T-Lex Browser is still under development). Secondly, sufficient RDF annotated web pages should be available. This is also not an evident task since, as we will show below, RDF(S) is still very rare on the current Semantic Web. To be able to state conclusions on the quality of our DMatch-lite algorithm compared to approaches adopted by other focused crawlers, precision and recall measures should be applied to the various crawlers. For a detailed description of precision and recall measures to evaluate ontology matching, we refer to [32]. These measures are hard to perform in an uncontrolled and dynamic environment that is the Semantic Web. Due to time-constraint reasons, we were not able to perform such an in-depth evaluation study. We have however already described a methodology to perform part of such an evaluation in section 2.4.3. In the following experiments, we will compare the relevance of a page against a given ontology commitment vs. a simple keyword. When no semantic data is found on the page, we will use the well-known TF/IDF metric to compute this similarity. While we cannot compare the ontology commitment with an SWD, there is still a difference between using an ontology commitment or a simple keyword. When we use an ontology commitment, the TF/IDF weight will be computed for all the concepts labels in the ontology commitment. When only using a keyword, of course, only this keyword will be used to compute the TF/IDF score.

6.2

Crawler Harvest Rate

Perhaps the most crucial evaluation metric in focused crawling is the rate at which relevant pages are acquired, and how effectively irrelevant pages are filtered from the crawl. As discussed in [35], a major hurdle is the problem of defining this ”relevance” metric. In an operational environment, real users may judge the relevance of pages as these are crawled, allowing us to determine if the crawl was successful or not. Unfortunately, meaningful experiments involving real users for assessing Web crawls are extremely problematic. Since we only evaluate our own crawler, it is sufficient to use the same metric and use the same similarity score to denote a relevant page in all of the following scenarios. In this evaluation study, pages that have a relevance score of 0.03 or higher will be considered as relevant to the given topic. In our own manual evaluation of example pages, we find this score the best threshold value. This is of course a temporarily situate as we, as evaluators, are also


98

the testers, the experts and the developers.

6.2.1

Harvest Rate on Different Ontologies

In this first scenario, we compare the harvest rate on 2 different crawls. The first crawl will use the KnowledgeWeb Project ontology. Of course, our topic ontology is represented as a Dogma ontology commitment. Since the KnowledgeWeb Project ontology is described in RDFS, we must first convert it to Lexons so we can insert these lexons in the DOGMA Server. Therefore, we have also developed a tool that automatically converts an RDFS ontology to a set of Lexons. This tool uses the Jena library and treats the RDF statements as frames1 . An in-depth description of this conversion tool falls out of the scope of this thesis. The resulting lexons from this automatic conversion are listed in table 6.1. The source ontology in RDFS can be found in appendix B. Note that, since RDFS can not express inverse properties and the conversion is done automatically, the co-role table will be empty. The second topic ontology is the Beer ontology we presented in figure 4.7 on page 72. Not only the domain of both ontologies are distinct. The KnowledgeWeb Project lexons are clearly converted from an existing RDFs ontology: typical RDF resources are used as concepts. For example RDF:Literal is often used as a tail-term. While this might not pose any problems when comparing the lexons with an SWD (in RDFs), it might have an effect on the TF/IDF score and therefore on the harvest rate. In contrast with the KnowledgeWeb Project ontology, the Beer ontology is clearly modeled from within DOGMA. In figure 6.1, we illustrate how this affects the crawl. In the graph, we plot the harvest rate of both crawls. We have taken the average harvest rate from every fifty crawled pages and plotted this value on the graph. The trend of both harvest rates is approximated by a polynomial function. We can clearly distinguish the difference between the harvest rate using both ontologies. The crawl using the Beer ontology keeps a harvest rate around 0.8 during the crawl. The crawl using the KnowledgeWeb Project ontology keeps an average harvest rate closer to 0.4. A possible explanation for this might be the following: As seen above, the KnowledgeWeb ontology commitment has a lot of general terms, e.g. ’Task’, ’Literal’, ’Resource’, ’Activity’ and ’Person’. While these terms might occur often on the web pages, the rate at which they occur on a web page will probably not be so high. This results in a lower TF/IDF score and hence a lower harvest rate. In contrast to the KnowledgeWeb ontology, the Beer ontology is mostly composed of concepts that are highly related to the beer domain, for example ’Pilsner’, ’Ale’, ’Brewery’ and ’AlcoholPercentage’. While these terms are more specific, it 1 More

information

on

the

frame-like

view

http://jena.sourceforge.net/how-to/RDF-frames.html

of

RDF

statements

can

be

found

at


1,2

99

Harvest Rate: Beer- vs KnowledgeWeb ontology

1

harvest rate

0,8

0,6

0,4

0,2

0

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 000 1

# crawled pages

KnowledgeWeb ontology Beer ontology

Polynomial (KnowledgeWeb ontology) Polynomial (Beer ontology)

Figure 6.1: The harvest ratio of two crawls: one using the Beer ontology and the other using the KnowledgeWeb Project ontology. is likely that the rate at which these terms occur on a page relevant to the beer domain, will be higher then the general terms of the KnowledgeWeb ontology. Therefore, the relevance score will be higher and this will result in a higher harvest rate. Another, perhaps more likely explanation is similar to the previous, but takes into account the seed pages of both crawls. The KnowledgeWeb ontology crawl starts from the general http://www.semanticweb.org page. The links that are found on this page point again to other general pages. Therefore, a lot of the crawled pages will receive a relevance score that is too low to be considered as relevant to the KnowledgeWeb ontology. The Beer ontology crawl starts from http://www.allaboutbeer.com. The outgoing links from this page will most likely point to pages that are again very related to the beer domain, which together with the specific beer ontology, will result in highly relevant pages and hence a higher harvest rate.

CHAPTER 6. CRAWLER EVALUATION head-term

role

100 co-role

tail-term

workpackage workload

is workload of

Organization

workpackage workload

is workload on workpackage

Workpackage

Workpackage

Workpackage number

Workpackage

Workpackage mailing list

Literal Literal

Workpackage

has person participant

Person

Workpackage

Workpackage expected results

Resource

Workpackage

Workpackage description of work

Resource

Workpackage

has contractor leader

Workpackage

Workpackage objectives

Resource

Workpackage

has

Milestone

Workpackage

has participant with workload

Workpackage

is made up of

Workpackage

Workpackage title

Activity

Activity tasks

Activity

Activity number

Activity

Activity deliverables

Activity

Start date

Activity

Activity timeline

Organization

workpackage workload Task Literal Resource Literal Resource Literal Resource

Activity

Activity name

Activity

Activity objectives

Resource Resource

Network of Excellence

Network summary


Network full title


Network objectives

Literal

Literal Resource


has associated event

Event


Network start date

Literal


Network acronym


Network URL

Literal Resource


is developed by


Network end date

Organization Literal


Contract Number

Literal

Milestone

Month

Milestone

Milestone description

Literal

Milestone

Milestone number

Literal

Milestone

is associated with

Workpackage

Task

has participant leader

Organization

Task

Task number

Literal

Task

team is formed by

Person

Resource

Task

Task name

Literal

Task

belongs to

Workpackage

Task

Task description

Resource

Table 6.1: The KnowledgeWeb Project ontology (in RDFS) automatically converted to Lexons

6.2.2

Harvest Rate on Ontology- and Keyword Based Focused Crawling

In this scenario, we compare the harvest rate on 2 crawls: the first uses the beer ontology we mentioned above and the other uses a simple ’Beer’ keyword instead of an ontology. The results of both crawls are shown in figure 6.2.


1,2

101

Harvest Rate: Ontology vs Keyword

1

harvest rate

0,8

0,6

0,4

0,2

0 25

75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875 925 975 025 075 1 1

# crawled pages

"Beer" Keyword topic Polynomial ("Beer" Keyword topic)

Beer Ontology topic Polynomial (Beer Ontology topic)

Figure 6.2: The harvest ratio of two crawls: one using the Beer ontology as topic description and the other using a simple ’Beer’ keyword. As illustrated on the graph, during the start of the crawl, both crawlers have a similar harvest rate. This is no surprise since, from the start, they will both be crawling the same pages. After a few pages, we notice more dissimilarity. That is because the ontology based crawler will find relevant pages that the keyword based crawler will not find. Visiting these pages results in a shift on the harvest rate graph. More to the right of the graph, the ontology based crawler finds almost all of its crawled pages relevant to the topic (this can be seen by the harvest rate getting close to 1). During the same time, the keyword based crawler will crawl a number of sites which are not relevant to the topic (low harvest rate). It takes a while before the keyword based crawler can restore itself and again finds relevant pages. We notice the same behaviour with the ontology based crawler, only does the ontology based crawler restore much faster.


30

102

# of SWDs found

# SWDs found

25

20

15

10

5

0 25

75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875 925 975 025 1

#crawled pages

# SWDs in "KnowledgeWeb" crawl

# SWDs in "beer" crawl

Figure 6.3: The number of SWDs found during the two crawls.

6.3

Availability of SWDs

During the development of our crawler and in the preliminary tests, we faced the lack of Semantic Web Documents on the current Web. This empirical evaluation shows how scarce SWDs actually are. To illustate this, we have kept track of the number of SWDs we found during both crawls. Figure 6.3 shows the number of SWDs the crawler has found after a certain number of crawled pages. Especially the number of SWDs that have been found on the ”KnowledgeWeb” crawl was below our expectations. One would expect that a crawl, starting from http://www.semanticweb.org, would return more then twenty five SWDs after having crawled over a thousand pages. This is less than one SWD per forty pages. This ratio clearly shows that thoroughly testing an ontology-based crawler on the Semantic Web is currently close to impossible, unless a special test environment is created and supervised.


6.4

103

Crawler Performance

In this section we provide some statistics on the performance of the crawler. Of course, for a thorough performance evaluation, a higher number of tests and more detailed tests should be performed. The crawler is run with several crawler threads as we have discussed in section 3.4. In the following test, we have used five threads.

6.4.1

CPU time per Page

In the first scenario, we monitor the average CPU time it takes to process one page. Since we use several threads, we cannot measure the actual time it takes for a page to be processed. While a crawler thread is processing a page, the CPU will also give the other threads time to process their pages. This would result in faulty values for CPU times for every thread. We use the management interface for the thread system of the Java virtual machine to measure the CPU computation time for every thread separately. In this test, the crawler has been run on a Mac Mini system 2 . For every twenty five pages, the average CPU time is plotted on the graph in figure 6.4

Crawler Performance

700

CPU time / page in ms

600 500 400 300 200 100 0 25

75

5 5 12 17

5 22

5 27

5 32

5 5 37 42

5 47

5 52

5 57

5 5 62 67

# crawled pages

5 72

5 77

5 82

5 5 87 92

5 5 97 102

Figure 6.4: The CPU time in milliseconds needed to process a single page. As depicted in the graph above, the time needed to process a page does not significantly increase during the crawl. The average CPU time stays below one hundred milliseconds. 2 The

specifications of the Mac Mini system are: a 1.42 Ghz PowerPC G4; 512 MB DDR RAM; Mac OS X 10.4.6


104

We also see two peaks in the performance. This can be explained by the fact that, on a certain page, there might be a number of links that point to a host that is off-line. The crawler threads will try to fetch the pages on these hosts and will receive a connection timeout. During such timeouts, the crawler thread can not process other pages. One reason why we use several crawler threads is to get around this problem. While one thread is waiting for the connection timeout, other threads can still process other pages. This increases the speed of the crawler a lot. Still, when a certain page contains a lot (more than the number of threads) of links to a ’dead’ host, all of the threads will be waiting on a connection timeout. This scenario, where all of the threads are waiting for a connection timeout can be perceived twice on the graph.

6.4.2

Process Partitioning

As we have seen in section 3.4, each crawler thread executes the same crawling loop for every page it processes. Each such crawling loop can be partitioned into 3 phases: the crawling phase, the preprocessing phase and the relevance computation phase. We have timed the average execution time for each of these pages. In figure 6.5, we illustrate how these phases relate to each other in CPU execution time.

Process Time Partitioning

Crawler phase

Preprocessing phase

Relevance Computation phase

Figure 6.5: The partitioning of the processing time of a single page into the several crawling loop phases. As depicted on the chart above, the preprocessing phase requires the most time to process a page. This is logical: The crawling phase only fetches a new url from the crawl frontier and downloads this page to memory. In the preprocessing phase, the links and semantic meta data is extracted and a word vector is built. Relative to the other phases, this process takes a lot of time. In the last phase, the relevance computation phase, the TF/IDF score


105

is read from the word vector and the semantic data, if available, is matched with the topic ontology. These processes are much faster than parsing the page and building the word vector. It could be argued that a possible reason for the fast execution time of the relevance computation time is because almost no semantic meta data is found. Our test results indicate that when semantic meta data is found, the time needed to compute the relevance of a page is not significantly higher.

Chapter 7

Conclusions On the Semantic Web, data has structure and ontologies describe the semantics of the data. The goal of this thesis was to show that using these semantic descriptions of the data, more efficient discovery and extraction of resources is possible. To this end, we introduced an ontology-based crawler for the Semantic Web. As we have discussed in this thesis, a lot of work has already been done on the discovery of resources on the Web using different sorts of crawlers. What distinguishes us from these approaches is that we model the domain of interest with a topic ontology instead of a keyword description. Furthermore, we also exploit the semantic data and structure of the Semantic Web to discover and extract resources more efficiently. We have opted for a focused crawler as we want to discover and extract information during the crawl. To find resources relevant to a given topic, we need to model the domain of interest of this topic. In general, this is done using a simple keyword description. In our approach, we have used an ontology to model this domain of interest. This ontology model is created in the DOGMA Framework under development at STARLab. The domain expert or ontology engineer can use the TLex tool from the DOGMA Studio Workbench to select the set of lexons that describes the domain. He can then add constraints to this set of lexons so that an ontology commitment is created. This ontology commitment is used to guide the crawler to the relevant information on the Semantic Web. Guiding the crawl is what distinguishes a focused crawler from a general-purpose crawler. As we use an ontology to model the domain of interest of a crawl, and we want to discover resources on the Semantic Web, we have developed an ontology matching algorithm, DMatch-lite. The DMatch-lite algorithm computes the relevance of a SWD against the topic ontology. We have also sketched the context of the DMatch-lite algorithm in respect to ontology integration (the merging and aligning of ontologies). Our DMatch-lite algorithm can 106

CHAPTER 7. CONCLUSIONS

107

be seen as a sort of fast and automated pre-matcher that will return a list of candidate mappings. These candidate mappings can later be used to aid the ontology engineer in the integration of both ontologies. As we have described in chapter 4, the DMatch-algorithm uses several similarity iterations, each with its own similarity algorithms to compute the final similarity score between two ontologies. We have incorporated algorithms that compute the linguistic relations between terms and compare the contexts and the gloss descriptions of the concepts to come to a semantically rich matching algorithm. We have illustrated the functioning of the algorithm with two beer ontologies in section 4.3.5. The resulting mappings returned by the algorithm were consistent with the mappings an ontology engineer would choose. Mappings between similar concepts described by different terms were found by the algorithm. While a more in-depth evaluation is still necessary, the results from our preliminary tests of the algorithm were very promising. As described in chapter 6, evaluating the crawler is a challenging problem. As was to be expected, and was confirmed by our preliminary test results, the availability of RDF based SWDs on the web is very low. Nevertheless, we have performed some preliminary experiments that, even when almost no SWDs are found, our ontology-based topic description indicates to perform better then a simple keyword description.

Chapter 8

Future Work As already discussed in the previous chapters, some components of the crawler should be further developed. As we have described, the crawler uses an ontology commitment modeled in Dogma Studio. Currently, we use a set of lexons to represent this topic ontology, this should be extended to use the Ω-RIDL format that is used to store ontology commitments. Furthermore, we could increase the interaction with Dogma Studio by totally embedding the system into the Dogma Studio workbench. For example, when modeling the topic ontology in T-Lex, the ontology engineer could ask the tool to find all similar ontologies on the web. The SWOs (in RDFS, OWL, ...) discovered during the crawl, could be automatically converted into lexons and visualized in Dogma Studio, where the ontology engineer could perform the final conversion steps. Also, the candidate mappings between the topic ontology and the discovered SWOs could be listed to the ontology engineer so that these ontologies can be integrated. To improve the discovery and extractions of resources on the Semantic Web, we propose a some extensions to our current crawler system. Our DMatch-lite algorithm uses a combination of existing similarity algorithms. Each algorithm has a specific weight score and normalize function. In this thesis, the weight score and normalize function of those similarity algorithms are chosen based on the logical idea behind them and on our preliminary testing results. As these weights and normalize functions can have a relative high influence on the final matching results, machine learning techniques should be used to compute the optimal weights and normalize functions. The use of machine learning techniques for ontology matching has already been discussed in the literature, for example in [30, 55]. Also, the GLUE [31] matching system uses several of these learning strategies. In contrast to our proposal, these approaches use machine learning to classify instances with their corresponding concepts and use this information 108

CHAPTER 8. FUTURE WORK

109

to compute the similarity between concepts. Our approach would be to add a layer on top of the Similarity Iterations which can analyze the resulting similarity scores and use this information to learn the optimal weight and normalize function for each similarity algorithm. The support for different semantic languages should also be extended. The lack of SWDs described in RDF could be tackled by also supporting N3-based SWDs. Currently, the crawler support all languages based on RDF (e.g. RDFS, DAML, OWL). However, only the semantical expressiveness of RDFS is supported. Therefore, the crawler could be extended so that DAML and OWL relations are also understood. When these languages are supported, the ontology engineer can put finer constraints on the kind of information it wants the crawler to find.

Bibliography [1] Charu C. Aggarwal, Fatima Al-Garawi, and Philip S. Yu, Intelligent crawling on the world wide web with arbitrary predicates, WWW ’01: Proceedings of the 10th international conference on World Wide Web (New York, NY, USA), ACM Press, 2001, pp. 96– 105. [2] Charu C Aggarwal, Fatima Al-Garawi, and Philip S Yu, Intelligent crawling on the world wide web with arbitrary predicates, WWW ’01: Proceedings of the 10th international conference on World Wide Web (New York, NY, USA) (Vincent Y. Shen, Nobuo Saito, Michael R. Lyu, and Mary Ellen Zurko, eds.), ACM Press, 2001, pp. 96–105. [3] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider (eds.), The description logic handbook: theory, implementation, and applications, Cambridge University Press, New York, NY, USA, 2003. [4] J. Barwise and J. Seligman, Information flow: the logic of distributed systems, vol. Cambridge Tracts in Theoretical Computer Science 44, Cambridge University Press, 1997. [5] C. Batini, M. Lenzerini, and S. B. Navathe, A comparative analysis of methodologies for database schema integration, ACM Comput. Surv. 18 (1986), no. 4, 323–364. [6] S. Bergamaschi, S. Castano, and M. Vincini, Semantic integration of semistructured and structured data sources, In SIGMOD Record, 1999, p. 5459. [7] Donna Bergmark, Carl Lagoze, and Alex Sbityakov, Focused crawls, tunneling, and digital libraries, Research and Advances Technology for Digital Technology : 6th European Conference, ECDL (Rome) (C. Thanos (Eds.) M. Agosti, ed.), vol. 2458, 2002, pp. 91 – 106. [8] Philip A. Bernstein and Erhard Rahm, Data warehouse scenarios for model management, Lecture Notes in Computer Science, vol. 1920, Springer Berlin / Heidelberg, 2000, pp. 1–15.

110

BIBLIOGRAPHY

111

[9] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian, The connectivity server: fast access to linkage information on the web, Comput. Netw. ISDN Syst. 30 (1998), no. 1-7, 469–477. [10] G. Bisson, Learning in fol with a similarity measure., In Proceedings of 10th National Conference on Artificial Intelligence (AAAI), 1992. [11] J. De Bo, P. Spyns, and R. Meersman, Towards a methodology for semi-automatic ontology aligning and merging, Technical Report 02, STAR Lab, Brussel, 2004. [12] Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual web search engine, WWW7: Proceedings of the seventh international conference on World Wide Web 7 (Amsterdam, The Netherlands, The Netherlands), Elsevier Science Publishers B. V., 1998, pp. 107–117. [13] Jerry H. Ying Dik L. Lee Budi Yuwono, Savio L. Lam, A world wide web resource discovery system, The Fourth International WWW Conference (Boston, USA), December 11-14 1995. [14] S. Castano, A. Ferrara, and S. Montanelli., H-match: an algorithm for dynamically matching ontologies in peer-based systems., In Proc. of the 1st Int. Workshop on Semantic Web and Databases (SWDB) at VLDB 2003 (Berlin, Germany), September 2003. [15] S. Chakrabarti, Recent results in automatic web resource discovery, ACM Comput. Surv. 31 (1999), no. 4es, 17. [16] Soumen Chakrabarti, Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, and Jon Kleinberg, Mining the web’s link structure, Computer 32 (1999), no. 8, 60–67. [17] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, Focused crawling: a new approach to topic-specific web resource discovery, Comput. Networks 31 (1999), no. 11-16, 1623–1640. [18] Junghoo Cho, Hector ector Garc´ıa-Molina, and Lawrence Page, Efficient crawling through url ordering, WWW7: Proceedings of the seventh international conference on World Wide Web 7 (Amsterdam, The Netherlands, The Netherlands), Elsevier Science Publishers B. V., 1998, pp. 161–172. [19] W. Cohen, P. Ravikumar, and S. Fienberg, A comparison of string distance metrics for name-matching tasks, In Proceedings of the IIWeb Workshop at the 22 IJCAI 2003 conference, 2003. [20] Oberle D. and Spyns P., The knowledge portal, International Handbook on Information Systems, ch. 25, Springer Verlag, 2004.

BIBLIOGRAPHY

112

[21] John Davies and Richard Weeks, Quizrdf: Search technology for the semantic web, HICSS ’04: Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS’04) - Track 4 (Washington, DC, USA), IEEE Computer Society, 2004, p. 40112. [22] Pieter De Leenheer and Aldo de Moor, Context-driven disambiguation in ontology elicitation, Context and Ontologies: Theory, Practice and Applications (Pittsburg USA) (P. Shvaiko and J. Euzenat, eds.), AAAI Technical Report, vol. WS-05-01, AAAI Press, 7 2005, pp. 17–24. [23] Pieter De Leenheer, Aldo de Moor, and Robert Meersman, Context dependency management in interorganizational ontology engineering, Technical Report STAR-2006-02, STARLab, Brussel, 2006. [24] Pieter De Leenheer and Robert Meersman, Towards a formal foundation of dogma ontology. part i: Lexon base and concept definition server, Technical Report STAR-2005-06, STARLab, 2005. [25] Aldo de Moor, Ontology-guided meaning negotiation in communities of practice, Proceedings of the Workshop on the Design for Large-Scale Digital Communities at the 2nd International Conference on Communities and Technologies (C&T 2005) (Milano) (P. Mambrey and W. Graether, eds.), 2005, pp. 21 – 28. [26] Jeffrey Dean and Monika R. Henzinger, Finding related pages in the World Wide Web, Computer Networks (Amsterdam, Netherlands: 1999) 31 (1999), no. 11–16, 1467–1479. [27] R. Dieng and S. Hug., Comparison of personal ontologies represented through conceptual graphs., In Proceedings of ECAI, 1998. [28] Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori, Focused crawling using context graphs, VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases (San Francisco, CA, USA), Morgan Kaufmann Publishers Inc., 2000, pp. 527–534. [29] Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs, Swoogle: a search and metadata engine for the semantic web, CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management (New York, NY, USA), ACM Press, 2004, pp. 652–659. [30] AnHai Doan, Pedro Domingos, and Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, SIGMOD Rec. 30 (2001), no. 2, 509–520.

BIBLIOGRAPHY

113

[31] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy, Learning to map between ontologies on the semantic web, WWW ’02: Proceedings of the 11th international conference on World Wide Web (New York, NY, USA), ACM Press, 2002, pp. 662–673. [32] Marc Ehrig and Jrme Euzenat., Relaxed precision and recall for ontology matching, Proc. K-Cap 2005 workshop on Integrating ontology (Banff (CA)) (Marc Ehrig Ben Ashpole, Jrme Euzenat and Heiner Stuckenschmidt, eds.), 2005, pp. 25 – 32. [33] Marc Ehrig and Steffen Staab, Qom quick ontology mapping, The Semantic Web ISWC 2004: Third International Semantic Web Conference (Sheila A. McIlraith, Dimitris Plexousakis, and Frank van Harmelen, eds.), vol. 3298 / 2004, 2004. [34] J. Euzenat, T. Le Bach, J. Barrasa, P. Bouquet, J. De Bo, R. Dieng, M. Ehrig, M. Hauswirth, M. Jarrar, R. Lara, D. Maynard, A. Napoli, G. Stamou, H. Stuckenschmidt, P. Shvaiko, S. Tessaris, S. Van Acker, and I Zaihrayeu, State of the art on ontology alignment., Knowledge Web Deliverable #D2.2.3, 2004. [35] P. Srinivasan F. Menczer, G. Pant and M. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th Annual International ACM/SIGIR Conference on Research and development in information retrieval. (New Orleans, USA) (Donald H. Kraft, W. Bruce Croft, David J. Harper, and Justin Zobel, eds.), 2001. [36] Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost, and Clay Fink, Information retrieval and the semantic web, HICSS ’05: Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) - Track 4 (Washington, DC, USA), IEEE Computer Society, 2005, p. 113.1. [37] P. Srinivasan G. Pant and F. Menczer, Crawling the web, Web Dynamics:Adapting to Change in Content, Size, Topology and Use (M. Levene and A. Poulovassilis, eds.), Springer-Verlag, 2004. [38] B. Ganter and R. Wille, Formal concept analysis: mathematical foundations, Springer, 1999. [39] Thomas R. Gruber, A translation approach to portable ontology specifications, Knowledge Acquisition 5 (1993), no. 2, 199–220. [40] N. Guarino and P. Giaretta, Ontologies and knowledge bases: Towards a terminological clarification, Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (Amsterdam) (N. Mars, ed.), IOS Press, 1995, pp. 25 – 32. [41] Michael Hersovici, Michal Jacovi, Yoelle S. Maarek, Dan Pelleg, Menanchem Shtalhaim, and Sigalit Ur, The shark-search algorithm. an application: tailored web site mapping, Comput. Netw. ISDN Syst. 30 (1998), no. 1-7, 317–326.

BIBLIOGRAPHY

114

[42] H.H.Do and E. Rahm. Coma, A system for flexible combination of schema matching approaches, Proceedings of 27th International Conference on Very Large Data Bases (Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, eds.), 2001, p. 610621. [43] I. Horrocks, Daml+oil: A description logic for the semantic web, IEEE Data Engineering Bulletin 25 (2002), no. 1, 4–9. [44] De Bo J., A survey on existing approaches to ontology integration, FF POIROT Deliverable #D5.1.1, STAR Lab, Brussel, 2003. [45] De Bo J., Spyns P., and Meersman R., Creating a ”dogmatic” multilingual ontology infrastructure to support a semantic portal, On the Move to Meaningful Internet Systems 2003: OTM 2003 Workshops (Z. Tari et al. R. Meersman, ed.), LNCS, vol. 2889, Springer Verlag, 2003, pp. 253 – 266. [46] Euzenat J., Towards a principled approach to semantic interoperability, Proceedings IJCAI 2001 Workshop on ontology and information sharing (Gmez-Prez A., Gruninger M., and Stueckenschmidt H., eds.), 2001, pp. 19–25. [47] Verlinden R. De Bo J. and Zhao G., Ontology alignment and merging components, FF Poirot Deliverable #5.1.3, STAR Lab, 2004. [48] Mustafa Jarrar and Robert Meersman, Formal ontology engineering in the dogma approach, Springer Verslag, LNCS, 1238-1254, 2002. [49] Yannis Kalfoglou, Bo Hu, Dave Reynolds, and Nigel Shadbolt, Capturing representing and operationalising semantic integration, 6th month deliverable: Semantic integration technologies survey, The University of Southampton, Hewlett Packard Laboratories@Bristol, 2005. [50] Jaewoo Kang and Jeffrey F. Naughton, On schema matching with opaque column names and data values, SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data (New York, NY, USA) (Zachary Ives, Yannis Papakonstantinou, and Alon Halevy, eds.), ACM Press, 2003, pp. 205–216. [51] Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46 (1999), no. 5, 604–632. [52] M.R. Koivunen and E. Miller, W3c semantic web activity, Proceedings of the Semantic Web Kick-off Seminar (Finland) (E. Hyvnen, ed.), 2001, pp. 27 – 44. [53] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal, Stochastic models for the web graph, FOCS ’00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science (Washington, DC, USA), IEEE Computer Society, 2000, p. 57.

BIBLIOGRAPHY

115

[54] Martin S. Lacher and Georg Groh, Facilitating the exchange of explicit knowledge through ontology mappings, Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society (FLAIRS) Conference. (Ingrid Russell and John F. Kolen, eds.), AAAI Press, 2001, pp. 305 – 309. [55] Wen-Syan Li and Chris Clifton, Semantic integration in heterogeneous databases using neural networks, VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA) (Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, eds.), Morgan Kaufmann Publishers Inc., 1994, pp. 1–12. [56] J. Madhavan, P. Bernstein, and E. Rahm, Generic schema matching with cupid, Proceedings of 27th International Conference on Very Large Data Bases (Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, eds.), 2001, pp. 49–58. [57] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Halevy, Corpus-based schema matching, ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05) (Washington, DC, USA), IEEE Computer Society, 2005, pp. 57– 68. [58] A. Maedche, M. Ehrig, S. Handschuh, L. Stojanovic, and R. Volz, Ontology-focused crawling on documents and relational metadata, Proceedings of the Eleventh International World Wide Web Conference WWW-2002 (Hawaii). [59] Alexander Maedche and Steffen Staab, Measuring similarity between ontologies, EKAW ’02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web (London, UK), SpringerVerlag, 2002, pp. 251–263. [60] Diana Maynard and Sophia Ananiadou, Identifying contextual information for multi-word term extraction, 1999. [61] S. Melnik, H. Garcia-Molina, and E. Rahm., Similarity flooding: A versatile graph matching algorithm., In Proceedings of ICDE, 2002, p. 117128. [62] Filippo Menczer, Complementing search engines with online web mining agents, Decis. Support Syst. 35 (2003), no. 2, 195–212. [63] Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel E. Ruiz, Evaluating topic-driven web crawlers, Research and Development in Information Retrieval, 2001, pp. 241–249. [64] George A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (1995), no. 11, 39–41.

BIBLIOGRAPHY

116

[65] G. Nijssen and T.A. Halpin., Conceptual schema and relational database design: A fact oriented approach., Prentice Hall of Australia, 1989. [66] N. Noy and M. A. Musen., Anchor-prompt: Using non-local context for semantic matching., In Proceedings of IJCAI workshop on Ontologies and Information Sharing (A. Gmez Prez, M. Gruninger, H. Stuckenschmidt, and M. Uschold, eds.), 2001, p. 6370. [67] Natalya F. Noy and Michel Klein, Ontology evolution: Not the same as schema evolution, Knowledge and Information Systems 6 (2004), no. 4, 428–440. [68] De Bo J. Spyns P. and Meersman R., Assisting ontology integration with existing thesauri, On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE (part I) (Zahir T. et al. Meersman R., ed.), LNCS, vol. 3290, Springer Verlag, 2004, pp. 801 – 818. [69] Spyns P., Oberle D., Volz R., Zheng J., Jarrar M., Sure Y., Studer R., and Meersman R., Ontoweb - a semantic web community portal, Proceedings of the Fourth International Conference on Practical Aspects of Knowledge Management (PAKM02) (D. Karagiannis and U. Reimer, eds.), LNAI, vol. 2569, Springer Verlag, 2002, pp. 189–200. [70] Spyns P. and Meersman R., A survey on ontology alignment and merging, OntoBasis Deliverable #D1.5, STAR Lab, Brussel, 2003. [71] Verheyden P., De Bo J., and Meersman R., Semantically unlocking database content through ontology-based mediation, Proceedings of the 2nd Workshop on Semantic Web and Databases (SWDB 2004) (Tannen V. Bussler C. and Fundulaki I., eds.), LNCS, vol. 3372, Springer Verlag, 2005, pp. 109 – 126. [72] Y Kornatzky R Post P De Bra, G Houben, Information retrieval in distributed hypertexts, Proceedings of the 4th RIAO Conference (New York), 1994, pp. 481 – 491. [73] Gautam Pant and Filippo Menczer, Myspiders: Evolve your own intelligent web crawlers, Autonomous Agents and Multi-Agent Systems 5 (2002), no. 2, 221–229. [74] Christine Parent and Stefano Spaccapietra, Issues and approaches of database integration, Commun. ACM 41 (1998), no. 5es, 166–178. [75] M. F. Porter, An algorithm for suffix stripping, (1997), 313–316. [76] A. Johannes Pretorius, Lexon visualisation: Visualising binary fact types in ontology bases, Unpublished MSc Thesis, Brussels, Vrije Universiteit Brussel, 2004. [77] Meersman R. and Jarrar M., An architecture and toolset for practical ontology engineering and deployment: the dogma approach, Technical Report 06, STAR Lab, Brussels, 2002.

BIBLIOGRAPHY

117

[78] Vijay V. Raghavan and S. K. M. Wong, A critical analysis of vector space model for information retrieval, Journal of the American Society for Information Science 37 (1986), no. 5, 279–287. [79] Erhard Rahm and Philip A. Bernstein, A survey of approaches to automatic schema matching, The VLDB Journal 10 (2001), no. 4, 334–350. [80] Cristiano Rocha, Daniel Schwabe, and Marcus Poggi Aragao, A hybrid approach for searching in the semantic web, WWW ’04: Proceedings of the 13th international conference on World Wide Web (New York, NY, USA), ACM Press, 2004, pp. 374–383. [81] G. Salton, Automatic text processing., Addison Wesley, Massachusetts, 1989. [82] Yannis Kalfoglou. Bo Hu. Dave Reynolds. Nigel Shadbolt, Semantic integration technonolgies survey, 2005. [83] A. P Sheth, Changing focus on interoperability in information systems: from system, syntax, structure to semantics, Interoperating Geographic Information Systems (R. Fegeas M. F. Goodchild, M. J. Egenhofer and C. A. Kottman, eds.), Kluwer, Academic Publishers, 1999, pp. 5 – 30. [84] P. Shvaiko, A classification of schema-based matching approaches., Proceedings of ISWC’04 workshop on Meaning, Negotiation and Coordination (MCN’04) (Hiroshima, Japan), Nov. 2004. [85] P. Shvaiko and J. Euzenat., A survey of schema-based matching approaches., Journal on Data Semantics IV (2005). [86] M.K. Smith, C. Welty, and D.L. McGuinness., Owl web ontology laguage guide., Technical report, http://www.w3.org/tr/2004/rec-owl-guide-20040210/, World Wide Web Consortium (W3C), February 10 2004. [87] Peter Spyns, Robert Meersman, and Mustafa Jarrar, Data modelling versus ontology engineering, SIGMOD Record Special Issue on Semantic Web, Database Management and Information Systems 31 (2002), no. 4, 12–17. [88] P. Srinivasan, F. Menczer, and G. Pant, A general evaluation framework for topical crawlers, Inf. Retr. 8 (2005), no. 3, 417–447. [89] STARLab, Introducing ontologies, http : //starlab.vub.ac.be/teaching/introducing ontologies.pdf . [90] O. Lassila T. Berners-Lee, J. Hendler, The semantic web, Scientific American (2001), 28– 37. [91] M. Ushold and M. Gruninger, Ontologies: Principles, methods and applications, The Knowledge Engineering Review, 1996.

BIBLIOGRAPHY

118

[92] David Vallet, Miriam Fernndez, and Pablo Castells, An ontology-based information retrieval model, The Semantic Web: Research and Applications: Second European Semantic Web Conference, ESWC 2005 (Crete, Greece) (Asuncion Gomez-Perez and Jerome Euzenat, eds.), vol. 3532 / 2005, 2005, p. 455. [93] Jan Vereecken and Damien Trog, Context-driven visualization for ontology engineering, Master’s thesis, Vrije Universiteit Brussel, 2006. [94] G. Verheijen and J. Van Bekkum, Niam: An information analysis method., Information Systems Design Methodologies : a comparative Review (Amsterdam, North-Holland) (T. W. Olle, H. G. Sol, and A. A. Verrijn-Stuart, eds.), 1982.

Appendix A

Crawler Transcripts We will present here the debug-transcript of the crawler. The first transcript shows the DMatch-lite process on the two beer ontologies we introduced before. The second transcript is a general crawl of the crawler.

A.1

DMatch-lite Matching the Two Beer Ontologies

[INFO | main] Crawler: Connecting to DOGMA Server [DEBUG| main] Crawler: Loaded Lexons: [INFO | main] Crawler: Started new Thread: 1 [DEBUG|Thread 1] SimilarityIteration: Initalize -> 3 [DEBUG|Thread 1] SimilarityIteration: Initalize -> 1 [DEBUG|Thread 1] WordnetRelations: Initializing WordnetRelations: vub.starlab.focusedcrawler.matching.Mapping.DogmaM [INFO |Thread 1] WordnetRelations: >> >> >> Fermentation Method [INFO |Thread 1] WordnetRelations: >> >> >> Location [INFO |Thread 1] WordnetRelations: >> >> >> Award [INFO |Thread 1] WordnetRelations: >> >> >> Company [INFO |Thread 1] WordnetRelations: >> >> >> AlcoholPercentage [INFO |Thread 1] WordnetRelations: >> >> >> Ale [INFO |Thread 1] WordnetRelations: >> >> >> component [INFO |Thread 1] WordnetRelations: >> >> >> Brewery [INFO |Thread 1] WordnetRelations: >> >> >> Beer > [Synset: [Offset: 7411192] [POS: noun] Lemma: beer > [Synset: [Offset: 7411424] [POS: noun] Lemma: draft_beerdraught_beer > [Synset: [Offset: 7411517] [POS: noun] Lemma: suds > [Synset: [Offset: 7411959] [POS: noun] Lemma: lagerlager_beer > [Synset: [Offset: 7411629] [POS: noun] Lemma: Munich_beerMunchener > [Synset: [Offset: 7411786] [POS: noun] Lemma: bockbock_beer > [Synset: [Offset: 7412292] [POS: noun] Lemma: light_beer > [Synset: [Offset: 7412383] [POS: noun] Lemma: OktoberfestOctoberfest > [Synset: [Offset: 7412554] [POS: noun] Lemma: PilsnerPilsener > [Synset: [Offset: 7413564] [POS: noun] Lemma: maltmalt_liquor > [Synset: [Offset: 7413782] [POS: noun] Lemma: ale

119

APPENDIX A. CRAWLER TRANSCRIPTS

120

> [Synset: [Offset: 7412790] [POS: noun] Lemma: Weissbierwhite_beerwheat_beer > [Synset: [Offset: 7413034] [POS: noun] Lemma: Weizenbier > [Synset: [Offset: 7413141] [POS: noun] Lemma: Weizenbock > [Synset: [Offset: 7414086] [POS: noun] Lemma: bitter > [Synset: [Offset: 7414244] [POS: noun] Lemma: Burton > [Synset: [Offset: 7414322] [POS: noun] Lemma: pale_ale > [Synset: [Offset: 7414480] [POS: noun] Lemma: porterporter’s_beer > [Synset: [Offset: 7414606] [POS: noun] Lemma: stout > [Synset: [Offset: 7414794] [POS: noun] Lemma: Guinness [INFO |Thread 1] WordnetRelations: >> >> >> Pilsner [DEBUG|Thread 1] SimilarityIteration: Initalize -> 1 [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Fermentation Method ] [DEBUG|Thread 1] DOGMARelevanceSets: > Fermentation Method [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - is-of [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Location ] [DEBUG|Thread 1] DOGMARelevanceSets: > Location [DEBUG|Thread 1] DOGMARelevanceSets: > Brewery [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Award ] [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - isGivenTo [DEBUG|Thread 1] DOGMARelevanceSets: > Award [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Company ] [DEBUG|Thread 1] DOGMARelevanceSets: > Brewery - subsumes [DEBUG|Thread 1] DOGMARelevanceSets: > Company [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF AlcoholPercentage ] [DEBUG|Thread 1] DOGMARelevanceSets: > AlcoholPercentage [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - is-of [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Ale ] [DEBUG|Thread 1] DOGMARelevanceSets: > Ale [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - is-a [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF component ] [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - isPartOf [DEBUG|Thread 1] DOGMARelevanceSets: > component [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Brewery ] [DEBUG|Thread 1] DOGMARelevanceSets: > Location - locatedIn [DEBUG|Thread 1] DOGMARelevanceSets: > Brewery [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - brews [DEBUG|Thread 1] DOGMARelevanceSets: > Company - is-a [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Beer ] [DEBUG|Thread 1] DOGMARelevanceSets: > Award - hasReceived [DEBUG|Thread 1] DOGMARelevanceSets: > Pilsner - subsumes [DEBUG|Thread 1] DOGMARelevanceSets: > Beer [DEBUG|Thread 1] DOGMARelevanceSets: > Brewery - brewedBy [DEBUG|Thread 1] DOGMARelevanceSets: > AlcoholPercentage - has [DEBUG|Thread 1] DOGMARelevanceSets: > component - madeFrom [DEBUG|Thread 1] DOGMARelevanceSets: > Ale - subsumes [DEBUG|Thread 1] DOGMARelevanceSets: > Fermentation Method - has [DEBUG|Thread 1] DOGMARelevanceSets: [ DOGMA CONTEXT OF Pilsner ] [DEBUG|Thread 1] DOGMARelevanceSets: > Beer - is-a [DEBUG|Thread 1] DOGMARelevanceSets: > Pilsner [DEBUG|Thread 1] ContextSimilarity: Done making Dogma contexts [INFO |Thread 1] CrawlerThread: Crawling new link: http://wilma.vub.ac.be/˜fvdmaele/felix_beer.rdfs [INFO |Thread 1] CrawlerThread: .. Reading page [DEBUG|Thread 1] UrlReader: Downloading http://wilma.vub.ac.be/˜fvdmaele/felix_beer.rdfs [INFO |Thread 1] CrawlerThread: .. Preprocessing page [DEBUG|Thread 1] DogmaMappingTable: Adding New SWT’s: 88 [INFO |Thread 1] DogmaMappingTable: New SWT: Brewery - http://starlab.vub.ac.be/Beer#


[INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

121

DogmaMappingTable: New SWT: Ale - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Pilsner - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Organization - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: BottomFermentedBeer - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Award - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Region - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: AlcoholPercentage - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: TopFermentedBeer - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Ingredient - http://starlab.vub.ac.be/Beer# DogmaMappingTable: New SWT: Beer - http://starlab.vub.ac.be/Beer# WordVector: Initializing the WordVector WordVector: Processing the document WordVector: WORDVECTOR: 0 - NaN - NaN - NaN CrawlerThread: v1:0.0 CrawlerThread: v1:0.0 CrawlerThread: .. computing relevance of page DLMatch: New iteration: SimilarityIteration: Comuting Similarity ! LevenshteinSimilarity: -> 0.14285713 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.15789473 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.14285713 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.14285713 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.28571427 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.11764705 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.57142854 SoundexSimilarity: -> 0.8444445 LevenshteinSimilarity: -> 0.111111104 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 1.0 SoundexSimilarity: -> 1.0 LevenshteinSimilarity: -> 0.28571427 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.052631557 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 1.0 SoundexSimilarity: -> 1.0 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.19999999 SoundexSimilarity: -> 0.8222223 LevenshteinSimilarity: -> 0.17647058 SoundexSimilarity: -> 0.7666667 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.25 SoundexSimilarity: -> 0.77777785 LevenshteinSimilarity: -> 0.111111104 SoundexSimilarity: -> 0.44444445


[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

122

LevenshteinSimilarity: -> 0.14285713 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 1.0 SoundexSimilarity: -> 1.0 LevenshteinSimilarity: -> 0.10526317 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.28571427 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.11764705 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.28571427 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.22222221 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.14285713 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.08333331 SoundexSimilarity: -> 0.6944445 LevenshteinSimilarity: -> 0.31578946 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.08333331 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.08333331 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.058823526 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.4166667 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.08333331 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.08333331 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.15789473 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.10526317 SoundexSimilarity: -> 0.6944445 LevenshteinSimilarity: -> 0.052631557 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.15789473 SoundexSimilarity: -> 0.6666667 LevenshteinSimilarity: -> 0.10526317 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.31578946 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 0.10526317 SoundexSimilarity: -> 0.6666667



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

123




1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

124



125

[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.19999999 SoundexSimilarity: -> 0.77777785 LevenshteinSimilarity: -> 0.11764705 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.0 SoundexSimilarity: -> 0.5555556 LevenshteinSimilarity: -> 1.0 SoundexSimilarity: -> 1.0 LevenshteinSimilarity: -> 0.111111104 SoundexSimilarity: -> 0.44444445 LevenshteinSimilarity: -> 0.57142854 SoundexSimilarity: -> 0.8444445 SimilarityIteration: DONE !!

[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

DLMatch: New iteration: SimilarityIteration: Comuting Similarity ! WordnetRelations: Compare: Pilsner - Brewery : -1 WordnetRelations: Compare: Fermentation Method - Brewery : -1 WordnetRelations: Compare: Ale - Brewery : -1 WordnetRelations: Compare: Company - Brewery : -1 WordnetRelations: Compare: Award - Brewery : -1 WordnetRelations: Compare: Location - Brewery : -1 WordnetRelations: Compare: Beer - Brewery : -1 WordnetRelations: Compare: component - Brewery : -1 WordnetRelations: Compare: Brewery - Brewery : 1 WordnetRelations: Compare: Pilsner - Ale : -1 WordnetRelations: Compare: Fermentation Method - Ale : -1 WordnetRelations: Compare: Ale - Ale : 1 WordnetRelations: Compare: Company - Ale : -1 WordnetRelations: Compare: Award - Ale : -1 WordnetRelations: Compare: Location - Ale : -1 WordnetRelations: Compare: Beer - Ale : -1 WordnetRelations: >> WORDNET SIM REL: Beer - Ale ==> HYPONYM 1 1.0 WordnetRelations: Compare: component - Ale : -1 WordnetRelations: Compare: Brewery - Ale : -1 WordnetRelations: Compare: Pilsner - Pilsner : -1 WordnetRelations: >> WORDNET SIM REL: Pilsner - Pilsner ==> HYPERNYM 1 1.0 WordnetRelations: Compare: Fermentation Method - Pilsner : -1 WordnetRelations: Compare: Ale - Pilsner : -1 WordnetRelations: Compare: Company - Pilsner : -1 WordnetRelations: Compare: Award - Pilsner : -1 WordnetRelations: Compare: Location - Pilsner : -1 WordnetRelations: Compare: Beer - Pilsner : -1 WordnetRelations: >> WORDNET SIM REL: Beer - Pilsner ==> HYPONYM 2 0.5 WordnetRelations: Compare: component - Pilsner : -1 WordnetRelations: Compare: Brewery - Pilsner : -1 WordnetRelations: Compare: Pilsner - Organization : -1 WordnetRelations: Compare: Fermentation Method - Organization : -1 WordnetRelations: Compare: Ale - Organization : -1 WordnetRelations: Compare: Company - Organization : -1 WordnetRelations: >> WORDNET SIM REL: Company - Organization ==> HYPERNYM 4 0.25 WordnetRelations: Compare: Award - Organization : -1 WordnetRelations: Compare: Location - Organization : -1 WordnetRelations: Compare: Beer - Organization : -1


[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

WordnetRelations: Compare: component - Organization : -1 WordnetRelations: Compare: Brewery - Organization : -1 WordnetRelations: Compare: Pilsner - Award : -1 WordnetRelations: Compare: Fermentation Method - Award : -1 WordnetRelations: Compare: Ale - Award : -1 WordnetRelations: Compare: Company - Award : -1 WordnetRelations: Compare: Award - Award : 1 WordnetRelations: Compare: Location - Award : -1 WordnetRelations: Compare: Beer - Award : -1 WordnetRelations: Compare: component - Award : -1 WordnetRelations: Compare: Brewery - Award : -1 WordnetRelations: Compare: Pilsner - Region : -1 WordnetRelations: Compare: Fermentation Method - Region : -1 WordnetRelations: Compare: Ale - Region : -1 WordnetRelations: Compare: Company - Region : -1 WordnetRelations: Compare: Award - Region : -1 WordnetRelations: Compare: Location - Region : -1 WordnetRelations: >> WORDNET SIM REL: Location - Region ==> HYPONYM 1 1.0 WordnetRelations: >> WORDNET SIM REL: Location - Region ==> HYPONYM 1 1.0 WordnetRelations: Compare: Beer - Region : -1 WordnetRelations: Compare: component - Region : -1 WordnetRelations: Compare: Brewery - Region : -1 WordnetRelations: Compare: Pilsner - Ingredient : -1 WordnetRelations: Compare: Fermentation Method - Ingredient : -1 WordnetRelations: Compare: Ale - Ingredient : -1 WordnetRelations: Compare: Company - Ingredient : -1 WordnetRelations: Compare: Award - Ingredient : -1 WordnetRelations: Compare: Location - Ingredient : -1 WordnetRelations: Compare: Beer - Ingredient : -1 WordnetRelations: Compare: component - Ingredient : 1 WordnetRelations: Compare: Brewery - Ingredient : -1 WordnetRelations: Compare: Pilsner - Beer : -1 WordnetRelations: >> WORDNET SIM REL: Pilsner - Beer ==> HYPERNYM 2 0.5 WordnetRelations: Compare: Fermentation Method - Beer : -1 WordnetRelations: Compare: Ale - Beer : -1 WordnetRelations: >> WORDNET SIM REL: Ale - Beer ==> HYPERNYM 1 1.0 WordnetRelations: Compare: Company - Beer : -1 WordnetRelations: Compare: Award - Beer : -1 WordnetRelations: Compare: Location - Beer : -1 WordnetRelations: Compare: Beer - Beer : 1 WordnetRelations: Compare: component - Beer : -1 WordnetRelations: Compare: Brewery - Beer : -1 SimilarityIteration: DONE !!

[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread

1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

DLMatch: New iteration: SimilarityIteration: Comuting Similarity ! SWORelevanceSets: Making rel sets of : Brewery SWORelevanceSets: >> CONTEXT of Brewery Brewery SWORelevanceSets: > Region - locatedIn ContextSimilarity: RELEVANCE: [ Award Brewery ] ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Award - Brewery -> 0.14484127 ContextSimilarity: [CONTEXT SIM: Award - Brewery -> 0.14543937 ContextSimilarity: RELEVANCE: [ Beer Brewery ] ContextSimilarity: >> Award - Brewery -> 0.14484127

126



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

127

ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Brewery - Brewery -> 1.0 ContextSimilarity: >> component - Region -> 0.10159466 ContextSimilarity: [CONTEXT SIM: Beer - Brewery -> 0.30758256 ContextSimilarity: RELEVANCE: [ Brewery Brewery ] ContextSimilarity: >> Location - Region -> 0.7054077 ContextSimilarity: >> Brewery - Brewery -> 1.0 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: [CONTEXT SIM: Brewery - Brewery -> 0.66562814 SWORelevanceSets: Making rel sets of : Ale SWORelevanceSets: >> CONTEXT of Ale Ale SWORelevanceSets: > TopFermentedBeer - superClassOf ContextSimilarity: RELEVANCE: [ Pilsner Ale ] ContextSimilarity: >> Beer - Ale -> 0.6663452 ContextSimilarity: >> Pilsner - Ale -> 0.121435925 ContextSimilarity: [CONTEXT SIM: Pilsner - Ale -> 0.26259372 ContextSimilarity: RELEVANCE: [ Ale Ale ] ContextSimilarity: >> Ale - Ale -> 1.0 ContextSimilarity: >> Beer - Ale -> 0.6663452 ContextSimilarity: [CONTEXT SIM: Ale - Ale -> 0.55544835 ContextSimilarity: RELEVANCE: [ Award Ale ] ContextSimilarity: >> Beer - Ale -> 0.6663452 ContextSimilarity: >> Award - Ale -> 0.16672431 ContextSimilarity: [CONTEXT SIM: Award - Ale -> 0.27768984 ContextSimilarity: RELEVANCE: [ AlcoholPercentage Ale ] ContextSimilarity: >> AlcoholPercentage - Ale -> 0.13964012 ContextSimilarity: >> Beer - Ale -> 0.6663452 ContextSimilarity: [CONTEXT SIM: AlcoholPercentage - Ale -> 0.26866177 ContextSimilarity: RELEVANCE: [ Beer Ale ] ContextSimilarity: >> Award - Ale -> 0.16672431 ContextSimilarity: >> Pilsner - Ale -> 0.121435925 ContextSimilarity: >> Beer - Ale -> 0.6663452 ContextSimilarity: >> AlcoholPercentage - Ale -> 0.13964012 ContextSimilarity: >> component - TopFermentedBeer -> 0.14091837 ContextSimilarity: >> Ale - Ale -> 1.0 ContextSimilarity: >> Fermentation Method - TopFermentedBeer -> 0.17045702 ContextSimilarity: [CONTEXT SIM: Beer - Ale -> 0.48110422 SWORelevanceSets: Making rel sets of : Pilsner SWORelevanceSets: >> CONTEXT of Pilsner Pilsner SWORelevanceSets: > BottomFermentedBeer - superClassOf ContextSimilarity: RELEVANCE: [ Pilsner Pilsner ] ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Pilsner - Pilsner -> 1.0 ContextSimilarity: [CONTEXT SIM: Pilsner - Pilsner -> 0.5060954 ContextSimilarity: RELEVANCE: [ Ale Pilsner ] ContextSimilarity: >> Ale - Pilsner -> 0.121435925 ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: [CONTEXT SIM: Ale - Pilsner -> 0.2132407 ContextSimilarity: RELEVANCE: [ Beer Pilsner ] ContextSimilarity: >> Pilsner - Pilsner -> 1.0 ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Brewery - BottomFermentedBeer -> 0.13973393 ContextSimilarity: >> AlcoholPercentage - BottomFermentedBeer -> 0.13083442 ContextSimilarity: >> component - Pilsner -> 0.10159466



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

128

ContextSimilarity: >> component - BottomFermentedBeer -> 0.13779241 ContextSimilarity: >> Ale - Pilsner -> 0.121435925 ContextSimilarity: [CONTEXT SIM: Beer - Pilsner -> 0.40961656 ContextSimilarity: RELEVANCE: [ component Pilsner ] ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> component - Pilsner -> 0.10159466 ContextSimilarity: >> component - BottomFermentedBeer -> 0.13779241 ContextSimilarity: [CONTEXT SIM: component - Pilsner -> 0.21869285 SWORelevanceSets: Making rel sets of : Organization SWORelevanceSets: >> CONTEXT of Organization Organization ContextSimilarity: RELEVANCE: [ Fermentation Method Organization ] ContextSimilarity: >> Fermentation Method - Organization -> 0.15423976 ContextSimilarity: [CONTEXT SIM: Fermentation Method - Organization -> 0.051413253 ContextSimilarity: RELEVANCE: [ Company Organization ] ContextSimilarity: >> Company - Organization -> 0.37317213 ContextSimilarity: [CONTEXT SIM: Company - Organization -> 0.124390714 ContextSimilarity: RELEVANCE: [ Location Organization ] ContextSimilarity: >> Location - Organization -> 0.18576391 ContextSimilarity: [CONTEXT SIM: Location - Organization -> 0.061921302 SWORelevanceSets: Making rel sets of : BottomFermentedBeer SWORelevanceSets: >>>> Pilsner SWORelevanceSets: >> CONTEXT of BottomFermentedBeer Pilsner - subClassOf SWORelevanceSets: > BottomFermentedBeer SWORelevanceSets: > Beer - superClassOf ContextSimilarity: RELEVANCE: [ Pilsner BottomFermentedBeer ] ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Pilsner - Pilsner -> 1.0 ContextSimilarity: [CONTEXT SIM: Pilsner - BottomFermentedBeer -> 0.6666667 ContextSimilarity: RELEVANCE: [ Company BottomFermentedBeer ] ContextSimilarity: >> Brewery - BottomFermentedBeer -> 0.13973393 ContextSimilarity: >> Brewery - Beer -> 0.29147682 ContextSimilarity: >> Company - BottomFermentedBeer -> 0.10489766 ContextSimilarity: [CONTEXT SIM: Company - BottomFermentedBeer -> 0.13212483 ContextSimilarity: RELEVANCE: [ AlcoholPercentage BottomFermentedBeer ] ContextSimilarity: >> AlcoholPercentage - BottomFermentedBeer -> 0.13083442 ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: [CONTEXT SIM: AlcoholPercentage - BottomFermentedBeer -> 0.3769448 ContextSimilarity: RELEVANCE: [ Beer BottomFermentedBeer ] ContextSimilarity: >> Award - Beer -> 0.15072018 ContextSimilarity: >> Pilsner - Pilsner -> 1.0 ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Brewery - BottomFermentedBeer -> 0.13973393 ContextSimilarity: >> Brewery - Beer -> 0.29147682 ContextSimilarity: >> AlcoholPercentage - BottomFermentedBeer -> 0.13083442 ContextSimilarity: >> component - Pilsner -> 0.10159466 ContextSimilarity: >> component - BottomFermentedBeer -> 0.13779241 ContextSimilarity: >> Ale - Pilsner -> 0.121435925 ContextSimilarity: >> Ale - Beer -> 0.6663452 ContextSimilarity: [CONTEXT SIM: Beer - BottomFermentedBeer -> 0.67543375 ContextSimilarity: RELEVANCE: [ component BottomFermentedBeer ] ContextSimilarity: >> Beer - Pilsner -> 0.51828617



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

129

ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> component - Pilsner -> 0.10159466 ContextSimilarity: >> component - BottomFermentedBeer -> 0.13779241 ContextSimilarity: [CONTEXT SIM: component - BottomFermentedBeer -> 0.37926412 ContextSimilarity: RELEVANCE: [ Brewery BottomFermentedBeer ] ContextSimilarity: >> Brewery - BottomFermentedBeer -> 0.13973393 ContextSimilarity: >> Brewery - Beer -> 0.29147682 ContextSimilarity: >> Beer - Pilsner -> 0.51828617 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Company - BottomFermentedBeer -> 0.10489766 ContextSimilarity: [CONTEXT SIM: Brewery - BottomFermentedBeer -> 0.46545815 SWORelevanceSets: Making rel sets of : Award SWORelevanceSets: >> CONTEXT of Award Award ContextSimilarity: RELEVANCE: [ Ale Award ] ContextSimilarity: >> Ale - Award -> 0.16672431 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: [CONTEXT SIM: Ale - Award -> 0.10581484 ContextSimilarity: RELEVANCE: [ Award Award ] ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Award - Award -> 1.0 ContextSimilarity: [CONTEXT SIM: Award - Award -> 0.38357338 ContextSimilarity: RELEVANCE: [ Beer Award ] ContextSimilarity: >> Award - Award -> 1.0 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Brewery - Award -> 0.14484127 ContextSimilarity: >> Ale - Award -> 0.16672431 ContextSimilarity: [CONTEXT SIM: Beer - Award -> 0.36557144 ContextSimilarity: RELEVANCE: [ Brewery Award ] ContextSimilarity: >> Brewery - Award -> 0.14484127 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: [CONTEXT SIM: Brewery - Award -> 0.09852048 SWORelevanceSets: Making rel sets of : Region SWORelevanceSets: >> CONTEXT of Region Region ContextSimilarity: RELEVANCE: [ Location Region ] ContextSimilarity: >> Location - Region -> 0.7054077 ContextSimilarity: [CONTEXT SIM: Location - Region -> 0.2351359 ContextSimilarity: RELEVANCE: [ Beer Region ] ContextSimilarity: >> Beer - Region -> 0.1076389 ContextSimilarity: >> component - Region -> 0.10159466 ContextSimilarity: [CONTEXT SIM: Beer - Region -> 0.052308388 ContextSimilarity: RELEVANCE: [ component Region ] ContextSimilarity: >> Beer - Region -> 0.1076389 ContextSimilarity: >> component - Region -> 0.10159466 ContextSimilarity: [CONTEXT SIM: component - Region -> 0.06974452 SWORelevanceSets: Making rel sets of : AlcoholPercentage SWORelevanceSets: >> CONTEXT of AlcoholPercentage AlcoholPercentage ContextSimilarity: RELEVANCE: [ Ale AlcoholPercentage ] ContextSimilarity: >> Ale - AlcoholPercentage -> 0.13964012 ContextSimilarity: [CONTEXT SIM: Ale - AlcoholPercentage -> 0.04654671 ContextSimilarity: RELEVANCE: [ AlcoholPercentage AlcoholPercentage ] ContextSimilarity: >> AlcoholPercentage - AlcoholPercentage -> 0.5 ContextSimilarity: [CONTEXT SIM: AlcoholPercentage - AlcoholPercentage -> 0.16666667 ContextSimilarity: RELEVANCE: [ component AlcoholPercentage ]



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

130

ContextSimilarity: >> component - AlcoholPercentage -> 0.12406197 ContextSimilarity: [CONTEXT SIM: component - AlcoholPercentage -> 0.04135399 SWORelevanceSets: Making rel sets of : TopFermentedBeer SWORelevanceSets: >>>> Ale SWORelevanceSets: >> CONTEXT of TopFermentedBeer Beer - superClassOf SWORelevanceSets: > TopFermentedBeer SWORelevanceSets: > Ale - subClassOf ContextSimilarity: RELEVANCE: [ Fermentation Method TopFermentedBeer ] ContextSimilarity: >> Fermentation Method - TopFermentedBeer -> 0.17045702 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: [CONTEXT SIM: Fermentation Method - TopFermentedBeer -> 0.39015234 ContextSimilarity: RELEVANCE: [ Company TopFermentedBeer ] ContextSimilarity: >> Brewery - Beer -> 0.29147682 ContextSimilarity: >> Company - TopFermentedBeer -> 0.10185588 ContextSimilarity: [CONTEXT SIM: Company - TopFermentedBeer -> 0.13111089 ContextSimilarity: RELEVANCE: [ AlcoholPercentage TopFermentedBeer ] ContextSimilarity: >> AlcoholPercentage - TopFermentedBeer -> 0.105679624 ContextSimilarity: >> AlcoholPercentage - Ale -> 0.13964012 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: [CONTEXT SIM: AlcoholPercentage - TopFermentedBeer -> 0.37988004 ContextSimilarity: RELEVANCE: [ Beer TopFermentedBeer ] ContextSimilarity: >> Award - Beer -> 0.15072018 ContextSimilarity: >> Award - Ale -> 0.16672431 ContextSimilarity: >> Pilsner - Beer -> 0.51828617 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Brewery - Beer -> 0.29147682 ContextSimilarity: >> AlcoholPercentage - TopFermentedBeer -> 0.105679624 ContextSimilarity: >> AlcoholPercentage - Ale -> 0.13964012 ContextSimilarity: >> component - TopFermentedBeer -> 0.14091837 ContextSimilarity: >> Ale - Beer -> 0.6663452 ContextSimilarity: >> Ale - Ale -> 1.0 ContextSimilarity: >> Fermentation Method - TopFermentedBeer -> 0.17045702 ContextSimilarity: [CONTEXT SIM: Beer - TopFermentedBeer -> 0.68550056 ContextSimilarity: RELEVANCE: [ component TopFermentedBeer ] ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> component - TopFermentedBeer -> 0.14091837 ContextSimilarity: [CONTEXT SIM: component - TopFermentedBeer -> 0.38030612 SWORelevanceSets: Making rel sets of : Ingredient SWORelevanceSets: >> CONTEXT of Ingredient Ingredient ContextSimilarity: RELEVANCE: [ Fermentation Method Ingredient ] ContextSimilarity: >> Fermentation Method - Ingredient -> 0.12858285 ContextSimilarity: [CONTEXT SIM: Fermentation Method - Ingredient -> 0.04286095 ContextSimilarity: RELEVANCE: [ Award Ingredient ] ContextSimilarity: >> Award - Ingredient -> 0.11805556 ContextSimilarity: [CONTEXT SIM: Award - Ingredient -> 0.039351854 ContextSimilarity: RELEVANCE: [ AlcoholPercentage Ingredient ] ContextSimilarity: >> AlcoholPercentage - Ingredient -> 0.105679624 ContextSimilarity: [CONTEXT SIM: AlcoholPercentage - Ingredient -> 0.035226543 ContextSimilarity: RELEVANCE: [ Location Ingredient ] ContextSimilarity: >> Location - Ingredient -> 0.12529337 ContextSimilarity: >> Brewery - Ingredient -> 0.12590021 ContextSimilarity: [CONTEXT SIM: Location - Ingredient -> 0.0837312 ContextSimilarity: RELEVANCE: [ component Ingredient ] ContextSimilarity: >> component - Ingredient -> 0.6259002



1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1] 1]

131

ContextSimilarity: [CONTEXT SIM: component - Ingredient -> 0.20863341 ContextSimilarity: RELEVANCE: [ Brewery Ingredient ] ContextSimilarity: >> Location - Ingredient -> 0.12529337 ContextSimilarity: >> Brewery - Ingredient -> 0.12590021 ContextSimilarity: [CONTEXT SIM: Brewery - Ingredient -> 0.0837312 SWORelevanceSets: Making rel sets of : Beer SWORelevanceSets: >>>> TopFermentedBeer SWORelevanceSets: >>>> BottomFermentedBeer SWORelevanceSets: >> CONTEXT of Beer Award - awared SWORelevanceSets: > Ingredient - madeFrom SWORelevanceSets: > Brewery - brewedBy SWORelevanceSets: > TopFermentedBeer - subClassOf SWORelevanceSets: > BottomFermentedBeer - subClassOf SWORelevanceSets: > Beer SWORelevanceSets: > AlcoholPercentage - hasAlocholPercentage ContextSimilarity: RELEVANCE: [ Pilsner Beer ] ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Pilsner - BottomFermentedBeer -> 0.10489766 ContextSimilarity: >> Pilsner - Beer -> 0.51828617 ContextSimilarity: [CONTEXT SIM: Pilsner - Beer -> 0.37957156 ContextSimilarity: RELEVANCE: [ Ale Beer ] ContextSimilarity: >> Ale - Award -> 0.16672431 ContextSimilarity: >> Ale - Beer -> 0.6663452 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: [CONTEXT SIM: Ale - Beer -> 0.41658628 ContextSimilarity: RELEVANCE: [ Award Beer ] ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Award - Award -> 1.0 ContextSimilarity: [CONTEXT SIM: Award - Beer -> 0.5 ContextSimilarity: RELEVANCE: [ Beer Beer ] ContextSimilarity: >> Award - Award -> 1.0 ContextSimilarity: >> Pilsner - BottomFermentedBeer -> 0.10489766 ContextSimilarity: >> Pilsner - Beer -> 0.51828617 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Brewery - Award -> 0.14484127 ContextSimilarity: >> Brewery - Brewery -> 1.0 ContextSimilarity: >> AlcoholPercentage - Ingredient -> 0.105679624 ContextSimilarity: >> AlcoholPercentage - BottomFermentedBeer -> 0.13083442 ContextSimilarity: >> AlcoholPercentage - AlcoholPercentage -> 0.5 ContextSimilarity: >> component - Ingredient -> 0.6259002 ContextSimilarity: >> Ale - Award -> 0.16672431 ContextSimilarity: >> Ale - Beer -> 0.6663452 ContextSimilarity: >> Fermentation Method - Ingredient -> 0.12858285 ContextSimilarity: >> Fermentation Method - TopFermentedBeer -> 0.17045702 ContextSimilarity: [CONTEXT SIM: Beer - Beer -> 0.7829984 ContextSimilarity: RELEVANCE: [ Brewery Beer ] ContextSimilarity: >> Location - Ingredient -> 0.12529337


[DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [DEBUG|Thread [INFO |Thread

1] 1] 1] 1] 1] 1] 1] 1] 1]

132

ContextSimilarity: >> Brewery - Award -> 0.14484127 ContextSimilarity: >> Brewery - Brewery -> 1.0 ContextSimilarity: >> Beer - Award -> 0.15072018 ContextSimilarity: >> Beer - Brewery -> 0.29147682 ContextSimilarity: >> Beer - Beer -> 1.0 ContextSimilarity: >> Company - TopFermentedBeer -> 0.10185588 ContextSimilarity: >> Company - BottomFermentedBeer -> 0.10489766 ContextSimilarity: [CONTEXT SIM: Brewery - Beer -> 0.4460382 SimilarityIteration: DONE !!

[INFO |Thread 1] DLMatch: < Beer - 3.3722959 > ---> < Beer - 0.89149916 > < Ale - 0.5737247 > < BottomFermentedBeer - 0.40758383 > < Award - 0.2581458 > < Pilsn [INFO |Thread 1] DLMatch: < Pilsner - 1.7797735 > ---> < Pilsner - 0.7530477 > < BottomFermentedBeer - 0.38578218 > < Beer - 0.44892886 > < Ale - 0.19201481 > [INFO |Thread 1] DLMatch: < Award - 1.4631978 > ---> < Award - 0.6917867 > < Beer - 0.3253601 > < Ale - 0.22220707 > < Ingredient - 0.07870371 > < Brewery - 0.14 [INFO |Thread 1] DLMatch: < Ale - 1.7158911 > ---> < Ale - 0.77772415 > < Beer - 0.54146576 > < Award - 0.13626957 > < AlcoholPercentage - 0.09309342 > < Pilsn [INFO |Thread 1] DLMatch: < Brewery - 1.7306643 > ---> < Brewery - 0.8328141 > < Beer - 0.3687575 > < Award - 0.12168087 > < Ingredient - 0.10481571 > < BottomFerm [INFO |Thread 1] DLMatch: < Location - 0.6986267 > ---> < Region - 0.4702718 > < Ingredient - 0.10451229 > < Organization - 0.123842604 > [INFO |Thread 1] DLMatch: < component - 1.2649287 > ---> < Ingredient - 0.41726682 > < BottomFermentedBeer - 0.25852826 > < TopFermentedBeer - 0.26061225 > < Alcohol [INFO |Thread 1] DLMatch: < Fermentation Method - 0.4688531 > ---> < TopFermentedBeer - 0.28030467 > < Ingredient - 0.0857219 > < Organization - 0.102826506 > [INFO |Thread 1] DLMatch: < AlcoholPercentage - 1.1046069 > ---> < AlcoholPercentage - 0.33333334 > < BottomFermentedBeer - 0.25388962 > < Ingredient - 0.070453085 > < Ale [INFO |Thread 1] DLMatch: < Company - 0.4837761 > ---> < Organization - 0.24878143 > < TopFermentedBeer - 0.11648339 > < BottomFermentedBeer - 0.118511245 > [INFO |Thread 1] DLMatch: TOP : < Beer - 3.3722959 > ---> < Beer - 0.89149916 > < Ale - 0.5737247 > < BottomFermentedBeer - 0.40758383 > < Award - 0.2581458 > < Pilsn [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.89149916 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.7530477 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.8328141 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.77772415 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.6917867 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.41726682 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.33333334 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.4702718 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.24878143 [INFO |Thread 1] DLMatch: >> false [INFO |Thread 1] DLMatch: FINAL MAPPING: 0.28030467 [INFO |Thread 1] DLMatch: >> TOTAL RESULT : 0.56968296 [INFO |Thread 1] CrawlerThread: Download speed: 120 kb/s


A.2

133

The Crawler in Action

Here we present a short transcript of the crawler performing a general crawl on the Web. [main] Connecting to DOGMA Server [main] Started new Thread: 1 [main] Started new Thread: 2 [main] Started new Thread: 3 [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread

1] 3] 2] 1] 1] 2] 2] 3] 3] 1] 2] 3] 1] 2] 3] 1] 1] 2] 2] 1] 1] 2] 2] 3] 3] 1] 1] 2] 2] 1] 1] 2] 2] 1] 1] 2] 2] 1] 1] 2] 2] 2] 2] 3] 3] 1]

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Crawling Crawling Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -DONE! -DONE! -Crawling Crawling Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! --

new link: http://www.allaboutbeer.com/ new link: http://www.allaboutbeer.com/aab/services.html new link: http://www.allaboutbeer.com/wbf/ relevance: 0.051571064 - Time : 1772ms - DL speed: 17 kb/s new link: http://www.allaboutbeer.com/brewpubs/ relevance: 0.06428243 - Time : 70ms - DL speed: 19 kb/s new link: http://www.allaboutbeer.com/aab/beerstorefinder.html relevance: 0.06401844 - Time : 60ms - DL speed: 19 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks.html relevance: 0.05089866 - Time : 70ms - DL speed: 23 kb/s relevance: 0.05143445 - Time : 40ms - DL speed: 63 kb/s relevance: 0.060084175 - Time : 50ms - DL speed: 19 kb/s new link: http://www.allaboutbeer.com/style/index.html new link: http://www.allaboutbeer.com/homebrew/index.html new link: http://www.allaboutbeer.com/news/index.html relevance: 0.043810796 - Time : 60ms - DL speed: 48 kb/s new link: http://www.allaboutbeer.com/beertravel/index.html relevance: 0.039715074 - Time : 60ms - DL speed: 35 kb/s new link: http://www.allaboutbeer.com/market/index.html relevance: 0.04152274 - Time : 40ms - DL speed: 46 kb/s new link: http://www.allaboutbeer.com/collect/index.html relevance: 0.059028134 - Time : 30ms - DL speed: 57 kb/s new link: http://www.allaboutbeer.com/pullupastool/markruedrich.html relevance: 0.024021532 - Time : 190ms - DL speed: 61 kb/s new link: http://www.allaboutbeer.com/wbfraleigh/ relevance: 0.042601433 - Time : 70ms - DL speed: 69 kb/s new link: http://www.allaboutbeer.com/wbf/home.html relevance: 0.03535534 - Time : 70ms - DL speed: 90 kb/s new link: http://www.allaboutbeer.com/aab/subscribe.html relevance: 0.05025189 - Time : 30ms - DL speed: 90 kb/s new link: http://www.allaboutbeer.com/aab/subscribe_gift.html relevance: 0.09901475 - Time : 20ms - DL speed: 28 kb/s new link: http://www.allaboutbeer.com/faq.html relevance: 0.05725983 - Time : 40ms - DL speed: 20 kb/s new link: http://www.allaboutbeer.com/aab/renew.html relevance: 0.045175396 - Time : 80ms - DL speed: 33 kb/s new link: http://www.allaboutbeer2.com/subs/address_change.html relevance: 0.096225046 - Time : 10ms - DL speed: 18 kb/s new link: http://www.allaboutbeer2.com/brewpubs/states.shtml relevance: 0.21320072 - Time : 10ms - DL speed: 10 kb/s new link: http://www.allaboutbeer2.com/beerstores/submit.html relevance: 0.10050378 - Time : 10ms - DL speed: 35 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/travel.html relevance: 0.05634362 - Time : 30ms - DL speed: 10 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/paraphernalia.html relevance: 0.05330018 - Time : 60ms - DL speed: 27 kb/s


[Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread

1] 2] 3] 2] 3] 1] 1] 2] 2] 3] 3] 3] 3] 1] 1] 2] 2] 1] 1] 2] 2] 1] 1] 3] 3] 2] 2] 1] 1] 3] 3] 1] 1] 3] 3] 2] 2] 3] 1] 3] 1] 1] 1] 2] 2] 3] 3] 2] 2] 3] 3] 1] 1] 3] 3] 1]

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Crawling DONE! -DONE! -Crawling Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -DONE! -Crawling Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! --

134

new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/collectibles.html relevance: 0.055555556 - Time : 40ms - DL speed: 18 kb/s relevance: 0.04981355 - Time : 40ms - DL speed: 22 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/reviews.html new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/beer_relate.html relevance: 0.053916387 - Time : 30ms - DL speed: 19 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/breweries.html relevance: 0.0625 - Time : 40ms - DL speed: 19 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/homebrewing.html relevance: 0.043519415 - Time : 70ms - DL speed: 23 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/bars.html relevance: 0.060522754 - Time : 20ms - DL speed: 18 kb/s new link: http://www.allaboutbeer2.com/wsdbi/beerlinks/merchandise.html relevance: 0.047836486 - Time : 50ms - DL speed: 22 kb/s new link: http://www.allaboutbeer.com/beertravel/bpubs/index.html relevance: 0.05504819 - Time : 40ms - DL speed: 19 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/24.5-vermont.html relevance: 0.05089866 - Time : 50ms - DL speed: 46 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/24.1-tours.html relevance: 0.034040395 - Time : 60ms - DL speed: 60 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/23.1-redletterdates.html relevance: 0.032042067 - Time : 50ms - DL speed: 64 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.6-september11.html relevance: 0.052999895 - Time : 40ms - DL speed: 21 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.5-easttowest.html relevance: 0.03209153 - Time : 70ms - DL speed: 64 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.4-realpeople.html relevance: 0.033690855 - Time : 90ms - DL speed: 62 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.3-heartoftexas.html relevance: 0.03181424 - Time : 80ms - DL speed: 62 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.2-brewski.html relevance: 0.031280562 - Time : 90ms - DL speed: 62 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/22.1-pennsylvania.html relevance: 0.032495633 - Time : 90ms - DL speed: 62 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/andy.html relevance: 0.03305898 - Time : 70ms - DL speed: 30 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/holidays.html relevance: 0.039133023 - Time : 60ms - DL speed: 75 kb/s relevance: 0.032241292 - Time : 90ms - DL speed: 68 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/bowling.html new link: http://www.allaboutbeer.com/beertravel/trips/romance.html relevance: 0.032808937 - Time : 70ms - DL speed: 31 kb/s new link: http://www.allaboutbeer.com/beertravel/trips/historic.html relevance: 0.038691163 - Time : 50ms - DL speed: 23 kb/s new link: http://www.allaboutbeer.com/market/wearables.html relevance: 0.033241123 - Time : 60ms - DL speed: 32 kb/s new link: http://www.allaboutbeer.com/market/bom.html relevance: 0.058025885 - Time : 40ms - DL speed: 38 kb/s new link: http://www.allaboutbeer.com/collect/24.6-cans.html relevance: 0.052199576 - Time : 40ms - DL speed: 42 kb/s new link: http://www.allaboutbeer.com/collect/24.5-brewzoo.html relevance: 0.031766046 - Time : 50ms - DL speed: 62 kb/s new link: http://www.allaboutbeer.com/collect/24.4-youradhere.html relevance: 0.042182453 - Time : 40ms - DL speed: 66 kb/s new link: http://www.allaboutbeer.com/collect/24.3-growlers.html relevance: 0.043193422 - Time : 50ms - DL speed: 66 kb/s


[Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread [Thread

2] 2] 1] 3] 3] 1] 1] 2] 2] 3] 3] 2] 2] 1] 1] 3] 3] 1] 1] 2]

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

DONE! -Crawling Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! -Crawling DONE! --

relevance: 0.043153185 - Time : 50ms - DL speed: 66 kb/s new link: http://www.allaboutbeer.com/collect/24.1-beertrays.html new link: http://www.allaboutbeer.com/collect/24.2-lighters.html relevance: 0.0436852 - Time : 50ms - DL speed: 46 kb/s new link: http://www.allaboutbeer.com/collect/23.6-scrape.html relevance: 0.044587795 - Time : 40ms - DL speed: 46 kb/s new link: http://www.allaboutbeer.com/collect/23.4-glassware.html relevance: 0.04512937 - Time : 50ms - DL speed: 42 kb/s new link: http://www.allaboutbeer.com/collect/23.3-handles.html relevance: 0.043768812 - Time : 30ms - DL speed: 63 kb/s new link: http://www.allaboutbeer.com/collect/23.2-backbarads.html relevance: 0.044543542 - Time : 50ms - DL speed: 66 kb/s new link: http://www.allaboutbeer.com/collect/coasters.html relevance: 0.04275691 - Time : 40ms - DL speed: 66 kb/s new link: http://www.allaboutbeer.com/collect/bottles.html relevance: 0.045037735 - Time : 50ms - DL speed: 69 kb/s new link: http://www.allaboutbeer.com/collect/labeled.html relevance: 0.04267896 - Time : 30ms - DL speed: 44 kb/s new link: http://www.allaboutbeer.com/collect/beerlabels.html relevance: 0.042953677 - Time : 60ms - DL speed: 44 kb/s

135

Appendix B

KnowledgeWeb Project Ontology This is the KnowledgeWeb Project Ontology (in RDFS) that we have used in section 6. We automatically converted the ontology to lexons and used those in our evaluation of the crawler. Network of Excellence's summary: objectives, description of work, milestones and expected results. Network of Excellence's full name or title.

136

APPENDIX B. KNOWLEDGEWEB PROJECT ONTOLOGY

137

A Network of Excellence is a complete programme of work to achieve defined goals and objectives within constraints of resources, time and other conditions. Network of Excellence means all the work as referred in Annex I to the contract. Workpackage number: WP1, WP1.1, ..., WPn. General network of Excellence's objectives. Task's number Task's start month (MonthN) Relative start date for the work in a specific workpackage. A Milestone is a control point at which progress can be assessed. Milestones point at events, when objectives or intermediate goals are to be reached. A milestone is normally connected with the submission of deliverables. A milestone, by definition, has duration of zero and no effort. There is no work associated with a milestone. Usually a milestone is used as a project checkpoint to validate how the project is progressing and revalidate work.


138

A summary of the tasks to carry out in the activity The start date of the Network of Excellence. A brief description of the expected results in each milestone. Network of Excellence's acronym or abbreviation. A brief description of the work to do in the workpackage. Network of Excellence's web site.


139

(1.1, 1.2, etc.) Workpackage's objectives. Task's name The end date of the Network of Excellence. A summary of the deliverables to deliver in the activity The month in which the activity starts


140

The total number of person-months allocated to a workpackage. Number of person-months per a participant in a workpackage. A brief description of the task Task's end month (MonthN) Relative end date. The month in which the workpackage finishes. Project month in which the milestone is acomplished. (MonthN) A Workpackage (WP) is a major subdivision of a project which leads to the completion of one of the goals, objectives or major deliverables within the project. Different workpackages can proceed in parallel within a project.


141

Workpackages can be further divided into Tasks. Different tasks can proceed in parallel, within a workpackage, and cover one or more reporting periods of the project. A Task should end on a definite milestone and lead to at least one deliverable. This term indicates the workload (that is, the number of person-months) of an organization that participates in a workpackage. A brief description of the milestone. Milestone's number or name. (MilestoneN) Workpackage's name or title. The code which identifies the project.


142

Ontology-based Crawler for the Semantic Web - CiteSeerX

Ontology-based Crawler for the Semantic Web - CiteSeerX

Suggest Documents

World Wide Web Crawler - CiteSeerX

Methodologies for Crawler Based Web Surveys - CiteSeerX

An Intelligent Web Crawler - CiteSeerX

CoBWeb â A Crawler for the Brazilian Web - CiteSeerX

CoBWeb â A Crawler for the Brazilian Web - CiteSeerX

A Scalable, Extensible Web Crawler with Focused Web Crawler

Crawler-Friendly Web Servers

Ontology Based Web Crawler - Semantic Scholar

chapter 2 web crawler

The DLR-Crawler - CiteSeerX

An Ontology-Supported Web Focused-Crawler for Java ... - CiteSeerX

A Framework for Incremental Hidden Web Crawler - CiteSeerX

Integration of Web mining and web crawler - Semantic Scholar

Using Web Crawler Technology for Geo-Events ... - Semantic Scholar

Semantic Web Crawler for More Relevant Search Using ... - IJARCSSE

Integration of Web mining and web crawler

22. Crawler-friendly web servers

myanmar web pages crawler - AIRCC

Auto-Explore the Web â Web Crawler

Web Crawler: Extracting the Web Data - International Journal of ...

An Effective Parallel Web Crawler based on ... - Semantic Scholar

RSS-Crawler Enhancement for Blogosphere ... - Semantic Scholar

the semantic web - CiteSeerX

A New Mercator Web Crawler

Ontology-based Crawler for the Semantic Web - CiteSeerX