owl

17 downloads 0 Views 4MB Size Report
is the maximum depth of vocabulary structure, L is the number of loose singleton concepts in the vocabulary, R is the ...... sevier Science, 2004. [BLFM05] ... Hsiao-Wuen Hon, Yunhao Liu, Wei-Ying Ma, Andrew Tomkins, and. Xiaodong Zhang ...
TRANSFORMING ONTOLOGIES IN THE WEB ONTOLOGY LANGUAGE (OWL) TO VOCABULARIES IN THE SIMPLE KNOWLEDGE ORGANIZATION SYSTEM (SKOS)

A THESIS SUBMITTED TO THE U NIVERSITY OF M ANCHESTER FOR THE DEGREE OF M ASTER OF P HILOSOPHY IN THE FACULTY OF E NGINEERING AND P HYSICAL S CIENCES

2015

By Nor Azlinayati Abdul Manaf School of Computer Science

Contents Abstract

12

Declaration

13

Copyright

14

Acknowledgements

15

1

. . . . . . . . . .

16 16 17 19 19 21 21 23 24 25 26

. . . . . . .

27 27 30 30 31 31 31 39

2

Introduction 1.1 Research Motivation . . . . . . . . . . . . . . . . . 1.2 Knowledge organisation and representation . . . . . 1.2.1 Lexical resources . . . . . . . . . . . . . . . 1.2.2 Terminological resources . . . . . . . . . . . 1.2.3 Ontological resources . . . . . . . . . . . . . 1.3 Semantic Web Knowledge Representation Languages 1.4 Research hypothesis and research objectives . . . . . 1.5 Research Contributions . . . . . . . . . . . . . . . . 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . 1.7 Published Work . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Framework for Semantic Web Knowledge Reuse 2.1 The Framework for Semantic Web Knowledge Reuse . . . 2.2 Semantic Web knowledge representation languages . . . . 2.2.1 Resource Description Framework (RDF) . . . . . 2.2.2 Resource Description Framework Schema (RDFS) 2.2.3 Web Ontology Language (OWL) . . . . . . . . . . 2.2.3.1 Elements of OWL ontologies . . . . . . 2.2.4 Simple Knowledge Organization System (SKOS) .

2

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

2.3

2.4 3

2.2.4.1 Elements of SKOS . . . . . . . . . . . . . . . . . . Semantic Web Knowledge Reuse . . . . . . . . . . . . . . . . . . . . 2.3.1 Use of SKOS vocabulary . . . . . . . . . . . . . . . . . . . . 2.3.1.1 SKOS vocabulary as a indexing tool . . . . . . . . 2.3.1.2 SKOS vocabulary as a searching tool . . . . . . . . 2.3.1.3 SKOS vocabulary as a navigation tool . . . . . . . 2.3.2 OWL and SKOS transformation framework . . . . . . . . . . 2.3.2.1 Characteristics of OWL and SKOS . . . . . . . . . 2.3.3 Transforming ontology into SKOS vocabulary . . . . . . . . 2.3.4 Transforming knowledge organization systems into OWL ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Characterising SKOS vocabularies on the Web 3.1 SKOS vocabulary survey . . . . . . . . . . . . . . 3.1.1 Methods of SKOS vocabulary survey . . . 3.1.2 Discovery component . . . . . . . . . . . 3.1.2.1 Dedicated collections . . . . . . 3.1.2.2 Semantic Web search engines . . 3.1.2.3 Web crawler . . . . . . . . . . . 3.1.3 Validation component . . . . . . . . . . . 3.1.3.1 SKOS vocabulary identification . 3.1.3.2 ’Slips’ detection and fixing . . . 3.1.4 Data Extraction and Analysis component . 3.1.4.1 Metadata extraction . . . . . . . 3.1.4.2 Duplicate filtering . . . . . . . . 3.1.4.3 Data analysis . . . . . . . . . . . 3.2 Result and observation . . . . . . . . . . . . . . . 3.2.1 Discovery component . . . . . . . . . . . 3.2.2 Validation component . . . . . . . . . . . 3.2.2.1 SKOS vocabulary identification . 3.2.2.2 ‘Slips’ detection and fixing . . . 3.2.3 Data Extraction and Analysis component . 3.2.3.1 SKOS constructs usage . . . . . 3.2.3.2 SKOS vocabulary categorization 3.2.3.3 SKOS vocabulary structure . . . 3

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

39 42 42 43 43 46 47 47 54 55 56 58 59 60 61 61 62 63 63 63 65 66 66 67 68 70 70 71 71 71 76 76 78 79

3.3

3.4 4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Validation of SKOS vocabularies . . . . . . . . . . . . . 3.3.2 SKOS constructs usage . . . . . . . . . . . . . . . . . . 3.3.3 SKOS vocabulary categorization . . . . . . . . . . . . . 3.3.4 SKOS vocabulary structure . . . . . . . . . . . . . . . . 3.3.5 Recommendations for SKOS Vocabulary Best Practices Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Characterising OWL ontologies on the Web 4.1 OWL ontology survey . . . . . . . . . . . . . . . . . . 4.1.1 Methods of OWL ontology survey . . . . . . . 4.2 Methods and materials . . . . . . . . . . . . . . . . . 4.2.1 Discovery component . . . . . . . . . . . . . 4.2.1.1 Dedicated collections . . . . . . . . 4.2.1.2 Semantic Web search engines . . . . 4.2.2 Validation component . . . . . . . . . . . . . 4.2.2.1 OWL ontology identification . . . . 4.2.2.2 SKOS vocabulary elimination . . . . 4.2.3 Data Extraction and Analysis component . . . 4.2.3.1 Metadata extraction . . . . . . . . . 4.2.3.2 Duplicate filtering . . . . . . . . . . 4.2.3.3 Data analysis . . . . . . . . . . . . . 4.3 Results and Observation . . . . . . . . . . . . . . . . 4.3.1 Discovery component . . . . . . . . . . . . . 4.3.2 Validation component . . . . . . . . . . . . . 4.3.3 Data Extraction and Analysis . . . . . . . . . . 4.3.3.1 OWL construct usage . . . . . . . . 4.3.3.2 OWL ontology structure . . . . . . . 4.3.3.3 Measure of ontology sophistication . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 OWL constructs usage . . . . . . . . . . . . . 4.4.1.1 Ontologies with subclass axioms . . 4.4.1.2 Ontologies without subclass axioms 4.4.2 OWL ontology structure . . . . . . . . . . . . 4.4.2.1 Branching factors . . . . . . . . . . 4.4.2.2 OWL complex class expressions . . 4

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

83 83 85 86 86 87 88

. . . . . . . . . . . . . . . . . . . . . . . . . . .

90 91 91 93 93 93 94 94 95 95 95 95 96 97 102 102 104 104 104 112 115 119 121 121 122 123 123 123

4.5 5

6

4.4.3 Measure of sophistication . . . . . . . . . . . . . . . . . . . 124 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A Survey of Identifiers and Labels in OWL Ontologies 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Corpus Preparation . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Isolation of identifiers and labels . . . . . . . . . . . . . . . . 5.2.3 Determination of whether the identifiers and labels are meaningful or meaningless . . . . . . . . . . . . . . . . . . . . . 5.2.3.1 Normalise the lexical encoding style of identifiers . 5.2.3.2 Check for meaningfulness . . . . . . . . . . . . . . 5.2.4 Result recording . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127 127 129 130 130 130 131 132 133 133 133 134 137 137

OWL2SKOS transformation procedures 139 6.1 OWL to SKOS transformation . . . . . . . . . . . . . . . . . . . . . 139 6.1.1 Component 1: Basic hierarchical structure transformation . . 140 6.1.1.1 Step 1.1: Primary entity transformation . . . . . . . 140 6.1.1.2 Guideline 1.1.1: Use SKOS concepts for OWL primary entities transformation . . . . . . . . . . . . . 141 6.1.1.3 Guideline 1.1.2: Group all named classes and individuals from the same ontology into one SKOS Concept Scheme . . . . . . . . . . . . . . . . . . . 142 6.1.1.4 Step 1.2: Primary hierarchical relationship transformation . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1.1.5 Guideline 1.2.1: Use SKOS hierarchical relationships for OWL primary hierarchical relationships transformation . . . . . . . . . . . . . . . . . . . . . . 143 6.1.1.6 Guideline 1.2.2.: Extend the SKOS hierarchical relationships to include specialised object properties . 144 6.1.2 Component 2: SKOS lexical labels identification . . . . . . . 146 5

6.1.2.1

6.2 6.3 6.4 7

Guideline 2.1: Transform rdfs:label to SKOS lexical label . . . . . . . . . . . . . . . . . . . . . . . 6.1.2.2 Guideline 2.2: Apply selection mechanisms to identify “preferred” lexical label . . . . . . . . . . . . . 6.1.2.3 Guideline 2.3: Use fragment identifier as SKOS lexical label . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Component 3: Complex class/class expressions transformation 6.1.3.1 Step 3.1: Transforming logical constructors union, intersection and complement . . . . . . . . . . . . 6.1.3.2 Guideline 3.1.1: “Split” named classes connected through union and intersection . . . . . . . . . . . 6.1.3.3 Guideline 3.1.2: Transform the axiom with complement into annotation . . . . . . . . . . . . . . . . . 6.1.3.4 Step 3.2: Transforming logic-based constructors property restrictions . . . . . . . . . . . . . . . . . . . 6.1.3.5 Guideline 3.2.1: Use skos:related to associate the classes . . . . . . . . . . . . . . . . . . . . . . Tool implementation . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion 7.1 Thesis Overview . . . . . . . . . . . . . . 7.2 Thesis contributions . . . . . . . . . . . . . 7.3 Limitation and Future Work . . . . . . . . 7.3.1 Characterising knowledge artefacts 7.3.2 Transforming knowledge artefacts .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

146 147 148 149 149 149 151 151 151 152 154 154 156 156 160 161 161 161

Bibliography

163

A OWL construct usage for Object Property and Data Property axioms

175

B OWL construct usage for Individual, Annotation and Others axioms

176

C List of OWL Axiom Patterns

177

6

Word Count: 50,353

7

List of Tables 2.1 2.2

Key features or characteristics of OWL and SKOS languages . . . . . Criteria in choosing OWL or SKOS as representation language . . . .

48 50

3.1

Exclusion type for candidate SKOS vocabularies failed to be identified as SKOS vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . Summary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71 76 76

3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14

Sample data for T, n and p parameters to illustrate measure of ontology sophistication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample data for c parameter to illustrate measure of ontology sophistication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results after applying Ontology Sophistication function and its ranking Results after applying modified Ontology Sophistication function and its ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reasons for Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . OWL construct usage for general entities and class axioms . . . . . . OWL construct usage for class expression constructors . . . . . . . . Number of ontologies with Forward Branching Factor (FBF) and Backward Branching factor (BBF). . . . . . . . . . . . . . . . . . . . . . Number of ontologies with object property relation in complex class expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top 20 Ontologies with highest sophistication measure . . . . . . . . Frequency distribution of OWL ontologies according to ontology sophistication binning . . . . . . . . . . . . . . . . . . . . . . . . . . . Top 30 axiom patterns according to the number of ontologies . . . . . Number of ontologies according to complexity bin . . . . . . . . . . 8

100 101 101 102 105 105 106 107 114 115 117 118 120 121

5.1

Number of ontologies for different criteria surveyed. (The mean was calculated over the total ontologies for each entity type) . . . . . . . 135

A.1 OWL construct usage for Object Property and Data Property axioms . 175 B.1 OWL construct usage for Individual, Annotation and Others axioms . 176

9

List of Figures 1.1 1.2 1.3 1.4 1.5

2.1 2.2 2.3 2.4 2.5

Adapted Knowledge Resources Spectrum based on [JYJRLRS09, Mcg03, Bod06] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Example of a MeSH concept for “myocardial infarction” . . . . . . . 20 Example of a MeSH tree structure for “myocardial infarction” concept 20 Example of a SNOMED CT concept for “myocardial infarction” . . . 21 Example of a SNOMED CT tree structure for “myocardial infarction” concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Knowledge representation space . . . . . . . . . . . An RDF graph for statement Yati is student of Robert. Examples of LISA Thesaurus . . . . . . . . . . . . . Thesaurus as navigation tool . . . . . . . . . . . . . Knowledge representation space . . . . . . . . . . .

. . . . .

28 30 44 47 51

The methods of SKOS vocabulary survey . . . . . . . . . . . . . . . Example graphs for determining the structure of vocabulary . . . . . . A snippet of SKOS vocabulary with a Type 1 slip . . . . . . . . . . . A snippet of SKOS vocabulary with a Type 2 slip . . . . . . . . . . . A snippet of SKOS vocabulary with Type 2 slips . . . . . . . . . . . Overall SKOS Construct Usage . . . . . . . . . . . . . . . . . . . . . Type of Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . Vocabulary Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . Size of vocabulary and its maximum depth of vocabulary structure thesaurus category . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Size of vocabulary and its maximum depth of vocabulary structure taxonomy category . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Size of vocabulary and its maximum depth of vocabulary structure glossary category . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60 69 72 74 75 77 77 78

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

10

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

79 80 81

3.12 (a) Number of loose concepts, root concepts, and maximum skos:broader relations of vocabulary structure . . . . . . . . . . . . . . . . . . . . 82 3.13 Hierarchical and associative branching factors of vocabulary structure 84 3.14 Anomaly for hierarchical FBF for one of vocabularies in the Taxonomy category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12

The methods of OWL ontology survey . . . . . . . . . . . . . . Distribution of ontologies with owl:unionOf constructs. . . . . . Distribution of ontologies with owl:allValuesFrom constructs. . Distribution of ontologies with owl:someValuesFrom . . . . . . Distribution of ontologies with owl:intersectionOf constructs. . . Distribution of ontologies with owl:minCardinality constructs. . Distribution of ontologies with owl:maxCardinality constructs. . Distribution of ontologies with owl:exactCardinality constructs. Distribution of ontologies with owl:oneOf constructs. . . . . . . Distribution of ontologies with owl:complementOf constructs. . Distribution of ontologies with owl:hasValue constructs. . . . . Distribution of ontologies with owl:hasSelf constructs. . . . . .

6.1 6.2

Example of an OWL ontology - Pizza OWL ontology . . . . . . . . . 153 Result from transforming Pizza OWL ontology into SKOS vocabulary 153

11

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

92 108 108 109 110 110 111 111 112 112 113 113

Abstract OWL and SKOS are two of the Semantic Web knowledge representation languages used to represent domain knowledge. OWL has the capability of expressing richly axiomatised logicbased ontologies, due to its precise semantics that allow explicit modelling and description of a domain, and enables automated reasoning. SKOS is designed to represent knowledge organization systems, whose representation has weak semantics that are used for simple retrieval and navigation tasks. Both languages are used to represent domain knowledge at different levels and for different purposes due to different problem requirements. Knowledge captured in OWL ontology to support automated reasoning could be reused for document navigation but needs to be represented in SKOS vocabulary, due to different problem requirements. In this thesis we argue that each application in a particular domain has a set of problem requirements that can be fulfilled by the characteristics of the knowledge representation languages. The aim of this thesis is to improve our understanding of the OWL ontology and SKOS vocabulary landscape, and through this, understand the opportunities presented and issues encountered when transforming OWL ontologies to SKOS vocabularies. The main objective of the thesis is to understand the characteristics of OWL ontologies and SKOS vocabularies. A subsidiary objective is to explore and identify the relationships and patterns between OWL and SKOS languages and define a principled systematic procedures for transforming OWL ontologies to SKOS vocabularies. To achieve this aim, we proposed a conceptual framework that defines different possible cases for transforming knowledge artefacts between the two formalisms, OWL and SKOS. We conducted a survey of SKOS vocabularies and OWL ontologies on the Web and makes a contribution to understanding of characteristics of both knowledge artefacts. One of the issues identified in transforming OWL ontologies to SKOS vocabularies concerns the labelling of concepts in SKOS vocabularies. Another survey on understanding the style of use of identifiers and labels in OWL ontologies was performed, contributing to the understanding of labelling usage in OWL ontologies and the style of lexical encodings of identifiers. We proposed a series of principled procedures for transforming OWL ontologies to SKOS vocabularies exploiting the result of the survey to deal with the labelling issues. A prototype tool, OWL2SKOS Converter, has been implemented that performs the transformation procedures, and this can be found here (http://owl.cs.manchester.ac.uk/owltoskos/). The work presented in this thesis should be of interest to researchers in the area of knowledge representation, who wish to exploit and reuse domain knowledge represented in formal representation like OWL ontologies to be used in applications like indexing, searching, browsing and navigation. It could also be of interest to application developers, who wish to be aware of the current practice of SKOS vocabularies.

12

Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

13

Copyright i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester. ac.uk/library/aboutus/regulations) and in The University’s policy on presentation of Theses

14

Acknowledgements It would not have been possible to write this thesis without the help and support of the kind and thoughtful people around me throughout the years. I would like to gratefully acknowledge my supervisors, Prof. Robert Stevens and Sean Bechhofer for their valuable guidance, insightful supervision, constant support and trust. I would like to express my appreciation to the School of Computer Science of The University of Manchester, especially to all members of the Bio-Health Informatics Group including Matthew, Simon, Sebastian, Ignazio, Luigi, Eleni, Dmitri, Collin, and DL Lunch members for the constructive comments and advices that opened an opportunity for me to view research from different contexts. During my research, I receive very helpful support from my friends including Azlin, Nurul, Zalina, Siti, Julia, Sabariah, Ainurul, Mariam, Athirah, Lisa, Farhana and Hartini who shared their research and life experience as well continuous encouragement. My special gratitude goes to my beloved husband, Benyazwar Mohmd for his unconditional love, sacrifices, patience and understanding during the substantial time it took to complete this research, and also to my sons Afif Najmi and Zarif Najmi who bring joy into my life. Special thanks to my parents, Hj. Abdul Manaf Ahmad and Hjh. Azizah Abdul for their constant prayers and motivations for me to regain confidence. I particularly want to thank my brothers, sisters and in-laws for their unconditional moral supports. Finally, I would like to acknowledge the financial assistance provided by the Malaysian Government Agency (MARA) Malaysia during the course of this research and MIMOS Berhad for granting a long “time out” for me to complete my study with special thanks to Dr Dickson Lukose who always there for me.

15

Chapter 1 Introduction 1.1

Research Motivation

This thesis is about the knowledge representation of artefacts in the Semantic Web. Knowledge can be organised and represented in various forms, ranging from simple term lists to complex ontologies. The choice of which type of knowledge representation to be used is very much dependent on the requirements of a particular problem. Some applications require the domain knowledge to be organised and represented in the form of ontologies, while other applications require domain knowledge to be organised in the form of taxonomies or thesaurus-like representations. In some cases, domain knowledge represented in certain form for one application can be reused by other application but need to be represented in different representation scheme. It will often be the case, however, that one representational form of domain knowledge will need to be transformed to another representational form, rather than being merely translated1 . Therefore, we expect that having some ways of transforming from one representation scheme to another can help us reuse the existing knowledge resources. For knowledge to be used within an application, they must be delivered using some concrete representation through certain knowledge representation language. There are a variety of languages for representing conceptual models with different characteristics such as expressiveness, ease of use and computational complexity [SGB00]. 1 Translation

renders the content of an artefact from one language into another language without changing the structure, characteristic and meaning of the original content. Transformation convert the content of an artefact from one language into another language with changes in form, appearance, nature, or character.

16

CHAPTER 1. INTRODUCTION

17

In this thesis we focus on two of the Semantic Web2 knowledge representation languages, Web Ontology Language (OWL)3 and Simple Knowledge Organization System (SKOS)4 , which are standard languages used to represent ontologies and knowledge organisation systems (KOS), respectively. The next section briefly discusses the types of knowledge organisation and representation.

1.2

Knowledge organisation and representation

Knowledge organisation and knowledge representation are amongst the two fields of study that are related to study about knowledge. Knowledge organisation is a field of study within the Library and Information Science (LIS), which involves the description of documents, their contents, features and purpose, and managing, arranging and organising these elements to make them accessible to users seeking for them [And96]. It includes all types and methods of indexing, abstracting, cataloging, classification, record management and other types of scheme for organising information, for information retrieval. According to Hodge [Hod00], Knowledge Organization System (KOS) is a general term that refers to the tools that present the organized interpretation of knowledge structures. The term knowledge organization system was coined by the Network Knowledge Organization System Working Group at its initial meeting at the ACM Digital Libraries 1998 Conference in Pittsburgh, Pennsylvania. “The term knowledge organization systems is intended to encompass all types of schemes for organizing information and promoting knowledge management. Knowledge organization systems include classification and categorization schemes that organize materials at a general level, subject headings that provide more detailed access, and authority files that control variant versions of key information such as geographic names and personal names. Knowledge organization systems also include highly structured vocabularies, such as thesauri, and less traditional schemes, such as semantic networks and ontologies. Because knowledge organization systems are mechanisms for organizing information, they are at the heart of every library, museum, and archive.” – [Hod00] 2 http://www.w3.org/2001/sw/ 3 http://www.w3.org/2004/OWL/ 4 http://www.w3.org/2004/02/skos/

CHAPTER 1. INTRODUCTION

18

Knowledge representation, on the other hand, has long been considered one of the principal elements of Artificial Intelligence, and a critical part of all problem solving [New82]. It is mainly about how to represent the knowledge about the world using some formalisms or languages in such a way that a machine can deduce or infer a new conclusion about this knowledge by manipulating these descriptions. Reichgelt [Rei91] discusses two aspects of knowledge representation language, namely syntactic and inferential aspect. The syntactic or notational aspect of the knowledge representation language is concerned with how the information is stored in an explicit format. The inferential aspect, which is conducted by the interpreter, is concerned with making inference of the knowledge represented. Knowledge can be organized and represented in different forms and schemes ranging from simple to complex conceptual structures. Various types of knowledge organization and representation schemes can be viewed as a linear spectrum of information artefacts that increases in semantics, expressiveness and machine interpretability as it moves from left to right as shown in Figure 1.15 . Concepts & Formal Ontology

Terms & Linguistic Relationships Terminology

Simple Taxonomy Weak   Semantics   UMLS Lex

Terminologia Anatomica NCI MeSH

Wordnet

BioLexicon UniProtKB Taxonomy

Terms/ Glossary

Frames

ICD

Thesauri

Lexical Resources

UMLS SN OBO

Semantic Network Terminological Resources

Complex Logics GALEN (Grail – OWL)

FMA SNOMED (Protégé) (Frames)

Strong   Semantics  

Description Logics Ontological Resources

Figure 1.1: Adapted Knowledge Resources Spectrum based on [JYJRLRS09, Mcg03, Bod06] Existing formalisms (denoted by boxes) were arranged according to their semantic expressiveness. We choose existing resources from the biomedical domain to show an example of resources for each formalism. According to Bodenreider[Bod06], these 5 This

spectrum first reported by Gruninger, Lehmann, McGuinness, Ushcold, and Welty who became panelist for an invited talk at AAAI-99, referred to initially as ‘Ontology Spectrum’ [Mcg03]

CHAPTER 1. INTRODUCTION

19

formalisms can also be categorised into three categories according to the nature of the resources; 1) Lexical resources, 2) Terminological resources, and 3) Ontological resources.

1.2.1

Lexical resources

The Lexical resources provide the lexical and lexico-syntactic information needed for parsing text [Bod06]. In particular, specialized lexical resources for biomedical domain may include lists of gene names for molecular biology corpora, which is useful for analyzing sub-domains of biomedicine, such as Specialist lexicon. The formalisms under this category include Terms/Glossary formalisms, Simple Taxonomy formalisms and Thesauri formalisms. Terms/Glossary formalism is placed to the left of the spectrum to represent the simple notion of information artefacts proving list of terms and some natural language description of the terms. Human beings can read the natural language descriptions but the interpretation is not unambiguous which results in inadequacy for computer agents and also not machine processable. Examples of resources for this type of formalism are Unified Medical Language System (UMLS) Specialist Lexicon [BDAM03] and BioLexicon [PJYLRS08]. Simple Taxonomy formalism provides hierarchical structures organised by subtypesupertype relationships, also called parent-child relationships. UniprotKB Taxonomy is an example of a resource for this type of formalism [Con12]. Thesauri formalism offers additional semantics in their relations between terms which can be interpreted by computer agents into simple hierarchy. Example of resources for this type of formalism are Wordnet [SR98] and Terminologia Anatomica [All09].

1.2.2

Terminological resources

The Terminological resources typically provide lists of synonyms for the entities given in a subdomain for a given purpose [Bod06]. For example, in the biomedical domain, terminological resources contain the names of entities employed in the domain, which play an important role in entity recognition [Chu00]. Furthermore, many terminologies have some kind of hierarchical organization that can be exploited for relation extraction purposes, which consist of a tree, where nodes are terms, and links represent parent-tochild or more-general-to-more-specific relationships, such as UMLS Metathesaurus,

CHAPTER 1. INTRODUCTION

20

Gene Ontology and MeSH. The formalisms under this category include Terminology formalism. The Terminology formalism represents more specialised terms and related definitions for a particular area in the field. National Cancer Institute (NCI) terminology resources6 , International Classification of Diseases (ICD)7 and Medical Subject Headings (MeSH)8 are some of the examples of terminology formalisms in the biomedical field. Figure 1.2 shows an example of a MeSH concept for “myocardial infarction”. Figure 1.3 shows the tree structure for MeSH concept “myocardial infarction”.

Figure 1.2: Example of a MeSH concept for “myocardial infarction”

Figure 1.3: Example of a MeSH tree structure for “myocardial infarction” concept

1.2.3

Ontological resources

The Ontological resources is used to study the kind of entities of domain significance. The formalisms under this category include Semantic Network formalisms, Frames formalism, Description Logics formalisms and other complex logics formalisms. 6 http://www.cancer.gov/cancertopics/cancerlibrary/terminologyresources 7 http://www.who.int/classifications/icd/en/ 8 http://www.nlm.nih.gov/mesh/

CHAPTER 1. INTRODUCTION

21

The Semantic network is also considered as one of the formalism for terminological resources category. UMLS Semantic Network (UMLS SN)9 is an example of biomedical resource of this type of formalism. The UMLS SN reduces the complexity of the UMLS Metathesaurus by grouping concepts according to the semantic types that have been assigned to them. Another example of biomedical resource of this type of formalism is that of Systematised Nomenclature of Medical - Clinical Terms (SNOMED CT)10 . Frame-based formalism offers relations between objects and restrictions on what and how classes of objects can be related to each other. The Foundational Model of Anatomy (FMA) is an example of a biomedical resource that uses frames formalism. Figure 1.4 shows an example of SNOMED CT concept for “myocardial infarction”. Figure 1.5 shows the tree structure for SNOMED CT concept “myocardial infarction”.

Figure 1.4: Example of a SNOMED CT concept for “myocardial infarction”

1.3

Semantic Web Knowledge Representation Languages

As discussed previously, all the knowledge resources, whether they be lexical, terminological or ontological, are intended for different purposes. OWL, for instance, is a Semantic Web knowledge representation language for expressing ontologies (i.e., ontological resources). In Computer Science, the term ontology has been adopted to refer to a set of precise descriptive statements about a domain which is described in terms of individuals, classes and properties [Hor08]. OWL is intended to express complex conceptual structures, which can be used to generate rich metadata and support inference tools. OWL has rich semantics that enable automated reasoning, and allows the 9 http://semanticnetwork.nlm.nih.gov/ 10 http://www.ihtsdo.org/

CHAPTER 1. INTRODUCTION

22

Figure 1.5: Example of a SNOMED CT tree structure for “myocardial infarction” concept explicit modelling and description of a domain. SKOS, on the other hand, is a language designed for the representation of thesauri, classification schemes, taxonomies, subject-heading systems, controlled structured vocabulary or any other knowledge organisation systems (i.e., lexical and terminological resources). These are structures whose representation has weak semantics that are used for simple retrieval and navigation tasks. The basic element in SKOS is concept which refers to the unit of thought – ideas, meanings or objects – that exist in the mind as abstract entities, independent of the terms used to label them. Each concept is given one or more labels to refer to them in natural language, through prefLabel or altLabel. Besides, the terms are semantically linked to each other through hierarchical broader (BT)/narrower(NT) and associative related (RT) relations. SKOS data model offers a standard, low-cost migration path for transferring existing knowledge organisation systems to the Semantic Web technology context allowing better re-usability, interoperability and sharing. SKOS and OWL will be discussed further in Chapter 3 and Chapter 4. Even though both OWL and SKOS are designed to address different problems, the fact that the SKOS data model is written in OWL language makes the relationship between both languages somewhat tighter than that between other languages, such as RDF and RDFS. Due to this fact, SKOS vocabulary is indeed an OWL ontology. In this research, we study the characteristics of the Web Ontology Language and the Simple Knowledge Organization System and artefacts written in those languages. These two

CHAPTER 1. INTRODUCTION

23

languages are among the languages used to support the Semantic Web knowledge representation. Furthermore, we also explore the relationship between the two languages and patterns for their use together. Besides, with the fact that the two languages are entwined, we aim to investigate whether this will help or hinder us in the process of transforming from one language to the other, or whether it makes any difference. In this thesis, we are interested in understanding the characteristics of both OWL and SKOS knowledge artefacts that will produce several sets of metrics, for each knowledge artefacts, OWL ontologies and SKOS vocabularies. This understanding will later be used to assist the transformation of OWL knowledge artefacts into SKOS knowledge artefacts through a systematic and principled transformation process. In the end, we will use the set of metrics produced to evaluate the transformation of OWL ontologies to SKOS vocabularies whether the result of the transformation produce SKOS vocabularies that resemble the existing SKOS vocabularies we gathered on the Web.

1.4

Research hypothesis and research objectives

The aim of this thesis is to improve our understanding of the OWL ontology and SKOS vocabulary landscape, and through this, understand the opportunities presented and issues encountered when transforming OWL ontologies to SKOS vocabularies. The hypothesis of this thesis is that by understanding the characteristics of SKOS and OWL knowledge artefacts in terms of the shape, structure and constructs usage, a useful SKOS vocabulary can be produced by reusing knowledge represented in existing OWL ontologies through a systematic and principled transformation procedures. The objectives are as follows and answered in the remaining chapters of this thesis:

Objective 1: Identify characteristics of SKOS vocabularies. • To identify the frequency of use of SKOS constructs in the collected SKOS vocabularies and which constructs are most frequently used and least frequently used. • To categorise the SKOS vocabularies according to the constructs used in the vocabularies. • To determine the structural characteristics of each category such as size, depth and branching factors of the SKOS vocabularies.

CHAPTER 1. INTRODUCTION

24

Objective 2: Identify characteristics of OWL ontologies. • To identify the frequency of use of OWL constructs in the collected OWL ontologies and which constructs are prevalence and prominent. • To determine the structure of OWL ontologies in terms of size of ontologies and depth and branching factors of class hierarchy and levels of nested class expressions. • To understand the syntactic complexity of the OWL ontologies. • To understand the use of identifiers and labels in OWL ontologies. Objective 3: Develop transformations from OWL to SKOS such that the resulting vocabularies have the characteristics found in Objective 1.

1.5

Research Contributions

The main aim of this thesis is to improve the understanding of the OWL ontology and SKOS vocabulary landscape. To achieve this, the present research focuses on the identification of characteristics of both OWL ontologies and SKOS vocabularies as basic understanding for the OWL to SKOS transformation process and makes the following contributions: Method for identification of SKOS vocabularies. Even though the term SKOS vocabulary is commonly used, there is no formal method to recognise a knowledge artefact as a SKOS vocabulary. In this thesis, we formally define a method for SKOS vocabulary identification. Understanding the characteristics of SKOS vocabularies on the Web. One of the major contributions of this thesis is on the understanding of SKOS vocabulary characteristics in term of constructs usage, its shape and structure. Identification of common modelling slips in SKOS vocabularies. In this thesis, we have discovered several types of common modelling slips found in SKOS vocabularies and proposed several methods to fix them. Recommendations for SKOS best practice. Based on our contributions to common modelling slips, this thesis also proposes recommendations for SKOS best practice.

CHAPTER 1. INTRODUCTION

25

Understanding the characteristics of OWL ontologies on the Web. Another major contribution of this thesis is on the understanding of OWL ontology characteristics in terms of constructs usage, its shape and structure as well as measure of its sophistication. Understanding of identifiers and labels in OWL ontologies. This thesis also contributes in providing understanding in terms of usage of meaningful identifiers and labels used in OWL ontologies. Systematic procedures and tool for transforming OWL into SKOS. Another major contribution of this thesis is on the proposed procedures and implemented tool for transforming OWL ontologies into SKOS vocabularies.

1.6

Thesis Outline

The rest of the thesis is organised as follows: Chapter 2 presents the conceptual framework of the research problem. This conceptual framework will be used in the remaining chapters of this thesis as a guide in describing the work done in these chapters. This chapter also lays out the early studies of knowledge representation, then presents the analysis of efforts in transforming knowledge from one formalism to the another. The gaps in the existing system will be presented. Chapter 3 describes the characterisation of existing SKOS vocabularies. The work presented in this chapter has been published in [MBS12b, MBS12a]. Chapter 4 describes the characterisation of typical OWL ontologies. Chapter 5 presents our work on a survey of identifiers and labels in OWL ontologies. The work presented in this chapter has been published in [MBS10]. Chapter 6 proposes a principled procedures of OWL to SKOS transformation, together with issues encountered. Chapter 7 concludes with a discussion of the research contributions, a review of outstanding issues and proposal for future direction of the research.

CHAPTER 1. INTRODUCTION

1.7

26

Published Work

The work of this thesis is supported by the following conference and workshop publications: • [MBS10] Nor Azlinayati Abdul Manaf, Sean Bechhofer, and Robert Stevens. A survey of identifiers and labels in OWL ontologies. In Proceedings of the 7th International Workshop on OWL Experiences and Directions Workshop (OWLED 2010), San Francisco, CA, United States, June, 2010, 2010. • [MBS12a] Nor Azlinayati Abdul Manaf, Sean Bechhofer, and Robert Stevens. Common modelling slips in SKOS vocabularies. In Proceedings of OWL: Experiences and DirectionsWorkshop 2012, Heraklion, Crete, Greece, May 27-28, 2012, volume 849 of CEUR Workshop Proceedings. CEUR-WS.org, May 2012. • [MBS12b] Nor Azlinayati Abdul Manaf, Sean Bechhofer, and Robert Stevens. The current state of SKOS vocabularies on the Web. In Proceedings of the 9th Extended Semantic Web Conference, ESWC 2012, Crete, Greece, May 27-31, 2012, pages 270284, May 2012.

Chapter 2 Framework for Semantic Web Knowledge Reuse This chapter introduces background of the research work in this thesis based on the existing literatures that forms the foundation of this thesis. The work presented in this thesis focuses on reusing knowledge represented by Semantic Web knowledge representation languages, particularly OWL and SKOS. To facilitate our discussion, we first present our proposal on the framework for Semantic Web Knowledge reuse. In Section 2.2 we focus our discussion on Semantic Web knowledge representation languages, with further emphasis on OWL and SKOS. In Section 2.3 we present the notion of Problem Space that requires reuse of existing knowledge in a different format with existing work on knowledge reuse and possible cases of knowledge reuse between OWL and SKOS languages. We conclude this chapter with the research gaps for the research topic and present our proposed solutions to these problems.

2.1

The Framework for Semantic Web Knowledge Reuse

Before we proceed with discussion on the existing literature to form the foundation of this thesis, we first present our proposal on a framework for Semantic Web knowledge reuse. We will use this framework to facilitate our discussion for the rest of this chapter as well as the thesis. Figure 2.5 illustrates a framework for the Semantic Web knowledge reuse within which we will identify different aspects of the research problem. The most outer layer of the framework is the knowledge representation space. Knowledge representation (KR) is the area of Articial Intelligence (AI) concerned with how knowledge can be represented symbolically and manipulated in an automated way by 27

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 28

Figure 2.1: Knowledge representation space

reasoning programs [BL04]. Davis et. al. [DS93] argue that the notion of “knowledge representation” can best be understood in terms of five distinct roles that it plays as follows: 1. A KR is a surrogate, a substitute for the thing itself, used to enable an entity to determine consequences by thinking rather than acting, i.e., by reasoning about the world rather than taking action in it. 2. A KR is a set of ontological commitments, i.e., an answer to the question: In what terms should I think about the world?. 3. A KR is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the representation’s fundamental conception of intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of inferences it recommends. 4. A KR is a medium for efficient computation, i.e., the computational environment

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 29 in which thinking is accomplished. One contribution to this pragmatic efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. 5. A KR is a medium of human expression, a language in which we say things about the world. We can view the first four roles introduced by Davis as referring to the content of the representation itself, whereas the last role is more about how the knowledge expressed using some kind of language. Therefore, within the KR space there is a layer called knowledge representation language (KRL) space that consists of various knowledge representation languages. William Woods defined the properties of a KRL as follows: A KR language must unambiguously represent any interpretation of a sentence (logical adequacy), have a method for translating from natural language to that representation, and must be usable for reasoning [Woo75]. In this thesis, we focused on the KRL used to represent knowledge for use in the Semantic Web applications. The two Semantic Web KRL that we have particular interest are OWL and SKOS. In this framework, they are represented as two separate layers within the KRL space, called OWL space and SKOS Space. Knowledge artefacts are the concrete representation of the knowledge represented by the KRLs. In this thesis we focus on the OWL ontologies and SKOS vocabularies, the knowledge artefacts represented by OWL and SKOS languages, respectively. However, for the knowledge artefacts to be useful, they need to be used within some applications to solve problems. The problem space consist of sets of problems for which the knowledge artefacts can be utilised to solve the problems. For each problem a set of requirements can be defined to determine the effectiveness of the solution to the problem. If all the problem requirements are satisfied, then the solution can be said to be effective, even though it may not be an optimal solution. Each representation language has their unique characteristics, which determine their usability in addressing the problems. These characteristics are closely related to problem requirements, where the representational characteristics are used to match the problem requirements. Once we can determine the representation characteristics that fulfil almost (if not all) the problem requirements, we can then perform mapping of the problem into the knowledge representation space of choice. In general problems can be

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 30 mapped into more than one knowledge representation space depending on whether the characteristics of the knowledge representation can match the problem requirements. The result of mapping the problem into the knowledge representation language produces an artefact which is the solution to the problem represented within the selected knowledge representation space.

2.2

Semantic Web knowledge representation languages

One of the areas that requires representation of knowledge is the Semantic Web. As mentioned in the previous section, for knowledge to be represented in concrete representation for use within applications, it needs to be described using some kind of logic or through the use of knowledge representation languages. In this section we focused on KR languages that are Semantic Web oriented (RDF, RDFS, OWL and SKOS). Semantic Web Rule Language (SWRL)1 is another KR language for the Semantic Web that can be used to express rules as well as logic, combining OWL DL or OWL Lite with a subset of the Rule Markup Language (such as Datalog)2 .

2.2.1

Resource Description Framework (RDF)

RDF3 is a data model used as a general method for conceptual description or modelling of information that is implemented in the Semantic Web. The RDF data model is similar to a classic conceptual modelling approaches such as entity relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular web resources) in the form of subject-predicate-object expressions, known as triples in RDF terminology. The subject is described by the predicate (the property) and the object (the value that such property takes). For example in RDF we can make statements like Yati is student of Robert, where Yati is the subject, is student of the predicate and Robert the object, and this can be shown in a “graph” as in Figure 2.2.1. Yati

student_of

Robert

Figure 2.2: An RDF graph for statement Yati is student of Robert. 1 http://www.w3.org/Submission/SWRL/ 2 This

thesis does not cover any work on SWRL language.

3 http://www.w3.org/RDF/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 31 RDF uses URIs [BLFM05] to identify entities (subjects, predicates or objects), so graphs can be combined over the internet. RDF is used in many resources, and there is a language called SPARQL [PS06] that offers a powerful, simple and flexible way of querying RDF graphs.

2.2.2

Resource Description Framework Schema (RDFS)

RDF Schema4 provides constructs for grouping RDF resources into classes. rdf:type construct is used to indicate a resource belongs to a given class rdfs:Class. Whereas the rdfs:subClassOf construct is used to indicate classes to be subclasses of other classes, which forms class hierarchy. RDFS also provides constructs for creating property hierarchies through the use of rdf:Property and rdfs:subPropertyOf. RDFS is the first step towards an ontology language, as it allows the expression of general attributes of the sets (classes) of entities rather than only working with concrete entities in RDF triples.

2.2.3

Web Ontology Language (OWL)

Since the Web Ontology Language, OWL, became a W3C (World Wide Web Consortium)5 Recommendation in February 2004, it has become the most widely used ontology language being adopted by academia and industry, for defining and instantiating Web ontologies. Borrowed from philosophy, the term ontology is used to refer to a set of precise descriptive statements about the kinds of entities in a particular domain and how they are related, which is described in terms of individuals, classes and properties. The OWL formal semantics of an ontology entails facts that are not literally present in the ontology by deriving its logical consequences which can be base don a single document or multiple distributed documents combined using defined OWL semantics. 2.2.3.1

Elements of OWL ontologies

This section discusses the key elements of OWL ontologies which includes the basic elements of OWL ontologies such as classes, instances and class hierarchies, and some of the advanced elements of OWL ontologies like complex classes and property restrictions [MvH04, HKR+ 04]. 4 http://www.w3.org/TR/rdf-schema/ 5 http://www.w3.org/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 32 For each element, we discuss the function of the element and illustrate one or two examples to demonstrate the usage of the elements. We choose to use the Manchester Syntax for the example illustrations. All examples are based on the people.owl ontology6 . Classes and Instances Classes and instances are the basic elements in OWL ontology. Instances, which are also known as individuals, represent objects in the domain of interest. Classes represent a group of individuals that have some common attributes. In the following example, we want to represent a concept duck. The concept duck can be represented as a class, since it denote a set of objects comprised by the concept duck. We then want to represent a particular duck called Dewey. In this case, Dewey can be represented as an individual that belongs to (or “is an instance of”) the class Duck. This can be done as follows: Example 1: Class: Duck Individual: Dewey Types: Duck The following statement conveys the fact that Dewey (in this case, we intend to refer to the same Dewey as in the previous example) is an instance of class Animal. This example is to show that class membership is not exclusive, which means that an individual may also belong to more than one classes concurrently. Example 2: Class: Animal Individual: Dewey Types: Animal Class Hierarchies In logical sense, the two classes defined in the previous examples have some relationship, which is Animal is more general than Duck. This means that any individual that is duck, that individual must be an animal. In an OWL ontology, this knowledge about the relationship between classes has to be to explicitly represents using subclass axiom as follows: 6 http://owl.cs.manchester.ac.uk/2009/07/sssw/people

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 33

Example 3: Class: Duck SubClassOf: Animal Since an OWL ontology is used to represent a domain of interest, every concept in a domain should correspond to some classes. OWL defines a class known as owl:Thing, which represents the universe of discourse and refers to the root of the taxonomic tree. Every individual in the OWL ontology is a member of the class owl:Thing and each user-defined class is implicitly a subclass of owl:Thing. OWL also defines the empty class, owl:Nothing. Besides the subclass relationship between classes, two or more classes may refer to the same set. In this case, OWL provides a mechanism to represent the equivalent relationship between the classes as follows: Example 4: Class: Animal EquivalentTo: Fauna The statement in Example 4 states that the class Animal is equivalent to the class Fauna, which means that every instance of the class Animal is also an instance of class Fauna. Two classes are considered equivalent if they contain exactly the same individuals. It is important to note that stating that Animal and Fauna are equivalent classes is the same as stating that both Animal is a subclass of Fauna and Fauna is a subclass of Animal. Class Disjointness In an OWL ontology, class membership is not mutually exclusive until it is explicitly stated. For example, consider two classes Duck and Cat, which conceptually refer to the class of all duck and the class of all cat respectively. For human thinking, an individual that is an instance of the class Duck, should not be an instance of class Cat. However, this information is part of our background knowledge and has to be explicitly stated in the system. This type of relationship between classes is called class disjointness and can be established as follows: Example 5: DisjointClasses: Duck Cat

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 34 It is important to insert the disjointness statement between classes to infer the correct information. By having the above statement added to the ontology, from our example, it can be inferred that Dewey is not a Cat. Object Properties Besides classes and individuals, another basic element of an ontology is properties. OWL defines two types of property known as object property and datatype property. In this section we discuss object property. An object property is used to relate between objects (individuals and classes). For example, Example 6: Class: Person ObjectProperty: hasPet Individual: Walt Types: Person Fact: hasPet Dewey In Example 6, an individual, Walt, is defined as an instance of the class Person. Walt is related to Dewey through a property hasPet. Besides stating that two individuals are directly related, a property can also be used to state that two individuals are not connected, like saying that Dewey is not Joe’s pet as in Example 7. This type of statement is called a negative property assertion, which is to state something that we know is not true. Example 7: Individual: Joe Facts: not hasPet Dewey Datatype Properties As stated in the previous section, another type of property defined in OWL is called datatype property. It is used to relate individuals to data values such as birth date, age, address and other data values. Similar to object property, this property can also be used to state that individuals are not connected to the data values. The following examples state that Joe’s age is 25 and Peter’s age is not 25. Example 8: Individual: Joe

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 35

Facts: hasAge "25"ˆˆxsd:integer Example 9: Individual: Peter Facts: not hasAge "25"ˆˆxsd:integer Besides that, domain and range also can be stated for datatype properties in similar way it is done for object properties. Property Hierarchies Besides class hierarchies, OWL also provides a mechanism to specify property hierarchies, which is similar to the mechanism to state class hierarchies. This can be done as follows: Example 10: ObjectProperty: hasPet SubPropertyOf: likes In this example, it is stated that the property hasPet is a subproperty of the property likes, meaning that whenever A is known to has a pet B, it is also known A likes B. However, this is not true the other way round (it is not true if A like B, then A has a pet B). The syntactic shortcut for property equivalence is similar to class equivalence.

Domain and Range Restrictions Properties relate individuals from a particular domain to individuals from a particular range. This additional information allows further conclusion about the individuals to be inferred. As in our example, the statement Joe hasPet Dewey implies that Joe is a person and Dewey is an animal. Thus, this correspondence can be stated as follows: Example 11: ObjectProperty: hasPet Domain: Person Range: Animal By having these additional axioms, if given a statement that relates Mike and Fifi via the property hasPet, a reasoner would infer that Mike is a person and Fifi is an

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 36 animal. Equality and Inequality of Individuals By default, OWL does not make any distinction that individuals with different names are different individuals. Therefore, if we want to include the fact that the information about two individuals are not the same individuals and also to refer to two individuals being the same individuals, this information has to be explicitly specified in the ontology. OWL provides two mechanisms, first, to refer to two individuals that are the same individuals, and second, to exclude the assumption that two individuals are being the same individuals as follows: Example 12: To state that two individuals are different Individual: Dewey DifferentFrom: Fifi Example 13: To state that two individuals are the same Individual: Joe SameAs: John In Example 13, with such statement added to the ontology, a reasoner could infer that any information given about individual Joe also holds for the individual John.

Complex Classes Complex classes can be constructed by combining two or more atomic classes using the logical class constructors and, or and not, provided by OWL, where each refers to (class) intersection, union and complement, respectively. As in set theory, the intersection of two classes results in those individuals which are instances of both classes. In the following example the class Kid consists of exactly those objects which are instances of both Person and Young. Therefore, it can be inferred that all instances of the class Kid are also instances of the class Person. Example 14: Class: Kid EquivalentTo: Person and Young

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 37 The union of two classes contains each individual which is an instance of at least one of the classes. The following example states that each instance of the class Parent is either an instance of the class Mother or the class Father. Example 15: Class: Parent EquivalentTo: Mother or Father The complement of a class contains exactly those individuals which are not members of the class itself. The following example defines a class adult using class complement and intersection. Example 16: Class: Adult EquivalentTo: Person and not Young Property Restrictions Property restriction is a special kind of class description that describes an anonymous class (a class of all individuals that satisfy the restriction). OWL defines three types of property restrictions which are quantifier restrictions, cardinality restrictions and hasValue restrictions. Quantifier restrictions can be further divided into existential quantification and universal quantification. Existential quantification refers to a class of individuals that involve in at least one relationship via a particular property to another individuals that are instances of a certain class. Consider the following example about class PetOwner. Example 17: Class: PetOwner EquivalentTo: Person and hasPet some Animal This means that for an individual to be an instance of class PetOwner, that individual must be an instance of class Person and owns at least one individual that belongs to class Animal. In this case, we can infer that if somebody said that Jim is a pet owner, we can infer that Jim owns at least one animal even if we do not know what type of animal he has. Universal quantification refers to a class of individual for which all related individuals must be an instance of a specified class. Consider the following example about class Cat Lady.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 38

Example 18: Class: CatLady EquivalentTo: Female and hasPet only Cat This means that for an individual to be an instance of class CatLady, that individual must be an instance of class Female and if she has any pets, that pets must only be instances of class Cat. In this case, for an individual to be an instance of class CatLady, she may or may not have a pet which must only be from the class Cat. In the previous example, it is possible for an individual that is Female and does not own any cat to be classified as an instance of class CatLady. However, if the intention of this class CatLady is someone who is female and has at least one pet that is a cat, and does not own any other pets except cats, the statement should be as follows: Example 19: Class: CatLady EquivalentTo: Female and hasPet only Cat and hasPet some Cat Another type of property restriction is hasValue restriction which is used to describe a set of individuals that are related to one particular individual. For example, we could define the class of Minnie’s cats: Example 20: Class: MinniesCats EquivalentTo: isPetOf value Minnie Property Cardinality Restrictions The third type of property restriction is cardinality restriction which is used to describe the class of individuals that have at least, ay most or exactly a specified number of relationships with other individuals or datatype values. Maximum cardinality restriction (max) specifies the maximum number of relationships of a specified property that an individual must participate in. Minimum cardinality restriction (min) is used to describe the minimum number of relationships of a specified property that an individual can participate in. The last type of cardinality restriction is called Qualified cardinality restriction (exactly). It is used to describe the exact number of relationships of a specified property that an individual can participate in. The following example states that Minnie has at least three cats as pets using minimum cardinality restriction.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 39

Example 21: Individual: Minnie Types: hasPet min 3 Cat In order to define the statement for maximum cardinality restriction or qualified cardinality restriction, the term min in the above example should be substituted with max or exactly, respectively.

2.2.4

Simple Knowledge Organization System (SKOS)

SKOS has been accepted as W3C Recommendation in August 2009. It is a language designed for representation of semi-formal KOS such as thesauri, classification schemes and taxonomies. SKOS enables a knowledge organization system to be expressed as machine-readable data that can be exchanged between computer applications and published the format in the Web. The basic element in SKOS is a concept which can be viewed as a unit of thought ; ideas, meanings or objects, that are subjective and independent of the term used to label them. Each concept can be linked to one or more lexical labels to refer to them in natural language, through prefLabel, altLabel or hiddenLabel. The terms are also semantically linked to each other through hierarchical broader/narrower relations and associative related relations. The SKOS data model offers a standard, low-cost migration path for transferring existing KOS to the Semantic Web technology context allowing better re-usibility, interoperability and sharing. 2.2.4.1

Elements of SKOS

This section discusses the key elements of SKOS vocabularies which include the basic elements of SKOS vocabularies such as concept class and concept schemes, lexical labels, and semantic relations [MB09b]. For each element, we discuss the function of the element and illustrate one or two examples to demonstrate the usage of the element. We choose to use the Manchester Syntax for the example illustration. All examples are based on the W3C SKOS Reference document7 . skos:Concept Class 7 http://www.w3.org/TR/skos-reference/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 40 skos:Concept is defined as a class. A SKOS concept refers to an idea or notion – a unit of thought. It is defined as an instance of class skos:Concept. The following example states that C1 is a SKOS concept (i.e., and instance of skos:Concept). Example 1: Individual: C1 Types: Concept Note that the specification [MB09b] made no additional statement about the formal relationship between the class skos:Concept and the class of OWL classes and OWL properties, is made to allow users to explore different design patterns for working with combinations of SKOS and OWL. The following example is considered as a consistent SKOS data model. Example 2: Individual: C1 Types: Concept and Class Concept Schemes A SKOS concept scheme is referred to as a collection of one or more SKOS concepts and links (semantic relationships) between the concepts. A concept scheme is useful when dealing with data that describes different KOS and also data from an unknown source. skos:ConceptScheme is defined as an instance of owl:Class. There are three properties associated with concept schemes, namely skos:inScheme, skos:hasTopConcept and skos:topConceptOf, defined as instances of owl:ObjectProperty. The property skos:hasTopConcept is used to indicate a top-level concept in a concept scheme. The following example states that concept C1 is the top-level concept of concept scheme CS1. Example 3: Class: CS1 Fact: hasTopConcept C1 The property skos:topConceptOf is the inverse of the property skos:hasTopConcept. Example 4: Individual: C1 Fact: topConceptOf CS1

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 41 The property skos:inScheme is defined to be a super-property of skos:topConceptOf. It is used to indicate that a particular SKOS concept is belongs to the specified concept scheme. Example 5: Individual: Cx Fact: inScheme CS1 Lexical Labels Lexical labels are used to provide human-readable representations in terms of natural language for each SKOS concepts. SKOS defines three types of lexical labels that can be associated to each SKOS concept, namely preferred labels (skos:prefLabel), alternative labels (skos:altLabel) and hidden labels (skos:hiddenLabel). The preferred label usually refers to the commonly used label in natural language for a SKOS concept. Only one preferred label per language tag is allowed for a SKOS concept. If there are more than one terms can be used to refer to the same SKOS concept, only one term will be chosen as preferred label. The remaining terms that refer to the same SKOS concept will be made as alternative labels. Both preferred and alternative labels can be used to portray the closest meaning of a SKOS concept. The hidden labels usually are used to represent commonly mis-spelled words for a particular SKOS concept. This is useful when a text-based search function is used to interact with a knowledge organization system. When a user mistakenly enters a mis-spelled word when searching for a related concept, the mis-spelled query can be matched against the hidden label to enable the user to retrieve the required concept. The hidden label is made visible to the search system only, but not visible to the user. The following example states different types of lexical labels associated to concept C1. Example 6: Individual: Fact: Fact: Fact: Fact: Fact:

C1 prefLabel ‘‘animals"@en altLabel ‘‘fauna"@en hiddenLabel "aminals"@en prefLabel "animaux"@fr altLabel "faune"@fr

Semantic Relations SKOS provides two types of semantic relation, namely hierarchical and associative.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 42 A hierarchical relation is used to link between two concepts, in which one concept is more general than the other. There are two properties defined for type of relation called skos:broader and skos:narrower. For a given triple A skos:broader B, this means that concept B is a broader concept than concept A. The same rule applies to the skos:narrower property, since it is an inverse property of skos:broader. Note that both skos:broader and skos:narrower properties are used to assert a direct hierarchical link between two SKOS concepts; thus the properties are not declared as transitive properties. For some applications that need to use both direct and indirect hierarchical links between concepts, like in the case of improving search recall through query expansion, SKOS provides two additional properties called skos:braoderTransitive and skos:narrowerTransitive. However, these properties are not used to make assertions, but to infer the transitive closure of the hierarchical links (to access direct or indirect hierarchical links between concepts). An associative relation is used to link between two concepts in which the two are basically related, but not in a hierarchical manner. For this type of relation, SKOS provide a property call skos:related to assert an associative links between concepts. Note that skos:related is a symmetric property and is disjoint with skos:broaderTransitive. The following example states a direct hierarchical link between concept C1 and concept C2 (where C2 is broader than C1), and an associative link between concept C1 and concept C3. Example 7: Individual: C1 Fact: broader C2 Fact: narrower C3

2.3 2.3.1

Semantic Web Knowledge Reuse Use of SKOS vocabulary

The SKOS data model can be applied to in different applications such as an indexing tool, a searching tool and a navigation tool.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 43 2.3.1.1

SKOS vocabulary as a indexing tool

When the thesaurus was first developed, it was used mostly as an indexing tool for large technical document collections, where the indexer used as a source of indexing terms for attaching to the database or catalogue records for those documents [Bro06]. For indexing with thesaurus aided process, SKOS constructs that should be considered are: skos:prefLabel is used to indicate the preferred term used for a particular concept. Users should select one or more skos:prefLabel to index the text or information item. skos:altLabel is used to indicate the synonyms or alternative terms that can be used to refer to a particular concept. Users may refer to these labels to indicate the suitable skos:prefLabel to index the text. 2.3.1.2

SKOS vocabulary as a searching tool

According to [Bro06], there are two different ways in which controlled vocabulary can be used as a search tool; i) the controlled vocabulary can be made visible to the end-users as an aid to “query formulation”, or ii) the controlled vocabulary can be ‘embedded’ in the search software and be used as background knowledge to support “query expansion”. Query formulation In the query formulation process, indexing vocabulary is made visible to the end-users or searchers through the use of hypertext links. This enables the users to view the vocabulary, to see what terms have been used in indexing, to select appropriate terms for framing a search, and to modify searchers through the cross-references in the vocabulary [Bro06]. For example, in the Library and Information Science Abstract (LISA) database, the same thesaurus that has been used for indexing mode, can be accessed by the searcher and use it in searching mode. As shown in Figure 2.3, the hypertext links take the searchers to the full thesaurus entry, with its cross-references to synonyms, and to broader, narrower and related terms (Figure 2.3(b)). Users may select any terms from the thesaurus, including the synonyms, broader, narrower or related terms to

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 44

(a) LISA thesaurus indexing term view for (b) LISA thesaurus indexing term view query formulation including cross-reference to synonyms, broader, narrower and related terms

Figure 2.3: Examples of LISA Thesaurus

formulate their query. Moreover, options on the left-hand side of the screen, support more complicated queries by incorporating boolean operators AND and OR. For query formulation with thesaurus aided process, the SKOS constructs that should be considered are: skos:prefLabel is used to indicate the preferred term used for a particular concept. Users should select one or more skos:prefLabel to start the query. skos:altLabel is used to indicate the synonyms or alternative terms that can be used to refer to a particular concept. Users may include one or more skos:altLabel to broaden the search. skos:Concept is used to represent the ”element” or node, which connects and relates all the information for a particular concept. skos:broader is used to represent the connections between the current node and those nodes that are at a higher level in the hierarchy. Users may include the skos:prefLabel and/or skos:altLabel of a particular node that is more general in meaning than the search node to broaden the search. skos:narrower is used to represent the connections between the current node and those nodes that are at lower level in the hierarchy. Users may include skos:prefLabel and/or skos:altLabel of a particular node that is more specific in meaning than the search node to narrow the search. skos:related is used to define the connections between the current node and other nodes that are somehow associated. Users may include skos:prefLabel and/or

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 45 skos:altLabel of a particular node that is related in meaning than the search node to broaden the search. Query expansion As for query expansion, the thesaurus is used to improve retrieval without making it visible to end-users. For this, the thesaurus is embedded as part of the search software, and terms being searched are matched against the controlled vocabulary. The search can be ‘expended’ by broadening the topic by adding narrower topics or by searching for synonyms and related terms, which are usually the process behind many intelligent searching systems. For query expansion, SKOS constructs that should be considered are: skos:prefLabel is used to indicate the preferred term used for a particular concept. Users should select one or more skos:prefLabel to start the query. skos:altLabel is used to indicate the synonyms or alternative terms that can be used to refer to a particular concept. Users may include one or more skos:altLabel to broaden the search. skos:Concept is used to represent the ”element” or node, which connects and relates all the information for a particular concept. skos:broader is used to represent the connections between the current node and those nodes that are at a higher level in the hierarchy. Users may include skos:prefLabel and/or skos:altLabel of a particular node that is more general in meaning than the search node to broaden the search. skos:narrower is used to represent the connections between the current node and those nodes that are at lower level in the hierarchy. Users may include skos:prefLabel and/or skos:altLabel of a particular node that is more specific in meaning than the search node to narrow the search. skos:related is used to define the connections between the current node and other nodes that are somehow associated. Users may include skos:prefLabel and/or skos:altLabel of a particular node that is related in meaning than the search node to broaden the search.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 46 2.3.1.3

SKOS vocabulary as a navigation tool

The use of SKOS data model in a navigation system can be achieved through several ways. One way is navigating the SKOS document itself, which involves navigating through the SKOS vocabulary for specific concepts required by users. Another possible use of SKOS in navigation is called document navigation [JSB+ 08, BYS+ 08a]. This is the case where one or more SKOS vocabularies are used as additional resources in the background to help navigate the document. An example of this type of application is COHSE8 . For both cases, the important SKOS constructs that should be considered are: skos:Concept is used to represent the ”element” or navigation node, which basically every ”places” where it is possible to go and the portion of information that is relevant for the navigation. skos:broader is used to represent the connections between the current node and those nodes that are at a higher level in the hierarchy. skos:narrower is used to represent the connections between the current node and those nodes that are at lower level in the hierarchy. skos:related is used to define the connections between the current node and other nodes that re somehow associated. *skos:ConceptScheme is used to group all concepts that belongs to the same source or KOS. *skos:inSchme is used to indicate a particular concept belongs to a concept scheme. *skos:mappingRelation which includes SKOS mapping properties such as skos:closeMatch, skos:exactMatch, skos:broadMatch, skos:narrowMatch and skos:relatedMatch, are used to state the mapping between nodes in different concept schemes. This is essential when navigating one concept scheme and if the user interested to know whether the same concept exist in other concept schemes. The constructs marked with (*) are considered optional in the navigation model. If the user deals with only one concept scheme, than these constructs may or may not be included in the vocabulary. Figure 2.4 shows the navigation model using SKOS vocabularies. 8 http://cohse.cs.manchester.ac.uk/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 47

(a) Navigation model using SKOS vocabularies

(b) AAT material facet hierarchy

Figure 2.4: Thesaurus as navigation tool

2.3.2

OWL and SKOS transformation framework

2.3.2.1

Characteristics of OWL and SKOS

OWL adopts an object-oriented model where the domain is described in term of individuals, classes and properties. Individuals are the basic elements of the domain, classes are the sets of individuals with similar characteristics, and properties are the relationships between pairs of individuals [Hor08]. SKOS, in contrast, denotes each element of the world with concepts and relates these concepts with either one of the three basic semantic relations, namely, broader, narrower and related [IS08]. Each of these languages has their unique characteristics that distinguish them. Table 2.1 lists a few key features or characteristics of both the OWL and SKOS languages. These lists are constructed based on the analysis of the use cases presented in the ‘OWL 2 Web Ontology Language: New Features and Rationale’ [GW08], ‘OWL Use Cases and Requirements’ [Hef04] and ‘SKOS Use Cases and Requirements’ [IPR07] documents. There are four main features that are used to compare both languages. The first feature that really distinguishes between the two languages is the hierarchical structure; OWL, for instance, provides a formal subsumption hierarchical structure with formal semantic through owl:subclassOf relationship, while SKOS only provide a hierarchical structure in terms of semantic relationship skos:broader (BT) and skos:narrower (NT) that is less formal. An application with formal subsumption hierarchical structures is necessary to explicitly represent the intentional relationships between the classes and

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 48

Features Hierarchy

OWL

SKOS

• subsumption hierarchical structure with formal semantic

• hierarchical structure in terms of semantic relationships

• through owl:subClassOf lationship.

• through broader term (BT) and narrower term (NT) relationships.

re-

Labeling • does not provide a standard and comprehensive labeling mechanism. • only use rdfs:label to label a concept.

Class tion

• provides a standard and comprehensive labeling specialization • through skos:prefLabel, skos:altLabel, skos:hiddenLabel, skos:scopeNote, etc.

descrip• provides a standard and formal method in expressing the class descriptions

• does not provide any means to express class description.

• through a standard and formal class constructors Correspondence/ mapping links and equivalence

• allows for representation of class, property and individual equivalence

• allows for correspondence/mapping links between concepts from different concept schemes.

Table 2.1: Key features or characteristics of OWL and SKOS languages

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 49 individuals in the ontology which is important in some applications such as representing the domain knowledge of a particular field. On the other hand, weaker semantics between the entities as represented in SKOS are necessary when dealing with applications that requires broader meaning of a particular concepts such as in browsing and navigating collection of documents. Another key feature that exists in OWL language but not in SKOS language is the ability to provide class descriptions, in which OWL provides standard and formal method in expressing the class descriptions through a set of standard and formal class constructors. In certain applications, this feature can be considered as an advantage of OWL over SKOS, like those applications that involve inference and reasoning of the knowledge, because SKOS does not provide any means to express such aspects. However, in other applications that do not require any reasoning and inferencing of the knowledge, having a SKOS vocabulary can be considered more suitable because it is a more ‘light-weight’ representation than OWL ontology. The next feature that really distinguishes between the two languages is the labeling ability. Even though both languages provide some mechanisms to display the label for each concept, SKOS provides a standard and comprehensive labeling specialization through skos:prefLabel, skos:altLabel, skos:scopeNote and a few others, whereas the only standard way to give label to a particular concept in OWL is using rdfs:label. In some applications that need to provide some means of human readable format, having the knowledge to be represented in SKOS vocabulary can be considered as more suitable due to the standard and comprehensive labeling specialization provided by this language. The last feature that characterizes the two languages is the ability to provide correspondence or mapping links and equivalence relationship between concepts. Both languages provide some mechanisms to express this feature, but in a different way. OWL, for instance, allows for the representation of class, property and individual equivalence, whereas SKOS allows for correspondence or mapping links between concepts from different concept schemes. The comparison between the OWL and SKOS languages as in Table 2.1 based on the key features for each language is helpful in making decision which representation language is more suitable for a particular problem. Generally, this decision is made based on matching the language characteristics to the problem requirements. Therefore, having a general guidelines on when to use which language representation is seen to be necessary in deciding the right knowledge representation language. Table 2.2

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 50 OWL When the application is driven to support automated/machine reasoning and classification from inference of the classes. When the need to describe most features of conceptual modeling languages like cardinality constraints, conjunction/disjunction restrictions, etc.

SKOS When the application is driven to support automated search and retrieval systems. When the need to just represent the simple relations between the terms and to express lexical label notation explicitly in the system.

Table 2.2: Criteria in choosing OWL or SKOS as representation language lists a few situations or criteria where one can decide which representation language to use. The main aim of this research is to reuse the existing knowledge artefacts that are represented in one format for use in other applications but represented in different formats. Figure 2.5 shows the framework of knowledge representation language space that we proposed in this research. As discussed in Section 2.1, a problem in the Problem Space can either be mapped to OWL space or SKOS space depending on the requirements of the problem. In this section, we discuss possible cases of knowledge knowledge reuse between OWL and SKOS knowledge artefacts. Case 1: Enriching SKOS vocabulary into OWL ontology Situation: For a problem, P, in the Problem Layer for a particular domain, the problem requirements can be matched by the OWL characteristics, resulting in the problem being mapped into OWL space that will produce an artefact in terms of an OWL ontology, AO. There, however, also exists an artefact in SKOS space in terms of a SKOS vocabulary, AS, that has been published to address the same domain problem. Therefore, the existing SKOS vocabulary, AS, can be re-used and thus needs to be transformed into the OWL space, producing a new OWL ontology, ASO, to solve the same domain problem. Objective: By using the existing SKOS vocabulary as a starting point and enriching the semantics between the concepts, we aim to produce an OWL ontology that can fulfill the requirements of the application.

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 51

Figure 2.5: Knowledge representation space

Application Example: A problem in the medical domain requires intentional representation that allows logical inferencing capability. This requirement is best matched by OWL characteristics. The solution should be in the form of an OWL ontology to represent the domain. The MeSH (Medical Subject Headings)9 vocabulary, a controlled vocabulary-like representation, already exists and represents knowledge from the same domain. Therefore, reusing the content of this vocabulary and transforming them into an OWL ontology can fulfill the problem requirement. Case 2: Extracting SKOS vocabulary from OWL ontology Situation: A problem, P, in the Problem Layer for a particular domain can be mapped into SKOS space since the problem requirements are matched by SKOS characteristics, producing a SKOS vocabulary, AS. However, there also exists an 9 http://www.nlm.nih.gov/mesh/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 52 artefact in OWL space in term of OWL ontology, AO representing the knowledge for the same domain. Therefore, the existing OWL ontology, AO, can be reused and needs to be transformed into the SKOS space producing a new SKOS vocabulary, AOS, to solve the same domain problem. Objective: By using the existing OWL ontology as a starting point and ‘reducing’ some semantics between the concepts, we aim to produce a SKOS vocabulary that can fulfill the requirements of the application. Application Example: A problem in the biomedical domain requires some representation for navigation and browsing purposes, i.e., less strict semantic relations between the concepts. This requirements are matched by SKOS characteristics and requires a SKOS vocabulary that represents the knowledge for this domain. The Systematised Nomenclature of Medical - Clinical Terms (SNOMED)10 exist as representations of this domain. The OWL representation of the SNOMED CT ontology does not have the appropriate structure for a simple navigation and browsing application. Therefore, reusing the content of the ontology and transforming them to produce a light-weight vocabulary like SKOS vocabulary is seen to be helpful as compared to the amount of effort needed to build the vocabulary from the beginning. Case 3: OWL ontology and SKOS vocabulary Situation: a problem, P, in the Problem Layer has different problem requirements that can be matched by both OWL and SKOS characteristics. Therefore, it is possible to map the problem into both language spaces. Objective: Deciding which space is more suitable to be mapped first can help in reducing the amount of effort to fulfill the requirements of the system. We predicted that if the problem is mapped into one spaces producing one artefact, then we might transform that artefact to fulfill remaining of the problem requirements. Application Example: As for this case, we decided that further research and investigation is needed to determine if such a problem exist and which transformation procedures can help in reducing the amount of effort to fulfill the requirements of the system. This case is worth defining due to the fact that there might exist such cases in reality and having some idea on how to deal with them is considered necessary. 10 http://www.ihtsdo.org/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 53 Case 4: OWL ontology and SKOS vocabulary co-exist in an application Situation: A problem, P, in the Problem Layer for a particular problem domain requires both a representation as an OWL ontology and as a SKOS vocabulary for the domain due to a subset of problem requirements being matched by the characteristics of either languages. There exist solutions in both of the representation spaces for different aspects of the problem. In this case, no transformation is needed; instead, both artefacts may need to co-exist in an application with some alignment of the domain knowledge. Objective: Having some systematic procedures for using both representations at the same time in the same application is considered necessary. Application Example: A problem in the biomedical domain requires both intentional representation of the domain and some representation for navigation and browsing purposes. In this case, we need the domain domain knowledge to be represented as both OWL ontology and SKOS vocabulary. The Open Biomedical Ontologies exist as intentional representations of this domain and assume that there also exist SKOS vocabulary that represent some aspects of the domain that is different from the OBO. In this case, since both OWL and SKOS representations need to co-exist within the same application, but address different aspect of the domain, some alignment of the domain knowledge is required. Based on our discussion in the previous section, we observed that different knowledge representation resources are used for different applications and are represented using different knowledge representation languages. As for the Semantic Web applications, these knowledge resources must be represented using the standard Semantic Web format either by using OWL or SKOS. Considerable effort has been made to map and transform the existing knowledge resources into Semantic Web standard format in various domains such as arts, agriculture, medical and many more [vAMMS06, vAGS06, SLL+ 04, SIRK08, GHH+ 08]. The main concern of the previous works is to reuse the existing knowledge resources. Some of the works involved mapping or transforming the existing knowledge resources into standard Semantic Web format on the same scale of semantic structures, like transforming a thesaurus into a SKOS vocabulary [vAMMS06, SIRK08]. However, there are also works that involved transforming knowledge resources into standard Semantic Web format at different semantic structures, like transforming from

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 54 Wordnet11 to OWL ontology [vAGS06]. In the case of transforming knowledge resources at different level of semantic structures, additional consideration is needed in transforming these type of knowledge resources regardless getting from rich semantic to less semantic structures or otherwise. For this reason, we are motivated to investigate further into what are the systematic procedures for transforming knowledge resources from one formalism into another formalism that are not the same in term of semantic structures. We are particularly interested to look at how SKOS vocabulary and OWL ontology could be transformed from one to another knowing that there exist some relationships between the two languages. The next two subsections discuss the works in both areas of interest which are transforming knowledge resources from ontology into SKOS vocabulary and transforming knowledge resources from knowledge organization systems into OWL ontology.

2.3.3

Transforming ontology into SKOS vocabulary

As discussed in the previous section, one of the area that motivates us is to look into how ontology can be transformed into SKOS vocabulary. Bechhofer et. al. identified a few issues concerning using ontologies for dynamic linking to support Web navigation and browsing [BYS+ 08b]. One of these issue is that the strict subclass and superclass relationships contain in OWL ontologies does not necessarily provide good navigation as compared to looser representation such as in thesaurus that consist of notions broader/narrower, which can provide more useful linking. In this paper, the Conceptual Open Hypermedia Service (COHSE) system12 has been used as an application example. The authors introduced an additional layer of abstraction into the knowledge model for Web navigation purposes, using standard lightweight knowledge representation, consisting of looser relationships, like SKOS, moving away from the strict super/sub class relationships as encapsulated in OWL. In [BYS+ 08b], the authors explore how different knowledge representations may affect Web navigation with comparison between the formalisms with ontologically formal and semantically rich such as OWL ontology, and the formalism with ontologically informal and semantically weak such as SKOS vocabulary. The authors also demonstrated how SKOS can be used to take advantage of the vast amount of existing ontological representation in Web navigation, especially in supporting the semantic linking of Web-based information and facilitate information retrieval. The main focus is to 11 http://wordnet.princeton.edu/ 12 http://cohse.cs.manchester.ac.uk

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 55 exploit the wealth of ontologies, thesauri and other types of knowledge organisation schemes already existing within the biomedical domain. The authors also presented the work on OBO to SKOS conversion, which involves converting an ontological representation into SKOS representation. In both works discussed above, the authors demonstrated that looser representations with weaker semantics such as SKOS is good in supporting Web navigation and information retrieval applications as compared to the strict sub/super class relationships such as OWL. In [BYS+ 08b], the aim is to make the existing knowledge organization system resources available in standard representation like SKOS to support COHSE system. Meanwhile, the aim of our thesis is to make use of the existing ontological-based information and to transform them into SKOS representations for applications that require simple navigation, indexing or information retrieval. The authors in [BYS+ 08b] focused on exploiting the knowledge resources existing within the biomedical domain, particularly the ontological representations. However, the aim of our thesis is to define standard transformation procedures for OWL to SKOS transformation that is independent of domain of interest, which means the transformation.

2.3.4

Transforming knowledge organization systems into OWL ontology

Another area of interest that is considered important to the research is the works on transforming from any type of knowledge organization systems such as thesauri, classification schemes, taxonomies and other types of controlled vocabulary into some kind of ontology representation. In [SLL+ 04], a conceptual structure and transition procedure for transferring from an existing knowledge organization system into a semantically rich knowledge organization system is presented. The authors have chosen AGROVOC, an existing thesaurus by the Food and Agriculture Organization (FAO) of the United Nations, as a case study to explore the reengineering of a traditional thesaurus into a well-defined ontology. However, the processes involved in the reengineering process demand a great amount of user intuition in deciding the appropriate and suitable relationships between the entities being described which is done manually and taking a lot of time especially when new terms are found and requires new relationships to be defined to related between the terms. Meanshile in our research, we aim to minimize the need for user intervention by defining some general transformation procedures that can be applied to SKOS vocabularies existing within different

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 56 domain. We hope that by doing this, we can automate or at least semi-automate the process of transforming from SKOS vocabulary to OWL ontology. Another study that derives OWL ontologies from traditional knowledge organization systems has been presented in[HB07]. The authors focused on three types of knowledge organization systems such as hierarchical classifications, thesauri and inconsistent taxonomies. A new methodology for automatically deriving RDF-S and OWL ontology from a traditional knowledge organization scheme has been presented. However, the derived OWL ontology only represents the hierarchical structures of the transformed concepts. Two e-business categorization standards Standardized Material and Service Classification (eCl@ss)13 and United Nations Standard Products and Services Code (UNSPSC)14 have been chosen to demonstrate the usefulness of the approach. In our research, we focused on transforming SKOS vocabularies, which actually covers almost all types of knowledge organization systems. Furthermore, our aim is also that the transformed OWL ontologies should consist the hierarchical structures as well as the class descriptions for each transformed concept.

2.4

Summary

In this chapter we presented our proposed framework for Semantic Web knowledge reuse, which will be used to facilitate our discussion throughout this thesis. We also discussed on Semantic Web knowledge representation languages such as XML, RDF and RDFS with further emphasis on OWL and SKOS. We discussed briefly some characteristics of both OWL and SKOS and presented four possible cases of knowledge reuse between OWL and SKOS. Considerable effort has been made to reuse and transform the existing knowledge resources into standard Semantic Web format. We observed that transforming knowledge structures from less semantic to rich semantic structures is more challenging as compared to transforming knowledge resources from rich semantic structures to less semantic structures. Some of the work also did not provide systematic procedures for doing so and others requires these transformation procedures to be done manually. Therefore, having some systematic and automatic or semi-automatic procedures in performing the transformation between one formalism to the other is considered essential and beneficial. 13 http://www.eclass-online.com/ 14 http://www.unspsc.org/

CHAPTER 2. FRAMEWORK FOR SEMANTIC WEB KNOWLEDGE REUSE 57 We believe there is something missing in these works that need further research and investigation. One of the areas of interest is to identify the characteristics of both OWL and SKOS knowledge artefacts by examining existing OWL ontologies and SKOS vocabularies. Based on the understanding of the characteristics of both types of knowledge artefacts, exploring the possibility to transform from one formalism to the other with some systematic procedures is another area of interests. In this thesis, we choose to explore the possibility of transforming knowledge represented in OWL ontologies to SKOS vocabularies. With this choice, another interesting area to look at is to investigate whether is there any effect in doing the transformation knowing that there already exist relationship between the two languages and does this relation will help or hinder us in doing the transformation or it does not give any effect to the transformation.

Chapter 3 Characterising SKOS vocabularies on the Web In Chapter 2, we presented the notion of the Semantic Web Knowledge Representation space, focusing on two Semantic Web knowledge representation languages, OWL and SKOS. We discussed how these two languages differ in terms of what they can and cannot represent. For example, SKOS is used to represent knowledge artefacts such as glossaries, taxonomies and thesaurus, which are frequently used in indexing, searching and navigating systems. However, SKOS cannot represent knowledge artefacts that requires strict semantics to be defined between the concepts to be described. Due to the semantic complexity difference between OWL and SKOS knowledge artefacts, for knowledge represented in OWL ontologies to be used within SKOS applications, we proposed the knowledge represented in the OWL ontologies to be reused and represented in SKOS vocabularies through some transformation process. We need to understand the characteristics in order to define a transformation that is somehow sensible. In this chapter we describe one of the two languages, SKOS, by exploring and investigating SKOS vocabularies gathered from the Web. We present a survey of Simple Knowledge Organization System (SKOS) vocabularies on the Web. The main outcome of this chapter is an understanding of what SKOS vocabularies look like in term of the SKOS constructs usage and the structure of the vocabularies. This level of understanding is necessary when performing the transformation from OWL ontologies to SKOS vocabularies, we expect the SKOS vocabularies produced from the transformation to look like the existing SKOS vocabularies. Therefore, understanding what existing SKOS vocabularies look like in terms of the structure of the SKOS vocabularies could 58

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

59

be used as a guideline to recognise and evaluate the output of the transformation.

3.1

SKOS vocabulary survey

The main aim of this survey is to understand the structure of SKOS vocabularies. While research has attempted to characterise Semantic Web documents such as OWL ontologies and RDF(S) documents on the Web [GHKP12, WPH06, DF06, TV03], to the authors knowledge there is no attempt at characterising SKOS vocabularies. The research by Mader and Souminen, on the other hand, focuses on finding and assessing the quality of SKOS vocabularies, called qSKOS and Skosify, respectively [MHI12a, MHI12b, SH12, SM14]. Our approach is similar to the work produced by Wang et. al. [WPH06], where they focused on OWL ontologies and RDFS documents and the work by Glimm et. al. [GHKP12], where they focused on OWL constructs usage in the Linked Open Data. We are interested in understanding the characteristics of SKOS vocabularies in terms of the structures of the vocabularies, particularly the depth and branching factors of concept the hierarchy. Amongst the objectives of this chapter are: • To identify the frequency of use of SKOS constructs in the collected SKOS vocabularies and which constructs are most frequently used and least frequently used. • To categorise the SKOS vocabularies according to the constructs used in the vocabularies. • To determine the structural characteristics of each category such as the size, depth and branching factors of the SKOS vocabularies. This information is useful when considering transforming OWL ontologies into SKOS vocabularies, where the metrics regarding the structure of existing SKOS vocabularies could be used as a guideline to recognise and evaluate the output of the transformation. Additionally, we are interested in finding the frequency of use of SKOS constructs in SKOS vocabularies. This could suggest the frequently used SKOS constructs to be used in the OWL2SKOS transformation procedures. The work presented in this chapter was published in [MBS12b, MBS12a].

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

3.1.1

60

Methods of SKOS vocabulary survey

Finnin et. al. consider a Semantic Web document (SWD) to be a document represented as an RDF graph encoded using syntactical encodings like RDF/XML, N-Triples and N3 [FPS+ 04]. The efforts on Semantic Web Documents (SWDs) discovery has been discussed in Semantic Web search engines literatures for Swoogle project [FPS+ 04, DPF+ 05], Watson project [dBG+ 07b], and many others. In this literatures, the authors described the methods and architecture of discovering SWDs used for their search engines.

the   Web  

d i s c o v e r e d

Dedicated   collec7ons   SW  search   engines   Web     crawler  

p o p u l a t e s

Discovery

SKOS   vocabularies  

populates

extracted

Metadata   Duplicate   extrac7on   filtering   Data Extraction & Analysis

‘Slips’   detec7on   and  fixing  

Candidate   SKOS   vocabularies  

retrieved

SKOS   vocabulary   iden7fica7on  

Validation

Data   analysis  

populates

SKOS   metrics  

Figure 3.1: The methods of SKOS vocabulary survey Figure 3.1 shows our methods for the survey of SKOS vocabularies. We divided the whole process into three components, Discovery, Validation and Data Extraction and Analysis components. - The Discovery component gathers and collects candidate SKOS vocabularies from the Web using three mechanisms: (i) downloading from dedicated collections; (ii) searching the Semantic Web search engines; and (iii) deploying a Web crawler to explore promising sites.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

61

- The Validation component validates the collected candidate SKOS vocabularies to restrict to only Semantic Web documents and further refines to only identify SKOS vocabularies that conform to our definition of SKOS vocabularies. As for the documents that fail to be identified as SKOS vocabularies, these documents are checked for any ’slips’ detection in the usage of SKOS constructs and the ’slips’ are fixed if possible. - The Data Extraction and Analysis component extracts the data for duplicate filtering purpose and further analysis of the SKOS vocabularies. In the next section, we start with the first component of the survey by explaining the methods we employed to gather and prepare the corpus of SKOS vocabularies for the survey. Apparatus All experiments were performed on a 2.4GHz Intel Core 2 Duo MacBook running Mac OS X 10.6.8 with a maximum of 3 GB of memory allocated to the Java virtual machine. Two reasoners were used: JFact1 , which is a Java version of the FaCT++ [TH06] reasoner, and Pellet [SPG+ 07]. JFact reasoner was chosen for its speed in classifying ontologies. However, for the vocabularies that failed to be classified by JFact reasoner due to errors like heap space error and user-defined datatype error, we use Pellet reasoner that is more robust then JFact however, a bit slower as compared to JFact [DCtTdK11]. We used the OWL API [HB11] version 3.2.42 for handling and manipulating the vocabularies.

3.1.2

Discovery component

The Discovery component is the first component in the SKOS vocabulary survey architecture. This component prepares the candidate SKOS vocabularies to be used in the survey by discovering the URLs of SKOS vocabularies on the Web. We used three mechanisms for discovering the candidate SKOS vocabularies on the Web; (i) dedicated collections; (ii) Semantic Web search engines; and (iii) web crawler. 3.1.2.1

Dedicated collections

Our first mechanism for candidate SKOS vocabularies discovery is to collect them from dedicated collections of ontology libraries. Ontology libraries are the systems that collect ontologies from different sources and facilitate the tasks of finding, exploring, and using these ontologies. Thus ontology libraries can serve as a link to enable 1 http://jfact.sourceforge.net/ 2 http://owlapi.sourceforge.net/

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

62

diverse users and applications to discover, evaluate, use, and publish ontologies. Noy and d’Aquin [Nd11] provide a survey of the diverse landscape of ontology libraries. Some examples of ontology libraries included in their survey are BioPortal [NSW+ 09], TONES Repository [TON10] and OntologyDesignPattern.org [PSFGP09]. However, most of the existing ontology libraries are for OWL/RDF ontologies. As for SKOS, amongst dedicated collections of ontology library for SKOS vocabularies are published by SWD Working Group3 . Two sources of collections of SKOS vocabularies are the SKOS Implementation Report [MB09a] and the SKOS Datasets page [SKO10]. The SKOS Implementation Report was edited by the SWD working group that gathers vocabulary submissions by the SKOS community in response to a W3C’s Call for implementations4 . The SKOS Datasets page of the SKOS community wiki has links to some published SKOS data sets added and shared by the SKOS community. We chose these collections because the vocabularies listed in the collections were compiled by the SKOS Working Group through its community call. Even though the collections are not representative. However, they contained a large collection of SKOS vocabularies that are reliable and available to be downloaded publicly. We manually downloaded the vocabularies listed in these collections and stored them locally. 3.1.2.2

Semantic Web search engines

The second mechanism for candidate SKOS vocabularies discovery is utilising Semantic Web search engines. Dietze and Schroeder [DS09] and Renteria-Agualimpia et. al. [RALPMM+ 10] report comparisons on several existing Semantic Web search engines such as Swoogle [FPS+ 04], Watson [dBG+ 07a], Semantic Web Search Engine (SWSE) [HUD06], Falcon [CGQ08] and Sindice [Sin]. Since we are interested in analysing each individual SKOS vocabulary, we require search engines that return the results of Semantic Web ontologies; in this case supported by Swoogle and Watson. For this reason, we choose the Swoogle [FPS+ 04] and Watson [dBG+ 07a] search engines for this survey. Collecting vocabularies from these sources enable us to gain some insights into the use of SKOS vocabularies in the community. We made use of the API provided by both search engines to programmatically gather the results from the relevant search. For both search engines, we used skos, thesaurus, taxonomy, 3 http://www.w3.org/2004/02/skos/ 4 http://www.w3.org/News/2009#item35

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

63

glossary, concept, broader, narrower, and related as search terms. We considered these terms as keywords in identifying SKOS vocabularies because some of them are the terms used as construct names in the SKOS data model. At this point, we collected the URLs of the vocabularies, as our analysis tools will retrieve the documents from the Web given the URLs. 3.1.2.3

Web crawler

The third mechanism we used in discovering the candidate SKOS vocabularies is a web crawler. We used an off-the-shelf web crawler called Blue Crab web crawler5 , which could be configured to crawl based on user specific settings. We used a few web pages related to SKOS as a starting point for the crawler such as the SKOS working group page 6 . The Discovery component produced a collection of candidate SKOS vocabularies. This collection was called candidate SKOS vocabularies because we are not entirely certain that the collection contains Semantic Web documents that we required for analysis in the survey and it may also contain some “noise” due to the methods of knowledge discovery we employed previously.

3.1.3

Validation component

The Validation component validates each candidate SKOS vocabulary (URLs) collected by the Discovery component as Semantic Web documents and further identify each candidate SKOS vocabulary in terms of whether it is a SKOS vocabulary. The documents that are failed to be identified as SKOS vocabularies are detected for any ’slips’ and are fixed if possible. This component consists of two modules; (i)SKOS vocabulary identification; and (ii) ’Slips’ detection and fixing. 3.1.3.1

SKOS vocabulary identification

Since several methods were used to collect candidate SKOS vocabularies, we expect the collection may also contain some “noise”. What we meant by “noise” is “unwanted” or “irrelevant” documents that do not fit to be included in the survey. We only want to keep the document that may contain semantic data or ontologies and eliminate any document that cannot be parsed by OWLAPI [HB11]. Before we can extract the 5 http://www.limit-point.com/products/bluecrab/ 6 http://www.w3.org/2004/02/skos/

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

64

required data for our analysis, we first need to identify whether the candidate SKOS vocabularies that we collected are indeed “real” SKOS vocabularies. Since we are interested in understanding the structure of the SKOS vocabularies, we require the SKOS vocabularies that to be included in our survey to contain instances are of type skos:Concept. For this survey, we used the following definition of SKOS vocabulary. Definition 1 A SKOS vocabulary is a vocabulary that at the very least contains SKOS concept(s) used directly, or SKOS constructs that indirectly infer the use of a SKOS concept, such as SKOS semantic relations. Each candidate SKOS vocabulary collected previously was screened in the following way to identify it as a SKOS vocabulary: 1. Check for existence of direct instances of type skos:Concept; if Yes, then accept the vocabulary as a SKOS vocabulary. 2. Check for existence of implied instances of skos:Concept due to domain and range restrictions on SKOS relationships (for example the subject of a skos:broader, skos:narrower or skos:related relationship is necessarily a skos:Concept); if Yes, then accept the vocabulary as a SKOS vocabulary. 3. Otherwise, do not accept this vocabulary as a SKOS vocabulary. Consider the following vocabulary snippets written in Manchester Syntax [HPS09]. Vocabulary 1 and Vocabulary 2 are accepted as SKOS vocabularies based on tests in Step 1 and Step 2, respectively. Meanwhile, Vocabulary 3 is not accepted as a SKOS vocabulary according to our definition, even though this vocabulary uses SKOS constructs such as skos:prefLabel and skos:altLabel. Vocabulary 1: Individual: Emotion Types: Concept Individual: Love Types: Concept Individual: Beauty Types: Concept

Vocabulary 2: Individual: Love Types: Thing Facts: broader Emotion Individual: Emotion Types: Thing

Vocabulary 3: Individual: Love Types: Thing Facts: prefLabel "Love", altLabel "Affection"

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB 3.1.3.2

65

’Slips’ detection and fixing

In the SKOS vocabulary identification module, we recorded the list of URLs of candidate SKOS vocabularies that do not pass to be identified as a SKOS vocabulary. These URLs are then screened for SKOS constructs used in the vocabularies using the OWLAPI. In this screening stage, we are looking for candidate SKOS vocabularies that use the following SKOS constructs, but that failed to be detected in the previous stage. • skos:broader • skos:narrower • skos:related • skos:Concept • skos:hasTopConcept • skos:topConceptOf By utilising the functionality provided by the OWL API, we then checked the entity types of the listed SKOS constructs recognised by the OWL API and recorded them for each candidate SKOS vocabulary. We also kept a list of the URLs of candidate SKOS vocabularies that were inconsistent when classifying with an automatic reasoner such as JFact or Pellet. For each of the candidate SKOS vocabularies that failed to be classified, we recorded the exception message thrown by the reasoner together with the cause of the inconsistency. Each impaired SKOS vocabulary was manually inspected for any sign of deviation from SKOS that would account for the irregularity. We found several patterns of irregularity in the vocabulary representation and considered them as slips made by ontology engineers when authoring the vocabularies. For each type of slip, we decide whether the error is intentional or unintentional, and if fixing the error would change the content of the vocabulary. If the error is unintentional and fixing the error does not change the content of the vocabulary, we can then apply fixing procedures to correct the slips. All fixed vocabularies were included in the survey for further analysis. We classified these irregularities into several types. The output of the Validate component is a “cleaned” collection of SKOS vocabularies that is ready to be analysed by the Analysis component.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

3.1.4

66

Data Extraction and Analysis component

The Data Extraction and Analysis component is the last component of the SKOS vocabulary survey architecture. This component collects the survey data for each SKOS vocabulary and analyses them based on the objectives set out in Section 1.4. There are two modules for this component; (i) metadata extraction; (ii)duplicate filtering; and (iii) data analysis. 3.1.4.1

Metadata extraction

In order to achieve the objectives listed in Section 1.4, we need to extract metadata for each SKOS vocabulary in our collection. For the first objective, since we are interested in understanding the frequency of use of SKOS constructs in these vocabularies, the metadata extraction should be done on the asserted version of the vocabularies. The data suggest the actual usage of SKOS constructs in those vocabularies. For each SKOS vocabulary, we count and record the number of instances for all SKOS constructs listed in the SKOS Reference [MB09b]. As for the rest of the objectives, such as categorising and identifying the structure of the vocabularies, we collect the data from the inferred version of the vocabularies; hence the use of an automatic reasoner such as Pellet or JFact. We collect and record the following for each SKOS vocabulary: 1. Concept Scheme IRI. 2. Number of SKOS concepts. 3. Depth of each SKOS concept and maximum depth of the concept hierarchy. 4. Total number of links for skos:broader, skos:narrower and skos:related properties. 5. Total number of loose singleton concepts (concepts that are not connected to any other concepts). 6. Total number of root concepts (concepts with only skos:narrower relation, but no skos:broader relation). 7. Maximum number of skos:broader property. The recorded data is used for duplicate filtering and further analaysis.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB 3.1.4.2

67

Duplicate filtering

Even though URIs are supposed to be unique identifiers in the scope of the Web, we may find local copies of an ontology at several locations (different URLs), or two ontologies that are intended to be different may declare the same URI [dBG+ 07b]. For this reason, we use the recorded data in the previous module to filter structurally identical SKOS vocabularies. We compare the Concept Scheme IRI to search for duplicate vocabularies. For two or more vocabularies having the same Concept Scheme IRI, we compare the record for total number of SKOS Concepts. We make a pairwise comparison between recorded data having the same URI, taking two vocabularies at a time. There are two main concerns on filtering the duplicates of SKOS vocabularies; (i) removing identical SKOS vocabularies; and (ii) removing versions of similar SKOS vocabularies. 1. Identical SKOS vocabularies This step detects the SKOS vocabularies that have the same Concept Scheme IRIs and structurally identical. As mentioned before, we may have collected several copies of the SKOS vocabularies that are shared at different location (different URLs). Therefore, for the analysis, we would like to keep only one copy of these vocabularies. If two vocabularies have identical records, we then check the content of these vocabularies. This is done by making a pairwise comparison between the instances of skos:Concept in one vocabulary to the other. If the two vocabularies have the same instances of skos:Concept, then one copy of these vocabularies is kept and the duplicate vocabulary is removed. Otherwise, follow the next step. 2. Version of similar SKOS vocabularies If the two vocabularies do not have identical records or identical instances of skos:Concept, we assume that one vocabulary is a newer version of the other. If the two vocabularies belong to the same category (either Thesaurus, Taxonomy, or Glossary), then we keep the latest version of the vocabulary and remove the older version. Otherwise, both vocabularies are kept. The rationale for keeping the vocabularies if they are in different categories is that we want to capture the ’essence’ of the vocabulary for the understanding of the vocabulary structure. At the end of this module, we should have a “clean” collection of unique SKOS vocabularies ready for data analysis.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB 3.1.4.3

68

Data analysis

We calculated the mode, mean, median and standard deviation for the occurrence of each construct collected from the previous process. The analysis focused on two major aspects of the vocabularies; the usage of SKOS constructs and the structure of the vocabularies. In terms of the usage of SKOS constructs, we analysed which constructs were most used in the SKOS vocabularies. As for the structural analysis of the SKOS vocabulary, we introduced a SKOS metric, M , with eight tuples as follows:

M =< S , D , L , R , M AXB , FH , BH , FA >

(3.1)

where S is the size of vocabulary (represented by the number of SKOS concepts), D is the maximum depth of vocabulary structure, L is the number of loose singleton concepts in the vocabulary, R is the number of root nodes of the vocabulary structure, M AXB is the maximum skos:broader relation for each concept in the vocabulary, FH is the average hierarchical forward branching factor, BH is the average hierarchical backward branching factor and FA is the average associative forward branching factor. According to [PM10], branching factor is the measure of the number of links coming in to or going out from a particular node. For a directed graph, there are two types of branching factor, namely forward branching factor (FBF) and backward branching factor (BBF). The FBF is the number of arcs or links going out from a node. The BBF is the number of arcs or links coming into a node. The FBF for hierarchical relations is calculated based on the number of skos:narrower relations of a particular concept. The BBF for hierarchical relations is calculated based on the number of skos:broader relations of a particular concept. As for associative relations, both FBF and BBF are calculated based on the number of skos:related relations of a particular concept. Since the skos:related relation is symmetric, both FBF and BBF for the associative relation of a particular vocabulary is the same. Since the branching factor values for each concept in a SKOS vocabulary are nonuniform, we calculated the average FBF and average BBF for both hierarchical and associative relations. Note that we ignored the loose singleton concepts when calculating the average branching factors. The average hierarchical FBF, FH , average hierarchical BBF, BH and average associative FBF, FA are given by the following equations: b r n (3.2) FH = , BH = , FA = Tn Tb Tr where n, b, and r are the total number of skos:narrower, skos:broader and skos:related

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

69

relations, respectively, and Tn , Tn and Tn are the total number of concepts with skos:narrower, skos:broader and skos:related relations, respectively. a b

a e

b

d c

f

g

h

e d

c

(a) Example 1

(b) Example 2

Figure 3.2: Example graphs for determining the structure of vocabulary

Figure 3.2 shows two graphs illustrating two different structures of example SKOS vocabulary. Each circle represents a SKOS concept and each directed link between two circles represents a skos:narrower relation. The SKOS metric for Figure 3.2(a) and 3.2(b) are given by:

M S D L R MA 5 2 0 1 MB 8 2 3 1

M AXB FH 1 1

2 2

BH

FA

1 1

0 0

Note that even though the FH and BH for both examples are the same because each example has the same skos:narrower and skos:broader relations, the structure of both example is different. However, by looking at the S , L , R , we may distinguish the structure of Example 1 from Example 2. We also defined some rules using the SKOS metric, M , to categorise the vocabularies in our corpus: • If all D , FH , BH , FA > 0, then this vocabulary is categorised as a Thesaurus. • If all D , FH , BH > 0 and FA = 0, then this vocabulary is categorised as a Taxonomy. • If all D , FH , BH , FA = 0, then this vocabulary is categorised as a Glossary. • If the vocabulary does not belong to any of the above category, then this vocabulary is categorised as Others. For example, the vocabulary uses only associative relation but not hierarchical relations.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

3.2

70

Result and observation

This section presents the results and observations of the SKOS vocabulary survey. We present the results following the order of components in the SKOS vocabulary survey architecture.

3.2.1

Discovery component

The Discovery component which consists of three methods used to collect candidate SKOS vocabularies from the Web resulted in 6819 candidate SKOS vocabularies. For the first method of knowledge discovery, we collected 303 candidate SKOS vocabularies7 . Of the 303 candidate SKOS vocabularies, 123 candidate SKOS vocabularies are gathered from the SKOS Implementation Report [MB09a] and the remaining 180 candidate SKOS vocabularies are collected from the SKOS Datasets page [SKO10]. We manually navigate the links provided to download these candidate SKOS vocabularies. Some of these links such as the STW Thesaurus for Economics8 requires one to complete a simple form before the document can be downloaded. We also found several links that only point to the description of the research projects, such as the English Heritage vocabularies from STAR9 with no access to the SKOS vocabularies. As for the second method, we collected 4220 URIs10 . We collected 2839 candidate SKOS vocabularies from Swoogle search engines and 1381 candidate SKOS vocabularies from Watson search. One of the We collected 2296 URIs of candidate SKOS vocabularies from the third method11 .

3.2.2

Validation component

3.2.2.1

SKOS vocabulary identification

The SKOS vocabulary identification module identified 1068 vocabularies as SKOS vocabularies according to our definition of SKOS vocabulary. The remaining 5751 candidate SKOS vocabularies that failed to be identified as SKOS vocabularies are summarised in the Table 3.1 [MBS12b]. 7 This

figure is valid as at 8 December 2010

8 http://zbw.eu/stw/versions/latest/about 9 http://hypermedia.research.glam.ac.uk/kos/STAR/ 10 as

at 2 March 2011 that the web crawler runs for approximately three months, ended on 17 May 2011

11 Note

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

71

Exclusion type vocabs RDF feeds 2704 Ontologies that are not SKOS vocabularies 1199 Connection refused 593 Network is unreachable 403 Connection timed out 354 Parsing error 333 Actual SKOS Core vocabulary 165 Table 3.1: Exclusion type for candidate SKOS vocabularies failed to be identified as SKOS vocabularies 2704 candidate SKOS vocabularies appeared to be RDF feeds that contain some of the keywords such as concept, broader, etc., which were used in searching the vocabularies. 165 candidate SKOS vocabularies referred to the actual SKOS Core data model12 , while 1683 candidate SKOS vocabularies consist of documents with parsing error and unreachable documents due to connection problems, such as connection refused and connection timed out. We found that 1199 candidate SKOS vocabularies were actual OWL documents but failed to be identified as SKOS vocabularies. Further inspection of these vocabularies revealed almost all of these OWL documents used at least one of the SKOS constructs especially of the SKOS labelling and documentation properties. 3.2.2.2

‘Slips’ detection and fixing

There were 47 URLs identified in the slips detection stage, with 18 documents detected using the listed SKOS constructs, and the rest were ‘inconsistent’ ontologies. We classified the types of slips into three categories as follows: • Type 1: Undeclared property type. • Type 2: Mis-use of SKOS constructs. 1. Mistyping of an individual to be an instance of both skos:ConceptScheme and skos:Concept. 2. Incorrect use of skos:narrower property to relate a concept to a collection. • Type 3: Use of an invalid or user-defined datatype. 12 http://www.w3.org/2004/02/skos/core.rdf

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

72

Type 1: Undeclared property type. The information regarding the entity types returned by the OWLAPI, revealed that all SKOS properties such as skos:broader, skos:narrower, etc. are of type owl:Anno tationProperty. Further inspection of the source of the vocabulary showed that each SKOS property used in the vocabulary was not explicitly typed as any of the possible property types such as owl:ObjectProperty, owl:DataProperty or owl:AnnotationProperty. Note that the SKOS specification [MB09b] does not enforce explicit declarations. In fact this type of slip is a consequence of managing and processing the SKOS vocabularies using tools such as the OWLAPI, which due to OWL 2 DL perspective require explicit declarations of properties used in the OWL documents. An example of this type of slip is shown in Figure 3.3. 18 candidate SKOS vocabularies were classified to have this type of slip. AnnotationProperty(http://www.w3.org/2004/02/skos/core#broader) AnnotationProperty(http://www.w3.org/2004/02/skos/core#narrower)

Figure 3.3: A snippet of SKOS vocabulary with a Type 1 slip There are two possible patches to fix the slip. 1. Patch 1: Addition of missing declarations. Search for all SKOS-related constructs in the vocabulary and add the missing declarations for these constructs. For example, if the properties skos:broader, and skos:narrower were found in the vocabulary, we would add declarations for both of these properties to be of type owl:ObjectProperty. 2. Patch 2: Import the SKOS core vocabulary. Another possible approach to fix this problem is by importing the SKOS core vocabulary13 . Applying either patch fixed the problem. We applied the Patch 1 fixing procedure and fixed 18 SKOS vocabularies of this category. Type 2: Mis-use of SKOS constructs. This type of slip was identified through the exception thrown by the reasoner when it failed to classify the vocabularies. We found 6 candidate SKOS vocabularies that were 13 http://www.w3.org/2004/02/skos/core.rdf

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

73

inconsistent, caused by a ‘mis-use’ of SKOS constructs. From our inspection of the SKOS constructs usage in the vocabularies, we can categorise this type of slip into 2 categories. (a) Mistyping of an individual to be an instance of both skos:Concept Scheme and skos:Concept. The SKOS Reference [MB09b] has defined that the skos:ConceptScheme and skos:Concept classes are disjoint. This means that in a SKOS vocabulary, an individual cannot be an instance of both classes at the same time without the ontology being inconsistent. Five SKOS vocabularies were found to be inconsistent due to having a condition where one individual had been declared as a skos:Concept, and the skos:inScheme property was used to relate other SKOS concepts to this individual. Since the rdfs:range of skos:inScheme is the class skos:ConceptScheme as defined in the SKOS Reference, this individual was indirectly defined as type skos:ConceptScheme through the use of the skos:inScheme property. Figure 3.4 shows a snippet of a vocabulary that illustrates the situation for this type of slip. 5 candidate SKOS vocabularies were classified as having this type of slip. Individual: urn:cgi:clasScheme:CGI:StratigraphicRank:200811 Types: Concept Individual: urn:cgi:clas:CGI:StratigraphicRank:200811:lithodeme Types: Concept Facts: inScheme urn:cgi:clasScheme:CGI:StratigraphicRank:200811, prefLabel "Lithodeme"@en

Figure 3.4: A snippet of SKOS vocabulary with a Type 2 slip To ‘fix’ this type of slip, we propose the following procedures: 1. Search the vocabulary for the mentioned individual X. 2. Check the existence of axiom(s) relating other SKOS concept(s) to individual X through skos:inScheme property. For example, skos:inScheme . If Yes, this indicates that individual X, is inferred to be of type skos:ConceptScheme.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

74

Class: Collection Individuals: _:genid1 Individual: milk Types: Concept Facts: narrower genid1, prefLabel "milk"@

Figure 3.5: A snippet of SKOS vocabulary with Type 2 slips 3. Check if the existing declaration for individual X as type skos:Concept. If Yes, then remove this declaration from the vocabulary. Applying these ‘fixing’ procedures fixed the five SKOS vocabularies in this category. (b) Incorrect use of a skos:narrower property to relate a concept to a collection. The SKOS data model provides the property skos:narrower to show a hierarchical relationship between SKOS concepts. For example, assertion A skos:narrower B means that concept A has a narrower concept B. However we found 1 vocabulary that used the property skos:narrower to relate a concept to a collection. The classes skos:Concept and skos:Collection are defined as disjoint classes in the SKOS data model. Therefore, since the rdfs:domain of the skos:narrower property is skos:Concept, using a skos:narrower to relate a concept to a collection will violate this constraint, causing the vocabulary to be inconsistent. In the SKOS data model, the correct property to use to relate a member to a collection is skos:member. Further inspection of the vocabulary showed that the skos:member property was declared but never used. Figure 3.5 shows a snippet of a vocabulary with this type of slip. 1 candidate SKOS vocabulary was classified to have this type of slip. To ‘fix’ this type of slip, we propose the following procedures: 1. Search the vocabulary for the mentioned individual X. 2. Check if the existing declaration for individual X as type skos:Collection.

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

75

3. Check for the existence of axiom(s) relating other SKOS concept(s) to individual X through skos:narrower property. For example, skos:narrower . If Yes, this indicates that individual X, is inferred to be of type skos:Concept. 4. Then replace the axiom skos:narrower by skos:member . Applying these ‘fixing’ procedures fixed the one SKOS vocabulary in this category. Type 3: Use of an invalid or use-defined datatype. This type of slip was also identified based on an exception thrown by the reasoner when it failed to classify the vocabularies. This type of slip was due to a user-defined or invalid datatype not being recognised by a reasoner. Moreover, the use of user-defined datatypes is not a problem from the SKOS point of view, instead it is a problem in the context of OWL 2 DL. We found 23 candidate SKOS vocabularies with this type of slip, 9 vocabularies due to user-defined datatypes and 14 vocabularies due to invalid datatypes. To patch this type of slip, we do the following. For the user-defined datatype problem, we first checked whether the user-defined datatype was actually in use to type the data in the vocabulary. If the datatype was not in use, we excluded the datatype from the datatype list and reclassified the vocabulary. For the invalid datatypes problem, further judgement was needed to fix this problem. We fixed 0 vocabularies for this type of slip. Fixing the slips resulted in 24 additional SKOS vocabularies included in the corpus.

3.2.3

Data Extraction and Analysis component

Based on the extracted metadata, filtering structurally identical SKOS vocabularies resulted in the exclusion of 603 identical SKOS vocabularies and 11 older versions of SKOS vocabularies, which gave us 478 SKOS vocabularies for further analysis. The summary figures and reasons for exclusion are presented in Tables 3.2 and 3.3. The full results and analysis can be found at http://www.myexperiment.org/packs/237. 3.2.3.1

SKOS constructs usage

Figure 3.6 shows the percentage of the SKOS construct usage sorted according to their frequency of occurrence and Figure 3.7 shows the percentage of the SKOS construct usage grouped according to the construct type. For each SKOS construct, a SKOS

Number  of  vocabularies   500   457   450   409   400   332   350   300   250   200   129   119   150   115  113   100   65  56  50   38  38  34  26   50   16  11   9   8   8   7   6   5   4   4   3   3   2   1   1   1   0   0   0   0   0   0   0   0   0   Concept   prefLabel   broader   ConceptScheme   defini?on   inScheme   narrower   altLabel   related   topConceptOf   hasTopConcept   historyNote   scopeNote   changeNote   exactMatch   Collec?on   nota?on   closeMatch   example   note   hiddenLabel   broadMatch   member   editorialNote   broaderTransi?ve   narrowMatch   relatedMatch   OrderedCollec?on   xl:literalForm   xl:altLabel   narrowerTransi?ve   seman?cRela?on   mappingRela?on   memberList   xl:Label   xl:prefLabel   xl:hiddenLabel   xl:labelRela?on  

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

Figure 3.6: Overall SKOS Construct Usage

Figure 3.7: Type of Constructs

76

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

77

Stages vocabs Discovery component: 6819 - Dedicated collections 303 - Semantic Web search engines 4220 - Web crawler 2296 Validation component: 1092 - SKOS vocabularies identification 1068 - Slips detection & fixing 24 – Type 1 18 – Type 2 6 Data Extraction and Analysis component: 478 - Duplicate filtering 478 Table 3.2: Summary results Exclusion type vocabs RDF feeds 2704 Ontologies that are not SKOS vocabularies 1152 Duplicate SKOS vocabularies 614 Connection refused 593 Network is unreachable 403 Connection timed out 354 Parsing error 333 Actual SKOS Core vocabulary 165 Slips (Type 2 & Type 3) 23 (14 & 9) Total exclusion 6341 Table 3.3: Exclusions vocabulary was counted as using the construct if the construct is used at least once in the vocabulary. At this stage, we only counted SKOS constructs from the asserted axioms in each of the SKOS vocabularies because we want to examine the actual usage of the SKOS constructs in those vocabularies. In Figure 3.6, we can see that of all the SKOS constructs that are made available in the SKOS Recommendation[MB09b], skos:Concept, skos:prefLabel, and skos:broader are the three most frequently used constructs in the vocabularies, with 95.6%, 85.6% and 69.5%, respectively. The rest of the constructs are used in less than 50% of the vocabularies. 28 out of 35 SKOS constructs were used in less than 10% of the vocabularies. There were eight SKOS constructs that were not used in any of the vocabularies in our corpus. As shown in Figure 3.7, the SKOS constructs are arranged according to their

CHAPTER 3. CHARACTERISING SKOS VOCABULARIES ON THE WEB

78

types. In the Semantic properties category, skos:broader property is used more frequently than skos:narrower property even though these properties as inverse of one another. As for the labelling properties, skoe:prefLabel are used more often than skos:altLabel. Only skosxl:literalForm and skosxl:altLabel from the SKOS-XL properties category are used in one of the SKOS vocabularies in our collection. 3.2.3.2

SKOS vocabulary categorization

Thesaurus;     11%  

Glossary;     27%  

Others;