Linking from Schema.org microdata to the Web of Linked Data: An empirical assessment Alberto Nogales, Miguel-Angel Sicilia, Salvador S´anchez-Alonso, Elena Garcia-Barriocanal PII: DOI: Reference:
S0920-5489(15)00144-0 doi: 10.1016/j.csi.2015.12.003 CSI 3085
To appear in:
Computer Standards & Interfaces
Received date: Revised date: Accepted date:
8 June 2015 31 October 2015 17 December 2015
Please cite this article as: Alberto Nogales, Miguel-Angel Sicilia, Salvador S´ anchezAlonso, Elena Garcia-Barriocanal, Linking from Schema.org microdata to the Web of Linked Data: An empirical assessment, Computer Standards & Interfaces (2015), doi: 10.1016/j.csi.2015.12.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Article
T
Linking from Schema.org microdata to the Web of Linked Data: an empirical assessment
RI P
Alberto Nogales
Research fellow Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km. 33.6, 28871 Alcalá de Henares, Spain. Email:
[email protected]
SC
Miguel-Angel Sicilia
NU
Full professor Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km. 33.6, 28871 Alcalá de Henares, Spain. Email:
[email protected]
Salvador Sánchez-Alonso
MA
Associate professor Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km. 33.6, 28871 Alcalá de Henares, Spain. Email:
[email protected]
Elena Garcia-Barriocanal
ED
Associate professor Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km. 33.6, 28871 Alcalá de Henares, Spain. Email:
[email protected]
PT
Abstract
AC
CE
The increase of Linked Open Data (LOD) usage has grown in the last few years, and the number of datasets available is considerably higher. Taking this into account, another way to make data available is microdata, whose aim is to make information more understandable for search engines to give better results. The Schema.org vocabulary was created for the enrichment of microdata as a way to give more accurate results for user searches. As Schema.org is a kind of ontology, it has the potential to become a bridge to the Web of Linked Data. In this paper we analyze the potential of mapping Schema.org and the Web of Linked Data. Concretely, we have obtained mappings between Schema.org terms and the terms provided by the Linked Open Vocabularies (LOV) collection. In order to measure the limitations of our mappings we have compared the results of our script with some matching tools. Finally, an analysis of the usability of interlinking Schema.org to vocabularies in LOV has been carried out. For this purpose, two studies in which we have been presented aggregated information. Results show that new information has been added a substantial number of times.
Keywords Ontologies, microformats, Schema.org, Linked Data, LOV.
1. Introduction On June 2, 2011, Bing, Google, and Yahoo! announced the joint effort Schema.org1. Schema.org ontologies are intended for the creation of microcontents targeted to improving indexing and search systems (Ronallo, 2012). It consists of a set of tags introduced by HTML52 defining a vocabulary that lets webmasters to mark up Web sites with microdata. The purpose of microdata is to help search engines and other tools working with Web sites to better understand the information contained in them. This will eventually help the users to do more precise searches when they are Alberto Nogales: Research fellow Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km. 33.6, 28871 Alcalá de Henares, Spain.
[email protected]
ACCEPTED MANUSCRIPT Nogales et al
2
AC
CE
PT
ED
MA
NU
SC
RI P
T
looking for information on the Web. Mika and Potter (2012) reported some statistics about the importance of using Schema.org. The increase of microdata used is also shown in Muhleisen and Bizer (2012), demonstrating that it has increased among the different formats to embed structured data. The Schema.org initiative has supported the use of microdata, choosing it as its favourite syntax. Taking into account the information given by BuiltWith 3 , which is tool providing technology adoption, ecommerce data and usage analytics for the Internet, the usage of Microdata has increased from 750,000 Websites at the end of 2013 to 1,500,000 nowadays. There are other initiatives aimed at making data and content more accessible for machine consumption, notably Linked Open Data (LOD). LOD has the objective of publishing open datasets using Resource Description Framework (RDF)4 format and interlinking these datasets using RDF links. The way these datasets are published follows the well-known Linked Data principles (Bizer, Heath and Berners-Lee, 2009). Sometimes Linked Data is confused with microformats5, but the latter is another way to extend the Web completely differently that does not use the principles of Linked Data. One of the characteristics of both Schema.org and LOD is that of bringing structure and vocabularies to the Web, so it appears to be promising to create links between them. One way to do this is to map the classes and properties from Schema.org with the principal vocabularies used in the Web of Linked Data. In LOD, there are no mandatory vocabularies; the communities using them are the ones that could make a vocabulary more popular at any time. Regardless, we need a way to measure the popularity of a vocabulary used in LOD. The Linked Open Vocabulary (LOV) 6 initiative quantifies the use of classes and properties. LOV consists of various vocabularies used in different fields. The objective of this initiative is to give access to the vocabularies, describe the relations between them and how they are linked with the Linked Data Cloud. In previous work, Nogales et al. (2013), we did a mapping between the classes and properties of Schema.org with LOV. Then using the statistics provided by the project LODStats we measured the impact of Schema.org in the LOD. Finally we exposed a use case to retrieve information from datasets that could be aggregated to Webpages enriching its information. In this paper, we report an assessment of the potential of linking microcontent and Linked Open Data through the mapping of Schema.org with Linked Open Vocabularies. We have developed a new mapping at a semantic level using synonyms of the Schema.org terms. Results show that only the third part of the vocabularies in LOV provides a class mapping between Schema.org and LOD. With regard to properties, just around the percentage can be mapped using our approach. Furthermore, in LOD it is easier to find particular values for Schema.org properties than for classes. Taking this into account, we can conclude that the reachability in properties is higher than in classes. Once we have demonstrated that there is an important amount of mappings between Schema.org and LOD, we are taking it into account in two studies. The first case will use the value of the classes and properties from Schema.org embedded in Web sites to aggregate new information retrieved from a dataset. The second case involves extending an ontology with properties that are used by a vocabulary from LOV, taking into account the mappings between classes that we have obtained previously. For both cases we have presented some real examples demonstrating its usage. We will get some conclusions about it, giving users a measure of LOD data in pages that are using Schema.org vocabularies. We will also present some example of software that could take advantages of our achievements. The rest of this paper is structured as follows: in the second section, we present a background of papers using Schema.org and LOV. Then we have a section describing the materials and the methods used to obtain the results. The fourth section shows the results obtained and discusses them. The following section relates the potential use of the mappings. Finally, the last section offers some conclusions about the paper and the implications of future work. 2. Background In 2011, Schema.org started to provide their official dump of their ontology in OWL7 format. As a vocabulary, it addresses multiple areas and is not domain specific, but we can differentiate two parts. First, it provides a small set of elements to describe primitive data types like numbers or text in which we can find classes like Boolean, Date or Number. Second, the rest of the classes and
ACCEPTED MANUSCRIPT Nogales et al
3
AC
CE
PT
ED
MA
NU
SC
RI P
T
properties are used to describe different fields like Organizations or entities related to Medicine, Media, etc. The schema can be extended by users themselves to add new vocabulary by marking up their own data. Nowadays the schema published on the Web can be found in three formats: one is represented in Microdata, the next is an experimental version in RDFa8 and the third is in OWL, which is not yet fully-up-to-date. One of the uses of Schema.org is improving the discoverability of data in order to obtain best results when searching. Rosati and Mayernik (2013) compare the use of RDF and Schema.org to increase the discoverability and connectivity of resources on the Web to mark up HTML web pages. Researchers cited three cases, concluding that only one of them is more useful to make data more visible in public search engines. This paper does not give statistics about the use of Schema.org that could be use to decide which tags are more useful. Another approach to resource description, search optimisation and resource discovery can be found in Hawksey, Barker and Campbell (2013) using Schema.org as embedding metadata. The limitation here is that it only works with open educational resources. Schema.org has been used in previous research to enrich data. Ambiah and Lukose (2013) used Schema.org in a case study to demonstrate the use of a tool that enriches Web sites automatically. The tool presented in the paper is Schema.org Microdata Creator (ScheMicCr), which is being tested in two cases. The first one builds a new Web site with microdata and the second enriches existing Web sites. In this paper the microdata is extracted from a patent knowledge base designed by the authors. In Li et al (2012) an application to publish media fragments and annotations is described using vocabularies defined in Schema.org. In this paper the authors only work with media fragments enriching them so they could be easier to find using search engines. Khalili and Auer (2013) introduce the concept of WYSIWYM9 (What-You-See-Is-What-You-Mean), which consists of manipulating structured content directly. For the implementation it uses a tool called RDFaCE10, which is an interface for semantic authoring of textual content, and Schema.org vocabularies to mark-up pages. The annotation in this case is made by the users who create their own subset from Schema.org. Also Schema.org is used as the vocabulary to classify a large collection of Web sites and categorize them ambiguously in Krutil, Kudelka and Snásel (2012). In this paper an algorithm is implemented using the microdata tag ‘itemscope’ to filter Web sites with recipes and ‘itemprop’ for getting extra information like rating or author. This paper only uses a few tags of the vocabulary. In Mynarz (2014) a tool for validating and previewing structured data tagged with Schema.org in webpages is developed. In this case the tags are already part of the Web. Another approach to enrich Website with Schema.org is presented in Tort and Olivé (2014). This paper shows an approach that consists of using a human-computer task-oriented dialogue to design the Web. We also have some papers in which Schema.org is mapped with other vocabularies. Another paper where Schema.org is used is Atemezing and Troncy (2012), where GeOnto 11 ontology is aligned with it, in order to represent and query geospatial data. This paper presents a mapping of Schema.org but only with one of the vocabularies of LOV. A Personalized Location Information System is presented in Viktoratos, Tsadiras and Bassiliades (2013); here a manual mapping is made between Google Places API12 and Schema.org so the users can fetch extra information from the Web when they retrieve information about a location. In this case the mapping of Schema.org is not made with LOV. Finally, Veres and Elseth (2013) present MaDaME, a tool for annotating Web sites with Schema.org, and add semantic metadata. This latter information is added when a concept that the user wants to add is not contained in Schema.org, importing it from WordNet13 and using SUMO14 to create a mapping if it is not available. Again a mapping is made between Schema.org and a vocabulary from LOV but only with one of them. In this paper we analyze the potential of using Schema.org microdata in resources from the Web of Data using the work of Nogales et al. as a foundation (2013). As a link between them, we need to use LOV in order to provide statistics in the use of Schema.org in LOD. LOV provides users a collection of vocabularies from several fields like library science, e-commerce or biology. It also collects information about ontologies that represent vocabularies, detailed information about them, statistics related to LOD or graphical relations between vocabularies. LOV has been reported in several previous researches. Some of these vocabularies have been analysed by Poveda, Suárez and Gómez (2011) to display the reuse of ontologies in Linked Data. This paper gives statistics about how these vocabularies are used and related between them not about their use in the Web of Linked
ACCEPTED MANUSCRIPT Nogales et al
4
AC
CE
PT
ED
MA
NU
SC
RI P
T
Data. A framework that helps to lift raw data sources to semantic interlinked data sources containing a module based on LOV is described in Scharffe et al (2012). Here LOV is used to convert raw data not as a gap to enrich Websites with information in the LOD. In Zamazal, Bühman and Svátek (2013) PatOMat15, a framework for transforming ontologies, is integrated with ORE16, a tool for improving knowledge bases using SPARQL endpoints and OWL ontologies. An experiment using 16 vocabularies from LOV was made reducing in many cases a problematic pattern. An analysis of a set of vocabularies from LOV is made in Poveda-Villalón et al (2013) to detect good and bad practices when publishing vocabularies, so the tools using them only need the URI, its namespace and the prefix. Again vocabularies contained in LOV have not been used to retrieve information from LOV. In Schaible et al (2013) LOVER is presented, which is an approach that recommends classes and properties from LOV in order to model Linked Data datasets. In our approach the classes and properties are directly obtained depending on the Schema.org mappings. An alignment between the services prefix:cc and LOV is presented in Atemezing et al (2013), with the aim of managing and harmonizing vocabularies’ namespaces. The first service is used for looking and providing namespaces and the second extracts vocabularies from metadata. One difference with our paper is that we are doing mappings with Schema.org. Finally in Kontokostas et al (2014), a methodology for creating tests to measure the quality of Linked Data datasets is developed. The approach is applied to 297 vocabularies in LOV generating 32,293 tests. In this paper LOV has been used to test the approach, is not a main piece of it like in ours. Our work extends previous papers in several directions. It provides a quantitative analysis from mapping Schema.org with the vocabularies in LOV. This could be use to improve users’ searches that cannot be found in previous work. It presents two use cases that enrich data in different ways regarding. The first case aggregates data to a particular Webpage similar to this, Ambiah and Lukose (2013) makes it with microdata tags. In our case we are retrieving new information from DBpedia, which is aggregated to the Website so it can contain new information. We also present a second use case, which has not been implemented before. In this case we extend a vocabulary whit new properties from Schema.org. Finally, none of these papers are giving statistics about the use of Schema.org in the Web of Linked Data, which is one of the principle achievements in our paper. Taking in account these previous papers in the same field, mainly we present a mapping between Schema.org and the vocabularies contained in LOV. This mapping was first presented in Nogales et al. (2013) so here we are doing a more in-depth analysis. We have developed a new mapping that works in a semantic level. New statistics about the mappings obtained in the vocabularies have been presented. We are comparing the mappings obtained by our script with new matching tools to know how accurate are our results. Furthermore, we are presenting two studies in which we can take advantage of our Schema.org mappings to add new information to Web sites and ontologies. The first case, which was also part of the foundation paper, will use the properties of classes from Schema.org found in some Web sites and its particular values to retrieve information from DBpedia and aggregate new data to these pages. The second case, which is a new achievement of the paper, will work with the mappings of classes generated between Schema.org and LOV to add new properties to ontologies, taking the properties from Schema.org that are not present. Another new achievement regarding Nogales et al. (2013) is that we are giving quantitative results in the development of these cases and are presenting some real cases applying both methods. Finally we have proposed some software developments that could use the work developed in the paper. 3. Materials and methods As said before, our starting point of the work is to map Schema.org terms with the vocabularies collected by the LOV project. So the first step was to download the vocabularies to work with. An amount of 360 vocabularies were downloaded using Speed Download17 on September 22nd of 2013 from the LOV portal. Once we had the files, we processed the RDF data. All the files from LOV are in format Notation318 (.n3), and to work with them we have decided to use RDF.rb19, which is a pure Ruby library using RDF data. The second step is to use these mappings with the information obtained from LODStats20, Demter et al (2012), which is an approach to generate statistics from RDF datasets. It gives us a comprehensive picture of the current state of the Web of Linked Data.
ACCEPTED MANUSCRIPT Nogales et al
5
MA
NU
SC
RI P
T
Apart from this we are developing a semantic mapping. Following the same method exposed before. First we will map the classes and once we have obtained these, we will map the properties. The difference with the firs mapping is that we are not matching the terms string by string; we are matching them by their synonyms. For that purpose we are using a Python library called PyDictionary21, which is a package providing translations, definitions and synonyms. As we need to check that the synonyms have the same meaning, the mapping has to be semiautomatic. First we will obtain the synonyms for an Schema.org term, we will search for them in LOV and finally we will check manually if both, the synonym and the term, have the same meaning. Schema.org is comprised of classes and properties, which are the terms we are interested in mapping. As the properties are related to classes, we need to make the mappings in two steps. First we have to establish the relationship between the classes, creating a map when a class from Schema.org has the same name of a class from LOV. Once we have these mappings, we have to work with the properties from the classes to obtain the mappings of the second level. For that purpose, a mapping between properties will exist when two classes have a mapping between them and also have a property with exactly the same name. For example, if we have the class “Person” in Schema.org it can only be mapped with another class called “Person” that we can find in LOV. At the second step, a property like “familyName” must be a property in the class “Person” in Schema.org and also in a vocabulary from LOV with the same class, “Person”. Table 1 and 2 indicate how the mappings are created for both cases.
ED
Table 1. Example of a class mapping between Schema.org and LOV.
Class schema.org_iri
http:// xmlns.com/foaf/0.1/Person
CE
PT
http:// schema.org/Person
Class lov_iri
Table 2. Example of a property mapping between Schema.org and LOV.
AC
Class schema.org_iri http://schema.org/Person
Property schema.org_iri
Property lov_iri
http://schema.org/familyName
http://xmlns.com/foaf/0.1/familyNa me
Normally, the way mappings are obtained is using what we call ontology matching tools. These tools are developed so users can discover which classes and properties two ontologies have in common. In our case we are developing a Ruby script using RDF.rb library. First we are comparing classes via string, and then with these results we are doing the same for properties. We have not classified anything as case sensitive when comparing the different strings. The results obtained with our scripts have been used with LODStats in order to obtain some statistics to know how many instances of the terms can be found in LOD. The results obtained in these steps have been compared with two ontology matching tools. First the work of LogMap, Jimenez and Cuenca (2011), which uses a highly accessible matching ontology, supporting the same formats as OWL-API, can work with classes, properties and instances. It can be run from the command line, as SEALS packages or by Web-Interface. We have also compared the results obtained by our script with Alignment API, David et al (2011), which is an API written in Java used to align ontologies. It has been used as the core of multiple ontology matching tools like CIDER-CL, Gracia and Asooja (2013), or ODGOMS, Kuo and Wu (2013).
ACCEPTED MANUSCRIPT Nogales et al
6
4. Analysis and discussion
T
In this section we present an analysis about the mappings and the number of occurrences of each. We will first report about the most mapped classes and properties and their impact from Schema.org to LOD. 4.1. Contrasting keyphrases
PT
ED
MA
NU
SC
RI P
In previous sections we have talked about two kinds of mappings. One is made between classes and the other is made between properties. Consequently, we have generated these mappings in two steps. Furthermore, we relate the mappings obtained with the information provided by LODStats. In what follows, we are talking about the top mappings achieved in each step in both ways. The first thing we have done is to establish mappings between the classes of Schema.org and LOV using both mappings. In total 135 different classes have been mapped, which is 25.18% of the classes in Schema.org. Comparing with Nogales et al (2013), we have mapped 16 more classes. Counting the total instances obtained with both mappings we have obtained 585, which is 298 more regarding the foundation paper. We are comparing these results with the ones obtained using LogMap and Alignment API to know how accurate are our results. Taking into account them, the top 5 classes according to their occurrences can be examined in table 3. The first result belongs to our script, the second one to LogMap and the last to Alignment API. We have also obtained a histogram in figure 1, which shows us the concentration of the classes mapped. This would indicate whether there are a few classes with a lot of occurrences or if there are lot of classes with a few occurrences. This classification can be used when users need to choose which classes they need to tag their Webpages. At first an Schema.org class with more mappings could take benefits of more vocabularies being applied to the use cases presented in the paper. For example in the case that a user has to choose between two classes that are synonymous, the one with more occurrences in LOV could be more useful.
CE
Table 3. Comparison of class mappings between our script and two alignment tools.
Our script
LogMap
Alignment API
Book Place Event School Comment
23 22 20 20 19
17 19 13 16 17
21 17 17 15 14
AC
Class Name
ACCEPTED MANUSCRIPT 7
AC
CE
PT
ED
MA
NU
SC
RI P
T
Nogales et al
Figure 1. Histogram of more classes mapped between Schema.org and LOV.
ACCEPTED MANUSCRIPT Nogales et al
8
RI P
T
In table 4, we have the same information but related to the mappings of the properties. In this case 16 new properties have been mapped comparing with the founding paper. That means that adding the semantic level to the mapping improves it. In total 13.55% of the Schema.org properties have been mapped. Counting the total instances obtained with both mappings we have 913 instances. In these mappings, 101 different vocabularies have been used. As in the previous table the results belong first to our script, second to LogMap and finally to Alignment API. We have also obtained a histogram to measure the concentration of the mappings between properties, this can be found in figure 2.
SC
Table 4. Comparison of property mappings between our script and two alignment tools.
Property Name
Our script
LogMap
Table Event AnimalShelter Winery Embassy
note description agent agent agent
12 9 8 8 8
11 9 3 4 7
AC
CE
PT
ED
MA
NU
Class Name
Alignment API 19 17 14 3 11
ACCEPTED MANUSCRIPT 9
AC
CE
PT
ED
MA
NU
SC
RI P
T
Nogales et al
Figure 2. Histogram of more properties mapped between Schema.org and LOV.
ACCEPTED MANUSCRIPT Nogales et al
10
RI P
T
Based on the histograms, it seems that its shapes follow a power-law distribution. A power-law distribution is used to model the data from a relationship between two different variables or, in this case, to model the frequency of only one variable. Using the data obtained with our script, we will check if the occurrences of class and property mapping vary as a power of the classes and properties mappings itself. In other words, we will prove if there are a few class or property mappings with a lot of occurrences or if there are a lot of mappings that are not very common. The mathematical definition of a power-law distribution is as it follows:
MA
NU
SC
In this definition, x corresponds to a range of values, and C and are constants. In fact, C is derived as . We have to take into account that 1 is a requirement for a powerlaw form to normalize. Now we have to estimate the values for and with the results obtained for the class and property mappings. We are using R22, a statistical computing software, which has a package called poweRlaw23 aimed for this particular task. After using a function which calculates these variables, the values for the sets are; =2.03 and =1 in the case of classes and =73.62 and =8 in the case of properties. In the case of the classes the set does not follow a power law.
PT
ED
Taking into account the comparisons between our mappings and LogMap, we have found different results. Most of the time, the numbers of matchings are the same for each vocabulary. Sometimes our matchings have more occurrences and sometimes not. Furthermore, a few times LogMap was not able to work with the file, giving an error. In table 5 you can see this information grouped by cases with the percentages of each. In 13 vocabularies our script obtained better results. LogMap was better in 16 cases. Finally most of the time, in 323 vocabularies, the number of mappings was the same. As the top of classes, this could be use with the same objective.
Mapping script LogMap matching Equals File error Total
Number of Vocabularies aligned
AC
Case
CE
Table 5. Comparison between our script and LogMap classified by cases.
13 16 323 8 360
Percentage 3.61 % 4.44 % 89.72 % 2.22 % 100 %
Analyzing the results obtained, there are some important points to be noticed. Regarding the mappings provided by LogMap, we have obtained the following conclusions. First of all, some of these matchings seem to improve because LogMap discriminates special symbols like “-”. An example of this would be matching “GovernmentOrganization” with “Government-Organization”. Or sometimes it matches similar words like “Organization” and “Organisation”, which are the same word in American and British English. We also have found a case in which LogMap seems to use synonyms, matching “School” with “College”. In these cases LogMap is improving the results, but there are occasions in which our matching finds more occurrences. We have found many times classes matched by LogMap in which one of the classes contains or is contained by the other class, e.g. “RecyclingCenter” with “Center”. In our opinion, these matchings cannot be considered accurate, as they seem to be at different levels in the subsumption hierarchy. However, they provide an interesting perspective to look for approximate matches in the future. So that means that the number of vocabularies where the matching has more occurrences than in ours is less than what 5 indicates. Taking into account these special cases, the number of matchings where the number of classes is the same is 335 and where LogMap is higher are only 4. Therefore,
ACCEPTED MANUSCRIPT Nogales et al
11
the percentages change to 93.05% and 1.11% respectively. In the following table we can see the increase of the cases where the results are the same using the script and LogMap Table 6. Comparison between percentages taking into account the special cases.
Preliminaries percentages
Percentages with special cases
Mapping script LogMap matching Equals File error Total
3.61 % 4.44 % 89.72 % 2.22 % 100 %
3.61 % 4.44 % 93.05 % 1.11 % 100 %
SC
RI P
T
Case
ED
MA
NU
The same comparison has been made with the results provided by Alignment API, which can be found in table 7. After running Alignment API with all the vocabularies, we have realized that it is less stable than the other options. 193 of the vocabularies could not be aligned, resulting in an error during the execution. In some cases, the error was a null pointer exception, and in other cases, the problem was trying to load other ontologies imported by the vocabulary. Here the other error was related with parsing the format in which the file was formatted. Taking into account the rest of the results: our script was more accurate with 18 vocabularies, Alignment API in 7 and in 142 cases the results were the same. In this case we haven’t found special cases like in the case of the classes.
Table 7. Global comparison between our script and Alignment API classified by cases.
Number of Vocabularies aligned
Percentage
18 7 142 193 360
5% 1.94 % 39.44 % 53.61 % 100%
CE
AC
Mapping script Alignment API matching Equals File error Total
PT
Case
Some statistics to know about the most used vocabularies in these mappings have been created. The information can be seen in table 8. In this table we can see how many classes of the vocabulary could be mapped. In total about the third part of the vocabularies have a mapping between classes. We have also obtained a histogram represented in figure 3 to see the concentration. Table 8. LOV vocabularies with more classes mapped between Schema.org and LOV.
Vocabulary
LOV
Mapping occurrences
Accommodation Ontology LinkedGeoData PROTON Extent module Audio Features Ontology AKT Reference Ontology
acco lgdo pext af akt
66 47 25 18 14
ACCEPTED MANUSCRIPT 12
AC
CE
PT
ED
MA
NU
SC
RI P
T
Nogales et al
Figure 3. Histogram of more classes mapped between Schema.org and LOV.
ACCEPTED MANUSCRIPT Nogales et al
13
MA
NU
SC
RI P
T
The first vocabulary in table 8 is the Accommodation ontology24, which is a vocabulary for the description of hotels, vacation homes, camping sites, and other accommodation offers for ecommerce. The second ontology is the LinkedGeoData ontology, which is a dataset about spatial dimension whose information is collected from OpenStreetMap25 presented in Stadler et al (2012). The rest of the vocabularies are: the PROTON Extent module 26 an Upper-level ontology with extensions to handle Linked Open Data, the Audio Features Ontology27 this vocabulary expresses some common concepts to represent some features of audio signals and the AKT Reference Ontology28 describes people, projects, publications, geographical data, etc. In reference to the mappings of the properties, we have table 9 representing the same information and figure 4 with its histogram. We have found new vocabularies like Open Graph Protocol Vocabulary29 that enables any web page to become a rich object in a social graph. OpenVocab30, which is a community maintained vocabulary intended for use on the Semantic Web. BIO31 is a vocabulary for describing biographical information about people, both living and dead. The Payments Ontology32 a vocabulary for representing payments, such as government expenditures, using the data cube representation. Finally the Basic Access Control ontology33, which defines the element of Authorization and its essential properties, and also some classes of access such as read and write. In total only 8.05% of the vocabularies obtained a mapping using the properties.
Table 9. LOV Vocabularies with more properties mapped between Schema.org and LOV.
ov bio pay acl
122 93 89 88
og
AC
ED
Mapping occurrences
CE
Open Graph Protocol Vocabulary OpenVocab BIO Payments ontology Basic Access Control ontology
LOV abbreviation
PT
Vocabulary Name
122
ACCEPTED MANUSCRIPT 14
AC
CE
PT
ED
MA
NU
SC
RI P
T
Nogales et al
Figure 4. Histogram of more properties mapped between Schema.org and LOV.
ACCEPTED MANUSCRIPT Nogales et al
15
T
We have also calculated the power laws for both cases. The values for the classes are =2.97 and =6. In the case of properties and =10.008 and =87. So both fit with a power law. The information given by these two tables could be applied when the user tagging a Website needs to choose which vocabulary will be used. If there exist more than one vocabulary in the same field, it will be better to use the one with more occurrences as it has more possibilities to enrich the data when applying the first use case exposed in the paper.
RI P
4.2. Impact of the mappings in LOD
SC
Here we shall present some statistics about how the classes and properties mapped previously have relevance in LOD. After obtaining the maps, we have contrasted them with the information given by LODStats. Here we have searched the number of occurrences given by LODStats by each class. Taking that into account, the information about the top classes is shown in table 10.
NU
Table 10. Schema.org classes from the mappings with more occurrences in LOD.
Class Name
3,217,769 237,655 8,235 4,589 612
ED
MA
Person Organization Event City Dataset
LOD Occurrences
AC
CE
PT
Further information about table 10 would be to measure which LOV vocabularies have more representation in these occurrences. For example, the vocabulary AKT Reference Ontology provides more occurrences than the others over three million. Other vocabularies with large representation are Friend of a Friend Vocabulary, with approximately two million, and Semantic Web for Research Communities. In table 11 we have the same information for properties. In this case we have searched in LODStats using only the name of the properties, but not considering the matchings of the classes
Table 11. Schema.org properties from the mappings with more occurrences in LOD.
Property Name
LOD Occurrences
name description height width gender
16,656,930 8,784,687 4,718,986 4,718,984 2,848,501
4.3. Limitations In this section we will review the limitations found in completing the experiment. First of all, let’s address the limitations of LOV. As you can see in the Web site, some of the vocabularies are not available for two different reasons. Sometimes it has an invalid URL or there is a problem of content negotiation. Or sometimes the .n3 file that contains the information has never been fetched. Second, RDF.rb could not process some files. Because of this, we have not been able to work with all the vocabularies presented in the Web.
ACCEPTED MANUSCRIPT Nogales et al
16
NU
SC
RI P
T
The second mapping is semiautomatic. As we have to disambiguate the synonyms, if we want to use tools we need them to be part of a sentence. In this case the disambiguation has been done manually comparing definitions. To have an accurate result this should be done by vocabulary curators. There are some limitations in using LODStats. We can see in the Website that 1185 datasets have errors with the dumps or with the SPARQL endpoints. Therefore, we cannot assure that all the information provided is correct. Finally, we have found some limitations with our matching script. Some vocabularies could not be processed. Moreover, comparing it with other matching tools, we have realized several things. We are not discriminating special symbols like “-“, having the problem that a term compounds by two words cannot be match when using them. We are not using synonyms, so two different words with the same meaning could be matched. Also two words written in different styles of English (American or British), which could differ only in one letter, for example, are not mapped well. Finally, taking into account the comparison of the results obtained with Alignment API, we could not work with more than the 50% of the files resulting in error. 4.4. Usage evaluation
AC
CE
PT
ED
MA
Lastly, we will evaluate two studies to demonstrate how the different vocabularies are connected and how we can benefit from the embedded metadata and its connections with the Web of Linked Data. The first case consists of aggregating new information to a Web site using Schema.org as microdata. Let’s suppose we start from a particular page, which contains metadata from Schema.org. A list of domains containing it is in the Web Data Commons34 project, which extracts data from webs with microdata providing statistics. This information is stored in instances using the following format: first we can find a class with a property from Schema.org, next, the domain where this is used, third, the parameter with the exact value in the Web site and finally, the number of occurrences. We can obtain extra information using these classes and properties from Schema.org with its particular value. We can query to an endpoint of a dataset about all the information related with this value. First of all, we need to run the query. For example, we can use DBpedia, which is a project allowing users to extract structured information from Wikipedia as it is reported in Lehmann et al (2014). This query will use that particular value and will retrieve extra information related to it, which is stored in DBpedia. Based on the information returned by the endpoint, we will find data that is not used in the page and could be aggregated to it. We will demonstrate it with a real example later. The second case consists of extending any of the vocabularies from LOV with properties from Schema.org. For that purpose we need a mapping between two classes that are referring the same term, for example Event is a class in Schema.org and also in the LOV vocabulary Semantic Web Portal Ontology. As this class is describing the same item, all the properties from Schema.org that are not included in the vocabulary can be used as properties in it. Further we will give some real examples to explain it. For the first case, we have been able to extract a set of information using the dataset provided by Web Data Commons. The dataset was created using Any2335 to extract structured data embedded in the web crawl, which is provided by the Common Crawl Foundation 36 . This crawl is now the biggest and most up-to-date web corpus. It is publicly accessible and stores information from about 3.5 billion pages, where over 7.5 billions N-Quads37 were found, having a size of 102 Terabytes uncompressed. The data is available in Amazon’s Simple Storage Service (S3) 38 , which is an interface of Web Services, allowing users to store and retrieve data. This information is freely accessible using the access data Web Service Amazon Elastic Compute Cloud (Amazon EC2)39, which was designed to make web-scale computing easier for developers. Since we are only interested in obtaining microdata and, in particular, using Schema.org, a script is necessary to filter the information. The script can be developed with Pig Apache40, which is a language aim to design programs to analyze large datasets to be run on a Hadoop Cluster. Hadoop41 is a framework used on clusters of computers so users can process distributed datasets.
ACCEPTED MANUSCRIPT Nogales et al
17
Query using only values:
PT
(1)
ED
MA
NU
SC
RI P
T
The script has to be developed in four steps. The first step consists of reading all the quads, as the only valid values are those that use Schema.org, and a filter is needed with this schema. It is important to take into account that some of the tags used by Webmasters have been written in the wrong way, so the script should only retrieve those using the standard format of Schema.org, which is http://schema.org/Class/property. In the second step, a distinct key is created, combining the class and property from Schema.org and the value that it has in a particular domain. Once the script has every different instance of that key, it needs to count the number of occurrences for each key. Finally, it generates the output in a plain text file. The information for each instance will have the class and property from Schema.org, the exact value for the IRI, the domain where it has been found and the number of occurrences of that value in the domain. After running the script on February 2, 2014, a file of 380 Gigabytes uncompressed has been obtained in which we can find more than 750 million different instances of Schema.org values. In order to count how many of these instances can have useful information to be queried in DBpedia, we have first made a filter by IRI’S with different classes and properties from Schema.org. After this step, we have obtained 7,783 different combinations of classes and properties and have realized that a lot of the values of these instances will not retrieve any information from DBpedia. For example, in the case where the class is “Article” and the property is “bodyArticle”, the value will be a large text with a unique value, having no correspondences in DBpedia, so all the values with more than 255 characters have been discarded. Other examples that will not retrieve information will be numeric values, such as the “width” of a video and those whose value has been encrypted. Based on these examples, we have reduced the number of instances to 1,662. The final step to obtain new information is using these IRI’s from Schema.org combined with the particular values stored in our file, obtained from Web Data Commons, to build the queries to DBpedia. We have developed two types of queries, the first one will only use the exact value for the IRI and the second will use that value combined with the Schema.org class/property obtained from the IRI. The following code corresponds to the two types of queries we have run.
CE
PREFIX dbpedia: SELECT * WHERE { dbpedia:value ?predicate ?object }
Query using Schema.org class and value:
AC
(2)
PREFIX dbpedia: PREFIX rdf: SELECT * WHERE { dbpedia: rdf:schemaClass ?object . FILTER regex(?object, "", "i" ) }
The final step consists of running these queries for each value stored in the file extracted from Web Data Commons, in relationship to the Schema.org IRI’s we have filtered previously. After running a process with the two types of queries, we have obtained new information in DBpedia 420,324 times for the first case and 3,539,510 for the second one. Below we will explain the case with a real example. In this study we are using the following instance: http://schema.org/LodgingBusiness/Hotel/addressRegion whose value is “Rio de Janeiro”, which appears in the domain http://mamangua.com. First we would check to see if the microdata is contained in that page. For that purpose, we have used the Google Structure Data Testing Tool42 that allows users to obtain all the microdata contained in the web, classifying it by elements, types (giving the metadata form used) and properties. In our example, results show that there is an
ACCEPTED MANUSCRIPT Nogales et al
18
element tag with Schema.org, which pertains to “LodgingBusiness/Hotel” class and with a property called “addressRegion”. Here is the mark-up HTML contained in the web.
NU
SC
RI P
Pousada e Restaurante
- Praia Grande
- Saco do Mamanguá
- Paraty
- Rio de Janeiro
- Brasil
- Tel:+55 (11) 4063-8242
- email: [email protected]
Contato