ACTIVE Deliverable template - PlanetData

2 downloads 313 Views 3MB Size Report
Sep 8, 2014 - discover new Linked Datasets, KIT as well as UMA performed large-scale ...... wordpress.com, data about op
PlanetData Network of Excellence FP7 – 257641

D4.5 PlanetData Data Sets and Vocabularies and Access Portal Coordinator: Max Schmachtenberg (UMA) With contributions from: Tobias Käfer (KIT), Christian Bizer (UMA), Andreas Harth (KIT) Quality Reviewers: 1st Giorgos Flouris 2nd Davor Orlic

Deliverable nature:

R

Dissemination level:

PU

(Confidentiality)

Contractual delivery date:

M48

Actual delivery date:

M48

Version:

1.0

Total number of pages:

49

Keywords:

Linked Data, Data Cataloging, LOD Cloud

Deliverable D4.5PlanetData

Abstract In this deliverable we report on the cataloguing activities of PlanetData. During the reporting period, the following results were achieved: In order to discover additional Linked Datasets, KIT as well as UMA performed large-scale crawls of the Linked Data Web. The Web data corpus crawled by KIT consists of 4 billion RDF triples and is offered as evaluation data for the Semantic Web Challenge 2014, the premier event for showcasing Semantic Web applications. The Web data corpus crawled by UMA was further analysed concerning the topical domains of the published data as well as the compliance of the data sources with the Linked Data best practices. The results of the analysis were published in the form of a data catalog covering 1900 datasets and classifying each dataset according to its topical domain, vocabulary usage and compliance with the best practices. The linkage relationships between the datasets were visualized in the form of an updated Linked Data Cloud diagram. A paper describing the results of the analysis was accepted for publication at the International Semantic Web Conference and will be presented there in October.

Page 1 of (49)

Executive summary In this deliverable we report on the cataloguing activities of PlanetData. During the reporting period, the following results were achieved: In order to discover additional Linked Datasets, KIT as well as UMA performed large-scale crawls of the Linked Data Web. In a first crawl, samples were drawn from a large number of datasets. Using the metrics described in deliverable 2.4, the datasets were examined to what exend they adhere to best practices emerged from the community. These include the extent of interlinking, the use of vocabularies as well as the provision of metadata. This information, along with other properties of the datasets, was used to create a dataset catalog that can serve as an entry point for data consumers when in search for datasets. The catalog is accompanied with a webpage, displaying the result of our analysis, as well as a LODCloud diagram, showing the extent and size of the Web of Linked Data at a glance. The Web data corpus crawled by UMA can be downloaded from http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/ The data catalog can be accessed at http://linkeddatacatalog.dws.informatik.uni-mannheim.de/ The linkage relationships between the datasets were visualized in the form of an updated Linked Data Cloud diagram. The diagram is found at http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ The paper [12] describing the results of the analysis was accepted for publication at the International Semantic Web Conference 2014 and will be presented there in October. The paper can be downloaded from http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/ The second crawl aimed at creating a large corpus of Linked Data triples for the Billion Triple Challenge (BTC) held at ISWC and for performing large scale data expeirments. Through multiple iterations, the crawl accumulated more than one terabyte data, holding a total of 4.4 billion triples. The KIT BTC corpus can be downloaded from http://km.aifb.kit.edu/projects/btc-2014/ In the following, we will report in more detail about both crawls as well as the results of the analysis.

Deliverable D4.5PlanetData

Document Information IST Project

FP7 - 257641

Acronym

PlanetData

Number Full Title

PlanetData

Project URL

http://www.planet-data.eu/

Document URL EU Project Officer

Leonhard Maqua

Deliverable

Number

D4.5

Title

Work Package

Number

WP4

Title

Date of Delivery

Contractual

Status

Version 0.4

Nature

prototype □ report X dissemination □

Dissemination level

public X consortium □

Authors (Partner)

University of Mannheim

M48

Actual final □

Name

Max Schmachtenberg

E-mail

[email protected]

Partner

UMA

Phone

06241

Responsible Author

Abstract (for dissemination)

M48

In this deliverable we report on the cataloguing activities of PlanetData . During the reporting period, the following results were achieved: In order to discover new Linked Datasets, KIT as well as UMA performed large-scale crawls of the Linked Data Web. The Web data corpus crawled by KIT consists of 4 billion RDF triples and is offered as evaluation data for the Semantic Web Challenge 2014, the premier event for showcasing Semantic Web applications. The Web data corpus crawled by UMA was further analysed concerning the topical domains of the published data as well as the compliance of the data sources with the Linked Data best practices. The results of the analysis were published in the form of a data catalog covering 1900 datasets and classifying each dataset according to its topical domain, vocabulary usage and compliance with the best practices. The linkage relationships between the datasets were visualized in the form of an updated Linked Data Cloud diagram. A paper describing the results of the analysis was accepted for publication at the International Semantic Web Conference 2014 and will be presented there in October. Page 3 of (49)

Keywords

Linked Data, data cataloge, LOD Cloud

Version Log Issue Date

Rev. No.

Author

Change

20.8.2014

01

Max Schmachtenberg

Initial Version

26.8.2014

02

Tobias Käfer

Added sections about KIT crawl

8.9.2014

03

Christian Bizer

Included feedback from Giorgos Flouris and updated links to LOD Cloud diagram

28.9.2014

04

Tobias Käfer

Included results of the analysis of the KIT crawl

30.9.2014

05

Max Schmachtenberg

Resolved some minor formatting issues

Deliverable D4.5PlanetData

Table of Contents Executive summary...................................................................................................................................... 2 Document Information ................................................................................................................................ 3 Table of Contents ........................................................................................................................................ 5 List of Figures............................................................................................................................................... 7 List of Tables ................................................................................................................................................ 8 1.

Introduction ......................................................................................................................................... 9

2.

Dataset Cataloging ..............................................................................................................................10 2.1 Crawl of the Linked Data Web ...........................................................................................................11 2.2 Categorization by Topical Domain .....................................................................................................12 2.3 Linked Data Metrics ..........................................................................................................................14 2.3.1 Adoption of the Linking Best Practices ........................................................................................15 2.3.1.1 Degree Distributions ............................................................................................................16 2.3.1.2 Overall Graph Structure .......................................................................................................17 2.3.1.3 Predicates Used for Linking ..................................................................................................19 2.3.2 Adoption of the Vocabulary Best Practices .................................................................................21 2.3.2.1 Usage of Well-Known Vocabularies......................................................................................22 2.3.2.2 Usage of Proprietary Vocabularies .......................................................................................23 2.3.2.3 Dereferencability of Proprietary Vocabulary Terms ..............................................................24 2.3.2.4 RDF Links to Terms from other Vocabularies ........................................................................25 2.3.3 Adoption of the Metadata Best Practices ....................................................................................26 2.3.3.1 Providing Provenance Information .......................................................................................27 2.3.3.2 Providing Licensing Information ...........................................................................................28 2.3.3.3 Providing Dataset Level Metadata .......................................................................................29 2.3.3.4 Providing Alternative Access Methods .................................................................................30 2.4 Related Work ....................................................................................................................................31 2.5 Summary of Findings .........................................................................................................................32 2.6 Dataset Catalog and State of the LOD Cloud Report...........................................................................33 2.7 Creation of an Extended Diagram ......................................................................................................34

3. Semantic Web Challenge 2014 - Big Data Track Dataset (BTC2014) .........................................................36 3.1 Crawling Method...............................................................................................................................37 3.2 Preparing the Crawler LDspider .........................................................................................................38 3.3 Running and Maintaining the Crawling ..............................................................................................39 Page 5 of (49)

3.4 Publishing the Crawl ..........................................................................................................................40 3.5 Analysis of the Crawl .........................................................................................................................42 3.5.1 Providing links to other data sets ................................................................................................43 3.5.2 Linking datasets ..........................................................................................................................45 3.5.3 License information ....................................................................................................................46 3.5.4 Use of Vocabularies ....................................................................................................................47 Bibliography ...............................................................................................................................................48

Deliverable D4.5PlanetData

List of Figures Figure 1: Distribution of the number of resources and documents per dataset contained in the crawl (log scale). ..................................................................................................................................................11 Figure 2: Indegree distibutions of different categories of datasets ..............................................................16 Figure 3: Outdegree distibutions of different categories of datasets ...........................................................16 Figure 4: Overall graph structure and categorization of the datasets by topical domain. The size oft he circles reflects their indegree ...............................................................................................................18 Figure 5: Extended LOD Cloud Diagram .......................................................................................................35

Page 7 of (49)

List of Tables Table 1: Number of datasets in each category and growth compared to 2011.............................................13 Table 2: Datasets with the highest in- and outdegrees. ...............................................................................17 Table 3: Top three linking predicates per category. The percentages are relative to number of datasets within the category which set outgoing links........................................................................................19 Table 4: Top 10 datasets regarding in- and outdegree for owl:sameAs links by category. ............................20 Table 5: Vocabularies used by more than 5% of all datasets. .......................................................................22 Table 6: Proprietary vocabularies with dereferencability per category and quota of vocabularies linking to others. .................................................................................................................................................23 Table 7: Predicates used to link terms between different vocabularies........................................................25 Table 8: Provenance vocabulary usage and license vocabulary usage by category. ......................................28 Table 9: Percentage of datasets using the VoID vocabulary and percentage of datasets offering alternative access methods. ..................................................................................................................................29 Table 10: Basic Metrics about the Components of the BTC2014 dataset .....................................................41 Table 11: Indegree and Outdegree ..............................................................................................................43 Table 12: Top 11 predicates linking into a dataset .......................................................................................43 Table 13: Top 10 predicates linking out of a data set ...................................................................................44 Table 14: Top 10 datasets used by other datasets as end of owl:sameAs links.............................................45 Table 15: Top 5 terms to indicate license information .................................................................................46 Table 16: Top 10 used vocabularies.............................................................................................................47

Deliverable D4.5PlanetData

1. Introduction This report describes two crawls performed, one used for creating a catalog of linked data, the other creating a large corpus of Linked Data for the Billion Triple Challenge 2014. The Web of Linked Data [1] [2] has grown from a dozen datasets in 2007 into a large data space containing hundreds of datasets today. To give data consumers an overview of the available datasets and help them identifying datasets suitable for their needs, we created a catalog of datasets. Unlike previous efforts, which were based on the manual entry of datasets by publishers into the catalog, we performed a crawl for linked data in April 2014, taking samples from every dataset we encountered. Using this strategy, we were able to find more than thousand different datasets, covering a wide array of topics. In deliverable 2.4, metrics were described to assess a dataset's quality and its adherence to best practices emerged from the community. They measure the linking of a dataset to others, the used vocabularies and the dereferencabiliy and interlinking of proprietary vocabularies, the provision of metadata, such as license and provenance information and the provision of access methods such as data dumps and SPARQL endpoints. Using the samples taken from the datasets, we evaluted the datasets properties, enabling users to better assess a dataset's properties, quality and usefulness. Our findings were entered into a dataset catalog, where users can browse the data and search for datasets having specific properties. Along the catalog, we provide an accompanying website, which is linked to the catalog. A paper [12] discussing these results has been accepted for the ISWC 2014 Replication, Benchmark, Data and Software Track. Based on our findings for interlinking, we created an up-to-date version of the LODCloud Diagram, giving an overview over the Web of Linked Data. Again, the diagram is tightly interlinked with the catalog, enabling a user to directly move from the diagram icon of a dataset to its catalog entry. To give a more complete picture of datasets available as Linked Data, we invited the community to report datasets that were not included in our crawl, and extended the LOD Cloud Diagram with the datasets reported by the community. The second crawl aimed at providing researchers with a large corpus of linked data, suitable for large-scale data experiments. This crawl was performed in a breath-first strategy, doing a total of 14 hops, crawling 1,1 terabyte of data containing 4.09 billion triples. In the following sections, we describe these two crawls. The following section describes the crawl used for creating the catalog of linked datasets. The results of the metrics applied will be further discussed, and we will describe the catalog, the state of the LOD cloud document as well as the created LOD Cloud diagrams themselves. Section three discusses the creation of the Billion Triple Challenge dataset. There, the crawling process as well as the amount of data crawled for the different iterations are shown.

Page 9 of (49)

2. Dataset Cataloging In this chapter, we describe the crawl for samples from a large number of Linked Data datasets for the creation of a dataset catalog. In the next section, we describe the crawling strategy used to gather the data that are the basis of our analysis. Then, we explain the categorization of the data by topical domain. In Section 2.3, we describe the adoption of best practices in the areas of linking, vocabulary usage, and metadata provision. Section 2.4 gives an overview of related work. Section 2.7 summaries our findings and in section 2.6, we describe the creation of the catalog and the accompanying website. To foster community feedback and also include datasets that were not included in the initial diagram, we published our diagram for community feedback and to let them report datasets not included in our crawl. The feedback round as well as the resulting extended LODCloud diagram can be found in Section 2.7.

Deliverable D4.5PlanetData

2.1 Crawl of the Linked Data Web To evaluate the conformance to the best practices, we have crawled a snapshot of the Linked Data Web. For this, we used LDSpider, a framework for crawling Linked Data [3]. We seeded LDSpider with 560 thousand seed URIs originating from three sources: 1. We included all URIs of example resources from datasets contained in the lod-cloud group in the datahub.io dataset catalog as well as example URIs from other datasets in the catalog marked with Linked Data related tags; 2. We included a sample of the URIs contained in the Billion Triple Challenge 2012 dataset 1; 3. We collected URIs from datasets advertised on the [email protected] mailing list since 2011. With those seeds, we performed crawls during April 2014 to retrieve entities from every dataset using a breadth-first crawling strategy. Altogether, we crawled 900,129 documents describing 8,038,396 resources. The crawled data is provided to the community 2 so that all results presented in the following can be verified. 10000000 1000000 100000 10000 1000 100 10 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992

1

Resources

Documents

Figure 1: Distribution of the number of resources and documents per dataset contained in the crawl (log scale).

For grouping the retrieved resources into datasets, we generally assume that all data originating from one pay-level domain (PLD) belongs to a single dataset. If the datahub.io catalog lists multiple datasets for a single PLD, we apply an exception to the general rule and use the dataset definitions from the catalog. Altogether, the crawled data belongs to 1014 different datasets. Figure 1 shows the distribution of the number of resources and documents per dataset contained in the crawl. Our crawler did respect crawling restrictions expressed by the data sources via robots.txt files. Altogether, we discovered 77 linked datasets which do not allow crawling and did not retrieve data from these sources. These datasets were later added to an enhanced version of the LOD Cloud Diagram (see Figure 5), to give the community a more comprehensive overview of the Web of Linked Data.

1

http://km.aifb.kit.edu/projects/btc-2012/

2

http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/iswc-rdb Page 11 of (49)

2.2 Categorization by Topical Domain Since the adoption of the Linked Data best practices might vary depending on the topical domain of the datasets, we classify the datasets into the following topical categories: media, government, publications, life sciences, geographic, cross-domain, user-generated content, and social networking. This categorization schema is the same as the one used by the 2011 State of the LOD Cloud report with the only difference that we added the category social networking as we discovered a large number of datasets providing data about people and their social ties. For datasets that are contained in the datahub.io dataset catalog, we adopt the topical categorization form the catalog. We manually assigned categories to the newly discovered datasets after inspecting them. In the following, we define the categories and refer to some prominent datasets from each category. Afterwards, we compare the overall number of datasets per category with the findings of the 2011 State of the LOD Cloud report. The media category contains datasets providing information about films, music, TV and radio programmes, as well as print media. Prominent datasets within this category are the dbtune.org music datasets, the New York Times dataset, and the BBC radio and television program datasets. The government category contains Linked Data published by federal or local governments, including a lot of statistical datasets. Prominent examples include the data.gov.uk and opendatacommunities.org datasets. The category publications holds li brary datasets, information about scientific publications and conferences, reading lists from universities, and citation databases. Well known datasets include the German National Library dataset, the L3S DBLP dataset, and the Open Library dataset. The category geographic contains datasets like geonames.org and linkedgeodata.org comprising information about geographic entities, geopolitical divisions, and points of interest. The life sciences category comprises biological and biochemical information, drug-related data, and information about species and their habitats. The cross-domain category includes general knowledge bases such as DBpedia or UMBEL, linguistic resources such as WordNet or Lexvo, as well as product data. The the category user-generated content contains data from portals that collect content generated by larger user communities. Examples include metadata about blogposts published as Linked Data by wordpress.com, data about open source software projects published by apache.org, scientific workflows published by myexperiment.org, and reviews published by goodreads.com or revyu.com. The category social networking contains people profiles as well as data describing the social ties amongst people. We include into this category individual FOAF profiles, as well as data about the interconnections amongst users of the distributed microblogging platform StatusNet. The distinction between the categories usergenerated content and social networking is that the datasets in the former category focus on the actual content while datasets in the later focus on user profiles and social ties. The 2011 State of the LOD Cloud report did not contain social networking as a separate category since the report did not count individual FOAF profiles as separate datasets and since StatusNet servers did not export Linked Data in 2011. Table 1 gives an overview of the number of datasets in each category as well as the growth per category compared to the 2011 report. A list with the exact assignments of each dataset to a category is found on the accompanying website. The numbers in brackets in the second column refer to the number of datasets that do not allow crawling. The by far largest category is social networking with 520 datasets (48% of all datasets). The second largest category is government with 199 datasets (18%), followed by publications with 138 datasets (13%). Compared to the 2011 State of the LOD Cloud report, we observe a larger number of datasets in all categories except geographic and media data. The category government shows the largest growth, followed by the categories user-generated content and life sciences. Excluding the new category social networking, the overall number of Linked Datasets has approximately doubled from 2011 (294

Deliverable D4.5PlanetData

datasets) to 2014 (571 datasets). Including the new category, we observe an overall growth of 271% (from 294 to 1091 datasets). Table 1: Number of datasets in each category and growth compared to 2011. Category

Datasets 2014

Percentage

Datasets 2011

Growth

Media

24 (-2)

2%

25

-4%

Government

199 (-16)

18%

49

306%

Publications

138 (-42)

13%

87

59%

Geographic

27 (-6)

2%

31

-13%

Life Sciences

85 (-2)

8%

41

107%

Cross-domain

47 (-6)

4%

41

15%

User-generated Content

51 (-3)

5%

20

155%

Social Networking

520 (-0)

48%

-

-

Total

1091 (-77)

294

271%

Page 13 of (49)

2.3 Linked Data Metrics In order to enable Linked Data applications to discover datasets as well as to ease the integration of data from multiple sources, Linked Data publishers should comply with a set of best practices [4]. These best practices can be grouped into three areas: 

Linking: By setting RDF links, data providers connect their datasets into a single global data graph which can be navigated by applications and enables the discovery of additional data by following RDF links.



Vocabulary Usage: The best practices advise publishers to use terms from widely-used vocabularies in order to ease the interpretation of their data. If data providers use their own vocabularies, the terms of such proprietary vocabularies should be dereferencable into their RDF schema or OWL definitions. The definitions of proprietary vocabulary terms should contain RDF links pointing at terms from widely-used vocabularies in order to ease their interpretation.



Metadata Provision: Linked Data should be as self-descriptive as possible, and thus include metadata. An important form of metadata is provenance meta-data describing the origin of datasets and enabling applications to assess their quality. The best practices also advise to provide licensing metadata and dataset-level metadata, e.g., in the form of a VoID file3. If datasets are accessible via additional access methods, such as a SPARQL endpoint or data dumps, then the VoID file should contain information about these access methods.

The adoption of the Linked Data best practices by datasets belonging to different topical domains was analyzed in the State of the LOD Cloud report [2] in 2011. The report is based on information provided by the data publishers themselves via the datahub.io Linked Data catalog4.

3

http://www.w3.org/TR/void/

4

http://datahub.io/group/lodcloud

Deliverable D4.5PlanetData

2.3.1 Adoption of the Linking Best Practices The linking best practice encourages publishers to set RDF links between datasets in order to enable the discovery of additional data and to support the integration of data from multiple sources. For analyzing the linkage between datasets, we aggregate all RDF links by dataset, meaning that we consider two datasets to be linked if there exists at least one RDF link between resources belonging to the datasets.

Page 15 of (49)

2.3.1.1 Degree Distributions In total, 56% of all datasets in our crawl set RDF links pointing to at least one other dataset. The remaining 44% are either only the target of RDF links from other datasets or are isolated. Figure 2 and Figure 3 show the distribution of the in- and outdegrees for each category. We see that the in- and outdegrees vary widely with a small number of datasets in each category being highly linked, while the majority of the datasets is only sparsely linked. Looking at the top ten datasets by in- and outdegree in Table 4, we see that datasets from categories social networking, user-generated content, and publications are among the top ten with respect to outdegree. While datasets with a high indegree are dbpedia.org (cross-domain), geonames.org (geographic), w3.org (cross-domain), reference.data.gov.uk (government) as well as several datasets from the category social networking. 160 140 120 100 80 60 40 20 0 1

10

100

1000

10000

Datenreihen1

Figure 2: Indegree distibutions of different categories of datasets 100 90 80 70 60 50 40 30 20 10 0 1

10

100

1000

Datenreihen1

Figure 3: Outdegree distibutions of different categories of datasets

10000

Deliverable D4.5PlanetData

2.3.1.2 Overall Graph Structure Analyzing the overall graph structure, we find one large weakly connected component which consists of 71.99% of all datasets. In addition, there are three small components, one consisting of three and two consisting of two datasets. Within the large weakly connected component, there exists one large strongly connected component consisting of 36.29% of all datasets. Figure 4 shows the overall graph structure using the same LOD cloud visualization as the 2011 report. The size of the circles reflects the indegree of the corresponding dataset. A zoom-able version of Figure 4 is available on the accompanying website. Note that we have aggregated all individual FOAF profiles into a single circle. Compared to the LOD cloud visualization from the 2011 report which was centered around dbpedia.org as central linking point, Figure 4 shows a much more decentralized graph with multiple highdegree nodes: The geonames.org and dbpedia.org datasets are linked by a large number of datasets belonging to different topical categories. In addition, the statistics.data.gov.uk and reference.data.gov.uk are highly linked from within the government category. In the category publications, the Library of Congress Subject Headings (LCSH) and the German National Library (DNB) datasets are highly linked. We can also see in Figure 4 that the category social networking is the most densely interlinked. Table 2: Datasets with the highest in- and outdegrees. Dataset

Category

In

Dbpedia.org

cross-domain

geonames.org

Dataset

Category

Out

207 bibsonomy.org

publications

91

geographic

141 semanlink.net

User-generated cnt.

88

w3.org

cross-domain

117 deri.org

Social netw.

70

quitter.se

social netw.

64

harth.org

Social netw.

68

status.net

social netw.

63

quitter.se

Social netw.

67

postblue.info

social netw.

56

semanticweb.org

User-generated cnt.

64

skilledtest.com

social netw.

55

skilledtests.com

Social netw.

60

reference.data.gov.uk

government

45

postblue.info

Social netw.

59

data.semanticweb.org

publications

44

status.net

Social netw.

47

fragdev.com

social netw.

41

w3.org

crossdomain

45

lexvo.org

cross-domain

37

data.semanticweb.org

publications

45

Page 17 of (49)

Figure 4: Overall graph structure and categorization of the datasets by topical domain. The size oft he circles reflects their indegree

Deliverable D4.5PlanetData

2.3.1.3 Predicates Used for Linking Table 3 displays the top three predicates that are used by RDF links within each topical domain. A first observation is that owl:sameAs is an important linking predicate within most categories, followed by rdfs:seeAlso. The most notable deviance is observed for the category social networking, where foaf:knows is the most widely used linking predicate. Due to the outstanding role of owl:sameAs as the most widely used linking predicate, we take a closer look at the datasets connected by owl:sameAs links. Searching for weakly connected components in the owl:sameAs graph, we find one large weakly connected component containing 297 (29.3%) of all datasets. Table 3: Top three linking predicates per category. The percentages are relative to number of datasets within the category which set outgoing links Category

Social networking

Publications

User-generated Content

Media

Predicate

Usage

foaf:knows

60.27%

foaf:based near

35.69%

sioc:follows

Category

Predicate

Usage

owl:sameAs

52.17%

rdfs:seeAlso

43.48%

34.34%

dct:creator

21.74%

owl:sameAs

32.20%

dct:publisher

47.57%

dct:language

25.42%

dct:spatial

30.10%

rdfs:seeAlso

23.73%

owl:sameAs

24.27%

owl:sameAs

53.13%

owl:sameAs

64.29%

rdfs:seeAlso

21.88%

skos:exactMatch

21.43%

dct:source

18.75%

skos:closeMatch

21.43%

owl:sameAs

81.25%

owl:sameAs

80.00%

rdfs:seeAlso

18.75%

rdfs:seeAlso

52.00%

foaf:based near

18.75%

dct:creator

20.00%

Life Sciences

Government

Geographic

Cross-domain

Apart from that, there are only eight further components, out of which three consist of three datasets and the remaining five consist of two datasets. Looking at strongly connected components, we find one large component consisting of 74 datasets (7.3%), one with four and six with two datasets. Table 4 shows the top ten datasets regarding in- and outdegree, this time considering only owl:sameAs links. Compared toTable 2, we observe a much smaller number of datasets from the category social networking as this category is dominated by foaf:knows links.

Page 19 of (49)

Table 4: Top 10 datasets regarding in- and outdegree for owl:sameAs links by category. Dataset

Category

In

Dataset

Category

Out

dbpedia.org

cross-domain

89

bibsonomy.org

publications

91

geonames.org

geographic

29

data.semanticweb.org

publications

31

data.semanticweb.org

publications

24

myopenlink.net

user-gen. cnt.

25

l3s.de

publications

24

dbpedia.org

crossdomain

23

semanticweb.org

user-gen. cnt.

18

semanticweb.org

User-generated cnt.

18

nytimes.com

media

11

revyu.com

User-generated cnt.

16

dbtune.org

social netw.

11

advogato.org

Social netw.

16

kit.edu

social netw.

9

el.dbpedia.org

crossdomain

13

revyu.com

user-gen. cnt.

8

nl.dbpedia.org

crossdomain

11

w3.org

crossdomain

8

harth.org

Social netw.

11

it.dbpedia.org

cross-domain

8

Deliverable D4.5PlanetData

2.3.2 Adoption of the Vocabulary Best Practices The vocabularies used to represent data and their interpretability are a key ingredient to make Linked Data semantic data. We consider a vocabulary to be used by a dataset if a term from the vocabulary appears in the predicate position of a triple from the dataset or at the object position of a rdf:type triple from the dataset.

Page 21 of (49)

2.3.2.1 Usage of Well-Known Vocabularies Table 5 lists the vocabularies that are used by more than five percent of all datasets. The vocabularies RDF, FOAF, RDFS, DCTerms, and OWL are the top vocabularies used by many datasets from across all topical categories. Compared to the 2011 report, we can state that there is a trend towards the adoption of wellknown vocabularies by more datasets. For instance, while the FOAF vocabulary was used by 27.46% of all datasets in 2011, it is used by 69.1% of all datasets in 2014. The same is true for the Dublin Core vocabulary which is used today by 56.01% of the datasets and was used by only 31.19% in 2011. The extent to which well-known vocabularies are used within the different topical categories reveals some differences. In the category social networking, there is a high quota of datasets using FOAF (85.96%), followed by the Dublin Core and the WGS84 vocabulary used by 40% and 37% of all datasets. Table 5: Vocabularies used by more than 5% of all datasets. Prefix

Occurence

Quota

Prefix

Occurence

Quota

rdf

996

98.22%

void

137

13.51%

rdfs

736

72.58%

bio

125

12.32%

foaf

701

69.13%

cube

114

11.24%

dcterm

568

56.01%

rss

99

9.76%

wol

370

36.49%

odc

86

8.48%

wgs84

254

25.05%

w3con

77

7.60%

sioc

179

17.65%

dopa

65

6.41%

admin

157

15.48%

bibo

62

6.11%

skos

143

14.11%

dcat

59

5.82%

Furthermore, the bibo ontology is used by 41.67% of the datasets belonging to this category. The vocabularies SKOS, resourcelist, which is used to create reading lists, and SIOC also find some adoption. In the category cross- domain, several vocabularies are used by 10-40% of all datasets: The dbpedia.org, georss.org, opengis.net, bibo, the prov vocabulary, the skos vocabulary, and void, showing that a wide variety of topics is covered in this category. In the category government, vocabularies for representing statistical data (cube with 61.75% and sdmx with 26.22%) are found frequently. Vocabularies for expressing metadata, like the void vocabulary, the sparql-service-description vocabulary, prov and prv are also find some use. Within the category geographic, 66.67% of all datasets use the WGS84 vocabulary for encoding geographic coordinates. Other well adopted vocabularies are skos or the geonames ontology. In the category user-generated content, many datasets use the FOAF vocabulary together with the SIOC vocabulary (50%) as well as the RSS and the admin vocabulary (both around 17%). The DOAP vocabulary is used by 12.5% of the datasets. Please note that the schema.org vocabulary promoted by Google, Yahoo and Microsoft is not listed in Table 5 as we found this vocabulary to be hardly used in the Linked Data context5. In contrast, the vocabulary is very widely used together with the Microdata syntax for annotating HTML pages [5].

5

One data source that uses the schema.org type system in addition to its own type system in order to increase interoperability is dbpedia.org.

Deliverable D4.5PlanetData

2.3.2.2 Usage of Proprietary Vocabularies Widely-used vocabularies often do not provide all terms that are needed to publish the complete content of a dataset on the Web. Thus, data providers often define proprietary terms that are used in addition to terms from widely deployed vocabularies. We have also analyzed to which extent datasets from different categories make use of proprietary vocabularies. We consider a vocabulary to be proprietary if it is used only by a single dataset. Out of the 638 different vocabularies that we encountered in our crawl, 375 vocabularies (58.77%) are proprietary according to our definition, while 263 (41.22%) are non-proprietary. In total, 234 datasets (23.08%) use proprietary vocabularies, while nearly all datasets also use nonproprietary vocabularies. These numbers show that the adoption of the best practice to use common vocabularies is improving compared to the State of the LOD Cloud report from 2011 which found 64.41% of all datasets to use proprietary terms. Table 6 further details the usage of proprietary vocabularies by topical category. The second column of the table shows the number of proprietary vocabularies used by datasets from each category. The third column contains the number of datasets in each category that use proprietary vocabularies. Table 6: Proprietary vocabularies with dereferencability per category and quota of vocabularies linking to others. Dereferencability

Different prop. vocabs. Used (% of all prop. vocab.)

# of datasets using prop. vocab. (% of all datasets)

full

partial

none

Social Networking

126 (33.6%)

81 (15.57%)

19.47%

8.8%

77.78%

20 (15.87%)

Publications

59 (15.73%)

33 (34.38%)

22.03%

8.47%

69.49%

15 (25.42%)

Government

47 (12.53%)

34 (18.58%)

21.28%

12.77%

65.96%

16 (34.04%)

Cross-domain

56 (14.93%)

17 (41.46%)

26.79%

10.71%

62.50%

14 (25.00%)

Geographic

13 (3.47%)

8 (38.10%)

15.38%

7.69%

76.92%

2 (15.38%)

Life Sciences

36 (9.60%)

27 (32.53%)

27.78%

5.56%

66.67%

4 (11.11%)

Media

12 (3.20%)

12 (54.55%)

0.00%

16.67%

83.33%

2 (16.67%)

User-gen. Cnt.

26 (6.93%)

22 (45.83%)

11.54%

11.54%

76.92%

6 (23.08%)

Total

375 (58.77%)

234 (23.08%)

19.47%

8.80%

71.73%

79 (21.07%)

Category

# of vocabs linking (quota)

Page 23 of (49)

2.3.2.3 Dereferencability of Proprietary Vocabulary Terms In order to enable applications to retrieve the definition of vocabulary terms, the URIs identifying vocabulary terms should be made dereferencable. To assess whether a vocabulary is dereferencable, we requested the definitions of all used terms from the vocabulary via HTTP GET requests. The resulting corpus of vocabulary definitions is provided for download on the accompanying website. We define the dereferencability quota of a vocabulary as the number of dereferencable terms divided by the number of all terms of the vocabulary. In total, 19.47% of all proprietary vocabularies are fully dereferencable (i.e., their quota is 1.0). On the other hand, 71.73% of all proprietary vocabularies are not dereferencable at all. The remaining 8.8% of all proprietary vocabularies are partially dereferencable, meaning that for some terms, but not for all, a definition could be retrieved. Possible causes for partial dereferencability are namespace squatting, i.e. accidentally or incidentally using terms not defined in a vocabulary, and vocabularies having changed without proper marking of old terms as deprecated. Columns 4, 5 and 6 in Table 6 show the percentage of fully, partially and not dereferencable proprietary vocabularies per topical category.

Deliverable D4.5PlanetData

2.3.2.4 RDF Links to Terms from other Vocabularies Vocabulary terms should be related to corresponding terms within other vocabularies in order to enable applications to understand as much data as possible. Table 7 contains the different predicates that are used to link terms between vocabularies together with the percentage of all vocabularies using each predicate for linking. We see that 9.87% of all vocabularies use the rdfs:range predicate to link to other vocabularies (for instance defining the range of a term to be foaf:Person). The table also shows that only a very small fraction of the vocabularies provides equivalence links to terms from other vocabularies. Table 7: Predicates used to link terms between different vocabularies. Term

% of vocabularies

Term

% of vocabularies

rdfs:range

9.87%

rdfs:seeAlso

1.60%

rdfs:subClassOf

8.80%

owl:equivalentClass

1.60%

rdfs:subPropertyOf

6.93%

owl:inverseOf

1.33%

rdfs:domain

5.60%

swivt:type

1.07%

rdfs:isDefinedBy

3.73%

owl:equivalentProperty

0.80%

Page 25 of (49)

2.3.3 Adoption of the Metadata Best Practices The Linked Data best practices propose that every dataset should provide provenance and licensing information, dataset-level metadata, and information about additional access methods.

Deliverable D4.5PlanetData

2.3.3.1 Providing Provenance Information For our evaluation, we have collected a list of vocabularies that are designed for the representation of provenance information. Information about such vocabularies came from the W3C Provenance Working Group, the LOV vocabulary catalog, as well as our own experience, adding up to a total of 26 vocabularies. Using those vocabularies, we searched for provenance information in our corpus. We followed the approach suggested in [6] and searched for triples using predicates from those vocabularies and containing a document URI as subject. As shown in Table 8, 36.69% of all datasets use some provenance vocabulary, which is a slight decrease compared to the State of the LOD Cloud report from 2011, which reports 36.63% of all datasets to provide provenance information. 29.09% of all datasets use Dublin Core Terms, 11.05% use MetaVocab, while W3C PRV and PROV are used by only 0.79% of the datasets. The provision of provenance information is widely adopted in the publications and government domains, while media and geographic datasets show less adoption. For government data, there is also a remarkable increase compared to the State of the LOD Cloud document from 2011, which reports only 20.41% for this topical domain.

Page 27 of (49)

2.3.3.2 Providing Licensing Information With the help of machine-readable licensing information, Linked Data applications can assess whether they may use data for their purpose at hand. To evaluate whether a dataset provides license information, we again followed the approach proposed in [6] and searched for triples which have the document as their subject and a predicate containing the string ’licen’. To this list, we added all predicates containing the string ’rights’ as well as the waiver vocabulary, which leads to a total of 47 terms. In total, 9.96% of all datasets provide licensing information in RDF. This number is lower than the 17.84% reported in the State of the LOD Cloud report from 2011, but still higher than the 3.4% reported in [6]. The most important predicates for indicating the license are dc/dct:license (7.39%), cc:license (2.07%) and dc/dct:rights (1.68%). As shown in the last column of Table 8, the provision of licensing information varies widely across topical domains. More than a third of all government datasets provide licensing information, while none of the geographic datasets provides licensing information. A main cause for the low overall number is the category social networking which contains 48% of all datasets and in which only 5.38% of the dataset offer licensing information. Table 8: Provenance vocabulary usage and license vocabulary usage by category. Category

Any prov vocab

Dublin Core

Admin

Prv/Prov

Any license vocab

social networking

169 (32.5%)

57.39%

57.39%

1.18%

5.38%

publications

39 (40.63%)

94.87%

5.13%

2.56%

4.17%

government

76 (41.54%)

100.00%

0.00%

1.32%

30.05%

life sciences

20 (24.10%)

100.00%

0.00%

0.5%

3.61%

cross-domain

7 (17.07%)

100.00%

14.29%

0.00%

9.76%

geographic

3 (14.29%)

100.00%

0.00%

33.34%

0.00%

user-gen. content

9 (18.75%)

88.89%

66.67%

0.00%

10.42%

media

4 (18.18%)

100.00%

0.00%

0.00%

5.41%

Total

372 (36.69%)

29.09%

11.05%

0.79%

9.96%

Deliverable D4.5PlanetData

2.3.3.3 Providing Dataset Level Metadata Dataset-level metadata can be provided using the VoID vocabulary, either as inline statements in the dataset or in a separate VoID file. In the latter case, that file has to be linked from the data via backlinks or be provided at a well- known location which is created by appending /.well-known/void to the host part of a URI. As reported in [7], the latter condition is often too strict for data providers due to missing root-level access to the servers. Thus, we follow the approach proposed in [7] of relaxing the search for VoID files at well-known locations, appending “/.well-known/void” to any portion of the URI. Table 9: Percentage of datasets using the VoID vocabulary and percentage of datasets offering alternative access methods. Category

VoID

Link

WellKnown

Inline

Alt. access

SPARQL

Dump

social netw.

5 (0.96%)

0.19%

0.77%

0.00%

4 (0.77%)

0.77%

0.19%

publications

13 (13.54%)

6.25%

3.13%

7.29%

13 (13.54%)

12.50%

4.17%

life sciences

30 (36.14%)

28.92%

2.41%

4.82%

20 (24.10%)

24.10%

15.66%

government

72 (42.08%)

2.73%

2.73%

36.61%

63 (34.43%)

31.15%

31.15%

user-gen. Cnt.

6 (11.76%)

11.76%

0.00%

0.00%

3 (6.25%)

6.25%

2.08%

geographic

6 (38.10%)

14.29%

9.52%

14.29%

5 (23.81%)

14.29%

19.05%

cross-domain

5 (12.20%)

7.32%

2.44%

4.88%

4 (9.76%)

4.88%

4.88%

media

2 (9.09%)

0.00%

0.00%

9.09%

1 (4.55%)

0.00%

4.55%

Total

149 (14.69%)

4.14%

1.28%

9.27%

113 (11.14%)

9.96%

8.19%

In general, dataset-level metadata is still rarely provided by datasets within all topical domains. Some trends towards emerging best practices and de facto standards can be observed: Dataset-level metadata is rather linked to than provided at well-known locations and the Dublin Core vocabulary is becoming the defacto standard for providing dataset-level provenance information. In total, 149 datasets (14.69%) use the VoID vocabulary. Out of these datasets, 42 (4.14%) use a backlinking mechanism. Columns 2 to 5 of Table 9 show the VoID adoption by topical category. Compared to the 2011 report, the overall percentage of datasets publishing dataset-level metadata using VoiD has decreased from 32.20% to 14.69%, with the categories government, geographic, and life sciences being exceptions in which the adoption has slightly grown. Again, the category social networking is a main cause for the low overall number.

Page 29 of (49)

2.3.3.4 Providing Alternative Access Methods According to the 2011 State of the LOD Cloud report, many datasets provide additional access methods, such as SPARQL endpoints (68.14%) and dumps (39.66%). In our analysis, the numbers are much lower as shown in columns 6 to 8 of Table 9. Apart from the government, life sciences and geographic domains, almost no information on alternative access methods are found. The deviation can be explained by the fact that we only look at those alternative access methods that can be discovered via VoID descriptions linked from the datasets or provided at well-known URLs. As reported in [7], the actual number of existing SPARQL endpoints may be higher, as many endpoints cannot be discovered from the data. This is a severe problem for automatic agents navigating the Linked Data graph, as they are not capable of discovering alternative access methods. While the numbers for alternative access methods are low, one has to keep in mind that such methods do not always make sense. For example, the large number of small FOAF files in the social networking category are mostly datasets contained in exactly one file. In these cases, it does not make sense to provide a data dump, because the file itself is a data dump. Likewise, the use of a SPARQL endpoint for a dataset consisting of only a few dozen triples would not justify the provision effort.

Deliverable D4.5PlanetData

2.4 Related Work An effort that is closely related to the work presented in this paper is the LOD-Stats project6 which has retrieved and analyzed Linked Data from the Web until February 2014 [8]. The LODStats website provides statistics about the overall number of discovered linked datasets as well as the adoption of different vocabularies. What distinguishes LODStats from the work presented in this paper is that they do not categorize datasets by topical domain and do not analyze the overall graph structure, as well as the conformance with the best practices in the areas of vocabulary dereferencability and metadata provision. Their results concerning the overall number of accessible datasets (they found 928 datasets) and the adoption of well-known vocabularies are inline with the findings of this paper. A comprehensive empirical survey of Linked Data conformance is presented by Hogan et al. [6]. Their survey is based on a large-scale Linked Data crawl from May 2010 as well as a series of smaller snapshots taken between March and November 2010. The work presented in this paper can be seen as an update of the results presented by Hogan et al. as we use a crawl from March 2014. Another major difference is that Hogan et al. do not categorize datasets by topical domain and thus can not analyze the differences in the adoption of the best practices in different domains. The article by Hogan et al. contains a detailed and comprehensive discussion of earlier work on analyzing the adoption of the Linked Data practices as well as work in the wider area of characterizing the Semantic Web/Linked Data, its link structure as well as the semantics of its content. The discussion covers related work from the time span of 2005 to 2012. For space reasons, we cannot repeat this excellent review of related work here. The general difference between the works discussed by Hogan et al. and our work is that our analysis is more up-to-date and that we distinguish the datasets by topical domain.

6

http://stats.lod2.eu/

Page 31 of (49)

2.5 Summary of Findings In this section, we revisited and updated the finding of the State of the LOD Cloud report [2] from 2011 based on a Linked Data crawl gathered in April 2014. Our analysis shows that the overall number of Linked Datasets on the Web has grown significantly since 2011. Looking only at the topical categories covered in the original report, the number of datasets has approximately doubled since 2011. Also taking the category social networking into account, the number of datasets has grown by 271%. Concerning the linkage of the datasets, our analysis shows that there is still a relatively small number of datasets that set RDF links pointing at many other datasets, while many datasets only links to a few other datasets. Compared to the 2011 LOD cloud, which was centered around dbpedia.org as central linking hub, we have discovered a more decentralized graph structure with geonames.org and dbpedia.org being linked from many datasets besides of the existence of further category-specific linking hubs. Concerning the types of RDF links that connect datasets, we have found the predicates owl:sameAs, rdfs:seeAlso and foaf:knows to be most widely used. We have observed a trend towards the adoption of well-known vocabularies by more datasets, the most prominent one being FOAF, which is used by more than two thirds of all linked datasets, independent of their respective topical domain. In parallel, the usage of proprietary vocabularies has decreased from 64.41% in 2011 to 23.08% of all datasets in 2014. While provenance information is provided for roughly a third of all datasets, only 10% of all datasets provide machine-readable licensing information. A positive exception concerning licensing information is the government domain in which licensing information is provided by 30% of all dataset. Compared to the 2011 report, the percentage of datasets providing provenance metadata is approximately the same, while the percentage of datasets providing machine-readable licensing information has dropped from 17% to 10%. The similar negative trend is also found for the percentage of datasets publishing dataset-level metadata using VoID. In 2011, 32.20% of all datasets published VoID while in 2014 only 14.69% provide such metadata. The categories government, geographic, and life sciences are exceptions to this trend and the adoption has slightly grown in these domains.

Deliverable D4.5PlanetData

2.6 Dataset Catalog and State of the LOD Cloud Report To make the information we gathered about individual datasets available to the community, we created a dataset catalog for linked datasets using CKAN, an open-source data portal platform 7. We imported the information about datasets we crawled before from datahub.io and enhanced them with the information we obtained from the analysis of the crawl. Datasets that did not exist before were created by us. The catalog is available at http://linkeddatacatalog.data.dws.informatik.uni-mannheim.de To make access to our results easier, we also created a webpage which is a up-to date version of the State of the LOD Cloud available at http://linkeddatacatalog.data.dws.informatik.uni-mannheim.de/state/ There, a repetition of the main statistics and finding above are displayed. The webpage and the catalog are strongly interlinked. From the displayed results, we created links to the catalog, were the datasets are listed that are part of a result,for example listing all datasets that use a certain vocabulary or datasets of a certain category that provide a SPARQL endpoint. We also created a clickable version of the LOD Cloud diagram, were a user just has to click on a bubble to reach the corresponding catalog entry.

7

http://ckan.org/

Page 33 of (49)

2.7 Creation of an Extended Diagram The diagram created from information obtained in the crawling process naturally misses some datasets, due to the restrictions of the crawl itself. First, no datasets were included that prohibited crawling, which as already stated summed up to 77 datasets. Secondly, we missed datasets that were not reported on datahub.io, or were not tagged as Linked Data, unless they were found in the crawling process, for example they were not being linked to. To give a more comprehensive picture, it is desirable to also include those datasets in the diagram. To enable this and foster the announcement of other datasets, we published the diagram on public mail lists giving the community the opportunity to report additional datasets, ask for changes of e.g. categorization or naming, and point at possible errors. In a two week period spanning from end of July to August, we obtained information of 164 additional datasets, most of them being added to datahub.io after our crawl or even after the publication of the diagram. After the additional first feedback round, we created a new diagram and again published it on mailing lists for additional feedback. Due to the feedback, we decided to add another category of datasets, namely linguistic ones. In the old diagram, these datasets were added to the category cross-domain, but due to the different nature of the dataset and the sufficient number of linguistic datasets, it seemed appropriate to add this additional category. Figure 5: Extended LOD Cloud Diagram displays the resulting diagram. The diagram is available online at http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/

Deliverable D4.5PlanetData

Figure 5: Extended LOD Cloud Diagram Page 35 of (49)

3. Semantic Web Challenge 2014 - Big Data Track Dataset (BTC2014) For the Big Data Track of the Semantic Web Challenge8 at the International Semantic Web Conference 20149, we prepared a data set (BTC2014) in the tradition of the Billion Triples Challenges 2009-2012 at the same conference. The Billion Triples track has been renamed Big Data track in 2013, since when the use of the Billion Triples Challenge data set was not compulsory for participation any longer. The BTC 2014 is a crawl of the Linked Data web and has been composed between February and June 2014. The seed list of the crawl has been extracted from the Linked Data registry datahub.io10 and consists of example URIs and extracts from VoID descriptions 11 found at this registry.

8

http://challenge.semanticweb.org/2014/

9

http://iswc2014.semanticweb.org/

10

http://datahub.io/

11

http://www.w3.org/TR/void/

Deliverable D4.5PlanetData

3.1 Crawling Method The crawling method has already been described in D2.7. In this deliverable, we provide a more detailed description. In the crawling of the data set, we considered Linked Data in the serialisation formats RDF/XML, N-Triples, N-Quads, Turtle, and RDFa. The crawl has been carried out in a breadth-first manner, i.e. first, the data from the seed list URIs is downloaded. Second, the URIs contained in that data are extracted and put into a new list. This list again is downloaded, and again the URIs contained therein are extracted and so forth. During the crawling, we adhered to the crawling best practices of respecting robots.txt12 (i.e., we do not download from providers that state that they do not want to be crawled) and crawling “politely” (i.e., between two requests to a given server, we waited two seconds in order not to hammer it). Polite crawling in a web context introduces the problem of PLD starvation [9]: As the distribution of the number of documents on a given PLD is power-law-ish, we see many PLDs with few documents and few PLDs with many documents. As the download of a hop proceeds, the number of PLDs that still have URIs to be downloaded gets smaller, until the crawler is idle most of the time waiting the polite time interval before it may download the next URI. Therefore, we follow [10] and stop the download of a hop if only a certain number of PLDs still have URIs. As a heuristic for this number, a figure around the number of threads has proven reasonable. This discarding of URIs biases the crawl a bit against the data of big data providers, but towards smaller data providers, thus biasing towards data from the “long tail”. With this bias, we want to provide a data set that gives a broad view on Linked Data on the Web, and does not just data from a handful of big providers. Nevertheless, in order to cover the important data of those big data providers of which we discard URIs, we set the download order of the URIs per PLD in the crawler according to the number of in-links as a metric of importance. In previous work, we found that downloading Linked Data, one can miss well-known providers due to temporary outages [10] [11]. We addressed this issue at several occasions by extracting URIs that responded with an HTTP error code in the area of 500 (server errors) and fitting them back into the URI list to be downloaded. At the same instances, and in order to counteract the discarding of URIs, we also injected those URIs that have not been visited back into the crawl.

12

http://www.robotstxt.org/ Page 37 of (49)

3.2 Preparing the Crawler LDspider The crawling effort of the BTC2014 data set was carried out using the open source Linked Data crawler LDspider13, which is developed in part at KIT. Before the crawling, we introduced new features like on-disk data structures in the crawler in order to crawl at the scale of the BTC2014, as described in D2.7. Moreover, we fixed several bugs in ldspider and nxparser14, the involved open source parser for RDF/XML and NTriples and N-Quads, which has also in part been developed at KIT. For the incorporation of RDFa, we made the necessary amendments to ldspider such that it utilises the updated RDFa parser of Sesame RIO 15 as it is in Apache any2316. We published new versions of the two tools on their corresponding pages on Google Code in order to supply the Linked Data community with the latest releases.

13

https://code.google.com/p/ldspider/

14

https://code.google.com/p/nxparser/

15

http://www.openrdf.org/

16

http://any23.apache.org/

Deliverable D4.5PlanetData

3.3 Running and Maintaining the Crawling The crawl ran on commodity servers at KIT. The method to counteract PLD starvation required steady amendments to the crawl set-up if too many URIs got discarded and thus the crawler started to run out of URIs. Therefore, we varied the number of required active PLDs between 40 and 70. Also, we extracted manually the URIs from the crawl that answered with HTTP status code 500 and injected them back into the crawl. Moreover, in the rare occasion that a server did not close the HTTP connection properly, and the involved libraries did not run into a time out, we had to manually intervene and terminate the involved crawling processes.

Page 39 of (49)

3.4 Publishing the Crawl We published the BTC2014 data set on a web page17 within the web presence of our institute. We provide 

The raw data downloaded in N-Quads format, for each crawl subdivided by crawling hop



Redirection information incorporating HTTP status codes in the 300 area and information from the HTTP header Content-Location



Access logs for the download



HTTP Header information



The list of URIs with which the crawl has been seeded



The list of downloaded URIs



Basic metrics about the crawl, as found in Table 10



On the web page alongside the crawl and notified the community by announcing the availability of the data set on the mailing list18 of the W3C’s19 Semantic Web Interest Group20.

17

http://km.aifb.kit.edu/projects/btc-2014/

18

http://lists.w3.org/Archives/Public/semantic-web/

19

http://www.w3.org/

20

http://www.w3.org/2001/sw/interest/

Deliverable D4.5PlanetData

Table 10: Basic Metrics about the Components of the BTC2014 dataset Crawl No.

Size (gzipped)

Size (unzipped)

Triple/Quad Count

01

3.7G

46G

211.918.262

02

4.0G

59G

247.881.388

03

4.4M

86M

412.327

04

19M

203M

868.219

05

7.6G

114G

501.136.390

06

0.3M

5.1M

27.623

07

5.3G

137G

563.477.028

08

0.5G

12G

40.061.653

09

7.5G

157G

565.906.367

10

2.1G

44G

164.488.613

11

0.6G

11G

40.688.530

12

28M

610M

2.371.959

13

15G

421G

1.399.507.594

14

5.3G

93G

352.012.643

Sum

52G

1.1T

4.090.758.596

Page 41 of (49)

3.5 Analysis of the Crawl On top of the raw quad/document statistics, we have analysed the crawl in terms of the more high-level metrics from [12].

Deliverable D4.5PlanetData

3.5.1 Providing links to other data sets According to the Linked Data principles, one should include links to other data in datasets. Here, we analyse the linking within the BTC dataset following the approach described in [13]: A link is a triple whose subject or object is from a different PLD than the document in which it was found. Because of the inclusion of RDFa we filter the links for those linking to other documents in the crawl. For example, if somebody links to a non-RDF homepage, this would otherwise count as a link. Yet, as from many HTML pages at least some RDFa can be extracted, the figures are still skewed, cf. the quite low average indegree. Also, we only considered authoritative statements [14] about resources as links, i.e. only in a document within a given dataset, statements are to be made about resources in that dataset. Table 11 shows the results. Table 11: Indegree and Outdegree BTC-2014

Deliverable 2.4

# Datasets with outdegree > 0 (% of datasets)

1843 (3.8%)

406 (44%)

Average outdegree (standard deviation)

24.7 (594.4)

3.1 (N/A)

# Datasets with indegree > 0 (% of datasets)

29’966 (63%)

(N/A)

Average indegree (standard deviation)

1.51 (6.34)

3.1 (N/A)

An indication for the assumption that HTML sites skew the distributions can be found in Table 12. The first predicate classifying as linking to RDF resources is in position 11 of the list of the top linking predicates into a data set. Table 12: Top 11 predicates linking into a dataset Predicate linking into a dataset

No. of links

18621

17332

7763

2361

1263

824

615

520

504

488

482

431

Page 43 of (49)

Also, the outlinks are skewed to data generated from RDFa included in HTML pages, see Table 13. Table 13: Top 10 predicates linking out of a data set Outlinking predicate

No. of links

926

321

148

99

80

51

41

37

35

31

Deliverable D4.5PlanetData

3.5.2 Linking datasets The predicate stating equivalence between two entities owl:sameAs plays an important role in entity consolidation. Also, it is on position 11 of the top out-linking predicates, and on position 6 of the top inlinking predicates. Therefore, we look closer into the use of that predicate. In Table 14, we report on the interlinkage of datasets using owl:sameAs links. We list the top 10 datasets that are used by other data sets as an end of an owl:sameAs link. Those therefore serve as hubs for interlinking data. Table 14: Top 10 datasets used by other datasets as end of owl:sameAs links Dataset

Number of datasets linking there

dbpedia.org

75

freebase.com

31

semanticweb.org

24

l3s.de

24

geonames.org

21

purl.org

16

fu-berlin.de

15

identi.ca

14

dbtune.org

13

w3.org

12

The first two are the multi-domain data sets dbpedia.org and freebase.org. Also, fu-berlin.de, purl.org and identi.ca are multi-domain. Otherwise, there are data sets in the top 10 providing information about academic scholars, conferences and publications, e.g. semanticweb.org, l3s.de. w3.org hosts vocabularies for schema-related information to the data, e.g. the ontologies for RDF, RDFS, and OWL. dbtune.org provides information on music, geonames.org information on geospatial features such as places.

Page 45 of (49)

3.5.3 License information Licensing data provides information under which terms and conditions the data may be re-used in other contexts. The inclusion of licensing information to the data adds to the self-descriptiveness of data. The provisioning of licensing information can either be machine-readable such that software agents can automatically process the data, or mere textual information may be given. As Deliverable 2.7 [13] and Deliverable 2.4 [15], we employ the heuristics of [16] for determining licensestating triples: We extract all properties from the crawl that contain the string “licen” in upper, lower or mixed case. For Deliverable 2.7, we manually removed properties from that list that obviously deal with different types of licenses, such as license plates or fishing licenses. We re-used that list when we analysed the whole crawl for this deliverable. Table 15: Top 5 terms to indicate license information License term

No. Datasets using the term

179

76

56

37

28

Any license predicate (total 81)

340

The number of data sets providing license information is very small compared to the number of data sets in the BTC 2014 data set (0.7%). This again can be accounted to the inclusion of RDFa in the crawling, which extracts RDF data from HTML pages residing on a plethora of domains. Overall, we found 340 data sets reporting licensing information. Still, this number is considerably higher than what we found for the three subsets of the part of BTC2014 available as of D2.7. If we take the use of the RDF vocabulary (see below) as proof that a data set contains richer RDF information, we get a share of 7.3% of the richer datasets providing license information which is still considerably below the figures reported in D2.4 and D2.7.

Deliverable D4.5PlanetData

3.5.4 Use of Vocabularies Data integration is eased considerably if vocabularies are re-used where possible. When data is described using the same terms, efforts for ontology alignment, entity consolidation or the development of software agents become smaller because the equivalence of the terms used for the description has not to be investigated first. For the sake of our analysis, we consider a vocabulary to be used by a dataset if there is a term from the vocabulary appearing in the dataset. The namespaces of the terms are used to tell different vocabularies apart. One namespace here equals one vocabulary. As already discussed, the dataset definition of a PLD per data set is skewed towards home pages delivering HTML from which RDFa can be derived. This is also reflected in vocabulary use. Accordingly, the most popular vocabularies when counting the number of datasets in which it has been used, are indeed the xhtml vocabulary and the open graph protocol from Facebook which appear very prominently in RDFa extracted from HTML pages. In terms of the open graph protocol, we see a 10x higher use of the new http://ogp.me/ns# namespace instead of the old one on the opengraphprotocol.org host. Manual inspection yielded that the document metadata vocabulary DCTERMS is also very common in RDFa extracted from HTML because of header fields. Nevertheless, we find the RDF and RDFS vocabularies that are used to make statements about schema information, SIOC and FOAF that are used to talk about online communities and personal relationships also in the top 10. Table 16: Top 10 used vocabularies Vocabulary namespace

Appearing in No. Datasets

43250

37749

4652

3206

3120

3046

2667

2061

1986

1060

Page 47 of (49)

Bibliography [1] C. Bizer, T. Heath and T. Berners-Lee, “Linked data-the story so far,” International Journal on Semantic Web and Information Systems, vol. 5, no. 3, p. 1‐22, 2009. [2] A. Jentzsch, R. Cyganiak and C. Bizer, State of the LOD Cloud, 2011. [3] T. Heath and C. Bizer, “Linked data: Evolving the web into a global data space,” Synthesis lectures on the semantic web: theory and technology, vol. 1, no. 1, p. 1‐136, 2011. [4] Robert Isele, Jürgen Umbrich, Chris Bizer and Andreas Harth, “LDSpider: An open-source crawling framework for the Web of Linked Data,” 2010. [5] C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, M. Schuhmacher and J. Völker, “Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis,” in The Semantic Web‐ISWC 2013, Springer Berlin Heidelberg, 2013, p. 17‐32. [6] A. Hogan, J. Umbrich, A. Harth, R. Cyganiak, A. Polleres and S. Decker, “An empirical survey of Linked Data conformance,” J. Web Sem, vol. 14, p. 14‐44, 2012. [7] Heiko Paulheim and Sven Hertling, “Discoverability of SPARQL Endpoints in Linked Open Data,” 2013. [8] Sören Auer, Jan Demter, Michael Martin and Jens Lehmann, “LODStats - An Extensible Framework for High-Performance Dataset Analytics,” pp. 353-362. [9] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres and S. Decker, “Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine,” Journal on Web Semantics, 2011. [10] T. Käfer, J. Umbrich, A. Hogan and A. Polleres, “Towards a Dynamic Linked Data Observatory,” in Proceedings of the Linked Data on the Web WWW2012 Workshop, 2012. [11] T. Käfer, A. Abdelrahman, J. Umbrich, P. O'Byrne and A. Hogan, “Observing Linked Data Dynamics,” in Proceedings of the 9th Extended Semantic Web Conference (ESWC), 2013. [12] Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, Riva del Garda, Italy, October 2014 (to appear).