An Empirical Study of owl:sameAs Use in Linked Data

10 downloads 840 Views 714KB Size Report
PREVIEW. An Empirical Study of owl:sameAs Use in Linked Data. Li Ding1, Joshua Shinavier1, Tim Finin2 and Deborah ... 5 http://vmlion25.deri.ie/index.html.
Note: This is a preview, updated version will be published in Web Science 2010

An Empirical Study of owl:sameAs Use in Linked Data Li Ding1, Joshua Shinavier1, Tim Finin2 and Deborah McGuinness1 1

Tetherless World Constellation, Rensselaer Polytechnic Institute, 110 8th St. Troy NY12180, USA 2 University of Maryland Baltimore County, 1000 Hilltop circle, MD 21250, USA

Introduction

EV PR

Linked Data (Berners-Lee 2007) is the practice of making machine-understandable data available on the Web in the form of linked RDF descriptions. The owl:sameAs property is most commonly used to support Linked Data integration by interconnecting the resources in one dataset and equivalent resources in another. However, this emerging use of owl:sameAs in Linked Data is also known to be problematic in that it is often at odds with the official semantics of owl:sameAs as defined by the Web Ontology Language (OWL). For example, the equivalence relationship represented by owl:sameAs is often contextdependent, accurate only in the context of one application or another (Jaffri, Glaser, Millard 2008). Therefore, a closer look at the use of owl:sameAs in practice is worthwhile for Linked Data application development. The sheer extent of owl:sameAs use can be seen in major Linked Data sources such as DBpedia1, Freebase2, GeoNames3 and the New York Times (NYT) annotated corpus4. Examination of the 2009 Billion Triple Challenge data set5 further reveals some 6.5 million owl:sameAs statements. Instead of jumping into the full scale Linked Data datasets, we conducted a pilot study using a small subset to provide a rough but quick empirical survey of the emerging usage of owl:sameAs in Linked Data.

“How large is an owl:sameAs network and where do the nodes come from?” An owl:sameAs network refers to a collection of URIs interconnected by owl:sameAs links. As shown in Figure1, the network size is measured by the number of URIs, and the URIs are further grouped by their sources, i.e. their host name (e.g. dbpedia.org). Most URIs are contributed by several sources as listed in the legend. DBpedia contributed many URIs in the owl:sameAs network because many Wikipedia “redirection” links have been captured using owl:sameAs. data.nytimes.com

dbpedia.org

rdf.freebase.com

sw .cyc.com

sw .opencyc.org

sw s.geonames.org

w w w .rdfabout.com

other

60 50 40 30 20 10

0

locations

organizations

people

IE

Figure 1. The number of URIs (colored by source) in the sample owl:sameAs network per seed URI.

1

http://dbpedia.org/ 2 http://rdf.freebase.com/ 3 http://sws.geonames.org/ 4 http://data.nytimes.com/ 5 http://vmlion25.deri.ie/index.html

10000 number seed URIs in owl:sameAs network

We collected the evaluation dataset by performing a shallow crawl of the owl:sameAs network starting from a selection of seed URIs. Each crawl was done by dereferencing a seed URI and the URIs (transitively) linked from the seed URI by owl:sameAs statements. All of the seeds were drawn from the NYT dataset, which is a popular and carefully-created source of linked data containing a significant number of owl:sameAs statements. Specifically, we selected 100 seed URIs for each of the three distinctive categories from the NYT corpus: people, locations and organizations. In what follows, we report a few interesting results.

W

Experimental Settings and Results

“What is the distribution of triples that have been contributed for each URI in the owl:sameAs network?” There are 819 URIs contributing zero triples as they cannot be dereferenced, 1639 URIs only contribute 1 triple (primarily from DBpedia as the result of wiki redirection), while the rest a contribute more triples, as shown.

1000

100

10

1 1

10

number of triples

100

Figure 2. The count of seed URIs contributing exactly n triples

1000

300 dbpedia & freebase 250

lead a reasoner to infer, on integrating the two data sets, that Li holds the position of “research scientist” at Stanford, which has never been the case. Similar issues were raised about an earlier version of the NYT dataset, where the cc:license property could be wrongly propagated via the owl:sameAs network (Cyganiak 2009). This concern extends to conflicting statements from different sources. Consider, for example, the population of Warsaw, the capital of Poland. DBpedia provides two conflicting answers including 1 (via dbonto:populationTotal) and 1,709,781 (via dbprop:populationTotal) while GeoNames provides yet another value (1,702,139). Each of these values could be true in a certain context; however, in answering a simple query involving population, we expect one simple answer rather than a set of alternatives. We thus can suggest several components of a general strategy for integrating and fusing information from the URIs in an owl:sameAs network. * [duplication rule] if the URIs carry identical or very similar content, then only one URI needs to be dereferenced. * [difference rule] if the URIs carry very different content, then both need to be dereferenced.

EV PR

“Which properties were frequently used to describe the URIs in owl:sameAs network?” We noticed that most sources used properties from their own vocabulary, as well as few common properties such as rdfs:label, geo:lat, foaf:homepage and geonames:population. Meanwhile, we also observed frequently used context-dependent properties such as nyt:first_use (in NYT dataset) and cc:attributionName (in the Freebase dataset). “How much does the content from different sources overlap?” It is, in general, hard to accurately compare the “real” difference between RDF graphs obtained from different sources. We compute a rough estimate by counting literals based on the intuition that literals in RDF graph constitute the main body of informational knowledge since they can be read by end users directly. Our analysis shows that most triples from one linked data source are connected to just one of the source URIs, i.e., two Freebase URIs usually yield the same sets of triples. We also observed that most DBpedia URIs in the owl:sameAs network are small and free of literals. As shown in Figure 3, DBpedia and Freebase are the primary sources of literals (accounting for 83% of all literals) but they don’t have many literals in common. Further, manual analysis on the values shows potentially conflicting literal values, e.g. the postal code in GeoNames follows the five-digit ZIP code standard but DBpedia serves zip codes using extended ZIP+4 codes. dbpedia only

freebase only

other sources

* [context rule] descriptions about two URIs can be merged after filtering the context-dependent part which cannot be merged. * [conflict rule] if the URI descriptions contain conflicting statements, they may be used as alternatives or be resolved using heuristics.

Conclusion

200 150

IE

100 50

locations

organizations

people

Figure 3. Sources of literals in RDF graph dereferenced from URIs in the owl:sameAs network of a seed URI.

Discussion Based on our experimental results, we have identified the following issues related to practical use of the owl:sameAs property. One problem with owl:sameAs, as commonly used in Linked Data, is that it may conflate context-dependent descriptions provided by different data sources. For example, Li Ding’s FOAF profile at Stanford University was accurate when it was published three years ago, but some facts have changed since then. It is reasonable to assert that from his Stanford FOAF profile and from the more recent RPI profile are equivalent, in that they refer to the same person. However, the use of owl:sameAs would also

W

0

Although not very comprehensive, this brief empirical study has already revealed interesting research directions related to owl:sameAs, and further suggests an emerging operational semantics of owl:sameAs in which equivalent URIs are treated as alternatives from which we must choose, as opposed to truly indistinguishable resources whose descriptions should be merged. By selecting the right interpretation of an owl:sameAs link, we can avoid a large amount of overhead in loading and storing associated RDF descriptions. Given the URI of a resource described in Linked Data, we have the option of either dereferencing and merging all equivalent resources, based on owl:sameAs statements, or of picking and choosing alternatives based on our knowledge of the various data providers.

References Berners-Lee, T. 2007. Linked data. http://www.w3.org/DesignIssues/LinkedData.html. Jaffri, A., Glaser, H., and Millard, I., 2008. URI disambiguation in the context of Linked Data. In 1st International Workshop on Linked Data on the Web, Beijing, China. Cyganiak, R, Linked data at the New York Times: Exciting, but buggy, 2009. webpage. http://dowhatimean.net/2009/10/linked-data-at-thenew-york-times-exciting-but-buggy, (last accessed Jan 22, 2010)