Scalable Approaches for Identifiers of Dynamic ...

4 downloads 9297 Views 717KB Size Report
Link rot revisited. Berners-Lee (1998): Cool URIs don't change. “In theory, the domain name space owner owns the domain name space and therefore all URIs ...
Scalable Approaches for Identifiers of Dynamic Data and Linked Data in an Evolving World Jens Klump | OCE Science Leader Earth Science Informatics Robert Huber | MARUM, University of Bremen Lesley Wyborn | NCI, Australian National University MINERAL RESOURCES

Introduction • When data were still all handcrafted, datasets used to be small and semantically rich. • Recent advances in technology now provide us with high volume data streams. • A particular challenge are very large dynamic data sets and the use of semantic concepts for machine readability.

2 | Dynamic Data | Jens Klump

Data and the record of science “Users want intellectual works, not digital objects” (Arms, 1995) “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” (Buckheit & Donoho, 1995)

3 | Dynamic data | Jens Klump

Data citation: Dynamic Data From artisan scale to peta-scale data

4 | Dynamic data | Jens Klump

Data at doi:10.1594/GFZ.SDDB.1043

Data citation – The classical case

Heim et al. (2005) Glob. Planet. Change 5 | Dynamic data | Jens Klump

Large data volumes

NCI Collections

Astronomy (Optical) 200 TB Weather 340 TB

CMIP5 3PB Atmosphere 2.4 PB Earth Observ. Water 2 PB Ocean 1.5 PB Marine Videos 10 TB

BOM

6 | Dynamic data | Jens Klump

GA

CSIRO

ANU

Other National

International

Geophysics 300 TB

Citing from large dynamic datasets RDA problem statement: A data citation … • allows us to identify and cite arbitrary views of data, from a single record to an entire data set in a precise, machineactionable manner • allows us to cite and retrieve that data as it existed at a certain point in time, whether the database is static or highly dynamic • is stable across different technologies and technological changes 7 | Dynamic data | Jens Klump

Identity, granularity … and other headaches • Data citations needs to consider the intended purpose of a dataset. • Do we need access to subsets of the data? • What is a suitable granularity? • Is the dataset made of discrete objects or is it continuous? • Should we assign identifiers to every object or use a canonical path? • Do the data change over time? Do they change only by appending new data or by changing data retrospectively (e.g. reprocessing, corrections, etc.) 8 | Dynamic data | Jens Klump

What is the identity of an object? Year 1

Year 2 Change some parts Year 3

Change some parts Year n Change some parts

9 | Dynamic Data | Jens Klump

Dynamic data according to DataCite

10.123/DOI-1

10.123/DOI-3

10.123/DOI-3

10.123/DOI-2

10.123/DOI-2

10.123/DOI-2

10.123/DOI-1

10.123/DOI-1

Time

10 | Dynamic Data | Jens Klump

10.123/DOI-1

10.123/DOI-A

10.123/DOI-4

Dynamic data according to RDA The WG recommends solving this challenge by: • ensuring that data is stored in a versioned and timestamped manner. • identifying data sets by storing and assigning persistent identifiers (PIDs) to timestamped queries that can be re-executed against the timestamped data store. However, the proposal then becomes very detailed on queries, sorting order, and other features of tabulated data in relational databases. This does not scale for large dynamic data sets. 11 | Dynamic Data | Jens Klump

"VersionNumbers" by AzaToth - w:en:Image:VersionNumbers.svg. Licensed under CC BY-SA 3.0 via Commons https://commons.wikimedia.org/wiki/File:VersionNumbers.svg#/media/File:Vers ionNumbers.svg

Versioning and releases

12 | Dynamic data | Jens Klump

Principles in software versioning: • Sequence-based identifiers • Change significance • Designating development stage In practice there are many more considerations (political, cultural, aesthetic, …) Versioning of dynamic data could follow software versioning principles (release vs. nightly build).

Who Input Data

Who/ which system

Code

Process

Nicholas Car (2015)

Dynamic data and provenance

Output Data

used Config

Entity

13 | Dynamic Data | Jens Klump

Activity

Agent

A proposal for dynamic data DOI “Template Handles” could be used to reference versions or subsets of dynamic data. (Huber et al., doi:10.6084/m9.figshare.1285728) • Doe, J. (2009-2011): Dynamic Data Set Title. Version: 1.2. Responsible Data Archive. [evolving dataset]. doi.10.1001/1234@version=1.2 • Doe, J. (2009-2011): Dynamic Data Set Title. Subset: 2010-01-01 2010-12-13. Responsible Data Archive. [growing dataset]. doi.10.1001/1234@range=2010-01-01--2010-12-13

14 | Dynamic data | Jens Klump

Linked Data: Identification of concepts When concepts change over time

15 | Dynamic data | Jens Klump

Semantics • Language is ambiguous. • Different user communities might name concepts differently. • Concepts evolve over time. • Machines are not good at interpreting ambiguity. • Semantic mediation can be used to overcome semantic barriers. 16 | Dynamic data | Jens Klump

Even seemingly stable classifications, like the classification of elephants by Linnaeus (1758), are subject to change.

Semantic mediation is needed to make data interpretable over long time and large data volumes. Rohland N, Reich D, Mallick S, Meyer M, Green RE, et al. (2010) Genomic DNA Sequences from Mastodon and Woolly Mammoth Reveal Deep Speciation of Forest and Savanna Elephants. PLoS Biol 8(12): e1000564. doi:10.1371/journal.pbio.1000564

Changing classifications

Linked data • Linked data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. • Linked data builds upon standard Web technologies such as HTTP, RDF and URIs. • Rather than using URI to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. • Linked data enables data from different sources to be connected and queried. 18 | Dynamic data | Jens Klump

Linked data and content negotiation • Concepts can be uniquely identified by HTTP URIs. • Ideally, these linked data URIs should resolve to something. • Combining linked data URIs and content negotiation can be used to serve different content to human and machine clients.

19 | Dynamic data | Jens Klump

Linked data application • Linked data URIs can identify: • Concepts • Relations • Graphs

• There are several ways to encode linked data (Turtle, RDFa, Microdata, JSON-LD).

20 | Dynamic data | Jens Klump

Linked data Linked data principles by Tim Berners-Lee (2006): • Use URIs to name (identify) things. • Use HTTP URIs so that these things can be looked up (interpreted, "dereferenced"). • Provide useful information about what a name identifies when it's looked up, using open standards such as RDF, SPARQL, etc. • Refer to other things using their HTTP URI-based names when publishing data on the Web.

21 | Dynamic data | Jens Klump

Link rot revisited Berners-Lee (1998): Cool URIs don't change “In theory, the domain name space owner owns the domain name space and therefore all URIs in it. Except insolvency, nothing prevents the domain name owner from keeping the name.” … and then came the burst of the dot.com bubble in 2001 and many companies went insolvent. And with them went their domain names. Even government department names are not stable. 22 | Dynamic data | Jens Klump

A new role for identifiers in linked data Linked data URIs are of the form http://example.com/people/alice What if the base URI example.com ceases to exist? The semantic part /people/alice remains valid. http://doi.org/10.12345/people/alice

URI-formed PID be easier to maintain, PID namespaces can easily be transferred from one host to another. 23 | Dynamic data | Jens Klump

Summary • Identifying static objects can easily be done through persistent identifiers. • For datasets changing with time we need other ways to point users to a stable form of the desired data. • Timestamped snapshots of very large dynamic datasets are not feasible. • A provenance record could be used to identify citeable forms of a data set. • Linked data currently relies on domain names for HTTP base URIs. This may lead to “link rot revisited”. The use of HTTP PID should be considered instead of domain names. 24 | Dynamic data | Jens Klump

Mineral Resources Jens Klump OCE Science Leader Earth Science Informatics t +61 8 6436 8828 e [email protected] w www.csiro.au

NCI/ANU Lesley Wyborn Adjunct Fellow t +61 2 6125 2581 e [email protected] w nci.org.au

MARUM/Univ. Bremen Robert Huber Senior Research Fellow t +49 421 2186 5593 e [email protected] w www.pangaea.de MINERAL RESOURCES

http://creativecommons.org/licenses/by-nc-nd/4.0/

Suggest Documents