Describing Research Data: A Case Study for Archaeology

Describing Research Data: A Case Study for Archaeology Nicola Aloia1 , Christos Papatheodorou2,3, Dimitris Gavrilis3, Franca Debole1 , and Carlo Meghini1 1

2

Istituto di Scienza e Tecnologie dell’Informazione, National Research Council, Pisa, Italy {Nicola.Aloia,Franca.Debole,Carlo.Meghini}@isti.cnr.it Dept. of Archives, Library Science and Museology, Ionian University, Corfu, Greece 3 Digital Curation Unit, Institute for the Management of Information Systems, ‘Athena’ Research Centre, Athens, Greece {c.papatheodorou,d.gavrilis}@dcu.gr

Abstract. The growth of the digital resources produced by the research activities demand the development of e-Infrastructures in which researchers can access remote facilities, select and re-use huge volumes of data and services, run complex experimental processes and share results. Data registries aim to describe uniformly the data of e-Infrastructures contributing to the re-usability and interoperability of big scientific data. However the current situation requires the development of powerful resource integration mechanisms that step beyond the principles guaranteed by the data registries standards. This paper proposes a conceptual model for describing data resources and services and extends the existing specifications for the development of data registries. The model has been implemented in the context of the ARIADNE project, a EU funded project that focuses on the integration of Archaeological digital resources all over the Europe. Keywords: Data registries, Research infrastructures, Interoperability, Archaeological digital resource.

1

Introduction

Extremely large scientific datasets are being generated and the issues for identifying, locating, re-using and exploiting data are getting more difficult and imperative. This data deluge affects the way research is carried out leading to a data-oriented paradigm. Data integration functionalities, data analysis, data mining and visualization tools should support this shift of the research and scholar communication paradigm. For this purpose Global Research Data Infrastructures infrastructures are being developed to assure the interoperability and discoverability of scientific resources and cope with the (i) structural (syntactic) heterogeneity of data organized in datasets, following particular database schemas, or in collections described by different metadata schemas at collection R. Meersman et al. (Eds.): OTM 2014, LNCS 8841, pp. 768–775, 2014. c Springer-Verlag Berlin Heidelberg 2014

Describing Research Data: A Case Study for Archaeology

769

level as well as at item level, (ii) semantic heterogeneity caused by the diversity of vocabularies used for the description of artefacts, activities, events, temporal periods, workflows and geospatial data, (iii) diversity of metadata schemas that requires their semantic integration through mappings to upper level conceptual schemas [1,2]. This paper is inspired by the work done on ARIADNE project1 , an EU funded project that develops a data registry aiming to the integration of archaeological research data. Due to the huge volume of data the main challenge is to provide an environment for establishing the degree of relatedness between data resources, in order to plan their integration. The achievement of such a goal needs the design of a system based on (i) algorithms for estimating the degree of relatedness between any two data resources and (ii) a Catalog of the existing data resources for tuning and executing the algorithms. The paper presents the conceptual model of the Catalog, named ARIADNE Catalog Data Model (ACDM) that steps beyond accessibility and re-usability requirements and extends the existing data registry standards2 .

2

Background

The main processes for making data understandable and shareable are standardization and registration. ISO/IEC 11179 facilitates acquisition, registration, reuse, interchange and sharing of data [3]. Based on this standard significant efforts have been made to develop metadata registries [4,5,6,7]. The most recent effort is the Open Metadata Registry3 , formerly named the National Science Digital Library (NSDL) Registry, which hosts vocabularies and their terms (concepts), metadata schemas and their elements and details about the agents (persons or corporate bodies) who have added content to the registry. Its implementation was based on the W3C standard Simple Knowledge Organization System (SKOS) and hence the user can search for vocabularies, terms, metadata schemas and metadata elements and retrieve their descriptions via a SPARQL endpoint. DCAT4 is an RDF vocabulary, recently published by the Government Linked Data Working Group at W3C as a recommendation to describe datasets and catalogs on the Web in order to enable their discoverability and consumption by services. The DCAT model “is well-suited to representing government data catalogues such as Data.gov and data.gov.uk ” and has been proposed as a tool for publishing datasets as Open Data [8]. Currently various datasets have been published according to the DCAT specifications and various European projects officially recommend its adoption. DCAT brings a number of classes from other well known vocabularies such as foaf:Agent, skos:Concept, as well as a set of relations among them. The main classes of the model are dcat:Catalog that represents a curated collection of metadata about datasets, dcat:Dataset that represents a 1 2 3 4

http://www.ariadne-infrastructure.eu/ The system is available at: http://schemas.cloud.dcu.gr/ariadne-registry/ http://metadataregistry.org/ http://www.w3.org/TR/vocab-dcat/

770

N. Aloia et al.

published and curated collection of data and dcat:Distribution that represents each dataset might be available in different formats or different endpoints.

3

ARIADNE Catalog Data Model

The proposed Catalog will support the browsing and querying of data resources and services providing useful information on each of them. We estimate its size to reach the order of thousands resources and therefore we are going to design tools for the exploration of this space and the discovery of archaeological resources. We plan to offer two kinds of discovery: a semantic discovery, allowing user to identify the resources, that relate to a specific topic, event or spatio-temporal region; a similarity discovery, allowing user to provide (the identifier of) a data resource as input and to obtain in return (descriptors of) the resources that are similar to the given one, ranked in decreasing degree of similarity. In this section we describe the ARIADNE Catalog Data Model (ACDM), defined to register information about resources that are scattered amongst different collections, inaccessible and unpublished fieldwork reports “grey literature” and in publications. These resources come in three different types: Data Resources, including the resources that are containers of data such as databases and collections; Language Resources, including the resources related to the formal languages used in Data Resources, such as vocabularies, metadata schemas and mappings; and Services, including the resources offering some kind of functionality in the archaeological domain. We built our model around the DCAT vocabulary, which we expanded by adding classes and properties that were needed for best describing the ARIADNE assets. Its adoption places ARIADNE in an ideal position for publishing archaeological data resources as Open Data. As illustrated in Figure 1 the central notion of the model is the class ArchaeologicalResource, specialized in: • DataResource, whose instances represent the various types of data containers owned by the ARIADNE partners and lent to the project for integration. This class is created for the sole purpose of defining the domain and the range of a number of associations. It is therefore an abstract class, whose instances are inherited from sub-classes. • LanguageResource, having as instances vocabularies, metadata schemas, gazetteers and mappings (between language resources). As new resources of linguistic nature are added to the Catalog (such as subject heading systems and thesauri) the corresponding classes will be added to the model as sub-classes of this class. To describe language resources we have used ISO/IEC 11179 “Specification and Standardization of Data Elements” [3]. • Services, whose instances represent the services owned by the Ariadne partners and lent to the project for integration. Classes with a more auxiliary role are: DataFormat whose instances represent the formats that realize metadata schemas, or structure the records of datasets; DBSchema to represent the instances of database schemas.


771

Fig. 1. The ACDM as a UML diagram

3.1

ArchaeologicalResource

The ArcheologicalResource class defines the properties common to its subclasses, mostly using the terms of the DCAT vocabulary, to which it adds properties for specifying: the access policy and the original identifier of the resource. The main associations having this class as domain are: • dct:isPartOf associates any archaeological resource in the catalog with that catalog. • dct:publisher: associates any archaeological resource with an agent responsible for making the resource publicly available. • dct:creator: associates any archaeological resource with an agent primarily responsible for creating the resource. • owner: associates any archaeological resource with an agent that is the legal owner of the resource. • legalResponsible: associates any archaeological resource with a person holding the legal responsibility of the resource. • scientificResponsible: associates any archaeological resource with a person holding the scientific responsibility of the resource.

772

N. Aloia et al.

• technicalResponsible: associates any archaeological resource with a person holding the technical responsibility of the resource and contact person. • ariadneSubject associates any archaeological resource with one or more archaeological subjects defined by ARIADNE, namely: Fieldwork databases, Event/intervention databases, Sites and monuments databases, Scientific databases, Artefacts, Burials. • dct:subject associates any archaeological resource with a subject drawn from an existing vocabulary. 3.2

DataResource

This class specializes the class ArchaeologicalResource, and has as instances the archaeological resources such as databases, GIS, collections or datasets. Two important attributes of this class are dct:temporal and dct:spatial, giving the spatial and temporal coverage of each instance data resource. The attributes will be used for establishing the degree to which two data resources are worth integrating. The main associations having this class as domain are: • dct:isPartOf associates a data resource with the collections which the data resource is part of. • dcat:distribution: associates a data resource with the distributions, i.e. the accessible forms of the resource. • hasItemMetadataStructure: associates a data resource with the format of the metadata of the members (or items) of the data resource (e.g. metadata of each record in a dataset, or of each item in a collection). • hasMetadataRecord: associates a data resource with the metadata of the resource as created by the organization holding the resource (for instance, the record describing a dataset in the organization holding the dataset). The class has the following subclasses: Collection: This class has as instances collections in the archaeological domain. The items in a collection are data resources themselves; for instance, a collection may include a textual document, a set of images, one or more datasets and other collections. For interoperability, Collection is a sub-class of dcmitype:Collection. The main association having this class as domain is dct:hasParts, which associates a collection with the data resources that are in the collection. This association is used in the ARIADNE Catalog only for stating membership of data resources in collections, since the Catalog does not store information on individual objects. Database: This class has as instances databases, defined as a set of homogeneously structured records managed through a Database Management System (such as MySQL), recorded as an attribute of the class. The main association having this class as domain is hasSchema, which associates a database with the schema defining the structure of the data in the database. Such schema is an instance of the class DBSchema. Dataset: This class is a specialization of the classes DataResource and dcat:Dataset, and it has archaeological datasets as instances. An archaeological


773

dataset is defined as a set of homogeneously structured records that are not managed through a Database Management System. The main association having this class as domain is hasRecordStructure, which associates a dataset with a data format defining the structure of its records. Such format is an instance of the class DataFormat. GIS: This class is a specialization of the class DataResource, and has as instances Geographical Information Systems (GISs). The GIS technology used for each instance is modelled as an attribute of the class. 3.3

LanguageResource

This is the class of all language resources described in the Catalog for the purposes of re-use or integration within the ARIADNE community. A language resource is a resource of a linguistic nature, whether in natural language (such as a gazetteer) or in a formal language (such as a vocabulary or a metadata schema). It also includes mappings, understood as associations between expressions of two language resources that may be of a formal (e.g., sub-class or sub-property links) or an informal (e.g., natural language rules) nature. The LanguageResource have as instances vocabularies, metadata schemas, gazetteers and mappings (between language resources). The most significant subclasses of the class are: MetadataSchema: This subclass has as instances metadata schemas used in the archaeological domain. Vocabulary: This is a subclass has as instances vocabularies used in the archaeological domain. Mapping: An instance of this class represents a mapping between two language resources (e.g metadata schemas). 3.4

Service

The modelling of the services to be integrated by ARIADNE is at a preliminary stage of development. The goal is to provide the primitives for describing the services developed by the project partners for which integration or reuse can be envisaged. A preliminary survey has brought about the following categories: services that make use of GIS software; services that make use of databases management systems; ad hoc systems developed in-house that do not use any of the previous technologies; composite services that use a combination of the previous categories. An ARIADNE service is therefore understood as an instance of one of the four categories of software listed above. Another important feature of services to be considered in the context of ARIADNE is how they can be accessed. From the preliminary investigation, the following types can be distinguished: services to be used locally, requiring installation on a specific hardware/software platform; services to be used locally independent from any specific hardware/software platform; services to be used as web applications; web services based on a standard protocol. For the first three categories, it is important to know whether they provide Application Programming Interfaces (API) and whether they are Open Source (Figure 2).

774

N. Aloia et al.

Fig. 2. The Ariadne Services Data Model

A third feature, relevant to the service description in the ARIADNE context is the kind of functionality offered by the services (e.g. map viewer, data entry system, etc.). We did not find a shared ontology to express the characteristics of services as discussed above. The best approximation that we found is the ontology adopted in DBpedia5 for describing software, so we defined the ARIADNEService class as a specialization of the DBpedia-Software class. The main associations having this class as domain are: • applyTo: the DataResource to which the service can be applied. • isInRepository: if the source code is available in a repository URI and other information like credential to access the repository are supplied. • hasAttachedDocuments: the documents that are attached to a service for illustration purposes. • hasTechnicalSupport: the person responsible for the technical support • hasAPI: if the service provide an API, a description must be supplied. • hasComponents: a service may include some other components.

4

Conclusions

The main goal of the ARIADNE project is to “to integrate the existing archaeological research data infrastructures so that researchers can use the various distributed datasets and new and powerful technologies as an integral component of the archaeological research methodology”. In order to achieve this goal, it is necessary to (i) gather information about the existing data resources and services in the archaeological domain, and (ii) to implement advanced search functionalities 5

http://dbpedia.org/ontology/Software


775

on this information in order to support the discovery of resources that make good candidates for integration. As a necessary step towards the realization of the first objective, we have set out to design a data model for representing archaeological resources. In the interest of understandability, usability and interoperability, the data model is an extension of the DCAT W3C Recommendation and includes classes and properties from other well-known vocabularies, such as Dublin Core, DBPedia and FOAF. As a necessary step towards the realization of the second objective above, we have implemented functionality for the persistence and the population of the Catalog. Much work lies ahead. First, we aim at making the content of the ARIADNE Catalog available as Linked Data. Most importantly, we aim at making both the Catalog and the services built around it, available to the rest of the archaeological community, by deploying both on the web. This is by far the most ambitious of our goals, and it requires a thorough effort, for completing the on-going implementations, validating the results, documenting the services so that users outside ARIADNE can learn them with minimum effort, and strengthening the system so that it can sustain usage at a global level. Acknowledgment. This work has been partially funded by the European Commission under ARIADNE funded in the theme Research infrastructures (Grant agreement no: 313193).

References 1. Castelli, D., Manghi, P., Thanos, C.: A Vision Towards Scientific Communication Infrastructures: On Bridging the Realms of Research Digital Libraries and Scientific Data Center. International Journal on Digital Libraries 13(3-4), 155–169 (2013) 2. Thanos, C., Manegold, S., Kersten, M.: Big Data – Introduction to the Special Theme. ERCIM News (89), 10–11 (2012) 3. ISO 11179 Part1 Framework for the Specification and Standardization of Data Elements (2004) 4. Caplan, P.: Metadata Fundamentals for All Librarians, American Library Association (2003) ISBN 9780838908471 5. DCMI Registry, http://dublincore.org/dcregistry/ 6. Jeong, D., Baik, D.K., Park, S.H.: A Practical Approach: Localization-Based Global Metadata Registry for Progressive Data Integration. J. Info. Know. Mgmt. 2, 391–401 (2003) 7. Heery, R., Gardner, T., Day, M., Patel, M.: DESIRE metadata registry framework, Deliverable 3.5. DESIRE II - Development of a European Service for Information on Research and Education II (1999), http://web.archive.org/web/20080513183558/ (retrieved) 8. Goedertier, S.: DCAT application profile for data portals in Europe. (2013), https://joinup.ec.europa.eu/asset/dcat_application_profile/ (retrieved)