SEAD: An Integrated Infrastructure to Support Data Stewardship in

SEAD1: An Integrated Infrastructure to Support Data Stewardship in Sustainability Science Margaret Hedstrom-School of Information, University of Michigan George Alter-Inter-university Consortium for Political and Social Research (ICPSR) , University of Michigan Inna Kouper-Data to Insight Center-Pervasive Technology Institute, Indiana University Praveen Kumar-Civil and Environmental Engineering, University of Illinois at Urbana -Champaign Robert H. McDonald-IU Libraries, Data to Insight Center-Pervasive Technology Institute, Indiana University James Myers-School of Science, Rensselaer Polytechnic Institute Beth Plale-School of Informatics and Computing, Indiana University

I. Introduction Major research universities are grappling with their response to the deluge of scientific data emerging as a result of their faculty’s research. This is critically important in light of coming changes around open access to federally funded research data.2 Many research institutions are looking to their libraries, IT organizations, and research administration organizations to find solutions for this important problem. One such starting point can be found within our current library based institutional repositories. While the choice of using university resources for data storage may seem beneficial to the scientific community, the library, and the university, the general consensus is that current institutional repository implementations are not designed for storing, publishing and providing access to research data. Often, data deposits in these repositories require manual workflows and human intervention to deposit data. This type of ingest can lead to many human errors for long-term data curation. Solutions for managing research data need to follow a solid data curation workflow as defined by the CCSDS OAIS Reference Model3 and must satisfy several key requirements: first, the system must support data processes at every stage of data life cycle, e.g., the processes of data collection, analysis, re-use and long-term preservation. Second, depositing of data must be quick and minimally intrusive on a scientist’s time. Third, data storage and ingest must be flexible enough to handle various kinds of data, i.e., collections of varied sizes, formats and composition. And finally, tools for accessing and using data need to be consistent with tools and processes of the scientific community. We address these requirements with a focus on sustainability science in a recently funded National Science Foundation DataNet project, the Sustainable Environments – Actionable Data (SEAD)4. The project proposes policies and architecture to address the needs of sustainability science researchers, who study the physical, biochemical, and social interactions that affect our planet. Researchers in sustainability science come from many disciplinary communities such as hydrology, ecology, and sociology. Those communities have their own standards for data collection, description, and dissemination, yet the data must be integrated to support sustainability research now and in the future.

1

SEAD is funded under NSF Cooperative Agreement #OCI0940824. White House OSTP Memo on Open Access for Heads of Executive Departments and Agencies http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf 3 CCSDS OAIS Reference Model for an Open Archival Information System http://public.ccsds.org/publications/archive/650x0m2.pdf 4 http://sead-data.net/ 2

1

SEAD addresses both the immediate needs of researchers as well as the long-term goals of data preservation and reuse by integrating data services into a generalizable “active and social curation” infrastructure and by developing capabilities for data set packaging and migration to a federated repository infrastructure based on multiple institutional repositories (IRs). The SEAD strategy addresses the goals of improving quality, relevance, and usefulness of data and reducing cost of data management and preservation in the following ways:    

Moving data curation tasks earlier in the data life cycle, toward the beginning of research projects. Engaging domain scientists in setting priorities for evolution of data and services. Providing mechanisms for facilitating and improving data discoverability, such as automatic metadata and provenance capture in diverse forms and community annotation of data. Enhancing long term curation processes to leverage rich metadata and crowd-sourced social networking efforts in data curation.

We realize our strategy by implementing the three component architecture as shown in Figure 1 and described in sections below that, in combination, provide capabilities for discovering reference data, building data collections during active research, moving new data through formal curation processes to create new reference collections, and exploring the direct and data-mediated social structure of the community.

Figure 1. SEAD End-to-End Prototype.

II. SEAD Active Content Repository (ACR) Sustainability researchers face significant challenges in managing the broad range and rapidly growing volume of data that characterizes the coupled human and natural systems they study. The SEAD ACR recognizes the need these researchers have for active, metadata-aware spaces for managing data during research that provide more capability for organizing data than file systems while retaining the flexibility to incorporate new data and evolve descriptive information as research progresses. 2

The SEAD ACR leverages a semantic content repository and rich ‘Flickr-style’ interfaces to provide data producers and consumers with shared project knowledge spaces5. Using an ACR as part of the research workflow, researchers can immediately upload and share arbitrary files with their project teams and incrementally add tags and structured metadata as they desire. For common file types, the ACR generates data views (e.g. a zoomable images, playable videos), extracts and displays embedded metadata (e.g. EXIF information embedded in photos), and generates indexes for search. Researchers can also organize data in hierarchical collections, add provenance, and leave notes (threaded discussion). Support for new data formats and metadata vocabularies, and automated import/export to other tools can be accomplished using well-defined web service and plug-in interfaces. In essence, the ACR provides a private project space that can provide rich data services incrementally as information is provided, e.g. it rewards efforts to use standard formats and provide metadata with immediate feedback in the form of services that make it easier to manage data during a research project. By using semantic web technologies, specifically RDF and the notion of URI-style global identifiers for data, vocabulary terms, and other objects of interest, an ACR-based approach can also provide value to data producers and consumers by linking to additional data/metadata managed in other systems. For example, the ACR and SEAD’s VIVO repository of author and publication information (described below) interact: as a researcher enters author (Dublin Core: creator) information for a given data set within the ACR, a query to VIVO is used to provide type-completion capabilities and the resulting entry, rather than being a string representing the author’s name is a live link to their interests, affiliations, project, and publication information in VIVO. Analogous functionality allows links between data in the ACR and publications in VIVO. In addition to the benefit for data producers and consumers to be able to directly use people/publication links to find new data sets, the links can also be exposed for further use, e.g. in generating network graphs of the community that show data-mediated interactions along with co-author and citation connections.

III. SEAD Virtual Archive (VA) The SEAD Virtual Archive (SEAD VA) is a federation layer that sits over multiple institutional repositories and is based on archival software developed by the Johns Hopkins University Data Conservancy.6 This layer offers the community of sustainability scientists a coherent view on their collective published data. Even if institutional repositories removed major obstacles to data submission and researchers began to submit their data, the view of data would be a fragmented one; a researcher would have to search repositories one by one to find relevant data. The SEAD VA can provide a single view into the data for sustainability researchers and by leveraging interinstitutional cooperative agreements such as those that have been developed by the Committee on Institutional Cooperation (CIC), a federation layer such as SEAD VA can form the basis of a crossdisciplinary resource.

5

Futrelle, J., Gaynor, J., Plutchak, J., Myers, J. D., McGrath, R. E., Bajcsy, P., Kastner, J., Kotwani, K., Lee, J. S., Marini, L., Kooper, R., McLaren, T. and Liu, Y. (2011), Semantic middleware for e-Science knowledge spaces. Concurrency and Computation: Practice and Experience, 23: 2107–2117. doi: 10.1002/cpe.1705 6 http://dataconservancy.org/software/

3

Data arrives at the SEAD Virtual Archive from the ACR and VIVO. Upon uploading datasets that were marked and selected for publication, the SEAD VA checks dataset integrity and allows curators to assess whether minimal metadata requirements have been met. The VA makes sure that the data are ready to deposit via its matchmaking mechanism, a technical solution for automated deposit that reconciles the needs of institutional repositories with the needs of end users and the SEAD VA. DOIs are assigned using DataCite. The data object indexed using an Apache Solr index to allow full text search of the metadata. The data object currently is sent to the Indiana University ScholarWorks IR, the University of Illinois IDEALS repository, or to an Amazon cloud repository. As has been pointed out, sustainability science is broad, and the data are diverse. We are working with domain scientists in order to map this diversity into a set of flexible yet manageable categories. So far, we have been working with data sets that are diverse in their size, formats, and structure. Through exploration of existing data sets, we have identified the following canonical cases for data preservation and discovery in sustainability science that the SEAD VA must support including huge data collections (size >= 20 GB), heterogeneous collections, small complex databases, and temporal (time series) data. Much of the data in existing collections comes without proper metadata and their processing requires manual labor. Use of an ACR is expected to significantly enhance the amount of metadata that will be available at the time of publication, but our work flow anticipates that format conversion, vocabulary mapping, and additional annotation may still be required to meet repository requirements. The SEAD VA as a federated solution has the benefit of giving a unified view of a sustainability science data resource even though the data are drawn from multiple projects and stored in multiple institutional repositories. In addition to serving as a federated deposit service, the SEAD VA performs another crucial function as an aggregator of reference data. One could envision a repository supporting multiple lightweight federation services like SEAD VA, each of which serves a particular scientific community. If each federation service supports its outward interfaces via record-exposing protocols such as the ones developed by the Open Geospatial Consortium (OGC)7 or DataONE8, a network of federated services would become a scientific search portal with rich discovery interface. Such a portal will track and index information from institutional repositories and expand their services, providing more breadth and accuracy, and at a less cost to the libraries.

IV. SEAD Social Curation and Analytics based on VIVO VIVO9 is an open source application developed at Cornell University in 2003 and implemented at Indiana University in 2004. It is a semantic web application that uses the Resource Description Framework (RDF) to represent researcher profile information. When installed and populated with information about researchers, e.g., their names and affiliations, interests, activities, and accomplishments, VIVO has primarily been deployed at an institutional level but it can also be

7

OGS Catalog Standard - http://www.opengeospatial.org/standards/specifications/catalog DataONE Investigator Toolkit - http://www.dataone.org/investigator-toolkit 9 VIVO Researcher Profile Software Project - http://vivoweb.org 8

4

deployed as a community resource as in SEAD. VIVO instances can also be federated, leveraging the VIVO Ontology10 and linked semantic data strategies. To facilitate rich connections between data and people as well as powerful analytics, the SEAD VIVO repository has incorporated an extension to VIVO ontology developed by the Australian National Data Service (ANDS). This extension, VIVO-ANDS ontology, focuses on datasets in an attempt to enable data to be treated as a VIVO resource such as a person or a publication. The team from IU has enabled SEAD VIVO to ingest published research data citations created using the DataCite DOI API11 for use with the SEAD data curation infrastructure. Once a data set is treated as a typical VIVO resource, all linking and analytical tools can be extended to the dataset. Currently, the SEAD VIVO prototype service contains information about all principal investigators, including their publications and published data, associated with the NSF National Center for Earth Surface Dynamics (NCED12). VIVO provides not only a capability to browse these products but also generates analytical products such as co-authorship networks and, with SEAD’s extensions, an ability to browse from data citations to sites where data can be retrieved and the possibility to explore –data-mediated researcher interactions.

V. Final Thoughts The SEAD Prototype offers an innovative model for collaborative metadata capture and data curation that simultaneously adds significant value for data producers and consumers while lowering the centralized costs incurred in maintaining and growing community reference data. By providing immediate value to researchers for good curation practices, supporting their needs during active research, linking data products with the formal literature and social structure of the community, and making the data and metadata generated by both primary data producers and data consumers visible to and usable by curators, SEAD significantly improves the overall return on investment for curation activities and, perhaps more importantly, provides more direct benefit to researchers who contribute information. With SEAD’s model in which the benefits of data curation accrue directly and immediately to researchers, we anticipate increased adoption relative to current methods as well as positive impacts on researcher productivity. SEAD has been built on open standards and open software components supported by multiple projects with the explicit goal of achieving long-term community support and making it possible for third parties to integrate new functionality and to adopt and adapt it for new purposes within sustainability research and across the larger research community.

VI. Acknowledgements The authors wish to thank the entire SEAD partnership for their help in establishing the SEAD Prototype and creating the functionality discussed in this paper. 10

VIVO Ontology 1.5 - https://wiki.duraspace.org/display/VIVO/VIVO+Ontology DataCite DOI API for Data Publishing - https://mds.datacite.org/static/apidoc 12 Initial data loading for the SEAD Prototype utilized content from the National Center for Earth-Surface Dynamics - http://www.nced.umn.edu 11

5