Trust Threads: Minimal Provenance for Data Publishing and Reuse ...

6 downloads 0 Views 284KB Size Report
“Trust Threads”, that connect the data to both its past and its future, forming a .... is the “long tail” of social and environmental science: data collections are.
Published in: Plale, B., Kouper, I., Goodwell, A., & Suriarachchi, I. (2016). Trust Threads: Minimal Provenance for Data Publishing and Reuse. In Sugimoto, C., Ekbia, H., Matioloi, M. (Eds.) Big Data Is Not a Monolith. MIT Press.

Trust Threads: Minimal Provenance for Data Publishing and Reuse Beth Plale, Inna Kouper, Allison Goodwell, Isuru Suriarachchi The world contains a vast amount of digital information which grows vaster ever more rapidly. This makes it possible to do many things on an unprecedented scale: spot social trends, prevent diseases, increase fresh water supplies, accelerate innovation, and so on (Bresnick 2015, Jolffson and Bryn 2011, Neale 2014, Cuttone, Lehmann and Larsen 2014). As essential a role as science and technology innovation plays in improving natural environments and human welfare, the growing sources of data promise to unlock ever more secrets. But the rapid growth of data also makes accountability and transparency of research increasingly difficult. Data that cannot be adequately described because of its volume or velocity (speed of arrival) is not useable except within the research lab that produced it. Data that is intentionally or unintentionally inaccessible or difficult to access and verify is not available to contribute to new forms of research. In this chapter we show how data can carry with it thin threads of information about its lineage, “Trust Threads”, that connect the data to both its past and its future, forming a provenance record. In carrying this minimal provenance through which the lineage of an object can be traced, the data inherently becomes more trustworthy. Having this “genealogy network” in place in a robust way as data travel in and out of repositories and through tools is a critical element to the successful sharing, use, and reuse of data in science and technology research in the future. Digital data’s disproportionally large impact on science and technology research is a relatively recent phenomenon because digital data (especially in large quantities) is itself relatively recent. Science has evolved over the last few hundred years to consist of four distinct but related methodologies that comprise four paradigms: empirical (observation- and experiment-based), theoretical, computational, and data-exploratory (Gray 2009). Early science was largely empirical, focused on observed and measured phenomena derived from actual experience; Darwin’s Origin of the Species is a good example of careful recorded observation. Early science also incorporated theory and experiments. For example, Boyle used mathematical laws (i.e., Maxwell’s equations and thermodynamics) to model physical phenomena, and performed elaborate experiments to validate his hypotheses (Shapin 1989, Crease 2008). The last few decades saw growth of computational science, where mathematical laws are implemented in software as a model or simulation that abstracts (i.e., simplifies) a complex, real world phenomena and does so methodologically so the simulation stays congruent with the real world phenomena. By utilizing

abstractions, numerical methods, simulations, and the ever increasing power of computers, computational science has given society more accurate weather predictions that can predict as spurious an event as a tornado and has improved jet propulsion through detailed modeling of fluid flow. Most recently data exploration has emerged as a legitimate form of science - the 4th paradigm of science (Gray 2009). Data that is captured by observational instruments, sensors, cameras, tweets, or cash registers is analyzed using computational tools (software), looking for trends or anomalies (see also, Andrejevic and Burdon, this volume). Biology, for example, has virtually turned into an information science: genomicists study patterns in DNA sequences to identify diseases or trace heritage (see also, Contreras, this volume), As Gray (2009, p. xix) points out, “The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.” Similar to the three research paradigms before it, the 4th paradigm promises unheralded advances in our understanding of society and the planet on which we live. The 4th paradigm is a response to new data sources, the exponential growth of data (Berman 2008), and the increasing scope and complexity of scientific challenges. Technology has transformed society into one that is more connected and has more channels to stay informed (see also, Alaimo and Kallinikos in this volume for a critical analysis of social data), yet grand challenge problems remain in energy, clean and abundant water, sustainable food, and a healthy population. These large-scale problems require interdisciplinary teams of natural or physical scientists, technologists, physicians, social scientists, legal teams, and policy makers to solve (see also Sholler, Bailey, and Rennecker in this volume for the discussion of Big Data and teamwork in healthcare). These interdisciplinary teams bring their own data and methodologies, but because they work across discipline boundaries, are increasingly interested in using data from varied sources. In this chapter we advance the trustworthiness of scientific digital data through a lightweight technological formalism for data provenance. We focus on a critical window in the lifecycle of a data object from the point where it is made available for publishing to the point where it is reused. This window is unique because valuable ephemeral information about the data is known only during this time period that if not captured immediately is lost forever. We posit that digital data are trustworthier when more can be known about where a data object came from, what processes were applied to the data object, and who acted to operate on the object. Data provenance addresses this lineage information: information about the entities, activities and people who have affected some type of transformation on a data object through its life (Simmhan et al. 2005). Our proposed model draws on research over the last decade in data provenance, which has defined graph

representations that capture the relationships between entities, activities and people that contribute to a data object as it exists today and into the future. The motivating goal for our work is to increase the reuse of scientific data. However, data that cannot be determined to be trustworthy is far less likely to be reused. Data reuse is the use of data collected for one scientific or scholarly purpose by a different researcher often for a different purpose. The data may be reused to 1) verify the original result, 2) be combined with another data set to answer similar questions, or 3) answer a different research question, for instance, when weather data is repurposed for crop forecasting or water conservation. We motivate the chapter with a use case from sustainability science — the study of dynamic interactions between nature and society. The science of sustainability requires integration of social science, natural science, and environmental data at multiple spatial and temporal scales that is rich in local and location-specific observations; referenced for regional, national, and global comparability, and scale; and integrated to enable end users to detect interactions among multiple phenomena. Our perspective is drawn from a 5-year project funded by the National Science Foundation called Sustainable Environments Actionable Data (SEAD). SEAD 1 is responsive to the expressed needs of sustainability science researchers for long-term management of heterogeneous data by developing new software tools that help a research: •

Culling the right subset of data to publish from amongst the thousands of files that are created and sued during the course of a research investigation;



Reduce the manual metadata markup burden on a researcher by engaging a data curation specialist early in the research process, and by creating easy to use tools that automatically extract metadata from the subset of data to publish;



Publish to a repository of choice

Sustainability science is the “long tail” of social and environmental science: data collections are often the property of a researchers lab; and data sets of local, regional, or topical significance are of limited value until they can be integrated and referenced geo-spatially and temporally, combined with related data and observations, and modeled consistently. While most of the individual “long-tail” data sets are small compared to the commonly discussed big data sets, their changeability and heterogeneity pose the same challenges as other “bigger” data. From a process-oriented perspective on big data (Ekbia et al. 2015), sustainability science research data is equally challenging to move while maintaining integrity and trustworthiness, and therefore requires new tools and techniques in processing and preservation.

1

http://sead-data.net/

Research Compendia: The Case of Sustainability A researcher who studies complex physical, environmental, or social phenomena will utilize data from multiple sources because complex phenomena frequently lack a single source of data from which the researcher can glean all the needed information. It is not uncommon for a single environmental study to use physical samples, model runs, observational data, data created within the study, and data used from other sources: all of which contribute to a full understanding of the natural, physical, and social phenomena under investigation. From this pile of data that is analyzed, combined, and synthesized during the regular course of research, a researcher must draw out and select the images, model results, metadata and data files which best support a research result appearing a published paper, or best form a cohesive dataset to share more broadly. We call the aggregated data (data pile) that is pulled together for purposes of sharing with a wider audience a “Research Object”, and the act of making it available, the “publishing” of the research object. The Research Object is first conceived at the point in the research lifecycle when a researcher is ready to share her data. At that time she makes decisions about what data—from amongst all the data sources she used, consulted, and created—she should package to make public. Should she publish the data needed to reproduce the images in the publication as required by the journal publisher, or would her professional impact be greater if she were to publish all the data needed to rerun the simulations? What are the criteria by which she even approaches such decisions? These questions are at the heart of the matter as the researcher decides whether it is sufficient for her to publish enough data to support the narrow conclusions of a particular journal article or whether she should follow a broader mandate to make the results of science more broadly accessible and useable (e.g., provide the data needed so that future researchers can replicate and expand on her results). There are sizeable implications in her answers to these questions. Amongst them, and by no means the most important, is one of size: data sufficient to reproduce the figures in a journal article is a small fraction of the size of the data needed to verify the conclusions of a study. Stodden (2014) further contextualizes the issue of size by emphasizing the role of software: When computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to reproduce by running the software on the data. In this chapter we focus on data curation and provenance in the context of publishing and re-use of data as central to accountability and transparency in science, omitting a discussion of the software needed to reproduce results, as the issues are somewhat separate. Illustrated by means of a detailed use case from sustainability science, we identify a formalism, and show how the formalism, simple as it is,

when put into practice, can advance data trustworthiness. While the use case discusses a publishing data in support of a single journal article, our formalism applies to all data published for consumption outside the research lab in which it was generated.

Use Case: The Flooding of Mississippi In May 2011, the Mississippi River, the chief river of the largest drainage system in North America, was at historic flood levels due to record rainfall and snowmelt, risking great damage, injury, and loss to life and property as the rushing river gained momentum from large tributaries south of the Wisconsin/Minnesota border. Following the disastrous flood of 1927, federal legislation authorized the several flood-control structures. One of these is the Birds Point New Madrid (BPNM) Floodway, a leveed agricultural floodplain at the confluence of the Mississippi and Ohio Rivers near the city of Cairo, Illinois. The policy allows the 130,000 acre Floodway to be intentionally flooded during extreme events through a series of levee breaches. In May 2011 the U.S. Army Corps of Engineers used blasting agents to create several artificial breaches in the BPNM levee. The impact was dramatic: parts of the agricultural floodplain were inundated for over a month. Large floods and emergency responses to them, such as the intentional breach of the Birds Point-New Madrid Floodway, create highly variable spatial patterns of water flow and floodplain erosion and deposition. These localized areas, or “hotspots”, of change expose underlying landscape vulnerabilities. This unique event is an opportunity for researchers to study and assess the causes of changes and vulnerability that can guide future actions. To examine floodplain vulnerability, a group of researchers obtained high-resolution elevation map data (LiDAR data) for this region from 2005 (before) and 2011 (after), the most recent pre-flood and post-flood data. To analyze the change, they created a model showing change in elevation by subtracting the 2005 landscape elevations from the 2011 elevations. The researchers used a 2D hydraulic model HydroSED 2D, validated with sensor data from the US Geological Survey, to simulate the flow of water through the floodway (Goodwell et al 2011). They identified woody vegetation, and obtained soil and vegetation properties from the Natural Resources Conservation Service (NRCS). They combined these datasets to identify regions most vulnerable to erosion and deposition, and compared with observed landscape impacts as observed from the LiDAR. The study required working with images, numeric sensor data, model results, and computed values. A journal article describing the study was published in the Journal of Environmental Science and Technology journal (Goodwell et al. 2014). The first two authors of this chapter assisted the first author of that publication (also the third author of this chapter) in publishing their data in support of the study. The second author participated as the digital curation specialist in the effort. The researchers made the decision to publish ten georeferenced image files (GeoTIFF files) as the publishable data object. The selected set supports reproducibility of all the images in the publication. The sustainability

researchers decided to limit their data publication to those ten files because some of the data they used was proprietary and they did not have the permission to share it, or the data was already publicly available (such as the soil data from NRCS, for example, or the USGS sensor data). The size of the resulting dataset was another concern that limited the data sharing options. The files selected for publishing resulted in a 3.74 GB bundle, called “BPNM object” henceforth; the files range from less than a megabyte to two gigabytes per file. Specifically, half of the set of ten files are from the 2D hydrological model run under different conditions for points in the floodplain. These files varied from 1.47 MB at the smallest to 6.59 MB at the largest. A soil map file giving erodibility of the soil and its loss tolerance over a spatial extent is 7.34 MB in size. Airborne imaging data to detect woody vegetation on the ground was 919 KB in size. The comparison data of 2005 to 2011 land elevation data at 1.5M, 3M and 10m ranged from 72 MB in size to 2.84 GB in size. The entire publishable research object is a sizeable 3.74GB. As the data was intended to be deposited using the SEAD framework into the repository at University of Illinois at Urbana Champaign, the researchers and the curator have realized that the repository, which was configured to accept scholarly papers rather than datasets, would not be able to handle a larger object. The repository software required a special manual re-configuration to handle an object of that size. Had the authors chosen to publish the raw data in addition to the GeoTIFF files needed to reproduce the images in the publication, the publishable data bundle would be on the order of one hundred times larger than 3.74 GB.

The Research Object and Its Role in Data Reuse Technology can contribute to strengthened trustworthiness of a complex bundle of data through data provenance, and to show how, we focus on the window in the life of research data that begins when a researcher is ready to package and distribute research data more broadly (e.g., make it public) and culminates in access by an unrelated party for scientific reuse. We call this time window the “publishreuse lifecycle” window. This window is crucial because it is during this time that the most that can ever be known about the data bundle is available because metadata is highly ephemeral (Gray 2009) and if not captured early in its life, is lost forever. The publish-reuse lifecycle window is thus a critical time window in which to harvest ephemeral information (metadata) about data that is about to be published. Our goal is to introduce into this critical time window a new, simple technology-oriented provenance (or lineage) formalism to take advantage of the metadata harvesting opportunity. Several important questions have to be answered when data first enters the publish-reuse lifecycle window. The researcher has to decide which data to cull from the larger data pile for publishing; and how much of a curator’s help she wants with metadata and consistency checks. The designers of the software system that supports data publishing have to convince themselves and their community of users of the system the acceptability, limits, and mechanisms of versioning; and the limits of trust and trustworthiness that can be had by a technical solution.

To address these questions in a consistent and formal manner, we conceptualize the publishable data products of research as bundles of heterogeneous but coherent content—a self-contained unit of knowledge. Drawing on our own earlier work (Plale et al. 2011) and heavily inspired by the work of DeRoure and Goble (Bechhofer et al. 2011, DeRoure et al. 2009) we refer to a bundle of publishable products of research as the “Research Object”. The Research object (RO) is an aggregation of resources that can be transferred, produced, and consumed by common services across organizational boundaries. The RO encapsulates digital knowledge and is a vehicle for sharing and discovering re-usable research. ROs contain data objects (files), references to data objects, collections of data objects, metadata, published papers, etc. Building and expanding on the workflow-centric and impact-centric approaches to research objects (see, for example, Hettne et al 2014 or Piwowar 2013), we conceptualize the Research Object as having five components as shown in Figure 1: a unique ID, a persistent identifier that is never reassigned; agents, information about the people who have touched the object in important ways (e.g., creator, curator, and so on); states, which describe ROs in time and are discussed in more detail below; and relationships, which capture links between entities within the object, such as data sets, presentations, or images, and to other ROs. Finally there is the content – the data and related documents. While we do not specifically include software that produced the data results, including it would be a minor extension under our model of RO as the “research compendia” (Stodden 2014).

Figure 1. A Research Object as a bundle of digital content that uses common standards and services to transfer and consume them

Examples of a Research Object are many: the supporting materials and results that are described in a single published paper; a written dissertation and its data; a newly created dataset that contains raw

observations and annotations that explain them; survey data that has been exported, coded, and aggregated into charts and tables; a visualization that is based on the data created by others. Each of these RO examples has common and unique actions performed by human and software agents that comprise its “behavior”. The RO concept provides a general organization to research product contents and metadata so that it can be used by common standards and services of all kinds. Specifically, there is a structure into which all needed contents of an RO can be stored: files, IDs, documentation, and so on. The organizing data model of the Research Object can be served by such protocols as OAI-ORE resource mapping protocol (Tarrant et al. 2009) or the Dataset Description model (W3C Consortium 2014); a description of the parts of the RO with example data taken from our BPNM use case are shown in Table 1. Table 1. Components of the Research Object

COMPONENT

EXAMPLE FROM BPNM RO

UNIQUE ID

https://seadtest.ideals.illinois.edu/handle/123456789/3349

AGENTS

Names, affiliation, contact information of data creators and curator

STATES

LO, CO, PO and dates of creation and transformation

CONTENT

Original LiDAR model Corrected LiDAR model 1 Corrected LiDAR model 2 AVIRIS data Hydro-simulation overall Hydro-simulation, Location1, Condition 1 Hydro-simulation, Location 2, Condition 1 Hydro-simulation, Location1, Condition 2 Hydro-simulation, Location2, Condition 2 Soil survey / computed data Descriptive / Annotative Information (abstract, spatial and temporal metadata)

RELATIONSHIPS (INTERNAL TO RO)

Published to HasKeywords HasSources (NASA, USGS) IsReferencedBy DOI 10.1021/es404760t

Trust Threads Model We define a simple model for the behavior of a research object (data bundle) as it passes through the publish-reuse lifecycle window; the model captures relationships between research objects as they are derived from one another, replicated and so forth. The model is the basis upon which software is written that implements the model in a controlled and predictable manner. The model has two parts: 1) states, which define the condition of a data bundle as it passes through the publish-reuse window; and 2) relationships, which capture the relationship between two RO’s. Figure 2 provides an overview of how an RO transitions through the publish reuse window, and how it relates to its derivatives once published. The states and relationships are drawn from two enumerated sets as follows where relationships are a subset of the properties defined in PROV-O: •

States = {Live Object (LO), Curation Object (CO), Publishable Object (PO)}



Core Relationships = {wasDerivedFrom, wasRevisionOf, alternateOf }

The Trust Threads model over will time result in a network of links between research objects and between a RO and itself over time, creating a genealogy network for published scientific data. The Trust Threads model does not apply to the relationships a RO may contain within it, such as a file and its metadata, or files belonging to a collection.

Figure 2. Behavior diagram for publish-reuse lifecycle

A Research Object passing through the publish-reuse lifecycle window exists in one of the three states: as a Live Object (LO), as a Curation Object (CO), or as a Publishable Object (PO). Data management for large research teams is rarely a priority, so we say the data of a highly active team exist in a wild west, where organization is loose. In our example of the Mississippi flood data, researchers work in a single shared SEAD Project Space. The project space contains tens of folders, each of which is dedicated to a particular subset of data, e.g., the raw LiDAR data for 2005 and 2011, images of the floodplain, spectrometer raw and corrected data, related publications, and so on. This project space constitutes the Live Object. A researcher culls from the larger set of data the objects she wants to publish more broadly. She will prune and organize the material to publish into a new directory or set of directories, or mark specific files. From the Mississippi Flood project space, the researcher culled a ten-file collection (the BPNM object). This culled content is the Curation Object (CO), an object related to its LO by the “wasCulledFrom” relationship. Where the Live Object is a wild west of loosely controlled activity, the Curation Object exists in a more controlled setting, a relative “boundary waters” between the wild west of the active research and the controlled setting where an object is polished for publishing. In the CO state, change to the contents of the object are still frequent, but one or two researchers alone assume the tasks of selecting, pruning, describing, and re-organizing the object and frequent untracked changes by others become unwelcome. Additionally, during the culling process, the researcher will engage a digital curator to examine the structure, makeup and metadata of the research object. The digital curator, in consultation with the researcher, will enhance the content to make it more useful. Once the researcher and the digital curator agree that the content and descriptions of the research product are ready, the researcher signals intent to publish whereupon the Research Object moves from its state as a Curation Object to a new state as a Publishable Object (PO). The PO is related to the CO by a “wasPublishedFrom” relation. The Publishable Object exists in a “control zone” and in this state, all actions on a PO are carefully tracked. That is, all actions on a PO result in a new instance of a PO. The relationship established between the PO that is acted upon and the newer PO captures the type of action that occurred. All actions on a PO are carefully tracked as they form the past and future lineage of a family research objects. The types of changes to a PO are several: the relationship “alternateOf” exists when a duplicate of the PO object is created. The relationship “wasRevisionOf” exists between two ROs when an RO undergoes a revision that does not change the researcher’s intent with the object. That is, when the RO is determined to be incomplete or incorrect in some way, e.g., revisions of metadata or corrections of errors and omissions; a new version is created that is related to the earlier version through “wasRevisionOf”. For example, if the authors in the Mississippi Flooding case wish to replace one hydrology model run with a

newer run because the existing geotiff file contained errors, the change is considered to be a revision. If instead the researcher decides to publish raw hydrology model results in addition to the already published geoTIFF files, this change is too substantial a change to the existing PO, so instead constitutes a new published object. The relationship “wasDerivedFrom” is established between two POs if the latter contains a portion of the former either directly or by reference. Suppose a biologist is carrying out an entirely different study of the post-flood regrowth of biomass in the Lower Mississippi floodplain after the 2011 flood. She would locate the existing BPNM PO and find that it contains 50 x 50 meter resolution spectral (AVIRIS) images. Such images can be used in measuring trends in abundance of vegetation and its components. The researcher downloads the image, combines it with field data, and identifies trends in post-flood biomass development in the area. This new PO published by the biodiversity researcher will have a “wasDerivedFrom” relationship to the BPNM RO. Reuse comes about when a researcher searches and finds a published PO in a repository to where it has been deposited, and pull a copy of it into their own research environment where it then becomes part of another researcher’s Live Object, wild west space.

Trust Threads and Trustworthiness Trust Threads is a simple model to enhance the trustworthiness of bundled and published data. It guarantees that a data bundle will carry with it useful data provenance and that the provenance record will not change or be modified except under controlled circumstances. Trust Threads focus on the capture and representation of the data provenance of data product bundles (the Research Objects) as they exist in the critical publish-reuse lifecycle window, a window in time that begins with the conceptualization of what goes into the research object, follows through to its curation and publication into a repository, and ends at the object’s subsequent use by researchers outside the originating scientific domain in which the data object was created. This time window is particularly crucial because it is in this window that important connections between research objects are made that if not captured are likely forever lost. Trust Threads are implemented as small bits of information contained in a virtual “suitcase”. The suitcase identifies the RO to which it belongs, and the lineage of other ROs from which it was derived, revised, revised, etc. The suitcase is locked, so that its contents can be trusted. Its contents are changed only by authoritative sources – which could include other trusted repositories for instance. It is the reuse of data outside its originating scientific discipline that motivates the need for Trust Threads. Data reuse outside a scientific discipline is particularly complicated by the competence, reputation and authority of data creators being more difficult to evaluate for researchers outside of the field. Yet, the increasing availability of data and growing complexity of the challenges facing society and

the environment will require that researchers be able to gauge data set trustworthiness before they will use it. According to Donaldson and Conway (2014), the determination of trustworthiness of an entity has a strong user perception component to it. This user perception has four properties: accuracy of information; objectivity of content; validity, which includes use of accepted practices and verifiability of data; and stability defined as persistence of information. Trust Threads enables developers to implement the technical “wiring” needed to capture provenance information and deposit the provenance into a locked “suitcase” so that the data’s accuracy, validity, and stability can be more easily determined. Thus Trust Threads contribute to three of the four properties identified by Donaldson and Conway. For instance, if a derived subset of a published data bundle lacks objectivity, a researcher can examine the suitcase contents to trace it back to its parent. Accuracy of information is related to its perceived quality. A framework of properties for data quality (Wang and Strong 1996) includes completeness, accuracy, relevancy, reliability, accessibility, and interpretability as user perceived properties for data quality. With the recent explosive growth in the amount and variety of research data, and inexpensive access to large-scale compute resources, science is on the cusp of new discoveries in areas that require interdisciplinary teams and data that have to be repurposed. Trust Threads can accelerate the sharing of research data across research disciplines (data reuse) through a small technological formalism that when adopted broadly can advance the trustworthiness and hence reuse of published data.

Acknowledgements We thank Praveen Kumar of University of Illinois for allowing us to study his research and publishing process. We thank Margaret Hedstrom, Sandy Payette, and Jim Myers all of University of Michigan for stimulating discussions about data sharing and preservation in the SEAD project. SEAD is funded by the National Science Foundation under award 0940824.

Cited References Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch et al. "Why linked data is not enough for scientists." Future Generation Computer Systems 29, no. 2 (2013): 599-611. Berman, Francine. "Got data?: a guide to data preservation in the information age." Communications of the ACM 51, no. 12 (2008): 50-56. Bresnick, Jennifer. Four use cases for healthcare predictive analytics, big data. HealthITAnalytics (April 21, 2015). Available at http://healthitanalytics.com/news/four-use-cases-for-healthcare-predictiveanalytics-big-data Crease, Robert P. The Great equations: Breakthroughs in science from Pythagoras to Heisenberg (2008). W.W. Norton & Company. Cuttone, Andrea, Sune Lehmann and Jakob Eg Larsen. Inferring human mobility from sparse low accuracy mobile sensing data. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication (UbiComp '14 Adjunct). (2014). doi: 10.1145/2638728.2641283

De Roure, David, Goble, Carole and Stevens, Robert. The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Generation Computer Systems, 25, 561-567 (2009). Donaldson, Devan Ray, and Paul Conway. "User conceptions of trustworthiness for digital archival documents." Journal of the Association for Information Science and Technology (2015). Ekbia, Hamid, Michael Mattioli, Inna Kouper, Gary . Arave, Ali Ghazinejad, Timothy Bowman, Venkata Ratandeep Suri, Andrew Tsou, Scott Weingart, and Cassidy R. Sugimoto. “Big data, bigger dilemmas: A critical review.” Journal of the Association for Information Science and Technology (2015). Goodwell, Allison E., Zhenduo Zhu, Debsunder Dutta, Jonathan A. Greenberg, Praveen Kumar, Marcelo H. Garcia, Bruce L. Rhoads et al. "Assessment of floodplain vulnerability during extreme Mississippi River flood 2011." Environmental Science & Technology 48, no. 5 (2014): 2619-2625. Gray, Jim. “eScience: A Transformed Scientific Method.” In Hey, T., Tansley, S., & Tolle, K. The Fourth Paradigm: Data Intensive Scientific Discovery. pp. xix – xxxiii. Seattle: Microsoft Research (2009) Hettne, Kristina M, Harish Dharuri, Jun Zhao, Katherine Wolstencroft, Khalid Belhajjame, Stian SoilandReyes et al. “Structuring research methods and data with the research object model: Genomics workflows as a case study.” Journal of Biomedical Semantics 5, no. 1 (2014). doi:10.1186/20411480-5-41 Jolfsson, Bryn, and Andrew Mcafee. “The Big Data Boom Is the Innovation Story of Our Time.” The Atlantic (2011). Available at http://www.theatlantic.com/business/archive/2011/11/the-big-databoom-is-the-innovation-story-of-our-time/248215/ Neale, Christopher. “8 ways Big Data helps improve global water and food security.” (October 22, 2014). Available at http://www.greenbiz.com/blog/2014/10/22/8-ways-big-data-helps-improve-globalwater-and-food-security Piwowar, Heather A. “Value all research products” Nature 493 (2013): 159. doi:10.1038/493159a Plale, Beth, Bin Cao, Chathura Herath, and Yiming Sun. “Data provenance for preservation of digital geoscience data.” Geological Society of America Special Papers 482 (2011): 125-137. PROV-O: The PROV Ontology; W3C Recommendation 30 Apr 2013, T. Lebo, S. Sahoo, D. McGuinnes, Eds. http://www.w3.org/TR/prov-o/ Shapin, Steve. “The invisible technician.” American Scientist 77 (1989): 554–563. Available at http://dash.harvard.edu/handle/1/3425945 Simmhan, Yogesh L., Beth Plale, and Dennis Gannon. “A survey of data provenance in e-science.” ACM Sigmod Record 34, no. 3 (2005): 31-36. Stodden, Victoria. “My input for the OSTP RFI on reproducibility.” Victoria’s Blog (2014, September 28). Available at http://blog.stodden.net/2014/09/28/my-input-for-the-ostp-rfi-on-reproducibility/ Tarrant, David, Ben O’Steen, Tim Brody, Steve Hitchcock, Neil Jefferies, and Leslie Carr. “Using OAIORE to transform digital repositories into interoperable storage and services applications.” Code4Lib Journal 6 (2009). W3C Consortium. “Dataset Descriptions: HCLS Community Profile.” (2014). Available at http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/O verview.html Wang, Richard Y. and Strong, Diane M. “Beyond accuracy: What data quality means to data consumers.” Journal of Management and Information Systems 12, no. 4 (1996):5-33.

Suggest Documents