Challenges on Preserving Scientific Data with ... - ACM Digital Library

Challenges on Preserving Scientific Data with Data Grids José Barateiro

José Borbinha

LNEC Av. Brasil 101 1700-066 Lisboa, PORTUGAL

INESC-ID Rua Alves Redol 9, Apartado 13069 1000-029 Lisboa, PORTUGAL

[email protected]

[email protected]

Gonçalo Antunes

Filipe Freitas



[email protected]

[email protected]

the physical information, digital materials require the continuous execution of management and preservation tasks, in order to allow its future interpretation.

ABSTRACT The emerging context of e-Science imposes new scenarios and requirements for digital preservation. In particular, the data must be reliably stored, for which redundancy is a key strategy. But managing redundancy must take into account the potential failure of component. Considering that correlated failures can affect multiple components and potentially cause a complete loss of data, we propose an innovative solution to manage redundancy strategies in heterogeneous environments such as data grids. This solution comprises a simulator that can be used to evaluate redundancy strategies according to preservation requirements and supports the process to design the best architecture to be deployed, which can latter be used as an observer of the deployed system, supporting its monitoring and management.

Usually, today's organizations make use of isolated information systems to produce, manage and exploit large amounts of heterogeneous data. When the information is managed by integrated information systems, the integration is based on processes defined to assure the interoperability1 of a specific set of operations, without the guarantee that the integration of systems will be preserved in the future. Furthermore, associated documents, like technical reports, etc., may be produced and managed outside the information systems. Also non-digitized documents could contain valuable data and may also be associated with digital information handled by current information systems.

Categories and Subject Descriptors H.3.7[Digital Libraries]: Collection; Software]: Distributed systems.

H.3.4[Systems

Digital preservation aims at ensuring interoperability in the time dimension (interoperate with the future), that is, guarantee that data or digital objects remain authentic and accessible to users over a long period of time, maintaining their value. Achieving this goal may require a large investment in infrastructure for storing data, management, maintenance, etc.

and

General Terms Algorithms, Measurement, Performance, Reliability.

Several communities, like biology, medicine, engineering or physics, manage large amounts of scientific information. It usually includes large datasets of structured data (e.g., data captured by sensors), physical or mathematical simulations and several highly specialized documents reporting the work and conclusions of researchers.

Keywords e-Science, Data Grids, Digital Preservation, Digital Libraries, Simulation.

1. INTRODUCTION

The above mentioned information can be represented in a wide range of formats (e.g., a researcher can use her own input and output formats to produce simulations) and include a high level of relations that are not expressed in the data model. Moreover, the collaborative environment of the scientific community, and associated services and infrastructures, usually known as eScience (or enhanced Science) [2], involves the requirement of

Artifacts stored in physical supports, like printed materials or drawings craved in stone, can survive for centuries. These contents with centuries of existence are testimonials of past generations and an important asset to the future. In contrast with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DaGreS’09, May 18–20, 2009, Ischia, Italy. Copyright 2009 ACM 978-1-60558-555-0/09/05...$5.00.

1

As defined by the Institute of Electrical and Electronics Engineers (IEEE), interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged [10].

17

interoperability and the respective data sharing. In a broad sense, e-Science concerns the set of techniques, services, personnel and organizations involved in collaborative and networked science. It includes technology but also human social structures and new large scale processes of making science. It also means, on the same time, a need and an opportunity for a better integration between science and engineering processes. Thus, long-term preservation can be thought as a required property for future science and engineering, to assure communication over time, so that information that is understood today is transmitted to an unknown system in the future.

2. MOTIVATION This section illustrates a real e-Science scenario where scientific data concerning dam safety must be preserved. This is one of the scenarios motivating our work (especially in the scope of the project SHAMAN). According to the Portuguese Dam Safety Legislation [7], the National Laboratory of Civil Engineering (LNEC) is responsible to keep an electronic archive of data concerning the dam safety, and maintain an updated knowledge about the behavior of dams. Thus, the preservation of this data is not only an option but a legal obligation.

To achieve long-term digital preservation, it is required to store digital objects reliably, preventing data loss. One potentially relevant strategy to achieve this goal is combining redundant storage and heterogeneous components.

The behavior of dams is continuously monitored by hundreds of instruments (e.g., plumblines) installed in strategic points of the dam structure. Typically, a concrete dam is monitored by hundreds to few thousands of instruments or sensors.

We focus particularly on a deployment scenario where the data is stored in a data grid or, eventually, a federation of heterogeneous cooperating data grids, which from now we will call a federation. These are highly relevant solutions for digital preservation, as they already store massive amounts of the data that must be preserved, such as in e-Science domains, and they provide a set of functionalities required by digital preservation systems (e.g., file management).

Raw data, usually known as readings is manually collected by human operators or automatically collected by sensors (automatic monitoring systems), and transformed into engineering quantities by specific algorithms. The dam safety archive includes a relational database to store, essentially, instrument properties, readings and engineering quantities. Automatic monitoring systems automatically collect and ingest data into the database, using a SOA architecture. Legacy files (binary and ASCII) comprise the old archive of readings and engineering quantities. The archive also includes CAD files of the dam project and documents, like photographs and movies that are usually captured in periodical inspections to catalog potential anomalies in the dam structure.

Redundancy is important to withstand failures of individual components in the federation environment, but it does not suffice when these components can fail in a correlated manner, e.g. due to a natural disaster or a worm outbreak that causes failures of multiple components with similar software configurations. Thus it is necessary to minimize the likelihood of correlated failures among replicas of the same data items, which can be achieved replicating the data in different geographic locations or ensuring diversity in the components and software running on each node. However, designing and maintaining a system with a high level of redundancy and heterogeneity can be a complex and error-prone task, which motivates our work.

Mathematical simulations are also crucial in dam safety. One single simulation consumes a set of input files (e.g., geometry files, data files) and produces a set of tabular and graphical files, with specific formats, representing the estimated behavior of the dam. Also physical tests are performed in scaled models, evaluating a specific set of actions. The results of physical tests and mathematical simulations can than be compared with the real behavior of the dam. Remark that physical models and mathematical simulations require data provided by the monitoring systems. Moreover, the community of dam safety researchers performs comparisons between simulated data, real data and documentation related with a specific dam.

This paper proposes a solution to support the decision of an adequate replication strategy and supporting infrastructure, comprising a simulator, a redundancy manager to deploy simulated scenarios in real data grids, and an introspection mechanism to refine simulated scenarios based on the behavior of the real system.

We can assume that the heterogeneity and interrelation of dam safety information composes a data space [3]. Therefore, we conclude that a dam safety data space requires preservation and access facilities.

The remainder of this paper is organized as follows. First, section 2 motivates digital preservation in e-Science, presenting a real scenario of dam safety control. Section 3 provides an analysis of the problem. In section 4, we describe our solution for this problem. In section 5, we present some results of simulations. Finally, in section 6 we list our conclusions and future work.

A common technology to handle e-Science collaboration and data management, for scenarios like dam safety control, are grid computing, data grids and federation of data grids. Data grids, such as iRODS4 are able to manage large digital objects and use middleware that makes file management, user management and networking protocols transparent. However, the daily increase of data captured by sensors and potential changes on federation

The core of this work has been done in the scope of the research project GRITO2, and the results will be further exploited in the international project SHAMAN3.

3 2

GRITO (A Grid for Preservation), is funded by the FCT (Portuguese Foundation for Science and Technology) under the contract GRID/GRI/81872/2006, http://grito.intraneia.com

4

18

SHAMAN (Sustaining Heritage Access through Multivalent Archiving) is funded under 7th Framework Programme of the EU, under the contract 216736, http://www.shaman-ip.eu https://www.irods.org

Furthermore, we can not assume threat independence [4], since a specific threat can generate other threats (e.g., a natural disaster can produce several component failures).

components (e.g., add a data grid to the federation) require an adaptation of current replication strategies. In other words, an optimal redundancy strategy for today’s collection and federation architecture may no longer be an acceptable option in the future due to changes in components or the increasing size of the collection.

In order to reduce the risk of threats to digital preservation, several techniques can be adopted. In this paper we are just focused on the preservation of the bit stream, not considering important factors like obsolescence. Several threats can produce a bit stream corruption or even a data loss. Techniques like redundancy and diversity, including diversifying the physical location, software, hardware, storage, system administration and funding, can be effectively applied to reduce the risk of data loss. However, the maintenance of efficient and effective redundancy and diversifying strategies is a complex and error-prone task, especially with huge and dynamic collections as in e-Science scenarios.

3. PROBLEM ANALYSIS Several potential threats, such as natural disasters, can generate unavailability, data loss and resource network failures. In [1], we propose a taxonomy for digital preservation threats that can be used to model a potential list of failures within the preservation environment. In this taxonomy, threats are classified into four classes, as shown in Table 1. Table 1 - Threats to digital preservation Component failures

Management failures

Media faults

Organization failures

HW faults

Economic failures

SW faults

Media/HW obsolescence

Communication faults

SW obsolescence

4. PROPOSED SOLUTION We aim to provide mechanisms to support the decision for which redundancy strategies should be used for a specific digital preservation scenario, to deal mainly with component failures or disasters. However, the behavior of these strategies might be too complex to be modeled and studied analytically. Moreover, the implementation of a full-proof prototype might not be feasible, due to the long-lifetime of real digital preservation scenarios. Therefore, we propose a solution comprising the following concepts:

Network service failures Disasters

Attacks

Natural disasters

External attacks

Human operational errors

Internal attacks

Simulation: We conceived a simulator, Serapeum, which can be used to evaluate redundancy strategies under different performance, reliability and failure models. These redundancy strategies might differ in design decisions such as how many replicas to create, where to place them, or when to create new replicas of the data.

Component failures enclose the technical problems in infrastructure’s components, such as media faults, which occur when a storage media totally or partially fails, causing loss of data. Hardware faults can be recoverable (e.g. a power loss), or irrecoverable (e.g. a power supply unit burning out). Software faults can cause occasional and abrupt failure of the system due to bugs in the code. Communication faults can happen in packet transmission, and network services failures can cause temporary unavailability.

Redundancy: The same code that implements the simulator can also be used as part of a management framework implementing the simulated replication strategy. This way the strategies only have to be implemented once for both simulating and for the real system. For that purpose we developed a driver of the simulator for the iRODS data grid (our main technological focus at the moment), but which can be re-implemented for other data grids. Introspection: Our solution also enables introspection, where we continuously measure the real behavior of the preservation system to adjust the models used to predict failures. The simulator can be used to determine, in run-time, how a new replication strategy might affect the reliability and performance of the system.

Management failures are threats resulting from wrong decisions. Failures may result in the inability of an organization to operate a preservation system at the desired level due to financial limitations (economic failures), political changes (organization failures), or other unpredictable reasons.

4.1 Simulator

Natural disasters can cause failures in multiple components at the same time (e.g. an earthquake). Other disasters can be caused by human operational errors, as non-deliberate actions that may introduce irrecoverable errors in the system.

A simulation environment has been developed to enable the analysis of the efficiency of redundancy strategies in federations of data grids, where different architectures, collections’ management and replication strategies can be evaluated.

Attacks correspond to deliberate actions affecting the system or its components. External attacks may occur in systems connected to public networks (e.g. virus, worms). Internal attacks might be performed by actors from the inside of the organization.

Figure 1 shows the UML component diagram of the simulator – Serapeum, which allows the simulation of the behavior of distributed redundancy systems with centralized replication decision-making. It can be used to study the properties of replication algorithms and to plan the deployment of a real system.

It is important to remark that some threats can remain unnoticed for a long period of time. For instance, a damaged disk sector can remain undetected until a data integrity validation or hard disk check is performed.

19

simulation time. Based on the simulated state, the replication algorithm returns a set of operations that are then simulated and affect the current state. The simulated operations and consequent state changes are repeated in a cycle. Each step generates a set of statistics, which are returned when the simulation finishes.

cmp simulator Collecti on Model

«flow»

Redundancy Strategy

Failure Model

«flow»

«flow»

Storage Model

«flow»

We generate statistics regarding the number of file losses, the total bandwidth usage, and average number of replicas through time. However, the retrieval of other statistics can be easily implemented, since statistics are implemented as Java classes that can be extended.

Simul ator

Simulate d State

4.2 Overall Solution

Ev ent List

Our overall solution consists in four main components, as shown in Figure 2: (i) the simulator; (ii) the redundancy manager, which interacts directly with a deployed distributed storage system and implements the replication strategy on top of the storage substrate; and (iii) the watcher, which monitors the real behavior of system’s components to adjust simulated models.

Figure 1 – Component diagram of the Simulator As input, Serapeum receives the storage system model (composed by the architecture of the storage infrastructure), a collection model defining the size and number of files5, the failure model (to generate failures during the simulated lifetime), the redundancy strategy to be evaluated, and a list of parameters like the simulation time or the list of statistics that should be retrieved.

cmp ov erall

Watcher

Performance Predictions

Currently, the collection model is static, which means that it is not possible to simulate the ingestion of additional files during a simulation. Furthermore, the storage model is adapted to characterize each data grid in a federation, including the number of storage components and its capacity, and the resource network represented as a graph where the edges are links between resources and vertices are storage resources. Each link is also defined by the maximum transfer speed between the connected resources (we ignore protocol stack and latency, since it is not representative for transmission of files with significant size.

«flow»

«flow» «flow» Performance Measurements

Simul ator

«flow»

Figure 2: Component diagram of the overall proposed solution In order to apply replication strategies in real scenarios, a redundancy manager has been implemented as a service that interacts with data grids through specific drivers. The replication algorithms in this redundancy service are defined in the same way as those implemented in Serapeum, which means that an algorithm can be evaluated using Serapeum and immediately be deployed in a real system.

The failure model allows the simulator to generate failures that can affect resources. We consider two types of failures: (i) unavailability, representing a temporary or permanent unreachable resource, and (ii) data loss for a specific resource. Currently, failure scenarios can be modeled by independent failures [5], attribute based failures [6] or unexpected correlated failures.

The drivers enable the system to obtain information about resources and files. They also permit execution of operations on the data grid, such as replica creation and deletion, and they are responsible to notify the watcher when the state of the grid is changed (such as a node becoming unavailable). This information is used by the replication algorithm to generate a set of operations, such as replica creation, which will be performed on the data grid by the driver.

Finally, the redundancy strategy is represented by a set of rules, where a rule is represented by a list of preconditions and a set of actions performed when the preconditions are met. As an example, preconditions use a set of metrics like storage usage, available bandwidth, file size or the predicted mean time between failures [6]. Each action is represented by a type (e.g., copy file) and targeted resources. We consider two types of strategies: (i) based on a fixed number of replicas, where all files have the same number of replicas (the rules just decide which resources should allocate each replica), and (ii) based on a variable number of replicas, where the number of replicas is also calculated.

Our system is integrated with the iRODS data grid. Although iRODS has mechanisms that allow the implementation of the redundancy strategies inside iRODS (as an iRODS module), we implemented it as a separate process to facilitate the integration with other data grid technologies. We implemented a driver for the iRODS data grid using its Java API (Jargon). This driver supports the functionalities mentioned before and also includes an introspection mechanism that uses the Serapeum simulator as part of the redundancy system. With this mechanism, measurements from the environment can be used as inputs to simulations that are

Based on the simulation specification, Serapeum creates a simulated state of the system, which is the current state within the 5

Redundanc y Manager

«flow»

Since we are just concerned with the preservation of the bit stream, other information like file formats or schemas is not relevant.

20

run periodically using models that are produced according to these measurements.

objects and use middleware that makes file management, user management and networking protocols transparent.

Both the simulator and the redundancy manager interact with the watcher component. The watcher provides inputs to the simulator and receives the results of the simulations. According to these results, it can generate a new replication strategy to the redundancy system, alter the physical configuration of the distributed storage system, or even introduce diversity according to the desired system goals. The redundancy system continuously produces a set of measurements that are evaluated by the watcher component, which can be used to start a simulation process.

Long-term digital preservation requires that digital objects are stored reliably, preventing data loss, which is usually achieved using redundancy and diversity. However, determining the best redundancy and diversifying strategy is a complex and errorprone task. Thus, in order to support this decision making, we proposed a complete solution composed by a Simulator that can be used to evaluate different redundancy strategies in well defined system models; a Redundancy manager capable to map redundancy strategies into real systems based in iRODS technology; and an Introspection mechanism that continuously evaluate the real environment to adapt simulation models to the reality.

5. RESULTS This section shows an example of the use of the proposed solution to evaluate the behavior of distinct replication algorithms for a specific federation. The storage model is composed by 14 resources with a capacity of 500GB, connected in a single network with a capacity of 100Mbit/s. The collection modeled is composed by 208 files of 5GB (1TB).

Based on the results achieved by the first implementation of our solution, we intend to develop three main extensions: (i) provide models to describe the behavior of dynamic collections as in eScience domains; (ii) extend collection models to include a metric of relevance (e.g., digital objects that are a result of a mathematical simulation can be reproduced and consequently have a less relevance than a physical observation that is impossible to repeat); and (iii) integrate our solution with other relevant data grids.

Figure 3 shows the number of file losses and total bandwidth usage for a simulation running for 50 years, using five replication algorithms based on fixed and variable number of replicas and using different metrics. These algorithms are described in detail in [9].

7. ACKNOWLEDGMENTS This work is partially supported by the projects GRITO (FCT, GRID/GRI/81872/2006) and SHAMAN (European Commission, ICT-216736), and by the individual grant from FCT (SFRH/BD/23405/2005) and LNEC to José Barateiro.

8. REFERENCES [1] J. Barateiro, G. Antunes, M. Cabral, J. Borbinha, and R. Rodrigues. Using a GRID for digital preservation. In International Conference on Asia-Pacific Digital Libraries, Bali, Indonesia, December 2008. [2] S. Miles, S. C.Wong, W. Fang, P. Groth, K. P. Zauner, and L. Moreau. Provenance based validation of e-science experiments. Web Semant., 5(1):28{38, 2007. [3] M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27{33, 2005. Figure 3 - File losses and bandwidth usage

[4] M. Baker, M. Shah, D. S. H. Rosenthal, M. Roussopoulos, P. Maniatis, T. Giuli, and P. Bungale. A fresh look at the reliability of long-term digital storage. In EuroSys ’06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, pages 221–234, New York, NY, USA, 2006. ACM.

The relevance of these results can be interpreted in two ways. First, it compares the behavior of different redundancy strategies, supporting the decision of the best algorithm for this scenario. Second, these results can be used to evaluate the robustness of current solutions to future modifications. For instance, if one can expect a specific collection growth per year then, it is possible to test the performance of the current architecture and redundancy strategy based on the future collection model.

[5] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In FAST ’07: Proceedings of the 5th USENIX conference on File and Storage Technologies, page 1, Berkeley, CA, USA, 2007. USENIX Association.

6. CONCLUSIONS AND FUTURE WORK Emerging e-Science scenarios produce large amounts of highly valuable data that needs to be preserved and made available over the long-term to support future research. A common technology to handle e-Science collaboration is grid computing and data grids. Data grids, such as iRODS are able to manage large digital

[6] F. Junqueira, R. Bhagwan, A. Hevia, K.Marzullo, and G.M. Voelker. Surviving internet catastrophes. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 4–4, Berkeley, CA, USA, 2005. USENIX Association.

21

[9] M. Cabral. GRITO: Evaluation system to support digital preservation in heterogeneous environments. Master Dissertation in Computer Engineering, Instituto Superior Técnico, Portugal, 2008.

[7] RSB. Dam safety regulation, decreto-lei n.344/2007, Ooctober 15th. Diário da República, Lisbon, 2007 (in Portuguese). [8] G. J. B. Antunes. GRITO: GRID clusters for digital preservation. Master Dissertation in Computer Engineering. Instituto Superior Técnico, Lisbon, 2008.

[10] IEEE. IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries, 1990.

22

Challenges on Preserving Scientific Data with ... - ACM Digital Library

Challenges on Preserving Scientific Data with ... - ACM Digital Library

Suggest Documents

TopoText: Context-Preserving Text Data ... - ACM Digital Library

60 Privacy-Preserving Multimedia Big Data ... - ACM Digital Library

Big Data - ACM Digital Library

Preserving Scientific Data with XMLArch

with compliance - ACM Digital Library

with compliance - ACM Digital Library

Challenges on the Journey to Co-Watching ... - ACM Digital Library

10 Search and Analytics Challenges in Digital ... - ACM Digital Library

ProfileGuard: Privacy Preserving Obfuscation for ... - ACM Digital Library

Privacy preserving K-Medoids clustering: an ... - ACM Digital Library

Efficient privacy preserving content based ... - ACM Digital Library

Privacy-preserving query log mining for business ... - ACM Digital Library

Big City, Big Data - ACM Digital Library

47 Data Musicalization - ACM Digital Library

Big City, Big Data - ACM Digital Library

Querying big data - ACM Digital Library

Privacy Preserving via Interval Covering Based ... - ACM Digital Library

A Memory Efficient Privacy Preserving ... - ACM Digital Library

Locality preserving verification for image search - ACM Digital Library

Creating and Preserving Locality of Java ... - ACM Digital Library

Privacy-preserving Machine Learning in Cloud - ACM Digital Library

Parallel latent dirichlet allocation with data ... - ACM Digital Library

Data Criticality in Network-On-Chip Design - ACM Digital Library

design - ACM Digital Library