Using a Grid for Digital Preservation Jos´e Barateiro1, Gon¸calo Antunes2 , Manuel Cabral2 , Jos´e Borbinha2 , and Rodrigo Rodrigues3 1
LNEC - Laborat´ orio Nacional de Engenharia Civil, Lisbon, Portugal 2 INESC-ID, Information Systems Group, Lisbon, Portugal 3 Max Planck Institute for Software Systems, Kaiserslautern and Saarbr¨ ucken, Germany
[email protected], {goncalo.antunes,manuel.cabral}@tagus.ist.utl.pt
[email protected],
[email protected]
Abstract. Digital preservation aims at maintaining digital objects and data accessible over long periods of time. Data grids provide several functionalities required by digital preservation systems, especially when massive amounts of data must be preserved, as in e-Science domains. We propose the use of existing data grid solutions to build frameworks for digital preservation. In this paper we survey the main threats to digital preservation, which are used to identify a central point of failure in the metadata catalog of the iRODS data grid solution. We propose three extensions to the iRODS framework, to overcome the shortcomings of iRODS when used as a digital preservation system. Keywords: Digital Libraries, Digital Preservation, Data Grids, e-Science.
1
Introduction
Physical artifacts like printed works or drawings carved in stone can survive for centuries. These stable objects are testimonials of past generations and an important asset to the future. In contrast, digital objects are unstable, requiring the execution of continuous actions to make it possible to interpret them in the future. Digital preservation is defined as the ability of two or more systems or components to exchange and use information [6]. Digital preservation stresses the time dimension of this interoperability, focusing on the requirement that data or digital objects must remain authentic and accessible to users and systems over a long period of time, thus maintaining their value. Achieving this goal may require specific investments in an infrastructure for storing, maintaining, and managing data. Such costs may be prohibitive for small organizations, or organizations that do not have a steady revenue, like university libraries, research laboratories, or non-profit organizations. Project GRITO1 tries to lower the cost of digital preservation by harnessing the spare storage of grid clusters in Portuguese universities and research institutions. To achieve this goal we propose to build a heterogeneous storage 1
http://grito.intraneia.com
G. Buchanan, M. Masoodian, S.J. Cunningham (Eds.): ICADL 2008, LNCS 5362, pp. 225–235, 2008. c Springer-Verlag Berlin Heidelberg 2008
226
J. Barateiro et al.
framework that will integrate two classes of members: (i) exclusive storage clusters, comprising systems dedicated to digital preservation, which are likely to be under the administration of the data owner; (ii) extended clusters, as existing grid clusters, primarily used for data processing, but whose spare disk, CPU and bandwidth can be also used to support preservation services. Project GRITO appears in the context of the international project SHAMAN2 - Sustaining Heritage Access through Multivalent ArchiviNg, whose goal is to develop integrated solutions to long-term preservation of massive data collections, especially engineering and scientific data. Important requirements include the support for migration strategies, with a strong focus on preserving authenticity and integrity. Therefore GRITO addresses the digital preservation problem from a bottomup perspective, focusing on detailed technical issues, while SHAMAN represents a top-down perspective, addressing business and organizational models, but not ignoring the related technical challenges. A common ground between these two initiatives is the decision to use the iRODS3 data grid technology as a storage substrate for digital preservation. In this paper, we analyze the main threats to digital preservation, and whether iRODS is designed to withstand them. Furthermore, we identify a central point of failure in the iRODS architecture that could undermine our preservation goals, and we propose a solution that takes advantage of the extensibility present in the design of iRODS. The remainder of this paper is organized as follows. Section 2 explains or motivation for the use of data grid technology for digital preservation. In section 3 we propose a taxonomy of threats to digital preservation. In section 4 we describe the iRODS data grid, pointing-out in section 5 its possible vulnerabilities for digital preservation, according to our taxonomy. In section 6 we propose an extension to iRODS to avoid those vulnerabilities. Finally, in section 7 we list the open issues of the proposed extension and conclude.
2
Motivation
The complexity of digital preservation increases with the fact that each type of digital object has its own specific requirements. For instance, the preservation of audio files requires having to deal with compression and complex encodings, unlike the preservation of XML files. Several communities, like biology, medicine, geographical sciences, engineering or physics, manage large amounts of structured datasets of data captured by sensors, physical or mathematical simulations generated by large computations, and also specialized documents reporting work progress and conclusions to researchers. That information can be represented in a wide range of formats (e.g., a researcher can use a specific input and output format, and a specific program to produce simulations) and include a large number of relations that are not expressed in the data models. Moreover, the 2 3
http://www.shaman-ip.eu https://www.irods.org
Using a Grid for Digital Preservation
227
collaborative environment of the scientific community, and associated services and infrastructures, usually known as e-Science (from ”enhanced Science”) [7], implies that interoperability and data sharing are required. 2.1
Data Grids
In recent years, there have been research efforts to define a new type of systems that deal with the large scale management, sharing and processing of data. These were commonly called Data Grids [5]. Data Grids offer a distributed infrastructure and services that support applications that deal with massive data blocks stored in heterogeneous distributed resources [3]. Grid computing is growing fast. Many applications of this technology exist, and Grid frameworks are already common in scientific research projects, enterprises, and other environments that require high processing power while using low-cost hardware. A possible definition of a Grid computing system [4] is one where: (i) resources are subjected to decentralized control; (ii) standard, open, and general purpose protocols and interfaces are used; and (iii) nontrivial qualities of service are delivered (e.g., combined throughput or response time). In Data Grids, data is organized into collections or datasets, and is replicated, managed and modified using a specific management system. Information about replicas is usually organized in a replica catalog. In summary, the common characteristics of a Data Grid can be described as: (i ) Massive Datasets: a Data Grid allows the management and access to enormous quantities of data, in the order of terabytes or even petabytes (e.g., scientific projects such as the Southern California Earthquake Center, can generate, in a single simulation, up to 1.3 million files and 10 terabytes of data [8]); (ii ) Logical Namespace: the requirements for scalability imply the use of virtual names for resources, files and users; (iii ) Replication: scalability and reliability require high availability and redundancy; (iv ) Authorization and Authentication: Due to the high value and frailty of the data, authentication and authorization mechanisms must be enforced to comply with authenticity and integrity requirements. 2.2
Data Grids and Digital Preservation
Grids are built using middleware software that makes fundamental aspects such as file management, user management and networking protocols, completely transparent. These goals are also shared by digital preservation systems. SRB - Storage Resource Broker4, is a grid technology that has been operational for more than a decade, and is used in many research projects, storing petabytes of managed data. However, SRB is a generic grid infrastructure, and any modification to the management of data needs to be hard coded. The iRODS data grid is being developed by the same team that worked on SRB. The purpose is to create a system with an adaptive middleware that simplifies the task 4
http://www.sdsc.edu/srb
228
J. Barateiro et al.
of modifying how data is managed, or creating new policies tailored to a particular application, while retaining the good practices and lessons learned from SRB. However, neither SRB nor iRODS address specific requirements for digital preservation, which is the problem being addressed by this paper.
3
Digital Preservation Threats
In this section we present a taxonomy of threats to digital preservation based on several papers that point out different threats [1,2,10]. Our taxonomy is presented in Table 1. Component failures enclose the technical problems in the infrastructure’s components. Management failures are the consequences of wrong decisions. Finally, disasters and attacks correspond, respectively, to non-deliberate and deliberate actions affecting the system or its components. Some threats cannot be detected immediately, remaining unnoticed for a long time. For instance, a damaged hard disk sector can remain undetected until a data integrity validation or hard disk check is performed. Moreover, we can not assume threat independence. For instance, a natural disaster like an earthquake can produce other threats. Table 1. Threats to preservation systems Component failures Media faults Hardware faults Software faults Communication faults Network services failures Disasters Natural Disasters Human operational errors
Management Failures Organization failures Economic failures Media/Hardware obsolescence Software obsolescence Attacks External attacks Internal attacks
In the next sections we further divide each threat into a set of possible specific events. 3.1
Component Failures
Media faults occur when a storage media fails partially or totally, losing data through disk crashes or “bit rot”. Other hardware components can suffer hardware faults by transient recoverable failures, like power loss, or irrecoverable failures, such as a power supply unit burning out. Similarly, software faults, usually known as bugs, can cause abrupt failures in the system. For instance, a firmware error can cause a data loss in hard drives. Communication faults occur in packet transmission, including detected errors (e.g., IP packet error) and undetected checksum errors. Other network services failures, such as DNS problems, can compromise the system availability.
Using a Grid for Digital Preservation
3.2
229
Management Failures
An organization responsible for a preservation system may become unable to continue operating at the desired level due to sudden financial limitations (economic failure), political changes or any other unpredictable reason (organization failure). Moreover, failures can also occur due to incompetent management. A different kind of management failure occurs when, even if the internal components do not fail over time, they become obsolete and unable to interact with the exterior components. Thus, unforeseen media, hardware or software obsolescence limits the system interoperability. 3.3
Disasters
Natural disasters such as earthquakes or fires can cause failures in many components simultaneously. For example, an earthquake may cause a data center to be destroyed or a wide-scale power failure. Accidentally human operational errors might introduce irrecoverable errors. For instance, people often delete data by mistake. Additionally, humans can cause failures in other components such as hardware (accidentally disconnecting a power cable) or software (uninstalling a needed library). 3.4
Attacks
Attacks might encompass deliberate data destruction, denial of service, theft, modification of data or component destruction, motivated by criminal, political or war reasons, including fraud, revenge or malicious amusement. Systems connected to public networks are especially exposed to external attacks, such as those caused by viruses or worms. Similarly, internal attacks might be performed by internal actors (e.g., employees) with privileged access to the organization and to the physical locations of the components.
4
iRODS Overview
The iRODS system is an open-source storage solution for data grids based on a distributed client-server architecture. A database in a central repository, called iCAT, is used to maintain, among other things, the information about the nodes in the Grid, the state of data and its attributes, and information about users. A rule system is used to enforce and execute adaptive rules. This system belongs to the class of adaptive middleware systems, since it allows users to alter software functionalities without any recompilation [9]. Figure 1 shows the UML [11] deployment diagram of iRODS. Note that the iCAT database only resides in the central node. iRODS uses the storage provided by the local file system, creating a virtual file system on top of it. That virtualization creates infrastructural independence, since logical names are given to files, users and resources.
230
J. Barateiro et al.
Fig. 1. iRODS deployment diagram
Management policies are mapped into rules that invoke and control operations (micro-services) on remote storage media. Rules can be used for access control, to access another grid system, etc. Middleware functions can be extended by composing new rules and policies.
5
Vulnerabilities in iRODS for Digital Preservation
iRODS presents some vulnerabilities if it is used as the basis for a digital preservation system. In particular, the iCAT stores crucial information like the localization of nodes, the mapping between logical names and physical objects, information about rules, collections, data, metadata, etc. Consequently, iRODS is unable to work without the iCAT catalog, which means that it is a central point of failure. An unrecoverable failure in the metadata repository can cause total data loss, even if the data stored on other nodes remains intact. Table 2 summarizes how the digital preservation threats listed in Section 3 affect the overall iRODS system if these threats affect the iCAT. Table 2. Threats to digital preservation in iRODS Component failures Media faults Hardware faults Software faults Communication faults Network services failures Disasters Natural Disasters Human operational errors
Loss Partial Partial Partial None None
Management Failures Organization failures Economic failures Media/Hardware obsolescence Software obsolescence
Attacks Total External attacks Partial Internal attacks
Loss None None None Total
Total Total
Using a Grid for Digital Preservation
231
Communication faults, network service failures, organization failures and economic failures can not directly affect the iCAT catalog. Media/hardware obsolescence can be easily avoided, as the iCAT catalog is managed by the open-source PostgresSQL database management system, which is able to run on several operating systems and hardware/software configurations. Consequently, a programmed replacement or migration of any obsolete component can surpass this threat. Media, hardware and software faults just partially affect the iCAT catalog, since efficient short-term recovery strategies (e.g., iCAT backups, redundant RAID storage, etc.) can be used to recover from these types of failures. In these scenarios, the main issue may be the identification of the failure. For instance, a bit-rot in an iCAT file may become undetected for a long period of time, affecting part of the preservation system. We also consider that human operational errors can partly affect the iCAT catalog and consequently the preservation system. Serious losses can occur from natural disasters, software obsolescence and external or internal attacks. An earthquake can destroy the centralized iCAT repository. If iCAT backups are also destroyed, a critical loss occurs affecting all the nodes in the system, because all the metadata was stored in the central repository. Natural disasters, external and internal attacks can affect the entire system if the event corrupts/destroys the iCAT repository and the short-term recovery support. Since PostgresSQL is able to run on several hardware configurations, the media/hardware obsolescence is not a critical threat for digital preservation using iRODS. However, if PostgresSQL becomes obsolete, the iRODS system becomes unable to access the metadata stored in iCAT, which can potentially produce complete data loss. Thus software obsolescence turns the system dependent on a specific hardware/software configuration and consequently also fragile to hardware/software obsolescence. However, in these types of scenarios, the crux of the threat is software obsolescence.
6
Extending iRODS
In order to reduce the threats to iCAT, we propose an extension to iRODS, comprising of three new services: (i) iCAT Replication Service (iRep), consisting of replicating the iCAT directory in all the nodes of the data grid; (ii) iCAT Recovery Service (iRec) , to recover the iCAT catalog in case of corruption or failure of the central node; and (iii) Audit Service, consisting of a system check that compares the iCAT with its replicas, and alerts any discrepancy, if detected. 6.1
Replication and Recovery
Figure 2 presents the UML activity diagram modeling the process of replicating the iCAT to other nodes in the data grid. The metadata repository is scanned for modifications. If there are any modifications, the system evaluates the rules to define if the replication should be postponed or proceed with a full or partial export. For instance, operations on metadata elements about nodes and data
232
J. Barateiro et al.
have a higher priority than operations on elements about users. Moreover, we also distinguish between types of operations. Therefore, a delete operation is not as critical as an insert, because the loss of new metadata (e.g., mapping between logical and physical name) may imply the loss of new data in the grid. Based on the type of the iCAT element modified and on the operation performed (delete, update or insert) we define a priority level to the iCAT replication. After low priority modifications have been made, the iCAT replication is postponed. For high priority levels, we also evaluate the current workload of the system, which determines if the export should be full or just partial.
Fig. 2. Activity diagram of the iCAT replication process performed by iRep
If the conditions are met, the current iCAT contents are partly (recent modifications and nearest records) or fully (all data, schema, and the list of nodes with the replicas) exported from the repository into a set of recovery files. Then, the recovery files are replicated to all the other storage nodes registered in the data grid. Local nodes are responsible for the preservation of recovery files, which are stored in a specific area outside the control of the data grid.
Fig. 3. Activity diagram of the iCAT recovery process performed by iRec
Using a Grid for Digital Preservation
233
If a problem occurs and the iCAT repository becomes unavailable, a recovery process can be performed, as described in Figure 3. This process is executed by a component external to iRODS. Depending on the scenario, it may be necessary to install a new instance of iRODS (e.g., if the node crashed permanently) or just to recover the iCAT catalog. To proceed with the iCAT catalog recovery, a list of the nodes that are storing recovery files should be given as an input (it can be retrieved from any node that had survived). Then, all the nodes are asked to send the recovery files. Those that are able to do it (e.g., those that survived, in case of a major disaster) will send the recovery files to the central node, where are compared for integrity validation, and the iCAT catalog is rebuilt. When the recovery process is completed, the data grid moves into a state where all the data stored in undamaged nodes is available again. 6.2
Audit
The audit process checks the integrity of the iCAT catalog. When this process is started, the iCAT catalog remains accessible and the list of nodes in the data grid is obtained from the current iCAT contents. The central node exports the current contents of the iCAT catalog into a set of recovery files. Then, the recovery files are submitted to all nodes, asking them to compare the submitted files with the latest version preserved in the local node. Thus, the iCAT check is computed locally, using parallel processing provided by the Grid infrastructure. In case of discrepancies, the local node produces a log with the list of detected inconsistencies and sends it to the central node. Finally, the central node is responsible for notifying the administrator (e.g., by email) confirming the success of the audit process or sending the log files. Note that the audit process does not produce any modification in either the iCAT or the recovery files stored in data grid nodes. Consequently, the system state remains unchanged with this process. Our solution executes both the replication, recovery and audit processes as iRODS modules implemented with external micro-services. We can materialize these modules as new components (replicate, recover and audit), thus extending the iRODS architecture represented in the deployment diagram of Figure 1. We decided to implement these as micro-services because: (i) it keeps the module implementation external to the iRODS core; and (ii) they can use the iRODS API to access resources and the iCAT.
7
Conclusion and Open Issues
Digital preservation in e-Science may require the use of data grids to manage the large and continuously growing amount of data. However, current data grids do not support natively key digital preservation techniques such as, for example, auditing. iRODS is a good starting point for building a preservation solution based on data grids because of its extensibility, due to the possibility to include new micro-services and rules. This paper presented a taxonomy of threats to digital preservation. Based on that, we identified that iRODS had a central point of failure in the metadata
234
J. Barateiro et al.
catalog. Consequently, we presented an extension to the iRODS system to handle digital preservation threats to the metadata catalog. The proposed solution still has some open issues that must be addressed. For example, the format used to export the schema and contents of iCAT must be defined in a way that makes itself appropriate for preservation. In this moment we are using the XML format, but a better defined XML-Schema needs to be defined. Another open issue is to establish the policies that define when the repository should be replicated. On the one hand, for performance reasons, it is unfeasible to replicate the repository on every single change to the iCAT database. On the other hand, a long replication period (e.g., daily) may imply important updates being lost in case of an iCAT failure. Thus, this process should be balanced with the normal operations on the grid and user-definable by a set of rules. For instance, the administrator may be able to define the maximum number of pending non-replicated transactions, the admissible workload to perform a replication in parallel with normal operations, etc. We are using and validating the proposed solution in the context of project GRITO, focusing on data objects from the National Digital Library5 and scientific data provided by the Portuguese National Laboratory of Civil Engineering6 . The case of the scientific data will be further analyzed in the context of project SHAMAN.
Acknowledgments This work is partially supported by the projects GRITO (FCT, GRID/GRI/ 81872/2006) and SHAMAN (European Commission, ICT-216736), and by the individual grant from FCT (SFRH/BD/23405/2005) and LNEC to Jos´e Barateiro.
References 1. Baker, M., Keeton, K., Martin, S.: Why traditional storage systems don’t help us save stuff forever. In: 1st IEEE Workshop on Hot Topics in System Dependability, June 30 (2005) 2. Baker, M., Shah, M., Rosenthal, D.S.H., Roussopoulos, M., Maniatis, P., Giuli, T.J., Bungale, P.P.: A fresh look at the reliability of long-term digital storage. In: EuroSys, pp. 221–234 (2006) 3. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke.: The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications 23, 187–200 (2000) 4. Foster, I.: What is the grid? A three point checklist. GRIDToday 1(6) (July 2002) 5. Hey, T., Trefethen, A.E.: The uk e-science core program and the grid. In: International Conference on Computational Science (1), pp. 3–21 (2002) 6. IEEE. IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries (1990) 5 6
http://www.bn.pt http://www.lnec.pt
Using a Grid for Digital Preservation
235
7. Miles, S., Wong, S.C., Fang, W., Groth, P., Zauner, K.-P., Moreau, L.: Provenancebased validation of e-science experiments. Web Semant. 5(1), 28–38 (2007) 8. Moore, R.: Digital libraries and data intensive computing. In: China Digital Library Conference, Beijing, China (September 2004) 9. Rajasekar, A., Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system. In: HPDC workshop on Next Generation Distributed Data Management, Paris, France (2006) 10. Rosenthal, D.S.H., Robertson, T., Lipkis, T., Reich, V., Morabito, S.: Requirements for digital preservation systems: A bottom-up approach. CoRR, abs/cs/0509018 (2005) 11. Unified, U.: modeling language specification, version 1.4.2 formal/05-04-01. ISO/IEC 19501 (January 2005)