Oct 6, 2005 - digital information life cycle management technologies for digital librarians and .... management and security policies for distributed systemsâ.
PLEDGE: PoLicy Enforcement in Data Grid Environments Developing Scalable Data Management Infrastructure in a Data Grid-Enabled Digital Library System
Proposed Research Agenda MacKenzie Smith Massachusetts Institute of Technology Reagan W. Moore San Diego Supercomputer Center Brian E.C. Schottlaender University of California, San Diego
October 6, 2005 Sponsored by National Archives and Records Administration
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archives and Records Administration or the U.S. Government.
Table of Contents I. II. III. IV. V. VII.
Introduction................................................................................................................ 3 Proposed Research and Development........................................................................ 4 Approach.................................................................................................................... 5 Testing and Evaluation ............................................................................................ 11 Schedule and Deliverables....................................................................................... 12 References................................................................................................................ 13
I.
Introduction
Development of prototype persistent archives has been funded through a collaboration of the NSF sponsored National Partnership for Advanced Computational Infrastructure and the National Archives and Records Administration (NARA). That collaboration was extended for a 12-month period from July 2004 through June 2005 to include participation from the Massachusetts Institute of Technology Libraries, and the University of California, San Diego Libraries. We propose the continuation of the project for a further 12 months to work on developing scalable policy expression and digital information life cycle management technologies for digital librarians and archivists. A federated Libraries and Archives Research Grid Environment will be implemented and tested at several institutions, including MIT and SDSC.
Elaboration The success of the previous year’s efforts has opened the door for more institutional cooperation in the life-cycle management of digital assets residing in an independently layered and networked distributed preservation environment. Specifically: The DSpace system developed at MIT provides a simple, user-friendly front end for digital content ingest, search and discovery, content management and dissemination services. The San Diego Supercomputer Center (SDSC) Storage Resource Broker (SRB) provides a data storage management infrastructure that supports large scale storage, as well as access to replicated, distributed data across the data grid. These two technologies have been successfully integrated to demonstrate an SRBmanaged and grid-enabled storage architecture for DSpace, and a number of test collections from MIT, UCSD, and NARA were ingested and made available in the combined platform. Further development and testing of a data grid environment for libraries and archives will address policy, management, control, preservation and access issues that must be resolved in order for libraries and archives to adopt these technologies and scale them up to very large archival collections.
Goals The project has the following goals:
• • • • •
Identification of necessary policy expression and information life cycle management ontologies to support large-scale digital collection management Further specification and development of a more modular, scalable architecture for the DSpace digital library platform (DSpace 2.0) Initial development and testing of distributed, federated collections built on data grid storage and managed by DSpace data curators Demonstration of SRB support for centralized mechanism to replicate and federate collections across institutional boundaries Demonstration of support for preservation of authenticity and integrity during exchange of documents between integrated DSpace/SRB systems
II.
Proposed Research and Development
Through the “Integrating Data Management with Data Grids” project we have successfully integrated DSpace life cycle management processes for the preservation of digital objects with SRB storage and replication functionality. With this in place the stage is set to develop further tools and technologies that will explore the data curation and life cycle management aspects of the system and build a working prototype to support the replication and federation of digital collections across libraries and archives over the data grid. Initially support for replication of institutional collections in the “lots of copies keep stuff safe” mode could be demonstrated with the involvement of several institutions within the U.S. and eventually beyond national boundaries. This “simple” replication provides for copies of collections to ensure against catastrophic loss at the parent institution. Replication and federation using data grid technology managed by SRB was a necessary prerequisite, but further investigation is needed into the business, policy, and control issues that accompany collection management. We propose incorporating such an effort within this project. SRB also provides the means to store collections remote from the system application (in this case DSpace). In such a scenario SRB managed storage at SDSC or an SRB instance at a non SDSC location could offer remote storage as a service. Again the need to work through agreements between the cooperating institutions and how those can be monitored and enforced by the system would form a major area for research. This topic is directly applicable to preservation facilities in which multiple institutions independently support replicas, with the intent of decreasing risk of data loss. The management policies used to coordinate replication between independent sites requires a level of procedural control beyond that provided by data grid federation. An example of such a system is the NARA research prototype persistent archive. The impact of the proposed integrated system is expected to be in the following areas: • Definition of requirements for life cycle management and policy expression by practicing archivists with scalable data architectures
• •
Prototype of a library community-managed data grid to support the replication and federation of digital collections in order to better preserve them and to offer more scalable storage options Education of the library and archive community on the possibilities the data grid offers the digital library community for preservation and access
The proposed research and development activities are focused on: • Development of two (or more) ontologies involved in digital archives: digital life cycle management, and storage policy expressions. • Demonstrated remote storage of digital collections from at least four institutions utilizing a library community data grid • Demonstration of the integration of Archival Information Packages, with Metadata Encoding and Transmission Standard profiles, with data grid containers for storing data • Investigation of the role SDSC might play in providing storage services for preservation of library and archival digital collections • Exploration of the management issues that need to be addressed for distributed storage and replication to become a trusted component in the digital preservation environment.
III.
Approach
III.1 Ontologies DSpace and other digital library and digital archiving systems have identified the need for better ways of defining, expressing, and encoding digital content management information. There have been several attempts to develop “ontologies” for this type of data in the past, but none have enjoyed widespread adoption or evaluation in the field. The two main candidates for digital life cycle management ontologies include: ABC Harmony The ABC Harmony life cycle ontology was originally developed by researchers at Cornell University, DSTC Pty, Ltd, Brisbane, Australia, and others as part of an international effort to build infrastructure to support scalable data interoperability [17]. ABC Harmony is expressed as an RDF ontology [18] and so benefits from the flexibility and scalability of that data architecture. The goals of the Harmony project were: • Collaborating with metadata communities to develop and refine developing metadata standards that describe multimedia components. • Investigating a conceptual model for interoperability among community-specific metadata vocabularies. Such a conceptual model should be able to represent the complex structural and semantic relationships in multimedia resources. • Investigating mechanisms for expressing such a conceptual model, including technologies currently under development in the W3C (XML, RDF, and their associate schema mechanisms).
•
Developing mechanisms to map between community specific vocabularies using such a conceptual model. The outcome of the project was a flexible, extensible life cycle ontology which was adopted by DSpace. CIDOC CRM Another digital life cycle management ontology has emerged from the museum and cultural artifact community called the CIDOC Conceptual Reference Model, or CRM [12]. The CIDOC CRM is intended to promote a shared understanding of cultural heritage information by providing a common and extensible semantic framework that any cultural heritage information can be mapped to. It is intended to be a common language for domain experts and implementers to formulate requirements for information systems and to serve as a guide for good practice of conceptual modeling. In this way, it can provide the "semantic glue" needed to mediate between different sources of cultural heritage information, such as that published by museums, libraries and archives. CIDOC CRM is also expressed as an RDF ontology, and so provides another candidate for DSpace life cycle management. Both of these ontologies built on the work of the now defunct CIMI Consortium (a consortium of cultural heritage institutions and organizations that ceased operations in 2003). In the last two years efforts have been ongoing to link the two ontologies and develop a common approach to digital life cycle management that the DSpace community can leverage. XACML, SAML, and Policy Expression Ontologies One significant unresolved issue in grid-based, federated digital libraries and archives is the need for ways of defining, expressing, and encoding the policies which effect digital collection management. Human-definable, machine-enforceable policies must be developed to define operational requirements such as how many copies of a collection are wanted, at what institutions or in what geographic regions, with what service level agreements (SLAs), and so on. Emerging schemas like XACML [13] and SAML [14] are a start towards defining the necessary ontologies, although more work will be needed. XACML has as its charter “to define a core schema and corresponding namespace for the expression of authorization policies in XML against objects that are themselves identified in XML. There are many proprietary or application-specific access control policy languages, but this means policies cannot be shared across different applications, and provides little incentive to develop good policy composition tools. Many of the existing languages do not support distributed policies, are not extensible, or are not expressive enough to meet new requirements. XACML enables the use of arbitrary attributes in policies, role-based access control, security labels, time/date-based policies, indexable policies, 'deny' policies, and dynamic policies — all without requiring changes to the applications that use XACML." The related Security Assertion Markup Language (SAML) standard is "an XML-based framework for exchanging security information. This security information is
expressed in the form of assertions about subjects, where a subject is an entity (either human or computer) that has an identity in some security domain.” Together these standards begin to define web-enabled security mechanisms, which are a necessary but insufficient piece of the general collection policy picture. The metadata framework [15] is used extensively in the publishing community as part of the International DOI Foundation [16] and others. It was developed as a means of flexibly managing diverse metadata for intellectual property, and particularly for commercial transactions affecting that property, but in practice the framework could apply to all digital content and any sort of transaction, including digital preservation and federation operations. The community says: “Any serious approach to the problem of interoperability of metadata for IP in the network environment needs to support interoperability of at least five different types: • • • • •
Across media (such as books, serials, audio, audiovisual, software, abstract works, visual material). Across functions (such as cataloguing, discovery, workflow and rights management). Across levels of metadata (from simple to complex). Across semantic barriers. Across linguistic barriers.
This clearly applies to the digital assets found in digital libraries and archives, and so this framework may have equal applicability in our domain. Ponder Finally, work is underway in the UK on Ponder [17] “a language for specifying management and security policies for distributed systems”. It was developed as part of a research carried out by the Policy Research Group at Imperial College, London, into the use of policy in distributed systems management. The work is no longer active, but the findings are of potential use to a grid-enabled digital library and archives community. Further research will be conducted to clarify requirements for life cycle management and policy expressions, to identify candidate standards for those ontologies (or create initial prototypes if necessary), and to deploy the ontologies for further assessment by digital curators.
III.2 Technologies The prototype data grid will use version 3.3 of the SRB and DSpace 2.0, currently in development. Both systems will continue to evolve and in the design effort, we will attempt to build interfaces that can remain invariant across updates of the technologies. DSpace 2 DSpace(tm) is a freely available, open source system originally developed by HewlettPackard Labs and the MIT Libraries to be used by academic research institutions to capture, archive, preserve, and make available the scholarly research material produced
by their faculty and researchers. The system itself is a simple, but fully-featured, digital asset management system, including a submission system that supports complex, flexible workflows, as well as support for access control and delivering complex digital content. DSpace can serve a variety of types of organizations to manage their digital assets, but it was designed and optimized for academic research institutions to manage their digital research materials. Since its initial release in 2002 DSpace has been deployed at many research-producing institutions (see http://wiki.dspace.org/DspaceInstances for the list of currently registered sites). As the community of adopting institutions has grown and experience gained in digital collection management and preservation, the needs for the system’s architecture have evolved, leading to a redefined DSpace 2.0 design (http://wiki.dspace.org/DspaceTwo). As part of the first year’s collaboration between MIT and UCSD, the DSpace 2.0 storage layer abstraction was implemented and SRB was integrated as an alternative, grid-enabled data storage layer. This has allowed DSpace adopters considerable more scalability for digital collections, as well as theoretically giving them the benefit of replication and federation with each other, and with nonDSpace archives internationally. This ability is hindered by the lack of a policy framework on which to base replication operations. However further development is needed to complete the DSpace 2.0 redesign and implementation, working collaboratively with the other institutions using the platform and helping to shape its future use. In particular, improving the data architecture to support Semantic Web technology (e.g. RDF) in addition to static RDBMS tables will significantly improve the scalability of the DSpace platform to support better discovery, management, preservation, and dissemination activities. In order to support the sort of data management and policy expression ontologies described above, further DSpace 2.0 work is necessary. From the beginning, DSpace has implemented the ABC Harmony ontology as part of its “History system” [19] and dutifully records significant life cycle events performed on the contents and metadata of the archive, but no consensus has been achieved from the archives community about which events are really significant or worth noting as part of their management and preservation regime. Furthermore no tools are present to assist archivists to query, manipulate, or view events logged in the History system with Harmony RDF statements, so their existence is underutilized. This project will extend the History system under the 2.0 architecture to be fully exposed to digital archivists and to provide them with useful information in support of their curation activities. Significant redesign or extension of the history system may be needed to achieve these management goals.
SRB 3 We will use version 3.3 of SRB for the data management system. Ten scenarios are listed in Table 1 for federation mechanisms between SRB zones (i.e. independent metadata
catalogs). The scenarios describe the types of zones, and the choices for management of selected sharing and control mechanisms. Zone interaction control
Consistency Management
Zones
Collections
User Connection Point to access files Files
Data Access Control Setting
Metadata synchronization
Resource sharing
User-ID sharing between zones
Files
Metadata
Resources
User names
Zone SRB
Zone Organization
Free Floating Zones
Peer-to-Peer
Local Admin
User-specified data publication
From home zone
User set access controls
User controlled synchronization
None
Occasional Interchange
Peer-to-Peer
Local Admin
User specified
From home zone
User set access controls
User controlled synchronization
None
Partial
Zones
None
Replicated Data Zones
Peer-to-Peer
Local Admin
User-specified replication
From home zone
User set local access controls
User controlled synchronization
Partial
Partial, user establishes own accounts
Resource Interaction
Peer-to-Peer
Local Admin
User-specified replication
From home zone
User set access controls
None
Partial shared resource for replication
Partial
User and Data Replica Zones
Peer-to-Peer
Local Admin
User-specified replication
From home zone
System set access controls
System controlled complete synchronization
Partial
Complete
Replicated Catalog
Peer-to-Peer
Local Admin
System controlled System managed System All zones share complete name conflict From any zone replicated resources resolution access controls synchronization
Snow Flake Zones
Hierarchical
Local Admin
System managed replication in hierarchy of zones
From home zone
System set access controls
System controlled partial synchronization
None
One
Master-Slave Zones
Hierarchical
Super Admin
System-managed replication to slave
From home zone
System set access controls
System controlled partial synchronization
None
One
Archival zones
Hierarchical
Super Admin
System-managed versioning to parent zone
From home zone
System set access controls
System controlled complete synchronization
None
Complete
Nomadic Zones
Hierarchical
Local Admin
User-managed replication to parent zone
From home zone
User set access controls
User controlled synchronization
Partial
One
Complete
Table 1. Comparison of federation mechanisms The sharing and control mechanisms include: • Zone organization – whether a hierarchy will be imposed between the zones, with the data grid managing interactions between levels of the hierarchy, instead of relying on user controlled interactions • Zone interaction control – for hierarchical zones, the controlled interactions may require a super administrator that imposes constraints on the local zone administrators • Consistency management – whether the user decides which files will be replicated between zones, or whether the system automatically replicates data • User connection point to access files – which zone should be used for the connection point. Each user is defined as a triplet {user name, domain name, home zone}. The user name can be registered into a peer zone, but the “home zone” retains its original setting. • Data access control setting – whether the user specifies the access controls, or whether the data grid automatically sets the access controls. • Metadata synchronization – whether the user issues commands to update metadata, or whether the data grid manages the metadata update
• •
Resource sharing – whether a peer zone is allowed access to a zone resource for storing data User-ID sharing between zones – whether the user name triplet is registered into a peer zone for a single user, some users, or all users.
We view the current system (zone SRB for peer-to-peer federation) as a vehicle for exploring the possible federation mechanisms that could be used with DSpace. For interoperating between independent preservation environments, the archive community will need standards for deciding how to manage access controls when registering a digital entity into a federated catalog, and for deciding how to manage metadata update consistency when modifying a digital entity in a federated catalog. The SRB version 3.0 supports peer-to-peer federation, based upon choices for whether resources, user names, files, and metadata context will be shared between the peers. If DSpace and the SRB are viewed as peers, each controlling a preservation environment, then the amount of sharing, the choice for who controls the shared data, and the choice for who controls updates to the shared metadata must be negotiated between the DSpace system and the SRB. The fundamental zone is the Free-floating Zones – myZone. This is a set of stand-alone zones with no parent zone. The zones can be considered peers and possibly have very few users and resources. The Free-floating Zones can be viewed as isolated systems running by themselves (like a PC) without any interaction with other zones, but with a slight difference. These zones occasionally "talk" to each other and exchange data and collections. This is similar to what happens when we exchange files using zip drives or CDs or being occasional network neighbors. This system has good level of autonomy and isolation with controlled data sharing. The other types of peer-to-peer federation environments can be derived from the Free-floating Zones by the selection of alternate consistency and control constraints. A major design decision will be the appropriate set of federation mechanisms between independent DSpace and SRB systems. Data Grid In researching the amount of material in need of preservation archiving at a typical research institution it is clear that the storage requirements will be very large, and that storage solutions are needed that can scale to data grid quantities and provide distributed replication strategies for backup and preservation. Completed work has integrated the SRB and DSpace systems so that digital libraries and archives using the DSpace platform may optionally specify data grid storage, via the SRB system, as the primary mode of asset storage. Collections can then be managed by DSpace in the normal way, but this is not the ideal: curators using DSpace should be able to request DSpace/SRB to store collections locally or remotely (e.g. on grid storage available from the San Diego Supercomputer Center or at another institution). They should be able to replicate collections as many times as called for by the local preservation regimen and to as many geographic regions as desired and available. They should be able to use distributed storage for very large assets (e.g. full-length digital films
or very large scientific datasets). And they should be able to specify these storage choices via standard collection policies and have those be executed and verified by DSpace and SRB. Administrative tools for establishing the service level agreements are required. The administrative tools will need to communicate with both DSpace and the underlying SRB data grid to ensure that consistent policies are implemented. We propose to develop a practical demonstration of these options using the chosen policy expression ontology at a small number of sites with DSpace, SRB, or grid storage services. Minimally these will include the MIT Libraries, UCSD Libraries, NARA, and the SDSC, and additional sites may be solicited as appropriate (e.g. the California Digital Library’s Digital Preservation Repository, or other institutions using DSpace). The demonstration will include storage, active curation (via the chosen life cycle management ontology), and access to the collections from the source institution’s DSpace. By this demonstration we will learn more about the requirements of curators for these systems and technologies, promote the use of SRB and the data grid in the digital library and archives communities, and advance the functionality of the DSpace and SRB software.
IV.
Testing and Evaluation
Four significant, multi-terabyte collections have been used in the first year’s testing and evaluation, including collections from the UCSD Libraries (a large collection of approximately 200,000 digital images), the MIT Libraries (a large collection of digital theses documents), and NARA (a collection of archival material selected in collarboration with NARA staff). Additional collections chosen in collaboration with NARA will be used to test the added functionality proposed for next year’s research. We will work with NARA to acquire relevant candidate collections The demonstrations will include tools for specifying local storage policies, for managing collections in DSpace that are stored on the data grid via SRB, and for specifying and monitoring collection events (e.g. preservation migrations). Usability will be measured using a small focus group of digital collection managers and archivists from MIT, UC, NARA, and others as appropriate. Technical scalability and robustness of the prototype systems will be tested with appropriate metrics to determine barriers to adoption of the technology beyond the project participants. Significant continued outreach will be a goal of this year’s work, to inform the digital library and archives community about the work of the project, of the DSpace and SRB technology in general, and of the ERA research program’s goals and findings.
V.
Schedule and Deliverables
The research prototype developed in year one will be further extended and deployed through a series of demonstrators and prototypes. The project will be designed in four phases: Phase 1: (3 months) Requirements from representative digital collection managers and archivists (convene focus group and devise survey) DSpace 2.0 development to expose current History system via prototype UI for administrators and collection managers Phase 2: (3 months) Ontology review, analysis, and selection for both life cycle management and policy expression Develop specifications for integration with DSpace 2, and SRB 3 Phase 3: (3 months) DSpace development to implement selected life cycle management and policy expression ontologies, provide prototype UIs for use by digital collection managers and archivists. At scale testing of these policies using the identified test collections. SRB development to implement SLA specified via the policy expression ontology and provide relevant data back to DSpace Phase 4: (3 months) Test and evaluate tools with focus group of digital collection managers and archivists Iterate prototypes, perform outreach and dissemination of results
VI.
References 1. URL - http://www.gridforum.org/6_DATA/persist.htm 2. R. Moore, A. Merzky, “Persistent Archive Basic Components”, Persistent Archive Research Group, Global Grid Forum; July 27, 2002 3. R. Moore, “The San Diego Project: Persistent Objects”, Proceedings of the Workshop on XML as a Preservation Language, Urbino, Italy, October 2002. 4. R. Moore, “Common Consistency Requirements for Data Grids, Digital Libraries, and Persistent Archives”, submitted to 12th High Performance Distributed Computing conference, Seattle, Washington, Jun 2003, URL http://grid.lbl.gov/GPA/GGF7_Data_Consistency.Word95.pdf 5. Arcot Rajasekar, Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky, “ The Grid Adventures: SDSC’s Storage Resource Broker and Web Services in Digital Library Applications, 4th Russian Conference on Digital Libraries, Dubna, Russia, October, 2002. 6. R. Moore, C. Baru, “Virtualization Services for Data Grids”, Book chapter in "Grid Computing: Making the Global Infrastructure a Reality", John Wiley & Sons Ltd, 2003. 7. National Partnership for Advanced Computational Infrastructure, http://www.npaci.edu/ 8. DSpace - http://www.dspace.org/ 9. Carl Lagoze and Jane Hunter, “The ABC Ontology and Model”; Journal of Digital Information, Volume 2 Issue 2, Article No. 77, 2001-11-06. http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Lagoze/ 10. Description of the Semantic Web Activity and the RDF data architecture are available on the W3C’s website http://www.w3.org/RDF/ 11. The DSpace History is system is described on the DSpace website (http://dspace.org/technology/system-docs/functional.html#history) and in Jason Kinner, Mick J. Bass, “The History Component of the DSpace Institutional Digital Repository”; Proceedings of the IS&T Archiving Conference, April 2004, p.71-76. 12. CIDOC Conceptual Reference Model http://cidoc.ics.forth.gr/ 13. OASIS XACML websitehttp://www.oasisopen.org/committees/tc_home.php?wg_abbrev=xacml 14. OASIS SAML website http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=security 15. INDECS home page http://www.indecs.org/ 16. International DOI Foundation website http://www.doi.org/welcome.html 17. PONDER project website at Imperial College, London http://wwwdse.doc.ic.ac.uk/Research/policies/ponder.shtml