Data Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and Archiving Data REAGAN W. MOORE, ARCOT RAJASEKAR, AND MICHAEL WAN, MEMBER, IEEE Invited Paper
The integration of grid, data grid, digital library, and preservation technology has resulted in software infrastructure that is uniquely suited to the generation and management of data. Grids provide support for the organization, management, and application of processes. Data grids manage the resulting digital entities. Digital libraries provide support for the management of information associated with the digital entities. Persistent archives provide long-term preservation. We examine the synergies between these data management systems and the future evolution that is required for the generation and management of information. Keywords—Data grids, digital libraries, persistent archives, information management.
I. INTRODUCTION Data grids support massive data collections that are distributed across multiple institutions. Communities such as the National Institutes of Health (NIH) Biomedical Informatics Research Network [1] (16 sites, 4 million files, 6 TB of data) promote the sharing of data between NIH-funded researchers by federating access to geographically remote storage systems. International collaborations such as the
Manuscript received March 1, 2004; revised June 1, 2004. This work was supported in part by the National Science Foundation (NSF) National Partnership for Advanced Computational Infrastructure (NPACI) under Grant ACI-9619020 (National Archives and Records Administration supplement), in part by the NSF Digital Library Initiative Phase II Interlib project, in part by the NSF National Science Digital Library under Subaward S02-36645, in part by the Department of Energy Scientific Data Management project under Award DE-FC02-01ER25486 and the Particle Physics Data Grid, in part by the NSF National Virtual Observatory, in part by the NSF Grid Physics Network, and in part by the NASA Information Power Grid. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archives and Records Administration, or the U.S. government. The authors are with the San Diego Supercomputer Center, San Diego, CA 92093-0505 USA (e-mail:
[email protected];
[email protected];
[email protected]). Digital Object Identifier 10.1109/JPROC.2004.842761
Worldwide Universities Network [2] (five initial sites) support the sharing of data between academic institutions in the United States and the United Kingdom. National Science Foundation (NSF)-funded Information Technology Research projects such as the Southern California Earthquake Center [3] (five sites, 1.7 million files, 91 TB of data) build digital libraries of domain specific material for publication and use by all members of the scientific discipline. The NSF National Science Education Digital Library [4] (26 million files, 3.5 TB of data) uses data grid technology to implement a persistent archive of material that has been gathered from Web crawls. The SIOExplorer project [22] (808 000 files, 2 TB of data) manages an archive of ship logs from oceanographic research vessels. All of these projects are faced with the organization of digital entities into collections, the assignment of descriptive metadata to support discovery, and the controlled access to data that are distributed across multiple sites. All of these projects use collections to provide a context for the interpretation of their digital entities. All of these systems are based upon a generic data management infrastructure, the San Diego Supercomputer Center (SDSC), San Diego, CA, Storage Resource Broker (SRB) [5]. The management of data has traditionally been supported by software systems that assume explicit control over local storage systems (file systems) or that assume local control over information records (databases). The SRB manages distributed data, enabling the creation of data grids that focus on the sharing of data, digital libraries that focus on the publication of data, and persistent archives that focus on the preservation of data. Data grid technology provides the fundamental management mechanisms for distributed data. This includes support for managing data on remote storage systems, a uniform name space for referencing the data, a catalog for managing information about the data, and mechanisms for interfacing to the preferred access method. Digital libraries can be implemented on top of data grids
0018-9219/$20.00 © 2005 IEEE
578
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
through the addition of mechanisms to support collection creation, browsing, and discovery. The underlying operations include schema extension, bulk metadata load, import, and export of metadata encapsulated in XML, and management of collection hierarchies. Persistent archives can be implemented on top of data grids by addition of integrity metadata needed to assert the invariance of the deposited material. The mechanisms provided by data grids to manage access to heterogeneous data resources can also be used to manage migration from old systems to new systems, and hence manage technology evolution. The SRB is being used as the underlying infrastructure for both digital libraries and persistent archives and is a proof in practice that common infrastructure can be used for data management. Despite the success in integrating digital libraries and data grids, significant challenges remain. The issues are related to information generation and management and can be expressed as characterization of the criteria used to federate access across multiple data management environments. A careful explanation is needed to explain precisely what we mean by the terms data, information, and knowledge [10]. The data grid community defines “data” to be the strings of bits that compose a digital entity. A digital entity might represent, for example, a data file, an object in an object ring buffer, a record in a database, a URL, or a binary large object in a database. Data are stored in storage repositories (file systems, archives, databases, etc.). Meaning is assigned to a digital entity by associating a semantic label. Information consists of the set of semantic labels that are assigned to strings of bits. The semantic labels can be used to assert a name for a digital entity, assert a property of a digital entity, and assert relationships that are true about a digital entity. Information is stored in information repositories (relational databases, XML databases, flat files, etc.). The combination of a semantic label and associated data is treated as metadata. Metadata are organized through specification of a schema and stored as attributes in a relational database. The digital entities that are registered into the database comprise a collection. The metadata in the collection in turn provides the context for interpreting the significance of the registered digital entities. Grids manage distributed execution of processes. The SRB data grid manages simulation results, observational data, and derived data products. Grids and data grids are complementary technologies that together enable the creation and management of data. Digital libraries organize information in collections. Persistent archives preserve the information content of collections. Persistent archives manage the evolution of all components of the hardware and software infrastructure, including the encoding syntax standards for data models. The integration of information management is one of the next steps in the evolution of grid technology. We examine how grid technology has evolved, describe the current state of the art in data grid technology, and then demonstrate the evolution required in grid technology for the characterization of information and the integration of digital library and persistent archive technology. An inteMOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES
Table 1 Evolution of Grid Functionality
grated environment for the generation, publication, sharing, and preservation of information is the next step in grid infrastructure. II. GRID EVOLUTION One approach to understanding the current state of grid services is to look at how grid technology has evolved over the last four years [6]. The original grid environments assumed that applications directly accessed remote data that were stored under the user’s Unix-ID, that data would be pulled to the computation, that accesses could be based upon physical file names, and that the applications would access data through library calls. Generalizations now exist for each of these functions, typically implemented as naming indirection abstractions. In Table 1, the evolution path is shown for each function. The left-hand column represents the original grid approach, the middle column represents functions provided by current digital library and persistent archive technology, and the right-hand column defines the capability enabled by the new function. Each of the evolutionary steps required the specification of a new naming convention for resources, users, files, collections, and services. The naming convention made it possible for a community to create uniform labels for accessing remote data and resources. The aggregation of the naming conventions is called a virtual organization [56]. A virtual organization is created to meet the needs of a particular group, project, or institution. It is quite possible for virtual organizations to implement different naming conventions. The naming conventions are assigned by a set of criteria specific to each virtual organization. The criteria might depend upon cultural considerations (status of a person within a project), organizational considerations (site that owns a resource), or choice of infrastructure (software systems used to implement the name space). The assignment of names corresponds to the creation of a new semantic label for each entity. The creation of the semantic label is an assertion by each virtual organization that the associated criteria have been met. Federation is the sharing of resources, user names, files, and metadata between grids. When grids are federated, the 579
underlying assumptions governing the creation of the name spaces must be integrated. The name space integration is possible if the assumptions underlying the application of the naming convention are compatible. The future evolution of the grid will strongly rely upon the use of information management technologies that can express the criteria used to assign semantic labels. An additional observation is needed: that the driving motivation for many of the grid evolutionary steps has been the need to manage the results created by services, in addition to managing the execution of the services. Digital libraries and persistent archives focus on the management of the data that results from the application of services. They define a context that includes the state information that results from all processes performed upon a digital entity and organize the digital entities into a collection. For grid technology to support end-to-end data management applications, grid technology will need to incorporate digital library information management capabilities as well as persistent archive technology management capabilities. III. INTEGRATING DIGITAL LIBRARIES AND DATA GRIDS: SPANNING THE INFORMATION DIVIDE A major research issue in data grids and digital libraries is the integration of knowledge management systems with existing data and information management systems. Knowledge management is needed to support constraints that are applied in federation of data grids and in semantic cross-walks between digital libraries. A growing number of communities [from astronomy (NVO [7]) to neuroscience (Biomedical Informatics Research Network (BIRN) [1]) to ecology (SEEK [8]) to geology (GEON [9])] are developing grids and digital portals for organizing, sharing, and archiving their scientific data. At SDSC, we have seen these and other communities specify diverse and sometimes orthogonal requirements for managing and sharing their data. The integration of constraint-based knowledge management technology with existing state-of-the-art data grids, digital libraries, and persistent archives requires equivalent support for relationship-based constraints across all three environments. The application of constraints for the integration of data grids and digital libraries will be an essential part of cyberinfrastructure. The assignment of a semantic label to a digital entity requires the application of a processing step. A set of relationships or logical rules is used to assert the application of the semantic label. An example is the naming of the fields within a binary file. For a scientific data set, one might attach the following types of semantic labels to a field: • name of the physical variable that is represented by the file (asserted by the structural order of the fields within the data set); • units associated with the physical variable (asserted by a choice of metrics for the field); • data model by which the bits are organized (asserted as say a column-ordered or row-ordered array); • structural mapping implied by the data model (asserted as the type of geometry and coordinate system); 580
• spatial mapping imposed on the data model (asserted through the number of spatial dimensions); • procedural mapping imposed on the data model (asserted through the name of the last processing step). Each digital entity may have multiple semantic labels that are used to characterize its meaning. Of interest is the fact that a semantic label typically represents the application of multiple relationships. The assertions behind the application of a semantic label can be used to define a context for the semantic label, essentially an information context. Knowledge is the expression of relationships between semantic labels. Relationships are typically typed as logical (“is a”; “has a”), structural (existence of a structure within the string of bits), spatial (mapping of a string of bits to a coordinate system), temporal (mapping to a point in time), procedural (mapping to process results), functional (mapping of features to evaluation algorithms), and systemic (properties that cover all members of a collection). The management of knowledge requires the ability to describe, organize, and apply relationships. Knowledge generation is closely tied to the processing of data. Each semantic label is the result of the application of a process (set of relationships and rules) that determines whether or not the semantic label can be applied to a given digital entity. The rules and relationships can be interpreted as “constraints.” Information is created by the application of constraints appropriate for a given community. One can view the creation of derived information (new semantic labels or new data sets) from a given data collection as the application of rules and relationships. Each type of knowledge constraint can be given a name and associated with a digital entity as a semantic label. The digital library community encapsulates knowledge constraints in the curation processes that are applied when a collection is assembled. The preservation community encapsulates knowledge constraints in the archival processes that are applied when the archival collection is created [11]–[13]. The data grid community characterizes knowledge constraints as applied processes or functions that transform digital entities into derived data products [14]. In each case, the process is encapsulated as “rigidly built” software that is applied to digital entities. A major change in perspective is needed when dealing with sociological imperatives that arise from interactions between independent groups of researchers. Each group has its own set of assumptions about the set of constraints that should be applied for the creation of a specified semantic label or for a specified action to be performed to create a derived data product. With current technology, the ability to specify such relationships is not possible. A major change in data and information infrastructure is needed to associate knowledge constraints with each assertion of a semantic label. The result will be the ability to compare the intended semantic meaning between research groups, when a process is applied in a data grid, digital library, or persistent archive. In practice, the requirement for management of knowledge constraints is pervasive even within the data management infrastructure itself. A simple example is the federation of digital libraries or data grids. Federations provide mechanisms to share storage resources, digital entities, user identities, and PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
information about the digital entities. Constraints are needed to enforce controls on interactions between the federated data management systems, both for access and for consistency. The constraints constitute relationships or rules that must be evaluated each time the shared item is accessed. For the digital library community, access constraints include digital library crosswalks that define how semantic labels within one community may be mapped to semantic labels used by another community. The preservation community associates authenticity metadata with each digital entity, which constitutes an assertion about the archival processes that have been applied. By keeping track of all of the archival processes that have been applied, assertions can be made about the lineage of a digital entity, and whether it continues to represent the original digital entity that was deposited into the preservation environment. It is possible to build a static system in which the knowledge relationships are specified in software and applied at the time of access. This is the approach used in current data grid technology. When constraints change in time, or when collections are federated, the dynamic application of changing constraints becomes essential to avoid having to rewrite software. Knowledge management technology will be viable when it is possible to change the relationship assertion behind the creation of an information label, either to apply an updated form of the relationship or to apply the relationship assumed by another group that is now viewing the data. The ability to federate data grids, digital libraries, and persistent archives strongly depends upon the ability to dynamically apply the knowledge relationships expected by each group participating in the federation. The context used to describe digital entities consists of the semantic labels (information) that are assigned to each digital entity. The context used to describe a semantic label consists of the relationships (knowledge) used to assert the application of the semantic label. Traditionally, the assertions used to apply a semantic label are characterized as relationships, organized as an ontology, and managed in a knowledge base or concept space. The information context is a generalization of a semantic label, allowing the multiple properties that are represented by the semantic label to be expressed. The integration of grids (support for application of processing steps) with digital libraries (support for managing the semantic labels assigned as a result of the processing steps) provides the simplest approach to the creation of a true information management system. The SRB provides a common data management infrastructure for integrating data grids, digital libraries, and persistent archives. The ability to characterize the relationships that underlie the assignment of a semantic label constitutes an integral part of the information management infrastructure. The ability to characterize the information context behind a semantic label is needed to build the next-generation information management systems. IV. SRB—INTEGRATION OF DIGITAL LIBRARIES AND DATA GRIDS SDSC has collaborated extensively with the communities listed in Table 2 on the development of data and information management technology. A generic data management MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES
Table 2 Example Projects Using the SRB Technology
system, called the Storage Resource Broker, was developed which is used to build digital libraries for the publication of data, data grids for the sharing of data, and persistent archives for the preservation of data [5], [15], [16], [55]. Because the SRB supports the capabilities required by all of the listed projects, the SRB has become the most advanced data management system in production use in academia for the organization and management of distributed data. The SRB is used extensively within the National Partnership for Advanced Computational Infrastructure (NPACI) project [17], with over 350 TB of data stored under the management of the SRB at SDSC, comprising over 50 million files. Supported projects include computer science, scientific discipline collections, education initiatives, and international collaborations. NPACI computational science researchers use the system for data sharing (one user registered over 500 000 files into the system to build a logical name space 581
that he shared with his students while he was on sabbatical), and for data publication (one user registered over 0.5 TB of data as a digital library for Web-based discovery and access). Many groups are using the SRB to support replication of data onto the TeraGrid for bulk data analysis (2-Micron All Sky Survey [18], Digital Palomar Observatory Sky Survey [19]). Other groups access the SDSC archive (Joint Center for Structural Genomics beam line data [20], Alliance for Cell Signaling microarray data [21]), or build data sharing environments (Scripps Institution of Oceanography voyage logs [22], GPS sensor data archiving [23], and Long Term Ecological Reserve data grid for collection federation [24]). The projects include international collaborations that are installing data girds that span multiple countries (Worldwide Universities Network [2], the Compact Muon Solenoid high energy physics experiment [25], and the BaBar high energy physics experiment [26]). The latter project relies upon federation of data grids to meet sociological requirements on data distribution and sharing. The implementation of the SRB [16] technology for use within the NPACI data grid required the development of fundamental virtualization mechanisms [41]. A storage repository virtualization was created that defined the set of operations that can be performed on any storage system. The abstraction includes Unix file operations (create, open, close, unlink, read, write, seek, sync, stat, fstat, mkdir, rmdir, chmod, opendir, closedir, and readdir). Additional remote operations were implemented for latency management and metadata manipulation. Drivers were implemented to map from the storage repository abstraction to the protocol required by Unix file systems (Linux, AIX, Irix, Solaris, Sun OS, Mac OS X, Unicos), by Windows file systems, by archives (HPSS, Unitree, ADSM, DMF), database blobs (Oracle, DB2, Sybase, SQLServer, Postgres, Informix), object ring buffers, storage resource managers, FTP sites, GridFTP, and tape drives [42]. A data virtualization mechanism was implemented to support collections that spanned multiple storage repositories. A logical name space provides a persistent infrastructure independent naming convention. The logical name space is organized as a collection hierarchy, permitting the management of administrative, descriptive, and authenticity metadata for each digital entity registered into the data grid. An information repository virtualization was defined for manipulating collections that are stored in databases [43]. The abstraction consists of the operations needed to add new metadata attributes, automate SQL generation, support template based metadata generation, support bulk metadata load, support distributed joins across databases via token-based semantic interoperability, support metadata formatting into XML or HTML files, etc. A service virtualization was defined for the set of operations that a user could initiate, equivalently the services provided by the SRB data grid [44]. From the service abstraction, it is possible to map to any preferred access mechanism, including C library calls, C++ library calls, Unix shell commands, Python shell commands, Perl shell commands, 582
Windows browsers, Web browsers, Java, WSDL/SOAP interface [45], [46], Open Archives Initiative interface, etc. The result is an interoperability environment, that lets the researchers apply their preferred access mechanisms to any of the resources for which SRB drivers have been created [47]. The service abstraction was used to implement latency management operations (prefetch, cache, stage, stream, replicate, data aggregation in containers, metadata aggregation in XML files [48], I/O command aggregation through remote proxy execution), and all of the operations supported by the storage and information repository virtualizations. The projects listed in Table 2 required the ability to support processing at the remote storage systems where the data was located. An interesting view of data grid technology is realized by examining the different types of remote processing operations that were required. National Aeronautics and Space Administration (NASA) Information Power Grid—“traditional” data grid [32]. Bulk operations are used to register files into the grid. Containers are used to package (aggregate) files before loading into an archive. Transport operations are specified through logical file names. NASA Data Management System/Global Modeling and Assimilation Office—data grid [37]. The logical name space is partitioned across multiple physical directories to improve performance. The OpenDAP access protocol [57] was ported on top of the SRB. Department of Energy (DOE) Particle Physics Data Grid (PPDG)/BaBar high-energy physics experiment—data grid [26]. Bulk operations are used to register files, load files into the data grid, and unload files from the data grid. A bulk remove operation has been requested to complement the bulk registration operation. Staging and status operations are used to interact with a hierarchical storage manager. National Virtual Observatory (NVO)/United States Naval Observatory-B—data grid [7]. Registration of files is coordinated with the movement of grid bricks. Data is written to a disk cache locally (grid brick). The grid brick is physically moved to a remote site where bulk registration and bulk load are invoked on the grid brick to import the data into the data grid. NSF/NPACI—data grid [17]. Containers are used to minimize the impact on the archive name space for large collections of small files. Remote processes are used for metadata extraction. The seek operation is used to optimize paging of data for a four-dimensional visualization rendering system. Data transfers are invoked using server-initiated parallel I/O to optimize interactions with the HPSS archive. Bulk registration, load, and unload are used for collections of small data. Results from queries on databases are aggregated into XML files for transport. NIH/BIRN—data grid [1]. Encryption and compression of data are managed at the remote storage system as a property of the logical name space. This ensures privacy of data during transport. PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
Fig. 1.
NVO architecture.
NSF/Real-time Observatories, Applications, and Data management Network (Roadnet)—data grid [23]. Queries are made to object ring buffers to obtain result sets. NSF/Joint Center for Structural Genomics—data grid [20]. Parallel I/O is used to push experimental data into remote archives, with data aggregated into containers. NVO/2-Micron All Sky Survey—digital library [18]. Five million images are aggregated into 147 000 containers for storage in an archive. An image cutout service is implemented as a remote process, executed directly on the remote storage system. A metadata extraction service is run as a remote process, with the metadata parsed from the image file headers and aggregated before transfer. NVO/Digital Palomar Observatory Sky Survey—digital library [19]. Bulk registration is used to register the images. An image cutout service is implemented as a remote process, executed directly on the remote storage repository. NSF/Southern California Earthquake Center—digital library [3]. Bulk registration of files is used to load simulation output files into the logical name space (1.5 million files generated in a simulation using 3000 time steps). National Archives and Records Administration (NARA)—persistent archive [39]. Bulk registration, load, and unload are used to access digital entities from Web archives. Containers are used to aggregate files before storage in archives. Transport operations are automatically forwarded to the appropriate data grid for execution through peer-to-peer federation mechanisms. NSF/National Science Digital Library (NSDL)—persistent archive [4]. Bulk registration, load, and unload are used to import digital entities into an archive. Web browsers are used to access and display the imported data, using http. The SRB is the underlying data management technology in each of these projects. However, each project integrates the SRB with additional systems to create the final data management system. The resulting architectures typically have similar components to those used in the NVO environment. MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES
Fig. 1 lists the components that are used to implement the NVO architecture. The components include: portals that provide a user interface to the NVO services; registry for publishing the existence of NVO services; Web-based services that implement interactive data manipulation or analysis tasks; workflow environments for support of processing pipelines; SRB data grid for access to the storage repositories; grid software for distributed computation; catalogs and image archives of sky surveys; storage systems and archives. Data grid technology is the interface between the storage systems, image archives, dataflow environments, and Webbased services. What is immediately obvious is that multiple types of interfaces must be supported. The data grid translates from the protocols used by the storage repositories to the access mechanisms used by a particular data manipulation environment. Thus the data grid serves as an interoperability mechanism. In the NVO architecture, the compute services are used for bulk operations performed on entire data collections. The data services provide interactive access. The SRB continues to be the leading data management environment. The concepts implemented and proven in the SRB are now being used by practically all other data grid implementations. These concepts include use of federated client server architecture to manage interactions with heterogeneous physical resources, use of a logical name space to build global location-independent identifiers, mapping of attributes onto the logical name space to manage service state information, and use of access controls on digital entities to manage interactions with collection or community owned data. Explicit services developed within the SRB for replication, aggregation of data into containers, support for user-defined metadata, role-based access controls, and ticket-based authentication, are now being implemented in other data grids, including the Globus toolkit [49]. V. DATA MANAGEMENT CONCEPTS A generic approach has been pursued at SDSC to identify the fundamental distributed data management concepts. 583
The concepts are best illustrated in terms of data grid terminology, but can also be readily applied to digital libraries and persistent archives. Distributed data management proceeds by the creation of logical name spaces that are used to assign global persistent identifiers to digital entities, users, resources, and applications. The logical name spaces provide a location independent naming convention. Grid services map distributed state information to the logical names as attributes. An example is a mapping from a logical digital entity name to a physical file name to support replication. Each replica is represented by the site where it is stored, the access protocol needed to interact with the site, the creation time, the file size, etc. Grids are implemented as middleware [50], which manages the distributed state information for each service. The management of consistency constraints on the mappings that are applied to the logical name space becomes important when two independent data grids want to share data. Unless both data grids can specify the constraints that have been applied to the mappings, inconsistencies will occur in the management, characterization, and manipulation of the digital entities. An example is peer-to-peer federation of data grids. How does one impose access controls on data that has been copied into another data grid? Can one build a system in which access controls are a property of the digital entity, rather than the storage repository? The SRB implements this concept by imposing multiple levels of constraints on the logical name spaces. Users are represented by a logical user name space managed by the SRB. Users authenticate their identity to the SRB as a distinguished name within the user name space. A mapping is imposed on the file logical name space through access controls for each digital entity, specifying an access role for each distinguished user name. The SRB imposes access constraints by storing digital entities under a SRB unix ID. User access to data is then accomplished by authenticating the user to the SRB, checking that the user has access permission based upon the mapping that is maintained by the SRB, authenticating the SRB’s access to a remote storage system, and retrieving the digital entity through the SRB data handling system. The user interacts with the SRB, which then serves as the proxy for interacting with the remote storage systems. When data is moved to another location, the access controls remain managed by the SRB as a property of the data. The access controls do not change when data is moved. This approach works very well for building scientific data collections, for sharing data within an organization, and for publishing data on the Web. Data grids can be viewed as systems that manage and manipulate consistency constraints on mappings of distributed state information. Digital libraries add mappings to manage user-defined metadata to support discovery and browsing. Persistent archives add mappings to manage the authenticity of the deposited digital entities [51]. Attributes are added to record all operations on the data, and assign signatures or checksums to prove that the original bits have not been unexpectedly changed. The three types of data management systems can be viewed as defining multiple levels of aggregation semantics 584
Table 3 Types of Constraints for Federation of Collections
(constraints) upon collections of digital entities. At the same time, each level of aggregation is managed by a set of constraint relationships. We recognize multiple levels of aggregation and associated constraints shown in Table 3, that are needed to specify sociological requirements [52]. VI. GRID IMPLEMENTATIONS Middleware was originally proposed as the software infrastructure that manages distributed state information resulting from distributed services [53]. A newer definition is that middleware is the software infrastructure that manages information flow between processes and distributed collections.1 The concepts underlying this interpretation are the following. • Computations are executed to generate data. • Output from computation represents a quantifiable prediction that can be compared with either observations or other computation results. • Organization of computation results into collections makes it possible to associate a context with the simulation output. The context consists of metadata attributes that are chosen by the collection creator. Each discipline can implement a separate context, which represents the set of information that will be used by researchers within the discipline. The same computational result can be stored into multiple collections with different choices for the information context. A digital entity becomes useful when a context is provided that defines how to interpret the digital entity. Without a context, digital entities are just meaningless bit strings. • Digital entities within a collection that are never accessed are useless. • Information and data movement (context and content) from collections to processes represents the access and use of the results from the original computation or observation. • The end goal of computation is to facilitate the advancement of knowledge through a better understanding of how to simulate reality. The comparison of simulation output with observations is a fundamental part of knowledge generation. Data is useful when it is being moved and analyzed. 1Based on an observation by D. Petravick that data is only relevant when it is moving (
[email protected]).
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
Table 4 Comparison of Grid and Digital Library Approaches to Context Management
the new functionality that will need to be implemented in grid services. For each category, we define additional grid services that will be needed for supporting digital libraries and knowledge generation systems. A. Federated Name Spaces
Table 5 Grid Evolutionary Steps
This view of data management systems as mechanisms to facilitate information flow is feasible if the underlying functionality provided by grids allows the association of state information with the output files. This raises the issue of grid software implementation. Grids focus on execution of access services. Digital libraries focus on management of the results. The driving concepts behind the two approaches are listed in Table 4. The grid approach manages application of processes. The digital library approach manages the data and information that are created. The approaches are complementary. VII. GRIDS AND DIGITAL LIBRARIES Given these characterizations of knowledge generation and management, we can examine why grid technology will undergo further evolution. In Table 5, additional evolutionary steps are defined. If we examine the starred items in Table 5 in terms of the contrasting approaches between grid and digital library information management, we can predict MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES
A global name space for files is used in the European data grid to assert equivalence of digital entities across service catalogs. Separate service catalogs are used to manage the state information that results from each service. Thus, a replica location service manages the location of replicas, a community authorization service manages access controls on the replicas, and a metadata catalog service lists descriptive metadata for the replicas. Each catalog manages the results from all applications of that particular service. A global unique identifier is used to map entries in each service catalog to a particular digital entity. In the digital library community, state information is mapped onto a logical identifier that is associated with each digital entity and organized as metadata in a collection. The decision to create a collection is independent of the set of services that were applied to the digital entities. The collection asserts relationships between digital entities by annotating digital entities with metadata attributes. The digital library community constructs union catalogs to federate access across collections. Equivalent federation mechanisms are needed to federate each of the logical name spaces managed by data grids, including name spaces for files, users, and resources. Federation extends the original naming indirection mechanisms developed for grids to support access across independently assembled collections of computational results. Interoperability between different virtual organizations (which define their own logical name spaces) is managed by services that implement constraints on the sharing of the logical name spaces. Examples are the registration of files from one virtual organization into the name space of a second virtual organization, the registration of a user name from one virtual organization into a second virtual organization, the sharing of storage resources between virtual organizations, and the sharing of metadata between virtual organizations. Federation mechanisms manage the sharing constraints. B. Processing Pipeline The Grid provides workflow processing systems that specify each service that is applied to a digital entity. Control mechanisms are applied to the services to specify their completion status. A dataflow environment focuses on the digital entities, and applies control mechanisms to the digital entities that are processed by the services. An example of a dataflow environment is the execution of a query on a collection, then the processing of the result set. The processing status of each digital entity in the result set is maintained. A workflow environment typically knows in advance the names of the digital entities and the processing steps that will be applied to each digital entity. The dataflow environment allows the digital entities to be identified as part of the dataflow and provides controls to allow looping, conditional execution, and branching based upon the results 585
of each service. The output from the dataflow may be stored in a collection or consumed by another dataflow or a device (such as video streaming). The collection context includes the state information that is generated by the application of the services. The design of appropriate dataflow control mechanisms is an integral part of access to distributed data. Operations may be performed more efficiently at a remote storage system when the result is the movement of a smaller amount of data over the network. Processing mechanisms have been incorporated into data access by the database community. Equivalent functionality will need to be supported by the grid community to improve performance. The grid will need to support application of processes at remote storage locations. The scheduling of dataflows as combinations of processing at compute resources and at storage resources requires the specification of the complexity of the data processing steps (the number of operations to be performed per byte of data). Processes with a small complexity are executed at the storage system. Processes with a large complexity are performed most rapidly by moving the data to a supercomputer. The decision for where to execute a process can be characterized as an execution constraint that is evaluated during the dataflow. This leads to the definition of a dataflow environment in terms of two sets of constraints. 1) The set of relationships and rules that govern the processing of the digital entity. This is equivalent to identifying the processing steps required to generate a derived data product. 2) The set of execution constraints that control where the processing will take place, and the order of execution of the processes. Given the set of constraints, knowledge management technology can be used to describe each of the processing steps, associate the information with the derived data products, and manage the information in a collection. C. Consistency Management The major component missing from grid technology is the ability to maintain consistency between content and context when multiple services are invoked. Consistency management is complicated by the desire by different data management communities to impose different constraints on the data manipulation. Constraint-based consistency management will be required to implement end-to-end applications such as persistent archives [54]. Persistent context can only be maintained if the state information that results from each grid service is consistently updated in a preservation catalog. Constraint-based consistency management is an example of the application of rules and relationships to the execution of the grid services themselves. A simple example is the specification of the order in which grid services must be applied for state information to be valid when replicating data. Before the existence of a replica is recorded, the creation of the copy must be completed. More sophisticated examples occur in the federation of data grids. Two data grids may establish criteria under which a user in the first data grid may access 586
data in the second data grid. The second data grid may require that the user be authenticated by the first data grid on every access. The second data grid cannot apply its access controls until the first data grid has verified the identity of the user. D. Information Flow The challenge in managing information flow is that coordination is required between services. When result sets are manipulated instead of individual files, it may be appropriate to do processing at the remote storage location for some of the digital entities, but may be more efficient to move another member of the result set to a compute node for processing. Information flow imposes a generality of solution that is not available with current grid technology. It requires all of the concepts that have been discussed: • federated name spaces for operations across collections and data grids; • mapping of state information to each digital entity; • organization of digital entities into collections, with the collection defining the information context that will be maintained; • consistency management mechanisms for updating the state information that results from the application of multiple services. A fundamental change for grids is the ability to define a context that can be managed independently of grid services. Grid environments that support data management will evolve to provide the following services: • application of consistency constraints; • storage of the consistency constraints in knowledge repositories; • knowledge repository virtualization mechanism, for the management of knowledge constraints in different vendor knowledge repository products; • knowledge virtualization, to provide a uniform naming convention for the management of consistency constraints that are stored in multiple knowledge repositories. VIII. CONCLUSION The integration of data grids, digital libraries, and persistent archives is forcing continued evolution of grid technology. Grids have been evolving through the addition of naming indirection mechanisms. The ability to manage information context will require further evolution of grid technology and the ability to characterize the assertions behind the application of the grid name spaces. The result will be the ability to manage the consistency of federated data collections while flowing information and data from digital libraries through grid services into preservation environments. REFERENCES [1] BIRN—The Biomedical Informatics Research Network [Online]. Available: http://www.nbirn.net [2] WUN—Worldwide Universities Network [Online]. Available: http://www.wun.ac.uk/ [3] SCEC—Southern California Earthquake Center Community Digital Library [Online]. Available: http://www.sdsc.edu/SCEC/
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
[4] NSDL—National Science Digital Library [Online]. Available: http://www.nsdl.org/ [5] SRB—The Storage Resource Broker Web Page [Online]. Available: http://www.npaci.edu/DICE/SRB/ [6] R. Moore, “Evolution of data grid concepts,” presented at the Global Grid Forum 10 Workshop: The Future of Grid Data Environments, Berlin, Germany, 2004. [7] NVO—National Virtual Observatory [Online]. Available: http:// www.us-vo.org/ [8] SEEK—Science Environment for Ecological Knowledge [Online]. Available: http://seek.ecoinformatics.org/ [9] GEON—Geosciences Network [Online]. Available: http://www. geongrid.org [10] R. Moore, “Preservation of data, information, and knowledge,” presented at the World Library Summit, Singapore, 2002. [11] R. Moore, C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta. (2000, Apr./Mar.) Collectionbased persistent digital archives—Parts 1 and 2. D-Lib Mag. [Online]. Available: http://www.dlib.org/dlib/march00/moore/03moorept1.html [12] R. Moore, “Knowledge-based persistent archives,” presented at the La Conservazione Dei Documenti Informatici Aspetti Organizzativi E Tecnici, Rome, Italy, 2000. [13] A. Rajasekar, R. Marciano, and R. Moore, “Collection based persistent archives,” in Proc. 16th IEEE Symp. Mass Storage Systems, 1999, pp. 176–184. [14] R. Moore, C. Baru, P. Bourne, M. Ellisman, S. Karin, A. Rajasekar, and S. Young, “Information based computing,” presented at the Workshop Research Directions for the Next Generation Internet, Washington, DC, 1997. [15] C. Baru, R. Moore, A. Rajasekar, and M. Wan, “The SDSC storage resource broker,” presented at the CASCON’98 Conference, Toronto, ON, Canada. [16] A. Rajasekar, M. Wan, and R. Moore, “mySRB and SRB, components of a data grid,” presented at the 11th High Performance Distributed Computing Conf., Edinburgh, U.K., 2002. [17] NPACI Data Intensive Computing Environment Thrust Area [Online]. Available: http://www.npaci.edu/DICE/ [18] 2MASS—Two Micron All Sky Survey [Online]. Available: http://www.ipac.caltech.edu/2mass/ [19] DPOSS—Digital Palomar Sky Survey [Online]. Available: http://www.sdss.jhu.edu/~rrg/science/dposs/ [20] JCSG—Joint Center for Structural Genomics [Online]. Available: http://www.jcsg.org/ [21] AFCS—Alliance for Cell Signaling [Online]. Available: http://www.afcs.org [22] SIO Explorer Digital Library Project to Provide Educational Material from Oceanographic Voyages in Collaboration with NSDL [Online]. Available: http://nsdl.sdsc.edu/ [23] ROADnet, California Institute for Telecommunications and Technology SensorNet [Online]. Available: http://www.calit2.net/ sensornets/ [24] LTER, US Long Term Ecological Research Network [Online]. Available: http://lternet.edu/ [25] CMS—Pre-Production Challenge Data Management for the Compact Muon Solenoid [Online]. Available: http:// www.gridpp.ac.uk/gridpp8/gridpp8_cms_status.ppt [26] BaBar—B Meson Detection System [Online]. Available: http:// www.slac.stanford.edu/BFROOT/ [27] CDL—California Digital Library [Online]. Available: http:// www.cdlib.org/ [28] Interlib-Digital Library Initiative Phase II Project with the California Digital Library [Online]. Available: http://www-diglib.stanford.edu/ [29] Transana-Education research tool for the transcription and qualitative analysis of audio and video data [Online]. Available: http://www.transana.org/ [30] ArtStor—Andrew Mellon Initiative to Create a Collection of Art Images for Use in Art History Courses [Online]. Available: http://www.artstor.org/ [31] Digital Embryo—Collection of Images for Embryology Courses [Online]. Available: http://netlab.gmu.edu/visembryo/index.html [32] IPG—NASA Information Power Grid [Online]. Available: http:// www.ipg.nasa.gov/ [33] IVOA—International Virtual Observatory Alliance [Online]. Available: http://www.ivoa.net/ [34] United Kingdom Data Grid [Online]. Available: http://www. escience-grid.org.uk/ [35] TeraGrid—NSF sponsored project to build the world’s largest, most comprehensive, distributed infrastructure for open scientific research [Online]. Available: http://www.teragrid.org/
MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES
[36] ESIP—Federation of Earth System Information Providers [Online]. Available: http://www.esipfed.org/ [37] 12th NASA Goddard /21st IEEE Conf. Mass Storage Systems and Technologies. [38] LDAS—NASA Land Data Assimilation System [Online]. Available: http://ldas.gsfc.nasa.gov/ [39] NARA Persistent Archives Project [Online]. Available: http://www. sdsc.edu/NARA/ [40] PAT—Persistent Archive Testbed [Online]. Available: http://www. sdsc.edu/PAT [41] R. Moore and C. Baru, “Virtualization services for data grids,” in Grid Computing: Making the Global Infrastructure a Reality. New York: Wiley, 2003, pp. 409–433. [42] M. Wan, A. Rajasekar, R. Moore, and P. Andrews, “A simple mass storage system for the SRB data grid,” presented at the 20th IEEE Symp. Mass Storage Systems and 11th Goddard Conf. Mass Storage Systems and Technologies, San Diego, CA, 2003. [43] MCAT—The Metadata Catalog [Online]. Available: http://www. npaci.edu/DICE/SRB/mcat.html [44] H. Stockinger, O. Rana, R. Moore, and A. Merzky, “Data management for grid environments,” in Proc. High Performance Computing and Networking (HPCN 2001), pp. 151–160. [45] WSDL, Web Services Description Language [Online]. Available: http://www.w3.org/TR/wsdl [46] SOAP, Simple Object Access Protocol [Online]. Available: http://www.w3.org/TR/SOAP/ [47] R. Moore, “Knowledge-based grids,” presented at the 18th IEEE Symp. Mass Storage Systems and 9th Goddard Conf. Mass Storage Systems and Technologies, San Diego, CA, 2001. [48] XML—Extensible Markup Language [Online]. Available: http:// www.w3.org/XML/ [49] Globus—The Globus Toolkit [Online]. Available: http://www. globus.org/toolkit/ [50] R. Moore, C. Baru, A. Rajasekar, R. Marciano, and M. Wan, “Data intensive computing,” in The Grid: Blueprint for a New Computing Infrastructure, I. Foster and C. Kesselman, Eds. San Francisco, CA: Morgan Kaufmann, 1999. [51] B. Ludäscher, R. Marciano, and R. Moore, “Towards self-validating knowledge-based archives,” in Proc. 11th Int. Workshop Research Issues in Data Engineering: Document Management for Data Intensive Business and Scientific Applications, 2001, pp. 9–16. [52] R. Moore, “The San Diego project: Persistent objects,” presented at the Workshop XML as a Preservation Language, Urbino, Italy, 2002. [53] B. Aiken, B. Carpenter, I. Foster, J. Mambretti, R. Moore, J. Strassner, and B. Teitelbaum. (1998, Dec.) Terminology for describing middleware for network policy and services. [Online]. Available: http://www-fp.mcs.anl.gov/middleware98/report.html [54] R. Moore and A. Rajasekar. (2003) Common consistency requirements for data grids, digital libraries, and persistent archives (Grid Protocol Architecture Research Group draft). Global Grid Forum 8 [Online]. Available: http://www.sdsc.edu/dice/Pubs/Moore-HPDC. doc [55] A. Rajasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Jagatheesan, C. Cowart, S.-Y. Chen, and R. Olaschanowsky, “Storage resource broker—Managing distributed data in a grid,” J. Comput. Soc. India, vol. 33, no. 4, pp. 41–53, Oct.–Dec. 2003. [56] I. Foster and C. Kesselman, The Grid-2; Blueprint for a New Computing Infrastructure, 2nd ed. San Franciso, CA: Morgan Kaufmann, 2003. [57] OpenDAP—The open source project for a network data access protocol [Online]. Available: http://opendap.org
Reagan W. Moore received the B.S. degree in physics from the California Institute of Technology, Pasadena, in 1967 and the Ph.D. degree in plasma physics from the University of California, San Diego, in 1978. He is Director for Data Intensive Computing Environments at the San Diego Supercomputer Center (SDSC), University of California, San Diego. He coordinates research efforts on digital libraries, data grids, and persistent archives. Notable collaborations include the National Science Foundation (NSF) National Virtual Observatory, the NSF National Science Digital Library persistent archive, the NSF Southern California Earthquake Center community digital library, the Department of Energy (DOE) Particle Physics Data Grid, the NHPRC Persistent Archive Testbed, and the NARA Prototype Persistent Archive.
587
Arcot Rajasekar received the Ph.D. degree from the University of Maryland, College Park, in 1989. He is the Director of the Data Grid Technologies Group at the San Diego Supercomputer Center (SDSC), University of California, San Diego. He is a key architect of the SDSC Storage Resource Broker, an intelligent data grid integrating distributed data archives, file repositories and digital collections. He has more than 50 publications in artificial intelligence, databases, and data grid systems. His research interests include data grids, digital library systems, persistent archives, and distributed data collection and metadata management.
588
Michael Wan received the M.S. degree in nuclear engineering from the Georgia Institute of Technology, Atlanta, in 1972. He was a Nuclear Reactor Physics Engineer at General Atomics. He has been a systems analyst/programmer at the San Diego Supercomputer Center (SDSC), University of California, San Diego, since it began in 1985 and has developed a variety of key enhancements to various operating system components. He is the Chief Architect/Designer of the Storage Resource Broker (SRB) and a Senior Software Engineer and Systems Analyst at SDSC. Collaborating with Dr. A. Rajasekar and others, he has been instrumental in the development of the SRB thoughout its entire history.
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005