Production Storage Resource Broker Data Grids - Semantic Scholar

Production Storage Resource Broker Data Grids Reagan W. Moore, Sheau-Yen Chen, Wayne Schroeder, Arcot Rajasekar, Michael Wan, Arun Jagatheesan San Diego Supercomputer Center {moore,sheauc,schroede,sekar,mwan,arun}@sdsc.edu 2nd International Conference on e-Science and Grid Computing Amsterdam, Netherlands, December 2006.

Abstract International data grids are now being built that support joint management of shared collections. An emerging strategy is to build multiple independent data grids, each managed by the local institution. The data grids are then federated to enable controlled sharing of files. We examine the management issues associated with maintaining federations of production data grids, including management of access controls, coordinated sharing of name spaces, replication of data between data grids, and expansion of the data grid federation.

1. Introduction Data grids are used to build shared collections out of files that are located at multiple sites across multiple administrative domains [1,2]. A shared collection provides persistent global name spaces for the names of the files, the curators of the shared collection, the storage resources, and even the metadata attributes associated with each file [3]. The result is an environment in which the files may be moved from site to site without having to worry about the file name changing. Access controls that are set on the files and user-defined metadata remain unchanged as well. It is possible to construct an environment in which all the properties of the shared collection are managed independently of the choice of storage system or database. This capability is called infrastructure independence, and is the essential design feature needed to manage technology evolution [4]. Data grids use infrastructure independence to ensure that data can be uniformly managed with strong access controls, even when the files are distributed across multiple types of storage systems. Production data grids have both the challenge and the benefit that they are the only interface seen by the end user. A collaborator on a shared collection does not have to worry about the different protocols used by the multiple storage devices, or differences in administrative

DICE e-Science Paper

policies about data residency lifetime, or local site authentication. On the other hand, any problem that occurs in the environment is the responsibility of the data grid, whether it is a network outage, or a disk crash, or a corrupted tape. A production data grid has to protect itself from all possible sources of data loss and provide to the users an environment in which the data have strong guarantees on integrity and authenticity [5]. In practice, no storage system can be trusted to reliably store data for the lifetime of a collection, which may be longer than 20 years. At a minimum, the technology used within the storage system will become obsolete, and the data will need to be migrated to more cost effective technology. At the worst, the storage system will lose the data or corrupt the data. Types of events that lead to data loss include media failure, systemic vendor product failure (such as bad microcode in a tape drive), operational error, natural disaster, and malicious users. Production data grids provide mechanisms to ensure data integrity, operational procedures to ensure reliable management of the collection, and consistency mechanisms to ensure that the authenticity of the data in the collection remains uncompromised. In this paper we examine the types of operational support needed to maintain a data grid based on the Storage Resource Broker technology [6]. We also examine the emergence of federations of data grids as the preferred mechanism for sharing data, and the implications that the management of federations has on site-specific production operational procedures.

2. Data Grid Management Before a data grid federation can be effectively integrated, each of the component data grids must be reliably managed. A data grid administrator performs operational tasks to ensure the smooth operation of the data grid, usually assisted by systems analysts who maintain the storage systems, network administrators

1

who manage both security systems and networks, and database administrators who maintain databases in which the data grid metadata catalog is housed. The multiple levels of hardware and software systems that must work together seamlessly are: • Data grid federation software • Application level client software • Data grid servers • Data grid metadata catalog • Security environment • Storage systems • Database • Network A failure in any one of these systems is viewed as a failure of the data grid. A data grid must ensure the endto-end reliability and availability of the integrated system across all types of failures. Given the multiple levels of hardware and software systems that are integrated by a data grid, it appears that management of integrity and authenticity of distributed collections is very difficult. Data grids overcome these apparent difficulties through the use of checksums, replication, synchronization, and federation [7]. The intent is to provide multiple copies of each file, assert that the copies are up to date, that the copies are uncorrupted, and that a copy resides in an independently administered domain on a different type of storage system. At the same time, state information must be replicated across independent databases to ensure no single point of failure. A list of the operations supported by data grids to maintain high availability while mitigating risk of data list is available at the Storage Resource Broker wiki: http://www.sdsc.edu/srb/. The operations include: • Management of end-to-end validation of checksums. A checksum is created before a file is registered into a SRB collection, and the checksum is validated after storage of the file. Related operations include creation of checksums for previously registered data. • Management of replicas, versions, and backups of files. A replica is a true copy of a file. Changes to one of the replicas of a file can be synchronized to the other replicas. A version is a numbered copy of a file, such as a unique representation. A backup is a time-stamped copy. A request to make a replica always creates another copy of the file. A request to make a backup replaces the previous backup of the file. • Management of synchronization. This issue is more difficult than it appears, because synchronization could be done between the collection and external (non-SRB managed) storage systems, between file system buffers and disk, between disk caches and tape archives, between two SRB-managed


collections, and between two SRB data grid federations. The SRB can register an external file system directory structure into a SRB collection. The registered files can then be synchronized with other SRB collections. The ability to synchronize file system buffers is essential when dealing with small files. It is possible for a storage system to report completion after the files are written to the file system buffer and before the files are written to disk. A storage system crash will then lose the files, even though the data grid believes the data are safely stored. A similar problem occurs with storage systems that write data to a disk cache before archiving data on tape. The data grid needs to protect its integrity by forcing synchronization of the data all the way to the end storage media. The ability to synchronize replicas is essential for managing data distributed over wide-areanetworks. If an attempt to create a remote file fails because of a network problem or storage system outage, the data grid can defer creation of the replica until the system is available. Also, changes to a file can be made on a single replica, and then propagated to the other replicas after the updates are complete. • Consistency checks on the integrity of the metadata catalog. This requires two checks: that files exist for each of the entries in the metadata catalog; and that for each file in a SRB vault a metadata record exists in the metadata catalog. • Management of slave catalogs. A standard approach is to create additional read-only metadata catalogs to ensure high availability or to ensure improved response at a remote site. All writes are done to the master catalog to ensure consistency. Synchronization of the slave catalog is done at selected intervals to download changed metadata. • Management of federations. The replication of an entire collection (including name spaces, data and metadata) can be done onto a separately administered data grid. This capability ensures that an independent environment with separate operational procedures is managing a copy of the collection. An operational procedural error on one data grid can be recovered by transferring the lost data or metadata from the federated data grid. This capability is used in both preservation environments and digital libraries. The data grid administrator executes the above operations to manage assertions about the shared collection. The assertions can be a statement that the integrity of all files has been verified within a specified time period, or that the required number of replicas exists for each file, or that the metadata catalog has been

2

synchronized with the SRB data grid vaults. If problems are found such that a desired assertion has not been met by the data grid, the data grid administrator may need help from systems analysts about storage system interactions or from network administrators about network interactions or from database administrators about interactions between the metadata catalog and the database. Typical data grid administrator tasks include both periodic assertion testing as well as intermittent operational tasks: Periodic system administration tasks: • Manage integrity checks on data • Manage audit trails • Manage consistency checks on collections • Manage synchronization of replicas • Manage deletion of files (trash can emptying) • Track all errors and reported data losses • Manage upgrades to new versions of the data grid servers • Manage upgrades to new versions of the metadata catalog • Manage interactions with new storage system, database, and network technology • Maintain end-to-end configuration control (database port assigned to a collection, ports open on firewalls, and network addresses) • Manage interactions with authentication environments Operational tasks: • Add servers for new storage resources • Add new users • Respond to user questions • Modify access controls on collections • Restart data grid servers as needed • Identify problems with storage systems • Respond to installation questions • Integrate user interfaces with data grid

4. Example Operational Issues The management of a data grid is best illustrated through examples of interactions between the data grid and the underlying software and hardware systems and between the data grid and the users. We recognize four main operations categories: data grid error diagnosis; user support; API selection; and federation.

4.1 Data Grid Error Diagnosis The SRB data grid is implemented as a peer-to-peer architecture in which any SRB server can communicate with another SRB server. When a user executes a client operation, the request is sent to a default server, which interacts with a central metadata catalog to interpret the


operation and identify where the operation should be executed, and then forwards the request to the appropriate storage server. The reliability of the data grid depends on the ability of the servers to communicate successfully. Any interruption in communication between servers is seen by the user as a failure of the data grid. Thus a major task of the data grid administrator is the diagnosis and correction of server-to-server communication errors. The major tool provided by the SRB data grid to monitor operations is a system log that tracks operations performed by users and the associated task completion status. Separate system logs are kept by each server. The data grid administrator examines each log to understand whether the problem is caused by a remote storage system outage, a network interrupt, a configuration change, or a collection management error. Since the underlying hardware and software systems may be administered by independent groups, it is quite possible for a system that is reliably functioning to be impacted by an administrative decision at a remote site. Example problems include: • Storage system maintenance. A storage system may be taken off line without notification, resulting in failure when users attempt to access files on that system. Although the SRB data grid supports automated fail over to replicas, a collection manager may have chosen to only have a single copy of a file. One mechanism to improve detection of remote storage system availability is to run a “hot page” which periodically pings each storage environment and reports response times and transport rates. Systems which monitor SRB server status are listed on the SRB wiki: http://www.sdsc.edu/srb/ under downloads and contributed software. • Storage system upgrades. When the software and hardware storage systems provided by a vendor are upgraded, the SRB driver written for that system may need to be modified. The problems typically are related to management of exceptional cases, such as use of parallel I/O to move very large files, or bulk manipulation of a large number of small files. Both cases stress local resources, and expose problems in vendor software. The SRB driver must then be modified to ensure that the particular vendor software is no longer unduly stressed. • Network configuration. A major challenge is managing interactions with network devices such as firewalls, private virtual networks, and load levelers. Since the SRB uses multiple network ports to send parallel I/O streams, the communication will fail if the network devices are not appropriately configured. Management tasks include setting up the correct port ranges to support parallel I/O

3

•

streams through a firewall, and specifying the IP addresses from which external communications will be received for private virtual networks. When a new server is added to the environment, the network configuration may need to be modified. The SRB uses multiple communication control protocols to ensure the ability to interact gracefully with firewalls. The control protocols include serial I/O with control and data movement through the same port, server-initiated parallel I/O, clientinitiated parallel I/O, client-initiated bulk metadata operations, and server-initiated bulk metadata operations. Currently, the user must execute different commands to take advantage of the different protocols. Thus an attempt to use parallel I/O to move data to a storage system behind a firewall may fail because a client-initiated parallel I/O command was attempted instead of a serverinitiated parallel I/O command. Keeping track of this type of problem requires examining the system log and comparing the requested operation with the data grid configuration information. Database administration. A critical component of the data grid is the management of the state information that results from the data operations (file writes, audit trails, access control changes, file updates, user-defined metadata changes, replication, …). Each of these updates increases the amount of metadata that must be managed within the database that holds the metadata catalog (MCAT). A standard impact is that the system can become unresponsive, with a query to the database taking an inordinately long time to respond. The data grid administrator manages the database instance to keep the system running interactively by periodically optimizing indices on the metadata catalog tables. The database administrator supports updates to the database technology, manages the size of the table spaces, monitors the resource utilization (number of simultaneous transactions, amount of memory used, fraction of CPU used) to ensure interactive response, and manages backups of the metadata catalog. Since all state information needed to maintain the data grid resides in the database, the database backups are the most critical component in data grid administration. If the state information in the database is lost, the data grid loses the ability to identify and retrieve remote files. An advantage of having data grid state information in a central catalog is the ability to process many collection operations much more efficiently than can be done by operations on file system i-nodes. Examples are generation of lists of million-file collections and checking the amount of space used. The exception is the physical deletion


•

of files. The SRB provides a “trash can” into which deleted files are moved. The physical deletion does not occur until the “trash can” is emptied. This requires a separate interaction with the file system for each file, and can take a very long time when tens of thousands of files are deleted. A periodic removal of files from the “trash can” is a required operation on some data grids, otherwise users perceive either storage systems that are filled to capacity or excessive physical file removal times. Occasionally, interactions with the metadata catalog appear to fail. In this case, additional system logs can be turned on that list the complete SQL command that is issued by the metadata catalog interface to the database. These systems logs can grow very rapidly, and thus are normally turned off. The ability to track the actual SQL makes it possible to identify problems with unusual characters in collection and file names, inconsistencies between state information in the metadata catalog and the files that actually reside in the SRB vault, and incomplete state information that was generated by an earlier version of the SRB. Each of these problems can be corrected by the data grid administrator through direct manipulation of the state information within the metadata catalog. This requires knowledge of the table structures used by the SRB MCAT, the ability to compose and issue SQL commands to the database, and an understanding of the semantic meaning of each piece of SRB metadata. SRB data grid administration. The SRB data grid continues to evolve to support capabilities required by the multiple projects that use the technology. The types of upgrades range from major releases that provide new fundamental capabilities (SRB version 3), to minor releases that provide new features (SRB version 3.4), to bug fixes (SRB version 3.4.2). For example the release of support for parallel I/O was provided in SRB version 2, and support for data grid federation was provided in SRB version 3. All of the major releases, plus some of the minor releases required changes to the SRB protocol used to support communication between SRB servers. The version of the communication protocol is specified by a letter that is associated with the release. The current protocol version is letter “f”, implying the communication protocol has changed 5 times. The state information that is managed by the SRB has also evolved. This means that new table structures have been added to the SRB MCAT catalog. An upgrade of the SRB system may require the creation of new database tables in MCAT. For instance, when support for federation

4

was developed, the SRB had to associate a unique zone name with each independent data grid. This meant that the files in a collection had to be identified by three labels; zone name / collection name / logical file name. This affected the clients, the servers, and the MCAT metadata catalog, and hence required a major release. When upgrades are made to new SRB versions, an attempt is made to maintain backward compatibility. Old clients are expected to be able to access the new environment. However this is not true for SRB servers. Thus a major source of questions is which SRB versions are required when combining clients, servers, and the MCAT catalog into a functioning system. Two approaches have been followed within the SRB user community to manage upgrades to new SRB versions. The approaches are differentiated by whether they try to preserve the collection identify, or whether they try to maintain the ability for prior clients to access the new system. The first approach keeps the same network address and same port number for the MCAT catalog. All clients are upgraded simultaneously along with all servers, ensuring that the correct protocol is used between all of the system components and the correct state information is saved. The second approach keeps the same network address for the MCAT catalog, but installs a new MCAT server using a new port number. The old clients access the original MCAT server using the original port number. The new clients access the new MCAT server using the new port number. This keeps the protocol consistent between the old and new systems, but forces the research community to use a “new” port number to access the collection.

4.2 User Support Data grids provide a uniform access interface to multiple types of storage systems through data and trust virtualization layers. The user therefore interacts with a characterization of the data instead of the physical files. The user interacts with authentication and authorization mechanisms provided by the data grid independently of the remote storage system. These are the hardest concepts to understand about data grids. The recognition that the logical name space for files can be organized independently of the physical file names is a major conceptual hurdle. At the same time, the virtualization mechanisms enable multiple new operations not supported by a Unix file system, including logical file name management, replication, audit trails, parallel I/O, bulk operations, and


metadata manipulation. Choosing between the available operations for these new capabilities turns out to be a major user support issue. The best performing command is always desired, but this may involve understanding whether parallel or serial I/O should be used, whether a network firewall is in the communication path, whether bulk operations should be used for manipulating small files, whether remote procedures should be used to filter the desired data, and whether replication is needed to minimize risk of data loss. In practice, serial I/O on current Gigabit/second networks is reasonable up to file sizes of 10 megabytes, bulk operations on a thousand files is always faster, and no single storage system should be trusted to reliably hold data. The issue of data reliability requires careful consideration, as the effort involved in assembling a shared collection can be substantial and take as much time as the generation of the original data. We note that as the amount of data that is being managed increases in size, both the size of individual files and the capacity of the storage systems are increasing. RAID systems are reaching a capacity point where the time to recover from a parity error is greater than the time for another parity error to appear. At the same time, the files that are stored on the RAID system must be larger to efficiently use the multiple disks. This imposes a minimum desired file size on the collection, and the need to replicate data to ensure the ability to recover from a RAID failure. The minimization of risk of data loss now requires the use of replication across multiple types of storage systems, periodic synchronization of the replicas, periodic validation of checksums on each file, and even replication of data and metadata into an independent data grid. These mechanisms work effectively with diskbased file systems, but are difficult to apply on large tape archives. The efficient validation of large numbers of files stored on tape requires a way to aggregate files to minimize access time. SRB containers were originally envisioned as a way to aggregate small files before writing onto tape-based archives. The containers were sized to match the latency of tape access and the tape read rate. Thus a storage system that required 15 seconds to retrieve and mount a tape, and that then read the tape at 120 Megabytes/second would use a container size of 1.8 Gbytes. Otherwise the duty cycle of the tape drive would be very low. Whenever more than two files were retrieved from the container, the effective performance of the storage system is optimized. The use of containers also minimized the number of names that the storage system had to manage. The data grid tracked the location of each file in a container, and managed caching of the container on disk to support actual read and write

5

operations. This off-loaded file manipulation from the tape archive. Containers now are needed on disk file systems when collection sizes are measured in the tens of millions of files. As collection sizes increase, the ability of disk file systems to respond interactively decreases. Data grids offload the management of state information normally stored in i-nodes onto the metadata catalog. The use of containers is now appearing as a requirement for improving the ability of both file-based and tape-based systems to support periodic checksum validation on very large collections. Typical problems encountered by users include setting SRB command parameters incorrectly, setting authentication environment (Grid Security Infrastructure) parameters incorrectly, and dealing with unavailable storage resources. Users require the ability to interact with the data grid administrator to: • Manage the root level of the data grid. This includes requests to add new users, add new servers, add access permissions, recover from input errors (inappropriate blanks in a file name), manage storage quotas, and manage storage resources. Over time, some of these operations have been moved into the user space, consistent with the authentication and authorization policies. Thus curators of collections can set access permissions for co-workers, but operations that require a higher level of trust have to be done by the data grid administrator. • Provide installation support. Standard installation scripts are provided for installing the SRB data grid software on Macs, Linux systems, and Solaris systems. Non-standard installations, such as using a pre-existing database, or a specific local system configuration require data grid administrator support. • Authentication environment. The SRB data grid supports three different authentication mechanisms: challenge-response, Grid Security Infrastructure (GSI) public key certificates, and tickets. The challenge-response mechanism requires a shared secret between the remote client and the SRB data grid. The shared secret is not sent over the network. Instead a challenge is encrypted at the client and sent to the MCAT server which validates the user identity by decrypting the challenge. The GSI environment manages public/private keys in a certificate authority. The GSI system provides standard services for validating the identity of a user. However the GSI system continues to evolve, implying the need to support multiple versions of the services. Installing the GSI environment is a non-trivial task. The SRB ticket-based access is a generalization of anonymous FTP and provides a


way to grant access to specified files. The number of accesses and the time period during which the accesses are allowed can be controlled. Any person who has the ticket can access the file. The management of the GSI authentication environment requires expert knowledge of grid technology. This is a major impediment to the use of GSI authentication in most digital libraries and preservation environments.

4.3 API Selection The SRB data grid provides three fundamental access methods: C library calls to support access from applications; Unix-style shell commands for interactive access; and Java class library for web-based access. So far, all other access methods (of which there are more than fifteen) are ported on top of these three mechanisms. While this provides a way for any existing data management system to access distributed data stored in a SRB collection, the porting of the local interfaces to the SRB may require assistance. Each SRB interface was developed in response to the needs of a particular research community. This means the capabilities provided by a particular interface are unique, and address project specific data management policies. The C library interface implements all of the SRB data management functionality. The Unix-style shell commands wrap most of the SRB functionality into S-commands that emulate similar operations provided by file systems. Thus “Sls” lists files in a SRB collection, similar to the action of the “ls” command for listing files in a file system. The Java class library extends the Sun I/O classes to support access to SRB collections. The choice of a specific access interface implies that the associated operations may be a subset of available SRB operations. Thus the WSDL interface supports a limited set of single file manipulation operations, the Perl load library supports some metadata operations, the inQ windows browser supports sophisticated drag and drop operations from the desk top into a SRB collection, and the mySRB web browser interface supports multiple curation operations related to metadata creation. Users encounter challenges when they start using capabilities not available in traditional file systems. Two specific capabilities exemplify this: • Shadow objects. It is possible to register a file that resides in a remote storage system into a SRB collection without physically copying the file. The user only has to set up permission for the SRB data grid account to read the file. Once the file is registered, the SRB can manage the logical file name that is assigned to the file and create a logical collection hierarchy, associate user-defined

6

•

metadata with the file, and support browsing and discovery of the file through any of the SRB interfaces. Since the file has not been physically copied into a SRB-managed storage system controlled by the SRB account ID, the user can still modify and even delete the file without informing the SRB data grid. Thus shadow objects are created to aid the ingestion of files into a data grid, and decouple data registration from data movement. The conceptual understanding of logical state information maintained about the shadow object versus the result of physical operations performed outside of the SRB data grid is difficult. To minimize the possibility of inconsistent access mechanisms (local versus SRB-based access), shadow objects should be copied onto SRB storage systems under the control of the SRB data grid. This ensures that all future operations on the file will be through the SRB data grid and the associated state information can be automatically updated. User-defined metadata and metadata queries. The ability to specify metadata attributes for each collection and for each object within a collection is supported by the SRB data grid. However it is up to each research project to define the set of metadata that is relevant to their digital holdings and data management policies. The specification of the desired metadata, the assignment of semantic meaning to each metadata element, and the specification of the allowed range of attribute values need to be a consensus formed by each community. Since the multiple SRB interfaces provide a different subset of metadata manipulation operations, the choice of API is strongly driven by the type of browsing and querying requirements. At the same time, each community has a preferred implementation language. Thus each research community inevitably drives the extension of more sophisticated interfaces for the SRB data grid. The ability to tailor the data management environment to the interfaces and operations desired by a specific research project requires interactions with not only the SRB data grid administrator, but also the SRB developers. This process has been repeated many times, with the resulting interfaces accessible from the SRB wiki page as contributed software.

4.4 Federation of data grids The dominant emerging strategy for linking data grids is the construction of federated collections that are independently managed by separate administrative domains [8]. The concept of federation has enabled the creation of quite sophisticated environments using data


grids as common building components. Examples of federations include: • Central archives. Under this management policy, independent data grids push data into a central archive. The central archive itself is a data grid that manages its own storage systems using its own metadata catalog. Each data grid assumes responsibility for deciding which files to replicate to the central archive. Each data grid independently synchronizes their selected data with the central archive. • Central data authority. Under this management policy, all data originates from the central authority. Data is pushed to remote leaf data grids for local access and management. Each of the leaf data grids relies upon the central data authority for authentic data. • Pull environments. This is the most popular federation environment, used to manage internationally shared collections. Each data grid decides which files will be pulled from another data grid and replicated locally. The data grids can be organized in chains, with data pulled from one data grid to the next under local administrative control. • Deep archives. It is possible to build federations such that the location of the deep archive and the identity of the archivists in the deep archive cannot be seen from the external world [9]. This is accomplished by installing a staging data grid between the multiple data grids that can be accessed by the public, and the data grid that will be the deep archive. The identity of an archivist with access privileges in the staging data grid is registered into the public data grids. This archivist then pulls data through a firewall using client-initiated parallel I/O. The identity of the archivist with access privileges in the deep archive is registered into the staging data grid. This archivist then pulls data from the staging data grid into the deep archive. Through appropriate combinations of private virtual networks and firewalls, all communication can be restricted between the staging grid and the public data grids, and between the staging grid and the deep archive. The result is the ability to automate the movement of data from the external world into the deep archive, without exposing the deep archive to unwarranted public access. The above scenarios are dependent upon the mechanisms used by data grids to manage authentication and authorization between independent environments. The approach taken in SRB federation is to require all authentications of a user to be done by the user’s home data grid. This means that the identity of each user is now a triplet: user-name / project-name / home-data-

7

grid. A registry is used to maintain unique names for each data grid. When two data grids are federated, they set up a trust relationship, identifying where requests for user validation will be sent. When a data grid receives a request for access by a user that is from a foreign data grid, the foreign data grid is accessed to provide authentication, while the access controls are asserted by the local data grid. This model is similar to that employed in the Shibboleth authentication model, in which the authentication of an individual is always done by their home institution. An experiment was conducted to demonstrate data sharing between federated data grids at the 17th Global Grid Forum meeting in Tokyo, Japan, held on May 11, 2006. The specific goals were to federate fourteen international SRB data grids and: - Demonstrate browsing in a remote data grid - Demonstrate read access to a file in a remote data grid - Issue SRB commands to list resources in remote data grid - Demonstrate write access to a registered user account in a remote data grid Table 1. List of Participating Data Grids Country Australia Taiwan New Zealand UK Italy France Japan Taiwan Chile/US US UK Netherlands US US

Data Grid APAC ASGC IB

Data Grid Administrator

IB DEISA IN2P3 KEK NCHC NOAO Purdue CCLRC SARA TeraGrid U Md

Daniel Hanlon Giuseppe Fiameni Jean-Yves Nief Yoshimi Iida Hsu-Mei Chou Irene Barg Lan Zhao Adil Hasan, Roger Downing Bart Heupers Sheau-Yen Chen Mike Smorul

Stephen McMahon Eric Yen, Wei-Long Ueng Daniel Hanlon

The fourteen international data grids that participated on shown in Table 1. Many of the data grids that participated in the demonstration were themselves federations of data grids. Thus the GGF interoperability testbed was really a federation of data grid federations.


4.5 GGF SRB Data Grid Federation Each data grid can think of itself as the hub of a federation. Each data grid controls the sharing of name spaces, access criteria for the local collection, and the specification of which subset of the files will be shared. Each data grid selects the set of other data grids with which they exchange trust information and cross-register user information. This implies that a federation of data grids can simultaneously sustain many of the federation models that are listed above. For the Global Grid Forum demonstration, a single user account was cross-registered across all the federated data grids. The user account was specified as a triplet defining the user (user_name), the project (domain_description), and home data grid (zone_id): user_name: ggfsdsc domain_desc: sdsc zone_id: SDSC-GGF A common authentication mechanism was used to authenticate access between data grids based on the SRB “Encrypt1” authentication method. This provides a challenge-response mechanism to authenticate user identity and requires the establishment of a shared secret between the user and the home data grid. All authentications of user identity were performed by the user’s home data grid (zone_id). Multiple versions of the Storage Resource broker were used within the federation. SRB version 3.4.1 was released at the end of April 2006 and included bug fixes for handling file names that contained illegal characters and a patch for inter-zone communication. SRB version 3.4.0 was also used along with the inter-zone communication patch. A data grid federation can use multiple versions of the SRB data grid technology as long as the SRB communication protocol is consistent across the SRB versions. However it is preferable to use the same SRB version across all participating data grids. Some of the participating data grids investigated tuning the data transmission performance within the federation. This required increasing the number of parallel I/O streams, increasing the window size used for TCP/IP transmissions, and increasing the system buffer size. The optimal window size depends upon the widearea-network latency and is usually set to the number of messages that can be in flight across the network, or the round-trip latency times the network bandwidth divided by the message size. Increasing the window size means more messages can be in flight within the network before an acknowledgement is required of successful message reception. The system buffer size must be large enough to hold copies of all of the messages that have been sent, in order to allow retransmission in case of a message corruption. For round-trip network latencies of

8

200 milliseconds and a bandwidth of 1 Gigabit/second, the system buffer size needs to be 200 megabits or 25 Mbytes. The number of parallel I//O streams was also increased to enable full use of the network bandwidth. For example, the AU data grid in Australia optimized data transmission to the KEK data grid in Japan. 1. In the “runsrb” script, they set the default number of parallel I/O channels to MaxThread=16 SizePerThread=2 This specified up to 16 parallel I/O streams. The SRB system selected the number of parallel I/O streams by dividing the file size measured in Mbytes by the SizePerThread. Smaller files used a smaller number of parallel I/O streams, with files larger than 32 Mbytes in size using 16 parallel I/O channels. 2. They reset kernel parameters on their host based on the recommendations published by the Global Grid Forum in the tuning guide: http://www.psc.edu/networking/projects/tcptune/ The participating data grids were located in Europe, the US, and the Far East. Each data grid chose which version of the Storage Resource Broker software (SRB) to use. The steps required for a data grid to join the federation were: - Register the Zone name for their data grid into the SRB zone-name authority. This ensures that each data grid has a unique zone name. - Federate their data grid with the hub data grid at SDSC. The hub data grid was located at SDSC and called SDSC-GGF. The operations required to establish the data grid federation had to be run by the data grid administrator of each data grid, from their SRB data grid administrator account. The explicit steps included: - Execute the "Stoken Zone" command to list the information about their local SRB data grid and store the result in a file. An example of the output from the “Stoken Zone” command is listed below in Figure 1. - Send the file to each data grid that the data grid administrator wanted in their federation. At a minimum, we requested that they send the information to SDSC for federation with the SDSCGGF data grid. The data grid administrator at SDSC then sent the corresponding information file about the SDSC-GGF data grid to the remote data grid administrator. - Each data grid administrator then ran the zoneingest.pl perl script located in the ./MCAT directory. This script registered the zone information into the local data grid. At this point, the data grids could communicate and respond to


-

requests from the remote data grid. However, no user information had been cross-registered. A user from the SDSC-GGF data grid could only access public information in the remote data grid. To allow a user from the SDSC-GGF data grid to be able to access controlled data within the remote data grid, the identity of the user had to be registered into the remote data grid. Two SRB commands were executed by the data grid administrator at the remote data grid to register the identity of a user called “ggfsdsc” from the SDSC-GGF data grid into the remote data grid: Singestuser ggfsdsc passwdxxx sdsc staff '' '' '' '' '' Szone -U SDSC-GGF ggfsdsc sdsc

The first command established a user name “ggfsdsc” within the remote data grid. The second command assigned this user to the “SDSC-GGF” data zone. After these steps, the user “ggfsdsc” from the “SDSC-GGF” data grid could log into the remote data grid and access files for which his user name had been granted read permission. Also, the process created a sub-directory into which the “ggfsdsc” user could store his or her own files. - Finally, the remote data grid administrator specified a logical storage resource name where files would actually be written. Note that this means that even with the creation of an account in the remote data grid, permission still has to be given for the new user to write a specified storage systems. After the issuing of these commands by the remote data grid administrator, the “ggfsdsc” account was able to browse files in the remote data grid, read files for which access permissions were enabled, and write files to the remote storage system. When the test user accessed a remote data grid, say AU, the AU data grid sent an authentication request to the home data grid at SDSC. If the authentication was validated, then the AU data grid applied access controls to the “ggfsdsc” user to decide which operations were allowed. Access controls were applied by the remote data grid on files, storage resources, and metadata. Data transfer tests were conducted between the remote data grid and the SDSC-GGF data grid. The performance was highly dependent on the number of parallel I/O streams used for the transfer and the size of the system buffer, especially for international links. Two types of data transfer tests were conducted. The numbers in parentheses are untuned transfers, using only 4 I/O channels. The other numbers used either 8 or 16 I/O channels, and increased the system buffer size to support a larger TCP/IP window size. The results are shown in Table 2.

9

Table 2. Data Grids that Participated in the GGF Storage Resource Broker Federation Demonstration Data Grid

Country

APAC NOAO ChinaGrid IN2P3 DEISA KEK SARA IB

Australia Chile/US China France Italy Japan Netherlands New Zealand Taiwan Taiwan UK UK UK US US US

ASGC NCHC CCLRC IB WunGrid Purdue Teragrid U Md

SRB version 3.4.0-P 3.4.1 CGSP-II 3.4.0-P 3.4.0-P 3.4.0-P 3.4.0-P 3.4.1

Demo user ggfsdsc yes yes (software) yes yes yes yes yes

SRB Zone name

3.4.0-P 3.4.0-P 3.4.0-P 3.4.1 3.3.1 3.4.0-P 3.4.1 3.4.0-P

yes yes yes yes (hardware) yes yes yes

TWGrid ecogrid tdmg2zone avonZone SDSC-wun Purdue SDSC-GGF umiacs

The attempt to federate data grids succeeded with fourteen of sixteen independent data grids. The two data grids that were not federated were dependent upon either software or hardware upgrades. The ChinaGrid data grid uses software developed within China to manage distributed data collections. They planned the development of an interface that would be able to interact with the published (Jargon Java class library) SRB protocol. The WUNgrid was upgrading to new hardware systems at the time of the demonstration, and needed to complete the upgrade before participating. The communication rates that were sustained between participating grids were highly dependent on the capabilities of the networks that joined the data grids, the amount of tuning that was performed to improve TCP/IP performance, and the ability of the end storage systems to read and write data at a high rate. The IN2P3 data grid is part of the BaBar high energy physics data grid which is used to replicate data from the Stanford Linear Accelerator to Lyon, France. They have tuned their network to sustain data transfer rates of 2 Terabytes per day. The data grids that had lower transmission rates (in parentheses) had done no network tuning and got very low transmission rates, consistent with a TCP/IP packet acknowledgement after every package transmission. The establishment of the data grid federation required substantial support by the SDSC data grid


AU noao-ls-t3-z1

Storage Resource Logical Name StoreDemoResc_AU noao-ls-t3-fs

Transfer MB/sec 3.9

ccin2p3 DEISA KEK-CRC SARA aucklandZone

LyonFS4 demo-cineca rsr01-ufs SaraStore aucklandResc

[25.]

SDSC-GGF_LRS1 ggf-test

(0.1)

avonResc sfs-tape uxResc1 sfs-disk narasrb02-unix1

7.4 (0.3)

{2.5}

administrator. The types of problems that were seen included: Opening of ports in firewalls to allow parallel I/O transfers. • Specification of the IP addresses allowed to send messages into a local virtual private network. • Execution of the commands from the correct data grid administrator account. • Establishment of a default logical storage resource name onto which the “ggfsdsc” user would be allowed to write data. • Execution of the commands to associate the “ggfsdsc” user with the SDSC data grid “SDSCGGF”. • Upgrading the local SRB version to include the inter-zone communication patch. • Setting of access controls for what the “ggfsdsc” user was allowed to do. An example is that some of the data grids allowed the “ggfsdsc” user to list all of the users participating from their data grid. Other data grid administrators restricted the “ggfsdsc” user to access to files within the “ggfsdsc” user home directory. The bulk of the support issues were related to coordination between data grid administrators to establish the links required for the data grid federation. Once the federation was created and tested, the remaining tasks were related to tuning the network.

10

Figure 1 shows the information that needs to be shared between the data grids. The critical pieces of information are the name of the data grid at SDSC (SDSC-GGF), the port_number and netprefix which define the collection address, the choice of authentication scheme that will be used to authenticate users (ENCRYPT1 is challengeresponses), and the identity of the data grid administrator of the SDSC-GGF data grid (user_name and domain_desc). If any of these parameters are not set correctly, the federation process will fail. --------------------------- RESULTS -------------zone_id: SDSC-GGF local_zone_flag: 1 netprefix: srb-mcat.sdsc.edu:NULL:NULL port_number: 7579 auth_scheme: ENCRYPT1 distin_name: /C=US/O=NPACI/OU=SDSC/USERID=srb/CN=Sto rage Resource Broker/[email protected] zone_status: 1 zone_create_date: 2003-08-01-00.00.00 zone_modify_date: 2006-03-27-13.58.23 zone_comments: SDSC-GGF zone created 3/27/2006 zone_contact: [email protected] user_name: srb domain_desc: sdsc locn_desc: srb-mcat Figure 1. Output from the “Stoken Zone command

5. Rule-Oriented Data Systems The management of data grids is hard, requires expertise in databases, networks, security, and storage systems. The reason such an extensive level of expertise is needed is because any failure by the systems accessed by the data grid is seen as a failure of the data grid. While the SRB data grid provides tools to manage many types of remote system failure, the commands that drive the use of the tools must be executed by the data grid administrator. This requires an expert understanding of how the system is configured, what types of problems are caused by which level of the integrated architecture, and what operations should be performed periodically to protect the shared data collection. In order to simplify the management of data grids, the ability to automate the application of management policies is needed. Towards this goal, SDSC is now developing an integrated rule-oriented data system, called iRODS, to support the virtualization of


management policies [10]. The system builds upon the concepts of shared collections and infrastructure independence, and adds the ability to characterize management policies as actions on management metadata. Each management policy is mapped to the evaluation of the management metadata, and the execution of an associated set of actions that implement that policy based upon the attribute values associated with the management metadata. This approach is possible if the operations that are performed upon the remote data are well defined. Fortunately, the SRB data grid provides an excellent characterization of the operations required for building collections, for sharing data in data grids, for publishing data in digital libraries, for preserving data in persistent archives, and for managing real-time data. The current set of remote operations supported by the SRB correspond to the standard operations that should be provided for the remote manipulation of data. The iRODS system maps from the management metadata to the state information maintained by the SRB for each data collection, and then applies the remote operations as specified by the appropriate rule. All remote operations are encapsulated as micro-services whose execution is controlled by an associated rule that is based on metadata (state information) managed by the data grid. The virtualization of management policies is accomplished by mapping from the management metadata and desired operations to the state information and remote micro-services managed by the data grid. The iRODS system uses explicitly defined rules to express the mapping. One of the features of this approach is that it is possible to characterize the management policies as rules, export a description of the rules into another rule-based data system, and apply the same management policies in the new environment. Another feature is that the result of applying any management policy can be exactly specified by listing the associated management metadata, the rules that then apply and the data grid state information that is updated as a result of applying the rule. A simple example is a management policy that specifies a form of disaster recovery through use of replicas. The management policy metadata would include the number of replicas that are desired, the distribution of the replicas across storage systems, and the frequency with which the replicas are synchronized. For a registration command, the iRODS rule would then check the management metadata, create the correct number of replicas, and distribute them to the desired storage systems. An integrity validation rule would check each sub-collection for whether the minimal

11

time period had passed, and would then synchronize the replicas of the files within the sub-collection. Additional management rules can be created to validate the assessment criteria for a trusted digital repository [11,12], or manage the steps required for a data grid federation, or manage the steps required to replicate a collection between two data grids. 6. Acknowledgements This project was supported by the National Archives and Records Administration under NSF cooperative agreement 0523307 through a supplement to SCI 0438741, “Cyberinfrastructure; From Vision to Reality”, and by the National Science Foundation grant ITR 0427196, “Constraint-based Knowledge Systems for Grids, Digital Libraries, and Persistent Archives”. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archives and Records Administration, or the U.S. government. The Global Grid Forum data grid interoperability demonstration relied upon the able assistance of Stephen McMahon from the Australian Partnership for Advanced Computing, Eric Yen and Wei-Long Ueng from the Academia Sinica Grid Computing Centre, Daniel Hanlon from the Integrative Biology data grid, Giuseppe Fiameni from the Distributed European Infrastructure for Supercomputing Applications, Jean-Yves Nief from the Institute National de Physique Nucleaire et de Physique des Particules, Yoshimi Iida from the KEK National Laboratory for High Energy Physics, Hsu-Mei Chou from the Taiwan National Center for HighPerformance Computing, Irene Barg from the National Optical Astronomy Observatory, Lan Zhao from the Purdue University data grid, Adil Hasan of Rutherford Appleton Laboratory and Roger Downing of Daresbury Laboratory for the CCLRC data grid, Bart Heupers from the Netherlands SARA data grid, and Mike Smorul from the University of Maryland data grid. 7. References [1] R. Moore, M. Wan, and A. Rajasekar, “Storage Resource Broker: Generic Software Infrastructure for Managing Globally Distributed Data”, Proceedings of IEEE Conference on Globally Distributed Data, IEEE Computer Society, Piscataway New Jersey, June 28, 2005, pp. 65-69.

Computing,” Morgan Kaufmann, San Francisco, 1999, pp. 105-129. [3] R., Moore, A. Rajasekar, and M. Wan, “Storage Resource Broker Global Data Grids”, Proceedings NASA / IEEE MSST2006, Fourteenth NASA Goddard / Twentythird IEEE Conference on Mass Storage Systems and Technologies, IEEE Computer Society, Piscataway New Jersey, April 2006. [4] R. Moore, “Building Preservation Environments with Data Grid Technology”, American Archivist, The Society of American Archivists, Chicago Illinois, July 2006, vol. 69, no. 1, pp. 139-158. [5] R. Moore, R. Marciano, “Technologies for Preservation”, chapter 6 in “Managing Electronic Records”, edited by Julie McLeod and Catherine Hare, Facet Publishing, UK, October 2005. [6] C. Baru, R, Moore, A. Rajasekar, and M. Wan, "The SDSC Storage Resource Broker,” Proc. CASCON'98 Conference, Toronto, Canada, Nov.30-Dec.3, 1998, p. 5. [7] A. Rajasekar, M. Wan, R. Moore, and W. Schroeder, “Data Grid Federation”, 2004 International Conference on Parallel and Distributed Processing Techniques and Applications - Special Session on New Trends in Distributed Data Access, Las Vegas Nevada, June 2004. [8] R. Moore, F. Berman, B. Schottlaender, A. Rajasekar, D. Middleton, and J. JaJa, “Chronopolis: Federated Digital Repositories Across Time and Space”, Proceedings of IEEE Conference on Globally Distributed Data, IEEE Computer Society, Piscataway New Jersey, June 28, 2005, pp. 171-176. [9] R. Moore, A. Rajasekar, and M. Wan, “Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data”, Special Issue of the Proceedings of the IEEE on Grid Computing, IEEE Computer Society, Piscataway new Jersey, March 2005, Vol. 93, No.3, pp. 578-588. [10] A. Rajasekar, M. Wan, R. Moore, and W. Schroeder, “A Prototype Rule-based Distributed Data Management System”, High Performance Distributed Computing workshop on “Next Generation Distributed Data Management”, Paris, France, May 2006. [11] R. Moore, and M. Smith, “Assessment of RLG Trusted Digital Repository Requirements,” Joint Conference on Digital Libraries workshop on "Digital Curation & Trusted Repositories: Seeking Success”, Chapel Hill, North Carolina, June 2006. [12] RLG/NARA Audit Checklist for Certifying Digital Repositories, http://www.rlg.org/en/page.php?Page_ID=2076

[2] I. Foster, and C. Kesselman, “The Grid: Blueprint for a New Computing Infrastructure,” Chapter 5, “Data Intensive


12