A Mass storage solution for Grid environments - CiteSeerX

A Mass storage solution for Grid environments Jules Wolfrat, Pieter de Boer, Walter de Jong, Ron Trompert SARA, Computing and Networking Services, The Netherlands {pieter,walter,ron,wolfrat}@sara.nl

Abstract In several scientific areas the use of large datasets is growing fast, e.g. data from High Energy experiments, Earth Observation, Astronomical Observation, and Biological databases. This requires the use of mass storage systems for managing these data. In grid environments user applications are not tied anymore to a particular system, but can run on a group of geographically distributed hosts. We will describe the integration and use of a mass storage system in the grid environment of the European Datagrid project, where data is transparently accessible for the end user. In conclusion we will discuss the role of network performance. Keywords: GRID, Data Storage, HPC

1 Introduction SARA participates in the European Datagrid project [5], a project that builds a grid infrastructure to enable the handling of large amounts of data from high energy experiments, earth observation and biological databases. The challenge is to have these data accessible from all participating research institutes in a transparent way from the user application. SARA has a long tradition in storing large amounts of data for the users of its HPC systems. Our current mass storage system is a tape robot managed from an SGI Origin 3800 system. We will describe the use of this mass storage system in section 2. Mass storage systems differ in their interfaces for accessing the data. For the integration of a mass storage system in a grid environment it is therefore not enough to install generic grid software, but it is also necessary to develop specific software that will interface to a particular mass storage system. We will describe how we made the mass storage system at SARA accessible for the users of the Datagrid environment. Furthermore, we will show how a user can exchange data between his application environment and the mass storage system. This will be discussed in section 3. The participating sites of the Datagrid project are distributed all over Europe. So not only data access rates at the site of the mass storage system are important, but the performance of data transfers over long distances in Europe are also important. As part of the Datagrid project different aspects of

this performance, like network throughput, TCP protocol implementation, were investigated and we will describe some of the results of this research. Not all aspects of the grid environment can be discussed here, for instance important aspects as authentication and authorisation in a grid environment are not discussed. These aspects are discussed in more detail in [9]. For details about grids, see [4] or [8].

2 Mass Storage System For more than a decade SARA provides long term storage of data for the users of the Dutch National Supercomputing facilities which are operated by SARA. Data is stored on tapes in a StorageTek (STK) tape library, which currently can manage 200 TB of data, while the maximum capacity of the tape silo at our site is 1.2 PB. Access to tape data is made possible with the DMF (Data Migration Facility) software, which is available for both CRAY and SGI systems. We started its deployment with the first CRAY YMP system at SARA in 1991, and today it is in use on a SGI Origin 3800 system with 1024 CPU’s. With DMF you can control the use of a file system. You can specify that files must be migrated to tape if less than a certain amount of disk space used by a file system is still free. All data on disk is destroyed, only the directory information of the file is preserved in the file system. System commands like “ls” will still give the file information. In principal more than one copy to tape (or other secondary storage) can be made, with each copy resident on a different storage system. With damaged tapes the data may still be preserved on a second or third tape copy. The choice of which files are migrated can be tuned, e.g. it can be based on size or age of files. An important feature of DMF is that for users all data always seems to be online, whether data is migrated to secondary storage or not. Users can access migrated files in the same way as a file that is still present on disk. If a user accesses data from a file migrated to tape, DMF intercepts the file access and the data of the accessed file will be retrieved from tape and moved back to disk, after which the user action will proceed. The only difference a user will experience with ordinary file access is the longer delay time in actually getting the data because of the data transfer from tape to disk. This

delay time can range from seconds to minutes, dependent on the size of the file and the load of tape requests for the tape subsystem. Currently the STK tape library at SARA has 6 tape drives for handling tape requests.

data can be replicated, in which case the RMS services do not have a single point of failure 3.2 Storage Element

3 Data Access in a Grid environment 3.1 Replica Management System A characteristic of running applications in a grid environment is that data which the application wants to use will not be present at the site where the application will run, or in the case that new data is produced, it must be stored at a location different from where the application runs. From a users point of view accessing remote data or storing data at a remote site must be as easy as working which local data in more traditional computing environments. The European Datagrid project (EDG), which started on January 1, 2001, formulated as one of its main goals to create a grid environment where data can be accessed almost like in a standard Unix environment. With the designed system the user can specify required data with a global unique filename or logical file name (LFN), and the job scheduling system of the grid environment will arrange that a copy of the physical file will be directly accessible at the site where the application will be run. The LFN will be unique at least in the context of the grid environment that the user operates in. The user doesn’t know in advance where his application will run, so it would impossible for him to arrange the data at the right site without the facilities that the EDG project created. We will discuss these facilities in some detail. The main characteristics of the Replica Management System (RMS) [11] under development are 1) creation and management of replicas of physical data at different locations, 2) maintaining a metadata catalog, and 3) a replica location service. Because of the distributed nature of grid environments the performance of accessing data can be poor due to network bottlenecks or the load on a specific mass storage system. Therefore the ability to create copies (replica’s) of data is implemented and optimization from where to access data is a key attribute of the RMS. The metadata catalog maintains among others file, security, and management information. The replica location service (RLS) is responsible for linking the information about the physical location of the data with the logical file names as used in the grid environment. This information is contained in a so-called replica catalog. In addition to the mapping of logical file names to physical locations, the replica catalog also contains information about the protocols that may be used to access the data. For the user and other services the RMS service behaves as a single entry point, but underneath it consists of a distributed set of services. Both replica metadata and replica location

In the Datagrid environment all places where data can be stored are designated as Storage Elements (SE’s). SE’s can be just disk space, but also mass storage systems as the one in use at SARA. Grid services, like RMS use a common API for accessing data from a SE. The protocol used for the transfer of data can be different, e.g. NFS or GridFTP [2]. One of the differences of GridFTP with standard FTP is that grid authentication services can be used with the former. Another difference is the usage of multiple I/O streams to improve the throughput. We enabled access to the SARA mass storage element as a SE with the implementation of grid software from the EDG project. The software had to be ported from source to the IRIX operating system environment, because the EDG binary distribution is currently only based on the RedHat Linux environment. Because the mass storage system behaves as a standard file system no special software had to be developed to interface the GridFTP service to the mass storage system. For separating local access and access from grid users it is only necessary to create a separate filesystem. In this way different groups of grid users can also easily be separated. Authorized grid users (based on x509 certificates) can copy files to and from the designated file system, directly or with the use of other grid services, in both cases using GridFTP. In table 1 we show a few entries from the log that the GridFTP server writes for every transaction that takes place to or from our mass storage system. Entries from the current day can be found at the following URL: http://grid.sara.nl/resources/gridftplog/gridftplog.html

Throughput (MB/s)

Source

Destination

N bytes

teras.sara.nl

grid0006.esrin.esa.int

14601763 0.7

deelteras.sara.nl nat2.nikhef.nl

1636209

teras.sara.nl

15787319 2.3

gpp004.gridpp.rl.ac.uk

2.4

Table 1 – GridFTP log of Mass Storage System at SARA 3.3 Use case We illustrate the operation of data access in the EDG environment with an example from an Earth Observation application. [1], [12], [13]

With this application ozone profile data are produced from data from the Global Ozone Monitoring Experiment (GOME). GOME was launched on April 21, 1995 on board of the second European Remote Sensing Satellite (ERS-2). This instrument can measure a range of atmospheric trace constituents, with the emphasis on global ozone distributions. The GOME level 1 data will be put on a GRID storage element (SE), e.g. the mass storage system at SARA. The meta information is stored in a GRID accessible database, Spitfire. This data and meta data is used by KNMI to produce ozone profile data using the Opera Software, by running it on a Computing Element (CE). The resulting profile data (level 2) is stored again on a GRID SE. The profile meta data is stored on the GRID via the Spitfire database. In this example a separate database, Spitfire, is used for storing metadata of the data used. In table 2 an example job descriptor file for a Datagrid job for this application is shown. With the parameter InputData, the logical file name (LFN) of the requested input file is given and with ReplicaCatalog, the location of the replica catalog to be used for the look up of the location of the physical data associated with the given LFN is given. In this example the user must still explicitly give the location of the replica catalog, however the user doesn’t have to know the name and location of his physical file in order to be able to use the data. run-opera-71102084.jdl Executable

= "/bin/csh";

StdOutput

= "opera.out";

StdError InputSandbox

= "opera.err"; =

{"/home/sdecerff/test/LFN2INFO.pl","/home/sdecerff/test /opera.tar.gz", "/home/sdecerff/test/runopera.csh","/home/sdecerff/extrac tor/gdpfiles.tar", "/home/sdecerff/extractor/scdegrad.108"}; OutputSandBox = {"71102084.tar","opera.out","opera.err", "gomecal.in","gdp01_ex.err","gomecal.msg","71102084.l v1"}; InputData

= {"LF:71102084.lv1"};

ReplicaCatalog = "ldap://gridvo.nikhef.nl:10389/lc=EarthOb WP1 Repcat,rc=EarthObReplicaCatalog,dc=eu-datagrid,dc=org "; DataAccessProtocol = "gridftp"; Arguments = "runopera.csh"; Requirements = other.Architecture == "intel"; Table 2 – Example of job descriptor file

4 Network performance Network performance plays an important role in the access of data in a grid environment because data often has to be transferred over long distances. And although the worldwide network bandwidth provided is increasing rapidly, this isn’t always enough to get high performance transfers of data. With the increasing importance of the performance of large data transfers interest of research communities in this behavior has also increased. Examples are the DataTAG project [6], a European/US project, and the work package of the EDG project [7]. The performance experienced for long distance transfers can be much smaller than the theoretical bandwidth of the network link [10]. In table 1 also small throughput numbers are seen, however here it is also the result of the relatively small size (~1.5 MByte) of the transferred files. In the Datagrid project the data transfers in principle use the European research network GEANT, from which the backbone capacity is 10Gbit/s, and most institutions nowadays have 1 Gbit/s links to the European network infrastructure. Actual speeds for data transfers in most cases will be smaller than 100 Mbit/s or 10 MByte/s. With tuning of the network connections speeds more close to the network bandwidth provided will be seen. We will discuss several of the factors that influence the behavior of the network performance. 4.1 TCP buffer space The size of the buffer space used by the TCP protocol can have a dramatic influence on network performance if set too low. Because TCP provides reliable transfer of data packets between the two end points of a connection, all data which isn’t acknowledged by the receiving side must be buffered by the TCP layer. If a packet isn’t signaled as received, the sender can resend a missing packet. Only after the sender has received a so called ACK packet for a data packet in the buffer, this packet can be removed from the buffer space. As a consequence the buffer size must have enough space for storing data packets that are not yet acknowledged. For the minimum size in bytes of the buffer space the following rule can be formulated: Buffer size >= RTT * Bps

(1)

Where RTT is the round trip time of a packet between the end points of a connection and Bps is the network speed between the end points in bytes per second. The sender can continue with sending new packets as long as the buffer space isn’t filled up. If buffer space is filled up before the first packet is acknowledged the sender will stop sending, which of course will decrease network performance. From equation (1) it is easy to see that with increasing RTT’s the demand on buffer space will increase, and also with the increase in network speeds the demand on buffer space will increase.

This explains why tuning of the buffer space is important in grid environments, the RTT’s for data transfers will be larger than for local transfers and the international bandwidth capacities have grown considerably in the last years. As an example, with an RTT of 20 ms (typical for a European connection) and a theoretical bandwidth of 1Gbit/s (= 125 MByte/s) we have a minimum buffer size requirement of 0.02 * 125 = 2.5 MByte. This value is much larger than the default values most systems use. On Linux systems the system settings can be viewed and set with the “sysctl” command. With “sysctl –a” you get Linux system parameter settings. The TCP buffer sizes will be displayed as net/ipv4/tcp_rmem = 4096

2097152 2097152

net/ipv4/tcp_wmem = 4096 2097152 2097152 for the receiving and sending buffer space respectively. The first number is the minimum size, the second is the default and the third is the maximum value that can be set. Each TCP connection has its own buffer size and an application can change the default setting when it sets up a TCP connection. The system wide values can be changed by the system administrator. Important is that the default value is set large enough, following equation (1). 4.2 TCP protocol Another aspect of the TCP protocol is that if packets get lost, e.g. because of network congestion, errors on the link, or adapter load the protocol will decrease the window of outstanding packets, the congestion window (CWND). This has the same effect as a small buffer space. Standard TCP implementations will increase this congestion window again over time. However the increase with standard implementations will be slow, especially with high RTT’s and high bandwidth values. Even with a dedicated high bandwidth link the results of TCP transfers can be surprisingly bad for a long delay link (high RTT). In the research reported in [3], a dedicated link between Amsterdam and Chicago was used which is a 10 Gbps lambda. It appeared that the standard TCP implementation isn’t well suited for high bandwidth and long delay network links. Changes to the TCP protocol can have considerable improvement on the network performance for this kind of network links. Several groups are investigating now the behavior of the TCP protocol for high bandwidth and long delay networks, and as a result experiments with new TCP implementations are forthcoming. Also experiments with complete replacements of the protocol for the TCP layer are performed.

5 Conclusions We discussed the use of a mass storage system, connected to the Dutch National Supercomputer service, in the grid environment of the Eurpean Datagrid project. Datagrid users can use the mass storage facilities without having direct

access to the system itself. Users even don’t have to know that their data is located in this particular mass storage system. We were able to implement the Datagrid services without interrupting other services. And, although the Datagrid services are still not very robust, normal operations of the supercomputing services at SARA were not interrupted by problems from the Datagrid service.

Acknowledgements We acknowledge the EU support for the work described in this paper, as part of the IST DataGrid project IST-200025182 of the Fifth Framework Program. We acknowledge the support from Wim Som de Cerff, KNMI, The Netherlands, for providing information.

References [1] van der A, R.J., R.F. van Oss, A.J.M. Piters, J.P.F. Fortuin, Y.J. Meijer and H. Kelder. Ozone profile retrieval from recalibrated Global Ozone Monitoring Experiment data. J. Geophys. Res. 107, 10.1029/2001JD000696, 2002 . [2] Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Liming, L., Meder, S., and Tuecke, S. (2002). Gridftp protocol specification. GGF GridFTP Working Group Document, September 2002. [3] Antony Antony, Johan Blom, Cees de Laat, Jason Lee, Wim Sjouw. Microscopic Examination of TCP flows over transatlantic Links, accepted for publication in a special issue of FGCS on the iGRID2002 conference, Amsterdam, September 2002. Technical Report, http://carol.wins.uva.nl/~delaat/techrep-2003-2-tcp.pdf [4] Fran Berman, Anthony J.G. Hey, Geoffrey Fox, editors. Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003. ISBN: 0-470-85319-0 [5] The European Datagrid project, http://eudatagrid.web.cern.ch/eu-datagrid/ [6] DataTAG project - http://www.datatag.org [7] EDG Network services - http://ccwp7.in2p3.fr [8] Ian Foster, Karl Kesselman. The Grid: Blueprint for a New Computing Infrastructure, Morgan-Kaufman, 1999 [9] I. Foster, C. Kesselman, G. Tsudik, S. Tuecke. “A Security Architecture for Computational Grids”, Proc. 5th ACM Conference on Computer and Communications Security Conference, pp. 83-92, 1998. [10] Jason Lee, D. Gunter, B. Tierney, W. Allock, J. Bester, J.Bresnahan, S. Tuecke. Applied Techniques for High Bandwidth Data Transfers across Wide Area Networks,

Proceedings of Computers in High Energy Physics 2001 (CHEP 2001), Beijing China, LBNL-46269. [11] Leanne Guy, Peter Kunszt, Erwin Laure, Heinz Stockinger, Kurt Stockinger. ”Replica Management in Data Grids”, http://edms.cern.ch/document/350430 [12] W.J. Som de Cerff, J. van de Vegte. ”Datagrid use case – Annex to Requirements specification: EO application requirements for GRID”, https://edms.cern.ch/file/346050/12/DataGrid-WP94UseCase-V12.doc [13] W.J. Som de Cerff. ”Results of running EO Applications on EDG 1.4”, https://edms.cern.ch/file/377513/3/DataGrid-wp9-D93v2-1.doc

A Mass storage solution for Grid environments - CiteSeerX

A Mass storage solution for Grid environments - CiteSeerX

Suggest Documents

Grid Programming for Heterogeneous Environments - The ... - CiteSeerX

Data Management for Grid Environments - CiteSeerX

Open Grid Computing Environments - CiteSeerX

Solution Adaptive Cartesian Grid Methods for ... - CiteSeerX

A system for ensuring data integrity in grid environments - CiteSeerX

A grid representation for Distributed Virtual Environments - CiteSeerX

A Simple Mass Storage System for the SRB Data Grid ... - CiteSeerX

A Simple Mass Storage System for the SRB Data Grid ... - CiteSeerX

Pattern Operators for Grid Environments

Data Replication Strategies in Grid Environments - CiteSeerX

Coordination in Intelligent Grid Environments - CiteSeerX

Visualization in Grid Computing Environments - CiteSeerX

Grid Application Programming Environments CoreGRID ... - CiteSeerX

Data Replication Strategies in Grid Environments - CiteSeerX

Performance evaluation in computational grid environments - CiteSeerX

Performance of A Mass Storage System for Video-On ... - CiteSeerX

Grid Application Programming Environments

Swift: a storage architecture for large objects - Mass Storage Systems ...

Capacity Planning Tools for Web and Grid Environments - CiteSeerX

Modeling and Solution Environments for MPEC: GAMS ... - CiteSeerX

Dynamic Runtime Environments for Grid Computing - NorduGrid

Programming Environments for High-Performance Grid Computing

Monitoring Data Archives for Grid Environments - Advanced ...

Programming Environments for High-Performance Grid Computing ...