Towards a Scalable Metacomputing Storage Service - CiteSeerX

3 downloads 6870 Views 115KB Size Report
Department of Computer Science, University of Adelaide,. Adelaide, SA ... ing infrastructure project, we are developing a storage system to address some of these ... and the need to support custom as well as legacy and commercial applications ... immense body of existing and future software and platforms as incompatible.
Towards a Scalable Metacomputing Storage Service Craig J. Patten, K.A. Hawick and J.F. Hercus Advanced Computational Systems Cooperative Research Centre Department of Computer Science, University of Adelaide, Adelaide, SA 5005, Australia Email: {cjp,khawick,james}@cs.adelaide.edu.au Technical Report DHPC-058 December 1998

Abstract. We describe a prototypical storage service through which we are addressing some of the open storage issues in wide-area distributed high-performance computing. We discuss some of the relevant topics such as latency-tolerance, hierarchical storage integration, and legacy and commercial application support. Existing high-performance computing environments are either ad-hoc or focus narrowly on the simple clientserver case. The storage service which we are developing as part of the DISCWorld metacomputing infrastructure will provide high-performance access to a global “cloud” of storage resources in a manner which is scalable, secure, adaptive and portable, requiring no application or operating system modifications. Our system design provides flexible, modular and user-extensible access to arbitrary storage mechanisms and on-demand data generation and transformations. We describe our current prototype’s status, some performance analysis, other related research and our future plans for the system.

1

Introduction

Wide-area distributed high-performance computing, or metacomputing [4], is becoming an increasingly active research field. Relatively recent advances in high-bandwidth wide-area network technology have released the centralised supercomputer’s hold on high-performance computing, and provided for greater integration of geographically distributed high-performance computing, communications, storage and visualisation resources. These metacomputing systems naturally present many interesting challenges in the areas of storage and I/O. Most existing systems for handling data storage and retrieval across networks were designed for different workloads and operational environments, and did not address many of the issues which are unique to wide-area distributed computing. Within the DISCWorld [10, 11] metacomputing infrastructure project, we are developing a storage system to address some of these issues.

This system, the DISCWorld Storage Service, has been designed to provide a standard file system interface to a global “cloud” of storage resources. The defining attribute of the system is its sole purpose of providing a storage service for applications in a wide-area metacomputing environment; we are not developing a general distributed file system. Most existing technology in the area of distributed storage and I/O is designed to address different problems to those encountered in metacomputing systems; this is our rationale for working towards a scalable metacomputing storage service. Issues such as separate administrative zones, large network latencies, non-uniform network topologies, highly heterogeneous systems, security and the need to support custom as well as legacy and commercial applications all present problems which cannot be suitably addressed using existing systems. Our aim is not to produce a system which attempts to solve all of these problems. Rather, we are concentrating on providing a storage service which is decentralised and scalable, portable without requiring application or operating system modifications or additions, capable of handling various wide-area networking issues such as high-latency links and flexible and adaptive in its use of communications and storage resources. Our main driving applications for the system centre around Geographical Information Systems (GIS), satellite imagery archives [9] and other scientific applications exhibiting large demands on the underlying storage infrastructure. Locally we store an up-to-date GMS5 [8] satellite imagery archive. We have built some example driver applications on top of this archive, such as the ERIC [12] satellite image browser. In Section 2 we discuss in more detail the various issues relevant to storage and I/O in metacomputing systems. In Section 3 we describe the architecture and current prototype implementation of the DISCWorld Storage Service, and in Section 5 we discuss various performance issues, and preliminary results. Finally we conclude with our future plans for DSS.

2

Metacomputing Storage

Storage and I/O in metacomputing systems present different challenges than those addressed by traditional distributed/network storage technology. The magnitudes of both the workloads and wide-area network latencies are two of the most obvious. Existing distributed file systems are predominately geared towards “everyday” data usage across relatively local areas, where network latency is not a bottleneck, and the file size distribution is skewed much lower than in highperformance scientific computing. Existing mechanisms for remote data access in metacomputing systems, whilst generally designed for bulk data transfer over high-latency networks, are however less sophisticated than their distributed file system counterparts. We discuss some existing systems in Section 4. Besides latency, other network issues also impinge on the design of any distributed storage system across the wide area. An increased level of heterogeneity in network and storage technology, higher network topology complexity and

greater reliability problems all present challenges rarely seen in local area storage management. Across the wide-area, computers are now often reachable through multiple different networks, sometimes possessing multiple network interfaces with different performance characteristics. Wide-area high-performance network management itself offers reliability and robustness problems to be addressed. Administrative and security issues become increasingly complex in wide-area environments. Computers across these networks belong to different institutions and companies, and reside in many different countries. Global user and host authentication and differing security requirements and infrastructures present challenges to developers of metacomputing systems and applications. As these networks spread over wider areas and connect greater numbers of computers, the collective level of heterogeneity in the system increases. Portability then becomes a more important attribute of any storage system, if it is to operate over the wide area. It is therefore important for a storage system to provide access to applications without requiring access to source code or custom operating system modifications. For example, commercial or legacy applications where source code is unavailable. Any distributed storage system which requires source code modifications, applications to be relinked against custom libraries, or custom operating system modules or modifications, consequently renders an immense body of existing and future software and platforms as incompatible. Recently emerging in wide-area distributed computing is the concept of remotely “leasing” access to high-performance computing resources [5]. In a storage context, this raises a number of technical issues such as data security, access methods for leased storage resources and associated electronic commerce mechanisms and policies. In a distributed computing system where the paradigm extends beyond the simple client-server case, bulk data transfer becomes more complex, in that the source, destination and initiatior of such transfers may each comprise of multiple, geographically distributed hosts. A user or application on a mobile computer with poor connectivity may wish to transparently transfer data between distributed storage resources possessing high-performance connectivity without involving their poorly-connected host in the actual transfer. This example can be extended to the cases of arbitrary parallel data transfers such as gather/scatter and on-demand bulk data transformations between distributed storage resources. Laterally extending the data replication model to enable intermediate proxy and nearby caching and prefetching nodes also holds great potential for distributed storage performance. Existing distributed file systems and metacomputing remote data access mechanisms cannot provide the potential performance benefits available from such flexible and integrated use of storage resources in such an infrastructure. The storage service we describe provides a framework for addressing these limitations and has the potential to drastically improve wide area data transfer performance.

3

DISCWorld Storage Service (DSS) Architecture

The DISCWorld Storage Service (DSS) architecture is illustrated in Figure 1. Each node in the metacomputing system may support a DSS daemon that emulates a Network File System [15, 18, 19] (NFS) server called the DISCWorld File System, DWorFS [13]. The NFS interface was chosen to allow commercial and legacy applications access to the DSS architecture without requiring any source or runtime modifications. All that is required is that the systems running the applications can NFS mount the file system exported by DWorFS. Other systems which use user-level NFS daemons include the Cryptographic File System [1], the Semantic File System [7] and ATTIC [4].

DWorFS

NFS DSS

Dataset Modules (GOES-9, ...)

Clients

Other DSS Peers

Storage Manager

DSS

DISCWorld Node Fig. 1. Overview of the DISCWorld Storage Service architecture. Each node executes a DSS daemon, controlling local storage resources and brokering access to those resources for NFS clients on their local network(s) and remote DSS peers.

To emulate the file systems presented by the DWorFS layer, the DSS architecture uses dynamically-loaded modules. Each module is responsible for presenting its underlying storage architecture as if it were a real file system, regardless of its internal structure. Consider Figure 1. The DISCWorld node on the left executes a DSS daemon, which consists of the front-end DWorFS NFS interface, the DSS Storage Manager

and underlying dataset and storage modules. NFS requests from local clients are received at the DWorFS layer, and passed down to dataset-specific modules, which process the request accessing whatever storage resources are available through the Storage Manager. The Storage Manager communicates with storage modules which control access to local storage resources and remote DSS nodes through the DSS module and protocol. Incoming requests from other DSS nodes are received by the DWorFS layer and processed similarly. The major strength of the DSS architecture is that the dynamically-loaded modules can be of arbitrary complexity. Therefore a DSS implementation is able to support arbitrary storage architectures, some of which may be virtual rather than physical in nature. It is also able to provide transparent access to distributed resources via the DWorFS layer without imposing any restrictions on the underlying networking protocols and infrastructure that may be used. 3.1

Wide Area Network Support

As discussed below, NFS performance characteristics may be appropriate to the high-bandwidth, low-latency behaviour of local area networks but perform badly with the high latency nature of wide area networks. This is especially true if large bulk data transfers are required. To address this issue the DSS architecture assumes that a dynamically-loaded module will be available to implement a bulk transfer protocol between DSS daemons. In Figure 1, the DSS storage module fulfills this role and is able to provide direct access from local dataset modules to the bulk transfer protocol. The bulk transfer protocol can then be used for tasks such as streaming data from a tape silo directly to DSS daemons elsewhere on a wide area network. Alternatively, it could be used to achieve efficient transfers between file systems emulated by local modules. The design of DSS differs from most other distributed storage systems in that the storage manager can be configured to take advantage of the concept of the “distance” to the data. Most systems which incorporate prefetching or caching, generally stay within the client-server paradigm. Data is prefetched from the server, cached at some level in the storage hierarchies of the server and/or client. When distributed storage systems accept data to be written to some remote host, the usual request-reply transactions occur and the data generally proceeds directly to the remote destination. DSS will enable, for example, write requests to be buffered at some nearby fast disk array and transfer them to their final destination in bulk. We are also looking at the capability of arbitrarily configuring a DSS node to utilise other DSS nodes for proxy caching and prefetching. 3.2

GMS5 Satellite Imagery Access

To demonstrate the utility of this architecture we have implemented a module to interface to our GMS5 satellite image repository. Assuming that the /dss directory on a system is the local mount-point for the DSS, an application could attempt to open the file

/dss/GMS5/vis/98/02/02/05/32/250x200+950+1700.hdf This prompts the GMS5 module to access its image repository, check for the existence of the named image, and if successful, present, through the DWorFS NFS layer, a file handle for the requested “file”. In the above example, a 250x200 portion of the Hierarchical Data Format (HDF) GMS5 satellite image for 2/2/1998 0532 UTC, cropped from coordinates (950,1700), generated on-demand. The application is shielded from the underlying organisation of the data, which in our archive is a collection of compressed HDF files, but which could be, for example, a database or distributed persistent object store.

3.3

Prototype Hierarchical File System

Robotic tape silos form the lowest level of many large scale on-line and near-line data storage and backup systems. They are used to provide economic solutions to the storage of large data sets which can not realistically be stored on disk. However, their integration into the storage hierarchy is anything but ubiquitous or standardised and invariably is proprietary and expensive, application and data specific, or involves human operator intervention. The DSS allows the integration of hierarchical storage into the system through the use of the customisable storage modules. The module for a tape silo will present the same interface as any other storage module. Its internal behavior will be different, to reflect the different access properties, namely high latency, of the underlying storage resource. Our prototype hierarchical storage module, currently a simple proof-of-concept at this point in time, provides applications with read/write access to files stored within a tape silo. The very large latencies of tape storage systems are the most important factor to be considered in the development of the module for them. The issues involved are similar to those involved for storage systems operating over a high latency, average bandwidth network, although the latencies are much larger. In our project we use a StorageTek TimberWolf 9740 tape silo containing two Redwood SD-3 drives. The average latency for access to data stored in this system is approx. 75 seconds [17]. This figure is dominated by the tape seek time which can be reduced by using lower capacity tapes (this figure assumes a 50GB tape). However, regardless of the size of tape used, the latency dominates the access times for even very large (larger than 100MB) data objects. As an example, consider an archive of large medical or scientific images stored on a tape silo. To overcome the latency problems of the silo, a local high-speed disk array is used as a cache for the silo, to buffer writes to the store and provide space for frequently used data and prefetching. This hierarchical storage resource can be readily managed by attaching appropriate dataset and storage modules to DSS.

4

Related Work

A large body of work exists in the field of distributed file systems, and the research effort into metacomputing, and metacomputing storage mechanisms, is growing. The Andrew File System [16] (AFS) and NFS have proven highly successful for remote data access in predominately workstation environments, however these and other existing distributed file systems are not directly applicable in a metacomputing environment. An indepth study of all existing technology and its relation to our work is beyond the scope of this paper, however we briefly discuss a few systems with similar goals to our own. As part of the Globus [6] metacomputing infrastructure, Remote I/O (RIO) provides remote data access, including some collective operations, through a latency-tolerant protocol. It is accessed through an Abstract Data I/O (ADIO) programming API, and is used by systems such as a message-passing interface implementation. Also part of Globus, Global Access to Secondary Storage (GASS) provides a set of replacement I/O calls such as read() and write() on Unix systems, which can communicate with remote HTTP, FTP and GASS servers, and perform some rudimentary caching. Applications wishing to use this functionality must be recompiled with these custom routines. WebFS, part of the WebOS [23] system, provides access to the HTTP namespace through a standard file system interface transparent to user applications. However, HTTP currently has limited semantics in an I/O context and is only slightly more sophisticated than anonymous FTP. Their current system is specific to the Solaris OS, using a kernel module to implement the functionality. WebNFS is an extension to the standard NFSv3 protocol by Sun, to provide access to NFS servers from within web-enabled appliations, and to improve the handling of latency in the NFS protocol. Through using a mechanism called Multi-Component Lookup (MCL), where lookup requests can contain entire pathnames rather than just single pathname components, WebNFS removes some of the inherent latency-intolerance in the NFS protocol.

5

Performance Discussion

The Sun Network File System (NFS) is the most ubiquitous existing distributed file system implementation. NFS is generally implemented with a relatively small block transfer size such as 8192 bytes (8KB). Under NFS Version 2 (NFSv2) this is the specified maximum, whereas the NFS Version 3 (NFSv3) protocol specification defines no limit; clients can request a desired read and write block transfer size, which the server has the choice of honouring. For example, our investigations show that Digital UNIX 4.0D has maximum read/write block sizes of 8KB, and Solaris 2.6 implements a maximum (and default) read/write block size of 32KB. The Andrew File System (AFS) allows block sizes of 64KB. An NFS read request from an application program will typically be implemented as a sequence of requests for 8KB data blocks, each incurring a latency or overhead cost as well as the bandwidth cost to transfer the data. This is

unacceptably slow for transfers across wide area networks where the latency of a single request may be of the order of tens or hundreds of milliseconds or more. Standard NFS file/directory lookup requests, which issue only one pathname component per request, to provide internal field separator independence, are highly intolerant of latency. WebNFS does address this issue through using Multi-Component Lookup (MCL), however it is yet to become widespread functionality. Performance tests using the Internet Protocol over available wide-area networks have indicated approximate round-trip times (RTT) of 20ms between the Australian cities of Adelaide and Canberra, 200ms across a dedicated research link between Canberra and Japan, and approximately half a second across conventional Australia-USA-Japan internet links. Consider the cost to transfer a 5MB file using conventional NFS, with an 8KB transfer block size. This requires some 640 request/reply pairs, incurring a latency cost of approximately 12.8 seconds for the Adelaide-Canberra link and 128 seconds between Canberra and Japan. It is likely that as faster networks become available this will become the dominant cost, if it is not already. The dedicated research link available to us between Canberra and Japan has a bandwidth of 1Mbit/s (128KB/s), thus contributing only 40 seconds to the transfer time. Hence, the latency overhead comprises 128/(40+128) or approximately 81%. This figure will surely worsen as more bandwidth becomes available, whereas the latency is fixed by speed of light limitations. We have a dedicated link of 10Mbit/s between Adelaide and Canberra, yielding a transfer time component of 4s, which is likewise smaller than the overhead component. At 155Mbit/s, a rate which our link is capable of being configured for, the bandwidth component is almost negligible compared to the latency overhead. This analysis shows the critical importance of a bulk data transfer mechanism that uses higher transfer block sizes and a latency-tolerant protocol. Existing implementations of NFS are not optimised for wide area use. The limited amount of block size tuning which NFSv3 implementations allow, offers only a factor of two or four improvement. At least an order of magnitude improvement is desirable. Performance gains can also be achieved by incorporating prefetching and pipelining transfer mechanisms into a wide area bulk data transfer system. Whether and how to support prefetching and caching in NFS is up to the implementors. However, existing implementations are not aggressive enough to effect a large performance improvement for the bulk data transfers used in wide-area metacomputing, as they are geared towards different I/O workloads. Typically, when an NFS client issues a read request, the requested block and perhaps some of the following blocks are read by the server. For large transfers across wide-area networks, better performance could be achieved by this request being used as a hint to start the transfer of larger portions of the file to some “closer” storage resource in anticipation of future requests, for example a fast disk array on a nearby LAN, or a local cache disk.

Our current implementation of DSS is still in its early prototype stage, and we do not yet possess detailed performance results. However, we have performed some preliminary experiments using NFSv3 (8KB blocks) across the loopback interface on Digital UNIX hosts, as applications could well be expected to perform if executing on a DISCWorld Storage Service node. Results indicate that on systems where the local file system read and write bandwidth was approximately 4.4MB/s, the read performance dropped to 4.0MB/s, and write performance dropped to 1.5MB/s. Whilst the write performance is not especially encouraging, we believe the performance tradeoff for the sake of portability is, at least for now, worth making. A more detailed performance evaluation of using the DWorFS NFS layer and a dynamically-loaded dataset module to provide access to a persistent object store has been undertaken by Brown [2]. The untuned user-space DWorFS implementation used for that study shows performance which suffers relative to the Solaris 2.5 NFS server due to the blocksize and synchronous write limitations of NFSv2, used by DWorFS, but in some write-heavy operations outperforms the Solaris server when the module minimises stable store write activity.

6

Conclusions and Future Directions

We have described a distributed storage service through which we are investigating the storage and I/O issues in metacomputing systems. The key attributes of this system are its decentralised design, the transparent interface for user applications, its extensibility in dataset and storage management, and its ability to take advantage of arbitrary storage resources in a flexible fashion. Future work obviously consists of a complete DSS implementation and detailed performance analysis thereof. Using the system, we also intend to investigate the areas of data access pattern analysis and prediction, and compression and data layout issues. Security is of course also an area to be addressed; there are however other research efforts focussing more intensely on this aspect of distributed storage, and we will likely draw on their results and experiences rather than concentrate on this topic ourselves. Our initial experience and performance results gained from implementing the upper layers of the DISCWorld Storage Service have shown that the design ideas described in this paper are feasible to implement. When we have completed the full implementation we will be able to investigate typical access behaviours, with a view to verifying our approximate predictions and design assumptions.

7

Acknowledgements

This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks (RDN) Cooperative Research Center and is managed under the On-Line Data Archives

Program of the Advanced Computational Systems (ACSys) CRC. RDN and ACSys are established under the Australian Government’s CRC Program. Thanks to F.A.Vaughan for his encouragement in developing the system described here.

References 1. Matt Blaze, A Cryptographic File System for Unix, Proc. First ACM Conference on Communications and Computing Security, Fairfax, Virginia, November 1993. 2. A. L. Brown, Utilising NFS to Expose Persistent Object Store I/O, Proc. Sixth IDEA Workshop, Rutherglen, Australia, January 1998. 3. Vincent Cate and Thomas Gross, Combining the Concepts of Compression and Caching for a Two-Level Filesystem, Proc. Fourth ASPLOS, Santa Clara, April 1992, pp. 200-211. 4. C. Catlett, and L. Smarr, Metacomputing, Comm. ACM, 35 (1992), pp. 44-52. 5. B. Christiansen, P. Cappello, M. F. Ionescu, M. O. Neary, K. E. Schauser, and D. Wu, Javelin: Internet-Based Parallel Computing Using Java, Proc. ACM Workshop on Java for Science and Engineering Computation, June 1997. 6. I.Foster, C.Kesselman, Globus: A Metacomputing Infrastructure Toolkit, Intl. Journal of Supercomputer Applications, 11(2):115-128, 1997. 7. David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, James W. O’Toole, Jr., Semantic File Systems, Proc. Thirteenth Symposium on Operating Systems Principles, 1991. 8. Japanese Meteorological Satellite Center, The GMS User’s Guide, 2nd Ed., 1989. 9. K.A.Hawick, H.A.James, K.J.Maciunas, F.A.Vaughan, A.L.Wendelborn, M.Buchhorn, M.Rezny, S.R.Taylor and M.D.Wilson, Geographic Information Systems Applications on an ATM-Based Distributed High Performance Computing System, Proc. HPCN Europe ’97, Vienna, April 1997. 10. K.A.Hawick, P.D.Coddington, D.A.Grove, J.F.Hercus, H.A.James, K.E.Kerry, J.A.Mathew, C.J.Patten, A.J.Silis, F.A.Vaughan, DISCWorld: An Environment for Service-Based Metacomputing. Future Generations of Computer Science Special Issue on Metacomputing. Also DHPC Technical Report DHPC-042, April 1998. 11. K.A.Hawick, H.A.James, Craig J. Patten and F.A.Vaughan, DISCWorld: A Distributed High Performance Computing Environment, Proc. HPCN Europe ’98, Amsterdam, April 1998. 12. H.A.James and K.A.Hawick, A Web-based Interface for On-Demand Processing of Satellite Imagery Archives, Proc. Australian Computer Science Conference (ACSC) ’98, Perth, February 1998. 13. Craig J. Patten, F.A.Vaughan, K.A.Hawick and A.L.Brown, DWorFS: File System Support for Legacy Applications in DISCWorld, Proc. Fifth IDEA Workshop, Fremantle, February 1998. 14. Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, David Hitz, NFS Version 3 Design and Implementation, Proc. USENIX 1994 Summer Conference, June 1994. 15. R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, Design and Implementation of the Sun Network Filesystem, Proc. USENIX 1985 Summer Conference, pp. 119-130, 1985. 16. M.Satyanaranan, J.H.Howard, D.N.Nichols, R.N.Sidebotham, A.Z.Spector, and M.J.West, The ITC Distributed File System: Principles and Design, Proc. Tenth Symposium on Operating Systems Principles, pp. 35-50, 1985.

17. StorageTek TimberWolf 9740 Specifications, URL http://www.stortek.com/StorageTek/hardware/tape/9740. 18. Sun Microsystems, NFS V2 Specification, RFC 1094, March 1989. 19. Sun Microsystems, NFS V3 Specification, RFC 1813, June 1995. 20. Sun Microsystems, WebNFS Client Specification, RFC 2054, October 1996. 21. Sun Microsystems, WebNFS Server Specification, RFC 2055, October 1996. 22. Sun Microsystems, WebNFS, April 1997. 23. Amin Vahdat, Thomas Anderson, Michael Dahlin, Eshwar Belani, David Culler, Paul Eastham, and Chad Yoshikawa, WebOS: Operating System Services for Wide Area Applications, Proc. Seventh HPDC Conference, Chicago, July 1998.