Document not found! Please try again

A Parallel I/O Middleware to Integrate Heterogeneous Storage ...

1 downloads 207 Views 558KB Size Report
for the integration of the available storage data servers (NFS servers, CIFS, HTTP- .... N. Roussopoulos “MOCHA: A Self-extensible Databse Middleware System.
A Parallel I/O Middleware to Integrate Heterogeneous Storage Resources on Grids Jos´e M. P´erez, F´elix Garc´ıa, Jes´us Carretero, Alejandro Calder´on, Javier Fern´andez Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganes, Madrid, Spain [email protected], [email protected], WWW home page: http://www.arcos.inf.uc3m.es

1 Introduction Currently there is a great interest in the grid computing concept. Usually this concept denotes a distributed computational infrastructure in the field of the engineering and advanced science [5]. Most of the advances that have been developed in grid computing have been centered on exploiting grid computational resources from the process point of view, taking apart the storage point of view. Nevertheless, for many applications, the access to distributed information (DataGrid) is as important as the access to processing resources, since the majority of the scientific and engineering applications need the access to big volumes of data in a efficient way. The efforts developed in the access to distributed data in grids are based on the creation of new services for the access to big volumes of information and specialized interfaces for grid. Examples of this systems are: Globus [10], Storage Resource Broker (SRB) [1], DataCutter [2], DPSS [9], and MOCHA [7]. These solutions are not suitable for the integration of the available storage data servers (NFS servers, CIFS, HTTPWebDAV, etc), and they forces the users to learn new APIs, install new servers, and modify or adapt their applications. On the other hand, the solutions developed use the replication as the way to obtain high performance accesses to data, bringing data near the client that use them. This solution supposes that applications do not modify the data that they use [4]. The use of replication originates two main problems: it supposes an intensive use of resources, not only in storage, but also in management. Furthermore, it is not appropriated for applications that modify the same set of information, because some resources must be used in collaborative environments. A way to improve the performance of I/O, specially in distributed systems is parallel I/O. Most of parallel file systems use special file servers, but in heterogeneous systems as grids this is not always possible. In this paper we present a parallel I/O middleware for grids, called GridExpand, that integrates existing servers using protocols like NFS, CIFS or WebDav, without need complex installations, and that facilitates the development of applications by integrating heterogeneous grid-resources under homogeneous and well known interfaces like POSIX and MPI-IO. This system apply the parallel I/O techniques used in most of parallel file systems to grids environments. The work presented in this paper uses the ideas proposed in the Expand parallel file system [6] that relies on the usage of several NFS file servers in parallel.

The rest of the paper is organized as follows: Section 2 presents the design of the data grid architecture proposed and Section 3 shows some evaluation results.

2 GridExpand Design To solve the problems previously addressed, the authors have defined a new data grid approach that integrates all available data storage servers an applies parallel I/O techniques to them. This architecture is called GridExpand and it is based on the effective integration of existing protocols, services and existing solutions. This system is an evolution of the Expand parallel file system [6] [3]. The idea is to provide a parallel I/O middleware that allows to integrate the existing heterogeneous resources from the point of view of the client without needing to create or to install new servers, and using existing APIs, like POSIX or MPI-IO. Integrating multiple resources and using parallel I/O techniques allow to increase the performance and storage capacity of the system. The GridExpand architecture is presented in Figure 1. This architecture allows the usage of several heterogeneous clusters to define parallel distributed partitions, where data are striped. The servers used to store the data are traditional storage servers as NFS, CIFS or WebDav. GridExpand uses the available protocols to communicate with the servers without needing specialized servers. Using the former approach offers several advantages: 1. No changes to the servers are required. All aspects of GridExpand operations are implemented on the clients. For example: for NFS we use RPC and the NFS protocol, for CIFS we use TCP/IP and the SMB protocol, and for HTTP-WebDav we use TCP/IP and the HTTP protocol with the WebDAV Distributed Authoring Protocol. 2. The parallel I/O middleware construction is greatly simplified, because all operations are implemented on the client side. This approach is completely different to that used in many current parallel file systems, that implement both client and server sides. 3. It allows parallel access to both data of different files and data of the same file. 4. It allows the usage of servers with different architectures and operating systems. 5. It simplifies the configuration, because the protocols proposed are very familiar to users. GridExpand combines several servers to provide a generic stripped partition which allows to create several types of file systems. The Figure 1 shows different partitions that can be defined in the grid: Intra-site partitions, as for example, the partition 4, and Inter-site partitions, as for example, the partition 1. The files in GridExpand consists in several subfiles, one for each server in the distributed partition. All subfiles are fully transparent to the users. GridExpand hides those subfiles, offering to the clients a traditional view of the files. To exploit all the available storage resources, GridExpand provides several data allocation and load balancing algorithms that search for servers that are available to be used and to select several nodes to store the data of a file.

Fig. 1. GridExpand Architecture

The access to the files is provided by standard interfaces like POSIX or MPI-IO. To accomplish this goal, GridExpand provides an Abstract File Interface (AFI) that allows the implementation of the typical interfaces (POSIX, Win32, MPI-IO) above it, and supports other advanced features as cache policy, prefetching, parallelism degree configuration and fault tolerance. The access to the servers and storage resources is provided by an Abstract I/O Adapter similar to the ADIO [8] used to develop portable implementations of MPI-IO.

3 Performance Evaluation In the evaluation we have used an image-processing application that processes a set of 128 images. All images have a size of 3 MB. On each image, the application returns a new image applying a fixed 64-pixel mask. This application has been executing on a grid configured using 8 workstations running Linux and connected through a Fast Ethernet. As data server we have used a NFS server on each node. Our experiments compare the performance of a single server using NFS and GridExpand with different intra-site partitions. The application used can be divided in several decoupled tasks (in the tests, from 1 to 32 clients). Figure 2 shows the time needed to run the application varying the number of subtasks (clients in Figure) and different distributed partitions (from 1 I/O node to 8 I/O nodes). The Figure also shows the performance obtained with NFS (NFS legend in Figure). As can be seen in Figure 2, the usage of distributed partitions on grids increases the total storage capacity. Furthermore, the usage of parallel I/O techniques allows to obtain better performance results.

Fig. 2. Performance Results

References 1. C. Baru, R. Moore, A. Rajasekar, M. Wan. “The SDSC Storage Resource Broker”. in In Proceedings of the International Conference in High Energy and Nuclear Phisys, Teatro Antonianum, Padova (Italia). Feb. 2002. 2. M.D. Beynon, R. Ferreira, T. Kurc, A. Sussman, J. Saltz. “DataCutter: Middleware for filtering very large scientific datasets of archival storage systems”. in In Proceedinfs of the 2000 Mass Storage Systems Conference, Tpages 119-133. College Park, MD, March 2000. IEEE Computer Society Press 3. A. Calderon, F. Garcia, J. Carretero, J.M. Perez, J. Fernandez. “An Implementation of MPI-IO on Expand: A Parallel File System Based on NFS Servers”. In 9th PVM/MPI European Users Group, Johannes Kepler University Linz, Austria. sep 29-oct 2, 2002. Pp. 306-313 4. A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific”. Journal of Network and Computer Applications. 23:187-200, 2001 5. I. Foster, C. Kesselman, editors. “The Grid: Bluprint for a New Computating Infraestructure.” Morgan Kaufmann, 1999 6. F. Garcia, A. Calderon, J. Carretero, J.M. Perez, J. Fernandez “The Design of the Expand Parallel File System”. Accepted for publication in the International Journal of High Performance Computing Applications, 2003 7. M.R. Martinez, N. Roussopoulos “MOCHA: A Self-extensible Databse Middleware System for Distributed Data Sources”. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, TX. May 2000 8. W. Gropp R. Takhur and E. Lusk, “An Abstract-Device Interface for Implementing Portable Parallel-I/O Interfaces”. In Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation, Oct. 1996, pp. 180–187. 9. B. Tierney, J. Lee, W. Johnston, B. Crowley, M. Holding “Holding. A Network-aware Distributed Storage Cache for Data-intensive Environments”. in In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 185193. Redondo Beach, CA. Aug. 1999. 10. S. Vazhkuda, S. Tuecke, I. Foster “Replica Selection in the Globus Data Grid”. In proceedings of the International workshop on Data Models and Databases on Clusters and the Grid (DataGrid2001), IEEE Computer Society Press. 2001.

Suggest Documents