The MOSIX Parallel I/O System for Scalable I/O ... - Semantic Scholar

2 downloads 66477 Views 68KB Size Report
where each node can be both a client and/or a server, simi- larly to the MOSIX MFS .... formation about a file from one or more dedicated servers, called Meta ...
The MOSIX Parallel I/O System for Scalable I/O Performance Lior Amar, Amnon Barak and Amnon Shiloh Institute of Computer Science The Hebrew University of Jerusalem Jerusalem, 91904 Israel ABSTRACT This paper presents the MOSIX Scalable Parallel Input/Output (MOPI) system that uses the process migration capability of MOSIX for parallel access to segments of data that are scattered among different nodes. MOSIX is a Unix based, cluster operating system that already supports preemptive process migration for load-balancing and memory ushering. MOPI supports splitting files to several nodes. It can deliver a high I/O performance by migrating parallel processes to the respective nodes that hold the data, as opposed to the traditional way of bringing the file’s data to the processes. The paper describes the MOSIX infrastructure for supporting parallel file operations and the functions of MOPI. It then presents the performance of MOPI for some data intensive application and its scalability. KEY WORDS Cluster computing, MOSIX, parallel I/O, scalable I/O.

1 Introduction Cluster computing has been traditionally associated with high performance computing of massive CPU bound applications. Clustered systems can offer other advantages for “demanding” applications, such as the ability to support large processes that span across the main memory of several nodes, or cluster file systems that support parallel access to files [3, 8, 10]. Recently, there is a growing demand to run data intensive applications, that can process several orders of magnitude faster than any existing single computer. This paper presents a cluster based parallel I/O system that could be useful for such applications. Some packages for parallel I/O in clusters are the Global File System (GFS) [10], the Parallel Virtual File System (PVFS) [8], Panda [9], DataCutter [6] and Abacus [2] . GFS is a shared file system suitable for clusters with shared storage, it enables clients to access the same shared storage while keeping the data consistent. Its main drawback is the high hardware costs. Its scalability is yet unknown. PVFS is designed as a client-server system in which files are transparently stripped across disks of multiple fileservers. PVFS provides good performance but its main drawbacks are its limited scalability (possibly due to the 

Copyright (c) 2002 Amnon Barak. All rights reserved.

use of the network for almost all I/O operations) and lack of data consistency when several processes interact with the same file. PVFS is particularly popular in Linux clusters for running intensive I/O parallel processes. Panda [9] is a parallel I/O library for multidimensional arrays that supports the strategy of “part-time I/O”, where each node can be both a client and/or a server, similarly to the MOSIX MFS scheme. However, unlike MOSIX which optimize the network usage by migrating processes closer to their data, Panda optimizes the performance by initially selecting the I/O nodes (among the cluster nodes) which will make the cluster use the minimal network resources for a given description of anticipated I/O requests, from clients to servers. DataCutter [6] is a framework designed for developing data intensive applications in distributed environments. The programming model in DataCutter, called “filter-stream programming”, represents components of data-intensive applications as a set of filters. Each filter can potentially be executed on a different host across a widearea network. Data exchange between any two filters is described via streams, which are uni-directional pipes that deliver data in fixed size buffers. The idea of changing the placement of program components to better use the cluster resources is similar to MOSIX. However, in MOSIX there is no need to change the applications, and MOSIX incorporates automatic load balancing, whereas in DataCutter assignments are manual. Abacus [2] is a programming model and a run time system that monitors and dynamically changes function placement for applications that manipulate large data sets. The programming model encourages programmers to compose data-intensive applications from explicitly-migratable functions, while the run time system monitors those functions and dynamically migrates them in order to improve the application performance. This project is similar to MOSIX: both perform dynamic load balancing of processes/functions based on run-time statistics-collection. The difference is that MOSIX works with generic programs whereas Abacus requires a programming model. This paper presents a new paradigm for cluster high I/O performance that combines data partitioning with dynamic work distribution. The target cluster consists of multiple, homogeneous workstations and servers (nodes) that work cooperatively, making the cluster-wide resources accessible to processes on all nodes. The cluster file sys-

tem consists of several subtrees that are placed in different nodes to allow parallel operations on different files. The key feature of our cluster I/O system is the ability to bring (migrate) the process to the file server(s) rather than the traditional way of bringing the file’s data to the process. Process migration, which is already supported by MOSIX [12] for load-balancing, creates a new, scalable, parallel I/O capabilities that are suitable for applications that need to process large volumes of data. We describe the MOSIX Parallel I/O (MOPI) library, which provides means for (transparent) partitioning of files to different nodes and enables parallel access to different segments of a file. The organization of the paper is as follows: Section 2 gives a short overview of MOSIX and its relevant features for supporting parallel I/O. Section 3 presents MOPI and Section 4 presents its performance. Our conclusions are given in Section 5.

sive I/O and/or file operations. This is due to the fact that such processes are required to communicate with their respective home-node environment for each I/O operation. Clearly, in such cases these processes would be better off not migrating. The next section describes a mechanism to overcome this last problem.

2.3 The MOSIX File System (MFS) We implemented the MOSIX File System (MFS) [1] which provides a unified view of all files on all mounted file systems (of any type) in all the nodes of a MOSIX cluster, as if they were all within a single file system. For example, if one mounts MFS on the /mfs mount-point, then the file /mfs/1456/usr/tmp/myfile refers to the file /usr/tmp/myfile on node #1456. MFS is scalable because the number of MOSIX nodes is practically unlimited. In MFS, each node in a MOSIX cluster can simultaneously be a file server and run processes as clients and each process can work with any mounted file system of any type. One advantage of the MFS approach is that it allows to raise the client-server interaction to the system-call level, which provides total consistency.

2 MOSIX Background MOSIX is a software that was specifically designed to enhance the Unix kernel with cluster computing capabilities. The core of MOSIX are adaptive algorithms for loadbalancing [5] and memory ushering [4], that monitor uneven resource distribution among the nodes and use preemptive process migration to automatically reassign processes among the nodes (like in an SMP) in order to continuously take advantage of the best available resources. The MOSIX algorithms are geared for maximal overall performance, overhead-free scalability and ease-of-use.

2.4 Direct File System Access (DFSA) DFSA [1] is a re-routing mechanism that reduces the extra overhead of running I/O oriented system-calls of a migrated process. This is done by allowing most such systemcalls to perform locally - in the process’s current node. In addition to DFSA, MOSIX monitors the I/O operations of each process in order to encourage a process that performs moderate to high volume of I/O to migrate to the node in which it does most of its I/O. One obvious advantage is that I/O-bound (and mixed I/O and CPU) processes have greater flexibility to migrate from their respective home-nodes for better load-balancing.

2.1 The system image model The granularity of the work distribution in MOSIX is the Unix process. In MOSIX, each process has a unique homenode (where it was created), which is usually the login node of the user. The system image model is a computing cluster, in which every process seems to run at its home-node and all the processes of a users’ session share the running environment of the home-node. Processes that migrate to a remote (away from the home) node use local (in the remote node) resources whenever possible but continue to interact with the user’s environment by forwarding environment dependent system-calls to the home-node.

2.5 Bringing the process to the file Unlike most network file systems, which bring the data from the file server to the client node over the network, the MOSIX algorithms attempt to migrate the process to the node in which the file resides. Usually most file operations are performed by a process on a single partition. The MOSIX scheme has significant advantages over other networked file systems as it allows a process to migrate and use any local partition. Clearly this eliminates the communication overhead between the process and the file server (except the cost of the process migration itself). We note that the process migration algorithms monitor and weigh the amount of I/O operations vs. the size of each process, in an attempt to optimize the decision whether to migrate the process or not.

2.2 Process migration MOSIX supports preemptive (completely transparent) process migration, that can migrate almost any process, any time, to any available node. After a process is migrated all its system-calls are intercepted by a link layer at the remote node. If a system-call is site independent, it runs on the remote node. Otherwise, the system-call is forwarded to the home-node, where it runs on behalf of the process. The above scheme is particularly useful for CPUbound processes but inefficient for processes with inten2

3 The Design and Implementation of MOPI

segments of the same file must be of the same size. Segments are created when the file is partitioned among the nodes. After its creation, each DS exists as an autonomous unit. When created, each segment is assigned to a node using a round robin scheme. We note that other partitioning schemes could be implemented, e.g., a random or a space conserving schemes, which balance the use of disk space among the nodes. Also note that as in regular UNIX files, gaps (holes) might exist in the files. In that case reading from a hole returns 0’s.

This section presents the design and implementation of MOPI, a library that provides means for parallel access to files. MOPI supports partitioning of large files to independent data segments that are placed in different nodes. Client applications can access such files transparently, without any knowledge of the partitioning scheme or the fact that the files are partitioned. The MOPI library requests information about a file from one or more dedicated servers, called Meta Manager (MM), that are responsible for managing large files, e.g., allocating new segments, removing segments, etc. The MM uses a general purpose daemon (MOPID) for such service requests. Below we describe the partitioning method of files, then present the MOPI implementation.

3.2 The MOPI implementation The prototype MOPI implementation consists of three parts: a user level library that is linked with the user application; a set of daemons for managing meta-data and an optional set of utilities to manage basic operations on MOPI files from the shell. See Figure 2 for details.

3.1 The MOPI file structure A MOPI file consists of 2 parts: a Meta Unit and Data Segments, as shown in Figure 1.

Figure 2. The MOPI components Figure 1. The MOPI file structure Meta Unit

Node 1

Node 2

Application

Application

MPI Library

MPI Library

DS Size = 50MB

MOPI Library

Data Segment

DS 1

DS 3

DS 11

DS 2

DS 7

DS 12

DS 4

Service Daemon

DS 9

MOPI Library

Service Daemon

Garbage Collector

MFS

DS 10

DS 8

MetaData Manager

MFS

Native Filesystem

Native Filesystem

Storage

Storage

DS 6

Storage On Node 1

Storage On Node 2

Storage On Node 3

3.1.1 The Meta Unit (MU) 3.2.1 The MOPI interface

The MU stores the file attributes, including the number and ID of the nodes to which the file is partitioned; the size of the partition unit (data segment) and the locations of the segments. The functions and operations of the MU are similar to the handling of i-nodes in UNIX file-systems; e.g., it can be stored in a stable storage when the file is not active (not accessed by any process) and it is loaded by the MM when the file becomes active.

In order to use MOPI, the application should be modified to use the MOPI functions, instead of the regular file system calls. The MOPI interface includes all the common UNIX file access and manipulation functions, such as mopi open, mopi close, mopi write, mopi read, mopi readahead, mopi lseek, mopi fcntl, mopi stat etc. We note that mopi readahead is an asynchronous function that return immediately, without finishing the actual read. It can be used to overlap computation with I/O as well as to pre-load parts of files to the main memory of nodes, then to migrate processes to the respective nodes to benefit from parallel, local access to data.

3.1.2 Data segments Data is divided into Data Segments (DS). A DS is the smallest unit of data for I/O optimization, e.g., when considering a migration of the process to the data. The segment size could vary from 1MB to almost 4GB, and all the 3

3.2.2 Configuration and data access

server. First, the benchmark creates a large pool of random size text files, then it performs a sequence of transactions, recursively, until a predefined workload is obtained. The benchmark was performed between a pair of nodes using 3 different file access methods and block sizes, ranging from 1K byte to 64K bytes.

The MOPI installation includes the allocation of a disk space in one or more of the nodes (allocating a dedicated partition is recommended). Then editing configuration files which define the nodes having the data and the access path to the MOPI files; and starting a number of daemons, e.g., MM and MOPID. A client process wishing to access a MOPI file must first open the file by sending a request to an MM. The MM processes the open request and loads the MU, then the process requests segment location(s), access the segment(s) directly via the MFS file system and may migrate to the node(s) which have the segment(s) if the MOSIX algorithms decide to do so.

Table 1. File systems access times (Sec.)

Block 1K 4K 8K 16K 32K 64K

3.3 MOPI support for MPI MOPI supports the MPI-IO standard [13]. An application using the MPI-IO interface can run on top of MOPI without any further modifications. In particular, we extended the ROM-IO [14] implementation of the MPI-IO standard, which includes support for NFS, UFS, PVFS, XFS, PFS, SFS, PIOFS and HFS. In the next section we present the performance of a benchmark which is written in MPI.

Local 14.4 13.2 13.0 13.0 13.8 14.2

Access method MFS with DFSA 18.0 15.4 15.0 15.0 15.4 16.4

NFS 162.0 162.0 161.6 162.0 162.8 163.0

The results are presented in Table 1. The second column shows the Local times, when both the process and the files were in the same (server) node; the third column shows the MFS with DFSA times, i.e., the benchmark started in a client node then migrated to the server node; and the fourth column shows the NFS times from a client (in its MOSIX home) node to a server node. From the results in Table 1 it follows that on average (for all block sizes) MFS with DFSA is only 16.6% slower than Local, and more than 10.3 times (900%) faster than NFS. We note that this last result motivated the development of MOPI.

4 Performance of Parallel I/O with MOPI This section presents the performance of MOPI using benchmarks that simulate heavy file system loads, parallel file system stress and tests to determine the optimal segment size. All measurements were performed in a cluster with 60 identical workstations, each with a Pentium III 1133MHz, 512MB RAM, a 20GB 7200 RPM IDE disk and a 100 Mb/Sec Ethernet NIC, that were connected by a 96X96 matrix switch and ran under MOSIX [12] for Linux 2.4.18. All the presented results reflect the average of 5 measurements. For reference, we used the hdparm utility of Linux and measured the average read rate of the disks at 38MB/Sec. We then used the Bonnie [7] benchmark and measured the throughput of the (ext2) file system at 36.3MB/Sec for read and 41.89 MB/Sec for write. We note that the write rate was higher than both of the above read rates due to caching optimization. We also used the ttcp1.12 benchmark to measure the speed of TCP/IP between pairs of nodes at an average rate of 11.17 MB/Sec for 8KB blocks, with less than 0.5% variations for 4KB, 16KB and 32KB blocks.

4.2 Local vs. MOSIX vs. remote read This test presents the performance of MOPI using the IOR [15] parallel file system stress benchmark, developed at LLNL. In this benchmark all the processes access the same file using the MPI-IO interface. First, each process is allocated to a different node (using MPI). Then each process opens the file and seeks to a different segment. Then one segment is written to a local disk, followed by a read of another (remote) segment. This procedure is repeated several times (a parameter), so that in each iteration each process accesses a different segment and no two processes access the same segment concurrently. To test the performance of applications that scan (read-once, e.g. filter) large amounts of data that is already present in different nodes, we ran one iteration of the IOR benchmark (without the write phase). We measured the throughput rates (MB/Sec) of the read operation when each node held one (1GB) data segment and ran one process. The test was repeated for the following cases:

4.1 Heavy file system loads This benchmark demonstrates the performance of DFSA with MFS [1] vs. NFS and local file access. We executed the PostMark [11] benchmark, which simulates heavy file system loads, e.g., as in a large Internet electronic mail

1. All-local: each process was created in the same node where its data was located and accessed it locally. 4

MB/Sec for the forced-migration all-local, 1633 MB/Sec for MOSIX and 525 MB/Sec for the all-remote tests. Observe that due to the large capacity of our switch, the results of all the tests show linear speedups. Clearly, this would not be the case for the all-remote test, that may saturate a network with several smaller switches, unlike the MOSIX approach that scales up without heavy use of the network.

2. Forced-migration all-local: each process was created in a different node and was forced to migrate to its respective data node as soon as it opened the file. 3. MOSIX: each process was created in a different node and later migrated to its respective data node using the MOSIX automatic process distribution algorithms. 4. All-remote: each process was created in a different node and accessed all its data via the network.

4.3 Write/read with several migrations

The results of these tests (aggregate throughput vs. number of nodes) are shown in Figure 3. Obviously, the highest rate, with a weighted average of 35.15 MB/Sec per node, was obtained for the all-local test. Note that this case represents the throughput of the best possible static assignment, in which each process access exactly one (local) segment. Clearly, this performance could not be sustained if such a process access segments in different nodes.

In this test we used the IOR benchmark (both write and read) to measure the degradation of the I/O performance of MOSIX due to several process migrations. First, we created one process in each node. Then the test was conducted with 1, 3 and 6 iterations, where in each iteration each process wrote one (1GB) segment locally, then read another segment from a remote node. Note that during the read phase the MOSIX algorithms were expected to detect and migrate each process to its respective data node.

Figure 3. Maximal vs. MOSIX vs. remote read rates

Figure 4. I/O rates with several process migrations

2000

All−local

1800

Forced migration all local

2000

MOSIX

1600

MOSIX−1 migration

1800

MOSIX−3 migrations

1400

MOSIX−6 migrations

1600 1200

Aggregate Throughput (MB/Sec)

Aggregate Throughput (MB/Sec)

All−remote

1000

800

TCP/IP max bandwidth

600

400

200

1400

1200

1000

800

600 TCP/IP max bandwidth

400 0 0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

200

Number of Nodes 0 0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

Number of Nodes

Just below the all-local are the forced-migration alllocal access results, which represent the theoretical maximal throughput of MOSIX with a single migration of each process. The obtained weighted average throughput was 32.68 MB/Sec per node, only 7.5% slower than the alllocal throughput. The weighted average throughput of the MOSIX test was 27.39 MB/Sec, about 19% slower than the forced-migration all-local and 28% slower than the alllocal throughput. We note that the lower rates of MOSIX are due to its stability algorithms, which prevent migration of short-lived processes. The lowest weighted average throughput, of only 9.18 MB/Sec per node, was obtained by the all-remote test, which represents disk striping over the network. The results obtained were 28% slower than the throughput of TCP/IP (shown for reference as a doted line) and almost 3 times (198%) slower than the weighted average throughput of MOSIX. The respective maximal aggregate throughput with 60 nodes, were 2117 MB/Sec for the all-local test, 1945

Figure 4 shows the results of this test (throughput vs. number of nodes) with 1, 3 and 6 process migrations. The respective weighted average throughput were 24.68 MB/Sec, 23.79 MB/Sec and 22.46 MB/Sec; and maximal aggregated throughput of 1477.64 MB/Sec, 1429.41 MB/Sec and 1335.34 MB/Sec respectively for 60 nodes. From these results it follows that each process migration resulted in a throughput loss of about 1.8% and aggregated throughput loss of about 1.6% for 60 nodes.

4.4 Choice of the segment size To help users select the optimal segment size, Figure 5 presents the I/O throughput rates for 16MB – 4GB segments, using the same set of tests as in Section 4.2 and a 60 node cluster. The throughput of the forced migration obtained maximal performance with segment size of 128 MB, 5

Acknowledgements

slightly lower than the aggregate rates of parallel read from 60 disks (shown as a doted line for reference). Below that are the MOSIX results, which show a steady increase up to a segment size of 2GB. As expected, the MOSIX results converge to the forced migration results when the segment size is increased. Finally, just under the TCP/IP line (which reflects parallel read from a client to 60 servers) are the all-remote results, which show maximal performance for 64-128 MB segments.

We wish to thank Danny Braniss and Assaf Spanier for their help. This research was supported in part by the Ministry of Defense and by a grant from Dr. and Mrs. Silverston, Cambridge, UK.

References [1] L. Amar, A. Barak, A. Eizenberg and A. Shiloh. The MOSIX Scalable Cluster File Systems for Linux. http://www.MOSIX.org, July 2000.

Figure 5. I/O rates, different segment sizes, 60 nodes

[2] K. Amiri, D. Petrou, G.R. Ganger and G.A. Gibson. Dynamic Function Placement for Data-intensive Cluster Computing. Proc. USENIX Annual Technical Conference, San Diego, CA, June 2000.

2400 Disk max bandwidth 2200 Forced migration all−local 2000

MOSIX

Aggregate Throughput (MB/Sec)

All−remote

[3] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S Roselli and R.Y. Wang. Serverless network file systems. ACM TCS, 14(1), pp. 41-79, 1996.

1800 1600 1400

[4] A. Barak and A. Braverman. Memory Ushering in a Scalable Computing Cluster. Journal of Microprocessors and Microsystems, 22(3-4), Aug. 1998.

1200 1000 800 TCP/IP max bandwidth 600

[5] A. Barak, O. La’adan and A. Shiloh. Scalable Cluster Computing with MOSIX for LINUX. Proc. 5-th Annual Linux Expo, pp. 95-100, Atlanta, GA, 1999.

400 200 0 16

32

64

128

256

512

1024

2048

[6] M. Beynon , C. Chang, U. Catalyurek, T. Kurc, A. Sussman, H. Andrade, R. Ferreira and J. Saltz. Processing Large-Scale Multi-dimensional Data in Parallel and Distributed Environments. Parallel Computing, 28(5), pp. 827-859, 2002.

4096

Segment Size (MByte)

[7] Bonnie File System http://www.textuality.com/bonnie.

5 Conclusions and Future Work

Benchmark

[8] P.H. Carns, W.B. Ligon III, R.B. Ross and R. Thakur. PVFS: A Parallel File System For Linux Clusters. Proc. 4-th Annual Linux Conference, pp. 317-327, Atlanta, GA, 2000.

This paper presented a new paradigm for cluster high I/O performance, that combines data partitioning with dynamic work distribution. Our scheme supports parallel access to segments of data by moving the processes to the data, to benefit from local access, as opposed to the traditional way of bringing the data to the processes. Our scheme scales up linearly and it does not saturate the network. With the increase of the segment size its performance asymptotically approach that of local disk access. The work described in this paper could be extended in several directions. First, it is possible to add support for segment replication, to allow parallel read of the same segment by different processes in different nodes. Another option is to support memory-only segments, by which a large file is pre-loaded to the main memory of the clusters nodes, thus allowing faster access to the data without any disk access. Another extension could be to support GRID computing, which due to unpredictable communication patterns justify process migration to the data node. Finally, it will be interesting to extend MOPI to support shared storage facilities, e.g. SAN or NAS, in which case the process migration could be used to balance the load over the I/O channels of the nodes.

[9] Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo and K.E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. Proc. 5-th Workshop on I/O in Parallel and Distributed Systems, pp. 1-13, San Jose, CA, 1997. [10] Global File System (GFS). http://www.sistina.com. [11] J. Katcher. PostMark: A New File System Benchmark. http://www.netapp.com. [12] MOSIX. http://www.MOSIX.org. [13] P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann Pub. Inc., 1996. [14] R. Thakur, W. Gropp and E. Lusk. On Implementing MPI-IO Portably and with High Performance. Proc. 6-th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, 1999. [15] The I/O Stress Benchmark Codes, ior mpiio (Interleaved or Random) benchmark. SIOP, LLNL. http://www.llnl.gov/asci/purple/benchmarks/limited /ior.

6