The MOSIX Scalable Cluster File Systems for ... - Semantic Scholar

The MOSIX Scalable Cluster File Systems for LINUX Lior Amar, Amnon Barak , Ariel Eizenberg and Amnon Shiloh Computer Science, Hebrew University, Jerusalem 91904, Israel Abstract

MOSIX is a cluster computing enhancement of Linux that supports preemptive process migration. This paper presents the MOSIX Direct File System Access (DFSA), a provision that can improve the performance of cluster le systems by migrating the process to the le, rather then the traditional way of bringing the le's data to the process. DFSA is suitable for clusters that manage a pool of shared disks among multiple machines. With DFSA, it is possible to migrate parallel processes from a client node to le servers enabling parallel access to dierent les. DFSA can work with any le system that maintains cache consistency. Since no such le system is currently available for Linux, we implemented the MOSIX File-System (MFS) as a rst prototype using DFSA. The paper describes DFSA and presents the performance of MFS with and without DFSA.

1 Introduction Recent advances in cluster computing and the ability to generate and migrate parallel processes to many machines has created a need to develop scalable cluster le systems that not only support parallel access to many les but also cache consistency between processes that access the same le. Most traditional network le systems such as NFS [10], AFS [1] and Coda [12] are inadequate for parallel processing because they rely on central le servers. A new generation of le systems, such as the Global File System (GFS) [8], xFS [2] and Frangipani [6] are more appropriate for clusters because these systems distribute storage, cache and control among the cluster's workstations and also provide means for parallel le access and cache consistency. The scalability of these le systems is yet unknown. This paper presents a new paradigm for scalable cluster computing, that combines cluster le systems with dynamic work distribution. The target cluster consists of multiple, homogeneous nodes (single CPU and/or SMP computers), that work cooperatively allowing the cluster-wide resources to be available to each process enabling any node to access any le or execute any process. The cluster le system consists of several subtrees that are placed in dierent nodes to allow parallel operations on dierent les as well as cache consistency. The key feature in the design of the cluster le systems is the ability to bring (migrate) the 1

process to the le rather then the traditional way of bringing the le's data to the process. This capability which is already supported by MOSIX for Linux [5] for load-balancing creates a new cluster computing approach which provides an \easy to use" environment to improve performance and scalability. We note that most existing packages for work distribution, such as PVM [7] or MPI [11], are inadequate for our needs because they use static ( xed) allocation of processes to nodes. The paper presents extensions to the MOSIX system to handle cluster le systems combined with dynamic work distribution for load-balancing. MOSIX is particularly ecient for distributing and executing CPU-bound processes. However, due to Linux compatibility the MOSIX scheme for process distribution was inecient for executing processes with signi cant amount of I/O and/or le operations. To overcomes these ineciencies MOSIX was enhanced with a provision for Direct File System Access (DFSA) for better handling of I/O-bound processes. With DFSA, most I/O oriented system-calls of a migrated process can be performed locally - in the node where the process currently resides. The main advantage of DFSA is that it allows a process with a moderate to high volume of I/O to migrate to the node in which it performs most of its I/O and take advantage of local access to data. Other advantages include a reduction of the communication overhead, since less I/O operations are performed via the network, as well as greater exibility to migrate I/O-bound (and mixed I/O and CPU-bound) processes, for better load-balancing. The net result is that with the proper distribution of les among the cluster's nodes, it is possible to migrate parallel processes from a client node to le servers and enable ecient, parallel access to dierent les. Correct operation of DFSA requires le consistency between processes that run on dierent nodes which currently only a few le-systems provide. One such le system is GFS [8], which we intend to use once it becomes fully operational. As another option, we developed a prototype le system, called the MOSIX File-System (MFS), that treats all les and directories within a MOSIX cluster as a single le system. We hope that more le systems that satisfy our requirements will become available for Linux in the future.

1.1 Contributions and organization of the paper

This paper has two main contributions. First, we show the performance improvements by migrating I/O-bound processes to le server versus traditional remote paging methods. Our second contribution is a demonstration of an operational cluster le system that supports a simple, yet scalable scheme for cache consistency among any number of processes. This has been achieved with good performance for the more common (large and complex) I/O operations, e.g., I/O with large block sizes. The paper is organized as follows: Section 2 gives a short overview of MOSIX and its relevant features. Section 3 presents DFSA and MFS. Section 4 presents the performance of MFS with and without DFSA vs. NFS. Our conclusions are given in Section 5. 2

2 MOSIX Background MOSIX is a software package that enhances the Linux kernel with cluster computing capabilities. Its main features are preemptive process migration and supervising algorithms for load-balancing [4] and memory ushering [3]. MOSIX operates silently and its operations are transparent to the applications, just like in an SMP. Users can run parallel (and sequential) applications by initiating processes in one node then allow the system to assign and reassign processes to the best available nodes throughout the execution of the application. In a MOSIX cluster each node could be both a server, for storing les, and at the same time be a client for executing processes. The run time environment is a cluster of cooperating machines that manage a shared pool of disks on multiple machines such that each le can be accessed from any node. The granularity of the work distribution is the Linux process.

2.1 The system image model

In MOSIX, each process has a unique home-node (where it was created), which is usually the login node of the user. The system image model is a Computing Cluster (CC), in which every process seems to run at its home-node and all the processes of a users' session share the execution environment of the home-node. Processes that migrate to a remote (away from the home) node use local (in the remote node) resources whenever possible but continues to interact with the user's environment by forwarding environment dependent system-calls to the home-node. For example, assume that a user launches several processes some of which migrate away from the home-node. If the user executes \ps", it will report the status of all the processes, including processes that are executing on remote nodes. One outcome of this organization is that each process, regardless of its location, can work with any mounted le system without any special provisions.

2.2 Process migration

MOSIX supports preemptive (completely transparent) process migration, that can migrate any process, any time, to any available node. After a process is migrated all its system-calls are intercepted by a link layer at the remote node. If a system-call is site independent, it is executed at the remote node. Otherwise, the system-call is forwarded to a stub, called the deputy, which executes the system-call (synchronously) on behalf of the process in the home-node. After completion, the deputy returns the result(s) to the process in the remote node which then continues its execution. The process migration policy is particularly useful for CPU-bound processes but inecient for processes with intensive I/O and/or le operations. This is due to the fact that such processes are required to communicate with their respective home-node environment for each I/O operation. Clearly, in such cases these processes would be better o not migrating. The next section describes a mechanism to overcome this last problem. 3

3 Direct File System Access (DFSA) The Direct File System Access (DFSA) re-routing mechanism was designed to reduce the extra overhead of executing I/O oriented system-calls of a migrated process. This was done by allowing the execution of most such system-calls locally - in the process's current node. In addition to DFSA, a new algorithm that takes into account I/O operations was added to the MOSIX process distribution (load-balancing) policy. The outcome of these provisions is that a process that performs moderate to high volume of I/O is encouraged to migrate to the node in which it does most of its I/O. One obvious advantage is that I/O-bound (and mixed I/O and CPU), processes have greater exibility to migrate from their respective home-nodes for better load-balancing. The MOSIX scheme is distributed and scalable as it allows many processes to simultaneously access dierent les that have been pre-allocated to dierent nodes prior to the execution of the parallel processes.

3.1 Bringing the process to the le

Unlike all existing network le systems which bring the data from the le server to the client node over the network, the MOSIX algorithms attempt to migrate the process to the node in which the le resides. Usually most le operations are performed by a process on a single le system. The MOSIX scheme has signi cant advantages over other network le systems as it allows the use of a local le system. Clearly this eliminates the communication overhead between the process and the le server (except the cost of the process migration itself). We note that the process migration algorithms monitor and weight the amount of I/O operations vs. the size of each process, in an attempt to optimize the decision whether to migrate the process or not. An interesting research problem is where to locate a process when it performs I/O operations on les that are located in dierent nodes. A naive solution is to migrate the process to the node in which most of its I/O operations are performed. This problem is somewhat more complicated if one also wants to balance the load or if the patterns of the I/O operations vary in time or when several processes are involved. The adaptive process allocation policy of MOSIX attempts to resolve all of those cases.

3.2 How DFSA works

DFSA is a software switch that determines whether to execute le-I/O system-calls of a migrated process in the current node or in its home-node. To function correctly DFSA requires that the chosen le systems (and symbolic-links) are identically mounted on the same-named mount-points. Moreover, it is implied that the user/group-ID scheme is either identical throughout the cluster or otherwise safe enough so that no access-violations can occur when the same le system is accessed by users with ID's assigned from dierent nodes. 4

DFSA checks that the given le systems are mounted (with the same mount- ags) and their type is supported by DFSA. It then redirects most system-calls to be performed locally (in the current node), while still direct some (problematic) system-calls, e.g., when sharing opened les with other processes, to the process' home node. DFSA continuously synchronize the le status, e.g., opening, closing, le-position, etc., between a migrated process and its home-node.

3.3 DFSA requirements

DFSA can work with any le system that satis es the following properties:

Consistency: when a process makes a change to a le from any node all other nodes must see that change immediately or no later than their next access to that le. On a server/client model, this usually implies either cache invalidation in the client nodes or a single virtual cache, which could be in any node (e.g., on the server). We note that a sophisticated server may possibly \lend" the cache of particular les and/or blocks to particular node at any time. File-systems with shared hardware may oer other consistent solutions. The time-stamps on les and between les of the same le system must be consistent and advancing (unless the clock is deliberately set backwards), regardless from which node modi cations are made. The le system must ensure that les/directories are not cleared when unlinked, for as long as any process in the cluster still holds them open. There are several possible techniques to achieve this, e.g., by some form of garbage-collection. Few extra functions for the le system, which in Linux come under \super-block operations" e.g., \identify", to encapsulate nite identifying information about a \dentry", in a way that is sucient to be able to re-establish that open le/directory on another node.

3.4 The MOSIX File System (MFS)

MOSIX requires le and cache consistency between processes that run on dierent nodes because even the same process (or a group of cooperating related processes) can appear to operate from dierent nodes. To run DFSA we implemented a prototype le system, called the MOSIX File-System (MFS). MFS provides a uni ed view of all les on all mounted le systems (of any type) on all the nodes of a MOSIX cluster as if they were all within a single le system. For example, if one mount MFS on the /mfs mount-point, then the le /mfs/1456/usr/tmp/myfile refers to the le /usr/tmp/myfile on node #1456. This makes MFS both generic, since /usr/tmp 5

may be of any le system type, and scalable, since each node in the MOSIX cluster is potentially both a server and a client. Unlike other le systems, MFS provides cache consistency by maintaining only one cache, at the server. To implement this the standard disk and directory caches of Linux are used only in the server and are by-passed on the clients. The main advantage of our approach is that it provides a simple, yet scalable scheme for cache consistency among any number of processes. Another advantage of the MFS approach is that it allows to raise the client-server interaction to the system-call level which is good for the more common, large and complex I/O operations (opening les, I/O with large block sizes). Obviously, having no cache on the client is a major drawback for I/O operations with smaller block sizes.

4 Performance This section presents the performance gains of using DFSA. Since DFSA is not a le system, we evaluated it by comparing the performance of MFS with and without DFSA. For reference we also evaluated the performance of local I/O operations and that of NFS. We executed the PostMark [9] benchmark, which simulates heavy le system loads, e.g., as in a large Internet electronic mail server. Postmark creates a large pool of random size text les and then performs a number of transactions where each transaction consists of a pair of smaller transactions. This is performed until a prede ned workload is obtained. All executions were performed in a cluster running MOSIX for Linux 2.2.16. The benchmark was executed between a pair of identical (Pentium 550 MHz PCs with IDE disks) machines using 4 dierent le access methods and block sizes ranging from 64 bytes to 16K bytes. Table 1: File systems access times (Sec.) 64B Local (in the server) 102.6 MFS with DFSA 104.8 NFS 184.3 MFS without DFSA 1711.0 Access method

512B 102.1 104.0 169.1 382.1

Data transfer block size 1KB 2KB 4KB 100.0 102.2 100.2 103.9 104.1 104.9 158.0 161.3 156.0 277.2 202.9 153.3

8KB 100.2 105.5 159.5 136.1

16KB 101.0 104.4 157.5 124.5

The results of the benchmark are shown in Table 1, where each measurement represents the average time, in seconds, of 10 executions. The rst line in the table shows the Local execution times, in which the process and the les were in the server node. The second line shows the MFS with DFSA times. In this case the benchmark started in a client node then migrated to the server node. The third line shows the NFS times from a client (in its MOSIX home) node to a server node. The last line shows the MFS without DFSA times, 6

in which all le operations were performed from the client node to the server node via MFS, with the process migration and the DFSA disabled. From Table 1 it follows that the MFS with DFSA execution times are 1.8-4.9% slower than the Local times for the measured block sizes, with an average slowdown of 3.7% for all measured block sizes. Also, the MFS with DFSA times are 51-76% faster than the NFS times, with an average speedup of 59% for all measured block sizes. Observe that NFS is 56-80% slower than Local. As expected, the MFS without DFSA times (fourth row) are slower than the NFS times for small block sizes. However, for block sizes 4K and larger, MFS has a better performance than NFS. This makes MFS a reasonable alternative for a cluster le system that supports cache consistency. The second benchmark shows the performance of le I/O operations between a set of client processes each running on their own node executing a parallel le application that communicates with a common le server. More speci cally, this parallel program reads a common database of size 200MB from the main memory of a Quad-SMP le server (Xeon 550 MHz). The application was executed with dierent le systems, as shown in Figure 1, using a 4KB block size. For reference, the benchmark was also executed locally on the server node. The application was started simultaneously in all of the nodes (Pentium 550 MHz) and the time taken for the last process to complete was recorded. 300

Local (in the server) MFS with DFSA NFS MFS without DFSA

Time (Sec.)

250 200 150 100 50 0 1

2

4 Number of nodes

8

16

Figure 1: Average read time of a common le from a Quad-SMP le server The results of this benchmark are averaged over 3 dierent executions and are shown in Figure 1. From the gure it can be seen that the MFS with DFSA times are very close to the Local times. In fact, MFS with DFSA was on average 6% slower than Local. We 7

note that this overhead is due to an extra software layer in MFS. From the gure it can also be seen that the NFS and the MFS without DFSA times are nearly identical with NFS providing, on average, 4% better performance than MFS without DFSA. From the obtained results it follows that MFS with DFSA is between 6.1 to 11.4 times faster than NFS and between 6.9 - 11.8 times faster than MFS without DFSA. Clearly, from these results MFS with DFSA provides better scalability over both NFS and MFS without DFSA.

5 Conclusions and future work This paper presented a new paradigm for scalable cluster computing that combines cluster le systems with load balancing. Our scheme allows a migrated process to directly access le systems from its current node, e.g., it allows an I/O-bound process to migrate to the node in which it performs most of its I/O. This means that with a proper distribution of les among the cluster's nodes, it is possible to migrate parallel processes from a client node to le servers to enable ecient, parallel access to dierent les. We showed that our scheme is \easy to use" and it provides improved performance and scalability. Correct operation of our scheme requires le consistency between processes that run on dierent nodes. The most suitable le system for our scheme is the Global File System (GFS) [8], which allows multiple Linux nodes to share storage over a network. In GFS, each nodes sees the network disks as local, and GFS itself appears as a local le system. Another attractive feature of GFS is that it exploits new interfaces, such as Fiber Channel for using network attached storage. One followup project is to adjust GFS to MOSIX and the DFSA. Similar extensions could be developed with other le systems that meet the DFSA requirements. Beyond that it would be interesting to explore means to combine MFS with storage area networks to provide consistency and more connectivity options between a scalable cluster and storage devices.

Acknowledgments Many thanks to Phil Joyce of Swinburne University of Technology and to Raanan Chermoni for their help. This research was supported in part by the Ministry of Defense and by a grant from Dr. and Mrs. Silverston, Cambridge, UK.

References

[1] AFS Version 3.6. http://www.transarc.com/Library/documentation/afs doc.html. [2] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S Roselli, and R.Y. Wang. Serverless network le systems. ACM Tran. on Comp. Systems, 14(1):41{79, 1996. 8

[3] A. Barak and A. Braverman. Memory Ushering in a Scalable Computing Cluster. Journal of Microprocessors and Microsystems, 22(3-4), Aug. 1998. [4] A. Barak, O. La'adan, and A. Shiloh. Scalable Cluster Computing with MOSIX for LINUX. In Proc. 5-th Annual Linux Expo, pages 95{100, May 1999. [5] MOSIX Scalable Cluster Computing for Linux. http://www.mosix.org. [6] Frangipani. http://research.compaq.com/SRC/projects/. [7] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine. MIT Press, 1994. [8] Global File System (GFS). http://www.global lesystem.org. [9] J. Katcher. PostMark: A New File System Benchmark. http://www.netapp.com. [10] NFS Version 3 Protocol Speci cation. http://www.faqs.org/rfcs/rfc1813.html. [11] Peter Pacheco. Parallel Programming with MPI. Morgan Kaufmann Pub. Inc., 1996. [12] Coda File System. http://www.coda.cs.cmu.edu.

9

The MOSIX Scalable Cluster File Systems for ... - Semantic Scholar

The MOSIX Scalable Cluster File Systems for ... - Semantic Scholar

Suggest Documents

The VirtualCL (VCL) Cluster Platform - mosix

The MOSIX Parallel I/O System for Scalable I/O ... - Semantic Scholar

Giga-Plant Scalable Cluster - Semantic Scholar

The MOSIX Parallel I/O System for Scalable I/O ... - Semantic Scholar

Hierarchical Cluster for Scalable Web Servers - Semantic Scholar

SCI and the Scalable Cluster Architecture Latency ... - Semantic Scholar

Liquid: A Scalable Deduplication File System for ... - Semantic Scholar

A Scalable Register File Architecture for ... - Semantic Scholar

TBBT: Scalable and Accurate Trace Replay for File ... - Semantic Scholar

MOSIX Evaluation on a Linux Cluster - The International Arab Journal ...

DiscFinder: A Data-Intensive Scalable Cluster ... - Semantic Scholar

Achieving Robust, Scalable Cluster I/O in Java - Semantic Scholar

Scalable Nonlinear Dynamical Systems for Agent ... - Semantic Scholar

A Scalable Cluster-based Web Server with ... - Semantic Scholar

Weblins: A scalable WWW cluster-based server - Semantic Scholar

Federated DAFS: Scalable Cluster-Based Direct Access File Servers

A Flexible Cluster Infrastructure for Systems ... - Semantic Scholar

A Flexible Cluster Infrastructure for Systems ... - Semantic Scholar

Job Management in Grids of MOSIX Clusters ... - Semantic Scholar

Towards a Scalable File System on Computer ... - Semantic Scholar

Memory Based Metadata Server for Cluster File Systems

Distributed File Systems for Exascale Computing - Semantic Scholar

Data layout optimization for petascale file systems - Semantic Scholar

Operation Shipping for Mobile File Systems - Semantic Scholar