Generalized Load Sharing for Distributed Operating Systems A. Satheesh1 and S. Bama2 1
Department of Information Technology, Periyar Maniammai University Vallam, Thanjavur, India
[email protected] 2 Department of Computer Science and Engineering College of Engineering, Guindy Anna University, Chennai-25, India
[email protected]
Abstract. In this paper we propose a method for job migration policies by considering effective usage of global memory in addition to CPU load sharing in distributed systems. The objective of this paper is to reduce the number of page faults caused by unbalanced memory allocations for jobs among distributed nodes, which improves the overall performance of a distributed system. The proposed method, which uses the high performance and high throughput approach with remote execution strategy performs the best for both CPU-bound and memory-bound jobs in homogeneous as well as in the heterogeneous networks in a distributed system.
1 Introduction Load sharing is a channel aggregation approach that permits data traffic to be dispersed on multiple channels at the same time inorder to reduce network load fluctuation. A major performance objective of implementing a load sharing policy in a distributed system is to minimize execution time of each individual job, and to maximize the system throughput by effectively using the distributed resources, such as CPUs, memory modules, and I/Os [1]. Most load sharing schemes mainly consider CPU load balancing by assuming each computer node in the system has a sufficient amount of memory space [3], [4]. These schemes have proved to be effective on overall performance improvement of distributed systems. However, with the rapid development of CPU chips and the increasing demand of data accesses in applications, the memory resources in a distributed system become more and more expensive relative to CPU cycles. The system believe that the overheads of data accesses and movement, such as page faults, have grown to the point where the overall performance of distributed systems would be considerably degraded without serious considerations concerning memory resources in the design of load sharing policies. R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part II, LNCS 4804, pp. 1489–1496, 2007. © Springer-Verlag Berlin Heidelberg 2007
1490
A. Satheesh and S. Bama
2 A Structure of Queuing Model The system uses a queuing network model to characterize load sharing in a distributed system. Here, the performance effects of a load sharing system having both CPU and memory bound resources [2], [5] are compared. This indicates the performance benefits of sharing CPU and memory resources. The model shown in Fig.1 is designed to measure the total response time of a distributed system. For a job arriving in a node, the load sharing system may send the job directly to the local node or transfer it to another node depending on the load status of nodes in the system. The transferred job will either execute on the destination node immediately or wait for its turn in the queue. Parameters λi, for i = 1… P, represent the arrival rates of jobs to each of P nodes in the system (measured by the number of jobs per second). In order to compare the two load sharing policies, the system assumes that all nodes have an identical computing capability, but have different memory capacities [8]. λ1
Queue 1 Node 1
λ2
Queue 2 Node 2
λP
Disk 1
Disk 2
Queue n Node n Disk n
Fig. 1. A Structure of Queuing Model
2.1 Load Sharing for Homogeneous Distributed System Using this approach, a system has been developed for a homogeneous distributed system with 6 nodes, where each local scheduler is CPU-based only. With this simulator as a framework, a homogeneous distributed system with 6 nodes, were built, where each local scheduler holds all the load-sharing policies, inclusive of the CPU-based, Memory based and CPU-Memory based, along with their variations. The simulated system is currently configured with following parameters [14]: • CPU speed: 100 MIPS (million instructions per second). • RAM size: 48 MBytes (The memory size is 64 MBytes including 16 MBytes for kernel and file cache usage [6]). • memory page size: 4 KBytes.
Generalized Load Sharing for Distributed Operating Systems
• • • •
1491
Ethernet connection: 10 Mbps. page fault service time: 10 ms. time slice of CPU local scheduling: 10 ms. context switch time: 0.1 ms.
The parameter values are similar to the ones of a Linux workstation. The CPU local scheduling uses the round-robin policy. Each program job is in one of the following states: “ready”, “execution”, “paging”, “data transferring”, or “finish”. When a page fault happens in an execution of a job, the job is suspended from the CPU during the paging service. The CPU service is switched to a different job. When page faults happen in executions of several jobs, they will be served in FIFO order. The migration related costs match the Ethernet network service times for Linux workstations, and they are also used in [9]: • a remote execution cost: 0.1 second. • a preemptive migration cost: 0.1 + D / B where 0.1 is a fixed remote execution cost in second, and D is the data in bits to be transferred in the job migration, and B is the network bandwidth in Mbps (B = 10 for the Ethernet). 2.2 System Requirements The system has following conditions and assumptions for evaluating the load sharing policies in the distributed system: i. All computing node maintains a global load index file which contains both CPU and memory load status information of other computing nodes. The load sharing system periodically collects and distributes the load information among the computing nodes. ii. The location policy determines which computing node to be selected for a job execution. The policy this paper is use to find the most lightly loaded node in the distributed system. iii. The system assumes that the memory allocation for a job is done at the arrival of the job. iv. Similar to the assumptions in [8] and [14], the system assumes that page faults are uniformly distributed during job executions. v. The system assumes that the memory threshold of a job is 40% of its requested memory size. The practical value of this threshold assumption has also been confirmed by the studies in [8] and [13]. 2.3 Experimental Results and Analysis When a job migration is necessary, the migration can be either a remote execution which makes jobs be executed on remote nodes in a non-preemptive way, or a preemptive migration which may make the selected jobs be suspended, be moved to a
1492
A. Satheesh and S. Bama
remote node, and then be restarted. This paper includes the two options in our load sharing experiments to study the merits and their limits. Using the trace-simulation on a 6 node distributed system, the system has evaluated the performance of the following load sharing policies: • • • • •
CPU RE: CPU-based load sharing policy with remote execution [9]. CPU PM: CPU-based load sharing policy with preemptive migration [9]. MEM RE: Memory-based load sharing policy with remote execution. MEM PM: Memory-based load sharing policy with preemptive migration. CPU MEM HP RE: CPU-Memory-based load sharing policy using the high performance computing oriented load index with remote execution. • CPU MEM HP PM: CPU-Memory-based load sharing policy using the high performance computing oriented load index with preemptive migration. • CPU MEM HT RE: CPU-Memory-based load sharing policy using the high throughput oriented load index with remote execution. • CPU MEM HT PM: CPU-Memory-based load sharing policy using the high throughput oriented load index with preemptive migration. In addition, the system has also compared execution performance of the above policies with the execution performance without using load sharing, denoted as NO_LS [14]. A major timing measurement the system has used is the mean slowdown which is the ratio between the total wall-clock execution time of all the jobs in a trace and their total CPU execution time. Major contributions to the slowdown come from the delays of page faults, waiting time for CPU service, and the overhead of migration and remote execution. 2.4 Load Sharing for Heterogeneous Distributed Systems A heterogeneity simulator has been developed inclusive of the CPU based, Memory based and CPU- Memory based load sharing policies using remote executions. This system is configured with parameters listed in Table.1. The parameter values in Table.1 are similar to the ones of Linux and Windows workstations. The remote execution overhead matches the 10 Mbps Ethernet network service times for the Linux workstations. The system also conducted experiments on an Ethernet network of 100 Mbps. However, compared with the experiments on the 10 Mbps Ethernet, this proposed method found only very little performance improvement gain for all policies. Although the remote execution overhead is reduced by the faster Ethernet, it is still an insignificant factor for performance degradation in our experiments. The page fault overhead is the dominant factor [7], [10]. The CPU local scheduling uses the round-robin policy. Each job is in one of the following states: “ready”, “execution”, “paging”, “data transferring”, or “finish”. When a page fault happens in the middle of a job execution, the job is suspended from the CPU during the paging service. The CPU service is switched to a different job. When page faults happen in executions of several jobs, they will be served in FIFO order.
Generalized Load Sharing for Distributed Operating Systems
1493
Table 1. Parameters used for the simulated heterogeneous network of workstations where the CPU speed is represented by MIPS [14]
CPU speeds Memory sizes (MS j) Kernel and file cache usage (U sys) User available space (RAMj) Memory page size Ethernet speed Page fault service time CPU time slice for a job Context switch time Overhead of a remote execution
100 to 500 MIPS 32 to 256 MBytes 16 MBytes MS j – U sys 4 KBytes 10 Mbps 10 ms 10 ms 0.1 ms 0.1 sec
3 Simulation Results and Discussion 3.1 A Simulator of a Homogeneous Distributed System The system has experimentally evaluated the 8 load sharing policies plus the NO_ LS policy. The upper limit vertical axis presenting the mean slowdown is set to 20. It scaled the page fault rate accordingly. The mean memory demand for each trace at different page fault rates was set to 1 MBytes, 2 MBytes, 3 MBytes, and 4 MBytes. This paper first concentrates on the performance comparisons of “trace 0” in different directions, and will present performance comparisons of all the traces also. Performance comparisons of all the traces The performance comparisons of the load sharing policies on other traces are consistent with what the system has presented for “trace 0” in principle. Fig.2 present the mean slowdowns of all the traces (“trace 0”, ..., “trace 7”) scheduled by different load sharing policies with the mean memory demand of 1 . 3.2 A simulator of a Heterogeneous Distributed System Using the trace-simulation on the distributed heterogeneous system of 6 nodes, the evaluation performance of the following load sharing policies has been carried out. CPU _RE, MEM_ RE, and CPU_ MEM _RE. In addition, the proposed method has also compared execution performance of the above policies with the execution performance without using load sharing, denoted as NO_ LS. A major timing measurement this paper has used is the mean slowdown, which is the ratio between the total wall-clock execution time of all the jobs in a trace and their total CPU execution time [11, 12]. Major contributions to the slowdown comes from the delays of page faults, waiting time for CPU service, and the overhead of migration and remote execution. Performance comparisons of all the traces The performance comparisons of the load sharing policies on other traces are consistent with what we have presented for “trace 0” in principle. Fig.3 present the
1494
A. Satheesh and S. Bama
Fig. 2. Mean slowdowns of all the 8 traces scheduled by different load sharing policies with the mean memory demand of 1 MBytes
mean slowdowns of all the traces (“trace 0”, ..., “trace 7”) scheduled by the three load sharing policies with the mean memory demand of 8 MBytes on platforms 1,2,3 and 4 [8]. The system also adjusted the page fault rate for each trace in order to obtain reasonably balanced slowdown heights of 20 among all the traces on the 4 platforms. The NO_ LS policy has the highest slowdown values for all the traces on all the platforms. The CPU RE policy has the second highest slowdown values for all the traces on all the platforms. Policies MEM RE and CPU_ MEM _RE performed the best for all the traces on all the 4 platforms.
Fig. 3. Mean slowdowns of “trace 0” on different heterogeneous platforms, with mean memory size of 8 Mbytes
4 Current Works and Future Research Direction We have experimentally examined and compared a CPU based, a memory-based and a CPU-Memory-based load sharing policies on both homogeneous and heterogeneous networks of distributed systems. Based on our experiments and analysis, this paper
Generalized Load Sharing for Distributed Operating Systems
1495
has following observations and conclusions: A load sharing policy considering only CPU or only memory resource would be beneficial either to CPU-bound or to memory-bound jobs. Only a load sharing policy considering both resources will be beneficial to jobs of both types. Our trace-driven simulations show that CPU-Memory based policies with a remote execution strategy is more effective than the policies with a preemptive migration for memory-bound jobs, but the opposite is true for CPU-bound jobs. This result may not be applied to general cases because this paper did not consider other data access patterns so that the system are not able to apply any optimizations to preemptive migrations. High performance approach is slightly more effective than high throughput approach for both memory-bound and CPU-bound jobs. The future work of this paper can go in different directions. Using both load sharing policies of CPU_ MEM _HT_ PM, and CPU_ MEM_ HP_ PM in Open MOSIX or KERRIGHED distributed operating system, which will improve the workload performance.
References 1. Acharya, A., Setia, S.: Availability and utility of idle memory in workstation clusters. In: Proceedings of ACMSIGMETRICS Conference on Measuring and Modeling of Computer Systems, pp. 35–46 (May 1999) 2. Barak, A., Braverman, A.: Memory ushering in a scalable computing cluster. Journal of Microprocessors and Microsystems 22(3-4), 175–182 (1998) 3. Hui, C.-C., Chanson, S.T.: Chanson, Improved strategies for dynamic load sharing. IEEE Concurrency 7(3), 58–67 (1999) 4. Eager, D.L., Lazowska, E.D., Zahorjan, J.: The limited performance benefits of migrating active processes for load sharing. In: Proceedings of ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, pp. 63–72 (May 988) 5. Glass, G., Cao, P.: Adaptive page replacement based on memory reference behavior. In: Proceedings of ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, pp. 115–126 (May 1997) 6. Voelker, G.M., et al.: Managing server load in global memory systems. In: Proceedings of ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, pp. 127–138 (May 1997) 7. Leung, K.C., Li, V.O.K.: Generalized Load Sharing for Packet-Switching Networks: Theory and Packet-Based Algorithm. IEEE Trans. Parallel and Distributed System 17(7), 694–702 (2006) 8. Xiao, L., Zhang, X., Qu, Y.: Effective Load Sharing on Heterogeneous Networks of Workstations. In: ICDCS’2000. Proceedings of 20th International Conference on Distributed Computing Systems, Taipei, Taiwan (April 10-13, 2000) 9. Feeley, M.J., et al.: Implementing global memory management systems. In: Proceedings of the 15th ACM Symposium on Operating System Principles, pp. 201–212 (December 1995) 10. Zhou, S.: A trace-driven simulation study of load balancing. IEEE Transactions on Software Engineering 14(9), 1327–1341 (1988) 11. Zhou, S., Wang, J., Zheng, X., Delisle, P.: Utopia: a load sharing facility for large heterogeneous distributed computing systems. Software Practice and Experience 23(2), 1305–1336 (1993)
1496
A. Satheesh and S. Bama
12. Kunz, T.: The influence of different workload descriptions on a heuristic load balancing scheme. IEEE Transactions on Software Engineering 17(7), 725–730 (1991) 13. Hesselbach, X., Fabregat, R., Baran, B., Donoso, Y., Solano, F., Huerta, M.: Hashing based traffic partitioning in a multicast-multipath MPLS network model. In: Proceedings of the ACM ANC 2005 (October 10-12, 2005) 14. Zhang, X., Qu, Y., Xiao, L.: Improving distributed workload performance by sharing both CPU and memory resources. In: ICDCS’2000. Proceedings of 20th International Conference on Distributed Computing Systems, Taipei, Taiwan (April 10-13, 2000)