Second Asia International Conference on Modelling & Simulation
A Descriptive Performance Model of a Load Balancing Single System Image Bestoun S. Ahmed1, Khairulmizam Samsudin2, Abdul Rahman Ramli3 and ShahNor Basri4 Department of Computer & Communication Systems Engineering, University Putra Malaysia, 43400 Serdang, Selangor, MALAYSIA 1
[email protected],
[email protected],
[email protected] 4 Department of Aerospace Engineering University Putra Malaysia, 43400 Serdang, Selangor, MALAYSIA
[email protected] the Beowulf clusters that has a different approach due to its dependencies on MPI [6], PVM [7] or any other middleware libraries [8]. Standard benchmarks have different approach since they use a simple static job assignment algorithm that completely ignores the current state of dynamic load-balancing algorithm in SSI based system [9]. As there are many SSI projects under active development, including OpenSSI, Kerrighed [4] and MOSIX [10, 11], it is very important to have a standard descriptive model for benchmarking in an effort to measure and compare the actual performance of the design. This paper presents a descriptive performance model that provide practical benchmark framework for understanding the performance of SSI clusters. The goal is to develop strategies for understanding performance of load balancing SSI that could provide crucial information to SSI designers for improving the performance on these systems. The descriptive model and performance evaluation framework is shown to be effective and practical for solving OpenMosix design issues. This work demonstrates improved performance by enhancing the load vector management of openMosix compared with standard algorithm.
Abstract Recent development in the load balancing single system image (SSI) provides a simple to use, cost effective and high performance environment, which has become increasingly attractive to many users. OpenMosix is one instance of a successful open source SSI cluster. Up date, there is no standard performance measurement metrics suitable for OpenMosix. A standard descriptive performance model benchmark allows developer and user to measure and compare performance of different SSI cluster design. The aim of this part of research is to study the effect of the number of nodes on some performance metrics by considering a particular network configuration and to investigate a framework for performance evaluation and modeling. The performance evaluation and benchmark show that the performance of the cluster is affected by the probabilistic and decentralized approach to disseminate load balancing information. A modification to the dissemination loadbalancing algorithm shows improved performance.
Keywords Single system image, OpenMosix, MOSIX, performance modeling.
2. OpenMosix Architecture
1. Introduction
OpenMosix is an open source project forked from MOSIX. Most of its design is similar to that of MOSIX. An important similarity is that a process appears to run on the node on which it was spawned known as Unique Home Node (UHN) even though it may have migrate elsewhere. A process uses local resources whenever possible, but often has to make system calls on its UHN. The deputy does this, which is the process representative at its UHN [12]. The preemptive process migration (PPM) is the main tool for the resource management algorithm. As long as the requirements for resources, such as CPU are below certain threshold point, all users processes are confined to their home node. When these requirements exceed the CPU threshold levels, some processes will be migrated transparently to other nodes [2,11]. Memory management is provided in OpenMosix through a memory ushering algorithm, similar to MOSIX
A single system image (SSI) provides the delusion of a single powerful and highly available computer to users and programmers of a cluster system. The most important paradigm of SSI over other types of clusters are the needless of paralyze code, automatic process migration i.e. load balancing, and the well aware of hardware and system resources [1]. OpenMosix [2, 3] is one instance of an SSI system. The latest stable version of OpenMosix patch for 2.4.26 Linux kernel incorporate additional features such as an improved load balancing, reduced kernel latencies, node auto discovery, and more sophisticated algorithm for process migration [4]. However, there are no efforts to benchmark the OpenMosix implementation due to the nonexistence of benchmark tools and framework for SSI operating system cluster. Existing research on the performance modeling and benchmarking [5] focus on
978-0-7695-3136-6/08 $25.00 © 2008 IEEE DOI 10.1109/AMS.2008.21
180
[12]. This algorithm is activated when a free memory of a node falls below a threshold value and OpenMosix attempts to transfer process to other nodes, which have sufficient free memory. Thus, process migration is decided not only based on processor load criterion but also taking in to account memory usage [13].
bandwidth, and topology [16]. In this paper, the following three performance metrics will be considered: run time, speed-up and efficiency. Run time performance metric simply means the total time for completing a given job without considering the number of operations performed. It is the standard performance measurement for systems running the same application. Speedup measure is the ratio between the time required to solve problem on a single processor node to the time taken to solve the same problem using more than one node. This is a useful performance metric to determine the cluster system ability to increase speed up as the number of node increase. This performance metrics is suitable to measure the scalability of the OpenMosix cluster. Speed-up is given by:
3. Load Balancing in OpenMosix As in any SSI system, the major feature of OpenMosix system is the load balancing. The main load balancing component of OpenMosix is the information dissemination, process migration and memory migration. This work concentrates on benchmarking the information dissemination and process migration daemon. The information dissemination daemon, disseminate load balancing information to other nodes in the cluster. The daemon runs on each node and is responsible for sending and receiving load information to and from other nodes. The sending part of this daemon will periodically send load information each second to two randomly selected nodes. The first node is selected from all nodes that have contacted the node "recently" with their load information. The second node is chosen from any nodes in the cluster [14]. The receiving portion of the information dissemination daemon receives the load information and attempt to replace information in the local load vector. The standard implementation simply utilizes a first-in-first-out (FIFO) queue of eight entries. Thus, the oldest information is overwritten by newly received information [10].
Speed-up=T (1)/T (N)
(1)
Where T (1) is the run time for a particular program on a single node and T (N) is the run time on N nodes. Efficiency is defined as the percentage of speed up divided by the number of nodes. Efficiency is given by: Efficiency =speed-up * 100 / number of nodes
(2)
It is very useful to measure the percentage of a node time spent in useful problem solving. A rigorous analysis of the efficiency metric can be used to identify dominant overhead factors and measure it in practice.
6. Results and Discussions
4. Cluster Construction
Each performance metrics considered has been measured by using a Distributed Key Generator (DKG) [17]. The program generates 4000 RSA public and private key pairs with 1024 bits. To ensure that the load exceed the CPU threshold of our cluster we modify the program so that the parent distribute computation to a specific number of child processes using fork() from the default value of four. However, as the load balancing is dynamic, the run time vary from one run to another throughout the experiment due to the job placement variation. This factor is minimized by running each experiment several times and taking the mean value of each measurement. To demonstrate the importance of having enough child process spawned from the parent, we measure the peak CPU utilization during the running time in a cluster with 8 nodes without considering time. The following figures illustrate the CPU utilization for DKG running with 4child and 8-child. Fig.1 below illustrates CPU utilization for each node in the cluster by executing DKG with 4-child. Note that the load generated from 4-child DKG does not exceed OpenMosix CPU threshold and therefore the processes are not migrated to the other nodes.
The openMosix cluster for this work was build from commodity off the shelf (COTS) computers and consists of eight nodes based on Intel Pentium 4, 1.6 GHz processor with 256 MB of main memory. All nodes in the system are homogeneous. This will help us to neglect the mismatch of processing speeds that may cause additional process coordination complexity and therefore affect the execution performance of the programs [15]. There is no graphic card on the nodes except for the main node. The nodes are interconnected with star topology using an Ethernet 10/100 Mbit/s switch. Each of the nodes was installed with 2.4.26 kernel based on Fedora Core 1 GNU/Linux with gcc 3.3.2 and patched with openMosix 2.4.26 patch.
5. Performance Modeling Metrics The concepts behind building clustered systems may be straightforward, but measuring the performance of clusters is not. Several parameters could influence the performance modeling of SSI. These metrics include processor parameters such as number of processor available for the cluster and the processor speed; program parameters such as data size, number of loop, number of spawned processes; network parameters such as latency,
181
We note that there is a significant performance improvement between one node and two nodes (nearly 50%); however, there is no significant improvement after having five nodes. We predict that there would be no significant performance gain after eight nodes.
Node8(0.1%) Node7(0.1%)
NODE
Node6(0.1%) Node5(55.5) Node4(99.9%) Node3(99.9%)
8
Node2(99.9%)
7 6
Node1(99.9%) 20
40
60 Utilization
80
100
Speedup
0
Fig. 1. CPU utilization with 4-child DKG
5 4 3 2
The CPU utilization for the fifth node is only 55% compared with the other utilized nodes at 99%. Therefore, we increase the number of child processes to obtain enough load to conduct the benchmarking process.
Variable Speedup Ideal Speedup
1 0 0
1
2
3
4 5 Number of Nodes
6
7
8
Fig. 4. Speed-up of DKG on OpenMosix cluster Node1(99.9%)
NODE
Node2(99.9%) Node3(99.9%)
100
Node4(99.9%)
90 80
Node5(99.9%)
70 Efficiency %
Node6(99.9%) Node7(99.9%) Node8(99.9%) 0
20
40
60 Ut ilizat io n %
80
100
60 50 40 30 20
0
Fig.2 illustrates CPU utilization for each node in the cluster by executing DKG with 8-child. It is clear that the load from executing DKG with 8-child surpass the OpenMosix CPU threshold as all the nodes in the cluster utilize all of their CPU resources. As a suitable load for the benchmark has been found, a run time benchmark is conducted to investigate the behavior of the system under load with varying number of nodes, as shown in Fig. 3.
Time (sec.)
1500
1000
500 2
3
4 5 Numbe r o f No de s
6
7
2
3
4 5 Number of Nodes
6
7
8
To investigate the performance point of drop and the scalability of the system, the speedup and efficiency metrics will be very important. Fig. 4 and Fig.5 illustrate the speed-up and efficiency measurement of DKG on the OpenMosix cluster. It is clear that the speed-up improvement is not linear with the number of nodes, while efficiency of DKG on the cluster starts to degrade after having more than three nodes. To identify the cause of speed-up and efficiency degradation, we investigate the effect of network traffic on performance of OpenMosix by measuring the traffic on the home node while executing 8-child DKG in a controlled network environment. IPtraf [18] is used to perform the measurement of network traffic. Fig.6. shows the network traffic generated on the home node and the actual inbound and outbound traffic of the network interface. Obviously, network traffic increases with each additional node and as a result contribute to the degradation of efficiency on performing calculation.
2000
1
1
Fig. 5. Efficiency of DKG on OpenMosix Cluster
2500
0
Variable Efficiency Ideal Efficiency
10
Fig. 2. CPU utilization with two times DKG running
8
Fig. 3. Run Time of DKG on OpenMosix cluster
182
five nodes. With 8 nodes, a gain of 70 second of run-time performance is achieved.
4000
Traffic(Kbits/s)
3000
8 2000
7 6 Speedup
1000 Variable Traffic Inbound Outbound
0 0
1
2
3
4 5 Number of Nodes
6
7
2
2500
0 0
Time (sec.)
500
7
4 5 Number of Nodes
6
7
8
We have presented in this paper a part of our research for describing performance model of OpenMosix, and to develop a benchmark framework due to the needs for development in the future. Efficiency, speed up and even performance does not linearly proportioned with the number of nodes. We note that the overall cluster performance is high, but there is a significant overhead due to the sharing of the process address space and communication state. The relationship between the processor, network latency and network bandwidth play important role in the performance consideration model. The results also show that speed-up and efficiency depend on the load (number of spawned process). Even though OpenMosix migrate processes to remote node transparently, a deputy is still created for each process in the main node. Therefore, increasing the cluster performance cannot be achieved by simply upgrading the processor or network hardware in each node as we must take in consideration the load-balancing algorithm ability to handle these processes. The understanding gained from the benchmark, points to a drawback in OpenMosix load-vector management implementation. Modifying the standard implementation results in improved efficiency and speed-up of the overall cluster performance.
1000
6
3
6. Conclusion
1500
4 5 Number of Nodes
2
Fig 8. Shows the speed up of running DKG using standard OpenMosix LVM compared to the improved LVM policy. Note that the modified LVM has a significant performance gain over the standard LVM. This improvement is due to the efficient information dissemination around the cluster that leads to better load balancing strategy.
2000
3
1
Fig. 8. Comparison of speedup for standard and improved LVM
Variable Actual algorithm Modified A lgorithm
2
Variable Speedup After Modification Ideal Speedup
1
The increases of the traffic influence the home node in such a way that the performance of the cluster itself will noticeably drop. Communication among nodes in the OpenMosix cluster relies on both TCP and UDP protocol. TCP is used to migrate the processes while UDP datagrams are used solely to send load messages [19]. However, from the measurement, the amount of UDP traffic is insignificant compared to TCP. An important factor that could improve the performance of OpenMosix cluster is the time it takes to balance the load around the cluster. As stated in section 3, due to the probabilistic and decentralized nature of load message dissemination, OpenMosix would always update the local load vector information (LVM) without considering the advantages to the current node. This leads us to modify the existing load vector information management in the kernel. Our approach is to update the load vector only when the load information from the remote node is beneficial (Lremote < Llocal) to the existing node.
1
4 3
8
Fig. 6. Nodes Relation with Traffic
0
5
8
Fig. 7. Comparison of run time for standard and improved LVM
Fig.7. shows the run time comparison between the standard OpenMosix LVM policy and the improved LVM. There is visible additional performance gain after
183
8. References [1] Christine, M., Pascal , G. , Renaud , L.,and Geoffroy ,V. , "Toward an Efficient Single System Image Cluster Operating System " ,Future Generation Computer Systems, Vol. 20(2) , Elsevier Science, January, pp.505521 , 2004. [2] Moshe ,B., Maya, K., Krushna', B., Asmita, J. and Snehal ,M., “Introduction to OpenMosix ", Linux Congress, 2003. [3] OpenMosix website, http://openmosix.org,2007. [4] Renaud , L., Pascal, G., Geoffroy ,V. and M. Christine, " Openmosix, OpenSSI and kerrighed: A Comparative Study ", IEEE International Symposium on Cluster Computing and the Grid, pp.1016-1023, 2005. [5] Nielson, C., “A Descriptive Performance Model of Small Low Cost, Diskless Beowulf Clusters “, Master’s thesis,Brigham Young University , Dec. 2003. [6] Pacheco, P., “Parallel Programming with MPI “, Morgan Kuafmann, 1996. [7] Geis, A. , Beguelin, A. , Dongarra, J., Jiang ,W. , Manchek ,R. , and Sunderam ,V. , “ PVM : Parallel Virtual Machine : A Users’ Guide and Tutorial for Networked Parallel Computing “ , MIT Press , 1994. [8] Nielson, C., “A Descriptive Performance Model of Small Low Cost, Diskless Beowulf Clusters “, Master’s thesis,Brigham Young University , Dec. 2003. [9] Keren , A. and Barak ,A. , “ Opportunity Cost Algorithms for Reducing of I/O and Interprocess Communication Overhead in a Computing Cluster “ , IEEE Transaction on Prallel and Distributed Systems , vol. 14 , No. 1 , 2003. [10] Nigel. C., " Evaluation of Enigma: an OpenMOSIX Cluster for Text Mining ", NII Technical Report, Jan 2003. [11] Rajkumar, B., "High Performance Cluster Computing", vol. I, P.H. PTR, pp. 70-71, Australia, 1999. [12] Malik, K. and Khan, O., "Migratable Sockets in Cluster Computing”, The Journal of Systems and Software 75, pp. 171-177, 2005. [13] Barak, A., and Braverman, A., "Memory Ushering in a Scalable Computing Cluster", In proceeding Of IEEE third Int. Conf. on Architecture for Parallel Processing, IEEE, 1997. [14] Barak, A., Guday, S. and Wheeler, R.G. “The MOSIX Distributed Operating System Load balancing for UNIX”, Lecture Notes in Computer Science, vol. 672 Springer 1993. [15] Adam, K., Wong, L. and Andrzej, M., “Parallel Computing on Clusters and Enterprise Grids: Practice and Experience “, Int. J. High Performance Computing and Networking,Vol. x, No. x, 2006. [16] Rajkumar, B., "High Performance Cluster Computing", vol. I, P.H. PTR, pp. 70-71, Australia, 1999. [17] Ying –Hung Chen , http://ying.yingternet.com/mosix. [18] Iptraf home page http://iptraf.seul.org [19] Wynne, A., “Load Balancing Experiments in OpenMosix “Master’s thesis, Western Washington University, 2005.
184