Our GRUB Patch, http://java.thn.htu.se/~caveman/GRUB/. Portage, 2003-03-05. Portage User Guide, http://gentoo.org/doc/en/portage-user.xml. Xu, M.Q., 2001.
31-NIGHTS AROUND MIDNIGHT: COMPUTE NODE BEHAVIOR IN A REAL WORLD METAMORPHOSIC CLUSTER Andreas Boklund, Stefan Mankefors-Christiernin, Christian Jiresjö Department of Informatics and Mathematics University of Trollhättan/Uddevalla P.O. Box 957, SE-461 29 Trollhättan, Sweden {andreas.boklund, stefan.christiernin, christian.jiresjo}@htu.se
ABSTRACT In this article we present a first time evaluation of a metamorphosic compute resource (MCR) in a real world scenario. During day-time the MCR compute nodes are ordinary office computers, used in a student lab. Thanks to our enhancements, the computers become part of an OpenMosix based compute resource during the night. The single most common reason for partial loss of computing power was students and/or personnel switching off some of the machines in the evening by mistake: something easily solved by e.g. implementing ‘wake on LAN’. We also found that less than 1 per mill of all disturbances was non-trivial in nature which clearly indicates the strength of the architecture. KEYWORDS Cluster computing, evaluation, parallel, metamorphosic, performance, stability
1. INTRODUCTION The number of personal computers in the world have far more then doubled over the last decade. The performance of a modern computer is millions of times faster than the ones used in the beginning of the last decade (Mueller, 2002) A substantial part of this gigantic amount of computers are used for ordinary office work; e-mail, web browsing, word processing, finance, etc. This means that they are only used interactively; when there are people sitting in front of them. The ordinary workday for office personnel is eight hours and because of this the workday for an office computer is also eight hours. Computers (unlike human beings) do not have anything against working up to twenty four hours a day, seven days a week, three hundred sixty five days a year, for several years in a row. Ever since the first workstations were introduced scientists have been working on solving the problem with low hardware utilization (Mulas, 2003)(Boklund, 2001a). The typical idle time for the average workstation calculated over a large group of workstations is around 80% (Overeinder, 1996). This idle time can be divided into two different categories; the first one consists of the processor cycles that are wasted during work hours when the user does not utilize the computer to its full potential, although he/she is hard at work (Korpela, 2001)(Lawton, 2000). The second category incorporates of the processing power that is wasted during non-office hours (Boklund, 2003). The purpose of the Midnight metamorphosic compute resource (MCR) is to harvest the processing power of the office computers during the non-office hours, i.e. the second category. In an ordinary 9-to-5 company this may be between 18:00pm and 8:00am, which leaves a 14 hour window, open for computations. For more flexible workplaces or schools the times will be different. In this paper we present the results from an empirical evaluation of the compute node behavior in a “realworld” MCR and how this affects the overall availability, stability and performance of the computational resource. To minimize measurement errors we based the evaluation on a simple but demanding application, which have proven to be stable and fault free. We also show that MCR’s have the potential to become a part of the normal computational environment, both in companies and in the academia.
2. TECHNICAL BACKGROUND – THE MIDNIGHT MCR Midnight was created in early 2003 to serve as a harvester of processor cycles and as a test bed for studies on the behavior of MCRs (Boklund, 2003). A computer laboratory at the computer science department provides the hardware platform, as can be seen in Table 1. A computer lab is a more challenging environment than an ordinary office environment, mainly because the standard office computers are not subject to being reinstalled by computer science students during labs nor does office personnel do their best in trying to explore the boundaries of computer security, stress testing of operating systems or any other possible or impossible tasks. Thus it provides a reasonable “worst case scenario” in regards to user behavior. Table 1. Technical characteristics of the Midnight hardware Node \ HW
Processor
Speed (MHz)
Chipset
Memory Size (MB/MHz)
Hard drive (GB)
Network card
Network speed (Mbps)
Management
2 * XEON
2400
Intel E7505
256/DDR266 Dual-channel
2*36GB SCSI
Intel 82545EM
1000
Compute
Pentium III
600
Intel 810
256/100MHz
30GB IDE
3com 3c920
100
Midnight is based on the Gentoo Linux distribution, version 1.4_rc2 (Gentoo, 2003), with the OpenMosix install options. The main advantage with Gentoo Linux is that it uses a portage system (Portage, 2003) for installation and upgrades of software packages. The portage system downloads the source code of the packages selected for installation and compiles it locally with all computer specific optimizations. This procedure results in faster (at least not slower) executables and might give a speed and resource advantage depending on the application. The Midnight MCR is build around the Mosix Single System Image (SSI) model, although it uses the OpenMosix (OpenMosix, 2003) implementation. The advantage that the Mosix SSI approach has, compared to a traditional Beowulf architecture is transparent support for relocation of running processes between compute nodes, this way processes which live longer than the online time of the compute nodes can be run. The single most important feature of the Midnight MCR is its customized boot loader. The boot loader makes it possible to reboot the compute nodes twice a day into two very different operating systems. At 7am the Linux system automatically reboots, and at 11pm a Windows scheduler reboots the computers again. Since there was no boot loader that supports handling boot options based on the current time, we have modified the GRUB boot loader (Boklund, 2003)(GRUB, 2003)(Patch, 2003). The boot loader default action is to always load the Windows operating system, unless it is between 11:00pm and 11:59pm then it boots into Linux.
3. THE EVALUATION We now present an empirical evaluation that was conducted on the Midnight MCR, although the results should be applicable to all MCRs that are being used within the same or a similar context. The evaluation was performed during 31 consecutive nights. The reason behind the number of nights is that there are 31 days within the month of May. During these nights the operation of Midnight was not in any way affected by the members of the research team. To simulate the most realistic situation Midnight was (over)loaded with processor intensive work. The application used to create the massive workload was the distributed.net client (dnet-client) (Lawton, 2000), for further information about the dnet-client see Section 4. The dnet-client was perfect for this job, it was run as 22 single threaded processes. This way the dnet-clients made use of all 22 processors. During the evaluation the availability of the compute nodes, i.e. the number of nodes that was rebooted and participated in solving the task at hand was measured. We also studied the reasons behind why some of the compute nodes were not rebooted, by manually reviewing both the Linux and the Windows log files, etc. In addition to the availability measurements we also performed an evaluation of the stability of the Midnight MCR. The stability was measured in two different categories.
• Did any of the compute nodes ever exit the Midnight MCR prematurely (crash). • Did any of the processes exit before finishing its calculation (crash). The third evaluation is an overview of how well the Midnight MCR uses its resources, and how well it compares to a dedicated compute cluster consisting of identical hardware.
4. TEST SOFTWARE Since the purpose was to measure the compute node behavior of the Midnight MCR we did not want to use complicated software with a lot of inter process communication, I/O, etc., we wanted to keep it simple and focus on the key issues. On the other hand we wanted to use a real life software that has some bearing in the professional world instead of the trivial, “while (i