Technical Note DHPC-077.
Java Tools and Technologies for Cluster Computing K.A. Hawick, H.A. James, J.A. Mathew and P.D. Coddington Distributed & High Performance Computing Group, Department of Computer Science, University of Adelaide, Australia SA 5005, Tel +61 8 8303 4519, Fax +61 8 8303 4366, Email
[email protected] 30 November 1999 Abstract The Java language and its associated libraries and environment provide a powerful and flexible platform for programming computer clusters. Java tools and technologies enable experimentation in both management aspects as well as performance aspects of cluster systems. We discuss the current interesting problems in cluster computing including those derived from distributed computing as well as the more performance-related ones derived from the discipline of parallel computing. We describe our experiments in building meta-computing management software in Java and message-passing high performance support software also written in pure Java. Our DISCWorld meta-computing system allows cluster applications to be embedded as services in a problem-solving environment. Our JUMP message-passing system allows users to develop conventional stand-alone Java applications that use the message-passing model of parallelism on cluster systems. JUMP applications are also compatible with DISCWorld and can be launched as component tasks in a dataflow-style task graph as managed by DISCWorld. In addition we discuss the support of multiple paradigms for parallelism, such as regular domain decomposition; irregular or scattered spatial decomposition; task farming; and dynamic and adaptive mesh support, that we are building into JUMP. We believe these higher level facilities make parallel computing on clusters easier for users to exploit. We review some of the Java tools and technologies available now and becoming available and discuss the contributions they make to cluster computing. We also describe experiments with these and how they influenced our design decisions for JUMP. Keywords: Java, cluster computing; Java tools; DISCWorld; meta-computing; parallel computing; distributed computing; JUMP.
1
1
Introduction
Cluster computing [7, 8] has emerged in recent years as one of the major growth areas in applied computer science. Clusters provide an economically attractive platform for applications requiring large numbers of compute cycles. In many applications areas clusters are replacing the traditional high performance computing (HPC) resources. This is particularly so for research groups and small to medium enterprises for whom large scale HPC platforms are unaffordable. The traditional model of national HPC facilities is also being eroded by cluster computing as research groups find more value in a flexible cluster facility that they own and control themselves rather than having to apply for a time share on a central facility. This change in usage pattern like so many things in computer science is not a new phenomenon [21]. It is similar to the shift towards departmental mini computers and away from conventional institution wide mainframes that occurred in the late 1970s and early 1980s. Cluster computing as a research discipline draws primarily from two research communities that existed prior to the recognition of cluster computing as a field in its own right. Many of the ideas for achieving high performance in cluster systems are drawn from the techniques and tools of parallel computing. Approaches for managing scheduling, reliability and interoperability on clusters often originated in the distributed computing community. Cluster system therefore represent an exciting melting pot for experiments in parallel and distributed computing and are having a strong influence on the way research in these areas is done as well as in the applications arena. The Java programming language and environment [18] is also having great influence on computer usage patterns worldwide. Java software with its greater propensity for portability and interoperability is becoming the language and environment of choice for many applications. Although Java has been (rightly) criticised for its performance limitations [39], this situation is changing with the advent of research into more efficient Java Virtual Machines, use of multi-threading and other techniques to take advantage of modern computer architectures [35]. Java has proven useful as a “programming glue” for distributed and wide-area computing systems and it is now interesting to examine Java as a programming glue for managing and running applications on cluster computer systems. Clusters and other high-performance parallel computers have traditionally been used mainly for scientific applications, which are predominantly coded in C or Fortran. However parallel computing is now commonly used for many other applications besides scientific number crunching. Many corporations are using large parallel compute servers for a variety of commercial applications, including data mining and the provision of Internet-based services. These are the kind of applications that are usually written in object-oriented languages such as C++ and (increasingly) Java. There is also a growing interest in using modern object-oriented languages, particularly Java, for implementing complex science and engineering applications. Java has the potential to be an excellent language for developing such applications, partly because of its portability, its plethora of useful class libraries, and its support for distributed and concurrent computing. Currently Java has some drawbacks, particularly its sluggish performance in numerical
2
computations, however this is improving as the compiler technology matures. The Java Grande Forum [35] aims to drive improvements to the Java language, class libraries and compilers, so that Java can provide better support for implementing such large-scale (or “grande”) applications on high-performance computers. Java offers many advantages for developing large, complex scientific applications. Many such applications now require integration of multiple modules for simulating different parts of the problem, such as particles and fields, structures and fluids, regular grids and irregular meshes. Object-oriented program design allows easier development of programs using extensible and reusable components for each part of the problem. Class libraries can provide high-level interfaces that abstract over low-level implementation details, which is particularly useful for supporting parallelism. This kind of approach has been successfully used with C++, for example in the Parallel ObjectOriented Methods and Applications (POOMA) project [1], which is part of the Accelerated Strategic Computing Initiative (ASCI). POOMA provides C++ class libraries for supporting and integrating a number of scientific application areas. The libraries have mainly high-level data-parallel APIs, with implementations that use message passing. We have experimented with a number of cluster computing systems, including: traditional Unix workstations running Solaris; ATM connected Alpha workstations running Digital Unix [27]; Beowulf-style PC clusters running Linux with various network technologies [24]; and even iMacs running Linux [20]. In this paper we describe our experiments with using Java in distributed systems approaches to cluster management and in running parallel applications on clusters. We focus on Beowulf-style systems as a suitable target for management of high-performance cluster applications. Beowulf clusters can be said to occupy the middle ground between very loosely-coupled computing systems such as wide-area cluster systems and very tightly-coupled parallel computer systems. Consider what has become a typical scenario in University departments and research groups around the world. The department probably already has a cluster of sorts consisting of a number of desktop workstations or PCs. These may well be idle for a large fraction of the day or week. The department may also have some specialised compute resources, and may in fact have a dedicated compute cluster such as a Beowulf-style system in its machine room. The model therefore is of some number of compute nodes upon which compute jobs can be run. Each node may not be available all of the time, as some may be desktops and their owners do not wish them to be loaded during working hours for example. Some nodes may occasionally be re-booted or fail or become disconnected from the remainder. Managing these issues is non-trivial and we discuss them in section 3. The set of nodes is unlikely to be completely homogeneous. In many situations it is valuable to have some inhomogeneity in the composition of the nodes. For example, some nodes may be given more memory to handle larger jobs. We have built a number of dedicated compute clusters and even with the intention of setting up a simple homogeneous system it is almost impossible to specify a completely homogeneous (in terms of hardware and operating system and revision) cluster. It is therefore highly valuable to be able to exploit a software platform like Java that allows: code portability (at the bytecode level); the potential for migration of executing
3
code or objects; and a set of code libraries the size of which has never before been seen in the history of computer systems. There is also a very large, and still growing, base of Java developers worldwide. We discuss issues related to performance monitoring, scheduling and management of clusters in section 3. These are important issues and we have experimented with wide-area computing systems using Java for some years now, and have constructed what we call a meta-computing model. This architectural model is providing us with an experimental framework for managing both wide area and local area cluster computer systems. In our DISCWorld meta-computing system [30] we embed parallel applications as services like any other serial program. The DISCWorld architecture does not directly model or consider explicit parallelism in the usual sense, but instead deploys a dataflow-like approach to distributed problem solving environments. We have more recently addressed the parallel computing performance issues for cluster systems in a series of experiments to develop parallel computing libraries in Java. We discuss these in sections 4 and 5. A common approach to parallel computing has been through message passing. A great achievement of the 1980s parallel computing era was the emergence in the 1990s of a de facto standard interface for writing message passing programs in procedural languages such as Fortran or C. This Message Passing Interface (MPI) [19] standard has now grown and subsumed many of the important ideas in message-based parallel computing. A number of recent experimental research projects have built MPI related systems that support Java in some way. Most of these have attempted to provide Java bindings to MPI [3,10,11]. Our approach has been to embrace Java more fully and admit that the procedural style bindings are unnatural to Java programs. We have therefore experimented with a pure Java message passing infrastructure. This system, known as Java Universal Message Passing (JUMP) [26], is described in section 7. We review the important issues for message passing parallelism and discuss various options we have investigated in building our JUMP system. We believe our key contribution in JUMP is to recognise that parallel computing must progress beyond low level message passing and that higher level paradigms must be supported. Attempts have been made to capture the essential applications programming paradigms for some years. The Parallel Utilities Library (PUL) [13] project at Edinburgh was an attempt to implement some of these paradigms using the primitive message passing technology of that time. It is our present belief that this is very difficult to achieve without Object Oriented programming technology and class libraries, and in section 5 we discuss our present work in extending JUMP to support parallel programming paradigms such as regular and irregular domain decomposition; task farming and divide and conquer task management; and static and dynamic decompositions. Most importantly we believe these paradigms need to be made interoperable and this is our main goal for the JUMP system. Java offers a number of integrated features that we have experimented with in arriving at both our DISCWorld and JUMP architectures. Java provides interfaces for [45]: sockets; an object-oriented version of remote procedure calls [4] known as Java Remote Method Invocation (RMI); as well as facilities for code reflection and dynamic loading; multi-threading [42]; and data transmission through serialised objects. We discuss these and the advantages each offers in a cluster management framework in section 6. We also review some
4
recent developments in managing distributed systems with the Jini architecture [2] and in particular with the JavaSpaces [15] tuple-space based approach to parallelism. We describe some of our planned experiments with JUMP and Jini in section 8. The operating systems now most widely used for clusters computing systems include Linux, Solaris, Digital Unix, and Windows/NT. Each of these offers various systems facilities that are useful in managing clusters. We focus on Linux and describe some of the OS features that we are able to interact with through a Java management daemon running on participating nodes in a cluster. For the most part our system is not dependent on any one OS, that being one of the main reasons for using Java. However there are some features of a particular OS that are useful to incorporate into the system for measuring performance. Ideally this low-level functionality would be available in the future as a Java class library through a standard API available on any OS. In the remainder of this paper, we describe our attempt to separate out distributed computing issues from parallel performance issues that are important for managing clusters; a review of higher level parallel programming ideas we are trying to support; a review of Java technologies that help implement such systems; a brief review of our DISCWorld metacomputing infrastructure for cluster management; our experiences building an early MPI-like system with Java (using sockets and Java RMI) which led us to a different implementation known as JUMP(using only socket communications); and a discussion of future directions.
2
Defining Cluster Computing
One of the confusing aspects of cluster computing is in defining precisely what is meant by a cluster. An IEEE Task Force for cluster computing [33] was formed recently to discuss and promote cluster computing issues more widely. The term “cluster computing” is currently used somewhat loosely and can denote a set of loosely coupled hosts that may be dedicated or may be user workstations. Questions of whether nodes are dedicated or not; which processor architecture they use; whether they are of homogeneous architecture type and operating systems revision; whether they are tightly coupled in the sense of having a shared file system or not; and what interconnection technology is used are all fascinating and are discussed at length in [9]. Terms such as “constellation”, “super-cluster”, and “meta-cluster” are all finding use in the literature to denote particular configurations of cluster. We will use the general term cluster which we describe as a group of Nn distinct nodes, where each node may in fact consist of more than one processor using some local shared memory mechanism inside the node. We will assume that nodes are interconnected with some networking technology with the obvious property that node-node communications is significantly more expensive that intra-node communications. We may express Nn = N0 + N1 (t) indicating a core number of dedicated nodes N0 and a dynamic number N1 of affiliated nodes that may reside on a distributed network. This model is in essence that of a Beowulf style cluster that has a core set of dedicated nodes, as well as other nodes such as desktop machines that “join” the cluster
5
for certain periods of time. We use Nn to differentiate from the number of processors Np in a cluster which is often greater than the number of nodes if symmetric multi-processor (SMP) nodes are employed [24]. It is useful to consider the classes of user or jobs that will make use of a computing cluster. Bricker and co-workers describe three archetypal user types in their paper on the Condor scheduling system [5]. Of these type 1 users are administrators or theoreticians who are amenable to have cycles stolen from them almost all the time; type 2 are software developers who often do not need their machines’ cycles during the night and at off peak times, but who do need them during the day; and type 3 users who never have enough cycles. The Beowulf-style clusters we consider are most useful for satisfying the needs of type 3 users, and may in fact be constructed from a core set of nodes, as well as desktop workstations belonging to type 1 and type 2 users. This model works amicably providing nodes can join and depart the core cluster cleanly without failure or other upset and providing jobs do not run out of control. Our DISCWorld system is aimed at joining together these sorts of locally dynamic clusters to allow specialised problem solving, whereas our JUMP system is aimed at providing a Java-based glue for running parallel (Java) programs on individual clusters.
3
Distributed Computing and Clusters
Many of the interesting research problems in cluster computing are based on more general problems that arise in building more general distributed systems. In this section we discuss some of these problems and how we have tried to separate them out from parallel performance problems. Several research projects have attempted to build software systems for managing distributed computing systems [6]. A number of interesting problems arise in designing such a system. These include: naming; resource discovery; security; reliability; and the scalability issues underpinning all of these. Some of these problems are fundamental and can only be partially addressed in a practical system. In this section we discuss some of these and how we have approached them, or avoided them, in our Distributed Information Systems Control World (DISCWorld) meta-computing system [29, 30]. The naming problem is related to the resource discovery problem. Naming users, data, programs, and services is relatively straightforward on small or local systems where the resource is management by a single authority. As multiple users start sharing resources and as the administration regime becomes larger this problem rapidly becomes difficult to manage. Tools for managing users, disks, hosts and other system administration tasks have been developed and cope remarkably well with quite large sets of computer resources. The problem we face in managing a wide area distributed computing system is that different management authorities must interoperate and the problem scope grows very rapidly when users wish to exchange data and services and give them all unique names. The only way to proceed with such a problem is to divide the namespace up into manageable parts. In effect this is a hierarchical namespace approach whereby users or groups of users are given a designated prefix or some other way of identifying their own part of the global
6
namespace in which they can manage their own names. The resource discovery problem underpins wide-area distributed computing and can be stated as the need for participating nodes in a distributed system to know about the capabilities (and existence) of other nodes in the system. This problem is difficult to solve in general although partial solutions such as the management of the Domain Name Service (DNS) for the Internet can be usefully applied. Nodes find out about other nodes either by having a user or their systems administrator tell them (in the form of some configuration information such as a hosts file), or by indirectly discovering about other nodes and services. This indirect mechanism can be implemented using a number of data distribution structures and mechanisms. In the early days of the Internet, hosts knew about other hosts from the very large host files that listed their names and Internet addresses and the routing tables that listed how to reach them via various networks. DNS supplied a more hierarchical structure to this operation so that hierarchically structured domain names can have delegated naming authorities and name services. This means less centralisation of name control and is generally much more scalable. The Internet and DNS work well for host names which change relatively slowly, and for which there are not too many look-ups going on simultaneously. Clearly it is unscalable in general to construct a single naming authority, although this approach may work for some limited applications, at least in the short term until the Internet user base grows beyond feasibility. For example digital signature certificate providing authority largely relies on a central or small number of trusted certifying authority sites providing this service to users and user applications on the Internet. For meta-computing applications where every program and every data product must have a unique canonical name, this model is not feasible. Similarly, it may be possible to have a single broker from which nodes may discover resources and their availability. It may also be possible to build simple hierarchies of brokers as in the CORBA trader model [43]. To an extent this approach may scale to small to medium sized systems but has significant problems as the number of managed resources and namespace sizes grow [28]. The reliability problem has many manifestations in distributed computing and can be stated as the difficulty of guaranteeing agreement among independent computers in a distributed system. The lack of guaranteed consensus means that is is not possible to rely on a distributed system to behave repeatably and predictably under a large class of failure scenarios. We have no more idea how to solve this general problem than any one else has at the time of writing, but we can usefully consider the partial system failure scenarios that a distributed system might be expected to encounter on a regular basis and which it might be expected to recover from. It is very likely that nodes in a wide area system will indeed fail in a reasonable manner, crashing or becoming unavailable due to network failures. At a large job granularity this may simply translate into long waits for certain data products and services. We have implemented a futures mechanisms for DISCWorld that allows task graphs to span distributed networks of services providers and provide some resilience against partial failure [25]. The reliability problem is perhaps a much harder one to deal with at the local cluster scale, where a whole parallel job may be lost if a single node crashes. Figure 1 shows a typical task graph that can be formulated from a user
7
A B C
D
E
F
H
I
G J
K
L M
N
O Figure 1: A typical task-graph used in the DISCWorld meta-computing system. Tasks (shown as A,B,C,...) are services or programs that can be run or placed to run on different nodes in the system, and edges represent dataflow. Each task or vertex is potentially decomposable into a further task graph. The task graph forms a directed acyclic graph. query and annotated with DISCWorld node placement information. The figure shows a directed acyclic graph (DAG) of tasks each one of which can be (recursively) transformed into a further DAG with execution of its tasks delegated to yet other nodes. DISCWorld is implemented using a software daemon that is pre-supposed to run on all participating nodes (or services providers). The daemon acts as a broker for services and coordinates jobs in the form of directed acyclic task graphs. These graphs can be used along with some novel distribution [28] and futures [25] mechanisms to place tasks in a task graph onto nodes which agree to carry them out. In this sense, DISCWorld is a software glue for managing data-flow composition of task graphs. This model is now finding specialised use in coordinating image processing tasks in other systems [41, 44]. We are currently collaborating with the Australian Defence Science and Technology Organisation (DSTO) to develop Java applications in this area. Figure 2 shows the component architecture of the DISCWorld daemon. The portal acts as a gateway to incoming queries or user requests for data products. A “waiter” thread is spawned by the “maitre’d” for each request which decomposes the query into a task graph according to the services that daemon knows about. Service knowledge comes from pre-configured service information or from information the daemon has assembled through gossiping with other daemons it knows about (and nodes they know about). This gossip
8
Filters all Communications for Security
Clients and Other DWd’s
Separate Thread Each
Controls Policies
Portal
QuarterMaster
Manager
Maitre’D
Chef
DISCWorld Daemon (DWd)
DWd runs on each host participating in a DISCWorld Console can attach via Manager
Waiter
Cook
Multiple Threads KAH, 1998.
Store
Contains Data, Products, Codes, and Configuration Information
Figure 2: DISCWorld Daemon Objects and Threads activity is carried out by a separate thread in the daemon at off peak times and is controlled by the “quartermaster” component of the daemon [34]. If the data product already exists in the store local to the daemon it can simply return that result to the user in the form of a reference or an actual data transfer. However, if the data product does not exist, and the daemon has a known recipe for constructing it, then a “cook” thread is spawned by the “chef” to construct that product, saving it in the store for future use if so ordered by management policy. At present we have experimented with a number of services in applications including image processing and geospatial data processing. For the most part these services are written in pure Java or are Java wrappers to native services on specialist nodes. DISCWorld does not have any parallelism support directly. We have recently designed and implemented a supplemental daemon for DISCWorld, known as JUMPd, which can interoperate with the primary DISCWorld daemon to coordinate parallel computing operations on suitable resources such as Beowulf-style clusters. The JUMP system and the JUMPd daemon are described in detail in section 7. One of the goals for JUMP was to separate out those difficult and hard-to-solve distributed computing problems from the performance and local management aspects of running applications on dedicated clusters. JUMP relies on DISCWorld to solve the distributed computing problems and JUMP can be viewed by DISCWorld as a service which will be invoked by “cook” threads just like a normal DISCWorld service. This is illustrated in figure 3. This approach allows us to make progress on utilising cluster computing systems in a controlled and managed manner without trying to solve (and
9
DISCWorld daemon DWd User Query
heavyweight DISCWorld daemon handles distributed computing aspects of task graph decomposition
running on Host A
[rest not shown]
Chef Thread
lightweight JUMP daemon handles parallel computing message passing
Daemon - Daemon protocol
JUMPd running on Host B
JUMPd running on
JUMPd running on
= cluster node 0
JUMPd running on cluster node 3
cluster node 1
cluster node 2
JUMPd running on
JUMPd running on
cluster node 4
cluster node 5
Parallel resource eg Beowulf Cluster
Figure 3: DISCWorld Daemon as an interface to the JUMP daemon resolve) all the distributed computing problems. We have not yet addressed the global naming and resource discovery problems. We make the assumption that these are at least separable problems and that given a hierarchical name space local DISCWorld users will have control over their own data product and service names and will name them as they please. Resource discovery is supported to an extent in that there is provision for DISCWorld nodes to gossip with one another at off-peak times, exchanging information about locally available data and services, and maintaining a bounded set of tables of this information. Tables can be bounded according to some management policy set by the resource owner and could be something simple like keeping only the N most recently used or discovered data/service items. We have addressed security issues in DISCWorld through the mechanism in the “portal” component to attach and verify all transactions against digital signatures or certificates. Whether this facility is used between trusted nodes can be controlled by some policy settings. We have investigated scheduling in DISCWorld at some length [34]. Dynamic scheduling on distributed and parallel systems is considerably easier if appropriate characteristics for each task in a task graph are known and if cur-
10
rent loads on each node are available. The Linux operating system provides a number of useful pseudo files in its /proc filesystem. We have built Java interrogation utilities into our systems that can exploit this information to aid in scheduling choices. For example /proc/loadavg provides information on the current and recent loads of a particular platform and this can be used to maintain a “league table” of loaded nodes. We believe the DISCWorld model is scalable in the sense that not all nodes need know about one another, and need not accumulate knowledge of the existence and capabilities of all other nodes. A sparse web of knowledge can be envisaged, where all nodes might be reachable by one another if the interconnection knowledge they each retain is above some percolation threshold. We envisage DISCWorld being used on a job granularity scale such that it is not critical if human beings have to intervene to configure particular nodes to know about each other manually. This is how humans operate with a network of contacts and partial lists of incomplete information. This approach is not however suitable for managing parallel computing jobs on a more tightly clustered system such as a Beowulf-style cluster. Our JUMP system is aimed at meeting this need, in a way that can interoperate with DISCWorld if required.
4
Parallel Programming on Clusters
One of the most flexible and powerful programming models for parallel systems has proved to be the message passing model. A number of attempts have been made to implement a suitable message passing interface for Java. That this is difficult should not be surprising as there was considerable debate about how a message passing interface that was designed for procedural language binding such as C and Fortran could be extended even to C++. Several laudable but not entirely satisfactory attempts were made to support C++, and it is now widely recognised that C++ is really only supported in the same way C is. There is no neat Object-Oriented interface that is natural to C++. Unfortunately a similar situation arises with Java. We do not have a solution to this problem and have decided that as we wish to proceed in developing support for multi-paradigm parallelism, we can not wait for the MPI forum to decide upon a solution. Consequently we are basing our current work on JUMP on a small subset of the MPI core functions, in the expectation that the exact syntax for these is not critical and can be adapted later. Our first attempt at building message passing programs in pure Java involved construction of a very simple Java support infrastructure that used a combination of the Java Remote Method Invocation (RMI) facility and raw sockets. Our use of both RMI and sockets was based on the expectation that it would be easier to write elegant and correct code to establish communications between participating nodes using the higher level RMI mechanisms, but that raw sockets would be needed to achieve acceptable communications performance. As we have noted earlier, our aim in JUMP is to address parallel performance issues primarily and not to try to solve too many of the more general distributed computing problems. Java RMI implementations are currently very slow compared to other communications mechanisms such as sockets and RPC, due to a number of prob-
11
lems such as the overheads of interpreted bytecode, inefficiencies in object serialization, and lack of support for asynchronous communication [32,36–38]. Several projects such as Manta [38], JavaParty [36] and HORB [32] have developed faster versions of RMI that are much more effective for high-performance distributed and parallel computing applications. However this performance improvement is mainly achieved by changing or extending the RMI protocols and interfaces, thereby creating what are essentially non-standard drop-in replacements for RMI. We have opted to use standard RMI and sockets rather than these faster but non-standard RMI implementations, on the expectation that the performance of standard RMI will improve in the future based on the optimizations pioneered by these projects. Of the more than one hundred MPI functions, our initial implementation supported only the classic eight, namely: • init() and finalize() which initialise and terminate MPI communications in a user program; • rank() and size() which allows a program running under the SPMD model to determine which processor identifier it has and how many peer processors are configured to be running in the same SPMD session; • send() and recv() which carry out blocking communications send and receive; • isend() and irecv() which carry out immediate (or non blocking) sends and receives. These eight can be readily extended by considering simple data reduction and broadcast collective communications operations. Once the infrastructure is in place for these operations it is relatively straightforward to implement the remaining MPI functions. The architecture of our initial Java MPI system was based on a combination of Java RMI and sockets technology. On each host that is to participate in an MPI computation, it is necessary to run the Java rmiregistry and an MPI daemon program MPI Daemon. This can either be done manually or it is possible to configure machines to start these programs automatically when the machine boots in both UNIX and Windows NT environments. We also developed a system utility to allow this daemon to started automatically on MacOS. A user who wishes to launch an MPI job first executes the MPI Run Java program, providing as arguments the following information: the program (class file) to be executed, the number of nodes to use for the computation, and optionally a file listing the nodes to be used for the computation. If the user does not specify which nodes are to be used, nodes are selected according to a default allocation policy which can be configured by the cluster owner or systems administrator. The MPI Run program connects to an MPI daemon using RMI and makes a method call to launch an MPI job. This daemon does not necessarily have to reside on the same host as the one where the user launches the job. Hence a user who wished to make use of our MPI system is not required to have an MPI daemon executing on the local machine, so jobs can be launched from outside the cluster.
12
The user creates an MPI program by writing a class that implements a known interface called MPI Program. This interface has an execute() method, which contains the program code to be executed by the daemon. Once a daemon receives a job, it contacts daemons on other nodes as required using RMI to start tasks on those nodes. These requests are made based on the hosts that the user has requested or the default task allocation policy. The initiating daemon starts the task with a rank of 0. The architecture is effectively a “peer to peer” system with daemons acting as clients to each other in executing tasks that comprise a complete MPI job. The daemon that launches a job creates a unique Job ID that used to tag all tasks that comprise a particular job. The daemon is implemented as a multi-threaded program and each new task runs as a separate thread in the daemon. In accordance with MPI specifications, programs that use the API must call the init() method before making use of MPI communications functionality. On calling the init() method, each task daemon makes an RMI method call to the initiating daemon to get a list of all tasks and corresponding hosts for a particular job. Once this information has been obtained, subsequent communication occurs through the use of sockets. Thus RMI is used as a convenient means of bootstrapping the communication process. MPI allows for data to be sent before the recipient is ready to receive it. This is implemented by launching a thread in the daemon program to listen for and accept incoming connections. These connections are stored internally until a required matching receive has been posted. As well as supporting send() and recv() methods with the arguments normally used in MPI, the methods have been overloaded to accept fewer arguments, since providing information about data types and array sizes is accessible at runtime in Java through the use of the Reflection API. This is an example of where the syntax of MPI is unnatural in Java. In our system we also make provision to send and receive general objects rather than forcing the user to convert these to byte arrays. Our implementation assumes that class files of the program being executed are present on the classpath of all daemons. This would be true where the hosts participating in an MPI job have a shared file system. One way to remove this limitation would be to make use of the Java RMI class loader to distribute the required class files. A problem with this approach is that the RMI class loader obtains class files from a HTTP (or web) server and hence a simple HTTP server would need to be incorporated into the client. An alternative approach is to develop a custom classloader using sockets, similar to that used by our Java program Code Server [40]. This project built a database of Java byte code that could be searched by various meta-data or criteria to compose task graphs of component tasks, each implemented as an independent Java class. It was convenient to store the Java classes in a database structure as large binary objects and assemble them together using our own classloader. This classloading mechanism is our chosen approach for our JUMP system. Our prototype implementation of the MPI system included a static method in the MPI class, output(), that enables any tasks to write output to the standard output stream of the client program. It is possible to make use of this functionality to redirect the standard input and output of all tasks to the client program that initiated the job. We explore this idea further in section 5. Java has a number of features that enable an MPI system to be imple-
13
mented. It has built in and easy to use support for multi-threading, as well as good support for networking functionality. For example having created a class to support blocking variants of send() and recv(), non-blocking variants are simply implemented by executing the corresponding blocking operation in a separate thread. We discuss Java features further in section 6. Some problems arose from our initial implementation which caused us to reconsider how best to exploit features of the Java environment to better meet user needs. Some questions and goals which we identified include: • How to achieve user transparency as far as possible and in particular how can we avoid the need for code distribution and/or common file systems? • Where should configuration information be supplied – in the traditional hostfile or table for example? • How can the Single Program Multiple Data (SPMD) parallel programming model integrity be best preserved and where do command line arguments propagate through to the user method? • Should communications be through the daemon or be directly initiated from the user method? • How can thread and socket objects be reused as much as possible? • Can a program be multi-threaded and then does it use the same set of sockets and JUMP framework? • Can the object-oriented structure hide much of the implementation details so that a user class can simply extend the JUMP base class? • Can a configuration and monitoring tool be integrated with JUMP? These questions led us to design a better message passing support system as described in section 7. A key driving issue for us was the desire to support not just simple message passing, but to support multiple data decompositions or parallel programming paradigms within a single user program.
5 Support for Multiple Paradigms for Parallelism Experience in writing parallel programs for large applications led us to the desire to construct a layered library of parallel programming paradigms or data decomposition models. Work by Clarke and co-workers at the Edinburgh Parallel Computing Centre in the late 1980s and early 1990s led to a prototype set of libraries known as PUL (Parallel Utility Libraries) [13]. PUL was designed to run on top of the CHIMP message passing library [12] which was a primitive (although efficient) precursor to modern MPI libraries. The approach was to consider the skeletal framework often used by parallel programmers to achieve a particular data decomposition. Much of this work was driven by experience in porting numerical simulations programs [31] to the parallel computing systems of the time, and PUL was an attempt to minimise the software engineering effort involved. PUL and CHIMP were based on C and Fortran bindings to library code written itself in C. One major limitation of
14
the library structure was in constructing applications codes where more than one parallel decomposition is needed. One of the reasons for using CHIMP at the time was that it allowed messages to be tagged or grouped properly. This is important when multiple libraries are used together in one program and massage spaces should not interfere with one another. MPI communicators now provide support for this. We are able to encapsulate much of the workings of this inside overloaded JUMP method calls. A famous example is that of a weather prediction code, such as that run by the UK Meteorological Office to provide prediction services for many international customers [46]. In this code, the weather data fields of temperature, pressure, and wind velocity are modeled as regular meshes of data mapped onto the surface of the Earth. The fluid dynamics aspects of such a code can be parallelised for the atmosphere using a simple regular domain decomposition, with processors computing updates on their own blocks of data based on the values of surrounding values. Unfortunately this simplistic decomposition is not appropriate for the ocean calculations, where the ocean depth is not constant and therefore a neat regular domain mapping is not ideal. In fact a scattered spatial decomposition where parcels of ocean data points are heuristically allocated to processors is often found to be more efficient. This application also requires Fourier transforms to be carried out at certain latitudes and this requires yet another data decomposition. Finally, real measured data from weather ships, satellites and ground observation stations and other sources is interpolated onto the atmospheric and ocean meshes and since this data is measured at arbitrary spatial locations the load balancing of this operation requires another different data and task decomposition strategy [23]. It is very appealing to consider a single application that could exploit parallel resources and that could simultaneously link together data and task decompositions all at once. We believe a multi-threaded Java program using a framework like our JUMP system will be able to do this efficiently. Our present implementation of JUMP is described in section 7. It can carry out the simple message passing support operations that parallel programmers have come to expect from systems like PVM and MPI. It is structured however to allow us to run several threads within the user program so that these threads share data and can interact with the same instance of the JUMP messaging system through different support library components for different data decompositions. This strategy is important to support the multiple paradigm cases such as the weather application we describe above and others such as: fundamental physics and chemistry applications where a mix of simple data decomposition and linear algebra operations are required; petrochemical simulations applications where a mix of regular and irregular data decomposition strategies are optimal; and computer aided and structural design simulations where adaptive meshes are used and the parallel data decomposition is irregular and changes during the course of the simulation. PUL provided experimental library code for: regular domain decomposition in multiple dimensions; irregular domain decomposition using scattered spatial approaches; tree decompositions and adaptive mesh decompositions. A number of general purpose but higher level operations emerged from the work at Edinburgh. These included frameworks for managing task farming programs; support of simple parallel I/O; and debug information support.
15
Some of these ideas are finding their way into the new MPI-2 interface specification and some are relatively straightforward to support. Parallel I/O is an important area where no ideal solution has yet emerged for a standardised approach to loading parallel decomposed data from multiple disks or other sources. We believe JUMP will supply us with a valuable platform for further experimentation in this area. We also expect to be able to support all the primitive operations pioneered by PUL in JUMP, but in an interoperable fashion so that a single program can exploit them. These higher level capabilities or multiple paradigms as we call them are much easier to support using object-oriented class libraries with an inheritable structure like that of Java. We are presently developing a unified library of these components. Adaptive mesh decomposition is still an active area of research in parallel computing. Algorithms including recursive bisection are used to decompose meshes and allocate mesh computation responsibilities to processors. We are experimenting with adaptive mesh decomposition using techniques including object migration. We are also investigating how object migration support can best be incorporated into JUMP. Java provides the necessary tools for this, but some care is needed in using the Java reflection and classloader mechanisms to allow object check-pointing and relocation. This is a particularly attractive facility to support long lived programs such as long running parallel applications.
16
6
Review of Java Tools and Technologies
One of the powerful features of the Java language and its development kit is the collection of libraries for various mechanisms that are integrated into the system. Of particular use to us in construction of messaging and cluster management software are: the threads support; sockets library; Remote Method Invocation (RMI); and object serialisation. We know from previous work that there are certain operations such as thread creation; object creation; socket creation; and in particular RMI calls that are very expensive in time. Object creation overheads vary significantly between platforms. We measured between 20µs and 0.5µs on various platforms [39]. This is relatively small compared with typical costs for thread swapping which are as much as forty times higher. This is important for our JUMP design in which we have tried to minimise the number of system threads active for message buffering purposes. In establishing our message passing prototype which used a mix of RMI and raw sockets, we measured the practical latency overheads and bandwidths that arise in cluster systems of interest to us. Table 1 shows a breakdown of the stages in a simple send/receive between two JVM’s running on various platforms. We have been primarily interested in Beowulf style clusters and we made our measurements on Pentium-II Beowulf cluster nodes, where each node had dual PII processors running at 450MHz, with 250MBytes of memory and were interconnected using a 100Mbit/s Intel 510T switch. We also examined combinations of Sun Enterprise E250 dual processor and E450 quad processor nodes, and DEC Alpha workstations. The PII systems run Linux, the Sun systems run Solaris and the Alphas run Digital Unix. All these experiments were using Java Development Kit (JDK) 1.2. Variances in the data are given based on the spreads in the measurements from approximately ten separate measurements in each case. As can be seen from Table 1, send and receive latencies for socket communications are of the order of 1 ms. This is tolerable, and is comparable to within a factor of two at least of communications overheads we find using native codes. We have found that switch latencies are of the order of 100µs on our Beowulf systems and that parallel program performance is inevitably dominated by the overheads in messages getting through the kernel or in this experiment out of the Java Virtual Machine. The bandwidths we can achieve are again somewhat less than we find using native code but are not too disappointing given our previous experiences with RMI. This table shows the various platforms with either 100Mbit/s Ethernet, 10Mbit/s Ethernet or internal memory bus between two processors in a dual or quad system. The fractions of theoretically achievable bandwidth are not too surprising. Table 2 shows the corresponding breakdown of RMI calls on the same systems. It is not entirely clear what the underlying RMI communications protocol is, but we have noted a significantly larger number of messages that are involved in RMI communications. We believe Sun have chosen to do this for flexibility reasons to allow the various services such as the Jini family of services to be constructed on top of RMI. This does rule out RMI as a performance communications package however. We therefore use raw sockets entirely in our JUMP system. The high latencies for RMI communications are approximately twenty to forty times higher than those for sockets. It
17
Interconnect technology
Hosts
Mem bus Mem bus 100Mbps 100Mbps 100Mbps 10Mbps
PII-self E250-self PII-PII E250-E450 E450-E450 Alpha-Alpha
Socket creation ms 12.5±1.9 35.3±15.0 10.1±0.3 25.6±0.7 25.6±0.7 84.5±15.9
Send latency ms 1.8±0.4 5.6±1.4 1±0. 3.2±0.5 3.2±0.5 8.3±5.3
Send bandwidth Mbit/s 189±64.7 205.5±43.9 72.8±15.9 26.5±0.9 26.6±0.9 6±0.
Receive latency ms 1±0. 1.3±0.5 1±0. 1.6±0.5 1.3±0.5 4±1.3
Receive bandwidth Mbit/s 240.8±6.2 222.5±48.8 86±0. 32.2±3.3 38.2±0.8 5±0.
Table 1: Socket measurements is therefore clear that RMI is not suitable for high performance computer communications.
Interconnect
Hostnames
Mem bus Mem bus 100Mbps 100Mbps 100Mbps 100Mbps 10Mbps
PII-self E250-E250 PIIa-PIIa PII-PII E250-E450 E450-E450 Alpha-Alpha
RMI Name lookup ms 659.8±43.3 580.8 ±62.9 714.6±63.6 704.5±119.4 508.1±24.9 519.1±33.8 1473.2±124.9
RMI latency ms 18.5±13.9 30.3±18.8 24.1±21.0 19.5±14.5 17.5±3.5 17.6±0.8 64.4±23.3
RMI bandwidth Mbit/s 71.7±1.9 54.9±10.5 38.9±11.3 65.4±2.0 33.7±1.7 34.1±1.5 5.8±0.2
Table 2: RMI measurements We have analysed Java RMI performance in greater detail elsewhere [37]. The RMI and CORBA technologies are more important for meta-computing systems, where as we have observed, latency overheads are less important as they are small compared to the completion times of the target tasks being farmed out as a DISCWorld task graph. We carried out some preliminary experiments with a ping-pong like communications example using Java Jini and Java Spaces. Communications costs were similar to those for RMI, from which we conclude that this technology will be useful if (and only if) Java RMI performance can be improved [32, 36–38]. We also note that present implementations of JavaSpaces are rather memory hungry and this too will be a significant limitation to its use at present.
7
The JUMP Architecture
Our JUMP system is implemented in pure Java code and uses sockets for its communications. We do not use RMI at all in JUMP in view of its high overheads and our experience with our early prototype. In this section we describe our implementation through an example and a sequence of diagrams showing the events during initialisation and running of a JUMP program. JUMP builds on our past experiences in building multi-threaded daemons
18
in Java for software management. We also utilise our experience in classloading in constructing code server databases of dynamically invocable Java bytes code from databases [40]. JUMP programs can be run in a similar fashion to conventional parallel programs written in MPI or PVM. We are still experimenting with the configuration of such programs [22] and are trying to determine where configuration information should best be located. At present, we utilise a flat hostfile of participating node names in a fashion similar to PVM. An alternative way of running JUMP programs is for a DISCWorld daemon to launch the program directly. In this case configuration information will be associated with the Java byte code as meta-data in the DISCWorld “store”. JUMP daemons are relatively lightweight compared to our DISCWorld daemon. JUMPd is intended only to bootstrap the communications framework and most of the message passing methods are in fact contained in the JUMP base class which user classes extend. Figure 4 gives an overview of the JUMP architecture. This shows a user class extending the JUMP base class, which initiates communications through its local JUMP daemon. A network of socket connections are established between participating daemons on an as-needed basis. These can be reused to minimise socket instantiation overheads. Once a socket is established for that program between two nodes, no other will be created in our present implementation. The architecture of the low-level part of JUMP is based loosely around that of PVM [17], although our interface bindings draw from those of MPI. The Parallel Virtual Machine (PVM) software was designed to support concurrent computing on networks of heterogeneous computers, and historically has had more of a focus on distributed computing issues than MPI. Ferrari and coworkers have developed a Java version of PVM known as JPVM [14] although as far as we are aware this does not support the multi-paradigm parallelism features we discuss in section 5 nor does it address the meta-computing issues we discuss in section 3. A JUMPd daemon is executed on each node to be used in the distributed computation, and the daemon brokers all communications to and from the user program (classes). The user program extends a JUMP base class, which like the more typical MPI or PVM libraries, provides the primitives that make communication possible. This is shown in figure 4. It is convenient to encapsulate communications and interrogation primitives in the JUMP base class. Figure 5 shows code for the trivial ping-pong example. The class extends the JUMP base class, and thus inherits all the facilities it needs to instantiate a message passing framework instance. Reference to this framework can be obtained using an interrogation method, so that multi threaded codes can share access to a single communications framework. Our intention has been to make the use of JUMP as natural to Java coding style as possible. The PingPong class provides a constructor method that passes arguments through to the base class constructor. The method parseJumpArgs() is provided to strip out any JUMP arguments (those prefixed with “-jump”, to facilitate passing these from the calling main method as supplied by the user. The JUMP base class is declared as abstract, meaning it must be extended in order to be instantiable. The user must supply, at the very least, one method with the signature public void jumpRun(String []args). This
19
User Class User class running also on host A
extends JUMP base class
Base class utilities communicate with local daemon to arrange sockets for communications. Lightweight JUMPd daemon coordinates communications amongst its peers
JUMPd running on
JUMPd running on Host A
JUMPd running on
= cluster node 0
JUMPd running on cluster node 3
cluster node 1
cluster node 2
JUMPd running on
JUMPd running on
cluster node 4
cluster node 5
Parallel resource eg Beowulf Cluster
Figure 4: JUMP daemon as a message coordinator for parallel programs method is invoked after the JUMP environment has been initialized for the user. The user does not have to be concerned with initialization of the JUMP environment; this is done automatically by the JUMP base class. The steps involved in the initialization of the system are shown in figure 6. Figure 6 i) shows the state of the system when the user executes the main() method of their program. An initialization routine in the JUMP base class contacts the local JUMP daemon (figure 6 ii) which runs in its own JVM and therefore may not share the same classpath as the user program. As shown in figure 6 iii), the local JUMP daemon sends initialization requests to an appropriate number of remote machines (the addresses of which are selected either by the user or the daemon). When each remote JUMP daemon has successfully instantiated a copy of the user program, the daemons send details of the port on which the program is listening to the master program, which distributes a complete host table (figure 6 iv). Java encapsulation and overloading allows us to provide default communicator tags to partition message space appropriately to avoid library layers
20
import adelaide.dhpc.Jump.*; import java.util.*; public class PingPong extends Jump{ Date sendTime = null; public PingPong(String args[]) { super(args); } public PingPong() {} public void jumpRun(String args[]){ System.out.println("PingPong::jumpRun() started"); System.err.println("My rank is "+getRank()); if (getRank() == 0) { sendTime = new Date(); send(1, new Integer(1)); Integer a = (Integer) recv(1); System.err.println("Send/Receive of Integer took: "+ ((new Date()).getTime()sendTime.getTime())+"ms"); } else if (getSize() == 1) { Integer a = (Integer)recv(0); System.err.println("Received ping:"+a); send(0, a); System.err.println("Sent pong."); } } public static void main(String []args) { PingPong pp = new PingPong(args); pp.jumpRun( pp.parseJumpArgs(args) ); }
}
Figure 5: Classic Ping-Pong example illustrating extension of JUMP base class colliding with each others messages. We are still debating the best syntactic form for the user to provide these explicitly when needed. In order for JUMP to be able to function on a truly federated cluster of compute nodes, we cannot assume the presence of a shared file system. For this reason, we use a custom classloader [40, 45] to transfer the necessary code to the remote machine. The mechanism for remote classloading is shown in figure 7. When the user program is instantiated (through the main() method) a copy of the program’s byte code is serialized (figure 7 i). The serialized code is sent with the run-time arguments in an initialisation request to the local JUMP daemon (figure 7 ii). The message is then distributed to each of the remote JUMP daemons that have been chosen to participate in the computation (figure 7 iii). If the class representing the user’s program references any other (non-core-Java classes) not in the remote daemon’s classpath, then when they try to instantiate a new instance of the user program, a ClassNotFoundException will be raised. Our classloader traps this exception (figure 7 iv) and sends a request for the class bytecode to the master user program (figure 7 v). The master user program replies to the request with a serialized copy of any requested class (figure 7 vi). Steps v) and vi) are repeated as many times as necessary to resolve all object on which the user program depends. Once all objects are resolved the slave copy of the user program is instantiated and the port to which the slave is listening is returned through the local JUMP daemon to the master program.
21
User Code (Master)
JVM on host 1
i) User Code (Master)
Local Daemon
Socket
JVM on host 1
JVM on host 1 Host List
ii) User Code (Master)
Socket
JVM on host 1
Local Daemon
Distributed
JVM on host 1
System
Remote Daemon
JVM on host 2
Partial host table returned Remote Daemon
iii) User Code (Master)
JVM on host 3
Socket
JVM on host 1
Local Daemon
Distributed
JVM on host 1
System
Remote Daemon
Remote User Code (Slave)
JVM on host 2
Updated host table distributed Remote Daemon
iv) User Code (Master)
JVM on host 3
Socket
JVM on host 1
Local Daemon
Distributed
JVM on host 1
System
Remote Daemon
Remote User Code (Slave)
JVM on host 2
Remote Daemon
v)
Remote User Code (Slave)
Remote User Code (Slave)
JVM on host 3
22 Figure 6: Initialization of the JUMP environment. i) The user executes their program. ii) A socket is created that contacts the local daemon. iii) The local JUMP daemon selects the appropriate number of remote hosts and their JUMP daemons an initialization request. They respond with the host IP/port that the slave program listens to. iv) When all the remote daemons have responded favourably to the initialization request, the complete host table is distributed from the master JUMP daemon.
i) User Code (Master)
ii)
Opens socket and sends own class bytecode to daemon with number of remote copies and arguments
User Code (Master)
Local Daemon
Local daemon sends user byte code to remote daemon.
iii) User Code (Master)
Local Daemon
Remote Daemon
Local Daemon
Remote Daemon
iv) User Code (Master)
Remote daemon tries to instantiate new instance of user class.
v) User Code (Master)
Remote Daemon
Custom ClassLoader traps ClassNotFoundException and sends request for byte code to Master instance of user code.
vi) User Code (Master)
Master instance sends back byte code for requested Classes.
Remote Daemon
vii) Remote Daemon
When remote daemon has all necessary classes, Slave instance of user code is created.
Figure 7: JUMP distributed classloading mechanism. i) The user program loads a serialized copy of its class byte code in preparation for distribution. ii) The serialized bytecode is sent to the local JUMP 23 daemon. iii) The bytecode is sent with an initialisation request and run-time parameters to the remote JUMP daemons. iv) The remote JUMP daemon tries to instantiate an instance of the user program on the remote machine. v) If the user program uses other non-Java core classes, a request is sent to the master instance of the user program. vi) The requested classes are sent to the remote daemon, where they are cached. Steps v) and vi) are repeated as required. vi) The slave instance is instantiated and the runJump() method is prepared with the run-time arguments.
To allow for the possibility of multiple, independent user programs running on the same physical machine, messages between user classes and the daemons, and between daemons are tagged with the source and destination host IP and the port on which the program is listening. With the exception of the JVM crashing or the machine running out of memory, user programs are protected from interference of other programs running on the same machine.
Dotted denotes separate Host Dashed denotes separate Java VM User Class Solid denotes one or more objects instantiated and running
JUMP user class and daemon JUMP Base Class
running in various JVMs and hosts
C o m m s JUMP daemon
JUMP daemon
Comms
Node Table
Node Table
Launching Host
Slave Host or Node
Figure 8: Relationships between User Class extending JUMP base class and JUMPd daemon running inside separate Java Virtual Machines. All communications are brokered through the JUMPd. Shown in figure 8, this provides a single point of entry to and from a machine for JUMP messages. It also allows for the caching of serialized byte code representing distributed objects. Since re-creation of frequently-used sockets is expensive in terms of processing cycles, we are experimenting with the caching and re-use of socket connections. Sockets are allocated using a factory pattern style object [16]. We have measured a number of simple applications on JUMP (including simple domain decomposition, task farming and scattered spatial decomposition) and are encouraged by preliminary results. Our use of sockets rather
24
than the full RMI protocol allows us to achieve a good performance as recorded in section 6, although as observed there, this is not quite as high as the bandwidth figures possible between C programs running on the test platforms we used. It is to be hoped and indeed expected that more effort from Sun and other Java Virtual Machine vendors and better compilers and operating systems support will lead to JVMs achieving a high fraction of the theoretical bandwidths. We believe the convenience and power of our JUMP system already outweighs this modest performance hit, and will in the near future be able to compete favourably with native-coded systems. The wall-clock time from running a multi-paradigm application is still likely to be improved using our system which can exploit the improved parallelism for complex application codes. JUMP uses a similar approach to message communicator tags and grouping mechanisms that MPI specifies and which was we believe pioneering in the Edinburgh CHIMP system [12]. As we discuss in section 5 this is important to allow library layers of more sophisticated communications to operate with internal messaging that does not impact on the explicit messages sent by user program code. We are experimenting with the tuple space approach offered by JavaSpaces. The tuple space model where messages are tagged with complex predicate objects is an attractive mechanism for building higher level systems. However, a limitation of that system is lack of support for “spaces of spaces”. A single node would have to host the entire communications space, which we believe is not suitable for a high performance communications system. We are investigating ways to enable scalable tuple spaces. We have compared this with the performance that would arise if all communications in our system were brokered through a single host. This will typically present too much message congestion for any practical applications. JUMP daemons maintain a table of nodes they have communicated with and for which they have sockets. We are presently investigating how to choose a semantics for behaviour if a node becomes temporarily available. While we cannot recover a parallel program from a single node crashing, we may be able to recover from a network glitch, by re-establishing a socket connection through an alternative route. Of particular interest is support for dynamic cluster configurations as discussed in section 2. Object migration can be achieved in Java by serialising a running object (assuming it can be stopped or its thread suspended in a controlled manner). The serialised object state and its class can be migrated to the class loader space of another node that is able to host it and to re awaken it. We are experimenting with a proxy object approach to being able to reactivate the serialised object and have it re-establish its communications connections. This appears feasible in a parallel computing context where relatively few connections and callbacks are required. This provides the basic “join and depart” mechanism for dynamic cluster size support. The scheduling software required to be able to sensibly relocate parts of running programs is not however trivial. At present our experiments in this area are limited to support facilities for a system console controlled by human scheduling decision making.
25
8
Future Directions and Conclusions
We have described our experiments in both distributed and parallel computing aspects of cluster systems using Java. We find Java and its associated development environment to be a powerful tool in building integrated and interoperating cluster management software. We note that present Java systems do not yet match the performance of native code, and that there is some overhead in using Java Virtual Machines for high performance software, however this situation is improving with better compilers and JVMs. Nevertheless we believe that this performance penalty can be traded off against advantages for large and complex codes in using multiple parallel decomposition paradigms simultaneously. This is hard to achieve without a platform like Java. Furthermore we believe the performance of Java has improved significantly in recent years and will continue to do so. Our research prototype systems should be well placed to reap the e rewards of this likely performance improvements. We have developed the basic algorithmic support structure using sockets and threads in Java, but note that RMI may have some promise if a more efficient underpinning communications protocol is used. Sun’s Jini and JavaSpaces systems offer scope for a different implementation of our JUMP system. We intend to construct this as it will provide a mechanism to build a richer and more complex system, but we expect the performance to be limited by Jini/JavaSpaces use of RMI as the underlying communications technology. Our DISCWorld system uses RMI where the high overhead is less significant compared to the individual task durations. We have implemented simple message passing facilities using various Java technologies including raw sockets and Remote Method Invocation. We have identified the needs and architecture for a system that allows more elaborate parallel programming models to be used, and have separated out the distributed computing issues from those related to achieving parallel performance. We have implemented our JUMP system so it can be invoked by a normal user program and also as a distributed service embedded in a metacomputing environment like our DISCWorld system. We discussed some of our ongoing work and expected future directions for the JUMP system in section 5. In summary, we believe there is scope for further work in integrated support for other parallel decomposition strategies and parallel file access in our JUMP system. Our system is entirely portable apart from low level systems monitoring facilities that are operating systems dependent. JUMP has been successfully deployed on our Beowulf systems running Linux. We expect to use it on other cluster systems providing DISCWorld services and hope to exploit similar performance hooks in other operating systems.
References [1] Advanced Computing Laboratory, Argonne National Laboratory. Parallel Object-Oriented Methods and Applications (POOMA). Available at http://www.acl.lanl.gov/pooma/
26
[2] Ken Arnold, Bryan O’Sullivan, Robert W. Schiefler, Jim Waldo, and Ann Wollrath. The Jini Specification. The Jini Technology Series. Addison Wesley Longman, June 1999. ISBN 0-201-61634-3. [3] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sunh Hoon Ko, and Xinying Li. mpiJava: A Java MPI interface. Available from http://www.npac.syr.edu/projects/pcrc/papers/mpiJava/mpiJava/mpiJava.html, February 1999. [4] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Trans. Computer Systems, 2(1):39–59, February 1984. [5] Allan Bricker, Michael Litzkow, and Miron Livny. Condor technical summary. Available from http://www.cs.wisc.edu/condor/, January 1992. [6] BROADCAST Working Group, ESPRIT Working Group 22455. Proc. Second Open Workshop on Basic Research on Advanced Distributed Computing: From Algorithms to Systems. Cambridge, 21-23 July 1999. [7] Rajkumar Buyya, editor. High Performance Cluster Computing, volume 1: Architectures and Systems. Prentice-Hall, 1999. ISBN 0-13-13784-7. [8] Rajkumar Buyya, editor. High Performance Cluster Computing, volume 2: Programming and Applications. Prentice-Hall, 1999. ISBN 0-1313785-5. [9] Rajkumar Buyya, Mark Baker, Ken Hawick and Heath James, (eds). Proc. IEEE International Workshop on Cluster Computing, December 1999. [10] Kivanc Dincer. Ubiquitous message passing interface implementation in Java: JMPI. In Proc. 13th Int. Parallel Processing Symp. and 10th Symp. on Parallel and Distributed Processing. Institute of Electrical and Electronics Engineers, 1998. [11] Dynamic Object Group. DOGMA home page. http://ccc.cs.byu.edu/DOGMA.
Available at
[12] Edinburgh Parallel Computing Centre. Chimp concepts. June 1991. [13] Edinburgh Parallel Computing Centre. PUL-RD prototype user guide, 1996. [14] Adam J. Ferrari. JPVM: Network parallel computing in Java. Technical Report CS-97-29, Department of Computer Science, University of Virginia, December 1997. [15] Eric Freeman, Susanne Hupfer, and Ken Arnold. JavaSpaces Principles, Patterns, and Practice. The Jini Technology Series. Addison Wesley Longman, June 1999. ISBN 0-201-30955-6. [16] Eric Gamma, Richard Helm, Ralph Johnson and John Vlissides. Design Patterns Elements of Reusable Object-Oriented Software. AddisonWesley Professional Computing Series, 1995. ISBN 0-201-63361-2. [17] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine A Users’ guide and Tutorial for Networked Parallel Computing. MIT Press, 1994. [18] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. The Java Series. Addison Wesley Longman, 1996. ISBN 0-20163451-1.
27
[19] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994. ISBN 0-262-57104-8. [20] D. A. Grove, P. D. Coddington, K. A. Hawick, and F. A. Vaughan. Cluster computing with iMacs and Power Macintoshes. In Proc. of Parallel and Distributed Computing Systems (PDCS’99), March 1999. [21] K. A. Hawick. Trends in High Performance Computing. In Proc. of the Fourth Workshop of Integrated Data Environments Australia (IDEA), May 1997. [22] K. A. Hawick. The Configuration Problem in Parallel and Distributed Systems. Technical Report DHPC-076, Department of Computer Science, The University of Adelaide, November 1999. [23] K.A. Hawick, R.S. Bell, A. Dickinson, P.D. Surry and B.J.N. Wylie. Parallelisation of the Unified Model Data Assimilation Scheme. Proc. Workshop of Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology, November 1992. [24] K. A. Hawick, D. A. Grove, P. D. Coddington, and M. A. Buntine. Commodity cluster computing for computational chemistry. Technical Report DHPC-073, Department of Computer Science, The University of Adelaide, November 1999. [25] K. A. Hawick and H. A. James. Data Futures in Meta-computing Systems. Technical Report DHPC-075, Department of Computer Science, The University of Adelaide, November 1999. [26] K. A. Hawick and H. A. James. A Java-Based Parallel Programming Support Environment. Technical Report DHPC-0xx, Department of Computer Science, The University of Adelaide, November 1999. [27] K. A. Hawick, H. A. James, K. J. Maciunas, F. A. Vaughan, A. L. Wendelborn, M. Buchhorn, M. Rezny, S. R. Taylor, and M. D. Wilson. An ATM-Based Distributed High Performance Computing System. In HPCN, editor, Proceedings HPCN’97, Vienna, Austria, August 1997. IEEE Computer Society Press. [28] K. A. Hawick, H. A. James, and J. A. Mathew. Remote Data Access in Distributed Object-Oriented Middleware. To appear in Parallel and Distributed Computing Practices, 1999. [29] K. A. Hawick, H. A. James, C. J. Patten, and F. A. Vaughan. DISCWorld: A Distributed High Performance Computing Environment. In Proc. High Performance Computing and Networks (HPCN) Europe ’98, April 1998. [30] K. A. Hawick, H. A. James, A. J. Silis, D. A. Grove, K. E. Kerry, J. A. Mathew, P. D. Coddington, C. J. Patten, J. F. Hercus, and F. A. Vaughan. DISCWorld: An Environment for Service-Based Metacomputing. Future Generation Computing Systems (FGCS), 15:623–635, 1999. [31] K. A. Hawick and D. J. Wallace. High Performance Computing for Numerical Applications. In Proc. of Workshop on Computational Mechanics in UK. Association for Computational Mechanics in Engineering, January 1993. (Keynote address).
28
[32] S. Hirano, Y. Yasu and H. Igarashi. Performance Evaluation of Popular Distributed Object Technologies for Java. Proc. ACM Workshop on Java for High-Performance Network Computing. February 1998. [33] IEEE Computer Society Task Force on Cluster Computing. http://www.dgs.monash.edu.au/rajkumar/tfcc/index.html
See
[34] Heath A. James. Scheduling in Meta-computing Systems. PhD Thesis. Department of Computer Science, The University of Adelaide, July 1999. [35] Java Grande Forum. Java Grande Forum home page. Available at http://www.npac.syr.edu/projects/javaforcse/javagrande/. [36] C. Nester, M. Philippsen and B. Haumacher. A More Efficient RMI for Java. Proc. ACM Java Grande Conference, June 1999. [37] K. E. Kerry Falkner, P. D. Coddington and M. J. Oudshoorn. Implementing Asynchronous Remote Method Invocation in Java. Proc. Parallel and Real-Time Systems (PART’99), Melbourne. December 1999. [38] J. Maasen, R. van Nieuwpoort, R. Veldema, H. E. Bal and A. Plaat. An Efficient Implementation of Java’s Remote Method Invocation. Proc. ACM Symposium on Principles and Practice of Parallel Programming. May 1999. [39] J. A. Mathew, P. D. Coddington, and K. A. Hawick. Analysis and Development of Java Grande Benchmarks. In Proc. of the ACM 1999 Java Grande Conference, April 1999. [40] J. A. Mathew, A. J. Silis, and K. A. Hawick. Inter Server Transport of Java Byte code in a Meta-computing Environment. In Proc. TOOLS Pacific (Tools 28) - Technology of Object-Oriented Languages and Systems, 1998. [41] National Imagery and Mapping Agency (NIMA). Geospatial and Imagery Exploitation Services (GIXS) specification. version 2.0, June 1999. [42] Scott Oaks and Henry Wong. Java Threads. Nutshell Handbook. O’Reilly & Associates, Inc., United States of America, 1st edition, 1997. ISBN 1-56592-216-6. [43] Object Management Group CORBA/IIOP 2.2 Specification. July 1998. Available at http://www.omg.org/corba/cichpter.html. [44] Sun Microsystems. Java Advanced Imaging API. Available from http://java.sun.com/products/java-media/jai/, November 1998. [45] Sun Microsystems. The Java Language Guides Available at http://java.sun.com/products/jdk/1.2/docs/guide/. [46] UK Meteorological Office Unified Model Weather and Climate Simulation Code Contact http://www.meto.gov.uk
29