PARDIS: CORBA-based Architecture for Application-Level Parallel Distributed Computation Katarzyna Keahey and Dennis Gannon fkksiazek,
[email protected] Department of Computer Science Indiana University 215 Lindley Hall Bloomington, IN 47405 Abstract
Modern technology provides the infrastructure necessary to develop distributed applications capable of using the power of multiple supercomputing resources and exploiting their diversity. The performance potential oered by distributed supercomputing is enormous, but it is hard to realize due to the complexity of programming in such environments. In this paper we introduce PARDIS, a system designed to overcome this challenge, based on ideas underlying the Common Object Request Broker Architecture (CORBA), a successful industry standard. PARDIS is a distributed environment in which objects representing data-parallel computations, called SPMD objects, as well as non-parallel objects present in parallel programs, can interact with each other across platforms and software systems. Each of these objects represents a small encapsulated application and can be used as a building block in the construction of powerful distributed metaapplications. The objects interact through interfaces speci ed in the Interface De nition Language (IDL), which allows the programmer to integrate within one metaapplication components implemented using dierent software systems. Further, support for non-blocking interactions between objects allows PARDIS to build concurrent distributed scenarios.
1 Introduction High-speed networks make possible the development of distributed applications combining the computation of several components. Programmers of these applications can draw on the Copyright 1997 ACM. To be published in the Proceedings of SC97 (Supercomputing '97), November, 1997. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee.
1
computational power of multiple supercomputing resources and exploit the heterogeneity of diverse architectures and software systems to create products with unique capabilities. However, the performance potential that these metacomputing environments [CS92, FK97] oer is seldom realized due to the diculty of combining dierent transport mechanisms, software packages and heterogeneous resources in one system. In this paper we describe PARDIS, a system which employs the ideas of the Common Object Request Broker Architecture (CORBA) [OMG95] | interoperability through metalanguage interfaces | to implement application-level interaction of heterogeneous, PARallel components in a DIStributed environment. Addressing interoperability at the level of applications allows the programmer to build metaapplications from independently developed and tested components. This approach allows for a high level of component reusability and does not require existing components to be reimplemented. Further, it allows PARDIS to take advantage of application-level information such as distribution of data structures in a data-parallel program [KG97]. PARDIS builds on CORBA in that it allows the programmer to construct metaapplications without concern for component location, heterogeneity of component resources, or data translation and marshaling in communication between them. However, PARDIS extends the CORBA object model by introducing SPMD objects representing data-parallel computations. These objects are implemented as a collaboration of computing threads capable of directly interacting with the PARDIS Object Request Broker (ORB) | the entity responsible for brokering requests between clients and servers. This capability ensures request delivery to all the computing threads of a parallel application and allows the ORB to transfer distributed arguments directly (if possible in parallel) between the corresponding threads of client and server. The PARDIS ORB interacts with parallel applications through a run-time system interface implemented in terms of the run-time system underlying the application software package, which extends the ORB into the communication domain of the parallel server. In order to accommodate the largest possible set of applications, the required run-time system functionality in current implementation is very basic. We will show how PARDIS can be used to interface directly parallel packages, based on dierent run-time system approaches. Speci cally, we have developed interfaces to the POOMA library [ABC+ 95] and High Performance C++ Parallel Standard Template Library (HPC++ PSTL) [GBJ+ar]. In summary, this paper makes the following contributions:
We describe the architecture and programming abstractions of a system allowing the
programmer to build ecient distributed metaapplications involving parallel components. We present examples of how this system can be used to build basic metaapplications and demonstrate that its abstractions allow the programmer to construct ecient distributed applications with relative ease. We report our experiments in building interfaces to two existing parallel software packages built on top of dierent run-time systems: HPC++ PSTL and the POOMA library. 2
The rest of this paper is organized as follows. In section 2 we introduce the general architectural components of PARDIS and describe how they interact. Section 3 will give an overview of programming support provided to use these components, and section 4 will give examples of how to use them. In section 5 we will give a sketch of related work in this area while section 6 will outline our conclusions and plans for future work.
2 Overview of PARDIS
2.1 Object and Invocation Model
PARDIS provides interoperability between objects in a distributed environment. An Object is de ned as an encapsulated entity capable of performing certain services. These services can be requested by clients. An entity called the Object Request Broker (ORB) delivers requests from clients to servers, and also identi es, locates and activates objects. PARDIS distinguishes two kinds of objects: SPMD objects and single objects. SPMD objects represent parallel applications which roughly adhere to Single Program Multiple Data (SPMD) style of computation. They are associated with a set of one or more computing threads, each of which may be executing in a distinct address space, and are capable of performing services if and only if a request for the service is accepted by all the computing threads. Single objects are associated with only one computing thread and will service requests accepted by that one thread. A similar distinction applies to clients. SPMD clients are associated with a set of one or more computing threads and represent parallel applications wishing to act as one entity in interactions with objects, whereas single clients are associated with only one computing thread. In order to make an invocation on an object, each thread associated with an SPMD client has to issue a request. PARDIS guarantees that sequence of invocation is preserved for single and SPMD clients. We use a slightly extended version of the CORBA Interface De nition Language (IDL) to represent object speci cations. The dierence between SPMD and single objects is important because it aects the invocation model, but is implementational rather than conceptual. Therefore interfaces to both SPMD and single objects can be de ned in the same way as IDL interfaces, with only the dierence that SPMD objects can use distributed argument types in operation de nitions.
2.2 Overview of the Architecture
As a software system PARDIS consists of an IDL compiler, communication libraries, object repository databases and agents responsible for locating and activating objects. As depicted in gure 1 the main components of a PARDIS system are:
Parallel Servers: programs which provide the implementation of single or SPMD
objects. A parallel server is a set of one or more computing threads determined either at time of server startup or by its activation parameters. It is assumed that these threads use some communication medium other than the PARDIS communication libraries, for example a message-passing library, for communication speci c to the program. The 3
current implementation assumes that the threads are associated with a distributed memory model. Parallel Clients: programs composed of one or more computing threads, capable of requesting services from parallel objects, either by acting as one entity (SPMD clients) or separate entities (single clients). As in the case of servers, we assume that the computing threads can communicate through a medium other than the PARDIS communication libraries. Object Request Broker: an entity responsible for managing requests between the client and the server. In order to properly process requests the ORB may need to communicate with the run-time system underlying the parallel server or client In order to make the implementation of an object available to clients, the programmer needs to provide the implementation of the object in terms of a chosen package, and de ne its interface in IDL. The IDL compiler translates the speci cations of objects into \stub" code containing calls to the ORB. Linked to the object's implementation, the stub code enables the ORB to invoke operations on the object. The client can use the generated stubs to make requests on the server. IDL specification
compiler client’s application and package
server’s application and package
client’s server’s @@@@@ @@@@@ stub stub @@@@@ @@@@@ @@@@@ @@@@@ client’s server’s PARDIS ORB @@@@@ @@@@@ RTS @@@@@ @@@@@RTS @@@@@ @@@@@
Figure 1: Interaction of main components of PARDIS: the shaded areas in the picture denote the PARDIS run-time system interface. The run-time system interface through which the ORB communicates with clients and servers comprises communication primitives and data marshaling calls speci c to a given system. The functional requirements are restricted to a very small subset of basic message passing primitives. In order to avoid con icts, we also require a way to distinguish between PARDIS messages and messages pertaining to computation in user code (for example through a set of reserved message tags). So far we have implemented this interface in MPI [For95], the Tulip [BG96] run-time system and the communication abstraction of the POOMA library [ABC+ 95] which allows PARDIS to interact with the the object-oriented packages built on top of those systems. Restricting the assumptions about run-time system to a small set limits 4
the functionality of distributed argument structures, but allows us to provide interoperability with many parallel applications. As the system develops, we plan to formulate an alternative interface based on more exible one-sided run-time systems. Apart from the main components described above, PARDIS contains the following facilities:
Object and Implementation Repositories: databases which de ne a naming do-
main for interacting objects. On activation, every object registers with an object repository, which is searched when the client requests a connection to a speci c object. Each repository is associated with a unique namespace; con guring clients and servers to work with dierent repositories allows the programmer to split the namespace for interacting objects. In the case of non-persistent servers, the programmer can use the register facility to register the object and information on how it should be activated with the Implementation Repository. Activating Facilities: since establishing connection with an object can involve starting up the server which provides its implementation, PARDIS provides activating agents. Such agents usually need to reside on the host of the server; in order to limit the interference between the activating agent and the server, the programmer can con gure the system to work in an activating and non-activating mode.
The current version of PARDIS uses NexusLite, the single threaded implementation of Nexus [FKOT94], for network transport. PARDIS has been tested with parallel clients and servers running on SGI multi-processor architectures, IBM SP/2, and over networks of Sun workstations.
3 Programming Support for Parallel Distributed Computation This section will discuss the programmer's view of PARDIS. We will outline how objects are created and how they interact with each other. Finally, we will explain how PARDIS interfaces with other systems. Examples using the concepts discussed below can be found in section 4.
3.1 Object Creation and Binding
Objects are created by the server programs. SPMD and single objects can share resources of the same parallel server (see section 4.2 for an example). On object instantiation the programmer has to specify if the object is created as a single object or an SPMD object; the instantiation of an SPMD object is collective with respect to all the computing threads of the server. Further, only objects which do not operate on distributed arguments can be created as single objects. In order to interact with objects, the client has to establish a binding between the object proxy generated by the compiler and the implementation of the object. Clients can bind to 5
objects in two ways: either collectively by using spmd bind, which represents the parallel client to the ORB as one entity, or by calling bind, which creates one binding per thread. All operations on a proxy instantiated using spmd bind must be invoked collectively and can use distributed arguments. To support single client invocations of operations using distributed arguments, PARDIS generates two stubs for every such operation: one with a distributed mapping to support SPMD invocations and another with corresponding nondistributed arguments to support single invocations.
3.2 Distributed Arguments
In order to make a full use of interaction with SPMD objects, the programmer needs to be able to de ne and manipulate argument data structures distributed over the address spaces of the computing threads of an SPMD object. PARDIS currently provides one such structure, a generalization of the CORBA sequence, called a distributed sequence. An IDL de nition: typedef dsequence dist_seq_double;
represents a bounded distributed sequence of 1024 elements of type double, uniformly blockwise distributed on the client's side, but concentrated on one processor on the server's side. The last two arguments in the de nition of a sequence are optional. See section 4 for details on how to use distributed sequences. Based on this de nition, the IDL compiler generates a C++ class which behaves like a one-dimensional array with variable length and distribution. If no distribution has been speci ed, it can be set by the programmer using a distribution template, which describes in what proportions the elements of a sequence should be distributed among the processors. Using dierent distribution templates the programmer can also redistribute the sequence. The server can set the distribution of any of the \in" arguments to its operations prior to object registration; the client can set the distribution of the expected \out" arguments before making an invocation. Knowledge of distribution allows the ORB to eciently transfer arguments between the client and server [KG97]. The sequence class overloads operator[] to provide access to its elements with location transparency. We would like to stress however, that the main purpose of a distributed sequence is to be used as a container for argument data, not to provide its management. To that end, the distributed sequence supports no-ownership constructors and provides access to owned data, which allows the programmer to easily build ecient conversions between the distributed sequence and data structures particular to his or her package.
3.3 Interaction
In order to support concurrency PARDIS allows both blocking and non-blocking invocations on the operations of the server. For each operation, the IDL compiler generates two stubs; an invocation through the blocking stub returns only after it has been fully processed by the server whereas an invocation through the non-blocking stub returns immediately after the request has been sent with futures of its \out" arguments and return value. A future 6
represents results of services which may not yet be available. Trying to read a future before the result it represents is returned, that is before a future becomes resolved, will cause the program to block until the result is delivered. Alternatively, the programmer may poll on a future to check if it has been resolved. A non-blocking invocation can involve multiple futures of distributed as well as non-distributed arguments; they will all be resolved at the same time when the server completes its computation. The PARDIS C++ mapping for futures draws on an analogous abstraction implemented in ABC++ [OEPW96]. The use of futures is demonstrated in sections 4.1 and 4.2. On the server's side, after all objects have been created, the programmer usually passes control to PARDIS by calling POA::impl is ready(). This call causes the server to poll for requests from clients and does not return; the server will remain in the polling loop until it is deactivated. Since the programmer may want to additionally poll for requests during processing, PARDIS allows the server to invoke POA::process requests() at any time during computation. This call returns, allowing the server to proceed with the interrupted computation (see section 4.2 for example). Both invocations must be collective with respect to all processing threads of the server.
3.4 Interoperability with Parallel Packages
Many dierent object-oriented systems and libraries have been developed to capture abstractions associated with parallel programming. Although interoperability through distributed sequences gives the programmer a way of interfacing any package to develop distributed parallel systems, it still requires invocation arguments to be PARDIS structures rather than structures expressed in the idiom of the package. This limits the \programmer convenience" factor of our system. To address this problem we explored the possibility of expressing PARDIS IDL de nitions directly in terms of data structures native to concrete packages: the distributed vector in HPC++ PSTL [GBJ+ar] and the eld in the POOMA library [ABC+ 95]. In order to use this mapping the programmer needs to annotate the IDL de nitions with pragma statements directing the compiler to generate stubs marshaling the data into existing structures rather than generating code for their PARDIS representations. An example of this interaction is shown in section 4.3. This direct interoperability requires customizing the compiler for any given package. However, in our experience the interfacing code is relatively straightforward, which may make possible generation of the necessary compiler routines from user's directions. A similar technique has been employed in the new implementation of the Sage++ library [BBG+94]. Our experiences with providing customized mappings to dierent systems have revealed that although PARDIS can easily marshal complex data structures, the interfaced packages themselves are frequently unable to handle dynamically-sized, nested data structures. The marshaling code generated by PARDIS is of course available to the package. Packages based on dierent run-time systems can interoperate only in distributed mode; for literature related to interoperability on the same server see [KBJ+ 96].
7
4 Examples In this section we will present three examples to demonstrate that PARDIS has potential for easy development of ecient distributed scenarios. In the rst two examples we present performance data obtained on a hardware con guration of a 4-node SGI Onyx R4400 (HOST 1) and a 10-node SGI PC R8000 (HOST 2) connected by a dedicated 155 Mb/s ATM link. In the third example we used HOST 2 and up to 8 nodes of IBM SP/2; in this case the hosts were communicating via Ethernet.
4.1 Concurrent Execution of Data-Parallel Components
In this section, we present a simple example as a basic introduction to programming with PARDIS, and at the same time highlight three of its features: non-blocking invocations, programming without concern for object locality and ease of manipulation of dynamically-sized data structures. We will consider a scenario in which the same system of linear equations is solved by a direct method and an iterative method; the returned solutions are then compared to calculate agreement between these two methods. Similar interactions occur in parameter study for physical simulation and algorithm development. We ran this application both in single-server and distributed-servers mode and obtained substantial speedup by putting the slower application on a faster remote resource. The results are shown in gure 2. We implemented both solvers as servers successively invoked by a parallel client which then compares the resulting vectors. The interfaces for the two solvers are speci ed below: //IDL typedef sequence row; typedef dsequence matrix; typedef dsequence vector; interface direct { void solve(in matrix A, in vector B, out vector X); }; interface iterative { void solve(in double tol, in matrix A, in vector B, out vector X); };
Based on these de nitions the compiler will generate stub code allowing the programmer to implement the client program as follows: //C++ 00: direct_var d_solver = direct::_spmd_bind("direct_solver", HOST_1); 01: iterative_var i_solver = iterative::_spmd_bind("itrt_solver", HOST_2); 02: matrix A(N); 03: vector B(N); 04: initialize_system(A, B); 05: PARDIS::future X1;
8
06: 07: 08: 09: 10: 11:
vector_var X1_real, X2_real; double tolerance = 0.000001; i_solver->solve_nb(tolerance, A, B, X1); d_solver->solve(A, B, X2_real); X1_real = X1; double difference = compute_difference(X1_real, X2_real);
First, a parallel client has to establish a binding between compiler-generated proxies and an actual object implementation by invoking spmd bind from all the threads. We will assume that at least one of the servers is located on the same host as the client, HOST 1. The binding operation returns a managed pointer to the object proxy. Managed pointers (denoted T var for type T) provide memory management for represented pointers. In this case a managed pointer ensures that the proxy, and with it the binding, will be destroyed when the pointer gets out of scope. In the case of pointers to distributed sequences, such as X1 real, the programmer can also use it to set the distribution of the result vector; in this example distribution defaults to BLOCK. distributed vs local performance 200 direct method (HOST 1) iterative method (HOST 2) different servers same server (HOST 1)
180
160
exucution time (in seconds)
140
120
100
80
60
40
20
0 200
300
400
500
600
700 800 problem size
900
1000
1100
1200
Figure 2: Performance of single-server and distributed computation. The total execution time of the distributed computation is t = to + maxfti ; td g where ti ; td are times of computation of the solvers, and to is the time of communication overhead. The client can now invoke operations on the objects. First, note the two dierent invocation styles: blocking (on d solver), and non-blocking (on i solver). The non-blocking invocation returns a future of its result. Trying to access a future whose value has not yet been provided will cause the program to block till the future becomes resolved; otherwise the future will return a value of the underlying type (line 10). Note that this assignment takes place between two managed pointers which are implemented as handles to the data; this makes distributed future instantiation computationally inexpensive. Above (in line 8), 9
we use a non-blocking invocation to send a request to a remote server; while the request is being processed, the program performs its own computation and only then tries to access the result. Both invocations must be made collectively and both use as their arguments sequences distributed over the address spaces of the client. Secondly, note that switching between using one host for both computations (during initial development, or until fast remote resources become available) requires only a change of the host name in line 2. By using inheritance and virtual function overloading in stub code, PARDIS ensures that invocation on a local object becomes a direct call to the object, bypassing the network transport. Thirdly, note that matrix is a distributed sequence type composed of dynamically-sized elements. Up to now the use of dynamically-sized elements in programs based on distributed memory model has been limited as it required the programmer to hand-code special marshaling routines in order to move the elements between address spaces. The IDL compiler generates those marshaling routines automatically; the same routines can be used for marshaling elements for network transport as well as transport within the communication domain of the application, thus relieving the programmer of the necessity to write additional, error-prone code.
4.2 Parallel Interaction: SPMD and Single Objects on Parallel Servers
This example presents a simpli ed model of an application in which an SPMD object produces many dierent results of interest to dierent clients. These results can be accessed through single objects located on the same server as the SPMD object. By distributing them across the computing threads of the server, the programmer can enable parallel interaction with the server. Similar scenarios are used in genetic programming and database searches. DNA database substring list server transpose list server
clients
servers
Figure 3: Parallel interaction: the object containing DNA database is activated by an SPMD request processed by a parallel computation on 4 nodes; the clients can also interact with single objects \owned" by threads associated with dierent nodes. We implemented a server containing a DNA database, which is searched in parallel for sequences which either contain a certain substring themselves, or whose edit distance deriva10
tives contain the substring. Periodically during the search, partial results are collected in ve lists: one containing sequences matching the substring exactly, and one for each of their four edit distance derivatives (transposition, deletion, substitution, addition). At this time the server can make the lists accessible to the clients by calling POA::process requests(); the clients can then process the lists further, each according to dierent criteria. Figure 3 illustrates this interaction. The interfaces to list servers are speci ed below: //IDL typedef sequence dna_list; interface list_server { void match(in string s, out dna_list l); }; interface dna_db { status search(in string s); };
For simplicity we assume that all the queries to the database come from one client issuing non-blocking requests. The code of the client could be implemented as shown below (note the use of resolved method on futures to test if the non-blocking request for search completed): //C++ future stat; future list_1, list_2; stat = dna_database->search_nb("ABCD"); while(!(stat.resolved())) { substring_list_srv->match_nb("DDD", list_1); transpose_list_srv->match_nb("AAA", list_2); // invoke match on other servers and process obtained results } //final processing substring_list_srv->match_nb(``DDD", list_1); ...
Figure 4 shows execution time from the client's perspective under two dierent distributions of single objects on the parallel server. In the centralized distribution scheme, all list servers are associated with one computing thread. This scheme models what would happen if only one computing thread of the SPMD object were visible to the ORB. In the second scheme, the list server objects are distributed to balance the client's requests. Since dierent list servers take dierent time to process client's queries, speedup of processing is a function of balancing queries. We did not attempt to provide the best balance, our results re ect a randomly chosen distribution of servers.
11
difference in execution time of the centralized and distributed scheme
centralized and distributed single objects on a parallel server
16
110 distributed list server objects centralized list server objects
100
14
difference in exeucution time (in seconds)
exeucution time (in seconds)
90
80
70
60
50
12
10
8
6
4
40
2
30
0
20
1
2
3
4 5 processors of server application
6
7
8
−2
1
2
3
4 5 processors of server application
6
7
8
Figure 4: Graph on the left shows timing comparison of the same search under dierent distribution schemes of 5 single objects; graph on the right emphasizes the dierence in time. The total time spent in single object queries for both cases was the same (30 seconds). Note that the parallel server was attempting to balance single objects by numbers, not by weight, hence redistribution going from 2 to 3 processors resulted in diminished dierence.
4.3 Pipelining: a Simple Scenario Using POOMA and HPC++ PSTL
In section 3.4 we outlined our experiments with generating direct mappings for POOMA and HPC++ PSTL; now we will give a simple example of constructing distributed applications using those mappings. We will also demonstrate how to use PARDIS to set up a simple pipelining scenario; this kind of interaction appears in many distributed scienti c applications [NBB+96, THC+ 96]. Consider a metaapplication consisting of two distributed units: an application computing a simpli ed simulation of 2-D diusion based on a 9-point stencil operation, and an application which computes magnitude gradient of the diusion eld in order to identify areas of the most intensive changes. The diusion operation is executed in a series of time-steps; at every n-th time-step, the diusion component pipelines the eld values to the gradient component and continues with its computation. Further, both the diusion and the gradient unit pipeline the results of every completed time-step to a visualizing server as part of their computation. In our simulation we will use a diusion implementation which is a part of the POOMA application suite [ABC+ 95], a gradient program implemented in HPC++ PSTL, and a simple program for viewing the result. In this example, the visualizer and the HPC++ program are servers described by the following IDL interfaces: //IDL const long N = 128; #pragma HPC++:vector
12
#pragma POOMA:field typedef dsequence field; interface visualizer { void show(in field myfield); }; interface field_operations { void gradient(in field myfield); };
The diusion unit is a parallel client, which will repeatedly request the show and gradient services, but not a server itself, and therefore no interface speci cation for diusion is required. Note that in this example a two dimensional array is represented as a vector in row-major order, rather then as a vector of sequences as in section 4.1. Also, in this example the distribution of the sequence is xed rather than left to the programmer's choice. Based on these speci cations the PARDIS compiler generates stub code for three different systems. When invoked with the \-pooma" option, the POOMA:field pragma causes the compiler to generate stub code marshaling the distributed sequence into a POOMA eld. Similarly, a \-hpcxx" option will cause it to generate stub code suitable for PSTL distributed vector; a no options invocation will generate standard C++ stubs used with the visualizer. In this way, components implemented in dierent systems can interoperate within one metaapplication without requiring the programmer to explicitly translate one set of abstractions into another. Extensions after the colon in the pragma statement can be used to associate the distributed sequence with dierent data structures of a given package. We have run this application with its components distributed over three machines. The POOMA diusion component was executing on a 10-node SGI PC and so was the sequential process visualizing its output. The gradient component was executing on up to 8 nodes of an IBM SP/2; its visualizing process was running on an SGI Indy workstation. The machines were communicating via an Ethernet connection. Figure 5 summarizes the results. The experiments show that although distributing the application brought some advantages in terms of execution time, these advantages did not scale very well. Our experiences indicate that this fact is due to two reasons. First, although invocations of both the gradient and show operations were non-blocking, they were not \oneway" [OMG95] and as the time of send began to approach the execution time of this relatively lightweight application the advantages of parallel execution began to disappear. Second, whenever the time of computation of the gradient was greater or comparable to the frequency with which the requests were made, the pipeline was becoming congested and the client application was obliged to wait for the server to nish computation of the previous request. Although a more stable network con guration would be required to clearly separate these in uences, it seems probable that their eects could at least partially be oset by delegating the process of receiving and sending data to threads in a multi-threaded implementation of PARDIS.
13
overall performance vs performance of components 40 overall time diffusion (SGI PC) gradient (SP2)
35
total exucution time (in seconds)
30
25
20
15
10
5
0
0
1
2
3
4 5 number of processors
6
7
8
9
Figure 5: The graph shows the overall performance of the metaapplication from the client's perspective compared to the performance of its components. In each case shown the number of processors of the diusion application was matching the number of processors of the gradient computation. The input was a 128 128 grid; the application was executed over 100 time-steps with the gradient computation requested every 5-th time-step. Values shown are the average over a series of measurements taken at dierent times.
5 Related Work As high-performance distributed computation became more feasible and more widespread, the challenges involved in constructing distributed metasystems received increasingly more attention. During the I-WAY [DFP+96] networking experiment many researchers presented systems and applications capable of exploiting the heterogeneity of geographically distributed resources, proving that metacomputing is no longer a thing of the future. This section will outline a number of current approaches to the problem of distributed supercomputing. Metacomputing environments and metasystems such as Globus [FK97], Legion [GW96] and WWVM [DF96] provide tools for application development and management in distributed environments. Services provided by these environments include resource con guration and management, security, fault-tolerance and debugging. AppLeS [BWF+96] provides tools for ecient scheduling of distributed supercomputing applications. While PARDIS comprises a minimal set of services necessary to run in a distributed environment, it does not aspire to the degree of sophistication oered by these systems; our focus is on developing and evaluating abstractions allowing the programmer to easily compose heterogeneous software units in order to construct parallel distributed applications. Another interesting approach to distributed supercomputing is oered by multi-method run-time systems and communication libraries [FGKT96, vRBF+95]. These systems integrate diverse transport mechanisms under the same interface, thus allowing the programmer to treat a set of supercomputers as one virtual metacomputer. Although this approach allows the programmer to take advantage of multiple resources, it also makes it necessary to program in terms of the interface oered by a given run-time system. This may require many 14
existing applications to be reimplemented and con nes the programmer to an interface which may not be suitable to his or her application. By addressing interoperability at a higher level our approach makes it possible to combine components which are developed independently, possibly using dierent run-time systems. The success of CORBA also stimulated research investigating its performance for highspeed networks [GS97] and suitability for real-time application [SGHP97]. This research concentrates on optimizing the performance of architectural components of the standard, and although not related directly to issues in parallel distributed programming, provides many interesting insights relating to it.
6 Conclusions and Future Work This paper describes PARDIS, an environment allowing the programmer to construct metaapplications from parallel and non-parallel objects residing on dierent parallel servers. Each object, whether parallel or not, constitutes an independently developed, encapsulated application and once implemented, can be used as a building block in many dierent metaapplications. Using an Interface De nition Language to represent these objects to each other allows us to combine within one metaapplication components implemented using dierent parallel packages, libraries and languages. We have presented examples of implementing basic distributed scenarios in this environment. These examples show that PARDIS contains support for abstractions which allow the programmer to develop ecient distributed applications composed of multiple parallel and non-parallel components with relative ease. In particular, we have shown that it can support distributed concurrent execution of parallel components and allows parallel interaction with objects distributed over the resources of a parallel server. Our future work on PARDIS will focus on several areas. Distributed sequences are only a rst attempt at describing distributed data; more exibility is needed to make distributed arguments fully usable. We also plan to introduce a run-time system interface targeting one-sided run-time systems and streamline compiler construction, which would facilitate designing mappings for many diverse systems. Further, the applications described in this paper oer several suggestions for experimenting with the implementation of PARDIS. Our most immediate experiments will deal with using communication threads (additional to the computing threads) as sending and receiving processes between parallel applications. This might alleviate such problems as pipeline congestion and facilitate asynchronous interaction, but on the other hand will contend for resources with the parallel application. Viewing a parallel application as a component of a distributed system, rather than a stand-alone unit also introduces other intriguing questions about performance tradeos. In our future work we plan to address these questions as we try to identify cost-eective interaction patterns for distributed supercomputing.
15
Acknowledgments The authors would like to express their gratitude to the members of POOMA team at Los Alamos National Laboratory for explanation and help in designing the mapping; particular thanks are due to Bill Humphery for advice on setting up the applications. Elizabeth Johnson at Indiana University helped with similar explanations concerning HPC++ PSTL. We would also like to thank Rob Henderson and Bruce Shei at Indiana University for help with con guring the network, reserving the machines, and many an interesting discussion.
References
[ABC+ 95] S. Atlas, S. Banerjee, J.C. Cummings, P. J. Hinker, M. Srikant, J. V. W. Reynders, and M. Tholburn, POOMA: A High Performance Distributed Simulation Environment for Scienti c Applications, Supercomputing '95 Proceedings, December 1995. [BBG+94] F. Bodin, P. Beckman, D. Gannon, J. Gotwals, S. Narayana, S. Srinivas, and B. Winnicka, Sage++: An object-oriented toolkit and class library for building Fortran and C++ restructuring tools, Proceedings of the Second Annual ObjectOriented Numerics Conference (OON-SKI) (Sunriver, Oregon), 1994, pp. 122{ 138. [BG96] P. Beckman and D. Gannon, Tulip: A Portable Run-Time System for ObjectParallel Systems, Proceedings of the 10th International Parallel Processing Symposium, April 1996, pp. 532{536. [BWF+96] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, Application-Level Scheduling on Distributed Heterogeneous Networks, Supercomputing '96 Proceedings, November 1996. [CS92] C. Catlett and L. Smarr, Metacomputing, Communications of the ACM 35 (1992), no. 6, 45{52. [DF96] K. Dincer and G. C. Fox, Building a World-Wide Virtual Machine Based on Web and HPCC Technologies, Supercomputing '96 Proceedings, November 1996. [DFP+96] T. DeFanti, I. Foster, M. Papka, R. Stevens, and T. Kuhfuss, Overview of the I-Way: Wide-Area Visual Supercomputing, The International Journal of Supercomputer Applications and High Performance Computing 10 (1996), no. 2, 123{131. [FGKT96] I. Foster, J. Geisler, C. Kesselman, and S. Tuecke, Multimethod Communication for High-Performance Metacomputing Applications, Supercomputing '96 Proceedings, November 1996.
16
[FK97] [FKOT94] [For95] [GBJ+ar] [GS97] [GW96] [KBJ+96] [KG97] [NBB+96]
[OEPW96] [OMG95] [SGHP97]
I. Foster and C. Kesselman, Globus: A metacomputing infrastructure toolkit, The International Journal of Supercomputer Applications and High Performance Computing 11 (1997), no. 2, 115{128. I. Foster, C. Kesselman, R. Olson, and S. Tuecke, Nexus: An Interoperability Layer for Parallel and Distributed Computer Systems, Technical Memorandum ANL/MCS-TM-189 (1994). Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, June 1995. D. Gannon, P. Beckman, E. Johnson, T. Green, and M. Levine, HPC++ and the HPC++Lib Toolkit, Languages, Compilation Techniques and Run Time Systems (Recent Advances and Future Perspectives) (1997 (To appear)). A. Gokhale and D. Schmidt, Evaluating CORBA Latency and Scalability Over High-Speed ATM Networks, Proceedings of the 17th International Conference on Distributed Systems, May 1997. A. S. Grimshaw and W. A. Wulf, Legion | A View From 50,000 Feet, Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computation, August 1996. L. V. Kale, M. Bhandarkar, N. Jagathesan, S. Krishnan, and J. Yelon, Converse: An Interoperable Framework for Parallel Programming, Proceedings of the 10th International Parallel Processing Symposium, April 1996. K. Keahey and D. Gannon, PARDIS: A Parallel Approach to CORBA, Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computation, August 1997. M. L. Norman, P. Beckman, G. L. Bryan, J. Dubinski, D. Gannon, L. Hernquist, K. Keahey, J. P. Ostriker, J. Shalf, J. Welling, and S. Yang, Galaxies Collide on the I-WAY: An Example of Heterogeneous Wide-Area Collaborative Supercomputing, The International Journal of Supercomputer Applications and High Performance Computing 10 (1996), no. 2, 132{144. W.G. O'Farrell, F. Ch. Eigler, S. D. Pullara, and G. V. Wilson, Parallel Programming Using C++, ch. ABC++, MIT Press, 1996. OMG, The Common Object Request Broker: Architecture and Speci cation. Revision 2.0, OMG Document, June 1995. D. Schmidt, A. Gokhale, T. Harrison, and G. Parulkar, A High-performance Endsystem Architecture for Real-time CORBA, IEEE Communications Magazine 14 (1997), no. 2. 17
[THC+ 96] V. E. Taylor, M. Huang, T. Can eld, R. Stevens, D. Reed, and S. Lamm, Performance Modeling of Interactive, Immersive Virtual Environments for Finite Element Simulations, The International Journal of Supercomputer Applications and High Performance Computing 10 (1996), no. 2, 145{156. [vRBF+95] R. van Renesse, K. P. Birman, R. Friedman, M. Hayden, and D. A. Karr, A Framework for Protocol Composition in Horus, Proceedings of Principles of Distributed Computing, 1995.
18