POEMS: A Parallel Object-oriented Environment

0 downloads 0 Views 389KB Size Report
intra-object as well as intra-method parallelism are all supported. ... They are used to support programming in MPMD and SPMD styles, ... paper will focus on the object models and programming facilities of POEMS and .... 11] is designed for the programmer to build distributed data ...... The variable-list contains the data.
Missing:
c British Computer Society 2002 

POEMS: A Parallel Object-oriented Environment for Multi-computer Systems W EI J IE , W ENTONG C AI

AND

S TEPHEN J. T URNER

School of Computer Engineering, Nanyang Technological University, Singapore 639798 Email: [email protected] POEMS is a Parallel Object-oriented Environment for Multi-computer Systems. In order to support dynamic load balancing, its runtime execution model is based on object replication. Method invocation in POEMS is asynchronous and threads are created to execute methods. Inter-object, intra-object as well as intra-method parallelism are all supported. Programs in POEMS are written using two classes of objects, i.e. parallel object replication (POR) and parallel object collection (POC) classes. They are used to support programming in MPMD and SPMD styles, respectively. This paper will focus on the object models and programming facilities of POEMS and presents some preliminary performance studies. The major features and execution models of POR and POC classes are described in detail. In addition, some typical applications are also presented to illustrate the usage of these two classes. The implementation issues of a POEMS prototype runtime system are also discussed.

Received 23 August 2001; revised 14 March 2002

1. INTRODUCTION

object-oriented system, POEMS (Parallel Object-oriented Environment for Multi-computer Systems), which:

The object-oriented paradigm offers a number of advantages for parallel and distributed applications compared with conventional parallel computing models [1]. The concept of tasks in parallel programming is elegantly represented by cooperating objects that may be physically distributed at different locations. Programs may be written at a higher level of abstraction, for example, objects may be accessed in a uniform way and message sending may be represented as method invocation. Object-oriented programming allows the programmer to encapsulate low-level mechanisms and manage the increased complexity of parallel programming. It also provides modularity and code reuse, which can be even more important for parallel programs because of the higher development cost. Two major research issues in the construction of a parallel object-oriented system are programmability and performance [2]. Some concurrent and parallel objectoriented systems have been designed and implemented, but our analysis of these existing systems has shown some limitations. For example, many of these systems support either single program multiple data (SPMD) or multiple program multiple data (MPMD) types of programming, but not both and they usually do not provide mechanisms to support dynamic load balancing, a major factor affecting performance. To address the issues of flexibility of programming and dynamic load balancing, we have proposed a new parallel



T HE C OMPUTER J OURNAL,



offers a new parallel object model based on object replication to improve the performance of parallel objectoriented applications by balancing the processing load on a parallel or distributed system; and provides a highly structured and easy-to-use programming facility to enable a high degree of parallelism and to support both SPMD and MPMD programming paradigms.

Method invocation in POEMS is asynchronous and threads are created to execute methods. POEMS supports two object models: parallel object replication (POR) and parallel object collection (POC). These two object models are designed for programming in MPMD and SPMD styles, respectively. In our other papers [3, 4, 5], a basic model of object replication and collection was described and some dynamic load-balancing strategies based on the model were studied. In this paper, we focus on the programming aspects of POEMS. First, an overview of related work is given in Section 2. The POR and POC object models are described in Section 3 and results from our simulation study to investigate performance are given in Section 4. The POEMS programming interface that enables programmers to utilize the POR and POC models is presented in Section 5. Examples that demonstrate the flexibility of the POEMS programming interface are given in Section 6. Vol. 45, No. 5, 2002

POEMS Implementation issues of the POEMS runtime system are discussed in Section 7. Finally, Section 8 concludes the paper with the direction of our future work. 2. RELATED WORK In this section, several representative parallel and distributed object-oriented systems are presented and compared with POEMS. The systems discussed in this section affect the design of POEMS to some extent. A more complete survey on parallel programming languages and models can be found in [6] and a review on programming environments and tools for Grid computing can be found in [7]. Charm++ [8, 9] is a concurrent object-oriented system with a clear separation between sequential and parallel objects. The basic unit of parallel computation in Charm++ is the chare which is similar to a process. Charm++ also supports a novel replicated type of object called branched chare which can be used for data-parallel programming. However, introducing many new language constructs has the disadvantage that it can make programming quite complicated. The POEMS object model is based on object replication, but its programming interface hides the complexity of the underlying system. It introduces very few new keywords and constructs. Instead, it provides two classes of objects to support the SPMD and MPMD programming paradigms, making it easier for users to manage their parallel programs. With respect to load balancing, Charm++ incorporates a message-driven scheduling strategy that is essential for system performance and latency tolerance. It also provides the object array, a language construct which allows the programmer to manage object migration. Charm++ provides some simple strategies for migrating load on a workstation cluster, but support for automatic dynamic load balancing by the runtime system is not addressed. In POEMS, users do not need to consider the issue of load balancing as the runtime system will automatically balance the system load to achieve high performance. The distributed collection model proposed in pC++ [10, 11] is designed for the programmer to build distributed data structures. However, the POC model is different from the distributed collection model of pC++ in that a part of a partitioned data structure, as well as the handling methods, is encapsulated in a partition object of a parallel collection. Furthermore, in pC++, each data partition will be bound to a certain physical processor and remain immovable at runtime, so the issue of load balancing is not considered. The POC model provides support for dynamic load balancing using partition objects as basic load-balancing units. Partition objects can migrate to other nodes in accordance with the load-balancing policy so that better performance can be expected. Other related work includes Mentat [12] and Arjuna [13]. Mentat is an MPMD-based parallel object-oriented language. Concurrent objects in Mentat are specified separately from sequential objects. This gives the programmer control over parallelism. The programmer makes computation T HE C OMPUTER J OURNAL,

541

granularity and partitioning decisions using high-level class definitions, while the compiler and runtime system manage communication, synchronization and scheduling. However, Mentat does not provide an explicit programming model and language constructs for data-parallel applications. Arjuna provides programmers with object-level concurrency. It also introduces the mechanism of object replication. However, this mechanism is adopted only for the consideration of fault tolerance. For applications in computational sciences, research efforts have also been spent on developing objectoriented frameworks that encapsulate parallelism, improve portability and increase programmability. Examples of systems based on such frameworks are POOMA [14] and CHAOS++ [15]. In these systems, a library of classes representing common abstractions in a certain application domain is usually provided. An object created using a class in the library can then represent a resource, e.g. a linear system solver, that is typically accessed by remote clients to get some work done. Although systems like POOMA and CHAOS simplify programming, they can only be used for specific domains of applications. High-Performance C++ (HPC++) [16] consists of a C++ library and a set of tools to support a standard model for portable parallel C++ programming. The development of HPC++ capitalizes on the lessons learnt from the previous efforts on developing both parallel object-oriented programming languages (e.g. Charm++, pC++ and Mentat) and parallel object-oriented libraries (e.g. POOMA and CHAOS). HPC++ puts together the most useful abstractions and mechanisms and creates a single cohesive development environment suitable for most highperformance applications. Open HPC++ [17] is a recent extension to HPC++, which provides application-level tools and mechanisms to satisfy application requirements of adaptive resource utilization and remote-access capabilities. Unlike POEMS, HPC++ uses the CORBA model, which is discussed below, for the explicit addressing of remote objects. There are also a number of popular distributed object systems, such as CORBA [18, 19], DCOM [20] and Java RMI [21]. However, these systems are more concerned with providing an architecture and associated communication protocol for distributed objects in a heterogeneous environment. In contrast, the focus of POEMS is to provide a more flexible programming paradigm, in which the location of objects is transparent to the user and both SPMD and MPMD programming styles are supported. POEMS is intended for high-performance parallel computing applications, exploiting both inter-object and intra-object parallelism and providing a solution to dynamic load-balancing issues. 3. POEMS OBJECT MODEL To support both task parallelism and data parallelism, there are two types of objects in the POEMS object model, i.e. POR and POC. In the following sections, the features Vol. 45, No. 5, 2002

542

W. J IE , W. C AI AND S. J. T URNER

data POR object

method m1 m2 m3

node 1

node 2

node 3

data

m1 m2 m3

m1 m2 m3

m1 m2 m3

Primary replica

Shadow replica

Shadow replica

FIGURE 1. The primary-shadow replica model.

and execution models of POR and POC objects will be described in detail. Then, data consistency issues raised in the model construction and load-balancing support provided by the model will be addressed. 3.1. POR object model The POR object model is used to support task parallelism. It is based on the mechanism of object replication. The purpose of object replication is to improve the performance of load balancing and to reduce the object migration overhead. 3.1.1. Object replication scheme The object replication scheme adopted in the POR object model can be categorized as a primary-backup policy [22] (see Figure 1). For any POR object, its replicas are designated as the primary replica and shadow replicas. Only the primary replica interacts with client objects, whereas the shadow replicas stand by and only become active when load migration takes effect. In addition, at any time in our object replication scheme, only the primary replica keeps the latest data of the object. However, both primary and shadow replicas keep copies of all methods. The information on the locations of the primary replicas is called object configuration. In Figure 2, the primary replica of object 1, denoted as O1 , resides on node 2, while its shadow replicas O1 are at nodes 1 and 3. However, for object 2 and object 3, their primary replicas, marked as O2 and O3 respectively, are mapped on node 3, while nodes 1 and 2 keep their shadow replicas O2 and O3 . Hence, the object configuration in this case can be expressed as T HE C OMPUTER J OURNAL,

{{ }, {O1 }, {O2 , O3 }}. Note that the shadow replicas have the same object id as its primary replica. 3.1.2. Method invocation and synchronization Unlike some object models, e.g. Actor [23] and ABCL [24], that use the object as the unit of parallelism by assigning a process (or thread) to each object, a POR object is only a passive placeholder for data and methods. There is no process (or thread) associated with it. Moreover, the POR object model moves the concept of process (thread) from object level down to method level. The introduction of method threading enables multiple threads to be executed within an object and thus provides an efficient support for both intra-object parallelism and inter-object parallelism. •



The basic idea of method threading is to allow the method of an object to be associated with a thread. Then, an object can have multiple threads to perform the services. Method threading thus results in parallelism inside an object, i.e. intra-object parallelism. Since different threads run independently inside different objects, it presents a scenario of concurrent executions of different objects in the system, i.e. interobject parallelism.

Since multiple threads execute at the same time, they may access the same data concurrently. To avoid the conflict of data access, some mechanisms are needed for data consistency and synchronization control. (The solution of this issue will be discussed in Section 3.3.) For POR objects, all interactions between objects, no matter whether they reside on the same node or not, are carried out through message-driven method invocation. Vol. 45, No. 5, 2002

POEMS

543

FIGURE 2. Object replication scheme.

invocation

synchronization

thread execution

FIGURE 3. Method invocation hierarchy.

When an object (client object) wants to get a service from another object (servicing object), it sends a method invocation request message to that object. After the servicing object receives the message, it will create a thread to execute the servicing method. Therefore, in the POR object model, all method executions are initiated in response to messages being received or, in other words, the message is the driving force of method invocation. Asynchronous messages are adopted for the interaction between client object and servicing object; therefore, method invocation in the POR object model is asynchronous. After generating an invocation request, the calling method can continue its execution in parallel with the invoked method. When it needs the results from the method invocation, the calling method has to wait until the required results are available. The dynamic invocation of methods can be represented hierarchically as shown in Figure 3. This figure shows that the methods at the upper levels (client methods) invoke the methods at the lower levels (servicing methods) asynchronously and all the method executions can proceed concurrently. T HE C OMPUTER J OURNAL,

Since methods are invoked asynchronously, a synchronization point may be required for the thread to wait for the completion of invoked methods. The synchronization mechanism defined in the POR object model can be described as follows, with semantics similar to that used in Cilk [25]. Methods invoked (i.e. servicing methods) prior to a synchronization point (denoted as sync) can be executed concurrently with the calling method (i.e. client method), but only when all invoked methods have returned can the calling method proceed from the synchronization point. For example, Figure 4 shows the execution flow of a calling method. Before the calling method can call method m3, it is suspended at a synchronization point, waiting for the termination of invoked methods m1 and m2. 3.2. POC object model The POC object is used to support data parallelism. It is based on data partitioning and a new model of parallel object collection. The idea of data partitioning comes from pC++ [10], which has an object-based model for programmers to build distributed data structures. Vol. 45, No. 5, 2002

544

W. J IE , W. C AI AND S. J. T URNER client method

method invocation

m1

m2

Besides supporting inter-object parallelism and intraobject parallelism, the POC object model encapsulates parallelism into each method and each method may have multiple active threads. Each thread executes the same method on different partitions of data at possibly different nodes. After the execution completes, the partial results are gathered from each node so that the final result is obtained. This leads to intra-method parallelism. Intramethod parallelism integrates method threading with data parallelism. So, both control and data parallelism are exploited at finer granularity.

sync wait

m3

FIGURE 4. Method synchronization.

3.2.1. Parallel object collection Each POC object implicitly holds a parallel object collection (see Figure 5) which consists of a set of partition objects. All partition objects in the collection share the method code of the POC object, but each contains its own partition of the data declared in the corresponding POC object. Like the POR object model, the POC object model is also based on object replication. However, there is no distinction between primary and shadow replicas. All partition objects have an equal role to play. Correspondingly, the information on the locations of the partition objects of a POC object is called partition object configuration. In Figure 5, partition objects P1 , P2 and P3 of a POC object reside on node 1, node 2 and node 3, respectively. Hence the partition object configuration for that POC object can be expressed as {{P1 }, {P2 }, {P3 }}. Upon the initialization of a POC object, the method code will be replicated on all the nodes. The data will be automatically partitioned according to a certain partitioning scheme and distributed to the partition objects (e.g. block and cyclic partitioning schemes discussed in Section 5). Method threading and message-driven method invocation introduced under the POR object model are also adopted in the POC object model. So, similar to the POR object model, the POC object model also supports both inter- and intraobject parallelism. Methods of different objects as well as methods of the same object can also be executed in parallel. However, unlike the case of the POR object, an invocation of some method of a POC object leads to the creation of multiple threads to execute the invoked method. Essentially, a method of a POC object is an SPMD program. When it is invoked, each partition object will spawn a thread to execute the code of the method and process the partition of the data allocated to it. T HE C OMPUTER J OURNAL,

3.2.2. Partition object synchronization In the POC object model, a barrier synchronization mechanism, similar to the one adopted in BSP [26], is used to synchronize the parallel execution of multiple threads of the same POC method. Using this mechanism, when a thread reaches a barrier, it has to wait until others hit the barrier. Thus, only when all threads reach the barrier can a thread go ahead. Under this scheme, the computation in a method of a POC object forms a sequence of parallel steps. Figure 6 illustrates the basic execution model of a POC object. It shows that when a method of a POC object is invoked by a client method, multiple threads are created to execute the method in parallel. The synchronization between different executions of threads is realized using the barriers. Between neighboring barriers, partition objects may communicate with each other to exchange data. Moreover, when the client method needs the results from the invoked POC method, it has to wait at the synchronization point for the termination of all the threads that were spawned to execute the method. Figure 6 also illustrates the difference between the two synchronization mechanisms adopted in our object model, i.e. sync and barrier. sync is used to synchronize the concurrent execution of the client and servicing methods, whereas barrier is used to synchronize the parallel executions of different threads running the same method of a POC object. 3.3. Data consistency issues Intra-object parallelism introduces more complicated concurrency and synchronization control problems, since multiple threads within an object may concurrently access shared data. Some existing systems (for example [27]) provide data-level concurrency control mechanisms, whereby locks are associated with data. Although this approach introduces a higher degree of intra-object parallelism, it is at the cost of increased programming complexity. In addition, in a distributed environment, the management of these low-level locks becomes very complicated and inevitably introduces a large amount of overhead. Some other systems (for example [28, 29]) use object-level locks. This approach implies that, at any point of execution, at most one thread is allowed to execute one of the methods that share data with other methods. In fact, this scheme serializes concurrent Vol. 45, No. 5, 2002

POEMS

545

data d1 d2 d3

POC object

method m1 m2 m3

node 1

node 2

node 3

partition

partition

partition

d1

d2

d3

m1 m2 m3

m1 m2 m3

m1 m2 m3

Partition object p1

Partition object p2

Partition object p 3

FIGURE 5. Parallel object collection model. calling method POC method invocation

partition objects

communication barrier sync POC object

FIGURE 6. Execution model of a POC object.

accesses to shared data. Obviously, it reduces the degree of intra-object parallelism. To be more flexible, the POEMS object model adopts the access guard concurrency control scheme [30, 31]. An access guard consists of a pre-condition and an access mode. It is a method-level concurrency control mechanism to control the access of shared data. Method execution may only proceed if the pre-condition of the method evaluates to true and the guard is satisfied. An access guard can have three access modes: exclusive, atomic and observe. Exclusive access means that an object is not available to other threads T HE C OMPUTER J OURNAL,

until the current guard evaluation yields false or the current method execution terminates. Atomic access means that the object is available to other threads except when the current thread is accessing the shared data. Observe access means other threads can observe the object, but cannot modify it. One type of access guard can be attached to the method that has data access conflict with other methods. So, whether or not a method can be executed is on the basis of the guard specified. Corresponding constructs for guards are provided on the level of the programming interface. The details are described in Section 5.1. Vol. 45, No. 5, 2002

546

W. J IE , W. C AI AND S. J. T URNER

Compared to the data-level lock, the access guard is a high-level language construct, thus is easier to use. Compared to the object-level lock, the access guard presents a more flexible and finer concurrency control scheme. Using this scheme, objects no longer serve as the unit of concurrency control. Methods can be executed in parallel even if they share data. Since the specification of guards is expressed separately from the functional implementation of methods, minimal work is introduced to programmers and programs are easy to maintain. For example, consider the implementation of the famous readers and writers problem [32], which models concurrent access to a database. It is acceptable to have multiple processes reading the database at the same time. However, if one process is writing the database, no other processes may have access to the database. Suppose methods write() and read() implement the action of writing and reading the database, respectively. Under the object-level lock scheme, since lock is associated with the object (in this case, it is the database), methods cannot be executed simultaneously. So, multiple processes cannot read the database at the same time. However, using access guards, we can specify write() with an atomic access guard and read() with an observe access guard. Therefore, different processes can execute the read() method concurrently to read the database. However, while a write() method is modifying the database, no other methods are able to access the database at the same time, not even the read() methods. 4. SUPPORT FOR LOAD-BALANCING Our object model is different from other object models in its support for load balancing. Dynamic load balancing can be achieved in distributed or parallel object-oriented systems by object migration. However, migrating an object may incur a large overhead. So, in our model, dynamic object migration is supported by object replication. As method code is duplicated on each replica, only the data need to be migrated, therefore reducing the cost considerably. In this section, we will discuss the load-balancing support provided in our object model and show the performance advantages of our model using simulation. 4.1. Load-balancing strategies Both POR and POC object models support load balancing. When a POR object is created, the node to which its primary replica should be placed must be determined. During execution, the mapping of primary replicas can also be changed dynamically. Assume that a method is executed at the primary replica. Dynamic load balancing can be achieved by dynamically changing the mapping of the primary replicas (that is, the primary replica configuration). In other words, the primary replica and a shadow replica can be switched dynamically during runtime. Since the shadow replica keeps object methods, the switching of replicas only involves migration of the object data. Dynamic load balancing for POR objects can also be achieved by dynamically deciding the node at which a T HE C OMPUTER J OURNAL,

method may be executed. Since method code is available at all the replicas, a method can be executed at a shadow replica as long as the required object data are made available. The adoption of access guards for concurrency control in our object models makes the concurrent execution of methods on the same object possible and thus provides more opportunities to perform dynamic load balancing. Methods that do not have access conflict can be executed concurrently on both the primary and shadow replicas, depending on the load of the system. When a POC object is created, its constituent partition objects can be initially mapped to the nodes according to certain strategies. For example, partition objects can be randomly placed on the nodes of the system, or mapped to those nodes with lighter load. During execution, partition objects of a POC object can be dynamically migrated to other nodes according to the runtime load situation in the system. Thus, as a basic unit of load migration, partition objects can be used to provide an effective and flexible support for dynamic load balancing. 4.2. Threshold-based dynamic load-balancing mechanism Typically, a dynamic load-balancing mechanism consists of four components that define the policies to be used: Transfer Policy, Selection Policy, Location Policy and Information Policy [33]. The dynamic load-balancing mechanism proposed for our object model is based on a threshold approach [34]. The transfer policy used for our dynamic load-balancing mechanism can be categorized as a senderinitiated policy. When the load of a node exceeds the sending load threshold, Ts , the transfer policy will try to transfer load to lightly loaded nodes. The unit of load transfer in our system is an object. Load balancing is achieved by executing the load-balancing strategies, according to the type of object, described in the last section. In our simulation studies (described in the following sections), the ‘dynamic switching of a primary replica’ was adopted as the selection policy for the POR objects. The location policy consists of looking for those nodes whose load is less than the receiving load threshold, Tr , Tr ≤ Ts . Of these nodes, the one with the lowest load is chosen as the destination for the object migration. If no such node is found, method invocation is serviced on the current node and object migration does not take place. Note that to respond to a method invocation, a thread will be created to execute the method at the node where the primary replica resides. The reason for using a receiving load threshold is to avoid thrashing in a heavily loaded system. Each node in our system also needs to keep the load information of other nodes of the system. In order to update the local load table at each node, load information is piggybacked on each message (that is, method invocation requests and replies). Each node checks its local load table every fixed time interval (referred to as the load probing interval, Tp ). If there are some nodes whose last load updating Vol. 45, No. 5, 2002

POEMS

547

TABLE 1. POR program generation parameters. Parameter Hierarchy depth of program Number of method invocations occurring in a method Duration of each phase of the execution of a method Object data size Parameter data size for a method invocation Return data size for a method invocation

Distribution

Mean

Deviation

Constant Normal Normal Normal Normal Normal

6 3 100 ms 1000 bytes 100 bytes 100 bytes

– 2 50 ms 100 bytes 100 bytes 100 bytes

TABLE 2. POC program generation parameters. Parameter

Distribution

Mean

Deviation

POC Object

Size of data partitioned Number of partition objects

Normal Normal

10 kbyte 10

1 kbyte 2

POC Method

Number of barrier synchronizations Execution duration of each step Number of communications in each step

Normal Normal Normal

6 100 ms 2

3 50 ms 1

time falls outside a time window (referred to as the load update window, W ), probing messages will be sent to these nodes to collect their up-to-date load information. Thus, the information kept in the local load table at each node can always be assumed to be up-to-date. 4.3. Simulation environment and parameters A simulation system has been constructed to evaluate the characteristics of our dynamic load-balancing strategy. To explore the parallelism supported by our object model, the ideal platform on which to implement the model is a cluster of SMPs where each node is a symmetric multiprocessor. The simulation therefore concentrates on a homogeneous distributed system; that is, a collection of identical nodes each consisting of an equal number of CPUs. The nodes are assumed to be connected by a high-speed local area network. In order to simulate the executions of various parallel object-oriented programs, parameterized synthetic parallel programs are automatically generated as the input of the simulation. The typical events occurring during the execution of a thread, including method invocation and synchronization, object creation and deletion, thread execution and termination, are generated and arranged in a sequence according to the logic of the program. In addition, probability distributions are used to control the total number of each kind of event to be executed by a thread. Table 1 summarizes the main parameters used for POR program generation, the corresponding probability distribution and the default mean and deviation. An example of the event graph of a parallel object-oriented program is shown in Figure 3. The vertical direction of the graph corresponds to the depth of method invocation, while the horizontal direction determines the number of invocations. The tree-structured graph also illustrates both inter-object T HE C OMPUTER J OURNAL,

TABLE 3. Simulation overhead cost assumption. Parameter Manager object processing time CPU context switching time Load computing time Object creation time Object deletion time Thread creation time Thread termination time Network latency time Network bandwidth

Value 6.0 ms 0.6 ms 3.0 ms 1.7 ms 0.4 ms 1.3 ms 0.2 ms 2.4 ms 10 Mb/s

parallelism and intra-object parallelism. Table 2 summarizes the main parameters for POC program generation. A POC method may consist of a number of steps. There is a barrier synchronization at the end of each step. Communications between the threads executing the same method on different partition objects, e.g. for fetching data from other partition objects, may also be generated. Thus, by adjusting these parameters, a variety of programs with different characteristics can be created. Overhead costs are inevitably involved in our dynamic load-balancing strategy; for example, the cost of load probing, the cost of making load-balancing decisions and the cost of data transfer. CPU time, network bandwidth and latency are consumed for performing these tasks. The cost assumptions of some parameters are listed in Table 3. The values used in our experiments are borrowed from the literature [26, 35, 36, 37]. All the simulation experiments conducted use the values listed in this table. 4.4. Simulation experiments and results Since decreasing the execution time of a parallel objectoriented program is the primary objective of load balancing, Vol. 45, No. 5, 2002

548

W. J IE , W. C AI AND S. J. T URNER 1.1 NoLB RANDOM THRESHOLD NoCOST

Program Execution Time (Normalized)

1

0.9

0.8

0.7

0.6

0

4

8

12

16

Number of Nodes

FIGURE 7. Average program execution time for different load-balancing algorithms. 1.1

Program Execution Time (Normalized)

mean method invocation num:2; mean partition object num:8 mean method invocation num:3; mean partition object num:16 mean method invocation num:4; mean partition object num:32

1

0.9

0.8

0.7

0

4

8 Number of Nodes

12

16

FIGURE 8. Effect of program parameters on load-balancing mechanism.

to measure and compare the effectiveness of different loadbalancing strategies, the total execution time of the program being simulated is chosen as a performance index in our experiments. Figure 7 compares the performance of the threshold-based load-balancing mechanism (referred to as THRESHOLD) with the following strategies: • •

NoLB, there is no dynamic load balancing of objects apart from the initial random placement of newly created objects; RANDOM, when an object is eligible for transfer, the destination is randomly selected;

T HE C OMPUTER J OURNAL,



NoCOST, this represents an (unrealistic) ideal case, in

which a threshold policy is used, but assuming that the cost for collecting information and transferring object data is zero. The simulation parameters used for the experiments are Ts = 6, Tr = 3, Tp = 5 s and W = 1 s. Each point shown in the graph is obtained by averaging the results of 10 simulation runs with program generation parameters taken from the distributions given in Tables 1 and 2. The results are normalized with respect to the execution time when no load balancing is used. Because of this normalization, for Vol. 45, No. 5, 2002

POEMS each data point only a small standard derivation of the results from different simulation runs is observed (it is in the range of 0.015 to 0.025). In the second experiment, we explore how the loadbalancing mechanism is affected by different program parameters. Similar to the first experiment, the input programs are generated with the parameters in Tables 1 and 2. By varying the mean values of the parameters number of method invocations occurring in a POR method and number of partition objects in a POC object, we get three groups of programs with different characteristics. Each group of generated programs is executed on a different number of nodes using the simulation parameters Tp = 3 s and W = 1 s. The average program execution time is shown in Figure 8, where each point in the figure is normalized against the case when no load-balancing is adopted. The experimental results show that the program parameters when we varied have a minimum impact on the performance of the load-balancing mechanism. 5. POEMS PROGRAMMING INTERFACE On top of the POEMS object models, a programming interface is designed to facilitate the development of parallel object-oriented applications on distributed systems. In addition to hiding the complexity of the underlying object models, the POEMS programming interface faithfully follows the semantics of the models. The objective of the programming interface is expressiveness and ease of use. Instead of designing a completely new language, the POEMS programming interface is an extension of the Java programming language [28], enhanced with parallelism features. To make minimal changes to the Java language, it provides a library of pre-defined classes and specifies very few new keywords and constructs. The POEMS programming interface is based on two classes of objects, i.e. POR and POC. They are proposed to support the POR object model and the POC object model, respectively, at the language level. The features of these two classes will be described in the following sections. 5.1. POR class definition The POR class is used to support task-parallel programming. All POR classes are derived from a base class named Replication. The syntax of the POR class takes a similar form to that of a Java class (see Figure 9). The main features of the POR class include the following. Method invocation. Method invocation of a POR object is syntactically the same as Java objects. Semantically, there are two major differences. Firstly, because POR objects have disjoint address spaces and may reside on different nodes, the parameter passing in method invocation always uses call-by-value semantics. Thus, when references are used as arguments, the designated object will be sent to the invoked method. Secondly, as described in the earlier sections, method invocation in POEMS is asynchronous. Threads are created to execute methods. To compensate T HE C OMPUTER J OURNAL,

549 class POR-class-name extends Replication { < constructor > < data & methods >

}

FIGURE 9. POR class definition.

class POC-class-name extends Collection { < constructor > < data & methods >

}

FIGURE 10. POC class definition.

for asynchronous method invocation, a synchronization mechanism sync is introduced in the POR object model. Correspondingly, a pre-defined method, sync(), is provided in the Replication class for creating a synchronization point. The semantics of this method is that any method invoked prior to sync() can be executed concurrently with the calling method, but only when all of them have returned can the calling method proceed from the synchronization point. Programmers are responsible for managing the synchronization among different methods using the sync() method. For example, for the method void fun( )

{ x.m1( ); y.m2( ); sync(); x.m3( );

} the execution flow is just as that shown in Figure 4. Data consistency. As stated in Section 3.3, the access guard concurrency control scheme is adopted to solve the problem of concurrent data accesses between methods inside a POR object. Under this scheme, synchronization between methods is fulfilled through guards that are attached to those methods. Method execution may proceed only if its precondition is evaluated to true and the corresponding guard is satisfied. The specification of an access guard is put into square brackets and placed prior to the definition of a method as shown below: [access guard] < method definition >

The specification of an access guard can be expressed as access guard ::= [pre-condition] access-mode variable-list

Three modes of access guard are supported, i.e. exclusive, atomic and observe (the definitions of these modes have been explained in Section 3.3). The variable-list contains the data that require the corresponding access restriction. The usage of access guards will be illustrated in the parallel quick sort example presented in Section 6.1. Vol. 45, No. 5, 2002

550

W. J IE , W. C AI AND S. J. T URNER POC-class-name (data-structure, data-partition-scheme, num-of-partition-objects ) { < Initialize data partitioning scheme > < Initialize number of partition objects > < Partition data structure according the partitioning scheme and distribute data partitions to the partition objects >

}

FIGURE 11. POC class constructor definition.

5.2. POC class definition The POC class is used to support data-parallel programming. All POC classes are derived from a base class named Collection. Similarly, the syntax of the POC class takes the same form as that of a Java class (see Figure 10). Methods defined in a POC class are invoked in SPMD mode. A set of threads, one for each partition object, will be started to execute the invoked method. Different threads will operate on different partitions of the data. Each POC class must provide a constructor function. In general, the definition of a constructor takes the form shown in Figure 11. In the constructor of a POC class, the following must be initialized: data structure, data partitioning scheme and number of partition objects. The data structure is the structure of the data to be partitioned. It is restricted that each POC object is allowed to contain only one such data structure. The data partitioning scheme is used to describe how the data will be partitioned. To create an instance of a POC class, programmers are required to explicitly call the constructor. When the constructor is executed, the data structure will be automatically partitioned to a specified number of partition objects according to the partitioning scheme. The most commonly used schemes for data partitioning include P BLOCK, P CYCLIC and P WHOLE. The P BLOCK scheme means each partition will get a consecutive block of data elements. For instance, in the example of parallel prefix sum (which will be discussed in Section 6.2), the integer series is partitioned by the P BLOCK scheme. As Figure 12 shows, each partition object gets a consecutive block of elements of the integer series. If P CYCLIC is used, each partition will get data in a round-robin fashion. Figure 13 shows the data partitioning result if the same integer series is partitioned using the P CYCLIC scheme. The P WHOLE scheme means the data are not partitioned and they are distributed to the partition objects as a whole. These data partitioning schemes can be used in combination for multi-dimensional data structures, as shown in the matrix multiplication example (see Section 6.3).

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

Partition Object 1

Partition Object 3

FIGURE 12. The P BLOCK partitioning scheme.

1

2

3

4

5

6

7

8

9

1

4

7

2

5

8

3

6

9

Partition Object 1

Partition Object 2

Partition Object 3

FIGURE 13. The P CYCLIC partitioning scheme.

5.3.1. Pre-defined methods in the POC class A number of methods that are fundamental to the runtime system are pre-defined in the Collection class. These methods can be divided into the following categories according to their functionality. Initialization •





set partition scheme(). This is used to specify the data partitioning scheme of a given data structure (e.g. set partition scheme(a, P BLOCK) specifies that data structure a is block-partitioned). set num partition(). With an integer parameter, this method specifies the number of partition objects (e.g. set num partition(n) sets the number of partition objects to n). get this partition(). This is used to get the local data partition (e.g. to get the partition of a assigned to the created partition object, get this partition(a) is used).

5.3. Programming facilities for POC classes

Synchronization

Some programming facilities are provided to facilitate dataparallel application development. These facilities mainly include the pre-defined methods and language constructs that support data alignment.



T HE C OMPUTER J OURNAL,

Partition Object 2

barrier(). This is a barrier synchronization primitive used to synchronize the parallel execution of multiple threads when a method of a POC object is invoked (as explained in Section 3.2.2).

Vol. 45, No. 5, 2002

POEMS

551

Information •

• • •

this partition(). Partition objects are indexed. Without any parameter, this method returns the index of the current partition object. With an integer parameter, it returns the index of the partition object that has a specified offset relative to the current partition object. For example, this partition(n) returns the index of the nth partition to the right of the current partition. num partition(). This method returns the total number of partition objects in a POC object. partition size(). This method returns the total number of data elements in a data partition. partition size in dim(). This returns the number of data elements in a specified dimension of a data partition.

Communication •

fetch element(). This method takes a variable number of parameters, but the first must be the index of a partition object. The rest are the local indexes of the data elements in the specified partition object for each of the dimensions. For example, fetch element(p, i, j ) gets the value of the data element with index (i, j ) from partition p.

5.3.2. Data alignment support in the POC class At runtime, communication may occur between the partition objects of different POC objects. If the partition objects with heavy communication reside on different nodes, the communication cost will be very high. To reduce this cost, a language construct is provided to support POC object alignment. Syntactically, alignment is marked by the keyword alignwith, followed by the aligning POC object and an alignment scheme. As the following statement shows, the alignment construct is put right after the instantiation of the aligned POC object. The aligning object refers to the object used for aligning and the aligned object refers to the object being aligned: POC object1 = new A POC Class(. . . ) alignwith POC object2 < alignment scheme >; Alignment is performed at the creation time of a POC object. In the above statement, when POC object1 is created, each of its partition objects will be placed on the same node as its partner partition object held by the earlier created POC object POC object2, according to the scheme specified by alignment scheme. The alignment scheme is a mapping pattern between the partition objects of two different POC objects. The current schemes allowed include A BLOCK and A CYCLIC. The A BLOCK scheme means that, for a newly created POC object, a consecutive block of its partition objects will align with partner partition objects held by another POC object. If A CYCLIC scheme is used, partition objects will be aligned in a round-robin fashion. In the case where no alignment scheme is explicitly specified, the A BLOCK scheme is assumed as default. T HE C OMPUTER J OURNAL,

FIGURE 14. Data alignment between matrix A and matrix C.

Generally the number of partition objects held by the aligned object should be no less than that of the aligning object. Otherwise, the A BLOCK scheme and the A CYCLIC scheme actually have no difference. To illustrate the usage of data alignment between POC objects, consider the matrix multiplication example (to be discussed in detail in Section 6.3). As shown in Figure 14, the product matrix C is aligned with the multiplicand matrix A in the scheme of A BLOCK using the following statement: Matrix C = new Matrix (. . . ) alignwith A ; The partition objects in each row of matrix C will be mapped to the same node as the partition objects of the corresponding row of matrix A. Thus, when the dot product is performed, the communications between corresponding partition objects of matrix A and matrix C can be carried out locally. 6. PROGRAMMING EXAMPLES Three examples, parallel quick sort, parallel prefix sum and matrix multiplication, are presented in this section to illustrate how POR and POC classes are used. 6.1. Parallel quick sort This example shows the solution of parallel quick sort [38] using a POR class. The procedure can be described as follows (see Figure 15). The elements to be sorted are stored in an array and a stack is used to store the index-pairs of subarrays that are still unsorted. Initially the stack contains the index-pair of the entire array. First, a thread attempts to get the index-pair from the stack. Then, it partitions the array elements into two sub-arrays based on a selected median value. The two sub-arrays contain elements either less than or greater than the median value. After partitioning, the thread pushes the index-pairs of the two sub-arrays back to the stack and spawns two new threads to repeat the above procedure. The array is sorted when the stack becomes empty. The pseudo-code is listed in Figure 16. Parallel quick sort is an example using the divide-and-conquer approach. Once an array is partitioned, the two unsorted sub-arrays form independent problems that can be solved simultaneously. In this example, multiple threads are spawned as the consequence of asynchronous method invocation. In addition, according to the dynamic loadbalancing policy employed, these threads can be executed Vol. 45, No. 5, 2002

552

W. J IE , W. C AI AND S. J. T URNER

shared stack of index pairs step 3: send index-pairs of step 1: fetch

14

partitioned sub-arrays back

67

14

34

36

67

step 1: partitioning

14

35

67

FIGURE 15. Parallel quick sort.

public class ParaQSort extends Replication { private IndexPair[ ] sp; // stack reference to index pair of unsorted sub-array public ParaQSort(int[ ] A) {. . .

...}

[exclusive sp] private IndexPair deleteStack( );

// constructor

// pop bounds of sub-array from stack

[exclusive sp] private void insertStack(int low, int high);

// push bounds of sub-array to stack

private void qsort(int median, IndexPair bounds);

// partitioning a sub-array

public void sort( ) { // quick sort if (the stack is empty) return; // sorting complete else bounds = deleteStack( ); // get an index-pair from the stack // synchronization sync( ); qsort(median value, bounds); // partition the sub-array sync( );

// put the index-pairs of the partitioned sub-arrays back to the stack if (bounds.low != median index − 1) insertStack(bounds.low, median index − 1); if (bounds.high != median index + 1) insertStack(median index + 1, bounds.high); sync( ); // spawn two more threads to perform sorting sort( ); sort( ); } // end sort method public static void main(String[ ] args) { int A[. . . ]; // integer array to be sorted ParaQSort S = new ParaQSort(A); S.sort( );

}

}

// end main method // end ParaQSort class

FIGURE 16. Class definition for parallel quick sort.

on either the primary replica or the shadow replica of the ParaQSort object. The code also shows that both method deleteStack() and insertStack() are prefixed by an access guard in their method definitions since these two methods require mutual exclusion T HE C OMPUTER J OURNAL,

on the stack pointer sp. The exclusive access guard indicates that, at any point of time, at most one thread can be executing either of these two methods. Other methods that do not have an access guard can run concurrently with the above two methods. Vol. 45, No. 5, 2002

POEMS 1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

3

6

4

9

15

7

15

24

1

3

6

10

15

21

22

30

39

1

3

6

10

15

21

28

36

45

FIGURE 17. Parallel prefix sum.

6.2. Parallel prefix sum This example illustrates the application of the POC class on parallel prefix sum [38]. The problem is to calculate the prefix sum  of an integer series: given x0 , x1 , . . . , xn−1 , it computes ij =0 xj for i = 0, 1, . . . , n − 1. The computation procedure can be divided into two steps (see Figure 17). At first, a sequential prefix sum is carried out inside each data partition. Then a parallel prefix sum is conducted among different data partitions. The pseudo-code of the class definition is shown in Figure 18. Here, the POC class is used. The array that stores the integer series is partitioned by the P BLOCK scheme in the PrefixSum class that extends the pre-defined Collection class. An instance of the PrefixSum class will have multiple partition objects. When the method sum is invoked, each partition object will spawn a thread to execute the method and process the data partition allocated to it. During execution, some partition objects need to exchange data with other partitions and a barrier is used in the method to synchronize the parallel execution of these threads. To balance the workload in the system, partition objects can be migrated to other nodes at runtime depending on the dynamic load-balancing algorithm used. 6.3. Matrix multiplication A matrix multiplication algorithm is presented in this example. Consider two matrices A and B with size of N ×N. The problem is to calculate the product of A and B and store the result in matrix C. Figure 19 illustrates the partitioning scheme for these three matrices. Matrix A is partitioned in blocks of rows. Assuming that matrix A is partitioned to n parts, then the first data partition will get the first N/n rows, the second data partition will get the next N/n rows and so on. Matrix B is partitioned in blocks of columns and matrix C is partitioned in blocks of both rows and columns. The multiplication is accomplished by computing the dot product between the row-block partition of matrix A and the column-block partition of matrix B. The class definition for the matrix is listed in Figure 20 and the main function is given in Figure 21. Matrices A, B and C are three instances of the Matrix class which extends the Collection class. When the multiply method T HE C OMPUTER J OURNAL,

553

of matrix C is invoked, n2 threads will be spawned to perform the dot product simultaneously. Since the three matrices are three different instances of the Matrix class, communication between partition objects of these matrices is inevitable during the computation. This is implemented using the pre-defined method fetch element of the Collection class. Note that both POC and POR classes are used in this example. In Figure 21, a POR class is defined to print the result. Data elements of array C are distributed when MC (an instance of Matrix class) is created, and they are implicitly collected when an instance of PrintMatrix is created using the same array. The pre-defined method sync() is used to make sure that data elements of matrix C are only collected after all the threads executing the multiply method have completed. 7. IMPLEMENTATION ISSUES The strategy for the implementation of the POEMS programming interface is to pre-process a POEMS program into a proper Java program using the translator. Then the resulting Java program is compiled together with the predefined classes and POEMS runtime library. The Java interpreter in turn executes the compiled program. A prototype runtime system has been implemented on a network of Sun workstations using the Java programming language. Its implementation is based on the CORBA [18] architecture. 7.1. Manager Object In the prototype implementation, a special module called Manager Object is designed to manage the execution of POEMS applications. Manager Objects symmetrically reside on every node. During the execution, all Manager Objects play the same role. Figure 22 illustrates Manager Objects on two nodes. The major tasks of the Manager Object include managing local objects, servicing the communication between objects, monitoring the system load situation and making loadbalancing decisions. The Manager Object consists of a set of function components (see Figure 23) as follows. •





Load Monitor (LM). The LM is responsible for monitoring the load situation. The LM of different nodes exchanges their load information with each other according to the requirement of the dynamic loadbalancing algorithms. The collected load information is forwarded to the Load Balancer. Load Balancer (LB). According to the load information provided by the LM, the LB makes decisions to balance the system load using the proposed load-balancing algorithms. Execution Manager (EM). The EM manages the execution of applications. In other words, the events occurring in the execution of applications, such as method execution, invocation and synchronization, are all handled by the EM.

Vol. 45, No. 5, 2002

554

W. J IE , W. C AI AND S. J. T URNER public class PrefixSum extends Collection { private int[ ] p;

// constructor public PrefixSum(int[ ] a, scheme s, int num) { set partition scheme(a, s); // data partitioning scheme // number of partitions set num partition(num); p = get this partition(a); // get current data partition

}

public void sum( ) { for (int i = 0; i < partition size( ) − 1; i++) // perform sequential prefix for (int j = i+1; j < partition size( ) − 1; j++) p[j] = p[j] + p[i] barrier( ); // barrier synchronization for (int i = 0; i < log(num partition()) − 1; i++) { // perform parallel sum left index = this partition(-2i ); // get element from other partition if (left index >= 0) { temp = fetch element(left index, partition size( ) − 1); barrier( ); // barrier synchronization for (int j = 0; i < partition size( ) − 1; j++) p[j] = p[j] + temp; barrier( ); // barrier synchronization

}

}

} // end sum method

public static void main(String[ ] args) { int a[. . . ]; // the integer array int size = . . . ; PrefixSum C = new PrefixSum(a, P BLOCK, size); C.sum( );

}

}

// end main method // end PrefixSum class

FIGURE 18. Class definition for parallel prefix sum.

A

B

C

FIGURE 19. Data partitioning scheme for matrix multiplication.



Object Manager (OM). The OM keeps track of object configuration and maintains the object configuration table at the node.

The EM is the most important component in the Manager Object. It is composed of two modules: guard evaluator and thread creator (see Figure 24). The guard evaluator controls the servicing of method invocation requests according to the concurrency control scheme discussed in Section 3.3. When a method invocation request is authorized by the guard T HE C OMPUTER J OURNAL,

evaluator, the thread creator starts a new thread to execute the method. In order to effectively manage method invocations, the following queues are handled by the EM. • •

Invocation Request Queue, the method invocation requests waiting to be serviced are put into this queue. Pending Request Queue, the method invocation requests that cannot be serviced immediately because of the conflicts in the access guards are put in this queue.

Vol. 45, No. 5, 2002

POEMS

555

public class Matrix extends Collection { private int[ ][ ] c;

// constructor public Matrix(int[ ][ ] M, scheme row, scheme column, int num) { // data partitioning scheme set partition scheme(M, row, column); // number of partitions set num partition(num); c = get this partition(M); // get current data partition

}

public void multiply(Matrix A, Matrix B) {

// determine partition index of matrix A and B required for the dot-product A index = this partition( ) / sqrt(num partition( )); B index = this partition( ) mod sqrt(num partition( )); // perform the dot-product for the corresponding partitions of matrix A and B for (int i = 0; i < partition size in dim(0) - 1; i++) for (int j = 0; j < partition size in dim(1) - 1; j++) { c[i,j] = 0; for (int k = 0; k < N − 1; k++) c[i,j] = c[i,j] + A.fetch element(A index, i, k) * B.fetch element(B index, k, j); } } // end multiply method // end Matrix class

}

FIGURE 20. Class definition for matrix multiplication. public class PrintMatrix extends Replication { private int[ ][ ] M; public PrintMatrix(int[ ][ ] m) { M = m; } public void print( ); public static void main(String[ ] args) { int N=. . . ; int size=. . . ; int A[N,N], B[N,N], C[N,N]; ... ... Matrix MA = new Matrix(A, P BLOCK, P WHOLE, size); // create Matrix instances Matrix MB = new Matrix(B, P WHOLE, P BLOCK, size); Matrix MC = new Matrix(C, P BLOCK, P BLOCK, size*size) alignwith MA
; ... ... // perform matrix multiplication MC.multiply(MA, MB); // synchronization sync( ); ... ... // create PrintMatrix instance PrintMatrix PM = new PrintMatrix(C); PM.print( ); // print out result

}

}

// end main method // end PrintMatrix class

FIGURE 21. Main program for matrix multiplication.



This queue has higher priority than the Invocation Request Queue. Invocation Reply Queue, the results returned by servicing methods are put into this queue.

In addition, in order to keep track of the execution status of applications, the EM maintains the following tables. •

Thread Execution Table. Once a thread is created to T HE C OMPUTER J OURNAL,



execute a method, the EM creates an entry in the thread execution table for that thread. Method Invocation Table. The method invocation table traces the execution status of the method(s) invoked by the current thread.

When the current thread reaches a synchronization point, the EM checks the method invocation table to find out whether Vol. 45, No. 5, 2002

556

W. J IE , W. C AI AND S. J. T URNER

FIGURE 22. Manager Objects.

FIGURE 23. Structure of the Manager Object.

all the invoked methods prior to this synchronization point have completed. Only when all invoked methods have finished can the current thread proceed.



7.2. Runtime support for the load-balancing strategies The load-balancing strategies discussed in Section 4 are implemented in the prototype system. In this section, runtime support for the ‘dynamic switching of a primary replica’ is discussed. The switching of a primary replica involves the migration of object data from the primary replica site to one of its shadow replica sites. In POEMS, only object data need to be transferred since each node of the system already keeps a copy of the methods. The switching of a primary replica is fulfilled through the following steps (see Figure 25). •

The LB of the old primary replica site sends a message to the EM if some object is eligible for primary replica switching. T HE C OMPUTER J OURNAL,







The EM receives the message and stops servicing new or queued invocation requests for that object. Meanwhile, an object creation message is directed to the OM of the new primary replica site, together with the object data. The invocation request for that object is also piggybacked and is put into the invocation request queue of the new primary replica site after arrival. A new primary replica is created on the new site. The new location of the primary replica is immediately broadcasted to other nodes of the system. In addition, a switching acknowledgment is sent back to the old primary replica site. A thread is created on the new primary replica site to execute the servicing method. After method execution is finished, the execution results are returned to the client site if required. The old primary replica is deallocated when the OM of the old primary replica site gets the switching acknowledgment.

Vol. 45, No. 5, 2002

POEMS

FIGURE 24. Processing of method invocation.

FIGURE 25. Support for primary replica switching.

T HE C OMPUTER J OURNAL,

Vol. 45, No. 5, 2002

557

558

W. J IE , W. C AI AND S. J. T URNER

Performance Improvement (%)

FIGURE 26. Communication under the CORBA architecture.

30 Simulation Runtime

25 20 15 10 5 0

1

2

3

4

5

Experiments

6

7

8

FIGURE 27. Performance improvement in simulation and runtime experiments.

Before the message regarding primary replica location change reaches every node of the system, some invocation requests might have already been sent to the old site of the primary replica. In this case, the invocation requests need to be forwarded to the new site of the primary replica. 7.3. Implementation on the CORBA architecture The components of the Manager Object, i.e. EM, LM, LB and OM, are implemented as objects. At system start-up, these component objects are created first on every node of the system and they are bound to the ORBs (Object Request Brokers). All the interactions between the component objects of different nodes are carried out through the ORBs. Between ORBs, communication proceeds by means of a shared protocol—the Internet Inter-ORB Protocol (IIOP). Figure 26 illustrates the communication between component objects of the Manager Objects residing on two nodes. The interfaces for the component objects of the Manager Object (these interfaces mainly fulfill the functions of the component objects) are defined using the language-neutral Interface Definition Language (IDL). IDL is used instead of the Java language because the IDL-to-Java compiler automatically maps from IDL to the Java version of the T HE C OMPUTER J OURNAL,

interfaces as well as generating the infrastructure code for connecting to the ORB. The Manager Object (more precisely, the EM) manages the execution of applications. When an application is executed, all the requests generated, such as method execution, invocation and synchronization, will be forwarded to the Manager Object. A Manager Object may communicate with other Manager Objects if necessary. Object migration is implemented through the mechanism of object serialization. The object to be migrated is first serialized and it is deserialized at the destination node. In addition, when a method invocation request is serviced, the servicing object and method are unknown until runtime. To invoke the correct method, we use the reflection mechanism. With the reflection API, a method of an object can be invoked even if the method is not known until runtime. 7.4. Simulation results validation To validate the simulation results, the same randomly generated programs used in the simulation were executed again under the runtime system. The same set of loadbalancing parameters is used in both simulation and runtime system experiments (that is, Ts = 0.8, Tr = 0.4, Tp = 3 Vol. 45, No. 5, 2002

POEMS and W = 1 s). Eight randomly generated programs were executed. Their execution times with and without load balancing were collected. Figure 27 presents the performance improvement achieved in the simulation and runtime system for these eight programs. The measurement of the correlation coefficient is utilized to evaluate the association between the performance improvement achieved in the simulation system and the prototype runtime system. The calculated result of the correlation coefficient is approximately 0.96. It means that the performance improvements achieved in the prototype system and the simulation system match very closely and the simulation results for the load-balancing algorithm are quite convincing.

8. CONCLUSIONS AND FUTURE WORK As an integrated environment, POEMS includes the underlying object model, the programming interface and the runtime system. The major objectives of the POEMS system are high system performance and flexibility of programming. Method invocation in POEMS is asynchronous and threads are created to execute methods. Methods of different objects as well as methods of the same object can be executed concurrently in POEMS. In order to support both task and data parallelism, POEMS supports two object models: POR and POC. To take advantage of the underlying object models, an API has been designed. In this paper, the major aspects of the API are described, namely the Replication (POR) and Collection (POC) classes and their programming facilities. Using POR and POC classes, users can write programs in both MPMD and SPMD styles. This is well demonstrated by the three examples described in this paper. To build a high-performance runtime environment is one of the challenges the designer of a parallel object-oriented system must face. The environment would attract users if it helps in solving the complexities introduced by parallel and distributed object-oriented computing. We claim that the presence of a transparent load-balancing mechanism is essential: programmers should not need to worry about the mapping of the applications onto the nodes and the load in the system should be uniformly distributed by the loadbalancing mechanism. Dynamic load balancing for the POEMS object model is based on object replication. The threshold policies are applied in dynamic load balancing and the simulation experiments show the effectiveness and efficiency of the mechanism for our parallel object-oriented system: a performance improvement is achieved compared with the case when no load balancing is used. In addition, a prediction model, which monitors the runtime behavior of an invoked method and estimates the execution time for subsequent invocations, was also developed to assist the load-balancing strategies to make wiser decisions [3]. This paper also describes and discusses the design and implementation of a prototype runtime system. The T HE C OMPUTER J OURNAL,

559

development of the prototype runtime system serves to demonstrate the feasibility of POEMS and to validate the performance results obtained in the simulation. At present, some components of the prototype runtime system are still under development. The POEMS prototype runtime system includes two class libraries. One is for the user and includes the pre-defined POEMS object classes (i.e. Replication and Collection classes). The other is for the runtime system and includes system classes (e.g. Manager class). A POEMS translator is being developed, which creates system objects and translates the new keywords and constructs in a POEMS program into method invocations to the system objects. The resulting program can then be compiled and executed using the Java compiler and interpreter. Currently, the runtime system only supports the POR class. However, in general the runtime support for the POR class is also applicable to the POC class except for a few differences: • •

the Manager Object needs to keep an additional configuration table to record the nodes to which the partition objects of each POC object are mapped; when a method invocation is executed on a POC object, the EM needs to check the configuration table to send the invocation request to all the nodes where the partition objects are mapped.

The load-balancing strategies for a POR object, i.e. dynamic switching of a primary replica and dynamic changing of the method execution site, have been implemented in the prototype runtime system. Both primary replica switching and partition object migration involve moving objects from one node to another node; therefore, dynamic load-balancing support for the POC class can be implemented in a similar way to that of primary replica switching.

REFERENCES [1] Bahsoun, J. P., Briot, H. P. and Caromel, D. (1994) How could object-oriented concepts and parallelism cohabit. In Proc. IEEE Int. Conf. on Computer Languages, Toulouse, France, May 16–19, pp. 195–199. IEEE Computer Society Press. [2] Krishnan, S. (1996) Automating Runtime Optimizations for Parallel Object-oriented Programming. PhD Thesis, University of Illinois at Urbana-Champaign, USA. [3] Jie, W., Cai, W. and Turner, S. J. (1999) Dynamic load balancing in a replication based parallel object-oriented system. In Proc. Int. Symp. on Future Software Technology (ISFST’99), Nanjing, China, October 27–29, pp. 385–390. Software Engineers Association, Tokyo, Japan. [4] Jie, W., Cai, W. and Turner, S. J. (2001) Dynamic load balancing using prediction in a parallel objectoriented system. In Proc. 15th Ann. Int. Parallel & Distributed Processing Symposium (IPDPS), San Francisco, CA, April 23–27. IEEE Computer Society Press. [5] Jie, W., Cai, W. and Turner, S. J. (2001) Dynamic load balancing in a data parallel object-oriented system. In Proc. 2001 Int. Conf. on Parallel and Distributed Systems

Vol. 45, No. 5, 2002

560

[6] [7]

[8]

[9]

[10]

[11]

[12] [13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

[21] [22]

W. J IE , W. C AI AND S. J. T URNER (ICPADS’2001), KyongJu City, Korea, June 26–29, pp. 279– 286. IEEE Press. Skillicorn, D. B. and Talia, D. (1998) Models and languages for parallel computation. ACM Comput. Survey, 30, 123–169. Foster, I. and Kesselman, C. (1999) The Grid: Blueprint for a New Computing Infrastructure. Morgan-Kaufman, San Francisco, CA. Brunner, R. K. and Kale, L. V. (1999) Adapting to load on workstation clusters. In Proc. 7th Symp. on the Frontiers of Massively Parallel Computation, Annapolis, ML, USA, February 21–25, pp. 106–112. IEEE Computer Society Press. Kale, L. V. and Krishnan, S. (1993) Charm++: A Portable Concurrent Object-oriented System Based on C++. Technical Report, UIUCDCS-R-93-1796, University of Illinois, USA. Gannon, D. and Lee, J. K. (1991) Object-oriented parallelism: pC++ ideas and experiments. In Proc. Japan Society for Parallel Processing, Kobe, Japan, May 14–16, pp. 13–23. Japan Society for Parallel Processing. Lee, J. K. and Gannon, D. (1991) Object-oriented parallel programming experiments and results. In Proc. Conf. on Supercomputing, Albuquerque, NM, November 18–22, pp. 273–282. IEEE Computer Society Press. Grimshaw, A. S. (1993) Easy-to-use object-oriented parallel processing with Mentat. IEEE Computer, 26, 39–51. Shrivastava, S. K. and McCue, D. L. (1994) Structuring fault-tolerant object systems for modularity in a distributed environment. IEEE Trans. Parallel Distrib. Syst., 5, 421–432. Haney, S. and Crotinger, J. (1999) How templates enable high-performance scientific computing in C++. IEEE Comput. Sci. Eng., 1, 66–72. Saltz, J., Ponnusammy, R., Sharma, S. et. al. (1995) CHAOS Runtime Library. Technical Report CS-TR-3437, Department of Computer Science, University of Maryland, USA. Gannon, D., Beckman, P., Johnson, E. and Green, T. (1997) HPC++ and the HPC++ lib toolkit. Compilation Issues on Distributed Memory Systems. Springer, Berlin. Diwan, S. M. (1999) Open HPC++: An Open Programming Environment for High-performance Distributed Applications. PhD Thesis, Department of Computer Science, Indiana University, USA. Object Management Group (1998) CORBA/IIOP 2.3.1 Specification. OMG, Framingham, MA. Vinoski, S. (1997) Integrating diverse applications within distributed heterogeneous environments. IEEE Commun. Mag., 14, 46–55. Redmond III, Frank E. (1997) DCOM: Microsoft Distributed Component Object Model. IDG Books Worldwide, Microsoft Press, Foster City, CA. Grosso, W. (2001) The Java RMI. O’Reilly, Cambridge, MA. Beedubail, G. and Pooch, U. (1996) Naming Consistencies in Object-oriented Replicated Systems. Technical Report TR96-007, Department of Computer Science, Texas A&M University, USA.

T HE C OMPUTER J OURNAL,

[23] Kafura, D., Mukherji, M. and Lavender, G. (1993) ACT++: a class library for concurrent programming in C++ using actors. J. Object-oriented Program., 6(6), 47–55. [24] Yonezawa, A. (1990) ABCL: An Object-oriented Language Concurrent System. MIT Press, USA. [25] Blumofe, R. D., Frigo, M. and Joerg, C. F. (1996) Dagconsistent distributed shared memory. In Proc. 10th Int. Parallel Processing Symposium, Honolulu, Hawaii, USA, April 15–19, pp. 132–141. IEEE Computer Society Press. [26] Hill, J. M. D., McColl, W. and Stefanescu, D. C. (1997) BSPlib—The BSP Programming Library. Technical Report TR-29-97, Programming Research Group, Oxford University, UK. [27] Corradi, A. and Leonardi, L. (1990) Parallelism in objectoriented programming language. In Proc. Int. Conf. on Computer Languages, New Orleans, LA, March 12–15, pp. 271–280. IEEE Computer Society Press. [28] Arnold, K. and Gosling, J. (1996) The Java Programming Language. Addison-Wesley, Boston, MA. [29] Chandra, R., Gupta, A. and Hennessy, H. L. (1994) COOL: an object-based language for parallel programming. IEEE Computer, 27, 13–26. [30] Nugroho, L. E. and Sajeev, A. S. M. (1999) Java4P: Java with high-level concurrency constructs. In Proc. Fourth Int. Symp. on Parallel Architectures, Algorithms, and Networks (I-SPAN’99), Perth, Australia, June 23–25, pp. 328–333. IEEE Computer Society Press. [31] Sajeev, A. S. M. and Schmidt, H. W. (1996) Integrating concurrency and object-orientation using boolean, access and path guard. In Proc. 3rd Int. Conf. on High Performance Computing, Trivandrum, India, December 19–22, pp. 68–72. IEEE Press. [32] Courtois, P. J., Heymans, F. and Parnas, D. L. (1971) Concurrent control with readers and writers. Commun. ACM, 10, 667–668. [33] El-Rewini, H., Lewis, E. and Ali, H. (1994) Task Scheduling in Parallel and Distributed Systems. Prentice-Hall, Upper Saddle River, NJ. [34] Zhou, S. (1988) A trace-driven simulation study of dynamic load balancing. IEEE Trans. Softw. Eng., 14, 1327–1341. [35] Caughey, S. J., Parrington, G. D. and Shrivastava, S. K. (1993) SHADOWS—A flexible support system for objects in distributed systems. In Proc. 3rd Int. Workshop on Object Orientation in Operating Systems, Ashville, NC, December, pp. 73–81. IEEE Computer Society Press. [36] Powell, M. L., Kleiman, S. R. and Barton, S. (1991) SunOS 5.0 Multithread Architecture. Sun Microsystems Inc., USA. [37] Zaki, N. J., Li, W. and Parthasarathy, S. (1995) Customized Dynamic Load Balancing for a Network of Workstations. Technical Report 602, Department of Computer Science, University of Rochester, USA. [38] Quinn, M. J. (1994) Parallel Computing: Theory and Practice. McGraw-Hill, New York.

Vol. 45, No. 5, 2002