Abstractions for Adaptive Data Parallelism - Semantic Scholar

Abstractions for Adaptive Data Parallelism P. Roe

School of Computing Science Queensland University of Technology Brisbane, Queensland, 4001 [email protected]

Abstract

This paper describes how a class of data parallel programs (SPMD) may be expressed using reusable adaptive abstractions. The abstractions support adaptive use of a network of workstations for parallel computing. Although data parallelism is the paradigm considered the programs are expressed using message passing. The main contribution of this paper is to demonstrate how adaptive parallelism may be realised using abstractions, their costs and bene ts.

1 Introduction

In the aggregate networks of workstations represent a huge unused computing resource. Workstations are, on average, largely unused: in any one minute during peak time, 60% of all machines are on average idle [3]. This paper investigates how networks of idle workstations can be adaptively used for parallel computation. The utilisation of a dynamic set of workstations requires adaptive parallelism [5]. Adaptive parallelism entails the automatic and transparent adaptation of parallel programs to available workstations; this contrasts with conventional parallel computing which assumes a xed set of processing elements. Adaptive parallelism must make highly ecient use of available resources in order to be successful. This paper describes how a class of data parallel programs (SPMD) may be expressed using reusable adaptive abstractions. The abstractions support adaptive use of a network of workstations. The main contribution of this paper is to demonstrate how adaptive parallelism may be realised using abstractions, their costs and bene ts. Although data parallelism is the paradigm considered the programs are expressed using message passing: in particular the Message Passing Interface (MPI) [7].

2 The data parallel paradigm

The particular data parallel (SPMD) paradigm considered in this paper is the following: a data

structure is partitioned across a set of processors, each processor repeatedly performs some computation on local data, then communicates: while cond(localstate) do compute(localstate) communicate(localstate)

The communication of one processor may involve all or a subset of all processors. This paradigm encompasses many parallel algorithms including the following in which we are particularly interested: genetic algorithms, simulated annealing and many arti cial neural network training and rule extraction algorithms. Conventionally a program is allocated a static set of processors and its data is statically partitioned across these. Since we are considering a processor set which will change dynamically we need an adaptive solution.

3 Adaptive data parallelism

An adaptive solution must adapt to the processor set (set of workstations) contracting or expanding. In the extreme there may be no processors on which to run a program. We make the following simpli cation: we assume that one designated processor, called the home processor, always runs the program, and that this processor performs all I/O and loads the initial problem data. Parallel computations must periodically communicate with each other to determine if the current processor set should change. This communication must occur often enough to allow processors to be released for interactive users. However since this communication represents overhead for the parallel computation there is a trade-o involved. When the processor set expands or contracts data needs to be redistributed accordingly, see Figure 1. In this gure an array of 60 elements is distributed across a set of workstations for processing; periodically data must be redistributed to take account of changes in workstations' status. We will often use the terminology asleep and awake to

mean that a workstation is currently utilised (indicated by the gure in the gure!) and available for use by an adaptive program (idle), respectively. Process data may be either copied or partitioned. Typically the bulk of data will be partitioned across workstations but some control data such as loop indexes and results will be copied. Time A 0 data elements

Workstations B

C

busy

busy

0...59 1

redistribute data elements across idle workstations

2 data elements 0...19 3

data elements 20..39

data elements 40...59

redistribute data elements across idle workstations

4 data elements 0...29

busy

data elements 30...59

Figure 1: Example of program adaptation. The revised algorithm to support adaptive computation is: (* initialise processes *) init(localstate) while cond(localstate) do (* adapt to the current set *) (* of unused workstations *) adapt(localstate) compute(localstate) communicate(localstate) (* clean up sleeping processes *) terminate(localstate)

Note, there should only be one call to adapt in the program. We assume that processes are started on all processors which may participate in the computation. These processes communicate details to one another (init). The local state contains: the local partition of data, any shared data and details of the current set of active processors. The adapt routine communicates activity details between processors; this in turn may result in data redistribution, if the processor set has changed. After redistribution any processes which should sleep (ie. be returned to the workstation user) will do so; upon awakening they will re-enter the same routine (adapt), and will participate in the next data redistribution. Notice that adapt causes process synchronisation. The terminate

routine kills any sleeping processes when the computation has nished. General direct communication between tasks is possible (unlike Piranha, see Section 7). However since the processor set changes, communication can no longer use absolute processor addressing. Instead data addressing must be used, this includes operations such as: collective communication (eg. reduction, broadcast); send, receive, put and get with data addressing (eg. send to processor with element A[i]) and monotonic variables (such as search bounds). The next section describes details of data redistribution.

4 Redistributing data

The data in an adaptive parallel computation must be redistributed when the processor set changes. The goal is to keep data evenly partitioned across the set of unused workstations so as to balance the load. When adapt is called four separate cases can be identi ed: 1. there is no change to the processor set, 2. the processor set changes but the number of active processors remains the same, 3. the processor set contracts or 4. the processor set expands. The rst two cases are straightforward, the latter two require re-partitioning data across processors. As previously mentioned the algorithm's data may be either copied or partitioned; this is determined by the programmer. Copying data is straightforward and can be performed automatically; the programmer only needs to select the data. Partitioning is usually implicit in parallel programs, although data parallel languages such as High Performance Fortran are an exception to this. Partitioning data is data structure dependent; therefore our solution is to provide the programmer with a library of data structures and partitioning/distribution routines. At present two data structure abstractions have been implemented which support partitioning: bags and one dimensional arrays. Bags are unordered collections of elements, they support collective communications operations and monotonic variables but not element addressing. One dimensional arrays support all the operations of bags plus element addressing (they are ordered). Bags could be implemented using one dimensional arrays; however their lack of ordering permits more ecient data redistribution routines to be utilised. This is because array elements must be distributed so that they are contiguous on

each processor. A greedy algorithm is used for redistributing bag elements, which is optimal. A similar algorithm is used for arrays; however this is sub-optimal: a more ecient algorithm is currently under investigation. The bag and array data structures are organised in a class like hierarchy, cf. Smalltalk collection class. Thus all bag operations are also array operations but not vice versa. The intention is that further data structures can be added to this hierarchy.

5 The implementation

The implementation was written using Oberon, an object-oriented language in the Pascal family, with an interface to an MPI (v1.1) system. In particular a local Oberon compiler (based on the Gardens Point series of compilers) and the ANL/MSU MPI implementation (mpich) were used. A type was declared to represent the local state of an adaptive computation, this included information such as: the current active processor set, the local data partition and shared data. Oberon's objected oriented features were used to simplify the design. For example any data would do for shared data as long as it supported send and receive methods. Likewise the array type inherited the bag type data, and some methods, as well as adding some of its own elds such as the identity of processors with elements preceding and succeeding those of this processor. The redistribute method was de ned dierently for bags and arrays. The adapt routine used dynamic dispatch to call the appropriate routine. Thus the system could be extended with additional data types, with redistribute methods, and adapt would work with these without modi cation.

5.1 Adaptation

A simple algorithm to perform adaptation is described below. The home processor acts as a master controller for adaptation: newly awoken workstations inform it of their new state. Each processor records a set of tasks which are currently awake1. A call to adapt on the home processor performs the following: (* get new status of awake w/stations *) for all p in awake receive newawake[p] from p (* any sleeping w/stations awoken? *) for all p in not awake asynchreceive newawake[p] from p (* send all awake w/stations new details *) 1 MPI communicators and the MPI COMM CREATE function could be used for this, providing the latter does not require the entire group of processors to participate, some of which may be asleep. MPI v1.0 does require this.

for all p in (awake' + awake) (* newly awoken also need old info *) if not (p in awake) then send shareddata to p send awake to p send newawake to p (* processor set changed => redistribute *) if newawake awake then redist(awake,newawake) awake := newawake

other processors perform: (* send status to home w/station *) send cont/sleep to home (* receive new details *) receive newawake from home if sleep then redist(awake,newawake) sleep (* until awoken by eg signal *) (* inform home processor woken up *) send awoken to home (* receive new details from home *) receive shareddata from home receive awake from home receive newawake from home (* processor set changed => redistribute *) if newawake awake then redist(awake,newawake) awake := newawake

Thus if there is no change to the processor set a simple fan-in fan-out communication is all that is required; this can further be optimised depending on network topology and the number of machines involved. By its very nature reception of the `wake-up call' must be asynchronous. A workstation `wakes-up' when its load falls below some level, ie. it becomes idle.

6 Some experiments and results

Using the adaptive abstractions several `toy' programs were written and many experiments performed. To test the bag data structure a genetic algorithm was written for solving the timetabling problem. This maintains a population (bag) of candidate timetables which are repeatedly mated and mutated. The population is partitioned across processors and every iteration the best timetable from each sub-population is distributed to all subpopulations. The implementation loosely followed that described in [1]. The array adaptive data structure was tested using a Jacobi Relaxation algorithm which iteratively computes over a grid. Each grid element is updated with the average of the previous values surrounding the element. The grid was partitioned into groups of contiguous rows which were distributed over processors.

The test environment was a set of four Sun SPARCstation 4's, each with 32Mb memory. The communications network was a standard 10Mb/s ethernet. All tests were performed when the system (machines and ethernet) were lightly loaded. The following sequential and non-adaptive parallel performance gures were obtained: for the genetic algorithm (with a population of 200, 60 subjects, 200 students and 30 time slots): Sequential Parallel Speedup 223s 65s 3.4 and for the relaxation algorithm (with a 200x200 grid and 4000 iterations): Sequential Parallel Speedup 394s 120s 3.3 To test the adaptive versions, the set of available processors was arti cially varied. This allowed greater control over the experiments than if real machine loads were used to determine this. (To ensure machines were not loaded experiments were performed at night.) The rst experiments randomly let processors sleep (be unavailable to the parallel programs) from 0 to 5 seconds with a probability of 0.3 of sleeping each iteration. (It must be admitted that these gures are rather arbitrary; we are currently collecting statistics on workstation availability.) The genetic algorithm, with the same parameters as previously, resulted in the following: No adapt Avg proc active Time Speedup 100 3 154s 1.5 \No adapt" is the number of adaptations performed and \Avg proc active" is the average number of processors which were active during the computation. The speed-up was calculated based on this latter gure. The results for the relaxation algorithm, again with the same data as previously, are: No adapt Avg proc active Time Speedup 200 2.7 182s 2.2 Reasonable speed-ups resulted in both cases; the genetic algorithm does not perform so well due to a rather inecient representation of timetables, which slows down data redistribution. In both cases minimal changes were needed to rewrite the program using the adaptive abstractions. To investigate the overheads of adaptation further the experiments were rerun but the adaptation probability was set to zero so that no data redistribution occurred. This is a good measure of how costly the repeated bookkeeping is for adaptation. The results for the genetic algorithm are: No adapt Avg proc active Time Speedup 100 4 75s 3.0

and for the relaxation algorithm: No adapt Avg proc active Time Speedup 200 4 120s 3.3 These results are almost the same as those for the non-adaptive versions. Further investigation into the costs of adaptation reveal that: the cost of performing an adaptation check, with no redistribution, is: 6ms, for the 4 workstations concerned. Given that such a check probably needs to be performed every second this is only a 0.6% overhead on runtime. This cost is related to the communications latency (since the messages sent are only small) and the degree of synchronisation required by the algorithm each iteration. To measure the performance of redistribution we wrote a program which did nothing other than repeatedly call the adapt routine. We ran the program with dierent sized data sets, and with adaptation parameters which caused a redistribution every iteration. Each iteration the system ipped from all processors being active to only the home processor being active: necessitating complete redistribution. The results for bag and array types were similar so we only show the array results here. The data consisted of an array of integers (32bit words). The time represents the time taken to perform a single complete redistribution between all four processors: Data size Time 1000 0.013s 10,000 0.041s 100,000 0.30s 1,000,000 3.0s As we would expect for large data sets the cost of redistribution is proportional to the data size, and is quite expensive. Thus, in general the cost of redistribution is related to the bandwidth of the communications network, assuming data sets are large. Finally it should be noted that all results were obtained on a lightly loaded ethernet; this is not very realistic. Todays users are easily capable of saturating an ethernet using multimedia and web based applications! Thus although an adaptive parallel computation will not try to utilise a busy workstation, a busy workstation may aect the computation by making heavy use of the ethernet which the computation is also using. Modern high performance communication networks, such as ATM, may alleviate this problem.

7 Related work

There are lots of projects investigating parallel computing across networks of workstations. However only a few of these have explicitly

addressed adaptive parallelism, including: data parallel C [10], CHAOS [6], Piranha [5], Application Data Movement [11] and DOME [4]. Data parallel C is a data parallel programming language. Its great advantage is that load balancing is achieved automatically, by using virtual processors (processes). Its disadvantage is that it relies on a programmingparadigm (SIMD) which is more restrictive than the one used here. The more recent CHAOS project also addresses adaptive parallelism in a data parallel setting using High Performance Fortran. As with all data parallel languages, less work is needed by the programmer to express parallel programs (adaptive) but the language is more restrictive in terms of the programs which can be expressed. The Piranha system is based on the Linda coordination framework. It supports a master worker paradigm of adaptive parallel computation. For true master worker computation only a small amount of work is required of the programmer to support adaptive computation. However for data parallel applications with many intertask dependencies, as described here, it is necessary to express a dierent form of computation on top of the master work one [8]. In this case we believe the system described here is simpler to program. Application Data Movement (ADM) is close to our work. It uses message passing and explicitly expressed adaptive computation and data redistribution. However ADM uses a centralised server to control all aspects of adaptation and communication. Our model is more decentralised and hence performs less communication. Also their model of adaptation is rather complex. We have concentrated on developing reusable adaptive abstractions. The DOME project is similar to ADM, it uses heavy weight check pointing to control redistribution and thus is unsuited to frequent redistribution, however an advantage of using check pointing is that it is fault tolerant. More distantly related work includes: Condor [9] and Nimrod [2], eectively both of these exploit extremely coarse grain parallelism with little or no inter-process communication, they are adaptive but do not support parallel programming.

8 Further work

There are many ways in which the abstractions and their implementation could be extended and improved. At present the abstractions only support a single distributed data structure. A simple extension would allow multiple data structures to be registered for redistribution. A more dicult restriction is that the current system only allows a single call to adapt in the whole program. The reason for this is to allow a sleeping process to

restart at the correct invocation of adapt (there is only one). A simple solution adopted by the CHAOS project [6] is to always run a skeleton process on every processor, even if the processor is current busy. When a processor is busy it will not contain any data so only a little computation will be required. A more radical alternative for homogeneous machines is to support some sort of remote fork, which really copies some state, eg. the stack, from the home processor to the newly awoken machine. It is also desirable to eliminate the need for a home processor which always runs the program. In the case that no processors are available the computation state could be dumped to a le, or the computation just suspended. Any processor could assume the role of home processor as required. Our model of I/O is a very restrictive one; however devising a more general scheme seems particularly dicult. Further research is required in this area. A useful optimisation would be to combine algorithms' communication with administrative communication required for adaptation, for example by piggy-backing adaptation information on the back of on ordinary communications. It is straightforward to use a real-time clock to control the frequency at which adapt is called; this can even be encapsulated within adapt. In general the cost of periodically synchronising processors to exchange processor set status information is not great. However it is necessary to periodically force workstations to synchronise in adapt. Otherwise there is the danger that an application will not, of its own accord, synchronise workstations soon enough (or even at all) for adaptation to be transparent to workstation users2 .

9 Conclusions

An implementation of adaptive parallel abstractions for a network of workstations connected by an ethernet has been written using Oberon and MPI. This allows a small, but useful, class of algorithms to be written to run adaptively on a network of workstations. A genetic algorithm and simple Jacobi relaxation algorithm have been written to utilise the bag and array adaptive abstractions, respectively. Performance results show that the abstractions are eective.

Acknowledgements

I would like to thank the anonymous referees and Clemens Szyperski for their comments on a draft of this paper.

2 Acceptance of the system by workstation users is an important issue which is not addressed in this paper.

References

[1] D Abramson and J Abela. A parallel genetic algorithm for solving the school timetabling problem. Technical Report 79, School of Computing and Information Technology, Grith University, November 1993. [2] D Abramson et al. Nimrod: A tool for performing parametrised simulations using distributed workstations. In Proc., 4th IEEE Symp on High Performance Distributed Computing, August 1995.

[3] Th E Anderson, D E Culler and D A Patterson et al. A case for NOW (Networks of Workstations). IEEE Micro, February 1995. [4] J Arabe, A Beguelin, B Lowekamp, E Seligman, M Starkey and P Stephan. Dome: Parallel programmingin a heterogeneous multi-user environment. In Proceedings of the International Parallel Processing Symposium, 1996. [5] N Carriero, E Freeman and D Gelernter. Adaptive parallelism and Piranha. IEEE Micro, pages 40{49, January 1995. [6] G Edjlali, G Agrawal, A Sussman, J Humphries and J Saltz. Compiler and runtime support for programming in adaptive parallel environments. Technical Report CS-TR-3510, University of Maryland, 1995. [7] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Technical report, University of Tennessee, 1995. [8] D L Kaminsky. Adaptive Parallelism with Piranha. Ph.D. thesis, Yale University, 1994. [9] M J Litzkow, M Livny and M W Mutka. Condor: A hunter of idle workstations. Technical Report CSTR 730, Computer Science Dept, University of Wisconsin-Madison, 1987. [10] N Nedeljovic and M J Quinn. Data-parallel programming on a network of heterogeneous workstations. Concurrency: Practice and Experience, Volume 5, Number 4, June 1993. [11] R Prouty, S Otto and J Walpole. Adaptive execution of data parallel computations on networks of workstations. Technical Report CSE-94-012, Department of Computer Science and Engineering, Oregon Graduate Institute of Science and Technology, March 1994.