A Flexible Library for Dependable Master-Worker Parallel Programs Mieke Leeman, Marc Leeman, Vincenzo De Florio, Geert Deconinck ESAT K.U.Leuven Kasteelpark Arenberg 10 B-3001, Leuven (Heverlee), Belgium
[email protected]
Abstract
variety of existing programs without major rewriting of the serial application. The time to parallelise existing applications is considerably reduced and the user does not need to know details of the master-worker library. The distribution on demand of the blocks to the workers ensures dynamic load balancing. Furthermore, it follows that temporary failures in the workers do not affect the result of the system, only the termination time. The library is written in C and uses the Message Passing Interface (MPI) 1.2 [18, 8, 9]. It is used and tested with MPICH [19], but it can be run on top of other MPI implementations. The fact that no MPI 2.0 calls are invoked, enlarges the set of possible MPI implementations that can be used. This ensures the hardware independency of the system. The next section describes the implementation of the master-worker library. The parallelisation algorithm, the transparency, the delay and failure semantics, the handling of deadlocks and the initialisation and finalisation are explained. Section 3 elaborates on the possibilities of this parallelisation scheme. Furthermore, the results of two applications, integrated with the master-worker library, are described. The following sections explains some extensions that are currently in development and section 5 touches some related projects and research.
We describe a robust library for parallel and dependable master-worker programs (raftnet). From a software client’s viewpoint, flexibility, hardware independence and ease of use are the main features offered by this library, while its users may take advantage of a framework that optimises the available redundant resources so as to reach at the same time dependability, high performance and load balancing. A wide variety of applications can be targeted without major rewritings of the existing sequential code. This paper describes the implementation of this library and provides evidence to our claims describing two case studies from vision and multimedia applications.
1
Introduction
In time-consuming programs, gaining speedup forms a major aim in optimising applications. This is often achieved by code and algorithmic optimisations. In cases where further speedup is desirable, parallelisation in an Multiple Instruction Streams, Multiple Data Streams (MIMD) structure can form a solution. However, parallelising software requires quite an effort. Details of parallelisation libraries are not known to most of today’s programmers. Furthermore, issues like deadlocks and race conditions are extremely important, but not easy to catch. We pursued facilitation of parallelising applications by setting up a robust parallelisation library (raftnet [17]). The objective of this library is to provide the functionality of a revised master-worker model. This library can be called by any application that requires the offered speedup and dependability. The user-programmer only has to write a small set of predefined functions. The main aims were performance, dependability, modularity, flexibility and maintainability. The interaction between this master-worker library and the user application is is such that this parallelisation scheme can be integrated in a
2 2.1
Master Worker Algorithm Introduction
The algorithm is based on the traditional synchronous master-worker model. In this model, the master processor – or farmer – is feeding a data block to each slave processor – called worker – which performs the necessary computations on it. The master recollects the processed blocks by polling each worker to get the finished data blocks. This model suffers from some shortcomings. Quite an amount of work is performed in series by the master. Therefore, this node can become a bottleneck in the system. A 1
Figure 1. Dependable master-worker model delay for one of the workers is slowing down the rate of processing for the whole system. Furthermore, this scheme offers static load balancing, for which the relative workload of each block is known beforehand. These pitfalls formed the starting point of a modified master-worker system.
2.2
a STOP message from the master indicates that all nodes should exit the system. In contrast to the traditional master-worker model, the workers control their own work pace. After initialisation, the workerid (i) is sent to the dispatcher. The dispatcher chooses the input block for sending, depending on status values (denoting the freshness grades) of these blocks [6]. Priority is given to non-processed and less fresh blocks. Whenever the collector receives a processed block, the blockid (k) is redirected towards the dispatcher. This enables the dispatcher to keep track of these processed blocks in the status values. The freshness grade denotes the number of times the block has already been sent to a worker for processing. Various forms of parallelisation and optimisation of the algorithm speed up the system. First of all, the workers are processing the blocks in parallel, as is the case for the traditional master-worker model. Furthermore, control tasks are divided over several processes. The master splits, the dispatcher communicates with the workers and the collector merges the output blocks. Hence, more control tasks are performed in parallel. A further enhancement is splitting the next input unit in advance. By the time the system begins processing a new input unit, all new blocks are already put at the dispatchers’ disposal. This is similar to 2-stage pipelining and is particularly effective when the duration of processing is approximately equal to the duration of the input phase. In what follows, several implementation concepts are explained.
The modified master-worker system
In the modified master-worker system (figure 1) [6], the master is detached from the workers via a dispatcher, which is responsible for communication with the workers. A collector receives all processed blocks from the workers, postprocesses and merges them, and sends the result back to the master. All nodes perform predefined tasks: Master The master receives its input from an external source, e.g. a frame grabber. It splits the input units and sends the array of input blocks to the dispatcher. The master receives the processed output units from the collector. These results can be saved. Dispatcher The dispatcher receives an array of input blocks from the master. An input block is sent to a worker on demand. Workers The workers perform some computations on the input blocks, and send the resulting output block to the collector. Collector The collector receives all output blocks from the workers, merges them and redirects the resulting output unit to the master.
2.3
The master node controls startup and termination of the system. Upon startup, the first input unit is split up. A NEW RUN message informs the dispatcher that the input blocks are available. From that moment onwards, the dispatcher can send these input blocks to workers. Likewise,
Transparent access to parallelisation
The implemented modified master-worker system acts as a black box. Firstly, knowledge about the algorithm details is not required by the user application programmer. This allows a fast integration of the master-worker library in ex2
MC_MERGED_STRUCT
Master/ Dispatcher
DW_STRUCT
data. Two functions written by the user application programmer, that are passed on via the master-worker interface, serve as a communication channel between user application library and master-worker library for these structure details (figure 3).
WC_STRUCT
Workers
Collector
CM_STRUCT
• The first function – describe – receives as input from the master-worker library, information like the index of the structure (which is one of the structures in figure 2) and the field index. The describe function in the user library returns information such as the number of entries in that field (in case it is an array) and the data type. For the receiving process, the returned information allows allocating memory for the receive buffer by the master-worker library. For the sending process, also the data itself is returned to the master-worker library, which is transmitted immediately.
Figure 2. Data structures and data structure conversions in the master-worker library
isting serial applications. Secondly, data structures in the user application can take any form. They need not to be known beforehand by the master-worker algorithm. Both aspects contribute to the flexibility of the system. The resulting master-worker library can be integrated in a wide variety of programs, on the condition that a limited set of user application functions are written. Both ideas are explained in this section. The modified master-worker algorithm is implemented in the master-worker library. The application that is to be parallelised resides in the user application library. A main process calls the master-worker library with a restricted set of arguments. Most of these arguments are pointers to functions in the user library for splitting, merging, post-processing the output data or for handling data structures. Communication between the master-worker and the user application library is done solely via these functions. Writing these functions forms the main effort of the user-programmer for integrating the master-worker library with his/her user application library. This setup ensures that the master-worker algorithm can be used for a wide variety of programs. Furthermore, the detailed implementation of the master-worker algorithm remains a black box for the user. Since the user application is a black box for the masterworker library, the library is only concerned with handling data structures and is not interested in the contents of the data that is determined by the user application. Often, the program flow can be regarded as a series of data structure conversions. To support a variety of applications, the master-worker library keeps track of various data structures (figure 2). DW STRUCT, WC STRUCT and CM STRUCT denote the input blocks, the processed blocks and the postprocessed output data. The prefixes are the first letters of source and destination. Conversion between these structures is performed in the user functions for processing and merging. MC MERGED STRUCT is a structure containing supporting information for merging the data. Fields in data structures are sent one at a time. Details about these structures are beforehand only known by the user library. On the other hand, the master-worker library needs to know these details for sending and receiving the
• After actually receiving this data, the receive process fills in the user’s data structure with this received field. The user function fillstruct receives via its arguments the structure and field index, the structure itself and the data. It returns the structure with the data field filled in. Via this schedule, the master-worker library does not need to be aware of the data structure details – these are returned and filled in by the user application’s describe and fillstruct functions. The user programmer is completely in control of the data structures that are used. A couple of examples that are used in this text are available on the SourceForge CVS [17].
2.4
Failure semantics
Due to the characteristics of the underlying algorithm [6], our system compensates omission/performance/state-transition [5] failures of its workers: omission failures occur when an agreed reply to a request is missing. The request appears to be ignored. This is compensated by redistributing a request when its “freshness” allows it. performance failures occur when the service is supplied, though too late with respect to some real-time interval possibly agreed upon in the specifications. This class of failures can be compensated by adopting a number of workers large enough to mask, e.g., crashing or late processing workers. state transition failures are also covered, as the state of the system is never lost because of a failing worker: indeed, by construction, each state transition is atomic— 3
Receiving process User library
Sending process
Master−worker library
Master−worker library
structure struct_index field_index READ
structure struct_index field_index WRITE
Time
UL
describe function
User library
MW
MW
describe function
count datatype sizeof(datatype)
UL
count datatype data
allocate memory in buffer: count * sizeof(datatype) receive data in buffer
send data
structure struct_index field_index data UL
fillstruct function
MW: master−worker library UL: user library MW
structure with data filled in
Figure 3. Process for sending and receiving one field of a data structure a block is marked as processed only when its completion is explicitly acknowledged. Half finished blocks are simply discarded. 2.4.1
of returning error codes by the user application functions for splitting, processing, merging and other functionality, the system is terminated correctly and, if necessary, further processing can take place in the main function. This is not the case for other MPI implementations. In the IBM Parallel Environment [13] for example, only the first node is required to call MPI Finalize. In these cases, other worker failures are supported.
Delays and errors during processing
When a worker is delayed in processing a block, the same block can meanwhile be assigned to another worker. This is a result of the freshness grade of each block, kept by the dispatcher, and denoting the priority with which blocks are assigned to workers. When the collector has notified the dispatcher that a certain block is finished, the status value of that block is set to DISABLED. When a block is assigned to a worker, the status value is increased by one. Choosing the block with the lowest non-DISABLED status value, results in re-assigning a block when a worker has been delayed and all other blocks of that input unit are processed or delayed on their turn. Hence, a failing or delayed worker only causes some performance degradation, the final result is not effected. MPICH requires that each process exits only after calling MPI Finalize, otherwise the whole program terminates erroneously. For this reason, user application functions should not let the process exit the program. Returning error codes to the master-worker library avoids this problem. When e.g. a worker fails to process a block, it discards it and demands another one. The discarded block will eventually be assigned to another worker1 . By the policy
2.4.2
Memory allocation, delays and failures
As already mentioned, raftnet is built on top of MPI. As many MPI implementations exist, each with its own memory allocation strategy. As for the buffers within the master-worker library, for each receive (each field in each structure), the describe function returns the number of elements and the data type to the master-worker library. The master-worker library allocates memory and returns the pointer to that allocated buffer with the data. This pointer is used in the fillstruct function to fill the structure. This memory allocation strategy is used because it is not always known beforehand how large e.g. the output of a worker will be and how large each block will be after splitting by the master. But most importantly, using this approach simplifies the code that is dependent on the user application. A memory allocation failure can be due to either not enough available memory on that node – possibly resulting from memory leaks – or due to other memory-greedy programs running. In the former case, memory allocation fails permanently and contribution to the system eventually
1 This
approach does not deal with a data fault triggered by a certain block. All workers would fail when attempting processing that work unit. This case is covered when the user makes use of the N-version programming approach [2] for the user function. This can reduce the probability of
correlated failures.
4
ends. In the latter, the worker might be able to continue after the other program finished or when has reached a further stage of processing. When receiving data, the receiving process can fail to allocate memory for the receive buffer. In this case, the process sleeps a certain time slice, polling for available memory regularly. In case the failure is temporary (e.g. another memory-greedy program running on the same node), this will allow the process to continue its program cycle. For permanent failures, the process stops contribution after trying to allocate memory for a certain number times. 2.4.3
cancelling them after a certain time slice. However, the version of MPICH that has been used, does not allow to cancel send operations. Therefore, workers send an acknowledgement message after receiving a STOP from the dispatcher. Only after all acknowledgements are received, a STOP message is transmitted to the collector. In this way, termination of all processes is guaranteed and the whole system exits.
2.6
The master-worker library and the MPI environment are initialised and finalised with two separate calls to the master-worker library, masterworker init and masterworker final. The masterworker interface itself can be invoked one or more times in between. One master-worker run can e.g. use the output of a former master-worker run as input. Also, other process functions or other user functions can be used in these separate masterworker runs. All nodes can be initialised and finalised via user application functions that are passed on to the master-worker library via the masterworker interface. These functions can be used for setting up or freeing global variables in these processes once. These global variables can be used during the program flow of that node, avoiding re-initialisation and re-finalisation after each process cycle of that node. If such initialisation is not necessary, empty functions can be used. Not all variables that are needed for processing, are available at all nodes. If a set of variables is needed by the collector for merging the data, these can be transmitted in the MC MERGED STRUCT structure. Variables needed for processing a block, but changing according to the blockid (k), should be sent to the workers via the DW STRUCT. Variables that do not change according to the blockid, should however not be transmitted with each block. These can be initialised as global variables before calling the masterworker init function. All processes will possess these variables in this case.
Processes sending to failed workers
The algorithm handles failures of workers if error codes are returned to the master-worker library. These can fail to contribute or be delayed for quite some time. However, processes that send to these workers are not aware of these kinds of events. Therefore, such events are handled by using appropriate send operations. When using synchronous messages for sending a bunch of data, a sort of ‘handshake’ is required between the sender and the receiver. When the send operation is a blocking one and the receiving process is a worker that failed meanwhile, the whole system will block. For this reason, using synchronous blocking messages for sending towards workers is avoided. In the MPI standard [18], no explicit asynchronous send operation is available [8, 9]. The standard send operation can be asynchronous, but this depends typically on the MPI implementation and – in case of MPICH – also on the length of the message that is to be transmitted. Therefore, nonblocking send operations are invoked. This is combined with standard messages, in order to exploit optimisations of buffering in the MPI implementation.
2.5
Initialisation and Finalisation
Deadlocks and Race Conditions
MPICH requires all processes to call MPI Finalize for the system being able to exit correctly. Blocking processes prevent this and hence are avoided in the masterworker system. Several measures are taken in this respect. Firstly, both the dispatcher and the collector resume receiving messages even after all blocks have been processed and take appropriate action. This avoids that workers block when trying to send a message to either process. Secondly, before a worker sends its workerid to the dispatcher or a processed block to the collector, it checks for pending messages from the dispatcher. Such messages either require to finalise or notify that another input unit is being processed meanwhile. In this way, unnecessary send operations are not invoked. With this scheme, race conditions are still possible. These can be avoided using non-blocking messages and
2.7
Speeding up
Possible time-consuming operations – like saving results, splitting input units or merging and post-processing the output – are performed by separate threads. In this way, temporary delays are smoothed out over one cycle of the master-worker system for splitting, processing and postprocessing a data unit. The same reasoning has been applied to the master and the dispatcher. These consist of separate threads on the same node. Sending through a certain set of input blocks from the master to the dispatcher results in this case in only exchanging pointers. This avoids the communication cost 5
input
master
capture client (A)
distributor workers collector High
...
Speed Disk
Network
Figure 4. Capture and processing setup
3.2
of sending input blocks from the master to the dispatcher or output units from the dispatcher to the master.
3 3.1
Corner Detection
The corner detection algorithm [12] forms the basis of a 3D reconstruction algorithm [23, 24]. The features selected in this phase are matched (i.e. the corresponding corners in several images) and used for reconstruction of the 3rd dimension, which was lost in the projection to 2D while capturing. During the tests, the setup in figure 4 was used. Machine A captures, via a bt878 based card [1], the data onto a local disk [21, 25]. In turn, this disk is shared via NFS [20] with the rest of the network, in particular the parallel master-worker cluster. The images are captured in PAL VideoCD format, 352x288, 25 fps (frames per second), to pgm2 frames. The initial version in C, based on the C++ development code was not even close to be able to process the images in soft real time and only 7 fps were processed3 . In a following step, the algorithm was optimised by including faster and more memory aware data types and by reducing intermediate buffers and loop transformations [3, 16, 15]. Due to these changes, this algorithm got a speedup of 30%, which still put it about two times slower than the minimal soft real time requirement (10-12 fps). Since conventional code optimisation techniques were exhausted, assigning a larger cycle budget to the task within the available environment was done by adding a parallelisation layer. Figure 5 shows two possible parallelisation schemes. One possibility is to split up the figure. In doing this, edge effects have to be accounted for, so 7% of the pixel values need to be replicated. In the test setup, the second option was preferred: the frames are handled in a FIFO by the master and are sent as a whole to the workers. This has also the advantage that no splitting is needed at the dispatcher side, nor is there need for merging of the results (coordinate data) on the collector’s side. For the integration of the master-worker algorithm, only 125 extra lines of code had to be written to use the required functionality of raftnet. Tests on the implemented system show that soft real time feature detection is possible with 4 to 5 workers (depending on the network and worker
Application and System Description Parallelisation Choice
Not all algorithms are good candidates to parallelise with raftnet. Parallelising an application over different processors offers a real added value when code optimisations have been done on the algorithm. As such, code optimisations and parallelisation are complementary. The following rules of thumb were used in the decision of applying the raftnet library on the algorithms in the following paragraphs: algorithm locality: splitting up the input data for parallel execution is the most efficient when operations have no or little data dependency or when these operations are very local if there is a positional data dependency (e.g. pixels in an image grid). input and output: when data is split up, this can affect the final result. Either the result is erroneous (mostly because of the absence of locality in the algorithm), the output has noise added to it or the merged output is correct (identical to serial execution). This example is often because parallel execution only used a similar split policy that was already inherently used in the serial execution (e.g. audio MP3 encoding). In the second case, the evaluation has to be made if the noise is significant for the final outcome and if it is, can the noise be filtered out in an efficient way. In the following, two case studies that fit in a global 3D reconstruction optimisation project will be discussed: feature selection via corner detection and Delaunay triangulation on the reconstructed points. The testing environment consisted of a PA RISC 85000 processor with 400 MHz and 1.5 MB cache with the HP-UX 11.0 operating system. MPICH 1.2.2 has been used, which implements the MPI 1.2 standard.
2 for more info on the format, http://www.wotsit.org/download.asp?f=pgm 3 AMD 650, GNU/Linux 2.4.19-pre4, gcc 2.95.4, -O3
6
see
time
Figure 5. Image processing parallelisation schemes the original (overlapping) blocks’ border, only two thirds is retained. Every face that does not lie entirely within this range, is deleted from the set of faces. The result of eliminating these faces can be seen on the right of figure 6. With only a minor change in the parallelised application for overlap and for eliminating border faces, the result is very similar to the one of the serial implementation. This setup has been tested for a varying number of workers and a varying number of blocks. We achieved a largest speedup of a factor 10. The serial program ran for 22.66 seconds, while the parallelised version only needed 2.28 seconds. This has been obtained with only three workers and 9 blocks. With a moderate programming effort for parallelising the application and a small number of workers, the speedup is quite high. This is a result both of reducing the algorithm complexity when using smaller input blocks and of parallelising the application. The larger the images and the higher the computational requirements, the higher the speedup of the parallelised application will be. Using a larger number of blocks not only reduces the computational requirements, but also balances the work load in the system. There is however a trade-off with the overhead of merging and eliminating border faces afterwards. A good splitting algorithm is an important task of the application programmer.
load).
3.3
Delaunay
With triangulation, an object can be visualised as a 3D image, using a set of 3D points. A Delaunay triangulation in dimension