Dependable Parallel Computing with Agents Sophie Chabridon and Erol Gelenbe EHEI, UFR de Mathmatiques et Informatique, Universite Rene Descartes 45 rue des Saints-Pres, 75006 Paris, France email:
[email protected]
Abstract
We discuss a novel technique for improving the dependability of parallel programs executing on a MIMD shared memory architecture. The idea is to empower certain tasks of each application program to carry out failure detection, and to reschedule the execution of those tasks which are considered to have failed. The source of failures is assumed to be any event which prevents a processor from performing the task that was assigned to it. This can be the actual stoppage of the processor due to some failure, or it may be the preemption of the processor by a high priority task which is not part of the application program considered and which impedes the progress of a task of the application program. The technique we propose is based on a task graph representation of the parallel program, in which communications between tasks have been voluntarily isolated to the end of each task which is being considered. We represent the parallel application programs being considered by a task graph which takes into account the precedence constraints between tasks. We propose and evaluate several algorithms which can detect failures and restart failed tasks based on knowledge of this graph structure. A discrete-event simulator is used to evaluate the performance under the eect of failures, with the use of our detection and restart algorithms, of two specic parallel applications: matrix multiplication and the Fast Fourier Transform. For the second application, we measure the overhead due to the detection algorithm and the total execution time { including failure detection and recovery { obtained for dierent processor failure rates. Key words: Dependability - Discrete-event simulation - Parallel computing - Performance evaluation - Software-based failure detection and restart.
1 Introduction The dependability of computer systems is very important to designers and users of computers. With the recent advances of VLSI technology and of parallel computing making it possible to build machines with hundreds of processors, this concern becomes even more crucial. A multiprocessor machine is more prone than a sequential one to failures, especially if it is constructed from o-the-shelf processing units. The complexity of a parallel computing architecture introduces more opportunities for both hardware and software failures. However, the richness of parallel and distributed systems also provides opportunities for novel techniques for enhancing system dependability. The purpose of this paper is to propose and evaluate a novel method which will enhance the dependability of parallel application programs running on MIMD architectures. 1
In recent research, Kedem et al. 22, 23, 24, 25, 26, 27, 28] showed that it is in principle possible for a parallel program to monitor its own execution in order to detect whether some of the processors on which it is executing have failed. Their approach allows the automatic rescheduling of tasks so that they are reassigned to processors which are in operating condition. They also conduct an analysis of the computational complexity of such algorithms, under the assumption that all live processors participate in the failure detection and recovery algorithm, and assuming that a small common failsafe memory is available for task restart. Our research is directed to the design of pragmatic failure detection and restart algorithms. These algorithms are meant to be self contained in a parallel program which is designed for execution on a MIMD architecture. We will therefore describe the algorithms we propose and then illustrate these ideas through the simulation of a realistic application, the parallel FFT (Fast Fourier Transform). In this work, we designate by \failure" any event which prevents a processor from performing the task that was assigned to it. This can be a hard failure where the processor fails by stopping and does not perform any further actions fail-stop processors have been formally presented and justied in 33]. This model takes also into account some soft failures where a processor is pre-empted by the operating system and assigned to another task. Moreover, we consider that processors that have failed, or are being used for some other task than the one that was initially assigned to it, can restart after a down period and thus can be allocated again to the computation which is of interest to us. Such models of failure have already given rise to much work 2, 6, 22, 23, 25, 26, 39]. Our failure detection and restart algorithms will consider parallel programs which are structured in a task-graph representation. Such parallel programs will be constrained to a set of sequential tasks, where each task communicates with other tasks of the same program only at the end of task execution. Furthermore the tasks are organized so that there is no cyclic information passing. Thus the task graphs we consider describe the sequential executions within the nodes of the graph, with precedence constraints between tasks, and an acyclic graph structure. The arcs of the graph describe both the transfer of information and the transfer of control. Such graphs have been widely used to model parallel programs in various settings (see for instance 7, 8, 16, 20, 39]). They also have been used to represent transactions in databases when these transactions are parallelizable 3]. The simplicity of this representation allows to describe eciently complex parallel computations which may be composed of many hundreds of interdependent tasks. As we will also see via the examples discussed in this paper, such systems of interdependent tasks with an acyclic graph structure are a faithful representation of many important parallelized numerical algorithms. The paper is organized as follows. In Section 2, we present the target architectures considered, and the failure detection and recovery algorithms. Section 3 illustrates the acyclic task graph structure via the rather common example of parallel matrix multiplication. Section 4 describes how failure detection and recovery techniques can be applied to a the parallelized Fast Fourier Transform algorithm. We then discuss in some details how the failure detection and recovery algorithm introduces overhead even when failures do not occur. Using a serie of simulation experiments we also illustrate the eect of the processor failure rate on the eective average execution time of the application being considered. Finally, conclusions are drawn and future research objectives are stated in Section 5.
2
2 System architecture and failure recovery algorithms We consider a MIMD architecture subject to failures. These failures manifest themselves to an application program via the stoppage of some processor. The following assumptions are made about this architecture and about the application programs which are running on it. The MIMD architecture is composed of M identical processing units. Communication and transfer of work among any two processors is possible via a network. This network is fully reliable { i.e. it is assumed that it never fails. The processors share a very large common memory which is accessible by each processor at the instruction level. Any application program running on this architecture is composed of a set of (possibly interdependent) tasks. This interdependence is represented by a directed task graph which shows the precedence relationship between tasks. At any given instant of time, each processor of the machine is assigned an active task of the application. An active task is one which is ready for execution, or which is executing. Necessarily, an active task is one all of whose predecessor tasks have successfully completed. When a processor fails, the task which it may be running will stop. We will abusively say that the task has \failed" to mean that the processor to which it was assigned has failed. On the other hand, the common memory is fail-safe, and is always accessible. If a processor fails and the task running on it is stopped, the failure is detected via the task stoppage using the fact that in a shared-memory machine the states of all the active tasks and execution threads are globally accessible. Thus we are assuming that the state of the processors is not a reliable way of determining whether a processor has failed or not. However the stoppage of a task (which { for instance { may enter a \trap" state as soon as a processor failure occurs) is the reliable way of knowing that a failure has occurred. Any application program is represented by a set of K tasks K = f1 : : : Kg, and by an acyclic task graph G. Acyclicity is used to represent computations which are guaranteed to terminate as long as there is no failure. It is understood that processors are homogeneous so that the same task may be executed on any one of the processors in the same amount of time. However each task will in general have a dierent execution time which depends on the amount of work (number of instructions) which it executes. A task is enabled if all of its predecessors are nished and have informed it of this. In that case the variable V (i) passes from the BLOCKED value to the READY value. Then when the task is allocated to some processor, the variable V (i) will be set to ACTIVE. We suppose that any task i which terminates its normal computational activity will set a status variable V (i) to the value FINISHED. Then we require that it informs all of its successors (in the precedence graph). This communication will in general take some random (but nite) time of average value Z . Note that the communication is fully reliable, as is the common memory. 3
2.1 Failure detection and recovery algorithms
We will discuss several failure detection and recovery algorithms in this section. A rst algorithm will be presented (Algorithm 1), and then several other algorithms which are essentially renements of the basic scheme will be discussed. The basic idea in all of these algorithms is that some { or all { of the tasks in the application are empowered with the failure detection and recovery procedure. Any task which is allowed to play this additional role (i.e. additional in the sense that this failure detection and recovery is carried out in addition to the ordinary computational function of the task) is said to be an agent. Algorithm 1 { In this basic algorithm the tasks which have no successor (i.e. the \leaves" of the task graph) are allowed to carry out failure detection and recovery { they are the only agents in the application. This activity will be carried out in the following manner. After the leaf task i sets V (i) to FINISHED, it will rst examine the other leaves of the task graph. For each leaf seen not to be in the BLOCKED state, the agent will examine its predecessors until all tasks are accounted for as BLOCKED, READY, FINISHED, or ACTIVE but unnished. This last case covers tasks which may have been stopped due to processor failure. The agent will examine the corresponding processors and reinitialize all such tasks by stopping their current execution and rescheduling them on another processor. This algorithm is ecient when failures are rare but it will delay the detection of failures till at least one leaf completes execution. Indeed, it may happen that no leaf will ever be executed due to preceding failures and the application may be stopped as a whole. To remedy this deciency, it is necessary to select intermediate tasks that will run the detection algorithm in addition to the leaves. The following algorithms dier in the way the selection of agents is performed and in the manner in which they follow the edges of the task graph. Algorithm 2 { Here, we select agents from a certain number of \ranks", where the rank of a task is the longest path from itself back to a source task (one without predecessors). Each task having those selected ranks is an agent. An agent then begins examining its sibling tasks for failure, i.e. those for which it is neither a predecessor nor a successor, and then proceeds to smaller ranks as the need arises. Algorithm 3 { When an agent nds a task of the same rank which has completed its execution, it will proceed to test in turn the successors of this completed task. The intention is to detect failures as soon as possible and to favor the rapid and successful completion of the whole program. Agents are still selected by choosing some of the ranks. Thus while Algorithms 1 and 2 carry out look back recovery, Algorithm 3 does look back followed by look-forward. Algorithm 4 { Algorithm 4 rst selects the leaves to be agents. Then it selects other tasks to be agents at random with probability . Note that when = 1, all the tasks are agents. Algorithm 4 only uses look-back for detection and recovery. Algorithm 5 { This algorithm acts in the same way as Algorithm 3 (it uses look-back and look-forward) but in addition to the leaves, it selects agent tasks at random with probability .
4
Several variants of these algorithms can also be considered. For instance detection can be improved by forcing the leaf task which is rst to complete the detection algorithm, to perform the detection again after some delay. This leaf task can be kept alive as long as the program is not completed as a whole. Furthermore, if a program is not completed after some time-out tmax it may be restarted as a whole this may be necessary when failure rates are very high.
2.2 A task-graph example: matrix multiplication
As an example of the construction of task graphs corresponding to a parallel computation we will consider matrix multiplication. Parallel matrix multiplication is very broadly used and easily implemented on a MIMD machine. Consider the matrix product: C = A x B , where A, B and C are square matrices of size n by n. The n2 terms cij of C are computed using the formula:
cij =
X aik
n;1 k=0
bkj
Clearly the total number of operations necessary to compute the matrix C is given by At this point, we construct the task graph corresponding to this computation. For the sake of illustration, we present two dierent graphs on Figure 1 for n = 4. In Graph 1, the computation of one term cij requires only one task which performs n multiplications and (n ; 1) additions. If each operation has an execution time of t, each task will then require (2n ; 1)t time units. In Graph 2, cij is computed in two parts. First the n products aik bkj are computed for k = 1 ::: n. Then, the n terms previously obtained are added in a tree fashion requiring log2 n steps. This means that one term cij is obtained using 2n ; 1 tasks. In this case, to compute all the n2 terms of the matrix C, n2 (2n ; 1) tasks are necessary. In the rst part, each task performs one multiplication this gives an execution time of t. In the second part, each task performs one addition and has also an execution time of t. Consequently Graph 2 will result in lower execution times than Graph 1 if a large number of processors is available. To quantify the number of processors necessary, we need to evaluate the degree of parallelism of each task graph, which is the maximum number of tasks present in the same generation of the graph. For Graph 1, there are n2 tasks that can be computed in parallel using n2 processors in only (2n ; 1) t time units. For Graph 2, the rst generation is the one with the largest number of tasks which is n3 . If n3 processors are available, then the total execution time of the second graph is (log2 n + 1) t time units. So the rst graph has an execution time of O(n) and needs no more than n2 processors while the second has an execution time in O(log2n) if n3 processors are available. However, since both task graphs involve the same number n2(2n ; 1) of operations, if only n2 processors are available the second task graph will have the same execution time of (2n ; 1) t units as the rst graph.
n2(2n ; 1).
5
Graph 2
Graph 1
1
C(1,1)
2
C(1,2)
1 PP P 5 P P 2 PP P 7 C(1,1) 3 ``` ` 6 4
A(1,1)*B(1,2) 8 ``` ` 12 aa a A(1,2)*B(2,2) 9 a " 14 C(1,2) " A(1,3)*B(3,2) 10 ``` " ` 13 " A(1,4)*B(4,2) 11
. . .
16
A(1,1)*B(1,1) A(1,2)*B(2,1) A(1,3)*B(3,1) A(1,4)*B(4,1)
C(4,4)
n2 tasks
.
. .
A(4,1)*B(1,4)106``` ` 110 PPP A(4,2)*B(2,4)107 P 112 C(4,4) A(4,3)*B(3,4)108``` ` 111 A(4,4)*B(4,4)109 n2(2n ; 1) tasks
Figure 1. Two dierent task graphs for parallel matrix multiplication with n = 4.
2.3 Simulating processor failures and task recovery
In the simulation experiments which we present in this paper, the task graphs used correspond to real application programs. However various parameters introduced are chosen as modeling assumptions for the sake of the simulation experiments. In particular, each time a task is tested by an agent, we assume that a xed time w is added to the execution time of the detecting task. Similarly when a task is restarted, we assume that a xed time C is added to the subsequent execution time of the task. Thus C and w are used to represent the overhead of the detection and restart work. For the purpose of the simulation, we have chosen specic random processes and random times for various other parameters. Thus processor failures occur according to a Poisson process with rate . Once a processor fails it remains unavailable for some time having an exponential distribution of average value F , after which it is again available. Clearly and F are among the important parameters of the system. If F is very large, we are dealing essentially with permanent failures. In addition to the statistical nature of the failures, we have to provide some further information concerning the manner in which failures occur. In particular we will consider two alternatives concerning the occurrence of failures while the failure detection task is being run. Failure Assumption A (FAA) { This assumption is that no failure may occur at the processor running an agent during failure detection and recovery. This is realistic only if
6
failure detection and recovery are run in some protected mode or on processors which are particularly reliable. Failure Assumption B (FAB) { Here may also fail while they run failure detection and recovery. This assumption corresponds to the case of hardware failures where a processor, and consequently the task it is running, can be stopped by a failure at any time. This assumption is more realistic.
3 An example: dependable execution of the parallel FFT algorithm In order to proceed further in the understanding of how our proposed failure detection and recovery algorithms work, and the manner in which they aect a practical parallel application, we will consider the discrete Fourier transform which is widely used in a variety of application areas, ranging from statistics to signal processing. Much eort has been spent by researchers to nd ecient implementations of this algorithm, especially on parallel machines. In particular, the Fast Fourier Transform algorithm 9] is an ecient version which we consider in parallel form in this section. We rst describe the FFT algorithm, its parallelization as well as the corresponding task graph structure. Then we conduct simulations of the execution of the FFT task graph on a set of failing parallel processors with our failure detection and recovery algorithms. For a given complex vector X , its discrete Fourier transform is given on a nite number of input samples by:
y(k) =
X x(i)!Nik
N ;1
k = 0 1 ::: N ; 1
i=0
where !N = ej (2=N ) is the Nth of unity. Clearly the computation of all the components of the complex vector Y requires N 2 complex multiplications and additions. It is possible to rewrite the above equation as:
y(k) = yeven (k) + !k yodd(k) y(k + N=2) = yeven (k) ; !k yodd (k)
0 k N=2 ; 1 0 k N=2 ; 1
where Yeven is the Fourier transform of the even points of X, and Yodd is the Fourier transform of the odd points of X. The Fast Fourier Transform (FFT) 9] uses the two previous recursive equations. It computes the discrete Fourier transform of a vector using 1=2N log2 N instead of N 2 operations. We show on Figure 2 an example of a task graph for the FFT algorithm for N = 16 with Nlog2N = 64 tasks. The connections between the tasks represent the ow of data during the computation. Initially, a source task receives the component of X whose index is the bit-reversal-permutation of the task number decremented by 1. For example, the task 2 (2 ; 1 = 1 being coded 0001 on log2N = 4 bits) will receive the component x(8). Then each task performs one complex multiplication and one complex addition. If each task has an execution time t, the total execution time of the parallel FFT algorithm is log2N t time units with at least N processors. 7
We have run simulations for the case where N = 256 generating a graph with 2048 tasks and we have used M = 256 processors. Each task has an execution time t corresponding to one complex multiplication and one complex addition we have taken 3:24E ; 04 for the average duration of a complex multiplication and 1:04E ; 04 for the average duration of a complex addition (in seconds) these are realistic gures related to current workstation technology. The detection step duration w is taken equal to 6:5E ; 05 which corresponds to the average time necessary to test the state of a task. The restarting step duration C is taken equal to 10E ; 05 this is a relatively small value which implies that a processor's registers can be very rapidly loaded with the status of a task. For all the simulations presented, we have computed condence intervals with the probability of 95 %. The intervals we have thus obtained range from 0 % to 10 % of the average program execution times for all the simulations. They are not directly shown on the gures in order not to clutter the information which is being presented. We easily see that when no detection algorithm running, and when failures cannot occur, the total execution time of the graph with N = 256 is (log2 N ) t = (log2 256) (3:24E ; 04 + 1:04E ; 04) = 3:424E ; 03 time units. However when there are no failures but a detection algorithm is running, this will generate obvious overhead. In all the subsequent gures, the symbol D is used to denote the Detection algorithm concerned by a particular curve. For Algorithms 2, 3, 4 and 5, in addition to the leaves, tasks are selected to be agents using a probabilistic assignment. Therefore we have run simulations with four values of the probability that an arbitrary task is an agent: 0.25, 0.5, 0.75 and 1. Without failures, the smallest execution time (1.99E-02) is obtained with Algorithm 1 since it has only N = 256 tasks (the leaves) running the detection, while the other algorithms also have intermediate tasks running the detection algorithm. The largest execution time (3.78E-01) is encountered when we use Algorithm 5 for = 0:25 giving an average detection overhead of 4:096E ; 02. x0 ! x8 ! x4 ! x12 ! x2 ! x10 ! x6 ! x14 ! x1 ! x9 ! x5 ! x13 ! x3 ! x11 ! x7 ! x15 !
49 33A 1 HH 17@ ; H " 50 2 bb " ;; 34AA H" 18 @@ b HH 19 @@;; 35 AA 3 "" 51 b A AA @ ; @ ; ; @ b " b 20@; 4 @;@; @; 36AA AA 52 ; ;@@ ; 5H 53 @@ 37 AA AA 21 ; ;@ A ; A HH A ; @ A 54 A 38 6 HH H 22 ;@ A @ ; A A A A A A HH H ;; @ @ 39 AA 55 23 A 7 H @ A A HH @ 40AAAAAAA 56 24; 8 AA A A A 41AAAAAAAAAAA 57 9 HH 25@ ; AAAA A A ; H 42 AA AA A 58 10H H 26 @ ; H @ H ; A A 59 AA H @ HH 43 27@@ ; 11 ; ; AA AAAAA @ H H 28 @; A @;; 44 12 AAA A 60 @; @;; @ ; @ @ 45 AAAAA 61 13bb "" 29;; @;@; @ ; " @ @ b AA A 62 14H" " b 30 ;;;@@ 46 HH AAA 63 b 15"HHb 31;; @@ 47 A H 32 ; A 64 @ 48 16
Figure 2. Task graph for the FFT computation 8
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15
In order to analyze the behavior of the detection algorithms, we have evaluated the FFT execution time with each detection algorithm for three values of the failure rate : 0.001, 0.005 and 0.01. We rst consider the failure assumption A (FAA) where a task running the detection algorithm is supposed not to fail until it has completed the detection. We have taken for the maximum program duration allowed tmax the value 0:5 for = 0:001, 1 for = 0:001 and 1:5 for = 0:001. Average execution time 0.4
FAST FOURIER TRANSFORM D=1 D=2 3 D=3 + D=4 2 D=5
0.35 0.3 0.25
+
0.2 0.15 0.1
+
2 3 +
0.05 0.001
2 3
2 3
0.005
0.01
Figure 3. Comparison of the detection algorithms
Average execution time 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 0.09 3 0.08 0.07 0.001
FAST FOURIER TRANSFORM
3
3 Z=0 Z = 4.28E-04 0.005 Failure rate
3 0.01
Figure 4. Execution times with and without communication overhead 9
We show on Figure 3 the average execution times for the parallel FFT algorithm for each of the failure detection and recovery techniques which we consider, as a function of the failure rate. For Algorithms 2 to 5 only the value of which gave the shortest execution time is shown for each value of failure rate. For Algorithms 2, 3 and 4, = 0:25 always provided the smallest execution time for any value of . For Algorithm 5, = 0:75 appeared to be the best choice for any value of . Algorithm 2 gives the smallest execution times for any value of , even though we observe that it results in average job execution times which are very close to Algorithm 3 for low values of (e.g. = 0:001). Algorithm 5 gives the largest execution times. So far, we have assumed in all of the simulation results presented above that tasks will communicate \instantaneously" with other tasks, i.e. that the communication delay between tasks is negligible so that Z = 0. We will now explicitly address the eect of communication overhead via a series of simulation runs carried out under the same conditions as before, with Z = 0 and also with Z = 4:28E ; 04 as shown on Figure 4. We present the average execution times obtained with Algorithm 2 and = 0:25 with and without communication overhead. We see that the additional slow-down introduced by communication overhead increases substantially with failure rate, because (as expected) as tasks are restarted, all their communication activities have to be repeated. For = 0:01 we observe an increase in total eective execution time of more than 30 %. Note again that all simulations produce as previously condence intervals of less than 10 % with a condence level which is better than 95 %. Finally let us consider Failure Assumption B which allows tasks to fail as they are carrying out failure detection and recovery. On Figure 5 we provide two execution values for the average execution time for each detection algorithm with = 0:001 the rst value was obtained under FAA and the one given below concerns FAB. As expected, in each case, the average execution time corresponding to FAB is larger than the result corresponding to FAA. However the relative dierence varies both with the number of agents (as represented by the parameter ) and the time it takes to detect faiures. D=1
D=2 D=3 D=4 D=5
9.835E-02 2.07-01 0.25
0.5
0.75
1
7.426E-02 9.88E-02 7.153E-02 4.393E-01 1.359E-01 1.759E-01 3.91E-01 5.044E-01
1.069E-01 1.398E-01 1.034E-01 5.566E-01 1.706E-01 1.895E-01 3.042E-01 6.506E-01
1.116E-01 1.911E-01 3.125E-01 3.862E-01 1.689E-01 2.128E-01 2.158E-01 6.924E-01
1.739E-01 2.506E-01 3.632E-01 5.978E-01 1.739E-01 2.506E-01 3.634E-01 5.978E-01
Figure 5. Average execution times under FAA and FAB with = 0:001 10
4 Conclusions In this paper, we have studied the behavior of parallel programs running on a MIMD machine subject to failures. We have introduced and discussed a set of algorithms which are meant to make an application program self monitoring, in that it will be able to detect failures or stoppages of some of its processors, and then it will be able to reschedule the tasks which have been stopped due to failures. This approach introduces additional overhead, even when no failures occur, but has the major advantage of making parallel programs relatively impervious to failures in the processing environment. These techniques do assume however that failures are limited to processors and that their is a failsafe shared memory and interconnection network. The method we have proposed to transform a parallel program written for an ideal machine with no failures to run correctly on an unreliable machine is totally user-transparent. It is thus the work of the compiler to transform any parallel program in a resilient version able to run on a failure-prone multiprocessor architecture. This software-based dependable strategy does not require any particular hardware mechanism and allows on-line failure detection and dynamic reallocation of failed processes. In order to illustrate and evaluate this approach, we have considered performance measures such as the total eective execution time in the presence of failures, and failure detection overhead. Simulations have been conducted and various parameters including the failure rate have been varied. These simulation results demonstrate the feasibility of our approach, and indicate that overhead can be maintained at a modest level. We are now pursuing work to adapt these algorithms to fully distributed systems with no shared memory.
References 1] F. Baccelli and A. Makowski, Queueing models for systems with synchronization constraints, Proc. IEEE, 77 (1), 1989. 2] M. Bantre, A. Geaut, P. Joubert, P. Lee, C. Morin, An architecture for tolerating processor failures in shared-memory multiprocessors, INRIA Report No. 1965, p.1-35, 1993. 3] P. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. 4] B. Bhargava and S.-R. Lian, Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems - An Optimistic Approach, Proc. 7th IEEE Symposium on Reliable Distributed Systems, 1988. 5] W. Cellary, E. Gelenbe, J. Morzy Concurrency Control in Distributed Databases, ElsevierNorth Holland, Amsterdam and New York, 1988. 6] S. Chabridon and E. Gelenbe, Dependable execution of distributed programs, Proc. Massively Parallel Processing Conference '94, (North-Holland Elsevier), Delft, June 21-23, 1994. 7] W. Chu and K. Leung, Module replication and assignment for real-time distributed processing systems, Proc. IEEE, 547{562, 1987. 8] W. Chu, C. Sit, and K. Leung, Task response time for real-time systems with resource contentions, IEEE Trans. on Software Engineering, 17 (10), 1076{1092, 1991.
11
9] J.M. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, p.297-301, 1965. 10] F. Cristian and Farnam Jahanian, A Timestamp-Based Checkpointing Protocol for Long-Lived Distributed Computations, Proc. 10th IEEE Symposium on Reliable Distributed Systems, 1991. 11] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, The Performance of Consistent Checkpointing, Proc. 11th IEEE Symposium on Reliable Distributed Systems, 1992. 12] E.N. Elnozahy and W. Zwaenepoel, Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit, IEEE Transactions on Computers, 41 (5), May 1992. 13] E. Gelenbe, A model of roll-back recovery with multiple checkpoints, Proc. ACM-IEEE 2nd International Symposium on Software Engineering, October 1976, p.251-255. 14] E. Gelenbe, On the Optimum Check-Point Interval, J. ACM, 26, 1979, p.259-270. 15] E. Gelenbe, Temps d'execution asymptotique d'un programme parallele, Comptes-Rendus Acad. Sci. Paris (Proc. French National Academy of Science), 309 (I), p.399-402, 1989. 16] E. Gelenbe, Multiprocessor performance, John Wiley & Sons, New York, 1989. 17] E. Gelenbe and D. Derochette, Performance of roll-back recovery systems under intermittent failures, Comm. ACM, 21 (6), June 1978, p.493-499. 18] E. Gelenbe and Z. Liu, Performance analysis approximations for parallel processing of concurrent tasks, in M. Cosnard (ed.) Parallel Computation, North-Holland Pub. Co., 1988. 19] E. Gelenbe and I. Mitrani, Modeling the Execution of Block Structured Processes with Hardware and Software Failures, in G. Iazeolla, P. Courtois, and A. Hordijk (eds.), Mathematical Computer Performance and Reliability, North Holland Pub. Co., 1983. 20] E. Gelenbe, R. Nelson, T. Philips and A. Tantawi, Asymptotic processing time of a model of parallel computation, Proc. National Computer Conference (U.S.A.), Las Vegas, p.127-138, Nov. 1986. 21] T.T-Y. Juang and S. Venkatesan, Crash Recovery With Little Overhead (Preliminary Version), Proc. 11th IEEE International Conference on Distributed Computing Systems, 1991. 22] P.C. Kanellakis, A.A. Shvartsman, E cient parallel algorithms can be made robust, Distributed Computing, p.201-217,1992. 23] P.C. Kanellakis, A.A. Shvartsman, J.F. Buss, P.L. Radge, Parallel algorithms with processor failures and delays, Brown University Tech. Rep. No. CS-91-54. 24] Z. Kedem and K. Palem, Transformations for the Automatic Derivation of Resilient Parallel Programs, Proc. 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, p.16-25, 1992. 25] Z.M. Kedem, K.V. Palem, M.O. Rabin, A. Raghunathan, E cient program transformation for resilient parallel computing via randomization, ACM Symp. on the Theory of Computing, p.306-317, 1992.
12
26] Z.M. Kedem, K.V. Palem, A. Raghunathan, P.G. Spirakis, Combining tentative and de nite executions for very fast dependable parallel computing, ACM Symp. on the Theory of Computing, p.381-390, 1991. 27] Z.M. Kedem, K.V. Palem, P.G. Spirakis, E cient robust parallel computations (extended abstract), ACM Symp. on the Theory of Computing, p.138-148, 1991. 28] Z. Kedem, K. Palem, A. Raghunathan, and P. Spirakis, Resilient Parallel Computing on Unreliable Parallel Machines, p.145-172, Lectures on Parallel Computation, eds. A. Gibbons and P. Spirakis, Cambridge University Press, 1993. 29] R. Koo and S. Toueg, Checkpointing and Rollback Recovery for Distributed Systems, IEEE Transactions on Software Engineering, SE-13(1), January 1987. 30] P-J. Leu and B. Bhargava, Concurrent Robust Checkpointing and Recovery in Distributed Systems, Proc. 4th IEEE International Conference on Data Engineering, 1988. 31] N. Pekergin and J. Vincent, Stochastic bounds on parallel program execution times, IEEE Trans. on Software Engineering, 17 (10), p.105-113, 1991. 32] R.A. Sahner and K.S. Trivedi, Performance and reliability using directed acyclic graphs, IEEE Trans. on Software Engineering, 13 (10), p.1105-1114, 1987. 33] R.D. Schlichting, F.B. Schneider, Fail-stop processors: An approach to designing fault-tolerant computing system, ACM Trans. on Computer Systems, Vol. 1, No. 3, p.222-238, 1983. 34] A. P. Sistla and J.L. Welch, E cient distributed recovery using message logging, Proc. 8th ACM Symposium on Principles of Distributed Computing, 1989. 35] R.E. Strom, D.F. Bacon, and S.A. Yemini, Volatile Logging in n-Fault-Tolerant Distributed Systems, Proc. 18th IEEE Symposium on Fault Tolerant Computing, 1988. 36] R.E. Strom and S.A. Yemini, Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, 3 (3), August 1985. 37] A. Thomasian and P. Bay, Analytic queueing network models for parallel processing of task systems, IEEE Trans. on Computers, 35 (12), p.1045-1054, 1986. 38] Z. Tong, R.Y. Kain, and W.T. Tsai, A Low Overhead Checkpointing and Rollback Recovery Scheme for Distributed Recovery, Proc. 8th IEEE Symposium on Reliable Distributed Systems, 1989. 39] J.N. Tsitsiklis, C.H. Papadimitriou, P. Humblet, The performance of a precedence based queueing discipline, JACM, Vol. 33, 3, p.593-602, July 1986.
All necessary clearances have been obtained for the publication of this paper. If accepted, the authors will prepare the nal camera-ready manuscript in time for inclusion in the proceedings, and will personally present the paper at the workshop.
13