Reliability-Aware Approach: An Incremental Checkpoint ... - IEEE Xplore

2 downloads 0 Views 457KB Size Report
Louisiana Tech University. Ruston, LA 71270, USA. 2Computer Science and Mathematics Division, Oak Ridge National Laboratory. Oak Ridge, TN 37831, USA.
Eighth IEEE International Symposium on Cluster Computing and the Grid

Reliability-aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments Nichamon Naksinehaboon1, Yudan Liu1, Chokchai (Box) Leangsuksun1, Raja Nassar1, Mihaela Paun1, Stephen L. Scott 2 College of Engineering & Science Louisiana Tech University Ruston, LA 71270, USA 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory Oak Ridge, TN 37831, USA {nna003, yli010, box, nassar, mpaun}@latech.edu, [email protected] Abstract— For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model [19] on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.

I. INTRODUCTION Generally, checkpoint/restart on a large-scale distributed system is much more challenging than checkpoint/restart on a single system. This difference is caused by the multiplicity nature and coordination of state saving and the fact that recovery of applications is quite complex. In addition, larger systems are exposed more to potential failures. For example, the analysis of ASC White error log from Lawrence Livermore National Laboratory (LLNL) shows that the mean time between failures (MTBF) of a single node may be several thousands hours, but for the whole system with 512 nodes the MTBF can be reduced to only 20 hours [19]. This phenomenon suggests that applications running on a largescale system need more checkpoint placements to reduce lost computational time. Also, for checkpointing on a large-scale system, huge memory contexts must potentially be transferred through the network and saved, therefore the checkpoint 1 Research supported by the Department of Energy Grant no: DE-FG02-05ER25659. 2 Research supported by the Mathematics, Information and Computational Sciences Office, Office of Advanced Scientific Computing Research, Office of Science, U. S. Department of Energy, under contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.

978-0-7695-3156-4/08 $25.00 © 2008 IEEE DOI 10.1109/CCGRID.2008.109

overhead becomes a critical issue which directly impacts the application execution time and storage requirement. In recent years, research in reducing checkpoint overhead has gained significant attention in the high performance computing community [2][5][6][7][9][11][13]. One solution is the incremental checkpoint scheme [6] [7][9][11][13]. It focuses on reducing the checkpoint overhead by saving only necessary application states or only modified states. In the incremental checkpoint scheme the first check point is typically a full checkpoint. After that, one checks to determine which pages have changed since the last checkpoint and saves only those pages. The restart technique for incremental checkpoint requires the system first to restore the states as captured in the previous full checkpoint, and then to apply all the incremental checkpoints before the recovery can be completed. Thus, the restart mechanism of an incremental checkpoint is more complex than the full checkpoint. The rest of this paper is organized as follows. Section 2 introduces related works on reducing checkpoint overhead techniques. The behaviour of the incremental checkpoint/restart model is described in Section 3. In Section 4, the number of sequential incremental checkpoints is introduced, including the estimation method. The mathematical solution of the model is proposed in Section 5. Finally, our evaluation results and conclusions are presented in Sections 6 and 7. II. RELATED WORK One of the recent checkpoint overhead reduction techniques is the cooperative checkpointing [2]. The main concept of the cooperative checkpointing schedules the basic checkpoint placements following the traditional fixed interval checkpoint model (Young’s model) [8]. However, in order to reduce the checkpoint cost, the technique skips some scheduled checkpoints according to the risk of system failure. The performance of cooperative checkpointing depends on the accuracy of risk estimation. Nevertheless, an accurate failure prediction or risk estimation is a challenging problem [10][14]. Other efforts [6][7][11][13] have focused on incremental checkpoint algorithms, e.g. how to efficiently implement incremental checkpoint in system level or user level. However,

783

not much attention has been given to an optimal incremental checkpoint model. Because the checkpoint overhead and restart mechanism of incremental checkpoint are different from full checkpoint, we describe the incremental checkpoint model, derive our solution, and compare it to the full checkpoint counterpart in the next sections. III. BEHAVIOURS OF INCREMENTAL CHECKPOINT/RESTART MODEL The behaviour of incremental checkpoint/restart model is illustrated in Figure 1. The incremental checkpoint model consists of two types of checkpoints (full checkpoints and incremental checkpoints). The meaning of each parameter in the incremental checkpoint/restart model is listed in Table 1.

For the clear and convenient discussion, we list the following assumptions in our model. Assumptions 1. An application can be interrupted by a series of unexpected failures Y, where Y could be a Poisson process following a probability density function f ( t ) , and the distribution function F ( t ) .

2. Each checkpoint interval may vary in length. 3. The full checkpoint overhead OF and the incremental checkpoint overhead OI should be cost functions of n, where n is the number of nodes. For simplicity, we assume that OF and OI are constants. The incremental checkpoint overhead is a proportion of the full checkpoint overhead, OI =

μ ⋅ OF ,

where μ is the incremental checkpoint overhead ratio, 0 < μ < 1 . 4. We assume that the recovery cost of each incremental checkpoint is a constant denoted by δ . Let Trf be the Figure 1 Behaviour of Incremental checkpoint/restart model.

recovery time from a full checkpoint, and Tr be the total

TABLE 1 PARAMETERS IN INCREMENTAL CHECKPOINT/RESTART MODEL.

Parameters

Tc1,…Tcn OF OI Tb Tr Trf

δ m μ

ωi

recovery time from an incremental checkpoint. Each incremental checkpoint adds cost to the restart phase. If the application is recovered from an incremental checkpoint, say the mth incremental checkpoint after the last full checkpoint, then the total recovery cost from an incremental checkpoints Tr is Trf + m ⋅ δ .

Definitions Checkpoint Intervals Full Checkpoint Overhead. Incremental Checkpoint Overhead. Rollback cost Recovery time from an incremental checkpoint. Recovery time from a full checkpoint Additional recovery cost per incremental checkpoint. Number of incremental checkpoints between two consecutive full checkpoints. Incremental checkpoint overhead ratio μ= OI /OF. The cycle between failure (i-1) and failure i where i = 1,2,3,…

5. The first checkpoint in an application is a full checkpoint. After an application is recovered from failure, the first checkpoint is a full checkpoint as well. After m consecutive incremental checkpoints, a full checkpoint may be performed if the overall cost reaches a breakeven point between incremental and full checkpoint. We will determine the value of m in the next section. IV. CONSECUTIVE INCREMENTAL CHECKPOINT NUMBER m

In our incremental checkpoint model, the first checkpoint is a full checkpoint, which saves the entire data section and the stack of the application. The full checkpoint is then followed by a sequence of incremental checkpoints, which only saves the address spaces that have changed since the previous checkpoint. The recovery cost is decided by the number of incremental checkpoints. After m incremental checkpoints are performed, either another incremental checkpoint or a full checkpoint can be performed. A full checkpoint is chosen if the cost of performing a full checkpoint is cheaper than the recovery cost for an incremental checkpoint. This is what we call a breakeven point. The main idea is to balance a cost saving function with full and incremental checkpoint overheads and the complexity of the recovery that is introduced by the incremental model.

784

We denote the number of incremental checkpoints between two consecutive full checkpoints as m. The value of m depends on the next checkpoint type, either incremental or full checkpoint. As discussed earlier, the incremental checkpoint aims to reduce the checkpoint overhead. On the other hand, the recovery cost will increase as the number of subsequent incremental checkpoints (m) increases. This is because the application reconstruction phase requires information from each and every incremental checkpoint since the last full checkpoint. From the model description of the incremental checkpoint in Section 3, we assume that the first checkpoint is a full checkpoint, followed by a sequence of incremental checkpoints. Let us assume that the number of sequential incremental checkpoint placements is m. The key finding of

the optimal incremental checkpoint model is how to derive m so that the overall cost (including recovery and rollback time) remains minimal when a failure occurs.

We consider that the probability of the failure events is approximately the same as in case (a), PI . When no failure occurs, the cost Cb1 is

Cb1 = ( OF + mOI ) + OI Alternatively, if a failure happens, the cost Cb 2 is

Cb 2 = ( OF + mOI ) + OI + Trf + (m + 1)δ Therefore, the expected cost in case (b) is

Cb = (1 − PI )[OF + (m + 1)OI ] + PI [OF + (m + 1)OI + Trf + (m + 1)δ ]

(2)

In order to minimize the waste time in the model, the solution of m must be satisfied by the following condition. If Cb ≥ Ca , it means that the cost of case (b) is larger than the cost of case (a). Thus, we will choose case (a) and perform a full checkpoint after m sequential incremental checkpoints. Therefore, we obtain Figure 2 Sequential incremental checkpoint scenario.

m≥

Therefore, our purpose of this checkpoint model study is to find an incremental checkpoint placement solution which will minimize the total waste time due to rollback cost and checkpoint overhead. We follow this idea to find m by comparing the waste time in two possible cases. In the first case, as shown in Figure 2 (2a), m continuous incremental checkpoints are followed by a full checkpoint. Alternatively, as shown in Figure 2 (2b), after placing m continuous incremental checkpoints, we continue to perform the (m+1)th incremental checkpoint. In each case, we consider the probability of failure. Details are discussed in the following section. Case (a): After placing m continuous incremental checkpoints, a full checkpoint is performed next as shown in Figure 2a. We assume that PI is the probability that a failure will

is

Ca1 = ( OF + mOI ) + OF Alternatively, if the failure occurs, the cost Ca 2 is

Ca 2 = ( OF + mOI ) + OF + Trf

+ PI (2OF + mOI + Trf )

OF − OI − 1 , the PI ⋅ δ

cost in case (b) will be greater than the cost in case (a). Thus, we take m as

⎡ O − OI ⎤ m=⎢ F − 1⎥ , ⎢ PI ⋅ δ ⎥ where ⎡⎢ ⎤⎥ is the

(4) ceiling

function.

Because

OI = μ OF (Assumption 3) in Section 3, we substitute OI in Equation (4) and obtain

⎡ (1 − μ )OF ⎤ − 1⎥ m=⎢ ⎢ PI ⋅ δ ⎥

(5)

V. MATHEMATICAL SOLUTION OF THE MODEL We consider that the checkpoint procedure is a renewal process [18]. Therefore, whenever a failure occurs, the new cycle starts. We follow the renewal reward theory to derive the optimal incremental checkpoint/restart model. Let the sequence of discrete checkpoint placements be 0 = t0 < t1 < ... < tn , and n(t ) be the checkpoint frequency function. Then

Therefore the expected cost is

Ca = (1 − PI )(2OF + mOI )

(3)

Inequality (3) suggests us that if m ≥

occur after the second full checkpoint and before the next incremental checkpoint. Hence, 1 − PI is the probability that failure will not occur in that period. If no failure occurs during this period, the overall cost Ca1

OF − OI −1 PI ⋅ δ

ti +1

∫t

(1)

i

n(τ )dτ = 1, i = 0,1, 2 ...

(6)

In Figure 1, the total number of checkpoints in cycle

Case (b): After reaching m consecutive incremental checkpoints, another incremental checkpoint is performed as shown in Figure 2b.

785

ω1

N (ω1 ) = ∫ n(τ )dτ = nF + nI , 0

ω1 is

(7)

where nF is the number of full checkpoints in cycle ω1 , and nI is the number of incremental checkpoints in the same cycle

We are now looking for the solution of the overall *

checkpoint frequency n (t ) to minimize Equation (10).

ω1 , and nI = m ⋅ nF .

Let x(t ) =

If W1 is the waste time due to the checkpoint in cycle ω1 ,

∞ 1+ μm E (W1 ) = ∫ [ ⋅ OF ⋅ x(t ) 0 m +1 k + + Trf + mδ ] ⋅ f (t )dt , x′(t )

then from [18]. ω1 1 ⋅ OF ∫ n(τ )dτ 0 m +1 ω1 m + ⋅ μ OF ∫ n(τ )dτ + Tb + Tr 0 m +1 ω1 1+ μm = ⋅ OF ∫ n(τ )dτ + Tb + Tr . 0 m +1

W1 =

k n(ω1 ), (0 < k < 1)

by

[18][19],

where

n(ω1 ) is the checkpoint frequency at time ω1 , and k can be evaluated by the same method as in [18][19]. In assumption 4, the recovery cost is Tr = Trf + m ⋅ δ , where Trf is the recovery cost from a full checkpoint. Therefore, we substitute Tr in Equation (8) and obtain: ω1 1+ μm k W1 = ⋅ OF ∫ n(τ )dτ + + Trf + mδ 0 m +1 n(ω1 )

1+ μm Φ ( x, x′, t ) = [ ⋅ OF ⋅ x(t ) m +1 k + + Trf + mδ ] ⋅ f (t ). x′(t )

(9)

N (t )

∂Φ d ∂Φ − ⋅ =0 ∂x dt ∂x′

E ( ∑1

N (t )

lim

t →∞

Wi )

t

E (W1 ) = E (ω )

The left hand side of the above equation represents the total average reward (in this case the waste time), and it is a function of the average reward in the first cycle, E (W1 ) . In the checkpoint/restart model, this theorem states that minimizing the overall waste time is equivalent to minimizing waste time in cycle ω1 . Let f (t ) be the probability density function of the system failure, such that the probability for the system failure within [t , t + Δt ] is f (t ) ⋅ Δt . The expected waste time during a cycle in the checkpoint process E (W1 ) is ∞

E (W1 ) = ∫ [( ∫

t 1+

μm

OF n (τ ) dτ )

m +1 k + + Trf + mδ ] ⋅ f ( t ) dt. n(t ) 0

0

(10)

(13)

Solving Equation (13), we obtain the solution for the incremental checkpoint frequency in Equation (14). The detailed derivation of Equation (14) can be found in [18].

n* (t ) =

Wi .

From the basic limit theorem of renewal reward processes, we obtain

(12)

Based on the theorem of calculus of variations, if the integral in Equation (11) has a minimum value, Φ ( x, x′, t ) in Equation (11) it must satisfy Euler-Lagrange equation in Equation (13).

By following the stochastic renewal reward process theory, the overall checkpoint/restart process can be described as

A(t ) = ∑1

(11)

Let the function under the integral in right side of Equation (11) be Φ ( x, x′, t ) . Then

(8)

We suppose that the system can be successfully recovered from the last checkpoint, and the rollback cost Tb can be estimated

t

∫0 n(τ )dτ . From Equation (10) we obtain:

(m + 1)k f (t ) ⋅ (1 + μ m)OF 1 − F ( t )

(14)

VI. MODEL ANALYSIS AND VALIDATION In Section 5, we obtain the general solution for our incremental checkpoint model. Equation (14) gives a checkpoint frequency function which is derived from a probability distribution function of the system time between failures (TBF). In previous work [19], the failure distribution could be obtained from the system failure analysis. For the purpose of the incremental checkpoint/restart study and evaluation, we validate our model results only when the system failure follows the exponential distribution. This assumption will help simplify our validation and clearly demonstrate the difference between the full checkpoint solution vs. the incremental counterpart. However, we plan to use these results as a guidance to further our study with other distributions such as time-varying one for the system failures. A. Model in Exponential Distribution For the time between failures (TBF) that follows an exponential distribution, we substitute

f (t ) = λ e− λt , and

1 − F ( t ) = e− λt , t ≥ 0, λ > 0 in Equation (14). The optimal model for an exponential Distribution can be written as

786

n* (t ) =

(m + 1)k ⋅ λ (1 + μ m)OF

(15)

The optimal checkpoint interval, I , is a fixed value.

I=

(1 + μ m)OF 1 1 = ⋅ k (m + 1) λ n (t )

(16)

*

In Equation (16), the full checkpoint overhead

OF and

incremental checkpoint overhead ratio μ are given as constants. The exponential failure rate fitted from the failure data set is λ . The number of incremental checkpoints m can be obtained by Equation (5) in Section 4, and we can obtain the recovery cost per incremental checkpoint, δ , from an experiment. Therefore, the failure probability PI can be analytically derived as follows. We suppose that the checkpoint placements are at t0 , t1 , t2 ,......tn , tn+1 , where t0 = 0 (when the application starts), and tn is a time stamp of the nth checkpoint. In the exponential distribution case, the checkpoint interval does not change over time as described in Equation (16),

t0 = 0, t1 = I , t2 = 2 I ,..., tn = nI From Equation (5), the probability PI of failure during the (tn , tn +1 ) interval is

PI = P[T < tn +1 | T > tn ] = For

the

exponential

F (tn +1 ) − F (tn ) . (17) 1 − F (tn )

distribution,

the

CDF

is

F (t ) = 1 − e− λt , and we have

(1 − e P = I

) − (1 − e ) = 1 − e 1 − (1 − e ) − λtn+1

− λ tn

− λ ( tn+1 −tn )

− λ tn

Since the checkpoint interval is constant, then we have

PI = 1 − e − λt1 .

B. Model Analysis In this section, we study the improvement of the incremental checkpoint model over the full checkpoint model by comparing the waste time of the incremental checkpoint model with the full checkpoint model [19]. Our simulations are based on the actual failure data of LLNL ASC White system. In comparing the full and incremental checkpoint model, we observe the mean time between failures (MTBF) of the LLNL ASC White system during the one year period June, 2003 to May, 2004 which is around 26 hours. We also obtain the best fitted distribution to the data from June, 2003 to May, 2004 which is the Weibull distribution with the shape parameter of 0.50944 and the scale parameter of 20.584. Using this Weibull distribution, we determine the checkpoint sequences for the full checkpoint model, and using the exponential distribution with mean as 26 (MTBF), we determine the checkpoint sequences for the incremental checkpoint model. We then use these checkpoint sequences to run simulations on the data from June, 2004 to August, 2004. Our purpose is to compare the waste times of both models by varying the completion time of the application. When the incremental checkpoint recovery cost is as large as the full checkpoint recovery cost, we observe from Figure 3 and Figure 4, that the waste times of the incremental checkpoint model are less than the waste times of the full checkpoint model. Moreover, the incremental checkpoint model gives better results than that of the full checkpoint model, except for the cases where the incremental checkpoint overhead value approaches to full checkpoint overhead value, as in Figure 4. One reason for this situation can be the fact that we use the exponential distribution in the incremental checkpoint model, which is not the best fitted distribution of the observed failure data. Therefore, we might conclude that if the incremental checkpoint overhead is almost as large as the full checkpoint overhead, we should use the full checkpoint model to schedule checkpoints.

(18)

From Equations (5), (16), and (18), we can find the value of m based on the following algorithm. Algorithm to find m: Step 1. Initialize OF , μ , δ , and

λ . Let m = 1 . Step 2. Calculate checkpoint interval I by Equation (16). Step 3. Calculate PI by Equation (18). (1 − μ )OF −1, Step 4. IF m < PI ⋅ δ THEN m = m + 1 , and go to Step 2 ELSE go to Step 5. Step 5. Set m = m − 1 . Step 6. End.

Figure 3 Waste times of the full checkpoint model and the incremental checkpoint model of different job completion times with full checkpoint overhead of 2 minutes, full checkpoint recovery cost of 10 seconds, incremental checkpoint overhead of 6 seconds, and incremental checkpoint recovery cost of 1 second, 5 seconds, and 10 seconds

787

[3] [4]

[5] [6] [7] Figure 4 Waste times of the full checkpoint model and the incremental checkpoint model of different job completion times with full checkpoint overhead of 2 minutes, full checkpoint recovery cost of 10 seconds, incremental checkpoint overhead of 1 minute, and incremental checkpoint recovery cost of 1 second, 5 seconds, and 10 seconds

VII.

CONCLUSION

References

[2]

[9] [10]

In this paper, we have discussed the incremental checkpoint/restart model in HPC systems. This work is an extension of the full checkpoint/restart model that is discussed in [18][19]. The incremental checkpoint aims to further improve overhead reduction time, especially for a large-scale distributed system where the full checkpoint overhead may be significant. We have built the model for the incremental checkpoint/restart solution and have derived the number of consecutive incremental checkpoints m, between two consecutive full checkpoints. This derived m, is the number of consecutive incremental checkpoints that yields the breakeven cost. Additionally, we provide guidance as to one can perform additional checkpoints, without paying much penalty, for gaining better benefits when a failure occurs. We have also analysed the sensitivity of the incremental checkpoint model with respect to the incremental checkpoint overhead and recovery cost. Moreover, we present the comparison between the incremental checkpoint model and full checkpoint model. Our results suggest that the incremental checkpoint recovery cost does not affect the performance of the incremental checkpoint model. For an incremental checkpoint overhead smaller than the full checkpoint overhead, the waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model. Hence, the incremental checkpoint model is preferred. However, when the incremental checkpoint overhead is as large as the full checkpoint overhead, the full checkpoint model performs better than the incremental checkpoint model.

[1]

[8]

A. C. Palaniswamy, and P. A. Wilsey, “An analytical comparison of periodic checkpointing and incremental state saving”. In Proc. of the Seventh Workshop on Parallel and Distributed Simulation (San Diego, California, United States, May 16 - 19, 1993. R. Bagrodia and D. Jefferson, Eds. PADS '93. ACM Press, New York, NY, pp. 127-134. A. J. Oliner, L. Rudolph, R. K. Sahoo, “Cooperative checkpointing: a robust approach to large-scale systems reliability”. In Proc. of the 20th

788

[11] [12] [13]

[14]

[15]

[16] [17]

[18] [19]

Annual International Conference on Supercomputing (ICS), Cairns, Australia, June 2006, pp.14-23. A. Tikotekar, C. Leangsuksun S. L. Scott. “On the survivability of standard MPI applications”. In Proc. of 7th LCI International Conference on Linux Clusters: The HPC Revolution 2006 J. Heo, Y. Cho, G. Jeon , H. Kimm, “The overhead model of wordlevel and page-level incremental checkpointing”. Proc. of the 2006 ACM symposium on Applied computing, April 23-27, 2006, Dijon, France, pp.1493-1294. J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Transparent checkpointing under UNIX”. In Proceedings of the USENIX Winter 1995 Technical Conference, pp. 213-223. J.S. Plank, J. Xu, and R.H. Netzer, 1995a. “Compressed differences: an algorithm for fast incremental checkpointing”. Technical Report CS95-302, University of Tennessee at Knoxville. J.C. Sancho, F. Petrini, G. Johnson, E. Frachtenberg, “On the feasibility of incremental checkpointing for scientific computing, ”Parallel and Distributed Processing Symposium, 2004. Proc. 18th International, vol., pp. 26-30. J.W. Young, “A first-order approximation to the optimum checkpoint interval,” Communications of. ACM 17, 9 (Sept 1974), pp. 530-531. K. Li, J. F. Naughton, and J. S. Plank, “Low-latency, concurrent checkpointing for parallel programs”. IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994. K. Sahoo, R. K., A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. “Failure data analysis of a large-scale heterogeneous server environment”. In Proc. of DSN'04, 2004. N. H. Vaidya, “Impact of checkpoint latency on overhead ratio of a checkpointing scheme”. IEEE Transactions on Computer Vol.46 no.8, Aug. 1997, pp. 942-947. O. Bolza, “Lectures on the calculus of variations”, Third Edition, Chelsea Publishing Company, New York, 1973. R. Gioiosa, J.C. Sancho, S. Jiang, F. Petrini, “Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers”. Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, vol., pp. 9, 12-18. R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J.E. Moreira, S. Ma. “Critical event prediction for proactive management in large-scale computer clusters” International conference on Knowledge discovery and data mining Pages: 426 - 435,Year of Publication:2003 ISBN:158113-737-0. S. Agarwal, R. Garg, M. S. Gupta, J. Moreira. “Adaptive incremental checkpointing on massively parallel systems”. In Proc. of 18th Annual ACM International Conference of Supercomputing (ICS’04), June 26 – July 1, 2004, pp. 277-286. S. M. Ross. “Stochastic Processes” Wiley; 2nd edition (January 1995), ISBN-10: 0471120626 S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. “The LAM/MPI checkpoint/restart framework: system-initiated checkpoint”. The 2003 Los Alamos Computer Science Institute Symposium, Santa Fe, NM. October 2003. Y. Liu, “Reliability-Aware Optimal Checkpoint/Restart Model in High Performance Computing. PhD thesis,” Louisiana Tech University, Ruston, LA, USA, May. 2007. Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, S. Scott, “A Reliability-aware Approach for an Optimal Checkpoint/Restart Model in HPC Environments”, Refereed proceeding of the IEEE Cluster Conference Austin, Texas, 2007, pp. 452-457

Suggest Documents