to suspend on demand the execution of the simulation pro- gram in order to wait, .... transfer incoming messages from the LANai internal mern- ory to the host ... into a checkpoint buffer in the stack of the checkpointed state vectors of the LP.
Tuning of the Checkpointing and Communication Library for Optimistic Simulation on Myrinet Based NOWs Francesco Quaglia, Andrea Santoro, and Bruno Ciciani Dipartimento di Informatica e Sistemistica Universith di Roma “La Sapienza” Via Salaria 113,00198, Roma, Italy
Abstract
“best suited” tradeoff is typically achieved through adequate tuning of the parameter(s) proper of the checkpointing strategy. A completely different approach to the reduction of the checkpointing overhead has been recently proposed in [ 1 I ] for the case of optimistic parallel simulation on Myrinet based Network of Workstations (NOWs). Specifically, the work in [l 11 presents a Checkpointing and Communication Library (CCL) that exploits data transfer potentiality offered by DMA engines on Myrinet network cards to support not only communication but also checkpoint operations. This library includes functionalities for activating DMA based checkpoint operations and also functionalities to suspend on demand the execution of the simulation program in order to wait, if needed, the completion of a pending DMA based checkpoint operation. The latter functionalities, that we refer to as “resynchronization” functionalities, are activated to avoid data inconsistency whenever the state vector currently being transferred (checkpointed) through DMA needs to be accessed by the LP for further modifications of some state variables, or even when a new checkpoint operation must be issued while the last issued one is not yet completed. Combined use of DMA based checkpointing and resynchronization leads to a so called “semiasynchronous” execution mode of checkpoint operations.
Recently a Checkpointing and Communication Library (CCL)for optimistic simulation on Myrinet based Network of Workstations (NOWs) has been presented. CCL ofloads checkpoint operations from the CPU by charging them to a programmable DMA engine on the Myrinet network card. CCL includes also functionalities for freezing the simulation application on demand, which can be used for data consistency maintenance (for example when a state buffer needs to be accessed for further mod$cations while a DMA based checkpoint operation involving it is still in progress). Programming the DMA to peqorrn a checkpoint operation by transferring large data blocks in a single burst allows the latency of any checkpoint operation to be kept low. This reduces the probability for application freezing to really ocCUI: On the other hand, transferring large data blocks in a single burst might cause negative inteference on communication since that DMA (and other circuitry) cannot be used for communication functionalities until the currently executed data transfer is not yet completed. In this paper we present a detailed identijication of the effects of the burst length, from which we outline a set of relevant phenomena to take into account in order to determine a compile time suited value for the burst length itself: We also report measures quantifying these phenomena f o r the case of a PC clustel: Actually, the data indicate that communication functionalities do not suffer from the use of non-minimal burst lengths for checkpoint operations, thus pointing out how, if well tuned, CCL provides highly effective, CPU ofloaded, checkpointing functionalities.
Although preliminary performance results [ 1 I ] have shown that the semi-asynchronous mode supported by CCL exhibits the potential for strong acceleration of the parallel simulation execution, an investigation to determine “well suited” parameter tuning is mandatory to guarantee checkpointing and communication functionalities provided by CCL to work properly at the same time. Specifically, any DMA based checkpoint operation can be split into portions, any of which takes care of the transfer of up to a maximum amount of bytes of the LP state vector into the checkpoint buffer. Such an amount, which we will refer to as the burst associated with checkpoint operations, can be selected at compile time of CCL. The use of long bursts allows the latency of any checkpoint operation to be kept low, thus reducing the probability for resynchronization to really occur. In other words, long bursts are likely to yield a high degree of concurrency between checkpointing and other simulation specific operations carried out by the CPU. On the other hand, long bursts increase the risk that the DMA engine and other circuitry on board of the network card remain busy for non-minimal periods, thus preventing the possibility to use them for communication functionalities. This
1 Introduction Optimistic parallel discrete event simulators are based on checkpointing and rollback recovery techniques to ensure causally consistent execution of simulation events at each Logical Process (LP) [6]. Since checkpointing might be a time consuming operation, the use of efficient checkpointing mechanisms is mandatory to guarantee adequate performance. Traditionally, the reduction of the checkpointing overhead has beer! pursued by the use of checkpointing strategies based on infrequent or incremental saving of the L P state vector (see for example [ l , 2, 4, 7, 10, 12, 13, 14, 15, 161). These solutions pay the price of an increase in the expected rollback latency since there exists the possibility that a state to be recovered is not directly available. In that case it must be reconstructed during the rollback phase. The
0-7695-1315-8/01$10.00 0 2001 IEEE
241
potentially increases the communication latency, with consequent risk of increase in the amount of rollback [3]. Also, as we will show, the burst length might affect the communication overhead at the application level. Therefore, well suited tuning of the burst length is mandatory to not incur excessive overhead and rollback thrashing, and to get, at the same time, advantages from concurrency in the execution of checkpointing and other simulation specific operations. In this paper we present a detailed identification of the effects of the burst length, from which we outline a set of relevant phenomena to take into account in order to determine a compile time suited value for the burst length itself. We also report measures quantifying these phenomena for a cluster of PCs (Pentium I1 300 MHz) running LINUX (kernel version 2.0.32). Actually, the data indicate that communication functionalities do not significantly suffer from the use of non-minimal burst lengths for checkpoint operations. This is an indication that, if well tuned, CCL provides highly effective, CPU offloaded, checkpointing functionalities paying no significant price from the point of view of communication. The remainder of this paper is structured as follows. In Section 2 we report a description of CCL. In Section 3 we identify the effects of the burst length on checkpointing and communication functionalities. Quantification of these effects for the PC cluster is reported in Section 4. Finally, for completeness of the analysis, we report in Section 5 a set of results related to the execution speed of optimistic parallel simulation of a classical benchmark demonstrating the real gain achievable for the case of well suited tuning of CCL and also the real problems one can incur in case of unsuited tuning.
1
I
Figure 1. High Level Structure of the M2M-PC132C Card (a) Data Transfer for Checkpointing (b).
-
is the reason why our focus is not on the performance of communication functionalities as such but on the whole performance provided by CCL when both communication aind checkpointing functionalities are activated concurrently. In other words, we are mainly interested on the global effects of the activation of checkpointing functionalities. Getting back to the implementation of communication functionalities, in CCL messages incoming from the network are temporarily buffered into the LANai internal memory (data transfer between the packet interface and the interrial memory takes place through the Receive DMA) and thlx transferred into the receive queue, located onto host mernory, through the EBUS DMA (see the directed dashed line in Figure 1.a). This is a common choice to fast speed messaging layers for Myrinet (see for example [9]). Once transferred into the receive queue, any message is received by the application program very efficiently by simply performing a memcpy ( ) operation of the message content i n t o a proper buffer in the application address space ('). Also, we have adopted a classical optimization called "block-DMA" to transfer incoming messages from the LANai internal mernory to the host memory. It allows incoming messages stored in contiguous message slots of the LANai internal memory to be transferred using a single DMA operation. Following the common design choice, any send operation issued by the application involves copying the message content directly into the LANai internal memory. This is also referred to as "zero-copy'' send. Then the message is transferred onto the network through the Send DMA. The responsibility to program the three DMA engines anytime there is the need for supporting a given data transfer operation pertains to the control program run by the LANai processor, which has the following basic structure:
2 CCL Overview CCL has been designed for the M2M-PCI32C Myrinet card, based o n the LANai 4 chip 181, whose high level
structure is schematized in Figure 1 .a. This chip is a programmable communication device consisting o f (A) An internal bus, namely LBUS (Local BUS), clocked at twice the chip-clock speed. (B) A programmable processor connected to the LBUS, which we will refer to as LANai processor. (C) A RAM bank of 1 Mbyte (LANai internal memory), connected to the LBUS, which is used for storing both data and the control program run by the LANai processor; this memory can be mapped into the memory address space of the host. (D) A packet interface between the Myrinet switch and the LANai chip, accessible by the LANai processor. (E) Three DMA engines used respectively for: (i) packet-interfacelinternal-memory transfer operations (Receive DMA), (ii) internal-memory/packet-interface transfer operations (Send DMA), and (iii) internal-memorykostmemory transfer (or vice-versa) operations (EBUS DMA, namely External Bus DMA). The LBUS cycles are assigned based on the following priorities (highest to lowest): host and EBUS DMA, receive DMA, send DMA, LANai processor. Finally, we note that the LANai processor cannot access host memory directly. Nonetheless, the control program run by the processor can program the EBUS DMA to perform data transfer to/from that memory. Communication functionalities provided by CCL have been implemented to fully exploit the potential offered by the hardware components on the M2M-PCI32C chip. This
2 . Willie 11 [message Ill( needs t o b e sent) aCtlYate_send-DMAO: 3 I € [message needs to be received) actiVate_TeCeiYe_DMAi);
4. 5.
if [Send DMA completedl complete-sendil, 11 [Receive DMA completed) complete-receive[);
6. 1.
if
8.
9.
(EBUS DMA n o t b u s y ) ( i f (block-DMA n o t needed AND checkpoint burst needed) c k p t _ b u r s t O ; ~f (block-DMA needed AND checkpoint bursr n o t active) block-DMAO:
)
10. i
The structure of the control program points out two main features: (i) Message transfer over the network is fullduplex (both Send and Receive DMAs can be contemporaneously active). (ii) Data transfer operations associated with 'EBUS DMA transfer of messages into host memory was not yet implemented in the version of CCL presented in [ I I ] since the receive queue wits located into the LANai internal memory.
242
3.1
checkpointing functionalities are activated only in case no block-DMA operation is currently required to transfer messages into the receive queue located onto host memory (see line 7). Point (ii) indicates that any block-DMA operation has priority over data transfer operations associated with checkpointing. However, once activated, such a data transfer is not preempted by any block-DMA operation. As respect to the latter feature, we underline that preemption on any in progress EBUS DMA data transfer can even not be implemented since, according to hardware specifications [8], it may cause problems to the PCI protocol. Any semi-asynchronous checkpoint operation involves data transfer from the LP current state buffer (located onto host memory) to the stack of the checkpointed states of the LP (also located onto host memory). As shown by the directed dashed lines in Figure l.b, the data transfer operation is charged to the EBUS DMA that uses the LANai internal memory as a temporary buffer (2). In other words, any checkpoint operation issued at the application level means requesting the LANai processor to program the EBUS DMA for the data transfer. Actually, any checkpoint operation is split by the control program into a sequence of EBUS DMA data transfer operations, any of which is responsible for the transfer of up to a maximum amount of bytes of the state vector called burst, determined at compile time of CCL. Any burst requires two distinct activations of the c k p t b u r s t ( ) procedure in line 7 of the control program. The first activation transfers the data from the current state buffer into the LANai internal memory (intermediate buffering), the second one transfers the data from the LANai internal memory into a checkpoint buffer in the stack of the checkpointed state vectors of the LP. The following API is provided for usage of semiasynchronous checkpointing at the simulation application level: (i) semi-asynch-ckpt ( i n t LP-id, t i m e - t y p e s i m u l a t i o n - c l o c k ) , where LP-id is the identifier of the LP whose state vector needs to be checkpointed, and s i m u l a t i o n - c l o c k is the value of the current simulation time seen by that LP; (ii) c k p t - w a i t ( ) , this function supports the previously mentioned resynchronization functionality. Invocation of this function suspends the execution of the simulation application while a semi-asynchronous checkpoint operation, if any, is still in progress. If there is no pending checkpoint operation, c k p t - w a i t ( ) returns immediately.
Burst Length vs Block-DMA Activation Delay
The EBUS DMA is used by CCL to support two distinct operations in different time intervals. Specifically, it is used for block-DMA transfer of incoming messages into the receive queue (see line 8 of the control program) and also for data transfer operations ( f r o d t o the host memory) associated with checkpointing (see line 7 of the control program). As pointed out in Section 2, the latter operation has lower priority as compared to block-DMA data transfer, however no preemption is exercised. This means that messages incoming from the network must be maintained into the LANai internal memory until any in progress EBUS DMA based data transfer operation ( f r o d t o the host memory) associated with checkpointing is not yet completed. The burst length determines the amount of bytes to be managed by the transfer operation, therefore it has effect on possible increase in the delay for the activation of block-DMA operations to transfer messages into the receive queue. Very long bursts might cause intolerable delay in the message delivery at the simulation application level, which, as shown in [3], might have strong negative impact on performance due to a possible increase in the amount of rollback.
3.2 Burst Length vs Zero-Copy Send Latency EBUS DMA based data transfer operations associated with checkpointing might interfere also with send operations issued by the simulation application. Specifically, send operations are based on the zero-copy approach, that requires access to the PCI bridge to perform the copy of the message content from the simulation application address space into the LANai internal memory. Very long bursts might interfere negatively with the access latency to the PCI bridge, and thus with the latency of the zero-copy send. As a consequence, there is the risk that: (i) the simulation application suffers from an increase in the overhead due to the send operation, and (ii) the message transfer delay increases due to the increase in the latency of copying the message content into the LANai memory, with potential negative impact on the amount of rollback.
3.3 Burst Length vs Checkpointing Latency The burst length impacts the completion time, i.e. the latency, of any EBUS DMA based checkpoint operation in both direct and indirect ways, any of which is discussed below:
Direct Effects of the Burst Length. Anytime an EBUS DMA based data transfer operation must be activated, the control program run by the LANai processor becomes aware of this by “polling” exercised within its main loop. This means that, anytime a new activation is required, it will be delayed until the corresponding polling operation takes place. Also, any transfer operation requires an EBUS DMA setup phase. Therefore, completion of any checkpoint operation through few EBUS DMA based data transfer activations (i.e. few calls to the function c k p t b u r s t ( ) ) mitigates the impact of both polling and setup delays. To achieve this, the use of very long bursts is recommended. Indirect Effects of the Burst Length. Indirect effects are related to the scheduling sequence of block-DMA operations and EBUS DMA data transfer operations associated with checkpointing. The scheduling sequence is determined by the control program run by the LANai processor in the
3 Performance Issues Identification In this section we identify how the burst length associated with checkpointing functionalities might impact the effectiveness of both communication functionalities and checkpointing functionalities themselves. As pointed out in the Introduction, the identification we make determines a set of relevant phenomena that must be taken into account to determine a well suited compile time tuning of the burst length. A practical use of this identification will be presented in Section 4, where we quantify these phenomena for a specific cluster environment. ’Temporary buffering is needed since, as already mentioned, the EBUS DMA does not support host-memory to host-memory data transfer directly. It only supports host-memory to internal-memory transfer or vice versa.
243
way that any EBUS DMA data transfer operation associated with checkpointing has lower priority as compared to block-DMA. The activation of each EBUS DMA based data transfer operation associated with checkpointing experiences a given scheduling delay determined by the decisions taken by the control program, which favors block-DMA operations. Therefore, completion of a single checkpoint operation through many EBUS DMA based data transfer activations (i.e. many invocations of the function c k p t b u r s t ( ) ), determines a non-minimal checkpointing latency due to the fact that each activation adds its scheduling delay to the completion time of the checkpoint operation. To mitigate the effects of scheduling delay, the use of very long bursts is recommended.
of semi-asynchronous checkpoint operations issued by the application and on the size of the state vectors of the LPs. Therefore, the burst length has in practice no direct effect on the LBUS contention since the real traffic on the LBUS due to any single checkpoint operation does not change as a function of the burst length. Anyway, the burst length might exhibit some indirect effects. Specifically, as discussed in Section 3.3, the burst length determines the latency of any checkpoint operation, therefore, it determines the probability of application freezing due to resynchronization. If the resynchronization frequency increases due to the use of very short bursts, then the real frequency of semi-asynchronous checkpoint operations issued by the application is expected to decrease. As a consequence, contention on the LBUS i.s expected to decrease as well.
Overall, the longer the burst, the lower the expected completion time of the checkpoint operation (due to mitigation of both direct and indirect effects), thus the lower the probability of application freezing to really occur due to resynchronization. In other words, with very long bursts we get a decrease of the probability that the last checkpoint operation issued by the application through an invocation of the function s e m i - a s y n c h - c k p t ( ) is not yet completed when a subsequent invocation of c k p t - w a i t ( ) is issued.
3.5 Issues Summary
3.4
By the arguments in previous sections, we summarize below well suited indications for the selection of the burst length, which are expected to alleviate each of the previously pointed out problems. By these indications we argue that, in general, a short burst should be selected, but not so short to incur unacceptable checkpointing latency. Zero-Copy Send Latency Checkpointing Latency LBUS Contention
Some Hints on the LBUS Contention
When the EBUS DMA is active for data transfer operations associated with checkpointing, it originates traffic on both the LBUS, namely the Local BUS on the LANai chip, and the internal bus of the host. This is because, both the host main memory and the LANai internal memory are involved in the data transfer operation. As shown by the preliminary performance study of CCL in [ 1 11, we expect that the traffic on the host internal bus produces in practice no interference with CPU activities due to the cache memory. Instead, the effects of the traffic on the LBUS and possible variation of these effects vs the burst length must be considered. We discuss these issues below. Traffic Effects. As mentioned in Section 2 the LBUS is clocked at twice the chip-clock speed, therefore it supports at most two memory cycles for every clock cycle. Given that the EBUS DMA has higher LBUS access priority, as compared to Send and Receive DMAs and to the LANai processor, when a memory cycle is destined to the EBUS DMA for data transfer operations associated with checkpointing, any other operation associated with communication functionalities and requiring access to the LBUS might be delayed. As an example, suppose both the Send and the Receive DMAs are active at the same time and suppose the EBUS DMA is activated to execute data transfer operations associated with checkpointing. In this case the Send DMA is penalized whenever both the EBUS and the Receive DMAs are granted memory access in the same clock cycle. The same arguments apply for the case of the LANai processor since it has the lowest LBUS access priority. Therefore, activation of EBUS DMA based checkpointing functionalities might cause delay in the execution of instructions associated with the control program run by the LANai processor. Variation of the Effects vs the Burst Length. The real traffic on the LBUS due to data transfer operations associated with checkpointing depends primarily on the frequency
4
Short
Experimental Analysis for a PC Cluster
By the discussion in Section 3, there are some indirect effects of the burst length on communication and checkpointing functionalities. These effects are not simple to capture since they are strongly related to dynamics associated with the execution of the specific overlying simulation application. As a n example, indirect effects on the checkpointing latency (see Section 3.3), due to the lower priority of EBUS DMA based data transfer operations associated with checkpointing as compared to block-DMA, are strongly related to the frequency of message arrival at the machine, which, in turn, is related to simulation execution dynamics. Also, the indirect effects of the burst length on the LBUS contention (see Section 3.4) arise only in case the burst length variation has strong impact on resynchronization, and this depends on proper dynamics of the simulation application execution, such as the length of the time interval between the invocation of the function s e m i - a s y n c h - c k p t ( ) and the successive invocation of the function c k p t - w a i t ( ) . Therefore, identifying indirect effects of the burst length for an arbitrary simulation is unrealistic in practice. However, from a pragmatical view point an analysis based on effects that are not indirect can anyhow provide hints for a well suited tuning of the burst length to be used at compile time of CCL. In this section we present such an analysis for the case of a cluster of PCs Pentium I1 300 MHz running LINUX (kernel version 2.0.32), equipped with 128 Mbytes RAM, 512 Kbytes second level cache and M2M-PCI32C Myrinet cards. Note that the analysis methodology we employ (e.g. software and system conditions) is general and can be used whatever is the real hardwarekoftware architecture for which a compile time well suited tuning of CCL must be performed. 244
4.1 Burst Length vs Block-DMA Activation Delay
DMA is such that a block of bytes is first copied from the host memory into the LANai internal memory and then is copied back from the LANai internal memory into the host memory. In practice, the execution of this control program simulates a case in which data transfer operations associated with checkpointing are continuously executed. Therefore, it simulates a “worst case scenario” in which access to the PCI bridge is continuously required for EBUS DMA based data transfer operations associated with checkpointing.
To determine the effects of the burst length on the activation delay of block-DMA operations, we have measured the time needed to perform an EBUS DMA based data transfer operation while varying the size of the data block involved in the operation. Actually, the transfer latency for a given data block size is a measure of the worst case delay in the activation of block-DMA when the burst length is ’ equal to that data block size. This is because the worst case occurs just when a block-DMA transfer is delayed for the whole time interval associated with an EBUS DMA based data transfer operation due to checkpointing. The measures we report are related to the case of both transfer from the LANai internal memory to the host memory and vice versa (recall that EBUS DMA based checkpointing requires data transfer in both directions to be performed in interleaved mode). Furthermore, the reported latency values include not only the net time for transferring the data but also the time needed to program the EBUS DMA for the transfer operation.
150
Mmessage uze
e8
100
0
’
” -
i
Plots in Figure 2 show that the data transfer operation is completed within the same latency for both transfer directions ( f r o d t o host memory). Furthermore, they indicate that the data transfer time is bounded by about 2 microseconds for data blocks up to 128 bytes, and is bounded by 5 and 8 microseconds for data blocks up to 5 12 and 1024 bytes, respectively. Also, the plots indicate that the transfer delay increases linearly vs the data block size (the logarithmic scale does not outline the linear behavior, but, at the same time, helps in providing plots for a very large interval of non-equidistant values of the data block size).
m
As discussed in Section 3, the burst length might impact the latency of any checkpoint operation in both a direct and an indirect way. The direct way is related to polling and EBUS DMA setup delays, while the indirect way is related to the lower priority of EBUS DMA based data transfer operations associated with checkpointing as compared to block-DMA. As already mentioned, the pragmatical approach leads us to focus the attention only on direct effects. To study the direct effects we have measured the latency for checkpointing a state vector of X bytes, using burst
To evaluate the effects of the burst length on the latency of zero-copy sends issued by the application, we have used a control program run by the LANai processor which has been derived as a modification of the control program reported in Section 2. It is structured as follows: mile 1111 not
1
4.3 Burst Length vs Checkpointing Latency
4.2 Burst Length vs Zero-Copy Send Latency
IEBUS DMA
(OM 1M data Mock slze (#bytes)
Fixed this scenario, we have measured the latency required for a zero-copy send operation. Specifically, we have measured the latency required by the host to copy the message content into the LANai internal memory and to set the data structures whose values indicate that a new message needs to be sent. We have measured that latency for the case of three different message sizes, namely 32, 64 and 128 bytes ( 3 ) . The independent parameter in the analysis is the data block size transferred through a single EBUS DMA activation, namely the burst length characterizing the simulated data transfer operation associated with checkpointing. The results are plotted in Figure 3 (dashed lines indicate reference latency values measured for the case of zero-copy sends executed with no active EBUS DMA based data transfer operation). They point out that, for data block size up to 1 Kbyte, the zero-copy send latency shows no relevant increase, as compared to the latency measured for the case of no active EBUS DMA based data transfer, especially for the case of 32 and 64 bytes message size. On the other hand, data block size larger than 1 Kbyte might cause intolerable increase in the latency of the zero-copy send operation.
data Mock m e (#bytes)
I€
10
Figure 3. Zero-Copy Send Latency vs Data Block Size (Average of 1000 Samples).
Figure 2. EBUS DMA Data Transfer Latency vs Data Block Size (Average of 1000 Samples).
1 2 3
32 bytes E messaq um 64 bytes w message uze 128 bytes
busy) ckpt_burstI).
)
In other words, the only responsibility of the modified control program is to check whether the EBUS DMA is not busy and, in the positive case, to execute the function c k p t b u r s t ( ) in order to activate the EBUS DMA for an operation that simulates a data transfer associated with checkpointing. The sequence of activations of the EBUS
31nvestigation for message size up to 128 bytes is representative since optimistic parallel discrete event simulation typically requires transfer of small size messages.
245
length equal to Y bytes, running the control program reported in Section 2 under the situation in which no operation associated with communication functionalities is ever activated (therefore indirect effects of the burst length are actually avoided at all). In the analysis we have varied X from 256 bytes to 10 Kbytes and we have used different values for Y ranging between 256 bytes and 5 Kbytes (4).
time of CCL in order to ensure adequate performance for both checkpointing and communication functionalities for the specific cluster environment considered in the analysis. The results reported in the following section confirm this deduction.
5
w burst lenglh 256 bytes P
A
burst length 1 Kbyie
e burr1 length 5 Kbytes
state vgtor size (Kbyter)
Figure4. CheckpointingLatencyfor Different Burst Lengths (Average of 1000 Samples).
The plots in Figure 4 show that, as expected, the checkpointing latency decreases while the burst length increases. Therefore polling and EBUS DMA setup effects tend to disappear with the increase in the burst length. Anyway, the checkpointing latency for any given state vector size X does not decrease linearly vs the burst length Y . As an example, very strong latency reduction (up to 50%) is obtained changing Y from 256 to 512 bytes. Instead, when Y is changed from 512 bytes to 1 Kbyte, only a 25% additional reduction of the latency is noted. By this behavior we argue that, beyond a given threshold, long or very long burst lengths are likely to not originate very different checkpointing latencies. Therefore, to bound the checkpointing latency, the important thing is to avoid the use of minimal burst lengths (e.g. 256-512 bytes).
4.4
Results for a Simulation Benchmark
In this section we report performance results for simulations of a classical parameterized synthetic benchmark, executed on 4 machines of the cluster environment considerlzd in the analysis in Section 4. We will show that choosing burst lengths different from 1 Kbyte may sometimes produce very strong negative effects on the execution speed of the simulation. We will also show how, in some c:ircumstances, burst lengths different from 1 Kbyte produce final performance similar to (or even slightly better than) that achieved with burst length of I Kbyte. However, the data indicate that, in the latter circumstances, communication functionalities slightly begin to suffer from the activation of checkpointing functionalities. Although for this specific benchmark such a suffering has no determinant effect on performance, it could become a real problem in case of different simulation settings. This is a support to the effectiveness of the pragmatical approach proposed in Section 4 for the selection of a compile time well suited burst lengtlh. The experiments have been performed by using the CClL based optimistic simulation engine presented in [ 113, whose main loop is structured as follows (for sake of simplicity GVT calculation and “fossil collection” for memory recovery are not reported): 1
pendmg-LP
= no-LP;
2 . whilelnot end1
3.