PICT 4 - A Bug Reproducing Method for the ...

3 downloads 0 Views 228KB Size Report
Abstract. Component debugging is critically important for diagnosing failures of programs. In component-based Parallel Discrete Event Simulation (PDES),.
A Bug Reproducing Method for the Debugging of Component-Based Parallel Discrete Event Simulation* Zhu Feng and Yao Yiping School of Computer Science, National University of Defense Technology, Changsha, P.R. China {zhufeng,ypyao}@nudt.edu.cn http://yhsim.nudt.edu.cn

Abstract. Component debugging is critically important for diagnosing failures of programs. In component-based Parallel Discrete Event Simulation (PDES), the bug probably not to be reproduced for the different orders of event processing in different simulation runs, so there is a great challenge in the debugging for components. To solve the problem, this paper proposes a bug reproducing method based on checkpoint/restart mechanism, which schedules a simultaneous interaction event after each event processing to request the checkpoint operation. Moreover, a module called CheckpointRestartMgr (CRM) has been designed to provide checkpoint/restart service for simulation objects. The cost brought from cooperation between Logical Processes when performing checkpoint operation can be reduced through the simultaneous interaction event and CRM. The usage of our method in the debug framework for componentbased PDES demonstrates it can be feasible. Keywords: Component debug, PDES, checkpoint/restart mechanism, CRM.

1

Introduction

Complex system simulation such as large-scale ecological simulation, computational biology simulation and complicated war simulation is usually composed of a large number of entities [1]. Each entity is composed of one or more components and intricate interactions exist among these components. This will be a great challenge for domain experts to apply these entities to buildup simulation applications. Componentbased PDES supports distributed developing and hierarchical composing for models, which provides an efficient way to build large-scale complex system simulation. However, the current major PDES platforms based on event scheduling, such as SPEEDES (Synchronous Parallel Environment for Emulation and Discrete-Event Simulation) [2, 3] and GTW (Georgia tech Time Warp) [4, 5] do not provide a correlative module of debugging to support the development of components. Therefore, the research on the debugging of component-based PDES becomes a very important issue gradually. *

This work has been funded by the National Science Foundation of China (No. 61170048).

J.-H. Kim et al. (Eds.): AsiaSim2011, PICT 4, pp. 456–465, 2012. © Springer Japan 2012

A Bug Reproducing Method for the Debugging of Component-Based PDES

457

In component-based PDES, the simulation object composed of one or several components performs in parallel on different processors. This parallel paradigm will lead the nondeterministic program behavior for the components running, and the indeterminism of a parallel program adds to the complexity of the debugging process. Certain bugs manifest themselves only due to a specific ordering of events and may not show up in a rerun of the program due to a different ordering of events in the rerun [6]. Therefore, How to reproduce the bugs is a very challenging issue in the debugging of component-based PDES. With checkpoint/restart mechanism, parallel debugging returns the developer to an intermediate state closer to the bug [7]. Checkpoint/restart mechanism provides for bug reproducing capability as well as other benefits for component-based PDES [8]. Component developers can save hours or days of time spent in debugging by checkpoint and restarting the parallel debugging session at intermediate points in the debugging cycle [9]. Therefore, this paper proposes a checkpoint/restart based approach triggered by event-processing. In another word, after processing an event, SimObj will schedule a simultaneous interaction event to request checkpoint operation. And a simulation object CheckpointRestartMgr (CRM) has been constructed on each processor, which provides checkpoint/restarting service for simulation program through processing interaction events. Then the checkpoint image will be saved into disk, which will be used in the resuming process later. The implementation based on YHSUPE (YinHe Simulation Utilities for Parallel Environment) [10] is conducted to investigate the applicability of the proposed method in a realistic system environment. And the usage of our method in the event-driven debug framework for PDES base on components demonstrates this method can be feasible. The remainder of this paper is structured as follows: in section 2 we analyze the motivation of our work. In section 3 we explain our method in detail for reproducing bugs in the components running. In section 4 we describe our implementation based on YH-SUPE. Finally, our conclusion will be made with an indication of the future work.

2

Motivation

In the execution of a parallel program, a bug can manifest itself because of an unusual ordering of events. The bug may not recur if the experiment is repeated, because the processor may alter the original ordering of events [11]. In such cases it would be helpful to be able to deterministically reproduce the bug. In a PDES program, SimObjs (simulation object) are distributed to several processers for parallel execution. SimObj processes the events with different speeds in different running course, which will alter the ordering of event-processing. There is a probable scenario where a PDES program is composed of two SimObjs named SimObj1 and SimObj2, which are distributed to two processers for parallel execution. SimObj1 composed of a component named ComponentX processes four events. They are event1, event2, event3 and event4 with time stamp 100, 130, 160 and 190. SimObj2 composed of a component named ComponentY processes four events. They

458

Z. Feng and Y. Yiping

are eventA, eventB, eventC and eventD with time stamp 100, 160, 210 and 240. The event-scheduling process in the two components is shown in figure 1.

Fig. 1. 1. Processor2 processes the events faster than Processor1. 2. Processor1 processes the events faster than Processor2. This shows the event-scheduling process in the two components.

In the first running process, providing Processor2 processes the events faster than Processor1. There will be the possible situation that after eventB has been processed in ComponentY, ComponentX hasn’t scheduled eventC. At that time, ComponentY proceses eventD, and then the program will be error. In the second running process, providing Processor1 processes the events faster than Processor2. Before ComponentY processes eventD, ComponentX has processed event4 and scheduled eventC. At that time, eventD will be processed correct, because the state of ComponentY has been changed through processing eventC. Therefore, the bug will not appear. The indetermination of the event processing order is a serious problem for components debugging. Although in PDES with optimistic time management algorithm, the error behavior of the program can be rectified through rollback mechanism which uses incremental state saving method [12, 13]. However, incremental state saving becomes more expensive than copy state saving if the state of the SimObjs is modified by each event [14]. And it will reduce the performance of the program, because the state of the SimObj modified by an event will be restored when the simulation program perform a rollback, one after the other, until the point of rollback is reached. Moreover, the rollback mechanism does not always maintain the consistency of a parallel simulation, especially the program with wrong behavior. Checkpoint/restart mechanism makes debugging easier because it enables cyclic debugging. Cyclic debugging is possible with checkpoint/restart mechanism because order of execution is repeatable. In the further, checkpoint/restart mechanism recovers the state of the program from the latest checkpoint, which avoids starting from scratch. However, for traditional periodic checkpoint/restart mechanism, the checkpoint operation is requested periodically, which will bring a problem that how to choose the periodical time. Especially in PDES, the time cost of event-processing is usually different. If the periodical time chosen is shorter, the times for requesting checkpoint operation will be more, which will lead the bigger memory cost to store the checkpoint file. Whereas the time is longer, we must record the event processing

A Bug Reproducing Method for the Debugging of Component-Based PDES

459

order between the checkpoint and the error position. Otherwise, the bug appeared in the components running will be hard to reproduce. In summary, the bug reproduction needs to be solved in the debugging of component-based PDES, but the current methods didn’t provide an efficient support for this. So we need to propose another method for component debugging to promote the development of component-based PDES.

3

Checkpoint/Restart Based Bug Reproducing Method

YH-SUPE is a common simulation environment for PDES, which provides services for parallel simulation application such as time management, memory management, persistent mechanism and event scheduling strategy etc. In YH-SUPE, the simulation system can be viewed as a collection of Logical Processes that interacts in some fashion, and each Logical Process is assigned to a different processor. So we can refer a Logical Process to a processor in the following sections. The checkpoint/restart technique is utilized to save the state of the simulation application program though persistent mechanism, which will be the beginning point of recovering this program. The challenge for using checkpoint/restart technique to reproduce a bug includes the following two aspects: (1) How to store the state of Logical Process to reduce the time cost? (2) How to guarantee the order of event processing between the checkpoint and the error position the same as previous? 3.1

Persistent Mechanism in YH-SUPE

YH-SUPE provides the persistent mechanism which serves for checkpoint/restart with persistent memory and persistent pointer [10]. Our method will utilize the persistent mechanism to store the states and events to be processed to a checkpoint file. The persistent memory is used to track persistent memory allocation and reclamation, and the persistent pointer to maintain the address of the persistent memory. YH-SUPE provides a persistent database which is utilized to store persistent memory and persistent pointer. Figure 2 shows the structure of the persistent database.

Fig. 2. The structure of the persistent database

460

Z. Feng and Y. Yiping

When requesting checkpoint operation, the persistent mechanism packages the states and the events to be processed of SimObjs to a buffer using the copy state saving approach, and then compresses the buffer to a checkpoint file in the disk. When restarting from the checkpoint file, the memory block will be constructed through the persistent mechanism. Then all of the persistent pointer will be adjusted to the new address of the memory block. 3.2

Event-Driven Checkpoint/Restart Method

PDES program advances time by processing events, thus the error behavior of the program most appears in the function of event-processing. Our method triggers a checkpoint operation by event-processing. In another word, after processing an event, the SimObj will schedule a simultaneous interaction event for requesting checkpoint operation. For the purpose of guaranteeing the simultaneous interaction event processed before other simultaneous events, it must be set higher priority. In order to maintain the consistency of global checkpoint and reduce the impact of the operation of checkpoint on execution time, the simulation object CheckpointRestartMgr (CRM) is constructed on each processor, which provides checkpoint/restarting service for simulation application, such as requesting checkpoint interaction, restoring the state of simulation objects and the events to be processed, and recovering the state of SimObjs and the event queue, etc. CRM on each processor works cooperative communication though YH-SUPE engine. The figure 3 shows the operation flow of checkpoint.

Fig. 3. The operation flow of requesting checkpoint

1. The SimObj_1 schedules a simultaneous interaction event to request a checkpoint operation after processing an event. 2. YH-SUPE engine receives a request about saving interaction, and then sends this request to all of the CRM on the other processors. 3. Each CRM notifies the simulation object manager SimObjMgr, which creates the simulation object dynamically, manages the subscribing of interaction events and event handlers on the same processor etc. 4. Each SimObjMgr notifies all of the SimObjs on the same processor.

A Bug Reproducing Method for the Debugging of Component-Based PDES

461

5. Each SimObjs recalls the function of CRM to save the persistent state and event message with which the stamp time greater than the checkpoint time to a parameter set. Then this parameter is compressed to a buffer which is written to a checkpoint file in the disk. When restarting simulation application to reproduce the error in components running, it needs to input command to restore the states of SimObjs packed in the latest checkpoint file on each processer. In the course of restarting checkpoint, the operation routine on each processor is as follows: First, the user input ProgramName –restart on the command line to restart from checkpoint. Second, each CRM searches for the latest checkpoint file from the error position occurring in the components running, and then unpack the checkpoint file to reconstruct the local buffer. Third, the parameter set of each SimObj is recreated recording to the local buffer. Fourth, the state of each SimObj and the events to be processed are reconstructed though the persistent data in the parameter set. Last, CRM sends ReconstructComplete message to YH-SUPE engine to restart running the program. Figure 4 shows the workflow of restarting checkpoint under YH-SUPE.

Fig. 4. The workflow of restarting checkpoint under YH-SUPE

3.3

Analysis

Our method applies the persistent mechanism which utilizing copy state saving approach to store the state of SimObjs. For some simulations that contain a large number of events and in which the event processing cost less time, making a copy of the state of all SimObjs before each event may consume large amounts of time and memory. So the effect of utilizing our method is worse. But for complex computation system simulation which contains fewer events, the function of event-processing often

462

Z. Feng and Y. Yiping

needs more time. And copy state saving can perform block moves to save and restore state. Therefore, the copy state saving probably becomes less expensive than incremental state saving because most of the SimObjs state is modified by each event. Furthermore, our method can reproduce the bug in component running instead of avoiding errors through rollback mechanism. It can help users to explore the impact of event-scheduling time which causes the error program behavior. This exploration is helpful to decrease the times of rollback. Comparing traditional periodic checkpoint technique, in event-driven checkpoint operation, the times of checkpoint operation will be fewer owing to the events to be processed are fewer. So the operation of checkpoint driven by event - namely, to request the operation of checkpoint after event-processing will not increase the memory cost of checkpoint file. As a result of using copy state saving, the time cost by restarting from a checkpoint file in our method is approximately the same as that in periodic checkpoint operation. In event-driven checkpoint/restart method, after processing an event, SimObj will schedule a simultaneous interaction event to request checkpoint operation. Then CRM will process this interaction event to store the states of Logical Process almost at the same time because of the higher priority. Thus there is an obvious benefit that it needn’t to store the orders between the checkpoint and the error position in the components running because there is only one event. So the issue of storing the order of event processing does not exist. And in this way, Logical Processes will not be suspended when requesting checkpoint in the components running. Thereby, the cost of process switching will be reduced.

4

Implementation

YH-SUPE provides interaction event, which support communication for SimObjs through the subscribe approach. Macro DEFINE_INTERACTION(ClassName, MethodName) is used to define interaction event, which will introduce some functions to support register/unregister and subscribe/unsubscribe operation for interaction event. ClassName specifies the name of the interaction class and MethodName specifies the name of the interaction event processing function in this interaction class. Macro SCHEDULE_INTERACTION(const SimTime simTime, const char* interactionName, ParameterSet ¶meterSet ) is utilized to schedule interaction event. simTime specifies simulation time, interactionName specifies interaction name and parameterSet is a container restoring data, which is utilized to transfer parameter for interaction event. CRM as a SimObj defines the event-processing function Checkpoing_Requested for checkpoint operation in our method. The function of Checkpoing_Requested is to restore the state and events to be processed of each SimObj through persistent mechanism in YH-SUPE. Before the components assembled to a simulation application, macro SCHEDULE_INTERACTION will be inserted in the end of each eventprocessing function to schedule the simultaneous interaction event. The events are processed in the order of time stamp, therefore as soon as the event of components is

A Bug Reproducing Method for the Debugging of Component-Based PDES

463

processed, the interaction event with the same time stamp will be processed. That is calling function Checkpoing _Requested to restore the state and event queue of each SimObj. CRM notifies each SimObjMgr to request checkpoint operation through the following codes. The handler of each SimObjMgr is fetched through function getSimObjMgrId. Macro SCHEDULE_CheckpointSimObjMgr schedules interaction event for each SimObjMgr. This interaction event has the same time stamp with the event to be processed. In order to maintain the causal order, this interaction event is set priority before scheduled. for (int i = 0; i < numSimObjMgrs; i++){ obj = getSimObjMgrId(i); //set priority for simultaneous interaction event checkpointTime.SetPriority(0x7ffffffe, 0x7ffffffe); SCHEDULE_CheckpointSimObjMgr(checkpointTime, obj); } CRM generates checkpoint file through the following codes. The parameter set of each SimObj is initialed from SimObjDataList, and then loop to fetch the state and events to be processed to a buffer, which is compressed to the checkpoint file. ParameterSet *ps = SimObjDataList->GetTopElement(); while (ps) { SimObjDataList->PopTop(); //save the state and events to checkpoint file Fetch the state and events to a parameter set Compress the parameter set to checkpoint file ps = SimObjDataList->GetTopElement(); } In the course of restarting checkpoint, CRM recovers the state and event queue of each SimObj through the following codes. At first CRM read the content of the checkpoint file latest from the error position to a buffer, and then call InitFromBuffer to construct local parameter set, which is used to construct parameter set for each SimObj. With that recover the state and event queue of each SimObj. Read the content of checkpoint to a buffer ParameterSet *localParameter = InitFromBuffer(buffer); Bool status = localParameter->GetSimObj(name, buffer); while (status) { ParameterSet *simObjParameter= Generate_SimObj(buffer); //reconstruct the parameter set of SimObjs Recover the state of SimObj Recover the event queue of SimObj status = localParameter->GetNextSimObj(name, buffer); delete simObjParameter; }

464

Z. Feng and Y. Yiping

The implementation based on YH-SUPE is conducted to investigate the applicability of our method. We have used it in event-driven debug framework for EDEVS based component (details can be found in [15]). For complex computation system which contains fewer events and whose functions of event-processing often needs more time, this method can work better.

5

Conclusion and Future Work

Component-based PDES supports distributed developing and hierarchical composing for models, which provides an efficient way to build large-scale complex system simulation. But the indeterminism of a parallel simulation program composed of one or several components makes the bug reproduction become an increasingly important issue. In this paper, we present a bug reproducing approach based on checkpoint/restart mechanism. And a module called CheckpointRestartMgr is constructed on each processor to provide checkpoint/restarting service. In this way, our method is not to increase the memory cost of checkpoint file, but also not to increase the time of restarting from the checkpoint file, and there is no need to store the order of event processing between the checkpoint and the error position. For complex computation system simulation which contains fewer events and whose function of eventprocessing often needs more time, our method will work better. As for our future work, we plan to study the fault location in which we will utilize the proposed bug reproducing technique in this paper. Another interesting line of investigation would be the usage of the checkpoint/restart mechanism to implement fault tolerance. Acknowledgment. I would like to thank Liu Gang, the member of High Performance Simulation Group, for his research of the EDEVS component based YH-SUPE debug technology.

References 1. Li, B.H.: Some Focusing Points in Development of Modern Modeling and Simulation Technology. In: Baik, D.-K. (ed.) AsiaSim 2004. LNCS (LNAI), vol. 3398, pp. 12–22. Springer, Heidelberg (2005) 2. Jeff, S.: SPEEDES: Synchronous Parallel Environment for Emulation and Discrete Event Simulation. In: Proceedings of Advances in Parallel and Distributed Simulation, pp. 95– 103 (1991) 3. Chris, B., Robert, M., Jeff, S., Jennifer, W.: SPEEDES: A Brief Overview. In: Proceedings of SPIE, Enabling Technologies for Simulation Science V, pp. 190–201 (2001) 4. Fujimoto, R.M., Das, S., Panesar, K.: Georgia Tech Time Warp programmer’s manual. Technical report, College of Computing, Georgia Institute of Techmology, Atlanta, GA (July 1994) 5. Perumalla, K.S., Fujimoto, R.M.: GTW++ An Object-oriented Interface in C++ to the Georgia Tech Time Warp System. Georgia Institute of Technology (1996)

A Bug Reproducing Method for the Debugging of Component-Based PDES

465

6. May, J., Berman, F.: Designing a Parallel Debugger for Portability. In: Proceedings of the International Conference on Parallel Processing, pp. 909–914 (1994) 7. Wang, Y.-M., Huang, Y., Vo, K.-P., Chung, P.-Y., Kintala, C.: Checkpoint and Its Application. In: IEEE Fault-Tolerant Computing Symp., pp. 22–31 (June 1995) 8. Bouguerra, M.-S., Gautier, T., Trystram, D., Vincent, J.-M.: A Flexible Checkpoint/Restart Model in Distributed Systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 206–215. Springer, Heidelberg (2010) 9. Hursey, J., January, C., O’Connor, M., Hargrove, P.H., Lecomber, D., Squyres, J.M., Lumsdaine, A.: Checkpoint/Restart-Enabled Parallel Debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 219–228. Springer, Heidelberg (2010) 10. Yao, Y.-p., Zhang, Y.-x.: Solution for Analytic Simulation Based on Parallel Processing. Journal of System Simulation 20(24), 6617–6621 (2008) 11. Netzer, M.: Optimal tracing and replay for debugging message-passing parallel programs. In: Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pp. 502–511 (1992) 12. Avril, H., Tropper, C.: On Rolling Back and Checkpointing in Time Warp. IEEE Transactions on Parallel and Distributed Systems 12(11), 1105–1120 (2001) 13. Steinman, J.S.: Incremental State Saving in SPEEDES Using C++. In: Proceeding of the 1993 Winter Simulation Conference, pp. 687–696 (1993) 14. Fujimoto, R.M.: Parallel and Distributed Simulation Systems. John Wiley & Sons, Inc. (2000) 15. Liu, G., Yao, Y.: Event-Driven Debug Framework for EDEVS based Components. In: 2011 International Symposium on Computer Science and Society (ISCCS 2011), pp. 411–414 (2011)