RecPlay: A Fully Integrated Practical Record/Replay System MICHIEL RONSSE and KOEN DE BOSSCHERE Universiteit Gent
This article presents a practical solution for the cyclic debugging of nondeterministic parallel programs. The solution consists of a combination of record/replay with automatic on-the-fly data race detection. This combination enables us to limit the record phase to the more efficient recording of the synchronization operations, while deferring the time-consuming data race detection to the replay phase. As the record phase is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time. This article describes an implementation of the tools needed to support RecPlay. Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent Programming—Parallel programming; D.2.5 [Software Engineering]: Testing and Debugging— Debugging aids; Monitors; Tracing; D.4.1 [Operating Systems]: Process Management— Concurrency; Deadlocks; Multiprocessing/multiprogramming; Mutual exclusion; Synchronization General Terms: Algorithms, Experimentation, Reliability Additional Key Words and Phrases: Binary code modification, multithreaded programming, race detection
1. INTRODUCTION Cyclic debugging assumes that a program execution can be faithfully reexecuted any number of times. Therefore, nondeterministic parallel programs are hard to debug by means of cyclic debugging because subsequent executions with identical input are not guaranteed to have the same behavior. Known sources of nondeterminism are certain system calls, such as random() and gettimeofday(), interrupts, traps, signals, noninitialized variables, dangling pointers, and finally—for parallel programs— unsynchronized accesses to shared memory.
Michiel Ronsse is supported by a grant from the Flemish Institute for the Promotion of the Scientific-Technological Research in the Industry (IWT). Koen De Bosschere is a research associate with the Fund for Scientific Research—Flanders. Authors’ address: Department of Electronics and Information Systems, Universiteit Gent, Sint-Pietersnieuwstraat 41, Ghent, B-9000, Belgium; email:
[email protected]. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1999 ACM 0734-2071/99/0500 –0133 $5.00 ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999, Pages 133–152.
134
•
M. Ronsse and K. De Bosschere
There exist effective and fairly efficient ways to remove the sources of nondeterminism that occur in sequential programs: the output of nondeterministic system calls can be traced; the nondeterminism caused by interrupts, traps, and signals can be dealt with by using the technique of Interrupt Replay [Audenaert and Levrouw 1994]; noninitialized variables and dangling pointers can be detected by packages like Purify1 or Insight.2 The nondeterminism that is caused by unsynchronized accesses to shared memory in parallel programs (the so-called race conditions) is however much harder to deal with during debugging. In this article, we present a combination of techniques that allow to use standard debuggers for sequential programs to debug parallel programs. This can be done by tracing a program execution (record) and by using this information to guide a faithful reexecution (replay). A faithful replay can only be guaranteed if and only if the trace contains sufficient information about all nondeterministic choices that were made during the original execution (minimally the outcome of all the race conditions). This suffices to create an identical reexecution, the race conditions included. Unfortunately, this approach causes a huge overhead, severely slows down the execution, and produces huge trace files. An alternative approach we advocate in this article is to record an execution as if it did not contain data races, and to check for the occurrence of data races during replay. As has been shown [Choi and Min 1991], replay will be guaranteed to be correct up to the race frontier, i.e., the point in the execution of each thread where a race event is about to take place. Running the race detector during the original execution would be more convenient, but is not feasible due to the huge overhead caused by on-thefly race detection. This overhead is however harmless during replay. Hence, we can safely run the race detector as a watchdog during replay, without changing the behavior of the execution. As soon as the first data race occurs, the user is notified, and the replayed execution stops. The user can force3 RecPlay to continue running after the first data race and to report all data races in the execution. However, as the first (detected) data race can make the remainder of the program nondeterministic, a correct reexecution is no longer guaranteed by RecPlay. RecPlay is thus a kind of weak record/replay system in that it can only correctly replay programs that are free of data races. It is, however, correct in that it will automatically detect when a data race occurs, and stop the execution. Hence, it will never produce an incorrect replay. This approach is not only more efficient, but also more useful because we believe that a tool that detects data races is more useful than a tool that simply replays them. Our system guarantees that a particular program execution is free of data races, where a system that replays data races cannot guarantee this.
1
http://www.rational.com http://www.parasoft.com 3 Using an environment variable. 2
ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System
•
135
In order to be practical, this approach requires to make a distinction between two abstraction levels: the synchronization level and the shared data access level. In the record phase, only the synchronization operations are traced. During race detection, the data access operations are also traced, and this information is used to automatically detect data races. Tracing data accesses is two orders of magnitude slower than tracing synchronization operations, but this is not a problem during replay. This distinction between the two abstraction levels has two major advantages: —The trace files only contain information about the synchronization events, and neither about the (many) race conditions that were needed to implement them, nor about the shared data accesses. They will therefore be much smaller than full data access traces. Furthermore, since synchronization operations are in some way related (more than the underlying race conditions), their entropy is lower, and information about them can be compressed more effectively. —The race detector does not have to take into account the race conditions that were contributed by the synchronization operations. It can therefore find data races in O ~ p 2 ! time and O ~ p ! space on a system with p processors. Our implementation of the automatic data race detection algorithm slows down the replayed program about 36 times, but it can run unsupervised. This is comparable to the operation and slowdown reported for tools like Purify or Insight. In comparison with Eraser [Savage et al. 1997], RecPlay can only detect data races that appear in a particular execution. Eraser is more general in that it also detects data races that do not appear in one execution, but are not excluded by the synchronization operations of the program by checking the locking discipline for shared-memory accesses. Eraser is, however, restricted to the use of mutexes as synchronization operations, and it is too conservative: it also returns false positives. RecPlay is indeed limited to the data races that are apparent from one execution, but we believe it is more useful: it does guarantee to find the first apparent data race in an execution; it will never report false positives; and very important for the practical applicability, it supports the full set of Solaris thread synchronization primitives. We believe that RecPlay is the only practical and effective debugging/race detection system currently available for thread-based programs. To the best of our knowledge, the combination of techniques we propose in this article has not been implemented before. The problems with making a practical tool were numerous, and the implementation was nontrivial (especially the efficient recording of the synchronization events, and the collection of the data references). This article tries to summarize the solutions we used to make RecPlay an effective and efficient tool for debugging nondeterministic parallel programs. It describes the results of a two-person-year implementation effort. The implementation described in this article runs on all Solaris versions. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
136
•
M. Ronsse and K. De Bosschere
During the test phase of RecPlay, we have found data races in several programs we tested (including the SPLASH-2 suite [Woo et al. 1995] and the Athapascan system [Cavalheiro and Doreille 1996]). All of the races that were reported were genuine data races that had stayed undetected until now. In the next section, we start with some definitions, followed by a discussion of tracing the synchronization races (Section 3) and detecting the data races (Section 4). These sections are followed by an evaluation section containing performance data. The article is concluded with an overview of related work and future work.
2. DEFINITIONS In this section, we start with defining the terminology that will be used in the sequel. A race condition is defined as two unsynchronized accesses to the same shared location, and at least one access modifies it. An update of a memory location with the same value can also cause a race condition in this article, although it can never change the behavior of the program (the memory location is updated, but it is not changed). We distinguish two types of race conditions: race conditions that are used to make a program intentionally nondeterministic: synchronization races, and race conditions that were not intended by the programmer (data races). We need synchronization races to allow for competition between threads to enter a critical section, to lock a semaphore, or to implement load balancing. Removing synchronization races makes a program completely deterministic. Therefore, in this article, we do not consider synchronization races a programming error, but a functional and useful characteristic of a parallel program. Data races are not intended by the programmer and are mostly the result of improper synchronization. By changing the synchronization, data races can (and should) always be removed. It is important to notice that the distinction between a data race and a synchronization race is actually a pure matter of abstraction. At the implementation level of the synchronization operations, a synchronization race is caused by a genuine data race (e.g., spin locks, polling, etc.) on a synchronization variable. This makes it difficult to make a clear distinction between synchronization races and data races at the machine instruction level. Fortunately, all synchronization operations are located in dynamically loadable synchronization libraries, and by switching off the race detector in dynamically loadable libraries, all detected races must be genuine data races. The same mechanism can be used to debug userdefined synchronization libraries. By statically linking the implementation of the newly defined synchronization operations, and by dynamically linking the code of the low-level synchronization operations, the former will be traced, while the latter are not. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System
•
137
If the programmer creates his or her own synchronization operations, we have to make a distinction between feasible data races and apparent data races. The difference between them can be intuitively understood from the following example: thread1 result5x; done5TRUE;
thread2 while (done55FALSE); y5result;
Technically speaking, in this fragment, there are no explicit synchronization operations, and therefore, a naive race detection will detect two races: one on result and one on done. However, after taking the semantics of the program fragment into account, there is only one data race that can actually occur, namely on done because the data race on result cannot occur, provided that done was set to FALSE before the two threads were created.4 We say that this program fragment contains two apparent data races, but only one data race is feasible. A good race detector must not detect data races that are not feasible. Another notable example of nonfeasible data races are the nonsynchronized accesses to shared variables during program initialization. 3. DEALING WITH SYNCHRONIZATION RACES: ROLT Since synchronization races and data races are at the lowest level caused by the same mechanism (an unsynchronized access to a shared location), tracing the order of all shared-memory operations is sufficient to capture both synchronization races and data races, and thus to allow for a correct replay of a program execution. However, this is not feasible for at least three reasons: (1) the resulting traces quickly become unreasonably large, (2) the overhead of collecting every memory access normally slows down the program with two orders of magnitude, causing a serious probe effect and in some cases Heisenbugs (Heisenbugs are bugs whose symptoms disappear due to the intrusion caused by observing them), and (3) it is difficult to distinguish the feasible from the apparent data races (at the implementation level, they are all similar). Fortunately, record/replay can be implemented at different levels of abstraction. In order to deterministically replay a parallel program, it is sufficient to replay the order of synchronization events instead of the (many) synchronization races that were used to implement them. The overhead for tracing synchronization events will be much smaller in time and in space [LeBlanc and Mellor-Crummey 1987; Leu et al. 1991; Netzer
4
This is true if the computer has a memory model that guarantees at least processor consistency. Modern processors with weaker memory models require some kind of serialization instruction (e.g., an STBAR instruction on a SPARC) between the two store operations performed by the first thread to make this work correctly. This example only serves at explaining the concept. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
138
•
M. Ronsse and K. De Bosschere
1993] and may be small enough to circumvent Heisenbugs and to limit the probe effect. In previous work, we have developed ROLT (Reconstruction of Lamport Timestamps), an ordering-based record/replay method, where only the partial order of synchronization operations is traced by means of (scalar) Lamport clocks. The method [Levrouw et al. 1994b] has the advantage that it produces small trace files and that it is less intrusive than other existing methods [Netzer 1993]. Moreover, the method allows for the use of a simple compression scheme [Ronsse et al. 1995] which can further reduce the trace files. The execution traces consist of a sequence of timestamps, one trace per thread. In order to reduce the amount of information stored, the timestamp increments are stored instead of the actual timestamps. This information can be further reduced by splitting the increments into a deterministic increment (the most common case: 11), and a nondeterministic increment (an update due to a synchronization variable that has a higher logical clock value). In principle, storing the nondeterministic increments per processor is sufficient to allow a correct replay, as the deterministic increments can be recomputed. In Levrouw et al. [1994a] it is proven that logging only the synchronization operations for which the nondeterministic increments are nonzero (1) still allows a correct replay, (2) does automatically eliminate the trivial transitive orderings, and (3) allows for an efficient compression scheme. To get a faithful replay, it is sufficient to stall each synchronization operation until all synchronization operations with a smaller timestamp have been executed. 4. DEALING WITH DATA RACES As pointed out in the introduction, data races are generally considered programming errors, and we will not try to correctly replay data races. Instead, we will run a data race detection algorithm during the reexecution of a program. As soon as a data race occurs, the race detector will notify the user. As the data races will be detected during replay—when execution overhead is less of a problem—an on-the-fly detection technique is used to find the first data race. In theory, there is no need to collect data after the first data race has occurred, as replay cannot be guaranteed to be correct anymore. For that reason, post mortem race detection is not applicable, besides the fact that it is not feasible due to the size of the trace files for real applications. Online data race detection consists of three phases: (1) Collecting memory reference information per sequential block between two successive synchronization operations (called segments). This yields two sets of memory references per segment: W ~ i ! are the locations that were written, and R ~ i ! are the locations that were read in segment i . ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System
•
139
Fig. 1. JiTI uses a clone area (d . . . d 1 a ) for code accesses and the original process for data accesses (0. . . a ). Load and store operations in the clone area are replaced with branches, using an intermediate trampoline, to the instrumentation routine.
(2) Detecting conflicting memory references in concurrent segments. There will be a data race between segment i and segment j if either ~ R ~ i ! ø W ~ i !! ù W ~ j ! Þ À or ~ R ~ j ! ø W ~ j !! ù W ~ i ! Þ À is true. This phase will yield the two segments that contain the racing instructions and the memory location involved in the race. It does not return the offending instructions. (3) Identifying the instructions that caused the data race, given the two segments, and the memory location involved. 4.1 Collecting Memory References (Tracing) Collecting memory reference information (the R and W sets) per segment is done by instrumenting the memory access instructions. Every sharedmemory location has one bit in the R and W sets. A load instruction sets a bit in R , and a store or exchange instruction sets the corresponding bit in W. To be able to distinguish the feasible data races from the apparent data races, the data race detection tool should be able to distinguish the memory operations used by a synchronization operation from the other memory operations. As standard synchronization operations are located in a particular library, it is fairly straightforward to prevent them from being traced. Users that develop their own synchronization operations (on top of the low-level Solaris primitives) should place these functions in a dynamically loadable library to prevent the tracing of their memory operations. In order to collect the memory references on a SPARC machine, we use a special technique called JiTI, or Just-in-Time Instrumentation (Figure 1). The technique consists of making a clone (between d and d 1 a ) of the original program (between 0 and a ) just after loading it into memory and before it has been started. By changing the start address of the program to main 1 d the execution will use the instrumented code of the clone on the data of the original program. All load and store operations in the clone are replaced by a jump (using BA,a, an annulled Branch Always) to an ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
140
•
M. Ronsse and K. De Bosschere
instrumentation routine.5 As a branch instruction does not save the return address we use trampolines (comparable to the springboards used by Tamches and Miller [1999]) to jump to the actual instrumentation routine. These trampolines (one for each instrumented load or store operation) consists of four instructions that switch to a new register window, store the return address in a fresh register, and jump to the actual instrumentation routine. The instrumentation routine calculates the address of the original instruction (this is d less than the return address saved by the trampoline), calculates the address used by the load or store operation, executes the original instruction, and then jumps back to the clone. JiTI also changes the Procedure Linkage Table of the process in order to force the application to use instrumented versions of the Solaris synchronization functions. The use of this cloning technique has several advantages over classical instrumentation techniques: (1) the size of the instrumented code (the clone) does not change, and hence relative addresses remain valid. Hence there is no problem with relative branch scopes, etc.; (2) code in data and data in code are no longer problems: code in data will be instrumented in the clone; data in the code will remain untouched in the original copy; (3) self-modifying code can be dealt with elegantly: a bit pattern that is stored in memory can be written directly in the original program, and can be written in its instrumented form in the clone (if it represents an instruction that has to be instrumented). The implementation of cloning and instrumenting however is not simple and has required a deep insight in the inner operations of the Solaris operating system and the SPARC processor, interaction with the memory management, etc. Nevertheless, we succeeded in integrating the data race detector in a dynamic library that runs in user mode: hence, it neither needs the intervention of a system manager to be installed, nor extra privileges to be executed. It is a property of an execution, and not of the program that is being debugged. 4.2 Detecting Data Races In our race detection tool, we use a classical logical vector clock [Fidge 1991; Mattern 1989] to detect concurrent segments (see Figure 2 (left-hand side)). Updating vector clocks is time consuming, but this is not an issue during replay. Vector clocks do have the interesting property that they are strongly consistent, which means that two vector clocks that are not ordered must belong to concurrent segments. This gives us an easy way to detect concurrent segments. The problem with nonscalar clocks, namely the
5 Delay instructions are handled by replacing the preceding jump instruction with a procedure call.
ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System 1,0,0
0,1,0
0,0,1
a
2,0,0 0,1,0 0,0,1 0,0,0
e
2,0,0
2,2,0 b
f 2,3,0
1,0,0
h
3,3,0 c
3,3,0 2,3,0 0,0,1 0,0,0 4,3,0 2,3,0 0,0,1 0,0,0
4,3,0 g
d
4,3,2 i
5,3,0
2,4,0
4,3,3
5,3,0 2,3,0 4,3,2 2,3,0
0,1,0
•
141
0,0,1
2,0,0 2,2,0 0,0,1 0,0,0 2,0,0 2,3,0 0,0,1 0,0,0 4,3,0 2,3,0 4,3,2 2,3,0 5,3,0 2,4,0 4,3,2 2,3,0
5,3,0 2,4,0 4,3,3 2,3,0
Fig. 2. Updating vector clocks and detecting data races: (left) vector clocks to detect concurrent segments, (right) matrix clocks to discard W and R sets. The right-hand side of the figure shows two lines and a shaded area. The lower line is a representation of the progress of the execution (the points up to where the processes have progressed). The upper line is the visible past for the processes. All segments above that line can safely be discarded. The shaded area (delimited by the two lines) shows the “current” segments.
fact that their size varies with the number of threads (which is generally unknown before running of program), is easily solved during replay, as this information can easily be extracted from the trace file before starting replay. All instructions can be assigned a vector clock, but as vector clocks can only be changed by synchronization instructions, all instructions between two successive synchronization operations will have the same vector clock. These sequences of instructions with the same vector clock are called segments. The synchronization instructions themselves are associated with the newly computed vector clock. Segments x and y can be executed in parallel if and only if their vector clocks are not ordered (p x is the processor on which segment x was executed):
x\yN
5
~VCx@px# $ VCy@px#! and ~VCx@py# # VCy@py#! or ~VCx@px# # VCy@px#! and ~VCx@py# $ VCy@py#!
This is possible thanks to the strong consistency property of vector clocks. For instance, c \ h and c i/ i in Figure 2. Once the vector clocks have been assigned to the segments, we can determine a segment execution order that is consistent with the vector clocks. Of all segments that are ready to be executed (at most one per thread), the segment with the smallest vector clock is selected. If there is no unique smallest vector clock (or if the vector clocks are incomparable), a random choice is made. During the execution of segment i , memory access information is gathered in the sets W ~ i ! and R ~ i ! . After the execution of a segment, its vector clock is compared with the vector clocks of all previously executed segments, and the race condition is verified on the W ~ j ! and R ~ j ! ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
142
•
M. Ronsse and K. De Bosschere 10 20 30 40 50
Fig. 3.
01 22 33 44 55
10
01
20
22
30 43
ClockSnooping
53
33 44 55
Logical clock update anomaly: (left) without snooping, (right) with snooping.
sets of the concurrent segments j . If a data race is discovered, this fact is communicated to the user. This approach keeps on adding W ~ i ! and R ~ i ! sets to memory. It is therefore important to remove them as soon as it is clear that there are no more segments that will need them for future comparison. A segment can be removed if all other processes have progressed beyond the point in logical time when the segment did terminate. After that point in time, there can be no more segments started that possibly cause a data race. Logical matrix clocks [Raynal and Singhal 1996] are traditionally used to discard information in a distributed environment because the componentwise minimum of the vector columns of a logical matrix clock returns the maximum number of events per process that can be discarded (this is one more than the number of segments; see Figure 2 (right-hand side)) [Wuu and Bernstein 1984; Sarin and Lynch 1987]. However, in practice we can discard more information than is indicated by logical matrix clocks. A drawback of all types of logical clocks is that logical clocks capture causality, which is one of the weakest forms of event ordering. Causality is a minimal relation that is part of any execution of a program. In a particular execution, events will be ordered most of the time, even if they do not have a causal relationship. Classical logical clocks are not able to capture this kind of additional execution-specific ordering because the events that are not causally related do not interact, and hence there is no way of communicating a clock value. This property of logical clocks is inconvenient, e.g., in applications that make use of stream-based parallelism where processes communicate information in a pipelined fashion, and the sending process does not know how far the receiving process has progressed, as is illustrated in Figure 3 (left-hand side): the vector clock of the second thread is never sent to the first thread. Our data race detection method uses a novel technique called clock snooping [De Bosschere and Ronsse 1997]. Clock snooping allows us to extract more event ordering information from a particular program execution than is possible with the classical clock update algorithms. The technique is based on the ability of a process to explicitly request the logical clock from another process (at any time) (see Figure 3 (right-hand side)). This introduces no overhead: as the clocks are stored in shared ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System 100 000 000 000 200 000 000 000
340 240 000 000 Fig. 4.
000 010 000 000 200 230 000 000
200 220 000 000 200 240 000 000
000 000 001 000 200 230 232 200
100 000 000 000 200 010 001 000
000 010 000 000 200 230 001 000
340 240 232 230
200 220 001 000 200 240 232 200
•
143
000 000 001 000 200 230 232 200
Discarding information with logical (left) and snooped (right) matrix clocks.
memory, each thread can read the clocks of the other threads. There is no need to protect the clocks using a mutex; vectors that are being read while they are updated are still correct for our purposes, provided the clock values per thread are updated atomically (which is the case for RecPlay). Instead of maintaining matrix clocks, a snooped matrix clock is built each time a segment ends, using the latest vector clocks VC p of all the processors. The componentwise minimum of the snooped matrix clocks MinVC of VC p , p [ $ 1, . . . , n % , is computed. All segments s that have vector clocks VC s @ p # , MinVC @ p # for a certain processor p can be discarded, as they will not be needed in the future anymore. Clock snooping is a cheap operation in parallel programs on shared-memory architectures. In Figure 4, we show the information we obtain by the two brands of matrix clocks. It is clear that a snooped clock allows to discard more segments than a classical logical clock, while staying correct with respect to the information concerning the segments to discard. Figure 5 compares the behavior of logical and snooped matrix clocks (simulated results). It is clear that snooped clocks perform better. 4.3 Reference Identification (Finding the Data Races) Once a data race is found using the scheme described above we know the address of the location and the segments containing the offending instructions, but not the instructions themselves. Detecting these instructions requires another (deterministic) replay of the program. This operation is completely automatic however and incurs no large overhead. The replayed execution runs at full speed up to the segments that contain the data race. From then on the instrumentation is switched on, and the offending instruction is searched for. We are always guaranteed to find the first data race on a particular variable. 5. PERFORMANCE EVALUATION The RecPlay system has been implemented for Sun multiprocessors running Solaris 2.6. The implementation uses the dynamic linking and loading facilities of the operating system. Four dynamic loadable libraries have been created: one for recording the order of the synchronization operations ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
144
M. Ronsse and K. De Bosschere
•
250 logical snooped
number of current segments
200
150
100
50
0 0
Fig. 5.
100
200
300
400 500 600 synchronization operations
700
800
900
1000
Number of “current” segments with logical and snooped matrix clocks.
using ROLT, one for replaying the synchronization operations, one for detecting data races during replay, and one for identifying the offending references while replaying a previously recorded run. Using an environment variable, the Solaris program loader is forced to load one of these libraries each time it loads a user program. The libraries contain the necessary routines to perform initialization before the actual program starts. Neither the user program nor the thread library need to be modified. The RecPlay library inserts itself automatically between the user program and the thread library. This is because we want to keep the RecPlay operation (record, replay, race detection, reference identification) a feature of an execution, rather than a feature of a program. A sample session of a debugging session with RecPlay follows: $ cc -o demo demo.c -lthread -g $ demo The result of this nondeterministic program is: 15 $ demo The result of this nondeterministic program is: 13 $ setenv LD_PRELOAD /usr/local/lib/librecord.so $ demo The result of this nondeterministic program is: 16 $ setenv LD_PRELOAD /usr/local/lib/libdetect.so $ demo Data race detected between thread 6 (store) and thread 5 (load) at address 0x20e10 (data location: i) $ setenv LD_PRELOAD /usr/local/lib/librefid.so $ demo Data race detected between - thread 6: 0x00010a30: STF %f3,[%l010]; i /5 3; - thread 5: 0x000109e4: LDF [%l010],%f3; i *5 2; ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System Table I.
Program cholesky fft LU radix ocean raytrace water-Nsq. water-spat. MLEM Average worst-case
•
145
Basic Performance of RecPlay (all times in seconds) Record
Replay
Normal Runtime
Runtime
Slowdown
Runtime
8.67 8.76 6.36 6.03 4.96 9.89 9.46 8.12 7.88
8.88 8.83 6.40 6.20 5.06 10.19 9.71 8.33 8.05
18.90 9.61 8.48 13.37 11.75 41.54 11.94 9.52 10.44
6.36
8.01
1.024 1.008 1.006 1.028 1.020 1.030 1.026 1.026 1.022 1.021 1.259
42.59
Replay1Detect
Slowdown Runtime Slowdown 2.18 1.10 1.33 2.22 2.37 4.20 1.26 1.17 1.32 1.91 6.70
721.4 72.8 144.5 182.8 107.7 675.9 321.5 258.8 209.4 1888.3
83.2 8.3 22.7 30.3 21.7 68.3 34.0 31.9 26.5 36.3 296.9
All detected data races are guaranteed to be feasible data races provided that only the Solaris synchronization operations are used. Data races detected in user programs that also use their own synchronization operations can also be apparent data races that are not feasible. However, at least one of the apparent data races must be a feasible data race [Netzer and Miller 1992] that is used to synchronize the other memory references (and hence is a synchronization race). By replacing user-defined synchronization operations by Solaris synchronization operations, or by putting the new synchronization operations in a separate library that is flagged as a synchronization library, we can prevent RecPlay from detecting such synchronization races during subsequent runs. These synchronization races are not detected, as the load and store operations of the synchronization libraries are not traced as long as they are dynamically loaded. Tables I and II give an idea of the overhead caused by RecPlay during record, replay, and race detection for programs from the SPLASH-2 benchmark suite. We have also added MLEM, an image reconstruction algorithm, and worst-case, a program full of synchronization operations that we have regularly used to test the limits of the RecPlay system. We have never included worst-case in the averages because it is a nontypical program. We have only added it to the tables as a reference. From Table I follows that the average overhead during the record phase is limited to 2.1% which is small enough to keep it switched on all the time. This is a way to completely eliminate Heisenbugs. Every time a bug manifests itself, we will be able to find it with the trace it has generated. A complete replay mechanism also needs to replay interrupts and system calls. Replaying the interrupts at the right moment [Audenaert and Levrouw 1994] and forcing the same system call results (using contents-based replay) will introduce an additional but fairly small overhead. The overhead can be kept small if one limits itself to tracing the input that cannot be replayed during the reexecution, e.g., the result of gettimeofday(), keyboard and network input,.... Data read from files should not be traced as long as one makes a copy of these files. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
146
•
Table II.
M. Ronsse and K. De Bosschere Performance of the Data Race Detector (as “worst-case” performs only synchronization operations, all segments are empty) Segments with Memory Operations
Program
Created
Max. Stored
cholesky fft LU radix ocean raytrace water-Nsq. water-spat. MLEM Average worst-case
13,983 181 1,285 303 14,150 97,598 637 639 179
1,915 (13.7%) 37 (20.5%) 42 (3.3%) 36 (11.9%) 47 (0.3%) 62 (0.1%) 48 (7.5%) 45 (7.0%) 7 (4.0%) (7.6%) 0 (0.0%)
0
Memory Accesses
Compared
Number
Mem. Acc./s
968,154 2,347 18,891 4,601 272,037 337,743 7,717 7,962 1,768
121,316,077 26,463,490 56,068,996 60,138,828 29,559,125 48,711,612 80,262,966 80,726,645 33,082,878
168,168 363,509 388,020 328,987 274,458 72,069 249,652 311,927 157,989
0
0
0
The average overhead for replay is 91%, which can seem high, but is feasible during debugging. The automatic race detection is however very slow: it slows down the program execution about 36 times. Fortunately, it can run unsupervised, so it can run overnight. A similar overhead has also been observed with tools that use similar techniques like Purify or Insight. If one forces RecPlay to test the complete execution of a program containing a data race, the total execution time is the same as if there would have been no data race. The cause of this huge overhead is explained in Table II, where the number of segments and memory accesses is shown. Apparently, for most of the test programs, the execution of the instrumented memory operations is the critical speed factor. For these benchmarks, this is about 350,000 memory operations per second. Cholesky does not reach this maximum, as it is slowed down by the huge amount of comparisons while raytrace is slowed down by the replay mechanism. With regard to the number of segments stored, we see that the matrix clock algorithm indeed succeeds in substantially reducing the number of segments that have to be stored. This is important to limit the memory consumption of the race detector. Table III shows why the overhead of recording an execution is limited to about 2%. This is because ROLT effectively succeeds in greatly reducing the number of synchronization operations that has to be stored on the trace file. Combining this with the efficient compression algorithm makes the disk bandwidth needed to store the trace never bigger than 5KB per second, which is very low on modern machines. On the average, we only need little more than three bits per traced synchronization operation of the original execution. 6. RELATED WORK Although several people have worked on either record/replay methods or data race detection tools, we are not aware of any tool that combines them and offers an implementation for a general-purpose multiprocessor with a commodity Unix operating system. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System Table III.
147
Efficiency of the ROLT Mechanism
Sync. Op. Traced Program
Number
Number/s
cholesky fft LU radix ocean raytrace water-Nsq water-spat MLEM Average worst-case
13,857 177 127 273 22,981 150,960 631 625 948
1560.5 20.0 199.2 44.0 4541.7 14,814.5 64.9 75.0 117.8 1927.9 199,865
1,600,915
•
Sync. Op. Stored Number 472 21 96 22 3148 20,084 108 106 551
(3.42%) (11.86%) (7.53%) (8.06%) (13.70%) (13.30%) (17.12%) (19.96%) (58.12%) (17.00%) 7872 (0.49%)
Bandwidth
Size (b)
Bytes/s Bits/Op.
1,132 162 312 160 6,458 41,416 336 332 1,224
127.5 18.3 48.8 25.8 1276.3 4064.4 34.6 39.9 152.0 167.1 4824.2
38,642
0.65 7.32 1.96 4.69 2.25 2.19 4.26 4.25 10.33 3.36 0.19
In the past, other replay mechanisms have been proposed for sharedmemory computers. Instant Replay [LeBlanc and Mellor-Crummey 1987] is targeted at coarse-grained operations and traces all these operations. It does not use any technique to reduce the size of the trace files nor to limit the perturbation introduced. A prototype implementation for the BBN Butterfly is described. Netzer [1993] introduced an optimization technique based on vector clocks. It uses comparable techniques as ROLT to reduce the size of the trace files. However, no implementation was ever proposed, and its use of vector clocks during recording introduces more overhead—in time and in space—than the use of scalar clocks during the recording. Although much theoretical work has been done in the field of data race detection [Adve et al. 1991; Audenaert and Levrouw 1995; Netzer and Miller 1991; Schonberg 1989] few implementations for general systems have been proposed. Tools proposed in the past had limited capabilities: they were targeted at programs using one semaphore [Lu et al. 1993], programs using only post/wait synchronization [Netzer and Miller 1990], or programs with nested fork-join parallelism [Audenaert and Levrouw 1995; Mellor-Crummey 1991]. The tool that comes closest to our data race detection mechanism, apart from Beranek [1992] for a proprietary system, is an on-the-fly data race detection mechanism for the CVM (Concurrent Virtual Machine) DSM system [Perkovic and Keleher 1996]. The tool is limited in that it only instruments the memory references to distributed shared data (about 1% of all references). The tool does not instrument library functions and is unable to perform reference identification: it will return the variable that was involved in a data race, but not the instructions that are responsible for the reference. Race Frontier [Choi and Min 1991] describes a similar technique as the one proposed in this article (replaying up to the first data race). Choi and Min prove that it is possible to replay up to the first data race, and they describe how one can replay up to the race frontier. A problem they do not solve is how to efficiently find the race frontier. RecPlay effectively solves the problem of finding the race frontier, but goes beyond this. It also finds the data race event. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
148
•
M. Ronsse and K. De Bosschere
Fig. 6. Two possible executions of the same program: both happens-before-based tools and Eraser detect the race in (b) while only Eraser detects the race in (a).
Most of the previous work, and our RecPlay tool, is based on Lamport’s so-called happens-before relation. This relation is a partial order on all synchronization events in a particular parallel execution. If two threads access the same variable using operations that are not ordered by the happens-before relation and one of them modifies the variable, a data race occurs. Therefore, by checking the ordering of all events and monitoring all memory accesses, data races can be detected for one particular program execution. Replay mechanisms based on the scheduling order of the different threads can be used for uniprocessor systems. Indeed, by imposing the same scheduling order during replay, an equivalent execution is constructed [Holloman 1989; Russinovich and Cogswell 1996]. This scheme can be extended to multiprocessor systems by also tracing the memory operations executed between two successive scheduling operations. Choi and Srinivasan [1998] describe such an implementation for Java. As a typical execution of a Java program has a small number of schedule operations (no time slicing is used, and therefore scheduling is only performed at predefined points such as monitorenter calls) they succeed in producing very small trace files albeit at the cost of a large overhead (17– 88%). Another approach is taken by a more recent race detector: Eraser [Savage et al. 1997]. It goes slightly beyond work based on the happensbefore relation. Eraser checks that a locking discipline is used to access shared variables: for each variable it keeps a list of locks that were held while accessing the variable. Each time a variable is accessed, the list attached to the variable is intersected with the list of locks currently held, and the intersection is attached to the variable. If this list becomes empty, the locking discipline is violated, meaning that a data race occurred. In a sense, it does for the synchronization operations what Purify and Insight do for the memory allocation and memory accesses. By checking the locking ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System Table IV.
•
149
The (Simplified) sema_post and sema_wait of Solaris
int sema_post(sema_t * sp){ lock(sp-.lock);
int sema_wait(sema_t * sp){ lock(sp-.lock); while (sp-.count 55 0) { unlock(sp-.lock); lock(sp-.lock); } sp-.count--; unlock(sp-.lock); }
sp-.count11; unlock(sp-.lock); }
discipline it can also detect races that are not apparent from a particular execution, e.g., Eraser will detect the data races in both executions (of the same program) shown in Figure 6, while tools based on the happens-before relation can only detect the data race in execution (b). Therefore, given a program with n possible inputs and m n possible executions for input n , the latter tools requires the checking of m 1 1 m 2 1 · · · 1 m n executions, while Eraser requires the checking of n executions only. However, Eraser detects many false data races: as Eraser is not based on the happens-before relation and has no timing information whatsoever. For instance, in theory there is no need to synchronize shared variables before multiple threads are created. The happens-before relation deals in a natural way with the fact that threads cannot execute code “before” they have been created, but Eraser needs special heuristics to support these kinds of unlocked accesses. The support for initialization makes Eraser dependent on the scheduling order and therefore requires the checking of all possible executions for each possible input. The most important problem with Eraser is, however, that its practical applicability is limited in that it can only process mutex synchronization operations and in that the tool fails when other synchronization primitives are built on top of these lock operations. For instance, Table IV shows (simplified) how the sema_post and sema_wait operations are build on top of the mutex operations in Solaris. Figure 7 shows a possible execution of the code fragment thread1 y5y11; sema_post(sp);
thread2 sema_wait(sp); y5y11;
Although fragment a can be repeated any number of times, c will always be executed after b. Hence, there is never a data race on the variable y. However, Eraser will flag these accesses as part of a data race, while tools based on the happens-before relation have no problem with this kind of synchronization. These problems with Eraser makes us believe that methods based on the happens-before relation (like RecPlay) are better. Contrary to Eraser, it can only detect feasible races that are apparent from a particular program run, but it is more general in that it knows how to deal with all common synchronization operations. Furthermore, it is more reliable because it never reports false data races. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
150
•
M. Ronsse and K. De Bosschere
Fig. 7. This kind of synchronization is not detected by Eraser, and as such a data race on y is incorrectly flagged.
7. FUTURE WORK The RecPlay system is fully functional for Solaris now. The ROLT method can also be applied to distributed applications [Ronsse and Zwaenepoel 1997]. For parallel programs, we also plan to look at less common synchronization primitives such as barriers, monitors, rendezvous, etc. The efficiency of the race detector can be improved by not tracing memory accesses to local data anymore. This will however require a serious program analysis to determine which memory accesses are strictly local. As an alternative, by giving the user the possibility to manually restrict the memory access tracing to particular functions, or to a particular block of memory, we can also expect a significant improvement in speed, at the cost of losing the guarantee that all data races will be found and at the risk that the replay might fail. From the engineering point of view, we would like to port the Solaris implementation to other operating systems, such as Windows NT. Contents-based replay of nondeterministic system calls could be added to correctly replay random generators, timing routines, etc. Finally, especially for systems programming, it would be nice to be able to record and replay traps and interrupts. 8. CONCLUSIONS In this article, we have presented RecPlay, a practical and effective tool for debugging parallel programs with classical debuggers. Therefore, we implemented a highly efficient two-level record/replay system that traces the synchronization operations and uses this trace to replay the execution. During replay, a race detection algorithm is run to notify the programmer ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
RecPlay: A Fully Integrated Practical Record/Replay System
•
151
when a race occurs. For debugging, any sequential debugging tool can be used on the parallel program. Our tools work on running processes and are therefore completely independent of any compiler or programming language. They do not require recompilation or relinking. We believe that RecPlay is currently the most advanced (cyclic debugging is enabled, and data races are detected) and practical (the instrumentation is added without recompilation or relinking) tool available for debugging parallel programs. A b -version of the system is available for evaluation at URL http://sunmp.elis.rug.ac.be/recplay/.
ACKNOWLEDGMENTS
The authors would like to thank Jan Van Campenhout for his support and advice. They are also indebted to Koen Audenaert and Mark Christiaens for the many discussions on the implementation and for proofreading and to the anonymous TOCS reviewers for their valuable comments. REFERENCES ADVE, S. V., HILL, M. D., MILLER, B. P., AND NETZER, R. H. B. 1991. Detecting data races on weak memory systems. SIGARCH Comput. Arch. News 19, 3 (May 1991), 234 –243. AUDENAERT, K. AND LEVROUW, L. 1994. Interrupt replay: A debugging method for parallel programs with interrupts. Microprocess. Microsyst. 18, 10, 601– 612. AUDENAERT, K. AND LEVROUW, L. 1995. Space efficient data race detection for parallel programs with series-parallel task graphs. In Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing (San Remo, CA., Jan.). IEEE Computer Society Press, Los Alamitos, CA, 508 –515. BERANEK, A. 1992. Data race detection based on execution replay for parallel applications. In Proceedings of Conference on Parallel Processing (CONPAR ’92, Lyon, France, Sept.). 109 –114. CAVALHEIRO, G. AND DOREILLE, M. 1996. Athapascan: A C11 library for parallel programming. In Stratagem ’96 (Sophia Antipolis, France, June). INRIA, Rennes, France. CHOI, J.-D. AND MIN, S. L. 1991. Race frontier: Reproducing data races in parallel-program debugging. In Proceedings of the 3rd ACM Symposium on Principles and Practice of Parallel Programming (SIGPLAN ’91, July). ACM, New York, NY, 145–154. CHOI, J.-D. AND SRINIVASAN, H. 1998. Deterministisc replay of java multithreaded applications. In Proceedings of the 2nd ACM Symposium on Parallel and Distributed Tools (SIGMETRICS ’98, Welches, OR, Aug.). ACM, New York, NY, 48 –59. DE BOSSCHERE, K. AND RONSSE, M. 1997. Clock snooping and its application in on-the-fly data race detection. In Proceedings of the 1997 IEEE International Symposium on Parallel Algorithms and Networks (I-SPAN ’97, Taipei, Dec.). IEEE Computer Society Press, Los Alamitos, CA, 324 –330. FIDGE, C. 1991. Logical time in distributed computing systems. IEEE Comput. 24, 8 (Aug. 1991), 28 –33. HOLLOMAN, E. D. 1989. Design and implementation of a replay debugger for parallel programs on unix-based systems. Master’s Thesis. Computer Science Department, NC State, Raleigh, NC. LEBLANC, T. J. AND MELLOR-CRUMMEY, J. M. 1987. Debugging parallel programs with instant replay. IEEE Trans. Comput. C-36, 4 (Apr. 1987), 471– 482. LEU, E., SCHIPER, A., AND ZRAMDINI, A. 1991. Efficient execution replay technique for distributed memory architectures. In Proceedings of the 2nd European Conference on Distributed Memory Computing (EDMCC2, Munich, Germany, Apr. 22–24, 1991), A. Bode, Ed. Lecture Notes in Computer Science, vol. 487. Springer-Verlag, New York, NY, 315–324. ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.
152
•
M. Ronsse and K. De Bosschere
LEVROUW, L. J., AUDENAERT, K. M., AND VAN CAMPENHOUT, J. M. 1994a. Execution replay with compact logs for shared-memory programs. In Proceedings of Applications in Parallel and Distributed Computing, IFIP Transactions A-44: Computer Science and Technology. Elsevier North-Holland, Inc., Amsterdam, The Netherlands, 125–134. LEVROUW, L. J., AUDENAERT, K. M., AND VAN CAMPENHOUT, J. M. 1994b. A new trace and replay system for shared memory programs based on Lamport Clocks. In Proceedings of the 2nd Euromicro Workshop on Parallel and Distributed Processing (Jan.). IEEE Computer Society Press, Los Alamitos, CA, 471– 478. LU, H., KLEIN, P., AND NETZER, R. 1993. Detecting race conditions in parallel programs that use one semaphore. Tech. Rep.. Brown University, Providence, RI. MATTERN, F. 1989. Virtual time and global states of distributed systems. In Proceedings of the International Workshop on Parallel and Distributed Algorithms (Gers, France, Oct. 3– 6), M. Cosnard, Y. Robert, P. Quinton, and M. Raynal, Eds. North-Holland Publishing Co., Amsterdam, The Netherlands, 215–226. MELLOR-CRUMMEY, J. 1991. On-the-fly detection of data races for programs with nested fork-join parallelism. In Proceedings of the 1991 Conference on Supercomputing (Albuquerque, New Mexico, Nov. 18 –22, 1991), J. L. Martin, Ed. ACM Press, New York, NY, 24 –33. NETZER, R. H. B. 1993. Optimal tracing and replay for debugging shared-memory parallel programs. SIGPLAN Not. 28, 12 (Dec. 1993), 1–11. NETZER, R. H. B. AND MILLER, B. P. 1990. On the complexity of event ordering for shared-memory parallel program executions. In Proceedings of the International Conference on Parallel Processing (Aug.). 93–97. NETZER, R. AND MILLER, B. 1991. Improving the accuracy of data race detection. In Proceedings of the 1991 Conference on Principles and Practice of Parallel Programming (Apr.). NETZER, R. H. B. AND MILLER, B. P. 1992. What are race conditions? Some issues and formalizations. ACM Lett. Program. Lang. Syst. 1, 1 (Mar. 1992), 74 – 88. PERKOVIC, D. AND KELEHER, P. J. 1996. Online data-race detection via coherency guarantees. ACM SIGOPS Oper. Syst. Rev. 30, Winter, 47–57. RAYNAL, M. AND SINGHAL, M. 1996. Logical clocks: Capturing causality in distributed systems. IEEE Computer, 49 –56. RONSSE, M. AND ZWAENEPOEL, W. 1997. Execution replay for TreadMarks. In Proceedings of the 5th Euromicro Workshop on Parallel and Distributed Processing. 343–350. RONSSE, M., LEVROUW, L., AND BASTIAENS, K. 1995. Efficient coding of execution-traces of parallel programs. In Proceedings of the ProRISC/IEEE Benelux Workshop on Circuits, Systems and Signal Processing (Mar.), J. P. Veen, Ed. 251–258. RUSSINOVICH, M. AND COGSWELL, B. 1996. Replay for concurrent non-deterministic sharedmemory applications. SIGPLAN Not. 31, 5, 258 –266. SARIN, S. K. AND LYNCH, N. A. 1987. Discarding obsolete information in a replicated database system. IEEE Trans. Softw. Eng. 13, 1 (Jan. 1987), 39 – 47. SAVAGE, S., BURROWS, M., NELSON, G., SOBALVARRO, P., AND ANDERSON, T. 1997. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst. 15, 4, 391– 411. SCHONBERG, D. 1989. On-the-fly detection of access anomalies. SIGPLAN Not. 24, 7 (July 1989), 285–297. TAMCHES, A. AND MILLER, B. P. 1999. Fine-grained dynamic instrumentation of commodity operating system kernels. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (New Orleans, LA., Feb.). 117–130. WOO, S. C., OHARA, M., TORRIE, E., SINGH, J. P., AND GUPTA, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA ’95, Santa Margherita Ligure, Italy, June 22–24), D. A. Patterson, Ed. ACM Press, New York, NY, 24 –36. WUU, G. AND BERNSTEIN, A. 1984. Efficient solutions to the replicated log and dictionary problems. In Proceedings of the 3rd ACM Symposium on Principles of Distributed Computing (New York, NY). ACM Press, New York, NY, 233–242. Received: April 1998;
revised: March 1999;
accepted: June 1999
ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.