Techniques for Cache and Memory Simulation ... - Semantic Scholar

3 downloads 0 Views 220KB Size Report
Duke University. Durham, NC 27706 internet: holliday@cs.duke.edu. December ...... 10] M. K. Vernon, E. D. Lazowska, and J. Zahorjan, \An accurate and e cient ...
Techniques for Cache and Memory Simulation Using Address Reference Traces  Mark A. Holliday Department of Computer Science Duke University Durham, NC 27706 internet: [email protected] December 1990 

This work was supported in part by the National Science Foundation (Grants CCR-8721781 and CCR-8821809).

1

Abstract

Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be divided into trace collection, trace storage, and trace usage. Trace collection can employ several hardware or software methods. Common concerns are that the collection method capture all of the address references of interest, that the execution overhead of the collection method is not excessive, and that the trace is of adequate length. The increasing size of caches heightens the adequate length concern. Trace storage is of concern because of the large size of traces. Techniques for trace compression and trace reduction have been developed. Trace usage is of concern because of the length of a simulation. Under some circumstances it is possible to evaluate multiple cache sizes in a single pass of the trace. For multiprocessor traces it is also possible to simulate the trace in parallel to achieve speedup. In the multiprocessor case, the global trace problem arises because environment-dependent address changes prevent the adjustment of traces collected in one environment from re ecting a di erent environment. A relatively new technique, inline simulation, attempts to avoid a number of the problems associated with traditional trace-driven simulation.

Index Terms address reference traces, trace-driven simulation, survey, inclusion property, trace reduction, one-pass simulation, parallel traces, global trace problem, inline simulation.

2

1 Introduction The design of the memory subsystem has a major e ect on the performance of a computer system. Consequently, evaluation of the e ect of alternative memory design parameters under expected workloads is of substantial interest. Three main approaches to such an evaluation exist. The rst, instruction-level simulation, requires a simulator that is detailed enough to execute the instructions in workload programs. The second is analytical models and the third is trace-driven simulation. The rst and second approaches form two extremes. An instruction-level simulator can very accurately predict the performance of a memory design for a given workload. The disadvantage, however, is the length of time required to develop such a simulator. An analytical model, in contrast, need only be implemented once and can be evaluated almost instantaneously. The disadvantage is that such a model can only characterize the workload with a limited set of parameters and those parameters might not capture all of the system interactions. Trace-driven simulation compromises between these two extremes. In trace-driven simulation a trace of the memory address references generated by a program execution is collected and stored. That trace is then input to a simulator of the memory subsystem or a portion of the memory subsystem. The memory subsystem simulator is much simpler to construct than an instructionlevel simulator since it does not require implementation of the instruction-set architecture [1]. On the other hand, the relevant interactions between the design parameters are captured and the workload is characterized realistically. As a result most research papers studying memory hierarchy performance have used trace-driven simulation [1]. However, as noted by Smith [2, 1] trace-driven simulation has drawbacks: (1) it is dicult to include in the trace operating system references, the e ects of input/output, and the e ects of context switches and interrupts, and (2) even traces normally considered to be long (say a million to three million references) capture only a small slice of time and that slice need not be from the most representative portion of the workload. Because of the importance of trace-driven simulation for the evaluation of caches and memories, this paper surveys the current techniques. There are important related subjects which we do not attempt to cover; namely, the vast literature on the performance results found using tracedriven simulation and the analytical models mentioned above. The book by Przybylski [3] and the bibliography by Smith [4] are good starting points for the former. An important class of the latter models predicts cache miss ratios in either the transient [5, 6] or steady-state cases [7, 8]. Another class applies Mean Value Analysis techniques [9] to study the overhead due to cache coherence protocols in di erent multiprocessor architectures [10, 11, 12, 13].

3

Main Memory

Cache

Block 0 Block 1 Block 2

Frame 0 Frame k

Frame 1 Frame k+1

Frame k-1 Frame 2k-1

Set 0 Set 1

Frame (n-1)k

Frame (n-1)k+1

Frame nk-1

Set n-1

Block n-1 Block n

Block mn-1

Figure 1: Structure of a set-associative cache.

1.1 Cache Terminology The early work describing the techniques surveyed here [14] was couched in the terminology of main memory with pages and fault times. More recently, however, the work has been overwhelmingly discussed using the terminology of caches. Consequently, we will use cache terminology which we now introduce [15]. The techniques remain, however, also relevant for main memory (and also le systems and disk caches). The three basic parameters of a uniprocessor processor cache are block (also called line) size, set associativity, and number of sets. A block is the storage unit within the cache. The higher-order bits of a memory address identify the block containing that address. The cache can be viewed as consisting of some number of frames with each frame being the same size which is the size of one block. The frames of a cache are partitioned into sets. A particular block can be placed in only one set, but it can be placed in any frame of that set. As shown in Figure 1, the set associativity of a cache is the number of frames in one set. If the number of frames in one set is k, then the set associativity is called k-way set associativity. One-way set associativity is also called direct-mapped. N-way set associativity where N is the number of frames in the cache (and thus there is only one set in the cache) is also called fully-associative. The set mapping function determines in which set a block belongs. The most common set mapping function is the block number modulo the number of sets. This function is called bit selection since the set number is the number given by the low-order bits of the block address. Bit selection implies that the number of sets is a power of two. A number of other features are involved in characterizing a cache con guration. These features include the replacement algorithm, the presence of prefetching, write-through versus write-back (on a write to a block in the cache, 4

write-through writes the block to main memory, while write-back writes the block to main memory only when the block is replaced in the cache), multiple levels in the cache, and subblocks. We will comment on the basic three parameters (block size, set associativity, and number of sets), the replacement algorithm, multi-level caches, and write backs though discussions exist about the implications of the others on simulation techniques [16]. Other levels of the memory hierarchy raise additional issues (such as the e ect of le deletions on disk caches [16]), that will not be pursued here. Caches in multiprocessors raise the issue of cache coherence. If multiple caches contain copies of the same block, steps must be taken to insure that the di erent copies are kept consistent. Two main classes of cache coherence protocols are write-invalidate and write-update. In a writeinvalidate protocol, if multiple copies of a block exist and one processor writes to its copy, the other copies are invalidated. In a write-update protocol, if multiple copies of a block exist and one processor writes to its copy, the other copies are updated.

1.2 Outline Our survey divides into three main sections: the uniprocessor case, the multiprocessor case, and inline simulation. In the next section we discuss the uniprocessor case and in particular, the issues related to collection, storage, and use of traces. The multiprocessor case appears in section three. New issues arise with respect to collection, storage, and use. The multiprocessor case also introduces what we refer to as the global trace problem; that is, feedback from environment changes can change the address contents of the trace. Section four addresses inline simulation. Inline simulation can be used as a method of trace collection or as an alternative to trace-driven simulation in that a separate trace never appears. We conclude in section ve.

2 Uniprocessor Uniprocessor trace-driven simulation is the most traditional form of trace-driven simulation. Issues arise at each of the three stages: trace collection, storage, and use.

2.1 Collection The most common collection approach traces the execution of a real application program and collects the entire trace. The other approaches, trace sampling and synthetic traces, we discuss at the end of this subsection. The execution of the application program can be done on an instruction-level simulator or on an actual computer system. In the case of an instruction-level simulator (such as MILS, the Mips Instruction Level Simulator [17]) the simulator interpretively executes the program. The trace of 5

addresses generated is saved and used later in a trace-driven simulator. In the case of measurements of the execution on an actual computer system, the methods can be divided into hardware monitors and software monitors [18]. Hardware monitors provide the advantage of being able to collect all addresses on the interface being monitored, so addresses from multiple processes, operating system code, and input/output can usually be saved. The time overhead tends to be less than that with software monitors. On the other hand, it might not be possible to attach a hardware monitor to the desired interface (for example, probes on a system bus would not detect the address references that hit in a cache on the single chip of a microprocessor). Also, a hardware monitor may not be available or the information needed to use it properly may be proprietary. An example using hardware monitors appears in a 1986 study [19, 20] of the translation lookaside bu er and cache of a computer based on the National Semiconductor 32016 microprocessor. The monitor consisted of a logic analyzer, a personal computer, and a communications link to a minicomputer. The logic analyzer was connected to the system bus and captured all bus activity. The software executing on the personal computer initiated analyzer acquisition of a 4096 32-bit sample of bus activity. When the analyzer completed acquisition a small amount of extra logic on the CPU board caused the processor and real-time clock to be in state HOLD until the trace section was transferred to the minicomputer. A new acquisition would be initiated at the end of the transfer. Time overhead in hardware monitors that collect address traces mainly occurs because the bu er holding the addresses lls rapidly requiring the system to be halted until the bu er can be emptied. Closely related to hardware monitors that collect address traces are hardware monitors that collect traces of other types of events or that count the number of occurrences of certain events (such as cache misses) [21, 22]. An interesting middle ground between hardware monitors and software monitors is microcode modi cation such as in ATUM [23]. Its advantages include that address traces of all references (operating system as well as all application processes) can be collected [24], but without specialpurpose hardware. Also, the time overhead compares to that required by hardware monitors. The disadvantages include that not all processors have microcode or microcode that can be modi ed and, for even those that do, such modi cations may require proprietary information. The basic idea in address trace software monitors is to generate a trap at the start of each instruction. The trap reaches a user-level process which saves the address of the instruction and determines the associated data references by interpretive execution of the instruction. This requires hardware support for trapping before each instruction, which is commonly provided to aid debugging. For example, the T-bit facility in the VAX1 processor architecture [25] supports this and can be accessed using the ptrace system call of the UNIX2 operating system. The ease of 1 2

VAX is a trademark of Digital Equipment Corporation. UNIX is a trademark of Bell Laboratories.

6

instruction-trap based trace collection has made it the most common method of collecting uniprocessor traces. There are, however, disadvantages. It has the largest time overhead of any of the collection methods. A slowdown of over 100,000 has been reported [26]. It also only captures the references of a single application process. Kernel references and the e ects of multiprogramming escape capture. Failure to include such references can cause substantial overestimation of cache hit ratios [21]. In trace collection, in addition to the issue of how the trace is collected, we must consider the issue of how long of a trace to collect. Stone [27] presents an interesting analysis concerning the proper length of a trace. Assuming a miss ratio of one percent, 16-byte blocks, and four-way set associativity, to obtain only four misses per set, a trace of 1.6 million references is needed for a 128 kilobyte cache, and a trace of 102 million references is needed for a 2 megabyte cache after initialization. Added to this length is the trace length to discard because it re ects the initialization transient. Some researchers \warm" the cache by discarding the rst xed number of references. How much of the transient they remove by this depends on the cache structure. Stone [27] recommends that a simulation only be started on each distinct set when the set is fully initialized; that is all of its initial contents have been purged (Laha, Patel, and Iyer call this priming the sets [28]). The issue of an adequate trace length has achieved increasing importance since the cache and main memory sizes of interest have increased dramatically. For example, second-level caches on the order of 4 to 16 megabytes are of current interest [29]. Complete traces must be extremely long in order to simulate the behavior of such caches. A trace which is too short can give misleading results. For example, in a study by Borg, Kessler, and Wall [29] the average cycles per instruction (CPI) over the last 10 million instructions is computed for the execution of 1.5 billion instructions. The average CPI varied from 1.7 to 6.8, so a typical trace of 10 million instructions could have resulted in an estimate anywhere in this range. An alternative to complete address trace collection that has become increasingly attractive is trace sampling [28, 30]. Laha, Patel, and Iyer [28] suggest an approach assuming the cache is too small for data to be retained across context switches. In this approach the sample size chosen is of the same order as the typical interval between context switches. Samples are collected starting just after a context switch. Over a wide range of LISP programs 35 samples always accurately predicted the mean miss ratio and the miss ratio distribution. The authors modify this approach for the case in which the cache is large enough for a signi cant amount of data to be retained across context switches. More recently [30] it has been argued that the connection with context switches is not essential. Instead, taking 30 to 40 samples at random points in the trace with the sample length being a function of the cache size suces. A nal alternative uses synthetic traces generated from a program behavior model (such as 7

the Least Recently Used (LRU) stack model [31]), instead of from the execution of an application program. Some [1] have argued that program behavior models do not adequately capture the behavior of real programs. Others [32] have argued that the ability of synthetic traces to capture a wide range of possible behaviors (such as di erent degrees of locality in the LRU stack model) still makes them a useful workload. A large literature exists on program behavior models [8, 33, 34, 35].

2.2 Inclusion and Related Concepts Many of the techniques for trace storage and use are based on a set of results initiated by the work of Mattson, et al. [14]. In this subsection we introduce the basic concepts involved. A replacement algorithm is said to have the inclusion property if after any series of references, the contents of any size cache is a superset of any smaller size cache. Surprisingly, inclusion does not hold for all replacement algorithms. FIFO (First In, First Out), for example, does not have the inclusion property. Consider the series of block references 1, 2, 3, 1, and 4. At the end of this sequence, a two-block FIFO cache contains the blocks 1 and 4. A three-block FIFO cache, at the end, contains the blocks 2, 3, and 4. Mattson showed that, assuming no prefetching, a xed block size, and the same set mapping function, inclusion holds between caches of di erent sizes for a class of replacement algorithms called stack algorithms. These replacement algorithms include LRU, Random, and Optimum. Having the same set mapping function implies having the same number of sets. Consequently, since the block size is xed, a larger cache can only be larger due to the existence of more blocks per set. Since the blocks of a set can only be replaced by one another, each set of a set-associative cache operates as an independent fully-associative cache. In the fully-associative cache that represents a set, Mattson showed that each stack algorithm has a \priority function" which, independent of cache size, imposes a total ordering on all the blocks at a given time. This ordering can be viewed as a stack of the blocks. For LRU this ordering is based on the time of last reference. Hill [36, 37] addressed the question of when inclusion holds between two caches when di erent set mapping functions are used assuming the LRU replacement algorithm. This is often important. For example, if we want to compare two direct-mapped caches of di erent sizes, for xed block size, the number of sets must di er. Consequently, they have di erent set mapping functions. Hill de nes set re nement and inclusion. Let A = n denote that a cache has n-way associativity. Let F = f denote that a cache has set mapping function f . \A set-mapping function f2 re nes set-mapping function f1 if f2 (x) = f2 (y ) implies f1 (x) = f1 (y ), for all blocks x and y." \Cache C2(A = n2; F = f2) is said to include an alternative cache C1(A = n1; F = f1) if for any block x after any series of references, x is resident in C1 implies x is resident in C2." Hill shows that \cache C2(A = n2 ; F = f2 ) includes cache C1(A = n1 ; F = f1 ) if and only if f2 (x) = f2 (y ) implies f1 (x) = f1 (y ) (set re nement) and n2  n1 (non-decreasing associativity)". Thus, we have 8

a characterization of inclusion in terms of set re nement. This result shows that some pairs of caches have the inclusion property even though they do not have the same set mapping function (a condition for the Mattson result). Hill also shows that inclusion is a partial order over a set of caches. Consequently, if inclusion holds between each pair of caches in a series, by the transitivity of a partial order, it also holds between all caches in the series. Finally, Hill presents a characterization of when set re nement holds for a special class of setmapping functions. Let h(x) be a hash function whose image is the set of all block numbers, \rem" be the remainder operator, and s be the number of sets in the cache. Hill shows that \set re nement holds for set-mapping functions of the form h(x) rem s if and only if s1 divides s2 ." This class of set mapping functions is signi cant because it contains the important practical case of bit selection. i

2.3 Storage Once the trace is collected, it has to be stored until it is used. This can present a problem because of the large storage demands. There are three approaches. In the rst, as implemented in abstract execution [38], the original trace is not a full trace, but only contains a subset of the addresses which upon trace use can be combined with static analysis to generate a complete trace. This approach, and abstract execution in particular, are discussed in the inline simulation section. In the second, trace compression decreases the size of the trace in storage, but returns it to the original size upon use. In the third, trace reduction decreases the size of the trace for both storage and use. Trace compression is exact in that after uncompression the trace returns to the original trace. A standard sequential data compression algorithm such as the Lempel-Ziv scheme [39] used in the UNIX compress program is an option. An attractive alternative is Mache [40]. In Mache each address has a label (for example, instruction fetch, data read, or data write) and a cache of the last address from the trace for each label type is kept. For each (label, address) pair, the di erence between this address and the last address for this label is computed. If the di erence is below some threshold, then (label, di erence) is emitted, otherwise (label, address) is emitted. The emitted di erence or address is then passed to a Lempel-Ziv compression algorithm. The result is substantially smaller then merely applying Lempel-Ziv to the original trace. The key is that the cache di erencing exposes patterns in the address references that are not available in the original trace. Trace reduction (which reduces the trace size for both storage and use) can be exact or approximate. For exact trace reduction techniques in all the cases we assume LRU replacement and initially assume that all simulations that will use the trace will have the same block size. Only the number of misses is of interest. The most general formulation for this case appears in Hill [37]. Construct a direct-mapped cache C0 such that all caches of interest C re ne C0. Delete all referi

9

ences that hit in C0 and record them as hits in all caches. The misses in C0 form the reduced trace. For all the C the number of misses observed when simulating with the full trace equals the number of misses observed when simulating with the reduced trace. Earlier restrictions of this result are seen in Smith's trace deletion technique and Puzak's cache lter [41]. Puzak's cache lter restricts C0 to a direct-mapped cache. As noted by Wang and Baer for write-back caches write-back trac as well as the miss trac is an important factor in cache-induced trac. They extend Puzak's cache lter to create a reduced trace that has exactly the same number of misses and write-backs as the original trace. They accomplish this by keeping references that are rst time writes as well as references that cause misses. Wang and Baer [42] also consider reduced traces that allow multiple block sizes. They produced universal reduced traces by collecting the superset of misses that occur in at least one cache lter for a number of di erent block sizes. Since most misses to a cache lter with one block size tend to be misses in another lter with a di erent block size, the increase in size over single block size reduced traces is modest. Approximate trace reduction techniques can be used as an alternative or as a complement to the exact methods (when used as a complement the approximate method is applied to the exact reduced trace). Approximate techniques are generally forms of sampling. Sampling methods used in trace collection [28, 43] can also be used to reduce a preexisting trace. Puzak [41] suggested an alternative based on the fact that references across sets tend to be highly correlated [44]: select only a small number of the sets. Experience with reasonable data indicates that retaining only 10 percent of the sets is adequate. Blocking [45] serves as another alternative by taking the misses from Puzak's cache lter and passing them through a block lter. The block lter considers each window of w consecutive references and sends out a reference from each spatial locality contained within the window. i

2.4 Use From a technique viewpoint, the most noteworthy aspect of the use of uniprocessor traces is onepass simulation. One-pass simulation means that in a single pass through the trace the number of misses and number of write backs (for write-back caches) can be computed for more than one cache con guration. The discussion can be couched in terms of a pair of caches since the transitivity of inclusion (since inclusion is a partial order) extends this case to the case of an arbitrary number of caches. Mattson and his colleagues did the original work in this area. Their algorithm, stack simulation, determines the hit ratios of all caches sizes as long as the replacement algorithm is a stack algorithm (thus, the inclusion property holds) and the number of sets is xed. A stack simulation of caches C (A = k; F = f ) for k = 1 to n uses a stack of n levels for each set and an array of n distance counters [36]. The blocks are placed in the stack in order of descending priority. On each reference 10

Reference :6 Time t-1

Reference :2 Time t-1 Time t

Time t

5

6

5

2

3

5

3

5

6

3

6

3

1

1

6

Stack Hit

Stack Miss

Figure 2: The two cases of stack simulation assuming LRU replacement.

x, stack simulation performs three steps: nd, metric, and update. The nd step locates the block in the stack at distance k, if found, or 1 if not found. The metric step increments the k-th distance counter and N where N is the number of references. The update step updates the stack to re ect the contents of all levels after the reference to x. When a block is found, the stack level containing

the block is called its stack distance. This stack level is the minimum cache size for which the block is resident. By inclusion the reference is a hit for this cache size and all larger cache sizes. The most common case of stack simulation occurs when the replacement algorithm is LRU. As shown by Figure 2, with LRU replacement, the stack contains the blocks in order of last reference. Stack simulation requires not only that inclusion holds between the pair of caches but also that both caches have the same number of sets (note that there is a stack per set) and the same block size. Thus, set associativity is the only variable among the three basic parameters. As mentioned earlier, Hill showed that inclusion holds in some cases between pairs of caches that do not have the same number of sets. In particular, inclusion holds between pairs of direct-mapped caches with the same block size but di erent numbers of total blocks. He [36, 37] developed a one-pass simulation technique called forest simulation for this direct-mapped case under the assumption of LRU replacement. Between many pairs of set-associative caches inclusion does not hold, however a technique called all-associativity simulation has been developed that allows rapid simulation of alternative caches assuming the same block size, no prefetching, and LRU replacement [14, 46, 47, 36] even without inclusion. For write-back caches the number of write-backs (the number of times a dirty block is replaced) as well as the number of misses is important. Counting write-backs is nontrivial within stack 11

simulation. Initially consider a fully-associative cache. When a block is chosen for replacement, it is written back only if it is dirty. However, whether it is dirty can depend on the size of the cache. For smaller caches the block may have been replaced earlier and then refetched causing the current resident copy to be clean. For larger blocks the block was not replaced earlier and is still dirty. Thompson and Smith [16] showed that stack simulation can be extended quite simply to account for the number of write backs for each cache size assuming an arbitrary stack replacement algorithm. They attach a dirty level, dl, to each block in the stack. A block is dirty for caches larger than or equal to dl blocks and is clean for caches smaller than dl blocks. In the case of LRU replacement Mattson et al. [14] extended basic stack simulation which has a stack per set to a simulation where a single stack is used for all sets. Wang and Baer [42] adapted the Thompson and Smith technique for counting write backs to use a single stack for all sets in the case of LRU replacement. The single dirty level per block has to be replaced with a vector of dirty levels where each vector element corresponds to a particular number of sets in the cache.

3 Multiprocessor Simulation using address reference traces is also important in shared memory multiprocessors. Some of the issues concerning uniprocessor trace collection, storage, and use continue to occur in the multiprocessor case, but become worse because the volume of trace information is proportional to the number of processors. New issues also arise. The presence of more than one processor introduces new architectural issues, such as cache coherence protocols, that have encouraged specialized tracedriven simulation techniques. Multiple processors also introduce a fundamental problem, called the global trace problem, that brings into question the decoupling of trace from environment changes that is basic to trace-driven simulation.

3.1 Collection As with uniprocessors, the dominant trace collection approach measures of the execution of the workload on an actual computer system. The range of uniprocessor measurement techniques also occurs here. Hardware probes can be attached to accessible interfaces (commonly a system bus) and addresses on the bus can be saved in a bu er. The bu er quickly lls and so execution must be interrupted periodically to allow the bu er contents to be saved on secondary storage. Mink et al. [48] discuss the use of hardware probes in multiprocessors for event trace collection (with addresses as a type of event). The ATUM-approach of microcode modi cation has been extended to multiprocessors [49]. Again this approach is limited because many processors are not microprogrammed and for even those that are, modi cation may require proprietary information. The widely-used uniprocessor approach of trapping at the start of each instruction has also been 12

extended to the multiprocessor environment [26]. The trap approach, though easiest to implement, is also the slowest to generate the trace. Lacy found that for a twelve processor system the average tracing rate was only 300,000 instructions per process per day.

3.2 Storage Many of the uniprocessor trace compression and approximate trace reduction techniques can be directly applied in the multiprocessor case. Extending exact trace reduction to the multiprocessor case, on the other hand, is nontrivial. Work has been done in this direction by Wang and Baer [42]. They assume that each processor's private cache is of the same size, that the relative order of reference streams from each processor remain the same across di erent simulations, and that an invalidation protocol is used (though they claim a larger class of coherence protocols can be allowed). They reduce the original full trace by simulating small caches (acting as cache lters) under the chosen cache coherence protocol and record only references that are misses or writes on clean blocks. They prove that the resulting trace produces the same number of misses, write-backs, and invalidations as the original trace assuming the same coherence protocol and block size are used. Multiple-level caches are of increasing importance in both uniprocessors and multiprocessors [50, 51, 52]. From the technique perspective the most interesting point is the multi-level inclusion (MLI) property. The MLI property occurs when the blocks within a parent cache are a superset of all the blocks within all of its children caches. The cache coherence protocol can be simpli ed when the MLI property is maintained. Note that this is a di erent use of the term inclusion property than de ned in subsection 2.2. Baer and Wang [53] give necessary and sucient conditions on when the MLI property holds for both fully associative and set associative caches. As noted by Baer and Wang [53], knowing that the MLI property is being maintained can be used to reduce the trace input to a parent cache (since any reference to a block in a child cache must be a hit).

3.3 Use The lengthy time required to run trace-driven simulations of multiprocessor cache coherence protocols has encouraged the development of special methods to speed up the simulations. Wang and Baer [42] have discussed the extension of their one-pass simulation technique to the case of multiprocessors with set-associative, write-back caches. The extensions are designed to handle shared reads and shared writes. A shared read for both write-invalidate and write-update protocols can be handled by an adjustment of the requested block's dirty level. For write-update protocols, this is also true for shared writes. For write-invalidate protocols, however, a shared write causes a deletion that leaves a hole in the stack. Keeping track of these holes requires the introduction of one or more markers. An alternative to doing a sequential simulation of the multiprocessor trace is to do the simulation 13

in parallel. Wang and Baer note that the same set in all the caches can be simulated independently from other sets. Their reduced multiprocessor trace can be split into N traces where N is the number of sets in each cache. Each processor can simulate the trace for a given set. Lin, Baer, and Lazowska [54, 55] noted an alternative means of exploiting parallelism. Each processor is assigned to simulate all the sets of a single cache. The multiprocessor trace is split into the input trace for each cache. If each input trace contains no shared references, then each processor can simulate its cache independently with linear speedup. Shared references, however, require concurrency control as shown by this example \Suppose two consecutive references e1 and e2 are private to cache j with time-stamps t1 and t2 , t1 < t2. Once cache j has been simulated up to event e1 , event e2 cannot be simulated unless it is known that no shared reference originating at some other cache occurs during the time period (t1 ; t2) which a ects the state of cache j ." [55]. Lin, Baer, and Lazowska note that if we assume that the relative ordering of shared references by di erent processors is independent of the system's con guration, then simulation need not be serialized. They argue that this assumption is reasonable when any system con guration changes are done in a homogeneous manner (such as, all cache sizes increasing by the same amount). All shared references in each trace are inserted into the other traces according to the xed relative ordering. This transformation of conditional events into unconditional events is similar to an approach for general distributed simulations proposed by Chandy and Misra [56]. The inserted references along with the shared references by this processor are called interaction points. Consequently, each processor during its simulation can execute independently until it reaches the next interaction point. That interaction point forms a barrier at which all of the processors check each other's cache status and take the appropriate action for the given cache coherence protocol. Lin, Baer, and Lazowska [54] refer to this algorithm as their basic simulation algorithm. They then noted [55] that the semantics of speci c cache coherence protocols can be used to tailor the simulation algorithm to increase its parallelism. For several important cache coherence protocols synchronization is not needed at many of the interaction points. In fact, for one well-known protocol, the Berkeley coherence protocol, synchronization is not needed at any of the interaction points.

3.4 Global Trace Problem Call the trace of a single process's references, a process trace. Call the interleaving of process traces that occurs in a particular environment, the global trace (see Figure 3). The problem of ensuring that process traces collected in one environment can be used to generate a correct global trace for a di erent environment has been called the global trace problem [57, 58]. Lin, Baer, and Lazowska [54, 55] assume in this terminology that not only do environment changes not change the individual process traces, but environment changes also do not e ect the interleaving of process 14

Process Program 1

Trace-Driven Simulator Process Trace 1

Process Program 2 Process Trace 2 Global Trace

Process Program N Process Trace N

Environment Parameters

Figure 3: The global trace problem. traces that form the global trace. If we are willing to return to sequential simulation of the multiprocessor trace, then it is straightforward to allow for environment changes to change the interleaving of process traces. This occurs because we make the change in the relative ordering during the simulation. The serious problem, however, is that sequential simulation might not be adequate since environment changes may change the address sequences of the process traces, themselves. (Strictly, speaking even a uniprocessor trace does change based on the environment [1]. For example, a change in the width of the data path to memory changes the number of memory references. However, the uniprocessor environment changes that can e ect the trace are not usually the design parameters under study or they are changes for which an adjustment can be made.) Consequently, it is not possible, in general, to use traditional single address sequence traces in trace-driven simulation of a shared memory multiprocessor environment with assurance that a correct global trace is generated. Holliday and Ellis [58] propose an extension to traditional traces, called intrinsic traces, which the simulator can use to generate a correct global trace. An intrinsic trace consists of an address

ow graph of address basic blocks. The process trace of a particular execution causes the process to follow a path through its address ow graph. The address basic blocks contain all the references that 15

are not environment-dependent. The references that are environment-dependent are called address change points(ACPs) and can be classi ed within the framework of a standard intermediate code language [59]. When the simulator reaches an ACP, the path expression associated with the ACP is used to determine the next address reference to issue. A path expression serves as a concise notation for mapping between possible global trace pre xes (the portion of the global trace that has been generated so far during the simulation) and possible next addresses. This mapping is determined prior to the simulation. The diculty of establishing this mapping makes a fundamental distinction among ACPs. For some ACPs this is straightforward; for other ACPs, program reexecution seems to be needed. The complexity of this approach makes an instruction-level (or system) simulator more attractive since such a simulator avoids the global trace problem entirely, perhaps at little additional cost.

4 Inline Simulation An interesting convergence of events has made traditional trace-driven simulation increasingly problematic. Traditionally the sizes of the storage units, especially caches, of interest have been small enough (say, 8 kilobytes) that uniprocessor traces of typical lengths (say, one million to three million references) have been viewed as adequately long. Stone's analysis [27] discussed earlier suggests that for the cache sizes of current interest much longer traces are needed even under liberal constraints (at least four misses per set in the cache). The work of Borg, Keller, and Wall [29] suggests that traces longer than those suggested by Stone may be needed. For the cache sizes of interest to them (up to 16 megabytes) in some cases traces of over 1 billion references were required for the performance metric of processor cycles per instruction to stabilize. Parallel systems also aggravate the problem in two ways. First, whatever we view as an adequate length uniprocessor trace must be multiplied by the number of processors in the system. Second, the global trace problem remains. These problems have caused a substantial interest in an alternative to traditional trace-driven simulation, called inline simulation. The basic idea is to modify the program's code by inserting tracing code. The program's code is executed on the host machine. When a tracing statement is reached, that statement determines the address of a reference in the original program and stores that reference in a bu er. Determining that address can be nontrivial since the inserted code has shifted the original addresses. The trace events can be analyzed on-the- y or collected as a traditional trace for later use in a simulator. The initial implementations of this approach were not designed for collecting address traces but instead collected timing information. For example the BB basic block pro ler of Weinberger [60] analyzes the assembly language output of the compiler for a uniprocessor program. At the start 16

of each basic block instructions are inserted to increment a counter by the estimated time of the basic block. The estimated time of a basic block can be done statically from instruction counts and types. As noted by Covington, et.al [61] this approach can be extended to trace events as well as the control ow. They introduced the term execution-driven simulation for this extension. In the Rice Parallel Processing Testbed they used it, for example, for tracing messagepassing on a hypercube. Stunkel and Fuchs [62] noted that execution-driven simulation can be adapted to address tracing and did so in a package named TRAPEDS (TRAce Producing Execution Driven Simulation). A number of uses of inline simulation for addresses or other events have occurred [63, 64, 65, 66, 67, 68, 69, 70]. We mention two recent uniprocessor implementations [29, 38] and two recent multiprocessor implementations [62, 71]. Borg, Kessler, and Wall [29] trace the address references of multiple processes including operating system references for a uniprocessor. The linker inserts the code to ensure that library routines have their code modi ed. Their system has 128 megabytes of main memory with 32 megabytes allocated to the trace bu er. The traced programs are slowed by the tracing which if the context switch interval were unchanged, would inaccurately suggest how many original instructions are executed between switches. Consequently, the context switch interval is lengthened proportionally. When the bu er is full, the trace can be saved to tape, but usually the data is run through a cache analysis program. Larus's abstract execution (AE) package is also used for uniprocessors, but the avor is quite di erent. His goal is to collect an address reference trace of a single application process and to minimize the trace's size. This is in contrast to Borg, Kessler, and Wall where the focus is on capturing references of a multiprogramming workload including operating system references and onthe- y analysis was primarily used instead of trace storage. The key in AE is noting that many of the addresses in a trace can be statically determined by the compiler by using techniques from compiler optimization theory. During compilation the compiler generates an executable le that has been instrumented with inline code for references that can only be determined during execution. The compiler also generates a schema le to encode the information that can be determined statically. The instrumented program is then run to generate a \signi cant events" trace le. When the trace is to be used, the signi cant events le is input to a program created from the schema le which outputs the full trace. TRAPEDS [62] is implemented on an Intel iPSC/2 hypercube multiprocessor. In TRAPEDS the code is modi ed at the assembly language level. During this modi cation static information about each instruction is saved in the le aux le.s. During the execution of a basic block, for each dynamic portion of each address, instructions are inserted to save the values at runtime in a reserved area of global memory. At the start of each basic block a call to the routine, X bb perf, is inserted that uses the saved dynamic information and the saved static information to determine 17

Source Program

Regular Compiler Unmodified Object Module

Modified Compiler

Modified Object Module Inline Library Routines

Static Information

Optimization Information

Trace Regeneration

Full Trace

Linker

Modified Executable

Simulator

Execute; on-the-fly

Execute; trace

Results

Partial Trace

Results

Figure 4: One possible inline simulation data ow. the addresses referenced for the previously executed basic block and to call the cache simulation routine. Thus, TRAPEDS uses on-the- y analysis and the bu er need only be large enough to store the largest number of dynamic addresses generated by any basic block. MPtrace [71] is implemented on a bus-based Sequent multiprocessor. In contrast to TRAPEDS, the full trace is stored for later use. Much like in AE, compiler analysis is used to avoid tracing addresses that can be determined statically. One such technique saves only the rst address of each super block instead of saving the rst address of each basic block. A superblock contains a single entry point, but may have multiple exit points. Since MPtrace is for multiprocessors and is designed to store the traces, bu er management and the I/O system become a major concern. Multiple bu ers are used with independent writer processes that, concurrently with program execution, 18

empty bu ers. Despite the use of multiple bu ers, bu er over ow does occur. In this case, all threads are stopped as soon as possible instead of stopping only the thread without a bu er. Though this does extend the overall execution time, it reduces the distortion to the interthread timing. These implementations suggest the data ow shown in Figure 4. Variations, of course, are possible. For example, the insertion of tracing code has been done at the intermediate statement level [38], the assembly language level [62, 71], and the object code level [29]. The main concern with inline simulation is the distortion of runtime behavior due to the inserted code. Though little has been done yet to quantitatively evaluate this distortion, this concern has motivated minimizing the time and space overhead. In the Borg, Kessler, and Wall implementation the slowdown due to the inserted code was 8 to 12 times; the slowdown due to both the inserted code and running the analysis program periodically is about 100 times the untraced execution speed for a complex cache structure. AE's time overhead, not including the writing to a le, is 1.3 to 1.8. Recall AE and MPtrace do not do on-the- y analysis. The time overhead for TRAPEDS for both the inserted code and the on-the- y analysis is roughly 30 times the uninstrumented time. The MPtrace time overhead for only the inserted code is 1.6 to 2.3. The MPtrace time overhead including the time required to save the trace on secondary storage is 10 times. With respect to space overhead, there are several di erent possible numbers to consider. One is the size of only the inserted code for determining and saving addresses. Second, is the size of the inserted code as well as the on-the- y analysis routines. Third, is the size of bu er space. Fourth, for implementations that save traces, is the size of trace le. The importance of minimizing the space overhead varies. For example, reducing the size of inserted code appears useful as it minimizes distortion to runtime behavior, but there is no reason not to use as large a bu er space as possible for the given main memory size. Borg, Kessler, and Wall mention the size of their bu er space, 32 megabytes, but not the size of the inserted code or analysis routines. For AE the main concern is the size of the signi cant events trace le. This le is 10-40 times smaller than a full trace for the set of programs considered. This size reduction illustrates the e ectiveness of using compiler optimization techniques to reduce the number of events saved. In TRAPEDS the inserted code is roughly four times the size of the original code. With the analysis routines added in, the space overhead is about 10 times. Bu er space is not signi cant in TRAPEDS since the bu er need only be proportional in size to the largest basic block. In MPtrace the space overhead of inserted code (in terms of number of instructions) is about 20 times without code optimizations and about 4 times with code optimizations. The bu er space MPtrace subjected to experimentation was up to 400 bu ers and a bu er size of up to 64K bytes. 19

5 Summary Simulation using address reference traces has proved to be an important approach for processor cache and main memory design. We have surveyed techniques that have been used for the collection, storage, and use of such traces in both the uniprocessor and multiprocessor cases. Though synthetic traces generated from program behavior models have at times been useful, the most common source of traces is the execution of a set of application programs chosen to be representative of the expected workload. In the uniprocessor context, a wide range of techniques have been used for trace collection: instruction level simulation, hardware measurement using probes, microcode modi cation, instruction trapping, and sampling. For trace storage standard data compression techniques can be augmented with techniques (such as Mache) that take advantage of address reference locality. Alternatively the number of addresses in the trace can be reduced either approximately (using various forms of sampling) or exactly. Exact trace reduction techniques build upon the cache lter of Puzak. With respect to uniprocessor trace use, the most noteworthy technique is one-pass simulation. Assuming a xed set-associativity and xed block size, algorithms exist for computing the hit ratios (and number of write backs in the case of write-back caches) for a number of di erent cache sizes in a single pass through the trace. The multiprocessor context aggravates some uniprocessor problems (overhead of trace collection and storage) and introduces some new problems (the global trace problem). All of the uniprocessor trace collection techniques can be extended to the parallel case. The overhead, however, is often proportional to the number of processors. With respect to storage, extending trace compression and approximate trace reduction appears straightforward. Extending exact reduction is also possible. With respect to usage, extending one-pass simulation is nontrivial, but can be done in some cases. Using a parallel system during the evaluation of a trace has been considered as a means of speedup. Simulation of multiprocessor traces introduces the global trace problem. Changes in the environment being studied can e ect race conditions between processes to cause the addresses in the process traces to change. This prevents a trace collected in one environment from being used directly when evaluating another environment. Traditional traces can be augmented into intrinsic traces which to a more-or-less extent can be used in trace-driven simulation. The increasing size of the caches of interest, the large execution overhead of the most common trace collection method (instruction trapping), the increased collection and storage overhead due to multiple processors in the multiprocessor case, and the global trace problem, have all increased interest in a new approach called inline simulation. Inline simulation can be used as an alternative means of trace collection or as an alternative to the decoupling of collection and use which is central to traditional trace-driven simulation. In inline simulation code is inserted into the application 20

program so as to generate the addresses as the program executes. The trace could be stored and used later in a standard simulator. Alternatively, during execution the current section of the trace can be input to a cache simulator and then discarded. The de nition of the \current section" can vary from the last basic block to when a bu er lls up. Inline simulation produces a relatively low execution overhead and no secondary storage costs if on-the- y analysis is used. Since some overhead still exists, concern about the global trace problem persists.

Acknowledgements The comments by the anonymous reviewers were quite helpful.

References [1] A. Smith, \Cache evaluation and the impact of workload choice," in Proceedings of the 12th Annual International Symposium on Computer Architecture, (Boston, MA), June 1985. [2] A. Smith, \Cache memories," ACM Computing Surveys, vol. 14, pp. 473{530, September 1982. [3] S. Przybylski, Cache and Memory Hierarchy Design: A Performance-Directed Approach. San Mateo, CA: Morgan Kaufmann, 1990. [4] A. Smith, \Bibliography and readings on cpu cache memories and related topics," Computer Architecture News, vol. 14, pp. 22{42, January 1986. [5] W. Stecker, \Transient behavior of cache memories," ACM Transactions on Computer Systems, vol. 1, pp. 281{293, November 1983. [6] D. Thiebaut and H. Stone, \Footprints in the cache," ACM Transactions on Computer Systems, vol. 5, pp. 305{329, November 1987. [7] A. Agarwal, M. Horowitz, and J. Hennessy, \An analytical cache model," ACM Transactions on Computer Systems, vol. 7, pp. 184{215, May 1989. [8] D. Thiebaut, \On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio," IEEE Transactions on Computers, vol. 38, pp. 1012{1026, July 1989. [9] E. Lazowska, J. Zahorjan, G. Graham, and K. Sevcik, Quantitative System Performance: Computer System Analysis Using Queueing Network Models. Englewood Cli s, NJ: PrenticeHall, 1984. [10] M. K. Vernon, E. D. Lazowska, and J. Zahorjan, \An accurate and ecient performance analysis technique for multiprocessor snooping cache-consistency protocols," in Proceedings of the 15th Annual International Symposium on Computer Architecture, pp. 308{317, May 1988. [11] S. Owicki and A. Agarwal, \Evaluating the performance of software cache coherence," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, (Boston, MA), pp. 230{242, April 1989. 21

[12] R. Jog, P. Vitale, and J. Callister, \Performance evaluation of a commercial cache-coherent shared memory multiprocessor," in Proceedings of the 1990 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Boulder, CO), pp. 173{182, May 1990. [13] J. Torrellas, J. Hennessy, and T. Weil, \Analysis of critical architectural and programming parameters in a hierarchical shared memory multiprocessor," in Proceedings of the 1990 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Boulder, CO), pp. 163{172, May 1990. [14] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger, \Evaluation techniques for storage hierarchies," IBM Systems Journal, vol. 9, pp. 78{117, 1970. [15] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufman, 1990. [16] J. Thompson and A. Smith, \Ecient (stack) algorithms for analysis of write-back and sector memories," ACM Transactions on Computer Systems, vol. 7, pp. 78{116, February 1989. [17] C. Chu, \MILS: Mips instruction level simulator." unpublished report, September 1985. [18] D. Ferrari, G. Serazzi, and A. Zeigner, Measurement and Tuning of Computer Systems. Englewood Cli s, NJ: Prentice-Hall, 1983. [19] C. Alexander, W. Keshlear, and F. Briggs, \Translation bu er performance in a UNIX environment," Computer Architecture News, vol. 13, pp. 2{14, December 1985. [20] C. Alexander, W. Keshlear, F. Cooper, and F. Briggs, \Cache memory performance in a UNIX environment," Computer Architecture News, vol. 14, pp. 14{70, June 1986. [21] D. Clark, \Cache performance in the VAX-11/780," ACM Transactions on Computer Systems, vol. 1, pp. 24{37, February 1983. [22] D. Clark and J. Emer, \Performance of the VAX-11/780 translation bu er: Simulation and measurement," ACM Transactions on Computer Systems, vol. 3, pp. 31{62, February 1985. [23] A. Agarwal, R. Sites, and M. Horowitz, \ATUM: A new technique for capturing address traces using microcode," in Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 119{129, June 1986. [24] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, \An evaluation of directory schemes for cache coherence," in Proceedings of the 15th Annual International Symposium on Computer Architecture, (Honolulu, HI), pp. 280{289, May 1988. [25] Digital Equipment Corporation, Bedford, MA, VAX Architecture Handbook, 1986. [26] F. Lacy, \An address trace generator for trace-driven simulation of shared memory multiprocessors," Tech. Rep. UCB/CSD 88/407, Computer Science Division (EECS), University of California at Berkeley, Berkeley, CA, March 1988. [27] H. Stone, High-Performance Computer Architecture, Second Edition. Reading, MA: AddisonWesley, 1990. [28] S. Laha, J. Patel, and R. Iyer, \Accurate low-cost methods for performance evaluation of cache memory systems," IEEE Transactions on Computers, vol. 37, pp. 1325{1336, November 1988. 22

[29] A. Borg, R. Kessler, and D. Wall, \Generation and analysis of very long address traces," in Proceedings of the 16th Annual International Symposium on Computer Architecture, (Seattle, WA), pp. 270{281, 1990. [30] J. Patel, \How to simulate 100 billion address references cheaply?," in ISCA '90 Workshop on Processor Tracing Methodologies, (Seattle, WA), May 1990. [31] J. Spirn, Program Behavior: Models and Measurements. New York City, New York: Elsevier North Holland, 1977. [32] J. Archibald and J.-L. Baer, \Cache coherence protocols: Evaluation using a multiprocessor simulation model," ACM Transactions on Computer Systems, vol. 4, pp. 273{298, November 1986. [33] J. Murphy and R. Bunt, \Characterising program behavior with phases and transitions," in Proceedings of the 1988 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 226{234, May 1988. [34] H. Haikala, \Arma models of program behaviour," in Proceedings of Performance '86 and ACM Sigmetrics 1986 Joint Conference on Computer Performance Modeling, Measurement and Evaluation, pp. 170{179, May 1986. [35] M. Kobayashi and M. H. MacDougall, \The stack growth function: Cache line reference models," IEEE Transactions on Computers, vol. C-38, pp. 798{804, June 1989. [36] M. Hill, Aspects of Cache Memory and Instruction Bu er Performance. PhD thesis, Univ. of California, Computer Science Division, Berkeley, CA, 1987. [37] M. Hill and A. Smith, \Evaluating associativity in cpu caches," IEEE Transactions on Computers, vol. 38, pp. 1612{1630, December 1989. [38] J. R. Larus, \Abstract execution: A technique for eciently tracing programs," Software{ Practice and Experience, to appear. [39] T. Welch, \A technique for high performance data compression," IEEE Computer, vol. 17, pp. 8{19, June 1984. [40] A. Samples, \Mache: No-loss trace compaction," in Proceedings of the 1989 ACM Sigmetrics and Performance '89 International Conference on Measurement and Modeling of Computer Systems, pp. 89{97, May 1989. [41] T. Puzak, Cache-Memory Design. PhD thesis, Univ. of Mass., ECE Dept., Amherst, MA, 1985. [42] W.-H. Wang and J.-L. Baer, \Ecient trace-driven simulation methods for cache performance analysis," in Proceedings of the 1990 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Boulder, CO), pp. 27{36, May 1990. [43] A. Smith, \Two methods for the ecient analysis of memory address trace data," IEEE Transactions on Software Engineering, vol. SE-3, January 1977. [44] J. Voldman and et al., \Fractal nature of software-cache interaction," IBM Journal of Research and Development, vol. 27, pp. 164{170, March 1983. 23

[45] A. Agarwal and M. Hu man, \Blocking: Exploiting spatial locality for trace compaction," in Proceedings of the 1990 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Boulder, CO), pp. 48{57, May 1990. [46] D. Slutz and I. Traiger, \Evaluation techniques for cache memory hierarchies," Tech. Rep. IBM Tech Rep RJ 1045(17547), IBM Research Division, May 1972. [47] I. Traiger and D. Slutz, \One-pass techniques for the evaluation of memory hierarchies," Tech. Rep. IBM Tech Rep RJ 892(17563), IBM Research Division, July 1971. [48] A. Mink, R. Carpenter, G. Nacht, and J. Roberts, \Multiprocessor performance-measurement instrumentation," IEEE Computer, pp. 63{75, September 1990. [49] A. Agarwal and A. Gupta, \Memory-reference characteristics of multiprocessor applications under Mach," in Proceedings of the 1988 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Santa Fe, NM), pp. 215{225, May 1988. [50] A. W. Jr., \Hierarchical cache/bus architecture for shared memory multiprofessors," in Proceedings of the 14th Annual International Symposium on Computer Architecture, pp. 244{252, June 1987. [51] S. Przybylski, M. Horowitz, and J. Hennessy, \Characteristics of performance-optimal multilevel cache hierarchies," in Proceedings of the 15th Annual International Symposium on Computer Architecture, (Jerusalem, Israel), pp. 114{121, May 1989. [52] H. Bugge, E. Kristiansen, and B. Bakka, \Trace-driven simulations for a two-level cache design in open bus systems," in Proceedings of the 16th Annual International Symposium on Computer Architecture, (Seattle, WA), pp. 250{259, May 1990. [53] J.-L. Baer and W.-H. Wang, \On the inclusion property for multi-level cache hierarchies," in Proceedings of the 15th Annual International Symposium on Computer Architecture, (Honolulu,HI), pp. 73{80, May 1988. [54] Y. Lin, J.-L. Baer, and E. Lazowska, \Parallel trace-driven simulation of multiprocessor cache performance: Algorithms and analysis," tech. rep., Department of Computer Science, University of Washington, Seattle, WA, 1988. [55] Y. Lin, J.-L. Baer, and E. Lazowska, \Tailoring a parallel trace-driven simulation technique to speci c multiprocessor cache coherence protocols," tech. rep., Department of Computer Science, University of Washington, Seattle, WA, 1988. [56] K. Chandy and J. Misra, \Conditional knowledge as a basis for distributed simulation," Tech. Rep. TR-87-5251, Computer Sciences Department, University of Texas at Austin, 1987. [57] M. Holliday and C. Ellis, \An example of correct global trace generation," in Scalable Shared Memory Multiprocessors (M. Dubois and S. Thakkar, eds.), Kluwer Academic, 1991. Also CS-1990-19, Dept. of Computer Science, Duke Univ., Durham, NC, 1990. [58] M. Holliday and C. Ellis, \Accuracy of memory reference traces of parallel computations in trace-driven simulation," IEEE Transactions on Parallel and Distributed Systems, to appear. also Technical Report CS-1990-8, Dept. of Computer Science, Duke University, Durham, NC. 24

[59] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. [60] P. Weinberger, \Cheap dynamic instruction counting," Bell Systems Technical Journal, vol. 63, pp. 1815{1826, October 1984. [61] R. Covington, S. Madala, V. Mehta, J. Jump, and J. Sinclair, \The Rice parallel processing testbed," in Proceedings of the 1988 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Santa Fe, NM), pp. 4{11, May 1988. [62] C. Stunkel and W. Fuchs, \TRAPEDS: Producing traces for multicomputers via executiondriven simulation," in Proceedings of the 1989 ACM Sigmetrics and Performance '89 International Conference on Measurement and Modeling of Computer Systems, (Berkeley, CA), pp. 70{78, May 1989. [63] T. Axelrod, P. Dubois, and P. Eltgroth, \A simulator for MIMD performance prediction: Application to the S-1 MkIIa multiprocessor," Parallel Computing, vol. 1, pp. 237{274, 1984. [64] M. Dubois, F. Briggs, I. Patil, and M. Balakrishan, \Trace-driven simulations of parallel and distributed algorithms in multiprocessors," in Proceedings of the 1986 International Conference on Parallel Processing, pp. 909{916, August 1986. [65] W. Williams and Bobrowicz, \Speedup predictions for large scienti c parallel programs on Cray X-MP-like architectures," in Proceedings of the 1985 International Conference on Parallel Processing, pp. 541{543, August 1985. [66] R. Fujimoto, \Simon: A simulator of multicomputer networks," Tech. Rep. Report No. UCB/CSD 83/140, Computer Science Division (EECS), Univ. of California at Berkeley, Berkeley, CA, September 1983. [67] P.-C. Yew, \Trace generation facilities in CHIEF," in ISCA '90 Workshop on Processor Tracing Methodologies, (Seattle, WA), May 1990. [68] A. Agarwal, \Multiprocessor address tracing: the agony and the ecstasy," in ISCA '90 Workshop on Processor Tracing Methodologies, (Seattle, WA), May 1990. [69] S. Goldschmidt and H. Davis, \Tango introduction and tutorial," Tech. Rep. CSL-TR-90-410, Computer Systems Laboratory, Stanford University, Palo Alto, CA, January 1990. [70] C. Erickson and M. Azimi, \A technique for generating architecture-independent MP traces," in ISCA '90 Workshop on Processor Tracing Methodologies, (Seattle, WA), May 1990. [71] S. Eggers, D. Keppel, E. Koldinger, and H. Levy, \Techniques for ecient inline tracing on a shared-memory multiprocessor," in Proceedings of the 1990 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Boulder, CO), pp. 37{47, May 1990.

25

Suggest Documents