A Simulation Tool for Evaluating Shared Memory Systems Jie Tao1 , Martin Schulz2 , and Wolfgang Karl1 1
Lehrstuhl f¨ur Rechnertechnik und Rechnerorganisation Institut f¨ur Informatik, Technische Universit¨at M¨unchen Boltzmannstr.3, 85748 Garching, Germany ftao,
[email protected]
Abstract This paper presents an execution-driven simulator called SIMT, which models the parallel execution of applications on multiprocessor systems with global memory abstractions. Based on Augmint, a simulation toolkit for Intel architectures, SIMT focuses on the memory hierarchy and contains facilities for designing and evaluating the memory system, especially cache coherence schemes and data allocation policies. In addition, it models a hardware monitor capable of collecting comprehensive performance data in the form of memory access histograms. SIMT allows to study the impact of various hardware and software techniques on performance. It is an easy-to-use and comprehensive tool for system designers and architecture developers to improve the existing infrastructure and enables the exploration of novel approaches in relation to applications and the memory system.
1 Introduction The performance of the memory system is one of the single most important issues that significantly influences the efficiency of shared memory programs running on multiprocessors. This situation is specially critical on NUMA (Non-Uniform Memory Access) systems, where the latency of remote memory accesses can be one or two orders of magnitude longer than the one of local memory accesses. In order to study the memory system and the memory access behavior of applications on parallel machines with shared or distributed shared memory, a multiprocessor simulator, called SIMT (SIMulation Tool), has been developed. SIMT models the parallel execution of shared memory applications on the target machines, i.e., SMPs and NUMA architectures. Based on Augmint, an event-driven multiprocessor simulator, SIMT focuses on the research of the memory performance and contains mainly facilities for sim-
2
School of Electrical and Computer Engineering Cornell University Ithaca NY 14853, USA
[email protected]
ulating the memory system. This includes a cache simulator, which models multilevel cache hierarchies with several cache coherence protocols; a distributed memory simulator, which models the shared memory and a spectrum of memory allocation policies; and a network mechanism modeling inter-node communications. In addition, SIMT uses an independent monitor simulator module to collect performance data. This monitor simulator models an existing hardware monitoring device which is capable of tracing memory transactions and gathering information about memory references. More specifically, the monitor simulator can be selectively inserted to monitor bus systems at any level of the memory hierarchy, allowing an acquisition of comprehensive monitoring information both about a specific hardware module and the complete memory system. Based on the monitor simulator, SIMT provides performance data in the form of memory access histograms that record memory transactions throughout the execution process. A memory access histogram contains detailed information about numbers of memory accesses across the whole virtual address space, thereby allowing an understanding of a program’s runtime data layout. This enables the user to perform a range of different optimizations. Performance information supplied by SIMT has been used to evaluate a large range of system tuning opportunities and design tradeoffs, including the cache coherence protocols, different data allocation schemes in terms of data locality, sensitivity of application performance to the remote access latency, and the performance improvement achieved by locality optimizations. For the latter, for example, simulated runs of several optimized numerical kernels on a 16 node NUMA system show a performance gain of as high as 26.4%. The remainder of this paper is organized as follows. Section 2 contrasts SIMT with a few similar simulation systems. Section 3 briefly describes the simulated architecture models, followed by a detail description of SIMT functions and its data collection mechanisms in Section 4 and Section
5 respectively. In Section 6, SIMT is compared with the actual systems. First experimental results achieved by SIMT for performance evaluation are discussed in Section 7. The paper concludes with some final remarks in Section 8.
lator of SIMT models not only arbitrary levels of caches but also various cache consistency protocols ranging from sequential consistency to relaxed schemes based on the concept of Release Consistency suitable for NCC-NUMA systems [13]. In addition, no trace file is generated within SIMT. Rather, memory references are directly delivered to the target architecture simulator, avoiding a slowdown of the simulated execution caused by I/O activities. Comparing with the system by Bhuyan et al., SIMT has been used to analyze the runtime data layout of applications and the potential of data locality improvement, in addition to studying the data allocation policies. This helps users to optimize their applications towards a better performance. Hence, SIMT is not only a multiprocessor simulator for architecture studies but also a general tool for programmers to understand their applications.
2 Related Work Simulation is a well-established technique for studying computer hardware and predicting system performance. Over the last years, many simulation systems with the goal of providing a general tool for such studies have been developed. Prominent systems include the full system-wide simulation tools, like SimOS [5], SIMICS [9], and the SimpleScalar [1], an infrastructure for simulation and architectural modeling. These larger multiprocessor simulators model the computer hardware in high detail, allowing to evaluate the complete parallel architectures. SIMT, however, aims at supporting the study of potentials for memory performance improvement. For the reason of efficiency and simulation speed, it does not model the concrete hardwares in detail. It rather focuses on the simulation of the memory hierarchy and the collection of complete information about caches, memories, and inter-node communications. This goal is more similar to RSIM [11], MICA [7], and the simulation environment at Texas A&M university [2]. RSIM [11], the Rice Simulator for ILP Multiprocessor, simulates shared memory multiprocessors built from processors that aggressively exploit instruction-level parallelism (ILP). It supports a two-level data cache hierarchy and hardware directory-based cache-coherent NUMA shared memory systems. As it emphasis on accuracy, RSIM is slower than simulators that do not include a detailed processor model. MICA [7] is a memory and network simulation environment for cache-based PC platforms. It is built on top of Limes [16], a multiprocessor simulator for Intel architectures. MICA extends Limes with the simulation of L2 caches and uses it to create a trace file containing all memory references. Based on this trace file, MICA simulates the network transfer. However, MICA’s cache simulator is quite simple and does not enable the modeling of various caches. Bhuyan et al. [2] use simulation to study the impact of CC-NUMA memory management policies in combination with interconnection network switch designs on application performance. The simulation environment is built on top of PROTEUS[3], an execution-driven simulator for shared memory machines. In order to evaluate memory and network, PROTEUS was extended with various data distribution schemes and network switches. In comparison with these systems, SIMT is more flexible in terms of memory hierarchy simulation. The cache simu-
3 Architecture Models SIMT models multiprocessor systems with a global shared memory abstraction. A simple model of the architectures that can be simulated is given in Figure 1. They comprise a set of nodes connected via interconnection networks or busses. Each node includes a processor, a memory hierarchy consisting of multilevel caches and the local memory, and other system facilities including the network interface and busses. Applications
Processor
Processor
L1 cache
L1 cache
L2 cache
L2 cache
Memory
Shared (virtual) memory
Memory
Interconnection network, bus
Figure 1. The simulated architecture. The physical memory is distributed among processor nodes, but is accessible through a global address space. This can then be used to create a global virtual memory abstraction across all nodes. The result is a shared memory environment with NUMA characteristics. Shared data is partitioned at page granularity (4096 bytes) and distributed among the nodes depending on a data distribution policy. Any memory access, which is not a cache hit, will be satisfied within either the local or a remote memory according to the chosen data distribution. The latency of local and remote memory accesses can be specified by the user according to the properties of the target architectures. As an 2
Augmint
Monitoring L1
Memory references L2
..
Ln
For the simulation of NUMA architectures, mechanisms are needed to provide a global shared virtual memory, generate and schedule events, and model the required data transfers. To simplify this task, Augmint [10], a fast execution driven multiprocessor simulation toolkit for Intel x86 architectures, has been deployed as the kernel of SIMT. We use Augmint’s memory reference generator to drive the simulation and provide a new backend to model the target architectures: NUMA machines and SMPs.
M1
miss Cache simulator
4 SIMT Functions
hit
hit miss
M2
..
hit Mn miss local
Ml
Memory access histogram
extreme case, the physical memory can also be centrally organized with equal access time for all memories. In this case, SIMT models SMPs with UMA (Uniform Memory Access) characteristics.
Memory simulator remote
Mr
Figure 2. SIMT simulation structure.
4.1 Augmint: the Base of SIMT
shared memory simulator, a monitor simulator, and a network modeling mechanism. For the sake of efficiency, the current prototype of SIMT uses a simplified processor model rather than simulating the actual operations within the CPU. In this model, each instruction of the target machine is combined with an estimated execution time. This is done in the form of simulated cycles and specified as an input file to Augmint.
To our knowledge, there are few simulators supporting distributed shared memory architectures on top of PC platforms built from Intel x86 computers. Two of them are suitable for our goal: Limes [16] and Augmint [10]. Limes provides cache and bus modules to simulate bus-based target architectures, while Augmint does not provide architectural components like caches. Augmint, on the other hand, provides scheduling routines for simplifying the tasks of developing a target architecture simulator. Hence, Augmint is applied to SIMT as the base of its memory reference generator. Augmint consists of a simulation infrastructure, which manages the scheduling of events, and a collection of architectural models, which represent the system under study. It is based on MINT [14], a frontend for the efficient simulation of shared-memory multiprocessor. It runs applications from the SPLASH-2 suites [15] and any application written in C/C++ together with m4 macros in a fashion similar to SPLASH and SPLASH-2 applications. It supports a threadbased programming model with shared address space and a private stack space for each thread.
The simulation structure of SIMT is shown in Figure 2. As described, SIMT uses Augmint as its frontend. Memory references delivered by this frontend are handed to the SIMT backend as events and then filtered by the cache simulator. Those references, which are not satisfied within any of the caches, are processed by the memory simulator as accesses to the main memory. Depending on the data allocation policies, these accesses can be either local or remote. For a local access, a task is issued and scheduled to be completed in a constant time corresponding to the simulated CPU. For a non-local memory reference, a network request is generated, which is inserted into a priority queue sorted according to the timestamp. This request is handled when it has the lowest timestamp in the queue, representing that all remote memory accesses issued before have been completed.
4.2 SIMT Infrastructure
For each level of the memory hierarchy a monitor simulator can be attached and used to probe the connecting busses. Each monitor then records the probed events and preprocesses them in a low-level fashion in hardware modules. Consequently, low-level software is used to generate independent memory access histogram occurring at the corresponding locations. These access histograms are then combined into a final result, which shows the complete memory access information serving as the base of evaluating the whole memory system.
Augmint itself contains only a frontend which is capable of generating specific events like memory references and synchronization events. The functions to handle these events, however, have to be defined by a backend, the actual target architecture simulator. Such a new backend has been added to Augmint for implementing SIMT. As SIMT is intended primarily for research in the area of memory system design and performance, its backend comprises mainly four components: a cache simulator, a 3
4.3 Simulation of Caches Caches are critical components in NUMA systems since they allow to hide a large percentage of remote memory accesses and their associated long delay. They can be seen as filters keeping traffic away from the main memory and the cross-system network. Therefore it is important to accurately model caches and their impact on the overall memory traffic. SIMT models a multilevel cache hierarchy and allows each cache to be organized as either write-through or writeback depending on the target architectures. All relevant parameters including cache size, cache line size, and associativity (from direct mapped to fully associative) can be specified by the user. In addition to simulating the caches themselves, the cache simulator models a set of cache coherence protocols. This includes the hardware-based MESI protocol, several relaxed consistency schemes, and an optimal, false-sharing free model using a perfect oracle to predict required sharing. The latter is used to evaluate the efficiency of cachecoherent protocols. The MESI scheme is traditionally deployed on existing multiprocessor architectures in order to maintain an easy and straightforward programming model. However, the hardware complexity and induced system traffic for keeping all caches coherent is rather expensive. SIMT therefore enables the investigation of relaxed consistency models which do not require global update operations, but rely on full cache invalidations and write buffer flushes [13]. In future, additional, more selective protocols will be added to further tune the coherence protocol and to find the right tradeoff between coherency requirements and hardware complexity.
i mod N
P
N
N
Prime-mapping: Pages are first allocated to processors using the Round-robin policy, assuming there were virtual processors, where is the number of processors and is a prime greater or equal to . After this first step, pages, which are allocated to the virtual processors, are then additionally allocated to the real processors again using the Round-robin policy.
P N
P
N
Explicit placement: The user can explicitly specify the data placement within the source code. Data can be partitioned at page, block, or word granularity depending on the target architectures.
4.5 Network Modeling SIMT simulates NUMA machines connected via an interconnection network. In the current prototype of SIMT, the actual interconnection technologies are modeled with corresponding transfer latencies. To simplify the modeling process, two constants are used to represent the respective latencies for local memory accesses and remote memory accesses. The former is used to model the delay of local main memory accesses, while the latter is used to model the overhead and transmission delay for accessing remote memories. Both values can be specified by the user depending on the properties of the target architectures.
The DSM (Distributed Shared Memory) simulator of SIMT models the management of the distributed memory and a spectrum of data distribution policies. For this, SIMT uses a single address space. Each process is able to access the shared data in this space, but with different access latency depending on the location of the data. The data distribution is chosen by the allocation policy. Currently, SIMT models the following policies:
Skew-mapping: Pages, which are allocated to memories using the Round-robin policy, are skewed linearly in the left-to-right direction. A page is therefore allo, where is cated to memory ( + b Ni + 1) the number of processors in the system.
i
4.4 Simulation of the DSM
Block: Shared data is divided into blocks depending on the number of processors and each block is placed on one processor.
5 Data Collection Using a Monitor Simulator Like most simulators, SIMT provides extensive statistics about the execution of an application at the simulation process. This includes elapsed simulation time and simulated processor cycles, number of total memory references, and number of hits and misses to each cache on the system. This allows the user to study the impact of data locality optimizations and any novel techniques to improve performance. However, often a deeper insight into the performance of the target application is required. Traditional simulation tools acquire such performance data by inserting annotations or instrumentation instructions into the simulator. The collected data is likely to be incomplete since it is infeasible to provide a full instrumentation. Besides this, performance data is usually only provided at the end of the simulation
Round-robin: In this allocation scheme, pages of data are cyclically allocated across all nodes at a user selected granularity (usually page size). Full: All data is allocated on one node, which is specified by the user. First-touch: Each page is allocated on the processor which first accesses the page. 4
phase information which can be used to optimize applications with dynamic changing access patterns.
process, rendering any on-line performance analysis impossible. In order to overcome these problems, SIMT uses an independent component to collect accurate, comprehensive performance data. This is a monitoring facility that models an existing hardware monitor [6] developed for tracing the inter-node communications on SCI1 This hardware monitor is designed to snoop a local bus containing address information over which inter-node transactions are transfered. It then extracts information about each transaction, including access source, destination, address, and transaction type. This is first stored in its registers, preprocessed, and further delivered to a ring buffer. From the ring buffer, data can be accessed by the users and system software. Within the processing phase of the monitor the information is condensed and transformed into access histograms resembling the access distribution across the whole address range within the monitored time span. This enables an easy evaluation of the target component and a detection of access hot spots. By modeling this monitor, SIMT can and is used to evaluate the design and effectiveness of this novel hardware monitor, while at the same time using its concepts in a more general way for a more comprehensive evaluation of the simulated application. The latter is achieved by extending its concept also to the various busses and links between the CPU core and the memory [12]. The monitor simulator comprises a packet generator, an event filter, a counter array, and a ring buffer. The packet generator is used to create packets for events delivered by the monitored bus or link. Packets are actually data structures for information contained in memory transactions, like access address and transaction type. These packets are then checked by the event filter in order to only capture those events of interest, which are further recorded in the counter array indexed by its address. As the main component of the monitor simulator, this array consists of counters for storing performance information. The counters are organized in a cache-like manner. Whenever all counters are filled or one counter is about to overflow, a counter is evicted from the monitor and stored in the ring buffer, a large user-defined memory space. The free counter is then reclaimed for the further monitoring process. The monitor simulator is designed and implemented as an independent component and can selectively be inserted into any level of the memory hierarchy. In addition, monitoring data is stored in a specific physical memory space and periodically refreshed allowing an on-line analysis by performance tools. Besides these, the monitor simulator can stop and restart the monitoring during the simulated execution of programs. This enables the acquisition of per-
6 Validation SIMT is a tool for evaluating shared memory systems. In order to examine its accuracy, we have compared the simulated execution of several applications with the execution on the actual hardware that is being modeled. The simulated codes have been run on a few of uniprocessors with different hardware configurations as well as on a sample NUMA-like cluster. Parameter Processor Processor Clock Speed Data Cache Secondary Cache Associativity Memory
Machine 1 Pentium II (Klamath) 266 MHZ 16 KB 512 KB 4-way L1, 4-way L2 128 MB, 100 ns
Machine 2 Pentium II (Deschutes) 448 MHZ 128 KB 512 KB 4-way L1, 4-way L2 256 MB, 150 ns
Table 1. Hardware configuration of tested machines.
6.1 Initial Performance Studies First, SIMT is compared with the execution on uniprocessors. The hardwares used in this study are two Intel PCs, both using Pentium as their compute processor, but with different system configurations. Several parameters of both machines are shown in Table 1. All the tested applications are chosen from the SPLASHII [15] benchmark suite. We run the applications using both their default working set size (210 data points for FFT, 128 128 matrix for LU, 131072 keys for RADIX, and 343 molecules for WATER) and a greater working set ( 214 , 256 256, 262144, and 512) to explore how well SIMT predicts the execution time of applications with varying working set sizes. The result is summarized in Figure 3. The x-axis is brocken down into simulated target machines for each application. The y-axis gives the execution time relative to the real hardwares. Values below one indicate that simulation predicts a faster execution than actually is achieved in real hardware, while values above one show the reverse effect. It can be observed that for both machines and for all applications SIMT can roughly estimate the performance, especially for larger working set. The behavior of applications, however, varies significantly. While a few programs show a faster run under simulation, most others present a slowdown. Nevertheless, the results reported by SIMT is within 20% in all cases.
1 SCI (Scalable Coherent Interface): an IEEE-standardized [8] interconnection technology with high bandwidth, low latency, and global physical access space [4].
5
111default workingset 000 000 111 000greter workingset 111
1.5 1.25 1.0 0.75 0.5 0.25 0.0
00 11 11 00 00 11 00 11 00 11 11 11 00 00 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 00 11 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 00 11 00 11 11 00 11 00 11 00 11 00 00 11 11 00 00 11 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 00 11 00 11 11 00 11 00 11 00 11 00 11 00 11 11 00 11 00 11 00 00 11 00 11 00 11 00 11 00 11 11 00 00 11 00 11 FFT LU RADIX WATER Machine 1
11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 FFT
only on small systems. However, independent of the actual performance, the difference between the measured speedup and the speedup predicted by SIMT is neglible.
0 1 11 00 0 00 1 11 0 1 0 1 00 11 0 1 0 1 00 11 0 0 1 00 1 11 0 1 0 1 00 11 0 1 0 1 00 11 0 1 0 1 00 1 11 0 0 1 00 11 0 1 0 1 00 1 11 0 0 1 00 11 0 1 0 1 00 11 0 1 0 1 00 1 11 0 0 1 00 11 0 1 0 1 00 1 11 0 0 1 00 11 0 1 0 1 00 1 11 0 LU RADIX WATER
7 Sample Uses of SIMT SIMT has been used to evaluate system hardware and softwares. In this Section, we show two of such studies in combination with four applications from the SPLASH-2 Benchmark suite [15] to illustrate the capabilities of SIMT. To balance the simulation time, the input dataset is specially chosen: 512 molecules for WATER, 262144 keys for RADIX, 216 complex numbers for FFT, and a 256 256 matrix for LU. The machine we model for the following experiments is a cluster of PCs with shared memory. Each processor node employs a four-way L1 cache of 32 KBytes with an access time of 1 cycle and a four-way L2 cache of 512 KBytes with an access time of 10 cycles. The access latency of the main memory is set to 100 cycles, while the delay of a remote memory access is chosen as 2000 cycles, representing architectures with NUMA characteristics typically found in loosely-coupled PC clusters.
Machine 2
Figure 3. SIMT vs. actual hardware.
6.2 Relative Performance In a second set of experiments, we evaluate how well SIMT is suited to measuring effects other than absolute performance, by examining how well SIMT predicts speedup. For this purpose, benchmark applications are both executed on an SCI-based cluster and simulated under the same configurations. The cluster is composed of six SMPs, each with two Pentium II CPUs, a 32K L1 cache, and a 512K L2 cache. All benchmark applications use their default working set size. Experimental results are shown in Figure 4.
7.1 Data Allocation Policies First, we study the characteristics of different data distribution schemes. Figure 5 shows the proportion of remote memory accesses to the total ones for FFT and LU under different policies. It can be observed that the First-touch behaves better than any of the others. The reason for this is that the First-touch allocates data on the node which accesses it first and that it is very likely that this node accesses it again later since data, first touched by a node, is usually handled by this node. Besides First-touch, the behavior of other allocation schemes varies from application to application. For example, the Round-robin scheme performs quite well for RADIX with 28 processors and LU with 8 processors. This indicates that in specific cases some allocation schemes can introduce better data distribution.
7.2 Locality Optimization
Figure 4. Speedup: SIMT against cluster.
From Figure 5 we can also observe that for all allocation schemes, a significant number of remote memory accesses exists and that the data locality on larger systems is poorer. This shows that a locality optimization in such systems, especially in larger systems, is mandatory. In order to observe how manual locality improvements affect performance, we have analyzed the run-time data layout of all codes using the memory access histogram gathered by the monitoring
For each application, Figure 4 shows two curves presenting the speedup achieved both on SIMT and a shared memory cluster interconnected by SCI using various number of processors. Speedup is acquired by dividing the sequential execution time with the parallel one. It can be seen that the behavior of the applications differs: FFT show a slight speedup and WATER and LU present a good performance 6
WATER
0.9 0.8 first-touch block full round-robin skew-mapping prime-mapping
0.7 0.6 0.5 0.4 0.3 5
10
15
20
25
30
Number of remote accesses : total accesses to memories
Number of remote accesses : totol accesses to memories
1
1
FFT
0.8
0.6
first-touch block full round-robin skew-mapping prime-mapping
0.4
0.2
0 5
10
1
RADIX
0.9 0.8 first-touch block full round-robin skew-mapping prime-mapping
0.7 0.6 0.5 0.4 0.3 5
10
15 20 Number of processors
15
20
25
30
Number of processors
25
30
Number of remote accesses : total accesses to memories
Number of remote accesses : total accesses to memories
Number of processors 1
LU
0.9 0.8 first-touch block full round-robin skew-mapping prime-mapping
0.7 0.6 0.5 0.4 0.3 5
10
15 20 Number of processors
25
30
Figure 5. Difference between the data distribution schemes.
8 Conclusion
module within SIMT. We have found that for all applications, the data was not properly allocated throughout their execution, leading to this large number of remote accesses.
In this paper, we present SIMT, a simulation tool for NUMA architectures. The main feature of SIMT is its ability to model various memory systems and provide comprehensive performance data about them. SIMT comprises a frontend, which simulates the parallel execution of programs, and a backend which simulates the target architecture. The frontend is a memory reference generator based on Augmint. It captures events, like memory references and synchronization primitives, and delivers them to the backend. The backend consists of functionality representing the handling of these events on real hardware. As SIMT is intended primarily for research in the area of memory system design, the main components of the backend are a cache simulator modeling caches of arbitrary levels and several cache consistency protocols, a shared memory simulator modeling the distributed shared memory with a spectrum of data distribution policies, a monitor simulator modeling a hardware monitor, and a network mechanism modeling the data transfer across processor nodes. For a verification of the observed results, the simulation execution on single processors as well as on clusters has been compared with an actual execution on real hardware. We found that to a large degree SIMT is capable of predicting the performance, lying within 20% of the real execution time. Performance trends and important metrics like cache counters, however, are simulated to much higher degree of accuracy, enabling well founded conclusions about the behavior of simulated target architectures. SIMT can be used for a large range of research targets
Based on the analysis, we have optimized the source codes with respect to the data distribution. Using special allocation macros provided by SIMT, incorrectly allocated pages are placed on the nodes which perform the most accesses. Figure 6 shows the simulated execution time (in million cycles) based on two transparent, unoptimized runs and an optimized execution on a 16 node system. It can be seen that the optimized version performs much better than the transparent executions. While the First-touch introduces an average improvement of 5.3% in contrast to the Roundrobin, the optimized version achieves a performance gain of as high as 26.4%.
Figure 6. Impact of static optimizations.
7
[9] P. S. Magnusson and B. Werner. Efficient Memory Simulation in SimICS. In Proceedings of the 8th Annual Simulation Symposium, Phoenix, Arizona, USA, April 1995.
including the evaluation of computer hardwares, the analysis of the parallel behavior of applications, the study of data allocation policies and cache coherence schemes, and for the optimization of programs with respect to data locality. A few sample uses of SIMT have been demonstrated, showing the capabilities and efficiency of this simulation tool for both system design and application optimization research.
[10] A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit for intel x86 architectures. In Proceedings of 1996 International Conference on Computer Design, October 1996.
References
[11] V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: An Execution-Driven Simulator for ILP-Based SharedMemory Multiprocessors and Uniprocessors. In Proceedings of the Third Workshop on Computer Architecture Education, February 1997.
[1] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer Syste Modeling. Computer, 35(2):59–67, 2002. [2] L. N. Bhuyan, R. Iyer, H. Wang, and A. Kumar. Impact of CC-NUMA Memory Management Policies on the Application Performance of Multistage Network. IEEE Transactions on Parallel and Distributed Systems, 11(3):230–251, March 2000.
[12] M. Schulz, J. Tao, J. Jeitner, and W. Karl. A Proposal for a New Hardware Cache Monitoring Architecture. In ACM SIGPLAN Workshop on Memory System Performance (MSP 2002), Berlin, Germany, June 2002.
[3] E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl. PROTEUS: A High-performance Parallel-Architecture Simulator. Technical Report MIT/LCS/TR-516, Massachusetts Institute of Technology, September 1991.
[13] M. Schulz, J. Tao, and W. Karl. Improving the Scalability of Shared Memory Systems through Relaxed Consistency. In Proceedings of the Second Workshop on Caching, Coherence, and Consistency (WC3’02), New York, USA, June 2002.
[4] H. Hellwagner and A. Reinefeld, editors. SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Computer Clusters, volume 1734 of Lecture Notes in Computer Science, chapter 3. Springer-Verlag, 1999.
[14] J. E. Veenstra and R. Fowler. MINT Tutorial and User Manual. Technical Report 452, University of Rochester, June 1993. [15] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995.
[5] S. A. Herrod. Using Complete Machine Simulation to Understand Computer System Behavior. PhD thesis, Stanford University, February 1998. [6] R. Hockauf, W. Karl, M. Leberecht, M. Oberhuber, and M. Wagner. Exploiting Spatial and Temporal Locality of Accesses: A New Hardware-based Monitoring Approach for DSM Systems. In Proceedings of Euro-Par’98 Parallel Processing / 4th International Euro-Par Conference Southampton, volume 1470 of Lecture Notes in Computer Science, pages 206–215, UK, September 1998.
[16] WWW. Limes: An Execution-Driven Multiprocessor Simulation Tool for PC Platforms. http://galeb.etf.bg.ac.yu/davor/limes/.
[7] H. C. Hsiao and C. T. King. MICA: A Memory and Interconnect Simulation Environment for Cache-based Architectures. In Proceedings of the 33rd IEEE Annual Simulation Symposium (SS 2000), pages 317– 325, April 2000. [8] IEEE Computer Society. IEEE Standard for the Scalable Coherent Interface(SCI). IEEE Std 1596-1992, IEEE 345 East 47th Street, New York, NY 100172394, USA, August 1993. 8