Accelerating Full-System Simulation through Characterizing and Predicting Operating System Performance ∗ Seongbeom Kim, Fang Liu, Yan Solihin NC State University {skim16,fliu3,solihin}@ece.ncsu.edu
Ravi Iyer, Li Zhao Intel Corporation {ravishankar.iyer,li.zhao}@intel.com
Abstract The ongoing trend of increasing computer hardware and software complexity has resulted in the increase in complexity and overheads of cycle-accurate processor system simulation, especially in full-system simulation which not only simulates user applications, but also the Operating System (OS) and system libraries. This paper seeks to address how to accelerate full-system simulation through studying, characterizing, and predicting the performance behavior of OS services. Through studying the performance behavior of OS services, we found that each OS service exhibits multiple but limited behavior points that are repeated frequently. OS services also exhibit application-specific performance behavior and largely irregular patterns of occurrences. We exploit the observation to speed up full system simulation. A simulation run is divided into two nonoverlapping periods: a learning period in which performance behavior of instances of an OS service are characterized and recorded, and a prediction period in which detailed simulation is replaced with a much faster emulation mode to obtain the behavior signature of an instance of an OS service, and prediction of its performance based on the signature and records of the service’s past performance behavior. Statistically-rigorous algorithms are used to determine when to switch between learning and prediction periods. We test our simulation acceleration method with a set of OSintensive applications and a recent version of Linux OS running on top of a detailed processor and memory hierarchy model implemented on Simics, a popular full-system simulator. On average, the method needs the learning periods to cover only 11% of OS service invocations in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9×, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%.
1
Introduction
The spectacular progress in computing technology in the last few decades has been enabled by transistor scaling, which increases the number of transistors that can be placed on a single chip. This increasing integration results in increasingly complex hardware, while the extra performance enables increasingly complex software. As a result, cycle-accurate processor system simulation has also become increasingly complex and incurs more overheads. Two popular types of cycle-accurate simulation models include full system simulation, which simulates the entire software layers including the application programs and the Operating System (OS), and application-only simulation which simulates only the application programs running directly on the simulated processor system. ∗ This work is partially supported by NSF Award CNS-0406306, CCF0347425, and gifts from Intel.
William Cohen Red Hat
[email protected]
Application-only simulators, such as SimpleScalar [7], are often much faster than full system simulators, such as Simics [19], because they do not simulate the OS and system library layers. However, for many application classes such as web servers, system tools, network processing, and transaction processing, full system simulations are necessary for accurate performance estimates, because the applications use OS services heavily. To illustrate the necessity of full system simulation for such applications, in Figure 1 we show the L2 cache misses, execution time, and IPC estimates obtained from full system simulation normalized to those obtained from simulating only the application programs. OS-intensive applications are shown on the left set of bars, which include web server applications (ab-rand and ab-seq), Unix tools (find-od and du), and a network benchmarking tool (iperf). The right set of bars show some applications from SPEC2000 (gzip, vpr, art, and swim). The figure shows that while the number of L2 cache misses in SPEC2000 applications are almost the same in both simulations, in OS-intensive applications the number of L2 cache misses obtained by full-system simulation can be as high as 405× compared to that obtained by application-only simulation. The execution time estimates for OS-intensive applications obtained by application-only simulation are also highly inaccurate. The full system simulation’s execution time estimates are up to 126× higher than those obtained through application-only simulations. Finally, since OS code often exhibits very different instruction throughput (measured as instructions per cycle or IPC), the estimated combined IPC is also very different in the two simulations. In computer architecture evaluation, designers/researchers often compare the performance estimates of one design versus another. Hence, the absolute performance estimates of each design may matter less compared to the relative difference in performance estimates of different designs. Relative difference in performance estimates allow designers to extract performance trends, and weed out design changes that do not improve performance over the base design. We note that even for obtaining relative performance estimates, a full system simulation is often necessary. We illustrate this through Figure 2, which shows the speedup ratio obtained when increasing the L2 cache size from 512KB to 1MB. Each application shows two bars: the first bar shows the speedup ratio by only simulating the applications (App Only) for both 512KB and 1MB L2 cache, while the second bar shows the speedup ratio obtained by simulating both the applications and the OS (App+OS) for both 512KB and 1MB L2 caches. The figure shows that while the speedup ratios for both simulations are similar for SPEC2000 applications, they are very different for OS-intensive applications. In fact, had one relied only on application-only simulations, he/she would have wrongly concluded that the performance benefits of 1MB L2 cache over 512KB L2 cache for the shown applications were negligible. However, the full system simulation clearly leads to a different conclusion that
256
2
0.6 0.4 0.2
1 0
art
swim
vpr
gzip
iperf
find-od
du
ab-seq
ab-rand
art
swim
vpr
gzip
iperf
find-od
du
ab-seq
ab-rand
0.1
swim
1 0.1
4
art
2
8
vpr
4
gzip
8
16
0.8
iperf
16
32
find-od
32
1
64
du
64
1.2
128
ab-seq
128
ab-rand
256
Normalized IPC
Normalized Execution Time (Log2 Scale)
Normalized L2$ Misses (Log2 Scale)
512
Figure 1. The L2 cache misses, execution time, and IPC of simulating application programs and OS, normalized to those obtained from simulating just the applications. The description of each benchmark and evaluation parameters are described in more details in Section 5.
using a 1MB L2 cache gives significantly better performance over using a 512KB L2 cache (up to 2.03× speedup for iperf). 2.03
2 1.8
Speedup Ratio
1.6 1.4 1.2 App Only App+OS
1 0.8 0.6 0.4 0.2
swim
art
vpr
gzip
iperf
find-od
du
ab-seq
ab-rand
0
Figure 2. Speedup ratios obtained using a 1MB L2 cache over using a 512KB L2 cache, using application-only simulations versus using full-system simulations.
Overall, both Figure 1 and Figure 2 point out that a full system simulation is necessary to produce quality (absolute and relative) performance estimates, especially for OS-intensive applications. For these applications, a large fraction of instructions executed are from the OS rather than from the applications themselves (e.g. 67% – 99% in the applications we tested). Consequently, while a full system simulation is necessary for accurate performance estimates, it is also much slower compared to application-only simulation. In order to tackle the dilemma of accuracy and speed tradeoffs of using full system simulation, this paper seeks to provide a scheme that significantly accelerates full-system simulation. One requirement for such a scheme is that we believe that it must be able to separate the performance of OS services apart from the application, and from each other. This separation enables deriving not just the execution time of the total execution, but also more meaningful statistics such as the fraction of execution due to OS services, the relative contribution of different OS services on execution time, and their performance characteristics. Our scheme is based on an observation that OS services such as system calls and interrupt handling often exhibit performance behavior repetitiveness. This repetitiveness is understandable because each type of OS service is designed to provide one specific functionality. However, unlike a phase which shows a single behavior point, we found that most OS services exhibit multiple behavior points, and the patterns of their occurrence are quite irregular depending on the applications and environments. This observation is due to the fact that the behavior of an OS service is not only determined by
the parameters passed by the application, but also by the state of the service handler itself and by the environment. Based on the study of OS service performance characteristics, we propose a statistically-rigorous learning technique that allows us to capture most OS service behavior points by sampling and characterizing only a small fraction of its occurrences. We apply this learning strategy to accelerate full-system simulation. First, using statistical analysis, the learning strategy chooses when to take samples and how many samples are taken in order to ensure that all important behavior points of an OS service are captured and characterized. The characterization includes the number of instructions executed, IPC, and various cache miss rates. These characteristics are then clustered and each cluster is indexed by a signature. Once learning is complete, future occurrences of the OS service is no longer fully simulated. Rather, it is profiled to obtain its signature. Its signature is then searched for a match against one of the clusters that were collected during learning, and its performance characteristics are predicted from its matching cluster. Since signature profiling can be performed in an emulation mode, i.e. without involving processor and memory hierarchy timing models, most occurrences of OS services can be fast-forwarded in emulation mode and the full-system simulation can be greatly accelerated. Overall, the contributions of this paper are: performance characterization of OS services, a statistically-rigorous learning and relearning strategy that captures OS service behavior through sampling only a fraction its occurrences, and performance prediction techniques for accelerating full-system simulation. We test our simulation acceleration method with a set of OSintensive applications and a recent version of Linux OS (kernel version 2.6.13 on Fedora Core4 distribution), running on top of a detailed processor and memory hierarchy model implemented on Simics, a popular full-system simulator. On average, the method only requires fully simulating 11% of OS service invocations, providing 89% prediction coverage, in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9×, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%. The rest of the paper is organized as follows. Section 2 describes related work. Section 3 presents our contribution on characterizing performance behavior of OS services, while Section 4 presents our learning strategy and performance prediction schemes. Section 5 and Section 6 present the evaluation environment, validation and evaluation. Finally, Section 7 concludes the paper.
2
Related Work
There have been many studies that try to accelerate cycleaccurate processor simulations [8, 9, 10, 12, 13, 15, 16, 18, 22, 23, 24, 25, 26, 27, 28, 29]. In general, the acceleration techniques
can be categorized into reduced input set, statistical simulation, and sampling-based simulation. An example of reduced input set simulations includes MinneSPEC [11], which carefully designed input sets that allow shorter simulations while still preserving the performance characteristics of larger input sets. Statistical simulation has also been proposed, for example in [20, 21]. The main idea is to collect statistical profiles from a single detailed simulation of a program and then use the information to generate a synthetic instruction trace that is much shorter than the original program. In sampling-based simulations, only samples of the dynamic instructions are simulated. One very simple sampling method was to skip the initialization phase of a program and simulate the next few billions of instructions. More statistically rigorous sampling methods have also been proposed. One popular class of such methods is phase-based sampling (e.g. SimPoint), in which a program execution is divided into intervals which exhibit uniform performance behavior called phases. Since the program performance behavior is uniform within a single phase, a subset of instructions from that phase can be selected for simulation to represent the entire phase. Phases can be constructed from fixed-length intervals [8, 15, 22, 23, 25, 26], or from intervals that correspond to natural code boundaries such as subroutines and loop structures [12, 13, 17, 18]. Another type of sampling-based simulation approach relies on random or systematic selection of samples (e.g. SMARTS), rather than relying on phases [10, 28, 29]. In these studies, detailed simulation is only performed on samples, and it is fast forwarded between samples. The size and quantity of samples are selected in such a way to satisfy a target degree of confidence. While the method proposed in this paper uses a sampling-based approach, there are important differences between our method with other sampling-based simulation acceleration approaches [8, 10, 12, 13, 15, 18, 22, 23, 25, 26, 28, 29]. First, note that one requirement for accelerating the simulation of OS-intensive applications is that it must be able to separate the performance of OS services apart from the application, and from each other. This separation enables deriving not just the execution time of the total execution, but also more meaningful statistics such as the fraction of execution due to OS services, the relative contribution of different OS services on execution time, and their performance characteristics. While other sampling-based methods can be extended and applied to full-system simulation, they do not have the ability to separate the performance characteristics of OS services from one another and from the application. In contrast, our scheme is designed to provide such separation, by defining the samples at system call boundaries, and by keeping track of different types of OS services invoked by the system calls. In addition, most of existing sampling-based simulation acceleration approaches are designed for accelerating application-only simulation. The performance behavior of OS services is very different from that of application programs. While an application program performance is largely only influenced by its own code characteristics, the performance of OS service is affected by four components: its own code characteristics, the parameters it receives through interacting with the application program, the state it maintains as a result of its previous invocations, and external factors such as asynchronous events like timer and I/O interrupts, or the load of the system at the time the application runs. Consequently, acceleration approaches that require offline analysis, or samples that are determined in one run [8, 10, 12, 13, 15, 18, 22, 23, 25, 26, 28, 29] but are applied in another run cannot take into account the variation of OS performance on different runs, and hence are less applicable for accelerating OS-intensive applications. In contrast, in our method
samples are collected, recorded, and used for prediction during live runs. Finally, we would like to point out that our method is orthogonal to application-only techniques, because it can be combined with them by letting them to handle only the application code execution, while our method handles acceleration of the simulation of OS services. It is also complementary because by decoupling the simulation of application code and OS services at the boundary of user/kernel mode switches, application-only techniques can be made more accurate because the performance of application execution is only determined by its own code and machine characteristics.
3
Characterizing the Performance of OS Services
This section describes this paper’s contribution on characterizing the performance patterns of OS services and discusses the implications for designing full-system simulation acceleration techniques. First, we define an OS service as a specific type of system call or interrupt handling in the privileged kernel mode. We also define an OS service interval as a contiguous group of dynamic instructions starting from when a system call or interrupt causes a mode switch to the kernel until right before it returns to the user mode. These definitions imply that all instructions executed in the non-privileged mode are considered as parts of the application code. While some instructions executed in non-privileged mode help interface with the respective OS service such as setting up OS stack and saving contexts, we found these instructions to be relatively few compared to the OS service code executed in the privileged mode. We also treat system libraries that are primarily implemented in the user mode such as memory allocation functions as a part of the application. However, we note that our technique can be easily extended to cover them. OS services can be synchronous if they are directly or indirectly invoked by the application or asynchronous if they are invoked due to factors external to the application. One example of a synchronous OS service is sys read() which is invoked when an application calls a C library function fread(). Another example includes exceptions that occur due to executing instructions of the application, such as floating-point exceptions, page faults, and segmentation faults. Examples of asynchronous OS services include interrupts due to I/O availability, timer interrupts, DMA transfers, etc. Our characterization of OS services is performed on every mode switch, so it includes both synchronous and asynchronous OS services. Each OS service interval defines a natural execution boundary between the application and OS services. On each mode switch, the information of the type of OS service invoked is easily available. For example, when sys enter occurs, the type of service can be read from a register, such as the register EAX in x86 architecture. We note that in Linux 2.6.x, there are more than 200 types of OS services defined in its system call table. Some of the OS services may overlap with others, such as when a system call handling invokes another system call of a different type. While it is possible to identify such an overlap and break an interval into smaller nonoverlapping ones, we choose a simpler approach that determines the type of an OS service based on the event that initially causes the transition to the privileged mode. All other OS services triggered afterward are considered as extensions of the initial OS service. To characterize the performance of OS services, we collect the statistics of OS services in terms of execution time and Instruction Per Cycle (IPC) in a full-system simulation and show them in Figure 3. Each bar in the figure shows the average number of cycles and its range, defined as the average plus minus its standard deviation. The white bars are for ab-rand benchmark while the gray bars
ab-seq
ab-rand 0.6
60000
0.5
50000
ab-seq
0.4
IPC
40000 30000
0.3
(a)
Int_49
Int_239
Int_121
sys_write
sys_writev
sys_stat64
sys_socketcall
sys_poll
sys_read
sys_ipc
sys_open
sys_fcntl64
sys_close
Int_49
Int_239
Int_121
sys_write
sys_writev
sys_stat64
sys_socketcall
sys_poll
sys_read
sys_ipc
sys_open
0
sys_fcntl64
0.1
0
sys_gettimeofday
10000
sys_gettimeofday
0.2
20000
sys_close
Number of Cycles
ab-rand 70000
(b)
Figure 3. The average and range (= average ± standard deviation) of the number of simulated cycles (a) and IPC (b) for different OS services, for ab-rand and ab-seq benchmarks (Section 5). Only OS services that are invoked more than once are shown. The IPC is computed as the number of x86 instructions committed per cycle.
are for ab-seq benchmark. The figure shows that on average, each OS service involves a few thousands to a few tens of thousands of instructions, while its IPC is typically quite low and ranges from 0.09 to 0.47. This observation implies that fixed-length intervals should not be used to record and characterize OS services because they would fail to correspond to the OS service boundaries. The figure also points out that the OS service intervals we choose from the natural boundary of mode switches are already very fine-grain. Comparing the different OS services, the figure shows that different OS services show different behavior points, characterized by unique average execution time and IPC. This observation implies that it is necessary to characterize each OS service type separately. Another observation is that the average behavior is also quite different from one benchmark to another, indicating that the interaction of the applications and OS affect the performance of OS services. This implies that prediction of OS service performance cannot be based on offline analysis, and instead should be based on dynamic profiling. Finally, we note that the behavior variation range is very high for most OS services. This indicates that each OS service shows not one but multiple behavior points that highly differ from one another. Next, we focus on sys read, one of the most frequently invoked OS services, and plot its behavior in terms of execution time over all occurrences in Figure 4. The figure shows that despite being designed to perform a very specific functionality of reading from a file buffer, sys read’s execution time behavior is not uniform across invocations. The variation is very high from one invocation to the next, ranging from about 2,000 cycles to 50,000 cycles. Much of the variation is the result of the OS service handler that goes through multiple execution paths, where different paths are chosen according to input parameters from the application program, as well as current state of the sys read handler. For example, sys read may take a shorter path if data to be read has already existed on a buffer. Otherwise, it may go through a different execution path to request data transfer from the disk, or trigger page faults if a buffer needs to be allocated, etc. At this point, it may be tempting to further divide up an OS service interval into multiple intervals so that each smaller interval has uniform behavior. However, we note that our OS service interval is already very fine-grain, so going to smaller intervals (hundreds to a few thousands of instructions) would introduce complexities in defining boundaries between intervals and tracking them, as well as inaccuracy in characterizing an interval’s performance because its performance would depend much more on vari-
ous processor pipeline and cache states, which can only be tracked through time-consuming processor and cache simulation models. Another observation is that it appears that there is only a limited number of distinct behavior points that are repeated. This suggests that characterizing a small fraction of all invocations may be sufficient to capture all or most behavior points. However, the selection of samples would greatly affect the ability to capture all behavior points through characterizing only the samples. The next question we need to deal with is whether each behavior point, characterized by a unique execution time, can be identified with a simple signature. This signature would be needed after sufficient samples have been characterized and the learning period is completed. The signature of each OS service will be used to predict its performance behavior point. If faster simulation is to be achieved, a signature must be captured without detailed simulation, otherwise signature profiling itself would be very time consuming. More specifically, we require that a signature should be captured in emulation mode without involving functional/timing processor and memory hierarchy models. Therefore, processor or cache statistics such as branch misprediction rate, IPC, and cache miss rates cannot be used as a signature. So we turn to the number of dynamic instructions as a candidate for a signature. Figure 5 plots the number of occurrences of sys read OS service based on their number of instructions and cycles. For the figure, we divide up the number of instructions into bins of 1000 instructions, and divide up the execution time into bins of 4000 cycles. The area of a bubble is proportional to the number of occurrences of each non-empty instruction bin. The figure corroborates the observation in Figure 4 that there are multiple (but few) behavior points that are repeated often, shown by the absence of bubbles at most instruction and cycle bins. In addition, the figure clearly shows that for most cases, given an instruction bin, the number of cycles executed is clustered within a relatively narrow range, shown as few large bubbles rather than many small ones. Hence, the number of instructions is a promising metric on which signatures that identify distinct performance behavior points can be built. We also note that other metrics such as the mix of instructions, branch history, or Basic Block Vector [25] may also serve as good bases for constructing signatures. However, since instruction-based signatures already give a high prediction accuracy, we leave this exploration for future work.
50000
45000
45000
40000
40000
Number of Cycles
Number of Cycles
50000
35000 30000 25000 20000 15000 10000 5000
35000 30000 25000 20000 15000 10000 5000
0
0 0
200
400
600
800
0
500
Invocations (ab-rand)
1000
1500
2000
Invocations (ab-seq)
(a)
(b)
50000
50000
40000
40000
Number of Cycles
Number of Cycles
Figure 4. The execution time of sys read system call at different invocations for benchmarks ab-rand (a), and ab-seq (b).
30000
20000
10000
0
-10000 -2000
30000
20000
10000
0
0
2000
4000
6000
8000
10000
12000
14000
Number of Instructions
(a)
-10000 -2000
0
2000
4000
6000
8000
10000
12000
14000
Number of Instructions
(b)
Figure 5. The bubble histogram of different behavior points of sys read for ab-rand (a), and ab-seq (b). The area of a bubble is proportional to the number of occurrences that falls into the instruction and cycle bins located at the bubble’s center.
4
Acceleration of Full-System Simulation
This section describes our full-system acceleration method in details, starting with assumptions, definition of performance behavior signatures, learning and re-learning mechanisms, and prediction mechanisms.
4.1
Assumptions
We assume that the full system simulator is able to dynamically switch between detailed simulation mode and fast emulation mode. In addition, we assume that kernel-mode pre-emptions of OS services are rare, and contention for locks and semaphores in OS services is relatively steady. We note that our web server benchmarks have multiple concurrent threads and may already incur some lock/semaphore contention and OS service pre-emptions. Since our OS service boundary is delimited by user/kernel mode switches, our learning captures OS lock/semaphore contention level dynamically. If the contention level changes, new OS service signature would result and trigger re-learning that captures the new contention level. As long as contention level does not change abruptly too often, our mechanism would capture it. We leave the case in which there are frequent abrupt changes in contention level and frequent kernel preemptions as future work.
4.2
Deriving Performance Behavior Signatures
Section 3 has shown that different instances of an OS service exhibit different behavior points with patterns that are specific to the
application, but each behavior point can be identified by their number of instructions. To construct an appropriate signature, we would like to distinguish between instances from distinct performance behavior points, as well as instances from a single performance behavior point. A single performance behavior point likely manifests in instances with similar number of instructions, and hence a good signature must be able to group small variations in the number of instructions into a single cluster. One option is to use fixed-sized instruction bins as clusters. The size of bins must be carefully selected such that they are not too big that multiple behavior points are clustered into a single bin, but not too small that a single behavior point is fragmented into several bins. Bins that are too small would result in longer learning to fill in all the bins, and frequent signature mismatches which reduce prediction coverage. In practice, we found that fixed-sized bins result in bins that are often too large for OS service instances with a low number of instructions, but at the same time too small for instances with a high number of instructions. As a result, we use scaled clusters, which are bin sizes that scale proportionally to the number of instructions of the bins. A scaled cluster has two components: a centroid which represents the center of the cluster, and a range which specifies the minimum and maximum number of instructions that are considered to fall in the cluster. We compute the centroid of a cluster as the arithmetic mean of all number of instructions of instances that belong to the cluster, and the range as the centroid plus minus 5%. An instance is considered
to match the cluster when its number of instructions falls within the range of a cluster. Since clusters’ ranges may overlap, an instance may match several clusters, and we choose the cluster with the closest centroid as the best match for that instance. If a new instance is added to a cluster, the cluster’s centroid and range are updated accordingly. To measure the ability of clusters to identify distinct performance behavior points, we measure a cluster’s coefficient of variation (CV), a metric for evaluating uniformity of clusters in phase analysis [14]. CV of a cluster is computed as the standard deviation divided by the average. Figure 6 compares the CVs of the execution time (a) and IPC (b) of several benchmarks. For each benchmark, the first bar is the CVs when no clustering is performed, i.e. an OS service is treated as a single big cluster, while the second bar shows CVs when the performance points are grouped using scaled clusters. The CVs are averaged across all OS services that occur during simulation. The figure shows that the variations of execution time and IPC are significantly reduced when scaled clusters are used. On average, the coefficient of variation for the execution time drops by 4.7× from 0.72 to 0.15, whereas because IPC variation is small to begin with, the coefficient of variation drops from 0.13 to 0.08. Overall, this demonstrates that scaled clusters are able to group OS service instances with similar behavior.
Poccur (N, k, x) =
Poccur (N, k ≥ 1, x) =
N X N k=1
(1)
k
pkx (1 − px )N −k
(2)
We require that the probability that during a learning window N cluster x appears at least once should be higher than a degree of confidence (DoC). Hence, with pmin as the minimum probability of occurrence, our goal is to find N such that N X N
Initial Learning Mechanism
k=1
k
pkmin (1 − pmin )N −k ≥ DoC
(3)
Figure 7 plots the relationship between pmin and N for 95% and 99% degrees of confidence. From experiments, we found that the prediction accuracy remains high when we set pmin to 3%, i.e. we attempt to detect all clusters with a probability of occurrence of 3% or higher during the initial learning period. The figure shows that at 3%, the learning window needs to be 100 for 95% degree of confidence, and a little bit over 150 for 99% degree of confidence. If an OS service occurs more than one thousand times which is the case for a few of them in our benchmarks, an initial learning window of 100 gives us more than 90% prediction coverage. 500 450 400 350 300 95%
250
99%
200 150 100 50 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2
The goal of the learning periods is to capture most performance behavior points of each OS service. During an initial learning period, the performance characteristics of each instance of an OS service such as the number of cycles and cache miss rates are recorded in a Performance Lookup Table (PLT). Each OS service has its own PLT, and a PLT entry contains a scaled cluster, and various performance metrics (e.g. cycles and miss rates). During learning, if an OS service instance cannot match existing clusters in the PLT or if the PLT is empty, a new cluster has been found, and a new PLT entry is created. However, if it matches a cluster in the PLT, it is added to the cluster and the cluster’s statistics are updated. One central challenge is that the learning periods should be short enough to provide a large prediction coverage (the fraction of invocations that are predicted), but not too short that we sacrifice prediction accuracy (due to recording too few behavior points and subsequent signature mismatches). We define the learning window as the number of contiguous invocations/instances of an OS service that are fully simulated and characterized. Each OS service has its own learning window that can start or end at different times compared to other services. Since the initial learning window size must be determined a priori, we have to choose it statically. To determine a good initial learning window statically, we rely on statistical analysis. First, we denote the probability of occurrence (PO) of a cluster x as px . px can be computed as the number of occurrences of cluster x divided by the total invocations of the OS service. Suppose now that we would like to confidently capture the cluster x in the initial learning window only if its probability of occurrence is not too small, since if x rarely occurs, chances are we do not need to predict it accurately since its contribution to the overall performance is small. Therefore, we define a minimum probability of occurrence threshold denoted by pmin . Our goal is to find the learning window N such that we would capture all clusters whose probability of occurrence is equal to or larger than pmin , with a certain degree of confidence. To model this problem, we first assume that all the occurrences are independent and identically distributed, which is a common assumption referred to as i.i.d. in statistical analysis. This assumption implies that for each occurrence of a par-
N k px (1 − px )N −k k
In order to capture x during learning, x must occur at least once during N occurrences, and such a probability is denoted and computed as:
Initial Learning Window
4.3
ticular OS service, the probability of that occurrence to exhibit cluster x’s behavior is px , and the probability that it exhibits behavior of other clusters is 1 − px . Hence, given the learning window of N , the probability that cluster x occurs k times is denoted by a binomial probability [5] of:
Minimum Probability of Occurrence
Figure 7. The initial learning window (number of required trials) needed in order to capture all clusters of which their probabilities of occurrences are equal to or higher than the minimum probability of occurrence, given 95% or 99% degrees of confidence.
4.4
Re-Learning Mechanisms
In Section 4.3, we have derived the initial learning window size required to capture most clusters of an OS service with a certain degree of confidence, with an i.i.d. assumption. However, the i.i.d. assumption is not always valid in reality for two cases. First, the assumption of identically distributed is not accurate for the first few
Non-Clustered
Clustered
Clustered
1
0.9
0.9
0.8
0.8
0.7
0.7
CV for IPC
CV for Execution Time
Non-Clustered
1.33
1
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
ab-rand
ab-seq
du
find-od
iperf
average
ab-rand
ab-seq
(a)
du
find-od
iperf
average
(b)
Figure 6. Coefficient of variation (CV) of execution time (a) and IPC (b), when all points from each OS service are considered as a single cluster (first bar for each benchmark) versus when they are grouped into scaled clusters (b).
times an OS service is invoked due to initialization efforts. For example, the first invocation of an OS service may allocate and initialize a buffer needed for subsequent invocations, so it performs much more work than subsequent invocations. In addition, due to the cold cache effects, the number of cache misses and execution time are not indicative of those of future invocations. We deal with this problem by not starting the learning period of each OS service until it has occurred five times, at which time we empirically estimate that the initialization effects are completed. This delayed initial learning boosts accuracy without reducing the prediction coverage by a noticeable amount. Secondly, the assumption of independent occurrence of clusters is also not accurate in cases where the behavior points occur in a non-random fashion. If these non-random patterns repeat in a small number of invocations, it is not a concern because chances are they would occur during initial learning and be fully captured. What the initial learning cannot handle well is when certain behavior points with high probability of occurrence do not occur at all during the initial learning window, such as the case for sys read in ab-seq benchmark (Figure 4b). To handle this, even during prediction periods, we need to watch out for an unusual number of signature mismatches, which may indicate new clusters that have not been captured during the initial learning. We define an outlier to be an OS service occurrence of which its signature mismatches with all entries in the current PLT. We design and evaluate four different strategies for dealing with outliers. In the first approach (Best-Match strategy), when an outlier is observed, we select a cluster from the PLT whose signature is the closest to the outlier, and use the cluster’s performance characteristics to predict the performance of the outlier. Best-Match does not invoke any relearning, so it provides the highest coverage but sacrifices the ability to learn new clusters. An alternative strategy is Eager re-learning, which considers a signature mismatch as a sign that a new cluster has just appeared. It first finds the best match in the PLT to predict the performance of current outlier, then it immediately triggers re-learning. The relearning window is the same as the initial learning window size, i.e. 100 instances for 95% degree of confidence. The Eager scheme provides the capability of capturing new behavior clusters and results in good accuracy. However, there are some behavior clusters that overall occur very infrequently and do not need to be predicted accurately. Each time one of such clusters occurs, re-learning is triggered unnecessarily. As a result, prediction coverage may be reduced significantly compared to the Best-Match strategy.
A compromise strategy between Best-Match and Eager would be a Delayed strategy in which when an outlier cluster is found, we wait until the cluster appears a few more times before triggering re-learning. In experiments, we wait until an outlier cluster occurs four times before we trigger re-learning. The fourth strategy is Statistical re-learning, which tries to strike a balance between the accuracy of Eager and the coverage of BestMatch through statistical criteria. With Statistical strategy, instead of triggering re-learning when a single outlier occurs, we delay relearning until we are confident that the outlier is from a cluster y which has a probability of occurrence higher than the minimum occurrence probability pmin (3% in our case). To achieve that, when an outlier is first observed, a special entry marked as an outlier cluster entry is created in the PLT. Note that unlike regular PLT entries, outlier cluster entries do not have performance numbers associated with them because they have not been captured through full-system simulation. When another outlier occurs, we try to match it with any outlier cluster entries, and if we find a match, we increment a match counter of the entry. An outlier cluster entry also maintains a list of estimated probabilities of occurrence (EPOs). Upon the match, we add a new EPO to the list, by dividing the number of times an outlier cluster y has occurred in a moving window of the last W invocations of the OS service (W = 100 in our experiments). Each subsequent match produces its own EPO with a unique moving window, thereby creating a list of EPOs. For an OS service, each EPO in the EPO list for an outlier cluster y is a single estimate of the true probability of occurrence of cluster y. We do not want to trigger relearning until we are sure that an outlier cluster y has a probability of occurrence of larger than pmin . Hence, we need a reliable way to test the hypothesis that the true probability of occurrence of y is higher than pmin . Let us denote the true probability of occurrence of y as µy , and the EPOs on the list are denoted as p1y , p2y , p3y , . . . , pm y assuming there are m EPOs that have been collected for cluster y. Each pjy is a point estimation of µy . From the EPOs of an outlier cluster y, we can compute the average and standard deviation of the EPOs as: py
=
m 1 X j py m
(4)
j=1
Spy
=
v u u t
m 2 1 X j py − py m−1 j=1
(5)
Let us assume that p1y , p2y , p3y , . . . , pm y are independent random variables that are normally distributed with unknown true mean and variance. Consequently, we would need to use the Student’s t-distribution [6] to estimate how far py is from the true mean of cluster y (i.e. µy ). If we choose a degree of confidence of 1 − α, then: µy − py √ < t(m−1,α) ) Spy / m Spy P (µy < py + t(m−1,α) √ ) m P(
⇐⇒
=
1−α
(6)
=
1−α
(7)
where t(m−1,α) is obtained from the Student’s t-distribution table. Plugging in a 95% confidence interval (α = 5%), µy is upperbounded by By , where: Spy By = py + t(m−1,0.05) √ m
(8)
Equation 8 provides a clue whether the true probability of occurrence of cluster y is 3% or higher. If By < 3%, then we are at least 95% confident that cluster y has a true probability of occurrence lower than 3% and thus re-learning should not be triggered. However, if By ≥ 3%, then we are less than 95% confident that cluster y has a true probability of occurrence lower than 3%, and in this case we conservatively trigger re-learning. One caveat is that By is only statistically meaningful if we have several EPOs. Hence, for the Statistical strategy, we wait until we have four EPOs before we decide whether to trigger re-learning. Consequently, the earliest time re-learning is triggered is when an outlier cluster has occurred four times. When re-learning is triggered, we clear out all outlier cluster entries. Another caveat is that the degree of freedom m − 1 is only approximated. This is because several EPOs may overlap and thus may not be independent samples. It is likely that the degree of freedom is overestimated, leading to re-learning being triggered more frequently then it should be.
4.5
Performance Prediction Mechanism
In the prediction periods, each OS service invocation is fast forwarded in an emulation mode, and its signature (number of instructions) which is obtained at the end of the invocation is used to search for a matching cluster in the PLT. When a matching cluster is found, the performance characteristics of that invocation, such as the number of cycles and cache misses, are predicted from those of the cluster. If no matching cluster is found, its performance is predicted by the cluster with the closest centroid, and re-learning may be triggered (Section 4.4). Through characterizing the performance of OS services, we can predict the number of cache misses suffered by each instance of the OS service. These cache misses already include the influence of the application program on the OS performance due to cache space contention and data sharing. However, the application program is also influenced by the OS services because parts of its working set may be replaced by the OS services. To account for this effect, if the predicted number of misses suffered by the OS service is M , then at the end of the OS service invocation, we replace M cache lines that belong to the application program. To determine the victim lines, we assume that the cache pollution due to OS services is uniformly distributed across cache sets. Hence, we iteratively select a set (using uniformly-distributed random-number generator) from which a victim line is selected, starting from invalid cache line, the valid least-recently used line, and to a more recently used line.
5 5.1
Evaluation Environment Simulation Environment
To evaluate our scheme, we use Simics [19], a full-system simulator that can boot and run unmodified workloads and OS. The OS that we choose is RedHat Linux’s Fedora Core4 [2] on a x86 machine model (dredd model). After the installation, the kernel is upgraded to Linux kernel version 2.6.13.2 downloaded from [4] to reflect recent changes in various kernel components. We use x86 Instruction Set Architecture for both the OS and the applications. We use a cycle-accurate execution-driven timing simulation by attaching processor and cache models to Simics functional emulator. The processor model simulates an out of order superscalar architecture and is based on Simics’ sample-micro-arch-x86 module. The cache model supports multiple outstanding transactions and detailed timing modeling and is based on Simics’ g-cache-ooo module. Over the base processor and cache models, we add statistics collection and other instrumentation code. For the processor model, we choose parameters similar to those of Intel Pentium4 systems. The processor has a 4GHz frequency, 4wide out-of-order issue, and can retire up to three x86 instructions per cycle. The pipeline can hold 126 in-flight instructions, with a branch misprediction penalty of 10 cycles. The L1 instruction cache is a write-back 2-way associative cache, with a size of 16KB. The L1 data cache is a write-back 4-way associative cache with a size of 16KB and hit latency of 2 cycles. The L2 cache is unified with 8-way associativity, 1MB size, and 8 cycle hit latency. All caches use LRU replacement policy and have 64B block size. The memory access latency is 300 cycles, equivalent to 75ns on a 4GHz processor. The memory bus is split-transaction that is 8-byte wide and runs at 800MHz frequency, for a total peak bandwidth of 6.4 GB/sec.
5.2
Benchmarks
We choose benchmarks from several categories. The first category is a Web Server. Upon request from web clients, a web server may spawn and dispatch threads that retrieve the requested file and send it to the clients. We choose Apache [1] as the web server. Since the behavior of Apache would heavily depend on the patterns of requests from the clients, the selection of client workload is quite important. For example, the ab workload simulate client requests to a single page, hence it lacks diversity. To increase diversity, we modify the ab workload to generate two new workloads. One workload (ab-rand) generates random requests to multiple pages. Among pages with different sizes, a random number is generated to pick a page. Then, eight concurrent requests are sent to the page. This process is repeated until the specified number of requests are sent. ab-rand represents the worst case in terms of request predictability and is closer to a real web client workload. Another workload (abseq) generates sequentially changing requests to multiple pages. It consecutively sends an equal fraction of requests to a page (eight at a time) before switching requests to the next page. This step is repeated until it sends all requests. The order of requested pages is sorted so that the size of pages keeps increasing. Note that while abseq is less realistic compared to ab-rand, we choose this in order to stress our re-learning schemes because the change of patterns represents the worst-case for our initial learning mechanism. For evaluation, the first 300 HTTP requests are skipped and next 300 HTTP requests for ab-rand and 700 HTTP requests for ab-seq are simulated (including learning and prediction periods). Finally, for both workloads, there are eight different text files on the HTTP server requested by the client whose size ranges from 104KB to 1.4MB.
6.1
4.50% 4.00%
App+OS I$ MR
3.00%
App+OS Pred I$ MR App+OS D$ MR
2.50%
App+OS Pred D$ MR App+OS L2$ MR
2.00%
App+OS Pred L2$ MR
1.50% 1.00% 0.50% 0.00%
ab-rand
ab-seq
du
find-od
iperf
Figure 9. The miss rates for L1 instruction cache, L1 data cache, and L2 unified cache obtained through full-system simulation versus predicted by our accelerated full-system simulation scheme. 2.032.16
2 1.8 1.6
Evaluation Performance Estimation Accuracy
Figure 8 compares the execution time and IPC obtained through a full-system simulation (App+OS bars), our accelerated fullsystem simulation (App+OS Pred bars), and application-only simulation (App-Only bars). The numbers are normalized to the fullsystem simulation case. For our accelerated full-system simulation, we use the Statistical re-learning strategy and learning window size of 100. As expected, application-only simulation highly underestimates the total execution time, while our prediction closely approximates the execution time obtained through full-system simulation, with the average and worst-case absolute errors of 3.2% and 4.2% (in du). A similar observation applies to the IPC. For our scheme, the average and worst-case absolute errors are 3.2% and 4.1% (in du), respectively. For application-only simulation, the average and worstcase errors are 12.5% and 39.8% (in ab-rand), respectively. One exception is du where the error for application-only simulation and our scheme are almost the same. This is because the IPC of OS services is coincidentally quite similar to the IPC of the application program. Overall, our prediction scheme approximates the performance numbers obtained by full-simulation quite closely. Figure 9 further compares our prediction scheme with fullsystem simulation in terms of the L1 instruction cache miss rates (first two bars), L1 data cache miss rates (second two bars), and L2 cache miss rates (last two bars). The miss rates are absolute numbers and are not normalized. In general, the predicted and fullysimulated numbers are quite close: the difference in the miss rates is 1% or less, except for the L2 miss rates in find-od which have a difference of 1.4%. The figure also shows that overall, our prediction is slightly less accurate for L2 cache miss rates compared to the L1 cache miss rate. This might be because the cold cache period for the L2 cache lasts longer than L1 instruction/data cache. To find out, we further delay the start of the initial learning from 5 to 25 invocations for find-od the L2 miss rate error drops to 1.2%. Figure 10 repeats the experiment from Figure 2 that measures the speedup ratio obtained when the L2 cache size is increased from 512KB L2 to 1MB using application-only simulation, full-system simulation, and our accelerated full-system simulation (App+OS
3.50%
Speedup Ratio
6
5.00%
Cache Miss Rate
The second category of benchmarks is Unix Tools. One benchmark is find, a Unix tool to search the directory tree rooted at each given file name by evaluating the given expression. For each file that is found, it may invoke other programs. In our case, od is executed for each file to dump the content of the file in octal format. The shell command we use is ‘find /usr -type f -exec od {} \;’, which searches ‘/usr’ and subdirectories in the Linux file system. Another benchmark we use is du, which is another Unix tool that summarizes disk usage of each file in a directory or its subdirectories. The shell command we use is ‘du -h /usr’. For both workloads, we skip 300 million instructions and simulate 600 million instructions. The final category of benchmarks is Network Benchmarking Tools. We choose iperf [3], which measures maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristics. iperf reports bandwidth, delay jitter, and datagram loss. We change iperf to print the number of socket writes at the client side. The first 4096 socket writes are skipped and the next 4096 socket writes to server are simulated. For the SPEC2000 results in Figure 1 and 2, we use reference input sets, skip the first 2 billion instructions, and simulate the next 1 billion instructions.
1.4 1.2 App Only App+OS App+OS Pred
1 0.8 0.6 0.4 0.2 0
ab-rand
ab-seq
du
find-od
iperf
average
Figure 10. Speedup ratios obtained using a 1MB L2 cache over using a 512KB L2 cache, using application-only simulations, fullsystem simulations, and our accelerated full-system simulation.
Pred bars). The figure shows that unlike application-only simulation, our scheme quite accurately captures the speedups from a larger cache.
6.2
Comparison of Re-learning Strategies
Figure 11 compares the performance of the four re-learning strategies described in Section 4.4 in terms of coverage and accuracy. Coverage is defined as percentage of OS services that are predicted (i.e. full-system simulation is skipped), while accuracy is defined as the absolute value of the ratio of the difference between numbers obtained through full-system simulation and prediction, divided by the number obtained through full-system simulation. The figure shows that as expected, Best-Match has the highest coverage at 93% due to only relying on the initial learning window without triggering re-learning. However, the initial learning window cannot capture all important behavior clusters, resulting in high prediction errors, with an average absolute error of 9.6% and worst-case of 29%. On the contrary, Eager triggers a re-learning effort every time it sees an outlier cluster. As a result, it significantly improves accuracy (average absolute error of 1.5% and worst-case of 3.0%) at the expense of much lower coverage (74%). Finally, recall that in Section 4.4 we derive the Statistical and Delayed relearning strategies to reduce the number of re-learning instances without sacrificing much accuracy. The figure shows that we have largely succeeded. Both Statistical and Delayed are almost as accu-
App+OS
App+OS Pred
App Only
App+OS
App+OS Pred
App Only
1.6 1.4
1
Normalized IPC
Normalized Execution Time
1.2
0.8
0.6
0.4
0.2
1.2 1 0.8 0.6 0.4 0.2
0
0
ab-rand
ab-seq
du
find-od
iperf
ab-rand
ab-seq
(a)
du
find-od
iperf
(b)
Figure 8. The execution time (a) and IPC (b) predicted by our scheme versus those obtained by full-system simulation and application-only simulation, normalized to the numbers obtained by the full-system simulation. Best-Match
Statistical
Delayed
Eager
Best-Match
100%
Absolute Prediction Error
90% 80%
Coverage
70% 60% 50% 40% 30% 20% 10% 0%
Statistical
Delayed
Eager
29%
16% 14% 12% 10% 8% 6% 4% 2% 0%
ab-rand
ab-seq
du
find-od
iperf
average
ab-rand
(a)
ab-seq
du
find-od
iperf
average
(b)
Figure 11. The coverage (a) and the accuracy (b) of different re-learning strategies.
rate as Eager (3.2%, 2.7% vs. 1.5% average absolute errors), with coverages similar to Best-Match (89%, 88% vs. 93% on average). Overall, both Statistical and Delayed provide the ability to adapt to dynamic behavior and capture newly occurring behavior clusters without sacrificing much prediction coverage.
6.3
Sensitivity Study
6.4
Absolute Prediction Error
6%
5%
4% 1MB 3%
2MB 4MB
2%
1%
0%
ab-rand
ab-seq
We also evaluate the prediction accuracy of our scheme on different L2 cache sizes. Figure 12 shows the absolute error of the number of cycles with 1MB, 2MB, and 4MB L2 cache sizes. The figure shows that our prediction scheme remains accurate across different L2 cache sizes, with the average error slightly declining for larger L2 caches.
du
find-od
iperf
average
Figure 12. The absolute prediction error for execution time, with L2 cache sizes varying from 1MB to 4MB, and fixed associativity (8-way).
Estimated Speedup
With our scheme, a full system simulation is accelerated when the simulation is fast forwarded during OS services. Hence, the simulation speedup ratio would depend on (1) prediction coverage, and (2) the relative speed difference between the emulation mode used during fast forwarding and the detailed simulation mode. Unfortunately, currently Simics does not support dynamically switching simulation modes between detailed simulation and emulation so we cannot measure the speedups in detail (we do not see any fundamental reasons why Simics cannot be made dynamically switchable). To estimate the simulation speedup, we measure different levels of simulation details in Simics, starting from in-order processor without caches, in-order processor with caches, out-of-order processor without caches, and out-of-order processor with caches. Table 1 shows the slowdown ratios of the different modes compared to the fastest inorder-nocache mode. To estimate the simulation speedups, we assume that ooo-cache is used for the full-system simulation while inorder-nocache is used for fast forwarding during the prediction periods, which is probably slower than necessary
Table 1. The slowdown ratios of various modes of simulation in Simics compared to the fastest mode: in-order processor without caches.
Mode Slowdown
inorder-cache 3×
ooo-nocache 64×
ooo-cache 133×
Table 2. Estimated simulation speedup ratios.
Benchmark ab-rand ab-seq du find-od iperf gmean
Speedup Ratio 2.8× 3.1× 7.1× 2.9× 15.6× 4.9×
since we only need to extract the number of instructions during fast forwarding. Let N denote the total number of instructions (from both OS and applications) and X denote the number of instructions executed in the prediction periods. Let Tf ull denote the simulation time per instruction in the ooo-cache mode, and Tprof ile the simulation time per instruction in the inorder-nocache mode. Simulation speedups can then be computed as: speedup
= =
N × Tf ull X × Tprof ile + (N − X) × Tf ull N X + (N − X) 133
(9) (10)
Using Equation 10, Table 2 shows the estimated speedups for the benchmarks, and the geometric mean of the speedups. The speedup ratios are substantial, they range from 2.8× to 15.6×, with a geometric mean of 4.9×.
7
Conclusions
In this paper, we have shown that full-system simulation is often necessary in computer architecture simulation, and presented a technique to greatly accelerate full-system simulation. Since OS performance is determined by both internal and external factors, and varies from one run to another depending on the system load, we propose a fully online algorithm that learns the behavior of OS services and once learning is completed, detailed simulation of OS services are replaced by the much faster emulation and performance prediction. Statistical analysis is used to guide when to switch between learning and prediction. We test our acceleration scheme on a set of OS-intensive applications and a Linux 2.6.13 kernel, running on processor and memory hierarchy timing models implemented on Simics. On average, the method needs the learning periods to cover only 11% of OS service invocations in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9×, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%.
References [1] [2] [3] [4] [5]
Apache HTTP Server Project. http://httpd.apache.org, 2006. Fedora Project. http://fedora.redhat.com, 2006. Iperf. http://dast.nlanr.net/Projects/Iperf, 2006. The Linux Kernel Archives. http://www.kernel.org/, 2006. H. Abdi. Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): SAGE, 2007. [6] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964.
[7] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system modeling. IEEE Trans. on Computers, 2002. [8] M. V. Biesbrouck, L. Eeckhout, and B. Calder. Considering All Starting Points for Simultaneous Multithreading Simulation. In Proc. of IEEE Intl. Symp. on Performance Analysis of Systems and Software, 2006. [9] M. V. Biesbrouck, L. Eeckhout, and B. Calder. Efficient Sampling Startup for SimPoint. IEEE Micro, pages 32–42, 2006. [10] T. M. Conte, M. A. Hirsch, and K. N. Menezes. Reducing Stae Loss for Effective Trace Sampling of Superscalar Processors. In Proc. of the Intl. Conf. on Computer Design, 1996. [11] L. Eeckhout, H. Vandierendonck, and K. D. Bosschere. Designing Computer Architecture Research Workloads - MinneSPEC. Computer, 36:65–71, 2003. [12] J. Lau, E. Perelman, and B. Calder. Selecting Software Phase Markers with Code Structure Analysis. In Proc. of Intl. Symp. on Code Generation and Optimization, 2006. [13] J. Lau, E. Perelman, G. Hamerly, T. Sherwood, and B. Calder. Motivation for Variable Length Intervals and Hierarchical Phase Behavior. In Proc. of the IEEE Intl. Symp. on Performance Analysis of Systems and Software, 2005. [14] J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder. The Strong Correlation Between Code Signatures and Performance. In Proc. of the IEEE Intl. Symp. on Performance Analysis of Systems and Software, 2005. [15] J. Lau, S. Schoenmackers, and B. Calder. Structures for Phase Classification. In Proc. of the IEEE Intl. Symp. on Performance Analysis of Systems and Software, 2004. [16] J. Lau, S. Schoenmackers, and B. Calder. Transition Phase Classification and Prediction. In Proc. of Intl. Symp. on High Performance Computer Architecture, 2005. [17] J. Lee, Y. Solihin, and J. Torrellas. Automatically Mapping Code on an Intelligent Memory Architecture. In 7th Intl. Symp. on High Performance Computer Architecture, 2001. [18] W. Liu and M. C. Huang. EXPERT: Expedited Simulation Exploiting Program Behavior Repetition. In Proc. of the 18th Intl. Conf. on Supercomputing, 2004. [19] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. In Computer, 2002. [20] S. Nussbaum and J. Smith. Modeling Superscalar Processors via Statistical Simulation. In Proc. of the Intl. Conf. on Parallel Architectures and Compilation Techniques, 2001. [21] M. Oskin, F. Chong, and M. Farrens. HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. In Proc. of the 27th Intl. Symp. on Computer Architecture, 2000. [22] E. Perelman, G. Hamerly, and B. Calder. Picking Statistically Valid and Early Simulation Points. In Proc. of the Intl. Conf. on Parallel Architectures and Compilation Techniques, 2003. [23] E. Perelman, M. Polito, J.-Y. Bouguet, J. Sampson, B. Calder, and C. Dulong. Detecting Phases in Parallel Applications on Shared Memory Architectures. In Proc. of IEEE Intl. Parallel and Distributed Processing Symp., 2006. [24] J. Ringenberg, C. Pelosi, D. Oehmke, and T. Mudge. Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time through Binary Modification. In Proc. of the IEEE Intl. Symp. on Performance Analysis of Systems and Software, 2005. [25] T. Sherwood, E. Perelman, and B. Calder. Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. In Proc. of the Intl. Conf. on Parallel Architectures and Compilation Techniques, 2001. [26] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. [27] T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In Proc. of the 18th Intl. Symp. on Computer Architecture, 2003. [28] T. Wenisch, R. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. Hoe. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro, 26(4), 2006. [29] R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In Proc. of the 30th Intl. Symp. on Computer Architecture, 2003.