Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior

0 downloads 0 Views 314KB Size Report
window, reorder buffer, function units and wakeup table) in a high-performance out-of-order execution superscalar processor. Experimental results on the ...
Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior Xin Fu

James Poe

Tao Li

José A. B. Fortes

Department of ECE University of Florida [email protected]

Department of ECE University of Florida [email protected]

Department of ECE University of Florida [email protected]

Department of ECE University of Florida [email protected]

Abstract Computer systems increasingly depend on exploiting program dynamic behavior to optimize performance, power and reliability. Prior studies have shown that program execution exhibits phase behavior in both performance and power domains. Reliabilityoriented program phase behavior, however, remains largely unexplored. As semiconductor transient faults (soft errors) emerge as a critical challenge to reliable system design, characterizing program phase behavior from a reliability perspective is crucial in order to apply dynamic fault-tolerant mechanisms and to optimize performance/reliability trade-offs. In this paper, we compute run-time program vulnerability to soft errors on four microarchitecture structures (i.e. instruction window, reorder buffer, function units and wakeup table) in a high-performance out-of-order execution superscalar processor. Experimental results on the SPEC2000 benchmarks show a considerable amount of time varying behavior in reliability measurements. Our study shows that a single performance metric, such as IPC, cache miss or branch misprediction, is not a good indicator for program vulnerability. The vulnerabilities of the studied microarchitecture structures are then correlated with program code-structure and run-time events to identify vulnerability phase behavior. We observed that both program code-structure and run-time events appear promising in classifying program reliability phase behavior. Overall, performance counter based schemes achieved an average Coefficient of Variation (COV) of 3.5%, 4.5%, 4.3% and 5.7% on the instruction queue, reorder buffer, function units and the wakeup table, while basic block vectors offer COVs of 4.9%, 5.8%, 5.4% and 6% on the four studied microarchitecture structures respectively. We found that in general, tracking performance metrics performs better than tracking control flow in identifying reliability phase behavior of applications. To our knowledge, this paper is the first to characterize program reliability phase behavior at the microarchitecture level.

1. Introduction As computer architecture becomes more complex, its efficiency increasingly depends on the capability of tuning hardware and software resources at run-time for different program behavior. Previous research [1, 2, 3, 4, 5] shows that program execution characteristics, such as performance and power features, exhibit time varying behavior called phases. Various techniques [6, 7, 8, 9, 10, 11, 12, 13, 14] have been proposed to capture program phases within which workloads show homogeneous characteristics of interest. In the past, program phases have been exploited in both performance and power domains. As semiconductor transient faults (soft errors) emerge as a critical threat to reliable system design, understanding and characterizing program dynamic behavior from the perspective of dependability becomes important. The

rapidly increasing soft error rates and the advent of billiontransistor chips suggest that it is infeasible to protect every transistor from soft error strikes, making workload dependent dynamic fault-tolerant techniques attractive. For example, when long latency instructions (e.g. loads that miss caches, floating point operations) in programs are encountered in the pipeline, a microprocessor can selectively squash instructions that sit in the instruction queue to reduce the likelihood of faults due to soft error strikes to that structure [15]. These techniques can be efficiently applied to optimize performance/reliability trade-offs only if the program related reliability phases and the critical execution regions required for fault tolerance can be identified. Prior work [2, 4, 6, 7, 16, 17] showed that program phases can be used to effectively evaluate and optimize performance and power. Similarly, if programs exhibit reliability-oriented phases and these phases can be accurately and efficiently identified, we can exploit workload characteristics in the design, optimization and evaluation of dependable computer architecture. However, to our knowledge, this issue has not been addressed by the community so far. This paper focuses on characterizing and classifying program reliability phase behavior on state-of-the-art processor microarchitecture. In earlier work, we developed a microarchitecture level software vulnerability analysis framework Sim-SODA [18]. Employing the Sim-SODA framework and the SPEC2000 benchmarks, we compute run-time program vulnerability to soft errors on four microarchitecture structures (i.e., instruction queue, reorder buffer, functional units and wakeup table). Experimental results show that the microarchitecture components studied manifest significant time varying behavior in their run-time reliability characterization. To identify appropriate reliability estimators, we used various performance metrics and examined their correlations with vulnerability factors at the microarchitecture level. We found that in general, using a simple performance metric is insufficient to predict program vulnerability behavior. We then explored code structure based and run-time event based phase analysis techniques and compared their reliability phase prediction abilities. This work makes the following contributions: • We quantitatively analyzed the run-time soft error vulnerability on four microarchitecture structures which play an important role in determining the reliability of a high performance microprocessor with dynamic scheduling and out-of-order execution. We found that all studied microarchitecture structures manifest significant variability in their run-time reliability characteristics. • We observed a fuzzy correlation between hardware reliability and a single performance metric (such as IPC, branch prediction rate and cache miss rate) across the simulated workloads. This suggests that a simple

performance metric is not a good indicator for program reliability behavior. • We investigated reliability-oriented phase classification using both code structure based and run-time event based schemes. We compared the effectiveness of different techniques and explored the cost-effective reliability phase classification methods. • We further explored the use of hybrid information to guide reliability phase selection. Finally, we performed a sensitivity analysis to understand the applicability of different techniques on machines with different processor and hardware structure configurations. The rest of this paper is organized as follows. Section 2 describes a method used to compute program reliability and details the techniques used to model vulnerability of the studied microarchitecture structures. Section 3 presents our experimental setup, including the architectural level reliability simulator SimSODA, the simulated machine configuration and the studied benchmarks. Section 4 profiles time varying behavior of microarchitecture vulnerabilities and their correlation with commonly used performance metrics. Section 5 explores control flow and run-time event based approaches and evaluates their abilities in classifying reliability-oriented phases. Section 6 summarizes the paper and outlines our future work.

2. Microarchitecture Vulnerability Analysis

Level

Soft

Error

At the microarchitecture level, program vulnerability to soft errors can be modeled using several methodologies. For example, Li and Adve [19] estimate reliability using a probabilistic model of the error generation and propagation process in a processor. In the past, statistic fault injection has also been used in several studies [20, 21, 22] to evaluate architectural reliability. In this work, we estimate the reliability of processor microarchitecture structures using the Architectural Vulnerability Factor (AVF) computing methods introduced in [23, 24]. This section provides a brief background of AVF and describes how we calculated the AVF of the studied microarchitecture structures. The AVF of a hardware structure is the probability that a transient fault in that hardware structure can cause erroneous output of a program. The overall hardware structure’s error rate is determined by two factors: the device raw error rate, mainly determined by circuit design and processing technology, and the AVF. Since the hardware raw error rate normally does not vary with code execution, AVF can be used as a good reliability estimator to quantify how vulnerable the hardware is to soft errors at different program execution stages. To compute AVF, one needs to distinguish those bits that affect the final system output and those that do not. A subset of processor state bits required for architecturally correct execution (ACE) are called ACE bits [23]. The AVF of a hardware structure in a given cycle is the percentage of ACE bits that the structure holds, and the AVF of a hardware structure during program execution is the average AVF at any point in time. In practice, identifying unACE bits, the processor state bits that do not affect correct program execution, is much easier. Examples of locations that often contain un-ACE bits include NOPs, idle or invalid states, uncommitted instructions and dynamically dead instructions and data. An instruction is considered dynamically dead if its result is not used by any other instructions and therefore will not affect the

final output of the program. A cycle accurate execution driven simulator can be used to identify un-ACE bits and to track the residency cycles of the un-ACE bits in hardware structures. To compute the AVF of the entire processor, the AVF of all hardware components are added together after they are weighted with the number of bits in each structure. In this paper, we focus on characterizing the AVF phase behavior of an individual microarchitecture structure since we believe that a componentbased reliability analysis is more suitable for the design and optimization of reliability-aware architecture. For example, the hardware components that yield high AVFs can be identified and protected, either at the design stage or at run-time. In the following subsections, we describe the methodologies to compute AVFs of the four studied microarchitecture structures: instruction window, reorder buffer, function units and wakeup table. We chose these components due to 1) their significance in reliable program execution and 2) the heterogeneity of their AVF computation.

2.1. Instruction Window Modern high-performance processors use the instruction window to support dynamic scheduling and out-of-order execution. To effectively exploit instruction level parallelism, the instruction window buffers a significant amount of program runtime states, making it vulnerable to soft error strikes. For example, Mukherjee et al [23] reported that the instruction window AVF number ranges between 14% and 47% on the slices of SPEC 2000 benchmarks. To model the AVF of the instruction window, we classify each bit in the instruction window as an ACE-bit or an un-ACE bit. For example, all bits stored in idle or invalid entries are un-ACE bits. All bits except opcodes in NOPs and prefetch instructions are also un-ACE bits. In our study, we identified the following sources that contribute to the un-ACE bits in the instruction window: uncommitted instructions, NOPs, prefetch instructions, dynamically dead instructions, trivial instructions, idle entries and states. To identify uncommitted instructions, NOPs, prefetch instructions and idle entries using a performance model is straightforward. However, tracking the dynamically dead instructions can be more computationally involved. We implement a post-graduation instruction analysis window as proposed in [23] to detect the future use of an instruction’s result, and identify dynamically dead instructions. We set the analysis window to 40,000 instructions which can cover most of the future use of an instruction’s result [18]. Instructions which have no consumer in the following 40,000 instructions are conservatively marked as unknown instructions. In [23], Mukherjee et al identified logical masking instructions (e.g. compare instructions prior to a branch, bitwise logical operations and 32-bit operations in a 64-bit architecture) as a source of un-ACE bits. An operand and its bits are logically masked and can be attributed to un-ACE bits if the operand does not influence the result of an instructions execution. In this study, we identified additional logical masking bits. We found that the bits used to encode the specifiers of source registers which hold logically masked values are un-ACE bits. This is because while a corrupted register specifier may cause the processor to fetch the wrong data from a different register, the computation result will not be altered because of the logical masking effect. Additionally, we extend logical masking instructions to trivial instructions [25]. Trivial instructions are those computations whose output can be

determined without performing the computation, so they cover all the un-ACE bits that logical masking instructions can identify. We classified the trivial instructions into three categories [18] and identified the corresponding un-ACE bits. Note that trivial instructions can be dynamically dead instructions also. If an instruction is both a trivial instruction and a dynamically dead instruction, we attribute it to the dynamically dead instructions because more un-ACE bits can be derived from that type of instruction.

2.2. Reorder Buffer In an out-of-order execution microprocessor, program execution states are buffered in the reorder buffer (ROB) until they are retired from the pipeline. To effectively exploit ILP, modern processors use a large ROB, implying its AVF can greatly affect the AVF of the entire chip. Bits in each ROB entry are allocated for an on-the-fly instruction. An ROB entry includes the instruction number, the destination register specifier, and the operand and control bits. Sources of un-ACE bits that we identified for the ROB are the same as those for the instruction window. Similarly, we used a performance model augmented with an instruction analysis window and trivial instruction classification to identify these scenarios in the ROB [18].

2.3. Function Unit In this study, we modeled 4 integer function units used by an Alpha-21264 processor unless otherwise specified [26]. Similar to [23], we assume each function unit has about 50% control latches and 50% data-path latches, and that the data-path within it has a width of 64 bits. We applied trivial instruction analysis to each function unit to further attribute un-ACE bits to different instructions. The semantics of trivial instructions implies that one input value to the function unit can be un-ACE. There are other instructions that only produce 32-bit outputs. In that case, the upper 32-bits in the output data path become idle. We observed that function unit ACE bits are mainly contributed by the operands of ACE instructions (instructions which affect program final output) as well as the output of those instructions [18]. Because the majority of instructions have two input operands and one output operand, the input bits contribute more ACE bits in the function units than the output bits do.

2.4. Wakeup Table When an instruction completes its execution, its destination register ID is broadcasted to all of the instructions in the window to inform dependent instructions of the availability of the result. Each entry compares the broadcasted register ID with its own source register ID. If there is a match, the source operand is latched and the dependent instruction may be ready to execute. This register ID broadcast and associative comparison is called instruction wake-up. A soft error that results in an incorrect match between a broadcasted physical register ID and a corrupted tag may cause instructions waiting for that operand to be issued prematurely. A single bit error in the tag array that results in a mismatch where there should have been a hit can prevent ready instructions from being issued, causing a deadlock in issuing instruction. Therefore, the wake-up table is vulnerable to soft error strikes. We have developed the following method to compute the AVF of the wake-up table: when a new instruction is allocated in the instruction window, the wake-up table records the renamed physical register IDs of instructions on which this instruction

depends. There are two fields in each wake-up table entry to hold the renamed register IDs for the two source operands of an instruction. A field in the wake-up table entry becomes invalid once the source operand is ready. The operations on the wake-up table include “fill”, “read” and “invalidate”; therefore, its nonoverlapping lifetime can be partitioned into fill-to-read, read-toread and invalidate-to-fill periods. Note that there is no read-toinvalidate component because the last read between fill and invalidate will cause a match between the stored register ID and the broadcasted register ID. Once there is a match, the field in the wake-up table will become invalid immediately. In other words, the lifetime of the read-to-invalidate component in the wake-up table is always zero. Therefore, we combine fill-to-read and readto-read components together, and attribute the invalidate-to-fill component as un-ACE.

3. Experimental Methodology This section describes our experimental framework, simulated machine architecture and benchmarks.

3.1. The Framework

Sim-SODA

Reliability

Estimation

The models described in Section 2 can be integrated into a range of architectural simulators to provide reliability estimates. We have developed Sim-SODA [18], a unified simulation framework that models software reliability on microprocessorbased systems. To implement these AVF models into a unified, timing accurate framework, we have instrumented the Sim-Alpha architectural simulator, which has been validated to the Alpha 21264 processor [27, 28]. The Sim-SODA framework also includes microarchitecture level reliability models for cache, TLB, and load/store queue. Through cycle-level simulation, both microarchitecture and architecture states are classified into ACE/un-ACE bits and their residency and resource usage counts are generated. This information is then used to estimate the reliability of various hardware structures. To characterize AVF phase behavior, we broke a program’s execution into continuous non-overlapping intervals. An interval is a continuous portion of instructions in a program. The Sim-SODA framework outputs a set of AVFs of the important structures and also a set of performance counters at the end of every simulation interval, and resets them to zero at the beginning of the next interval. We collected microarchitecture structures’ AVFs and program performance counters for each interval.

3.2. Simulated Machine Configuration We simulated an Alpha-21264-like 4-way dynamically scheduled microprocessor with a two level cache hierarchy. The baseline microarchitecture model is detailed in Table 1.

3.3. Simulated Workloads For our reliability-oriented phase characterization experiments, we used 10 SPEC2K integer benchmarks and obtained AVF and performance characteristics using Sim-SODA. Obtaining accurate timing information is critical for AVF estimation. We didn’t include SPEC2K FP benchmarks because Sim-Alpha does not model floating point pipeline execution accurately: it shows 21.5% error across the set of floating point benchmarks, compared with 6.6% error across the set of SPEC integer benchmarks [27]. To perform this study within a reasonable amount of time, we used the MinneSPEC [29] reduced

input datasets. However, bzip2 and gzip still require a significant amount of time to finish simulation. Therefore, we could not include their results in this paper draft. In order to collect enough interval information for accurate AVF phase behavior classification, we chose each benchmark’s interval size based on the total dynamic instructions executed. For instance, gcc and parser have 100,000 instructions in each interval, while perlbmk has an interval size of 200,000 instructions. Table 1. Simulated Machine Configuration Parameter Pipeline depth Integer ALUs/ multi Integer ALU/multi latency Fetch/Slot/Map/Issue/Commit Issue queue size Reorder buffer size Register file size Load/store queue size Branch predictor Return address stack Branch mispredict penalty L1 instruction cache L1 data cache L2 cache TLB size MSHR entries Prefetch MSHR Victim buffer

Configuration 7 4/4 1/7 4/4/4/4/11 instructions per cycle 20 80 80 32 Hybrid, 4K global + 2-level 1K local+ 4K choice 32-entry 7 cycles 64KB instruction/64KB data,2-way, 64B line, 1-cycle latency 64KB instruction/64KB data,2-way, 64B line, 3-cycle latency 2048KB, direct mapped, 64B line, 7-cycle latency 128-entry ITLB/128-entry DTLB fully-associative 8/cache 2/cache 8-entry, 1 cycle hit latency

Benchmarks

4. Characterizing Run-time Microarchitecture Vulnerability to Soft Errors This section profiles run-time microarchitecture structure reliability characteristics. In section 4.1, we illustrate and quantify the time varying behavior of microarchitecture vulnerability to soft errors. Section 4.2 correlates program reliability statistics with simple performance metrics.

(%)

Instruction

Window AVF

4.1. Time Varying Behavior of Microarchitecture Vulnerability 60 50 40 30 20 10 0

gcc

0%

20%

40%

60%

80%

100%

60%

80%

100%

parser

80 (%)

Instruction

Window AVF

Execution Time 100 60 40 20 0

0%

20%

40%

Variation (COV) for all studied structures and all simulated benchmarks. The COV has been widely used in evaluation of phase analysis techniques. It is the standard deviation divided by the mean. In other words, CoV measures standard deviation as a percentage of the mean. Table 2 shows that although both the instruction window and the ROB are used to support out-of-order execution, they show significant discrepancy in their AVFs: the ROB’s AVF is lower than the instruction window’s AVF on a majority of studied benchmarks. This can be explained as follows: The Alpha 21264 processor has separate integer and floating point instruction windows [26]. The integer instruction window has 20 entries. The ROB is used to hold all types of instructions and the size of the ROB is 80 entries. Due to the lack of floating point operations, the fraction of idle bits in the ROB is much higher than that in the instruction window. Table 2 also shows that the COVs of the runtime AVFs on mcf, gap, and vortex are much higher than those on the remaining benchmarks. This indicates that these programs contain complex vulnerability phases which may be challenging for accurate phase classification. Table 2. Variation in Run-time Microarchitecture Vulnerability

Execution Time

Figure 1. Run-time Instruction Window AVF Profile on Benchmark gcc and parser Figure 1 shows the run-time instruction window AVF profile on the benchmarks gcc and parser. Each point of the plots represents the AVF of an interval with a size of 100,000 instructions. (We don’t show all the 10 benchmarks’ AVF profiles because of space limitations). As can be seen, during program execution, the instruction window’s vulnerability shows significant variation between intervals, and some intervals also show similar AVF behavior regardless of temporal adjacency. Table 2 lists AVF statistics in terms of mean and Coefficient of

mcf eon gcc gap parser perlbmk twolf vortex vpr crafty AVG

Inst. Window Reorder Function Wakeup Table AVF Buffer AVF Unit AVF AVF MEAN COV MEAN COV MEAN COV MEAN COV 23 86% 23 28% 17 100% 11 28% 37 7% 17 7% 31 10% 9 17% 27 11% 10 15% 14 10% 5 12% 24 96% 40 43% 5 121% 9 45% 32 16% 13 13% 19 5% 6 20% 29 9% 12 47% 16 8% 8 13% 40 13% 15 14% 18 8% 14 20% 26 20% 13 54% 17 23% 3 24% 40 9% 15 20% 19 10% 12 14% 29 25% 10 37% 15 17% 7 33% 31 29% 17 28% 17 31% 8 23%

4.2. How Correlated is AVF to Performance Metrics If a hardware structure’s AVF shows strong correlations with a simple performance metric, its AVF can be predicted using that performance metric across different benchmarks. To determine how correlated vulnerability is to performance metrics, we calculated the correlation coefficients between the run-time hardware structure’s AVF and each of the simple performance metrics using statistics that were gathered from each interval during program execution. We chose widely used performance metrics such as IPC, branch prediction rate, L1 data and instruction miss rates and L2 cache miss rate. Figure 2 shows there are fuzzy correlations between the AVFs and the different performance metrics. For example, for the instruction window, IPC shows strong correlation with AVF on the benchmarks perlbmk, vortex, mcf and crafty, while yielding near zero correlation coefficients on twolf and eon. Intuitively, high IPC reduces the ACE bits residency time in a microarchitecture structure, resulting in a reduction of the AVF. On the other hand, the high ILP (usually manifested by high IPC) in a program can cause the microprocessor to aggressively bring more instructions into the pipeline, increasing the total number of ACE bits in a microarchitecture structure. Interestingly, we observed that the same performance metric (e.g. IPC) can exhibit both positive and negative correlations with the AVFs of different structures running on the same benchmark. For example, on perlbmk, the correlation coefficient between IPC and the ROB’s AVF is -0.98, while the correlation coefficients between IPC and

Corrleation Coefficient

other structure’s AVFs are all around 0.99. As mentioned in Section 2, ACE bits residency time plays an important role in determining AVF. For the same instruction, its residency time in different structures can vary significantly. For example, when a processor executes a low ILP code segment, few instructions can graduate. This can increase the ROB’s AVF if those instructions contain a significant amount of ACE bits. Note that the low IPC doesn’t necessarily mean a long residency time of instructions in the instruction window, the functional units and the wakeup table. The instructions may have been completed a long time before their final graduation from the ROB because of the out-of-order execution and in-order commitment. Overall, the results shown in Figure 2 suggest that we simply cannot use one performance metric to indicate hardware reliability behavior. 1.5 1

crafty eon

0.5

gcc

gap

mcf

parser

perlbmk twolf

vortex

vpr

0 -0.5

5.1. Basic Block Vectors vs. Performance Monitoring Counters

-1

ipc

-1.5

bpred_rate

dl1miss_rate

il1miss_rate

l2miss_rate

Corrleation Coefficient

(a) Instruction Window 1.5 1

crafty

eon

0.5

gcc

gap

mcf

parser

perlbmk

vortex twolf

vpr

0 -0.5 -1

ipc

-1.5

bpred_rate

dl1miss_rate

il1miss_rate

l2miss_rate

Corrleation Coefficient

(b) Reorder Buffer 1.5 1 0.5

eon

crafty

gcc

perlbmk

gap mcf

parser

twolf

vortex

vpr

0 -0.5 -1

ipc

-1.5

bpred_rate

dl1miss_rate

il1miss_rate

l2miss_rate

Corrleation Coefficient

(c) Function Unit 1.5 1

parser

crafty eon

0.5

gcc

gap

perlbmk

mcf

twolf

vortex

vpr

0 -0.5 -1 -1.5

ipc

bpred_rate

dl1miss_rate

il1miss_rate

and Martonosi recently conducted a study [10] where they compared different phase classification techniques in tracking run-time power related phases. Although the above methods have been used in performance and power characterization, their effectiveness in classifying program reliability-oriented phases remains unknown. In this section, we examine program vulnerability phase detection using two popular techniques: the basic block vector based and the performance event counter based schemes. A complete comparison of all of the phase analysis techniques exceeds the scope of this paper. In section 5.1, we compare the effectiveness of different techniques and explore the cost-effective reliability phase classification methods. In section 5.2, we explore the use of hybrid information to guide reliability phase selection. In section 5.3, we perform a sensitivity analysis to understand the applicability of different techniques on machines with different processor configurations. To our knowledge, there has been no prior effort in investigating program reliability phase detection.

l2miss_rate

(d) Wakeup Table

Figure 2. The Correlations between AVFs and a Simple Performance Metric

5. Program Reliability Phase Classification Various phase detection techniques have been proposed in the past. In [2, 7, 11, 30, 32], phases are classified by examining the program’s control flow behavior via the features of basic block, working set, program counter, call graphs and control branch counters. This allows phase tracking to be independent of the underlying architecture configuration. In [9, 10], performance characteristics of the applications are used to determine phases. These methods detect phase changes without using knowledge of program code structure; however, the identified phases may not be consistent across different architectures. There are also previous studies on comparing and evaluating different phase detection techniques. In [8], Dhodapkar and Smith showed that basic block vectors perform better than instruction working set and control branch counters in tracking performance oriented phases. The studies reported in [11, 13] revealed the correlations between application’s phase behavior with code signatures. Isci

A basic block is a single-entry, single-exit section of code with no internal control flow. Therefore, sequences of basic blocks can be used to represent control flow behavior of applications. In [2], Sherwood and Calder proposed the use of Basic Block Vectors (BBV) as a metric to capture a program’s phase behavior. They first determine all of the unique basic blocks in a program, create a BBV to represent the number of times each unique basic block is executed in an interval, and then weight each basic block so that instructions are represented equally regardless of the size of the basic block. In this paper, we quantify the effectiveness of basic block vectors in capturing AVF phase behavior across several different microarchitecture structures. In [9], Isci and Martonosi showed that hardware performance monitoring counters (PMC) can be exploited to efficiently capture program power behavior. To examine an approach that uses performance characteristics in AVF phase analysis, we collected a set of performance monitor counters (PMC) to build a PMC vector for each execution interval. We then used the PMC vectors as a metric to classify AVF phases. In our experiments, we instrumented the Sim-SODA framework with a set of 18 performance counters and dumped the statistics of these counters for each execution interval. We then ran an exhaustive search on the 18 counters to determine the 15 counters (shown as PMC-15 in Table 3) that were able to characterize the AVF phases most accurately. We chose the size of the counter vector to be 15 because BBV is commonly reduced to 15 dimensions after random projection [2]. To investigate whether the AVF phases are predictable using a small set of counters, we further reduced the PMC vector to 5 dimensions (shown as PMC-5 in Table 3). The 5 counters were chosen based on their importance in performance characterization and their generality and availability across different hardware platforms. The BBVs of each benchmark were generated using the SimPoint tool set [2] and the PMC vectors were dumped for every interval during Sim-SODA simulation. We then ran the kmeans clustering algorithm [31] to compare the similarity of all the intervals and to group them into phases. Table 4 lists the number of phases identified by the k-means clustering algorithm. In this study, we also examined the use of hybrid information that

combined both BBV and PMC vectors to characterize reliability phases. More detailed information about the hybrid schemes can be found in section 5.2. As shown in Table 4, the BBV generates a slightly larger number of phases than the PMC-15. The PMC-15 scheme detects more phases than the PMC-5 on a majority of the studied benchmarks. Table 3. Events used by PMC-15 and PMC-5 Schemes Events Number of Loads Number of Stores Number of Branches Total Instructions Total Cycles Correctly Predicted Branches Pipeline Flushes due to Branch Misprediction Pipeline Flush due to other Exceptions Victim Buffer Misses Integer Register File Reads Integer Register File Writes L1 Data Cache Misses L1 Instruction Cache Misses L2 Cache Misses Data TLB Misses Squashed Instructions Idle Entry in IQ Idle Entry in ROB

PMC-15 X X X

PMC-5

X X X X X X X X X X X X

X X

X X X

Table 4. Number of Phases Identified by the K-means Clustering Algorithm Benchmarks crafty eon gap gcc mcf parser perlbmlk twolf vortex vpr

BBV

PMC-15

PMC-5

Hybrid (5+10)

35 37 28 30 25 33 11 26 31 29

28 25 25 26 27 26 21 29 27 26

26 28 23 25 22 22 20 28 22 22

33 26 24 29 22 25 11 26 26 20

Hybrid BBV-20 (5+15) 32 34 22 28 24 30 11 25 27 29

34 29 24 33 25 25 12 25 25 30

We use the COV to compare different phase classification methods. After classifying a program’s intervals into phases, we examined each phase and computed the average AVF of all of the intervals within the phase. We then calculated the standard deviation of the AVF for each phase, and we divided the standard deviation by the average to get the COV. We calculate an overall COV metric for all phases by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for, and then summing up the weighted COV. Better phase classification will exhibit lower COV. In the extreme case, if all of the intervals in the same phase have exactly the same AVF, then the COV will be zero. Figure 3 shows that both BBV and PMC methods can capture program reliability phase behavior on all of the studied hardware structures. As compared to the COV baseline case without phase classification, the BBV based techniques achieve significantly higher accuracies in identifying reliability phases. On the average, the BBV based scheme reduces the COVs of AVF by 6x, 5x, 6x, and 4x on the instruction queue, ROB, function unit and wakeup table respectively. Figure 3 further shows that the PMC-15 scheme can achieve even higher accuracies in characterizing program reliability behavior. For example, PMC15 yields lower COVs than BBV on 8 out of the 10 studied benchmarks for all of the examined structures. On benchmarks eon and vpr, BBV performs better than PMC-15 because of the weak correlations between performance metrics and vulnerability

shown in Figure 2. Overall, the PMC-15 scheme leads to an average COV of 3.5%, 4.5%, 4.3% and 5.7% on the instruction queue, reorder buffer, function unit and wakeup table, while BBVs achieve COVs of 4.9%, 5.8%, 5.4% and 6% on the four studied microarchitecture structures respectively. As we expected, Figure 3 shows that the COV of AVF for PMC-5 is higher than that for PMC-15 in most cases. This is because PMC-15 provides additional information that can improve the accuracy in phase clustering. The only exception is vortex shown in Figure 3 (a). From Figure 2 (a), one can see that on benchmark vortex, several performance metrics (e.g, IPC, DL1miss_rate and L2miss_rate) already exhibit strong correlation with the instruction window’s AVF. When more performance metrics are included, it is possible that additional noise is also introduced into the clustering. However, this situation does not happen frequently as we observed only one exception among all the cases that we analyzed. On the average, reducing the PMC dimensionality from 15 to 5 increases the COV by 1.3%, 0.8%, 1.5% and 2.1% on the 4 studied microarchitecture structures respectively. In practice, gathering information for 5 counters is much easier than collecting statistics for 15 counters. This suggests that the PMC-5 scheme can be used as a good approximation to the more expensive PMC-15 scheme. Compared with the BBV technique, the PMC-5 scheme yields better AVF prediction for the ROB but exhibits worse accuracy for the wakeup table. For the instruction window and function unit, both schemes show similar performance in AVF classification. In summary, our experiments show that both BBV and PMC phase analysis have significant benefit in characterizing reliability behavior. We found that in general, the PMC-15 phase analysis performs better than the BBV based approach.

5.2. The Effectiveness of Hybrid Schemes We also explored two hybrid phase detection approaches that combine information used by both BBV and PMC schemes. In our first experiment, we used random projection to obtain a BBV with 10 dimensions and then concatenated the BBV with a PMC5 vector of the same program execution interval to form a hybrid vector, i.e. Hybrid(10+5). In our second experiment, we appended the PMC-5 vector to the original BBV to form a 20 dimension hybrid vector, i.e. Hybrid (15+5). We then invoked the k-means clustering algorithm on the two hybrid vectors. We compared the effectiveness of hybrid schemes with the default BBV and a BBV with a 20 dimensions of data. Figure 4 shows the COVs on Hybrid(10+5), BBV, Hybrid(15+5) and BBV-20 schemes. As can be seen, simply concatenating BBV and PMC information does not show significant benefit in reliability phase detections. In fact, Hybrid(10+5) show slightly worse performance than BBV on a majority of benchmarks. Similar observation is held when Hybrid(15+5) and BBV-20 are compared. One possible reason is that when the PMC vector and the BBV are randomly merged together, their information can interfere with each other. Such interference can reduce the quality of AVF phase classification. Although we only examine two combinations of PMC and BBV in this paper, we think it may be possible to gain some benefit from other combinations that can take the advantages of both schemes while avoiding the interference between each other. Further exploration of effective hybrid phase detection schemes is one aspect of our future work.

PMC-5

16

BBV

8 4 0

4

A VG A VG

tw ol f vo rt ex

vp r

BBV

16 COV (%)

12 8 4

PMC-15

PMC-5

BBV

12 8 4

tw o lf vo rt ex

m cf p ar s pe er rl bm k

gc c

ga p

vp r A VG

tw ol vo f rt ex

m c pa f rs pe er rl bm k

gc c

ga p

eo n

(c)Function Unit

eo n

0

0 cr af ty

vp r

PMC-5

m cf p ar se r pe rl bm k

gc c

g ap

cr af ty

vp r A V G

tw ol vo f rt ex

m c pa f rs pe er rl bm k

gc c

g ap

PMC-15

(b) Reorder Buffer 20

cr af ty

COV (%)

16

BBV

8

(a) Instruction Window 20

PMC-5

0 eo n

cr af ty

PMC-15

12

eo n

COV (%)

12

PMC-15

COV (%)

16

(d)Wakeup Table

Figure 3. COVs yielded by Different Phase Classification Schemes

5.3. Sensitivity Analysis Phase analysis techniques which use architecture dependent characteristics (e.g. PMC) may be sensitive to the machine configuration. To evaluate the robustness of the prior studied phase classification techniques in the context of AVF phase classification, we examine the applicability of using PMC, BBV, and hybrid techniques to detect program reliability phases across different architectures. As mentioned in section 3, our baseline model is a 4-way issue Alpha-21264-like superscalar microprocessor. To generate different machine configurations, we varied the bandwidth in each pipeline stage, modified the latency, size and associativity of the L2 cache, and resized the branch predictor, load store queue and register files. We created two additional configurations to model 8-issue and 16-issue processors. The 8-issue machine has 120 physical registers, a 20KB hybrid branch predictor, a 64-etnry load/store queue, and an 8MB 8-way set associative L2 cache. We also increased the branch misprediction penalty to 14 cycles on the 8-issue processor. We used a similar approach to further scale up resources for the 16-issue machine. Using these new configurations, we ran the Sim-SODA framework on all of the benchmarks. The collected data were then processed using the kmeans clustering algorithm to obtain the phase classification results (shown in Figure 5). Figure 5 shows that the COVs of reliability phase classification yielded on the 8-issue architecture are very close to those on the 16-issue architecture regardless of phase characterization techniques. This indicates that the various phase classification schemes we examined in sections 5.1 and 5.2 show similar performance in reliability phase detection across different architecture configurations. Interestingly, we observed that the COV on the 8 and 16-issue machines is higher than that on the 4issue machine. Intuitively, as the processor issue width increases, programs show larger variation in different characteristics because of the reduced resource constraint. The increased standard deviation can result in an increment in COV. However, if the configuration of the hardware structures (i.e. instruction window, ROB, function unit and wakeup table) is fixed, the

impact of the increased issue width will be limited by a threshold. The structures’ AVF will not change significantly when the issue width surpasses that threshold. This explains the noticeable increase in COV when the machine configuration is varied from the 4-issue to the 8-issue and the diminishing variation in COV when the issue width is further increased to 16. Table 5 shows AVF phase classification error on different machine configurations. Due to space limitations, we report the average statistics of the 10 studied benchmarks. Compared with COV, AVF errors are slightly changed across different configurations. This is because the positive and negative errors may cancel with each other, resulting in an overall reduction in that metric. The results shown in Table 5 indicate that various techniques are still able to track the aggregated AVF behavior on the more aggressive 8 and 16-issue machines despite of the increased COV in phase classification. Table 5. AVF Phase Classification Error on Different Machine Configurations Instruction Window 4 issue 8 issue 16 issue PMC-5 0.70% 1.59% 1.86% PMC-15 2.06% 2.32% 1.04% BBV 3.21% 3.19% 3.09% BBV-20 2.74% 1.57% 1.79% Hybrid(10+5) 1.28% 1.66% 1.71% Hybrid(15+5) 1.48% 1.16% 1.39% Function Unit 4 issue 8 issue 16 issue PMC-5 1.04% 1.38% 0.98% PMC-15 3.85% 2.50% 1.08% BBV 3.67% 3.63% 2.47% BBV-20 3.69% 2.40% 1.86% Hybrid(10+5) 0.83% 1.29% 0.90% Hybrid(15+5) 1.68% 0.88% 1.03%

Reorder Buffer 4 issue 8 issue 16 issue 0.71% 2.24% 2.20% 1.45% 1.20% 1.19% 1.75% 2.83% 2.71% 1.24% 1.26% 2.07% 0.98% 1.61% 2.10% 1.95% 1.35% 1.64% Wakeup Table 4 issue 8 issue 16 issue 1.69% 3.24% 2.59% 3.72% 2.27% 2.40% 1.91% 4.07% 2.24% 1.50% 3.03% 3.51% 1.12% 1.96% 1.91% 1.81% 2.19% 1.99%

6. Conclusions and Future Work Phase analysis is becoming increasingly important for optimizing the efficiency of next generation computer systems. As semiconductor transient faults become an increasing threat to

0

PMC-5

20 PMC-15

PMC-15

BBV

BBV-20

BBV BBV-20

12 Function Unit

BBV BBV-20

15 Wakeup Table

BBV-20

12 Instruction Window

Hybrid(10+5)

Reorder Buffer

Hybrid(10+5)

Hybrid(10+5)

Hybrid(10+5)

Figure 5. The COVs of Reliability Phase Classification on the 8-issue and 16-issue Processors Hybrid(15+5)

A VG

vp r

tw o lf vo rt ex

COV (%)

Hybrid(15+5)

A V G

vp r

BBV

crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

Hybrid(10+5)

tw o lf vo rt ex

m cf p ar se r p er lb m k

gc c

g ap

eo n

cr af ty

BBV

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

BBV

Hybrid(10+5)

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

(c) Function Unit m cf p ar se r p er lb m k

0

crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

(a) Instruction Window

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

4 16

g cc

8 20

g ap

BBV-20

eo n

12

COV(%)

0

crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

cr af ty

A V G

Hybrid(15+5)

16

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

PMC-15

vp r A V G

20

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

16

vp r

tw o l vo f rt ex

4

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

16 crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

tw o lf vo rt ex

m cf pa rs e p er r lb m k

gc c

COV (%) 8

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

PMC-15

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

16

BBV-20

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

PMC-5

Hybrid(15+5)

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

PMC-5

BBV

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

PMC-5

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

m cf p ar se r p er lb m k

gc c

ga p

eo n

cr af ty

BBV

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

0

g ap

COV (%)

Hybrid(10+5)

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

0

eo n

cr af ty

Hybrid(10+5)

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

0 crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG

CoV(%) 16

crafty eon gap gcc m cf parser perlbm k tw olf vortex vpr AVG

CoV(%) 20

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

CoV(% )

12

crafty eon gap gcc m cf parser perlbm k twolf vortex vpr AVG

CoV(%)

16

BBV-20

12 8

4

0

(b) Reorder Buffer BBV-20

12 8

4

0

Figure 4. COVs yielded by Hybrid Schemes (d) Wakeup Table 8 issue 16 issue

8

4

Hybrid(15+5)

12 8 issue 16 issue

8

4

Hybrid(15+5)

8 issue 16 issue

8

4

8 issue 16 issue Hybrid(15+5)

10

5

Hybrid(15+5)

reliable software execution, it is imperative to design workload dependent fault tolerant mechanisms for future microprocessors which contain billions of transistors. Observing reliabilityoriented program phase behavior is the first step leading to dynamic fault tolerant systems which can be tuned to meet the workload reliability requirement. In this work, we studied the run-time reliability characteristics of four important microarchitecture structures. We investigated the applicability and effectiveness of different phase identification methods in classifying program reliability-oriented phases. We found that both code-structure based and run-time event based phase analysis techniques perform well in detecting reliability oriented phases. In general, the performance counter scheme outperforms the control flow based scheme on a majority of benchmarks. We also found that program reliability phases can be cost-effectively identified using five performance counters that are generally available across different hardware platforms. This suggests that on-line reliability phase detection and prediction hardware with low complexity can be built on a small set of performance counters. Although we used AVF analysis as the method to quantify microarchitecture reliability, we believe the observations we made in this paper are independent of the reliability characterization method used. Our future work includes phase characterization using statistical fault injection. Alternative clustering methods (e.g. regression tree) will also be used to analyze reliability phases. Additionally, we will explore on-line reliability phase prediction techniques that enable dynamic fault tolerant system design. For example, if structure’s AVF is higher than the threshold, a dynamic fault tolerant technique will be triggered to avoid signification performance degradation caused by soft error. Extensions to the Sim-SODA simulator infrastructure and the creation of additional modules are also topics of future research.

Acknowledgment This research is partially supported by NASA award no. NCC 2-1363, and by Microsoft Research Trustworthy Computing award no. 14707. James Poe is supported by a National Science Foundation Graduate Research Fellowship.

References [1] T. Sherwood, E. Perelman and B. Calder, Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2001. [2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. [3] E. Duesterwald, C. Cascaval and S. Dwarkadas, Characterizing and Predicting Program Behavior and Its Variability, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [4] X. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data, In Proceedings of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder, Phase Tracking and Prediction, In Proceedings of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, Managing Multi-Configurable Hardware via Dynamic Working Set Analysis, In Proceedings of the International Symposium on Computer Architecture, 2002. [8] A. Dhodapkar and J. Smith, Comparing Program Phase Detection Techniques, In Proceedings of the International Symposium on Microarchitecture, 2003.

[9] C. Isci and M. Martonosi, Identifying Program Power Phase Behavior using Power Vectors, In Proceedings of the International Workshop on Workload Characterization, 2003. [10] C. Isci and M. Martonosi, Phase Characterization for Power: Evaluating Control-Flow-Based Event-Counter-Based Techniques, In Proceedings of the International Symposium on High-Performance Computer Architecture, 2006. [11] M. Annavaram, R. Rakvic, M. Polito, J.-Y. Bouguet, R. Hankins and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, In Proceedings of the International Symposium on Microarchitecture, 2004. [12] J. Lau, S. Schoenmackers and B. Calder, Structures for Phase Classification, In Proceedings of International Symposium on Performance Analysis of Systems and Software, 2004. [13] J. Lau, J. Sampson, E. Perelman, G. Hamerly and B. Calder, The Strong Correlation between Code Signatures and Performance, In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2005. [14] J. Lau, S. Schoenmackers and B. Calder, Transition Phase Classification and Prediction, In Proceedings of the International Symposium on High Performance Computer Architecture, 2005. [15] C. Weaver, J. Emer, S. S. Mukherjee and S. K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, In Proceedings of the International Symposium on Computer Architecture, 2004. [16] M. Huang, J. Renau and J. Torrellas, Positional Adaptation of Processors: Application to Energy Reduction, In Proceedings of the International Symposium on Computer Architecture, 2003. [17] A. Iyer and D. Marculescu, Power Aware Microarchitecture Resource Scaling, In Proceedings of the Design Automation and Test in Europe Conference, 2001. [18] X. Fu, T. Li and J. Fortes, Sim-SODA: A Framework for Microarchitecture Reliability Analysis, In Proceedings of the Workshop on Modeling, Benchmarking and Simulation (Held in conjunction with International Symposium on Computer Architecture), 2006. [19] X. D. Li, S. V. Adve, P. Bose, J. A. Rivers: SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors, In Proceedings of the International Conference on Dependable Systems and Networks, 2005. [20] S. Kim and A. K. Somani, Soft Error Sensitivity Characterization of Microprocessor Dependability Enhancement Strategy, In Proceedings of the International Conference on Dependable System and Networks, 2002. [21] N. J. Wang, J. Quek, T. M. Rafacz and S. J. Patel, Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline, In Proceedings of the International Conference on Dependable Systems and Networks, 2004. [22] E. W. Czeck and D. Siewiorek, Effects of Transient Gate-level Faults on Program Behavior, In Proceedings of the International Symposium on FaultTolerant Computing, 1990. [23] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, In Proceedings of the International Symposium on Microarchitecture, 2003. [24] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the International Symposium on Computer Architecture, 2005. [25] J. J. Yi and D. J. Lilja, Improving Processor Performance by Simplifying and Bypassing Trivial Computations, In Proceedings of the IEEE International Conference on Computer Design, 2002. [26] R. E. Kessler, E. J. McLellan and D. A. Webb, The Alpha 21264 Microprocessor Architecture, In Proceedings of the International Conference on Computer Design, 1998. [27] R. Desikan, D. Burger, S. W. Keckler and T. Austin, Sim-alpha: A Validated, Execution-Driven Alpha 21264 Simulator, Technical Report, TR-01-23, Dept. of Computer Sciences, University of Texas at Austin, 2001. [28] R. Desikan, D. Burger and S. W. Keckler, Measuring Experimental Error in Microprocessor Simulation, In Proceedings of the International Symposium on Computer Architecture, 2001. [29] A. J. KleinOsowski and D. J. Lilja, MinneSPEC: A New SPEC Benchmark Workload for Simulation Based Computer Architecture Research, Computer Architecture Letters, Vol. 1, June 2002. [30] W. Liu and M. Huang, EXPERT: Expedited Simulation Exploiting Program Behavior Repetition, In Proceedings of International Conference on Supercomputing, 2004. [31] J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. [32] R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu and S. Dwarkadas, Memory Hierarchy Reconfiguration for Energy and Performance in General Purpose Architectures, In Proceedings of the Internationsl Symposium on Micrarchitecture, 2000.

Suggest Documents