SBST for On-Line Detection of Hard Faults in Multiprocessor Applications Under Energy Constraints A. Merentitis¹, D. Margaris¹, N. Kranitis¹, A. Paschalis¹, and D. Gizopoulos² ¹ Department of Informatics & Telecommunications, University of Athens, Greece ² Department of Informatics, University of Piraeus, Greece {amer, nkran, paschali}@di.uoa.gr,
[email protected] Abstract—Software-Based Self-Test (SBST) has emerged as an effective method for on-line testing of processors integrated in non safety-critical systems. However, especially for multi-core processors, the notion of dependability encompasses not only high quality on-line tests with minimum performance overhead but also methods for preventing the generation of excessive power and heat that exacerbate silicon aging mechanisms and can cause long term reliability problems. In this paper, we initially extend the capabilities of a multiprocessor simulator in order to evaluate the overhead in the execution of the useful application load in terms of both performance and energy consumption. We utilize the derived power evaluation framework to assess the overhead of SBST implemented as a test thread in a multiprocessor environment. A range of typical processor configurations is considered. The application load consists of some representative SPEC benchmarks, and various scenarios for the execution of the test thread are studied (sporadic or continuous execution). Finally, we apply in a multiprocessor context an energy optimization methodology that was originally proposed to increase battery life for battery-powered devices. The methodology reduces significantly the energy and performance overhead without affecting the test coverage of the SBST routines. Keywords- software-based self-testing; hard multiprocessors; on-line test; low energy optimization
I.
faults;
INTRODUCTION
New types of defects appearing in deep submicron technologies require at-speed testing in order to achieve high test quality. Moreover, many types of faults are increasingly difficult to detect during manufacturing testing due to voltage stress and power limitations during burn-in that can cause the test to be ineffective. If such faults escape, they are likely to cause hard failures during the useful lifetime of the system. In order to tolerate faults encountered during operation, a reliable system requires mechanisms for detection, recovery and repair. Given the existence of lowcost mechanisms for system recovery and repair in contemporary chip multiprocessors (CMPs) the remaining major challenge is the development of low-cost defect detection techniques [1]. In the multimillion gate SoC era, design and test engineers, apart from the usual challenges, also face signal
integrity problems, as well as serious power consumption and overheating issues, especially when the circuit has to be placed in special test modes [2]. For on-line testing, that aims at detecting and/or correcting operational faults, test methods that are based on hardware usually require special scheduling in order to avoid overheating that can cause circuit failures and long time reliability problems (e.g. accelerate silicon aging and wear-out phenomena like electromigration, Negative Bias Temperature Instability and Time Dependent Dielectric Breakdown). The abovementioned problems are exacerbated for multi-core processors, where heat dissipation is a real concern and temperature-related failures are more likely. Reliability in multiprocessor environments is a very “hot” topic and therefore various works have addressed different aspects of the problem. However, the majority of these works usually address only the problem of intermittent faults and soft-errors (e.g. [3]-[6]). These techniques are based on duplication of instructions from the actual application that may or may not be executed on the same hardware and therefore are well suited for detecting short lived transient faults but not long lived intermittent faults (e.g. related to voltage drops or temperature issues) [7]. Even more so, they are not efficient for the detection of hard faults, because they are based on actual application instructions meaning that they tend to test the same logic repeatedly, while large portions of the processor functionality remain untested. On-line test methodologies that are able to address the problem of hard faults in functional units are relatively few (e.g. [1], [8], [9]) and most of them also only consider the performance overhead. An interesting non-intrusive approach was proposed in [7], targeting the concurrent online test of hard faults in simultaneous multi-threaded (SMT) processors using a test thread. Issues related to fault coverage, generation of high quality vectors, employing the assistance of hardware via test points and signature registers were not addressed. Software-Based Self-Test has been proposed [10]-[20] as a low-cost solution for testing of processors integrated in non-safety critical applications that can be used either as an alternative or a supplement to other test methods. It is using
existing processor resources for test pattern generation and application, with no hardware or frequency overhead for the design. Moreover, it can be used for flexible and efficient on-line testing, because unlike most hardware solutions, it allows to dynamically trade-off between reliability and performance overhead [1]. Finally, the fact that SBST is performed in normal mode using the processor Instruction Set Architecture (ISA) alleviates the problem of excessive toggle activity that is beyond the specification of the circuit and can cause immediate circuit failures. However, long term reliability problems need to be also addressed by energy optimization of the SBST routines. The main contributions of this paper are summarized as follows: • We assess the performance and energy overhead of a SBST test thread strategy using a novel power evaluation framework that synthesizes the capabilities of different tools. • The SBST test thread we used is not pseudorandom as in [7] but deterministic, consisted of routines with proven test capabilities. • We evaluate the performance and energy overheads in a multi-core processor that executes SPEC benchmarks. • We apply the low energy optimization methodology of [21] in a multi-core context in order to reduce the energy overhead. The rest of the paper is organized as follows. Some preliminaries are briefly discussed in Section II. Section III highlights the simulation environment and the flow we have used for generating performance overhead and energy metrics. Experimental results are provided in Section IV. Finally, Section V concludes the paper. II.
ENERGY OPTIMIZATION IN THE T EST T HREAD STRATEGY
The idea of a test thread for the detection of hard faults was introduced in [7]. The main assumptions of this approach are outlined in the following: the primary thread is executed normally in a simultaneous multithreading environment. When enough resources are available and the overall system load allows it, a secondary or test thread is executed to detect hard faults in the underlying hardware. In fact, previous works show that even in a system that executes several threads there are often plenty of free resources [22]. As the hardware that executes the primary thread is not always the same (e.g. different core in a multicore system or different ALU in a superscalar system), the test thread is also executed in different hardware, allowing the detection of faults in a broad scope. The test thread can be implemented through a variety of hardware, software or hybrid techniques and its execution can be scheduled from the operating system or the hardware itself if the overall resource utilization is low. It can be executed either sporadically (e.g. before checkpoints) or continuously. If a pure software approach, based on SBST is used, as the one here, the test thread strategy is entirely non-
intrusive, since it does not require any hardware modifications. However, the impact in performance and energy overhead needs to be assessed. A brief theoretical analysis of the parameters that contribute to power consumption is required in order to set the scope of the problem. Power consumption in CMOS circuits can be either static or dynamic. Leakage current drawn continuously from the power supply causes static power dissipation. Dynamic dissipation occurs during output switching due to the short-circuit current and charging/discharging of load capacitance. The importance of static power consumption increases as dimensions scale down, however current CMOS technologies are dominated by dynamic power consumption. For a single node the latter can be approximated by the following mathematic formula:
P = C L ⋅ S ⋅ Vdd2 ⋅ f CLK , where CL is the equivalent load capacitance, Vdd is the power supply voltage, S is the number of node switches and fCLK is the operating frequency. Additionally, energy consumption for a period T equals E = Pav⋅T and T = N⋅τ, where Pav is the average power consumption over period a T, N is the total number of execution cycles and τ is the clock period. It is apparent from the previous formulas that, for a given circuit and technology, energy consumption can be reduced if the test program has small cycle count and low average power consumption. However, these two factors cannot always be optimized simultaneously. A systematic methodology for low energy optimization of SBST routines was proposed in [21] to reduce the energy consumption for periodic non-concurrent testing in battery-powered processors. Another important advantage of that methodology is that it can be applied independently before the application of other optimization methods, such as scheduling algorithms that exploit threadlevel parallelism (TLP) to speed up the execution of self-test routines (e.g. [23]) or algorithms for the scheduling of online self-test tasks in hard real-time systems (e.g. [24]). In this paper, after we study the impact of the test thread strategy in terms of performance and energy we proceed to assess the effectiveness of the methodology proposed in [21] in a multi-core, concurrent test scenario, implemented as a test thread. The test thread is comprised by deterministic high-RTL code [15] for functional units like the MAC, register file and pipeline logic [17], deterministic code generated by constrained ATPG according to the methodology of [14] for the ALU and MAC adder, as well as verification-based functional code for control-oriented components as introduced in [15] and extended in [20]. III. S IMULATION ENVIRONMENT The power evaluation framework utilized in this paper is comprised by a combination of tools from the test technology and computer architecture technical areas. More specifically, the SBST routines are selected from a test
routine library that contains routines that are already validated in terms of test effectiveness using commercial fault simulation tools. However, the most important part of the multi-core power evaluation framework is the extended CMP-SIM simulator. CMP-SIM [25] is an architecturallevel simulator that extends Simplescalar with the capability for multi-core/multithread simulation. It offers a multi-core micro architectural environment with a detailed cycleaccurate model for the key pipeline structures and it implements a cache coherence protocol similar to MESI. In order to be able to generate energy metrics in a multi-core context we extended the original publically available version of CMP-SIM with the incorporation of the energy libraries and functions from a dedicated power simulator, Wattch. The Wattch framework [26] is an architectural-level simulator for estimating power consumption. It can accurately model four main categories of processor elements (array structures, memories, combinational logic and wires, clock networks). Experimental validation of the generated reports for numerous cases (including commercial processors like Alpha 21264 and MIPS R10000) has shown that the estimated power consumption is within 10-13% of the values reported by the tools operating at the post layout netlist. The power models incorporated by Wattch are general and accurate [27] meaning that the results are valid for a broad range of processors. These properties constitute Wattch an ideal tool for evaluating the SBST test thread strategies from an energy point of view. The extension of CMP-SIM with the libraries and functions of Wattch involved the modification of about 20% of the approximately 10,000 lines of C code that comprise the original version of the simulator. This involved the addition of complex structures for maintaining the energy metrics for every component and pipeline stage in every core, as well as numerous modifications for preserving the integrity of the simulation and the validity of the results in multithread simulation. The flexibility offered by the produced configurable architectural-level multi-core simulator allows us to cover a broad design space and generate metrics for a wide range of processor models. Thus, it is possible to evaluate the performance and energy overhead of the test thread across different configurations and under various scenarios of processor load. In order to evaluate the impact of the SBST thread in terms of both performance and energy, we have used the flow that is depicted in Figure 1. Initially the considered processor specification is used to set the basic parameters of the simulator configuration (number of cores, number of functional units per core, cache sizes, cache and memory latency, etc), as well as to select the appropriate SBST routines from an existing library of self-test routines (e.g. [15], [16], [17]). Specifically, the considered routines include deterministic High-RTL routines for functional modules that are characterized by regularity, constrained ATPG routines for non-regular functional modules and verification-based routines for control oriented modules.
Following this step, the selected routines are combined in a test thread and this is compiled by the appropriate crosscompiler and executed together with the normal application to generate the performance and energy overheads introduced by SBST. As a next step, in order to evaluate the impact of the energy optimization methodology of [21] in multi-core processors, the methodology of [21] is applied and the same steps are repeated for the optimized routines.
Figure 1: The multiprocessor power evaluation framework
In this paper two different configurations, covering a wide range of processor benchmarks are considered. Detailed characteristics for the simulated processor configurations are presented in Table I. TABLE I.
SIMULATED PROCESSOR C ONFIGURATIONS
Branch Miss Penalty Decode Width Issue Width Commit Width L1 Data Cache L1 Instruction Cache L2 Unified Cache L1 D-Cache Latency L1 I-Cache Latency L2 Cache Latency Integer ALUs Floating Point ALUs L/S Queue Size Register Unit Size
IV.
Simple 5 cycles 4 4 4 16 KB 16 KB 1024 KB 2 cycles 2 cycles 12 cycles 2 1 16 32
Advanced 8 cycles 8 8 8 32 KB 32 KB 4096 KB 4 cycles 4 cycles 18 cycles 4 2 32 64
EXPERIMENTAL RESULTS
In this section the overhead of the test thread in terms of performance and energy consumption is evaluated for the considered processor configurations, under various scenarios. Specifically, simulations are performed for the case of a dual-core and quad-core processor that only executes the useful application, as well as for the case that the same processor concurrently with the useful application executes a test thread. Finally, a scenario of sporadic
execution of the test thread is considered (i.e. before every checkpoint), as well as a scenario of a test thread that runs continuously in the available cores (e.g. [1], [7]). For all these cases two processor configurations are used (Table I). The first step for evaluating the overhead of the test thread strategy is to consider the case that only the useful application load is executed (either in a dual-core or quadcore execution environment). Table II presents Instructions Per Cycle (IPC), number of cycles and energy consumption for some representative SPEC2000 benchmarks, assuming that checkpoints are taken every 10 million instructions. The selected benchmarks cover a wide range of computer operations, including scientific simulations, CAD tools, desktop applications and databases. For every benchmark we fast-forward the first 10 million instructions (initialization part) and then execute up to the next checkpoint normally. Table III presents the same results as Table II for the advanced processor configuration. C YCLES AND ENERGY OF THE APPLICATION LOAD FOR THE SIMPLE PROCESSOR CONFIGURATION
SPEC 2000 Benchmark Apsi
IPC
Cycles
Energy (mJ)
2.13
4,687,577
230.6
Fma3d Mcf Parser Vortex
1.41 2.07 1.26 1.50
7,071,555 4,821,497 7,902,626 6,680,683
235.7 285.5 323.8 253.6
Apsi
2.67
3,750,039
235.1
Fma3d
2.16
4,632,756
233.9
Mcf
2.20
4,553,620
Parser
1.59
6,272,347
Vortex
2.13
4,692,960
TABLE III.
TABLE IV.
DELAY AND ENERGY OVERHEAD FOR THE SIMPLE PROCESSOR CONFIGURATION – SPORADIC EXECUTION
SPEC 2000 Benchmark Apsi
% Energy Overhead 2.37%
1.01% 1.36% 0.91% 1.24%
1.85% 1.92% 1.18% 1.50%
Apsi
2.68
1.07%
2.39%
265.5
Fma3d
2.17
0.84%
1.85%
2.21
0.92%
1.90%
1.61
0.65%
1.16%
2.15
0.73%
1.37%
C YCLES AND ENERGY OF THE APPLICATION LOAD
2-Core
287.4 323.9
Mcf Parser Vortex
SPEC 2000 Benchmark Apsi
IPC
Cycles
Energy (mJ)
2.67
3,750,088
405.8
Fma3d Mcf Parser Vortex
1.92 2.20 1.20 1.88
5,207,359 4,553,645 8,302,716 5,316,419
306.6 416.3 488.9 295.3
Apsi
3.00
3,333,406
408.7
Fma3d
2.48
4,039,791
305.8
Mcf
2.33
4,285,763
Parser
1.53
6,519,926
Vortex
2.91
3,430,864
TABLE V.
DELAY AND ENERGY OVERHEAD FOR THE ADVANCED PROCESSOR CONFIGURATION – SPORADIC EXECUTION
SPEC 2000 Benchmark Apsi
IPC
% Delay Overhead 1.39%
% Energy Overhead 2.15%
0.92% 1.21% 0.83% 1.08%
1.16% 1.48% 1.02% 1.21% 2.12%
2.67 1.93 2.20 1.21 1.89
491.2
Apsi
3.02
0.77%
311.9
Fma3d
1.19%
The results that are presented in Tables II and III lead to some interesting outcomes. The first of these outcomes concerns the performance and energy consumption of the processor when only the useful application load is executed. The simulations show that for most of the benchmarks moving to a quad-core processor provides an average speed-
2-Core
417.5
Fma3d Mcf Parser Vortex
4-Core
2-Core
% Delay Overhead 1.53%
2.13 1.42 2.08 1.27 1.50
FOR THE ADVANCED PROCESSOR CONFIGURATION
4-Core
IPC
Fma3d Mcf Parser Vortex
4-Core
4-Core
2-Core
TABLE II.
up while energy consumption remains approximately the same (notable increase is exhibited only for the Vortex benchmark). On the other hand, going from the simple to the advanced processor model offers performance gains that are comparable to the previous case. Specifically, depending on the nature of the benchmarks they can be higher or lower – particularly for the Parser benchmark the longer cache miss penalties (reflected in the IPC) even result to slower execution in the advanced model. However, the energy requirements are increased significantly for all benchmarks. After evaluating the requirements of the useful application load, we proceed to assess the impact of the test thread. Initially we consider the case of sporadic execution before every checkpoint. In this scenario the overhead of the test thread is mainly determined by its size in cycles. Since deterministic SBST is used for most of the components [20], the cycle count for every execution of the test thread is small (less than 100,000 cycles for all the different configurations), thus we expect the overhead to be relatively small as well. The overhead for the simple and advanced configurations is presented in Tables IV and V, respectively.
2.50
0.59%
Mcf
2.35
0.83%
1.44%
Parser
1.55
0.48%
1.02%
Vortex
2.94
0.51%
1.09%
When we introduce the test thread in the form of sporadic testing before every checkpoint, we derive that the overall
delay overhead is always below 2% (it could be higher if the test thread consisted of more routines) and it is reduced further for the quad-core processor model. It should also be noted that between the dual-core model for the advanced configuration and the quad-core model for the simple configuration the latter has smaller delay overhead, since the test thread is executed independently in the available core. Finally, regarding energy consumption we derive that it is affected more significantly than delay, because in order to compensate for the extra load (and increase IPC) all processor configurations trigger hardware that was previously in “sleep” mode (e.g. due to clock gating) more often. Therefore, the extra load is only partially translated to delay, while energy consumption is impacted directly. A different scenario that is also very interesting from a reliability point of view is to continuously execute the test thread, as long as resources are available. In this scenario the impact of the test thread in terms of delay and energy consumption is not determined by the number of cycles (since it executes continuously) but from the properties of the routines that comprise it. An assessment of the energy overhead for this scenario is presented in Tables VI and VII. TABLE VI.
DELAY AND ENERGY OVERHEAD FOR THE SIMPLE PROCESSOR CONFIGURATION – CONTINUOUS EXECUTION % Delay Overhead 26.72%
% Energy Overhead 63.6%
19.01% 23.36% 16.91% 20.24%
61.2% 65.3% 62.0% 69.9% 61.3%
Fma3d Mcf Parser Vortex
2,72 1,92 2,72 1,75 2,01
Apsi
3,59
20.09%
Fma3d
2,95
18.34%
61.7%
Mcf
3,01
17.85%
65.7%
Parser
2,27
13.26%
61.9%
2,97
15.82%
58.1%
Vortex
TABLE VII. DELAY AND ENERGY OVERHEAD FOR THE ADVANCED PROCESSOR CONFIGURATION – CONTINUOUS EXECUTION
4-Core
2-Core
SPEC 2000 Benchmark Apsi
IPC
% Delay Overhead 23.08%
% Energy Overhead 44.0%
18.44% 21.21% 16.59% 20.06%
77.0% 77.3% 76.5% 62.7%
Fma3d Mcf Parser Vortex
3,50 2,62 2,93 1,67 2,53
Apsi
4,12
17.62%
43.3%
Fma3d
3,42
16.91%
77.4%
Mcf
3,24
16.35%
Parser
2,19 4,09
Vortex
Application of the energy optimization methodology on the routines of [20] that cover the complete range of SBST techniques (deterministic high-RTL, constrained ATPG and verification-based) shows that the test coverage, measured in the gate level netlists of two processor benchmarks [21] is reduced by less than 0.5%. Results on a per module basis are also provided in [21] and indicate that deterministic routines (high-RTL or constrained ATPG) are not affected while verification based routines exhibit some degradation in test coverage because strict equivalence cannot be achieved. The effect of the methodology in terms of test coverage for different SBST techniques is derived from the nature of the optimization steps and thus is similar for all routines [21]. The gains in terms of energy consumption are presented in Table VIII and Table IX. TABLE VIII.
DELAY AND ENERGY OVERHEAD FOR THE SIMPLE PROCESSOR CONFIGURATION – OPTIMIZED TEST THREAD
SPEC 2000 Benchmark Apsi
IPC
% Delay Overhead 20.28%
% Energy Overhead 44.1%
14.39% 18.61% 12.75% 15.18%
41,7% 48.9% 45.2% 49.6%
Fma3d Mcf Parser Vortex
2,86 2,00 2,82 1,81 2,10
77.0%
Apsi
3,72
15.67%
41.2%
12.87%
75.9%
Fma3d
3,05
14.19%
42.3%
15.13%
66.8%
Mcf
3,13
13.20%
49.1%
Parser
2,33
10.34%
41.8%
Vortex
3,07
12.18%
39.7%
2-Core
IPC
4-Core
4-Core
2-Core
SPEC 2000 Benchmark Apsi
It is clear that in the case of continuous execution the overhead is increased, especially in terms of energy consumption. Moreover, it is interesting to note that while the energy overhead is approximately uniform across different benchmarks for the simple configuration, it varies significantly for the advanced configuration, due to the increased delays associated with misses in the speculative performance aiding mechanisms. However, high energy consumption for a significant amount of time can cause overheating that is a major concern for long term silicon reliability because it accelerates silicon aging and wear-out mechanisms. Thus, apart from the main application, the test thread should also be optimized to avoid such problems, especially in the case of continuous execution. In this direction we apply the energy optimization methodology proposed in [21], in order to study its effectiveness in a multi-core environment. The methodology of [21] consists of the following steps: • Energy-aware loop synthesis, deployed to minimize the byte count of the routine and to better exploit caches, • Loop transformations for loop-based routines, • Instruction substitution for replacement of instructions with equivalent but more energy efficient, • Modified Register Name Adjustment (RNA) to minimize bit toggles on the address decoders and buses.
TABLE IX.
DELAY AND ENERGY OVERHEAD FOR THE ADVANCED PROCESSOR CONFIGURATION – OPTIMIZED TEST THREAD
4-Core
2-Core
SPEC 2000 Benchmark Apsi
IPC
% Delay Overhead 17.54%
% Energy Overhead 31.2%
[8]
14.16% 16.71% 12.48% 15.03%
51.9% 55.7% 56.4% 39.6%
[9]
[11]
Fma3d Mcf Parser Vortex
3,66 2,72 3,04 1,73 2,64
Apsi
4,28
13.21%
35.8%
Fma3d
3,54
12.85%
52.3%
Mcf
3,34
12.77%
55.1%
Parser
2,26
9.69%
54.4%
Vortex
4,21
11.80%
42.5%
Juxtaposing the results of Table VIII with Table VI, as well as the results of Table IX with Table VII shows that both the performance and the energy overhead are reduced significantly (especially for the latter on average approximately one third of the overhead is removed). V.
[7]
[10]
[12] [13]
[14] [15]
CONCLUSIONS
In this paper, we extended the capabilities of a multiprocessor simulator in order to evaluate the impact of a SBST test thread strategy in the execution of the useful application load in terms of both performance and energy. The application load consisted of some representative SPEC benchmarks, and various scenarios of sporadic or continuous execution of the test thread were evaluated. A broad range of processor configurations was studied. Finally, we applied for the first time in a multiprocessor context an energy optimization methodology that was originally proposed to increase battery life for on-line test of battery-powered processors. Simulation results indicate that the methodology reduces significantly the energy and performance overhead in the multiprocessor application.
[16]
REFERENCES
[21]
[1]
[2] [3]
[4] [5] [6]
K. Constantinides, O. Mutlu, T. Austin, V. Bertacco, “Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation”, International Symposium on Microarchitecture (MICRO), 2007, pp 97-108. M. Nicolaidis, Y. Zorian, “On-line Testing for VLSI – A Compendium of approaches”, in Journal of Electronic Testing: Theory and Applications (JETTA), Vol. 12, No. 1-2, 1998, pp 7-20. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, “Detailed design and evaluation of redundant multithreading alternatives”, Annual International Symposium on Computer Architecture (ISCA), 30(2):99-110, 2002. F. Rashid, K. K. Saluja, and P. Ramanathan, “Fault tolerance through re-execution in multiscalar architecture”, International Conference on Dependable Systems and Networks, pp. 482-491, 2000. E. Rotenberg, “AR-SMT: A microarchitectural approach to fault tolerance in microprocessors”, International Symposium on FaultTolerant Computing (FTCS), pp. 84-91, 1999. T. N. Vijaykumar, I. Pomeranz, and K. Cheng, “Transient fault recovery using simultaneous multithreading”, International Symposium on Computer Architecture (ISCA), 30(2):87-98, 2002.
[17]
[18]
[19]
[20]
[22] [23] [24]
[25] [26]
[27]
Eric F. Weglarz , Kewal K. Saluja , T. M. Mak, “Testing of Hard Faults in Simultaneous Multithreaded Processors”, International OnLine Testing Symposium, , p.95, July 12-14, 2004. Y. Li, S. Makar and S. Mitra, "CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns", Design, Automation and Test in Europe (DATE), March 2008, pp. 885-890. O. Khan, S. Kundu “A Self-Adaptive System Architecture to Address Transistor Aging”, Design, Automation and Test in Europe (DATE), 2009. J. Shen, J. A. Abraham, “Synthesis of Native Mode Self-Test Programs” in Journal of Electronic Testing: Theory and Applications (JETTA), Volume 13, Number 2, October 1998, pp. 137-148 (12). I. Bayraktaroglu, J. Hunt, D. Watkins, “Cache Resident Functional Microprocessor Testing: Avoiding High Speed IO Issues”, International Test Conference (ITC), 2006, paper 27.2. F. Corno, G. Cumani, M. Sonza Reorda, G. Squillero, “Fully Automatic Test Program Generation for Microprocessor Cores”, Design Automation & Test in Europe (DATE) 2003, pp.1006-1011. E. Sanchez, M. Sonza Reorda, G. Squillero, "On the Transformation of Manufacturing Test Sets into On-Line Test Sets for Microprocessors," International Symposium on Defect and Fault Tolerance in VLSI Systems, 2005, pp.494-502. L. Chen, S. Ravi, A. Raghunathan, S. Dey, “A Scalable SoftwareBased Self-Testing Methodology for Programmable Processors”, Design Automation Conference (DAC) 2003, pp. 548-553. N. Kranitis, A. Paschalis, D. Gizopoulos, G. Xenoulis, “SoftwareBased Self-Testing of Embedded Processors”, IEEE Transactions on Computers, vol. 54, no. 4, pp. 461-475, April 2005. A. Paschalis, D. Gizopoulos, “Effective software-based self-test strategies for on-line periodic testing of embedded processors”, IEEE Transactions on CAD, Vol. 24, no.1, pp. 88 – 99, Jan. 2005. N. Kranitis, A. Merentitis, N. Laoutaris, G. Theodorou, A. Paschalis, D. Gizopoulos, C. Halatsis, “Optimal periodic testing of intermittent faults in embedded pipeline processor applications”, Design, Automation and Test in Europe (DATE), 2006, pp. 65-71. C.H.P. Wen, L.C. Wang, K.T. Cheng, W.T. Liu; J.J. Chen, “Simulation-based target test generation techniques for improving the robustness of a software-based-self-test methodology”, International Test Conference (ITC), 2006, pp. 936 – 945. A. Apostolakis, D. Gizopoulos, M. Psarakis, and A. Paschalis “Software-Based Self-Testing of Symmetric Shared-Memory Multiprocessors”, IEEE Transactions on Computers, vol. 58, no. 12, pp. 1682-1694, July 2009. N. Kranitis, A. Merentitis, G. Theodorou, A. Paschalis, D. Gizopoulos, "Hybrid-SBST Methodology for Efficient Testing of Processor Cores" IEEE Design & Test of Computers, vol.25, no.1, pp.64-75, Jan-Feb 2008. A. Merentitis, N. Kranitis, A. Paschalis, D. Gizopoulos, “Low Energy On-Line SBST of Embedded Processors”, International Test Conference (ITC), paper 12.1, 2008. D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous multithreading: maximizing on-chip parallelism”, International Symposium on Computer Architecture (ISCA), pp. 392-403, 1995. A. Apostolakis, M. Psarakis, D. Gizopoulos, A. Paschalis, I. Parulkar, “Exploiting Thread-Level Parallelism in Functional Self-Testing of CMT Processors”, European Test Symposium (ETS), May, 2009. D. Gizopoulos, “Online Periodic Self-Test Scheduling for Real-Time Processor-Based Systems Dependability Enhancement”, IEEE Transactions on Dependable and Secure Computing, vol. 6, no. 2, pp. 152-158, April 2009. Sandeep Baldawa and Rama Sangireddy, “CMP-SIM: An Environment for Simulating Chip Multiprocessor (CMP) Architectures”, University of Texas at Dallas, October 2006. D. Brooks, V. Tiwari, M. Martonosi, “Wattch: a framework for architectural level power analysis and optimizations”, International Symposium on High-Performance Computer Architecture (HPCA), 2000, pp. 83- 94. D. Brooks, P. Bose, M. Martonosi, “Power-performance simulation: design and validation strategies”, ACM SIGMETRICS, vol. 31, pp. 13-18, March 2004.