Evaluating the Error Resilience of GPGPU

0 downloads 0 Views 204KB Size Report
an operand consisting of one regular register or up to four scalar registers ... Crashes can be categorized based on their root causes as stack ... stack#overflow#.
Evaluating the Error Resilience of GPGPU Applications Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu Department of Electrical and Computing Engineering, University of British Columbia {bof, jwei, karthikp, matei}@ece.ubc.ca § I. INTRODUCTION Over the past years, GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose computing. A number of studies [1, 2] have shown that significant performance gains can be achieved by deploying GPUs on traditional high performance computing (HPC) systems that host demanding scientific applications. However, the reliability implications of using GPUs are unclear. GPUs have originally been designed to support applications that are intrinsically fault-tolerant: for example image rendering applications where a few wrong pixels are not noticeable by human eyes. As GPUs are used to support a wider class of applications such as DNA sequencing, graph analysis, and linear algebra related problems, which are not error-resilient, it becomes critical to understand the behavior of these application in the presence of hardware faults. This is especially important as hardware faults become more and more common in commodity systems due to the effects of technology scaling and manufacturing variations [4]. Over the past years, GPU manufacturers have improved GPUs’ reliability. For instance, starting from Fermi, NVIDIA GPUs use Error Correcting Code (ECC) to protect register files, DRAM, cache and on-chip memory space from transient faults. However, hardware faults can occur anywhere in computational or control data paths, and can propagate to registers and/or memory. Such faults would not be detected by ECC in registers and/or memory, as they would cause the correct ECC to be calculated on faulty data. Previous efforts to study and improve the reliability of applications using GPUs reliability exit: Yim et al. [3] proposed software-based fault-detection tools through data duplication at the programming language level (loop code and non-loop code). Skadron et al. [5] propose a hardware redundancy scheme to selectively replicate critical parts of the GPU pipeline. However, both these approaches treat GPGPU applications in a uniform way without taking into account the characteristics of different applications. As a result, they may incur significant inefficiencies mainly due to false-positives (i.e., detect errors that do not matter to the application). Our study evaluates resilience from the GPGPU applications’ perspective by focusing on faults occurring in the GPU’s computation components. We investigate the fundamental reliability characteristics of CUDA based applications to serve as a first step towards application-specific error detection. We employ fault-injection to study the behavior of GPU applications under faults. The paper makes the following contributions:

§

It designs and implements a fault injector that leverages run-time information of GPGPU applications on real GPU hardware. It presents a preliminary evaluation of error-resilience of GPGPU applications under faults. II. METHDOLOGY

We use the traditional approach for this area: we inject faults in the GPU computation path and focus on evaluating the application behavior after the fault is activated. To this end we have designed and implemented a fault injector with the following design guidelines/goals in mind: 1. The fault injector should have visibility to runtime information of executed instruction stream. This enables making sure that fault injection correctly simulates the hardware faults that propagate to the runtime execution of the programs. 2. The fault injector should interfere minimally with the executed applications. This guarantees that the fault injector itself does not affect the way hardware faults are propagated, and that the system accurately emulates the consequence of a real fault. 3. The fault injector should inject faults uniformly in dynamic instructions of applications. This reflects the uniformity of the actual hardware faults on runtime execution of the programs. We achieve these goals by building a fault injector based on the CUDA GPU debugging tool, namely cuda-gdb. The fault injector comprises two main phases. First, we profile applications to get the run-time information of different threads (goal 1). Second, we randomly choose one of the executed instructions for fault-injection. The fault injector only injects faults when the chosen instruction is executed (goal 3). This is realized by setting a conditional breakpoint before running the application. When the application hits this breakpoint, a fault is injected into the application (goal 2). Only one fault is injected in each run, as hardware faults are relatively rare events. The faults that we consider for the fault injection are transient faults in the computation. Examples of the faults are faults in the ALU during computation, faults in the load-store unit when it is computing the memory addresses, and faults in the instruction issue and commit. We do not consider faults in cache, memory and register files, as we assume that these are protected by ECC. We use the single bit flip model to simulate transient faults. Since GPUs support vector instructions, given an operand consisting of one regular register or up to four scalar registers as a vector, we randomly choose a register from this set and flip a random bit in the register.

III. EXPERIMENTAL EVALUATION Experimental Setup We perform the experiment on NVIDIA Telsa C2075 graphic card with CUDA toolkit 4.1. We use five benchmarks, namely AES encryption (AES) [6], matrix multiplication (MAT) [7], MUMmerGPU (MUM) [8], Breadth First Search (BFS) [9] and LIBOR Monte Carlo (LIB) [10]. We run each benchmark 2500 times on average with the same input to ensure that we have a sufficient number of activated faults. Overall we have approximately 1500 runs that have activated faults, i.e., the faulty values are used in the program, for each benchmark. The activation rates vary from 30% to 60% among benchmarks. Only activated faults are considered in our results. We categorize the fault based on the application’s behavior as Benign, Silence Data Corruption (SDC) or incorrect output, Crash, and Hang.

Percentage)of)crash)root)cause)

100%#

A.

B.

Experimental Results Figure 1 presents the overview of experimental result for all of five benchmarks. Out of the three faulty outcomes crashes are the dominant outcome s across the five benchmarks: between 18% and 50%. Silent data corruption (SDC) is the second most frequent failure outcome, observed from 8% to 40% depending on benchmark. This is higher than CPU applications where SDCs have a 1%-9% frequency in similar experiments [11,12]. One reason for the high number of SDCs could be that a high degree of parallelism of GPGPU applications lowers the complexity of each single thread, which potentially decreases the probability that a fault is masked by the application’ behavior. In addition, the spectrum of SDC frequency variation across benchmarks is wider than the spectrum of SDC variation for CPU applications. This suggests the need of having application-specific error detections. Other than studying SDCs, understanding crashes in depth is valuable for analyzing the characteristics of the benchmarks. Figure 2 presents the breakdown of crashes for the benchmarks. Crashes can be categorized based on their root causes as stack overflow, out-of-range and misaligned memory access. Memory related crashes are the only reason of crashes across all benchmarks. Within those memory related crashes, out-ofbound memory access dominates. 100%#

Outcome(percentage(

90%# 80%# 70%# 60%#

Crash#

50%#

Hang#

40%# 30%#

SDC#

20%#

Benign#

10%# 0%# AES#

MAT#

MUM#

BFS#

LIB#

Benchmark(programs(

Figure 1 Overview of outcomes for each benchmark

90%# 80%# 70%# 60%#

stack#overflow#

50%#

global#outFofFbound#

40%# 30%#

misaligned#

20%#

local/share#outFofFbound#

10%# 0%# AES#

MAT#

MUM#

BFS#

LIB#

Benchmark)programs)

Figure 2 Crash breakdown for five benchmarks

IV. CONCLUSION AND FUTURE WORK We present a preliminary evaluation of error-resilience of GPGPU applications. We find that, compared to CPUs, these platforms lead to a higher rate of silent data corruption – a major concern since these errors are not flagged at runtime and often remain latent. We also find that out-of-bound memory accesses are the most critical reason of crashes. In the future, we will first focus on techniques to reduce frequency of silent data corruption, as this is critical to most HPC applications. REFERENCES [1] Zhe Fan et al. GPU Cluster for High Performance Computing. In Proceedings of SC '04. IEEE Computer Society, Washington, DC, USA. 2004. [2] D. Luebke. CUDA: Scalable parallel programming for highperformance scientific computing. 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2008. [3]

[4] [5]

Keun Soo Yim et al. Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU. 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS). C. Constantinescu, et al. Trends and challenges in VLSI circuit reliability. IEEE Micro, 2003 Jeremy W. Sheaffer et al. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In Proceedings of the GH '07

[6] S. A. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In ICSPC 2007: Proc. of IEEE Int’l Conf. on Signal Processing and Communication, pages 65–68, 2007. [7] http://developer.download.nvidia.com/compute/DevZone/docs/h tml/C/doc/CUDA_C_Programming_Guide.pdf [8] M. Schatz et al. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics, 8(1):474, 2007. [9] P. Harish. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007. [10] M. Giles and S. Xiaoke. Notes on using the NVIDIA 8800 GTX graphics card http://people.maths.ox.ac.uk/˜gilesm/hpc/ [11] E. Czeck and D. Siewiorek, “Effects of transient gate-level faults on program behavior,” in Proceedings of 20th International Symposium on Fault-Tolerant Computing, 1990. [12] Jiesheng Wei et al. Comparing the Effects of Intermittent and Transient Hardware Faults on Programs. Workshop on Dependable and Secure Nano-Systems(WDSN),2011

Suggest Documents