... He, Liwei Yang,. Swathi Gurumani, Kyle Rupnow, Subhasish Mitra, Deming Chen .... code the bug activates, and provides a strong hint for possible bug fixes; 7) ...... Newton Technical Impact Award in Electronic Design Automation (a test of.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
1
Hybrid Quick Error Detection: Validation and Debug of SoCs through High-Level Synthesis Keith Campbell, David Lin, Leon He, Liwei Yang, Swathi Gurumani, Kyle Rupnow, Subhasish Mitra, Deming Chen
activates, but there is a very long (e.g. millions of cycle) error detection latency (time from activation to manifestation as an observable failure) making isolation very difficult; and 4) the bug’s activation condition and/or effects are non-deterministic, meaning that they are influenced by unexpected confounding factors such as the toolchain used and the simulation or testing environment, making reliable reproduction very difficult. Validation procedures are performed in both the pre-silicon and post-silicon stages to attempt to catch these bugs. Presilicon validation has the advantage of detecting and isolating bugs before design fabrication, where they are much less expensive to fix. On the other hand, traditional RTL-level pre-silicon simulation is unable to detect electrical bugs and such simulations execute at speeds several orders of magnitude slower than real-time, limiting the achievable test coverage. Post-silicon validation has the advantage of real-time execution in a real system environment, enabling long execution times to activate more bugs with real test scenarios that go beyond the Index Terms—logic bug, electrical bug, detection latency, val- limits of simulation models, for electrical circuit behavior in idation, debug, pre-silicon, post-silicon, high-level synthesis, au- particular. On the other hand, design-for-test logic has area cost tomation, simulation, testing, software, hardware, co-simulation, constraints, limiting the amount of internal state information hybrid tracing, hybrid hashing, system modeling that can be extracted from a design for bug isolation, causing many masked bugs to go undetected and making isolating bugs I. I NTRODUCTION with long error detection latencies difficult. FPGA emulation can be used in pre-silicon testing and URING post-silicon validation and debug (PSV), manhas the advantage of execution speeds on the same orderufactured ICs are tested in real system environments in of-magnitude as post-silicon testing. Emulation, however, order to detect and fix design flaws (bugs). Bugs can be broadly does not accurately the model the electrical behavior of the classified as: 1) electrical bugs that are caused by interactions fabricated circuit, making it only practically useful for logicbetween a design and its electrical state (including timing level validation. Emulation also has the disadvantage of area errors); and 2) logic bugs that are caused by design errors. and bandwidth constraints for design-for-test instrumentation. Logic bugs can also be caught during pre-silicon validation An ideal solution to the validation problem would work in and debug, where simulations are performed on models of both preand post-silicon scenarios to detect all bugs at the the circuit in a simulated system environment and emulations instant they activate and point to the source location of each are performed on mappings of the circuit to an FPGA-based activation, allowing the hardware designer to focus on fixing the emulation environment. This paper focuses on both logic and bugs, rather than reproducing and isolating them. In this paper, electrical bugs, and both pre-silicon and post-silicon validation. we advance towards this ideal solution with the Hybrid Quick Logic and electrical bugs are difficult to find and isolate Error Detection (H-QED) technique to overcome validation because of their elusive behavior. Example of this include: 1) the bug is not activated because test inputs fail to fully challenges for non-programmable hardware accelerators in exercise all components of the design; 2) the bug activates, but SoCs. H-QED is inspired by the QED technique [1], [2], [3], its effects are masked so that no failure is observed although [4] for detecting bugs in programmable microprocessors. Hthe bug would be unmasked with different inputs; 3) the bug QED builds on advances in high-level synthesis (HLS) [5], [6], [7] to overcome this challenge by automatically embedding K. Campbell, L. He, and D. Chen are with the Department of Electrical small hardware structures inside hardware accelerators. H-QED and Computer Engineering at the University of Illinois at Urbana-Champaign. simultaneously improves error detection latencies and coverage D. Lin and S. Mitra are with the Department of Electrical Engineering at of logic and electrical bugs. By combining H-QED with QED, Stanford University. L. Yang, S. Gurumani, and K. Rupnow are with Inspirit IoT Inc. we provide a systematic solution for the validation of SoCs Abstract—Validation and debug challenges of system-on-chips (SoCs) are getting increasingly difficult. As we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Hence, it is essential to address the validation and debug of hardware accelerators. High-level synthesis (HLS) is a promising technique to rapidly create customized hardware accelerators. In this paper, we present the Hybrid Quick Error Detection (H-QED) approach that overcomes validation and debug challenges for hardware accelerators by leveraging HLS techniques in both the pre-silicon and post-silicon stages. H-QED improves error detection latencies (time elapsed from when a bug is activated to when it manifests as an observable failure) by 2-5 orders of magnitude with 1 cycle latencies in pre-silicon scenarios and bug coverage 3-fold compared to traditional validation techniques. HQED also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. H-QED incurs an 8% accelerator area overhead with negligible silicon performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by H-QED.
D
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
2
consisting of processor cores, uncore components, softwareC++ C++ CPU programmable accelerators, and hardware accelerators. Memory An emerging method for designing hardware accelerators is SoC to specify them using high-level languages (e.g., C/C++, SysInternal C++ Test C++ C++ I/O Bench Accelerator Trace temC or domain-specific languages), and automatically translate the high-level specifications into RTL designs using HLS tools. Bugs HLS Compare Many EDA companies (Xilinx, Intel/Altera, Cadence, Synopsys, Calypto, Maxeler) have commercial HLS tools on the market. C++ Test RTL C++ I/O Bench Accelerator HLS can reduce code size by up to 10× and improve simulation Interface SoC speed by 1000× (C vs. RTL code) [8], [9]. Also, high-level Trace C++ synthesis has a rich control data flow graph (CDFG) input to C++ CPU Memory work with, enabling intelligent decisions about which variables to observe and when to observe them for debugging that would be much harder to extract from RTL-level input. Fig. 1. Using hybrid tracing for early pre-silicon integration testing Since pre-silicon validation and post-silicon validation have TABLE I different constraints for inserting instruments into a design, we M ETHODS FOR CATCHING DIFFERENT KINDS OF LOGIC BUGS created different variations of H-QED for each scenario. In this paper, we present both adaptations of H-QED: a pre-silicon unactivated masked unmasked deterministic coverage unit testing debug tools tailored adaptation called hybrid tracing (Section II) and a postnon-deterministic hybrid tracing analysis silicon tailored adaptation called hybrid hashing (Section III). A preliminary version of hybrid hashing was presented in [10] and an early version of hybrid tracing was introduced in [11]. vectors or bugs that relate to integration (e.g. a module works With the hybrid tracing and hybrid hashing variations, we perfectly with the expected number of input data items, but will demonstrate the effectiveness and practicality of H-QED another module sends the wrong amount). In both cases, the by showing that: 1) H-QED enables 2–3 orders of magnitude goal of hybrid tracing is to detect logic bugs as RTL-level improvement in error detection latencies for both electrical bugs simulation models only logic, not electrical behavior. Unlike and logic bugs vs. end-result-checks (comparing with known traditional functional testing, the goal of hybrid tracing is to correct outputs); 2) H-QED uncovered four previously unknown capture the point of bug activation at a fine temporal and logic bugs in the widely-used CHStone HLS benchmark spatial granularity by observing not only module outputs but suite [12]; 3) H-QED does not require any failure reproduction also internal variable values. This fine-granularity tracing is or low-level simulation (e.g., RTL or netlist) to detect bugs the key to dramatically improving debugging productivity. (although simulation can aid in bug localization); 4) H-QED As illustrated in Fig. 1, hybrid tracing enables hardware allows accelerators to operate in “native” mode (similar to designers to isolate logic bugs by swapping between C/C++ normal system operation) and has a minimal intrusiveness reference implementations and RTL implementations with impact; (It detects bugs detected by traditional validation HLS. Thus, designers can validate complex designs piecemeal, techniques.) 5) hybrid tracing error detection latency is one selecting one module at a time to integrate with the rest cycle or less; 6) hybrid tracing pinpoints where in the source of the system for verification. Note that the designer of a code the bug activates, and provides a strong hint for possible target module only needs high-level C/C++ models of the bug fixes; 7) hybrid hashing improves electrical bug (timing system it interfaces with, which need not be synthesizable, error) coverage by up to 3× compared to PSV techniques using enabling early stage integration testing for parts that are end-result-checks; 8) hybrid hashing incurs an 8% accelerator synthesizable. Our framework compares the series of module area overhead, and negligible performance costs. outputs and internal variable values for discrepancies between The rest of this article is organized as follows: Section II the software model and the RTL implementation. When discusses our pre-silicon hybrid tracing variation of H-QED, validation reveals a problem due to non-deterministic behavior, Section III presents our post-silicon hybrid hashing variation, our code instrumentation, trace comparison and back-tracing Section IV covers our experimental results, Section V provides steps (discussed in Section II-B) provide the hardware designer an overview of related work, and we conclude in Section VI. with C/C++ locations where the discrepancies occur. II. P RE -S ILICON : H YBRID T RACING
A. Comparison to Software Debugging
We call our pre-silicon variation of H-QED hybrid tracing As mentioned in Section I, logic bugs have many ways to since it involves the fine-grained uncompressed tracing of elude detection. Fortunately, hybrid tracing leverages HLS, variable values. Hybrid tracing can be used for both module- which brings in software debugging tools to bear on the level verification of HLS-produced RTL as well as integration problem. Table I broadly classifies logic bugs by their behavior testing — verification of multiple RTL modules and software in three categories: unactivated, masked, and unmasked. Each on a CPU into a system. Although module-level testing is of these bug classes can be further divided into deterministic important, integration testing invariably detects additional bugs and non-deterministic subcategories. (Non-deterministic means that went undetected due to insufficient module-level test either the activation or masking condition is non-deterministic.) 0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Input C++ code
Bug Source Locations
Clang LLVM-IR Trace Schedule + Address Map
HLS Schedule Instrument
Debug Info
Instrument
Backtrace
LLVM IR
LLVM Backend
Reference Simulation
Buggy Variable IDs
Machine Code
Bind
CPU
System Verilog
Trace Output
Compare
Verilator SystemC
Clang + LLVM
Hardware Simulation
Machine Code
CPU
Trace Output
Fig. 2. Our hybrid tracing framework
Unactivated bugs are caused by gaps in coverage (e.g. the buggy line of code was not executed or a condition was not met) which are best addressed by software coverage analysis tools. Such tools will point out these gaps, allowing the hardware designer to modify the design or test vectors to eliminate those gaps and activate bugs hiding inside. A deterministic, activated bug is reliably reproducible by definition and can be isolated with existing software debugging tools. Such tools can also help with deterministic, masked bugs by increasing observability. Software practices also encourage unit testing to help detect such masked bugs in the first place. Software debugging techniques are much less useful for nondeterministic, activated bugs. By definition, such bugs are likely to behave differently in a software testing environment when compared to an RTL simulation environment. For example, the bug may cause a failure in RTL simulation, but the highlevel simulation produces correct output, rendering software debugging alone unhelpful. Without any aid to track this bug down, the hardware designer has little choice but to attempt to find the bug in the RTL waveform by tracing backwards in execution from the observed failure to the root cause. After this painstaking process, the designer has another difficult problem to solve: determining the source-code meaning for the buggy RTL variable he identified. This can be very non-trivial with complex software and HLS transformations involved in translating a high-level language to RTL. Hybrid tracing is designed to address both of these difficult problems, making isolation for the most difficult bugs automated and fast. B. Hybrid Tracing Framework Our hybrid tracing implementation is illustrated in Fig. 2. The input to the framework is a C++ module targeted for debugging and written with a synthesizable subset of C++ supported by the HLS tool. Additional non-synthesizable modules (not shown to simplify the illustration) representing the system environment such as those in Fig. 1 can be integrated into the hardware simulation through co-simulation and into the software simulation environment through linking. There are two branches of the framework, a hardware RTLlevel simulation branch and a “reference” software branch.
3
Both branches have integrated instrumentation passes (which add variable tracing instructions) to enable greater observability of internal source-level variables. In the hardware branch, the instrumentation is integrated into the HLS engine after scheduling to minimize intrusiveness. The HLS engine produces SystemVerilog as output, which is then translated to a cycleaccurate SystemC module using Verilator [13], [14]. The software branch performs software compilation using the LLVM framework [15] and contains a custom instrumentation pass designed to reproduce the trace output produced by the hardware simulation given some scheduling and address mapping information from the hardware pass. The output of the two branches is two variable trace sequences for comparison; when mismatches are found, information on which variable(s) caused the discrepancy can be used to identify the C/C++ source code involved with the bug. We now discuss each component of our framework in detail in the following subsections. 1) Hardware Simulation: The hardware simulation is a cycle-accurate RTL simulation of the hardware module in a test environment that can include high-level implementations of modules it interfaces with. This starts with the HLS of the LLVM Intermediate Representation (LLVM-IR) for the hardware module, using an in-house HLS engine that is based on LegUp [16]. We insert our hardware instrumentation pass after scheduling and optimization, but before binding. The pass takes an optimized, scheduled CDFG as input and adds trace annotations on all variables that have a software counterpart. In our previous prototype implementation in [11], the instrumentation pass was inserted pre-scheduling. This meant that the trace calls needed to be scheduled and their dependencies considered, resulting in some cases in deferred scheduling of trace calls to maintain ordering. This can increase register pressure artificially, change the synthesis results, and cause multicycle error detection latencies. Furthermore, trace calls create false dependencies that can block or complicate HLS optimizations. To improve the hybrid tracing implementation in [11], we split the instrumentation pass into complementary hardware and software passes as shown in Fig. 2 and integrated the hardware instrumentation pass into our HLS engine. Adding the trace calls pre-binding makes them mere debugging annotations on signals that the HLS engine has decided are “real” signals (i.e. not redundant or dead operations) that must be bound to a physical resource. The annotations simply follow their operations and variables to the physical functional units and registers that they are bound to, producing the appropriate output in the state the variable is generated. During binding, the annotations are handled separately from the binding of “real” hardware. In other words, the addition of these debugging annotations does not affect the synthesizable binding solution generated by the HLS engine. Furthermore, the annotations can easily be removed or ignored for the purpose of synthesis. We illustrate our hardware instrumentation pass with an example in Listings 1–5 and Tables II–V. Listing 1 shows an example input C program with a global variable to be mapped to a memory and a function which becomes a hardware module. Listing 2 shows the same program lowered to LLVMIR and Listing 3 shows the scheduled, optimized hardware IR (internal scheduled CDFG and memory address space map)
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
1 2 3 4 5
Listing 1. Input C++ code (foo.cpp)
Listing 4. Hardware trace operations inserted
int bar [ 4 ] ; i n t foo ( i n t x , i n t index ) { int y = bar [ index ] ; bar [ index ] = x + y ; return x ∗ y ; }
i32 foo ( i32 x , i32 index ) { ... [0] trace (0 , x) [0] t r a c e (1 , index ) [0] t r a c e (2 , addr ) [1] trace (3 , y) [1] t r a c e ( 4 , tmp1 ) [2] t r a c e ( 5 , tmp2 ) [2] r e t tmp2 }
Listing 2. LLVM-IR (simplified for clarity) global [4 x i32 ] bar i32 foo ( i32 x , i32 index ) { i32 ∗ addr = getelementptr ( bar [ index ] ) i32 y = load addr i 3 2 tmp1 = add x , y s t o r e tmp1 → a d d r i 3 2 tmp2 = mul x , y r e t tmp2 }
Listing 3. Scheduled Operations (custom IR)
( i32 x , i32 index ) { i 3 2 a d d r = add 0 x1000 , i n d e x i32 y = load addr i 3 2 tmp1 = add x , y s t o r e tmp1 → a d d r i 3 2 tmp2 = mul x , y r e t tmp2 }
x , i32 index ) { x) index ) addr convert ( addr ) ) y) tmp1 ) tmp2 ) }
TABLE II H ARDWARE A DDRESS M AP Memory bar
Address 0x1000
Depth 4
Block entry
Traced Variables x:0, index:1, addr:2, y:3, tmp1:4, tmp2:5
TABLE IV D EBUGGING I NFORMATION
g l o b a l [ 4 x i 3 2 ] b a r i32 foo [0] [0 −1] [1] [1 −2] [1 −2] [2]
TABLE III T RACE S CHEDULE Func. foo
Listing 5. Software trace operations inserted i32 foo ( i32 ... trace (0 , trace (1 , trace (2 , trace (3 , trace (4 , trace (5 , r e t tmp2
4
Width 32
id 0 1 2 3 4 5
func:var foo:x foo:index foo:addr foo:y foo:tmp1 foo:tmp2
file:line:col foo.cpp:2:14 foo.cpp:2:21 foo.cpp:3:16 foo.cpp:3:9 foo.cpp:4:20 foo.cpp:5:14
TABLE V A DDRESS T RANSLATION TABLE Variable bar[0] bar[1] bar[2] bar[3]
SW addr 0xa7010 0xa7014 0xa7018 0xa701c
HW addr 0x1000 0x1001 0x1002 0x1003
before binding. The memory is annotated with its base address hardware IR transforms like scheduling and optimization where on the left, and each instruction is annotated with the cycles feasible. If a variable is lowered to a constant, we propagate the the instruction is scheduled for execution in, where the result variable annotation to a constant as this enables the compilebecomes available on the final cycle. time computation of the constant value to be checked. Our Listing 4 shows the additional trace instructions inserted by hardware instrumentation pass can then scan the hardware IR our hardware instrumentation pass as well as their scheduled to find all nodes with debugging annotations, generate trace states. Table II shows the hardware address map determined instructions for them, and use the annotations to produce the by our HLS engine and passed to the software instrumentation corresponding LLVM-IR instruction in the trace schedule. Once the trace annotations are inserted, our HLS engine pass (Section II-B2). Each row of the address map indicates an LLVM variable, its base hardware address and the corre- performs binding and finishes with RTL generation, during sponding memory block depth and width. Table III is the trace which our HLS engine lowers each “trace” instruction instance schedule passed from our hardware instrumentation pass to to a SystemVerilog “$fwrite” call that prints the corresponding 1 the software instrumentation pass. The trace schedule has a variable ID and value to a file. The hardware simulation row for each basic block in each function and indicates the process then proceeds with the following steps: a) Verilator: Send the resulting SystemVerilog RTL code LLVM-IR variables traced in each block, in the order they are through Verilator [13], [14]. Verilator translates the RTL code traced, as well as a unique integer ID for each variable. The challenge in creating this trace schedule is mapping to an equivalent cycle-accurate SystemC representation that a 2 hardware IR variables to LLVM-IR variables. Not all LLVM-IR standard C++ compiler can compile and run. b) Clang+LLVM: Compile the SystemC version of the variables have a corresponding hardware IR counterpart as some hardware module. The hardware module can optionally be CDFG nodes may be optimized away by HLS transformations. linked with untimed high-level C/C++ versions of software (e.g. a global array reference becomes a constant address modules it interfaces with some additional glue code to connect in hardware IR after due to a static address space mapping the SystemC interfaces to the untimed C/C++ function calls. which can lead to further constant propagation optimizations.) c) CPU Execution: Run the simulation. This results in Similarly not all hardware IR variables have an LLVM-IR untimed execution of software portions of the design, and a counterpart. The LLVM-IR getelementptr operation can involve cycle-accurate RTL simulation of the hardware module. The a number of additions and multiplications, generating multiple instrumented hardware module will dump RTL execution traces hardware IR variables that represent intermediate computations to a file to be sent to the comparison step (Section II-B3) and do not correspond to the final getelementptr result. together with the software-only reference trace. Our solution to this problem is to propagate debugging annotations in our hardware IR. During the initial lowering 1 These calls can easily be ignored or removed for the purpose of synthesis of LLVM-IR to our hardware IR, we annotate hardware IR after debugging is complete. nodes with references to the corresponding LLVM-IR variables. 2 An alternative to SystemC translation is RTL/C++ co-simulation. We find We then preserve these LLVM-IR variable references across that SystemC RTL simulation is faster. [11] 0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
2) Reference Simulation: The purpose of the reference simulation is to produce a “gold” reference trace that the hardware simulation should reproduce exactly under bug-free conditions. This process starts with an LLVM-IR pass that inserts instrumentation that generates this trace. The software instrumentation pass needs two pieces of information from the hardware pass to generate a matching trace (Section II-B1): a trace schedule and a hardware address map. The trace schedule provides the ordering of the trace calls as determined by the HLS scheduler, enabling the software pass to generate code that produces the trace output in the same order. The hardware address map enables the software pass to generate code that reproduces hardware address values by translating software addresses to hardware addresses. The hardware pass also generates a debugging information table that provides a source location for each LLVM-IR variable that is traced, which enables the backtracing process (Section II-B3) to provide source-level meaning to the hardware designer. This debugging information is generated from debugging metadata provided by the LLVM compiler infrastructure. We continue our example to illustrate this process. Listing 5 shows the software trace operations the software instrumentation pass adds to the LLVM-IR in Listing 2 and Table IV shows the outputted debugging information file. The pass adds calls to two library functions that we implement and link the LLVM module against: “trace” and “addr convert”. The trace function is semantically equivalent to the hardware variant, taking a variable ID and value and writing that variable ID and value pair to a file with an “fprintf” call. The addr convert call is inserted for all address variables traced, takes a software address as input, and outputs a translated hardware address. This translation process is based on a static address translation table generated at runtime, which we will discuss shortly. We then run the instrumented LLVM-IR through the LLVM backend to generate machine code that runs on the host CPU, linking with our library that implements the “trace” and “addr convert” functions. Since the trace function generates output, the software compiler cannot change the relative ordering of the trace calls, ensuring that the software trace order will match the hardware trace order under bug-free conditions. We then run the simulation, which executes the software model of the module and generates a trace through the inserted “trace” calls. To enable the “addr convert” calls, we generate, at the start of execution, a translation table for all variable elements mapped to hardware memory blocks by using the hardware address map and the dynamic software address of those variables. Table V shows the address translation table for our example based on the hardware address map (Table II). We map such variables to static memory, ensuring that their addresses are fixed at runtime. Once the table is initialized, each software to hardware address conversion is a simple lookup.3 3) Trace Comparison and Debugging: We now compare the trace files from the reference simulation (Section II-B2) and hardware simulation (Section II-B1). Under bug-free conditions, the traces will be identical since we ensure that the trace call 3 To translate pointers to the “end” of an array (i.e. one past the last element, used as an upper pointer bound in a loop), we first pass the LLVM-IR module through a transform that adds an extra dummy element to each array.
5
TABLE VI B UG D ETECTION E XAMPLE Source Code int x ; i f ( cond ) { x = 1; } z = x + y;
Reference Trace cond : 0 −− s k i p body −− x : 7461 y : 7 x + y : 7468
Hardware Trace cond : −− s k i p x : y : x + y :
0 body −− −24905 7 −24898
ordering of the reference simulation matches the hardware schedule and we translate software addresses to hardware addresses in the reference simulation. Thus any discrepancy indicates a bug. (See Section II-D for an example bug and how our process detects it.) For discrepancies observed, we look up the corresponding variable IDs in the debug info file generated by the software pass (Section II-B2, Table IV) to identify the variable name and source locations for the mismatched variables. For the first variable ID with a discrepancy, we report the variable name, source location, and the pair of mismatched values observed to the hardware designer. C. Extensions A variation of hybrid tracing can be used as a hardware simulation breakpoint trigger. This can be useful if a bug is only activated in the generated hardware. In this variation, the reference simulation trace is generated first. In the hardware simulation, the trace function is implemented by reading a variable (ID, value) pair from the reference trace and checking if it matches the (ID, value) parameters the function receives. If there is a mismatch, the function calls the Verilog “$stop” function (or similar) to suspend the simulation and enable the test engineer to examine the simulated hardware state right at the point the bug first activates. Hybrid tracing can also be used in FPGA emulation for small designs where off-chip bandwidth or on-chip memory is sufficient to handle trace data. In this variation, the trace instructions are mapped to hardware structures that buffer the trace data and store it in memories, either on-chip or off-chip. D. Bug Example How does hybrid tracing detect bugs? We have found that a reader’s intuition often leads one to the conclusion that source code bugs cannot be caught because the same buggy code is fed to both the hardware and software simulations, and thus the two simulations will produce identical traces. While this intuition is correct for deterministic bugs that always behave the same way, this intuition fails for non-deterministic bugs that have hardto-predict behavior that depends on many confounding factors. (The reader may want to refer back to Section I for insights about different kinds of bug behavior and to Section II-A for how hybrid tracing fits in with other debugging techniques.) To drive this point home, we provide a simple example of a non-deterministic source-code bug in Table VI. The bug in the source code is that “x” is used uninitialized, and thus its value is non-deterministic, i.e. it is toolchain and environment
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
6
C Source Code
Accelerator Input / Output
Uncore Components
C Front-end
Hardware Processor Core(s):
Signature Memory
Accelerator Inputs/Outputs H-QED-Enabled Accelerator 1 H-QED-Enabled Signatures Accelerator n
(a) System-on-Chip
Data Registers
Probe Ports
Native mode: Pass inputs to hardware accelerator(s) H-QED mode: Run software version of accelerator(s) or capture inputs of accelerator(s) to run on other platforms
Signature Generation Logic
RTL
High-Level Synthesis
Logic Synthesis Place and Route Fabrication
Hardware Signature Generation
Software Software Compiler Software Signature Generation
Machine code
Circuit
Signatures
Hardware Execution
State Registers
Accelerator
LLVM-IR
Signature Generator
(b) H-QED Enabled Accelerator
Fig. 3. Hybrid hashing instrumented accelerators inside an SoC. (a) SoClevel view, and (b) block diagram of an instrumented accelerator showing the accelerator and the signature generator.
dependent. The reference simulation will likely use some garbage value from the stack. The hardware simulation could use the register initial power-on state.4 It is unlikely that the hardware and software values for “x” are identical, and thus hybrid tracing pinpoints the location where the bug activates. Note that while this bug is a simple example for explanatory purposes, initialization bugs can have complex activation conditions (e.g. large buffers that are partially initialized). More importantly, there are many other types of non-deterministic bugs that cause different hardware and software behavior such as undefined memory accesses, timing dependent bugs, and hardware-specific protocol violations. In our experiments with all of the known bugs in the CHStone high-level synthesis benchmark suite [12] in Section IV-D we find that hybrid tracing is able to detect many different kinds of logic bugs, including some previously unknown bugs as well as bugs that a suite of existing static software analysis and dynamic software bug detection techniques are unable to detect. III. P OST-S ILICON : H YBRID H ASHING As mentioned in Section I, area and bandwidth costs are the key constraints when adding instruments for post-silicon validation of accelerators. We call our post-silicon variation of H-QED hybrid hashing since we reduce variable traces to a running hash value used to generate a low-bandwidth trickle of “signature” bits. The primary goal of hybrid hashing is to detect electrical bugs. Hybrid hashing can also detect most (if not all) of the non-deterministic logic bugs that hybrid tracing can, but we expect hybrid tracing to catch most of those bugs pre-silicon. While both our pre-silicon and post-silicon H-QED solutions can be integrated into an SoC design, we pay special attention to integration post-silicon because of the limited testing flexibility of physical hardware. Fig. 3 shows an SoC-level view of our hybrid hashing enabled accelerators. The SoC typically consists of processor core(s), accelerator(s), and uncore components. The inputs and 4 X simulation would detect this bug for a memory element that is never touched since the start of simulation, but will not by default model elements that have been previously used but are no longer allocated (i.e. locations that hold garbage values for dead variables). Additional annotations by the hardware designer (e.g. explicit assignments to ‘X’) are required to correctly model these cases (which are more common when the hardware has been running for a long time, exactly the kind of scenario we want to target).
Post-Silicon Validation Run
Signatures
Signatures Match?
Software Execution
Signatures
Bugs from mismatches
Fig. 4. Our hybrid hashing framework
outputs of the accelerators are supplied by the processor cores. During PSV, the accelerators generate hardware signatures that are saved in dedicated on-chip memories (Fig. 3a). These signatures are then later compared to a reference set of signatures to detect bugs using a similar software instrumentation pass as hybrid tracing, but modified to reproduce the running hash function and signature generation functionality of the hardware. Section III-A discusses the hybrid hashing process in detail, and Section III-B details a proposed methodology for integrating hybrid hashing into a post-silicon validation run. A. Hybrid Hashing Framework Our hybrid hashing implementation is illustrated in Fig. 4. The input to the framework is a high-level design of a hardware accelerator. Similar to hybrid tracing, the framework has a hardware branch and a software branch. The hardware branch has an instrumentation pass integrated into the HLS engine after scheduling. The software branch also involves a complementary instrumentation pass that takes as input a probe schedule and a hardware address map from our hardware pass to ensure that it will produce the same signature stream as the hardware under bug-free conditions.5 Unlike hybrid tracing, our hybrid hashing framework is area and bandwidth cost constrained. Our cost reduction strategies are: 1) reducing the initial raw signal probing bandwidth by only tracing key “non-temporary” variables; 2) using a hybrid multiplexor and XOR tree reduction logic to drastically reduce the number of probe bits to a small number with minimum area cost; 3) using an LFSR to compute a running hash of this reduced signature; and 4) outputting a single bit checksum computed from the LFSR state every n cycles (configurable). As Fig. 3b shows, our hybrid hashing framework produces an RTL implementation with integrated hashing logic, which generates a sequence of signature bits during a PSV run. Care must be taken to ensure that the instrumentation does not cause excessive intrusiveness, e.g., by stalling the accelerator or by interfering with its input and output data traffic. Intrusiveness can prevent activation of bugs inside the accelerator during 5 We use the term “probe” instead of “trace” in post-silicon validation to avoid confusing our high-level technique with the many trace-buffer based approaches that perform cycle-granularity recording of RTL-level signals. See Section V-B for a comparison of hybrid hashing and trace-buffer techniques.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
PSV. In an effort to minimize intrusiveness, we store hardware signatures in a dedicated on-chip memory with dedicated communication channels, as shown in Fig. 3a.6 The costs associated with this storage are reported in Section IV. As mentioned earlier, the primary goal of hybrid hashing is to detect electrical bugs. Hybrid tracing will detect most logic bugs while electrical bugs are likely to escape pre-silicon validation due to the difficulty of accurately modeling and predicting the electrical-level behavior of a complex design at a useful speed. Indeed, even relatively fast RTL-level simulation, which does not model electrical bugs, can have simulation speeds several orders of magnitude slower than real-time for complex designs. With limited electrical bug detection capabilities pre-silicon, post-silicon validation becomes the last line of defense against such bugs escaping to end-user deployment. While hybrid tracing of LLVM-IR variable equivalent hardware signals is sufficient to detect any non-deterministic logic bug that activates with essentially no error detection latency (See Section IV-D for a demonstration of this), electrical bugs activated post-silicon can affect almost any hardware structure in the hardware accelerator, including the state register. To ensure that electrical bugs are caught quickly and do not make it to the accelerator outputs undetected, we add additional instrumentation to the state register as well as the accelerator’s input and output ports. The intuition here is that if we check all outputs and periodically check all “non-temporary” bits of the state, then electrical bugs will have no place to hide. We now discuss these additions in detail. 1) Hardware Execution: The hardware execution is an insitu test of the fabricated accelerator with embedded hybrid hashing instruments through an existing post-silicon validation testing harness. Similar to the hybrid tracing process, we start with LLVM-IR for the hardware accelerator and add instruments after scheduling and optimization, but before binding. To reduce the initial raw probing bandwidth, we only probe variables that are non-temporary. Looking at the states in the FSM that each variable is live, we define a non-temporary variable as one that crosses more than one state transition, at least one of which is a basic block boundary (i.e. the variable is live in more than one basic block). Our scheduler prefers to schedule each probe for a variable in its last use state (state where it is accessed). The intuition here is to observe the variable at the last cycle in its lifetime to catch all potential electrical-bug induced mutations that could have occurred in earlier cycles before the value goes into a functional unit where the mutation could be masked. Note that this contrasts with hybrid tracing instruments which target logic bugs by observing variable values at the start of their lifetime, right after the value is generated by some operation, since a logic bug causing a variable value mutation is unlikely. Another goal, however, is to minimize the number of probe ports carrying these signals through multiplexing. To allocate a minimum number of register probe ports, we use an algorithm 6 If routing congestion is an issue, each accelerator could have its own dedicated, local signature memory which would also enable multiple simultaneous signature captures. If storage cost is an issue, it may be possible to minimize signature storage costs (while controlling intrusiveness) by streaming hardware signatures to off-chip memory using existing debugging ports, such as JTAG.
7
that attempts to create a feasible probe schedule using a single probe port. We attempt to reschedule probes for variables with the same use state to predecessor states (where the variable is still live) to produce a feasible schedule. If scheduling fails, we attempt to schedule again with an additional probe port and repeat until scheduling succeeds. After instrumentation, the resulting hardware IR is passed through binding to produce a set of shared probe ports for the accelerator’s CDFG variables. In the RTL generation stage, appropriate multiplexors are produced for those probe ports. We also add dedicated probe ports for each accelerator input, output, and state machine. As discussed earlier, these additional ports enable electrical bug detection. Each probe port outputs a probed value when active, zero otherwise to avoid contaminating the generated signatures with garbage values. Our HLS engine generates additional RTL for signature generation logic which, as shown in Fig. 5, involves a bitwise XOR reduction of the probe port signals to compress them to small number of bits with minimum area cost.7 The bits are then fed to an LFSR, which computes a running hash of the reducer output that captures all probed signal history, including the cycle timing of those signals. Every n cycles (configurable), we output a one bit signature from the LFSR. Assuming that the LFSR has a sufficient number of state bits for the probability of aliasing inside the LFSR to be negligible, we can compute the expected time from an error being captured (meaning becoming different from the error-free value) in the LFSR state to being captured in a signature bit as follows: After an error is captured in the LFSR state, the average time until the next signature bit is outputted is n/2. The probability of aliasing in each signature bit is 1/2 and each alias occurrence costs n cycles of latency. Thus the expected cycle delay from LFSR error capture to the first signature bit error capture is: E[sig EDL] =
∞ X 1 i 3 n+ n= n i+1 2 2 2 i=0
(1)
Note that the delay distribution decays exponentially, so delays several times this average are unlikely. We illustrate our hardware instrumentation and PSV execution with an example in Listings 6–7, Table VII, and Fig. 5. Listing 6 provides an example scheduled hardware IR similar to Listing 3 for hybrid tracing. Table VII is the probe schedule. Note that with one probe port allocated, both “x” and “y” could not be scheduled in their last use cycle, cycle 1. Thus we rescheduled the probe of “x” for cycle 0. (“z ptr,” “z,” “b ptr,” and “b” are probed through dedicated memory port probes.) The probe schedule is similar to the pre-silicon trace schedule in Table III, but there are a important differences. One is that each basic block in the schedule is broken into cycles which correspond to a state in the FSM that controls the accelerator. Cycle granularity is needed because the downstream LFSR is 7 The probe port multiplexors and XOR reducers can be pipelined as needed with some number p of additional pipeline registers to meet timing, resulting in p cycles of additional real-time error detection latency. If the signature generation logic is also run p cycles behind to match the change in the XOR reducer output timing, this delay will not affect the signatures generated. Thus offline comparison with reference signatures will have the same result, resulting in an effective error detection latency overhead of zero.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Listing 6. Scheduled Operations (custom IR)
Listing 7. Software trace operations added
g l o b a l [ 1 0 0 x i 3 2 ] Z g l o b a l [ 1 0 0 x i 3 2 ] B
void bar ( i32 z ptr , ... s o f t w a r e l f s r (1 addr convert ( z ptr ) s o f t w a r e l f s r (2 s o f t w a r e l f s r (3 addr convert ( b ptr )
void bar ( i32 z ptr , i32 b ptr ) { [0 −1] i32 z = load z ptr [1] i 3 2 a = add x , y [1 −2] i 3 2 b = mul a , z [2 −3] store b → b ptr }
memory address
memory data in
state ⨉
LFSR
state probe
bb1:1
+
a
bb1:0
b_ptr z_ptr
⊕ ⊕ z ⊕ x) ⊕ y) ⊕ ⊕ b) }
signatures
FSM
z en
b en
x
register probe
y
Fig. 5. Instrumented accelerator with signature generation. Probe wires are labeled in red with the basic block and cycle(s) in which they are probed.
sensitive to the cycle the probed values are provided. Another difference is that there is a fixed constant provided in addition to the probed variables. This constant is a lumped XOR sum of all of the values that are fixed in that cycle; in this example the only constant is the FSM’s state encoding for that cycle. Finally, no variable IDs are tracked as variable-value associations are lost in the hashing of all of the probed values. Fig. 5 shows the resulting hardware. Each probe port has a multiplexer associated with it that drives the port to logic 0 when it is not probed. The select signals of the mux are derived from the corresponding states annotated in Fig. 5. 2) Reference Simulation: As with hybrid tracing, the purpose of reference simulation is to reproduce the signatures produced by the instrumented hardware under bug-free conditions. This process is similar to the hybrid tracing variation, with some changes to the software instrumentation pass. As with hybrid tracing, the software pass takes a hardware address map and probe schedule as input. Instead of reproducing a trace, the software pass is now tasked with reproducing the hardware’s signature sequence, which involves implementing the XOR reduction and LFSR in software. We design our hardware to be software implementation friendly, so the XOR reduction is simply a software (bitwise) XOR of all of the variables while the LFSR is a small series of bit shifts and XOR operations. The software instrumentation for our example is shown in Listing 7. The LFSR function also mimics exactly the signature output interval of the hardware LFSR, enabling the software to generate signatures that match the hardware. B. Integration into PSV Testing As mentioned at the start of this section, the limited testing flexibility of physical hardware compared to pre-silicon simulation necessitates a careful consideration of how an
TABLE VII P ROBE SCHEDULE
i32 b ptr ) {
memory data out
bb1:0 memory bb1:2 probes bb1:0, bb1:2
8
Func.
Block
bar
bb1
Cycle 0 1 2
Const 1 2 3
Probed vars z ptr, z, x y b ptr, b
accelerator will be tested as an integral part of an SoC during a PSV run. To demonstrate the practicality of our approach, we describe a proposed testing procedure. During PSV the hardware accelerator (and SoC) operates in native mode to activate bugs and a sequence of hardware signatures is generated, stored in on-chip memory, and collected at the end of the run. Next, the software version is executed on a processor and generates a separate sequence of signatures. Bugs may or may not be activated during this execution and it can be totally decoupled from the PSV run. The two signature sequences are compared; mismatches indicate bugs. To ensure that the hardware signatures match the software signatures (under bug-free conditions), we must ensure that the software version and hardare accelerator receive the same inputs. This can be accomplished in several ways, including: 1) After a test is executed during a PSV run, the SoC is configured so that the hardware accelerator is disabled and the software version is swapped in. Next, the same test is executed to generate software signatures. Note that this is different from failure reproduction because we do not require bugs to be reproduced during the second run. 2) After a test is executed during a PSV run, the same test is run again with the SoC (and the test) configured to capture (and store) accelerator inputs at pre-defined memory locations. Using these captured inputs, the software version is then executed either on the embedded processor core of the SoC being validated, or on some other processors to generate software signatures. Again, we do not require bugs to be reproduced after the first PSV run. C. Extensions In a similar vein to the hybrid tracing simulation trigger discussed in Section II-C, hybrid hashing can be used as a trigger to stop hardware execution during a post-silicon validation run when a bug is detected. This variation of hybrid hashing involves first generating the reference signature sequence for an accelerator and storing it in the accelerator’s on-chip signature memory. Instead of writing to the signature memory, the signature generator reads from the signature memory and performs realtime comparison of the reference and generated signatures. If a mismatch is found, a trigger is asserted which stops all hardware execution on the SoC and enables the validation engineer to examine the chip’s state (e.g. by reading out scan chains). Trace buffers can also be used in conjunction with such a trigger to provide information about past state (up to and including bug activation if the error detection latency does not exceed trace buffer capacity). Hybrid hashing can also be used in FPGA emulation for larger designs where off-chip bandwidth or on-chip memory is
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
unsufficient for trace data. This emulation would proceed much the same way as a post-silicon validation run (Section III-B). IV. E XPERIMENTAL R ESULTS To show the effectiveness and practicality of H-QED, we ran simulation and FPGA emulation experiments to collect data for cycle, flip-flop, and simulation overhead for hybrid tracing; area and clock period overheads for hybrid hashing; and error detection latencies and coverage estimates for logic and electrical bugs. We used all 12 benchmarks from CHStone [12] and 15 benchmarks from the PolyBench [17] benchmark suites. For our hybrid hashing experiments, we used a 16-bit LFSR with a 1-bit output. We fixed the signature output interval of each benchmark at 100 cycles or the interval that would result in a 5% signature storage area cost, whichever interval is larger. At the end of benchmark execution, we dump the full contents of both the hardware and software LFSRs into the signature stream to ensure that any late LFSR mismatches are detected. C-sim + trace (s)
RTL-sim (s)
RTL-sim + trace (s)
0.103
0.347
0.402
aes
0.041
0.043
0.071
0.112
atax
0.044
0.032
0.062
0.106
bicg
0.031
0.029
0.052
0.083
blowfish
0.901
1.931
2.516
3.417
dfadd
0.012
0.012
0.017
0.029
dfdiv
0.006
0.011
0.016
0.022
dfmul
0.005
0.010
0.011
0.016
dfsin
0.124
0.303
0.427
0.551
15.137
7.734
18.038
33.175
0.552
0.210
0.628
1.180
doitgen
floyd-warshall
Bench adpcm aes atax bicg blowfish dfadd dfdiv dfmul dfsin doitgen floyd-warsh gemm gemver gesummv
Both traces (s)
adpcm
A. Hybrid Tracing Intrusiveness
TABLE VIII OVERHEAD DUE TO INTRUSIVENESS ( FLIP - FLOPS /
8
0.505
Runtime (RTL-sim = 1)
Benchmark
9
[11] 24.91 / 0.00 246.88 / 0.02 0.00 / 0.00 0.00 / 0.00 29.16 / 0.00 11.05 / 2.51 0.00 / 0.00 5.65 / 0.00 7.55 / 1.47 0.00 / 0.00 17.87 / 0.00 0.00 / 0.00 5.80 / 0.00 0.00 / 0.00
HT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Bench gsm jpeg matrix matrix4x4 mips motion mvt reg-detect sha symm syr2k syrk trmm
CYCLES ,
%)
[11] 0.31 / 0.00 25.63 / 0.00 0.00 / 0.00 28.20 / 0.00 0.00 / 0.00 0.00 / 0.00 26.48 / 0.00 22.96 / 0.08 18.28 / 0.00 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00
HT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
C-sim + trace RTL-sim + trace
6 4 2
To determine the intrusiveness of hybrid tracing, we per0 formed HLS with and without hybrid tracing and measured the number of flip-flops and benchmark execution cycles for each benchmark. Table VIII shows the hybrid tracing overhead relative to the baseline values as well as the overhead for the LLVM-level instrumentation insertion approach of [11]. We Fig. 6. Hybrid tracing instrumented RTL simulation and C++ simulation time observe significant overheads for that approach as high as a normalized to uninstrumented RTL simulation. 247% cycle overhead and a 2.5% flip-flop overhead. The LLVM-level instrumentation of [11] creates significant cycle overhead because it constrains the scheduler to maintain C. Hybrid Hashing Area and Delay Costs To determine the area and delay costs of hybrid hashing, trace call order. This constraint can cause trace calls to be scheduled one or more cycles after what would normally we performed HLS for an accelerator with and without hybrid be the end of a variable’s lifetime. This extends variable hashing. We then performed logic synthesis using Synopsys lifetimes, which can result in flip-flop overhead due to increased Design Compiler 2013-12.sp1, with a 45nm ARM standard register pressure. Our hybrid tracing approach, in contrast, cell library and targeting maximum clock frequency. The area has no intrusiveness because instrumentation is inserted after and clock period overheads for each accelerator core are shown scheduling, so the instrumentation cannot interfere with the in Fig. 7. Results show a mean accelerator-level area cost of schedule by design. Furthermore, we show that QoR is 8.3% with no mean clock period overhead. unaffected by instrumentation, and thus the instrumentation can be ignored or removed for synthesis input equivalent to Clock period Signature generation area the same design without instrumentation. gemm
0.525
0.342
0.700
1.225
gemver
0.065
0.106
0.153
0.218
gesummv
0.035
0.023
0.047
gsm
0.029
0.047
0.061
0.090
jpeg
2.469
30.131
31.938
34.407
matrix4x4
0.030
0.022
0.040
0.070
matrix
0.123
0.031
0.109
0.232
mips
0.025
0.019
0.040
0.065
motion
0.020
0.039
0.050
0.070
mvt
0.046
0.040
0.073
0.119
reg-detect
2.411
5.037
6.380
8.791
sha
1.210
2.331
3.128
4.338
symm
0.332
0.191
0.409
0.741
syr2k
0.774
0.407
0.919
1.693
syrk
0.530
0.254
0.637
1.167
trmm
0.012
0.010
0.018
0.030
adpcm
0.297
1.000
1.159
1.455
aes
0.953
1.000
1.651
2.605
atax
1.375
1.000
1.938
3.313
bicg
1.069
1.000
1.793
2.862
blowfish
0.467
1.000
1.303
1.770
dfadd
1.000
1.000
1.417
2.417
dfdiv
0.545
1.000
1.455
2.000
dfmul
0.500
1.000
1.100
1.600
dfsin
0.409
1.000
1.409
1.818
doitgen
1.957
1.000
2.332
4.290
floyd-warshall
2.629
1.000
2.990
5.619
gemm
1.535
1.000
2.047
3.582
gemver
0.613
1.000
1.443
2.057
gesummv
1.522
1.000
2.043
3.565
gsm
0.617
1.000
1.298
1.915
jpeg
0.082
1.000
1.060
matrix4x4
1.364
1.000
1.818
matrix
3.968
mips
1.316 0.513
atax
bicg
blowfish dfadd
mvt
1.150
reg-detect
0.479
sha
0.519
dfdiv dfmul dfsin doitgen floyd-warsh gemm
1.277
92,374
1.30
95,335
0.688
64,923
0.658
69,430
0.892
13,434
0.861
15,027
1.000
gemvar
1.738
syr2k
1.902
syrk
2.087
gesummv gsm jpeg
matrix4x4 matrix
adpc m aes atax blowbicg fis dfadh d dfdi dfm v u dfsinl floy doitgen d-w ars gem h gem m gesu ver mm v gsm j p matr eg ix4x matr 4 ix m mot ips ion reg- mvt dete ct sh sym a m syr2 k syrk trmm med ia mean n
1.142 3.182
signature memory
normalized
bits + end SRAM lfsr dump area
3.516
sig bits
area w/ SRAM
zero padding
clock period
chip-level
area w/ SRAM
7.484 area
area
area w/ SRAM
74
95,409
0.000
0.016 0.032
0.033
0.007
0.007
38
54
17
69,447
0.000
-0.044 0.069
0.070
0.015
0.015
105
121
38
15,065
0.000
-0.035 0.119
0.121
0.025
0.026
-0.082 0.153
0.154
0.033
0.033
-0.044 0.033
0.043
0.007
0.009
219
235
2.105
3.421
0.903
13,923
0.829
16,047
66
82
26
16,072
0.000
0.597
57,418
0.571
59,295
1,792
1,808
570
59,864
0.000
0.749
36,348
0.765
39,058
6
22
7
39,065
0.000
0.021 0.075
0.016
0.016
11 100,401
0.000
-0.028 0.058
0.058
0.012
0.012
54,178
0.000
0.006 0.087
0.087
0.019
0.019
177 178,218
0.000
-0.010 0.090
0.091
0.019
0.020
1.000 1.000 0.89
94,928
0.865 100,390
0.691
49,835
0.695
0.898 163,386
1.000
35
2
18
0.889 178,041
545
561
2.975 1.745
2,609
2,625
827
18,888
0.000
0.004 0.099
0.149
0.021
0.032
1,712
1,728
544
14,319
0.000
-0.004 0.079
0.122
0.017
0.026
15,188
1,457
1,473
464
15,652
0.000
0.014 0.098
0.021
0.028
1.031
20,220
161
177
0.844
14,080
0.850
18,062
0.698
13,775
0.856
1.094 109,914
1.111 113,575
1.199 172,344
1.166 179,705
1.000
0.075
18,855
13,830
0.998
1.267
6
1.795
13,230
16,434 12,764
0.844
0.836
1.825 19
0.847
1.000
1.282
54,172
0.701
1.000
B. Hybrid Tracing Simulation Time Costs symm
area
1.03
65,258
0.801
11,551
1.017
64,355
0.8
12,860
1.342
1.861
0.132
56
20,276
0.000
0.033 0.072
0.075
0.016
0.016
64
80
25
14,105
0.000
0.010 0.064
0.066
0.014
0.014
52
68
21 113,597
0.000
0.016 0.033
0.034
0.007
0.007
8,605
8,621
2,716 182,421
2.141
3.880
0.000
-0.028 0.043
0.058
0.009
0.013
34
50
16
64,371
0.000
-0.013 -0.014
-0.014
-0.003
-0.003
328
344
108
12,968
0.000
0.015 0.113
0.123
0.024
0.026
2.258
4.160
To determine the simulation performance impact of hybrid tracing, we measured the SystemC RTL simulation time of untraced RTL code for our 27 benchmarks and compared it with the combined time of software simulation and SystemC RTL simulation with tracing enabled. The results are shown in Fig. 6. We observe a mean RTL simulation time of 1.79× and an additional reference simulation time of 1.18× compared to untraced RTL simulation for a total mean overhead of 2.97×. Reduction of overheads may be possible by engineering faster variations of the trace functions, in particular by using binary mode I/O operations for writing and comparing traces (i.e. to avoid formatting overhead for human readability of trace files). trmm
1.200
median
1.069
mean
1.178
mips
motion mvt
reg-detect sha
symm syr2k syrk
trmm
median mean
1.000
2.508
4.594
0.000
-0.012 0.034
0.643
32,979
0.636
36,081
63
79
25
36,106
0.000
-0.011 0.094
0.095
0.020
0.020
0.875
16,212
0.878
17,757
86
102
32
17,790
0.000
0.003 0.095
0.097
0.021
0.021
0.931
45,131
0.923
44,993
7,164
7,180
2,262
47,255
0.000
-0.009 -0.003
0.047
-0.001
0.010
2,559
0.912
1.000
32,586
0.901
33,698
54
70
1.800
22
33,720
0.035
3.000
0.007
0.007
0.970
57,048
2,575
811
57,859
0.000
0.058 0.056
0.071
0.012
0.015
0.842
18,039
751
767
242
18,281
0.000
0.006 0.065
0.079
0.014
0.017
15,183
0.840
16,469
1,795
1,811
570
17,040
0.000
-0.009 0.085
0.122
0.018
0.026
0.894
13,975
0.899
15,212
1,467
1,483
467
15,680
0.000
0.006 0.089
0.122
0.019
0.026
0.834
20,312
0.817
22,442
12
28
9
22,451
0.000
-0.020 0.105
0.105
0.023
0.023
0.875
32,586
0.856
33,698
161
177
56
33,720
0.000
-0.004 0.075
0.079
0.016
0.017
0.875
46,966
0.871
49,643
1,176
1,192
376
50,018
0.000
-0.005 0.071
0.083
0.015
0.018
0.917
54,017
0.837
16,943
0.848
1.000 1.000
1.793 1.789
2.862 2.967
Signature storage area
15 10 5 0 -5 -10
medi a meann
motion
aes
clock period
adpcm aes atax blowbicg fis dfadhd dfdiv dfmu dfsinl d o floyd itgen -wars gem h gemvm gesum ar mv gsm j p matri eg x4 matrxi4 x m motiiops n reg-d mvt etect sh symma syr2k syr trmmk
1.000
adpcm
experimental
area
Overhead (%)
baseline
clock period
0.082
Fig. 7. Hybrid hashing area and performance overheads
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
D. Logic Bug Effectiveness
ERC
H-QED
adpcm
0
2
245
33
220
500
0.494
0.440
0.934
aes
0
2
300
99
99
500
0.604
0.198
0.802
atax
0
0
349
86
65
500
0.698
0.130
0.828
bicg
0
0
324
75
101
500
0.648
0.202
0.850
blowfish
0
1
335
45
119
500
0.672
0.238
0.910
dfadd
0
31
149
145
175
500
0.360
0.350
0.710
0.228
0.664
dfdiv
0
16
202
168
114
500
0.436
dfmul
0
27
129
205
139
500
0.312
0.278
0.590
dfsin
0
1
258
149
92
500
0.518
0.184
0.702
doitgen
0
0
314
18
168
500
0.628
0.336
0.964
floyd-warsh
0
0
255
68
177
500
0.510
0.354
0.864
gemm
0
0
340
13
147
500
0.680
0.294
0.974
gemver
0
0
275
52
173
500
0.550
0.346
0.896
gesummv
0
0
362
14
124
500
0.724
0.248
0.972
gsm
0
2
225
37
236
500
0.454
0.472
0.926
jpeg
0
3
284
59
154
500
0.574
0.308
0.882
matrix
0
0
252
178
70
500
0.504
0.140
0.644
matrix4x4
0
0
391
29
80
500
0.782
0.160
0.942
mips
0
2
173
146
179
500
0.350
0.358
0.708
motion
0
1
80
79
340
500
0.162
0.680
0.842
mvt
0
1
334
19
146
500
0.670
0.292
0.962
reg-detect
0
0
266
50
184
500
0.532
0.368
0.900
0.336
0.884
To evaluate the effectiveness of H-QED in detecting logic bugs, we considered all 21 known real bugs in the current and past versions of CHStone [12], 7 synthetic bugs injected into CHStone benchmarks, 2 bugs in previous versions of our HLS engine itself, and 3 bugs in synthesizable C code for a matrix-multiply kernel generated by FCUDA [18]. We attempted to detect these bugs with both hybrid tracing and hybrid hashing. Hybrid hashing is designed primarily to detect electrical bugs, but we evaluate it here against logic bugs for the sake of completeness. As references for comparison, we use a comparison the output of the benchmark with a known correct output and several software-based static and dynamic bug detection tools.8 In hardware simulation, we initialized registers and memories to random values, a common technique for enhancing bug detection. Results of our experiments are enumerated in Table IX with the columns defined in Table X. To find real bugs, we exhaustively searched the version changes of CHStone for bug fixes. H-QED also found some previously unknown bugs in the then-current version of CHStone, 1.10, prompting the release of 1.11, in which H-QED found a newly introduced bug. Previously unknown bugs are printed in bold in Table IX. For each bug found, we isolated it by fixing all of the other bugs in the last version of CHStone with that bug, creating bug benchmarks with one bug each. All real CHStone bugs were confirmed with the CHStone authors. We also created synthetic interface bugs in some CHStone benchmarks by violating benchmark input assumptions (e.g. that the number of inputs is even). We reproduced HLS engine bugs by modifying our HLS engine to emulate the original buggy behavior. We isolated each FCUDA output bug into a bug benchmark like we did for the real CHStone bugs. We make the following observations in these results: 1) Of the 21 real CHStone bugs, 16 are non-deterministic, providing evidence that the “difficult” bugs that escape into releases/production tend to be non-deterministic; 2) 6 of these bugs are not activated, and the Clang coverage analysis tool detects 5 of those cases;9 3) No tool dominates the others in bug detection, each has unique strengths, and using a combination of tools is the best way to detect/localize bugs; 4) Compiler optimizations complicate bug detection by making non-deterministic bugs deterministic (e.g. by statically evaluating undefined behavior and eliminating it); 5) In most cases, the different tools agree on the bug location although in a few cases compiler optimizations complicate bug localization by making it difficult to map instructions back to source code locations, resulting in some localization accuracy loss; 6) In 12 out of 19 cases where hybrid tracing reported a buggy line, the line had at least one variable in common with the bug patch line (or was the patch line), indicating a strong hint for a fix; 7) Hybrid tracing error detection latency is 1 cycle or less, average hybrid hashing error detection latency is 83 cycles, and end result check latency sha
0
1
273
58
168
500
0.548
symm
0
0
383
15
102
500
0.766
0.204
0.970
syr2k
0
0
314
21
165
500
0.628
0.330
0.958
syrk
0
0
332
17
151
500
0.664
0.302
0.966
0.314
0.924
trmm
10
coverage samples
0
1
304
38
157
500
0.610
median
0.0
1.0
284.0
52.0
151.0
500.0
0.574
0.302
0.896
mean
0.0
3.4
275.9
71.0
149.8
500.0
0.558
0.300
0.858
8 We also ran our bug benchmarks through the Clang static analyzer [19], which is a source-code analysis tool for finding bugs in C, C++, and Objective-C programs. We ran the built-in static analyzer in Clang 3.9 using the “scan-build” wrapper tool. The Clang static analyzer failed to identify any of our bugs. 9 At getbits.c:155 a variable range condition that is never met is required for bug activation.
100 80 60 40 20 0
H-QED End Result Check
adpcm aes atax b blow icg fish dfadd dfdiv dfmu l dfs doitg in floyd e -warsn gemmh gem gesumver mv gsm jpeg m matri atrix x4x4 mi motiops n reg-d mvt etect sh symma syr2k syrk trmm medi an mean
masked undetected detected
Coverage (%)
unmasked unactivated undetected detected
Fig. 8. Timing error detection coverage
can be thousands of cycles; 8) We observe negative hybrid tracing error detection latency for two OOB bugs, meaning that hybrid tracing detects activation conditions the reported number of cycles before bug activation. In both of these cases, the hardware version of the benchmark computed an out-of-bounds address one or more cycles before loading from that address. In the software version, the corresponding address overflowed beyond the translation table for the variable it was intended to point to, resulting in the software-to-hardware translation failing before the undefined memory access even occurred. E. Electrical Bug Effectiveness To evaluate the effectiveness of H-QED for detecting electrical bugs, we injected timing errors into each of our benchmark designs. We start by running each benchmark through HLS with hybrid hashing, feeding the output RTL code to Design Compiler, and compiling for timing optimization. To identify timing error activations, we use an approach similar to the “ground truth” method in [24]: for each flip-flop in the logic netlist, add a duplicate flip-flop connected to the same “D” input, but with an additional half-cycle delay on the input. This flip-flop’s “Q” output is left unconnected as it is used only to trigger reports of timing violations (by a timing simulator) while the original flip-flops maintain the error free execution of the benchmark. We ran timing simulations with the modified netlist and compiled the timing violations reported into a set of (flip-flop, cycle) pairs, referred to as “injection candidates.” We selected a random subset of these candidates with size n (we set n = 500) to use in our error injection experiments. Starting again from the original netlist, we applied another netlist transform, which inserts XOR gates at the “D” input of flip-flops in the selected injection candidates. We added 1 each XOR gate, enabling error injection at a logic to control specific cycle. We mapped the transformed netlist to an FPGA (Altera Stratix III) for emulation purposes, and performed n full execution runs for each benchmark, injecting one error from the selected “injection candidates” during each run (bit flip at the input of the given flip-flop at the given cycle). Timing error coverage (number of errors detected divided by the number of errors injected) is presented in Fig. 8, including both masked (errors that do not propagate to accelerator outputs) and unmasked errors (errors that propagate to the primary outputs). Note that the unmasked timing error
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
E VALUATION OF H-QED Benchmark: Version(s) adpcm:1.8
Bug Patch Bug Line(s) Type adpcm.c:689 MLU lpc.c:87 OOB lpc.c:150 OOB lpc.c:157 OOB decode.c:204 *++ decode.c:205 *++ decode.c:209 *++ marker.c:0a INIT mips.c:172-173 USE mips.c:102 INIT mips.c:103 INIT mips.c:105 INIT mips.c:91 OOB mpeg2.c:225 OOB getbits.c:113 SHFT motion.c:155 SHFT
Nondet.? no C only C only yes no no no yes no yes yes yes yes yes C only C only
motion.c:160
SHFT
motion.c:166 getbits.c:134 getbits.c:144 getbits.c:155
SHFT SHFT SHFT SHFT
adpcm:1.11 aes:1.11 blowfish:1.11 gsm:1.11 jpeg:1.11
adpcm.c:778 aes.c:83 bf pi.h:77 gsm.c:32 huffman.c:86
motion:1.11 sha:1.11 jpeg:1.4 jpeg:1.11 matrix-mul: FCUDA generated
gsm:1.4
Real CHStone Bugs
jpeg:1.9 mips:1.6 mips:1.9 mips:1.10 mips:1.11 motion:1.2
motion:1.4
HLS Bugs
Synthetic Bugs
motion:1.10
a b c
AND
TABLE IX S OFTWARE T OOLS AGAINST L OGIC B UGS
yes yes yes yes no no no yes no yes yes yes yes yes yes yes
Clang Cov. yes yes yes yes no no no yes no no no yes yes yes yes yes
Nondet. Activation Line N/A lpc.c:88 lpc.c:151 lpc.c:158 N/A N/A N/A marker.c:385 N/A mips.c:179 mips.c:182 mips.c:255 mips.c:134 mpeg2.c:226 getbits.c:113 motions.c:155
Cppcheck 204 205 209 134 -
Valgrind 88 151 158 N/A N/A N/A N/A 226 -
Clang San. 88 151 158 N/A N/A N/A N/A 134 113 155
yes
yes
yes
motion.c:160
-
-
160
yes yes yes yes
no yes yes no
no yes yes yes
motion.c:166 getbits.c:134 getbits.c:144 getbits.c:155
-
N/A N/A
OOB OOB OOB OOB ZERO
yes yes yes yes no
yes yes yes yes yes
yes yes yes yes yes
adpcm.c:848 aes func.c:159 bf enc.c:105 lpc.c:59 N/A
mpeg2.c:354
INIT
yes
yes
yes
motion.c:68
sha.h:52 HLS Engine HLS Engine mm.c:52, 54 mm.c:157, 161 mm.c:165-172
OOB INIT ZERO BUF INF OOB
yes yes yes no no yes
yes yes yes yes yes yes
yes yes yes yes yes yes
sha.c:84 jfif read.c:69 huffman.c:118 N/A N/A memcpy.h:5
848 105 mpeg2.c: 377 N/A N/A 157, 161 -
Act.?
Com. Vars N/A N/A N/A no N/A N/A N/A N/A N/A yes yes yes yes yes N/A N/A
HT Lat. N/A 1 N/A N/A N/A 1 N/A 1 1 1 1 1 -
HH Lat. N/A 77 N/A N/A N/A N/A 22 9 8 -
ER Lat. yes N/A N/A N/A N/A 91 yes -
no
1
15
15
N/A 134 144 N/A
HT Line N/A 158 N/A N/A N/A 400 N/A 179 182 255 134 226 mpeg2.c: 388 N/A 134 144 N/A
N/A yes yes N/A
N/A 0 0 N/A
N/A 102 N/A
N/A N/A
854 140b 59c -
848 159 59 -
848 159 105 59 N/A
yes yes no no N/A
1 -7 1 1 N/A
46 362 180 179 N/A
903 643 999 yes
68
68b
68
yes
1
10
168
N/A N/A -
84 N/A N/A 5
84 69 118 N/A N/A 5
no yes yes N/A N/A no
1 1 1 N/A N/A -1
11 122 70 N/A N/A 30
857 hang 811k yes hang 292k
Bug is the absence of a header file “#include” directive. Execution was terminated with a segmentation fault reported at this line. add.c:68 is at the top of the reported trace, followed by lpc.c:59.
100
V. R ELATED W ORK ED
80
HQ
Coverage (%)
11
60
ul Res End
40 20
eck t Ch
0 1
10
100
1k
10k
100k
1M
10M
Error Detection Latency (cycles) Fig. 9. Overall timing error coverage as a function of error detection latency
detection coverage is 100% with hybrid hashing (i.e., we detect all unmasked errors). The overall error detection latency distribution is shown in Fig. 9. We observed mean timing error detection coverage for hybrid hashing of 85.8% compared to 55.8% for the end result check, resulting in a 3.1× improvement (i.e., reduction) in undetected timing errors. We also observed a mean error detection latency of 705 cycles for hybrid hashing, compared to 124,490 cycles for end result check, resulting in a 176× improvement (i.e., reduction) in error detection latency.
The inspiration for H-QED is QED [1], [2], [3], [4], which is a software technique for the validation of programmable microprocessors. In general, validation techniques that target processors (e.g., [25], [26] and others) are inadequate for bugs inside accelerators. Given a high-level specification and a design produced by HLS (referred to as an implementation), there is a large class of techniques that check if the implementation is equivalent to the high-level specification, often using formal techniques [27], [28], [29]. The goal is to detect bugs in the implementation caused by the HLS tool. However, formal equivalence checking techniques are limited in their capacity to handle HLS transformations. This limitation is further compounded by the large state space of HLS implementations. In contrast, HQED is a dynamic technique that integrates into HLS to follow instructions through HLS transforms and to generate a software reference implementation. H-QED can be run in pre-silicon simulation at RTL simulation speeds (with acceptable overhead) or during post-silicon validation at full hardware speed.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
TABLE X C OLUMNS OF TABLE IX Column
Description
Benchmark: Versions(s) Bug Patch Line(s) Bug Type
Nondet.?
Act.?
Clang Cov. [21]
Nondet. Activation Line
Indicates the line of code, determined by inspection, where non-deterministic behavior first occurs.
Cppcheck [22]
Valgrind [23]
Clang San.
HT Line
Com. Vars
HT Lat., HH Lat., ER Lat.
The CHStone benchmark and version the bug benchmark was based on, or the For the real CHStone bugs, the version is the last version of the benchmark containing the bug. Indicates what line(s) of code were modified to fix the bug: bug fixes from version history for real bugs, and the line modified to inject the bug for synthetic bugs. Indicates the bug root cause as defined in Table XI. Indicates whether the bug is non-deterministic, meaning that the bug behavior is not well defined by standard C [20] semantics (or only affects the hardware in the case of the two HLS engine bugs). A “C only” value means that the bug is non-deterministic, but becomes deterministic after compiler optimizations. Indicates if the bug activates during benchmark execution. Indicates if the non-deterministic activation line is covered according to the Clang coverage analysis tool (Clang 3.9 “Source-based Code Coverage” reference methodology), which provides source-based dynamic code coverage. If the bug is deterministic, this column indicates if the bug patch line(s) are covered.
The first line flagged as a warning or error by the Cppcheck (1.76.1 with warnings enabled) static analysis tool. An entry of “-” means that a technique failed to detect the bug. The first line indicated by the Valgrind binary instrumentation framework (Valgrind 3.11.0 with GCC 4.9.2 using the “Valgrind Quick Start Guide” reference methodology, invoking the “Memcheck” memory error detection tool and “SGCheck” stack and global array overrun detector tool). Dynamic techniques are unable to detect unactivated bugs, so we indicate “N/A” for those cases. The first line reported by a suite of Clang (3.9.0) “sanitizer” dynamic instrumentation tools: the AddressSanitizer which detects memory errors, the MemorySanitizer which detects uninitialized reads, and the UndefinedBehaviorSanitizer which detects undefined behavior. Indicates the buggy line as reported by our hybrid tracing framework. Hybrid tracing is also not applicable for deterministic bugs, so we indicate “N/A” for those cases. Indicates if there are variables in common between the HT Line and the Bug Patch Line(s), indicating a strong hint for a potential bug fix. Indicate the error detection latency for respectively hybrid tracing, hybrid hashing, and the end result check measured in cycles from non-deterministic bug activation to detection. For deterministic bugs, we indicate with a “yes” entry if the ERC is able to detect such bugs. An ERC entry of “hang” means that the benchmark execution fails to terminate.
TABLE XI B UG T YPES Type MLU OOB *++ INIT USE SHFT ZERO BUF INF
Description Manual loop unrolling omits one iteration Out-of-bounds array access Wrongly assuming dereference (*) has higher precedence than postincrement (++) Read of uninitialized variable Unintended sign extension Bit shift by out-of-bounds amount Variable initialized to zero instead of nonzero initializer Copying from the wrong half of a split buffer Infinite loop due to erroneous loop termination condition
12
A. Hybrid Tracing Prior works such as [30], [31] perform source-level transforms to create external ports for selected signals to improve observability. This approach requires manual source code annotation and interferes with compiler optimizations, creating intrusiveness. A hardware-software run-time trace comparison technique is proposed in [32], [33] to provide automated HW/SW discrepancy detection. Both techniques map software variables and hardware components through LLVM variables to detect discrepancies. Again, these techniques are intrusive as they insert additional error detection operations that change the schedule of the hardware design. In contrast, hybrid tracing instruments are integrated into HLS to eliminate intrusiveness, creating an RTL design with debugging annotations that can easily be removed before synthesis. B. Hybrid Hashing Hybrid hashing may appear to be similar to tracing techniques used in PSV (e.g., using trace buffers or system memory [34], [35], [36], [37]). XOR trees and signature registers for low-bandwidth tracing are known techniques [38] and have been used for selective data capture based on discrepancies from RTL simulation [39], [40]. This being said, there are important differences when comparing these techniques to hybrid hashing: 1) hybrid hashing automatically, systematically collects signatures, unlike tracing techniques that require designer input to determine which signals to trace and when to trace them; 2) hybrid hashing does not require extensive low-level (e.g., RTL) simulation; 3) hybrid hashing does not require multiple hardware executions and/or failure reproduction; 4) hybrid hashing does not require designercrafted assertions; 5) hybrid hashing enables very short error detection latencies and high bug coverage, unlike tracing techniques that become ineffective for difficult bugs with long error detection latencies. Hybrid hashing is distinct from fault-tolerant computing techniques for processors (e.g., using watchdog processors, DIVA, multi-threading and signature techniques for duplex systems [41], [42], [43], [44], [45], [46]). Many of these techniques only check the register values as defined by the Instruction Set Architecture (ISA). In contrast, hybrid hashing is effective for arbitrary hardware accelerators created using HLS and automatically identifies signals to check in the resulting designs. Unlike time redundancy and cycle stealing techniques for enhancing reliability of designs created using HLS [47], [48], [49], hybrid hashing utilizes unique aspects of the PSV environment (where the generation of software signatures after a PSV run is acceptable vs. reliability techniques that focus on quick error recovery) to minimize area/performance costs and intrusiveness. VI. C ONCLUSION H-QED utilizes HLS principles to quickly detect bugs in hardware accelerators in SoCs. Our results demonstrate the effectiveness and practicality of H-QED: up to 5 orders of magnitude improvement in error detection latency with 1 cycle latencies in pre-silicon scenarios, up to a 3-fold improvement
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
in coverage, a pre-silicon RTL simulation overhead of 79%, and an 8.3% post-silicon accelerator-level area overhead with a negligible performance cost. Furthermore, H-QED also discovered previously unknown bugs in the widely-used CHStone HLS benchmark suite. Through hybrid hardware/software traces/signatures, H-QED minimizes intrusiveness during preand post-silicon validation. Thus, the combination of QED and H-QED provides a systematic approach to validation of complex SoCs consisting of processor cores, uncore components, programmable accelerators, and hardware accelerators. Future directions related to H-QED include: 1) Use of HQED for a wide variety of high-level descriptions beyond C and C++ (e.g., various domain-specific languages) 2) Use of HQED for programmable accelerators; 3) Integration of H-QED with formal analysis tools for automatic debug. R EFERENCES [1] T. Hong et al., “QED: Quick error detection tests for effective post-silicon validation,” in ITC, 2010, pp. 1–10. [2] D. Lin et al., “Quick detection of difficult bugs for effective post-silicon validation,” in DAC, 2012, pp. 561–566. [3] ——, “Effective post-silicon of system-on-chips using quick error detection,” IEEE Trans. CAD, vol. 33, no. 10, pp. 1573–1590, Oct. 2014. [4] ——, “Quick error detection tests with fast runtimes for effective postsilicon validation and debug,” in DATE, 2015, pp. 1168–1173. [5] G. Martin and G. Smith, “High-level synthesis: Past, present, and future,” IEEE Design and Test of Computers, vol. 26, no. 4, pp. 18–25, Jul. 2009. [6] K. Rupnow et al., “A study of high-level synthesis: Promises and challenges,” in IEEE Intl. Conf. on ASIC, 2011, pp. 1102–1105. [7] J. Cong et al., “High-level synthesis for FPGAs: From prototyping to deployment,” IEEE Trans. CAD, vol. 30, no. 4, pp. 473–491, April 2011. [8] K. Wakabayashi and T. Okamoto, “C-based SoC design flow and EDA tools: An ASIC and system vendor perspective,” IEEE Trans. CAD, vol. 19, no. 12, pp. 1507–1522, Dec. 2000. [9] K. Wakabayashi, “C-based behavioral synthesis and verification analysis on industrial design examples,” in ASP-DAC, 2004, pp. 344–348. [10] K. Campbell et al., “Hybrid quick error detection (H-QED): Accelerator validation and debug using high-level synthesis principles,” in DAC, 2015, pp. 53:1–53:6. [11] ——, “Debugging and verifying SoC designs through effective crosslayer hardware-software co-simulation,” in DAC, 2016, pp. 7:1–7:6. [12] Y. Hara et al., “Proposal and quantitative analysis of the CHStone benchmark program suite for practical C-based high-level synthesis,” Journal of Information Processing, vol. 17, pp. 242–254, 2009. [13] W. Snyder, “Verilator and SystemPerl,” presented at NASCUG/DAC, Jun. 2004. [14] ——, “Verilator: Open simulation - growing up,” presented at DVClub Bristol, Jan. 2013. [15] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis and transformation,” in CGO, 2014, pp. 75–86. [16] A. Canis et al., “LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems,” ACM Trans. Embedded Computing Systems, vol. 13, no. 2, pp. 24:1–24:27, Sep. 2013. [17] L.-N. Pouchet and T. Yuki, “PolyBench/C 3.2.” [Online]. Available: http://www.cse.ohio-state.edu/∼pouchet/software/polybench/ [18] A. Papakonstantinou et al., “Efficient compilation of CUDA kernels for high-performance computing on FPGAs,” ACM Trans. Embed. Comput. Syst., App.-Specific Processors, vol. 13, pp. 25:1–25:26, Sep. 2013. [19] “Clang static analyzer.” [Online]. Available: http://clang-analyzer.llvm.org [20] ISO/IEC 9899:201x - Programming languages - C, International Organization for Standardization Std., Dec. 2011. [21] “Clang 3.9 documentation.” [Online]. Available: http://llvm.org/releases/ 3.9.0/tools/clang/docs/ [22] “Cppcheck: A tool for static C/C++ code analysis.” [Online]. Available: http://cppcheck.sourceforge.net [23] “Valgrind.” [Online]. Available: http://valgrind.org [24] M. Gao et al., “On error modeling of electrical bugs for post-silicon timing validation,” in ASP-DAC, 2012, pp. 701–706. [25] A. Adir et al., “Threadmill: A post-silicon exerciser for multi-threaded processors,” in DAC, 2011, pp. 860–865.
13
[26] I. Wagner and V. Bertacco, “Reversi: Post-silicon validation system for modern microprocessors,” in ICCD, 2008, pp. 307–314. [27] X. Feng and A. J. Hu, “Early cutpoint insertion for high-level software vs. RTL formal combinational equivalence verification,” in DAC, 2006, pp. 1063–1068. [28] M. Fujita, “Equivalence checking between behavioral and RTL descriptions with virtual controllers and datapaths,” ACM Trans. Design Automation Electronic Systems, vol. 10, no. 4, pp. 610–626, Oct. 2005. [29] A. Mathur et al., “Functional equivalence verification tools in high-level synthesis flows,” IEEE Design and Test of Computers, vol. 26, no. 4, pp. 88–95, Jul. 2009. [30] J. S. Monson and B. Hutchings, “New approaches for in-system debug of behaviorally-synthesized FPGA circuits,” in FPL, 2014. [31] J. S. Monson and B. L. Hutchings, “Using source-level transformations to improve high-level synthesis debug and validation on FPGAs,” in FPGA, 2015, pp. 5–8. [32] N. Calagar et al., “Source-level debugging for FPGA high-level synthesis,” in FPL, 2014. [33] L. Yang et al., “JIT trace-based verification for high-level synthesis,” in FPT, 2015, pp. 228–231. [34] M. Abramovici, “In-system silicon validation and debug,” IEEE Design and Test of Computers, vol. 25, no. 3, pp. 216–223, May 2008. [35] ARM, “CoreSight debug and trace.” [Online]. Available: http: //www.arm.com/products/system-ip/coresight [36] S. B. Park et al., “Post-silicon bug localization in processors using instruction footprint recording and analysis (IFRA),” IEEE Trans. CAD, vol. 28, no. 10, pp. 1545–1558, Oct. 2009. [37] S.-B. Park et al., “BLoG: Post-silicon bug localization in processors using bug localization graph,” in DAC, 2010, pp. 368–373. [38] E. Anis and N. Nicolici, “Low cost debug architecture using lossy compression for silicon debug,” in Design, Automation & Test in Europe. IEEE, 2007, pp. 1–6. [39] J.-S. Yang and N. A. Touba, “Improved trace buffer observation via selective data capture using 2-D compaction for post-silicon debug,” IEEE Transactions on VLSI Systems, vol. 21, no. 2, pp. 320–328, 2013. [40] S. Deutsch and K. Chakrabarty, “Massive signal tracing using on-chip DRAM for in-system silicon debug,” in International Test Conference. IEEE, 2014, pp. 1–10. [41] T. Austin, “DIVA: a reliable substrate for deep submicron microarchitecture design,” in MICRO, 1999, pp. 196–207. [42] D. J. Lu, “Watchdog processors and structural integrity checking,” IEEE Trans. Computers, vol. 31, no. 7, pp. 681–685, Jul. 1982. [43] A. Mahmood and E. J. McCluskey, “Concurrent error detection using watchdog processors – a survey,” IEEE Trans. Computers, vol. 37, no. 2, pp. 160–174, Feb. 1988. [44] N. R. Saxena et al., “Online testing in adaptive and configurable systems,” IEEE Design and Test of Computers, vol. 17, no. 1, pp. 29–41, Jan.–Mar. 2000. [45] J. C. Smolens et al., “Fingerprinting: Bounding soft-error detection latency and bandwidth,” in ASPLOS, 2004, pp. 224–234. [46] E. S. Sogomonyan et al., “Early error detection in system-on-chip for fault-tolerance and at-speed debugging,” in VTS, 2001, pp. 184–189. [47] R. Karri and A. Orailoglu, “High-level synthesis of fault-secure microarchitectures,” in DAC, 1993, pp. 429–433. [48] S. Mitra et al., “Fault escapes in duplex systems,” in VTS, 2000, pp. 453–458. [49] N. R. Saxena and E. J. McCluskey, “Dependable adaptive computing systems,” in IEEE Systems, Man, and Cybernetics Conf., 1998, pp. 2172– 2177.
Keith Campbell received a B.S. in Electrical Engineering from the Illinois Institute of Technology, Chicago, IL, USA in 2008 and a M.S. and Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign, IL, USA in 2015 and 2017 respectively. He worked as a research assistant under Prof. Deming Chen in the Coordinated Science Laboratory at the University of Illinois. His research focuses on the intersection of high-level synthesis and circuit validation and reliability. Other research interests include compiler design, high-level programming language design, and alternative computer architectures.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2837103, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Leon He received his Bachelor of Science Degree in Computer Engineering from the University of Illinois Urbana-Champaign in 2016 and is currently a software engineer at the Johns Hopkins University Applied Physics Laboratory. His current fields of work include high performance computing and machine learning.
Deming Chen (SM11) received the B.S. degree in computer science from the University of Pittsburgh, Pittsburgh, PA, USA, in 1995, and the M.S. and Ph.D. degrees in computer science from the University of California at Los Angeles, Los Angeles, CA, USA, in 2001 and 2005, respectively. He is currently a Professor with the ECE Department, University of Illinois at Urbana-Champaign, Urbana, IL, USA. His current research interests include high-level synthesis, reconfigurable computing, hardware/software co-design, hardware security, and computational genomics. Dr. Chen was a recipient of various awards, including six best paper awards, the First Place Winner for DAC17 International Hardware Design Contest, ACM SIGDA Outstanding New Faculty Award, and the IBM Faculty Awards. He is (or has been) an Associated Editor for several IEEE and ACM transactions. He is a Donald Biggar Willett Faculty Scholar of the College of Engineering.
Liwei Yang received the Ph.D. degree from the School of Computer Engineering, Nanyang Technological University, Singapore, in 2017. He received the B.S. and M.S. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 2005 and 2008, respectively. He is a Senior Software Engineer with Inspirit Pte. Ltd., Sinagpore, and an Adjunct Senior Research Engineer with the Advanced Digital Sciences Center of Illinois at Singapore Pte. Ltd., Singapore. His current research interests include high-level synthesis, compiler techniques and reconfigurable computing.
Kyle Rupnow (M00-SM’16) received the B.S. degree in computer engineering and mathematics and the M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin–Madison, Madison, WI, USA, in 2003, 2006, and 2010, respectively. He is currently the CTO of Inspirit-IoT, a design automation startup affiliated with the University of Illinois at Urbana-Champaign. His current research interests include high-level synthesis, reconfigurable computing, and systems management of compute resources. Dr. Rupnow has served on the Program Committees of Field Programmable Gate Arrays, Field-Programmable Custom Computing Machines, Field-Programmable Technology, Field Programmable Logic and Applications, and ReConfig. He is an Associate Editor of ACM Transactions on Reconfigurable Technology and Systems, and has served as a Reviewer of IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, ACM Transactions on Design Automation of Electronic Systems, the IEEE TRANSACTIONS ON VLSI SYSTEMS.
14
David Lin received the B.S. degree in electrical engineering from the California Institute of Technology, Pasadena, CA in 2009 and the M.S. and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 2011 and 2015 respectively. He was a Research Assistant under Prof. S. Mitra in the Robust Systems Group at Stanford University. His research interests include post-silicon validation, verification, debug, and computer architecture.
Subhasish Mitra is Professor of Electrical Engineering and of Computer Science at Stanford University, where he directs the Stanford Robust Systems Group and co-leads the Computation focus area of the Stanford SystemX Alliance. He is also a faculty member of the Stanford Neurosciences Institute. Prof. Mitra holds the Carnot Chair of Excellence in Nanosystems at CEA-LETI in Grenoble, France. Before joining the Stanford faculty, he was a Principal Engineer at Intel Corporation. Prof. Mitra’s research interests range broadly across robust computing, nanosystems, VLSI design, validation, test and electronic design automation, and neurosciences. He, jointly with his students and collaborators, demonstrated the first carbon nanotube computer and the first three-dimensional nanosystem with computation immersed in data storage. These demonstrations received wide-spread recognitions (cover of NATURE, Research Highlight to the United States Congress by the National Science Foundation, highlight as ”important, scientific breakthrough” by the BBC, Economist, EE Times, IEEE Spectrum, MIT Technology Review, National Public Radio, New York Times, Scientific American, Time, Wall Street Journal, Washington Post and numerous others worldwide). His earlier work on XCompact test compression has been key to cost-effective manufacturing and high-quality testing of almost all electronic systems. X-Compact and its derivatives have been implemented in widely-used commercial Electronic Design Automation tools. Prof. Mitra’s honors include the ACM SIGDA/IEEE CEDA A. Richard Newton Technical Impact Award in Electronic Design Automation (a test of time honor), the Semiconductor Research Corporation’s Technical Excellence Award, the Intel Achievement Award (Intels highest corporate honor), and the Presidential Early Career Award for Scientists and Engineers from the White House (the highest United States honor for early-career outstanding scientists and engineers). He and his students published several award-winning papers at major venues: ACM/IEEE Design Automation Conference, IEEE International Solid-State Circuits Conference, IEEE International Test Conference, IEEE Transactions on CAD, IEEE VLSI Test Symposium, and the Symposium on VLSI Technology. At Stanford, he has been honored several times by graduating seniors ”for being important to them during their time at Stanford.” Prof. Mitra served on the Defense Advanced Research Projects Agency’s (DARPA) Information Science and Technology Board as an invited member. He is a Fellow of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).
Swathi T. Gurumani received the B.E. degree in electronics and communications from the University of Madras, Chennai, India, and the M.S. and Ph.D. degree in computer engineering from the University of Alabama in Huntsville, Huntsville, AL, USA in 2003 and 2007 respectively. He is currently the Vice President of Engineering at Inspirit IoT Inc., a UIUC-based accelerated machine learning startup. Prior to this he was a Principal Research Engineer with the Advanced Digital Sciences Center (ADSC) of Illinois at Singapore Pvt. Ltd., Singapore, and a Senior Research Affiliate with Coordinated Sciences Laboratory, University of Illinois at Urbana- Champaign, Urbana, IL, USA. His current research interests include high-level synthesis, reconfigurable computing, and hardware/software co-design.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.