Using Source-Level Transformations to Improve High-Level Synthesis

0 downloads 0 Views 976KB Size Report
Feb 24, 2015 - support source-level debug during RTL simulation, some do not. No current .... out modifying the original behavior of the program. This is.
Using Source-Level Transformations to Improve High-Level Synthesis Debug and Validation on FPGAs Joshua Monson

Brad Hutchings

Brigham Young University 459 Clyde Building Provo, UT 84602

Brigham Young University 459 Clyde Building Provo, UT 84602

[email protected]

[email protected]

ABSTRACT

more importantly, we show how our techniques can be used to debug real problems in simulation.

This paper proposes a method for extending source-level visibility into the RTL of an HLS-generated design using automated source-level transformations. Using our method, source-level visibility can be extended into co-simulation, insystem simulation, and hardware execution of any HLS tool that provides the ability to infer top-level ports. Experimental results show the feasibility of our method in situations where visibility needs to be added without modifying the timing, latency, or throughput of the design.

2.

Categories and Subject Descriptors B.m [Hardware]: Miscellaneous

Keywords FPGA, High-Level Synthesis, HLS, Debugging, Simulation

1.

PREVIOUS WORK

There have been several academic efforts to add support for source-level debugging of HLS generated cores. The earliest academic effort was Hemmert’s source-level debugger for the JHDL-based Sea Cucumber Synthesizing Compiler [6]. Hemmert’s source-level debugger could be used both in simulation and during hardware execution. Hemmert’s in-system debugger leveraged JHDL’s support for clock-controlled inspection of hardware registers. Calagar’s Inspect Debugger [2] mirrors Hemmert’s debugger in many respects. However, Inspect targets the Legup HLS tool and supports automated discrepency detection. Goeder’s work also targeted Legup [4]. However, Goeder’s worked focused on efficiently capturing a hardware trace during in-system execution. The research of Curreri et. al. [3] and Hammouda et. al. [1] has focused on the automatic generation of assertion checkers from source-level C-based assertions. There have also been commercial efforts to support debugging. The most notable of these effort is the source-level debugger provided by CyberWorkBench [10]. CyberWorkBench is a full system development platform and supports source-level debugging all the way down to RTL systemwide simulation. Impulse C also has RTL simulation-based source-level debugging support [7].

INTRODUCTION

One of the many reasons HLS tools increase productivity is that developers are able to quickly test and verify the functionality of their designs in software. While softwarebased verification and debugging is indispensable for HLS, it cannot simulate the errors that can occur due to concurrent behavior. Diagnosing such errors in RTL simulation is difficult because the developer is generally unfamiliar with the structure of the generated RTL. While some HLS tools support source-level debug during RTL simulation, some do not. No current commercial HLS tools of which we are aware support in-system debugging. In this paper, we propose an automated method for extending source-level visibility into HLS generated cores during simulation and in-system execution. Since our method is based on source-level transformations, it could, with tool-specific implementations, support almost any HLS tool. For tools that do not already support source-level debug this removes much of the manual effort required to instrument these designs for debug. But even Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. FPGA’15, February 22–24, 2015, Monterey, California, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3315-3/15/02 ...$15.00. http://dx.doi.org/10.1145/2684746.2689087 .

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Listing 1: EOP Examples // o r i g i n a l in t mult ( in t a , in t b ) { return ( a ∗ b ) ; } // eop from p o i n t e r in t mult ( in t a , in t b , in t ∗ eop0 ) { return ∗ eop0 = ( a ∗ b ) ; } // eop from g l o b a l in t eop0 ; in t mult ( in t a , in t b ) { #pragma HLS i n t e r f a c e p o r t=eop return eop0 = ( a ∗ b ) ; }

3.

EVENT OBSERVABILITY PORTS

In their work, Monson and Hutchings [9] propose the use of Event Observability Ports (EOP) to provide a struc-

5

ture for source-to-hardware correspondence which could be added to either the RTL code (for simulation) or synthesis netlist (for in-system debugging). In practice, an EOP is a top-level RTL port that corresponds to the output of a specific expression or statement in the original source. The EOP contains two signals, an event signal and a data signal. The event signal is asserted when the source-level expression to which the EOP corresponds is evaluated within the generated HLS design. When the event signal is asserted, the data signal holds the result of the expression. EOPs provide the developer with a simple construct that can be connected-to and monitored that has a guaranteed source-level meaning. Because EOPs have a guaranteed source-level meaning, they relieve the developer of the task of becoming familiar with the structure of the generated hardware.

4.

Parent return

=

Target Expression

X

EOP A

B

Figure 1: Demonstration of EOP write insertion into Abstract Syntax Tree. Inserted nodes are represented with dotted lines. In this paper, we utilized the Rose compiler infrastructure [8]. Rose uses a standard compiler frontend to transform the source code into an Abstract Syntax Tree (AST). Rose provides an API that allows the user to both analyze and modify the AST. When the transformations have been completed, the AST undergoes an “unparsing” process in which the modified AST is written out again as source code. The use of a source-to-source compiler framework allows us to leverage the AST to demonstrate that we can automatically insert EOPs without affecting the original behavior of the circuit. Figure 1 shows how we modified the AST to insert the EOP write. The tree nodes with solid outlines are nodes from the original subtree in the first example (Listing 1). The assignment operator and EOP variable reference node (dashed outlines) are the nodes that were inserted to implement the EOP write. In general, the syntax of the AST is that a sub-tree is resolved before its parent node is resolved. In Figure 1, this would mean that the multiply operation would be resolved before the return operation. Therefore, as long as the the value of the target expression is passed to the parent node, we can insert an EOP write without modifying the original behavior of the program. This is indeed accomplished since the assignment operator returns the value of its right-hand-side (RHS) operand.

EOP INSERTION METHODS

Listing 1 demonstrates two approaches showing how a target expression ((a * b); line 3) can be instrumented with an EOP. Instrumenting the source code with an EOP requires that we add a new top-level port and then perform a write to it. Our first example (Listing 1, lines 5-8) uses the insertion of a pointer parameter on the function declaration to add the top-level port (line 6). The type of this pointer should match the return value of the target expression (in this case int). The write to the EOP is added by inserting an assignment statement that assigns the value of the target expression to the dereferenced EOP pointer (line 7). The second example (Listing 1, lines 10-14) uses a global variable and Vivado HLS pragma to create the top-level port representing the EOP. The assignment operator is added in the same manner as before to initiate the write to the port. In both examples, Vivado HLS synthesizes a top-level port for the EOP. This port contains a 32-bit signal for the data (value of the target expression) and a 1-bit valid signal. The 32-bit data signal on this port corresponds exactly to the data signal needed for the EOP. The valid signal is only nearly analogous to the event signal of the EOP. Technically, the event signal should be asserted during the same clock cycle that the target expression completes its execution. In general, this should be the case, but it is possible that the HLS tool could schedule the write during a later clock cycle. There are drawbacks of increasing inner-HLS-core visibility by adding EOPs to the source code. First of all, manually modifying source-code can be an error prone and time consuming. Second, modifying the source can alter the way that the HLS tool schedules and optimizes the circuit. The next two sections will discuss our solution to these problems.

5.

Write To EOP

6.

DEBUGGING WITH EOPS

There are three points at which we may wish to use EOPs to accelerate our debugging efforts. They are co-simulation, in-system simulation, and in-system hardware execution. In co-simulation, the HLS tool generates an RTL test bench that matches the software test bench. The primary use of cosimulation is to prove that the generated RTL has the same functional behavior as the original C. Generally, the problems found during co-simulation are expected software/simulation mismatches which are explained in the HLS tool’s user guide. On occasion, bugs in HLS tools are discovered as well. Since we are not concerned with interface timing, EOPs can be used liberally during co-simulation to provide software-like visibility to the developer. Next, in-system simulation is used to ensure that the HLS core interacts properly with the other cores in the system. During in-system simulation (and hardware execution), bugs may be sensitive to the timing, through-put, or latency of the core. Therefore any transformation made should avoid modifying these characteristics of the core. The best way to accomplish this is to add EOPs to expressions that are not likely to be “optimized away”.

SOURCE-LEVEL TRANSFORMATIONS

The current generation of HLS tools struggles with its ability to efficiently synthesize arbitrary C code. Often, this requires developers to make substantial modifications to their code before it will synthesize at all. This becomes a tedious process as the developer must determine how to modify his code without changing the behavior. Sourcelevel transformations are a common solution to this problem [11]. The goal of source-level transformations is to increase the usefulness and productivity of HLS tools by removing the need for the programmer to make the necessary changes manually. Additionally, the source-level transformation allows the user to keep their original source code intact.

6

LUT Usage

Co-Simulation Latency

1.40

1.40

1.30

1.20

1.20 1.10

1.00

1.00

0.80

0.90

0.60

0.80 add/sub adpcm

array read aes

array write

blowfish

dfadd

multiply dfdiv

add/sub

function call all assignment dfmul

dfsin

adpcm

sha

array read aes

array write

blowfish

dfadd

multiply dfdiv

function call all assignment dfmul

dfsin

sha

Figure 4: Co-simulation latency compared to uninstrumented design.

Figure 2: Lut usage compared to baseline uninstrumented design.

Benchmark add/sub array read array write multiply function call all assignment

Minimum Clock Period

adpcm

25

7

22

17

26

174

aes

1

176

226

0

3

324

blowfish

0

29

16

0

0

99

dfadd

7

1

0

0

15

73

1.00

dfdiv

8

1

0

4

12

77

0.90

dfmul

4

1

0

4

11

65

dfsin

17

1

0

4

33

138

sha

0

7

15

0

0

70

1.20 1.10

0.80 add/sub adpcm

array read aes

array write

blowfish

dfadd

multiply dfdiv

function call all assignment dfmul

dfsin

sha

Figure 5: This table shows the number of EOPs added in each experiment. Figure 3: Minimum clock period compared to uninstrumented design. ments, there were less than 30 opportunities to add EOPs. As shown in Figures 2, 3, and 4, these were the experiments that had almost no effect on simulation latency, LUT usage, or minimum clock period. This provides some evidence to show that if a small amount of EOPs are carefully selected we can instrument the design (at the source-level) with EOPs with little or no effect. In general, in-system debugging during hardware execution is just that; a trace of a small number of carefully selected signals. It is important to note that there is some possible overlap between the array write experiment and the other experiments (see Figure 5). This is because array writes are left-hand-values while the operations targeted in the other experiments are right-hand-values. Therefore, anytime an array element is assigned a value there may be overlap. This occurs frequently in the AES benchmark. We also note the significant drop in co-simulation latency for the adpcm benchmark in the “all assignment” and ”multiply” experiments. These same experiments also exhibited a drop in DSP usage. Upon further investigation we found that the insertion of EOPs had prevented a few functions from being inlined. This change appears to have allowed Vivado HLS’s scheduler to find a lower latency schedule.

Debugging is most difficult during in-system hardware execution. Since hardware execution operates much faster than software or simulation it may encounter bugs that occur less frequently. Additionally, the core may encounter input or other situations the developer did not expect. We believe that carefully targeted EOP instertion can minimize hardware overhead and provide useful visibility.

7.

EXPERIMENT AND RESULTS

One potential approach to instrument the circuit without affecting the timing, latency or through-put is to target source code structures that are mapped to the HLS core in predictable ways. To test the feasibility of this approach we used our source-to-source translator to instrument expressions whose results were likely to survive HLS optimization. In particular, we instrumented several benchmarks from the assignment operations whose final result was an array read or write, add/subtract, multiply, or function call. Additionally, as a first order evaluation of the effect on latency, we co-simulated each instrumented design using Vivado HLS’s co-simulation. Then we executed the implementation process using Vivado HLS’s export functionality. Several benchmarks from the CHStone HLS [5] benchmarking suite were evaluated against a baseline design. The benchmarks were modified just enough to get the benchmark to synthesize and pass co-simulation in Vivado HLS. No directives were added to unroll or pipeline loops. To provide some additional context, we also ran an experiment that instrumented all of the assignment operations in the circuit. The one exception was assignments to pointers, since Vivado HLS does not support writing pointer values to output ports. We used Vivado HLS (2014.1) to generate the instrumented RTL. The table in Figure 5 shows the number of EOPs that were inserted in each experiment. In the majority of the experi-

8.

CASE STUDIES

In this section, we provide examples using EOPs to solve real debug problems in RTL simulation and in-system simulation. Both examples, in Vivado HLS (2014.1), will execute correctly during normal execution and incorrectly during simulation. Vivado HLS does not provide any automated debugging support for either simulation or in-system execution. Therefore, our EOPs are important debug tools during co-simulation and in-system simulation. Our first example, shown in listing 2, is an HLS core that accepts and accumulates a variable length data stream. In this core, the length of the stream is determined by reading

7

9.

the first word (line 3). Let us assume that the system this core has been placed in is off by one cycle and length is assigned the wrong value. At this point, two results are possible: 1) length is assigned a value that is less than the number of entries in the stream or 2) length is assigned a value that is higher than the remaining entries in the stream. In the first case, the read of stream[i] (line 5) will consume the part of the stream that was specified by length and then return the erroneous sum. The early return of the HLS core may cause the system to lock while it waits for the HLS core to consume the remainder of the stream. In the second case, it is likely that the HLS core would stall as it waits for streaming values that might never come. In this case, a source-to-source transformation could be employed to add an EOP to capture the value of length (line 3). This would allow the developer to check whether the expected value for length was actually assigned to length and adjust the control of his streaming interface accordingly.

CONCLUSION

In this paper, we have shown the feasibility of using sourceto-source transformations to increase the source-level visibility of an HLS design during simulation. We have shown that, for several CHStone benchmarks, if the target expressions are carefully selected we can instrument between 0 and 30 expressions at the source level all while having little or no effect on the timing, throughput, latency, or resource usage of the design. We have also shown that in co-simulation all the assignment operations can be instrumented to provide software-like visibility for HLS tools that do not provide source-level debugger for simulation. We have shown how these techniques can be used to find real bugs during simulation. In the future we plan to demonstrate these instrumentation techniques in executing hardware.

10.

ACKNOWLEDGMENTS

This work was supported by the I/UCRC Program of the National Science Foundation under Grant No. 1265957. Listing 2: Example of misaligned streaming Header read. 1 i n t m i s a l i g n e d ( i n t stream [ 1 0 0 0 ] ) { 2 #pragma HLS INTERFACE a p f i f o . . . 3 i n t l e n g t h = stream [ 0 ] , i , sum=0; 4 f o r ( i =1; i

Suggest Documents