A Study of the Current State of High-Level Synthesis

Designing High-Quality Hardware on a Development Effort Budget: A Study of the Current State of High-Level Synthesis Zelei Sun∗ , Keith Campbell∗ , Wei Zuo∗ , Kyle Rupnow† , Swathi Gurumani† , Frederic Doucet‡ , Deming Chen∗ ∗ {zsun19,kacampb2,weizuo,dchen}@illinois.edu

† {k.rupnow,swathi.g}@adsc.com.sg

Abstract—High-level synthesis (HLS) promises high-quality hardware with minimal development effort. In this paper, we evaluate the current state-of-the-art in HLS and design techniques based on software references and architecture references. We present a software reference study developing a JPEG encoder from pre-existing software, and an architecture reference study developing an AES block encryption module from scratch in SystemC and SystemVerilog based on a desired architecture. Additionally, we develop micro-benchmarks to demonstrate best-practices in C coding styles that produce high-quality hardware with minimal development effort. Finally, we suggest language, tool, and methodology improvements to improve upon the current state-of-the-art in HLS.

I. I NTRODUCTION High-level synthesis (HLS) seeks to create a bridge from the highproductivity software paradigm to the hardware world. The challenge in bridging this gap is that software and hardware approaches can be incompatible in both supported syntax and performance portability. Figure 1 illustrates some of the key differences between hardware design and software programming. In both “worlds,” developers wish to write simple, straightforward code that is easy to read and reuse, and easily adaptable with high performance given any other constraints. Hardware low-energy low-area

Both fast execution readable

Software

ISA oriented

flexible gate oriented

Fig. 1. Hardware platform vs. Software platform goals. Some goals are aligned while others are not.

Although both software and hardware desires performance, the means to achieving high performance are not always aligned. In software, the primitive is a machine instruction, so optimal mapping to instruction set with small code size and data size for cache behavior is important. Conversely, in hardware, the primitives are gates and functional units, so the designer is more concerned with expressing parallelism and independence than mapping to an ISA or behavior of code and data caches. For example, a programmer targeting a 32-bit ISA may pack four 8-bit items into a 32-bit word to process them in parallel (e.g. SSE in x86). However, a hardware designer will prefer to write code with 8-bit words to minimize gates, and express the parallelism explicitly. In addition to performance, hardware design must also consider the area and power/energy of produced designs.1 To achieve highquality hardware using high level synthesis, there are two primary approaches: 1 Although software optimization for reduced energy has been studied, the means for reducing energy in software and hardware are not necessarily aligned.

‡ [email protected]

Software Reference – Start with software, and modify to be synthesizable and hardware friendly. Iteratively refine software until QoR goals are met. • Architecture Reference – Start with a block-level architecture, and write code from scratch to specify and implement that architecture. Refine architecture and specification until QoR goals are met. The best development approach depends on the quality, availability, and appliability of reference software and hardware specifications. Reference software that is organized for efficient pipelining and operation-level optimization of computation tasks falls in the center of Figure 1, whereas if software is large, complex, and well-optimized for specific CPUs it may be better to start from scratch. HLS has been popular in academic studies, with a large number of studies focusing on improving the optimization [1]–[5], or applying HLS to various applications (e.g., [6], [7]). Nevertheless, the evaluation of current state-of-art HLS techniques is not equivalently well studied. One work [8] evaluates different optimization techniques provided by the HLS tool and compares the QoR results with handwritten RTL. That work, however, focuses on features provided by the HLS tool, but not factors such as input language quality or best-practices in adopting HLS for hardware design. In this paper, we evaluate the current status of HLS techniques through two case studies: a software approach for JPEG encoder design (Section II), and a hardware approach for AES block encryption (Section III). Following these studies, we use micro-benchmarks to demonstrate best practices in C-style hardware specification coding in Section IV. We conclude in Section V with some advice for tools developers, researchers, and language designers for improving the predictability, usability, and productivity of high-level synthesis. •

II. S OFTWARE A PPROACH C ASE S TUDY: JPEG The software approach involves starting with an existing software program, modifying the code to make it synthesizable, analyzing and refining the code to perform better intra- and inter-function level optimizations, and using HLS tools to generate the final result. Due to the sequential nature of software programs, the initial design we generate usually does not inherently exploit the parallelism available in hardware. We will now use a JPEG encoder program as a tutorial to show the steps for generating a good hardware design with a given software program and how the idea of “finding parallelism” guides our approach to the design. A JPEG encoder [9] is a classic application in hardware accelerator design. We start with existing reference software for a JPEG encoder described in Fig. 2. We will use this design to perform design space exploration and evaluate design quality and effort. A. Define Design Top When starting with reference software, the first step is to determine the functions that will be mapped to hardware and the interfaces to the hardware. Due to software synthesizability and resource limitations, in general only a part of the software program may be synthesized to hardware and control-intensive tasks are maintained

Fig. 2. On the left side, we can see the whole flow of the JPEG encoder. On the right side is the inner logic of the arithmetic coding block. In the beginning, it runs sequentially with the 5 blocks. After the modification, we use FIFOs to connect the different modules.

on the software side. Some basic characteristics of an ideal top function are as follows: HLS works better on compute-intensive tasks compared to control intensive tasks; any parallelism in computeintensive functions can be better exploited in hardware. Also, any function, though not compute-intensive, but called frequently such that it contributes significantly to the overall time is a good candidate for acceleration. Our JPEG encoder has the structure shown in the left side of Fig. 2. It has three functions. “SaveJPEGFile” reads an uncompressed bitmap image from the file system, writes the headers and the tables to the JPEG file, and then call “Main encoder” to partition the image data and change the color representation format. “Main encoder” will then call “process DU” repeatedly for each 8 × 8 pixel block. Inside the “process DU” function, we run the Discrete Cosine Transformation, quantization, arithmetic coding and write the bitstream out to the file. Our plan is to use a hardware module to perform the acceleration, and put the other tasks on a CPU as software program. We chose “process DU” as a top module for following reasons: 1) It deals with a small, fixed-size input (8×8 pixel blocks), which means the interface of our hardware is simple and clean. 2) This function is called a large number of times (3 times for each 8 × 8 pixel block), and generates the stream of bits directly, so we can make full use of the hardware. 3) This function is a computation-intensive function as DCT and quantization involves many additions and multiplications. On the software side, we will divide the image into 8 × 8 blocks, convert RGB representation of the bitmap format into YCbCr format, and feed the 8 × 8 block of each part (one of Y, Cb, Cr) into the hardware side, and our accelerator will send the data back to the software. Our design goals is to achieve the highest throughput for this design. We used the 65nm technology node for the design and the clock period is set to 2.5ns. All the data we collected are reported from HLS tools. B. Preparation Before we started the optimization for the design, we have some preparation to perform. We first need to modify the code to make it synthesizable and then change the floating point data types into fixed point data types(unless we indeed want to perform floating point operations) to save area. Unsynthesizable code: Most of the HLS tools support C/C++ dialects as the input high level software programming language. However, some restrictions apply to the code as not all the software concepts have a corresponding hardware equivalent. The following are generally not supported by HLS tools: Dynamic allocation is used frequently in software programming. However, dynamically creating and destroying hardware makes no

sense; dynamic allocation is not synthesizable. The Standard Template Library(STL) in C++ is thus mostly unsynthesizable to HLS tools due to its use of dynamic allocation. Finally, some data structures that involves dynamically allocated memory, such as linkedlists, will need to be avoided. System calls assume the existence of an operating system, which has no equivalent hardware concept, they are not synthesizable. The most common system calls are standard I/O and file I/O calls. Some HLS tools will automatically ignore debugging print calls while synthesizing, but some won’t and will report an error. All the input and output from either file or standard input will now need to be passed into functions through parameters so that the tools will clearly know what kind of I/O interface it needs to implement. Fixed-point data type: In industrial hardware designs, fixed point data types are widely used and are the main datatypes for representing real numbers. Most HLS tools support floating point hardware, but they deal with it differently. Some of them by default will convert the float/double data types into one fixed point data type with specific bit width, which is a waste of area as we may have different precision requirements for different variables. However, this automation may involve some problems: Overly conservative transformations may result in extra bits being used for each variable, which will result in wasted area. Too aggressive transform may introduce potential overflow problems as well as loss of precision. Some other HLS tools would use the floating point arithmetic cores in the technology library which might not be the resources we want to use. Most commercial HLS tools provide their own fixed point data types, which provide flexible length for bits and many functions for bitwise operations, logic operations, and bit slice accesses. We recommend using these data types instead of standard C/C++ data types as they enable bitwidth minimization while maintaining acceptable accuracy. C. Function Partitioning Most HLS tools recommend that the user split a large and complex function into smaller ones to explicitly show the data flow between functions, reduce the complexity of optimization, and enable possible pipelining between those functions. We think this is one of the most important approach in optimizing the design. While we are partitioning the functions, we are actually restructuring the micro-architecture of the design. Function partitioning is a powerful technique for data streaming type of design. By using the FIFO class provided by the HLS tools, we are able to buffer the internal data between functions and alleviate the effect of the dependence between the functions. In this way, we actually set up a function level pipeline, which will greatly improve the throughput. We will take the arithmetic coding part in the JPEG encoder as an example: The original arithmetic coding part works like the middle diagram of Fig. 2. The function itself involves a lot of tasks. It needs

TABLE I JPEG E NCODER E XPERIMENTAL RESULTS (65 NM ) Arithmetic coding report Optimizations

Latency(cycles) Throughput(outputs/cycle) Area(µm2 )

No optimizations

863

0.074

39.7k

Final result

12

10.7

88.8k

Optimizations

Latency(cycles)

Throughput(px/cycle)

Area(µm2 )

No optimizations

334

0.189

77.1k

Single function

19

2.91

90.5k

Split functions

21

5.33

190.9k

Optimizations

Latency(cycles)

Throughput(px/cycle)

Area(µm2 )

No optimizations

6,983

0.009

108.4k

w/ optimizations

33

5.82

432.7k

DCT design space exploration

Entire design result

to perform run length search to generate a length-value pair, then perform Huffman coding, and call the “writebit” function to generate the final bit stream. Each time we will stall the run-length coding task, generate the stream of bits, and then go back to current work again. And since different functions will be implemented as different modules in hardware, all of these function calls will be treated as I/O accesses, which are limited by resources and bandwidth, resulting in increased latency. These three tasks can run in parallel if we buffer the intermediate data using FIFOs. By observation, we find that in each round we would generate a data pair of the non-zero value and the number of zeroes encountered. The data pairs are the basic element to be processed later, we can use these FIFOs to connect the run-length search task and the “writebits” function, separate them, and modify our design to make the I/O output organized. With the FIFO in the interface, we manually pipelined the subfunctions and improved the parallelism a lot. The first part in Table I shows the difference between the initial result and the optimized result. D. Design Space Exploration In this section, we will use some techniques to generate the optimized solution for the DCT function in the block and explore the tradeoff between area and throughput during HLS design. The Discrete Cosine Transformation (DCT) block is implemented using an AAN(Arai, Agui and Nakajima algorithm for fast DCT calculation) algorithm [10], and consists of two “for” loops: row DCT and column DCT. In each for loop, in one iteration we will load one row or one column in the 8 × 8 block, and process the 8 elements simultaneously. To optimize “for” loops, we have the following options: loop pipelining, loop unrolling and array mapping [8]. However, after loop unrolling, the performance is not much improved. The bottleneck is in the memory access. By default the array is implemented as one memory block, which means that we can only read one element in the array per cycle. This will greatly slow down the iteration as we will need 8 cycles to get all the data. And it will also greatly affect the initial interval number if we want to using loop pipelining. The best way to solve this problem is to map the whole array to registers so that we can access all the data in the same time both for the row DCT and column DCT. After that, we will apply loop pipelining to further improve the throughput.

However, if we perform function splitting to pipeline the two different stages, we would increase the throughput considerably. In addition, since we only deal with one for loop in each function, we can use memory banking technology to map the array to 8 memories. In this way we can access 8 variables in the same time. This is not achievable in the single function version. Because the row DCT and column DCT have different array accessing order, we have to map the array to the registers. However, with this design tradeoff, we use more area for the FIFOs that connect the two functions. The result is shown in the second part of Table I. The single function version is the one where we map the arrays to registers, pipeline the loop and keep the row DCT and column DCT in the same function. The throughput is further improved if we apply memory banking to the arrays to each function and also perform loop pipelining. E. Discussions We discussed converting a software program into a hardware friendly design. We discussed the general patterns of such a design flow: Define design top, make code synthesizable, function level restructuring, inside-function optimization and design space exploration during the process. The extra cost for this software modification based design approach lies in the effort of making the code synthesizable and restructuring. The key idea across the whole design is to increase the parallelism in the code. This can be done in the following ways: 1) Function splitting with FIFOs as intermediate interfaces breaks the dependence between blocks and is a powerful technique to increase throughput. 2) Loop unrolling and pipelining can help reduce the latency and throughput but are restricted by memory access ports as well as area budget. 3) Array partitioning helps eliminate resource constraints and provides more possibility for parallelism. As Table I shows, by applying these techniques, we decreases the latency of our JPEG encoder design by more than 200X and increased the throughput by more than 600X. III. A RCHITECTURE R EFERENCE C ASE S TUDY: AES The architecture reference approach starts with an intended hardware architecture and develops specifications from scratch to implement that architecture. In this case study, we design an AES block encryption core; we will compare a manual design approach using SystemVerilog to an HLS-based implementation using SystemC. In section III-B, we will compare the languages in more detail, but we will first specify our desired system architecture. A. AES Architecture AES is an industry standard symmetrical block cipher algorithm that takes a 128-bit plaintext “block” and a key as input and outputs a 128-bit ciphertext “block” [13]. The key can be 128, 192, or 256 bits wide; each of the three key widths corresponds to the three possible operation modes of AES. In this case study, our goal is to create a custom hardware module that performs 128/192/256-bit AES encryption with the following requirements: • Clock period: 2ns • Throughput: 1 encryption operation / cycle • Registered outputs We have the following prioritized QoR goals: 1) Minimum area 2) Minimum cycle latency

sub

Sub Bytes Shift

plaintext block

… …

Mix

…

Add

Add

…

Subkey 0

Subkey 1

round n-1

round n

Sub Bytes

Sub Bytes

Shift

rot

sub rot

⨁c1

Shift ciphertext block

Mix Add

Add

Subkey n-1

Subkey n

sub rot

⨁c2

⨁c10

sub ⨁c1

⨁

⨁

⨁

…

⨁

⨁

⨁

…

⨁

⨁

⨁

…

⨁

⨁

…

⨁

…

⨁

⨁

⨁

…

⨁

⨁

⨁

…

⨁

⨁

⨁

⨁

…

⨁

⨁

(b) 128-bit key expansion

subkey 10

subkey 0

⨁c8

…

…

subkey 9

rot

⨁c2 ⨁

⨁

subkey 1

rot

sub

⨁

⨁

subkey 0

(a) Cipher. n ∈ {10, 12, 14}.

rot

sub

192-bit key

round 1

128-bit key

round 0

subkey 1

subkey 2

subkey 11 subkey 12

(c) 192-bit key expansion

Fig. 3. AES function diagrams (256-bit key expansion omitted). Shift = shift rows, Mix = mix columns, Add = add round key. rot = rotate word, sub = sub word, ⊕cn = add round constant n, ⊕ = bitwise xor. In (a) wires are 128-bits wide, in (b,c) wires are 32-bits wide.

Given the throughput goal of 1 operation / cycle, our architecture must use a feed-forward topology with pipelining to meet the clock period constraint. Decomposing the AES encryption algorithm into sub-functions, we find two main sub-functions: key expansion and cipher as shown in Fig. 4. We show a block diagram for Cipher in Fig. 3(a) and for the 128-bit and 192-bit versions of Key Expansion in Figs. 3(b) and 3(c). The cipher function has a different number of rounds of operation, depending on the number of bits in the key: {10, 12, 14} rounds for the {128, 192, 256}-bit keys, respectively. Thus, the hardware architecture implements a 14-round cipher, with a multiplexor on the input of the final round that selects from the outputs of round {9, 11, 13}. AES Encrypt

1 Cipher

1

15

Key Expansion 13 25

144

rot word

14

Add Round Key

Sub Bytes

4

4

xor word

sub word

128-bit functions

13

14

Mix Columns

Shift Rows

4

32-bit functions

mix word 4

4

8-bit functions

GF dot product

s-box 1

affine transform

1 GF inverse

4 GF multiply

1 GF normalize

Fig. 4. AES Encrypt design hierarchy. Edge labels indicate how many submodules the parent module instantiates. GF = Galois Field.

Based on this desired, general architecture, we will implement these sections in both a manual design flow and an HLS-based design flow. Next, we compare SystemVerilog and SystemC, our two input languages for the manual and HLS-based design flows. B. Comparing SystemVerilog and SystemC We divide input languages for hardware design into two general classes: 1) Purpose-built hardware description languages (e.g. Verilog, VHDL, SystemVerilog). 2) Software programming languages adapted for hardware description (e.g. C/C++, SystemC). Software programming languages often take the form of a C variant with libraries, language extensions, and annotations (e.g. pragmas). The C code provides a behavioral description of the desired hardware, while libraries, pragmas, and directives describe the hardware

architecture that implements that behavior. For this case study, we compare SystemVerilog [11] (Class 1) to SystemC [12] (Class 2). SystemVerilog extends Verilog with a powerful type system, metaprogramming, and assertion semantics for verification. SystemVerilog is well supported by existing logic synthesis tool chains, similar to Verilog and VHDL. SystemC extends C++ with hardware interface descriptions, cycle accurate and cycle-approximate semantics, and arbitrary bit-width data types. Our SystemC HLS tool is a stateof-the-art tool with proprietary extensions for “for loop” and module hieararchy architecture directives. SystemC is supported by a growing number of synthesis toolchains and is the most popular of existing C-based, synthesizable hardware specification standards. Table II summarizes the similarities and differences between SystemC and SystemVerilog for key features we found important for developing our implementation of AES. SystemC is a subset of C++ with extended features to describe the hardware behavior. Compared to SystemVerilog, the design description is at a higher abstraction level with more flexibility in modeling. SystemC can model cycle accurate, timing approximate and no-timing (e.g. C/C++). This coding style with cycle accurate modeling is widely adopted in industrial design flows: designers control the I/O behavior to control interface protocols, while allowing the HLS tool to schedule and synthesize the design with changing design goals. This higher abstraction level has benefits in mapping between technology nodes, differing design goals, as well as faster learning curve for developers. High-abstraction modeling using SystemC enables easy transfer to different process nodes, i.e., the same piece of SystemC code can be used as the input of HLS tools to generate different hardware design solutions according to the technology relatively smoothly. On the other hand, the switching of technology node usually leads to different hardware designs and hence requires massive amount of redesign effort using the traditional RTL design flow. Both languages have sequential and concurrent semantics. SystemVerilog is concurrent in modules with sequential semantics available in “always” blocks, functions, and tasks. SystemC, due to its C origins, has mostly sequential semantics with concurrency available between modules and the clock threads within. Metaprogramming allows the hardware designer to write code that generates code to be compiled. SystemVerilog does this though “generate” blocks with “genvars,” special variables that are only used for metaprogramming and never become hardware. Both languages have template functions and modules, allowing the programmer to generate multiple variations of logic functions with a single implementation. Both languages have cycle exact semantics. SystemVerilog specifies cycle synchronized operations with clock edge triggered “always ff” blocks. SystemC has a “wait” function to define cycle

TABLE II C OMPARISON OF S YSTEM V ERILOG AND S YSTEM C AS INPUT HARDWARE DESCRIPTION LANGUAGES FOR TYPICAL SYNTHESIS TOOLS . Feature Interface specification

SystemVerilog modules

Submodules

module instantiation

Inlined functions Sequential semantics Concurrent semantics Metaprogramming Loop unrolling Module pipelining Cycle exact semantics Cycle approximate semantics Array assignment, slicing, concatenation, reshaping Arbitrary bit-width data types Bitwise slicing, concatenation, reduction Explicit don’t care Composite types and typedefs

function or tool ungroup directive in “always” blocks, functions, tasks in modules generate blocks, parameterized modules default registers + tool retime directive (II=1) yes no

SystemC module classes function + tool submodule directive function + tool inline directive default between clock threads templated functions tool unroll directive loop + tool pipeline directive (any II) yes through tool directives

yes

no

native

through libraries

native

through libraries

yes

limited

yes

yes

boundaries. A key differentiator of SystemC is its toolchain support for cycle approximate semantics: specifying a region of code that can have variable latency. This gives the tool freedom to choose the latency that best optimizes for the QoR goals given by the designer. A distinctive advantage of SystemVerilog is its strong support for arrays as a first-class datatype. In particular, SystemVerilog supports array assignment, slicing, and concatenation that works the same way as integer operations. In SystemC, the designer must write functions with loops to perform array copies and slices, as the core C language does not support such features. Another useful feature in SystemVerilog is array reshape casting: reinterpreting an variable as one with different dimensions, but the same total number of bits. Both languages have arbitrary bit width datatypes. SystemVerilog is designed for variable bit widths, while SystemC offers them through arbitrary bit width libraries. SystemVerilog supports bit slicing, concatenation, and reduction natively, while SystemC offers them only for its arbitrary bit width data types. Supporting “don’t care” is useful because it allows the designer to explicitly indicate where a variable is undefined. This not only helps catch bugs during simulation, but also gives the synthesis engine additional freedom to assign “don’t care” outputs that best optimizes for the QoR goals given by the designer. While SystemVerilog has explicit “don’t care” literals (i.e. an “x”) for its primary “logic” datatype, SystemC is limited as its primary “sc int” and “sc uint” datatypes have two-state bits that cannot be assigned to “x.” C. Designing an AES Encrypt Module Based on our desired architecture, we implement the AES encryption using both SystemVerilog and SystemC. In our initial implementation, we wrote a templated function (parameterized module) that we instantiated three times, once for each mode, the idea being that the synthesis tool would share resources among the three variations (two are shown in Figs. 3(b) and 3(c)). Although both SystemC and SystemVerilog [14] toolchains have automatic resource sharing

optimizations, we found that limitations in our toolchains prevented automatic sharing in both cases. We noticed that the resource that accounted for over 80% of the area cost in Key Expansion was “sub word” instances: 31 instances were allocated when only 13 were needed if shared optimally. Thus, we could greatly improve QoR if we could find a way to instantiate (call) only 13 instances and share them. In SystemVerilog, we instantiated an array of 13 instances and shared the array with three-way multiplexers on the inputs, using a function in an “always comb” block to specify the other key expansion hardware that connects the “sub word” instances for each of the three modes. This is challenging to implement correctly because sharing variables for subkeys resulted in (false) combinational loops not supported by our synthesis toolchain because different modes use the “sub-word” function with different frequencies. We found that we needed separate subkey variables for each of the three modes and a multiplexer at the subkey output to select among them. In addition to this, we found that we needed to split the function into two parts called by two separate “always comb” blocks: one function to handle the inputs to the “sub-word” function, and one to handle the output and generate the next subkey(s) for an iteration. Without this split, we found that our toolchain did not recognize the concurrency between the single “always comb” block and the array of “sub word” instances. To share the “sub word” instances in SystemC, we did not have the option to specify concurrency with some “always comb” block equivalent because the synthesizable SystemC subset only contains processes that are clock edge sensitive [15] (i.e. each submodule has to have registered outputs). Thus, we were restricted to sequential semantics. Our sequential solution was to manually align the three key expansion processes such that the “sub word” instances lined up, resulting in a main loop with 13 iterations that calls three variations on an input generation function in each iteration, a common instance of “sub word”, and three variations of an output handling function. D. Synthesis Optimizations Having implemented a good architecture for AES, the next step is to work with the synthesis tool to find a good synthesis strategy for both designs. Observing that the “s-box” consumes about 80% of the area cost of the design, we focused our effort on this subdesign. In our SystemVerilog toolchain, we performed bottom-up synthesis by synthesizing the s-box with maximum effort (the time required for this was still less than a minute due to the small submodule size) and then synthesizing the rest of the design with the s-boxes fixed (essentially they became black boxes after initial synthesis). We performed a similar procedure in our SystemC toolchain by making the “s-box” design a custom resource characterized by calling the logic synthesis back-end. Our initial approach was to ask both of our synthesis toolchains for the minimum area version of the s-box with no delay constraint, then exploring alternate synthesis optimizations that improve performance with marginal area cost. Having obtained optimal s-box implementations, the final step was to insert pipeline flip-flops to add cycle latency as needed to meet clock period constraints. Here the cycle approximate semantics of SystemC enabled the toolchain to do this work for us through the classic resource scheduling process. For the SystemVerilog design, at each output we created an array of pipeline flip-flops with depth n and configured our toolchain to perform retiming for all flip-flops except the last stage. By tweaking n, we found the minimum cycle latency for our design that satisfies our clock period constraint.

E. Results Table III compares the effort involved in implementing the SystemC and SystemVerilog versions of AES Encrypt using executable lines of code (ELOC) as an objective metric. We define an executable line as a line of code that isn’t empty and contains more than just a delimiter for the end of a code block. As ELOC is a rough approximation for development effort and code maintainability, only large, order-of-magnitude differences are meaningful. Our comparison shows that the SystemVerilog and SystemC development efforts for our AES case study are similar. Note that we wrote our SystemVerilog code at roughly the same abstraction level as our SystemC code, both are mostly straightforward translations of the AES standard [13]. The unique overhead in SystemC was the result of verbose interface specifications for the design and testbench modules as well as extra SystemC functions to copy arrays/matrices and pack/unpack them to/from bit vectors. We observed some unique overhead in SystemVerilog from implementing pipeline flip-flops. Of course this is just a single case-study; results for other hardware designs may vary. TABLE III E XECUTABLE LINES OF CODE (ELOC) COMPARISONS . Function AES Encrypt Key Expansion Cipher Add Round Key Sub Bytes Shift Rows Mix Columns rot word xor word sub word mix word s-box affine transform

SV 21 58 26 5 2 6 2 5 5 2 9 2 9

SC 30 61 26 4 4 6 4 4 4 4 9 2 10

Function GF inverse GF dot product GF multiply GF reduce array copy global typedefs global constants other Design Total testbench synthesis script Total

SV 19 7 5 6 11 4 204 59 41 304

SC 19 6 6 7 32 12 3 3 256 154 31 441

Table IV compares the QoR of our two AES designs when synthesized to gates. We used a 45nm technology library, and ran our toolchains on a 16-core AMD Opteron 6272 CPU on a system with 256GB of DDR3 DRAM. We used an academic version of a commercial state-of-the-art SystemC-to-RTL synthesis tool. The key difference between the two synthesis runs is the control we had over the synthesis strategy. In the SystemVerilog design, we were able to focus the tool’s optimization effort on the s-box design and then work from the bottom up. In the SystemC design, we had control over how the tool characterized the s-box, but the tool didn’t give it any special treatment during logic synthesis; it just performed top-down logic synthesis with no subdesign receiving special treatment. TABLE IV Q O R AND SYNTHESIS RUNTIME COMPARISONS . Metric Min Clock Period (ns) Latency (cycles) s-box instances s-box area (µm2 ) Other area (µm2 ) Total area (µm2 ) RTL gen. time (min) Total Runtime (min)

SV 2.029 4 276 163.7k 67.0k 230.7k 51.78

SC 2.000 5 276 230.8k 85.7k 316.5k 14.87 24.77

ratio 0.99 1.20 1.00 1.41 1.28 1.37 0.48

For our AES design, we found that both modern HDL-style and modern C-style input languages had for the most part the

power needed for us to be productive. In addition, we found that the hardware design, architectural decisions, toolchain quality, and synthesis strategies were at least as important as the choice of input language, if not more important. Our study uncovered a number of places where the current state-of-the-art can be improved. We found that SystemC has a performance gap with manual design in SystemVerilog, a lower-level specification language. Particularly, key features such as arrays as a first-class datatype, separation between metaprogramming constructs/variables and code for synthesis, fine-grained concurrency, and interface specifications could all be improved to either give access to low-level optimizations, better automate optmization or both. Conversely, SystemVerilog could be improved by adopting cycle approximate semantics to enable designers to insert variable-depth pipeline flip-flops for flexible design space exploration. Furthermore, in both SystemC and SystemVerilog, lightweight constructs that give resource sharing hints to the synthesis toolchain may be helpful. Additionally, the designer could help characterization tools converge on an optimal designs faster with higher-order constraints (e.g. a linear cost function of area and delay) and a larger variety of optimization strategies (e.g. “delay recovery” with fixed area cost). Finally, high-level scheduling algorithms could be improved by borrowing some of the techniques from logic synthesis retiming. IV. HLS B EST P RACTICES In this section, we build some micro-benchmarks in SystemC to illustrate general best-in-practice design principles for creating efficient HLS designs. We will show that both hardware and software knowledge are essential to effectively use HLS tools. A. Using templates in HLS As discussed in Section III-B, a template is a C++ feature that allows multiple variations of a function to be automatically instantiated from a single implementation. In high-level synthesis, this can be used to instantiate multiple functional units with similar functionality but different variable sizes or data-types, without code duplication. This is illustrated in Fig. 5 for a benchmarking case study. Given a function description as shown in Fig. 5(a), the input is stored in a 4 × 4 matrix, while only the upper triangular is processed in this block. In particular, the matrix is processed in a row fashion, i.e., at ith outer iteration, the ith row of the matrix is sent to a processing function named CORE, which is shown in Fig. 5(c).

void top(int in[4][4], int out[4][4]) { int row_out[4] = {}; for (int i=0; i