Behavioral-Level IP Integration in High-Level Synthesis

2 downloads 0 Views 92KB Size Report
Verilog together with an RTL testbench, and scripts for ... assert fin, signifying outputs can be read. inX .... and $assert() tasks in Verilog without linking to DPI. We.
Behavioral-Level IP Integration in High-Level Synthesis Liwei Yang∗ , Swathi Gurumani† , Deming Chen‡ , Kyle Rupnow† ∗ School

of Computer Engineering, Nanyang Technological University † Advanced Digital Sciences Center, Singapore ‡ University of Illinois at Urbana-Champaign Email: [email protected], {swathi.g, k.rupnow}@adsc.com.sg, [email protected]

Abstract—High level synthesis (HLS) quality improvements have led to its increased adoption in hardware design. In the design flow, IP reuse is critical for achieving quality of results, yet current HLS tools allow only a small set of tool-provided IPs integrated during HLS. General IP integration is then handled as an additional step either manually or using other system level tools. Performing post-HLS integration of IPs requires a clear separation of IPs from HLS-generated cores, requiring significant partitioning effort. In contrast, behavioral-level IP integration during HLS can simplify the design flow while still supporting HLS-based optimization and design space exploration. In this paper, we develop a general IP integration framework for HLS that supports fixed- and variable-latency IPs without requiring application partitioning. Using this framework that allows userspecified function/instruction-to-IP mapping, we demonstrate integration of both synthesizable and non-synthesizable IPs. Index Terms—IP Integration, Behavioral-Level Integration, High-Level Synthesis

I. I NTRODUCTION High level synthesis (HLS) tools improve designer productivity by allowing design and verification in high level languages with automatic translation to hardware, thus leading to wider adoption in complex applications. In complex applications, there may be functions without HLS-appropriate implementations as well as functions with existing well-optimized register transfer level (RTL) IP implementations. Mathematics, I/O and debug libraries are heavily used in software; in order to synthesize the design and verify functional correctness, the user must provide functions that are synthesizable or eliminate calls to those functions in order to produce any hardware output. Pre-existing IPs for important sub-functions allow reuse of well-known, high-quality implementations, which improves performance and productivity [1]. Prior HLS tools integrate a small fixed set of IPs, but do not target generalized integration of third-party IPs. Without generalized IP integration, HLS users must provide HLScompatible C/C++ function implementations. Although [2] proposed a method to translate RTL IP blocks into C++ code, they re-synthesize the C++ code instead of using the RTL IP, and can only be used with simple RTL that can be translated to C++. Instead of integrated, behavioral-level IP integration, prior HLS tools require the user to instantiate RTL IPs and HLS-generated cores and connect them together c 978-1-4673-9091-0/15/$31.00 2015 IEEE

as an additional step outside the HLS flow [3]–[6]. This postHLS system-level IP integration requires substantial additional design effort: the user must divide the application into partitions that do not use IPs, perform HLS on each partition independently and then manually connect them (and the IPs) together. Furthermore, all library functions must be eliminated or re-implemented. Other IP integration methods like [7] treat each IP core as an independent accelerator, but depend on a CPU-based platform, whereas HLS-based designs typically synthesize datapath and control without using a processor core. In comparison, behavioral-level IP integration directly instantiates IP cores during HLS and generates the complete custom hardware without the need for partitioning the application into functions with and without existing IPs or reimplementation of non-synthesizable functions. The considered HLS-compatible application may be part of a subsequent hierarchy in a system, and will then need integration at that hierarchical level with the other components. However, given a system-level source for custom hardware, behavioral-level IP integration can instantiate the provided IPs for corresponding functions and generate a system-level output. Thus, IP profile data can be used during allocation, binding and scheduling processes to improve design space exploration and final design quality. In this paper, we develop a standard framework for users to specify a mapping between C/C++ functions (or instructions) and IPs. This generalizes IP integration support, extending the small, fixed set of supported IPs in existing tools to any function or instruction. Our framework supports both fixed-latency IPs or variable-latency IPs with a simple readyvalid handshaking protocol. This paper contributes to HLS and behavioral-level IP integration with: • The first generalized IP integration supporting fixed- and variable-latency IPs, and both synthesizeable and nonsynthesizable IPs. • A flexible intermediate representation that tracks data-, control- and resource-dependencies for IPs. • User-specified function/instruction to IP mapping and behavioral-level or system-level integration. II. R ELATED W ORK Pre-existing verified third-party IPs are important for fast and efficient hardware development [1], [8]. There have

been prior EDA tools, such as Coware N2C (now part of Synopsys) [9], and Cadence Virtual Component Codesign (VCC) [10], specifically for IP integration, predominantly for prototyping and co-simulation at the system-level, but they do not generate functional RTL implementations for the IPs. Leading FPGA vendors also provide system-level IP integration solutions: Altera SOPC builder [11] and Xilinx Vivado IP Integrator [12] aid the user to create interconnect interfaces and manually connect components but do not perform automated integration of IPs during HLS. Also, the user must still manually partition C code to support system-level IP integration. In this work, we implement behavioral-level IP integration allowing instantiation of hardware RTL IPs, including third-party IPs, during the HLS process. III. T HE VAST HLS F RAMEWORK We extend an existing LLVM-based HLS framework VAST [13] with behavioral-level IP integration as shown in Fig. 1. Entirely new features are highlighted in orange, and partially modified steps are in blue. We present the sequence of steps in the flow (Fig. 1). 1) Parsing and HLS-independent optimization: The framework uses Clang as the frontend to parse the source code in C/C++ and perform HLS-independent compiler optimizations. C/C++ Specification

Frontend

Generic Optimizations

LLVM IR

IP Specification

VAST IR Building VAST IR

IP List

Allocation Scheduling Binding

IP Generator

RTL Code Generation

IPs & Scripts for Simulation/Synthesis

RTL Specification & Timing Constraint

Simulation

Hardware Synthesis

Fig. 1: VAST HLS Flow 2) VAST IR Building: From the LLVM IR produced by Clang, the framework builds VAST IR, which represents hardware-level information to support optimizations such as dead-code elimination, as well as timing estimation, scheduling, and binding. To effectively partition optimization and analysis of instructions, the VAST IR consists of four layers that explicitly represent the design at the behavior, controlflow, data-flow, and hardware architecture levels. The behavior level is inherited from LLVM IR, which contains LLVM instructions. The control-flow level contains a state transition

graph (STG) that represents control state transitions. The data-flow level contains directed-acyclic graphs (DAGs) that represent computations and data-dependencies, where each node in the DAGs is an operator. The hardware architecture level represents physical resources in the hardware architecture such as registers, functional units, memories and I/O ports. Therefore, each layer represents different types of information and supports different types of optimization passes. We modify IR building to include user-specified IP mappings, and characterization of IP area and latency. We will discuss IP specification and integration in more detail in Section IV. 3) Allocation/Scheduling/Binding: The VAST IR is then processed through typical HLS steps, i.e. allocation, scheduling and binding to construct IP-oriented hardware implementation. Necessary resources are allocated to accommodate the VAST IR in static single assignment (SSA)-form. A modified version of the system of difference constraints (SDC)[14] scheduler is used to support inter-basic block parallelism [15], multi-cycle analysis and chaining [13], and global code motion [15]. A structural hashing based binding is applied to eliminate redundant structures in the design. As for the integrated IPs with variable latencies or external interfaces, such as sqrt() or printf () in Section V, a single IP instantiation will be shared by all instructions/functions using that IP. Therefore, these shared IPs as well as shared memories and I/O must be serialized to guarantee correctness; thus, we perform binding before scheduling. 4) RTL Code Generation: VAST HLS generates RTL in Verilog together with an RTL testbench, and scripts for simulation and synthesis. To assist in testbench generation and automated instantiation of external IPs, we modify codegeneration to dump information about IPs. For Altera and Xilinx IPs, we also use this information to automate running Altera or Xilinx’s IP-core generation tools. IV. IP I NTEGRATION IN VAST HLS Our implementation of IP integration has two main components: an extensible function/instruction-to-IP mapping to allow users to specify candidates to be implemented by IPs and the corresponding IP implementations, and extensions to existing VAST IR to represent those candidates in the HLS flow. In this work, we focus on IPs with scalar inputs and outputs (I/O), and leave IPs with vector I/O for future work. A. Extensible Function/Instruction to IP Mapping The first step in IP integration is a user-provided specification of functions and instructions that should be mapped to IPs. Each function/instruction that will be mapped to an IP must provide the behavioral-level information such as the name of function/instruction and the corresponding signature including return type, the number and types of arguments/operands. This mapping also provides characterization information on latency, area, externality and a wrapper module netlist that conforms to our fixed- or variable-latency interface for internally instantiated IPs, i.e. the IPs instantiated within the HLS-generated top module for the application.

TABLE I: IP Interfaces Port clk rstN start

Fixed/Variable Both Both Variable

fin

Variable

inX

Both

outX

Both

Description Clock – All IPs must use register inputs and outputs Module reset – IPs can optionally reset registers Start computation – Top module asserts to tell IP to register inputs and begin computation Finish computation – Top module waits for IP to assert fin, signifying outputs can be read 0 or more input ports, with width specified in instruction/function signature 0 or 1 output port, with width specified in instruction/function signature

Table I lists the interfaces for both fixed- and variablelatency IPs. Fixed-latency IPs follow a simple clocked inputoutput interface; variable-latency IPs use start-finish handshaking signals. Internal IPs may use either fixed- or variablelatency interfaces; external interfaces always use the variablelatency handshaking interface. The netlist can contain a useful module that correctly follows these interfaces (Table I), or even more complex non-synthesizable functions. Although our two interfaces are simple interfaces for IP integration, they support a wide range of IPs: many IPs can be designed or used as fixed-latency (worst-case) IPs, and IPs that use common bus interfaces such as AXI, Avalon, or Wishbone can often be instantiated in a wrapper that follows the handshaking protocol. For example, Xilinx IPs, which are compatible with AXI4 interface, have a ‘tvalid’ signal for every data input and output channel. The ‘start’ signal can assert the ‘tvalid’ at input to initiate operation and ‘fin’ can be asserted by ‘tvalid’ of output to indicate completion. B. VAST IR Extensions In order to support third-party IPs, we make modifications to each of the levels of the VAST IR. At the behavior level, we use the function/instruction-to-IP mapping from the user script to detect candidates that will be implemented by IP blocks. Once a candidate is detected, it will be tagged for proper handling in the other three levels. Entry

Waiting State Machine Entry/Start

Idle

Exit

start

Waiting Return

fin

fin

Variable-latency IP Module

Fig. 2: Waiting State Machine for Variable-Latency IPs At the control-flow level, IPs for fixed-latency instructions are treated identical to elementary instructions such as additions or multiplications. For each variable-latency IP, there is a waiting state machine as shown in Fig. 2. The core FSM asserts the IP’s start signal, and then waits to receive a return. After the IP completes computation, it asserts the fin signal, which is received by the waiting FSM. Because the same IP may be used multiple times, there are disambiguation flags to denote the correct next state in the core FSM. At the data-flow level, IPs for fixed-latency candidates are represented as computations in SSA-form and can be

multiply instantiated and optionally shared according to the scheduling result. Variable-latency IPs may not be part of optimizations such as instruction-chaining or parallelization between multiple instantiations, where a fixed upper-bound on latency is required. Consequently, at the hardware architecture level, fixed-latency candidates will become one or more instantiations of the IP-block, and variable-latency instructions will become a single instantiation shared by all users. C. Testbench Generation and Validation In addition to generating IP instantiations, our HLS engine generates a list of required IPs to support automated generation of IP blocks and testbench. We develop scripts to automate use of Altera or Xilinx’s IP generation tools (users can provide appropriate commands of other tools or copy of IP blocks), and generate testbench instantiating the HLS-generated design, external IP cores and test logic. The testbench is self-verifiable based on the return value of the behavioral-level application. V. E XPERIMENT We now demonstrate the capabilities of our generalized IP integration by supporting instructions/functions not part of the Verilog syntax, usage of non-synthesizable IPs to support debug and verification, and supporting variable-latency IPs. A. General Fixed-Latency IP Integration We first demonstrate general IP integration by utilizing the user-specified mapping to instantiate appropriate IP blocks for floating-point instructions and library functions. We target both Altera and Xilinx platforms, automatically integrating platform-specific floating-point IPs into seven benchmarks from the Rodinia suite [16]. These benchmarks are large, complex floating point applications that include complex functions such as exponential, square root, and logarithm. Using the OpenMP version of each application [16], we modify them for HLS compatibility, and introduce selfverifying logic to be compatible with the automatically generated testbench. For each application, we perform high-level synthesis, and automatically generate testbenches and simulation scripts. We validate results with functional simulation and then perform logic synthesis. Due to Xilinx’s excrypted IPs, TABLE II: Results of Latency and Fmax Latency in cycles Fmax in MHz benchmarks No-Opts HLS-Opts No-Opts HLS-Opts Altera Platform - Stratix IV EP4SGX230K backprop_kernel 160088 157198 147.32 140.59 nn 787369 671886 218.25 138.77 hotspot 200 200 135.83 141.38 lavaMD 771089 741389 163.72 157.88 lud 3891360 3576951 146.31 147.86 srad 18506 17165 139.45 140.08 cfd 22950 23093 138.50 138.70 Xilinx Platform - Virtex 7 XC7V2000T backprop_kernel 245740 242849 39.712 30.819 nn 971530 738851 284.900 167.280 hotspot 420 420 284.091 276.778 lavaMD 1502974 1473293 68.418 88.433 lud 5197784 4872040 48.964 47.703 srad 33068 32429 316.957 288.434 cfd 35768 35673 267.094 270.416

TABLE III: Results: Resource Usage benchmarks backprop_kernel nn hotspot lavaMD lud srad cfd backprop_kernel nn hotspot lavaMD lud srad cfd

Altera ALUTs No-Opts HLS-Opts 34,497 34,235 3,382 3,168 46,907 46,925 38,191 39,853 4,636 4,475 25,570 25,542 92,390 92,405 Xilinx Slice LUTs 1054223 1063001 2542 2448 60097 60096 491595 488975 987250 978146 22307 20609 67440 67188

Altera Registers No-Opts HLS-Opts 33,773 33,235 3,920 3,677 42,252 42,252 38,114 32,280 4,620 4,438 25,010 24,655 107,956 107,956 Xilinx Registers 195611 194876 4233 3996 111457 111457 129049 123651 141251 141058 40634 40423 134158 133387

we use Vivado simulator for simulations of Xilinx designs, and Modelsim 10.1d for simulations of Altera designs. To demonstrate the compatibility with HLS optimizations, for each application, we synthesize using no additional HLS optimizations (No-Opt), and with multi-cycle path analysis [13], and global code motion [15] VAST HLS optimizations (HLS-Opts). With our current one-to-one binding algorithm, fixed-latency IP blocks are not shared, so there may be multiple instantiations running in parallel. Performance and area data for all of the benchmarks on Altera and Xilinx platforms are shown in Table II and III. This demonstrates correct operation of the basic features of our IP integration technique targeting any FPGA platform with appropriate user-specified IPs. This area difference is not contributed by the integrated IPs. Although this experiment demonstrates IP integration with floating-point IPs, any IP with scalar I/O can be simply integrated using this technique. B. Integration of Non-synthesizable and Variable-latency IPs Several benchmarks use math library functions that do not have exact equivalent hardware implementations. For backprop kernel and srad benchmarks, we make use of CPU implementations of functions such as srand() and rand() through the SystemVerilog DPI interface and ease verification of HLS-generated design by using identical random number sequence compared to software implementation. We also directly support printf () and assert() by using $display() and $assert() tasks in Verilog without linking to DPI. We integrate all non-synthesizable functions at system level so that the generated core remains synthesizable even if we integrate with non-synthesizable IPs for functional verification. Although fixed-latency IPs allow multiple instantiations, there exist many large, complex variable-latency IPs that only require a single instantiation. Variable-latency IPs must internally have control to implement the simple start/fin protocol. This may be as simple as a shift register, or a complex state machine that also manages internal execution states. To evaluate the area overhead for supporting variable-latency IPs, we create a variable-latency alternative for a fixed-latency square root IP in nn benchmark. Comparing the two designs, the area overhead is 73 ALUTs and 43 registers, which is less than 2% of the application. This demonstrates that the integration of variable-latency IPs is efficient.

Altera BRAM in bits No-Opts HLS-Opts 157,248 157,248 684,864 684,864 728 728 88,064 88,064 266,752 266,752 42,164 42,164 68,768 68,768 Xilinx RAMB36 in bits 184,320 184,320 1,198,080 1,198,080 0 0 129,024 129,024 147,456 147,456 221,184 221,184 92,160 92,160

Altera DSP 18-bit No-Opts HLS-Opts 328 328 8 8 616 616 166 166 28 28 306 306 780 780 Xilinx DSP48E1 215 215 10 10 139 139 196 196 16 16 146 146 448 442

VI. C ONCLUSIONS We demonstrated a generalized IP integration framework integrated with an HLS flow. Using two effective interfaces for fixed- and variable-latency IPs, and support for user-defined mappings between instructions/functions and instantiated IP blocks, we support a wide variety of IP blocks as both internal instantiations and as system-level integrated IP blocks. VII. ACKNOWLEDGEMENT This study is supported in part by the research grant for the HumanCentered Cyber-physical Systems Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science, Technology and Research (A*STAR).

R EFERENCES [1] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P. Pande, C. Grecu, and A. Ivanov, “System-on-chip: Reuse and integration,” Proceedings of the IEEE, vol. 94, no. 6, pp. 1050–1069, 2006. [2] N. Bombieri, H.-Y. Liu, F. Fummi, and L. Carloni, “A method to abstract rtl ip blocks into c++ code and enable high-level synthesis,” in DAC, May 2013, pp. 1–9. [3] Vivado HLS, Xilinx Inc, www.xilinx.com/products/designtools/vivado/integration/esl-design.html. [4] Synphony High-Level Synthesis Solution, Synopsys, Inc., 2012, http://www.synopsys.com/. [5] C-to-Silicon Compiler, Cadence, Inc., 2012, http://www.cadence.com/. [6] Calypto, “Catapult c synthesis.” [Online]. Available: http://www.calypto.com/catapult c synthesis.php [7] H. Nikolov, T. Stefanov, and E. F. Deprettere, “Automated integration of dedicated hardwired ip cores in heterogeneous mpsocs designed with espam.” EURASIP J. Emb. Sys., vol. 2008, 2008. [8] P. Coussy, A. Baganne, and E. Martin, “A design methodology for ip integration,” in ISCAS, vol. 4, 2002, pp. 711–714. [9] D. Verkest, K. Van Rompaey, I. Bolsens, and H. De Man, “Coware—a design environment for heterogeneous hardware/software systems,” in Readings in Hardware/Software Co-design, 2002, pp. 412–426. [10] F. Schirrmeister and A. Sangiovanni-Vincentelli, “Virtual component codesign-applying function architecture co-design to automotive applications,” in IVEC, 2001, pp. 221–226. [11] SOPC Builder, Altera Inc. [Online]. Available: http://www.altera.com/literature/ug/ug sopc builder.pdf [12] Vivado IP Integrator, Xilinx Inc. [Online]. Available: http://www.xilinx.com/products/design-tools/vivado/integration.html [13] H. Zheng, S. Gurumani, L. Yang, D. Chen, and K. Rupnow, “High-level synthesis with behavioral level multi-cycle path analysis,” in FPL, 2013. [14] J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based on sdc formulation,” in DAC, 2006, pp. 433–438. [15] H. Zheng, Q. Liu, J. Li, D. Chen, and Z. Wang, “A gradual scheduling framework for problem size reduction and cross basic block parallelism exploitation in high-level synthesis,” in ASP-DAC, Jan 2013. [16] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in IISWC, 2009, pp. 44–54.

Suggest Documents