Document not found! Please try again

HIGH-LEVEL SYNTHESIS WITH BEHAVIORAL

25 downloads 0 Views 512KB Size Report
made design time (and cost) at register transfer level (RTL) increasingly challenging. ...... lifelong program analysis & transformation,” in CGO, 2004, pp. 75–. 86.
HIGH-LEVEL SYNTHESIS WITH BEHAVIORAL LEVEL MULTI-CYCLE PATH ANALYSIS 1

1

2

3

12

Hongbin Zheng , Swathi T. Gurumani , Liwei Yang , Deming Chen , Kyle Rupnow Advanced Digital Sciences Center

1

Nanyang Technological University

[email protected]

[email protected]

[email protected]

[email protected]

2

University of Illinois at Urbana-Champaign [email protected]

3

3

ABSTRACT High-level synthesis (HLS) tools generate register transfer level (RTL) hardware descriptions through a process of resource allocation, scheduling and binding. Intuitively, RTL quality influences the logic synthesis quality. Specifically, the achievable clock rate, area, and latency in clock cycles will be determined by the RTL description. However, not all paths should receive equal logic synthesis effort – multi-cycle paths represent an opportunity to spend logic synthesis effort elsewhere to achieve better design quality. In this paper, we perform multi-cycle optimisation on chained functional operations. We couple HLS and logic synthesis synergistically so multi-cycle paths can be identified and optimised coherently across both behavioral and logic levels. In addition, we perform multi-cycle path analysis at the behavioral level efficiently. We prove that our technique examines all reachable circuit state and finds multi-cycle paths including control flow and guarding conditions that improve the flexibility and power of the technique. Compared to LegUp, we achieve average 55% execution time improvement, 29% area improvement, and 68% time-area product improvement targeting FPGA architecture.

1. INTRODUCTION The increase in size and complexity of FPGA designs has made design time (and cost) at register transfer level (RTL) increasingly challenging. In contrast, high-level synthesis (HLS) tools offer improved design time with high-level languages, reduced susceptibility to error, improved test and simulation abilities, and easier design space exploration; there are numerous academic [1–10] and commercial [11– 19] examples. Generally, HLS tools follow three main steps: resource allocation, scheduling, and binding. Constraints such as latency, throughput, area and power consumption guide decisions in these three steps. HLS is followed by logic synthesis tools from companies such as Xilinx, Altera, Synopsys or Cadence. Intuitively, the quality of RTL to implementation mapping depends on RTL quality, and thus HLS quality. Logic synthesis maps RTL elements to underlying circuit structures, with the achievable clock period determined by the critical path delay among all register-to-register paths (and I/O paths). The delay of register-to-register paths are determined in part by the RTL description. However, not all paths should

Fig. 1: An example of Multi-cycle path in the netlist

affect the clock period: some paths are multi-cycle paths that are allowed multiple clock cycles to propagate from inputs to outputs. Figure 1 shows an example of multi-cycle paths; data register R2 is enabled for (S0|(S2&C)), and R1 is enabled for S3. However, S2 transits to S3 for (S2&C); therefore, there is no case where R2 is enabled and also control transits from S2 to S3. Hence, there are 3 cycles available for the new value of R2 to propagate through the combinational paths to R1. Let us assume the critical path of the circuit is in the combinational paths between R2 and R1, and has a delay of d. Identifying the paths between R2 and R1 as multi-cycle (3 cycles) paths allows the circuit to run at a clock period d/3, which is 3× faster than the case that simply assumes all paths as single cycle paths. In this paper, we discuss multi-cycle paths that are prevalent in most designs. Some designs that are deeply pipelined and achieve an initiation interval (II) of one do not have any multi-cycle paths. However, an II of one is not commonly achievable due to dependencies recurrences (RecII) and resource constraints (ResII) [20]. If the design cannot inherently achieve an II of one, inserting pipeline registers to break combinational paths leads to wasted registers and forces the synthesis tool to optimise smaller combinational blocks. It should be noted that the technique presented in this paper is also compatible with pipelining techniques. It is important to identify multi-cycle paths to focus logic synthesis effort on challenging paths, and expose optimisation opportunities for larger combinational blocks. This tight inter-relation between HLS and logic synthesis demon-

strates the importance of generating efficient RTL through accurate estimations of circuit delays and identification of multi-cycle paths to correctly focus logic synthesis effort. Prior techniques for multi-cycle path analysis either used exhaustive enumeration with exponential time complexity, or heurisitic approaches that do not identify all paths. In this paper, we develop a novel methodology that allows analysis of all reachable control and data states without the exponential time complexity in the prior exhaustive enumeration approaches. With our identification of multi-cycle paths, we expose larger combinational blocks to logic synthesis, and thus improve logic synthesis optimisation opportunity for improved execution time, area, and area-time product. Our technique is implemented in VAST, a LLVM [21] based HLS framework, and compatible with LegUp [1]. This work contributes to HLS and multi-cycle path analysis by: 1. Generation of state transition graph from scheduler output and behavioral level control information to examine all reachable circuit states through representation of equivalent behaving groups for general multicycle path analysis.

a state transition graph, annotating with SSA-form [30] behavioral level information, and using symbolic representation of state-transition equivalent groups of data states. High-level synthesis tools commonly pass behavioral level timing information to logic synthesis [3, 27–29] with multi-cycle information typically generated after logic synthesis as a refinement step. Cong calculates multi-cycle constraints based on placement information [27], and Choi enables multi-cycle interconnect using floorplan information [28]. Bluespec allows pipelined atomic operations based multi-cycle atomics but do not work on multi-cycle propagation paths [29]. In this work, instead of multi-cycle paths in interconnect, we focus on multi-cycle optimisation on chained functional operations. We observe that passing multi-cycle paths from HLS to logic synthesis improves HLS quality and logic synthesis optimisation opportunity, thus achieving significant area and latency reduction. In addition, our multi-cycle analysis with SSA-form control-data flow graph (CDFG) annotations considers all reachable circuit states efficiently through symbolic representation of equivalent data state groups, as we prove in section 3.2.

2. Multi-cycle optimisation on chained functional operations to synergistically work with logic synthesis to achieve better design quality.

3. HIGH-LEVEL SYNTHESIS WITH BEHAVIORAL LEVEL MULTI-CYCLE PATH ANALYSIS

3. A proof that our technique examines all reachable circuit states through symbolic representation of behaviorally equivalent groups of data states.

Multi-cycle optimisation remains important for designs in which pipelining with II of one is not possible. Thus, in these cases, maintaining multi-cycle paths consumes fewer register resources (lower area), incurs no performance penalty (initiation interval and cycle time not affected), and there is increased optimisation opportunity (larger combinational blocks). We implement our multi-cycle optimisation in VAST, a HLS framework based on LLVM [21]. Figure 2 shows the sequence of operations in VAST, which is typical of HLS tools: it takes a two-level CDFG as input and generates RTL description. A two-level CDFG [31] is a directed graph of basic blocks (BB), where each BB is a directed acyclic graph of operations. Firstly, we apply bitlevel optimisations [32] to perform dead computation elimination and strength reduction at bit-level on the CDFG. We then estimate the required number of cycles for sequences (chains) of operations by simply accumulating the critical delay of each operation in the chain together. Then, we perform scheduling to automatically generate intervening control states for multi-cycle chains of operations. After that, we apply our behavioral level multi-cycle path analysis to calculate the available cycles for the chains, where the available cycles may be larger than the required cycles. Finally, we generate the RTL description and multi-cycle constraints and send them to logic synthesis. Our current flow is implemented as a feed-forward synthesis flow; feedback and optimisation based on placement and floorplanning informa-

4. A proof that our multi-cycle path analysis always produces the minimum shortest path distance between states, considering reachability and compatibility of register assignments and state transitions. The rest of this paper is organised as follows: Section 2 discusses related work in multi-cycle path optimisation. Section 3 describes the sequence of tools and analysis steps in our platform, and Section 4 presents our experimental setup and discusses results. 2. MULTI-CYCLE PATHS IN HLS Multi-cycle paths analysis is well-recognised as an important circuit feature to detect and optimise circuit implementations [3, 22–29], but the number of possible circuit states is exponential to the number of registers (control+data) thus exhaustive enumeration becomes quickly intractable [22, 23]. To avoid this state-space explosion, several heuristic approaches have been explored [24, 25], but without exhaustive enumeration, these approaches are proven to be inaccurate [26]. In contrast, our technique explicitly examines all reachable circuit states by generating

CDFG

Timing Estimation

SDC Scheduling

Binding

MCP Analysis

Constraints

Tcl File

Code Generation

RTL

Fig. 2: VAST HLS Flow

tion [27, 28] may provide additional benefit, but is left for future work. We will now present our framework in detail.

3.1. Scheduling Using the timing estimation and clock period constraints, we schedule the source and sink of the operation chains into clock cycles with our prior implementation [33] of the systems of difference constraints (SDC) scheduling algorithm [31]. The scheduler produces appropriate intervening control states for the chains if they have multi-cycle delays.

3.2. Multi-cycle Path Analysis The scheduler produces a cycle-accurate behavioral model of the design, which provides the available cycles for each chain. Because we analyse every chain of operations in the design for the number of available cycles, we automatically identify multi-cycle paths as the set of paths such that the available cycles is greater than one. Then, in each multicycle path, we generate appropriate constraints for logic synthesis so that the large combinational block can be optimised under appropriate timing constraints without requiring additional pipeline registers. To perform the multi-cycle analysis, we perform three high-level tasks: (1) construct a datapath representation of combinational nodes in the circuit to identify combinational paths (2) construct and annotate a state transition graph (STG) with control flow and use-define chain [30] annotations, and (3) analyse the STG to find the multi-cycle paths and determine appropriate constraints. It is important that our technique for construction of the STG represents all reachable circuit states yet does not require exponential growth in the number of nodes or number of edges between nodes. Therefore, we first present the logical connection between the Control STG in HLS and the circuit STG employed in prior works of multi-cycle path analysis. Then, we prove that our STG represents all reachable circuit states. Finally, we will present the analysis technique, which is based on the reaching definition and allnodes shortest path algorithm. Importantly, we will prove that our multi-cycle path analysis technique considers both the control flow and guarding conditions (register assignment enables), and thus considers all reachable states.

(a)

(b)

(c)

Fig. 3: (a) Possible States, (b) STG consisting of Control States, (c) STG consisting of GESs

3.2.1. Circuit States and Control-States An STG is defined as G = hS0 , VS , ES i where S0 is the initial control state, VS is the reachable control states, and ES is the set of state transitions. Before constructing the STG, we formally define circuit states, control states, and reachable states as in [26]. Given a circuit with N -bits of register state (r0 , ..., rN −1 ) including both control and data registers, we represent a circuit state as a bit-vector ~b = (bo , ..., bN −1 ), where bi is the value of register ri in state S, and the set of possible states P has 2N states. Among the N bits, we divide the registers into NC control registers, and ND data registers (N = NC + ND ), and typically NC  ND . The states on the control (and data) registers are also referred to as control states (and data states). Any combination of control and data state can be represented by a bit-vector ~b. Given a CDFG of the circuit and an initial circuit state S0 , we say a circuit state S is a reachable state if and only if the circuit can reach state S from S0 through a sequence of state transitions. We denote the set of reachable states by R, and R is a subset of all possible states (P) because not all combinations of control and data state can actually occur. We can thus also consider reachable control (and data) states RC and RD , where R is a subset of RC × RD . An example of the circuit states for the circuit in Figure 1 are shown in Figure 3a. From Figure 1 we know that the circuit contains 4 control registers, S0, S1, S2, S3, and 2 data registers, R1, R2. According to the initial state of the control registers, (1000), we know RC contains 4 elements: (1000), (0100), (0010), (0001), the states of R1 depends on the primary inputs, and the state of R2 depends on the combinational logic between R1 and R2. Thus, we can make no assumption about the state of R1 or R2, and must include all possible states in PD . Then, all reachable states of the circuit S0 − S3, R1 − R2 is a subset of RC × PD . Meanwhile, T 1 and T 2 are the sets of data state transitions on R2, and T 3 is the set of state transitions from circuit state set {0010} × PD to circuit state set {0001} × PD .

The SDC scheduler produces an STG as its output [31], which we annotate with control-flow information from the CDFG to represent control decision information and precisely capture the reachable control state RC . Leveraging the precise control state information, we divide the circuit states into control-equivalent state sets (CES), and capture state transitions on data registers by register assignments. Definition 1. Given a control state SC = (c0 , ..., cNC −1 ), the control-equivalent state set E(SC ) is the set of all reachable circuit states such that the control state is equal to SC . Definition 2. For an n-bit data register R, a register assignment A that assigns R, is a set of circuit state transitions that may change one or more bits in R (e.g., T1 in Figure 3a). A is described by a triple (SC (A), R(A), F I(A)), where SC (A) is a control state and all source circuit states of A are in E(SC (A)); R(A) is the target register of A, and F I(A) is the input set of A, which contains two elements: G(A), the guarding condition (enable) from a combinational node in A, and I(A), data input from primary inputs or a combinational dataflow node. The example control STG with register assignments of the circuit in Figure 1 is shown in Figure 3b. There are four control states CS0 , CS1 , CS2 , CS3 , corresponding to the four states in RC , with control-equivalent states E(CS2 ) and E(CS3 ). There are also three register assignments A0 , A1 , A2 that represent the assignment to R2 when S0 is 1 (corresponds to T 1 in Figure 3a), the assignment to R2 when S2 and C are both 1 (corresponds to T 2 in Figure 3a) and the assignment to R1 when S3 is 1, respectively. Specifically, F I(A2 ) contains G(A2 ) and I(A2 ), where G(A2 ) and I(A2 ) correspond to G and I in Figure 1, respectively. We also annotate register assignments with control-flow information to provide information about the guarding condition. During control transitions, the guarding condition indicates whether a register assignment is enabled or disabled. When decomposed into basic-blocks, the guarding condition is also the control-path activation condition [31]. Thus, register assignments with the same parent BB always share the same guarding condition, and we can determine when a register assignment occurs through tracking control flow in our STG representation without tracking the exact values of data registers. Definition 3. Given basic block BB, which implies the guarding condition G, we divide E(SC ) into guarding condition equivalent states set (GES), GE(SC , BB, G), where all circuit states S within GE(SC , BB, G) implies G always resolves to TRUE. The GESs of the circuit in Figure 1 are shown in Figure 3c. In the example, we divide E(CS2 ) into three GESs: GE(CS2 , BB1, 1), GE(CS2 , BB10 , C) and GE(CS2 , BB2, C), where BB10 is the next execution iteration of BB1

in the loop, C and C are branching conditions of the control-flow edge (BB1, BB10 ) and (BB1, BB2), respectively. Also, register assignment A1 is enabled only in the transitions starting from the states in GE(CS2 , BB10 , C), and control state transition (CS2 , CS3 ) can only start from GE(CS2 , BB2, C). Given that C and C are the branching conditions for the control-flow diverging from the same source, we know they are mutually exclusive and GE(CS2 , BB10 , C) is disjoint with GE(CS2 , BB2, C). Thus, R2 remains stable during transition (CS2 , CS3 ). Identifying this fact enables us to conclude that the combinational paths between R1 and R2 in Figure 1 are multi-cycle paths. It is important to note that we do not need to evaluate the value of guarding conditions in order to analyse the control flow information and compatibility between guarding conditions. In addition, the number of GESs within a CES is bounded by the number of guarding conditions, which is bounded by the number of basic blocks. Thus, dividing circuit states into GESs remains feasible and allows multicycle path analysis to consider guarding conditions without evaluating all data states. Following these definitions, we build the guarding condition aware STG (GA-STG) for our multi-cycle analysis with the following steps: 1. Build the data-flow dependencies between register assignments based on SSA-form use-def chains [30]. 2. For each control state SC , construct the BB set BB(SC ) that includes the parent BB of SSA definitions, which correspond to the register assignments in SC . 3. For each control state SC , split it into GESs which are identified by (SC , bb), where bb ∈ BB(SC ). 4. Add edges between GESs to represent either state transitions between control states or compatibility between guarding conditions present in the same control state. 5. Perform register allocation and binding to assign logical variables to physical registers. This enables the GA-STG to capture data state transitions. For step 4, it is important to note that the SDC scheduler ensures that an edge never changes both control state (SC ) and BB (bb) at the same time. If there is an edge where bb remains the same, but SC changes, the edge is annotated as a weight-1 edge because it represents a control state transition that must takes one cycle. If bb changes and SC remains the same, the edge represents the intersection between GESs rather than a control state transition and thus is a weight-0 edge which ensures correct traversal of the GA-STG. Based on this edge weighting scheme and the properties from [31], given a register assignment A assigned to state S = GE(SC , BB, G), if the shortest path between S and

S 0 is ≥ 1, then R(A) is modified by A during the transition sequence. Additionally, if S is the only state which assigns to R(A) in the sequence between state S and another state S 0 , then R(A) will keep stable after A. Theorem 1. Given a state transition graph G = hS0 , VS , ES i, with annotations for register assignments and guarding conditions, G represents all reachable circuit state from initial state S0 as well as the transition between these states. Proof. By construction, the state transition graph explicitly represents each control state as a vertex in VS , and the edges represent all reachable state transitions. By Definition 1 all reachable data states are part of the CES, which are grouped into multiple GESs by Definition 3 based on guarding conditions, which are the control-path activation condition of basic blocks in the CDFG. Transitions that may change any bit of a register are register assignments that are annotated to control states in the graph. Thus, all reachable circuit states and the transitions are represented. This demonstrates that the GA-STG represents all reachable circuit states given an initial state and annotated control flow information. We previously demonstrated that this construction technique is bounded by the number of control states and BBs rather than the number of possible circuit states. Thus, we are able to efficiently examine all reachable data states symbolically through traversals on the GA-STG. In addition, the GA-STG also represents datapath between register assignments, which is similar to FSMD [34], to support multi-cycle path analysis. We now analyse the GA-STG for multi-cycle paths. 3.2.2. Use-define Chain Identification In this step, we identify all possible use-define assignment pairs using the information from the SSA-form CDFG. The properties for finding these use-define pairs directly follow from the SSA format [30]. For each register assignment in the GA-STG, and each combinational node i in its input set F I(A), we identify the combinational cone rooted on i. We then identify the “define” assignments for each source register R of the combinational cone as Di (R, use). It is the set of all register assignments that can propagate through i to the “use” register assignment. The SSA’s reachability property ensures that a “define” assignment will stay stable until the transition propagates to “use” through i. The SSA’s dominance property also provides that all paths from the initial state to the state in which a “use” assignment occurs will include a “define” assignment that produces a transition to propagate to the use. 3.2.3. Available Cycles Calculation We calculate the minimal number of cycles available for an assignment def to propagate across a combinational

path to another assignment use, for all use-define register assignment pairs. We denote the number of cycles as k(def, use), which is the number of cycles of the transition sequence from SC (def ) to SC (use) with the minimal number of transitions, i.e. the shortest path distance from SC (def ) to SC (use) in the GA-STG (denoted by ˆ C (def ), SC (use))). Formally: d(S ˆ C (def ), SC (use)) k(def, use) = d(S

(1)

Now, we have k(def, use) computed for every usedefine pair. However, there may be multiple register assignments that assign register R(use) and R(def ) with different shortest-paths. Thus, we must compute the final number of cycles for a combinational path as the minimum number of cycles for all use-define pairs and register assignment. Formally, given a combinational path with Rsrc as source and Rsnk as sink; let us assume that combinational node i in the path is the input set of register assignments that assign Rsnk : ∀Ai s.t. (R(Ai ) ≡ Rsnk ∧ i ∈ F I(A)) ∀Adef ∈ Di (Rsrc , Ai )

(2)

K = min(k(Adef , Ai )) Where Ai denotes the register assignment that modifies Rsnk and its input set contains i, Adef denotes the register assignment that modifies Rsrc and used by Ai in a usedefine chain, and K is the minimal shortest-path distance between all Ai s and Adef s. Thus, we annotate the tuple (Rsrc , Rsnk , K) to each node in the combinational path. But a node may be on the path between multiple register assignments and therefore each node may initially have multiple tuple constraints; we resolve this in section 3.3. 3.2.4. Correctness and Time Complexity For each combinational path, we generate multi-cycle constraints based on the value of K calculated by Equation 2. To ensure correctness of the multi-cycle constraints, we must prove Theorem 2. Otherwise, the static timing analysis may produce an incorrect timing report based on the incorrect multi-cycle constraints. Theorem 2. Given a combinational path P , and using the same terminologies from Equation 2, K is the minimal stable cycles for Rsrc on P . Proof. Let us assume that Rsrc is stable for less than K cycles before a register assignment latches i’s value to Rsnk . This implies that there is a register assignment AM that produces a transition on Rsrc and propagates to Rsnk through i with a sequence of state transitions in M cycles, where M < K. However, this contradicts Equation 2 that K is the minimal distance between use-define pairs of register assignments that are identified by P . Thus, AM does not exist, thereby Rsrc must be stable for at least K cycles on P .

Finally, we discuss the complexity of our approach; for a circuit with C control states and 2ND possible data states, we discuss that our multi-cycle path analysis is able to symbolically examine the entire C × 2ND reachable state space without exponential complexity. In section 3.2.1, we constructed the GA-STG and demonstrated that the number of nodes in the GA-STG has complexity on the order of control states, and basic blocks. In the worst case, the GASTG would need C control states, NBB GES states within each control state (C × NBB GA-STG states). Thus, in the worst case there are (C × NBB )2 edges between all GASTG states, which does not require O(2ND ) states or edges. During use-define chain identification, we use a depth first search with time complexity O(NO ) for No as the number of combinational nodes in the circuit dataflow representation. Then, we use a modified reaching definition algorithm with worst-case complexity of O(C × NBB × NA ), with NA as the number of register assignments. To compute the available cycles for each path, we first use the all-pairs shortest path algorithm on the GA-STG, whose worst case complexity is cubic in the number of nodes, and therefore O((C ×NBB )3 ). Then, all k(def, use) computations query the results of the all-pairs shortest path, with an upper-bound of NA2 queries. Finally, to annotate the K constraints on all nodes, the complexity is O(NA2 × No ). 3.3. Constraints and Code Generation During the annotation process, there may be multiple constraints on a combinational node. To ensure correct timing constraints generation, we only generate the constraint with smallest cycle count [35]. These constraints are produced as a TCL file for input into the logic synthesis tool, together with corresponding RTL. These generated constraints guide logic synthesis effort to attain timing closure, and improve the quality of the final logic synthesis output. 4. EXPERIMENTS We implement our technique in VAST, a LegUp compatible HLS framework as shown in Figure 4. To ensure fair comparison of non-HLS compilation optimisations, we generate the optimised bytecode using the LegUp front-end and send that identical bytecode to both VAST and the LegUp 3.0 release. We then synthesise the RTL using Quartus II 12.0. We use the CHStone [36] benchmarks plus the dataflow benchmarks from [37]. We evaluate each design in execution time and area. For execution time, we compute the product of clock cycles reported by ModelSim and clock period reported by Quartus’ post-place and route timing report. For area, we measure logic elements and 9×9 multipliers; to create a single, combined area metric, we implement a 9 × 9 multiplier in logic elements and then use that LE count as a scaling factor between LEs and multipliers. Thus the area

Benchmarks (C) LegUp Frontend LLVM IR HLS by LegUp

HLS by VAST

Logic Synthesis, P&R Experimental Results

Fig. 4: Experimental Flow

metric is Area = #LE + 115 × #M ult. We also directly compare registers, the primary source of area savings between our technique and LegUp. Finally, we also compare time-area product. For both VAST and LegUp, we vary the clock constraint from 5ns to 20ns targeting Altera Cyclone II (EP2C70896C6) and Cyclone IV (EP4CE75F29C6). We present the best VAST result normalised to the best LegUp result in terms of time-area product in Figure 5. Figure 5a shows that VAST reduces the number of clock cycles by 32% and 36% in Cyclone II and IV, respectively because multi-cycle chaining improves operation grouping and reduces quantization error. This tightens the scheduling constraints, allowing the scheduler to generate a tight schedule, and reduces the design’s cycle latency. In addition, the bit-level optimisations implemented in VAST prevents the delay estimator from including the delay of computations that will be eliminated by Quartus, and thereby tighten the scheduling constraints. Finally, our SDC scheduling implementation that efficiently schedules the multi-cycle chains also contributes to the reduction in cycles. The multi-cycle analysis also enables Quartus to generate high quality hardware implementation because the generated constraints correctly relaxes timing constraints on multi-cycle paths and focus optimisations on challenging paths. As a result, VAST achieves 36% and 26% improvement in clock period in Cyclone II and IV, respectively (Figure 5b). This improvement in clock period coupled with the reduction in number of clock cycles enables VAST to improve the hardware execution time by 57% and 53% in Cyclone II and IV, respectively (Figure 5c). VAST also reduces register usage by 79% in both Cyclone II and IV (Figure 5d), because it needs not break long combinational paths into single cycle segments. Again, the combination of multi-cycle constraints and larger combinational blocks provides Quartus with better optimisation opportunities and allows it to generate better hardware implementation. As a result, the reduction in logic elements usage is 30% on both Cyclone II and IV (Figure 5e). How-

1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

VAST/LegUp

CycloneIV 1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

(a) Number of Cycles Normalised to LegUp

(b) Clock Period Normalised to LegUp

1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

VAST/LegUp

VAST/LegUp

VAST/LegUp

CycloneII

VAST/LegUp

1.2 1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

(d) Register Use Normalised to LegUp

2.0 1.5 1.0 0.5 0.0

ul dfm

v dfdi

n dfsi adpcm

gsm

jpeg

n s mip eomea g

(e) Logic Elements Use Normalised to LegUp

(f) Multiplier Use Normalised to LegUp

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

1.0 0.8 0.6 0.4 0.2 0.0 d ul iv in m sm g2 g ha es sh ps ir ct ee cm pr ml ng an dfaddfm dfd dfasdpc gmpe jpe s balowfi mi fedig_d l m u5 gweaome CHStone Dataflows

(g) Combined Area Metric Normalised to LegUp

VAST/LegUp

VAST/LegUp

VAST/LegUp

(c) Hardware Execution Time Normalised to LegUp

(h) Area-Time Product Normalised to LegUp

Fig. 5: VAST Best Design Normalised to LegUp Best Design (Smaller is better) ever, area may be higher compared to LegUp due to reduced sharing of functional units on multi-cycle paths, as observed from ADPCM and GSM in Figure 5e and Figure 5f. VAST uses more multiplications than LegUp in ADPCM and GSM, because it cannot share the multiplications in multi-cycle paths. On the other hand, VAST does not use any multipliers while LegUp uses 78 multipliers in U5ML. This is because all multiplications in U5ML are multiplications with constant, and completely embedding them into a combinational block without sharing enables Quartus to specialize them by LUTs. Nevertheless, multi-cycle optimisation outweighs the cost of lower sharing; VAST reduces total

design area by 29% in both Cyclone II and IV(Figure 5g). Combined, VAST achieves 69% and 67% better areatime product on Cyclone II and IV respectively due to the combination of reduced execution time and reduced area. This demonstrates the importance of analysing multi-cycle paths to overall synthesis quality, especially performing this analysis with exact examination of all reachable states rather than some heuristic subset of paths. In terms of HLS execution time, VAST has a geometric mean runtime of 507ms compared to LegUp’s 827ms. It should be noted that LegUp generates a number of report files which could be the reason for the comparatively higher runtime.

5. CONCLUSION We have presented a novel methodology for multi-cycle path analysis and proven that our technique for multi-cycle path analysis performs exact analysis of all reachable circuit states without requiring exponential time complexity to examine all possible circuit states. We implemented these techniques in VAST and demonstrated average 55% execution time improvement, 29% area improvement, and 68% time-area product improvement compared to LegUp. 6. ACKNOWLEDGEMENT This work was supported by A*STAR under the HSSP grant. 7. REFERENCES [1] A. Canis, J. Choi, et al., “Legup: high-level synthesis for fpga-based processor/accelerator systems,” in FPGA, 2011, pp. 33–36. [2] P. Coussy, C. Chavet, et al., “Gaut: A high-level synthesis tool for dsp applications,” High-Level Synthesis: from Algorithm to Digital Circuit, p. 147, 2008. [3] D. Chen, J. Cong, et al., “xpilot: A platform-based behavioral synthesis system,” SRC TechCon, vol. 5, 2005. [4] Z. Guo, B. Buyukkurt, and W. Najjar, “Optimized generation of datapath from c codes for fpgas,” in DATE, 2005, pp. 112–117. [5] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “Spark: a high-level synthesis framework for applying parallelizing compiler transformations,” in ICVD, 2003, pp. 461–466. [6] A. Papakonstantinou, K. Gururaj, et al., “Fcuda: Enabling efficient compilation of cuda kernels onto fpgas,” in SASP, 2009, pp. 35–42. [7] D. Greaves and S. Singh, “Kiwi: Synthesis of fpga circuits from parallel programs,” in FCCM, 2008, pp. 3–12. [8] P. Bjesse, K. Claessen, M. Sheeran, and S. Singh, “Lava: hardware design in haskell,” in ACM SIGPLAN Notices, vol. 34, no. 1, 1998, pp. 174–184. [9] C. Baaij, M. Kooijman, J. Kuper, W. Boeijink, and M. Gerards, “Cλash: Structural descriptions of synchronous hardware using haskell,” DSD, pp. 714–721, 2010. [10] M. Lin, I. Lebedev, and J. Wawrzynek, “Openrcl: low-power highperformance computing with reconfigurable devices,” in Field Programmable Logic and Applications (FPL), 2010, pp. 458–463. [11] R. Nikhil, “Bluespec system verilog: efficient, correct rtl from high level specifications,” in MEMOCODE, 2004, pp. 69–70. [12] LabVIEW FPGA IP Builder, National Instruments, http://sine.ni.com/ nips/cds/print/p/lang/en/nid/210573/. [13] Impulse Accelerated impulseaccelerated.com.

Technologies,

Inc.,

http://www.

[14] Catapult C Synthesis, CALYPTO, 2012, http://www.calypto.com/ catapult c synthesis.php. [15] Cynthesizer, Forte Design Systems, 2012, http://www.forteds.com/ products/cynthesizer.asp. [16] T. Czajkowski, U. Aydonat, et al., “From opencl to high-performance hardware on fpgas,” in FPL, 2012, pp. 531–534. [17] J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, “Lime: a javacompatible and synthesizable language for heterogeneous architectures,” in OOPSLA, 2010, pp. 89–108.

[18] B. Bond, K. Hammil, L. Litchev, and S. Singh, “Fpga circuit synthesis of accelerator data-parallel programs,” in Field-Programmable Custom Computing Machines (FCCM), 2010, pp. 167–170. [19] Z. Zhang, Y. Fan, et al., “Autopilot: A platform-based esl synthesis system,” High-Level Synthesis: From Algorithm to Digital Circuit, pp. 99–112, 2008. [20] M. Lam, “Software pipelining: an effective scheduling technique for vliw machines,” in Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, ser. PLDI ’88. New York, NY, USA: ACM, 1988, pp. 318–328. [21] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in CGO, 2004, pp. 75– 86. [22] P. Ashar, S. Dey, and S. Malik, “Exploiting multicycle false paths in the performance optimization of sequential logic circuits,” IEEE Trans. CAD, vol. 14, no. 9, pp. 1067–1075, 1995. [23] A. Saldanha, H. Harkness, P. McGeer, R. Brayton, and A. Sangiovanni-Vincentelli, “Performance optimization using exact sensitization,” in DAC, 1994, pp. 425–429. [24] K. Nakamura, K. Takagi, S. Kimura, and K. Watanabe, “Waiting false path analysis of sequential logic circuits for performance optimization,” in ICCAD, 1998, pp. 392–395. [25] H. Higuchi and Y. Matsunaga, “Enhancing the performance of multicycle path analysis in an industrial setting,” in ASP-DAC, 2004, pp. 192–197. [26] V. D’silva and D. Kroening, “Fixed points for multi-cycle path detection,” in DATE, 2009, pp. 1710–1715. [27] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, “Architecture and synthesis for on-chip multicycle communication,” IEEE Trans. CAD, vol. 23, no. 4, pp. 550–564, 2004. [28] J. Jeon, D. Kim, D. Shin, and K. Choi, “High-level synthesis under multi-cycle interconnect delay,” in ASP-DAC, 2001, pp. 662–667. [29] M. Karczmarek and Arvind, “Synthesis from multi-cycle atomic actions as a solution to the timing closure problem,” in ICCAD, 2008, pp. 24–31. [30] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently computing static single assignment form and the control dependence graph,” ACM TOPLAS, vol. 13, no. 4, pp. 451–490, 1991. [31] J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based on sdc formulation,” in DAC, 2006, pp. 433–438. [32] J. Zhang, Z. Zhang, et al., “Bit-level optimization for high-level synthesis and fpga-based acceleration,” in FPGA, 2010, pp. 59–68. [33] H. Zheng, Q. Liu, J. Li, D. Chen, and Z. Wang, “A gradual scheduling framework for problem size reduction and cross basic block parallelism exploitation in high-level synthesis,” in ASP-DAC, 2013, pp. 780–786. [34] D. D. Gajski, N. D. Dutt, A. C.-H. Wu, and S. Y.-L. Lin, High-level synthesis: introduction to chip and system design. Norwell, MA, USA: Kluwer Academic Publishers, 1992. [35] L. Cheng, D. Chen, M. D. Wong, M. Hutton, and J. Govig, “Timing constraint-driven technology mapping for fpgas considering false paths and multi-clock domains,” in ICCAD, 2007, pp. 370–375. [36] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: A benchmark program suite for practical c-based high-level synthesis,” in ISCAS, 2008, pp. 1192–1195. [37] M. Srivastava and M. Potkonjak, “Optimum and heuristic transformation techniques for simultaneous optimization of latency and throughput,” IEEE TVLSI, vol. 3, no. 1, pp. 2–19, 1995.