A Gradual Scheduling Framework for Problem ... - Semantic Scholar

1 downloads 0 Views 175KB Size Report
Hongbin Zheng Qingrui Liu Junyi Li Dihu Chen Zixin Wang. ∗. School of Physics and ...... Xianhua Liu, Xu Cheng, and Jason Cong. Bit-level op- timization for ...
9D-2

A Gradual Scheduling Framework for Problem Size Reduction and Cross Basic Block Parallelism Exploitation in High-level Synthesis Hongbin Zheng

Qingrui Liu

Junyi Li

Dihu Chen

Zixin Wang∗

School of Physics and Engineering Sun Yat-sen University Guangzhou, P.R. China 510275 e-mail: [email protected]

Abstract— In High-level Synthesis, scheduling has a critical impact on the quality of hardware implementation. However, the schedules of different operations are actually having unequal impacts on the Quality of Result. Based on this fact, we propose a novel scheduling framework, which is able to schedule the operations separately according their significance to Quality of Result, to avoid wasting the computational efforts on noncritical operations. Furthermore, the proposed framework supports global code motion, which helps to improve the speed performance of the hardware implementation by distributing the execution time of operations across the their parent BB.

I. I NTRODUCTION Being able to increase the productivity of integrate circuit (IC) designers by automatically translating high-level design specification to hardware description and applying various high-level optimizations, High-level Synthesis (HLS) is becoming more and more popular in recent years. In a high-level synthesis flow, scheduling assigns each operation in the Control-Data Flow-Graph (CDFG) to a particular control step (c-step), so as to transform the untimed design specification to the timed specification. Because of its importance to the HLS flow, lots of algorithms [1–10] are proposed to solve the scheduling problem efficiently. All these algorithms try to schedule all operations in the CDFG with the same computational efforts to achieve their objectives, e.g. minimizing the latency of the design while maximizing opportunities of sharing the function units (FU) among operations. While HLS is pursuiting higher Quality of Result (QoR), the scheduling algorithms become more time-consuming. For example, [10] schedules the whole CDFG iteratively. On the other hand, the schedules of different kinds of operations are actually having unequal contribution to different metrics of the hardware implementation, especially in terms of resources usage, which means distributing the computational efforts to all operation equally may lead to a waste of computational efforts. For example, the impact of the schedule of the low-cost operations on resources usage is weakened in modern platforms, because the multiplexers which are introduced to perform FU sharing may cost more resources than the shared FU itself, making the FU sharing totally unprofitable. Hence, the low-cost operations are actually not required to be sched∗ Corresponding

author, e-mail:[email protected]

978-1-4673-3030-5/13/$31.00 ©2013 IEEE

uled as carefully as the high-cost operations to expose the sharing opportunity of its underlying FU. Similarly, when the aggressive bit-level [11] optimization is applied before scheduling, there will be a lot of zero-cost operations like subword extraction and bit concatenation. Scheduling such operations carefully to expose FU sharing opportunities wastes even more computational efforts. To tackle the problem above, we propose a novel scheduling framework, which schedules the operations separately according to the contribution of their schedule to QoR. Specifically, the proposed framework first schedules the critical operations (as shown in Fig. 1(c)), whose schedule has a critical impact on both the speed performance and resources usage of the hardware implementation; and then schedules the rest of the operations, i.e. the noncritical operations, whose schedule has less impact on resources usage of the hardware implementation1 , under the constraints of the scheduled operations (as shown in Fig. 1(d) and Fig. 1(e)). By scheduling the operations in this way, our framework can employ an aggressive yet time-consuming scheduling algorithm (e.g. [10]) to schedule the critical operations while employing a simpler scheduling algorithm (e.g. the As Last As Possible (ALAP) scheduling algorithm) to schedule the noncritical operations. In addition, our framework can distribute the execution time of the noncritical operations across the basic block (BB) boundaries implicitly (as shown in Fig. 1(e)), as if we are performing global code motion [5, 6]. In summary, the novelties of this paper lie as follows: 1. The scheduling framework, which is able to schedule the operations by different scheduling algorithms according to their contribution to the QoR, e.g. scheduling the critical operation with an aggressive yet time-consuming algorithm while scheduling the noncritical operations with a simpler algorithm. 2. The scheduling technique, which is able to distribute the execution time of a specific kind of operations across the boundaries of their parent BB. The rest of this paper is organized as follows: Section II reviews previous work. Section III presents the details of the gradual scheduling framework and Section IV illustrates the 1 Note that their delays may have significant contribution to the total latency of the hardware implementation, however, their schedules do not have critical impact on QoR.

780

9D-2 1

1

1

2

2

4

5 6

Noncritical Operation

9

Critical Operation

ĭ

6

(a)

6 9

4

S1

5

7

8

Branch Operation

10

S1

5

Basic Block

7

8

2

3

3

S1

7

S2 8

9

S2

S2 ž

10

S3

10

(b)

3

(c)

S3

4 ž

S3

10

(d)

10

(e)

Fig. 1. The Gradual Scheduling Example. (a) An example of the original CDFG. (b) The refined CDFG corresponding to the CDFG in (a), note that the dependence edge from operation 1 to operation 10 is abstracted from the chain including operation 3 and 4. (c) Schedule the refined-CDFG in (b) to 3 c-steps, i.e. S1, S2 and S3. (d) A partial-scheduled CDFG derived from the scheduled refined-CDFG in (c), where operation 2, 3, 4, and the phi node need to be scheduled under the constraints of the others operations. (e) While scheduling partial-scheduled CDFG, the operations (e.g. operation 3 and 4) may need to be moved across the BB boundaries, because their execution is distributed across BBs in (c).

technique to perform implicit global code motion. Section V presents the experimental results. Finally, Section VI concludes the paper with an outlook on future work.

CDFG Refining Refined-CDFG

II. R ELATED W ORK

Scheduling

Earlier scheduling algorithms in HLS can be classified into two major categories: Data-Flow-based (DF-based) and Control-Flow-based (CF-based). These algorithms exploit the parallelism within the design specification from different points of view. The DF-based scheduling algorithms [1, 2] mainly focus on exploiting parallelism among straight line sequences of operations, i.e. BBs, but they cannot further exploit parallelism beyond the level of a single basic block. The CF-based scheduling algorithms [3, 4] mostly concentrate on scheduling the mutual exclusive operations to the same c-step to exploit parallelism more globally than the DF-based algorithms, but, because the number of paths in [3] and the size of condition vector in [4] can grow exponentially with the number of the BBs in the CDFG, applying these algorithms to large design specification is impractical. The later algorithms introduce transformation like global code motion [5, 6] and hyper-block formation [7] to HLS to further exploit parallelism before or during scheduling. They still limit the execution time of an operation within the boundary of its parent BB, like the earlier algorithms [1–4]. This limitation means they cannot distribute the execution time of the operation across the BB boundaries. In the most recent developed System of Difference Constraints (SDC) based scheduling algorithms [8–10], the authors claim that global code motion can be easily supported by SDC scheduling, but the important details, e.g. the technique to preserve inter-BB constraints after control-dependencies relaxation, is missing. At last, all existing algorithms try to schedule all operations in the CDFG with the same computational efforts. In contrast, our framework decomposes the scheduling process into two stages and schedules the CDFG gradually in each stage to avoid wasting the computational efforts on noncritical operations.

Annotating Scheduling Result Partial-scheduled-CDFG Scheduling Scheduled-CDFG Fig. 2. The general flow of our framework.

III. T HE G RADUAL S CHEDULING F RAMEWORK In this section, we first state the so-call “gradual scheduling” problem formally, and then show how to solve the gradual scheduling problem. An example of framework’s general flow is shown in Fig. 1, and the flow’s overview is shown in Fig. 2.

A. Problem Statement Given: 1) A Control-Data Flow-Graph, which is a directed  graph G = V, E, where V = Vc Vnc is a set of vertices representing operations. V can be further divided to two sets of vertices including Vc , in which the schedule of operations has a critical impact on QoR, and Vnc , in which the schedule of operations has less critical impact on QoR. At the same time, E is a set of edges representing data or control dependencies. 2) A set of scheduling constraint C which may contains delay or resource constraints, etc. Goal: Apply the scheduling algorithms to the operations in Vc and Vnc separately under the constraints in C.

781

9D-2 B. Refining the Control-Data Flow-Graph In the proposed framework, the key to schedule the critical operations alone is to build a refined-CDFG for critical operations. The definition of refined-CDFG is given as follows: A refined-CDFG for critical operations Vc is a directed ˆ c ) = Vc , E, ˆ W , where 1) Vc is the verweighted graph G(V ˆ tices representing the critical operations to be scheduled. 2) E ˆ exists if and only is a set of edges in which an edge (u, v) ∈ E if there is a path from u to v in the original CDFG. The path does not contain any other critical operations except u and v. ˆ there is an associated weight W (u, v) 3) for each (u, v) ∈ E, representing the minimal allowed time interval between the schedule of u and v. Before applying the refining algorithm, we assume the following condition hold for the input CDFG:

Algorithm 1: Refining CDFG Input: The CDFG G = V, E, Critical Operations Vc ⊆ V , Latency Constraints L ˆ c ) = Vc , E, ˆ W Output: The Refined-CDFG G(V  1 E ← remove backward edges from E; 2 matrix M whose elements are initialized to 0;  3 modifiedDAGLongestPath (M, E ); 4 Add backward edges delays in E to M ;

8

foreach v ∈ Vc do foreach v  ∈ Vc and M [v  , v] = 0 do ˆ ˆ ← {(v  , v)}  E; E   W (v , v) ← M [v , v];

9

ˆ W ; return Vi , E,

5 6 7

10

Cond. 1 ∀(u, v) ∈ E and u ∈ Vnc =⇒ u’s parent BB dominates v’s parent BB in the Control-Flow.

11

For the CDFGs have edges that dissatisfy Cond. 1, we can insert a “copy-to-register” operation to break the corresponding edges and treat those copy operations as critical operation. Given a CDFG G = V, E and the critical operations Vc ⊆ V , the algorithm to build the corresponding refinedˆ c ) = Vc , E, ˆ W  is shown in Algorithm 1. Since CDFG G(V ˆ the vertices of G(Vc ), i.e. Vc , are already known, the algorithm ˆ and calculate the associated only need to build the edges E weights. First of all, we remove the backward edges from G (Line 1) to obtain a Directed-Acyclic Graph (DAG). According to Cond. 1, removing the back edges will not affect the refined-CDFG. Once the DAG is obtained, we apply the modified longest path algorithm (Line 10 to Line 18) to the DAG to further obtain the biggest delay matrix M . In M , the element M [v  , v] has ˆ a zero value indicating that (v  , v) is not a valid edge in E,   while M [v , v] has a non-zero value implying v ∈ Vc (ensured ˆ if by the conditions in Line 13 and Line 14) and (v  , v) ∈ E  and only if v ∈ Vc . In this case the value of M [v , v] is the associated weight of edge (v  , v). After we set the elements in M for the backward edges to the correct values (in Line 4), we ˆ and annotate the corresponding weights according to build E M (Line 5 to Line 8).

14

B.1

Implementation Consideration and Complexity

In order to reduce the time and space complexity of Algorithm 1, we implement the biggest delay matrix M with adjacency lists because G is usually sparse. By doing this, the loops in Line 5 and Line 14 can actually only enumerate the nonzero elements instead of enumerating the whole Vc and checking if the value is nonzero. It is obvious that the worst space complexity of the algorithm is |V |2 . On the other hand, the upper-bound of the time complexity of the algorithm is |V |3 .

12 13 15 16 17 18

function modifiedDAGLongestPath (M, E  ) begin foreach vertex v ∈ V in topological order do foreach edge (v  , v) ∈ E  do if v  ∈ / Vc then foreach t ∈ Vc and M [t, v  ] = 0 do l ← M [t, v  ] + L[v  , v]; M [t, v] ← max(M [t, v], l); else M [v  , v] ← max(M [v  , v], L[v  , v]);

is smaller than the original CDFG, but has more critical impact than the rest of CDFG on QoR, an aggressive yet timeconsuming scheduling algorithm can be employed to schedule the critical operations to achieve better QoR. After the critical operations in the refined-CDFG are scheduled, the remaining noncritical operations can be scheduled under the constraints of the scheduled critical operations. In this stage, a simpler scheduling algorithm can be employed to schedule the noncritical operations, because their schedule are less critical to the QoR. IV. I MPLICIT G LOBAL C ODE M OTION In this section, we introduce the implicit global code motion, which is able to distribute the execution time of noncritical operations across the BB boundaries while we are scheduling the critical operations in the refined-CDFG. The code motion process can be divided into 3 step: 1. Relax the control dependencies in the original CDFG and build the corresponding refined-CDFG, 2. Perform code motion implicitly while scheduling the refined-CDFG, and 3. Insert necessary inter-BBs delay operations to fix the delay constraints for cross-BB chains.

C. Scheduling the refined-CDFG and the rest of CDFG

A. Control Dependencies Relaxation

Once the refined-CDFG is built, our framework is able to schedule the critical operations in the refined-CDFG while ignoring the noncritical operations. Because the refined-CDFG

In this step, we remove the control dependencies selectively so that the noncritical operation can be scheduled out of its parent BB. The dependencies to be relaxed include:

782

9D-2 1 10

3

18 2

8

Other BBs 4 0 0 5 6 8

(a)

1 10

1 2

10

3 8

Other BBs

10

3 26

4 0

6

6



2

5 8

(c)

8

Other BBs 4 0







3

Other BBs

4 8 5

(b)

1 2

10

26





 

5



          

6

  

(d)

Fig. 3. Control Dependencies Relaxing Examples.



(a)

  



(b)

Fig. 4. Preserving cross-BB delay constraints.

1. All control dependencies from the entry-node of the basic block to the noncritical operation. 2. All control dependencies from the noncritical operation to the exit-node of the noncritical operation. After dependencies listed above are relaxed, we can distribute the execution time of the noncritical operations to other basic blocks. Fig.3 shows an illustrating example: Fig.3(a) shows the original CDFG, where operation 1 and 6 are critical operations, operation 2 and 5 are noncritical operations; the dashed edges represent control dependencies, while the solid edges represent other dependencies and all edges are annotated by its delay, in nanosecond (ns). Fig.3(b) shows the refinedCDFG corresponding to Fig.3(a), where the delay of refined edges (1, 3) and (4, 6) includes the delay of the original control dependencies edges (2, 3) and (5, 6). Such constraints force operation 2 and 5 to execute within their parent basic block. Fig.3(c) shows the CDFG after the corresponding control dependencies are removed. Fig.3(d) shows the refined-CDFG corresponding to the Fig.3(c), the delay of refined dependent edges (1, 3) and (4, 6) does not include the delay of the original control dependencies edges (2, 3) and (5, 6). This allows us to distribute the execution time of operation 2 and 5 to the time frame of other basic blocks. B. Preserve delay constraints for cross-BB dependencies One of the important problems for implicit global code motion is to preserve delay constraints for cross-BB dependencies in the refined-CDFG. The problem is illustrated in Fig.4: In Fig.4(a), there is a cross-BB dependency between operation 1 and 4, it requires operation 4 to be scheduled at least 15 ns after operation 1. At the same time, there are 2 paths between the parent BBs of operation 1 and 4: BB1 → BB2 → BB3 and BB1 → BB3. However, the actual path from BB1 to BB3 taken in run-time cannot be captured by the scheduling algorithms statically. Furthermore, applying the inter-BB dependency constraints in the SDC scheduling algorithm, i.e. constraint (2) in [8], leads the scheduling algorithm to considering the longest-path, i.e. BB1 → BB2 → BB3. As a result, the constraint between operation 1 and 4 is only preserved in path BB1 → BB2 → BB3, but not preserved in other paths which have a smaller latency. Fig.4(b) demonstrates a possible solution: After scheduling, we insert necessary delay operations between BBs to preserve the cross-BB delay constraints in non-longest-paths. The formal description of this problem is given below:

Given: A set of delay constraints derived from the cross-BB dependencies in the refined-CDFG, whose source and sink are connected by more than one possible path in CFG. Goal: Preserve these constraints in all possible paths. It is straight-forward that once we can preserve a cross-BB delay constraint in all paths by preserving it in the shortestpath. Hence, we rewrite the cross-BB constraint as follows: d(u, tu ) + d(tu , ev ) + d(ev , v) ≥ W (u, v)

(1)

Where tu denotes the terminator (or supper sink) of u’s parent BB, and ev denotes the entry (or supper source) of v’s parent BB, while function d(x, y) denotes the minimal distance (in number of cycles) from x’s schedule to y’s schedule. For a cross-BB constraint, whose source and sink are u and v respectively, the inequation decomposes the distance from u to v into three parts: d(u, tu ), d(tu , ev ) and d(ev , v). Specifically, d(u, tu ) and d(ev , v) are the distance from the u to the end of its parent BB, and the distance from the start of the v’s parent BB to v, respectively; at the same time, d(tu , ev ) is the shortest-path distance between the parent BB of u and v in the non-linear CFG. In order to preserve the inequation (1), we insert delay operations between BBs to increase d(tu , ev ) when necessary after the refined-CDFG is scheduled. Please also note that, because the source BB always dominates the sink BB in a crossBB constraint according to Cond. 1, calculating single source shortest-path from the entry of the CDFG (also the root of the dominator tree) is already enough to answer all d(tu , ev ) queries during delay insertion. Specifically, d(tu , ev ) can be simply calculated by: d(tu , ev ) = d (bbv ) − d (bbu ) − latency(bbu )

(2)

Where d (bbv ) and d (bbu ) are the Shortest-Path Distance from the CDFG’s Entry (ESPD) to the parent BB of v and u, respectively; latency(bbu ) is the latency of u’s parent BB, it is equal to the distance from the start to the end of u’s parent BB, i.e. latency(bbu ) = d(eu , tu ). The delay insertion algorithm is shown in Algorithm2. At the beginning of the algorithm, array A is allocated to store the ESPD for each BB. After that, from line 2 to 18, the algorithm visits all BBs in topological order to fix the ESDP of each BB by inserting delay operations between the BB’s predecessors and the BB, so as to preserve inequation (1). The ESDP is calculated by the following inequation, which is the combination

783

9D-2 of inequation 1 and equation 2: d (bb) ≥ max (W (u, v) + de (u) − de (v) + d (bbu )) (3) ∀v∈bb ˆ (u,v)∈E

Where de (w) is short for d(ew , w), it denotes the number of cycles from the start of w’s parent BB to w. In the first part of the loop body, i.e. line 4 to 9, the expected ESPD (De ), which preserves inequation (3), is calculated according to the scheduling results and the previously calculated ESPD of current BB’s ancestor in the dominator tree. In the second part, i.e. line 11 to 17, the algorithm calculates the actual ESPD through each predecessor of current BB. If newly calculated actual ESPD through a specific predecessor is smaller than the expected ESPD, delay operation will be inserted between the predecessor and current BB to increase the actual ESPD. At last, the ESPD value of a BB that preserves inequation (1) is written to array A in line 18.

A. Problem Size Reduction

Algorithm 2: Inserting Delay Operations ˆ c ) = Vc , E, ˆ W  and G’s ˆ Input: The Refined-CDFG G(V Schedule S ˆ Output: Modified Refined-CDFG G 1 Allocate array A; ˆ in topological order do 2 foreach BB bb ∈ G 3 De ← 0; 4 foreach vertex v ∈ bb in topological order do ˆ do 5 foreach forward-edge (u, v) ∈ E  6 bb ← u’s Parent BB; 7 if bb = bb then 8 De ← W (u, v) + de (u) − de (v) + A[bb ]; 9 De ← max(De , De ); 10 11 12 13 14 15 16 17 18

[13]. The scheduling algorithm employed to schedule both the refined-CDFG (R-CDFG) and the partial-scheduled-CDFG (P-CDFG) is based on system of difference constraints, which is the same as [8]. In addition, to support combining multicycling and chaining on the chains consist of noncritical operations, the scheduling algorithm requires cycle time constraints (i.e. constraint (8) in [8]) between critical operations, which are calculated by exactly the same way as the refining algorithm, to construct multi-cycles combinational-path, even it is scheduling the original CDFG. We perform 2 experiments to demonstrate the abilities of the proposed scheduling framework to reduce the size of the scheduling problem, as well as exploit cross-BB parallelism. All the experiments are performed on the CHStone HLS benchmark [14], which covers various application domains, such as arithmetic, media processing, security and microprocessor.

Da ← ∞; foreach forward-CFG-edge (bb , bb) do Da (bb ) ← A[bb s] + latency(bb ); if Da (bb ) < De then N ← A[bb] − Da (bb ); Add N cycles delay from bb to bb; else Da ← min(Da , Da (bb )); A[bb] ← max(Da , De );

C. Exploit Cross-BB Parallelism Please note that the operations in the cross-BB chain correˆ e.g. opersponding to the cross-BB dependency (u, v) ∈ E, ation 2 and 5 in Fig.3(a), whose execution times fall in to the time frame between tu and ev , are executing in parallel with the operations in the BBs between u’s parent BB and v’s parent BB. Hence the so-called “cross basic block parallelism” is exploited. V. E XPERIMENTAL R ESULTS Our scheduling framework has been implemented as a part of the Shang HLS framework [12] which is based on LLVM

We perform experiments to compare the problem size of scheduling the original CDFG with R-CDFG and P-CDFG. Because the size of R-CDFG depends on the critical/noncritical operation classification provided by user, our experiments cover different classifications including: 1) Chained, which treats load/store and branches as critical operations; 2) S16M16, which treats load/store, branches, and multiplications/shifts whose size is bigger than 16-bit as critical operations, and 3) All, which treats all load/store, branches, multiplications, additions, comparisons and shifts as critical operations. The experimental results are given in Table I, the size is measured by the product of the number of variables and the number of constraints in the SDC. In the table, the R-CDFG row gives the ratio of the R-CDFG’s size to the original CDFG’s size. Similarly, the P-CDFG row gives the ratio of the P-CDFG’s size to the original CDFG’s size. Meanwhile, the last row of the table contains geometric mean results for each column. From the experimental results, we found that both the number of variables and constraints in the R-CDFG increase as more operations are treated as critical operations. For example, the number of variables and constraints in All are bigger than S16M16 (24%×16% v.s. 13%×9%), while the comparison between classification S16M16 and Chained gives the similar result. However, even the critical operations set covers all load/store, multiplications, additions, comparisons and shifts, the size of R-CDFG is only 3.8% of the original CDFG, on average. As long as the time to find the optimal solution for SDC increases with the number of variable and constraints of the model, we believe such problem size reduction can speed up the scheduling process, especially for the advanced algorithms like [10] which employs SDC scheduling iteratively. Similarly, it is easy to understand that the size of P-CDFG decreases as more operations are treated as critical operations. We also found that the sum of the R-CDFG’s size and the P-CDFG’s size is smaller than the size of the original CDFG, this is because the complexity of solving the scheduling problem is partial eliminated by the CDFG refining algorithm. In this case, the complexity is actually transfered to the CDFG refining algorithm as long as the algorithm itself also takes time.

784

9D-2 TABLE II C OMPARISON ON L ATENCY IN # CYCLES

Chained

S16M16

ALL

Design

Run Time (s)

0.05

motion dfadd dfmul dfdiv sin sha aes adpcm blowfish gsm mips jpeg geomean

0.04 0.03 0.02

0.01 0 0

0.2

0.4 0.6 0.8 Normalized CDFG Size

1

Fig. 5. The runtime of refining algorithm.

Meanwhile, the runtime of the refining algorithm is acceptable in our experiments, although its time-complexity upper-bound is |V |3 . From Fig.5 which gives the runtime of the algorithm in the experiments, we found that the biggest time is only 0.05 second. We also found that the more operations are treated as critical operation, the more time is required to refined the CDFG on the same CDFG, because the execution flow is more likely to fall between line 13 and 16 of Algorithm 1 to forward the dependences of noncritical operations to critical operations. At last, even under the same critical/noncritical classification, the refining time may be significantly different among the CDFGs with similar size. The above observations indicate that the actual runtime of the refining algorithm depends on the critical/noncritical classification and the connectivity of the CDFG, but it is generally not a problem during the HLS process.

IGCM

w/o IGCM

Ratio

#Op #BB

2170 516 190 1903 n/a n/a 11003 52248 405036 11087 6253 1262480 9809

2196 802 272 2041 62686 264542 11821 56122 412062 11516 10185 1413191 11605

98.8% 64.3% 69.9% 93.2% n/a n/a 93.1% 93.1% 98.3% 96.3% 61.4% 89.3% 84.5%

14.4 8.5 9.7 9.5 40.9 42.5 48.5 40.0 35.6 15.2 7.0 14.1 18.8

control-dependencies to exploit the inter-BB parallelism. On the other hand, for the designs which already have rich intraBB parallelism, IGCM cannot futher reduce the latency because the control-dependiences are not the speed-performance bottle neck. One exception is the design dfdiv, whose average operations per BB are almost the same as dfmul. However, the effect of IGCM on it is not significant, this is because the latency of cross-BB chains cannot be hidden by the execution time of operations in other BBs. Specifically, we found that 8 extra delay cycles are inserted to fix the cross-BB latency constraints for dfdiv, while dfmul only requiring 4 extra delay cycles, which is only half of dfdiv. At last, IGCM is able to reduce the latency by 16%, on average.

VI. C ONCLUSION

B. Implicit Global Code Motion A comparison on the latency in number of cycles between the hardware implementations with and without implicit global code motion is shown in Table II. In the table, the latency of the hardwares which are optimized by IGCM are given in the IGCM column, while latency of the hardwares which are not optimized by IGCM are given in the w/o IGCM column. At the same time, the ration column gives the ratio of the optimized hardware’s latency to the unoptimized hardware’s latency, At last, the number of operations per BB is also given in the last column. Besides, the last row of the table contains geometric mean results for each column. In the experiments, the hardwares of design sin and sha deliver wrong results due to a bug of the HLS framework and hence the related data is excluded from the geometric mean. From Table II we found that for the designs have fewer than 10 operations per BB on average, i.e. dfadd, dfmul and mips, IGCM is able to reduce the latency by at least 30% comparing with the latency of unoptimized hardware. For other degins which have more operations per BB on average, the effect of IGCM is not significant. These results indicate that IGCM reduce the latency of the designs, which are limited by control-dependencies between BBs, by relaxing the

In this paper, we present a scheduling framework, which is able to scheduling the operations in CDFG with different algorithms according to impact of their schedule on QoR. With our framework, we can avoid wasting the computational efforts by employing the aggressive yet time-consuming algorithms to schedule the critical operations, while employing the simpler algorithm to schedule the noncritical operations. Furthermore, the scheduling framework supports implicit global code motion, which is able to distribute the execution time of the noncritical operations across the BB boundaries. In future, we will try to develop an automatic approach to classify critical/noncritical operations at compile time.

VII. ACKNOWLEDGMENT This work was supported by the Strategic emerging industry key technology special project of Guangdong Province (20111680142011912004) and the provincial Ministry of research project of key science and technology (2011A090200037).

785

9D-2 C OMPARISON ON THE PROBLEM

Design

CDFG

motion dfadd dfmul dfdiv sin sha aes adpcm blowfish gsm mips jpeg geomean

1760×5637 653×2538 515×2112 628×2597 1800×7183 553×1732 3154×11476 2322×9686 998×14268 2578×9192 591×2431 3978×14809 1262×5319

TABLE I #VARIABLE×#C ONSTRAINT IN SDC

SIZE IN TERMS OF

Chained R-CDFG P-CDFG 19%×14% 7%×4% 5%×3% 8%×4% 9%×5% 16%×11% 15%×12% 9%×6% 15%×12% 8%×5% 27%×18% 17%×12% 12%×8% (0.9%)

S16M16 R-CDFG P-CDFG

81%×86% 93%×96% 95%×97% 92%×96% 91%×95% 84%×89% 85%×88% 91%×94% 85%×88% 92%×95% 73%×82% 83%×88% 87%×91% (78.8%)

21%×15% 9%×5% 7%×4% 10%×6% 11%×6% 16%×11% 15%×12% 11%×9% 15%×12% 10%×7% 28%×19% 19%×13% 13%×9% (1.2%)

[2] A.C. Parker, J. Pizarro, and M. Mlinar. Maha: A program for datapath synthesis. In Design Automation, 1986. 23rd Conference on, pages 461 – 466, june 1986. [3] R. Camposano. Path-based scheduling for synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 10(1):85 –93, jan 1991. [4] K. Wakabayashi and H. Tanaka. Global scheduling independent of control dependencies based on condition vectors. In Proceedings of the 29th ACM/IEEE Design Automation Conference, pages 112–115. IEEE Computer Society Press, 1992. [5] G. Lakshminarayana, A. Raghunathan, and N.K. Jha. Incorporating speculative execution into scheduling of control-flow-intensive designs. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 19(3):308 –324, mar 2000. [6] S. Gupta, N. Savoiu, N. Dutt, R. Gupta, and A. Nicolau. Using global code motions to improve the quality of results for high-level synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 23(2):302 – 312, feb. 2004.

All R-CDFG

P-CDFG

32%×25% 15%×8% 14%×8% 18%×10% 17%×9% 32%×22% 29%×25% 29%×20% 30%×20% 22%×12% 34%×23% 34%×26% 24%×16% (3.8%)

68%×75% 85%×92% 86%×92% 82%×90% 83%×91% 68%×78% 71%×75% 71%×80% 70%×80% 78%×88% 66%×77% 66%×74% 74%×82% (61.1%)

[9] Jason Cong, Bin Liu, and Zhiru Zhang. Scheduling with soft constraints. In International Conference on Computer Aided Design, pages 47–54, 2009.

R EFERENCES [1] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 8(6):661 –679, jun 1989.

79%×85% 91%×95% 93%×96% 90%×94% 89%×94% 84%×89% 85%×88% 89%×91% 85%×88% 90%×93% 72%×81% 81%×87% 86%×90% (77.0%)

SCHEDULING

[10] Jason Cong, Bin Liu, and Junjuan Xu. Coordinated resource optimization in behavioral synthesis. In Design, Automation, and Test in Europe, pages 1267–1272, 2010. [11] Jiyu Zhang, Zhiru Zhang, Sheng Zhou, Mingxing Tan, Xianhua Liu, Xu Cheng, and Jason Cong. Bit-level optimization for high-level synthesis and fpga-based acceleration. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, FPGA ’10, pages 59–68, New York, NY, USA, 2010. ACM. [12] Hongbin Zheng, Qingrui Liu, Junyi Li, Dihu Chen, and Zixin Wang. An open source high-level synthesis framework with cross-level optimizations. [13] Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, IEEE/ACM International Symposium on, 0:75, 2004. [14] Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada. Proposal and quantitative analysis of the chstone benchmark program suite for practical c-based high-level synthesis. Journal of Information Processing, 17:242–254, 2009.

[7] J.L. Tripp, M.B. Gokhale, and K.D. Peterson. Trident: From high-level language to hardware circuitry. Computer, 40(3):28 –37, march 2007. [8] J. Cong and Zhiru Zhang. An efficient and versatile scheduling algorithm based on sdc formulation. In Design Automation Conference, 2006 43rd ACM/IEEE, pages 433 –438, 0-0 2006.

786