The DT-Model: High-Level Synthesis Using Data ... - CiteSeerX

0 downloads 0 Views 212KB Size Report
Existing systems tend to decouple this from the core synthesis steps: scheduling, allocation ... This includes the operation sourcing the data and all those using it.
The DT-Model: High-Level Synthesis Using Data Transfers

Shantanu Tarafdar and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University (Submitted to the 1998 Design Automation Conference)

Abstract We present a new model for formulating the classic HLS subproblems: scheduling, allocation, and binding. The model is unique in its use of data-transfers as the basic entity in synthesis. A data transfer represents the movement of one instance of data and contains the operation sourcing the data and all the operations using it. Our model compels the storage architecture of the design to be optimized concurrently with the execution unit. We have built a high-level synthesis system, Midas, based on our data transfer model. Midas generates designs with smaller storage and data transfer requirements than other HLS systems.

1 Introduction High-throughput, memory intensive applications, like those found in multimedia and video processing, are frequently being implemented as application speci c integrated circuits (ASICs). The high memory bandwidth required by these applications to operate in real-time is available only on a custom integrated circuit. With the dramatic reduction in feature size in present day fabrication technology, the complexity of applications mapped to ASICs has increased as well. Managing this complexity manually is near impossible. This has spurred the development of high-level synthesis (HLS) systems for ASIC design. The input to these systems is a description of the behavior of the application and the output is a datapath and controller architecture for the ASIC implementation. However, existing high-level synthesis tools treat memory and data subsystem synthesis as a secondary stage in synthesis. They give priority to the synthesis of the datapath's execution unit. 1

If the execution unit is synthesized with no heed to where the data it uses is stored or how it is moved, the storage unit implied may be very large and inecient. This is most noticeable when high-throughput, memory-intensive applications are being synthesized. Applications in this class require large amounts of data to move and be processed in reasonably short amounts of time. It is therefore crucial for synthesis algorithms to include optimization criteria for data storage and data motion. Existing systems tend to decouple this from the core synthesis steps: scheduling, allocation and binding. In this paper, we propose a new model for formulating the classic high-level synthesis subproblems of scheduling, allocation, and binding. The model di ers from convention by using data transfers as its primary entity in synthesis. A data transfer (DT) is a cluster of operations that share the same data. This includes the operation sourcing the data and all those using it. All the operations of a DT are scheduled in one control step (cstep) in the schedule and, when this cannot be done, the DT must be partitioned into several DTs. Partitioning implies an increase in the storage unit size and bandwidth. Since scheduling and binding cannot be solved unless DT partitioning is performed concurrently, a synthesis system that uses our model is forced to handle data storage and motion in the core synthesis steps. Midas is an HLS system we have developed based on the DT-model. Experiments show that architectures synthesized by Midas have smaller storage unit requirements than HLS systems that do not use the DT-model.

2 Motion estimation: A case study of custom design The inspiration for our new model for HLS and for Midas came from some characteristics exhibited in custom architectures for motion estimation. Motion estimation is a simple video application which has high throughput and is memory-intensive. It estimates the motion of a group of pixels in a video sequence. There are many ways to perform motion estimation of which one is blockmatching. Each image of a video sequence is divided into sub-images. A motion vector is computed for each sub-image by scanning a search neighborhood in the previous frame of the sequence for a sub-image that best matches the rst one. The displacement between the two is the motion vector for the sub-image from the current frame. Yang et al.[1] described a custom architecture for a block-matching motion vector estimation ap2

plication where the sub-image size was 16  16 pixels. Figure 1 shows a schematic for this design.

.

Testing vector input or c p

Logic

. .

.

DFF p1

. DFF p2

..

Mux

Test

PE

0

1

.

2

.

PE 14

.

PE

DFF p2

..

. . .

Mux

PE

Comparator

DFF p3

DFF

Mux

. .

DFFp15

DFF p’

. . .

DFF p1

.

Address Generator

Mux

Mux

PE 15

Testing Pattern Decoder

Mux

Figure 1: Schematic diagram of motion estimation chip (Fig. 3 of [1]) This is a pipelined architecture with an input bandwidth of 3 bytes/cycle. It illustrates some interesting characteristics of custom architecture design for high-throughput, memory-intensive applications. A high level synthesis system tuned for these applications should try to emulate these: 1. Data transfers take place between processing elements spaced close together. A high level synthesis system can generate a binding like this only if it takes layout information into account during synthesis. 2. Bytes p and p are broadcast to several processing elements at each time step. The schedule is organized so several of the operations that use these bytes of data execute simultaneously. 0

3. Broadcasts to many processing elements happen over long buses thereby amortizing the area cost of a long, wide bus over multiple uses of a single piece of data. Data that is not broadcast tends to be communicated by short buses. The net e ect is better utilization of available internal data transfer bandwidth. This is that part 3

of the ASIC that is not occupied by processing elements and through with data may ow. Our model aims at providing a framework which guides high-level synthesis algorithms to architectures exhibiting these characteristics. There are synthesis approaches that focus on array and multi-dimensional loop based designs like the motion estimation example above, and the designs they generate tend to have the characteristics above [2, 3]. There have also been studies in background memory synthesis for similar application classes [4, 5, 6]. However, all these approaches require that the behavior be described in loops and arrays, and that array indices to the arrays be linear expressions. Our model allows us to synthesize designs with the same characteristics even if the input is not speci ed using loops and arrays.

3 Current approaches in high-level synthesis Storage and data-transfer subsystem synthesis can be divided into allocation and binding problems. The rst set includes storage size and bandwidth allocation, memory mapping, and bus allocation. The second includes the assignment of variables to storage locations, storage I/O operations to memory ports, data transfers to buses (or interconnect), and operand data to terminals of functional units. The solutions to these subproblems determine where data is stored and how it is routed. Results show that in larger, more complex behavioral designs, a large fraction of the ASIC area is occupied by the storage and interconnect [7, 8]. The solutions to the allocation and binding problems mentioned above greatly in uence this fraction of the ASIC area. These solutions, in turn, are in uenced by the solutions to the subproblems that generate the execution unit of the architecture: functional unit allocation and operation scheduling and binding. A high-level synthesis system should integrate both sets of subproblems. However, to reduce computational complexity most high level synthesis systems decouple them. Older HLS systems synthesized the execution unit rst and then the storage and data transfer subsystem without considering their interdependence [9, 10, 11]. This could prematurely prune the search space for storage and data transfer subsystem synthesis. More recent HLS systems do consider the e ects of operation scheduling and binding on the storage and data transfer subsystem [12, 3, 13]. Most use an estimate of the minimum numbers of registers, ports, and buses needed 4

in any clock cycle. They make no explicit e orts to broadcast data being used multiple times and have no explicit mechanisms ensuring data is used soon after it is produced. They rely on minimizing the estimates to imply this. Most of these systems have a post-processing step to convert the storage size, port, and bus requirements into a set of register les and interconnect [14, 15]. However, relegating this to a post-processing stage after scheduling precludes certain optimizations. Data duplication, regeneration and preemptive routing [8] are a few examples. Our data transfer model allows us to handle all these issues concurrently with operation scheduling and binding, with very little additional overhead. The model's target storage unit is a foreground memory structure made up of spatially distributed, multi-ported register les. Most scheduling and binding algorithms in HLS literature can be directly mapped into the data transfer model with minor modi cations. For those that we have implemented, we have observed designs with smaller storage size, storage bandwidths, and bus counts than those from their operation-based counterpart. The importance of including data transfer optimizations in the scheduling and binding phases of high level synthesis has been stressed by other researchers [8, 16, 17]. Our model is unique in the way it treats data-transfers in the context of high-level synthesis for ASIC design.

4 The data-transfer model A data transfer corresponds to the motion of a single piece of data from its source to its destinations. Expressed in terms of elements of a data ow graph (DFG), a data-transfer (DT) contains the operation sourcing the data, all the destination operations, and all the precedence constraints between each source-destination pair. Figure 2 illustrates a DT and its relation to the DFG that contains it. Every DFG gives rise to many DTs. Each node with outgoing edges corresponds to the source operation of a distinct DT. Figure 3 shows all the DTs extracted from a DFG.

5

b

c

a add1 add1 add2

add3 add2

add3

add4

DFG

Data-Transfer

Figure 2: De nition of a data-transfer (DT) IN_a

n4(abs)

n5(add)

n8(sub)

n9(add)

IN_b

IN_c

n6(abs)

n7(add)

n10(sub)

n11(sub)

n12(add)

OUT_f

(a) DFG

IN_a

n4 n5 (dt1)

IN_b

IN_c

n5 n6 n7 (dt2)

n7 (dt3)

n6

n7

n8

n9 (dt6)

n10 (dt7)

n11 (dt8)

n11

n12

n12 (dt11)

OUT_f (dt12)

n4

n5

n8 (dt4)

n8 n9 n10 (dt5)

n9

n11 n12 (dt9)

n10

n12 (dt10)

(b) Data transfers extracted from DFG

Figure 3: DTs extracted from a data ow graph

4.1 DT-cost: The physical signi cance of a data transfer Each DT can be assigned a cost that has an spatial component and a temporal one. In the absence of placement information during high-level synthesis, the spatial cost is the width of the DT. If placement information exists, it is the product of the width and the length of the interconnect network needed to route the data. The temporal cost of a DT is the length of time the interconnect must be reserved for this data transfer. The product of the spatial and temporal costs indicate how \expensive" this DT is in terms of utilization of ASIC area. This is the DT-cost. Figure 4 is a snapshot of a control step in the operation of an ASIC. The data transfers are indicated 6

on the oorplan of the ASIC by arrows. Wider arrows are transfers with larger wordlengths. Therefore, areas of the arrows indicate how much each transfer costs. This is modelled by our notion of DT-cost. Register

dt_d

dt_a

Adder

Register

Multiplier dt_f

Register dt_b dt_e Register

dt_c

Adder

Figure 4: DT cost

4.2 Scheduling and binding a DT Scheduling all the operations in a DT in a single control step eliminates the need to store the data in a register. In our model, we postulate that a DT (meaning all its operations) must be scheduled at a single control step. This is possible if operation chaining and sucient resources are available. Otherwise, the data transfer must be partitioned as we shall see later. Binding a DT involves binding all its elements. This includes all the operations and the data itself. Thus binding a DT merges the operation binding and data binding subproblems of conventional models for HLS. Operations in a DT can include storage reads and storage writes. These must be bound to to register les and ports. Finally, if a oorplanner is available, DT placement can be done. This means placing all the functional units the operations of the DT are bound to as well as routing the data over the buses or interconnects on the ASIC.

4.3 Partitioning a DT A DT needs to be partitioned when it is impossible to schedule all its operations in it at the same cstep. The operations of the original DT are distributed among a set of DTs forming a 7

partition of the operations. The DT containing the sourcing operation is called the source DT of the partition. The rest contain destination nodes only. The source operations of these destination DTs are storage read operations. The source DT has an extra destination operation, a storage write operation. Figure 5 illustrates DT partitioning. +

+

+

+

+

+

+

Original DT:

+

ST RD

ST RD +

+ +

+

ST WR

+

+

+

+

DT Partition:

Figure 5: DT partitioning

4.4 E ects of partitioning Since few, if any, behavioral designs ever result in purely combinational architectures, DT partitioning is needed. Only a combinational design, where all operations can be scheduled in one cstep, can escape this need. Partitioning a DT has three implications: (1) a variable needs to be held in storage, (2) a variable needs to be written to storage, and (3) a variable needs to be read, maybe several times, from storage. The rst implication impacts the size of storage, since the variable represented by the original DT now needs a location in a storage unit where it will be stored. The second and third a ect the bandwidth of the storage unit since they call for the availability of read and write ports on the storage unit. Careful partitioning can result in a good storage unit architecture. One way to perform partitioning is to minimize the number of data transfers we partition. By partitioning a DT only when all its operations cannot otherwise be scheduled in the same cstep, we minimize the number of variables being sent to storage. When we do perform partitioning, if we minimize the size of the partition, we minimize the number of times the variable is accessed in storage. The net result is a reduction in storage size and bandwidth on average. 8

Scheduling, binding and partitioning are interdependent. If we consider the implications on partitioning of proposed scheduling and binding actions, we are indirectly considering the consequences of those actions for the architecture of the storage unit.

4.5 When is partitioning needed Partitioning is required when scheduling all the operations of a data transfer in the same control step would result in an inconsistency. Three conditions leading to inconsistency are described below.

C-Step period violation In Figure 6, the path through the source addition and destination

multiplication is 11ns. This is longer than the 10ns cstep period and so the DT must be partitioned into two as shown. CStep period = 10ns 4ns ST WR

4ns

ST RD

7ns

*

Figure 6: DT partitioning due to control step period violations

Resource allocation violation In Figure 7, the DT would require four adders if all its operations

were to be scheduled in the same control step. But the given resource allocation for adders is two. The DT must be partitioned as shown for scheduling to be viable.

Local temporal inconsistency In Figure 8, we have a chain of three data transfers. Let DT1

have been scheduled at cstep 1 and DT3 at cstep 10. Since n2 is part of DT1 and n3 is part of DT3, n2 is scheduled at cstep 1 and n3 is scheduled at cstep 10. However, these two nodes are also operations of DT2, but are not scheduled in the same cstep. Therefore DT2 is now inconsistent and must be partitioned into DT2a and DT2b for the system to regain scheduling consistency. 9

Resource allocation: 2 adders

ST WR ST RD

Figure 7: DT partitioning due to resource allocation violations n1 [cstep1]

DT1

n1

[cstep1]

n2

[cstep1]

ST WR

[cstep1]

ST RD

[cstep10]

DT2b

n3

[cstep10]

DT3

n4

[cstep10]

DT1

n2 [cstep1]

DT2a DT2

n3 [cstep10] DT3 n4 [cstep10]

Figure 8: DT partitioning due to local temporal inconsistency

4.6 Scheduling constraints We have de ned DTs and formulated scheduling, binding and partitioning in terms of them. Just as in conventional models for HLS, there are scheduling constraints between DTs. These dictate temporal relationships between pairs of DTs in the schedule of our synthesized architecture.There are two types of scheduling constraints between DTs in our model: precedence constraints and concurrency constraints.

Precedence constraints In the box to the left in Figure 9, DT1 and DT2 share an operation.

The source operation of DT2 is one of the destinations of DT1. Therefore, DT1 must be scheduled before DT2 and there is a precedence constraint between the two DTs as indicated by the directed edge between them in the gure. In this form of precedence constraint both DTs must still be scheduled in the same cstep. Within the cstep, DT1 is scheduled before DT2. There is a second form of precedence constraint which occurs between DTs of a DT partition. There is a precedence constraint between the source DT of the partition and each other DT. However, 10

n1 n1 dt1

ST_WR

dt1a

n2 dt2 ST_RD dt1b n3

n2

Figure 9: Precedence constraints between DTs the constraint also dictates there be at least one cstep boundary between the two DTs concerned. This is shown in the box to the right in Figure 9.

Concurrency constraints In Figure 10, DT1 and DT2 share a destination operation. If all the

operations in both DTs must be scheduled in the same cstep, this means DT1 and DT2 must be scheduled at the same cstep. There is a concurrency constraint between the two DTs as is indicated by the dotted, undirected edge between them. n1

n2 dt1

n3

dt2

n4

Figure 10: Concurrency constraints between DTs

4.7 The DT ow graph The scheduling constraints allow us to build up a counterpart of the DFG of conventional HLS models. We de ne a data transfer ow graph (DTFG) where nodes represent DTs, directed edges represent precedence constraints between DTs, and undirected edges represent concurrency constraints between them. Figure 11 shows a DTFG. Note that connected components formed by the undirected edges are sets of DTs that must be scheduled in the same cstep. The DTFG is the intermediate form that a DT-model based HLS system will use while performing synthesis. 11

DT1

DT2

DT3

DT4a

DT5

DT4b

DT6

DT7

Figure 11: A DT ow graph

4.8 Bene ts of the dt-model formulation The DT-model has several interesting bene ts. It raises the status of the data-related subproblems of high level synthesis. Since DTs need to be consistent before they can be scheduled and bound, since DT partitioning directly impacts the storage unit and since DT scheduling and binding force data and operation scheduling and binding to be performed simultaneously, the data issues of high level synthesis can no longer be ignored or treated separately. An HLS system based on the DTmodel is forced to consider the impact of scheduling and binding on the storage unit architecture. The cost of using the DT-model is the additional subproblem, DT-partitioning. Depending its implementation, the computational overhead can be minimal.

5 Midas: An HLS system based on the DT-model Midas is a deterministic, constructive high level synthesis system based on the DT-model. It synthesizes an register-transfer level (RTL) architecture of the behavioral description. It optimizes the area of the resulting ASIC given a constraint on maximum execution time of the behavior. Inputs to Midas are a behavioral description of the application, a technology library of combinational components and parametric register le models, a clock cycle period, and a maximum execution time for the behavior. Midas is driven by a set of interwoven heuristics and cost measures. The top-level owchart of Midas is shown in Figure 12. 12

Translate DFG to DTFG Update design measures Timeframes, DT costs, Prob partitioning, ASIC area Resource utilization Partition inconsistent DTs. Minimize partition size and maximize partition DTs’ timeframes.

Evaluate cost of each (DT, cstep) scheduling action For minimum cost (DT, cstep) pair, commit scheduling action Bind DT using branch-and-bound heuristic

Place bound DT using incremental floorplanner

No

Are all DTs scheduled?

Yes

Output design

Figure 12: Flowchart of Midas

5.1 Design measures Time frames The timeframe of a data transfer is the set of csteps in the schedule in which the

data transfer may be scheduled. The timeframe is de ned by an \as soon as possible" (ASAP) cstep and an \as late as possible" (ALAP) cstep. It is computed in two phases. First the timeframes of the operations in the data ow graph are computed. For each data transfer, the intersection of the timeframes of its constituent operations is computed. This is the native timeframe. It is then re ned by considering DTFG precedence constraints. Midas tries to maximize the lengths of timeframes of data transfers while it schedules. Doing this avoids premature pruning of the scheduling search space.

Probability of partitioning a data transfer A data transfer can have parent and children data

transfers as de ned by precedence constraints. Each one also has sibling data transfers de ned by concurrency constraints. We de ne a probability of partitioning for each data transfer. Scheduling a data transfer at a cstep requires its parents, children and siblings to be scheduled at either the same cstep or, in case of the second form of precedence constraints (between DTs in a DT-partition), an adjacent cstep. The number of csteps in the intersection of the timeframes of the data transfers and those of its parents, children, and siblings (appropriately normalized) gives us a measure of how likely it is that the data transfer can be scheduled without requiring partitioning. Midas tries to minimize the probability of the partitioning of data transfers. 13

Utilization of resources Midas maintains a set of distribution graphs (DG) representing the

utilization of di erent types of resources. A DG is maintained for every functional unit and is used to estimate the area of the execution unit. Another is maintained for cumulative DT-cost in a cstep. This estimates the area of buses or the interconnect network. Finally, for each register le, a DG is maintained for its size, one for its read ports, and one for its write ports. The three combined allow us to estimate the area of the register le. Midas tries to minimize the sum of these area estimates.

ASIC area estimates The oorplanner builds up a oorplan incrementally and can supply an

estimate of the area that the partial design takes up. Midas tries to minimize the increase in this number at every iteration through the main synthesis loop.

5.2 Solving the high level synthesis subproblems Midas performs data-transfer partitioning, scheduling, binding and placement each time through the synthesis loop. Midas partitions only those data transfers that are inconsistent. It uses a modi ed left-edge algorithm to form clusters of operations that can be scheduled at the same cstep. It minimizes the number of such clusters as the rst optimization criterion and maximizes the intersection of the timeframes of operations in each cluster as the second. The resulting data transfer partition is as small as possible and the individual data transfers still have a high degree of scheduleability. Scheduling is cost driven. In Midas, the cost of a candidate scheduling action is a weighted sum of increases in the oorplanner's estimate of the area of the partial design and in the projected ASIC area. If this does not resolve the choice between two candidate scheduling actions, the number of DT-partitions implied and the probability of having to partition at least one more DT in addition to those are used in that order. The scheduling action which leads to the minimum cost is committed. Midas binds the data transfer scheduled using a branch-and-bound search mechanism. The monotonically increasing cost measure used to guide the search is the increase in oorplanner's estimate of ASIC area as a result of each individual binding action. These can be operation binding, variable binding, and bus, port and terminal assignment. The branch-and-bound search combines binding and resource allocation by allowing an operation or variable to be bound either to an existing 14

resource or register le or a newly added one. Finally, the newly bound data transfer is passed to the oorplanner for placement. The oorplanner works in two modes. In incremental mode, the oorplanner returns area increase estimates in O(log n) time, where n is the number of components in the oorplan. When the data transfer is scheduled and bound and nally sent for placement, the oorplanner reverts to its full complexity and completely recomputes the oorplan of the partial design. This prevents estimation errors from propagating over synthesis loop iterations.

6 Results In order to test the performance of the DT-model in a meaningful way, we used a scaled down version of Midas. A detailed treatment of Midas will be the subject of another paper. We tested our model's performance on existing criteria for judging storage and data transfer subsystems: the numbers of registers, storage read ports, storage write ports and buses in the synthesized design. To show these e ects clearly, we disabled oorplanning in Midas which is not central to the DTmodel. The cost of a candidate scheduling action was simpli ed from the increase in ASIC area to the increase in the minimum number of a single speci ed resource type. As a basis for comparison, we constructed a high-level synthesis system based on the conventional operation-based model. The system has an identical iterative scheduling loop and uses identical cost functions as long as the cost functions are not speci c to the DT-model. In e ect, the only di erence between the two systems is that one attempts to minimize DT-partitioning and the other does not. We performed runs of both systems on several inputs, including the 5th order elliptic lter and a motion estimator. These two examples are reasonably large having 58 and 61 operations respectively. For each input, we varied the allowed schedule length from 30 control steps to the minimum possible. For each input and schedule length pair, we performed runs optimizing separately for the number of one type of resource: functional units, storage size, storage read ports, storage write ports, and buses. To evaluate the DT-model, we disabled chaining and forced each operation to take one full control step to execute. We extracted the functional unit utilization, the storage size, the storage bandwidth, the number of buses and the number of data transfers of each synthesized design. 15

Table 1 displays the storage and data subsystem requirements of architectures synthesized by Midas under di erent inputs and optimization goals. The results show that Midas' performance is very close to the operation-based HLS system with regard to the resource being optimized for. In addition, Midas returns better storage and data-transfer subsystem designs. Input DFG Motion estimator Elliptic lter Di erential Equation Solver 2nd order IIR lter

Inputs Optimized resource

C-steps

Storage write ports Storage read ports Storage size Buses Storage write ports

26

Adder

Storage size Midas Op 14 18

Storage data transfer subsystem requirements Storage read ports Storage write ports Buses Midas Op Midas Op Midas Op 6 6 3 3 9 9

26

17

18

4

4

5

13

7

13

24

9

9

5

5

4

9

9

9

21 16

9 7

14 9

5 3

6 4

4 2

6 2

8 4

9 6

9

6

8

5

5

5

8

8

8

Table 1: Experimental results: Midas versus operation-based HLS system In Figure 13, we see results of runs on the elliptic lter benchmark. As can be seen, Midas consistently generates architectures with lower numbers of registers, storage read and write ports, and buses. The numbers of data transfers in the resulting designs is smaller too. In Figure 14, we see the results of runs on the motion estimation example. The results show the same trends. There are a few cases where Midas's performance is sometimes a little worse than that of our operation-based HLS system which is not unexpected. Midas operates with heuristics as does the operation-based HLS system and is expected to perform better on the average. Midas generated architectures with smaller storage and data subsystems. We observed this trend in all of our experiments.

7 Conclusions We have presented a data-transfer model for formulating the classic subproblems of high-level synthesis. The model introduces an additional subproblem, DT-partitioning, and compels the core 16

high-level synthesis algorithms { scheduling, allocation, and binding { to consider their e ects on storage size, storage bandwidth, and bus size in the synthesized design. Our experiments have demonstrated that Midas, an HLS system based on the data-transfer model, generates smaller storage and data transfer subsystems than an HLS system that does not use the data-transfer model but is otherwise identical.

Acknowledgments We would like to thank Sun Microsystems and the Rational Software Corporation, whose generous donations made this work possible. The work was funded in part by NSF grant CCR-9696196.

References [1] K.-M. Yang, M.-T. Sun, and L. Wu, \A Family of VLSI Designs for the Motion Compensation Block-Matching Algorithm," IEEE Transactions on Circuits and Systems, vol. 36, pp. 1317{ 1325, October 1989. [2] F. Catthoor, M. van Swaaij, J. Rosseel, and H. De Man, \Array Design Methodologies for Real-Time Signal Processing in the CATHEDRAL-IV Synthesis Environment," Algorithms and Parallel VLSI Architectures II. Proceedings of the International Workshop, pp. 211{221, 1992. [3] J. L. V. Meerbergen, P. E. R. Lippens, W. F. J. Verhaegh, and A. Van Der Werf, \Phideo: High Level Synthesis for High Throughput Applications," Journal of VLSI Signal Processing, vol. 9, pp. 89{104, May 1995. [4] F. Balasa, F. Catthoor, and H. J. De Man, \Practical Solutions for Counting Scalars and Dependences in ATOMIUM { A Memory Management System for Multidimensional Signal Processing," IEEE Transactions of Computer-Aided Design of Integrated Circuits and Systems, vol. 16, pp. 133{145, February 1997. [5] D. J. Kolson, A. Nicolau, and N. Dutt, \Elimination of Redundant Memory Trac in HighLevel Synthesis," IEEE Transactions of Computer-Aided Design of Integrated Circuits and Systems, vol. 15, pp. 1354{1364, November 1996. 17

[6] I. Verbauwhede, C. Scheers, and J. Rabaey, \Memory Estimation for High Level Synthesis," Proceedings of the 31st Design Automation Conference, pp. 143{148, 1994. [7] K. Danckaert, F. Catthoor, and H. De Man, \System level memory optimization for hardwaresoftware co-design," Proceedings of the Fifth International Workshop on Hardware/Software Codesign, pp. 55{59, 1997. [8] D. Laneer, M. Cornero, G. Goosens, and H. De Man, \Data Routing: A Paradigm for Ecient Data-Path Synthesis and Code Generation," Proceedings of the 7th International Symposium on High Level Synthesis, pp. 17{22, 1994. [9] C.-Y. Wang and K. K. Parhi, \High-Level DSP Synthesis Using Concurrent Transformations, Scheduling, and Allocation," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, pp. 274{295, March 1995. [10] A. C. Parker, J. T. Pizarro, and M. Mlinar, \MAHA: A Program for Datapath Synthesis," 23rd IEEE Design Automation Conference, pp. 461{466, 1986. [11] B. S. Haroun and M. I. Elmasry, \Architectural Synthesis for DSP Silicon Compilers," IEEE Transactions on Computer-Aided Design, vol. 8, pp. 431{447, April 1989. [12] J. Rabaey and M. Potkonjak, \Estimating implementation bounds for real time DSP application speci c circuits," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, pp. 669{683, June 1994. [13] S. Amellal and B. Kaminska, \Functional Synthesis of Digital Systems with TASS," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, pp. 537{ 552, May 1994. [14] P. K. Jha and N. D. Dutt, \Library Mapping for Memories," Proceedings of the European Test and Design Conference, pp. 288{292, 1997. [15] T. Kim and C. L. Liu, \A new approach to the multiport memory allocation problem in data path synthesis," Integration, The VLSI Journal, vol. 19, pp. 133{60, November 1995. [16] H. D. Lee and S.-Y. Hwang, \A scheduling algorithm for multiport memory minimization in datapath synthesis," Proceedings of the ASP-DAC`95/CHDL`95/VLSI`95, pp. 93{100, 1995. 18

[17] J. Hoogerbrugge and H. Corporaal, \Transport-triggering vs. operation triggering," Lecture Notes in Computer Science 768, Compiler construction, pp. 435{449, 1994.

19

4

Data Transfers

Adders

3

2

1

65 60

0 16

20

24 C-Steps

28

15

10

5

20

24 C-Steps

28

16

20

24 C-Steps

28

16

20

24 C-Steps

28

8 6 4 2

0

0 16

20

24 C-Steps

28

20

20

15

15 Buses

Storage Write Ports

16 10

Storage Read Ports

20

Storage Size

110 105 100 95 90 85 80 75 70

10

10

5

5

0

0 16

20

24 C-Steps

28 Key: Midas Operation-based HLS

Figure 13: Results of run on elliptic lter (adders minimized)

20

0

65 60

Data Transfers

Subtractors

1

110 105 100 95 90 85 80 75 70

4 3 2

10

12

14

16 18 C-Steps

20

22

15 Storage Size

12

14

16 18 C-Steps

20

22

10

12

14

16 18 C-Steps

20

22

10

12

14

16 18 C-Steps

20

22

20

Storage Read Ports

20

10

10

5

15

10

5

0

0 10

12

14

16 18 C-Steps

20

22

20

30

20 Buses

Storage Write Ports

25 15

10

15 10

5 5 0

0 10

12

14

16 18 C-Steps

20

22

Key: Midas Operation-based HLS

Figure 14: Results of run on motion estimator (subtractors minimized)

21

Suggest Documents