Fast and Adaptive Data-flow and Data-transfer Scheduling ... - CiteSeerX

2 downloads 3577 Views 341KB Size Report
GAs hosting a processor core and co-processors. However, in or- ... merous processing units within functional blocks dedicated to the application. This paper ...
Fast and Adaptive Data-flow and Data-transfer Scheduling for Large Design Space Exploration Y.Le Moullec, J-Ph.Diguet, D.Heller and J-L.Philippe LESTER, UBS Research Center, 56325 Lorient, France ([email protected])

ABSTRACT

Keywords: system on chip, load balancing, parallelism, scheduling, data transfers.

1.

INTRODUCTION

Due to hard constraints in terms of power and performance, the design of embedded systems on chip must tend to drastic reduction of the power/area ratio and increase of the Gops/area ratio, while using relatively low clock frequencies. Such performances can be obtained if the intrinsic application parallelism is correctly exploited. Moreover, future re-configurable circuits will benefit from large densities, this evolution makes possible the intensive use of numerous processing units within functional blocks dedicated to the application. This paper focuses on the scheduling problem within functional blocks as the first step of a design space exploration tool for complex applications. The target architecture, depicted on figure 1, is heterogeneous and includes a processor with dedicated accelerators or co-processors implemented in a re-configurable chip. As demonstrated in [13], a large parallel hardware implementation exploitation is power efficient compared to general-purpose processors implementation. Regarding the methodology, the issue is to produce power/performance efficient architectures within a very

Accelerators Processor Mem

Main memories

Mem I/O interface

Mem

R Co−processors

The integration opportunities offered by technological and methodological advancements permit to create heterogeneous systems that provide high levels of parallelism on a single chip, such as FPGAs hosting a processor core and co-processors. However, in order to optimize the exploitation of this parallelism, it is necessary to improve the architectural exploration step, which is underdeveloped at the time being. Moreover, inputs/outputs management is currently a key issue for heterogeneous Systems On Chip (SOC) design. The aim of the ”Design Trotter” framework, is to explore the design space at the system level by quickly computing resources/delay trade-off curves to provide applications functional blocks with alternative architectures. In this paper we present a particular point of this framework, namely a scheduling algorithm. The method relies on a generic and evolutive architecture that is iteratively constructed during the exploration phase.

Figure 1: Heterogeneous architecture model

short time window. On one hand simulation and design automation tools are now mature to speed up the design flow if the architecture is correctly specified. But on the other hand, large efforts remain necessary to develop design exploration tools in order to guide the designer towards an optimized architecture. Actually, the designer faces complex applications which include many functional blocks (filter, quantifier, D CT, motion estimation,...) which can be designed with numerous different architectures (cf figure 1). An efficient design space exploration imposes large architecture database, i.e., resources/delay trade-off curves for each functional block. This database will then be used during the system-level synthesis which can be defined as a kind of scheduling problem based on coarse grain operations with non decided but bounded latencies. In this study we deal with the computation of these trade-off curves, which is based on a fast scheduling algorithm. However, to be efficient, the scheduling algorithm policy must be adapted firstly to the function characteristics, which can be data-flow, data-transfer or control oriented, and secondly to the local memory size which influences the I/O transfer rate. Existing scheduling algorithms have predefined application domains, in this paper we propose a unified method to explore architectural solutions whatever is the function orientation. The rest of the paper is organized as follows: section 2 presents related scheduling work and in section 3 we discuss the algorithm principles. One possibility for increasing parallelism is to unfold loops, this point is detailed in section 4. Experimental results are presented in section 5 and finally we conclude in section 6.

2. RELATED WORK At the system level, power and area estimators usually rely on allocation and scheduling of more or less accurately defined resources.

3.

DESIGN SPACE EXPLORATION ALGORITHM 3.1 Design Flow While the use of static (i.e., predefined) libraries can be sufficient for a high-level synthesis tool that uses atomic components such as multipliers and adders, things are different for complex applications (filters, transforms, quantification, etc.) made of several functional blocks which can be very different in terms of size, consumption, and cost characteristics. These functional blocks are complex enough to justify dynamic architectural exploration. Therefore, our method is based on dynamic trade-off curves which represent different architectural alternatives (or configurations) for different time constraints. The estimation strategy for systems synthesis has to be done hierarchically in order to explore, in a reasonable time, intra and inter-component architectural alternatives. The architectural alternatives for each functional block are then used as a database for the application architectural design as explained in [5]. Based on these ideas, the design flow depicted on figure 2 has been devised. The intra-function estimation process is responsible for exploring the parallelism at the instruction level. The idea here is to estimate resources and bandwidth costs for several time constraints (expressed in cycles number). The goal is to provide the designer and the design flow a range of architectural alternatives which offer different data-flow and memory intrinsic parallelism management. This process is based on a time-constrained scheduler that minimizes the amount of resources as well as the bandwidth, i.e., data-transfers between local memories (registers and caches) and slower main memories. Then, the trade-off curves resulting of this first phase, are combined during the inter-functions estimation in order to divide the needs for resources and memory bandwidth for the whole application. In this paper we detail the intra-function estimation process.

3.2 Models and Definitions The input of the design flow is a ’C’ description of the application. The application is decomposed into functions which are parsed to Hierarchical Control and Data Flow Graphs (H CDFG), [6]. In this graph, nodes can represent either data transfers (read and write),

Task 2

Task 1

Thread 2

Thread 1

These algorithms are usually dedicated to specific application domains. For instance, the list-scheduling (LS) [1] and force-directed scheduling (FDS) [9] target processing oriented applications described with data-flow graphs. A second class of algorithms is dedicated to control-oriented applications. The aim of these algorithms, like path-based scheduling [11] improved in [8], is to optimize the finite-state machine. A third algorithm family has been defined to optimize the data-access scheduling, namely the memory bandwidth, which is a key issue for data-transfer dominated applications. In [12], multi-dimensional data accesses are ordered before memory allocation. Processing operations are scheduled under data-transfer constraints, after memory and data allocation at the very end of the flow. On the other hand, in [3] data-transfers are scheduled, at a scalar level, after the scheduling of processing operations. An interesting method, proposed in [10], permits to find a trade-off between data-flow and control-flow approaches. The heterogeneous nature of functional blocks within complex applications prevents from using a unique scheduling method. Thus, memory and processing aspects are combined in the algorithm we present hereafter. Due to the restricted paper size the control-flow scheduling will be described in a following publication.

Control & data−flow transformations

F1 Task 3 Tstart

(possible) real time OS

F3

Inter−function exploration

F2

Tstart + ∆2

Resources / Bandwidth

F6

Intra−function exploration

Allocated cycles

F4

F5

Figure 2: The design flow

processing operations (e.g., addition, mac, etc.) or complex functions previously estimated. The edges between the nodes represent data or control dependencies. Nodes are characterized by several attributes such as data type (i.e., data format), memory locality (i.e., hierarchy), etc. In order to start the estimation processes it is necessary to have information about the type of resources available in the generic architecture model. The designer must define a set of rules, named ”UAR” (User Abstract Rules); the processing part is characterized by the type of available resources: ALU, MAC, etc. and the operations they can perform; to every type of operator a number of cycles is associated. For the memory part the user defines the number of levels (1..N) of the hierarchy. For each level he defines the size and the access time. We intend to add some existing memory size estimators, [4], to guide his choices. In what follows we have restrained the hierarchy to two levels: a global memory with an access time equal to 2 cycles and a local memory (in fact a register file) with an access time equal to 1 (a register-register transfer taking one cycle). We distinguish several types of memory nodes: 1. input/ouput nodes (N1) 2. temporary data (produced by computations) (N2) 3. re-usable data (re-used input nodes) (N3) 4. accumulator data (N4) N1 data are always global, N4 data are always local, N2 and N3 data can initially be local (stored in the register file) but they can be moved to the global memory if ever requirements are larger than the local memory size. In order to guide the function analysis we have defined a metric called DRM: Data Reuse Metric. This metric takes into account

the local memory size, which have to be estimated. Given the abstraction level of the method, accurate temporal estimations are not essentials, in fact, relative cycles distribution is sufficient to compare several solutions. Moreover, as we want to quickly obtain estimates, methods such as clique coloring have been discarded. We use average data lifetime to estimate the quantity of data alive for each cycle, from which the minimum memory size can be derived. Minimum and maximum data-life of data d are defined as follows: MinDLd   ASAPd

n



 ASAPd

1

1

MaxDLd   ALAPd n   ASAPd 1   1 where ASAP and ALAP are respectively the earliest and latest scheduling dates, d 1 and d n respectively the earliest and the latest read access to data d for a given time constraint. The average data-life of data d is then given by: AvDLd  

1 MinDLd   MaxDLd  2

Finally, the number of data alive per cycle is given by: LMS 

1 T

∑ AvDLd  d

where T is the number of cycles allocated to the estimated function. The number of local transfers turning into global transfers because of a too small local memory is given by:



T LG 

LMS

 UMT i f LMS  UM 0 otherwise

where UM is the local memory sized (defined by the user). The DRM metric gives the global/local accesses ratio. A local access which produces a memory conflict (local memory full) involves a global read and write, thus the number of extra global accesses is α  T LG with α that ranges from 0 to 2 according to the architecture: α tends toward 0 for a superscalar architecture, and tends toward 2 for an architecture which includes a main memory, no cache, and a local memory made of a registers file (model considered in the example). Finally the DRM metric is obtained as follows: N1  αT LG DRM  N1  N2  N3  N4

3.3 Algorithms Principles With the information contained in the graph and the designer rules (UAR), the goal of the method is to minimize the resources quantity and the bandwidth requirement for several time constraints. The scheduling principle is a time-constrained list-scheduling heuristic where the number of resources of type i allocated at the beginning is given by the lower bound: Nb resource type i 

Nb operations type i T

The method includes three sub-algorithms: processing first, memory first and mixed. Dealing with processing nodes or memory accesses first allows to reduce to the minimum the resources used for the first type treated (which is the most critical) and imposes constraints for scheduling the other type. Processing first and memory first algorithms are used respectively for processing and memory oriented functions. For each cycle the average number of resources (resp. buses) is computed and the processing (resp. memory) nodes are scheduled. Then ASAP and

ALAP dates of memory (resp. processing) nodes are updated according to the processing (resp. memory) nodes scheduling dates. Finally memory (resp. processing) nodes are scheduled, using the same average computation technique. If ever ”free” cycles remain after the processing (resp. memory) scheduling, the ”draw pile” technique (cf.3.4.2) is used if necessary. However, traditional architectural synthesis approaches based on the dissociation memory/processing are not always adapted to all cases. Indeed, for some cases, it is necessary to regard simultaneously processing and memory scheduling (usually when memory transfers and processing are evened). The third algorithm, named ”mixed”, is used when it is not possible to define precisely the function orientation. In that case, both memory and processing nodes are handled with the same priority. As with the two other algorithms, the average resource number is computed for every cycle and nodes are scheduled according to their priority (which is a function of the nodes ASAP and ALAP dates as well as their mobility). In order to minimize the quantity of necessary resources and to favor reusing, several techniques, which are described in the next section, have been employed.

3.4 Improving Resources Reuse and Distribution 3.4.1 Online average resource computation (OARC) One of the drawbacks of list-scheduling is its linear handling of the scheduling (that also makes it fast, its complexity being On). If, because of data dependencies, the quantity of resources which have been allocated at the beginning of the scheduling process, is not sufficient, new ones are allocated during scheduling, but they might be under-exploited. To minimize resources over-allocation, we propose to compute the average number of resources for every cycle while scheduling (complexity O1) in order to ”smooth” the trade-off curves. The average number of resource of type i is computed as follow: Nbaverage resourcetype i 

Nb o f remaining nodes using resourcetype i Nb remaining cycles

The example on figure 3 shows how the OARC method permits to save one memory bus: in the first case, (A), the average number of buses computed is equal to 2 and at the end of the scheduling it is necessary to add two extra buses that are under-used. When scheduling with the OARC technique, (B), three buses are allocated and they are used more efficiently.

3.4.2 The "draw pile" This technique is applied for processing first and memory first algorithms. Once processing or memory nodes have been scheduled, some non-used cycles can remain, (i.e., the execution time is shorter than the time constraint), these cycles located at the end of the schedule, are collected in a ”draw pile”. While scheduling respectively memory accesses or processing nodes, it is possible to ”pick” free cycles in the pile, in order to shift the nodes for which the number of resources is not sufficient. Once the processing or memory accesses schedule has been obtained it cannot be modified. The only possible variation is to shift all the scheduled nodes to the ”right”, while taking benefit from the liberty offered by the ”draw pile”. The example on figure 4 depicts how this technique works: the memory first algorithm is used with a time constraint equal to 9 cycles. Firstly, the memory accesses are scheduled, (A), then processing nodes are scheduled and two cycles, (7,8) are not

nodes/resources

nodes/resources

nodes/resources

S6

M3

S2

S5

S3

S6

S4

S2

S5

M1

M2

0 S1

S3

S1

(B)

3

M6

4

5

6

7

8

cycles

(A)

nodes/resources

M1

used, (B). When the ”draw pile” technique is applied, (C), it is possible to shift the memory accesses schedule to the right and thus enables to insert some processing nodes. Then only one processing resource is needed instead of 2 previously.

LOOPS ISSUES

Unfolding loops permits to increase potential parallelism and thus also permits to reduce the critical path of the function. This can be helpful to successfully schedule a function in a given time constraint. However, unfolding a loop is limited by data dependencies. We distinguish two types of dependencies: 1) functional dependencies, which are linked to data transformation recursivity (for instance if A[i] computation needs the result from A[i-4]) and 2) structural dependencies, which are simply the result of the writing style. The first type cannot be avoided if the algorithm is not modified using functional transformations, the latter can be broken by renaming variables, i.e., by using structural transformations. For instance a loop core of length N containing yi  yi  ai  xk  i can be re-written as: y1 i  y1 i  ai  xk  i y2 i  y2 i  ai  xk  i

0

where the length of y1 and y2 equals N 2. In the general case, a loop of size N, unfolded L times, is transformed into a L branches tree and its critical path is given by: N   Log2 L L

T2

M3

T4

M5

T1

M2

T3

M4

2

3

1

4

where Tmax is the latency of the slowest operator, ∆cr the number of delays and Tcr the accumulated latency of the critical cycle. Our problem is different, we have to find the minimum unfolding factor to schedule a loop with respect to a time constraint T . Therefore, we have re-formulated the problem as: if a loop requires T ” to

M6

5

6

7

8

cycles

nodes/resources

M3 M1 0

M5

T1

T2

M2

T3

1

2

3

4

T4

M4

5

6

T5

M6

7

8

cycles

(C)

Figure 4: The draw pile technique

be scheduled without unfolding, we must find the unfolding factor to gain G  T ”  T cycles. We can show that the number of cycles obtained after unfolding a loop with a factor α is computed as: T” 

T  α  1dmin α

where dmin  maxcycles 

Tcr  ∆cr

Finally the unfolding factor α to gain G cycles is given by: α

When it comes to functional dependencies, things are more complex. Researches usually focus on the unfolding factor α necessary to obtain the optimal rate (i.e., the unfolding factor which maximizes the throughput). In [2], it is shown that α can be computed as: Tmax PGCD∆cr  Tcr   PGCD∆∆cr  T  αmgpr   Tcr cr cr

T5

(B)

yi  y1 i  y2 i

CPL  

M4

2

S4

Figure 3: The OARC technique: (A) scheduling without the OARC technique, (B) scheduling with the OARC technique

4.

1

time (in cycles)

time (in cycles)

(A)

M5

1 1  T Gdmin

The main points of our loop management policy are: i)loop unfolding is included in the trade-off curves computation as a solution to meet resources lower bounds, ii)structural and functional dependencies are both considered whereas they are usually implemented in frameworks dedicated to different application domains and iii)the unfolding factor is computed with the aim to meet time constraints instead of finding the optimal rate.

5. EXPERIMENTAL RESULTS Our framework has been implemented in a Java tool named ’Designtrotter’. This tool permits, among others, from a ’C’ application description to obtain the trade-off curves. In order to illustrate the

scheduling method discussed above, we have experimented with two functions. The first one is the Lee’s algorithm for the DCT computation [7], the second one is an adaptive filter, which is described in alg.1. y=0 for i=0 to 1023 do y=h(i)*e(k-i)+y end for s(k)=y adapt=2*(yt-y) h(0)=h(0)+adapt*xt for For i=0 to 1023 do h(i)=h(i)+adapt*e(i) end for @e(k)=@e(k)+1 modulo 1024 Alg. 1: Adaptive filter example For the DCT example we have applied the three algorithms for several time constraints and memory sizes. The results are presented on figure 5. From the results we observe that when the DRM metric is high the function is memory oriented, for instance with a memory size of 8 and tight time constraint, only the memory first algorithm permits to obtain valid schedules thanks to the priority choice. For smaller values of DRM, the function is processing oriented, processing regularity can lead to memory accesses regularity, thus better results can be obtained with the processing first algorithm as compared to the mixed algorithm. Finally when both time and memory size constraints are not tight the mixed algorithm takes advantage of its flexibility, i.e., there is no interest to handle either memory or processing nodes before the other ones. The filter example has been used to test loop unfolding. Without unfolding, 10250 cycles are necessary to perform the algorithm (10+(5120*2)) with 1 ALU, 1 MAC, 1 modulo operator and 2 buses. If the function must be performed in less than 3500 cycles it is necessary to unfold the loops (which require 5120 cycles each) with α  3; the new loop critical path is then CL3  51203  Log2 3  1709. The whole function needs 3428 (1709*2+10) cycles to be performed and 4 ALUs, 3 MACs, 1 modulo and 6 buses are necessary. If the time constraint is even smaller, e.g., if the function must be performed in less than 2600 cycles, it is necessary to unfold the loop with α  4, the new loop critical path is CL4  51204  Log2 4  1282; the function then needs 2574 cycles and 7 ALUs, 4 MACs, 1 modulo and 8 buses. Thanks to the functions characterization it is possible to apply the most appropriated scheduling technique to each functional block accordingly to criteria such as the local memory size. The DCT example, which is often seen as a processing oriented function, can be, in fact, memory oriented if the local memory size is too small. Indeed, this can lead to more data-transfers between the local and global memories (swapping). If we compare our method to traditional processing list and force-directed scheduling methods, we have to consider two cases. If the function is processing oriented then we obtain equivalent (and faster than with force-directed scheduling) results thanks to the trade-off curves computations. In the other cases (memory oriented or mixed), our method enables memory transfers optimization, which is often ignored by other techniques. A symmetric conclusion can be drawn with memory oriented list and force-directed scheduling algorithms.

6.

CONCLUSION

Figure 5: Lee’s DCT example: trade-off curves and DRM values for several memory size and time constraints.

In order to select the most suitable architecture(s) when designing embedded systems, it is necessary to explore large solution spaces. Our tool, Design Trotter, relies on dynamic trade-off curves which are computed by the scheduling algorithm that we have presented in this paper. Based on a H CDFG representation of the application, and following a set of rules (UAR) that parameterizes a generic heterogeneous architecture, our method permits, according to the function orientation, to select the most appropriate scheduling algorithm (processing first, memory first or mixed). Moreover, several techniques have been implemented in order to explore the parallelism and to favor resources reuse: the OARC, the draw pile and loop unfolding. The applicability of the method has been demonstrated with two examples, which illustrate how the designer can explore the design space by referring to the trade-off curves computed by the scheduling algorithm. The results obtained are encouraging and let us augur interesting perspectives. Future work includes the extension of the method to programmable processors and to take power consumption into account.

7. REFERENCES [1] A.C.Parker, J.T.Pizarro, and M.Mlinar. MAHA : A program for datapath synthesis. In 23rd Data Automation Conf., June

1986. [2] D-J.Wang and Y.H.HU. Rate optimal scheduling of recursive DSP algorithms by unfolding. TCS, 41:672–675, october 1994. [3] D.Chillet. M´ethodologie de conception architecturale des m´emoires pour circuits d´edi´es au traitement du signal. PhD thesis, Universit´e de Rennes I, France, January 1997. [4] F.Balasa, F.Catthoor, and H.De Man. Background memory area estimation for multi-dimensional signal processing systems. I EEE Trans. on V LSI Systems, 3(2), June 1995. [5] H.Thomas, J-Ph.Diguet, and J-L.Philippe. A methodology for an application profiling at a system level. In I EEE Work. on Signal Processing Systems (SiPS), Taiwan, October 1999. [6] J-Ph.Diguet, G.Gogniat, P.Danielo, M.Auguin, and J-L.Philippe. The SPF model. In Forum on Design Language (F DL ), T¨ubingen, Germany, sep 2000. [7] B. G. Lee. A new algorithm to compute the discrete cosine transform. IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-32(6):1243, 1984. [8] M.Rahmouni and A.A Jerraya. Formulation and evaluation of scheduling techniques for control flow graphs. In EURO-DAC, Brighton, UK, 1995. [9] P.G.Paulin and J.P.Knight. Force directed scheduling in automatic data path synthesis. In 24th ACM /I EEE Data Automation Conf., pages 195–202, 1987. [10] R.A.Bergamaschi, S.Raje, I.Nair, and L.Trevillyan. Control-flow versus data-flow-based scheduling: combining both approaches in an adaptive scheduling system. I EEE Trans. on V LSI Systems, 5(1):82–100, March 1997. [11] R.Camposano. Path-based scheduling for synthesis. I EEE Trans. on Computer-Aided Design, 10(2), January 1991. [12] S.Wuytack, F.Catthoor, G.De Jong, B.Lin, and H.De Man. Flow graph balancing for minimizing the required memory bandwidth. In 9th I EEE /ACM Int. Symp. on System Synthesis, pages 127–132, La Jolla, USA, November 1996. [13] W.R.Davis, N.Zhang, K.Camera, F.Chen, D.Markovic, N.Chan, B.Nikolic, and R.W.Brodersen. A design environment for high throughput, low power,dedicated signal processing systems. In I EEE Custom Integrated Circuits Conference, CICC’2001, San Diego, CA, May 2001.

Suggest Documents