Efficient Variable Partitioning and Scheduling for DSP ... - CiteSeerX

1 downloads 0 Views 159KB Size Report
[3] D. Lanneer, J. V. Praet, A. Kifli, K. Schoofs, W.Geurts, F. Thoen, and G. Goossens, “CHESS: Retargetable code generation for embedded processors,” in Code ...
1

Efficient Variable Partitioning and Scheduling for DSP Processors with Multiple Memory Modules Qingfeng Zhuge, Student Member, IEEE, Edwin H.-M. Sha, Member, IEEE, Bin Xiao Student Member, IEEE, and Chantana Chantrapornchai

Abstract— Multiple on-chip memory modules are attractive to many high-performance DSP applications. This architectural feature supports higher memory bandwidth by allowing multiple data memory accesses to be executed in parallel. However, making effective use of multiple memory modules remains difficult. The performance gain in this kind of architecture strongly depends on variable partitioning and scheduling techniques. In this paper, we propose a graph model, Variable Independence Graph (VIG), and algorithms to tackle the variable partitioning problem. Our results show that VIG is more effective than interference graph for solving variable partitioning problem. Then, we present a scheduling algorithm, Rotation Scheduling with Variable Repartition (RSVR), to improve the schedule lengths efficiently on a multiple memory module architecture. This algorithm adjusts the variable partitions during scheduling, and generates a compact schedule based on retiming and software pipelining. The experimental results show that the average improvement on schedule lengths is 44.8% by using RSVR with VIG. We also propose a design space exploration algorithm using RSVR to find the minimum number of memory modules and functional units satisfying a schedule length requirement. The algorithm produces more feasible solutions with equal or fewer number of functional units compared to the method using interference graph. Index Terms— Retiming, scheduling, variable partitioning, DSP processor

I. I NTRODUCTION It is well-known that high-performance DSP applications require strict real-time processing. However, the growing speed gap between CPU and memory becomes a bottleneck for designing such real-time systems. To close this speed gap, embedded systems need to utilize on-chip memories. Since there are multiple memory modules can be exploited, the technique of variable mapping and scheduling becomes one of the most important accesses in performance optimization [1]–[5]. To improve the overall performance, many Digital Signal Processors employ Harvard architecture, which provides simultaneous accesses to separate on-chip memory modules for instructions and data [6]–[9]. Some DSP processors are further equipped with dual data-memory banks that are accessible in Manuscript received May 6, 2002; revised January 8, 2003; accepted April 28, 2003. This work is partially supported by TI University Program, NSF EIA-0103709 and Texas ARP 009741-0028-2001, USA, and by NECTEC, NT-B-06-4D-16-509, Thailand. Qingfeng Zhuge, Edwin H.-M. Sha, and Bin Xiao are with the Department of Computer Science, the University of Texas at Dallas, Richardson, Texas 75083, USA. E-mail: qfzhuge, edsha, bxiao @utdallas.edu Chantana Chantrapornchai is with the Department of Mathematics, Silpakorn University, Nakorn Pathom, Thailand 73000, E-mail: [email protected] 

parallel, such as Motorola 56000 [6] and Gepard Core DSPs [9], [10]. Since data can be partitioned and allocated to separate data-memory banks, and can be accessed simultaneously, the dual data-memory bank architecture offers potentially higher memory bandwidth, and thus improves the system performance. This architectural feature is very attractive for highperformance DSP applications. In fact, many DSP routines, such as FIR filters, require the convolution of two data arrays as a kernel operation. Processors with dual memory banks can achieve higher memory bandwidth for this kind of application. However, many existing high-level language compilers cannot exploit the advantage of dual data-memory banks effectively. Consider the data flow graph in Figure 1. The triangles in the graph represent memory operations (load/store), while the circles represent ALU operations. The edges represent dependencies between ALU or memory operations. For the edges going from/to memory nodes, we put the labels to indicate the array variables transferred between memory and computation nodes. For example, the edge between node 1 and node 2 with label C shows that node 2 requires memory access to array C. Figure 2 shows that different schedules as well as variable partitioning can affect the schedule length. If we assign array variables A and C to the same memory module, and B to the other as shown in Figure 2(a), the schedule length is 7 cycles. However, if A and B are assigned to the same partition and C is assigned to the other as in Figure 2(b), the schedule length is reduced by one cycle. 0

1

B

C 2

3

A

B

10

4 C 5

6

B

11

C 7 A

12

8 9

Fig. 1.

An example of data flow graph.

Some previous work on the variable partitioning problem introduced in [10], [11] try to solve the variable partitioning problem on dual memory banks by using an interference graph. For most benchmarks used in our experiments, the variable partitioning results based on the interference graph model gives a longer schedule length. One of the limitations

2

of the interference graph model is that it can only be applied to a directed acyclic graph (DAG), where parallelism across the loop body from different iterations is not explored. The second problem of the interference graph is that it does not incorporate sufficient information for a schedule to exploit the potential parallel memory accesses. Other dual-bank variable partitioning techniques in previous work are restricted to some specific architecture such as Motorola DSP processors [12]. The variable partitioning technique for multiple memory banks presented in [13] is aimed to find the mapping of array variables with minimal page misses. All the above techniques consider variable partitioning only as a separate phase before scheduling. Therefore, the quality of these techniques depends on the assumption of scheduling algorithm, such as list scheduling. However, some widely used optimization techniques, such as software pipelining [14], change the dependencies in a loop. As a result, the predefined partition may not work well in real cases. CS

A0

A1

M0

M1

0 1 2 3 4 5 6

− 7 2 − 11

− − − − − 12 −

6 1 8 10 9 − −

5 0 − − − − 4

3 −

Partition 1 = { A, C } Partition 2 = { B }

(a)

CS

A0

A1

M0

M1

0 1 2 3 4 5

− 7 2 11 3 −

− − − − 12 −

5 0 8 9 − 4

6 1 10 − − −

Partition 1 = { A, B } Partition 2 = { C }

(b)

Fig. 2. (a) Schedule with Partitions of A,C and B of Figure 1. (b) Schedule with Partitions of A,B and C of Figure 1.

In this paper, we address the variable partitioning and scheduling problem on multiple memory module architectures to maximize the benefit of this architectural feature. We first present a new graph model called Variable Independence Graph (VIG) for variable partitioning problem. Unlike the previous graph model, Interference Graph (IG) [10], [11], our graph model can be used to analyze a cyclic data flow graph. We then provides a variable partitioning algorithm that attempts to maximize the parallel memory accesses. Second, we develop a novel scheduling algorithm called Rotation Scheduling with Variable Repartitioning (RSVR), which considers both software-pipelining technique and variable partitioning simultaneously to generate compact schedules on multiple memory module architectures [14]–[17]. This algorithm adjust the variable partition during the software-pipelining process so that the parallel memory accesses can be exploited to produce a compact schedule. Finally, RSVR algorithm is used in our design space exploration approach to achieve a feasible schedule with the minimum number of functional units and memory modules. Our experimental results show the constant improvement on schedule length. The improvement by using RSVR based on VIG is up to 52.9% compared to the approach using interference graph model. The average improvement is

44.8%. Using our design exploration scheme, we can produce a better design solution with a fewer number of functional units and memory modules. Also, we can produce a feasible design solution in many cases while an approach using IG cannot. This paper is organized as follows: Section II presents definitions and models used in this paper. In Section III, we present the constructions of the VIG. Section IV provides a variable partitioning algorithm based on the VIG. Then a new scheduling algorithm, Rotation Scheduling with Variable Repartitioning, is presented in Section V. In Section VI, a design exploration framework using our algorithm is proposed for finding the minimum memory module and functional unit configuration with schedule length requirement. The example and experimental results are displayed in Section VII. Finally, Section VIII concludes the paper. II. D EFINITIONS AND G RAPH M ODELS In this section, we present graph models related to the variable partitioning problem. We also introduce important definitions used in our algorithms. Before we formally present the graph model for variable partitioning problem, let us first introduce a Data Flow Graph model (DFG). Many multimedia and DSP applications can be represented by data flow graphs. In this paper, we use data flow graph as the description of a given program.  A Data Flow Graph (DFG)  Definition    2.1: is a node-weighted and edge-weighted directed   graph, where is the set of computation nodes, is the edge defines the precedence relations over the  , set !thatrepresents nodes in the variable accessed by an  , "! represents the number  edge !$#% of delays # for an edge , represents the computation time of node . The nodes in the data flow graph can be either ALU operations or memory operations. As shown in Figure 1, the ALU operations are represented by circles and the memory operations are represented by triangles. Particularly, an edge from a memory operation node to an ALU operation node represents a load, while the edge from an ALU operation node to operation node represents a store. An edge &a' memory label indicates a variable that is accessed by a memory  operation through edge . Note that edges that not involve a  is do memory operation will not have the label. a function from to a set of non-negative integers, representing the number of delays between two nodes. Loops can be modeled by  cyclic DFGs. An iteration is the execution of each node in exactly once. Inter-iteration dependencies are edges with " delays. (*),represented +-#% withbydelay !/.1In0 particular, an edge count # at 23 4 iteration requiresmeans that the computation of node data )  "! 3 4 iteration. The dependencies produced by node at 2 5 within the same iteration are represented by edges with zero delay. The delay count of a path is the summation of delay counts of all the edges along the path. Note that if we remove all edges with non-zero delays in a DFG, we will obtain a directed acyclic graph (DAG). A static schedule must obey the precedence relations defined by the DAG part of a DFG.

3

For a cyclic data flow graph, the delay count of any cycle needs to be positive;  otherwise, the corresponding program is illegal. Function maps from to a set of positive integers, representing the computation time of each operation node. In  takes one time this paper, we assume that every node in unit to execute. Next, we define a novel graph model which is constructed from a DFG. The objective of using this model is to expose all parallel memory accesses in a DFG so as to properly partition variables. The nodes of the graph represent variables, and the edges in the graph represent potential parallelism existing among the memory accesses for these variables. We call this graph model, Variable Independence Graph (VIG). Definition 2.2: A Variable Independence    Graph (VIG)  is an undirected weighted graph      where is a set of nodes representing variables, is a set of edges  connecting between nodes in , whose memory  operations can be executed in parallel potentially. Function maps from to a) set of# real values representing a priority of partitioning *) #node . and to different memory modules of an edge In the following, we introduce some important concepts that will be used in constructing a complete Variable Independence Graph. During the graph construction, we will be particularly interested in some memory operation pairs that help us identify the parallel memory accesses. We call them “independent pairs”.      , if nodes Given DFG   )  Definition # are 2.3: not reachable from other through any path ) andeach # are without delay in  , nodes independent pairs. 0 For example, the nodes and  in Figure 1 are independent pairs. We define the Access Set as the set of memory operation nodes that access a particular variable.     %   Definition 2.4: Given Ac  isDFG #(#%, &the )  cess Set of variable

    *) #%   or $#   .        , the acDefinition 2.5: Given DFG   #  is cessed variable of a memory operation node  )  #!     !  $   #        and    or  #! )    "')#where  . For instance, in Figure 1, ACCESS(A)= $ 8 % , $ % $ ACCESS(B)= 0,3,5 , ACCESS(C)= 1,6,10 % , and & ACCESS (3) = B. The mobility property of a node in a schedule is very important in improving the preciseness in the graph construction, as we will discuss in details in the next section. Given a DFG       $#% , a Mobility Window [18] of node #' , denoted by (#) in this # paper, is a set of control steps in a static schedule that node can be placed. The first control # step the node can be scheduled is determined by As Soon As Possible scheduling (ASAP), and the last control step that # node can be scheduled is determined by As Late As Possible (ALAP) scheduling with the longest path as a time constraint. Mobility window gives the earliest and the latest position a node can be scheduled. Note that the overlap of mobility windows of two nodes indicates the possibility that the nodes could be scheduled in the same control +,+ step. +.- In+0Figure / + 1 1,+3the2 . critical path consists of six nodes *

-

4-

16

5 . The mobility The mobility window0 of node  is0 (#)  78 1 92  *  . Therefore, window of node  is (5) the parallel operations regarding nodes 8 and 10 can only occur once, that is in control step 3, among all four possible arrangements of these two nodes in the schedule. In the following, we define the priority function of an edge in VIG based on mobility windows of two parallel memory accesses. We use the cardinality of mobility window overlap to denote the possible occurrences of parallel operations, and use the multiplication of the cardinalities of two mobility windows to denote all arrangements of two nodes    in a ,schedule.  :;   for Definition 2.6: Given VIG   

 :

Suggest Documents