A Graph-Theoretic Approach for Register File

0 downloads 0 Views 73KB Size Report
-RAV. Figure 4: Best Architecture for the fifth-order wave fil- ter evolved by SOUPS. ^5(4) = ^3(3) + .... Proceedings of IC-DAC, pages 276-279, 1991. [2] L. Avra.
A Graph-Theoretic Approach for Register File based Synthesis1 C.P. Ravikumar Department of Electrical Engineering Indian Institute of Technology Hauz Khas, New Delhi 110016, INDIA Email : [email protected]

R. Aggarwal Computer Science Department University of Minnesota Minneapolis, MN 55454, USA Email : [email protected]

C. Sharma Indian Institute of Management Lucknow, INDIA

Abstract

is manifold. Floorplanning, placement, and routing become easier due to reduction in the number of modules. There is a significant reduction in interconnect (both wires and multiplexers) when registers are grouped into register files. This can lead to a significant saving in chip area. Several authors have considered the problem of optimal utilization of register files in data path synthesis [1, 4, 3, 6]. This problem is usually solved in two steps. In the first step, the life times of data variables are analyzed and a minimum number of registers are allocated to the variables using Interval Graph Coloring. In the second step, the registers are grouped into register files based on interconnect considerations. The existing approaches handle the grouping of variables and assignment of registers to register files as two independent activities, leading to preferential optimization which does not address all the objective functions concurrently. Our approach is to address both the issues simultaneously, with the intention of optimizing the number of registers, the number of register files, and the interconnect.

With the increasing use of register files as storage elements in integrated circuits, the problem of assigning data variables to ports of register files has assumed significance. The assignment involves simultaneous optimization of several cost functions, namely, number of register files, number of registers and access ports per register file, and the interconnect both internal and external to memories. In this paper, we refer to multiplexers, busses, and tristate switches when we refer to interconnect. The objective of this paper is to describe graph-theoretic optimization algorithms for the assignment problem. The allocation system described in this paper (SOUPS) accepts a scheduled data flow graph as input and performs (i) assignment of variables to a minimal number of registers, (ii) assignments of registers to a minimal number of register files, (Hi) assignment of registers to ports of the register files using minimal interconnect within the register files, and (iv) assignment of ports of the register files to terminals of functional modules using minimal interconnect outside the register files. We describe experimental results on several benchmark problems from literature with substantial improvements.

1

Balakrishnan et al [3] introduced the problem of synthesizing register file-based data paths and formulated register grouping and interconnect minimization as integer linear programming problems. In this early work, results were not presented for any large-size example. The method of [3] is not practical for large examples since the number of variables involved in the integer linear program is extremely large. Chen [4] presented a backtracking algorithm for assignment of registers to register files, with the objective of minimizing the number of register files. The author does not perform grouping of registers, and the derivation of the datapath architecture is not automatic. Ahmad and Chen [1] proposed a technique where interconnect minimiza-

Introduction

Register files are increasingly being used as storage elements in modern VLSI systems [1, 4, 3, 6]. The motivation for preferring register files to scratch registers 1 Research was conducted when the authors were at Indian Institute of Technology, Delhi, India.

118

lCfH International Conference on VLSI Design — January 1997

tion is achieved at the cost of increasing the number of registers. However, an increase in registers will also result in an increase in the number of switches internal to register file modules. Kim and Liu [6] presented a novel technique of first minimizing the interconnect and then grouping the registers. The disadvantage of this method is that it results in more number of busses and registers both of which lead to more interconnect. Our work attempts to address the shortcomings of the previous work. The paper is organized as follows. The next section discusses the allocation problem. Section 3 describes the graph-theoretic formulation of the register allocation and register file assignment problems. Experimental results are discussed in Section 4 and concluding remarks are given in Section 5.

2

files thus giving us a complete architecture. The best of the p architectures generated by the p optimization processes is selected on the basis of cost considerations. 3.1 3.1.1

Terminal Binding

The scheduled data flow graph is analyzed to determine the number of modules of each type used in each time step T,. This analysis is used in determining the maximum number of modules of each type required to construct the data path. After module allocation, we perform module-to-operation binding using well known procedures found in the literature. We next perform terminal binding, where we decide for each variable v the set of terminals r0 of the functional units to which v is to be connected. Our terminal binding procedure is geared towards minimizing the external interconnections. Let / be a functional module and V) be the set of variables associated with module /. We construct a graph Gf ~ (Vj,Ef) to reflect the usage of module terminals by the variables in V). An edge e = (i,j) is included in Ef if and only if the variables i, j £ Vj are assigned to the module / during the same time step. A minimal node-coloring of graph Gf is found. Let j is the number of colors used in the coloring, and let Vc be the chromatic set for color c i.e. the set of vertices painted using color c. We sort the colors on the size |V^| of the chromatic sets. The largest chromatic set is assigned to terminal 1, and the second largest set is assigned to terminal 2. The variables which belong to the remaining chromatic sets are assigned to both terminals 1 and 2.

Problem Formulation

Automatic synthesis of data paths from behavioral descriptions has received much attention in the recent past [5] due to the interest in ASIC implementations of digital signal processing algorithms. The input to a data path synthesis system is an intermediate form of behavioral description known as a data flow graph (DFG). Synthesis proceeds in several welldefined phases, such as scheduling, resource allocation, and binding. Resource allocation consists of determining the number of functional modules and registers in the data path. Operation-to-module binding assigns every operation to a an appropriate module; variableto-register binding assigns each variable to a register. The objective of the binding step is to minimize the interconnects. Our register allocation and register file assignment procedure SOUPS (for Simultaneous Optimization Using Parallel Synthesis) assumes as input a scheduled data flow graph. We also assume that each operation in the DFG is binary and commutative. The next section describes the graph-theoretic algorithms used in each stage of SOUPS.

3

Sequential Phase

3.1.2

Register Allocation

We construct a graph G — (V,E) where the node set V corresponds to the set of variables in the DFG. An edge e = (i, j) is included between nodes i and j if the corresponding variables have disjoint life intervals We color an edge (i, j) red if there exists at least one terminal t such that both the variables i and j are assigned to t; we color the edge (i,j) blue otherwise. In particular, it is advantageous to merge two variables i and j which share a red edge into the same register. The register allocation problem can now be formulated as a graph coloring problem. We use the notation N(c) to describe the set of all nodes to which the same color c has been assigned. Register allocation (see Figure 1) algorithm must assign colors to the nodes of the graph G such that (1) nodes i and j are assigned different colors if there is no edge between i and j, (2) minimize the

Sequential and Parallel Algorithms

Our algorithm works in two phases, namely, a sequential phase and a parallel phase. The register allocation is solved sequentially since there is insufficient gain in speedup in parallelizing this step. In the parallel phase, we explore the search space for the problem of register file allocation using a number of processes that run concurrently on a network of work stations. Each of the spawned processes allocates registers to minimal number of register files, finds out interconnections both internal as well as external to the register

119

time tag. We refer to the problem as imperfect graph coloring since, unlike the conventional graph coloring problem [7], it permits two nodes connected by an edge to be assigned the same color.

number of node colors, and (3) for each resulting color c, maximize the number of red edges in the subgraph induced by N(c). procedure AllocateRegisters(G) begin Phase 1: repeat Create a graph G1 — (V,E') where B' consists only of red edges in G; Find a maximum sized matching M iti G'; for each edge e =: (u,v) G M do collapse (u,v) into a "supernode"; Let Gnew = (Vnev,, £ « » ) be a graph where VneU) consists of the supernodes and the uncollapsed vertices of V; Add an edge e = (u, v) to Encw if and only if the vertices corresponding to the nodes u and v in the original graph G form a clique in G; Color the edge e G Enew blue if all edges which connecting nodes in u and u are blue; otherwise color the edge red; G — Gnem\ until G contains only blue edges; Phase 2: Find Complement graph Gc of G Find an optimal node-coloring of G c ; The chromatic number of G" is the number of registers; The chromatic set of color c is the set of variables implemented using register c; end

We have modified the well known backtracking algorithm [7] for imperfect graph coloring. The backtracking algorithm considers the nodes of the graph in a a certain order v\, V2, • • •, vn for assigning colors. With each node Vi, a coloring set is maintained, which is the set of colors which could be assigned to V{. Initially, the coloring set of ti,- consists of colors 1,2 ,-••,£,-, where £,• = min(i,di + 1). Here, d, denotes the degree of node v,-. In a forward pass, the coloring algorithm assigns color 1 to node v\\ the coloring sets of all the nodes are then examined and color 1 is removed from all the neighboring nodes of node v-\_. This procedure is repeated for each node V{, 1 < i < n. If q is the maximum number of colors used in the forward pass, the backtracking algorithm tries to color the graph using q — 1 colors by altering the coloring decisions. To achieve imperfect graph coloring, we modify the initial computation of coloring sets of nodes. The coloring set of node Vi is set to {1,2, •••,&}, where £ = maxmin(i — k + 1, d,; — k + 2), 1. We also modify the way in which the backtracking algorithm updates the coloring sets of nodes. After assigning a color c to some node v;, c is dropped from the coloring set of a neighboring node Vj iff Vj forms a clique of size k with a subset U of nodes such that all the edges of the clique have the same time tag and all the nodes in U are assigned to the color c.

Figure 1: Register Allocation Procedure

3.2 3.2.1

Parallel Phase R e g i s t e r File Assignment

Following register allocation, we arrive at an assignment of registers to register files. During this assignment, our objective is to find the fewest number of register file modules. We assume that each register file module has k or fewer read/write ports, where k > 2. The user has the option of specifying parameter k. Although we begin with the assumption that all register file ports are of read/write variety, we perform a postprocessing step (see next subsection) where we determine if a port can be implemented as a read only or a write only port. A modified form of graph node coloring, called imperfect graph coloring, is used during the assignment. We construct a graph Gr = (Vr,Er), where Vr is the set of registers found during the allocation procedure. An edge e = (i,j) is added to Er if and only if there exists a time step T during which registers i and j are accessed concurrently. Note that G> may have multiple edges between a pair of nodes. We associate the time step T as a tag with edge (i, j). The problem of register file assignment can now be posed as one of assigning the fewest colors to the nodes of Gr such that the following coloring rule is followed. Do not assign the same color to I > k nodes of Vr if the subgraph induced by the I nodes forms a clique in Gr such that all the edges of the clique have the same

3.2.2

P o s t Processing

A post processing is required to further optimize the allocation of register files and the assignment of variables to the register files. During the assignment step, we have assumed that all the register file ports are of read/write type. Following the assignment, we analyze the set of registers Rm assigned to each register file module m. Let r; (WJ) denote the number of registers which are assigned to m and are read (written to) during time step i. Then the number of read/write, read only, and write only ports in m are given by the following equations, where rmax denotes max(ri) over all time steps i and wmax is similarly defined. RW = rnin(rmax, wmax) R = r — RW W = wmax - RW

120

(1) (2) (3)

3.2.3

Port Assignment

3.2.4

Once an architecture has been evolved by each processor, we can now evaluate the architecture for its cost in terms of number of hardware components required and number of interconnections. We divide the cost into two parts, namely, (i) Cost of internal architecture (ii) Cost of external architecture Thus cost function can be written as:

We now determine the internal architecture of the register file modules. The internal architecture is specified by the way in which the registers are connected to the ports through read and write switches. The port assignment step is modeled into weighted clique partitioning problems on two graphs Gr — (Vr,Er) and Gw = (Vw, Ew) constructed for each register file module M, where Vr — Vw is the set of registers assigned to M. An edge e = (i,j) is included in Er (Ew) if registers i and j can be assigned the same port i.e. the registers i and j are not read(updated) during the same time step. The clique partitioning algorithm is identical for both Gr and Gw, and we shall discuss the algorithm for Gr. For each edge e = (i, j) £ Er, we assign a weight Wij which represents the number of terminals which are shared by registers i and j. Note that a set of all registers which form a clique in Gr can be assigned to the same read port. We therefore look for the minimum number of cliques which cover all the nodes of Gr- During clique partitioning, we prefer generating cliques with large internal weight, since that would effectively reduce the interconnect from the port to the functional modules. When the number of the resulting cliques is equal to k, we assign each clique to a separate port. If the number of cliques in the partition is larger than k, then the registers of clique c, c > k, will have to share register file ports with other registers, requiring internal register file switches. Our minimum weighted clique partition procedure is illustrated in Figure 2.

=

ax Icost + h X Ecost

(4)

=

2x X Nunlft switches + X X Numw switches +yT X NumR por,, + yw X Numw por ts +yrw X NumR/w pori,

(5)

Ecost =

V { i x 4 x ( n j - l ) } + « , x NumR (,«se» i=l

+VW X Numw buses

+vra x NumR/w +Z

X

Numbuffers

buse,

+WX

Numwires

(6)

Here, a and 6 are two constants specified by the user, and are referred to as weight coefficients for internal and external architectures respectively. However due to the absence of concrete cost function, we shall quantify cost of a component in terms of layout area [2], this being directly measurable in terms of the number of transistors required, n, is the number of inputs to multiplexer i, there being k multiplexers. Note that, a read switch requires two transistors, a write switch only single transistor and a multiplexer with n; inputs requires 4(n,- — 1) transistors. We accept all the coefficients as inputs from the user. Based on this cost function, we find the cost of each architecture and give the least costly architecture as final solution.

begin „ repeat

Find a maximum weighted matching M in Gr; for each edge e =: ( J , J ) £ M do collapse i, j into a super vertex i .j; Construct a graph G'r = V}, E'r such that V'T is the set of all super vertices and vertices in Gr which were not collapsed; 6

Cost Icost

k

procedure PortAssignment(GT)

Include e' — (i',j')

Cost Calculation

4

Results

In this section, we compare our results with the benchmark problems from Chen [4], Ahmad & Chen [1], and Kim k, Liu [6] and show that in every case SOUPS shows substantial improvements.

E'r if t h e vertices

in G r which correspond to i' and j' form a clique in GT. Gr •- G'r; u n t i l E'r = \

The vertices in GT represent cliques; for each clique i in the partition compute weight W{; sort the cliques in ascending order of weights; for each of the k largest weighted cliques i do assign port i to the registers which correspond to i; for each remaining clique j do for each register r in clique j do assign r to each register file port; end

4.1

Fifth order Elliptic Wave Filter

This is a benchmark from the 1988 High Level Synthesis Workshop and has been used by many authors to study the performance of their synthesis programs. We generated a Force-Directed Schedule (see Figure 3) for the filter and derived an allocation as well as the RTL equations from this schedule. The register transfer level equations are given in [4].

Figure 2: Port Assignment Algorithm

121

I_ l 2

13

.2

up

1

18

26

38

/ I / •5^. 1/

33

/ /

\ \

39

-RAV -R/w

/I / 1

3

-RAV "RAV

4

\ w s ! 7

"T" "flfiv

S

}"" \"

6 7

X


* V7(9)

V 9(6) "15(6)

18

Figure 5: Ahmad and Chen's Example

19 LOUT.

Figure 3: Dataflow Graph for the fifth-order wave filter Hardware Registers Dual-Port Register Files Mux-inputs Tri-state Buffers ROMs Read Switches Write Switches

PDM 42 5 37 N/A 1 _ -

4.3 Kim and Liu Example

SOUPS 21 3 14 11 0 21 19

This example has been adopted from the work of Kim and Liu [6]. RTL equations for the example are given in Figure 7. Numbers in the brackets represent the register numbers to which the respective variable is mapped. Complete internal as well as external hardware allocation for the architecture found by SOUPS is shown in Figure 8. Kim and Liu use 21 registers and 9 busses, but complete hardware allocation is not given in their paper. Whereas, we used only 14 registers and 7 busses.

Table 1: Datapath allocation results of Wave Filter; a comparison between PDM k SOUPS The architecture evolved by SOUPS for the example is shown in Figure 4. Table 1 compares datapath allocation results of our approach with Chen's Approach (PDM). A "-" entry in the table means that relevant data is not available. 4.2 Ahmad & Chen Example

R

Ahmad and Chen [1] have adopted this problem from Ref. [9]. The RTL equations are shown in Figure 5. Numbers in the brackets represent the register numbers to which the respective variable is mapped. The architecture shown in Figure 6, was found by SOUPS. Table 2 shows the comparison of our architecture with that given by Ahmad k Chen, Wilson et al. [9], and Sutarwala et al. [8],

RAV

Figure 6: Architecture evolved by SOUPS for AhmadChen Example

122

5 Hardware Registers Register Files Mux-inputs Tri-state Buffers Read Switches Write Switches R Ports W Ports R/W Ports

MAP 15 2 2 4 18 10 2 0 2

GREGMAP

Wilson

SOUPS

3 11 N/A _ 1 0 5

2 8 N/A _ 4 2 0

10 2 5 3 10 7 1 0 3

We have reported efficient algorithms for register allocation and assignment of registers to register file modules in a datapath. We have used graph theoretic formulations to model the various subproblems that arise in the framework of the above two problems. Our algorithms are geared towards minimizing interconnections as well as hardware allocation. We have compared our results with those published in the literature and have shown that our algorithms outperform the published results in all the cases. The run times of our algorithms on Sun SPARC workstations were in the order of a few seconds on all the examples on which we tested SOUPS.

Table 2: Datapath allocation results of Ahmad-Chen Example

References

c - step 1 : V lg ( 7 ) = V 1 ( 1 ) + V 2 ( 2 ) , V 19 ( 8 ) = V 3 ( 3 ) + V 4 ( 4 ) , ^20(12) = ^5(10) + ^6(5) c - step 1 : V 2 2 (9) = V g ( n ) + ^3(3)i ^24(3) = ^11(9) + ^1(1

[1] I. Ahmad and C.Y. Roger Chen. Post-process for data path synthesis using multiport memories. In Proceedings of IC-DAC, pages 276-279, 1991.

c - step 3 : V 2 9 ( 6 ) = V 1 9 ( g ) + V 2 0 ( 1 2 ) , V 3 7 ( 1 4 ) = V i 8 ( 7 ) + V 2 0 ( 1 2 ) l ^27(11) = ^2(2) + V6(b)' ^28(5) = ^18(7) + ^19(8) c - step 4 : V 3 0 ( 4 ) = V 2 0 ( 1 2 ) + V 2 1 ( 1 ) , V 3 1 ( 8 ) = V 2 1 ( 1 ) + V 2 2 ( 9 ) ,

[2] L. Avra. Allocation and assignment in high-level synthesis for self-testable data paths. In Proceedings of International Test Conference (IEEE), pages 463-472, 1991.

V 32(2) = V 5(10) + V 4(4) c - step 5 : V 3 5 ( 1 0 ' ) = V 2 7 ( n ) + V

^36(11) = ^28(5) + 29(6) ( () (g) ()+ V ( ), c - step 6 : ^ = V2 4 ( 3 ) 32 2 = V31 ^43(11) = ^ ( l O ) + VV28(5) c - step 7 : V 45 ( 7 ) = V 29 ( 6 ) + V37(14-), V 46 ( 14 ) = ^44(12) ~ ^18(7) + ^28(5) c - step 8 : V 4 7 ( l o ) = V 3 8 ( 1 ) + V31(g), = V3 2 ( 2 ) + V 2 4 ( 3 )

[3] M. Balakrishnan et al. Allocation of multiport memories in data path synthesis. IEEE Transactions on CAD, 7(4):536-540, April 1988.

Figure 7: RTL equations for the Liu's example

[4] C.-In H. Chen. Using PDM on multiport memory allocation in data-path. International Journal of VLSI Design, l(3):217-232, 1994. [5] D. Gajski, N. Dutt, A. Wu, and S. Lin. High Level Synthesis - Introduction to Chip and System Design. Kluwer Academic Publishers, 1992.

- • " R/W

-*-

Conclusions

R/W

[6] T. Kim and C.L. Liu. Utilization of multiport memories in data path synthesis. In Proceedings of 30th DAC, pages 298-302, 1993.

" R/W • R/W

[7] N. Deo M.M. Syslo and J. Kowalik. Discrete Optimization Algorithms with Pascal Programs. Prentice Hall, Englewood Cliffs, NJ, 1983. [8] S. Sutarwala et al. Gregmap: A design automation tool for interconnection minimization. In Canadian Conference on VLSI, Halifax, Nova Scotia, pages 362-371, 1989. [9] T.C. Wilson et al. Optimal allocation of multiport memories in datapath synthesis. In 32nd Midwest symposium on Circuits and Systems, pages 10701073, 1989.

Figure 8: Architecture for the Liu's example evolved by SOUPS

123

Suggest Documents