Simultaneous Scheduling and Allocation in High-Level ... - CiteSeerX

1

Simultaneous Scheduling and Allocation in High-Level Synthesis Using A Genetic Algorithm Kenji Ohmori, Peter W. Eklund and Jem Daalder

Abstract

This paper describes a genetic algorithm applied to high-level hardware synthesis, generating a register transfer level circuit from a behavioral speci cation of a hardware system. High-Level synthesis consists of two combinatorial optimization problems; scheduling and allocation. It is also a multiple objective problem. Our GA solves the scheduling and allocation problems simultaneously so that it can successfully avoid falling into local minimums and cope with multiple objectives, giving dierent weight to tness functions. Several new ideas are involved in our paper. One is a straightforward method obtaining ospring eciently using GA operations with a mechanism repairing faulty genes when creating ospring. Ecient creation is crucial to an optimization problem with severe restrictions since most ospring violate the environment. When a gene violates the environment, it is replaced by the alternative parent gene or one of the available genes in order to repair the violating gene. Another idea is to dierentiate the circuits with the same tness to the environment from the viewpoint of closeness to the assumed optimal circuit. This dierentiation can strongly avoid premature convergence, wandering around a better solution through many generations. Using these ideas, our GA has succeeded in obtaining improved solutions, compared to other high level synthesis optimization techniques,for bench mark problems.

Keywords Evolutionary Computing, Genetic Algorithm, Crossover, Mutation, Selection, High-Level Synthesis, Scheduling, Allocation, VLSI design.

Mailing Address: Kenji Ohmori Department of Industrial and Systems Engineering, Hosei University, 3-7-2 Kajinocho, Koganei-shi, Tokyo, 184-8584 Japan Email: [email protected]. Tel: +81-423-87-6349 Fax: +81-423-87-6126

September 30, 1998

DRAFT

2

I. Introduction

High-level synthesis, generating a register transfer level circuit from a behavioral description, is one of the most sophisticated problems in VLSI design. High-level synthesis consists of two combinatorial optimization problems. The rst is a scheduling problem that assigns a control step, where the operation of the node is performed, to each node of a control data ow graph (CDFG) { one of ways to represent a behavioral description. The other is an allocation program assigning resources such as an arithmetic logic unit (ALU), register and bus to each node of the CDFG. High-level synthesis has been studied extensively in the 80's, summarized in [1], and a number of algorithms have been invented. However, it took several years eort to nd an algorithm that gave an optimal circuit for a rather simple CDFG representing a single dierential equation. As the scheduling and allocation problems correlate to a high degree, the method to solve the high-level synthesis task by two separate sequentially executed programs does not always give an optimal solution. Nevertheless, since the task is necessarily complex, most algorithms reported in [1] tackled the high-level synthesis problem by dividing it into two separate sequential problems, a solution to the rst being used as a guide to a solution to the second. Among these algorithms, the HAL system [2], where scheduling is performed rst and then allocation is carried out, succeeded in nding an optimal solution for a single dierential equation. However, the HAL system has diculty in nding an optimal solution for other bench mark problems. High-level synthesis is a multiple objective problem [3], [4] that minimizes several resources at a time, i.e., it minimizes the number of control steps used to execute a cycle of the CDFG, the number of ALUs (or multipliers) allocated to nodes of the CDFG, the number of registers used to store output values of the ALUs and the number of buses used to transmit the output values stored in registers to ALUs. Since it is not always possible to minimize all objectives, priority may be set to each. If a very fast circuit is required, the number of control steps should be minimized. Alternatively, if a low power consumption circuit is required, the number of ALUs, registers and buses should be minimized. However, most of the algorithms invented in the mid 80's cannot tolerate priority changes. For example, HAL minimizes the number of required ALUs prior to the number of required DRAFT

September 30, 1998

3

registers. A genetic algorithm (GA) [5], [6], [7], [8] is considered to give exibility to multiple objective problems since each problem is solved by giving dierent priorities to tness or objective functions in order to meet the given requirements. There are several papers, [9], [10], [11] studying high level synthesis using GAs. However, none have studied it from the viewpoint of multiple objective problems, where scheduling and allocation are solved simultaneously. Some papers [9], [11] solve only scheduling. Another paper [10], solves scheduling rst using a GA and allocation later by traditional methods, still having the local minimum problem as traditional algorithm approaches exhibit. There is also another paper [12], employing simulated evolution, closely related to GAs. In simulated evolution, the select and generate operations are employed in such a way that the select operation evaluates a solution and removes probabilistically elements with high cost and passes the remaining elements, called a partial solution, to the generate operation. It then searches for a new solution from the partial solution by progressively adding missing elements. For fth-order elliptic wave lter, one of the bench marks for high level synthesis, it succeeded in obtaining better solutions than the algorithms [2], [13]. However, it is noteworthy that [12] commented that the method for solving scheduling and allocation sequentially has the possibility of falling into local minimum. In the following year [10], better solutions were found using a GA for the scheduling step. This paper describes a GA handling scheduling and allocation simultaneously so as to cope with the realization of dierent VLSI circuits from the same CDFG to meet dierent design objectives. A chromosome in our GA is longer than those of other GA approaches [9], [10], [11] since it includes genes related to the allocation step. This leads to diculties obtaining an optimal solution. Surroundings an optimal solution, there are many dierent circuits with the same tness to the design objectives. Ospring are apt to be generated into this surrounding regions of solution space with few chances of leading to the optimal solution. In our paper, several methods dierentiating these circuits from the viewpoint of closeness to the assumed optimal solution are introduced. The methods lead to success in obtaining better solutions than previously developed algorithms [2], [10], [11], [12], [13]. As high-level synthesis imposes severe restrictions on circuit design, ospring obtained by September 30, 1998

DRAFT

4

crossing two parent chromosomes can rarely meet design rules. This necessitates signi cant computation to obtain sound ospring using a conventional GA. In this paper, ospring violating design rules repair in the process of crossover as reported in [14]. II. Recombination

A. A genetic algorithm

High-level synthesis transfers a system's behavioral description into a register transfer level circuit. A behavioral description is presented by a control data ow graph (CDFG) consisting of nodes and arcs connecting them. A node obtains its inputs from the preceding nodes, executes its operation and sends results to succeeding nodes. A CDFG for fast discrete cosine transform (FDCT) is shown in Figure 1. A synthesis system has to replace each node in some way with an operational unit that carries out its operation, a register that stores its results and multiplexers that distribute its inputs. Also, the system assigns to each node a control step deciding when its operation is to be carried out.

+00

+01

+02

+03

-04

+10

+11

-12

-13

x14

x15 x16

x17

x18

+22

-23

-24

+25

x19

x30

+26

-27

x31

x32

+38

+39

-05

-06

x08

x09

-20

+21

x33

-07

+28

-29

x34 x35

x36

-40

-41

x37

Fig. 1. A control data ow graph for the fast discrete cosine transform.

In a genetic algorithm, a set of register transfer level circuits, which usually use hardware resources wastefully, are provided in the rst step. This set is called the rst generation. DRAFT

September 30, 1998

5

The second generation is constructed by creating ospring from two parents selected from the rst generation, adding ospring to that generation and abandoning the worst circuit from the rst generation. This process is repeated until a satisfactory circuit is obtained as shown in Figure 2. In general, a descendant generation ts the objectives as closely as any antecedent generations. (According to Schema theory, GAs can locate good solutions in vast space by exponentially accumulating good solution components in successive generations.) Create the first generation.

Select two parents. Obtain offspring by crossover and mutation.

Create the next generation by selecting members from the current generation members and the offspring.

Is a satisfactory solution obtained? Yes.

Fig. 2. A genetic algorithm.

Ospring are created using three genetic operations: selection, crossover and mutation. A circuit more closely tting the objectives is selected with a higher probability as a parent. From the current generation, two circuits are selected as parents and they produce their ospring using crossover and mutation. Figure 3 is an example of a CDFG for a very simple September 30, 1998

DRAFT

6

+0

+1

x0

+2

Fig. 3. A CDFG consisting of three addition nodes and one multiplication node. Mother

0

alu 0

alu 0

reg 0

reg 1

1

2

Father

alu 0

alu 0

reg 1

reg 0

mul 0

mul 0

reg 1

reg 0

alu 1

alu 0

reg 0

reg 1

3

4

Fig. 4. Two examples of register level transfer circuits for Figure 1.

system. Figure 4 is two register transfer level circuits realizing the CDFG in Figure 3. When these circuits are selected as parents, they will produce such ospring that inherits some parts from the mother circuit and the rest from the father circuit. An example is shown in Figure 5. Subsequently, some parts of the ospring may be changed dierently from that of their parents by mutation. An example of this process is shown in Figure 6. It is notable that ospring does not always t the environment. The possibility ospring DRAFT

September 30, 1998

7

violate design rules is strong in high level synthesis since severe restrictions are imposed in circuit design. The following is a list of important design rules: 1) a node that receives its input from another node should be scheduled for a later control step than that sending node; 2) the same operational unit cannot be allocated to nodes of the same control step; 3) the same register cannot be allocated to nodes with the same control step; 4) a multiplexer can receive only one input at each control step.

0

1

2

Father

Mother

alu 0

alu 0

reg 0

reg 1

mul 0 reg 1

3

4

alu 1 reg 0

Fig. 5. An ospring circuit obtained by crossover.

The ospring in Figure 5 violates a design rule as the same ALU is allocated to the nodes of control step 1. There are two methods to handle the environmental violation of ospring. One is to select new parents and produce other ospring from the parents. This process is repeated until ospring meeting the environment is created. This method disadvantageously takes a lot of computation time. The second method is to repair the violating subschema [14]. This method is applied to our GA. September 30, 1998

DRAFT

8

0

Father

1

Mother

alu 0

alu 0

reg 0

reg 1

mul 0

2

mutation reg 0

3

alu 1

4

reg 0

Fig. 6. An ospring circuit obtained mutation.

B. Chromosome representation of circuits

In a GA, an entity is represented as a chromosome consisting of genes. In our GA, the chromosome representing a register transfer level circuit consists of genes for control steps and hardware resources allocated to nodes, i.e., = 00 01 02 03 0 1 2 3 where the node in the chromosome is scheduled to control step 0 with operational unit 1, register 2 and input condition 3 (showing the input given to which input port of the operational unit). In a chromosome, the nodes are placed in such a way that a node sending outputs are placed earlier than the nodes receiving the outputs. The order of genes are, of course, the same in every chromosome. This order helps avoid backtracking when a faulty gene is repaired. This process is described later. circuit

k

g

k

g

k

g

k

g

k

k

k

k

k

::::gn gn gn gn ;

k

i

k

gi

k

gi

gi

k

gi

C. The rst generation

In the initial stage, a set of chromosomes, each of which represents a register transfer level circuit for the given CDFG, is created for the rst generation. In creating a chromosome, the range of available control steps for a node can be prepared using as soon as possible DRAFT

September 30, 1998

9

(ASAP) and as late as possible (ALAP) schedulings. The ASAP and ALAP schedulings allocate the node to the earliest control step and the latest control step, respectively. For each node, a control step is selected among the available control steps, i.e., steps between the earliest and latest steps. An example is shown in Figure 7. The left hand side is the ASAP scheduling and the right hand side is the ALAP scheduling for the CDFG in Figure 3. From this gure, we know the available control steps for node +2 are 3 and 4. 0

+0

1

x0

2 3 4

+2

+1

0 1

+0

2

x0

+1

3 4

+2

Fig. 7. ASAP and ALAP scheduling results for the behavioral description given in gure 2.

A chromosome of the rst generation is created along the placement of the nodes, where a sending node is placed earlier than its receiving nodes. Each node is assigned a control step selected from the available control steps. However, this assignment causes modi cations of the available control steps of nodes that receive its result. In our system, each node has an available control step list, where the start and the end of the available control steps are stored. When a control step is assigned to some node, the start control steps of its receiving nodes are changed to one control step later than the assigned control step of the sending node (unless they were located later than the sending node). For example, consider generating a circuit for the CDFG in Figure 4. Assume that the genes for nodes +0, +1, x0 and +2 are placed in this order in the chromosome. Therefore, control steps and hardware resources are assigned to +0, +1, x0 and +2 in this order. At rst, a control step selected from 0 to 1 is assigned to +0. Assume control step 1 is allocated to +0. This assignment aects the available control steps of x0, initially from 1 to 2. To avoid the wrong assignment for x0, the available control list is modi ed. In this case, the start control step of x0 is changed to 2, since the initial start control step, i.e. 1, is not later than the control step assigned to +0. After assigning an arithmetic logic unit, September 30, 1998

DRAFT

10

a register and an input direction to +0, the procedure goes to assign +1 to a control step. This assignment does not cause modi cation to the available control steps of x0. When a control step is assigned to x0, control step 2 is assigned without choice. It should be noted that assignments are carried out in a straightforward way without changing any previous assignments. This assures that the receiving nodes are placed later than their sending nodes and an assignment is carried out from the available control steps. A node is allocated hardware resources together with a control step. In the rst generation creation, hardware resources are selected from a rather large range of available resources so that dierent resources are easily allocated to dierent nodes of the same control step. After allocating hardware resources, the genes are relabeled so that dierent labels are not attached to the same circuits. Using available control step lists, we can obtain such a chromosome as shown in Table I, representing the father circuit in Figure 4. TABLE I The chromosome for the father circuit.

+0 control step unit register input

+1

x0

+2

1 0 2 4 alu0 alu0 mul0 alu0 reg0 reg1 reg0 reg1 straight straight cross cross

D. Ospring creation

A new generation is created from the current generation by crossover and mutation, using available control step lists. Two parents are selected from the current generation preferably to circuits more closely tting the objectives. Using a crossover operation, some genes from the mother and the remaining genes from the father are copied to the chromosome of the new ospring. It may happen that the selected genes do not meet the design rules described before and the ospring violates the design constraints. DRAFT

September 30, 1998

11

In a crossover operation, the chromosome of new ospring is created by copying mother or father genes, selected randomly from them, step by step along the gene placement. However, through this operation, it is necessary to maintain that the gene for a node sending output is allocated to an earlier control step than those of the nodes receiving its outputs. Assume that node A sends its output to node B. Assume that the control step for A in the mother is later than the one for B in the father. Also, assume that the control step for A in the ospring circuit has already been copied from that of the mother circuit. Now, it is time to create a gene for the control step of B by copying from the gene of the mother or father to the ospring. In the creation of ospring, an available control step list for each node is also provided. In this situation, the available control step list for B of the ospring shows that the start control step is at least one control step later than that of the control step assigned to A of the ospring, i.e., the control step assigned to A of the mother. If the father gene is selected for the ospring, it violates design rules since B is allocated to an earlier control step than A, while A sends its outputs to B. When this violation occurs, the gene for the ospring is changed to that of another parent. In this case, the gene of the mother is copied. The repair of a gene is explained more clearly by the following example. We consider obtaining an ospring inheriting from the mother in Figure 8 and the father in Figure 4. Assume that genes for nodes +0 and +1 have been already copied from the father. Therefore, the available control steps for the node x0 of the ospring are now the control step 2. Assume a control step for the node x0 is selected from the mother. As it is assigned to control step 1, it is not included in the available control steps of the ospring. Therefore, it is abandoned and the father control step is selected instead. In most cases, this repair works well since both genes come from the same parents. In some cases, however, the sending node is mutated and assigned unluckily to a later control step than the original control step. In this case, the gene from the same parent do not possibly meet the environment. If it does not, another gene is selected for this node from the available control steps. In this case, the ospring has a later control step than the parents. September 30, 1998

DRAFT

12 Mother

alu 0

alu 1

reg 0

reg 1

0

mul 0 1 reg 0

2

3

4

alu 0 reg 1

Fig. 8. Another register transfer level circuit for the behavioral description in Figure 3.

It must be noted that any violation and the succeeding selection of an alternative make the node allocated a further later control step. Therefore, crossover has a tendency to drive nodes to be allocated in later control steps. (There may exist earlier control steps to be assigned than the alternative.) This disadvantage is compensated by mutation as well as a tness function. Immediately after copying a gene from one of the parents to the ospring, mutation is performed with some assigned probability on this gene. In this event, mutation reassigns another control step to the node. In the rst trial, the mutation task intentionally tries to allocate the node to an earlier control step by limiting the range of available control steps from the start control step of this node to the control step assigned to the node by crossover. This forces a node to be allocated an earlier control step than that of the parent. If this is not possible, a control step is selected from all the available control steps and assigned to the node. DRAFT

September 30, 1998

13

E. Hardware resources

Hardware resources of a new ospring are allocated in the same way as control steps. The mother or father's genes for hardware resources are copied to the chromosome of the new ospring. Genes are also selected randomly. If the selected gene violates the environment, the gene of the other parent is copied. If both genes do not meet the environment, a new gene is selected from the available hardware resources, those used in the parent circuits. When a proper resource is not included in the selection, a new resource is added and allocated to it. In Figure 5, two nodes +0 and +1, assigned to control step 1, happen to inherit the same arithmetic logic unit, i.e., alu 0, from dierent parents. This causes a violation of resource allocation. The gene of +1 from the mother circuit is therefore abandoned. Then, the gene of the father is copied. However, it is again the same arithmetic logic unit, alu 0. Therefore, this choice is also rejected. One of the arithmetic logic units used in the parents, which does not cause any violation of the environment is then selected. In this case, alu 1 is selected from the mother circuit. Mutation is also carried out in the same way as scheduling. A resource is selected from the available hardware resources. F. The recombination procedure

The following is the formal procedure for ospring creation. 1) Two parents are selected from the current generation preferably to a circuit more closely tting the objectives. Suppose that the parent chromosomes be and and an ospring chromosome be ; = 00 01 02 03 0 1 2 3 = 00 01 02 03 0 1 2 3 = 00 01 02 03 0 1 2 3 where 0, 0 and 0 are for control steps, 1, 1 and 1 are for operational units, 2, 2 and 2 are for registers, 3 , 3 and 3 are for input directions, and q = 0,1,...,n. 2) The rst gene 00 is produced as follows: i) crossover; one of the parent genes 00 and 00 is selected randomly and copied to 00 ; ii) mutation; the copied gene 00 is possibly circuit

circuit

circuit

circuit

circuit

i

g

j

g

k

k

g

i

g

j

g

k

g

i

g

j

g

k

j

i

j

j

g

gp

gp

i

g

k

j

i

i

i

i

j

j

j

j

k

k

k

k

::::gn gn gn gn ;

::::gn gn gn gn ;

::::gn gn gn gn ;

i

gp

k

September 30, 1998

j

g

gp

i

gp

j

i

circuit

k

k

gp

gp

g

g

g

i

j

gp

j

gp

k

i

gp

gp

k

gp

k

g

g

k

g

i

k

DRAFT

14

changed to another gene which satis es the design rules. 3) The same processes are performed to produce 01 02 03 . 4) The next gene 10 is produced as follows: i) crossover; one of the parent genes 10 and 10 is selected randomly. If the selected gene meets the environment, it is copied to 10 . If not, the gene of the other parent is copied unless it too violates design rules. When both genes do not satisfy the design rules, another gene that meets the environment is selected from the available control list; ii) mutation; the copied gene 10 is possibly changed to another gene that meets the design rules. (It is selected from the available control steps, initially from the limited earlier ones, later from the entirety.) 5) The same processes are performed to produce 11 12 13 . 6) This process is repeated until 3 is obtained. 7) The genes are relabeled so that dierent labels are not attached to the same circuits. g

g

g

k

;g

k

;g

k

k

g

j

i

g

g

g

k

;g

k

;g

k

k

k

k

gn

III. Fitness functions

Each chromosome is measured to the extent it meets the objectives. This measure is carried out by tness functions. In our system, tness functions are provided to measure the performance of both scheduling and allocation, the main components of high level synthesis. Each tness function takes on a lesser value when a chromosome more closely ts its objectives. The objective of scheduling is to minimize the overall computation time. Allocation has the objective of minimizing the hardware cost. However, these objectives are often contradictory. The design of a faster system requires minimization of computation time. But, this leads to an increase in the number of operational units. On the contrary, the design of an economical system requires minimization of hardware resources. In our GA, this problem is solved by preparing objective functions, some of which represent the optimization of computation time and others represent the minimization of the number of hardware resources. Weights for these objective functions are provided depending on a design strategy. The tness function for scheduling is very simple. It counts the total number of control steps. This represents how fast the circuit is. There are several tness functions for allocation. Some of these are simple. The number DRAFT

September 30, 1998

15

of multipliers, arithmetic logic units, registers and multiplexers are computed by their own tness functions. Also, the maximum number of fan-outs from a register and the total number of fan-ins to the multiplexers are computed by these tness functions. The remaining tness functions are more sophisticated. The functions measure performance of allocation but this is correlated strongly to scheduling. The rst function evaluates how evenly nodes are distributed among control steps. If the number of nodes allocated to a control step exceeds the average number of nodes in a control step, the squares of exceeding nodes are summed by this tness function. In Figure 9, a circuit is designed using four or less control steps. It consists of eight addition nodes. The circuit requires at least two ALUs. Of course, it depends on the behavioral description as to whether the circuit can be designed by two ALUs. However, our GA assumes that the circuit is designed by two ALUs and this forces two addition nodes to each control step, penalizing the excess of nodes over the average number of the nodes per control step. In the allocation result at the left hand side in Figure 8, the number of the nodes allocated to control step 2 exceeds the average number by one. Thus, the value of the tness function is one. This forces a node from control step 2 to 3. 0

+0

+4

0

+0

+4

1

+1

+5

1

+1

+5

2

+2

+6

2

+2

+6

3

+3

+7

+7

+3

3

Fig. 9. Two allocation results for eight addition nodes.

In the allocation result at the right hand side in Figure 9, the value of the tness function is four since the number of nodes allocated to control step 2 exceeds the average by two. If an economical circuit is being designed, it forces two nodes from the control step 2 to 3. On the other hand, if a faster circuit is being designed, this movement leads to an increase in the tness value on speed. Therefore, this movement may be blocked by the tness increase. In this case, it forces one node from the control step 2 to 0 or 1. This tness function can also dierentiate circuits with the same performance and cost September 30, 1998

DRAFT

16

from the viewpoint of closeness to a better solution. The scheduling result in Figure 10 requires four control steps and three ALUs as does the one on the left hand side in Figure 9. However, the scheduling result on the left hand side in Figure 9 is closer to a better solution than the one in Figure 10 since the former needs to move only one node from control step 2 to 3 in order to obtain a better circuit, while the later requires the movement of two nodes. This is re ected in the values of the tness function. The former gives one as the tness value, while the later yields two. 0

+0

1

+1

2

+2

3

+3

+4

+5

+6

+7

Fig. 10. Another allocation result for eight addition nodes.

The second function takes account of the number of blank control steps, where less nodes than average are allocated, with heavier weighting for a later control step. This tness function is provided to compensate a disadvantage of our system, that is, nodes are forced to later control steps in gene repairs at crossover operations. The tness function prefers a node allocated in an earlier control step. The third tness function is for non-pipelined multipliers that need two control steps to perform their function. When more nodes are allocated from an even (or odd) number of control steps, a lower value is given since this avoids overlaps in multipliers. For example, two scheduling plans for the nodes of multiplication are shown in Figure 11. Both plans require three multipliers. However, the left scheduling plan (only one node is allocated to an odd numbered control step) is closer to a scheduling plan requiring two multipliers than the right since it is achieved by moving node x3 from control step 1 to 0. IV. Results

Our GA has been applied to fast discrete cosine transform (FDCT) consisting of 42 nodes and including the possibility of large scale parallel computation as shown in Figure 1. As described before, high level synthesis is a multiple objective problem, where the DRAFT

September 30, 1998

17 0

x0 x3

1 2

0

x1

3 4 5

x0 x3

1 x5

2

x1 x5

3 x2 x4

4

x2 x4

5

Fig. 11. Two allocation results for six multipliers.

optimal solution depends on each design consideration. In this example, we set priorities for minimizing the number of hardware resources in the following order: control steps, arithmetic logic units, multipliers, registers, multiplexers, fan-outs from an operational unit, and fan-ins to multiplexers. The rst experiment is carried out to obtain the most economical circuit within 11 control steps.(As the number of control steps is increasing, the number of necessary hardware resources decreases. The upper bound gives the most economical circuit unless circuits with the same number of hardware resources are designed with less control steps. In this case, 11 control steps gave the most economical circuit.) The mutation rate is set to 0 2 in all the experiments, being decided after comparing the results of several trials. Figure 12 shows the transition of the tness value of the best circuit in each generation. When the control step is 11, the optimal number of arithmetic logic units and multipliers are known to be three and four, respectively. This optimal solution is obtained around the 70,000th generation. In our GA, two parents are selected from the current generation on the condition that a better circuit is likely to be more often selected as a parent than a worse one. (The best circuit is selected 1.4 times as often as the worst. The rest are selected linearly to this preference.) Their ospring is inserted into the generation and the worst circuit is eliminated from the generation. The remaining circuits form the next generation. The optimal solution is, therefore, obtained when about 70,000 ospring are generated. Then, circuits with less registers, multipliers, fan-ins and fan-outs in this order are gradually obtained as more generations are generated. After 270,000 ospring productions, :

September 30, 1998

DRAFT

18

TABLE II Resource allocation results for 11 control steps of the fast discrete cosine transform.

steps

0 1 2 3 4 5 6 7 8

alu

mul

0 0 6 1 11 20 29 25 22 38

1 2 3 5 2 13 4 7 12 10 21 26 28 23 24 27 41

40

39

9 10

3 8 8 14 14 30 30 32 32

register

4

5

6

9 9 15 15 31 31 33 33

19 19 16 16 37 37 34 34

18 18 17 17 36 36 35 35

0 0 0 0 18 18 18 25 25 25 25 25

1 5

2

3 4 5 6 7 8 3 2 3 6 13 1 2 3 4 7 8 9 12 11 4 10 7 8 21 4 7 20 14 15 26 4 16 28 17 20 14 15 4 28 24 20 27 30 37 36 22 28 24 31 22 24 38 32 35 34 22 33 24 38 40 39 22 24 38

9 10

19 19 19 23 23 23 23 23

29

41 41 41

the population converges, i.e., new ospring that are better than the worst member of the generation are almost never generated. The population size in this experiment is 800. (Several dierent population sizes are tested. This population size gives us the best result among other population sizes.) The allocation results are shown in Table II. A circuit with three arithmetic logic units, four multipliers, 11 registers, nine multiplexers, ve fan-outs and 56 fan-ins has been designed by our GA. Table II shows that alu 0 is allocated to node +00 at control step 0, to -06 at control step 1 and to +01 at control step 2. At control step 0, the four nodes of multiplication are performed simultaneously using mul 3, 4, 5 and 6 assigned to x32, x33, x34 and x35, respectively. Register 0 is assigned to +00 from control step 0 to 2, to x18 from control DRAFT

September 30, 1998

19 110 ’ga_out58.dat’ 100 90 80 70 60 50 40 30 20 10 0

50000

100000

150000

200000

250000

300000

Fig. 12. Transition of tness value of the best circuit in each generation for the fast discrete cosine transform.

step 3 to 5 and to +25 from control step 6 to 10. Other experiments have been carried out for various ranges of control steps. The results are shown in Table III. These allocation results were obtained in around 200,000, 570,000, 495,000 and 995,000 ospring productions (generations) for 8, 13, 26 and 34 control steps, respectively. Each result is also optimal with respect to the number of operational units. Finally, we applied our GA to the fth-order elliptic wave lter (FEWF), whose behavioral description is shown in Figure 13. The behavioral description has 34 nodes, where eight of the nodes are for multiplication and the remaining for addition. FEWF has feed back loops, dierent from FDCT. From the viewpoint of designing circuits, there are no dierences between them except feedback values are stored in registers and are fed to the circuit when a new loop is performed. The results are also shown in Table IV. The rst two experiments were carried out using pipelined multipliers. A pipelined multiplier takes two control steps to operate one multiplication. However, it can accept inputs at every control step by overlapping its operation. An arithmetic logic unit needs one control step to compute one addition, subtraction or comparison. For 17 control steps, the circuit designed by our GA consists of three arithmetic logic units, two pipe-lined multipliers and ten registers. The circuit is obtained in around 200,000 ospring productions (generations). September 30, 1998

DRAFT

20

TABLE III Allocation results for the fast discrete cosine transform.

steps

alu

mul

reg

mpx

fout

fin

4 4 3 2 2 2 1 1

8 5 4 4 3 2 2 1

12 11 11 12 12 10 12 12

12 8 9 6 6 6 4 3

4 5 5 6 6 6 10 11

50 44 56 43 52 45 42 39

8 10 11 13 14 18 26 34

+00 +01

+30 +02 +03 x20

x04

+05

+21 +22

+06

+18 +19

x23

x07 +24 +08 +31 +25

+13

+09 x10

+14 x15

+11

x32

+26 +33

x27

+16 +28 +12

+17

+29

Fig. 13. Behavioral description for the fth-order elliptic wave lter. DRAFT

September 30, 1998

21

TABLE IV Allocation results for the fifth-order elliptic wave filter. system GA GSA SE HAL GA GSA SE HAL GA GSA SE HAL GA SE HAL

steps

alu

mul

reg

17 17 17 17 19 19 19 19 19 19 19 19 21 21 21

3 3 3 3 2 2 2 2 2 2 2 2 2 2 2

2p 2p 2p 2p 1p 1p 1p 1p 2 2 2 2 1 1 1

10 11 11 12 9 11 11 12 9 10 10 12 9 11 12

The allocation results for 19 control steps are shown in Table V. The circuit consists of two arithmetic logic units, one pipe-line multiplier, nine registers, six multiplexers, at most ve fan-outs from an operational unit and 31 fan-ins for the multiplexers. The last two experiments were carried out using non pipelined multipliers, accepting inputs and computing them by taking two control steps, where new inputs cannot be accepted until the computation is completed. For 19 control steps, our system designed a circuit with two arithmetic logic units, two multipliers and nine registers. The circuit was obtained in around 80,000 ospring productions (generations). Compared with the results of HAL[2], simulated evolution (SE)[12] and GSA[10], our results demonstrate an improvement. Among the four systems, there are no dierences in September 30, 1998

DRAFT

22

the numbers of arithmetic logic units and multipliers. However, our system is superior in terms of the number of registers. (As SE stores the input value in a register and other systems do not, the number of registers in SE should be decremented by one when it is compared with other systems.) The main reason for this dierence is that our GA solves scheduling and allocation simultaneously. It also avoids more successfully falling into local minimums than do the other systems. Systems solving scheduling and allocation sequentially have advantages in obtaining the results in a shorter computation time, but they sacri ce reliability, an important factor for circuit design. V. Conclusions

A new GA solving simultaneously scheduling and allocation in high level synthesis has been successfully developed and applied to design large scale circuits replete with parallel computation. Our GA has the ability to create chromosome of new ospring in a straight forward method, where a faulty gene is repaired using another parent gene or an available gene. This method is eective where ospring, created by the conventional crossover and mutation operations, rarely meet the environment imposed by severe design restrictions. Our GA succeeds in obtaining optimal solutions with respect to the number of arithmetic logic units and multipliers for various ranges of control steps. Our genetic algorithm obtains better results than other systems in terms of the number of registers. Along with this, the tness functions, dierentiating similar circuits with the same performance and cost, also play important roles in obtaining optimal solutions. It is also shown that our GA can solve multiple objective problems, giving a dierent weight to each tness function. By comparing our GA with other systems, the number of control steps, which aects the processing speed of a designed system, is bounded to some range in each experiment. Then, economical circuits, where the numbers of arithmetic logic units, multipliers, registers, fan-outs and fan-ins are minimized in this order, are sought within the range of control steps. For various ranges of control step, our GA succeeds in obtaining more economical circuits than other systems. If we change the order of minimization, giving a dierent weight to each tness function, dierent results can be obtained in order to satisfy various design criteria. However, the procedure does depend on initial conditions to obtain an optimal solution. DRAFT

September 30, 1998

23

In the worst case an optimal solution will be found only 10% of the time. In most cases, optimal solutions were obtained in around 400,000 generations. This process takes 5 to 10 minutes using a personal computer with an 120 MHz Pentium processor. Acknowledgments

The work described in this paper is supported by NEC. The authors gratefully acknowledge Dr. M.J.M Heijligers for providing a behavioral description of fast discrete cosine transform and papers published by his group and Dr. Sadiq M. Sait for providing me with his research papers. References [1] R.A.Walker, \The status of high-level synthesis," IEEE Design and Test of Computers, Winter, pp. 42{43, 1994. [2] P.G. Paulin and J.P. Knight, \Force-directed scheduling for the behavioral synthesis of ASIC's, " IEEE Trans. on CAD, Vol.8, No.6, pp. 661-679, 1989. [3] K. Ohmori, \High-Level Synthesis Using Genetic Algorithms," IEEE International Conference on Evolutionary Computing, pp. 209-214, 1995. [4] J. Daalder, P.W. Eklund and K. Ohmori, \High-Level Synthesis Optimisation Using Genetic Algorithms," Fourth Paci c Rim International Conference on Arti cial Intelligence, Springer-Verlag LNAI 1114, pp.276287, 1996. [5] J.H. Holland, Adaptation in Natural and Arti cial Systems, The University of Michigan Press, Ann Arbor, MI, 1975. [6] D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA, 1989. [7] L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991. [8] J.R.Koza, Genetic Programming: On Programming Computers by means of Natural Selection and Genetics, The MIT press, Cambridge, MA, 1992. [9] R.S. Martin and J.P. Knight, \Genetic Algorithms for Optimization of Integrated Circuits Synthesis," Proc. of the Fifth International Conference on Genetic Algorithms, Morgan Kaufmann, pp. 432-438, 1993. [10] S. Ali, S.M. Sait and M.S.T. Benten, \GSA: Scheduling and Allocation using Genetic Algorithm," Proc. of European Design Automation Conference, IEEE Computer Society, pp. 84-89, 1994. [11] M.J.M. Heijligers and J.A.G. Jess, \High-Level Synthesis Scheduling and Allocation using Genetic Algorithms based on Constructive Topological Scheduling Techniques," IEEE International Conference on Evolutionary Computing, pp. 56-61, 1995. [12] T.A. Ly and J.T. Mowchenko, \Applying simulated evolution to high level synthesis," IEEE Trans. on CAD, Vol.12, No.3, pp. 389-409, 1993. [13] B.M. Pangrle, \Splicer: A Heuristic Approach to Connectivity Binding," 25th Design Automation Conference, pp. 536-541, 1988. September 30, 1998

DRAFT

24

[14] D. Whitley, T. Starkweather and D. Fuquay, \Scheduling problems and traveling salesmen: The genetic edge recombination operator," Proc. of the Third International Conference for Genetic Algorithms, Morgan Kaufmann, pp. 133-139, 1989.

DRAFT

September 30, 1998

25

TABLE V Allocation results for the fifth-order elliptic wave filter. steps

0 1 2 3

alu

0 1 0 30 1 2 3

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

September 30, 1998

5 6 18 19 8 13 14 9 31

21 22

24 25 26

16 17 11 12 28 33 29

mul

2

4 20

7 23

15 10 27 32

register

0 1 2 3 4 5 6 7 8 0 17 19 16 30 33 28 0 1 19 16 30 33 28 0 1 2 16 30 33 28 0 1 3 16 30 33 28 0 1 3 16 30 33 28 0 1 3 4 16 30 33 28 0 1 5 3 20 16 30 33 28 0 6 5 3 21 16 30 33 28 0 18 5 21 16 22 33 28 0 7 5 19 21 16 33 28 0 8 5 19 21 16 23 33 28 0 8 13 19 21 16 24 33 28 0 8 14 19 25 16 24 33 28 9 8 14 19 26 16 24 33 28 15 8 14 19 26 16 24 31 28 10 8 14 19 26 16 24 31 28 11 8 17 19 26 16 24 27 28 12 32 17 19 26 16 24 28 12 32 17 19 29 16 33 28

DRAFT

Simultaneous Scheduling and Allocation in High-Level ... - CiteSeerX

Simultaneous Scheduling and Allocation in High-Level ... - CiteSeerX

Suggest Documents

SASEPA: Simultaneous Allocation and Scheduling ... - Semantic Scholar

RESOURCE ALLOCATION AND SCHEDULING OF ... - CiteSeerX

Simultaneous Scheduling of Replication and Computation ... - CiteSeerX

Simultaneous Scheduling of Replication and Computation ... - CiteSeerX

Multicast Scheduling and Resource Allocation Algorithms ... - CiteSeerX

RESOURCE ALLOCATION AND SCHEDULING OF ... - CiteSeerX

Move Frame Scheduling and Mixed Scheduling-Allocation ... - CiteSeerX

Simultaneous Speculation Scheduling| A Technique for ... - CiteSeerX

Simultaneous Module Selection and Scheduling

Bandwidth Allocation Scheduling Algorithms for IEEE ... - CiteSeerX

Stochastic Allocation and Scheduling for

QoS-oriented sesource allocation and scheduling of ... - CiteSeerX

Allocation and Scheduling for MPSoCs via decomposition ... - CiteSeerX

Joint Power Allocation and Scheduling of Multi-Antenna ... - CiteSeerX

Lifetime Reliability-Aware Task Allocation and Scheduling ... - CiteSeerX

Lifetime Reliability-Aware Task Allocation and Scheduling ... - CiteSeerX

Simultaneous Routing and Resource Allocation ... - Stanford University

Simultaneous Workload Allocation and Capacity Dimensioning for ...

Resource Allocation and Outpatient Appointment Scheduling Using

Uplink Scheduling and Power Allocation for M2M

Comparative analysis the simultaneous scheduling

SIMULTANEOUS SCHEDULING OF MACHINES AND OPERATORS ...

Workflow Scheduling and Resource Allocation for ...

Surgery Allocation and Scheduling - Semantic Scholar