Petri Net Representation for Parallel Loop Scheduling ... - CiteSeerX

2 downloads 0 Views 171KB Size Report
The size of the prelude, loop section, postlude cannot be directly determined from the problem de nition. For a given problem, many legal schedules can exist ...
Petri Net Representation for Parallel Loop Scheduling Using a Genetic Algorithm M.R. O'Neill, V.H. Allan, N. Flann, and H. Chen Department of Computer Science, Utah State University Logan, Utah 84322-4205 [email protected] Phone:(801) 797-2022 Fax:(801) 797-3265

Abstract Scheduling is an important combinatorial optimization problem that has many practical applications. This paper introduces an application of GA's to scheduling iterative, or repeating operations. Iterative schedules occur in many tasks including manufacturing, production control and compiling iterative programs. A petri net representation of schedules is introduced that enables the GA to search the space of legal schedules by incorporating resource and dependency constraints. Each genome represents the space of possible delay allocations to arcs of the petri net. The cost function simulates the Petri net to identify the iterative schedule, which is then evaluated. This paper includes a study of the approach applied to scheduling machine instructions for a parallel computer architecture given iterative programs. Results indicate that GA is robust and produces more ecient schedules that a traditional heuristic method when applied to a diversity of iterative programs.

Key Words: software pipelining, instruction-level parallelism, Petri nets, cyclic scheduling, genetic algorithm, combinatorial optimization.

0

Petri Net Representation for Parallel Loop Scheduling Using a Genetic Algorithm Abstract Scheduling is an important combinatorial optimization problem that has many practical applications. This paper introduces an application of GA's to scheduling iterative, or repeating operations. Iterative schedules occur in many tasks including manufacturing, production control and compiling iterative programs. A petri net representation of schedules is introduced that enables the GA to search the space of legal schedules by incorporating resource and dependency constraints. Each genome represents the space of possible delay allocations to arcs of the petri net. The cost function simulates the Petri net to identify the iterative schedule, which is then evaluated. This paper includes a study of the approach applied to scheduling machine instructions for a parallel computer architecture given iterative programs. Results indicate that GA is robust and produces more ecient schedules that a traditional heuristic method when applied to a diversity of iterative programs.

Key Words: software pipelining, instruction-level parallelism, Petri nets, cyclic scheduling, genetic algorithm, combinatorial optimization.

1 Introduction

Scheduling iterative operations is a challenge for traditional approaches to scheduling using genetic algorithms because the actual schedule to be found is of an undeterminate size. Each iterative schedule consists of a prelude, that sets up the resources appropriately, an loop section, which is iterated until some conditions are satis ed, then a postlude, where the resources are \cleaned up." The size of the prelude, loop section, postlude cannot be directly determined from the problem de nition. For a given problem, many legal schedules can exist that have di erent sizes for all three components. Because GA's traditionally represent candidate solutions as xed size chromosomes, a representation must be identi ed that can incorporate the needed fexiblity in schedule size, while still incorporating constraints of a traditional schedules, such as limited resources and precedence relations over operations. This paper introduces a new approach to this problem that employs a petri net to represent the schedule. Petri nets are general purpose graphs that can represent cyclic or non cyclic processes. Because a petri net is a xed size structure, with a xed number of parameters that control is behavior, it can be represented by a xed size chromosomes. A petri net consists of directed arcs, which represent operations of the schedule, and nodes which represent transitions between operations. The arrangement of arcs directly represent the precedence relations over operations. If operation Oi must be performed before operation 1

Oj , then the arc representing Oi must precede Oj in all paths through the graph. To produce a schedule, which is a time-ordered sequence of operations, the petri net must be simulated. It is here that resource constraints are imposed. Tokens, representing resources, are propagated through the net, with each operator being applied when appropriate tokens arrive at the \in" nodes of the operator's arc. The simulation process ensures that at each time step, no more than the available resources are consumed. Petri nets can represent a traditional schedule using a directed graph with designated start and end nodes. However, petri nets can also easily represent iterative schedules by using a cyclic directed graph, where the cycle represents those operators that must be repeated. To produce the prelude, loop section and postlude components of the schedule, the petri net is simulated, then the resulting sequence of operators is searched to identify the loop section by looking for a minimally sized repeated subsequence. It can be shown that if such as subsequence is identi ed, then the petri net will repeat that subsequence until the loop termination conditions are met. The prelude is then the operator sequence that precedes the repeating subsequence and the postlude is the operator sequence that succeeds the repeating subsequence. Petri nets enable genetic algorithm approaches to be applied to scheduling iterative tasks. In addition to providing a xed sized representation, petri nets ensure that all legal constraints are incorporated into the space of schedules searched by the GA. However, a disadvantage is the computational expense of the resulting cost function. To evaluate the quality of a schedule, the petri net must be simulated and the iterative schedule identi ed through a search process. Once identi ed, the quality of a schedule is given by a weighted sum of the length of the prelude, loop section and postlude. The loop section is given the most weight, since it is repeated many times when executed. To demonstrate the usefulness of this approach, the domain of scheduling iterative code for a parallel computer architecture has been selected. This task is known as software pipelining. This domain is of practical importance because iterative operations dominate runtime in real world scienti c and business programs. In order to signi cantly speedup these programs, parallel compilers must make the best use of parallel computer architectures by producing the most ecient schedules of iterative code. The rest of the paper is organized as follows. First the task of software pipelining is introduced and a brief review of previous heuristic non-GA approaches is included. Second, the GA approach is described including details of the chromosome encoding of petri-net parameters. Third, an empirical study, where a standard test set of loop programs is scheduled, is described. The paper concludes with a brief section on future work and a summary.

2 Software Pipelining

Software pipelining is a loop optimization technique in which the body of the loop is reformed so that an iteration of the loop starts executing before the previous iterations of the loop have nished. It is an extremely useful technique for architectures which support instruction 2

Iterations O1

A = A+2

O2

B = A-t

O3

C=B+4

O4

D=B+6

O5

E=D+ F

O6

F=E+D

1 (0,1)

Time

2 (0,1)

(0,1) 3

4

(0,1)

(a)

1

$I_1$

2

$I_2$

3

$I_3$

5

2

$I_4$

6

3

$I_5$ 5

1

$I_0$

(0,1)

Prelude 4

1

5 6

3 4

Body

1 2 3 5

(1,1)

2

(0,1)

6

4

4

1

1

2

2

3 4

3 4

5 6

5 6 5

Postlude

6

6

(dif,min) (c)

(b)

(d)

Figure 1: (a) Original loop body.(b) Data dependence graph of a loop body (c) Software Pipelined loop body. (d) Early Schedule. level parallelism [RF93]. Software pipelining takes advantage of parallelism between iterations of a loop by scheduling operations such that satisfying dependencies is the foremost concern. Dependencies which span one or more iterations are termed loop carried. One can create a dependency graph in which the nodes represent operations and the arcs represent a must follow relationship. An arc a ! b is annotated with a min time which is the time which must elapse between the time the rst operation is executed and the time the second operation is executed. In the graphical representation, it is common to let one node represent the same operation from all iterations. Since an operation of a loop behaves similarly in all iterations, this is a reasonable notation. However, a dependence between a and b of the same iteration must be distinguished from a dependence between a and b of di erent iterations. Thus, in addition to being annotated with min time, each dependency is annotated with the dif which is the di erence in the iteration numbers from which the operations come. Figure 1(b) shows the data dependence graph for the loop body of Figure 1(a). Arcs are annotated with (dif; min) pairs. The nodes in the data dependence graph represent operations from the loop body while the arcs represent the data dependence relationships between the nodes. The loop is scheduled by placing each operation of each iteration at a particular point in time. An operation can be scheduled at time t only if all operations on which it is dependent are completed before t. An operation from the second iteration will not necessarily wait until all operations from the rst iteration nish executing unless dependencies require it. Figure 1(c) shows a schedule satisfying dependence constraints for the rst three iterations. The vertical dashed lines separate the di erent iterations of the original loop body while each row from top to bottom represents successive time steps. Each row in the schedule is a parallel instruction, and all constituent operations are performed simultaneously. If the operations of every iteration must be scheduled independently, scheduling time as well as schedule space would be prohibitive. However, if the schedule can be made to repeat, these disadvantages are eliminated. 3

Notice that for this example, the operations shown in the box do indeed repeat. The code before the repeated section is termed the prelude while the code after the repeated section is termed the postlude. In fact, we can think of the repeated section as forming a loop. This new loop body together with the prelude and postlude is termed a software pipeline. The initiation interval is de ned as the time for an iteration of the parallel loop to execute and is bounded from below by the maximum min=dif ratio on any cycle. As the new loop body may contain more than one copy of each operation, the term e ective initiation interval is used to denote the average number of time units it takes to complete a full iteration of the original loop. The e ective initiation interval of the pipeline is the length of the new loop body divided by the number of copies of each operation which are present. In Figure 1(b) the e ective initiation interval = (2/1)= 2. Since the initiation interval determines how fast the iterations of the new loop body can be executed, minimizing the initiation interval increases speedup. If this loop were to be executed 10 times, then the pipelined version takes 23 parallel instructions, as the prelude takes 4 cycles, the new loop body of length 2 must be executed 8 times (the equivalent of 2 iterations were completed in the prelude and postlude) for a total of 16 time steps, and the postlude requires 3 cycles. If each iteration is executed sequentially (5 time cycles per iteration), 50 instructions are required to execute the whole loop. Software pipelining represents a speedup of 2.2 over the local parallelization. Formally, the software pipelining problem is as follows. Given a set of operations which are performed in a loop, resources available such as processors and memory, and the dependence information which controls proper scheduling, produce a schedule consisting of three distinct parts: prelude, new loop body, and postlude. Each part of the schedule is an ordered list of sets of operations such that all operations of a set are performed simultaneously and are completed before operations in subsequent sets. The execution time of the new loop (prelude, multiple executions of the new loop body, and postlude) should be less than the execution time of the original loop. All dependency and resource constraints must be satis ed by the new schedule.

2.1 Previous work

There have been a number of software pipelining algorithms proposed. Ebcioglu and Nakatani's EPS[EN90] is a powerful technique for extraction of instruction level parallelism from loops with unpredictable branches. Both Rau [RST92] and Lam [Lam88] propose algorithms which produce a regular schedule but cannot achieve fractional initiation intervals. Vegdahl's algorithm[Veg92] uses dynamic programming and is an exhaustive technique in that it generates all the possible schedules and searches for the pipeline with the shortest initiation interval. Another interesting technique is Gao's Petri net model[GWN91a]. His model cannot handle loop carried dependencies which span more than one iteration and produces an inferior schedule in several cases. Rajagopalan and Allan's use of the Petri net concept produces competitive schedules which are elegantly generated [RA94] and is the key component in part of this research. None of these approaches employ GA's or other general-purpose search based approaches. 4

1

1 (0,1)

(0,1)

2

2

(0,1)

(0,1) 3

(0,1) (3,1)

4

(0,1)

4

(0,1) 5

5

(0,1)

(0,1)

(1,1)

(0,1) 3

(0,1)

(0,1) (1,1) 6

6 (b)

(a)

1

1

B

B 2

2

C

A

3

E

3

4

D 5

H

C

A

E 4

D 5

F

G

F

G

H

6

6 (c)

(d)

1

1

B B 2

2

C A 3

D 5

H

C

A

3

E 4

D 5

F

G

H

E 4

F

G

6

6

(e)

(f)

Figure 2: (a) DDG (b) Strongly connected (c) Petri net at time 0 (d)-(f) Petri net at successive time steps Rather, they rely on powerful heuristics to produce a schedule. While these strong methods produce schedules quickly, they necessarly produce mostly sub-optimal schedules (if they produce a a schedule at all). Genetic Algorithms promise to be able to produce near-optimal schedules for a diversity of problem types because they are naturally robust and e ective at searching large combinatorial spaces. However, because identifying the optimal schedule is known to be NP complete, there can be no guarentee of optimality. There have been many applications of GS's to scheduling problems, although none have dealt with iterative schedules. In [Br93] points out the great advantage of incorporating legal constraints such as precedence and resource constraints into the represenation searched by GA's. In [Kid93] a task allocation problem for parallel software is considered. Here, domain dependent heuristics are employed to direct the search to high quality schedules.

3 Approach

The rst stage of the approach is to travere the given code and produce a dependency graph, as in Figure 1(b). This is easily done using traditional compiler technology. The next stage is to produce a petri net model that represents the cyclic dependences of a loop. A Petri net model is like a data dependence graph with the current scheduling status embedded. Each 5

operation is represented by a transition. Places show the current scheduling status. Each pair of arcs between two transitions represents a dependence between the two transitions. When the place (between the transitions) contains a token, it is a signal that the rst operation has executed, and the second is ready to execute. A Petri net is a bipartite graph. The two types of nodes are places P and transitions T. Arcs connect transitions and places. Figure 2(a) shows the DDG from the previous example while Figure 2(b) shows the DDG which has been made strongly connected. Figure 2(c) shows the corresponding Petri net. The transitions are represented by horizontal bars while places are represented by circles. An initial mapping M associates with each place p, M (p) number of tokens (shown as black dots) such that M (p)  0. A place p is said to be marked if M (p) > 0. Associated with each transition t is a set of input places Si(t) and a set of output places So(t). The set Si(t) consists of all places p such that there is an arc from p to t in the Petri net. Similarly So (t) consists of all places p such that there is an arc from t to p in the Petri net. To simulate a Petri net the tokens are propagates though the graph according to rules, illustrated in Figure 2. A transition is ready to re if for all p belonging to Si(t), M (p)  1. Initially, the place corresponding to an arc with dif > 0 contains dif tokens. When a transition res, the number of tokens at each input place is decremented by one while the number of tokens at each output place is incremented by one. In Figure 2(d), transition 1 has red. All transitions re according to the earliest ring rule; that is, they re as soon as all their input places have tokens. In Figure 2(e), transition 1 has red again and transition 2 has red, creating tokens on each of its output arcs. Transition 5 still cannot re as both of its input arcs must have tokens. In Figure 2(f), transitions 1,2,3 and 4 have red. The ring of nodes corresponds to the scheduling of operations. When the same places are marked exactly as in a previous state, a repeating pattern has occurred. In the table below, time 6 and time 8 have identical states, thus delimiting the pipeline. Steps 0 through 5 represent the prelude. The Petri net algorithm produces the following schedule in which the marked places are shown. A token count greater than one is shown in parentheses. Time 0 1 2 3 4 5 6 7 8

Marked A(3)H A(2)BH ABCEH BCDEFH CDEF(2)G AD(2)F(2)H BDF(2)G ACDEFH BDF(2)G

Schedule 1 12 1234 2345 346 15 [2 6 1345]

A con ict exists when either of two transitions can re but not both, i.e., once one 6

(1,1) (1,1) 6

(1,1)

1

(1,1)

0

1

0

(0,1)

(0,1)

(0,1) 2

2 (0,1)

(0,1)

(4,1)

(4,1)

5

3

(0,1)

(0,1)

3

(0,1) 4

4

(a) BITS:

(c) 1-3

4-6

7-9

10-12

13-16

001 000 001 000 1110 FIELDS: ARCS:

1

2

3

(1,1)

(1,2)

(2,3)

4

5-8

(3,4)

(b)

Figure 3: Modifying a DDG with a Chromosome (a) Original DDG (b) Chromosome (c) DDG after Modi cation by Chromosome transition consumes the tokens the other transition has insucient token to re. Since Petri nets can be used to model both con ict and concurrency, they are extremely powerful. During simulation of petri net to produce a schedule, two heuristic decisions to produce a schedule: the method of resolving con icts and the earliest ring rule. In our model, con icts are resolved in favor of the transition which has red the fewest times. The earliest ring rule states that when a transition can re it does re. This makes sense in that it eliminates future con icts.

3.1 Encoding of Petri net parameters for GA

To control the behavior of a petri net during simulation, each arc has an associated min time, which enables the simulation to delay the passing of tokens through the transitions. This has the a ect of changing the earliest ring rule by letting the GA decide how long, if at all, to delay operations once data dependencies are satis ed. Any changes in min times do not a ect the legality of the schedule. However, min time changes do a ect the eciency of the schedule. The chromosome speci es which operations will be delayed and for how long. To evaluate a chromosome, the Pertri net is simulated and the resulting schedule is evaluated based on e ective loop length. During simulation, for each arc, the chromosome determines if there will be any additional delay along that arc and, if so, whether the rst or the last arc in the newly formed path will contain the dif . To ensure the original delay constraints are never violated, the delay times in the chromosome are use to increment the existing min times. 7

Table 1: Result of processing example in Figure 3. left Resource Con icts; middle Schedule for the original Petri net of Figure 3(a); right Schedule for the Petri net of Figure 3(b) Resource Operation r1 r2 r3 1 X X 2 X X 3 X 4 X

State 1 2 3 4 5

Schedule 1 2 [3 4 1 2]

State 1 2 3 4 5

Schedule 1 2 1 [3 2 4 1]

Currently, this increment eld is assigned 3 bits in the chromosome. Thus, it is possible to add as much as 7 to the total min value on any single arc. Because the solution of not incrementing min times is often a good solution to the problem, the population is biased to always include the all 0's chromosome. See Figure 3 for an example chromosome and its interpretation. The resource con icts indicated in Table 1 are not shown in the DDG, but are incorporated into the Petri net model. Figure 3a shows the original DDG. Figure 3b shows an example chromosome. Bits 1..3 map to the increment on arc 1 ! 1, bits 4..6 map to arc 1 ! 2, bits 7..9 map to arc 2 ! 3, and bits 10..12 map to arc 3 ! 4. The last 4 bits are the binary variables that specify the allocation of difs. Since the increment of the rst arc is a one, the binary variable for arc 1 ! 1 is examined. The 1 at position 13 indicates the dif associated with this arc will be placed on the last arc of the path (from 1 to 1) instead of the rst. Field 3 represents an increment of 1 on arc 2 ! 3. The binary variable assigned to this arc is 1 (bit 15) has no e ect since the dif along this arc is zero. Fields 2 and 4 are zero, so no more changes are made. Figure 3c shows the modi ed DDG. Dummy nodes n5 and n6 are added. The schedule produced by the original Petri net when resource con icts are present is shown in the center of Table 1. For this example, the minimal length of the new loop body is 2. This can be seen from the fact that two copies of every resource are required to execute all operations, so a loop body containing all operations must be at least two instructions long. The Petri net algorithm, however, nds a schedule of length 3. The schedule that results from the modi ed DDG in Figure 3(c) is shown on the right of Table 1. Iterations are shown progressing horizontally. By delaying operation 3 for one cycle, a schedule of length 2 is produced.

4 Empirical Study

The GA was applied to a set of DDGs, derived from a diversity of iterative code fagments. Each example has di erent resource constraints. These examples were taken from [Jon91] and a random DDG generator. The GA was compared to two heuristic methods: The Petri method using the all 0's chromosome (with no additional delays) and Lam's algorithm [Lam88]. The Pertri net algorithm is compared to Lam's algorithm on 84 randomly chosen exam8

Table 2: Summary Statistics for Perti net method versus Lam's Algorithm Measurement Lam Petri Number of Cases 84 84 Percent Better 13 32 Mean 4.286 3.830 Median 4.000 4.000 Standard Deviation 1.860 1.181 Minimum 2.000 1.500 Maximum 13.000 6.900 25 Percent Quartile 3.000 3.000 75 Percent Quartile 5.000 5.000 Con dence Better 0.000 99.96

ples. Table 2 shows the summary statistics for these examples. Notice that the Pertri net approach is better than Lam's algorithm with 99.96 percent con dence. In 32 percent of the cases, the original petri net algorithm is better than Lam. In 13 percent of the cases, Lam is better than the original petri net algorithm. Since the GA does at least as well as the original petri net algorithm, it is these cases that interest us. Our experiments focus on these cases to see if using the GA can improve on the simple Petri net algorithm. Table 3 shows the schedule lengths produced by Lam, the original petri net algorithm, and the GA for 18 test cases. Cases Ex1 through Ex16 represent examples in which Lam is better than the original petri net algorithm. The additional two cases are taken from the following experiment: For 19 additional cases in which the original petri net algorithm is at least as good as Lam, GA was analyzed. GA show improvement in only 2 of 19 cases. The schedule length shown for GA represent the schedule length produced by running 15 trials and averaging the results. The column labeled ImproveGA contains an \X" if GA is statistically better than the original petri net algorithm. This is determined by comparing the set of 15 GA observations to the schedule length produced by the original petri net algorithm. Let LPa be the schedule length produced by the original petri net for example a. Let a be the mean of the 15 observations for example a. When the hypothesis a  LPa can be rejected with a t-test, then PetriGA is statistically better than the original petri net algorithm for this example. GA is statistically better than the original petri net algorithm for 14 examples. There is no statistical di erence in 4 cases.

5 Summary

This paper has introduces a new representation for iterative schedules using GA approaches: the Petri net. The GA algorithm encodes alternative legal schedules by assigning additional min times to each arc of the petri net. By searching the space of additional min times, GA can nd near-optimal schedules of a better quality than direct heuristic techniques. 9

Table 3: Schedule Lengths Produced by GA in 18 Interesting Loops Example Nodes Lam OP PetriGA ImproveGA Ex1 4 2.00 3.00 2.00 X Ex2 4 3.00 4.00 3.00 X Ex3 4 2.00 3.00 2.33 X Ex4 4 2.00 3.00 2.00 X Ex5 4 3.00 4.00 3.00 X Ex6 4 2.00 3.00 2.33 X Ex7 5 2.00 2.50 2.00 X Ex8 5 3.00 4.00 4.00 Ex9 5 3.00 4.00 4.00 Ex10 5 2.00 2.50 2.00 X Ex11 15 4.00 4.33 4.04 X Ex12 15 5.00 6.00 5.98 Ex13 15 4.00 4.33 4.02 X Ex14 15 5.00 6.00 5.98 Ex15 19 5.00 5.67 5.29 X Ex16 19 5.00 5.67 5.32 X Ex17 9 3.00 2.50 2.03 X Ex18 11 4.00 3.50 3.00 X Ave.Length 3.43 3.95 3.45

This method does su er from one drawback|the computational cost of evaluating the schedule. This can be reduced through use of ecient subsequence matching algorithms and the use of bounding techniques that terminate evaluation when the cost exceeds some factor of the best so far. An interesting avenue of future research is to exploit massively parallel implementations of GA's to do ecient compilation of parallel code. While this paper has focussed on a rather narrow problem, that of scheduling loops for a parallel computer architecture, the method is general and could be applied to other iterative scheduling tasks in manufacturing. This could be especially useful for mass-production problems where a xed number of machines must be scheduled to produce a large number of similar products. This is the focus of future work.

References [AN88a] [AN88b]

A. Aiken and A. Nicolau. A Development Environment for Horizontal Microcode. IEEE Trans on Software Engineering, 14(5):584{594, May 1988. A. Aiken and A. Nicolau. Optimal Loop Parallelization. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 308{317, Atlanta, GA, June 1988.

10

[Br93] [CHMR91] [EN90] [Gol89] [GWN91a] [GWN91b] [Jon91] [Kid93] [KCDGV83] [Lam88] [RA94] [Raj93] [RF93] [RST92] [Veg92] [Zak89]

Ralf Bruns, Direct chromosome representation and advanced genetic operators for production scheduling. In Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 352{359. James P. Cohoon, Shailesh U. Hegde, Worthy N. Martin, and Dana S. Richards. Distributed Genetic Algorithms for the Floorplan Design Problem. IEEE Transactions on Computer-Aided Design, 10(4):483{491, 1991. K. Ebcioglu and T. Nakatani. A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture. In D. Gelernter, editor, Languages and Compilers for Parallel Computing, pages 213{229. MIT Press, Cambridge,MA, 1990. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. AddisonWesley Publishing Company, Inc., Menlo Park, CA, 1989. G.R. Gao, W-B. Wong, and Q. Ning. A Timed Petri-Net Model for Fine-Grain Loop Scheduling. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 204{218, June 26-28 1991. G.R. Gao, W-B. Wong, and Q. Ning. A Timed Petri-Net Model for Fine-Grain Loop Scheduling. Technical Report ACAPS Technical Memo 18, School of Computer Science, McGill University, Montreal, Canada, H3A 2A7, January 1991. R.B. Jones. Constrained Software Pipelining. Master's thesis, Department of Computer Science, Utah State University, Logan, UT, September 1991. Michelle Kidwell, Using genetic algorithms to schedule distributed tasks on a bus-based system. In Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 368{374. S. Kirkpatrick, Jr. C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671{680, May 1983. M.S. Lam. Software Pipelining: An E ective Scheduling Technique for VLIW Machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 318{328, Atlanta, GA, June 1988. M. Rajagopalan and V. H. Allan. Speci cation of Software Pipelining using Petri Nets. International Journal on Parallel Processing, 3(22):279{307, 1994. M. Rajagopalan. A New Model for Software Pipelining Using Petri Nets. Master's thesis,

Department of Computer Science, Utah State University, Logan, UT, July 1993. B. R. Rau and J. A. Fisher. Instruction-Level Parallel Processing: History, Overview, and Perspective. The Journal of Supercomputing, 7:9{50, 1993. B. R. Rau, M. S. Schlansker, and P.P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In Proceedings of Micro-25, The 25th Annual International Symposium on Microarchitecture, December 1992. S. Vegdahl. A Dynamic-Programming Technique for Compacting Loops. In Proceedings of Micro-25, The 25th Annual International Symposium on Microarchitecture, December 1992. A. M. Zaky. Ecient Static Scheduling of Loops on Synchronous Multiprocessors. PhD thesis, Department of Computer and Information Science, Ohio State University, Columbus, OH, 1989.

11