A Heuristic model for task allocation in

4 downloads 0 Views 108KB Size Report
The general allocation problem has been shown to be NP-complete. To derive a .... The Parallel Execution Time (PET) is a function of the amount of computation to be performed ..... The algorithm terminates when all the orphan modules, if any, are ... Table 2: Normalized module execution times on 3 processor system E.
A Heuristic model for task allocation in heterogeneous distributed computing systems A. Abdelmageed Elsadek

B. Earl Wells

Electrical and Computer Engineering The University of Alabama in Huntsville Huntsville, AL 35899 U.S.A. Phone: (205) 890-6047 Fax: (205) 890-6803 [email protected] [email protected]

Abstract Keywords: Heterogeneous Computing, Distributed Processing, Task Allocation, Simulated Annealing

In heterogeneous distributed computing systems, partitioning of the application software into modules and the proper allocation of these modules among dissimilar processors are important factors which determine the efficient utilization of resources. This paper presents a new heuristic model, the HMLM/SA, which performs static allocation of such program modules in a heterogeneous distributed computing system in a manner that is designed to minimize the application program’s parallel execution time. The new methodology augments the Maximally Linked Module concept by using stochastic techniques and by adding constructs which take into account the limited and uneven distribution of hardware resources often associated with heterogeneous systems. The execution time of the resulting HMLM/SA algorithm and the quality of the allocations produced are shown to be superior to that of the base HMLM algorithm, pure simulated annealing and the randomized algorithm when they were applied to randomly-generated systems and synthetic structures which were derived from real-world problems. Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

0 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

1 Introduction Although improvements in device technology have allowed computer speed to increase by several orders of magnitude in recent decades, the computing capability demanded by a number of realworld applications has increased at an even faster pace. The required processing power for these applications can not be achieved with a single processor. One approach to this problem is to use Distributed Computing Systems (DCSs) that concurrently process an application program by employing multiple processors. In DCSs, homogeneous or heterogeneous, processors are connected together through a communication network. Distributed computing provides the capability for the utilization of remote computing resources and allow for increased levels of flexibility, reliability, and modularity. If DCSs are properly designed and planned, they can provide a more economical and reliable approach than that of centralized processing systems. Task allocation is the process of partitioning a set of programming modules into a number of processing groups where each group executes on a separate processing element. The manner in which this partitioning is done determines to a very large extent the efficiency of a given application when it is executed on distributed computing systems. If this step is not performed properly, an increase in the number of processors may actually result in a decrease in the total system throughput. This degradation is caused by what is commonly called the “saturation effect”, which occurs due to heavy communication traffic induced by data transfers between tasks that reside on separate processors. The interprocessor traffic is always the most costly and the least reliable factor in the loosely coupled distributed system.

1 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

The general allocation problem has been shown to be NP-complete. To derive a simple and fast algorithm, it is essential to make use of heuristics that provide a near optimal solution in a reasonable amount of time. Several approaches to the problem of task assignment in DCS have been developed [17]. Unfortunately, most of these methods deal with homogeneous systems and the complexity of these algorithms increases rapidly as the problem size increases. For a given allocation methodology, the interprocessor communication time (IPC) and potential for parallelism are two conflicting requirements [8], maximizing the parallelism requires distributing the subtasks or program modules over different processors, but minimizing the IPC time requires that all the modules be assigned to the same processor. In this paper, we examine static task allocation in a heterogeneous computing system which provides a variety of architectural capabilities, orchestrated to perform on application problems whose tasks have diverse execution requirements. Static allocation techniques can be applied to the large set of real-world applications that are able to be formulated in a manner which allows for deterministic execution. Some advantages of these techniques over dynamic ones, which determine the module assignment during run time, are that static techniques have no run time overhead and they can be designed using very complex algorithmic mechanisms which fully utilize the know properties of a given application. Problems which map well to this paradigm include those where the program modules execute in a periodic and continuous manner for large periods of time and those application programs which are run many times using different sets of data. In both cases, the execution time saved merits the extra time spent performing the allocation in an off-line manner. While it is true that some applications cannot be formulated in a deterministic manner, the initial assignment of modules to processors is often a critical component of any dynamic allocation/process migration technique. In this case a static allocation 2 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

technique (such as the one presented here) can serve as the off-line portion of a more comprehensive allocation strategy. The static task allocation heuristic presented here attempts to assign the modules to the processors as evenly as possible with such load balancing being carried out in a single phase. This differs from most of the heuristics, such as the one in [2], which allocates the modules to processors during the first phase and then performs the load balancing in the second phase by reallocating the modules. Our complete scheme does not require further redistribution of program modules and therefore has superior time complexity. This paper is composed of nine sections. Section 2 defines the specific task assignment problem which is being addressed in this research and the mathematical notation that will be used throughout the manuscript. Section 3 describes the base HMLM task allocation methodology for heterogeneous distributed computing systems. In this section, a comprehensive example problem is presented after which the technique is modified to take into account the limited memory resources that are common in truly heterogeneous environments. Section 4 describes how the base HMLM heuristic can be augmented by applying simulated annealing to create the HMLM/SA technique. To evaluate this new methodology, it is useful to compare it with other well known task allocation techniques. This is accomplished in Section 5 where the reference algorithms, simulated annealing and the randomized algorithm are introduced. The augmented methodology, HMLM/SA, is then extensively evaluated in Sections 6 by applying it to a number of randomly-generated software systems assuming hardware configurations which possess varying degrees of system resources which, in the most general case, are distributed unevenly among the processing nodes. In Section 7, similar analysis is applied to task systems which were derived from three real world problems, the simulation of a space shuttle main rocket engine, a six degree of freedom robot 3 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

simulation, and the simulation of launch trajectories of the proposed national launch system. The results of the empirical analysis of these two sections show that the HMLM/SA is superior in most respects to the base HMLM algorithm and to the reference algorithms. The addition of exclusion, affinity, and antiaffinity relationships are presented as future research in Section 8 and some general conclusions are discussed in Section 9.

2 Task assignment problem The specific problem being addressed is as follows: Given application software that consists of M communicating modules (tasks), m1, m2, ..., mM, and a heterogeneous distributed computing system with N processors, p1, p2, ..., pN, where it is assumed that M>>N, assign (allocate) each of the M modules to one of the N processors in such a manner that the IPC time is minimized and the processing load is balanced. This is accomplished in this research by maximizing the energy function which represents the projected speedup. The following notation will be used throughout this text. 2.1 Actual execution time The task assignment vector A is defined to be A:M→P, where A(i) =j if module mi is assigned to processor pj, 1≤ i ≤ M, 1≤ j ≤ N. For a certain assignment, the task set (TSj) of a processor j can now be defined as the set of tasks allocated to that processor [9,10]: TSj= {i |

A(i)=j}

j=1, ...,N

(1)

The communication set (CS) of a processor is the set of edges on the program graph that go between the given processor and another processor, for a given assignment A. CSj={ (x,y) | (A(x) =j ∧ A(y)≠j}

j=1, ..., N

4 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

(2)

The communication load of a processor j is the sum of the values of the edges in the communication set, that is, In a similar manner, the E matrix is defined to be the MxN matrix whose elements represent the execution time of the communicating modules when they are processed on each node of the targeted DCS. Each element, ei,j, depends upon the amount of computation to be performed by the module mi as well as on the specific attributes of the processor pj. The overall Actual Execution Time (AET) of a given assignment A is then

AET(A)=



1≤ i ≤ M

ei,A(i).

(3)

and the per processor Actual Execution Time for processor pj is defined to be AET(A)j =



1≤i ≤ M i ∈T S j

ei,A(i)

(4)

2.2 Communication time In this paper, the MxM matrix C is used to represent the communication time (cost) between modules in a program, where Ci,j=g >0 if module mi communicates with module mj for some cost g when A(i) ≠ A(j). That is, any two modules that communicate during their execution incur a penalty if they are executed on different processors, otherwise Ci,j = 0. The overall Interprocessor Communication time (IPC) of a given assignment A can be expressed by IPC(A)=

∑ C A ( i ), A ( j ) 1≤ i ≤ M i + 1≤ j ≤ M A (i ) ≠ A ( j )

and the per processor Interprocessor Communication time is given by IPC(A)j = ∑ C A (i ), A ( k ) 1≤ i ≤ M i + 1≤ k ≤ M A (i )= j ≠ A ( k )

5 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

(5)

(6)

The IPC times, which arise due to the data exchanged between modules residing on different processors, are derived from an appraisal of the characteristics of the application software and the distributed hardware. They may be specified explicitly by the programmer, deduced automatically by the compiler, acquired from the operating system, or defined by the dynamic profiling of previous executions of the application. In this paper, we presume that the data about the IPC times are somehow available and that all the times (execution and IPC) are expressible in some common unit of measurement. 2.3 Parallel execution time The Parallel Execution Time (PET) is a function of the amount of computation to be performed by each processor and the communication time. This function is defined by considering the processor with the heaviest aggregate computation and communication loads. The degree to which communication latency will be hidden by overlapping computation with communication depends on such factors as the hardware attributes of the DCS and the module scheduling methodology that is employed. With this in mind, the parallel execution time for a given assignment A is defined conservatively (assuming that computation cannot be overlapped with communication) as shown below: PET(A) = max

1≤ j ≤ N

{AET(A)j+IPC(A)j}

(7)

2.4 Memory constraints The vector B is defined to be a vector whose elements bi denote the amount of memory needed to process module mi and Q is a vector whose elements qj represent the maximum capacity of local memory associated with processor pj. In cases where memory is constrained, the following inequality must hold at each processing node in the DCS.

6 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.



1≤ i ≤ M i ∈ TS j

bi ≤ q

j

(8)

2.5 Relative processor speeds In this paper, we assume that the data transfer rate between the processors is constant and the time attributed to communication delays is proportional to the total amount of data exchanged between the processors. This is true in many real-world systems such as shared path networks (when there is no contention) and in multipath networks that employ wormhole routing [11]. Also, we consider a heterogeneous network that is composed of a set of dissimilar processing elements. So, the execution time of module mi on processor pj depends on the work to be performed by module i and on the attributes of processor j, such as its clock rate, instruction set, existence of floating point hardware, cache memory, version of operating system that is being run, etc. The algorithm presented here requires that the parameters which reflect the overall relative speed of each processor be calculated at the time the E matrix is created. In a truly heterogeneous environment, such parameters can not be easily determined, since the distinct hardware attributes associated with the individual nodes result in each program module having a wide range of execution times (depending upon which node the program module is assigned) and the situation can exist where a certain number of program modules simply cannot be made to execute on one or more of the processing nodes of the system. A method which allows the calculation of a vector S, where sj indicates the relative speed of processor j is therefore incorporated into the algorithm. This method is based upon the techniques presented in [12] where the entries of the S vector are determined in a normalized manner while allowing

7 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

for the omission of entries in the task execution time matrix, E, which correspond to situations in which a program module is not allowed to execute on a given processing node. The method used to calculate the S vector is based upon the calculation of the so-called “weighted column handicap” of the execution time matrix E. This handicap is then used to derive the relative speeds of the processing nodes as described below. Using the execution time matrix, E, the row mean, βi, and row variance, σ i2 , are calculated in the manner shown below: βi=



1 N

1≤ j ≤ N

σ i2 =

1 N

ei,j



1≤ j ≤ N

(9)

( ei,j - βi )2 .

(10)

Then, the normalized execution time matrix, E , is defined as being the MXN matrix whose elements e i,j for 1 ≤ i ≤ M and 1 ≤ j ≤ N are determined as follows − βi e i,j= e i , j .

(11)

σi

The weighted handicap, α j, for each processor j, 1 ≤ j

≤ N, is then expressed in terms of the

normalized execution time matrix as follow α j=

1 M



1≤i ≤ M

e i,j.

(12)

The handicap is then adjusted by adding the absolute value of the minimum value plus a constant (in our case we used 1 as the constant which in effect normalized the fastest processor speed to a value of one) to all the handicaps to get the modified handicapα∃ j ,

8 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

{α i} + 1 . α∃ j = α j + 1≤min i≤ N

(13)

The relative speed of each of the N processors is inversely proportional to its modified handicap as shown in Equation 14.

sj

=

1

(14)

α∃ j

It should be noted that a higher sj does not mean that a given module execution time will be less on a particular processor. Of course, if the network is homogeneous, then there is no difference in the amount of time that it takes for any particular task to be executed on any processor, which will be reflected by the fact that sj=1 for all processors. 2.6 Average load In distributed systems, a program is divided into modules and each of these modules is assigned to one of the processors. Such a program can be modeled as an undirected graph consisting of a set of nodes, where each node represents a module, connected together with a set of edges, that represent the data dependencies between the modules. For each node there are N values representing the execution time required for the corresponding module to be processed on each of the N processors. These values are in the matrix E. Each edge is labeled with a value that represents the communication time needed to exchange the data when the modules reside on different processors. The total workload W is the summation of the maximum module execution time on the different processors as shown in Equation 15. W=



max

1≤ i ≤ M 1 ≤ j ≤ N

{e i , j}

9 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

(15)

The average load on a processor Lavg (pj) depends upon its relative speed. The system is considered to be balanced if the load on each processor is equal to the processor average load within a given (small percentage) tolerance. Lavg (pj)=

W sj ∑ si

j=1, ...,N

(16)

1≤ i ≤ N

3 Heterogeneous maximally-linked modules algorithm (HMLM) In many existing heuristic methods [10], a search is made for a pair of adjacent modules with the maximum communication time between them. These modules are then assigned to the same processor to minimize the interprocessor communication time. In our heuristic, we always try to choose a module, mk, which is considered to be maximally linked from the perspective of the processor on which the assignment is to be made. A module is defined in [1] to be maximally linked (a maximally linked module-MLM) if it has the largest aggregate intermodule communication time of any of the modules that are adjacent to one or more of the modules already assigned to a processor pi. In the previous work [1] the MLM were determined once at the beginning of the allocation process. In the algorithm described in this manuscript, the MLM is recomputed for each step of the allocation process in the manner shown below. For a given point in the allocation process the MLM for processor pj can be given by MLMj= mk ∑

1≤ l ≤ M

where k ∉ TS j and

C k ,l = m a x { k ′ ∉T S

j

∑Ck

1≤l ≤ M

′ , l}

.

(17)

The major rationale here is simple, since a maximally linked module has the maximum amount of communication with its neighboring modules, in most cases it is advantageous to form clusters around such modules. 10 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

3.1 Base algorithm To determine the underloaded processors, the algorithm calculates the processing loss associated with each processor. The processing loss is the difference between the processor average load Lavg (pj) and the actual load on that processor, which is the sum of the execution times of the modules assigned to the processor, i.e., in its task set (TS). The HMLM algorithm is presented below: Step 1. Average Load Calculation. In this step, the average load is assigned to each processor, Lavg (pj)=

W sj ∑ si

j=1, ...,N.

1≤ i ≤ N

This step executes in O(N 2) time. Step 2. Maximally Linked Module Determination. Then for all M modules, the communication time as outlined in Equation 17 is calculated and is used to order the modules in a monotonically decreasing manner. This has a worst case time complexity of O(M2). Step 3. Initial Module Assignment. An initial assignment of the maximally linked modules is performed in this step where the first N modules that have the highest communication times are assigned to the set of processing elements in a one-to-one manner. In a heterogeneous situation, each of these modules is allocated (starting at the top of the sorted list with the module that has the highest communication time and proceeding downward) to its “best” available processing element. The “best” available processing element is defined to be the one that has the lowest execution time for a particular module as expressed in the E matrix. This step obviously executes in O(N) time. Step 4. Processing Loss and IPC Calculation. In this step the processing loss, lj= Lavg(pj) - Lj

j=1, ...,N,

11 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

(18)

and IPC(A)j time as defined in Equation 6 for each processing element is calculated. Where Lj is the actual load on processor j at the current point in the allocation process and it is equal to Lj =



i ∈TS

ei,j .

(19)

j

This is an O(MN) time operation. Step 5. Module Assignment. This step starts with the underloaded processor (min (Lj) with priority being given to the processors that have the highest relative speed), and allocates module k to processor j if the condition Lj+ei,j ≤ (1+ δ ) Lavg(pj) holds, where module k is MLMj as defined in Equation 17. (A good initial selection of δ has been found in this research to be δ ≈ 0.2.) The value of δ can then be increased as necessary to cover orphan modules. If this condition does not hold, the next highest maximum communication module for processor j is selected and the condition is checked again. In the cases where there are found to be no other candidate modules which are adjacent to the modules which have been assigned to processor j then the processor with the next lowest Lj which has not previously been considered in this iteration is selected and Step 5 is repeated. This step also has a worst case time complexity of O(MN). Step 6. Algorithm Continuation. The algorithm then continues with Step 4 until all modules which can be assigned, are assigned, to the set of processing elements. Thus the loop which is composed of Steps 4 and 5 will execute in O(M2N) time. Step 7. Orphan Module Allocation. In some cases, the algorithm is unable to assign all modules because it is impossible to satisfy the above Lavg(pj) and implied adjacency constraints. To complete the allocation process this step systematically examines the set of orphan modules and attempts to assign each module to its “best” processor (i.e. the processor with minimum ei,j) without violating the Lj+ei,j ≤ (1+ δ ) Lavg(pj) 12 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

constraint. In cases where all the processors violate this constraint then the algorithm relaxes this condition and the assignment is made to the processor with the minimum ei,j. This step has a worst case time complexity of O(M). Step 8. Algorithm Termination. The algorithm terminates when all the orphan modules, if any, are assigned. The HMLM algorithm can be classified as being a polynomial time greedy algorithm which employs localized optimization techniques. Its overall time complexity is of the order O(N 2+M2+N+M2N+M). Since the number of modules is assumed to be much greater than the number of processors, M>>N, this simplifies to order O(M2) which makes the base algorithm very attractive.

3.2 The objective function An important figure-of-merit that is traditionally used in homogeneous domains to determine the effectiveness of an allocation is the relative speedup, which is defined as the ratio of sequential (one processor) execution time to the parallel (N-processor) execution time. A natural generalization of this figure-of-merit to the heterogeneous domain can be expressed by the following energy function (U) U=

AET PET

(20)

which represents the heterogeneous “speedup” of the allocation. This expression will also be used as the objective function in the simulated annealing portion of the overall algorithm, which will be introduced in the next section. 3.3 Example

13 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

To illustrate the base algorithm, a typical program graph has been constructed and is shown in Figure 1 with the execution time matrix, E, being shown in Table 1 for a three processor DCS. Using Table 1 the row mean, βi, is found to be equal to [146.7, 81.3, 143.7, 168.7, 133.3, 164.3, 77.3, 116.7, 143.3]t and the square root of the row variance, σ i, is [26.95, 49.5, 32.8, 77.98, 97.3, 81.3, 82.6,93.6]t. The normalized execution time matrix for this example can then be determined as shown in Table 2. From this table, the handicap,α j, for each processor j is then computed and found to be [0.277, -0.253, -0.024]. To normalize these results, a positive constant of one is added to the absolute value of the most negative handicap (i.e. -0.253) to get the scaling parameter (i.e. 1.253). This parameter is then added to the handicaps to obtain the modified handicap, α∃ j , which equals [1.53, 1, 1.229] which in turn is the inverse of the processor relative speeds of S=[0.65, 1, 0.81].

m9

12

5

m6

10

m8

3 P1

m7

8

P2 12

6

3

3

m4

4

m5 1 4 7

10

m3

m1

P3

m2 8

Figure 1. Example program graph Table 1: Module execution time matrix E 14 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

module number m1 m2 m3 m4 m5 m6 m7 m8 m9

p1 174 95 196 148 44 241 12 215 211

p2 156 15 79 215 234 225 28 13 11

p3 110 134 156 143 122 27 192 122 208

Table 2: Normalized module execution times on 3 processor system E module number p1 p2 p3 m1 1.014 0.346 -1.361 m2 0.276 -1.339 1.063 m3 1.078 -1.332 0.254 m4 -0.630 1.411 -0.782 m5 -1.146 1.291 -0.145 m6 0.788 0.623 -1.411 m7 -0.803 -0.606 1.410 m8 1.191 -1.256 0.065 m9 0.723 -1.414 0.691 Table 3: Modules allocation and IPC times for 3 processor system Processor Module Name/Selection Order Actual Load Aggregate IPC Time p1 p2 p3

m9/3 m8/1 m1/2

m6/6 m5/4 m3/5

m7/9 m4/7 m2/8

211 13 110

452 247 266

464 462 400

27 31 29

18 25 27

15 16 15

Module allocation is begun by initially assigning the N most maximally linked modules, which are m8, m1, m9 to their best processor, as defined in Steps 2 and 3 of the HMLM algorithm. In this case, module m8 is assigned to p2, module m1 to p3, and module m9 to p1 which is the remaining unallocated processor. Then, the algorithm proceeds as described in Steps 4, 5, and 6. It should be noted that after each assignment of a module to a processor the aggregate IPC time associated with that processor usually decreases. Table 3 shows the resulting allocation where it indicates the order in which each 15 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

module is assigned to a processor along with the Lj and IPC times for each phase of the allocation process. From this table and Table 1, it can be seen that the AET is 1326 time units and the PET is 479 time units which implies that the projected “speedup” of this allocation is U=2.77 which is very promising for a three processors system. 3.4 Algorithm modification for memory constrained implementations The base algorithm is easily modifiable to take into account system resource constraints such as the available memory size for each processing element. In this case, the modules’ memory requirement vector, B, is initialized to values which indicate the maximum amount of memory that is needed during the execution of each module. To accomplish this, the following substeps should be added to Step 1 of the base algorithm 1.1 Initialize the amounts of local memory, qj, associated with each processing element, pj, 1.2 Initialize the amount of memory, bi, required by each module, mi and Steps 3, 4 and 5 should be replaced by the following Step 4 . Processing Loss and IPC Calculation. In this step the processing loss, IPC(A)j, as defined in Equation 6, and the remaining memory for each processing element is calculated. Step 5 . Module Assignment. This step starts with the underloaded processor (min (Lj) with priority being given to the processors with the highest relative speed), and allocates mi to pj if the condition Lj+ei,j ≤ (1+ δ ) Lavg(pj) holds and the memory that is available is sufficient, where mi is the maximum communicating module that has yet to be allocated. Otherwise the processor is selected which has the next lowest Lj which has not previously been considered in this iteration and Step 5 is repeated.

16 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

Step 7 . Orphan Module Allocation. As in the other cases, this step is modified to reflect the limited memory constraint by incorporating Equation 8 at all points in the allocation process. It should be noted, however, that memory constrains can never be relaxed. Therefore, it is possible to create systems for which no feasible allocations can exist. Care should be taken to avoid this situation by employing sufficient hardware resources (as discussed in Section 6).

4 Augmented HMLM algorithm with simulated annealing Stochastic methodologies can be combined with the base HMLM algorithm to improve the quality of the allocations that are produced. One such methodology is simulated annealing (SA) which performs heuristic hill climbing to transverse a searches through a problem space in a manner which is resistant to stopping prematurely at local critical points that are less optimum than global one(s). The heuristic is based upon the idea of iteratively making small changes to the search space and evaluating these changes through the use of a properly specified objective function [13,14]. Figure 2 shows how simulated annealing can be combined with the HMLM algorithm. The process begins with the selection of the appropriate values of Tfreezing, λ, and Tinitial. Then the first allocation is made using the HMLM algorithm and the energy function, which represents the projected speedup, is calculated. During this process, the initial module assignment for the first N modules, as defined in Step 3 of the algorithm, is stored for future use. Simulated annealing is then applied by iteratively altering this initial assignment and applying the remaining portions (Steps 4-8) of the HMLM heuristic. The state-space movement generation strategy, which is employed, is to randomly swap out (in a pair-wise manner) a single element from the initial module assignment list and replace it with an element that is not in that list. This strategy allows for the search to progress from the current state to any one of

17 Elsadek/Wells The International Journal of Computers and Their Applications, Vol. 6, No. 1, March 1999.

the

 N   M − N      1   1  

=N(M-N) neighboring states. At this point, a new energy function (the projected

speedup) is calculated and the changes are accepted whenever they are better than the ones accepted previously, or they are accepted within a certain probability, according to Equation 21, in the case where changes are worse than before. The probability function used is based upon finding the difference between the previously accepted and the new energy function, and up on a variable called T, which is analogous to the current temperature associated with the physical process of the annealing of molten metal. In both cases, the effectiveness of the end result depends upon the manner in which this variable, T, is varied. In general, T is initialized to the value Tinitial and is then decreased in a standard manner through the use of an exponential cooling schedule with Ti=λTi-1 , where λ < 1 but is very close to one, until it reaches the freezing temperature, Tfreezing. As the process proceeds the probability of accepting inferior solutions also decreases at an exponential rate. The process finally stops when the variable T reaches the freezing temperature,

Initialize data Enter TInitial, Tfreezing, λ Set T=Tinitial

Execute HMLM allocation heuristic

Perform random pair-wise switching

Execute HMLM allocation heuristic

Apply simulated annealing to maximize 18 energy function Elsadek/Wells The International Journal of Computers and Their Applications, Probability function:Vol. 6, No. 1, March 1999.

P = e (Un-Up)/KT

is Inum reached ?

No

Yes Set T = Tλ

is T

Suggest Documents