Three Search Strategies for Architecture Synthesis and Partitioning of Real-Time Systems Jakob Axelsson Email:
[email protected]
Abstract This report studies the problem of automatically selecting a suitable system architecture for implementing a real-time application. Given a library of hardware components, it is shown how an architecture can be synthesized with the goal of ful lling the real-time constraints stated in the system's speci cation. In case the selected architecture contains several processing units, the speci cation is partitioned by assigning tasks to processing units. We investigate the use of three meta-heuristic search algorithms to solve the problem: genetic algorithms, simulated annealing, and tabu search; and it is described in detail how these can be adapted to the architecture synthesis problem. Their relative merits are discussed at length, as is the importance of scheduling to the solution quality. This work has been supported by the Swedish National Board for Industrial and Technical Development (NUTEK).
IDA Technical Report 1996 LiTH-IDA-R-96-32 ISSN-0281-4250
Department of Computer and Information Science, Linkoping University, S-581 83 Linkoping, Sweden
Contents
1 Introduction 2 Related work 3 Problem description 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Behavioural speci cation . Component library . . . . Component models . . . . Virtual prototypes . . . . Fixed-priority scheduling . Primitive transformations Quality criterion . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 8 . 10 . 11 . 13
Compound transformations . . . . . . . . . Application example and component library Simulated annealing . . . . . . . . . . . . . Genetic algorithms . . . . . . . . . . . . . . Tabu search . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
4 Search algorithms 4.1 4.2 4.3 4.4
Simulated annealing Genetic algorithms . Tabu search . . . . . Comparison . . . . .
. . . .
5 Applying the algorithms 5.1 5.2 5.3 5.4 5.5
1 1 3
. . . .
. . . .
3 4 4 5 6 6 7
8
14
14 15 16 16 18
6 Experimental results
19
7 Discussion
21
8 Conclusions References
23 24
6.1 Using deadline-monotonic scheduling . . . . . . . . . . . . . . . . 19 6.2 Using optimal scheduling . . . . . . . . . . . . . . . . . . . . . . 20 7.1 Partitioning quality . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 Why is architecture 13 so popular? . . . . . . . . . . . . . . . . . 22 7.3 Implementation eort . . . . . . . . . . . . . . . . . . . . . . . . 22
1
1 Introduction When a real-time system is constructed, one of the most important design goals is to verify that certain timing constraints are ful lled. These timing constraints are imposed by the environment in which the system is embedded, and they typically place limits (expressed as deadlines ) on the system's response time to certain events taking place in the environment. The quality of the system, i.e. its cost and its ability to reach the deadlines, depends to a large degree on certain early implementation decisions. The designer has to choose a suitable hardware architecture to use, and decide how the resources of that architecture are to be shared between the system's tasks, which includes doing a partitioning of the functionality onto the components in the hardware architecture. The real-time research community has dealt extensively with the problem of determining the quality of a completed implementation, but it has not really provided much support for the early design phases, when a high-quality system-level design is sought. This work aims at improving upon the current practice of early system-level design. Since real-time systems are usually application-speci c digital systems, they can be carefully tuned to give an optimal trade-o between cost and performance. Often, it might be advantageous to use a combination of applicationspeci c integrated circuits (ASICs) and software running on standard microprocessors. The design of such heterogeneous architectures has been studied within the eld of hardware/software codesign. In this report, we study the automatic synthesis of a suitable architecture for a given real-time system's speci cation, and the partitioning of the system's tasks on the processors and ASICs of that architecture. Since the number of possible implementations is enormous, it is extremely time consuming to nd the optimal solution, and therefore we have to resort to meta-heuristic search techniques. There are three well-known such techniques|genetic algorithms, simulated annealing, and tabu search |that have previously been used on similar problems, with varying success. To determine which of these algorithms is the most suitable for our problem, we have conducted a comparative study involving all three of them. The report is organized as follows: The next section describes some related work and Section 3 de nes the exact problem that we have studied, and proposes a design environment for solving it. Section 4 introduces the three search techniques and Section 5 shows how they were adapted to the architecture synthesis problem. Section 6 reports the experimental results obtained, and Section 7 tries to explain them. Finally, in Section 8 the conclusions are summarized, and some indications of future work are given.
2 Related work The synthesis of system hardware architectures has been studied by several researchers before. Prakash and Parker [17] describe an algorithm for selecting processors to be used in a heterogeneous multiprocessor system, and for partitioning. It assumes that all processor nodes contain a local memory, and that the dierent nodes are connected through direct point-to-point channels. The algorithm is based on mixed integer-linear programming, and it does trade-os
2
2 RELATED WORK
between cost and performance. However, it does not consider real-time constraints. Buchenrieder and Pyttel [3] use knowledge-based techniques for selecting components from a component library, and for deciding on how to connect them. The input to the system is not a speci cation of the system's behaviour and performance. Instead, the designer decides interactively on things like which processor to use, and what clock frequency the system should have. The design goal is unclear, but presumably it does not include satisfaction of real-time constraints. There are also some articles about synthesis of more regular architectures, but that is beyond the scope of this report. In hardware/software codesign, most researchers concentrate on how to partition a speci cation on a xed architecture, in order to reach a maximal speedup, but some work has also been done on partitioning with respect to timing constraints, especially by Wolf's group (see e.g. [10]). Some of the search algorithms have previously been applied to the partitioning problem. Peng and Kuchcinski [16] use simulated annealing for partitioning a graph representing a design during high-level synthesis of a circuit. The goal is to reduce the amount of communication and synchronization between the parts, thus making them suitable for implementation in dierent modules. Majhi et al. [13] compare the use of a genetic algorithm and simulated annealing for partitioning a circuit description to produce multichip modules, and they claim that the genetic algorithm performed better for equal execution times. Eles et al. [5] study how to use simulated annealing and tabu search for partitioning a graph into two subgraphs, representing hardware and software parts of the implementation, respectively. The optimization goal is described as a weighted sum of certain criteria, such as communication between the subparts and amount of parallelism, and the algorithm also takes into account constraints such as hardware size. However, it does not deal with real-time constraints. The two optimization techniques are compared, and it is concluded that tabu search is vastly superior to simulated annealing for this problem. Ernst et al. [6] also do hardware/software partitioning using simulated annealing, but on a ner level than the previous reference. The target architecture consists of a processor and an ASIC which acts like a coprocessor. With this architecture, the most important criterion for partitioning is communication. The design goal is to achieve a maximum average speedup under given cost constraints, so real-time issues are not considered. Finally, Tindell et al. [18] use simulated annealing for assigning tasks of a real-time system to processors in a distributed architecture. The goal is to allocate the tasks in such a way that their deadlines are met, while not exceeding the capacity of the distribution network. The research described in this report diers from some or all of the above references in the following ways:
It considers real-time applications, with the explicit goal of producing an implementation which reaches all deadlines at a minimal cost.
It allows both processors and ASICs in the architecture, and is therefore suitable for hardware/software codesign.
3 Behavioural specification Transformation
VP
Search algorithm
Component library
Quality criterion Implementation
Figure 1: Overview of the design environment.
It uses meta-heuristic search algorithms for concurrent architecture selec-
tion and partitioning. It compares the relative merits of genetic algorithms, simulated annealing, and tabu search, for this problem.
3 Problem description Before going into the dierent algorithms, a more detailed description of the actual problem is needed. As mentioned above, the goal of this work is to nd methods for providing an implementation of a real-time system. This implementation should guarantee the timing constraints stated in the speci cation, and it should do that at a minimal hardware cost. The process of nding this implementation starts with a detailed behavioural speci cation that de nes what the system is going to do. This speci cation is necessary to be able to analyze the consequences on the response time and hardware cost, of dierent design alternatives. The hardware components that can be used in the architecture are given in a component library, which also contains models for estimating characteristics of the components. During the design, the current status of implementation is captured using a virtual prototype, and the design proceeds by applying iterative transformations to it, in order to nd an implementation which gives a maximal value to a certain quality criterion. Figure 1 shows a codesign environment based on virtual prototyping. In the remainder of this section, the dierent parts of the environment will be presented in more detail.
3.1 Behavioural speci cation
Formally, we can de ne the speci cation of the desired behaviour as follows:
De nition 1 (Behavioural speci cation) A behavioural speci cation is a three-tuple S = (B; D; T ), where:
4
3 PROBLEM DESCRIPTION
B = f ; : : : ; m g is a behaviour consisting of a set of tasks; D : B ! R is a function specifying a deadline for each task; T : B ! R is a function specifying the minimum time between two invo1
cations of each task (which equals the period for periodic tasks).
Based on the behavioural speci cation, a set of components are selected from a component library, and these are connected to each other, sometimes using buses. Then it is decided how to share the hardware resources between the tasks, both physically, through partitioning of the tasks, and temporally, through scheduling.
3.2 Component library
The components of the architecture are selected from a component library, describing their characteristics. The components can be divided into four groups:
Microprocessors are sequential components used to execute the behaviour.
They are modelled as having two memory ports, one for accessing data and the other for reading instructions. ASICs and other custom hardware devices such as FPGAs, are parallel components that both execute the behaviour and store local data. They are modelled as having one port, which is connected to a memory storing data shared with other components. Memories are the main storage components. Caches are used to accelerate retrieval from a memory, by allowing faster access to a portion of the data in the memory. They are connected to another cache or a memory from which the data is to be fetched if it was not found in the cache. The components which can be used to perform the actual execution of the behaviour, i.e. the processors and ASICs, will be referred to as processing units.
3.3 Component models
For each of the component classes in the library, a set of characteristics is de ned which can be used to build cost and performance estimation models for the architecture. For hard real-time systems, where guaranteed response times are needed, the performance models should be tuned to give good estimations of the bounds on execution time that will be derived by the nal veri cation procedure (which might of course be quite dierent from the actual worst case execution time). In this study we have used rather simple models. The microprocessor execution time and memory usage was estimated using techniques described in [1], and the ASIC models were based on [15]. The caches were modeled by a single number indicating the hit ratio, and this is a simple model, which can be sucient for early analysis, provided that the hit ratio is tuned to match a typical outcome of the nal analysis method. Better, but much more complex, models exist, e.g. the one developed by Mueller and Harmon [14], which is able
3.4 Virtual prototypes
5
to determine the instruction cache behaviour of over 80% of the instructions. Another model which takes caches into account is the one by Lim et al. [12], that calculates worst case execution times on RISC processors.
3.4 Virtual prototypes
During the search for a suitable implementation, the algorithms will have to keep track of various designs, and analyze them to determine their quality. The data structure used to store the necessary information describing the current state of implementation is called a virtual prototype, since it serves the traditional purposes of a prototype, without having to exist in the material world. Formally, a virtual prototype is de ned as follows:
De nition 2 (Virtual prototype) A virtual prototype is a four-tuple VP = (B; A; p; p ) where:
B = f ; : : : ; m g is a behaviour consisting of a set of tasks; A = f ; : : : ; n g is an architecture consisting of a set of components; p : B ! P is a partitioning assigning tasks to processing units of the architecture (P A are the processing units in A); p B B is a (total) order which assigns priorities to the tasks, and 1
1
thereby de nes a xed-priority schedule.
A fundamental idea of this work is that the implementation should be synthesized from the information provided in the behavioural speci cation, without signi cant additions to the latter during the development. Therefore, we let the B in the speci cation and the VP be one and the same, and leave out the question of verifying the correctness of the implemented behaviour with respect to the speci cation (this is a valid approach if the synthesis process can be shown to be semantics preserving). Instead, we concentrate on showing that the nonfunctional requirements, i.e. the deadlines D, are met by the implementation. In the applications we are currently studying, there is some data common to all the tasks. This makes it necessary to provide the tasks with a shared storage that can be reached from all processors and ASICs. Thus the architectures can be restricted to always contain a shared data memory connected to all processing units via a bus. Further, it is dicult to have cache memories for data in such architectures, because the protocols for coherence control get complicated and dicult to analyze (see [19, 20] for an overview of this area), and consequently we prohibit the use of data caches. But it is not necessary to have the same restrictions on processor instruction memories; instead we allow these to be separate for a particular processor, and to include caching hierarchies. Buses and other forms of interconnection are implicit in the architecture, and thus not included in the set of components. Input/output ports are supposed to be memory mapped, and therefore also not catered for.
6
3 PROBLEM DESCRIPTION
3.5 Fixed-priority scheduling
Since the behaviour consists of several parallel tasks, con icts over hardware resources between these tasks must be resolved by a scheduler. We assume that the scheduler is based on xed-priorities, i.e. each task is assigned a priority, and con icts are settled by granting a resource to the task with the highest priority rst. Thus the tasks are ordered by a priority order p , which can be any total order on the set of tasks. But since the selection of an order aects the response time of the tasks, some orders are better than others, and it has been shown before how an optimal order can be derived [2]. The complexity of this procedure is O(jB j2 ), and the result might be dierent for dierent VPs, which means that a new schedule must be synthesized for every VP. An alternative is to use one and the same schedule for all VPs, and then it is a good choice to use the deadline-monotonic priority order, de ned as p 0 () D( ) < D( 0 ). It is known to be optimal for single-processor systems, and believed to be near-optimal for other architectures too. Since the order only depends on the speci cation, it will be the same for all VPs, and can thus be calculated once and for all. We will see later in Section 6 how the choice of scheduling strategy aects the algorithms.
3.6 Primitive transformations
As will be seen, the search algorithms work by moving step by step through the search space. An alternative way of viewing this is that they iteratively modify a certain con guration, in this case the VPs. This is done by using a set of transformations which map one VP onto another, while respecting certain consistency conditions. A sequence of such transformations can be seen as a path through the search space, and by giving a complete set of operations, an arbitrary state in the space can be reached from any other, which is a necessary condition for the success of the optimization. Such a complete set of transformations is:
Move task. A task is moved from one processing unit to another. Add component. A component is added to the architecture and its ports
(in case it is not a memory) are connected to memories or caches in the architecture. Remove component. A component is removed from the architecture. A precondition is that no other component is connected to it (in case it is a memory or a cache) and that no task is allocated in it (if it is a processing unit). Reconnect component. One of the ports of some component is reconnected to a dierent cache or memory in the architecture. Only the instruction memory port of a processor, or a cache port, may be reconnected. If is a cache, its port may not be connected to another cache which is already (directly or indirectly) connected to . Replace component. A component is replaced by one from the same group (i.e. a processor by a processor, a cache by a cache, etc.). The structure of the architecture is not aected.
3.7 Quality criterion
7
By restricting ourselves to such a limited set of operations, it becomes easy to verify the correctness of the search algorithms. However, they might not always be the most eective ones in terms of search speed, and it will be seen later how several of them can be combined to form compound operations.
3.7 Quality criterion
Most optimization algorithms require a function that assigns to each solution in the search space a numerical value indicating its quality, and the optimization goal is to nd that solution which has a maximal (or minimal) value for that function. For our problem, it is desirable that the quality function posseses the following characteristics:
A solution which meets all timing and implementation constraints has higher quality than one which does not. For two otherwise equivalent solutions, the one with the lower cost has higher quality, where the estimated cost of a VP is given by the function cost (VP ).
In addition, to guide the search, it is necessary that for two solutions which do not meet the timing constraints, the one which is closest to doing so is better than the other. To achieve this, the concept of minimal required speedup (m.r.s.) [2] is useful. The m.r.s. of a VP is a number indicating how much faster the VP would have to be in order to meet the deadlines. With an m.r.s. 1, no speedup is needed, thus all the deadlines are met by the VP; if m.r.s. > 1, then some deadlines are not met. In a similar way, we can quantify the ful lment of hardware constraints, by dividing e.g. the actual ASIC area used with the available amount. This ratio will be called the feasibility of that ASIC, and the VP is said to be feasible if all components have a feasibility 1. We are now ready to de ne a quality function q : VP ! R which ful ls the above requirements:
(
cost (VP ) if m(VP ) 1 q(VP ) = 2?+(cb= cost (VP ))k e + 1=m(VP ) otherwise
where b, c, and k are constants, for which we used the values b = 500, c = 0:0009 and k = 31, and:
m(VP ) = maxfmrs (VP ); feasibility (VP )g Figure 2 shows the quality function graphically. Notice the \step" in the function. The higher level represents the region where all constraints are met, and since these are the only solutions that are really interesting, that area has higher quality values than any point outside it. When the VPs are manipulated using the transformations in the previous subsection, the calculation of the quality function for a new VP can be made much more rapid if some of the data used in this calculation is stored together with the VP. Then only the parts (usually quite small) that are really aected 1 The appropriate values to use might vary slightly from one application to another, but these are the ones used in the experiments described in the report.
8
4 SEARCH ALGORITHMS
4
3
2
1
0 0 0
1
500 1000
2 1500 3 constraints
2000
cost
Figure 2: The quality function. by the change need to be recalculated. On the other hand, it becomes more complicated to make sure that the stored data remains consistent with the current con guration, and this is another strong argument for using a limited set of well de ned transformation. By assuring that they do the necessary recalculations as a part of the transformation we guarantee that the quality function gets the same value as it would, had all the data been updated.
4 Search algorithms Since the search space which is de ned by our problem is enormous, it is impossible to nd the global optimum in realistic situations, and it is therefore necessary to use optimization algorithms which do not guarantee the delivery of the optimal solution, but which in practice have often been shown to nd very good ones. Three such algorithms are: genetic algorithms, simulated annealing, and tabu search, which are all inspired by phenomena in nature. They can be seen as meta-heuristics, that are designed to avoid the search from getting trapped in a local maximum. In the next subsections we will describe each one of them, and also compare their merits and way of functioning.
4.1 Simulated annealing
The simulated annealing (SA) algorithm [11] is inspired by techniques found in statistical mechanics, where it is of interest to investigate the arrangement of large amounts of atoms in uid and solid matter. To study a certain state of a material, an annealing process is used, where the matter is rst melted, and then slowly cooled in a controlled way to obtain a certain arrangement of the
4.1 Simulated annealing
9
1 0.9
probability of acceptance
0.8 0.7 0.6 0.5
−2.0 −1.0 −0.5
−0.2 −0.1
−0.01
0.4 0.3 0.2 0.1 0 0
500
1000
1500 iteration
2000
2500
3000
Figure 3: The probability of accepting moves to states with lower quality as a function of the number of iterations (with parameters T0 = 10, T = 0:97, and n = 10). atoms. When the temperature is high, atoms can occasionally move to states with higher energy, but then, as the temperature drops, the probability of such moves is reduced. In the optimization procedure, the energy of the state corresponds to its quality function value, and the temperature becomes a control parameter which is reduced during the execution of the algorithm. Starting from a random initial solution, a neighbour of the current solution is randomly chosen in each iteration, and in case this solution is better than the current, the algorithm accepts to move to that solution. If it is worse, then the move is still taken with a probability determined by the dierence in quality and the current temperature. During early iterations, there is a high probability to accept worse solutions, but as the temperature drops, acceptance becomes less and less likely. In this way, the algorithm behaves much like a random walk during the earliest stages, while it performs almost a hill-climbing as the temperature drops towards the freezing point. Figure 3 illustrates this. On the y-axis is the probability that a move will be accepted, and on the x-axis the iteration number. The curves show how this probability evolves during the execution, for some negative quality moves. The control parameters of the algorithm are: the initial temperature T0 > 0; the number of iterations n 1, for which the temperature remains constant between two changes; and the temperature update T 2 (0; 1), by which the current temperature is multiplied to obtain the new one. Algorithm 1 shows the SA algorithm.
10
4 SEARCH ALGORITHMS
proc SA(T , T , n) is var current , new , q , T ; begin 0
T := T0 ; current := initial ();
repeat for i := 1 to n do
new := neighbour (current ); q := q(current ) ? q(new ); if q 0 _ random () < e?q =T then current := new
end if end for; T := T T until stopCondition end SA
Algorithm 1: Simulated annealing algorithm.
4.2 Genetic algorithms
The genetic algorithm (GA) [8, 9] is an algorithm inspired by how changes in the chromosomes are made in nature to adapt a species to changes in the environment. Like its natural counterpart the GA is de ned using the concepts of mating (crossover), mutation, and selection based on survival of the ttest. The algorithm keeps a population of individuals, which represent a subset of the search space. The initial population is generated randomly, and then in each iteration, some of the individuals are replaced by new ones. These are created by selecting pairs of individuals from the previous population, where the probability of being selected increases proportionally with the individual's quality function value. The two selected individuals are cut up in two at a random point, and then one part of them is swapped between the individuals to create two new ones. This mimics the crossover of chromosomes in nature. Finally, some random changes are done to the new individuals, which corresponds to a mutation of the genes in a chromosome. Some of the highest ranking individuals might \survive" from one generation to the next. Algorithm 2 shows the GA. The control parameters are: the size n > 0 of the population, the number of individuals s 0 that survive from one generation to the next, and the probability of crossover pc 2 [0; 1] and mutation pm 2 [0; 1]. The simple selection scheme, where the probability of selecting a particular individual equals its quality divided by the population's total quality, does not work well in all situations. For instance, if there is one individual which has much higher quality value than the others, this one will soon dominate the whole population; and in the opposite case, where all individuals have approximately the same value, the procedure will degenerate to a random search. The solution to this is called scaling, where a scaling function is applied to the quality values before selection. This function usually is de ned so that it maps the average quality value to itself, and the maximal quality of the population to the average multiplied by a constant c (known as the scale factor), which typically takes a
4.3 Tabu search
11
proc GA(n , s , pc , pm ) is var pop , pop 0 ; begin pop := initialPopulation (); repeat
pop 0 := pop ; pop := copySurvivors (s , pop 0 ); for i := 1 to (n ? s)=2 do mum := select (q, pop 0 ); dad := select (q, pop 0 ); if random () < pc then (child , child 0 ) := crossover (mum , dad ) end if; pop := pop [ fmutate (child ; pm); mutate (child 0 ; pm )g
end for until stopCondition end GA
Algorithm 2: Genetic algorithm. value of about 2:0. The scaled quality function can be de ned as: qs (x) = cq qavg??qqavg (q(x) ? qavg ) + qavg max avg (Note that the sum of scaled quality values over the population equals the sum of unscaled quality values.) In the original GA, the individuals were represented by chromosomes made up of xed length bit-strings. This is convenient since it gives very straightforward de nitions of the crossover and mutation operations, but it does not give sucient exibility when the search space contains points which vary in \size". As an example, when chosing the architecture of a real-time system, one might need to consider architectures ranging from a single ASIC to ten processors with a large number of buses and memories. To handle situations with such discrepancies within the search space, a variant of the GA has been developed, called evolution programs. The main dierence is that the individuals can be represented by arbitrary data structures (such as VPs, in our case), which removes the need for inventing a complex and often redundant bit-string coding of the search space. On the negative side, it then becomes necessary to come up with problem speci c de nitions of the genetic operators. As will be seen below in Section 5, this is non-trivial.
4.3 Tabu search
The tabu search (TS) algorithm [7] is an arti cial intelligence inspired technique, in which the concept of memory is used to make the search more ecient. While traversing the search space, a short term memory is used to avoid that a recent step is nulli ed by taking the opposite step, and thereby going back to a part of the search space which was recently visited. The short term memory is implemented as a tabu list, which is a list of predicates p(x; y) indicating whether
12
4 SEARCH ALGORITHMS
proc TS (k, n) is var current , best , new , N , i; begin current := initial (); best := current ; T := [ ];
repeat
N := neighbours (current ); new := ?; i := 0; while i < k do let x = head (N ) in if q(best ) < q(x) _ 8p 2 T : :p(current ; x) then i := i + 1; new := maxq fnew ; xg
end if end let; N := tail (N ) end while;
T := take (n; tabu (current ; new ) : T ); current := new ; best := maxq fbest ; current g until stopCondition end TS Algorithm 3: Tabu search algorithm a move from x to y is disallowed, or tabu2 . The tabu list has a xed length n, which is also the number of iterations a tabu predicate remains in eect. (In the full TS, there is also a long term memory, which keeps track of what moves have been successful earlier in the run, and what parts of the architecture has remained unchanged for a long time. This information may then be used to apply successful moves more frequently (intensi cation ), and to change parts that have not been changed recently (diversi cation ). However, the long term memory component of the algorithm has not been used in this work.) In some situations it might be sensible to override a tabu, and allow the move. This is determined by an aspiration criteria, and we have used a very simple such: a suggested solution which breaks a tabu is accepted if and only if it is better than the best solution found so far. In each iteration, a part of the neighbourhood of the current solution is examined, and the best element in that set (which does not break the tabu, of course) is selected as the next element. It should be noted that the number of 2 In the original formulation of tabu search, there were two proposed implementations of the tabu list. The rst one was to store the recently visited solutions, but this can be very space consuming for many problems. The other suggestion was to keep track of the moves made lately, and simply disallow the inverse move, but this is rather in exible. By instead storing a list of predicates, the user is granted full freedom to de ne what it means for a move to be \tabu", and how much information is feasible to store for this particular problem. In a powerful programming language, the implementation of such a list of predicates is of course straighforward.
4.4 Comparison
13
elements that are examined is restricted to k, and therefore the neighbourhood list must be constructed using heuristics which consciously try to nd the elements in the entire neighbourhood that are most likely to lead to the optimum, and order them in a list according to this criterion. There are no random factors involved in the selection of the move. After the selection is made, the tabu list is updated by adding a suitable predicate. Algorithm 3 shows the tabu search algorithm. In the algorithm, the following notion for sequences is used: [ ] denotes the empty sequence; x : xs the addition of one element x at the start of a sequence xs ; head (xs ) the rst element of a sequence xs ; tail (xs ) the sequence obtained by removing the rst element from xs ; and take (n; xs ) the n rst elements of a sequence xs (or the entire sequence, if it has less than n elements). ? denotes an unde ned element, and it is assumed that q(?) = ?1.
4.4 Comparison
The search strategies can be classi ed according to several aspects, and such a classi cation clearly reveals their dierent ways of traversing the search space: Deterministic vs. random. TS is totally deterministic, so the choice of the next solution depends only on the rules given in the program. Therefore TS will always give identical results when started from the same initial solution. GA and SA on the other hand, determine the next state(s) randomly, and will thus give dierent results for dierent runs, even if the parameters are identical. A consequence of this is that these algorithms nd the optimum in some runs and fail in others. Sequential vs. concurrent. TS and SA traverse the search space by moving from one point to another. This is in contrast with the GA, which concurrently examines dierent parts of the search space by keeping track of a set of points. Stepping vs. leaping. TS and SA always move in small steps, by selecting a neighbour of the current solution. GA on the other hand moves by combinining two points, ending up in between them (in some sense). If the two points are distant, this will correspond to making a long leap in the search space. When it comes to implementing the algorithms, SA and GA do generally need less problem-speci c information than TS, since they compensate for this lack of information by doing random selections. On the other hand, the dierence is in practice not always so big, since the performance of SA and GA can also be improved considerably by including some knowledge about the application in e.g. the de nition of neighbourhood (SA) or crossover (GA). The GA requires a larger implementation \cost" in terms of memory than the others. This is due to the fact that it always keeps a set of solutions (which can sometimes include hundreds of elements) rather than a single one. It might also be slower than the others when calculating the quality function, because the algorithm generates solutions which are distant from the previous ones. By only making small changes, like the SA and TS do, the quality function can usually be calculated much more rapidly by starting the calculation from the previous point's stored data and adjusting for the changes.
14
5 APPLYING THE ALGORITHMS
5 Applying the algorithms Before the general algorithms of the previous section can be used for the concrete application of architecture synthesis, two steps must be taken: rst the functions which implement the problem speci c parts of the algorithms must be de ned (e.g. the mutation operation in GA, or the neighbourhood function in SA); then the parameters that control the search must be assigned values which render the algorithms as ecient as possible (e.g. the probability of mutation in GA, or the temperature update in SA). The approach we adopted to the latter problem was to run a limited series of experiments with the dierent algorithms, using a set of alternative parameter values, and then simply choose for the subsequent trials the setting which performed best. For the former problem, the solutions for each of the algorithms will be described in the following subsections. There are however certain things that are common to all of them. The function which creates the initial solution(s) was for instance identical. It randomly selected between one and ve processing units from the component library, and connected their ports to a randomly selected memory. The tasks were randomly partitioned on the processing units. We earlier claimed that the TS algorithm was deterministic, as opposed to the others, but this does not hold anymore when a random starting point is used. But nevertheless we think this is an appropriate approach, because the purpose of the experiments was to nd out the progress the dierent methods can achieve, and it would be dicult to judge the results if some were started from random solutions and others from deterministically selected ones. But as a consequence, the TS algorithm also has to be run repeatedly for the same parameters to determine its performance from an average starting point. The goal of the experiment was to nd out how good solutions the dierent algorithms can nd, and for this comparison to be fair, they must be allocated approximately the same execution time. However, the implementations used were optimized for clarity rather than speed, and therefore we decided not to limit the elapsed CPU time of the execution, but rather to count the number of evaluations of the quality function, which is usually by far the most time consuming part of the program. The limit on the number of evaluated VPs was set to 3; 000.
5.1 Compound transformations
To avoid that the architecture gets cluttered up with a lot of components that contribute to the cost without improving the ful lment of the timing constraints (i.e. processing units with no tasks assigned to them, or memories and caches that are not used by any processing unit) the VPs were \cleaned" after each iteration, by removing all unused components. Unfortunately, this operation renders some of the primitive transformations introduced in Section 3.6 pointless. For instance, adding a new processing unit without simultaneously moving some tasks to it would just mean that it immediately gets removed by the cleaning operation. Therefore we introduce three compound transformations:
Split processing unit. A processing unit is complemented with another one, randomly selected from the component library. The data memory port
5.2 Application example and component library
15
(and instruction memory port if the new component is a processor) is connected to the shared data memory in the architecture. The tasks assigned to the original processing unit are distributed evenly between the two processing units, and it is a precondition that there are at least two such tasks. Merge processing units. All the tasks of a processing unit are moved to another one, and the processing unit consequently gets removed during the cleaning that follows. Add storage. A cache or memory is added, and some of the connections of the architecture are updated so that the new component becomes used, i.e. it is a combined add/reconnect transformation. These transformations, together with the primitive ones for moving tasks and reconnecting components, are the ones which are used by the algorithms. The transformations that remove components are used implicitly by the cleaning. If a processing unit ends up with no tasks allocated to it, it will be removed automatically, and if the connections are changed so that a memory is no longer reachable from any processing unit, it is also removed.
5.2 Application example and component library
The application we used to try out the algorithms was a network packet switch, described in [4]. The switch has 8 input and 4 output channel, which transmit variable size packets bit-serially. The transmission rate is 20; 000 baud on the input channels and 40; 000 on the output. The switch is not allowed to loose any packages, and this requirement is translated into timing constraints. The behaviour is expressed using 12 tasks, one for each port. The input tasks read the packets and buer them, while the output tasks take data from the buers and write it to the output ports. A closer description of the problem formulation we have used can be found in [1]. The component library contained 10 components: 3 microprocessors ressembling the Motorola 68000 and 68020, and Intel 80386; 3 ASICs capable of holding 400; 000, 600; 000, and 800; 000 area units (approximately transistors); 3 caches with 70, 75 and 80 percent cache hit probability; and 1 memory. Each of these components had dierent costs depending on their performance, and by combining them, a very wide range of sensible architectures can be constructed. Table 1 shows some of the best architectures known, and the cost of the best and worst partitionings on that architecture3. For several of the architectures, the dierence between best and worst partitioning is a mere 1:4 cost units, and this corresponds to moving one output task from the ASIC to the processor, and moving one input task in the opposite direction. The small dierence indicates that these solutions are very close to breaking some constraints. The cost of the ASICs and memory was de ned to also increase slightly with the amount of area and storage actually used. This explains why architecture 5 and 6 in the table have almost identical cost, even though the latter contains one more component than the former; it is because the ASIC area actually used is much smaller.
3 It should be pointed out that the component models were not very accurate, thus it is possible that with better models, the quality order between the solutions might be dierent.
16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
5 APPLYING THE ALGORITHMS Cost 752:7{755:6 762:4 764:2{765:6 767:6{805:6 775:3 775:8{815:3 787:2{825:3 795:3{865:6 804:2{805:6 815:6 820:4{855:6 834:2{835:6 841:1 861:1 881:1
Description 68000 + instr. memory + ASIC (800k) 68020 + instr. memory + ASIC (600k) 68000 + cache (70%) + ASIC (800k) 68000 + cache (70%) + instr. memory + ASIC (800k) 68020 + cache (70%) + ASIC (600k) 68020 + cache (70%) + instr. memory + ASIC (600k) 68020 + 2 cache levels (70%) + ASIC (600k) 68020 + cache (70%) + ASIC (800k) 68000 + cache (75%) + ASIC (800k) 68020 + ASIC (800k) 2 68000 + instr. memory + ASIC (800k) 68000 + cache (80%) + ASIC (800k) ASIC (400k) + ASIC (600k) ASIC (400k) + ASIC (800k) ASIC (600k) + ASIC (600k)
Table 1: Some of the best architectures for the packet switch. All architectures also contain a shared data memory. The cost is given as a range which covers all the acceptable partitionings, when several such exist.
5.3 Simulated annealing
To apply the SA algorithm, the function for randomly selecting a point in the neighbourhood has to be de ned. We took the simplest possible approach: construct the list of all neighbours and select a random element from it4 . A VP is de ned to be in the neighbourhood if and only if exactly one transformation is needed to construct it from the current VP. The parameter setting trials lead to the adoption of the following parameter values: T0 = 10, T = 0:97, and n = 10 (see also Figure 3 above). It should be mentioned that the algorithm's performance was quite robust to changes in these values, i.e. the performance was roughly the same for many dierent settings.
5.4 Genetic algorithms
For the GA, the functions for selection, crossover, and mutation have to be de ned. The selection was done by randomly picking an individual, where the probability of a certain individual getting selected was based on its scaled quality function value (with scale factor c = 2:0) divided by the sum of the qualities of all individuals. The crossover between two VPs was performed in the following way (see Figure 4): From each of the two architectures, a random number of processing units were selected for exchange. The number had to be equal for the two parents, since pairs of components were swapped. 4 It may be computionally expensive to construct the entire neighbourhood for many implementations. In our case, the implementation was programmed in a \lazy" functional language, which means that only that VP in the neighbourhood list which gets selected is actually constructed. The others are just represented by unevaluated expressions.
5.4 Genetic algorithms
17 Mem3
Mem1 Proc1
Mem2
Asic2 τ1 τ 2 τ 3
Cache
Proc3 τ5
τ1 τ 2 Proc2 τ 3 τ4 Asic1
Asic3 τ4
τ5 Mem1
Proc1 τ1 τ 2
Asic2 τ1 τ 2 τ 3 Proc3 τ4 Asic1 τ5
Mem3
Cache
Proc2 τ3
τ5
Asic3 τ4
Figure 4: A crossover between two virtual prototypes. The upper two VPs are the parents, where the dotted rectangles indicate the parts that are exchanged. Since no component is connected to Mem2 after the swap, it is removed by the cleaning.
For the processors selected, the r caches closest to the processor in the
instruction memory hierarchy were picked to accompany it, for a random number r. If two processors were exchanged, r was equal for both of them.
Those tasks that were allocated in the selected processing units in both parents simply followed the processing unit when it was moved.
Those tasks that were not allocated in a selected processing units stayed where they were.
The tasks which were allocated in a processing unit which was moved, but were not allocated in any of the selected processing units in the other parent, were placed in the processing unit which replaced the one where they were allocated before.
Notice how this de nition mimics the crossover operation for bit-strings, where an equal number of genes (bits) are swapped between the two chromosomes. Here, we just have to be more careful to make sure that the new VPs are reasonable. The mutation operation proceeded according to the following steps:
18
5 APPLYING THE ALGORITHMS
For every processing unit, the unit was either split, merged, or replaced
by an equivalent processing unit. Each of these operations was selected with a probability of pm, but only one of them was selected, thus the probability of each processing unit remaining unchanged was 1 ? 3pm. The other argument involved in the split, merge, or replace operations was randomly selected among all possible alternatives. Every task was moved to another (randomly selected) processing unit with the probability pm for each move. Finally, for each processor, the instruction memory hierarchy was rebuilt by inserting a cache; reconnecting to bypass an existing cache (and thereby removing it); or replacing a cache by another one from the component library. Each of these transformations was performed with a probability pm.
The parameter settings for the GA were empirically decided to be: n = 50, s = 30, pc = 0:9, and pm = 0:05.
5.5 Tabu search
To use the TS algorithm, the neighbourhood function has to be de ned, and as was mentioned above, it is important for the eciency of the algorithm that this is done with some care, since only a restricted portion of the neighbourhood is examined. The approach taken for architecture synthesis can be divided into two steps: 1. When the deadlines are not being met, the algorithm concentrates on achieving this by expanding the architecture's capacity. This is done by identifying bottlenecks, and trying to remove them. If a memory is overloaded, the load is reduced by introducing instruction caches or separate instruction memories. If a processor is overloaded, it can be replaced with another one with higher capacity, tasks can be moved to other processing units, or it can be split to introduce a brand new processing unit. The same actions can be taken if too many tasks are assigned to an ASIC. 2. Once an architecture has been found which ful ls the deadlines, the procedure shifts to instead give priority to reducing the cost of the architecture. This is done by moving tasks from ASICs to processors, removing caches, replacing components with cheaper ones, and merging processing units and memories. Priority is given to these operations in order of their expected cost gains. To ll out the neighbourhood in the case where a lot of the prime options are tabu moves, the measures taken during the rst phase are also allowed during the second, and vice versa, but with lower priority. This allows the algorithm to escape out of a local optimum by e.g. introducing a new processing unit even though the deadlines are already met. The tabu predicates for the dierent transformations basically try to prohibit that the eects of the transformations are reversed. As an example, when a task is moved, a subsequent transformation is tabu if it tries to move the task again.
19 On the other hand, a tabu should not be too restrictive, and since a transformation like merging or splitting processing units also changes the partitioning, the tabu does not apply if a task is moved as a result of an architecture change. Deciding what the tabu predicate should be for dierent transformations requires a deep understanding of both the application area and the tabu search algorithm itself, and probably has a large impact on the algorithm's eciency. Suitable values for the TS control parameters were determined to be: neighbourhood size k = 10 and tabu list length n = 10. (In our experiments TS was much more sensitive to the parameter values than were SA and GS.)
6 Experimental results After determining the most appropriate parameter settings for the three algorithms, two sets of experiments were done with each algorithm. In the rst one, the deadline-monotonic schedule was employed, which gives superior execution speed, and in the second the optimal priority order was used (see Section 3.5 above, for a description of the scheduling methods). Each set of experiments included 15 runs for each algorithm to obtain sucient statistical data.
6.1 Using deadline-monotonic scheduling
Using the deadline-monotonic priority order, the results of the trials were as follows:
Tabu search performed best. It found architecture 1 in Table 1 once,
architecture 3 twice, architecture 5 four times, architecture 7 once, and architecture 13 seven times. The average cost of the solution was 804:2. The partitionings were not always the optimal ones for the architecture, but good enough to meet the timing constraints. The nal solution was found after 584 iterations on an average.
Simulated annealing was the second best. It also found architecture 1
once, as well as architecture 3. Architecture 10 was found three times, architecture 12 once, and architecture 13 nine times. The average cost was 824:9. The nal solution was found after an average 1; 995 iterations.
The genetic algorithm was the most consistent of the three: it found
architecture 13 every time! (With one other parameter setting it managed to nd architecture 10 once, but that setting was not used further since it produced worse results most of the time.) An average of 12 generations (about 380 VPs calculated) was needed to nd the nal solution.
It should be noted that in each and every one of the 45 runs, an acceptable solution was found. The dierence was instead in how cheap the solutions were, and how quickly they were found. Figure 5 shows the average progress of the algorithms, when using the deadline-monotonic priority order. On the vertical axis are the best quality function values found up to the iteration given on the horizontal axis. The quality value 2:5 corresponds to a VP which meets all constraints and has a cost of 1; 000. The optimum has a quality of 2:664.
20
6 EXPERIMENTAL RESULTS 2.66
2.64 TS
2.62
Quality
2.6
GA
2.58 SA 2.56
2.54
2.52
2.5 0
500
1000
1500 Iteration
2000
2500
3000
Figure 5: Average progress of the algorithms when using deadline-monotonic scheduling.
6.2 Using optimal scheduling When the optimal priority order was used instead, the following results were obtained:
Tabu search was extremely successful: it found the global optimum (i.e.
the optimal partitioning on the optimal architecture) 13 out of 15 times! It also found the optimal partitioning for architecture 2 once and architecture 13 once. For the runs when the global optimum was found, the execution time varied from 53 to 2; 161 iterations, with an average of 736. (The time to nd the optimal architecture was at most 1; 105 iterations, with an average of 681.)
Simulated annealing also gave better results when using the optimal scheduling. It found architecture 2 once, architecture 3 six times, architectures 4, 6, 7, 8, and 10 once each, and architecture 13 three times. The average cost was 792:6, and it needed an average of 1; 901 iterations to nd the best solution.
The genetic algorithm did not improve signi cantly when using the opti-
mal priority order. It found architecture 10 once, architecture 13 eleven times, architecture 14 twice, and architecture 15 once. The GA had great diculties on improving on these architectures; the best result was found after only, on an average, 26 generations (about 800 VPs tried) iterations, with a maximum of 70 iterations (2120 VPs). On several occasions, the algorithm was not able to improve at all on the initial (randomly generated) solutions.
21 2.66
2.64
TS
2.62
Quality
2.6 GA 2.58
2.56 SA 2.54
2.52
2.5 0
500
1000
1500 Iteration
2000
2500
3000
Figure 6: Average progress of the algorithms when using optimal scheduling. Figure 6 shows the average progress of the algorithms when using the optimal priority order.
7 Discussion From the above results, it is quite clear that TS performed best and GA worst. In this section, we will look into the results obtained in some detail, and also try to give some explanations. We will also brie y touch upon an issue not related to the performance of the algorithm, but which is still an important factor when selecting one of them: how dicult it was to implement them.
7.1 Partitioning quality
The synthesis algorithms we developed do two things simultaneously: they generate architectures and they partition the behaviour on that architecture. So a relevant question is: how good is it at partitioning? This question can be answered by comparing, for each of the generated solutions, how good the selected partitioning is compared to the optimal partitioning on that architecture. Of course, for many of the architectures, only one partitioning, or a set of equivalent partitionings, are possible. But for others, where there are more resources available, a large number of partitionings might be feasible. In those situations, we wish to obtain the one which gives the lowest cost, i.e. which uses as little ASIC area as possible. In the material obtained from the experiments described above, a lot of the architectures found were among those where all feasible partitionings had equal cost. In particular, for the GA, all the architectures were in this group. Because the material is scarce, it is not possible to draw any decisive conclusions, but we
22
7 DISCUSSION
note that when using TS and SA with deadline-monotonic scheduling, in most cases the partitionings were non-optimal. When using optimal scheduling with TS, the partitionings were always optimal, and with SA they were better than with deadline-monotonic scheduling. Since the quality of the partitioning is important for the success of the search, this comparison gives a clear indication of how good the algorithms are. If they fail to produce good partitionings, they will reject many architectures that are feasible with the optimal partitioning but not otherwise. A reasonable conclusion is that TS with optimal scheduling can be assumed to produce partitionings very close to the optimum.
7.2 Why is architecture 13 so popular?
One might wonder what makes architecture 13 such a common solution, or perhaps rather what makes it so hard to bypass it. The reason is that architecture 13 only has one time shared component (the memory), whereas all the better architectures in table 1 contain also a processor. This means that as long as the partitioning is feasible (i.e. the size of the ASICs is not surpassed), the partitioning does not aect the quality function. Therefore, a lot of partitionings are equivalent for this architecture. Also, the deadline-monotonic schedule turns out to be very close to optimal for this architecture, independently of the partitioning. (The situation is similar to that of a single-processor architecture, where there is also only a single time-shared resource.) For other architectures, like the better ones which contain both microprocessors and ASICs, the scheduling and partitioning eects become more intricate. Only a very small number of partitionings are feasible and meet the deadlines for these architecture, and when tasks are moved more or less randomly, there is a large risk that the VP will become infeasible, thus giving a very small quality value. Further, many of the better architectures might only meet the deadlines when used with an optimal schedule. By using a xed schedule, like the deadline-monotonic, the algorithms risk discarding cheaper architectures, even if they did nd a good partitioning for them. This explains the better results obtained in the second round of experiments. Another diculty with architecture 13 is that once it has been found, several \negative" transformations have to be made to restructure it into one of the better architectures. This explains why the algorithms so easily got stuck there. Apparently, tabu search was better at handling this local optimum. From this discussion, and from the results obtained, it should be clear that the optimal scheduling approach is to be prefered, even though it makes the algorithms considerably much slower (in the current implementation, the dierence is about a factor 2, but this could surely be reduced).
7.3 Implementation eort
Important for the choice of algorithm is not only its performance, but also its ease of implementation and application. As mentioned above, SA and GA are generally regarded as requiring less implementation eort than, e.g. TS. However, in our experience, where we use a very simple TS, and a complex GA, the relation between their implementation eort was dierent. In fact, SA and TS required about the same amount of work, whereas the GA needed much
23 more. Despite of this extra work on the GA, it was clearly not enough, since the algorithm's performance did not match that of the others. This is not to say that GAs are less useful than the others, but for problems like the above, where a standard bit-string coding of the chromosomes is impossible, it can get very dicult to de ne ecient genetic operations for use in an evolution program. A further bene t of TS is that it does not really require a quality function which maps VPs on numbers. Instead, it is enough to have an order relation on VPs, and it is therefore not necessary to invent an (often arti cial) quality function, which can have eects on the search that are dicult to anticipate, and needs a lot of experiments to adjust. Further, there is less need for experiments to determine suitable parameter settings for TS (assuming that a deterministic choice of starting point is made).
8 Conclusions In this report, we have compared the use of three meta-heuristics for selecting architectures and partitionings of real-time systems. The results show that although all three can be used to nd relatively good implementations, the tabu search algorithm stands out as being more ecient, both in execution time and solution quality. Simulated annealing also performed quite well, but our implementation of the genetic algorithm proved to have great diculties in escaping from local optima. A somewhat surprising result was that the choice of priority order for the xed-priority task scheduler had a large impact on the result. This was unexpected, since it is sometimes advocated that the simple deadline-monotonic priority order is suciently close to optimal in most situations, and that there is thus no need to go through the trouble of deriving the optimal priority order. As has been clearly shown by our results, using the optimal priority order can be the dierence between success and failure. As a result of the experiments presented in this report, TS seems to be the best choice for further work in this eld, especially since its capabilities have been exhausted to a far less degree than those of the two other algorithms. We will therefore continue developing the TS based architecture synthesis method, and some of the improvements we see as possible are:
The algorithm often needs a relatively long time to nd a feasible solution
which meets the timing constraints when started from a single random solution. This problem can be attacked in two ways. The simplest is to instead take a set of random solutions, and choose the best one as the starting point. This is similar to generating the initial population for the GA; this population often contained some very good implementations. The other possibility is to include a special heuristic for choosing the starting point.
The full TS algorithm uses a long term memory in addition to the short
term memory realized by the tabu list. The role of the long term memory is to remember things that have been common to previously found, good solutions. These elements can then be applied more frequently to intensify the search. For our example, this could include observations about which
24
REFERENCES components have been frequent in good solutions, or which tasks have been allocated together. Diversi cation is a complementary notion to intensi cation. During diversi cation the algorithm tries to avoid solutions that have previously been examined, and often, a TS algorithm alternates between intensi cation and diversi cation.
It is our belief that the inclusion of these improvements would make the tabu search based architecture synthesis algorithm even more superior. Other improvements can be made to the models used for estimating cost, m.r.s., and feasibility. These use rather simple models of the architecture components, and better precision is probably needed when applying the techniques to real-life examples. However, since the algorithms do trade-os between cost and performance based only on the mentioned metrics, without bothering how these are calculated, we can expect them to function equally well for other models. But of course, more complex models will require a longer execution time for each iteration of the algorithms. Also, although many applications are suitable for shared-memory architectures like those used in this study, there are others that require more general architectures, perhaps including distribution networks and less regular connection schemes. We expect to include such possibilities in subsequent versions of the tabu search synthesis algorithm.
References [1] Jakob Axelsson. Schedulability-driven partitioning of heterogeneous realtime systems. Licentiate Thesis No. 517, Linkoping University, 1995. [2] Jakob Axelsson. Hardware/software partitioning aiming at ful lment of real-time constraints. To appear in the Journal of Systems Architecture, 1996. [3] K. Buchenrieder and A. Pyttel. System zur wissensbasierten Kon gurierung von Leiterplatten. CADS, 92(1):52{59, 1992. [4] Mitchell Bunnell. Maximizing performance of real-time RISC applications. Dr. Dobb's Journal, 19(1):54{64, January 1994. [5] Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli. System level hardware/software partitioning based on simulated annealing and tabu search. To appear in Design Automation for Embedded Systems, 1996. [6] Rolf Ernst, Jorg Henkel, and Thomas Benner. Hardware-software cosynthesis for microcontrollers. IEEE Design & Test of Computers, 10(4):64{75, December 1993. [7] Fred Glover, Eric Taillard, and Dominique de Werra. A user's guide to tabu search. Annals of Operations Research, 41:3{28, 1993. [8] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989.
REFERENCES
25
[9] John H. Holland. Genetic algorithms. Scienti c American, pages 44{50, July 1992. [10] Junwei Hou and Wayne Wolf. Process partitioning for distributed embedded systems. In 4th International Workshop on Hardware/Software Codesign, 1996. [11] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671{680, May 1983. [12] Sung-Soo Lim, Young Hyun Bae, Gyu Tae Jang, Byung-Do Rhee, Sang Lyul Min, Chang Yun Park, Heonshik Shin, Kunsoo Park, Soo-Mook Moon, and Chong Sang Kim. An accurate worst case timing analysis for RISC processors. IEEE Transactions on Software Engineering, 21(7):593{ 604, July 1995. [13] Ananta K. Majhi, L. M. Patnaik, and Srilata Raman. A genetic algorithmbased circuit partitioner for MCMs. Microprocessing and Microprogramming, 41(1):83{96, April 1995. [14] Frank Mueller, David B. Whalley, and Marion Harmon. Predicting instruction cache behavior. In Proc. ACM SIGPLAN Workshop on Languages, Compiler, and Tool Support for Real-Time Systems, 1994. [15] Sanjiv Narayan and Daniel D. Gajski. Area and performance estimation from system-level speci cation. Technical Report ICS-92-16, University of California, Irvine, 1992. [16] Zebo Peng and Krzysztof Kuchcinski. An algorithm for partitioning of application speci c systems. In Proc. European Conference on Design Automation, pages 316{321, 1992. [17] Shiv Prakash and Alice C. Parker. SOS: Synthesis of application-speci c heterogeneous multiprocessor systems. Journal of Parallel and Distributed Computing, 16:338{351, 1992. [18] K. W. Tindell, A. Burns, and A. Wellings. Allocating hard real-time tasks: An NP-hard problem made easy. Real-Time Systems, 4(2):145{165, 1992. [19] Milo Tomasevic and Veljiko Milutinovic. Hardware approaches to cache coherence in shared-memory multiprocessors, part 1. IEEE Micro, 14(5):52{ 59, October 1994. [20] Milo Tomasevic and Veljiko Milutinovic. Hardware approaches to cache coherence in shared-memory multiprocessors, part 2. IEEE Micro, 14(6):61{ 66, December 1994.