Mapping and Scheduling for Architecture Exploration of ... - CiteSeerX

0 downloads 0 Views 219KB Size Report
The second approach is a recursive procedure that is based on local search ... the architecture. The paper is organized as follows: In section 2, a short ..... +7.11%. +0.66%. -5.20%. -4.87%. 0.00%. +1.58%. 100. +5.95%. +0.01%. -6.55%. -6.38 ...
Mapping and Scheduling for Architecture Exploration of Networking SoCs Thomas Wild, Jürgen Foag, Nuria Pazos Institute for Integrated Circuits TU Munich D-80290 München, Germany [email protected]

Winthir Brunnbauer Infineon Technologies NA Corp. System Architecture Group Communications San Jose, CA 95112, USA [email protected]

Abstract

Packet processing is control dominated where parts of the function have to be executed depending on the protocol stacks contained in the packets. The following investigations have been made with this application area in mind. In the context of SoC design, it has been shown that the communication architecture has strong influence on the achievable performance ([1]). Whereas in traditional design flows (e.g. [2]), the consideration of communication aspects is mostly separated from the architecture definition phase. The effects of concurrent use of communication resources, e.g. contention on buses leading to delayed data transfers, is disregarded in the mapping decision. The simple example in figure 1 shows the influence of

This paper describes two different approaches to optimize the performance of SoC architectures in the architecture exploration phase. Both solve the problem to map and schedule a task graph on a target architecture under special consideration of on-chip communications. A constructive algorithm is presented that extends previous work by taking into account potential data transfers in the future. The second approach is a recursive procedure that is based on local search techniques in a specially defined neighborhood of the critical path. Simulated annealing and tabu search are used as search algorithms. Both approaches find solutions with better performance than established methodologies. The recursive technique leads to superior results than the constructive approach, however, is limited to small and mid-sized problems, whereas the constructive algorithm is not limited by this issue.

A

1 2

3 A

Alt.1 Res.A

System-on-Chip (SoC) solutions allow the realization of very complex functions with high performance and flexibility requirements. These properties can be reached by mixed HW/SW implementations using a combination of specific accelerator modules and embedded processors. In order to get the right mixture in a fast and efficient manner, different architectural alternatives have to be compared at a level of abstraction much higher than that of today’s design tools. CAD vendors as well as academia spent much effort to provide tools that support designers in early phases of a design project (just to name a few [2], [3], [4] and [5]). In industrial practice, this task is often based on the individual experience of the designer. However, considering the increasing complexity and the multitude of parameters, a more formal procedure is needed. In many application areas, these mixed HW/SW solutions are required to fulfill the steadily increasing demands. Networking is one such area where high throughput and at the same time adaptability to evolving standards is needed.

Execution Time

C

1 2 3 4

Alternative 1

4

1. Introduction

B

C

A B C 2 - 8 2 - - 2 2 - -

1->2 1->3 2->4 3->4

2 5 2 3

Alternative 2

4

1

2

Res.B

3 2->4

Res.C

1->2

Bus Alt.2 Res.A

Transfer Time

1

1->3 2

3->4 4

Res.B

3

Res.C Bus

1->3

3->4

Figure 1: Influence of bus contention this effect. There are two alternatives for the implementation of task 2: It may either be executed on a special accelerator block B or implemented on processor A. Although the sum of the transfer times and the execution time on the accelerator is less then the duration of task 2 on A the second solution is faster. It is obvious that contention on the bus leads to delayed transfers and therefore to reduced performance. This paper concentrates on the mapping and scheduling

0-7695-1868-0/03/$17.00 (C) 2003 IEEE

of a functionality on an architectural template, especially taking into account the communication between the different blocks of the architecture, that are working fully in parallel. We treat this as a graph partitioning problem that is solved in two very different ways: First, we enhance a constructive method by looking ahead and considering potential data transfers to avoid unnecessary delays. Secondly, we present two local search based approaches that recursively approach the optimum solution using a specific definition of a neighborhood that considers dynamic effects on the architecture. The paper is organized as follows: In section 2, a short overview of previous approaches in the area of HW/SW partitioning and scheduling is presented. Section 3 gives a description of the problem and the assumptions that are made. Section 4 contains the description of both the linear and the recursive algorithms. Results of experiments with the algorithms are presented in section 5.

2. Related Work Optimizing the mapping and scheduling of a task graph is known to be an NP-complete problem ([8]). There are only few simple cases where algorithms find optimum solutions for this problem in polynomial time. In consequence, for practically relevant situations many heuristic approaches have been published that try to find good solutions in a reasonable amount of time. They differ in many ways: the optimization target, application area, granularity level, target architecture and other constraints. Therefore, a direct comparison is very difficult. The problem of scheduling has been in the scope of researchers in the area of parallel processing for a long time. Many proposals are known for multiprocessor architectures. A broad overview is contained in [8]. The assumptions from multiprocessor environments concerning number, type and connectivity of the architecture’s resources are not valid in the context of SoC architectures. Usually, less embedded processors are available, and third party IP modules or tailor made HW accelerators have to be integrated that can implement only a very limited part of the tasks. Moreover, the communication architecture including the memory structure is often more restricted. Graph based approaches can be divided into constructive or recursive methods. The solution is either generated in one pass through the graph, mapping and scheduling the tasks one after the other, or solutions are tested iteratively and improved in an optimization loop. Constructive algorithms have been described in [6] and [7]. In both cases, prevailing information, such as selected resource and ready times of the data and resource, is taken into account to generate a list schedule. With all this information, the mapping of node and resource with the best execution time is chosen and scheduled. Constructive approaches cannot consider mapping decisions which will take place at a later point of time during the scheduling process, since only limited information is available. Neverthe-

less, this process is able to generate mappings very rapidly. Local search techniques have been applied by several researchers. In [11], the authors use a simulated annealing (SA) approach to optimize the partitioning of a task graph on a processor-coprocessor architecture in order to get a maximum speedup from the usage of the coprocessor. The studies described in [12] encompass the investigation of tabu search (TS) in addition to SA. The results showed that these algorithms are capable to deliver high quality solutions, TS being definitely superior to SA. The target was to find solutions that reduce communication overhead in a simple single-processor environment. However, it turned out that run times of the algorithms are quite critical. SA and TS were both considered suitable for architecture synthesis in [13] with advantages for TS in respect to search speed and tunability. However, only very small graphs were considered and an improvement concerning inter-task communication was envisaged. A further approach applying local search techniques is contained in [9]. The random local search algorithm named FAST (first published in [9] and modified in [10]) makes a limited number of moves within two coupled loops to accelerate the search process. Then, a random node is picked from the critical path and used as basis for the next search process. This outer loop is executed a fixed number of times. The FAST algorithm is taken as benchmark of the search algorithms presented in the following chapters.

3. Problem Description The main goal of the ongoing work is to find methodologies for architecture definition and exploration that are well adapted for mixed HW/SW solutions. In contrast to a distribution of tasks onto a set of identical processors with a general interconnection structure, different solutions with diverse HW components and various properties connected by a shared communication architecture have to be compared to each other. A basic task in this context is to avoid negative effects of resource conflicts, while mapping tasks on different resources. In general, two coupled mapping/scheduling problems have to be solved: The mapping of computation tasks leads to the necessity of transfers. Each transfer can be interpreted as a new communication process, that also has to be mapped and scheduled on the communication architecture. The following assumptions are made for the time being: • The function to be implemented is specified as a graph. • The architecture consists of blocks with different characteristics, connected by a common bus (see figure 1). • Communication between tasks mapped on the same resource is disregarded. • Estimation values for the execution times of the nodes on the diverse architectural blocks are available. • Control dependencies coming from the networking application area are not yet covered. • The only optimization goal is to find a minimum schedule duration. Implementation cost is not yet covered.

0-7695-1868-0/03/$17.00 (C) 2003 IEEE

4. Algorithms First, common aspects of the two covered approaches are described. Both start with a description of the functionality as a process graph. The granularity of the graph is assumed to be on the level of function calls, which represent certain protocol specific functions within the desired packet processing application. In order to capture the effects of internal communication, transfers are only possible when source nodes have finished execution. Therefore, concerning the choice of the granules, the memory organization has to be taken into account.

4.1. Constructive Algorithms Reference Constructive Algorithm (ReCA). The constructive algorithm, derived from [7], utilizes so called static and dynamic urgencies to determine the mapping and scheduling. Mappings which are performed cannot be revised later on. Therefore, constructive approaches can deliver slightly deteriorated mapping results than recursive approaches. Nevertheless, this process is able to create mappings very fast, since all decisions are made only once. In a first step, a static urgency (SU) for each nodes represents a preliminary node priority, corresponding to the timely distance to the final node. Then, a dynamic urgency (DU) is used for the decision about the actual mapping of nodes and resources considering the current situation of the system. It has to be evaluated for each possible combination of ready nodes and respective resources and is calculated as follows: DU = SU – max ( ready_task; resource_available ) – WCET

The WCET (Worst Case Execution Times) is the median of the worst case execution times of all possible resources for one node. The middle addend determines the greater value of either when the earliest point of time when the node can be executed, or the desired resource is done and available for the node. Thereby, the actual occupancy of the resources is taken into account. The DU with the greatest value represents the node which has to be executed soonest (because of SU) and identifies the appropriate resource by availability and execution time. The algorithm of [7] is extended by the equal treatment of HW blocks as regular computational units, in order to get balanced solutions for SoC architectures. Also, existing capacities of shared media can be exceeded easily. Therefore, a variable amount of transferred data is included to be able to capture realistic traffic patterns. This defines our reference algorithm ReCA for linear mapping/scheduling that is referred to in the following comparisons. Enhanced Constructive Algorithm (ECA). A closer look at ReCA shows some controversial effects. Although obviously faster resources are chosen, the overall performance is impaired by blocking transfers on the shared medium. The algorithm would also end up with the suboptimal solution for the example in figure 1. Alternatively, if

less performing resources are selected, unfavorable transfers can possibly be obviated. If potential subsequent transfers of a task are also taken into account in the calculation of the critical DU values, the degradation of performance can be prevented. This modification is implemented in the ECA leading to the choice of the better solution. However, this modification causes difficulties in the determination of the needed transfers. A decision of the next node’s resources without any actual mapping would be implied. Therefore as a simplification, only mandatory transfers are regarded. As a further step, not only the transfers, but also the succeeding tasks can be included in the calculations. However, this looks not very promising since the effort to look ahead the potential future mapping, which can be determined differently in the following steps, is immense.

4.2. Recursive Algorithms Constructive approaches suffer from the limited ability to take future effects on the architecture into account when scheduling earlier tasks. Therefore, a recursive methodology has been studied that allows a more balanced mapping decision, leading to improved schedules. This approach is based on local search that recursively checks possible solutions in the neighborhood of the current solution. As a first step, two local search algorithms that are applied in many areas have been used to check the viability of this approach. The general framework for this approach is depicted in figure 2. Starting from the graph representation of the application, a first mapping is performed, e.g. by choosing the resource with minimum execution for each node. Then, the loop is executed until the stop criterion is fulfilled. First, based on the current mapping, the graph is extended by introducing communication nodes for the transfers between tasks that are mapped on different resources. This procedure allows the treatment of commudefine select Granularity Arch Templ Resources Graph

Target Arch.

Mapping Process/Transfers

execution times transfer times

Introduce Comm Nodes

Neighborhood Move Analysis Optimization

Estimation

simplified Schedule

wait times Cost Function

implementation costs

Acceptance Decision Stop ?

Figure 2: Recursive mapping/scheduling procedure

0-7695-1868-0/03/$17.00 (C) 2003 IEEE

nication in the same way as computation tasks. The extended graph and the corresponding mapping are used to construct a schedule using a straight forward list scheduling method. For each resource, a list of ready tasks is maintained. This list is ordered according to the value of the maximum path length to the end node. The task with the highest priority for each resource is scheduled and the list is updated recursively until all nodes are scheduled. This step relies on a mapping of the tasks to the resources that is determined in the mapping loop. In the mapping loop, a local search for an optimum solution is performed. The quality of the solution is evaluated according to a cost function that may encompass performance as well as an implementation related part. In the current version, the optimization is done purely in respect to performance. Neighborhood. Local search is based on the definition of a neighborhood N(s) for a possible solution s. It is a subset of all possible solutions that has a certain relation to the current solution. The basic concept is to recursively select the next solution from the neighborhood of the current solution and thus approach the minimum cost value without searching the complete solution space. It is desirable to limit the search range by defining a tight neighborhood and at the same time direct the search towards the minimum. By accepting solutions with higher cost than the best value found hitherto, escaping from local minima is possible. As the optimization is done exclusively in respect to performance, the concept of the critical path (CP) through the graph is used for defining the neighborhood. The CP of a scheduled graph is defined as the sequence of nodes from the input to the exit node that determines the schedule duration D(CP). Changes of the execution time of any node n i ∈ CP (CPN) lead to changes of D(CP). (Strictly, this holds only if there is a single CP or if the duration of ni is increased. In respect to the method described below, multiple critical paths do not lead to problems, as changes in one of the CPs would usually lead to an unambiguous new CP.) Additional delays arise for those CPNs ni that are ready to run, but have to wait because some other node is active on the same resource. Bus contention is a typical example. However, the same may also occur on computation resources. In the following, these times are indicated as wait times (TWait,i). This leads to D ( CP ) =

∑i D ( ni ) + ∑i TWait, i , ni ∈ CP

One main idea to reduce the schedule duration is not only to increase parallelism (i.e. reduce the length of the CP) and use faster resources or avoid transfers (i.e. reduce or zero D(ni)) on the CP, but also to reduce TWait,i. This idea is introduced in the definition of the neighborhood for selecting improved solutions. Therefore, the neighborhood of a solution not only consists of the CPNs, but also of nodes with delaying influence. The CP neighborhood can be defined by a combination of the following elements: a) computation CPNs with/without TWait b) communication CPNs with/without TWait

c) transfers preceding communication CPNs with TWait d) the parent nodes of communication vertices of type c) e) tasks preceding computation vertices from CP with TWait In general, possible moves to shorten D(CP) encompass the remapping of single tasks, especially of CP nodes, or the reduction of delaying influence on CP nodes, or the remapping of source and target node of a transfer onto a common resource in order to eliminate it. In figure 3, an example of an enhanced CP neighborhood (ECP) is shown. The nodes are classified according to the above list. The potential of the extension of the critical path can be shown with transfer 25: If transfer 22 (case c) can be avoided, the delayed CPN 25 may start earlier, possibly leading to a reduced total schedule. This may be valid also for node 5 (case d), if it can be mapped on a resource with shorter execution time.

18

7

4

5

21

15

22

8

12

class

25

node

a

7-10

b

21, 25, 28

c

15, 22

d

4, 5

e

12

9 Comm. node

28

CP node 10 TWait Res A

7

12

Res B Res C Bus

9

5 4 18

8 15 21

10 22

25

28

Figure 3: Enhanced CP neighborhood example The consideration of nodes corresponding to types c) e) extends the critical path by nodes that have delaying impact on CPNs. Therefore, these nodes are also candidates for moves that may improve the schedule duration. The different options mentioned above can be chosen selectively to define the neighborhood for both simulated annealing and tabu search optimization approaches. Simulated Annealing. Simulated annealing ([14]) is a method that constructs a solution analogous to a physical cooling process, where a material is heated and slowly cooled down, in order to reach good material properties that correspond to minimum energy. SA recursively takes feasible solutions from the neighborhood and calculates the difference to the cost of the current solution. In case of an improvement, the solution is accepted and taken as the solution for the next step. If the tested neighboring solution has a higher cost value, its acceptance is determined by a

0-7695-1868-0/03/$17.00 (C) 2003 IEEE

random acceptance function that depends on the temperature, that controls the cooling process, and by the difference of the cost values. The probability of accepting a worse solution falls with decreasing temperature and increasing cost difference. This allows leaving local minima and nevertheless approaching a near optimum solution. Two SA variants are studied: The choice of a move from the enhanced CP neighborhood (SA-ECP) and, as alternative, the selection of a move with the change of exactly one task mapping, later on referred to as SA-RN. In this study, a geometric reduction of the temperature is used as cooling schedule. Temperature reduction is done by multiplication with a positive factor

Suggest Documents