Online Resource Management in a Multiprocessor with a ... - CiteSeerX

6 downloads 0 Views 241KB Size Report
Multiprocessors-on-Chip, Networks-on-Chip, Real-Time Systems. 1. INTRODUCTION ..... ber of slots in the time wheel of the router. Each node therefore.
Online Resource Management in a Multiprocessor with a Network-on-Chip Orlando Moreira

High Tech Campus, 31/3.13 5600 JA Eindhoven, The Netherlands

Jacob Jan-David Mol

Dpt. of Computer Science Delft University of Technology PO Box 9000, 2600 GA Delft

[email protected]@ewi.tudelft.nl ABSTRACT We propose an online resource allocation solution for multiprocessor systems-on-chip, that executes several real-time, streaming media jobs simultaneously. The system consists of up to 24 processors connected by an Æthereal [7] Network-on-Chip (NoC) of 4 to 12 routers. A job is a set of processing tasks connected by FIFO channels. Each job can be independently started or stopped by the user. Each job is annotated with resource budgets per computation task and communication channel which have been computed at compile-time. When a job is requested to start, resources that meet the required resource budgets have to be found. Because it is done online, allocation must be done with low-complexity algorithms. We do the allocation in two-steps. First, tasks are assigned to virtual tiles (VTs), while trying to minimise the total number of VTs and the total bandwidth used. In the second step, these VTs are mapped to real tiles, and network bandwidth allocation and routing are performed simultaneously. We show with simulations that introducing randomisation in the processing order yields a significant improvement in the percentage of mapping succdesses. In combination, these techniques allow 95% of the processor resources to be allocated while handling a large number of job arrivals and departures.

General Terms Multiprocessors-on-Chip, Networks-on-Chip, Real-Time Systems

1.

INTRODUCTION

Temporal behavior is critical in streaming media applications. In order to deliver high quality output these applications have tight throughput constraints. Many of the emerging streaming media applications consist of a set of different jobs (a job-set). Each job may be independently activated or stopped by an external source, such as the user. Because of this, many combinations of instances of these jobs may be executing at any time during device operation. We call these combinations job-mixes.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07 March 11-15, 2007, Seoul, Korea Copyright 2007 ACM 1-59593-480-4 /07/0003 ...$5.00.

Marco Bekooij

High Tech Campus, 31 5600 JA Eindhoven, The Netherlands

[email protected]

As an example one may look at Digital Car Radio. In this application the user can request at any moment for radio baseband processing, or decoding for one of several audio formats. Several streams can be present at a time because independent sound output must be provided to front and backseat, and different streams may be mixed (such as when listening to music while receiving a phone call). Moreover, further sound processing may be provided, such as equalization or echo cancellation. Each stream has independent input and/or output data rates. Each job activation is independent of each other and made upon user request. In other words, the number of usecases of the systems is potentially very high. Multi-Processor Systems-On-Chip (MPSoCs) based on Networkson-Chip (NoCs) are emerging as a preferred architecture for consumer media devices. Such architectures fit the requirements of media applications in terms of cost, power efficiency and flexibility. The NoC infrastructure also makes expansion of the architecture and inter-chip communication easier. In order to allow maximum flexibility at the lowest cost, jobs must share computation, storage and comunication resources. This poses a difficult problem for real-time systems: as satisfaction of the temporal constraints depends on resource availability, resource sharing can make the temporal behavior of each job depend on the behavior of all other jobs in the system, which is difficult to predict if the job-mix is dynamic. Our approach to this problem is to give jobs a degree of independence from the rest of the system by using a strong resource management policy. The resource management policy should ensure: a) admission control – a job is only allowed to start if the system can allocate upon request the resource budget it requires to meet its timing requirements; and b) guaranteed resource provisions – the access of a running job to its allocated resources cannot be denied by any other job. Resource budgets are computed offline such that jobs meet timing constraints. For hard real-time jobs, we do deep temporal analysis to determine these budgets. For softreal-time jobs, a combination of temporal analysis and simulation may be used. The enforcement of resource budgets requires the hardware platform to provide predictable arbitration schemes, i.e. arbitration schemes that allow us to tightly bound the time that a request takes to be served. We designed a template for scalable multiprocessor architectures that fits these requirements. We use global resource allocation to implement admission control and local schedulers to guarantee resource provision. The Æthereal NoC fits well in this architecture template as it allows for resource reservation by providing connections with guaranteed minimum throughput and maximum latency. In Section 2, we explain how this paper relates to other work.

1557

In Section 3, we describe the hardware, software and scheduling mechanisms. In Section ??, we will explain the link with our previous work [17] in further detail. In Section 4, we will show how resources can be allocated within the system, followed by Section 5, which describes heuristics to find a feasible allocation. The paper is concluded with tests and their results in Section 6 and overall conclusions in Section 7.

2.

RELATED WORK

An extensive survey on the traditional techniques for scheduling and resource allocation in embedded multiprocessors can be found in [20]. It describes techniques from fully static to fully dynamic scheduling. However, it does not consider the case in which tasks arrive and leave at run-time, as is the case in our scenario. The same holds for techniques that compute task assignments at design-time, such as [18]. In CPA [15], it is considered that jobs may start and stop at any time. Each job mix has its own schedule, which is calculated at compile time and stored in a look-up table. This approach is not without problems. The number of potential job combinations is exponential, and switching from one configuration to the other could mean a non-trivial processor migration of running tasks. Moreover, it assumes that all tasks are known at compile time. The literature on task allocation of periodic tasks is extensive, and covers many combinations of constraints. Previous approaches either do not consider any network [1, 4, 6, 14, 21], consider only a bus topology [2, 17], require tasks to be preemptible [6, 21], migratable between processors [6], or they require a solution to be computed at compile time [15, 18]. Hansson et al. [9] consider the network, but do an off-line computation of the network slot tables, provided the assignment of tasks to processors is given. Our approach does not share these restrictions. We do task assignment using global system knowledge. Task to processor assignment is done at job start time. Task scheduling is done locally on the processor and it can use any of a number of predictable scheduling mechanisms such as Round-Robin or TDMA. In [17], a resource allocator mechanism is proposed for MPSoCs in which a homogeneous set of processors is connected by a single bus with Time Division Multiple Access (TDMA) arbitration, and processors provide a predictable scheduling mechanism. In that work, tasks are allocated to processors using variants of the First-Fit (FF) bin-packing algorithm. Algorithms are introduced that minimise the bandwidth usage over the bus. This hardware model can only support a few processors due to the use of a simple bus. In this paper, we extend that hardware model to incorporate more processors. Instead of a bus, the processors are now connected by an Æthereal[6, 7] network of 4–12 routers. By adding the network, the system changes in two important aspects. First, the resource allocator must find routes through the network and reserve time slots per link per router. Secondly, the network cannot be modelled as a single resource, as bottlenecks between two tiles are formed depending on which routes are allocated. Special care has to be taken about which processors are used for which tasks, even though the set of processors itself is homogeneous.

3.

SYSTEM

In this section we will describe some relevant aspects of the type of system we consider. We will describe the hardware and software models, as well as the scheduling mechanisms used in the processors.

3.1 Hardware

The hardware under consideration consists of a homogeneous set of processors, connected through a network of routers. The processors and the network are placed on a single chip, forming an MPSoC. In our work, we consider systems with at most 24 processors and 12 routers. Each processor resides on its own processing tile (PT), which consists of an ARM processor, local memory (MEM), a communication assist (CA) and a network interface (NI). A scheme of this template is shown in Figure 1. The CA offers a fixed number of time slices in which it arbitrates memory accesses between the ARM and NI. The NI offers a fixed number of input/output FIFO queues for accessing the channels allocated over the network.

Figure 1: A processing tile. Our multiprocessor system is built on top of an Æthereal[6, 7] NoC. Æthereal can use contention-free routing, based on TDMA switching and allows links with guaranteed throughput and latency. Every router is connected to other routers and/or to a number of NIs. Each PT has one single NI. Due to physical constraints, the routers can have at most 8 bidirectional connections. Every connection consists of a fixed number of links, and each link offers a fixed amount of bandwidth. Data are transmited by the routers in three-words packets. All routers and tiles send and receive data packets synchronously as if operating under a global clock. If data arrives on an input at tick , it will be sent through the router to the  output link and arrive at the other end of that link at tick . Per time slice, a packet from each input link is sent to an output link, according to a slot reservation table that specifies per time slice which input link is connected to each output link. Two processes on different PTs can communicate unidirectionally by establishing a channel. This requires a send and a receive buffer to be allocated in the data memory of their respective PTs, as well as time slices in the CA and queues in the NI on both sides. In the network, a communication channel must be created by finding a path accross routers from source PT to destination PT. The routers do not buffer their input packets. Therefore, when a packet arrives to an input link in a router, it is sent at the end of the cycle to the output link specified for that time slot in the slot reservation table.

3.2 Application The application is partitioned in a number of stream-processing jobs. An unpredictable source, such as a user, can ask to start or stop a job instance at any moment. Because the user can typically request for more job instances than the hardware can run simultaneously, there is a resource allocation problem. If the system can find sufficient resources, the job is started. Otherwise, the user is informed of the lack of resources, and the system either refuses to start the job, or offers to start it with reduced resource requirements,

1558

resulting in a reduced quality of service (if the job supports some form of graceful degradation). The jobs are modelled as Multirate Data Flow (MRDF) graphs [3, 13, 20]. An MRDF graph is a directed graph  

 . The nodes represent computation tasks and are commonly referred to as actors, and the edges represent FIFO communication channels. Each edge is annotated with the number of packets, or tokens, that are produced or consumed when one of the endpoint actors is executed, as well as with the maximum size of these packets. The model allows initial tokens to be placed on the channels, representing an initial state. An actor in an MRDF graph can execute (or fire) if the annotated numbers of tokens are present in the input channels and if there is enough space in the output channels to store the annotated numbers of tokens. Note that these conditions can cause several actors within the same task to fire simultaneously. The graph representation is used to analyse the temporal behavior of jobs that have to guarantee a certain end-to-end throughput. Given a minimum throughput requirement, each iteration of a task in the MRDF must execute in a limited time interval after its input tokens are available. We call this interval of time the relative deadline of the task. Through analysis techniques [3, 17, 8, 19], it is possible for a given MRDF graph, annotated with worst-case execution times for all actors, and expanded with channel models [19] and a given required minimum throughput to compute feasible relative deadlines for each task to chose relative deadlines for the tasks and compute minimal buffer sizes [8]. Budget determination falls outside of the scope of this paper. For a brief explanation, relative deadlines and channel characteritics are chosen by computing per actor a time interval in which it must finish its execution per iteration, such that the most critical cycle in the graph still allows for the desired throughput. The budget of a task in terms of computation resources is represented as the number of cycles per period of the TDMA wheel it needs to meet its relative deadline. Besides this, a task also consumes tile memory for both program and data. Channels require bandwidth and latency, as well as buffering space.

3.3 Scheduling Each processor runs a TDMA scheduler, and has a (possibly empty) set of tasks assigned to it. The computation resource budget of each task is given as a number of computation cycles that must be given to the task execution for every turn of the TDMA wheel. Other local scheduling mechanisms can be used, as long as they provide a tight upper bound to the availability of resources for each task.

4.

RESOURCE ALLOCATION

If a request to start a job arrives, sufficient resources need to be found and allocated for its actors and channels. The actors are to be allocated to PTs, and routes need to be allocated in the network for the channels between actors on different PTs. The problem is to find such an allocation, while trying to keep as many free resources as possible for jobs that may be started later. Also, it is a concern that resources may start fragmenting after many job starts and stops. We will now describe the constraints placed on the allocation of a job.

4.1 PT Resources A PT provides computation, memory and communication resources. This can be represented as a resource provision vector. If  is a PT, then the resources that  offers to the components of a job graph can be represented by means of a vector      !" #$&% (')*(+ where  is the num-

ber of CPU cycles per TDMA period of the processor;   is the amount of memory the PT offers, !" is the number of NI queues available, from which one needs to be allocated for each FIFO channel entering or leaving this PT, #$ is the amount of CA bandwidth needed for each network connection, % and ') are the upper-bounds on the amount of, respectively, incoming and outgoing bandwidth. Actors and channels are resource consuming entities. An actor represents computation, and thus requires CPU cycles as well as memory space, for both state and temporary variables (which we consider together). For a task graph ,-

 , where is the set of tasks and  is the set of channels, this can be modeled as a vector ./1023  54102  102 67 68 67 69 + where 4102 and  102 are, respectively, the CPU cycles per TDMA period and the amount of memory that the task needs. Channel requirements depend on the mapping of its source and sink actors. If both actors are mapped to the same PT, the channel is implemented by a FIFO buffer in the memory of that PT and we say that the channel is internally mapped. On the other hand, if source and sink are mapped to different PTs, bandwidth in the NoC has to be allocated, as well as memory at both extremities, outgoing/incoming bandwidth, network interfaces and CA bandwidth. We say that such a channel is externally mapped. The vector that represents the resource consumption of an internally mapped channel is: 0:;?:; , representing respectively the resource consumption at source and sink PTs =:;    67 A@B:; C @:; *67 D?:;* + , >?:?E67 GF?:; C3F;:? (D9:; *69 where D9:; is the bandwidth required by the channel,  @ :; and GF;:; are, respectively, the memory required to store the source and sink endpoints, and C @ :? and C F :; represent the CA bandwidth required. These four resource requirement vectors can be reduced to two by the following transformation. Let I H 102 be the amount of resources needed to host actor 0 if no other actors are mapped to that PT, i.e.,

I 102$ J./102 

K  K =:; >9:;X L*M=NPO1Q R&STVU L*M=NPRQ OWS T/U

The resources that are saved when both endpoints of an actor are mapped to the same PT is modelled by

Y

:;Z [\:?



>9:;\]"0:;X

^ can be allocated on PT  iff: Y K I K 1ab] :;3cdX _9T9` L*M=N_9Q R2S T/U _9T9`=Q RT9`

Then, a set of actors

(1)

4.2 Network Resources The route through which the packets of a FIFO channel flow through the network should be allocated, and is fixed for the lifetime of the channel it supports. Such a route through the network consists of a path through the routers used and slot allocations for the slot reservation tables of each one of the routers. In previous work [9] path finding and slot reservation in NoC resource allocation are decoupled: first a path is found in a directed graph model of the network where nodes are routers and edges are links between routers, and given such a path, an attempt is made to do the slot allocation. If the slot allocation fails, then a new path is search for.

1559

+ ,

In our work we use a network graph model that includes the time slots. Each router is represented by 4 nodes where 4 is the number of slots in the time wheel of the router. Each node therefore represents a specific time slot of a given router, and it has incoming and outcoming edges to every network interface and router to which its router is connected. If an edge is between two routers, it ;e connects slot vertex of the source to slot vertex  fhg8ih4 of the sink, which reflects the 1-slot delay introduced by each router. If the edge is between a router and a tile, each edge connects slot vertex of the router to the single vertex representing the tile. Each node thus represents one time slot of a router. When the source and sink of an MRDF channel are mapped to different tiles, the channel requires a route through the network to be allocated between these two tiles. In the network graph model, this corresponds to finding a set of paths between the two NIs of those tiles, which together provide enough bandwidth to support the channel. In this way, finding a path through the network and a slot allocation become a single problem. The trade-off is, of course, that if the original network had j routers and k links, the corresponding graph in the traditional model has j nodes and k edges, our network graph model has 4mlj nodes and knlek edges. For the whole job, more than one route will typically be allocated through the network. Because in every time slice a router can only route to each output link a single input link, these paths are not allowed to collide. Finding several paths in the network graph model that do not collide amounts to solving the Directed Edge-Disjoint Paths Problem, which is NP-complete even in many restricted cases [12]. Instead of solving it directly by considering all paths at once, we can approximate a solution by finding routes one at a time using Shortest Path. In theory, this algorithm is capable of finding at u u rqts u u wv  of the routes possible, where wv least a fraction of op is the number of channels allocated in an optimal solution [11]. In practice however, it usually performs much better.

This problem can be proven to be NP-complete [16]. We must also take into consideration that the solution cannot be algorithmically complex, because the resource manager must run on-line. The problem resembles the Vector Bin-Packing (VBP) problem by viewing the PTs as bins and the actors as items. The resources provided by the PTs and the resources required by the actors become the bin capacities and item sizes, respectively. However, there are non-trivial differences between our problem and VBP. In VBP, the items to be packed have a constant size, while in our resource allocation problem, the resources required by the endpoints of a channel depend on whether or not they are mapped to the same PT. This can be disregarded, but that would over-dimension the problem. Also, VBP does not take bandwidth usage into account, nor the available routes in the network. In [17] it has been shown that under these circumstances the low-complexity First-Fit (FF) and First-Fit Decreasing (FFD) algorithms give a good performance regarding the number of PTs needed and the bandwith usage can be optimized by clustering the most heavily communicating actors. To upgrade these results such that a network can be included, we use a two-steps approach. An FF-based algorithm is used to map the actors onto virtual tiles (VTs), which are assumed to be connected through a bus. Virtual tiles provide the same amount of resources as a PT would. Then, each of these VTs is mapped to a real PT, and routes through the network are allocated using shortest path. We will describe some techniques that we combine in our proposed solution. First, we discuss the different techniques employed for clustering actors. Next, we argue why it is advantageous to shuffle the input of the FF algorithm. In Section 5.3, the different placement algorithms for the virtual tiles are discussed. Finally, we show how to map the VTs onto the PTs in Section 5.4.

5.

Clustering can be applied before and during packing [17]. If applied before, a set of actors is replaced by a single larger actor, thus forcing the set to be placed on the same PT. We will use a greedy heuristic, which orders the channels in order of decreasing bandwidth requirements, and replaces each pair of endpoints by a single large actor if the resulting actor fits on an empty PT. This is repeated until a predefined percentage of the number of channels in the job is contracted, or until no more channels can be contracted. This can lead to an increased number of required PTs if the clustering is done too aggressively. If clustering is applied during packing, it works together with the packing heuristic: when the packing heuristic packs an actor to a PT, it will then try to pack adjacent actors to the same PT before trying to pack other actors. The order in which the adjacent actors are considered is the same as their order in the input. This modified First-Fit packing algorithm will be called First-Fit with Clustering (FFC). FFC has been shown in [17] to perform significantly better than other FF variants for virtual tile placement.

MAPPING JOBS

The complete problem of mapping a single job can be formulated in this way: P ROBLEM 1. Single Job Resource Allocation: Given a job digraph -x

 , and a network topology digraph 4yz{}| j !~ of processing tiles { , router nodes j and network links ! . Also given a weight D9:;€6 for each edge in  , representing the amount of bandwidth capacity required by channel : . The topology 4 is restricted by the fact that each vertex u u in { is only connected  ). Besides that, all to exactly one router in j (assuming { elements  of { have a valuation {‚„ƒb… , the resource provision of  , and all † elements of have valuation I B ‡ˆƒb… , the Y resource usage of † , and all : elements of  have valuation 8zƒ … , the resource savings for internal mapping of : . Does there exist an injective mapping ‰ of actors to tiles { such that there exist edge-disjoint paths through the network (through j and ! nodes) to accommodate the edges between the actors in the mapping, knowing that each edge : requires D?:? paths from the tile ‰G .Š?>?:?* to the tile ‰" .І‹:;* , where .r;>9:; is the source actor of : and .ІŒ‹:; is the destination actor of : , while guaranteeing that for all tiles  it holds that

K I 1†Œ=] 7T9`bN @ S where ^w tile  ?

Y K :;Zcn *L M=NP8Q ŽŠS T/U 8T/`=N @ S Q ŽT/`=N @ S

(2)

†G 9‰"1†ŒE‘ , is the set of actors mapped to

5.1 Clustering Strategies

5.2 Shuffled Input Both FFC and the pre-clustering strategy behave in a deterministic manner and without backtracking. This gives these algorithms a single shot at finding an acceptable solution. To incorporate backtracking is non-trivial. The problem it has to solve is NP-complete, so a solution which uses full backtracking can take an exponential amount of time. Instead of backtracking, we use randomisation to generate multiple distinct solution candidates. The First-Fit algorithm considers and places the actors one at a time, which allows a randomisation of the input to yield different mappings. An unsuc-

1560

cessful mapping of the actors onto a set of bins using one ordering of the actors can thus occasionally be corrected by considering a different ordering of the actors. If the First-Fit algorithm clusters during packing, it still just starts to map the ’next’ actor when it has filled a PT, which keeps it sensitive to the original ordering of the items.

channels to the rest is minimised. These VTs (a , D and > ) are then mapped on the PTs around j in any order. This process is repeated until the placement algorithm runs out of VTs (or empty PTs, in which case the algorithm fails to find a valid placement). The order in which the routers are selected is predefined; we acknowledge this as a point of future research.

5.3 Virtual Tile Placement The First-Fit packing algorithm packs actors into the PTs, but considers empty PTs to be equal and uses them in the order they are presented. However, due to the presence of the network and the bottlenecks that can be formed between two PTs, it is not efficient to use just any subset of the empty PTs. It is favourable to map both endpoints of a channel as close to each other as possible. If they do not fit on the same PT, they should preferably be placed on PTs close to each other, for instance on two PTs connected to the same router. We want to accomplish this by employing a two-step system using virtual tiles (VTs). The packing algorithm maps actors to a VT, as if the VTs are interconnected through a bus. Then, each VT is assigned to a real PT by in the second step, and routes through the network are allocated. We compare three methods to do this assignment: On-line placement. As soon as the packing algorithm needs a new, empty tile, a PT is selected. Because it is unknown which actors are going to be assigned to this new PT, there is little information for which PT to select. As a heuristic, the PT closest to those already used is chosen, using the sum of distances to the already used PTs as the distance metric. Semi on-line placement. The packing algorithm is capable of filling the VTs one by one. Once filled, the content of a VT does not change, so the VTs can be mapped to PTs immediately. This allows a full VT 4 to be placed near those PTs which contain actors connected to those in 4 . The number of links required in the network to allocate the channels to 4 is used as the metric for distance. Off-line placement. All actors are packed into a set of VTs first. This set of VTs is then mapped to the PTs, minimising the number of links used. The described placement methods assume an empty system. If the system is not empty, it is preferable to fill up partially filled PTs first to avoid fragmentation of free space. To adjust all algorithms to be able to cope with a non-empty system, the placement methods let the packing algorithm fill up the partially used tiles first. Thus, the location and space limitations for the VTs representing these locations are known beforehand.

Figure 2: A router connected to three PTs (left) and four VTs connected by channels (right). The off-line placement algorithm is thus required to find a subset of VTs of fixed size, with a minimal cut to the rest of the VTs. This problem contains the Minimal Bisection problem, which asks to divide a graph into two equal pieces with a minimal cut and which is NP-complete [5]. In our case, two partitions of unequal size are u u needed, say of sizes a and ]’a“da . To accomodate this, we add u u a clique of a size ]‘”a to the graph. The vertices of this clique are not connected to any vertex outside the clique, and the edges of the clique have an infinite weight. When Minimal Bisection is applied, this clique will end up in one of the partitions. Any other solution yields a cut of infinite weight. The rest of the vertices will u u thus form a a“ ]Ga split. A well-known heuristic for Minimal Bisection is the algorithm designed by Kernighan and Lin (KL) [10]. It starts with any bisection and iteratively improves on it as follows: 1. Create any bisection of

u u u u { •  {=– . {b– with Z

2. Set all vertices to ‘unlocked’. 3. Let # be a list of pairs of actors, and let  be a list of integers. Set # and  equal to the empty list. 4. For every pair of unlocked vertices a“~{3•r (D—{=– , calculate  _rR , which is the reduction of the weights of the edges in the cut, were a and D to be swapped. 5. Find a pair a ˜{Z•Š *D™˜{b– , for which  _rR is maximal. Append 1a *Dr to # . Append  _rR to  . Swap and lock a and D . The value šOž› œBM › v  •  O now denotes the decrease in cutsize so far.

5.4 Bisection According to Kernighan-Lin The off-line placement algorithm maps the VTs to PTs such that the number of links used is minimised. As will be shown, this problem contains NP-complete sub-problems, so a heuristic is needed. For our purposes, the network consists typically of 4 to 12 routers, with each of which is attached to several PTs. The heuristic we use tries to map the VTs to PTs such that most of the required links will be mapped over a single router. To accomplish this, the placement algorithm selects a router and counts the number of empty PTs around it. The algorithm then selects the same number of VTs, minimising on the bandwidth required between the selected and not selected VTs. These selected VTs are then mapped to the empty PTs. Once such a cut is found and the selected VTs are mapped to PTs, a new router is selected. For example, take Figure 2, which shows a router surrounded by empty PTs as well as a set of unmapped VTs to map. Because the router is connected to three PTs, three VTs are cut off. A cut is made such that the number of

into {Z• and

6. Repeat the previous two steps until all vertices are locked.

 •

7. Find †G mŸ; ¡ f“Ÿ;¢ v¤£ BM ¥ š OžM  v  O , which represents the › œB› moment at which the cutsize was the lowest. 8. Swap back all pairs in # after index † . 9. Repeat steps 2 to 8 while there is an improvement (i.e. while  š OžM  v •  O ™6 ). Further modification of the KL algorithm is needed to account for the fact that we cannot swap VTs that are already mapped to PTs. To ignore these tiles means to ignore their channels to VTs that still need to be mapped. Rather, they can be taken into account by locking the already mapped VTs at step 2.

1561

EXPERIMENTS AND RESULTS

In this section we will present a set of experiments to evaluate the performance of several combinations of the presented algorithms as well as three different angles of variation. First, we will compare the performance of the simplest variants on several topologies. Secondly, the difference in performance is measured between the online, semi on-line and off-line VT placement algorithms. Thirdly, we will evaluate the gain in performance when the ordering of the input actors is shuffled in order to obtain several solution candidates. Finally, the best of these alternative algorithms is tested on a system with a high load. We tested several topologies, shown in Table 1. These topologies will be compared first. In a ring topology, routers are connected to neighboring routers in a one-dimensional grid. In a mesh topology, routers are connected to north, south, east, west neighboring actors in a two-dimensional grid. The rest of the tests will focus on the ring-4 topology. Topology ring-3 ring-4 ring-6 ring-12 mesh-6

Size 3 4 6 12 2 by 3

Links/router 10 8 6 4 6

Tiles 24 24 24 24 22

Routers 3 4 6 12 6

Links 54 56 60 72 58

Table 1: The topologies considered. As a job set, we used synthetically generated jobs with properties similar to real applications, such as radio baseband processing, audio decoding and audio post-processing. These graphs have at most 100 actors, and contain a number of channels equal to the number of actors. We divided them in four test sets depending on actor and channel bandwidth requirements. Because of lack of space, we only show the results for coarse-grained actors (where each actor has resource requirements which are in a range of about 4–14% of the resources available by PT) and heavy edges (2–7% PT resources and bandwidth between routers). The results for other classes are similar or better. The complete results can be found in [16]. All tests are performed against varying rates of clustering, from contracting 0% to 60% of the channels. Trying to cluster more than 60% of the channels failed in all cases.

6.1 Effect of Topology For each of the topologies of Table 1, the FFC algorithm was chosen to try to map an instance of 100 actors onto an empty system. We used the on-line virtual placement method. This was tried for 100 instances, each mapped upon an empty system. The percentage of jobs that could successfully be mapped is shown against the clustering percentage in Figure 3. We observe several facts from this figure. First, clustering is necessary to be able to map any of the tested jobs onto any of the tested topologies. Secondly, the larger rings (ring-6 and ring-12) give bad performance (50% success rate or lower), while the performance for the other topologies is similar (up to 70% success rate). This could be due to the fact that in ring-3, ring-4 and mesh-6, the average distance between two PTs is low. Hence, less congestion is likely to occur, which would hinder a successful mapping. What would be the optimal topology heavily depends on the cost function. In our case, routers contain an all-to-all interconnection grid, thus having a cost quadratic in the number of links they are connected to. Each link and router also bears costs, further complicating judgement. For the rest of this paper, we chose the ring-4 topology as a trade-off between number of routers and the number

100 success (%)

6.

80 60 40

mesh−6 ring−12 ring−3 ring−4 ring−6

20 0 0

20 40 clustering (%)

60

Figure 3: Successes per topology.

of links of each router, balancing the costs.

6.2 Virtual Tile Placement The best performance observed in all topologies was only a 70% success rate for finding a mapping for a large job on an empty system. This percentage increases if a different form of VT placement is used. In Figure 4, the success rate is shown for the same job set on a ring-4 topology, using different VT placement methods. The difference between the on-line and semi on-line placement methods is marginal. This can be explained by the little knowledge available and used by both algorithms when asked to place a VT. The on-line algorithm has no knowledge of the contents of the tile, so has little information to perform optimisations, other than choosing a location close to the tiles already mapped. The semi online algorithm knows which actors are mapped to the VT it has to place, but does not know the location of all the actors which have channels to it. As can be seen in Figure 4, this extra knowledge does not increase the performance significantly. The off-line VT placement algorithm does prove to be a substantial improvement over the on-line one. The success percentage is raised across all clustering percentages, reaching 90%. Even though the performance is better, clustering is still required to reach the higher chances of success. The difference in placement algorithms is not only visible in the percentage of successful mappings, but also in the number of links used by successfully mapped jobs. The percentage of links used in the network, relative to the number of links the basic on-line placement algorithm uses, is shown in Figure 5. The spikes at low clustering percentages occur because there are relatively few successful mappings over which the average is taken. Again, the on-line and semi on-line algorithm differ little in link usage. The off-line algorithm attains a profit of 15% in the number of links used, which is decent considering many routes that are optimised will be reduced from length 3 (PT–router–router–PT) to length 2 (PT–router–PT), for which only a gain of 33% in length is possible. This directly implies profit in terms of power consumption and amount of free resources left for other jobs.

6.3 Using Shuffled Input Another way to increase the chances of success is by trying to map several random permutations on the ordering of the actors to map. This results in a favourable increase as is shown in Figure 6. The figure shows the percentage of successes when 1 to 4 random permutations are tried, increasing the success rate up to over 95% when 4 permutations are tried.

6.4 Stress Test 1562

the inequality does hold, after which the job is requested to start. The resource allocator receives this job start request and tries to map the job onto the unused resources. Its success or failure to do so is noted, and the next job is randomly selected. The allocator tries four times in total, and employs the off-line VT placement algorithm. This is repeated for 10,000 iterations for several values of ¨ slack , with the percentage of successful mappings as the metric.

80 60 40 20 0 0

off−line on−line semi on−line 20 40 clustering (%)

100

60

success (%)

success (%)

100

Figure 4: Success vs placement algorithm.

60 40

0 20 40 60

20

100

0 0

90 80 70 0

20 40 clustering (%)

0.05 slack

100

60

Figure 5: Links used vs clustering

80 60 40

0 20 40 60

20 The previous tests all assumed an initially empty system. However, our solution should behave well under heavy load, i.e., if there are barely enough resources available in the system when there is a request to start a job. We tested the performance under heavy loa, by taking the ring4 topology and a large set of jobs, each consisting of 10 actors. For each job ¦ , the sum § of the processor cycles and memory requirements of its actors and channels is calculated. Let be the sum of all processor cycles and memory requirements still available in the system. Vector is updated whenever a job is started or stopped. The system starts without any jobs running. A job ¦ which is not already running, is selected at random, and is requested to 3n¨ § ¨ start iff  & c , where m6 is an amount of slack. If the inequality does not hold, running jobs are stopped at random until

success (%)

100 80 60 40

1 2 3 4

20 0 0

20 40 clustering (%)

0.1

Figure 7: Success vs slack.

off−line on−line semi on−line success (%)

links used(%)

80

60

Figure 6: Success vs number of attempts

0 0

0.05 slack

0.1

Figure 8: Success with unclustering. The result of this test is shown for several clustering percentages ¨ in Figure 7. At ©6 , there is no slack, implying the resource allocator has to be capable of starting a job no matter how much the available resources are scattered. This is not always possible, so we cannot expect the resource allocator to always succeed. As more slack is added, the resource allocator has more redundant free resources it can use, increasing the success rate dramatically. With only 5% slack, all requested jobs could be started if no clustering is used. Another observation in Figure 7 is that clustering actually has a negative impact on performance. The reason for this is that if clustering is applied, larger actors are created, which require larger portions of free space on a PT. This requires the available resources to be concentrated on fewer tiles, something that is less likely to happen. Because clustering reduces bandwidth usage (and thus power usage) as well as increases the feasibility of mapping bigger jobs, we did not want to discard it. Instead, we devised a fallback mechanism, in which first the job is clustered and an attempt is made to map it. If this fails, we re-try to map the job without pre-clustering. Any remaining attempts with shuffled input are done with the unclustered version. This allows us to prof it from the positive effects of clustering when possible, while being able to also profitfrom the higher success rate of trying to map fine-grained, unclustered versions of the jobs. This is shown in Figure 8. Note the slack factor introduced only stretches the computing

1563

cycles and memory requirements. It is assumed bandwidth is not a scarce resource. If it is, clustering does help to reduce the bandwidth usage, but cannot prevent the allocator from starting to fail. The results shown in this paper are for large actors and heavy edges. The results for lighter versions of either or both can be found in [16]. All algorithms perform better when the actors or edges are smaller, due to a finer granularity. The finer the granularity of the actors, the easier it is to find allocations for them in the PT.

7.

CONCLUSIONS

In this paper, we have proposed a set of heuristics to map realtime streaming jobs onto a homogeneous multiprocessor system containing a network of 4 to 12 routers. We proposed a model that represents hardware constraints, and the resource requirements of jobs. Within this model, it is possible to reason about temporal guarantees, which allows us to map jobs onto the system such that temporal constraints are met. We showed how to model the NoC in such a way that timemultiplexing can be taken into account by representing each router as a number of communication nodes, one per time slice. The problem of finding a path through the network is then merged with the problem of finding a valid slot allocation table per router. The joint prolem can be seen as a disjoint paths problem. In order to map a job, its actors and channels are packed onto the tiles using a First-Fit strategy, after which routes through the network are allocated. We showed it is beneficial to use a two-step approach. First, we map the actors onto virtual tiles (VTs), and then we map the VTs onto the actual tiles. This raised the chance of finding a feasible mapping, and reduced bandwidth usage. We also showed that chances of finding a feasible mapping can be further increased by trying several random permutations on the ordering of the input. Our stress test shows how our resource manager performs performs under high load. One interesting remark is that although clustering is essential to map some jobs, and greatly decreases bandwidth usage, it has an adverse effect on allocation success once the system is close to full. Because of this, we devised a mechanism by which two versions of the job are tried out in the mapping: first a clustered version, and, if this fails, an unclustered version. This allowed for a substancial increase in the percentage of successes during the stress test. When operating under high load, the described approach is capable of allocating at least 95% of the resources available on the processing tiles, assuming that bandwidth doe not become a very scarce resource.

8.

REFERENCES

[6] K. Goossens et al. “Networks on Silicon: Combining Best-Effort and Guaranteed Services”, Proceedings of the Design, Automation and Test Conference, 2002. [7] K. Goossens et al. “Guaranteeing The Quality of Services in Networks on Chip”, Networks on Chip, edited by A. Jantsch and H. Tenhunen, 2003. Kluwer [8] R. Govindarajan, G.R. Gao, and P. Desai. “Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks,” Journal of VLSI Signal Processing,, Kluwer, 2002. [9] A. Hansson, K. Goossens, and A. Radulescu. “A Unified Approach to Constraint Mapping and Routing on Network-on-Chip Architectures,” Proceedings of the ISSS,, 2005. [10] B.W. Kernighan and S. Lin. “An efficient heuristic procedure for partitioning graphs,” The Bell System Technical Journal,, 1970. [11] S.G. Kolliopoulos and C. Stein. “Approximating Disjoing-Path Problems Using Greedy Algorithms and Packing Integer Programs”, Proceedings of the IPCO,, 1998. [12] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms, Springer-Verlag Berlin Heidelberg, 2002. [13] E.A. Lee and D.G. Messerschmitt. “Synchronous Data Flow,” Proceedings of the IEEE, 1987. [14] C.L. Liu and J.W. Layland. “Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment,” Journal of the ACM, 1973. [15] G. Martin and H. Chang. Winning the SoC Revolution, Kluwer Academic Publishers, 2003. [16] J.J.D. Mol. “Resource Allocation for Streaming Applications in Multiprocessors,’ Delft University of Technology, 2004. [17] O. Moreira, J.J.D. Mol, M. Bekooij, and J.L. van Meerbergen. “Multiprocessor Resource Allocation for Hard-real-time Streaming with a Dynamic job-mix,” Proceedings of the RTAS,, 2005. [18] D.T. Peng, K.G. Shin, and T.F. Abdelzaher. “Assignment and Scheduling Communicating Periodic Tasks in Distributed Real-Time Systems,” IEEE Transactions on Software Engineering,, 1997. [19] P. Poplavko et al. “Task-level Timing Models for Guaranteed Performance in Multiprocessor Networks-on-Chip,” Proceedings of the International Conference CASES,, 2003. [20] S. Sriram and S. Bhattacharyya. Embedded Multiprocessors, Marcel Dekker, 2000. [21] O.U.P. Zapata and P. Mejia-Alvarez. “Analysis of Real-Time Multiprocessors Scheduling Algorithms,” Proceedings of the RTSS, 2003.

[1] S.K. Baruah and S. Funk. “Task assignment in heterogeneous multiprocessor platforms,” University of Georgia, 2003. [2] J.E. Beck and D.P. Siewiorek. “Modeling Multicomputer Task Allocation as a Vector Packing Problem,” Proceedings of the ISSS, 1996. [3] M. Bekooij, O. Moreira, P. Poplavko, B. Mesman, J. van Meerbergen, M. Duranton, and L. Steffens. “Predictable Embedded Multiprocessor System Design,” Proceedings on CD of Philips Conference on DSP, 2003. [4] A. Burchard, J. Liebeherr, Y. Oh, and S.H. Son. “Assigning Real-Time Tasks to Homogeneous Multiprocessor Systems,” IEEE Transactions on Computers, 1995. [5] M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman and Co, San Francisco, 1979.

1564

Suggest Documents