Automated Task Allocation for Network Processors - CiteSeerX

Automated Task Allocation for Network Processors William Plishker, Kaushik Ravindran, Niraj Shah, Kurt Keutzer University of California, Berkeley {plishker, kaushikr, niraj, keutzer}@eecs.berkeley.edu

Network processors have great potential to combine high performance with increased flexibility. These multiprocessor systems consist of programmable elements, dedicated logic, and specialized memory and interconnection networks. However, the architectural complexity of the systems makes programming difficult. Programmers must be able to productively implement high performance applications for network processors to succeed. Ideally, designers describe applications in a domain specific language (DSL). DSLs expedite the development process by providing component libraries, communication and computation semantics, visualization tools, and test suites for an application domain. An integral aspect of mapping applications described in a DSL to network processors is allocating computational tasks to processing elements. We formulate this task allocation problem for a popular network processor, the Intel IXP1200. This method proves to be computationally efficient and produces results that are within 5% of aggregate egress bandwidths achieved by hand-tuned implementations on two representative applications: IPv4 forwarding and DiffServ.

1. INTRODUCTION As the number of ASIC design starts continues to decline, programmable application specific solutions such as application specific instruction processors (ASIPs) are moving to fill the void. Examples of this trend are widespread including the rise of graphics processing units, digital signal processors, media processors, and network processors. Instead of being confined to a single application as ASICs are, the programmability of ASIPs enables them to support set of applications. However this flexibility often comes at the cost of performance. By exploiting properties of the targeted domains, architects have employed a variety of techniques to recapture any ground lost to custom solutions. For network processors, these include special purpose hardware, multiple multithreaded processing elements, distributed heterogeneous memories, and special purpose peripherals. However, since programming environments for these architectures have not advanced as rapidly, software developers face a daunting challenge of efficiently programming these devices. Ideally, application developers would program in a domain specific language (DSL) which facilitates natural application capture. DSLs enable higher productivity by providing component libraries, communication and computation semantics, visualization tools, and test suites tailored to a given application domain. For networking, a popular DSL is Click (Kohler 2000) which is an actor oriented language customized for packet processing and packet flow visualization. Unfortunately the divide between DSLs and their implementation platform is wide. A path to traverse this chasm for high performance systems is described in Figure 1. A designer describes the application in a DSL, while ideally not being concerned with the final implementation platform. This abstract representation of the application is optimized to produce a task graph suited for high performance systems. The task graph is made up of computational elements that correspond to the target architecture. Performance and resource usage parameters of these elements along with a model of the

Automated Task Allocation for Network Processors William Plishker, Kaushik Ravindran, Niraj Shah, & Kurt Keutzer University of California, Berkeley

target architecture are used by the mapping stage. This stage includes the allocation of computational tasks to processing elements, the layout of data in memory, the assignment of communication to interconnect, and the scheduling of tasks. Based on the derived mapping, the task graph is divided into the individual programs to be run the processing elements. A compiler specific to each processor finishes the flow to implementation by producing executables from the individual programs. Since prior stages use approximations of tasks properties, a profile of the resulting executable feedbacks to the high level optimization stage and the mapping stage. Consequently, feedback helps guide both the DSL to task graph transformation and the mapping of the task graph. After a sufficient number of iterations, the final result of the entire flow is a program optimized for performance within resource constraints.

#! $ %

"

! !

Figure 1. DSL to ASIP Based on our experience with network processors, this work focuses on the most influential piece of this flow: the assignment of computational tasks to processing elements. While some automated solutions exist for solving this problem generically (e.g. dynamic partitioning in operating systems, offline scheduling algorithms), for high performance embedded systems the most common solution is simply a manual partitioning of the design across threads and processors. Partitioning may be done either informally, by the designer's intuition and experience, or more methodically, by architecting the design to be flexible, and then changing the task allocation based on profiling feedback. In either case it is a timeconsuming and challenging problem, largely due to a huge and irregular design space further exacerbated by resource constraints. This work automates the task allocation process for high performance multiprocessor architectures. We use simplified models of the application and the architecture and leverage recent advances in 0-1 integer linear programming (ILP) to solve the task allocation problem efficiently. To demonstrate our approach, we map the data plane of an IPv4 router and a Differentiated Services interior node onto the Intel IXP1200, a common network processor. For both examples, the runtime of our approach is less than one second, with resulting implementations performing within 5% aggregate data rate of hand partitioned designs. The remainder of this paper is organized as follows: Section 2 gives some background on prior work and an overview of Click and the IXP1200. The problem formulation is described in Section 3. In Section 4, we present the results of our approach for two network applications. Section 5 concludes and summarizes this work and section 6 describes future work.

2. BACKGROUND In this section we review prior approaches to solving similar problems and also overview Click and the Intel IXP1200 architecture, which are used as demonstration vehicles for these ideas.


2.1. Related work Task allocation has been considered in a variety of contexts and handled with various approaches. Behavioral synthesis deals with the assignment operations to hardware and the usage of storage and communication paths (Devadas 1989). While analogous to the problem faced here, these approaches are best suited for synthesizing datapath elements for small computational kernels. Further, the constraint exploration is usually based on probabilistic algorithms which for large examples are slow or suboptimal. Approximation algorithms have been extensively studied to solve these problems for general multiprocessor models (Chekuri 1998)(Shachnai 2000). However, such generalized solutions are not suitable for modern embedded architectures since they fail to consider practical resource constraints. In particular, the approximation schemes do not take into account thread and storage limitations, which are critical factors that affect the quality of the mapping to heterogeneous ASIP architectures. This simplification substantially limits the design space that would be explored for some embedded systems. Therefore, existing approximation algorithms are not appropriate to solve the task allocation and scheduling problems for these multiprocessors. We utilize the framework of ILP to solve our variant of the mapping problem. The use of ILP in high-level synthesis is not new. Hwang, et al. (Hwang 1991) presented an ILP model for resource-constrained scheduling and developed techniques to reduce the complexity of the constraint system. Extending this concept, mixed integer linear programming (MILP) based task allocation schemes for heterogeneous multiprocessor platforms have been advanced (Bender 1996). These formulations determine a mapping of application tasks to hardware resources that optimizes a trade-off function between execution time, processor and communication costs. The advantage of ILP is the natural flexibility to express diverse constraints and its potential to compute optimal solutions with reference to the problem model. However, ILP approaches are typically infeasible, since most ILP solvers suffer due to large run times even for simple problem instances. To counter this issue, we use a modern 0-1 ILP solver with improved search heuristics and additionally introduce special constraints to restrict the search space. Thus we exploit the flexible framework of ILP to generate optimal solutions to the mapping problem within fractions of a second. The work that comes closest to our problem of mapping to hardware multithreaded architectures was published by Srinivasan et al. (Srinivasan 2003). The authors consider the scheduling problem for the Intel IXP1200 and present a theoretical framework in order to provide service guarantees to applications. However, they do not consider practical resource constraints of the target architecture, nor do they test their methodology with real network applications. In contrast, our approach provides an efficient solution to the mapping problem, explicitly taking into account resource constraints of the multiprocessor.

2.2. Click Click is an application environment designed for describing networking applications. It is based on a set of simple principles tailored for the networking community. Applications in Click are built by composing computational tasks, or elements, which correspond to common networking operations like classification, route table lookup and header verification. Elements have input and output ports that define communication with other elements. Connections between ports represent packet flow between elements. In Click, there are two types of communication between ports: push and pull. Push communication is initiated by the source element and effectively models the arrival of packets into the system. Pull communication is initiated by the sink and often models space available in hardware resources for transmitting packets. Click designs are often composed of paths of push elements and paths of pull elements. Push paths and pull paths connect through special elements that have differently-typed input and output ports. The Queue element, for example, has a push input but a pull output, while the Unqueue element has a pull input, but a push output. Figure 2 shows a Click diagram of a simple 2 port packet forwarder. The boxes represent elements. The small triangles and rectangles within elements represent input and output ports, respectively. Black ports are push ports, while white ports are pull ports. The arrows between ports represent packet flow.

FromDevice(0)

FromDevice(1)

Lookup IPRoute

ToDevice(0) ToDevice(1)

Figure 2. Click Diagram for a 2 Port Forwarder


The FromDevice(x) element abstracts an ingress port. As a packet enters the system, it is received by FromDevice(x), which then passes the packet to LookupIPRoute. LookupIPRoute contains a table that maps destination IP addresses to output ports. After LookupIPRoute determines the packet’s output port, it is enqueued. When there is free capacity in the MAC port buffer, ToDevice(x) dequeues a packet and sends it to the appropriate MAC port. Click is implemented on Linux and uses C++ classes to define element behavior. Element communication (push and pull) is implemented with virtual function calls to neighboring elements. The sources of push paths and the sinks of pull paths are called schedulable elements. Firing schedulable element of a path will execute all elements on that path in sequence. Therefore, to execute a Click description, a task scheduler is synthesized to fire all schedulable elements. A natural extension of this is to execute Click on a multiprocessor architecture to take advantage of the inherent parallelism in processing packet flows. A multithreaded version of Click targets a Linux implementation (Chen 2001). Two pertinent conclusions can be drawn from this work: First, significant concurrency may be gleaned from Click designs in which the application designer has made no special effort to express it. For many applications, packets may be processed independently by separate threads. Second, a Click configuration may easily be altered to express additional concurrency without changing the application’s functionality. Both of these conclusions motivate why the Click abstraction is a good basis for a network processor programming environment.

2.3. Intel IXP1200 architecture The IXP1200 (Intel December 2001) is one of Intel’s first network processors based on their Internet Exchange Architecture. It has six identical RISC processors, called microengines, plus a StrongARM processor as shown in Figure 3. The StrongARM is used mostly to handle control and management plane operations. The microengines are geared for data plane processing and each has hardware support for four threads that share an instruction store. This is enabled by a hardware thread scheduler that permits fast context swapping. The memory architecture is divided into several regions: large off-chip SDRAM, faster external SRAM, internal scratchpad, and local register files for each microengine. Each region is under the direct control of the user and there is no hardware support for caching data from slower memory into smaller, faster memory (except for the small cache accessible only to the StrongARM). Currently there are several programming environments that may be used to design applications for the IXP1200 including NP-Click (Shah 2003), Teja-C (Teja Technologies 2003), Intel’s ACE framework (Intel June 2001), and Intel’s Microengine C (Intel 2002). We chose to use NP-Click as our implementation environment due to its modularity and ease of use. Further, it handles many of the hurdles of using a DSL to arrive at an implementation.

+ +,

-+ +,

' #" (

! "# $

+ & .+

%!

"

$ "$

! $

&

$) *

Figure 3. Intel IXP1200 microarchitecture

3. PROBLEM FORMULATION In this section we formulate the task allocation problem in mapping Click applications to IXP1200 like architectures. We approach the task allocation problem in three steps: (1) Construct a simplified model to capture only those salient application parameters and resource constraints that are most likely to influence the quality of the final solution. (2) Encode the constraint system as a 0-1 ILP formulation. (3) Solve the optimization problem using an efficient solver to determine an optimal feasible configuration.


We elaborate on these steps in the following sections.

3.1. Model We view the multiprocessor as a symmetric shared memory architecture consisting of a memory hierarchy in which each region has uniform access time from any processing element (PE). This obviates the necessity to explicitly account for memory and communication metrics in the model. We incorporate instruction store limits per PE as a resource constraint. In our experience, applications implemented on the IXP1200 are often instruction store bound, severely complicating the load balancing process. Furthermore, the tradeoff between instruction store and execution cycles of individual tasks is critical to determining optimal performance. In our application model, we first partition the Click description into computational tasks. Tasks are defined as push or pull chains with one schedulable element. Task boundaries are defined by queues. This ensures that a task can be invoked by one and only one entry point. Tasks are classified by class. Elements within a class are functionally the same. Tasks within a class can have different implementations, which may differ in their execution cycles, number of instructions, etc. In a multithreaded system, two tasks on the same PE may share part of the instruction store if they are of the same class and implementation. Such instructions are called shareable instructions. Conversely, a task that contains state may have an implementation in which instructions directly address specific state variables. Therefore these instructions may not be shared. However, since some multithreaded architectures allow for context relative addressing, they enable a logical separation of a single memory resource. As a result, instructions which utilize context relative addressing to reference state variables directly may be shared across threads, but not within one. Such instructions are called quasishareable. While direct references are faster, a designer may implement a task with state with shareable instead of quasishareable instructions by indirectly addressing state variables. Tasks written with shareable instructions incur additional cost of execution time and total instruction store, but allow a developer to tradeoff execution cycles and total instruction memory. Further, we assume that the application consists of independently executing tasks. The queues in the data plane of a typical network application decouple tasks from execution dependencies. In other words, while there is a graph that represents dataflow through the application, the individual tasks may be considered as executing independently. Currently, we assume tasks have the same periodicity, but our formulation could be easily extended to accommodate multiple execution rates. The utilization of a PE by a task is measured by the number of execution cycles it consumes (execution time less long latency events). Our goal is to allocate tasks onto PEs with the objective of minimizing the average makespan – the maximum execution cycles of all tasks running on the system. We acknowledge that these assumptions sacrifice some accuracy, but they were carefully chosen to maximize the exploration of critical parts of the design space while still keeping the problem tractable.

3.2. Problem We attempt to solve the following resource-constrained optimization problem: given a set of independent tasks and a set of PEs, find a feasible implementation for each task and a mapping of tasks to PEs so that the makespan is minimized. Formally, we have a collection of task classes Y , and each class y ∈ Y consists of a set of tasks T ( y ) = t y1 , t y2 ,...

{

}

{

}

and a set of implementations M ( y ) = m y1 , m y2 ,... . An individual task in T ( y ) can operate in any one implementation from

(

M ( y ) . Each m yi ∈ M ( y ) is characterized by a tuple, e y , m y , s y , m y , q y , m y i

i

i

) denoting the number of execution cycles, the

number of shareable instructions, and the number of quasi-shareable instructions, respectively. All tasks of the same class and implementation may share a part of the program instructions in a PE denoted by s y ,m y . The quasi-shareable instruci

tions denoted by q y ,m y can only be shared by at most N tasks of the same class and implementation, where N is the i

number of hardware threads in a PE. Once more than N tasks are assigned to a PE, extra instructions must be used to accommodate additional state registers. By choosing the appropriate implementation for each task, the optimization problem attempts to minimize the makespan, while satisfying constraints on the instruction store. The set P consists of the available PEs in our system. An instruction store limit S limit is enforced for each PE. The parameter Ecycle denotes the bound on makespan. We encode the resource constrained decision problem as a 0-1 ILP; the variables in our constraint system are the following:


(1) x y ,t ,m, p : a 0-1 variable which indicates whether task t ∈ T ( y ) belonging to task class y ∈ Y with implementa-

tion m ∈ M ( y ) is assigned to PE p ∈ P . In a feasible configuration, a variable with value 1 denotes a selection of implementation and PE for each task. (2)

a y ,m , p ,k : a 0-1 variable which indicates whether

k or more tasks from class y ∈ Y with implementation

m ∈ M ( y ) are assigned to PE p ∈ P , where k ∈ {1, 1 + N , 1 + 2 N , ,1 + N ⋅ (T ( y ) − N ) N }. This counts the number of tasks from

class y and implementation m in PE p in increments of N . The constraints in our system are the following: x y,t ,m, p p∈P m∈M ( y )

y∈Y t∈T( y) m∈M( y)

a y ,m, p,k = 1 ⇔

= 1 ∀ y ∈ Y , ∀ t ∈ T (y)

ey,m ⋅ xy,t,m, p

≤ Ecycle ∀p ∈P

x y ,t ,m, p ≥ k

{

∀m ∈ M ( y ), ∀k ∈ 1, 1 + N , 1 + 2 N ,

(

y∈Y m∈M ( y )

) }

s y ,m ⋅ a y ,m, p,1 ≤ S limit

) }

(3)

,1 + N ⋅ T ( y ) − N N

q y ,m ⋅ a y ,m , p , k y∈Y m∈M ( y ) k∈ 1, ,1+ N ⋅ T ( y ) − N N

(

(2)

∀ p ∈ P, ∀ y ∈ Y ,

t∈T ( y )

{

(1)

∀ p∈P

+

(4)

Constraint (1) is the exclusionary constraint that specifies that each task must be executed in exactly one implementation and assigned to exactly one PE. The total execution time of all tasks in their selected implementations in each PE must be less than E cycle and this is ensured by constraint (2). Constraint (3) determines the values of the a y ,m , p ,k variables, which are then used in (4) to stipulate a bound on the combined instruction store of all tasks assigned to a PE, accounting separately for the shareable and quasi-shareable parts. The total number of variables in our constraint system is of order Ο(max{T ( y ) | y ∈ Y }⋅ max{M ( y ) | y ∈ Y }⋅ P ⋅ Y ) . The number of constraints is linear in the number of variables.

3.3. Solver The search strategy is to perform a binary search on E cycle to find the optimum possible execution time and a corresponding implementation and PE assignment for each task. We note that the decision problem formulated above is a reduction of the basic bin-packing problem and hence is NP-complete. Though encoding the problem as ILP allows us the flexibility to specify varied constraints, solving such problems in the general case is inefficient for reasonably sized instances. However, we take advantage of recent advancements in search algorithms and heuristics for solving 0-1 ILP formulations to efficiently compute solutions that are optimal with respect to our problem model. We use GALENA (Chai 2003), a fast pseudo-Boolean SAT solver, to solve the constraint system. Additionally, we can introduce specialized constraints to prune the solution space and remarkably speed up the ILP search procedure. For instance, tasks from the same task class and implementation are identical (since they are characterized by the same number of instructions and execution cycles), and hence lead to symmetric configurations. Accordingly, introducing symmetry-breaking constraints eliminates all redundant configurations that are distinguished only by a permutation of tasks from the same class. To generate these constraints, a total ordering is introduced over all tasks (T ( y ), ≤ ) and implementations (M ( y ), ≤) in each task class y ∈ Y and PEs (P, ≤) in the system. Intuitively, let ρ t yi and

( )

( )

ω t yi denote a selection of PE and implementation, respectively, for some task t yi ∈ T ( y ) . In order to break symmetry, we

enforce the constraint that if t yi

and t y j

are two tasks in the same task class y ∈ Y

with t yi ≤ t y j , then


( ) ( )

ρ t yi ≤ ρ t y j

( ) ( )

ρ t yi = ρ t y j

∧

( ) ( )

ω t y i ≤ ω t y j . In this way, we avoid symmetric configurations that arise due to a permuta-

tion of tasks. This directs the ILP search towards useful configurations and achieves significant run time speedups. Similarly, we can exploit symmetries that arise because the PEs in the system are non-distinct. We may also include restrictions on the placement or pairing of certain tasks to expedite the search. The ILP framework is sufficiently general to accommodate various other user-specified constraints based on knowledge of the problem instance. Thus, the combination of a modern 0-1 ILP solver with specialized constraints to restrict the search space provides a powerful mechanism to solve the task allocation problem efficiently.

3.4. Example We demonstrate our resource-constrained optimization framework through a simple example. Take the theoretical problem of allocating the DiffServ application as characterized by Table 2 onto the Intel IXP1200 microarchitecture (Figure 3). There are 6 PEs, each with an instruction store limitation of 2048. First consider the naïve first fit approximation (Figure 4, left), which greedily allocates tasks to the least filled bins in order of their execution times. First, the 4 DSBlock tasks are allocated to separate PEs. However, the constraint on the instruction size then prevents the Receive and Transmit tasks from sharing a PE with any DSBlock task. The Receive and Transmit tasks are partitioned between the remaining 2 PEs, resulting in a sub-optimal makespan of 790 execution cycles. This is a consequence of our first greedy step in allocating the DSBlock tasks to separate PEs to reduce execution time. The best achievable configuration is the one shown in Figure 4 (left). We solve the same problem instance through our ILP-based optimization framework. The resulting configuration displayed in Figure 4 (right) achieves the optimum makespan of 640 cycles. The solution exploits the advantage of sharing instructions between two tasks of the same class. Two DSBlock tasks are allocated to the same PE, hence better utilizing the instruction store. This capability of multithreaded architectures is not easily captured in a greedy selection heuristic. Even though this example is relatively simple, it illustrates the need for a rigorous method to optimally solve the task allocation problem. As the number of tasks increases and there is greater diversity in the task classes, the advantages of a fast and robust ILP optimization framework will be more evident.

&

99 99

99 99

134

134

134

134

296

296

320

320

320

320

296

296

Makespan: 790

Makespan: 640 320

320

320

320

99 134

99 134

99 134

99 134

296

296

296

296 '

'

(

(

218

1800

218

1800

218

1800

218 462

462

985

985

1800

1800

462 218

462 218

462 218

462 218

985

985

985

985

1800

Figure 4. Diffserv task assignment: first fit approach (left) and the present ILP framework (right)

4. RESULTS To demonstrate the validity of the model, we used the IXP1200 and two common and representative network applications, internet protocol packet forwarding in IPv4 and a Differentiated Services (DiffServ) interior node. IPv4 packet forwarding (Baker 1995) is a common kernel of many network processor applications. The major features of this benchmark are: (1) Receiving the packet (2) checking its validity, (3) determining the egress port of the packet by a longest prefix match route table lookup, (4) decrementing the time-to-live (TTL), and finally (5) transmitting the packet on the appropriate port. We use egress aggregate bandwidth as a proxy for performance of IPv4 forwarding. The Differentiated Services architecture (Blake 1998) is a method of facilitating end-to-end quality of service (QoS) over an existing IP network.


In contrast to other QoS methodologies, it is a provisioned model, not a signaled one. This implies network resources are provisioned for broad categories of traffic instead of employing signaling mechanisms to temporarily reserve network resources per flow. Interior nodes apply different per hop behaviors (PHBs) to various classes of traffic. The classes of PHBs recommended by IETF include: Best Effort (no guarantees of packet loss, latency, or jitter), Assured Forwarding (4 classes of traffic, each with varying degrees of packet loss, latency and jitter), and Expedited Forwarding (low packet loss, latency and jitter). Table 1. IPv4 Forwarding Task Characteristics Class Receive Transmit (Impl 1) Transmit (Impl 2)

Number Execution QuasiShareable of Tasks Cycles shareable 16

337

801

0

16

160

348

0

16

140

5

285

Table 2. DiffServ Task Characteristics Class

Number of Tasks

Execution Cycles

Shareable Instructions

Receive

4

99

462

Lookup

4

134

218

DSBlock

4

320

1800

Transmit

4

296

985

In both applications, a developer supplied the tasks to be used, and each class and implementation was profiled in a cycle accurate simulation environment. To obtain average execution cycles per task, the application was tested with worst case input traffic. An instance of each task class and implementation was run on a PE by itself in a functionally correct configuration so that it could be profiled with the appropriate traffic. We note that different configurations may cause a task to perform differently, but those effects have not yet been substantial. We have seen at most a 10% change in execution cycles consumed between tasks compiled alone and in the presence of other tasks. For shareable and quasishareable instructions, the application was complied with varying task configurations. We implemented a 16 port Fast Ethernet (16x100Mbps) IPv4 router consisting of 16 Receive class and 16 Transmit class tasks with characteristics shown in Table 1. A Transmit task has state and can be written with either shareable or quasi-shareable instructions to reference its shared variables. For this application, we targeted a version of the IXP1200 for which the bound on the instruction store was 1024 instructions. The result is optimal with respect to the model: the instruction store constraints preclude the two task classes from coexisting on any PE. The problem then degenerates into bin packing, the only wrinkle being that Transmit class tasks may exist in either implementation. Implementation 2 is preferred for all Transmit tasks since it is faster of the two and fits within instruction store limits. The resulting configuration shown in Table 3 is exactly the same partition that was arrived at from hand tuning. Our DiffServ application supported 4 Fast Ethernet ports (4x100Mbps) targeting the 2K instruction store version of the IXP1200. The corresponding task properties are presented in Table 2. Since there are only 4 tasks in each class and the IXP1200 has 4 hardware threads per PE, the quasi-shareable instructions are omitted and all instruction store used is represented by the sharable component. After testing with various mixes of traffic flows, we found the egress bandwidth of each packet class in the automatically generated design to be within 2% of the hand-tuned design, except for Best Effort which occasionally transmitted at only one-third the data-rate of the hand-tuned. This is because there is a strict priority scheduler between Best Effort traffic and all other traffic, and that the latency of Transmit tasks is higher due to the simplifications inherent to the model. When the Transmit tasks service the various flows and fail to keep up with ingress, Best


Effort is the first to suffer. Overall the automated partition is within 5% of the hand-tuned aggregate bandwidth for all data points, and was generated in less than a second while the hand-tuned design took days to arrive at. Table 3. Final Partitioning Application PE1 IPv4 forwarding 4 Receive (hand & auto) DiffServ (hand) 4 Receive DiffServ (auto) 1 Receive 1 Lookup 1 Transmit

PE2 4 Receive

PE3 4 Receive

PE4 4 Receive

PE5 8 Transmit (Impl 2) 4 Lookup 2 DSBlock 2 DSBlock 2 Transmit 1 Receive 2 DSBlock 2 DSBlock 1 Receive 1 Lookup 1 Lookup 1 Transmit 1 Transmit

PE6 8 Transmit (Impl 2) 2 Transmit 1 Receive 1 Lookup 1 Transmit

The principal reason the hand-tuned design performs better than the automatically generated one is that the designer’s internal model accounts for more aspects of the mapping problem than the one proposed here. Empirically, the major difference between these two models can be accounted for by the consideration of execution cycles consumed by polling. In our setup, tasks poll to determine whether a packet is ready to be processed, so there are execution cycles consumed by these polling loops. The execution cycles shown in Table 1 and Table 2 do not include this polling since it depends greatly on the final configuration, however it can impact design decisions. Consider the DiffServ allocation problem once the 4 DSBlocks are assigned to 2 PEs by themselves (a decision that both approaches agree on). We are left with a sub-problem of allocating 4 Receive, 4 Lookup, and 4 Transmit tasks to the 4 remaining PEs. Based on the model presented, since there is ample instruction store in the sub-problem, the optimal result is to put one of each task class into each PE. But since the tasks with shorter execution times (Receive and Lookup) spend execution cycles in their polling loops, a significant number of additional execution cycles are consumed: a fact which is not considered by the model. A knowledgeable developer would note that pairing an execution-heavy Transmit task with only one other execution-heavy task (one additional polling loop) might be better than putting it with the two execution-light tasks (two additional polling loops). In this new configuration, the Transmit tasks are paired together and Receive and Lookup are given their own PEs. After trying the configuration, the results indicate that the two polling loops Receive and Lookup introduce mitigate any apparent savings from being execution-light.

5. SUMMARY AND CONCLUSIONS As the functionality expected from single chip, high performance, multiprocessor systems continues to increase, so will the task of distributing that functionality. Ideally programmers would capture their application in a DSL and automation would take it to implementation. As a first step, we formulate the mapping problem for one instance of such resource constrained embedded systems. By encoding the mapping problem as a 0-1 ILP, we allow for flexibility and extensibility while still utilizing a high performance back end. Problems are solved in less than a second with results that are optimal with respect to the model. The model has proved itself for two representative network application producing results within 5% aggregate bandwidth of hand balanced designs. While more applications need to be tested, we feel that this approach is robust and fast enough to be used as a tool by designers when developing software for these systems. It is one piece in an overall design process that will enable designers to explore different task implementations and identify optimal mappings. This in turn expedites the overall application design flow for network processors which utilize this class of architecture.

6. FUTURE WORK Network processors continue to provide more programmable features with every generation. The IXP2800 (Intel 2004), besides increasing the number of PEs (16) and the number of threads per PE (8), also adds new architectural features like next neighbor registers for tightly coupled producer consumer relationships between PEs, local memory for each microengine, and more communication options. While these new features provide more opportunities to create high performance applications, they make the overall programming challenge more difficult. It is no longer possible to assume the processors are symmetric with respect to memory, so the task allocation problem is entangled with the assignment of data to memory. For instance, if next neighbor registers can accommodate the link between producing and consuming tasks,


the system enjoys a significant performance benefit as communication through globally scoped memory is avoided. Pursuant to these challenges, we will focus on the other aspects of the general mapping problem: data layout, communication assignment, and scheduling. The increase in size and the addition of architectural features necessitates a significant extension in the framework presented here. Fortunately, the ILP framework is flexible enough to accommodate these features. However, with the additional complexity, solving the problem directly may be infeasible. Once this barrier is reached, we plan on dividing up the problem, utilizing designer guidance, and developing heuristics. Hierarchy or division into reasonable sub-problems is a classic technique for managing complexity. Programmer guidance or application input can be incorporated as constraints requiring no change to the solver or framework. Lastly, a preliminary mapping of the architecture and application graphs earlier in the design cycle should allow extensive pruning of the design space for faster exploration. It is our belief that as systems incur additional resource constraints, processors, and tasks, the advantages of an automated approach to the mapping problem will be more apparent.

7. REFERENCES F. Baker, “Requirements for IP Version 4 Routers,” Request for Comments - 1812, Network Working Group, June 1995 A. Bender, “MILP Based Task Mapping for Heterogeneous Multiprocessor Systems,” Proceedings of EDAC, 1996, pg 283-288. S. Blake, et al., “An Architecture for Differentiated Services,” Request for Comments - 2475, Internet Engineering Task Force (IETF), December 1998. D. Chai and A. Kuehlmann: “A fast pseudo-boolean constraint solver.” DAC 2003: 830-835 C. Chekuri, “Approximation Algorithms for Scheduling Problems,” Technical Report CS-TR-98-1611, Computer Science Department, Stanford University. August 1998.

B. Chen and R. Morris, “Flexible Control of Parallelism in a Multiprocessor PC Router,” Proc. 2001 USENIX Annual Technical Conference (USENIX '01), USENIX Assoc., 2001, pg. 333-346. S. Devadas & A. R. Newton. “Algorithms for Hardware Allocation in Datapath Synthesis,” IEEE Trans. On CAD, Vol. 8, No. 7, pp. 768-781, July 1989. C-T. Hwang, J-H. Lee, and Y-C. Hsu, “A formal approach to the scheduling problem in high level synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume: 10, Issue: 4, April 1991, pg 464 -475. Intel Corp., “Intel IXA SDK ACE Programming Framework Developer’s Guide,” June 2001. Intel Corp., “Intel IXP1200 Network Processor,” Product Datasheet, December 2001. Intel Corp., “Intel Microengine C Compiler Language Support Reference Manual,” March 2002. Intel Corp., “Intel IXP2800 Network Processor,” Hardware Reference Manual, May 2004.

E. Kohler et al., “The Click Modular Router,”ACM Trans. Computer Systems, vol. 18, no.3, Aug. 2000, pp. 263-97. H. Shachnai and T. Tamir, “Polynomial time approximation schemes for class-constrained packing problems,” Proceedings of Workshop on Approximation Algorithms, pg 238-249, 2000. N. Shah, W. Plishker, and K. Keutzer, “Programming Models for Network Processors,” Network Processor Design: Issues and Practices, vol. 2, Dec. 2003, pp. 181-202. A. Srinivasan, et al., “Multiprocessor Scheduling in Processor-based Router Platforms: Issues and Ideas,” Network Processor Design: Issues and Practices, Volume 2, November 2003.


Teja Technologies, Inc., “Teja C: A C-based Programming language for Multiprocessor Architectures,” to appear in Network Processor Design: Issues and Practices, Volume 2, November 2003.

Automated Task Allocation for Network Processors - CiteSeerX

Automated Task Allocation for Network Processors - CiteSeerX

Suggest Documents

Automated Task Allocation for Network Processors - CiteSeerX

Hector: Automated Task Allocation for MPI - CiteSeerX

Automated Task Allocation

Automated Task Distribution in Multicore Network Processors using ...

Storage Allocation for Embedded Processors - CiteSeerX

Network Processors - CiteSeerX

Efficient Prefix Cache for Network Processors - CiteSeerX

Complex Task Allocation For Multiple Robots - CiteSeerX

O Subsystem for Network Processors

Creating Advanced Functions on Network Processors - CiteSeerX

Automated Task Allocation on Single Chip, Hardware Multithreaded ...

A Smart Algorithm for Dynamic Task Allocation for ... - CiteSeerX

Grid Computing Distribution Using Network Processors - CiteSeerX

Network Processors: Challenges and Trends - CiteSeerX

Creating Advanced Functions on Network Processors - CiteSeerX

Creating Advanced Functions on Network Processors - CiteSeerX

Distributed Task Allocation in Social Networks - CiteSeerX

Congress: A Dynamic Distributed Task Allocation ... - CiteSeerX

FORWEB: File Fingerprinting for Automated Network ... - CiteSeerX

Virtualizing Network Processors

Hybrid Open Hash Tables for Network Processors - CiteSeerX

Self-Organized Task Allocation for Computing Systems ... - CiteSeerX

A Simple Load Balancing Scheme for Task Allocation in ... - CiteSeerX

A Simple Load Balancing Scheme for Task Allocation in ... - CiteSeerX