Application-Awareness - Semantic Scholar

Evaluating the Network Processor Architecture for Application-Awareness TKS LakshmiPriya, V.Hari Prasad, D.Kannan, L.Karthik Singaram, G. Madhan, R.Meenakshi Sundaram, R.M. Prasad and Ranjani Parthasarathi Department of Computer Science, College of Engineering, Guindy, Anna University, Chennai, India. [email protected], [email protected], kannan2 1 ceg(gmail.com, singaramlks(yahoo.com, madhan_gg(yahoo.co.in, meenakshi.r.sundaram(gmail.com, rmprasad(gmail.com, [email protected] Abstract-The introduction of deep packet inspection technique has enabled the provision of 'application-specific' QoS. Application-aware processing is a natural extension of this

'higher layer header examination' technique. The introduction of multicore, programmable, network processors has largely contributed to this extension. In addition to network-specific

services inherently supported by the NP architecture, applicationaware services also find a place in the fast path. In this paper, we examine the feasibility and suitability of porting application-

aware operations on network processors. We have selected a popular application of content delivery - both text and media, to demonstrate the process of building application awareness

and application-level information, in kernel-based mechanisms at web servers to prevent overloading. Security services such adentification using pattem-matching

asirus sate pattern-matchin m lemented usingt at routers oeraons have been identicationn 34], swItches

[13], and firewall proxies using FPGAs [3], [4], ASICs [5], and NPs [13], [14] for hardware acceleration. Of these, NPs have inherent advantages of flexibility, programmability, and the architectural resources to support diverse networking services

[16]

in a

dynamic

manner.

The use of NPs as emerging network building blocks is

in the network, and to critically examine the scalability of the concept. We also propose a generic evaluation methodology for application-awareness on NPs and adopt the same to evaluate

mainly due to their inherent characteristics: multi-core nature, re-programmability, hierarchical memory, and an Instruction Set Architecture that supports network-processing tasks. The

study is the design and evaluation of XML processing on the Intel IXP architecture. The resource utilization for the two case studies indicate that different demanding services can be easily accommodated on the NP, and that the NPs have 'headroom' to support multiple services.

NP archltecture provides hardware support for high-speed (data-plane) and low speed (control-plane) operations in terms of hiding memory latency. These hardware facilities may

the Intel IXP network processor. One of the highlights of this

be exploited to optimize the high-speed packet processing operations. For example, choice of appropriate memory, based on access time and data size; overlapping memory accesses; defining data sizes in terms of memory word-length; are a Application-Aware Networks (AANs) denote a paradigm few optimization techniques to hide memory latency. While shift from network-aware applications to application-aware such features have motivated the use of NPs for security processing. One of the prime motivations for AANs is services such as cryptographic operations [6] and QoS that they relieve end-systems from additional computational services such as traffic analysis [7]; NPs have also been used processing, thereby paving the way for client-specific for unconventional applications like accelerating database processing and development of lighter client systems. The operators [17]. In terms of application-layer processing, advent of powerful network elements has provoked the NP-based software architecture has been developed to support shifting of application-awareness into the network. It is application-specific messages [15] and a comprehensive suite characterized by the introduction of Deep Packet Inspection of resource management mechanisms for AANs has been (DPI), Deep Content Inspection (DCI) and recently Content addressed [11]. Processing in the network, and is opening new avenues of Turning to evaluation, we find that considerable work, research that primarily focus on reducing surplus traffic over including benchmarking tools, has been carried out for the network. evaluating network-processing operations on NPs [8], [9], DPI wherein, layer 4-7 headers are checked for classifying [10]. However, evaluation of application-awareness in NPs is flows has long been adopted in many routers to provide QoS still in its infancy. NPs, with their immense compute power guarantees. An extension ofthis concept, DCI, which involves and highspeed processing capability have made it feasible checking of the application-layer contents as well [1], [2], to incorporate application-layer processing along the fast is being widely adopted at the servers [25], in the security path, in addition to the regular networking operations. It has domain, etc. In [25] the authors have made use of connection been shown [23] that NPs, employed as attached processors,

1-4244-06 14-5/07/$20.00

©)2007 IEEE.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on February 12, 2010 at 18:41 from IEEE Xplore. Restrictions apply.

are capable of performing event-action processing. In a comparison with General Purpose Processors [24], it has been shown that the IXP-based implementation of an applicationspecific service outperforms that of a GPP, in terms of both latency and throughput. We identify a number of other dimensions along which, NPs application-awareness may be evaluated, namely, scope, feasibility, scalability, overheads, performance, resource requirements, and resource utilization. In terms of scope, the significance and advantages for porting an application onto an NP may be evaluated. The ability to arrive at an NP-specific design for an application, views the problem from the feasibility dimension. Horizontal or vertical scalability issues form another dimension. The overheads due to the introduction of app-awareness may be identified and their effects on the normal operation may be evaluated, The performance dimension may look at evaluating the performance of the chosen application on an NP. Yet another dimension is in terms of evaluating resource requirements and resource utilization for applications that have contrasting workload characteristics. The very seed of app-awareness is the existence of immense power in the network. The fact that the NPs 'can' accommodate app-layer services and that this area is still in its infancy, has motivated us to work in this direction. It can be argued that, since currentarouters are already quite overloaded with routing alone, the unpredictable and heavy network traffic conditions would overload NP-based routers as well We like to emphasize that the evolving NP technology and the multiple roles that NPs can take-up (namely, full-fledged processors for router implementation, attached processors, or embedded processors), indicate the amount of computation that can be pushed into the NPs. This has motivated us to poe ingthedirection of resource utilization for app-aware services and the effectiveness of including the app-aware feature on an NP along with its normal networking operations. We intend to port app-layer services of varying workloads onto NPs and explore the resource usage in each case. Thus, in this paper, we critically examine the suitability of NPs for application processing in the network. First, we propose a generic methodology for evaluating app-aware operations on NPs. Second, we choose two case studies of a popular web application that possess extreme characteristics and map it onto the NP architecture. We then evaluate the application-awareness of NPs in terms of its scope, feasibility and scalability. The rest of the paper is organized as follows: Section II briefs the IXP Network Processor Architecture. Section III gives our methodology for evaluating application-aware processing in e t network using NPs. In Sections IV and V we describe two case studies and their design and implementation on IXP2400. In Section VI, we discuss the results obtained, in terms of scope, feasibility and scalability. Section VII concludes the paper.

II. IXP2400 OVERVIEW

IXP2400 belongs to the Intel family of integrated

network processors and comprises of a single XScaleCore processor, eight Microengines, standard memory interfaces, and highspeed bus interfaces. It is targeted at networking applications that require a high degree of flexibility, programmability, scalability, and performance. The IXP architecture offers numerous hardware features, which provide the user, a highly concurrent packet-processing model, while keeping the programming model simple. The Microengines are fully programmable, custom processors, implemented specifically for networking applications and are especially well suited to high-speed data manipulation and movement. The key architectural features include: multi-processing - multiple network packets can be processed in parallel; distributed data storage architecture data can be positioned close to where it is needed for faster access; hardware multi-threading each microengine can process multiple packets with minimal context switching overhead; active memory optimizations multiple memory requests can be executed with low overheads and high memory bandwidth; multi-level concurrency multiple packets can be processed simultaneously with interleaved memory and compute cycles; and block data movement - efficient movement of large amounts of data. All instructions execute in a single cycle including instructions that perform network processing operations such as hashing, CRC calculation and extracting a part of a memory word. The instruction set architecture (ISA) also provides support for synchronization primitives and atomic operations. III. PROPOSED EVALUATION METHODOLOGY In this section we present a generic seven-stage methodology, to evaluate the application-awareness of the network processor architecture. (i) Identifting an application: An application whose services are to be moved into the network is chosen. The motivation or goal behind such a movement must be well-defined; such as to improve specific performance metrics, to alleviate a known problem, to scale existing system, to extend existing functionality, or to accommodate a new feature/service. (ii) Identifying the point of implementation: The point in the network, where the application-layer service is to be implemented is decided. Typical places may be client edge, server edge, or router. This decision depends on factors such as the overall goal, administrative issues, performance issues, and criteria for evaluation. (iii) Identifying a suitable NP platform: If the evaluation is for a known/given NP, then this stage may be skipped, otherwise the particular NP (architecture / vendor / version / programming environment) on which the evaluation is to be made is chosen. If the focus is on evaluating the awareness to a specific application, then the platform chosen depends on the features supported by the NP and the evaluation criteria. One other option is to perform the evaluation across many


Forwarded Requestc

Request -.+PX 0t S - NP Network

Response S: Content Server

C: WiredMWireless/Mobile Client

NP: NP Based Router

NP

Transcoded Response

Fig. 1. Operational environment for the case study

NPs, for a given application. (iv) Identifying the application-layer services: The specific services of the application, which are to be moved into the network to achieve the goal, are identified. It is these services that are mapped onto the NP. (v) Mapping the application-layer service(s) to NP operations: Each app-layer service, identified in stage-iv, is partitioned into tasks and/or sub-tasks that can be represented as network-processing operations. These tasks / sub-tasks are then mapped onto the NP architecture. The partitioning of the app-layer services may require certain adaptations in the operations or in the approach adopted. This is because all application-layer operations may not inherently map to 'network processing' operations. Appropriate computational and memory resources are assigned for each of the tasks and the data pipeline design is made. The pipeline design plays a crucial role, specifically when performance is the evaluation criterion. (vi) Implementation and testing: The system is implemented, tested and results are tabulated. The choice of the algorithm, the instructions, and the test cases must be made with care. NP specific features and programming aids must be well exploited. (vii) Evaluation: The test results and other issues are considered for evaluation. The evaluation may involve an iterative approach. IV. OUR CASE STUDY The generic methodology proposed in Section III has been adopted to evaluate the application-awareness of the Intel IXP 2400 Network Processor [18]. The goal is to explore the porting of app-layer services of contrasting features. This is presented below. A. The application: Our choice of an application is web content access, a popular, traffic-heavy network application. We propose to move certain content access services into the network, which reduces network traffic and facilitates heterogeneous clients such as the handhelds, laptops and PCs, to ultimately improve performance. B. Platform for evaluation: Intel IXP2400 Network Processor. C. Choice of application-layer services: The application-layer services chosen for our evaluation are content adaptation for both media and text content. The multimedia content

N

adaptation is for a media multicast environment while text content adaptation is for a specific client group at the client edge. Content-type with contrasting characteristics P(multimedia: large size, loss tolerant, quality sensitive; text: smaller size, loss sensitive, frequent access) has been intentionally chosen for adaptation, in order to explore how the resources are utilized in each case, which is the focus of our evaluation. In addition, we have chosen XML processing, an application-layer service, as a part of media adaptation, specifically because, we intend to evaluate a service which the IXP 2400 does not explicitly support - XML parsing. The details of the media adaptation service and the text adaptation service follows: Design of Media Adaptation Service: The media multicast scenario consists of multicast routers and multicast groups of heterogeneous clients, forming a tree-topology. All video files are encoded using a hierarchical encoding scheme to facilitate faster transcoding and in order to support codec independent adaptation, certain media content-specific annotations are provided from the server end. We use XML to describe the media content. The XML-based metadata, describing the structure of the video, followed by the raw video bit stream, constitutes a parse-able unit of a video, defined as the 'unit'. Each such unit is transmitted to the multicast clients in IP packet(s). The level of media adaptation done at a router depends on the maximum capabilities of the client group under it and the congestion information along the downstream; thus avoiding unnecessary traffic down the multicast tree. The entire service has been designed on the line of established protocols and standards [22], such as, Internet Group Management Protocol (IGMP) for maintaining the multicast groups corresponding to the media streaming sessions; Resource Reservation Protocol (RSVP) for clients to provide their capability information; Real Time Protocol (RTP) for transmitting the media content; RTP control protocol (RTCP) for feedback-based congestion determination; Global BitStream Definition (gBSD) [26], and a standardized extended markup language (XML) for media content description. Design of Text Adaptation Service: The scenario consists of NPs housed at an edge node such as a proxy, or base station controller, which interprets the URL (HTTP) request and responses passing through it. Page re-authoring techniques are employed at the NP and the adapted pages are delivered to the clients - wired or wireless. In addition, the entire web page is cached at the proxy, and serviced during subsequent requests from the client, thereby alleviating network traffic and improving response time at the client. D. Strategic points of implementation: The operational environment for the case study is shown in Fig. 1. Here, when clients request the content servers for web content, NP-based network intermediaries interpret the requests and forward them to the content servers. Similarly responses from the content servers are interpreted; the content is transcoded according to the requesting clients capabilities and delivered to the clients. Among the network intermediaries,


the NP-based mrouters (multicast routers) that span across the entire network have been chosen to perform media content adaptation, while the NP-based edge routers, being close to the clients, perform the text content adaptation service. It is to be noted that here, media adaptation service is deployed at more than one point in the network while text adaptation is confined to a single point. E. Mapping the content adaptation services to NP-specific operations: Each of the adaptation services is partitioned onto tasks such that each task can be represented as one or more network processing operations. These operations are then mapped onto the NP architecture for implementation. This process, as applied to each of the adaptation services is described below. Media Adaptation Service: Each mrouter involved in this service, processes the join/leave commands in IGMP packets, 'capability propagation' in RSVP packets, the ReceiverReport (RR) in RTCP packets, and the 'video units' in the RTP packets. Of these, processing IGMP, RSVP and RTCP packets, involve table processing based on interpreting the respective commands. Such operations are typical of any network-processing environment. Since, major work lies in processing the RTP packets, this task of processing RTP packets, is partitioned into four subtasks: 1) packet re-assembling, in order to extract an entire 'video unit', 2) XML parsing: i.e., checking the validity of the metadata from the 'video unit' and parsing it, 3) Metadata processing i.e., extracting the video descriptions and transcoding i.e., identifying the hierarchical layers of the video and selectively dropping layers, and 4) segmentation i.e., generating video packet(s) containing modified metadata and the adapted video. Of these, subtasks 2, 3 and 4 involve XML processing which is a significantly complex task for the NP. Checking whether the metadata confines to the given Video XML grammar (a recursive operation), involves string recognition, for which NPs do not inherently provide any architecture support. Since NPs have a non-stack-based architecture and implementing a recursive grammar requires explicit stack-handling operations, we have implemented customized stack processing operations on the IXP2400. Text Adaptation Service: The NP employed for this service performs Caching and Page re-authoring in addition to processing the URL requests and responses. The URL Response processing consists of five stages: (i) extracting the web page from the URL response, (ii) caching the page (iii) transcoding the page (iv) generating a URL response with the transcoded page (v) delivering it to the client. The URL request processing consists of (i) checking if a cached page exists and (ii) servicing the request (i.e., URL Response generation) during a cache hit. If a cache-miss occurs, a URL Request is generated to the server. The proposed transformation techniques are: (a) Outlining transform [19]: This technique reduces a paragraph with a heading into a single hyperlink with the link text being a part of the heading. (b) First line transform [19]: This technique reduces a paragraph without a heading into a single hyperlink

with the link text being a part of the first few letters of the paragraph text. (c)Repetitive-structure elision transform [20]: This technique reduces one or more sequences of similar structures like tables, lists into a single hyperlink. These transformation techniques are based on building a page code tree. The process of generating this tree involves scanning the HTML code, reducing it into tokens, obtaining a node for each token, and ultimately building the page code tree. The pruned version of this tree corresponds to a transcoded web page or a sub-page. During URL Response processing, stage-i represents 'packet reassembly', stage-ii can be mapped as table processing, stages -iv and v are packet generation operations. The transformation stage (stage-iii) is further partitioned into three subtasks tokenizer, tree generator and sub-page generator. In this section, the first five stages of the evaluation methodology have been elaborated. The following section illustrates the NP-based design and implementation of the two application-layer services for content delivery. V. DESIGN FOR IXP2400

Since, established techniques for media and text

transcoding, as mentioned in the previous section, have been chosen, designing for the IXP 2400 NP involves identifying the various pipeline stages of the dataflow. The generic pipeline stages that are used in our case study are Ingress and Classifier in the input stage, and Packet generator and Egress at the output stage. (i) Ingress: The media switch fabric interface of IXP2400 stores layer-2 'mpackets' in the receive buffers. The ingress module, picks these mpackets from buffers, assembles them into layer-3 (i.e., IP) packets using a state machine and validates them. (ii) Classifier: This module adopts content inspection mechanism to classify the packets into different flows for appropriate processing. (iii) Packet generator: This module is responsible for generating layer-3 packets. (iv) Egress: It validates the packets and sends them into the network through the media switch fabric interface, after splitting them into mpackets. The design of the two application-layer services for IXP2400 makes effective use of architectural features such as: use of scratch rings to exploit hardware supported communication between the pipeline modules; use of Hash Unit to store various 'data tables' so that retrieval time is linear; and use of the Content Addressable Memory to store standard information like predefined port numbers for the protocols. The details of the design and the implementation of each of the two adaptation services for IXP2400 are given below:

A. Media adaptation Design for IXP2400 The pipeline design of the media adaptation service for IXP2400, as shown in Fig. 2, constitutes the following three stages, apart from the common ones mentioned above. i)Multicast Request Processing: This module receives the IGMP and RSVP packets from the classifier module for


Pakt Packetsa e Se ors data Cryn

Media Classifier

From Ingress

Packet

-.

URL Request

Processing

Comleedparsing As ermbliong

Data Structure

Parseral ~~~~~~Transcoder (ME 1)

IGMP

(ME

~ ~ ~ ~ ~ Scratch

6)

akt

Request Hardier

Table(DSAM)Seas embly

Find the

Request

Enqueuer

Sing

~~~ ~ Table(SRAM)~ ~ ~ ~ ~~~~~~~~~~~~~S Data No Flow

Capability

Signa

Transcoding

Capability

Access

Transcod',nr~

~

---

Fig. 2. Data pipeline for Media adaptation

Feedba ck

on

~

Signal

~

IXP2400

~

~

-lo.

(DRAM)

ClssfEr

p

Cd

(Ed

atore dewH

RoesedoSgal U Create New

Search

HTM

Code

Tiee

Handler Subpage

I

Response QLsue ~~~~~~~~~~~~~~~~~~~Semven

Processing

~ ~ ~ ~

rncd

)Toeie M

R~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update

(UL euet n

Sspr

e Generto

R

eqetdTe

Resporse Packets) To Egress

Generatr 6) (ME

Now

Sosponse (ME 5)

Generator

HTML Code

processing join/leave commands and capability commands. ieiefrTx rncdn Fg .Dt Fg .Dt ieiefrTx rnCdn Accordingly, the multicast table is populated with multicast client information and the capability table is populated with the frame rate and resolution information. B. Text adaptation - Design for IXP2400 ii)Feedback Processing: Feedback information in the RR The text transcoding implementation, when compared reports of the RTCP packets, are extracted to populate the Trans Rate Table, with information regarding the network to media transcoding requires lesser resources. The design consists of the following service-specific modules apart from state along the path to the client, iii) Media Processing: This module deals with processing ingress, egress, classifier and packet generator: (i) Packet the media (i.e., RTP) packets, and transcoding the video Assembler (ii) Transcoder and (iii) Response Generator for based on client capability and congestion information. The URL Response processing and (IV) Request Handler (along Content Adaptation module performs selective frame dropping with Response Generator) for URE Request processing. and RTP reassembly. It uses the Parser module for XML The data pipeline of the functional modules, the associated validation, the Transcoder module for identifying the video memory resources and microengines units, are shown in Fig. layers for dropping and the Packet Segmentation module for 3. The pipeline has deliberately not been balanced in order to updating the metadata and packetizing the video data. The study the effects of an unbalanced pipeline. Transcoder module operates by referring to Trans Rate table (i) Packet Assembler: The IP packets are assembled (Packet and Capability table, and follows conventional approaches Data Buffer) and the application data (i.e., web content) is such as interleaved dropping of frames and preferring extracted. enhancement layers over base layers to ensure reduced jitter (ii) Transcoder: The application data (i.e., the HTML code), is transcoded using the following sub-modules. (a) Tokenizer: and better quality, The HTML page is parsed and tokenized using a state Implementation on IXP2400: In order to ensure a balanced pipeline design, from the diagram. (b) Tree generator: The tokens are used to generate observations made during implementation, the monolithic the page code tree and the root node address of the page media adaptation module has been split across three code tree and its associated URL are stored in the Web Page aheTb. (cSu-aegnrt:Thccedpaecd microengines. This is discussed in Section VI using Table II1 Since the IXP 2400 ISA has no support for string or character tree is pruned and used to generate the transcoded web page. processing, for the modules that involve XML processing, (iii) Response Generator: This module generates the URL the tokenizer has been implemented as an auxiliary function, response packets using Page Generator to generate the HTML namely getToken(, using a DFA (Deterministic Finite page for the client; and Packet Generator to construct the IP Automata). Further, since XML allows tags of the same level packet for the generated page. to be ordered in different ways, our implementation supports (iv) Request Handler: The IP packets containing the URL 'unordered tagging'. Requests originating from the clients are handled here. The The number of live registers required for media processing URL is first extracted from the packet and enqueued (Request Aff 1XP40 hecInuurmd le).Fo ever UREI reqest th URLHanle is beyond the number offered~~~~~1-by the- -71,


Implementation on IXP2400: Depth first search techniques have been adopted for page code tree generation. Memory for different data structures has been allocated appropriately. DRAM is characterized by larger size and high latency. Hence the Packet Data Buffer, Page Code Buffer, the Web Page Cache and the nodes of the Page code tree are stored in the DRAM. The Request Enqueuer stores the requested URL temporarily in the SRAM, since URLs are long and are frequently referenced. Scratch Memory has been used to implement circular queues for inter-thread communication. Inter-process communication is achieved by employing queues implemented as scratch rings and by using next -neighbor registers. VI. EVALUATION OF NP FOR APPLICATION-AWARENESS Each of these services has been built upon the standard IP forwarder application in a typical router, available in the IXP2400 development environment. All the services have been coded in microcode, the assembly language for IXP2400. Simulations have been carried out on the Intel Developer Workbench. The Intel IXP2400 network processor that we have used for evaluation is housed on the Radisys ENP2611 board. This board is connected to the PCI slot of a host processor and has a separate Ethernet interface and three fiber optic gigabit interfaces. Communication between the board and the host system is established through the serial communications port. The UcLo library shipped with the IXA SDK is used to write the driver program for loading the microengine code into the microstore and for developing a management interface on the XSCALE core, which controls the microengines. After verifying the simulation, the code has been ported on to IXP2400 mounted on ENP2611 board, for which modules have been developed using the ENP261 1_SDK that comes with the ENP2611 board. Based on the experiments carried out with this set-up, we evaluate the NP for application-aware processing along three dimensions: feasibility, scalability and scope.

A. Feasibility We ascertain the feasibility of performing application-aware operations on NP by (i) highlighting the techniques / tasks adopted to map these operations onto NPs, and (ii) indicating the availability of resources on the NP, to house these operations. (i) Mapping application-aware service onto NP This involves identifying tasks in the application-aware operations, representing them as network-processing functions, determining the resource requirements (CPU, memory, synchronization) and assigning processors and threads. Proper mapping leads to a suitable data pipeline design. The modules described in the design of each example in Section V have been obtained in this manner. We find that a mapping was

TABLE I

MICROENGINE ASSIGNMENT AND UTILIZATION

Microengines (8 Threads Each) (Cluster#:MEid)

Media Adaptation Service

Text Adaptation IP Service

0:1

Ingress 39.37% Classifier

Ingress 42.74% Classifier 44.74% Request Handler 45.19%

0:0

0:2

0:3 1:0

1:1

1:2 1:3

24.23%

30.85%

Media ProcO 43.04%

9.17%

Media Procl

Packet Assembler

68.00%

72.20%

Standard Forwarder 35.61%

Ingress 34.81% Forwarder Not used Not used

43.20% Media Proc2

47.27% Transcoder

Egress

Page Generator

Not used

48.63% Transcoder 25.63% Parser

41.46% Packet Generator 68.09% Egress

40.67% Egress

Not used

Not used

feasible and a data pipeline has been obtained for each case (Fig 2, 3). Even media transcoding, a resource hungry operation, has been mapped onto network processing operations, by making suitable adaptations - metadata representation of the content; transcoding involving elimination of certain finer details of the media object rather than converting it into another form. Further, an efficient mapping requires the design of a balanced pipeline. The pipeline designed for the mediaadaptation service, is a well balanced one as discussed in Section V. Suitable architectural features of the NP have been incorporated in the design and implementation stages: for instance, use of Scratch memory, which supports atomic readmodify- write operations for queue maintenance; and use of Next-neighbor registers, that form dedicated data-path between successive microengines, for conflict-free interprocess communication. (ii) Accommodating the application-aware service on an IXP2400-based Router Table I shows the assignment and utilization of various functional units of each service onto the microengines of the NP. The first column indicates the Microengine Cluster# and the MicroenginelD (cluster#:MEid). The next three columns indicate the network processing operation assigned to the Microengine (as described in the design) and the percentage utilization of that microengine (as obtained from the simulator) for the three services namely, standard IP forwarder (column four), Media Adaptation (column two) and Text adaptation (column three). From the table it is clear that the standard IP forwarding application (i.e., the fourth column) has five free microengines (i.e., not used). These microengines have been used to provide the additional services. The utilization of most of the modules is less than 50% (i.e., 35.61% for ingress, 34.81%


TABLE II EXECUTION CYCLES TO PROCESS A GIVEN PACKET (AT IGB/S)

Efficien cy Vs Line Rate

(URL Requtest only)

Text Ada tation

Functional Module Classifier Packet Assembler Request Handler Transcoder Page Generator Packet Generator

Execution Cycles 400 1400 14200 73900

COKRT

0Z

-j

3400

0

Adaptation Functional Module Execution Cycles Classifier 600 Multimedia Packet Processing (Transcoder 340000 Parser and Media Proc 0 1 2)

250

500

DRAM(Longwords)

SRAM(Longwords) Scratch(Longwords) Receive Buff(Bytes) Transmit Buff(Bytes)

1000

2000

Fig. 4. Efficiency Vs Line rate at different clock speeds for Text Adaptation Service

(URL Request only)

Effcienc Vs Line Rate

TABLE III

Media Adaptation Service 1310720 135168 5120 965536 965536

750

Litne Rate (Mbps)

(URL Request & Response)

MEMORY USAGE Usage

120 MHz

02

17400

Media

Memory

|t 800 MHz -a

Text Adaptation Service 1426152 864 1280 65536 65536

Standard IP Forwarder 1310720 1024 2048 65536 65536

CLOCK RATE

I 9

_

.

o6

-

-

ataooMHz |,at 1200 MH

0.3

0.1o 250

500

750

1000 2000

Linie Rate (Mbps) for Forwarder and 40.67% for Egress) and hence indicates 'headroom' available for more computation, if memory and itreaved.Fig. 5. Efficiency Vs Line rate at different clock speeds for Text Adaptation tasks are well interleaved.are processor tasksroom' Service (URL Request and Response) The unused microengines and threads have been used in developing the Media Adaptation service (second column) and the text adaptation service (third column) and the details are computation can be accommodated. indicated in Table I. Table II gives the module-wise packet processing time, indi- B Scalability We discuss scalability in terms of efficiency of the system cating the degree of the pipelines balanced nature. In the Text Adaptation Service, the Transcoder module consumes more at varying line-rates and clock speeds (Fig. 4, 5 and 6). The time than the other modules, while in the Media Adaptation graphs show the efficiency of the system plotted against the Service the media-processing module consumes most of the line rate. The experiments were conducted for two different time spent by a packet in the system. processor clock speeds: 800 and 1200 MHz. Fig. 4 has been Table III shows the amount of memory used. A major chunk obtained by considering URL request traffic alone, in the of the DRAM space (1310720 longwords) has been utilized to Text adaptation application, while Fig. 5 takes both the URL store the incoming and yet-to-be-processed packets. The Text request and the response traffic, thus activating all stages of Adaptation Service uses extra space for table maintenance, the system. It is expected that the efficiency increases upto including the cache (1426152 longwords). On the other hand, 1Gbps, the line rate for which IXP2400 has been designed, tables for the Media Adaptation service are stored in the and beyond this the efficiency should drop. The variation of efficiency for the different services in SRAM, as seen from the figures for SRAM usage. Out of the total twelve scratch rings of 1024 bytes each, the graphs in Fig. 4, 5 and 6 can be attributed to the available in the IXP 2400, Media Adaptation service makes balance/imbalance of the design pipeline. In text adaptation use of eight (4 x 1024B and 4 x 256B), and Text Adaptation service (Fig. 4 and 5), the pipeline is unbalanced leading to service makes use of ten (10 x 128B). drastic variation of the efficiency, for small variation of line Thus, feasibility in terms of mapping the app-layer services speeds. onto the NP architecture and in terms of resource utilization But in case of the media adaptation service (Fig. 6), the has been established and we find that many modules with more efficiency is almost same for varying line rates below 1Gbps. wel


Efficieincy Vs Linie Rate 1. -CLOCK RATE

1

>,08 .c, 6-

__|8

m

h M

hz |

|

A~4.

0o2

1000

Li.e

2000

3000

Rate (,IVhps) (Mbps) Linle Rate Fig. 6. Efficiency Vs Line rate at different clock speeds for Media Adaptation Service

fly processing, and once the application-aware operations are represented as network-processing operations, they fit well on an NP. Further the assignment of processors and threads must balance the pipeline so as to ensure on-the-fly processing as observed in the media -adaptation service. It has been shown in the text adaptation service that the performance is affected if data pipeline in not balanced. Secondly, we have shown that even media transcoding, a resource hungry operation, can be mapped onto network processing operations, of course with certain adaptations metadata description at the server and capability propagation along the multicast tree. (ii) More than one application-aware service can be deployed on a given NP. NP unlike ASICs need not be 'single application specific'. Our media adaptation example shows that an IXP2400 NP that performs normal IP forwarding has enough resources to accommodate media adaptation and XML processing. Text adaptation service on the other hand, has a very light load and free resources to accommodate another service. As a continuation of this work, we have incorporated a security service to the text adaptation service, which we have not p resented here. (iii) Dynamic adaptation to app-layer policy or network conditions. Any change in the application-level policy or decision rules can be reflected in the application-aware operations implemented on the NP dynamically. Similarly, dynamic adaptation to network conditions can be achieved. The dynamic adaptation may be incorporated by (a) on-the-fly updation of tables (eg. Policy tables) to reflect the changed condition (b) on-the-fly updation of the code that is running on the NP thereby changing the service itself as shown in one of our earlier works [12]. The first approach has been demonstrated in the media adaptation service where the transcoding rate table is dynamically updated based on the receiver report of RTCP packets. This gives an idea of the scope in terms of the significance and advantages for porting an application onto an NP. VII. CONCLUSION

aimedatat the arhiecuris eX sinc ththe IXP2400 expeted the ine architecture iS aimed expected since linenoprstehr. I

rate of 1Gbps.--

Iate

is

evIdentIttathpplndsgplyacuiroen is evienttha thppelneesinpayarucalolein

determining the scalability of the system.

C. Extensibility We discuss extensibility in terms of the ability to accommodate more services on the NP. In the case of text adaptation service an entire microengine is free to accommodate other services. But, the media adaptation service occupies all the micro-engines available in the NP. However, the NP may be used as an attached NP or it may be used in conjunction with other NPs to implement the media content adaptation functionality at higher line rates. In both services, the classifier may be augmented with the classification requirements of other applications as well. In terms of memory requirements, it has been shown that free scratch rings are available in each of the services and may be utilized. The DRAM memory used to store the packets is a constant for a given target line-rate and need not be increased when augmenting the NP with additional services. Since this memory component makes up a large part of the memory requirement, this sharing of DRAM across services is a significant advantage. Thus it is possible to incorporate additional functionality by proper design and use of the remaining resources. D. Scope and Observations We make the following inferences from the above study. (i) Implementing complex application-aware operations on NPs is NOT necessarily dfi;cult: By complex, we refer to the effort required to translate the application-aware operations to network processing operations like classification, queue management, table lookup, header processing, packet handling, hashing, and scheduling. This is because, NPs architecturally support network-processing operations, for on-the-

In this paper, we have examined the shift from application related network processing to application processing in the network. We have further examined how NPs can support and fuel this shift by evaluating the design and implementation of a popular application namely content delivery. While We have illustrated with a case study that heavy-duty applicationlayer services can be mapped on the NP at the cost of high resource consumption. Further, in order to house such services on the NP architecture proper mapping and design of a balanced pipeline is essential. Even app-layer services for which NP architecture does not explicitly support can to be mapped. Thus, our experiments using the IXP 2400 show that the NP architecture, if exploited well can very well support full fledged application layer services along with the networking services, making AANs a reality. It is to be noted that this is just the beginning of the immense possibilities of using NPs in AANs. Our evaluation of appawareness of NPs in terms of scope, feasibility and extensibil-


ity has lead to a few noteworthy observations that we foresee will open new avenues for research and experimentation.

REFERENCES

[26] Gabriel Panisa, Andreas Huttera, J.org Heuera, Hermann Hellwag-

nerb, HaraldKoschb, Christian Timmererb, Sylvain Devillersc, Myriam Amielhc, Bitstream syntax description: a tool for multimedia resource adaptation within MPEG-21, Signal processing: Image communication, 2003.

[1] Zia Iqbal, Deep Content Inspection: Beyond Deep Packet Inspection, White paper, Barbewire Technologies, Nov 2003. [2] Israel L'Heureux, Beyond Deep Packet Inspection: Application Front Ends for the Webified Data Center, White Paper, Redline Networks, April 2004. [3] John W Lockwood, James Moscola, Mattew King, David Reddick and Tim Brooks, Internet worm and virus protection in Dynamically Reconfigurable Hardware, Military and Aerospace Programmable Logic Device (MAPLD) 2003. [4] Young H. Cho, Shiva Navab, and William H. Mangione-Smith, Specialized Hardware for Deep Network Packet Filtering, International Conference on Field Programmable Logic and Applications (FPL), Montpellier, France, Sep. 2002. [5] Young H. Cho and William H. Mangione Smith, A Pattern Matching Coprocessor for Network Security, Proceedings of the 42nd annual conference on Design Automation, 2005. [6] Xie, Haiyong, Zhou, Li, Bhuyan, Laxmi. An Architectural Analysis of Cryptographic Applications for Network Processors, IEEE First Workshop on Network Processors, 2002. [7] Tao He, Jiang Liu, Shijin Kong, Zhichun LI, Generic Network Traffic Capture Platform building on Network Processor, IEEE INFOCOM, 2005. [8] Ramaswamy, R., Wolf, T. PacketBench: a toolfor workload characterization of network processing Workload Characterization, IEEE International Workshop, WWWC-6, Pages 42 50, 27 Oct. 2003. [9] Memik, G., Mangione-Smith, W.H. , Hu, W. NetBench: a benchmarking suite for network processors, Intl conference on Comput er Aided Design, ICCAD 2001. [10] Lee, B.K., John, L.K. NpBench: a benchmark suite for control plane and data plane applications for network processors, 21st International Conference on Computer Design 2003. [11] Darwin Project: http://www.cs.cmu.edu/ darwin/ [12] Sharmila R, LakshmiPriya MV and Ranjani Parthasarathi, An Active Framework For A WLAN Access Point Using Intel's IXP1200 Network Processor, Bangalore, HiPC 2004. [13] Radware Security Switches: www.radware.com [14] Young-Ho Kim Jeong-Nyeo Kim, Design offirewall in router using network processor, The 7th International Conference on Advanced Communication Technology, ICACT 2005. [15] Ada Gavrilovska, Kenneth Mackenzie, Karsten Schwan, and Austen McDonald, Stream handlers: Application-specific Message Services on Attached Network Processor, 10th Symposium on High Performance Interconnects, 2002. [16] Robert Haas, Clark Jeriesz, Lukas Kencl, Andreas Kind, Bernard Metzler, Roman Pletka, Marcel Waldvogel, Laurent Frel echoux, and Pat rick Droz, Creating Advanced Functions on Network Processors: Experience and Perspectives Commun. Syst. Dept., IBM Zurich Res. Lab., Ruschlikon, Switzerland, IEEE Networks. July-Aug. 2003 [17] Brian T. Gold, Anastassia Ailamaki, Larry Huston, Babak Falsafi, Accelerating Database Operators Using a Network Processor, Proceedings of the 1st international workshop on Data management on new hardware, 2005. [18] IXP2400 Hardware Reference Manual, June 2001, Intel Corporation [19] Yonghyun Hwang, Eunkyong Seo, and Jihong Kim, WebAlchemist: A Structure-Aware Web Transcoding System for Mobile Devices ACM Mobile Search Workshop, May 7, 2002, Honolulu, Hawaii, USA. [20] Yonghyun Hwang, Jihong Kim, and Eunkyong Seo, Structure-Aware Web transcoding for mobile devices, IEEE Internet Computing, 2003. [21] Intel IXP1200 Network Processor:

www.intel.com/design/network/products/npfamily/ [22] Request For Comments: http://www.ietf.org/rfc.html [23] Gavrilovska, S. Kumar, K. Schwan, The Execution of Event-Action Rules on Programmable Network Processors, OASIS 2004, held with ASPLOSXI, Oct. 2004. [24] Srikanth Sundaragopalan, Ada Gavrilovska, Sanjay Kumar, Karsten Schwan, Approach Towards Enabling Intelligent Networking Services for Distributed Multimedia Applications, IMMCN 2005. [25] Thiemo Voigt, Renu Tewari, Douglas Freimuth, and Ashish Mehra, Kernel Mechanisms for Service Differentiation in Overloaded Web Servers Usenix Annual Technical. Conference, Boston, MA, USA, June 2001.