The Journal of Systems and Software 80 (2007) 972–983 www.elsevier.com/locate/jss
Adaptive network QoS in layer-3/layer-2 networks as a middleware service for mission-critical applications q,qq Balakrishnan Dasarathy *, Shrirang Gadgil, Ravi Vaidyanathan, Arnie Neidhardt, Brian Coan, Kirthika Parmeswaran, Allen McIntosh, Frederick Porter Applied Research, Telcordia Technologies, One Telcordia Drive, Piscataway, NJ 08854, United States Available online 13 November 2006
Abstract We present adaptive network Quality of Service (QoS) technology that provides delay bounds and capacity guarantees for traffic belonging to mission-critical tasks. Our technology uses a Bandwidth Broker to provide admission control and leverages the differentiated aggregated traffic treatment provided by today’s high-end COTS layer-3/2 switches. The technology adapts to changes in network resources, work load and mission requirements, using two components that are a particular focus of this paper: Fault Monitor and Performance Monitor. Our technology is being developed and applied in a CORBA-based multi-layer resource management framework. 2006 Elsevier Inc. All rights reserved. Keywords: Bandwidth brokering; Admission control; Layer-3/layer-2 networks; Real-time middleware
1. Introduction A new generation of distributed real-time and embedded (DRE) middleware is needed to address the performance needs of mission-critical military applications. Current capabilities are largely limited to fixed static allocation of resources in support of predefined mission capabilities. A static allocation strategy limits the ability of a military application to adapt to conditions that vary from the original system design. Dynamic resource management systems can adapt to changes in mission requirements, workload distributions, and available resources, including resource reduction caused by fault conditions. We focus on dynamic resource management for network resources. We are integrating and validating our adaptive
q
A preliminary version of this paper appeared in RTAS 2005. See Dasarathy et al. (2005). qq This work is supported by DARPA Contract NBCH-C-03-0132; approved for Public Release, Distribution Unlimited. * Corresponding author. Tel.: +1 732 6992430; fax: +1 732 3367015. E-mail addresses:
[email protected] (B. Dasarathy), coan@ research.telcordia.com (B. Coan). 0164-1212/$ - see front matter 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2006.09.030
network Quality of Service (QoS) solution as part of a Multi-Layer Resource Management (MLRM) framework (Lardieri et al., 2006) being created by the DARPA Adaptive and Reflective Middleware Systems (ARMS) program using CORBA middleware and component technology. The purpose of the MLRM architecture is to push middleware technologies beyond current commercial capabilities especially in their ability to detect mission-impacting events and adapt in a timely manner. It is being applied to shipboard computing. The goals of our adaptive network QoS solution are to guarantee a required minimal level of QoS for mission-critical traffic, provide a simple way to express QoS needs, detect and adapt to adverse events, and optimize overall network use. They are achieved with a Bandwidth Broker (BB) that provides admission control of application packet flows into various traffic classes. Admission control ensures that a flow of a given class has enough available capacity. For a delaysensitive flow to have enough capacity means that the flow complies with an off-line computed occupancy bound on each link on its path. This compliance check ensures an upper bound on delay for this flow and previously admitted flows. The solution leverages widely available mechanisms
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
that support layer-3 DiffServ (Differentiated Services) and layer-2 CoS (Class of Service) features in commercial routers and switches for enforcement. This paper is organized as follows. Section 2 provides an overview of the MLRM middleware framework and explains how our QoS technology fits in. Section 3 describes the networks of interest. In Section 4, we describe the Bandwidth Broker and our overall QoS architecture. Section 5 describes the two QoS feedback mechanisms, Fault Monitor and Performance Monitor. In Section 6, we explain how the Bandwidth Broker performs policy-driven mode changes. Section 7 is on our experimental results. Section 8 compares and contrasts our work with the work reported in the literature. Section 9 is the summary.
2. MLRM middleware framework The ARMS MLRM (Lardieri et al., 2006) is a framework for multi-layer resource management whereby complex resource allocation and scheduling can be handled in a divide-and-conquer manner. The goal of the MLRM framework at its highest layer is to maximize mission coverage. The framework supports the incorporation of different algorithms at different layers in a plug-and-play manner using the CORBA component and middleware technology, specifically the CIAO (Wang et al., 2003; http://www.cs.wustl.edu/~schmidt/CIAO-intro.html) implementation for C++ built over a C++ real-time ORB, TAO (http://www.cs.wustl.edu/~schmidt/TAO.html), and OpenCCM (http://openccm.objectweb.org/) with JacORB (http://www.jacorb.org/), a Java ORB, supporting development in Java. One may use a utility function to formulate an optimization problem in a particular layer. For example, a utility function may penalize heavily if the timeliness of a mission-critical function cannot be met. The MLRM framework supports multiple QoS dimensions, such as survivability, timeliness (hard and soft), security, and efficient resource utilization. A key assumption in MLRM is that
Infrastructure Allocator
the level of service in one QoS dimension can be coordinated with and/or traded off against the levels of service in other dimensions. The goal of the framework, furthermore, is to enable rapid deployment of applications, monitoring of QoS in different dimensions, and rapid redeployment of applications if their QoS is being violated. MLRM is a federated resource management middleware service and it is a layer in the software architecture sandwiched between the network/operating system and the application layer. The MLRM resource management hierarchy, as shown in Fig. 1, comprises three layers: Services Layer, Resource Pool Layer and Physical Resource Layer. Each layer has allocation, scheduling, management or configuration functions as well as feedback functions. • Services Layer: The Services Layer receives explicit resource management requests from applications along with command and policy inputs. Two key allocation, scheduling, management or configuration components at this layer are Infrastructure Allocator (IA) and Operational String Manager Global (OSM Global). The IA component provides coarse-grained global resource allocation. It assigns applications or operational strings to resource pools taking into account their inter-pool communication needs using the Bandwidth Broker. An operational string, commonly known as a task in realtime computing, is a sequence of applications that interact to provide a service satisfying certain QoS requirements. A pool is a collection of resources often determined by factors such as physical proximity and type (e.g., processors in a data center). The OSM Global component coordinates deployment of operational strings across resource pools. • (Resource) Pool Layer: The Pool Manager (PM) uses multiple Resource Allocators (RA’s) to assign applications to computing nodes, taking into account their intra-pool communication needs. The OSM Pool Agent component is responsible for managing operational substrings (assigned to a pool), monitoring and controlling the applications within each operational substring.
Operational String Manager Global
Global Resource Status Service
Services Layer Resource Pool Layer Bandwidth Broker Network Fault Monitor
Flow Provisioner
Operational String Manager Pool Agent Pool Resource Status Service
Network Performance Monitor
Resource Allocator
Physical Resource Layer
973
Pool Manager
Node Provisioner
Fig. 1. MLRM middleware framework.
Node Resource Status Service
974
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
• Physical Resources Layer: The Physical Resources Layer deals with the specific instances of resources in the system. Each Node Provisioner (NP) handles management and provisioning of an individual resource, specifically a host resource to run applications allocated to the host. It configures OS process priorities and scheduling classes of deployed components across a variety of operating systems (e.g., Linux and VxWorks). • Resource Status Service (RSS): RSS operates across all these layers and provides continuous feedback on the status of non-network resources toward determining how well the QoS concerns are being met by applications and operational strings. RSS is classified into Global RSS, Pool RSS and Node RSS based on the layer at which it operates. At the simplest Node level, RSS consists of processor and process failure detectors. At higher layers, RSS consists of condition monitors and detectors to monitor for and detect QoS violations. A violation can be set to be triggered in the broader context of mission importance and policy directives. The Bandwidth Broker (BB) component is the main entity responsible for managing network QoS. The BB component is shown in the Resource Pool Layer, as the network is collectively viewed as a resource pool of network elements. Its services are used by the RA and IA components for managing the network resources in a coordinated fashion within a pool (e.g., data center) and across pools (e.g., between data centers), respectively. The Bandwidth Broker invokes Flow Provisioner to provision and configure routers/switches at the Physical Resource Layer. The Bandwidth Broker also interacts with two network feedback components at the pool layer, Network Fault Monitor and Network Performance Monitor. Finally, the BB component also provides feedback to the PM and OSM Global components on network-related QoS problems. This paper is about these network QoS components. The forward control flow among the MLRM components is as follows: After the allocation decision, the IA component invokes the OSM Global which then invokes the OSM Pool Agent. The OSM Pool Agent component coordinates allocation of resources within a pool of host resources through the PM component. The PM component then invokes the NP for provisioning node resources (e.g., setting operating system process priority). The IA and PM (through RA) components invoke the BB to determine the availability of, reserve and provision network resources across pools and within a pool, respectively. The RA component also performs a schedulability analysis for timing compliance of critical operational strings at the pool level. Based on the timing analysis results, a pool-level reallocation may need to be coordinated by the PM component. The PM component, in turn, may have to defer to the IA component for a reallocation decision across pools. In Fig. 1, we show status propagation using two-way arrows as it could be done using a synchronous request-reply or
an event propagation paradigm. The coordination among the various MLRM components is described in more detail in Lardieri et al. (2006). The paper also illustrates the MLRM framework dynamism using load and mission change scenarios. We also refer the reader to that paper for details on an empirical evaluation of the MLRM architecture.
3. Network architecture The network architecture illustrated in Fig. 2 consists of four pools, each served by three access (edge) switches. The access switches within a pool are fully meshed and across pools are partially meshed. The rich connectivity among switches (numbering in the 10s) is to enable continued operation in several catastrophic situations at a cost that is acceptable. The illustrative architecture shown is a specialized case of a robust wireline, enterprise network architecture. Increasingly, there is only one (IP) network in enterprises that carries all the traffic – data, voice, and video. The network carries both point-to-point traffic (e.g., synchronous RPC, voice over IP calls) and multipoint traffic (e.g., pub/sub, announcements, broadcast video). Our network QoS component design and implementation is generic for these layers and is capable of handling a variety of enterprise network configurations including all layer-3 and all layer-2 with layer-3 awareness at the edges. Typically, high-end gigabit Ethernet switches such as the Cisco 6500 used can operate either at layer-3 or at layer-2 based on configuration settings. Our network QoS design is also generic with respect to the topology it can support. However, our experimental studies are done using the cluster architecture shown in Fig. 2.
4. Network QoS components Fig. 3 illustrates our network QoS component architecture. In Fig. 3, we make use of CORBA Component Model (CCM) notation to describe the interactions among the components. The four major components of the QoS management architecture are: (1) Bandwidth Broker, (2) Flow Provisioner, (3) (Network) Performance Monitor and (4) (Network) Fault Monitor. Our network QoS components provide adaptive admission control that ensures there are adequate network resources to match the needs of admitted flows. The Fault Monitor and Performance Monitor are the two feedback mechanisms in support of this adaptive behavior.
4.1. Bandwidth Broker The functions provided by the Bandwidth Broker to other MLRM components are:
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
Layer 3/2 Edge Switch
Layer 3/2 Edge Switch Layer 3/2 Edge Switch
Layer 3/2 Edge Switch
975
Layer 3/2 Edge Switch Layer 3/2 Edge Switch
Pool2
Pool 1
Layer 3/2 Edge Switch
Pool3
Layer 3/2 Edge Switch
Layer 3/2 Edge Switch
Pool4 Layer 3/2 Edge Switch
Layer 3/2 Edge Switch
Layer 3/2 Edge Switch
Fig. 2. Network of interest.
Legend:
Other MLRM Components
Reservation, & Resource Query
R/S: Router/Switch
: Event sink
QoS Problem Event
Applications/Middleware
Bandwidth Broker
Provisioning Request
: Facets (Interface Exporter)
Flow Provisioner
Performance Event Fault Event
Host
Fault Query
Performance Query
: Event source
R/S
: Receptacle (interface Importer)
R/S R/S R/S
Performance Monitor R/S
Fault Monitor
Host
Applications/Middleware
Fig. 3. QoS components and their interactions.
• Flow Admission: Reserve, commit, modify, and delete flows in support of allocation and scheduling for use by the PM and IA components. • Queries: Provide information about bandwidth availability in different classes among pairs of pools and subnets in support of coarse-level allocation of processes to processors for use by the IA component. • Events: Provide notification of high-level QoS-affecting events (e.g., the Bandwidth Broker’s inability to meet QoS of a previously admitted flow because of a network
fault, repeated deadline violations on a flow, inability to provision a switch for desired QoS) for use by the OSM Global, and PM components. • Bandwidth Allocation Policy Changes: Adapt existing and future bandwidth reservations in support of mission mode changes for use by the IA component. (See Section 6.) The Bandwidth Broker leverages DiffServ (Blake et al., 1998; Nichols et al., 1998) in layer-3 and CoS mechanisms
976
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
in layer-2 network elements to provide end-to-end QoS guarantees. Transitions between DiffServ and CoS are transparent to end users. CoS mechanisms provide functionality at layer-2 similar to what DiffServ mechanisms provide at layer-3. Layer-2 CoS support is somewhat restrictive, however. Layer-2 supports a 3-bit Class of Service (CoS) marking or eight classes as opposed the 6-bit Differentiated Services Code Point (DSCP) with potentially 64 different classes. Moreover, CoS has limited support mechanisms for scheduling and buffer management. The DiffServ and CoS features are typically implemented in software and in ASIC (Application-Specific Integrated Circuits) hardware, respectively. They both provide aggregated traffic treatment throughout the network and per-flow treatment at the network ingress. DiffServ and CoS features by themselves are insufficient to guarantee end-to-end network QoS, because the traffic presented to the network must be made to conform to the network capacity. We need admission control that ensures there are adequate network resources to match the needs of admitted flows. 4.1.1. Path discovery in layer-3/layer-2 networks To do its admission control job, the Bandwidth Broker needs to be aware of the path that will be traversed by each flow, track how much bandwidth is being committed on each link for each traffic class, and estimate whether the traffic demands of new flows can be accommodated. Our bandwidth tracking is for both capacity assurance and deadline assurance. Capacity assurance makes sure that there is enough bandwidth on a link. Deadline assurance ensures that the occupancy of all of the deadline-sensitive flows traversing a link is kept low enough so that worstcase burst and thus the delay on the link will hold. (See Section 4.1.3.) As tracking bandwidth on links is an important aspect of the Bandwidth Broker, path discovery is a major functional component of the Bandwidth Broker. Path discovery finds out which network links are used by a flow of traffic between two hosts. 4.1.1.1. Layer-3 path discovery. There are two approaches to layer-3 path discovery. Active techniques introduce packets into the network and generally use Internet Control Message Protocol (ICMP) mechanisms to determine the path taken by the introduced packets. Passive techniques typically rely on the monitoring of layer-3 routing tables. Active Layer-3 Path Discovery (traceroute): Traceroute, a widely used active path discovery technique in layer-3, relies on the Time-to-Live (TTL) field in the IP header and ICMP error messages to track the hop-by-hop IP path between source and destination. Essentially, ICMP or UDP packets with TTL values of (1, 2, . . .) are sent from the source to the destination. The IP TTL field is decremented at each IP hop. When the TTL value reaches 0, IP routers originate an ICMP error message (ICMP type 3) to the source. The traceroute program reconstructs the hop-by-hop IP path from these ICMP error messages.
Passive Layer-3 Path Discovery: Passive techniques rely on the monitoring of layer-3 routing tables. One of the most effective ways to achieve this is to passively participate as a peer in the layer-3 link state routing protocol (e.g., OSPF or IS-IS). The Bandwidth Broker can peer with link state routing protocols and receive and reconstruct the link state topology of the network in real-time, just like any other link state router in the network. The peering arrangement enables the Bandwidth Broker to obtain routing information as quickly as network routers. The peering arrangement does not, however, mean that the Bandwidth Broker offers any routes or plays any role in packet forwarding. This technique eliminates the need to do traceroute. Equal-Cost Multi-Path (ECMP): Layer-3 path discovery mechanisms (both active and passive) work well in cases where a single best path is available between a source and destination. When there is more than one equal-cost path from a source to a destination, COTS routers support a feature known as Equal-Cost Multi-Path (ECMP). With ECMP, routers balance the traffic load on multiple equalcost paths between two points. Typically, this type of load balancing is done in a way that keeps the packets of the same flow together (in order to minimize packet re-ordering). In general, with passive layer-3 path discovery, it is difficult to predict the specific path that a flow will use, since this depends largely on the vendor’s ECMP implementation. Further, depending on the vendor’s ECMP implementation, active techniques such as traceroute may or may not be able to discover the specific best cost path used for a specific flow. For instance, ICMP packets may be treated differently by the ECMP implementation than say, TCP flows. When layer-3 path discovery is able to predict accurately the specific path used, admission control decisions are accurate. When path discovery is unable to predict which of the shortest paths will be used, a conservative approach that accounts for flows along every possible equal-cost path should be employed. 4.1.1.2. Layer-2 path discovery. Layer-2 switches do not run routing protocols, instead they only communicate using the spanning tree protocol (STP). Layer-2 network segments are broadcast domains. In a layer-2 network topology, redundant connectivity (or a loop) could lead to broadcast storms, where broadcast packets are repeatedly forwarded around the loop. STP eliminates such loops by marking certain layer-2 ports as non-forwarding. Thus, one of the keys to layer-2 path discovery is to discover the state of the spanning tree, i.e., which ports are active or blocked. The Bandwidth Broker uses SNMP MIBs to track spanning tree state and VLAN membership. It uses this information to compute layer-2 paths within a single VLAN (Decker et al., 1993). 4.1.1.3. Hybrid network path discovery. A two-pass approach to path discovery in a hybrid layer-3/2 network is used. In the first pass, the end-to-end layer-3 path is
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
discovered. Each switch in the layer-3 path is the gateway switch from one VLAN to the next VLAN. In the second pass, the layer-2 path between each of the layer-3 segments discovered in the first pass is computed. 4.1.2. Layer-3 and layer-2 QoS treatment The Bandwidth Broker realizes QoS treatment in layers 2 and 3 for traffic flows that are admitted in the network. Our implementation of QoS treatment uses policing and marking of traffic at the network edge. Policing and marking is done at the granularity of each flow. In the core of the network, scheduling and buffer management is performed at the granularity of each traffic class. Policing and Marking: These functions are performed at the edge of the network. The Bandwidth Broker uses Access Control Lists (ACLs) available on COTS network elements to classify traffic into flows that can be policed. Typically, the classification is based on the TCP/IP fivetuple: hsource address, destination address, protocol, source port, destination porti. However, in the MLRM architecture, the port numbers are generally not known when allocation decisions are made. In our present scheme, the Bandwidth Broker returns a DSCP marking whenever a flow reservation is made based on traffic type or QoS requested for the flow. The sending application or the ORB middleware on behalf of the application then needs to mark the packet with the DSCP marking returned. Each class of traffic is then policed by vendor mechanisms such as Committed Access Rate (CAR). In the policing process, an aggregate rate or bandwidth is ensured for the flow through a combination of rate and allowable burst parameters. The Bandwidth Broker can be configured to either drop packets that exceed the rate profile or re-mark those packets to best effort treatment. Actual provisioning of individual network elements for policing and marking is done by the Flow Provisioner under the direction of the Bandwidth Broker. Scheduling and Buffer Management: Scheduling mechanisms vary significantly across COTS vendor implementations, from simple round-robin mechanisms, to strict priority queuing and sophisticated mechanisms such as weighted fair queuing. The Bandwidth Broker uses available vendor scheduling mechanisms. High-priority mission-critical traffic is accorded strict priority treatment, within bounds (dictated by admission control policies), while other traffic classes can share link bandwidth in accordance with configurable weights or percentages. Typically at layer-3, vendor products isolate traffic classes by using separate queues for each class. Where sophisticated buffer management schemes such as Weighted Random Early Detection (WRED) are available, they are applied to TCP/SCTP traffic. In the absence of such schemes, tail drop is the only option. Transport of QoS markings: In layer-3 network segments, the DSCP marking is visible in the IP header and scheduling and buffer management decisions can be based on this information. The layer-3/layer-2 integrated
977
switches, even when they are configured as layer-2 switches, can typically process DSCP codepoints. When traffic traverses layer-2 network segments, DSCP markings are translated to corresponding layer-2 CoS values. At most eight distinct classes of service can be identified on the layer-2 segments. In our implementation experience, we have seldom found the need for more than eight classes of service – however, if the deployment requires it, then multiple layer-3 classes may map onto the same layer-2 class. Typically, layer-2 network segments have increased bandwidth and forwarding capacity in contrast to layer-3 network segments, making such a deployment workable. In layer-2, multiple classes of traffic may be forced to share the same queue. In such a situation it is difficult to maintain isolation between traffic belonging to different classes. For each queue, one or more configurable drop thresholds may be available. The drop thresholds indicate the percent queue utilization at which frames are discarded from the queue. Multiple drop thresholds can be associated with a single queue. For instance, a drop threshold of 40% can be assigned to CoS 3 in queue 1, and a drop threshold of 90% can be assigned to CoS 4 in queue 1. If traffic marked with CoS 3 arrives for queue 1 when it is say, 50% full, that traffic will be discarded. However, traffic with CoS 4 will be accepted and enqueued. 4.1.3. Delay-bound support in the Bandwidth Broker The Bandwidth Broker admission decision for a flow is not based solely on requested capacity or bandwidth on each link traversed by the flow, but it is also based on delay bounds requested for the flow. The delay bounds for new flows need to be guaranteed without damaging the delay guarantees for previously admitted flows and without redoing the expensive job of readmitting every previously admitted flow. We have developed computational techniques to provide both deterministic and statistical delaybound guarantees. Delay guarantees raise the level of abstraction of the Bandwidth Broker to the higher layer MLRM components and enable these components to provide better end-to-end mission guarantees. The basic framework we have developed is capable of dealing with any number of priority classes and, within a priority class, any number of weighted fair queuing subclasses. These guarantees are based on relatively expensive computations of occupancy or utilization bounds for various classes of traffic, performed only at the time of network configuration/reconfiguration, and relatively inexpensive checking for a violation of these bounds at the time of admission of a new flow. We provide an overview of the calculations involved. For deterministic bounds, during off-line (i.e., at the time of configuration/reconfiguration) computation for computing an occupancy bound ak for each traffic class k, the input data items required are: • service rate C on links; • maximum packet or frame size M;
978
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
• diameter h of the network (longest path length in the network); • propagation delay D on the longest path; • burst times Ts for various service subclasses s; • WFQ weights ws within each priority class k; • target delay ds for each service class/subclass s. From these values, elementary off-line calculations directly yield the occupancy bounds that need to be satisfied to ensure that target delays are met. To provide this assurance even for the worst case consistent with the information that will be employed during the on-line admission decisions (i.e., basically just that the occupancy bounds will be respected), the off-line calculations need to account for the fact that a flow’s burstiness effectively grows with each hop at which queuing can occur. In more detail, if the first packet of a burst in the flow’s traffic encounters a queuing delay d at one link (presumably from bursts in other flows competing for that link), and if the later packets of the flow are served immediately after the first packet (presumably because the queue-growing bursts in the traffic of all the competing flows ended just as the first packet of the first flow arrived and encountered the big queue), then, at the following link, the flow’s effective burst size will have increased by qfd, where qf is the flow’s rate parameter. The implication is especially serious for a link l whose class-k occupancy limit ak is practically filled by class-k flows traversing long, h-hop paths ending with link l as the hth hop (so Rqf = akC, where C is the service rate of link l, and the sum is over these flows). Specifically, the implication is that if these flows all encountered a queuing delay of d on each of their earlier hops, then ignoring the original burstiness of these flows as they entered the network, just the corresponding aggregate increase in effective burstiness arriving at link l is (h 1)akCd, correspondk ing to a contribution of ðh1Þa d to the queuing delay for 1a>k class-k traffic at link l, where a>k is the aggregate occupancy bound for classes of higher priority than k (and the factor (1 a>k) reflects the fact that class-k traffic, in effect, can count on link l only for the fraction (1 a>k) of its capacity). Of course, there are other contributions, and the actual calculations do take these other contributions into account, but the consideration of this one contribution alone already leads to a severe constraint on the occupancy bound ak. Specifically, this contribution alone had better be smaller than d, if the class-k queuing delay at link l is supposed to be bounded by d, and this requirement is equivalent to requiring that ak be strictly smaller than (1 a>k)/(h 1), which is a rather low value if the diameter h is even moderately high. Of course, the actual calculations pick an even smaller value for ak to account for the other contributions, but this first contribution typically has the dominant effect. In any case, the various class-isolating occupancy bounds for the different service classes are calculated starting from the highest priority class. During the on-line admission control calculations, the only calculation required is to determine whether the
admission of the new flow would violate the occupancy bound of the flow’s traffic class/subclass on any of the links the flow would traverse, so there is no need to readmit already admitted flows. The statistical bound calculations require, in addition to the input values listed above, the tolerance values for the probabilities of violating delay targets. Many simplifying assumptions are made in our statistical calculations, of which some are pessimistic and others optimistic. One optimistic assumption is that the burstiness of a flow’s traffic as it arrives at one link on its route is the same as the burstiness at other links along the route. So, in contrast to the deterministic case discussed above, the dominant contribution to the depression of occupancy bounds for classes requiring worst-case guarantees is absent in our calculations for classes requiring only statistical guarantees, and this absence raises the corresponding occupancy bounds appreciably. Moreover, with the tolerance values as zero (the most constraining case), the occupancy bound calculations are especially simple and that is what we are implementing in the Bandwidth Broker. The deterministic and statistical delay bounds are currently being incorporated in the Bandwidth Broker admission control process for the highest priority class (e.g., the DiffServ EF class) and the second highest priority class (the DiffServ AF Class; in the AF class, there can be two or more weighted fair queuing subclasses). 4.2. Flow Provisioner The Flow Provisioner translates technology-independent configuration directives generated by the Bandwidth Broker into vendor-specific router and switch commands to classify, mark, and police packets belonging to a flow. The Flow Provisioner component enables the enforcement of the Bandwidth Broker admission control primitives on the network elements. On Cisco devices, enforcement is primarily by platform-specific variants of Cisco’s Committed Access Rate (CAR). CAR implements a variant of a token-bucket scheme that allows individual flows to be policed to a specific rate with a specific burst size. The subset of flows for which CAR rules are applied is specified by Access Control Lists (ACLs). ACLs allow the matching of an individual flow (or group of flows) by using the five-tuple as well as additional fields such as the DSCP codepoint. We have implemented Flow Provisioners for layer-3 IOS Cisco (e.g., Cisco 3600, 7200, and 7500 routers), layer-2/3 Catalyst and IOS switches (e.g., Cisco 6500, and 4507 switches), and layer-3 Linux routers to demonstrate the viability of this QoS architecture for a variety of network topologies and equipment. 5. Feedback mechanisms Two QoS feedback mechanisms to the Bandwidth Broker are implemented by the Performance Monitor and Fault Monitor components. Both these components
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
provide ongoing feedback as asynchronous events. They also support synchronous queries. The Performance Monitor provides information on the current performance status of flows and traffic classes. The Fault Monitor provides information on the up/down status of links and switches. 5.1. Performance Monitor The performance monitoring features we support can be classified as follows: • Delay measurement: Delay measurement determines how well critical flows are meeting their timing constraints, specifically their end-to-end-latency (delay) metric. • Detection of overflow: Detection of overflow of traffic for an admitted flow can identify a mission-critical task that requires additional capacity.
5.1.1. Delay measurement We employ an active probe technique to measure delay (Alberi et al., 2003). The infrastructure can be easily extended to measure jitter and packet loss. The measurement/probe infrastructure, as illustrated in Fig. 4, has three main components: the performance data management component (consisting of Performance Monitor Servant, Probe Sink and Probe Control), the Probe Platform that manages setting up of a probe (measurement job) and the probes that run on hosts (shown as Probe A and Probe B). A few of the highlights of this Performance Monitoring component are:
get_measurement () Delay, delay violation measurement_event_ register ()
Performance Monitor Servant
979
• The measurement infrastructure is able to measure delay experienced by specific traffic flows or delay between a pair of hosts for one or more traffic classes. Averaging window sizes can be specified. The interface supports both synchronous requests to query current delay and asynchronous events for violations of thresholds or to provide periodic updates on delay. • The analysis and management of performance data is separated from probes for raw data collection. The measurement data is analyzed and managed by the performance monitor servant and stored in an HSQLDB (http://hsqldb.org/) in-core database. • Probe job configurations are stored in a persistent medium using MySQL (http://dev.mysql.com/). Probe jobs can be recovered in case of probe platform or probe host failures. • In setting up a probe job, one can vary packet size, gap/ time between packets, the number of packets in a packet train and periodicity. • The probe job packet train generation is done at the (Linux) kernel level to control or minimize the timerelated vagaries in generating packets. • Clocks are synchronized between two measurement (Linux) hosts with GPS using a non-network interface between the hosts to achieve delay measurement accuracy in the micro-second range.
5.1.2. Detection of overflow The Bandwidth Broker’s admission control does not allow overload in the network. Excess offered loads are detected at the ingress of the network and policed. The policing function provided by an ingress network element
Probe Control
Probe Sink
HSQL In-core Database
Measurement setup Probe Platform
Probe Job MySQL Database
Raw Performance Data Measurement setup L2/L3 Core Network
Measurement setup
Probe B
Probe A
Pool A packet train for High Priority
Pool B
Clock Synch-ed up with GPS and other host (s)
(Odetics Board (IRIGB Signal)) packet train for Best Effort
Fig. 4. Delay measurement component.
980
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
either drops or marks down all excess packets. Policing is orchestrated by our Flow Provisioner capability that sets up the policing attributes for flows at the ingress of the network. The monitoring program uses the policing functions provided by an ingress network element to determine whether the rate at which packets are dropped or marked down exceeds a threshold during several consecutive intervals. 5.2. Fault Monitor A key feature of a resource management system in a dynamic battlespace environment is the ability to detect and react to network faults. We illustrate the problem we are trying to address using the network shown in Fig. 5. If the link between switches A and B goes down, then a flow Y between A and B may be routed through switch C (shown in dashed lines). Similarly, a flow Z between E and A that originally used the links EB and BA may now use links ED and DA (shown in dotted lines). However, links AC, CB, ED, and DA may now be oversubscribed causing concerns on QoS guarantees for Y and Z as well as for the flows that had been using these links prior to the occurrence of the fault. Our techniques consider both reactive and proactive analyses. In the reactive mode, when and only when the network fault is detected, we recompute the layer-3 and/or layer-2 topology and paths for the individual flows. Proactive response essentially involves precomputing network paths for various failure conditions, thus enabling faster response. We are currently restricting ourselves to only single-mode faults (single link or switch failure) in proactive analysis. On the occurrence of any single-mode failure, a simple lookup operation should yield the new network path information. The goal of the Fault Monitor is not to perform a root cause analysis or enable fixing the fault, but to do QoS restoration. If the QoS of a previously admitted flow cannot be guaranteed, the Fault Monitor will raise a fault exception event to the Bandwidth Broker. The Bandwidth Broker, in turn, will raise a higherlevel event to other MLRM components, specifically OSM Global and PM. The three functional aspects of the Fault Monitor component are as follows:
• Fault detection: We use SNMP traps to detect link failures (and links coming back into service). A switch failure is detected when SNMP trap notification for all links to the switch are received by the adjacent switches. • Impact analysis: For each admitted flow the impact analysis involves determining whether the flow has changed its path using the path discovery algorithms. If a path for a flow has changed, that flow has been impacted by the failure and is a candidate for readmission. • QoS restoration: Our design is capable of supporting different algorithms satisfying different utility functions or optimality criteria. The first step, regardless of the algorithm used, is to temporarily remove affected flows. In the current implementation, the affected flows are then readmitted one at a time, from the highest priority to the lowest priority and within the same priority with less bandwidth first. Here, we are trying to readmit the maximum number of higher priority flows whose paths have changed. We may substitute an algorithm that admits more flows. We can also employ a preemption algorithm. For instance, if the admission of a flow would lead to capacity violations on a link, then the process preempts a lower priority flow that uses this link. The preempted flow then becomes a candidate for readmission. The QoS restoration functions described above restore QoS based on the bandwidth required. When there is a network fault, the diameter of the network could have increased, causing the occupancy bounds for various service classes to decrease. If the occupancy bounds have changed, the QoS of flows whose paths have not changed can also degrade. To support honoring the delay bounds, readmission has to be carried out on all previously admitted flows. The readmission for an affected flow will now be based on the newer occupancy bound for the flow’s class on each link on the new path to be traversed by the flow. If preemption of a flow is made, the new occupancy value should be used in readmitting the preempted flow as well. Finally, the Fault Monitor also tracks the paths and the resources used when the fault disappears and restores guarantees to the original level. We carry out the entire process of impact analysis and restoration, as described above, when the fault condition disappears. 6. Support for mission mode changes
Fig. 5. Fault and its impact on QoS.
In addition to adapting to faults and overload, the Bandwidth Broker supports dynamically changing among modes. A mode is a major operational situation such as normal, alert, and battle mode in a military environment. Our work in support of mode changes deals with global policy changes affecting the entire network, including changes in the fraction of the bandwidth allocated to various traffic classes. The bandwidth policy change implementation involves sending reconfiguration instructions to every switch to change its QoS parameters such as queue size, number of scheduling slots allocated, and packet drop
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
rules for every traffic class. Moreover, a policy change invariably will result in the reduction of the bandwidth allocated to one or more traffic classes. The QoS for various flows already admitted in these classes might no longer be guaranteed. Identifying the flows affected and readmitting the affected flows are similar in spirit to the impact analysis and restoration of QoS in response to network faults, but the details are somewhat different. A flow is affected in this case if there is a link in its path for which the total bandwidth allocated to the link exceeds the link capacity of the flow’s class and/or the current occupancy value of the flow’s class exceeds its corresponding threshold for the class. For the flows affected, the primary sorting field is priority, from the lowest priority to highest priority, and within each priority, we sort on the bandwidth size in descending order. We keep deleting the flows in the affected list starting from the one with the lowest priority and highest bandwidth requirement until there is no link in the path used by the flow such that the total bandwidth allocated for the link (for that class) exceeds the link capacity (for that class) and/or the current occupancy value exceeds its corresponding threshold. The utility function used here minimizes the number of higher priority flows that have the potential of being denied their QoS. When a flow is deleted, the bandwidth used or the current occupancy value in all the links used by the flow needs to be adjusted down. 7. Experimentation and validation To demonstrate that the Bandwidth Broker does indeed improve both performance and predictability, we report on this simple experiment using a testbed consisting of Cisco 3600 series routers. The results of our experiments are summarized in Table 1. We have a simple configuration consisting of two routers each serving a host and connected by a link of 10 Mbits/s capacity. We are transferring a 10-Mbyte (80Mbit) file over the network using FTP. The first row corresponds to file transfers being done as best effort. The second row corresponds to file transfers being done using a High Reliability traffic class (an Assured Forwarding (AF) class in DiffServ) policed at the rate of 2 Mbits/s. The columns correspond to contention traffic of 0, 1, 2, and 3 Mbits/s. Table 1 Illustration of Bandwidth Broker improving both performance and predictability Contention traffic
Flow not admitted by the Bandwidth Broker Flow admitted by the Bandwidth Broker at 2 Mbits/s in an AF Class
None
1 Mbits/s
2 Mbits/s
3 Mbits/s
11.4 s
70 s
218 s
>5 min.
30.2 s
30.5 s
30.3 s
30.3 s
981
As can be seen in Row 1, when there is no QoS treatment (best effort), as the contention traffic increases, the FTP transfer time (elapsed wall time for transfer) gets worse, from 11.4 to 70 to 218, to more than 300 s. Basically, the file transfer flow and the contention traffic get the same equally bad treatment. When the FTP transfer uses the AF class, as can be seen in Row 2, as the contention traffic increases, the performance of the FTP transfer stays the same, i.e., the transfer time is about 30.5 s. With the policing rate for the flow at 2 Mbits/s, it should have taken at least 40 s to transfer an 80 Mbits file. It has taken only 30.5 s for the transfers. This discrepancy can be explained by the nominal burst size allowance. In our policing configuration, we have instructed the router to drop the packets instead of marking them down to best effort when the rate exceeds 2 Mbits/second. This is consistent with how the TCP traffic behaves. In the experiments, only 30% of capacity (3 Mbits/s) was allocated to this AF class. If there were a request to the Bandwidth Broker to admit another flow to the same AF class at a rate greater than 1 Mbits/s, the Bandwidth Broker would have rejected this new traffic flow request ensuring that the already admitted FTP traffic at the rate of 2 Mbits/s in the AF class gets the right QoS treatment. In fact, this is what we did in one of the extended experiments where we turned off the admission control of the Bandwidth Broker allowing flows in the AF class to exceed the class capacity on some links. As to be expected, packets were dropped in all the competing AF flows using those links and thus violating their QoS requirements. The network overload ‘‘event’’ generation capabilities (see Section 5.1.2) have been demonstrated for a key ‘‘gate’’ test in the ARMS Phase I program. (Gate tests are instituted by the program to measure progress.) The gate test showed that our technology is applicable in dynamic resource management and increases mission survivability. Finally, the Bandwidth Broker consistently processes add and delete reservation requests in under 100 ms. 8. Related work The two main technologies for providing differentiated treatment of traffic are DiffServ/CoS and IntServ. The Bandwidth Broker makes use of DiffServ/CoS. In IntServ, every router on the path of a requested flow decides whether or not to admit the flow with a given QoS requirement. Each router in the network keeps the status of all flows that it has admitted as well as the remaining available (uncommitted) bandwidth on its links. Some drawbacks with IntServ are that (1) it requires per-flow state at each router, which can be an issue from a scalability perspective; (2) it makes its admission decisions based on local information rather than some adaptive, network-wide policy; and (3) it is applicable only to layer-3 IP networks. Our network QoS does not have any of these drawbacks (see http://www.cisco.com/en/ US/tech/tk543/tk766/technologies_white_paper09186a008 00a3e2f. shtml).
982
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983
Our delay-bound work for DiffServ/CoS networks with deterministic guarantees is closely related to Le Boudec and Thiran (2004). Our mathematical formulation both for deterministic and statistical bounds is broader than Le Boudec and Thiran (2004), and Wang et al. (2001) in that any number of priority classes and any number of weighted fair queuing classes within a priority class can be handled and thus our admission control can support delay guarantees for any DiffServ/CoS classes of traffic. Telcordia has successfully applied Bandwidth Broker technologies to other Government projects and toward commercial offerings (Kim and Sebuktekin, 2002; Chadha et al., 2003). None of these endeavors, however, deals with layer-2 QoS, let alone unified management of QoS across multi-layers. None of these works, moreover, is reflexive and adaptive with fault and performance monitoring as part of the QoS framework. Furthermore, integration into middleware and integration into an end-to-end resource management framework have not been the focus of these efforts. Proactive fault impact analysis and QoS restoration in a faulty network as explored in our research, to our knowledge, has not been explored by other researchers. 9. Concluding remarks Our network QoS components provide a unified QoS solution that guarantees network performance for mission-critical applications in complex wireline, layer-3/2 network topologies. Our implementation is flexible and unique in the mix of guarantees it can provide – deterministic delay guarantees, statistical delay guarantees, and capacity (bandwidth) assurance – to various mission tasks. Our ability to detect and respond appropriately to faults, changing mission mode, and changing needs of high priority and time-critical applications is essential to providing an end-to-end adaptive allocation and scheduling service for mission-critical systems. Our adaptive behavior is supported by continual performance monitoring at the network level. Much of the component functionality described in this paper is in place. We are realizing the rest. Experimentation and validation are ongoing. Acknowledgments The design and architecture of ARMS is a collaborative effort by many institutions. Our thanks to our ARMS colleagues from BBN, Boeing, Carnegie-Mellon University, Lockheed-Martin, Johns Hopkins University Applied Physics Lab., Ohio University, PrismTech, Raytheon, SRC and Vanderbilt University. References Alberi, J.L., McIntosh, A., Pucci, M., Raleigh, T., 2003. On achieving greater accuracy and effectiveness in network measurements. NYMAN 2003, New York.
Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., Weiss, W., 1998. An architecture for differentiated services. IETF RFC 2475. Chadha, R. et al., 2003. PECAN: policy-enabled configuration across networks. In: IEEE 4th International Workshop on Policies for Distributed Systems and Networks. Dasarathy, B., Gadgil, S., Vaidyanathan, R., Parmeswaran, K., Coan, B., Conarty, M., Bhanot, V., 2005. Network QoS assurance in a multilayer adaptive resource management scheme for mission-critical applications using the CORBA middleware framework. In: Proceedings of RTAS 2005, pp. 246–255. Decker, E., McCloghrie, K., Langille, P., Rijsinghani, A., 1993. Definitions of managed objects for source routing bridges. IETF RFC 1525. Kim, B., Sebuktekin, I., 2002. An integrated IP QoS architectureperformance. Milcom’02, Anaheim, CA. Lardieri, P., Balasubramanian, J., Schmidt, D.C., Thaker, G., Gokhale, A., Damiano, T., 2006. A multi-layered resource management framework for dynamic resource management in enterprise DRE systems. Elsevier Journal of Systems and Software, this (Special) Issue on Dynamic Resource Management in Distributed Real-Time Systems, Edited by Charles Cavanaugh, Frank Drews, Lonnie Welch. Le Boudec, J.-Y., Thiran, P., 2004. Network calculus, 2004. A theory of deterministic queuing systems for the Internet, Chapter 2, Online Version of the Book, Springer-Verlag, LNCS, vol. 2050. Nichols, K., Blake, S., Baker, F., Black, D., 1998. Definition of the differentiated services field (DS field) in the IPv4 and IPv6 headers. IETF RFC 2474. Wang, N., Schmidt, D.C., Gokhale, A., Rodrigues, C., Natarajan, B., Loyall, J.P., Schantz, R.E., Gill, C.D., 2003. QoS-Enabled Middleware. In: Mahmoud, Qusay (Ed.), Middleware for Communications. Wiley and Sons, pp. 131–162. Wang, S., Xuan, D., Bettati, R., Zhao, W., 2001. Differentiated services with statistical QoS guarantees in static-priority scheduling networks. In: Proceedings of the IEEE Real-Time Systems Symposium, London, UK. Balakrishnan ‘‘Das’’ Dasarathy is a Chief Scientist at Telcordia Applied Research. He has over 25 years of R&D experience in software research and development (R&D) and software R&D management. His current areas of interest include middleware, real-time systems and network QoS. He received his Ph.D. in Computer and Information Science from the Ohio State University. Shrirang (‘‘Shree’’) Gadgil is currently a Senior Research Scientist at Telcordia Applied Research. He has nearly 15 years of experience in network software development and system software. His current areas of interest include policy based network management systems, and network QoS and traffic engineering. He received his M.S. in Computer Science from Columbia University. Ravi Vaidyanathan is currently a Senior Scientist, Telcordia Applied Research and has been with Telcordia since 1999. His accomplishments at Telcordia include development of a Border Gateway Protocol toolkit, development of QoS assurance architecture for wireless 802.11b networks, designing a policy framework for Traffic Engineering and QoS Provisioning in IP/MPLS networks, and development of simulation models of ad hoc wireless networks. He received his M.S. in Electrical Engineering from the University of Maryland. Arnie Neidhardt is currently a Senior Research Scientist at Telcordia Applied Research. He has been with Telcordia Technologies since 1984. His areas of interest include network management, and performance analysis and traffic modeling and he has published extensively in these areas. He received his B.S. from Purdue University, and his M.A. and Ph.D. from the University of Wisconsin, all in mathematics. Brian Coan is currently a Director, Distributed Computing Research Group, Telcordia Applied Research. He has been affiliated with Telcordia
B. Dasarathy et al. / The Journal of Systems and Software 80 (2007) 972–983 Technologies (and previously with Bell Laboratories) continuously since 1978. His current work concentrates on providing resilient networking and information services in adverse environments, possibly caused by cyber attacks, for the U.S. Army FCS program. He has degrees in Computer Science from Princeton (B.S.E.), Stanford (M.S.), and MIT (Ph.D.). Kirthika Parmeswaran is currently a Research Scientist at Telcordia Applied Research. Her current areas of interest include policy management and security for wired and wireless tactical networks, real-time middleware and network QoS. She has a B.E in Computer Engineering from Pune Institute of Computer Technology (PICT), India and M.S in Computer Science from Washington University in St. Louis.
983
Allen McIntosh is currently a Senior Scientist, Telcordia Applied Research, and has been with Telcordia since 1987. His research interests include statistical computing, linear models and large datasets. He received his Ph.D. in Statistics from the University of Toronto. Frederick (Rick) Porter is currently a Senior Scientist in Telcordia Applied Research. He has been with Telcordia Technologies since 1984. His current areas of interest include middleware and defense against cyber attacks. He has degrees in Electrical Engineering from Cornell (B.S.) and Stanford (M.S.).