Applying a Scalable CORBA Event Service to Large-scale Distributed Interactive Simulations Carlos O’Ryan, Douglas C. Schmidt, Mayur Desphande J. Russell Noseworthy fcoryan,schmidt,
[email protected] [email protected] Department of Electrical and Computer Engineering Object Sciences Corp. University of California, Irvine Alexandria, VA, 22312 Irvine, CA, 92697
Abstract Next-generation distributed interactive simulations have stringent quality of service (QoS) requirements for throughput, latency, and scalability, as well as requirements for a flexible communication infrastructure to reduce software lifecycle costs. The CORBA Event Service provides a flexible model for asynchronous communication among distributed and collocated objects. However, the standard CORBA Event Service specification lacks important features and QoS optimizations required by distributed interactive simulation systems. This paper makes five contributions to the design, implementation and performance measurement of distributed interactive simulation systems. First, it describes how the CORBA Event Service can be implemented to support key QoS features. Second, it illustrates how to extend the CORBA Event Service so that it is better suited for distributed interactive simulations. Third, it describes how to develop efficient event dispatching and scheduling mechanisms that can sustain high throughput. Fourth, it describes how to use multicast protocols to reduce network traffic transparently and to improve system scalability. Finally, it illustrates how an Event Service framework can be strategized to support configurations that facilitate high throughput, predictable bounded latency, or some combination of each. Figure 1: Architecture of a Distributed Interactive Simulation Keywords: Scalable CORBA event systems, object- Application oriented communication frameworks. units stationed around the world can participate in joint training exercises, with human-in-the-loop airplane and tank simulators. Internet gaming is another form of distributed interactive simulation. In both examples, heterogeneous LANbased computer systems can be interconnected by high-speed WANs, as depicted in Figure 1.
1 Introduction
Overview of distributed interactive simulations: Interactive simulations are useful tools for training personnel to operate equipment or to experience situations that are too expensive, impractical, or dangerous to execute in the real world. The advent of high-speed LANs and WANs has enabled the The QoS requirements on the software that support disdevelopment of distributed interactive simulations, where par- tributed interactive simulations are quite demanding. They ticipants are dispersed geographically. For example, military combine aspects of distributed real-time computing, with the 1
need for low-latency, high-throughput, multi-sender/multireceiver communication over wide-range of autonomous and interconnected networks. Meeting these challenges requires new software infrastructures, such as those described in this paper. Distributed interactive simulation systems, such as DIS [1], have been based, historically, on Publish/Subscribe patterns [2]. Participants in the simulation declare the data that they supply and consume. To exchange this data, distributed interactive simulation systems require an efficient and scalable communication infrastructure. Typically, each participant in these event-driven systems consume and supply only a subset of the possible events in the system. By nature, however, these systems can vary dynamically, e.g.consumers and suppliers can join and leave at arbitrary times. Likewise, the set of events published or subscribed to can also vary during the lifetime of the simulation. It is not uncommon for large-scale simulations, such as synthetic theater of war training (STOW) activities, to be composed of hundreds or thousands of suppliers and consumers that generate enormous quantities of events in real-time. Thus, simulation communication infrastructures must scale up to handle large event volumes, while simultaneously conserving network resources by minimizing the number of duplicated events sent to separate consumers. In addition, the system must avoid wasteful computation. For instance, it should avoid sending events to consumers that are not interested, as well as quickly rejecting those events if they are received. Moreover, communication infrastructures must be flexible to cope with different simulation styles that require different optimization points, such as reduced latency, improved throughput, low network utilization, and reliable or best-effort delivery.
dleware is commonly available for acquisition or purchase, it becomes commercial-off-the-shelf (COTS). Employing COTS middleware shields software developers from low-level, tedious, and error-prone details, such as socket level programming [4]. Moreover, it provides a consistent set of higher level abstractions [5, 6] for developing adaptive systems. In addition, it amortizes software lifecycle costs by leveraging previous design and development expertise and reifying key design patterns [7] into reusable frameworks and components. COTS middleware has achieved substantial success in certain domains, such as avionics mission computing [8] and business applications. There is a widespread belief in the distributed interactive simulation community, however, that the efficiency, scalability, and predictability of COTS middleware, such as CORBA [9], is not suitable for next-generation large-scale simulation applications. Thus, if it can be demonstrated that the overhead of COTS middleware implementations can be removed, the resulting benefits make it a very compelling choice as the communication infrastructure for large-scale simulation systems. Our previous research has examined many dimensions of high-performance and real-time CORBA ORB endsystem design, including static [10] and dynamic [5] scheduling, event processing [8], I/O subsystem [11] and pluggable protocol [12] integration, synchronous [13] and asynchronous [14] ORB Core architectures, systematic benchmarking of multiple ORBs [15], patterns for ORB extensibility [7] and ORB performance [16]. This paper extends our previous work [8] on real-time extensions to the CORBA Event Service to show how this service can support the QoS requirements of largescale distributed interactive simulations by using IP multicast to federate multiple Event Channels and conserve network resources. In addition, we describe the design of a flexible Event Service framework that allows developers to select implementation strategies that are most appropriate for their application domain.
Towards a middleware-based solution: Given sufficient time and effort, it is possible to achieve the specific requirements of distributed interactive simulation applications by developing these systems entirely from scratch. In practice, however, the environment in which these systems are developed places increasingly stringent constraints on time and effort expended on developing software. Moreover, the increasing scarcity of qualified software professionals exacerbates the risk of companies failing to complete mission-critical projects. Therefore, unless the scope of software development required for each project can be constrained substantially, it is rarely realistic to develop complex simulation systems from scratch. For these reasons, it is necessary that distributed interactive simulation systems be assembled largely from re-usable middleware. Middleware is software that resides between applications and the underlying operating systems, protocol stacks, and hardware in complex real-time systems to enable or simplify how these components are connected [3]. When mid-
The remainder of this paper is organized as follows: Section 2 discusses the optimizations and extensions we added to the standard CORBA Event Service to support large-scale distributed interactive simulation applications; Section 3 describes how this new version of the CORBA Event Service has been used in the implementation of the High Level Architecture (HLA) Run-Time Infracstructure (RTI); next Section 4 shows the results of several benchmarks performed on our implementation, under different workloads, and identifies the primary sources of overhead and Section 5 presents concluding remarks. 2
APPLICATION INTERFACES
DOMAIN INTERFACES
sources, periodic event processing, and event correlations 1 . To resolve these limitations we have developed a Realtime Event Service (RT Event Service) as part of the TAO project [10]. TAO’s RT Event Service extends the COS Event Service specification to satisfy the quality of service (QoS) needs of real-time applications in many domains, such as distributed interactive simulations, avionics, telecommunications, and process control. The following discussion summarizes the features missing in the COS Event Service and outlines how TAO’s Real-time Event Service supports them.
COMMON FACILITIES
OBJECT REQUEST BROKER
OBJECT SERVICES
Figure 2: OMG Reference Model Architecture
2.1.1 Support for Centralized Event Filtering In a large-scale distributed interactive simulation, not all consumers are interested in all events generated by all suppliers. Although it is possible to let each application perform its own filtering, this solution wastes network and computing resources. Ideally, therefore, the Event Service should send an event to a particular consumer only if the consumer has explicitly subscribed for it. Care must be taken, however, to ensure that the subscription process used to support filtering does not itself cause undue burden on distributed system resources. It is possible to implement filtering using standard COS event channels [21]. For instance, channels can be chained to create an event filtering graph that consumers use to receive a subset of the total events in the system. However, filter graphs defined using standard COS Event Channels increase the number of hops a message must travel between suppliers and consumers. This increased traversal overhead may be unacceptable for applications with low latency requirements. Likewise, it hampers system scalability because additional processing is required to dispatch each event. To alleviate these scalability problems, therefore, TAO’s RT Event Service provides filtering and correlation mechanisms that allow consumers to specify logical OR and AND event dependencies. When the designated conditions are met, the event channel will dispatch all events that satisfy each consumers’ dependencies.
2 Overview of TAO’s Real-time Event Service
CORBA is a distributed object computing middleware specification [17] being standardized by the Object Management Group (OMG). CORBA is designed to support the development of flexible and reusable service components and distributed applications by (1) separating interfaces from (potentially remote) object implementations and (2) automating many common network programming tasks, such as object registration, location, and activation; request demultiplexing; framing and error-handling; parameter marshaling and demarshaling; and operation dispatching. Figure 2 illustrates the primary components in the OMG Reference Model architecture. The ACE ORB (TAO) [10] is a freely available, open-source implementation of CORBA. Many distributed applications exchange asynchronous requests using event-based execution models [18, 19, 20]. To support these common use-cases, the OMG defined a CORBA Event Service component in the CORBA Object Services (COS) layer, as shown in Figure 2. The COS specification [21] presents architectural models and interfaces that factor out common object services, such as persistence [22], security [23], transactions [24], fault tolerance [25], and concurrency [26]. 2.1.2 Efficient and Predictable Event Dispatching
To improve scalability and take advantage of advanced hard2.1 Overcoming Limitations with the CORBA ware it may be desirable to have multiple threads sending events to their final consumers. TAO’s RT Event Service can Event Service be configured with an application provided strategy to assign the threads that will dispatch events. The standard distribuThe standard COS Event Service Specification lacks sevtion also includes a number of selectable strategies to cover eral important features required by large-scale distributed inthe more common cases. The same component can be used teractive simulations. Chief among these missing features 1 Correlation allows an Event Channel to wait for a conjunction of events are centralized event filtering, efficient and predictable event dispatching, efficient use of network and computational re- before sending it to consumer(s). 3
to achieve greater predictability and enforce scheduling decisions at run-time, since an event can be assigned to a thread with the appropriate OS priority.
Consumer Consumer
Consumer
2.1.3 Efficient use of Network and Computational Resources TAO’s RT Event Service can be configured to minimize network traffic, using multicast protocols to avoid duplicate network traffic, and building federations of Event Service that share filtering information to minimize or eliminate the transmission on unwanted events to a remote entity.
Consumer Admin Event Flow
2.2 TAO’s RT Event Service Architecture
TAO’s RT Event Service is implemented using the Mediator pattern [27]. The heart of the RT Event Service service is Dispatching the Event Channel, shown in Figure 3. The features of TAO’s Module Event Channel are defined by an Event Channel IDL interface and implemented by a C++ class of the same name. This class Supplier Timer also plays a mediator role by serving as a “location broker” so the rest of the Event Channel components can find each other. Admin Module When a ProxyPushConsumer receives an event from an application, it iterates over the set of ProxyPushSuppliers that represent the potential consumers interested in that event (section 2.3.2 describes how this set is determined). Each ProxyPushSupplier checks to see if the event is relevant for its consumer. This check is performed by the filter hierarchy described in section 2.3.1. If a consumer is interested in the event, a Dispatching Strategy selects the thread that will disSupplier Supplier patch the event to the consumer. Section 2.3.6 discusses various tradeoffs to consider when selecting the dispatching thread strategy. For real-time applications that require periodic event processing, the Event Service can contain an optional Timer ModFigure 3: RT Event Service Architecture ule. Section 2.3.14 outlines several strategies for generating timer events. Each strategy possesses different predictability and performance characteristics and different resource re2.3.1 Implementing an Extensible and Efficient Filtering quirements. Framework
2.3 Design and Implementation Challenges
Context: TAO’s real-time Event Service provides several filtering primitives, e.g.a consumer can only accept events of a given type or from some particular source [8]. Not all applications require all filtering mechanisms provided by the RT Event Service, however. For example, many distributed interactive simulations do not require correlation. Likewise, some Event Service applications do not require filtering. Moreover, consumers often compose several filtering criteria into their subscription list, e.g.they request to receive any event from a given list or a single notification when all the events in a list are received.
Section 2.2 outlined the core components of the CORBA Event Service that are defined by IDL interfaces. Below, we describe how we have systematically applied key the design patterns, such as Builder, Command, Composite, and Strategy from the GoF book [27], Strategized Locking [28], and Service Configurator [29], to tackle the design and implementation challenges posed by TAO’s RT Event Service. Since these patterns are applicable to related systems, we document how we applied and composed these patterns in TAO to achieve our performance and scalability goals. 4
Problem: The Event Channel should support the addition of new filtering primitives flexibly and efficiently. For instance, an Event Channel should allow new filtering primitives, such as receiving a single notification when a set of events are received in a particular order or accepting any event whose type matches a designated bitmask.
events generated by each supplier. In such cases, it is counterproductive to use the pre-computation optimization described above. Instead, it may be more efficient to use a single global consumer set, which reduces the memory footprint and minimizes the time required to update the consumer set. Problem: The application developer should be able to control over algorithm used to build consumer sets.
!
Solution use the Composite pattern [27]: This pattern allows clients to treat individual objects and compositions of objects uniformly. Usually, the composition forms a tree structure, which in our case is a filter composition tree. New filtering primitives can be implemented as leaves in the composition tree. These primitives provide applications with substantial expressive power, e.g.they can can create complex filter hierarchies using disjunction and conjunction composites. To control the creation of the concrete filters, we use the Builder Pattern [27], which separates the construction of a complex object from its representation. In our case, we build the filter hierarchy from the subscription IDL structures. Ultimately, we plan to use the complete Trader Constraint Language [30] filtering language. One important consequence of using the Builder pattern is that this change will not affect the overall architecture of TAO’s Event Service framework.
!
use the Strategy Pattern [27]: In this pattern, Solution a family of algorithms is represented by classes that share a common base class. Clients access these algorithms via the base class, thereby enabling them to select different algorithms without requiring changes to the client. We use this pattern in TAO’s RT Event Service framework to encapsulate the exact algorithm used to control the number of consumer sets, as well as how these sets are are updated. Note that the variations mentioned thus far are not exhaustive, e.g.we can store separate consumer set for each event type. The framework implemented in TAO’s RT Event Service can support this use case, as well. 2.3.4 Supporting Re-entrant Calls while Dispatching Event
2.3.2 Improving Consumer Filtering Scalability
Context: To dispatch an event to multiple consumers, an Event Channel must iterate over its set of ProxyPushSupplier objects. In some concurrency models, such as the singlethreaded or reactive dispatching strategies described in Section 2.3.6, the same thread that iterates over a consumer set executes the upcall to consumers. Consumers are then allowed to push new events, add or remove consumers and suppliers, as well as call back into the Event Channel and its internal Problem: Reduce the time required to dispatch an event by components. reducing the set of consumers tested. Problem: The Event Channel should support re-entrant calls Solution pre-compute the set of consumers for each during event dispatching, regardless of the concurrency model ProxyPushConsumer object: We can use the types of that is being used. However, many iterator implementations events generated by a supplier to determine which con- become invalidated when their data structure is modified [31]. sumers may be interested in the events generated by that Thus, the ProxyPushSupplier set cannot be changed when a supplier. Likewise, we use the filter hierarchy in each thread is iterating over it. Simply locking the set is inapProxyPushSupplier to estimate if the corresponding consumer propriate because the application will either deadlock if the is willing to receive at least one of the events published by the upcall changes the set or will invalidate iterators if recursive supplier. If the consumer is not interested in any of the events, locks are used. Another inappropriate alternative is to copy we can remove it from the set, thereby improving overall sys- the ProxyPushSupplier set before starting the iteration. Although this strategy works for small sets, it performs poorly tem performance. for large-scale distributed interactive simulation applications.
Context: In some distributed interactive simulation applications, only a small percentage of consumers are interested in a particular event. Therefore, an Event Channel implementation that queried each ProxyPushSupplier to check if an event is of interest to any of its clients will scale very poorly as the number of events increases.
!
!
2.3.3 Reducing Memory Footprint
apply lazy evaluation to delay certain operaSolution Context: In other distributed interactive simulation applica- tions: TAO’s Event Channel tracks the number of threads ittions, a large percentage of consumers may be interested in the erating over each set of ProxyPushSupplier objects. Before 5
performing changes that would invalidate other iterators, it checks to ensure no concurrent iterations are occurring. If iterations are in progress, the operation is stored as a Command object [27]. When no threads are iterating on the set, all delayed command operations can be executed sequentially. To avoid starving a delayed operation indefinitely, limits can be placed on the number of iterations started after a pending modification occurs. After this limit is reached, all new threads must wait on a lock until the modification completes.
Priority Queues 3 : consumer->push( event) 2 : dequeue (consumer, event)
Thread Pool
Reactive
3 : consumer->push( event) 2 : dequeue (consumer, event) 1 : consumer->push( event)
1 : enqueue (consumer, event)
2.3.5 Reducing Synchronization Overhead
1 : enqueue (consumer, event)
Dispatching Module
Context: Excessive synchronization overhead can be a significant bottleneck when it occurs in the critical path of a concurrent system.
EVENT CHANNEL
Consumer Admin Supplier Admin
Problem: The lazy evaluation solution described in Section 2.3.4 is functionally correct. However, it increases synchronization overhead along the critical path of the event filtering and dispatching algorithms. In particular, Event Channels may choose to decouple (1) threads that iterate over the ProxyPushSupplier sets from (2) threads that perform consumer upcalls. This decoupling (1) yields more predictable behavior in hard real-time systems, (2) allows Event Channels to re-order the events, e.g.in order to perform dynamic scheduling [5], and (3) isolates event suppliers from the execution time of consumer upcalls.2
Priority Timers
Figure 4: Dispatching Strategies Supported in TAO’s Event Channel may want to use a pool of threads to dispatch events, thereby leveraging advanced hardware and overlapping I/O and computation.
!
Problem: An Event Channel must provide a flexible infrastructure to determine which thread dispatches an event to a particular consumer.
Solution use the Strategy Pattern [27] TAO’s Event Channel uses the this pattern to strategize the dispatching algorithm and minimize overhead in applications that do not require complex concurrency and re-entrancy support. For complex use-cases, TAO’s Event Channel uses a special lock that updates the state in the set to indicate that a thread is performing an iteration. When this lock is released, any operations delayed while the lock was held are executed.
!
Solution use the Strategy Pattern [27]: This pattern can be applied to encapsulate the algorithm used to choose the dispatching thread. The selected dispatching strategy is responsible for performing any data copies that may be necessary to pass the event to a separate thread. The current implementation of TAO’s Event Channel exploits several optimizations, such as reference counting, in the TAO ORB to reduce those data copies. In applications with stringent real-time requirements, the dispatching strategy collaborates with TAO’s Scheduling Service [10, 5] to determine the appropriate queue (and thread) to process the event. When the same thread is used for reception and dispatching, the strategy collaborates with the ProxyPushSupplier to minimize locking overhead, as described in Section 2.3.5.
2.3.6 Selecting the Thread to Dispatch an Event
Context: Once an Event Channel has determined that a particular event should be dispatched to a consumer, it must decide which thread will perform the dispatching. As shown in Figure 4, there are several alternatives. Using the same thread that received the event is efficient, e.g.it reduces context switching, synchronization, and data copying overhead [16]. However, this design may expose the Event Channel to misbehaving consumers. Moreover, to avoid priority inversions in real-time systems, events must be dispatched by a thread 2.3.7 Configuring Event Channel Strategies Consistently at the appropriate priority. Likewise, highly-scalable systems Context: To adapt to various use-cases, TAO’s Event Chan2 Although this design may increase context switching overhead, many applications can tolerate this if Event Channels already use separate threads to nel provides myriad strategies that can be configured by apperform upcalls. plication developers. Often, the choice of one strategy affects 6
other strategies. For example, if the Event Channel’s dispatching strategy always uses a separate thread to process the event there is no risk of having re-entrant calls from the consumers modifying the ProxyPushSupplier sets. Thus, a simpler strategy to manipulate those sets can be used.
Supplier
Event Service
Supplier
Problem: Selecting a suitable combination of strategies can impose an undue burden on the developer and yield inefficient or semantically incompatible strategy configurations. Ideally, developer should be able to select from a set of configurations whose strategies have been pre-approved to achieve certain goals, such as minimizing latency, avoiding priority inversion, or improving system scalability.
Supplier
Consumer
Consumer
CORBA/IIOP
Consumer Consumer
Supplier Supplier
Consumer
!
Solution use the Abstract Factory Pattern [27]: to control the creation of all the objects in the Event Channel. In this pattern a single interface creates families of related on dependent objects. We use it to provide a single point to select all the Event Channel strategies, and avoid incompatible choices. Concrete implementations of this Abstract Factory ensure that strategies and components are semantically compatible and collaborate correctly.
Figure 5: A Centralized Configuration of the TAO RT Event Service CORBA. Many applications want to be shielded from distribution aspects, while simultaneously achieving high performance.
2.3.8 Supporting Rapid Testing and Run-time Changes Problem: There are use-cases where distribution transin the Configuration parency may not yield the most effective configuration. For Context: Some applications may be used in multiple envi- example, Figure 5 illustrates a scenario where most or all conronments, with different Event Channel strategies configured sumers for common events reside in the same process, host, or for each environment. During application development and network with the supplier. Thus, sending an event to a remote testing, it may be necessary to evaluate multiple configura- Event Channel, only to have it sent back to the same process tions to ensure that the application works in all of them or to immediately, wastes network resources and increases latency unnecessarily. Likewise, there may be multiple remote conidentify the most efficient/scalable configurations. sumers expecting the same event. Ideally, bandwidth should be conserved in this case by sending a single message across Problem: If the Event Channel is statically configured, it is the network to all those remote consumers. hard evaluate various combinations without time consuming recompiling/relinking. Solution federate Event Channels: Figure 6 illustrates the use of Event Channel Gateways to federate Event ChanSolution use the Service Configurator Pattern [29]: nels. Each Gateway is a CORBA object that connects to the This pattern allows applications to dynamically or statically local Event Channel as a supplier and connects to the remote configure service implementations. We use ACE’s implemenEvent Channel as a consumer. To reduce network traffic, the tation of this pattern to dynamically load Abstract Factorys Gateway must subscribe to events that are of interest to at least that create various Event Channel configurations. Our imone local consumer. Suppliers and consumers connect directly plementation includes a default Abstract Factory that uses the to their local Event Channel. This design reduces average lascripting features of the ACE Service Configurator framework. tency for all the consumers in the system because consumers By using this default, developers or end-user can modify Event and suppliers exhibit locality of reference, i.e.most consumers Channel configuration at initialization time by simply changfor any event are in the same domain as the supplier generating ing entries in a configuration file. the event. Moreover, if multiple remote consumers are interested in the same event only one message is sent to the remote 2.3.9 Exploiting Locality in Supplier/Consumer pairs Gateway, thereby minimizing network utilization. A straightforward and portable way to implement a GateContext: TAO’s Event Channels can be accessed transparently across distribution boundaries since it is based on way is to use IIOP to receive a single event from a remote
!
!
7
Consumer
Consumer
Supplier Event Service
vice framework, we propagate the changes in the subscription and publication lists to all interested consumers and suppliers. The implementation of observables, i.e.the Event Channels, can be strategized. Thus, if applications know a priori that there will be no observers at run-time, they can configure the Event Channel to disable this feature, thereby eliminating the overhead required to update the (empty) observer list.
Consumer
Event Service
2.3.11 Further Improving Network Utilization Context: In distributed interactive simulations, it is common that an event will be dispatched to multiple hosts in the same network.
CORBA/IIOP Problem: Network bandwidth is often a scarce resource for large-scale simulations, particularly when they are run over WANs. As the number of nodes increase, therefore, sending the same event multiple times across a network does not scale. Event Service
Event Service
!
Solution use a multicast or broadcast protocol: TAO’s Event Channel can be configured to use UDP to multicast events. As with the Gateways describe in Section 2.3.10, a special consumer can subscribe to all the events generated by Supplier Supplier local suppliers, as shown in Figure 7. This consumer uses mulConsumer Supplier ticast to send events to selected channels in the network. On each receiver, a designated supplier re-publishes all events that are of interest for local consumers. This supplier receives remote multicast traffic, converts it into an event, and forwards Figure 6: A Federated Event Channel Configuration the event to its local consumers via an Event Channel. For both consumers and suppliers, the Observer interface described in Event Channel and propagate it, through the local Event Chan- Section 2.3.10 is used to modify the subscriptions and publinel, to multiple consumers. This design conserves network cations of multicast Gateways dynamically. resources and increases latency only for an uncommon case, i.e.where local consumers receive events directly through the 2.3.12 Exploiting Hardware- and Kernel-level Filtering local Event Channel. Context: If different types of events can be partitioned onto 2.3.10 Updating the Gateway Subscriptions and Publica- different multicast groups, consumer hosts only receive a subset of the multicast traffic. In large-scale distributed interactions tive simulations it may be necessary to disseminate events over Context: In a dynamic environment, subscriptions change several multicast groups. This design avoids unnecessary inconstantly. terrupts and processing by network interfaces and OS kernels when receiving multicast packets containing unwanted inforProblem: To use network resources efficiently, the Event mation. Channel Gateways described in Section 2.3.9 must avoid subscribing to all events in a remote events channel. Otherwise, Problem: The Event Channel must select the multicast the locality of reference benefits of Event Channel federation group used for each type of event in a globally consistent way. are lost. However, the mapping between events and multicast groups may be different for each application. Applications can use Solution use the Observer Pattern [27]: In this pattern, different mechanisms to achieve that goal. For instance, some whenever one object changes state, all its dependents are no- use pre-established mappings between their event types and tified. When this pattern is applied to TAO’s RT Event Ser- the multicast groups, whereas others use a centralized service
!
8
between address services. Conversely, if mappings change dynamically, applications must implement mechanisms to propagate these changes to all address servers. One solution is to use the Event Service itself to propagate this information.
Consumer Consumer Supplier
Consumer
2.3.13 Breaking Event Cycles in Event Channel Federations Event Service
Context: In a complex distributed interactive simulation, the same event could be important for both local and remote consumers. For instance, a local supplier can generate tank position events; if both a local and remote consumer are interested in the event the Gateways could continuously send the event between two federated Event Channels.
Event Service
IP Multicast Network
Event Service
Problem: Consumers for a particular event can be present in multiple channels in the federation. In this case, Gateways will propagate events between the peers of the federation indefinitely due to cycles in the event flow graph. One approach would be to add addressing information to each event and enhance the routing logic in each Event Channel. However, this design would complicate the Gateway architecture for simpler use-cases and requires additional communication among the peers.
Event Service
Supplier Supplier
Consumer
Supplier
!
Solution use a time-to-live (TTL) counter: This counter is stored in each event and decremented each time an event passes through a Gateway. If the TTL counter becomes zero the event deallocated and not forwarded. Usually event channel federations are fully connected, i.e.all event channels have a Gateway to each of their peers. Thus, setting the TTL counter to 1 eliminates all cycles because no event traverses more than one Gateway link. In more complex distributed configurations, however, the TTL can be set to a higher number, though events may loop before being discarded. To further improve performance, the TAO event channel code has been optimized to reduce data copying, only the event header requires a copy to change the TTL counter, the payload, that usually contains most of the data, is not touched.
Figure 7: Using Multicast in Federated Event Channels
to maintain this mapping. Moreover, applications that require highly scalable fault tolerance may choose to distribute the mapping service across the network. An Event Channel must be able to satisfy all these scenarios, without imposing any one strategy.
!
Solution use a user-supplier callback object: Application developers can implement an address server, which is a CORBA object that Event Channel Gateways query to redirect events to the appropriate multicast group. On the receiver, the Gateways consult this service to decide which multicast groups to subscribe to, based upon the current set of subscriptions in the local Event Channel. Advanced operating systems and network adapters can use this information to process only multicast traffic that is relevant to them. To avoid single points of failure and to improve scalability, application developers can replicate address servers across the network. If developers use a static mapping between events and multicast groups, there is no need to communicate state
2.3.14 Providing Predictable and Efficient Periodic Event Context: Real-time applications require an Event Channel to generate events at particular times in the future. For instance, applications can use these events to detect missed dead-lines in non-critical processing or to support hardware that requires watchdog timers to identify faulty equipment. In addition, some applications require periodic events to initiate periodic tasks and to detect that the periodic tasks complete before their dead time. 9
Hard real-time applications may assign different priorities to their timer events. To avoid priority inversions, therefore, events should be generated and dispatched by thread at the appropriate priorities. Soft real-time applications or best-effort applications often impose no such strict requirements on timer priorities and thus are better served by simpler strategies that conserve memory and CPU resources. Other applications require no timers at all and obviously single-threaded applications cannot use this technique to generate the periodic events.
FIGURE GOES HERE
Problem: Implement predictable periodic events for hard real-time applications without undue overhead for applications with lower predictability requirements.
!
Solution use the Strategy Pattern [27] to dynamically select the mechanisms used to generate timeout events. In TAO’s Event Channel, the ConsumerFilterBuilder creates special filter objects that adapt the timer module used to generate timeouts with consumers that expect the IDL structures used Figure 8: Federations are composed of disjoint sets of fedto represent events. erates. A single process may have more than one federate, possibly joined to different federations.
3 Applying TAO’s Real-time Event Service to DMSO’s HLA RTI-NG The U.S. Defense Modeling and Simulation Office (DMSO) developed a standard for distributed simulation middleware known as the High Level Architecture (HLA). Every implementation of this standard is referred to as a Run-time Infrastructure (RTI). Early RTI’s demonstrated the viability of the specification, so DMSO commissioned the development of a next generation RTI, the RTI-NG. The architecure of the HLA RTI-NG is centered around TAO’s Real-time Event Service. Fundamentally, HLA specifies publish/subscribe distributed middleware, so this is an excellent foundation. Those aspects of the HLA not well-suited to publish/subscribe (mostly aspects involving the control of data dissemination) use the RMI provided by TAO’s implmentation of CORBA. These RMI aspects of the RTI-NG will not be condisered further in the remainder of this discussion. In HLA parlance, a group of participants cooperating in a distributed simulation is a federation. 3 A federate is the term describing a participant of a federation. A given process that belongs to a federation may contain one or more federates. Every federate belongs to only a single federation, however the federates in a multi-federate process need not all belong to the same federation, as shown in Figure 8. Federates not in the 3 Strictly speaking, federation execution is the name given to the group of participants while they are actually running, to draw distinction between the a running simulation and a simulation that is not running (e.g.being planned).
same federation can not communicate using the HLA. 4 Since there is no need to support communication between federates in different federations, every RTI-NG process contains a separate instance of the data dissemionation mechanisms for each of the federations with which the process must communicate. This data dissemination mechanism primarily consists of three instances of TAO’s Real-time Event Channel as well as several gateways as described in Section 2.3.9 and a single multicast gateway as described in Section 2.3.11. The HLA specifies four combinations of data transport and ordering: reliable/receive-ordered, reliable/time-stampordered, best-effort/receive-ordered, and best-effort/timestamp-ordered. The multiple Event Channels and gateways are used to efficiently support these four combinations of transport and ordering. Two of the Event Channels as well as the gateways are used to support the reliable transport (both receiveand time-stamp-ordered). The third Event Channel and the multicast gateway are used to support the best-effort transport. For the reliable transport, one Event Channel is used to handle the outbound (reliable) data. The other Event Channel and the gateways are used to handle the inbound data. Consider any two processes that contain federates in the same federation, process A and process B . The outbound event channel in A is connected to a gateway in process B that supplies events to process B ’s inbound Event Channel. This configuration is 4 It is possible to bridge communication between HLA federates. However, such bridging is not standardized by the HLA specification.
10
FIGURE GOES HERE Figure 9: Outbound reliable Event Channels connect to a remote gateways which, in turn, connect to a co-located inbound reliable Event Channels. Outbound multicast data is transmitted directly to remote multicast gateways which, in turn, supply data to co-located inbound multicast Event Channels. depicted in Figure 9. As was discussed in Section 2.3.10, Event Channel subscriptions and publications can change quite frequently. Using separate Event Channels for the inbound and outbound allows publications (subscriptions) to change concurrently with the receipt (transmission) of events. In order to support the reliable and time-stamp-ordered data delivery required by the HLA, the RTI-NG uses a distributed snapshot algorithm that requires for every reliable event that is sent and received be accounted for. To accomplish this, the TAO’s ProxyPushSupplier was extended to count every reliable event that is transmitted to a remote gateway. Likewise, TAO’s Event Channel gateway was extended to count every reliable Event that is received. This information is then provided to the distributed snapshot algorithm. For simplicy, best-effort data is transmitted via IP multicast directly, i.e.without the use of the TAO’s Event Channel. However, best-effort data is received using TAO’s mulicast Event Channel gateway. Each multicast gateway passes the data it receives to an inbound multicast Event Channel. Once again, this approach minimizes the impact of publication and subscription changes of not only the best-effort data, but the reliable data as well. In addition to four combinations of data transport and ordering, the HLA specifies that the programmers developing
a federation have the ability to logically segment the data exchanged amongst federates to reduce the amount of unwanted network traffic. Ideally, the implementation of the user-defined segmentation would be perfect and each subscriber would receive only exactly that data in which it is interested. In practice, the implementation of the user-defined segmentation is (almost) never perfect and unwanted data must be filtered on the receive-side. The RTI-NG has various static mappings between these user-defined segmentations and TAO suppliers. Each supplier supplies only one type of event. On the send-side, the RTING maps a given user-defined segmentation onto a set of TAO suppliers. In the case of reliable transport, these suppliers push events onto the reliable outbound event channel. The outbound reliable Event Channel uses the consumer filtering scalability improvements described in Section 2.3.2. In the case of best-effort transport, there is but one specialized supplier. This supplier collaborates with an address server (as described in Section 2.3.11) to transmit events to the multicast group dictated by the statically determined mapping. On the receive-side, the RTI-NG instantiates one consumer for every federate in the process. The inbound Event Channels (both reliable and multicast) deliver their events to these consumers. Thus the inbound Event Channels serve to dispatch an event to a potentially large number of co-located consumers. Finally, it should be noted that events exchanged by the RTING do not carry CORBA Any’s, as specified in the OMG Event Service Specification, but instead carry CORBA octet sequences. This is an important optimization that was made to eliminate the potentially tremendous overhead involved in sending and receiving CORBA Any’s.
4
Empirical Performance Evaluation
We evaluate our design decisions above using a series of experiments described below. Unless otherwise noted, all the experiments were performed using two identically configured nodes. Each was a Pentium III @ 866Mhz, with 512Mb of RAM and running Timesys’ Linux/RT v.2.2.14. The tests are based on TAO’s version 1.1.19, compiled using gcc-2.95.2. Both nodes where connected via a 100BaseT ethernet hub, other than the test in question the network had no significant traffic nor was any other processing taking place on these hosts. In our first experiment we establish the baseline latency for the Event Service, to this effect a supplier and consumer objects are placed in the same host, and the time to deliver the event through a remote event channel is measured (see Figure 10). Placing both the consumer and the supplier on the same host obviates the need for a distributed clock, and achieves higher precision in the !measurements. The fre-
11
Event Service Latency Supplier
105
Latency
100
Percentage
Event Service Consumer
95
90
85
Figure 10: Experimental setup to measure the Event Service Latency
80 500
600
Event Service Latency
900
1000
Figure 12: Latency Experiments Represented as an Ogive Function
6000
Supplier
Latency
Number of Samples
800
Latency (usecs)
8000
4000
Consumer
Event Service
Event Service Supplier
Consumer
2000
0
700
0
1000
2000
3000
4000
5000
6000
Figure 13: Measuring Latency in a Simple Event Service Configuration
Latency (usecs)
Figure 11: Measuring Latency in a Simple Event Service Configuration
the roundtrip latency observed when an event is sent and returned via the Event Service federation, as shown in Figure 13 quency distribution for this experiment is shown in Figure 11 To put these results in perspective we also performed latency and summarized in the following table: experiments over (1) the ORB using callback calls, (2) the ORB using twoway calls and (3) raw TCP/IP. As shown in Average Minimum Maximum Jitter the following table the event service . In these experiments we 780 762 5490 34.53 obtained the following results: A better representation on this results is obtained using the ogive function for the measured latency, that is, the percentTransport Average Minimum Maximum Jitter age of the samples that is smaller or equal than a given laTCP/IP 475 473 2223 17.32 tency value. Using this technique we can easily determine that ORB 513 511 2970 29.79 Event Service 780 762 5490 34.53 over of the samples are not more than one standard deviFederation 722 710 9372 34.64 ation over the average, and that of the samples are under secs. In fact, in this experiment : are under milliseconds. As we can see in Figure 14 The Ideally, the Event Service In the second set of experiments we use the Event Service should add as little overhead as possible over the underlying architecture described in 2.3.9, in this experiment we measure network, as well as avoid any extra jitter.
80%
800
95%
99 9%
1
12
102
6.0 and Linux/RT-2.2.14 and two general-purpose operating systems with real-time scheduling classes, Windows NT 4.0 Workstation with SP5, an Debian GNU/Linux (kernel version 2.2.14). On all platforms, we used the TCP/IP stack supplied with the operating system. A brief overview of each OS follows:
TCP/IP ORB Event Service Event Service Federation
101
Percentage of Samples
100
99
98
97
96
95 400
500
600
700
800
900
1000
VxWorks VxWorks is a real-time OS that supports multithreading and interrupt handling. By default, the VxWorks thread scheduler uses a priority-based first-in firstout (FIFO) preemptive scheduling algorithm, though it can be configured to support round-robin scheduling. In addition, VxWorks provides semaphores that implement a priority inheritance protocol
Latency (usecs)
QNX QNX is a real-time, multi-threaded, micro-kernel OS that is POSIX-compatible. Unique to QNX is the usage Figure 14: Compared Latency Results for Different Transports of message passing as a fundamental means of inter process communication (IPC). QNX supports preemptive, 5 Concluding Remarks priority based FIFO and round robin scheduling. QNX also implements the priority inheritance protocol for mutexes. 6 Ackowledgments Linux Linux is a general-purpose, preemptive, multithreaded implementation of SVR4 UNIX, BSD UNIX, and POSIX. It supports POSIX real-time process and Evaluating the performance of a complex software system, is thread scheduling. The thread implementation utilizes a challenging task. The problem becomes even more chalprocesses created by a special clone version of fork. This lenging if the software system is built over some combinadesign simplifies the Linux kernel, though it limits scalation of middleware and a real time OS. To help our readers bility because kernel process resources are used for each interepret the results in the previous sections, we have meaapplication thread. sure the performance of TAO over different operating systems, while keeping the hardware platform constant. Such experi- Linux/RT Timesys Linux/RT adds a resource Kernel (RK) to the core Linux kernel. This is supposed to deliver ments isolate the overhead due to the ORB in a OS and hardware independent way. real-time capabilities from Linux such as fixed-priority scheduling with priority-inheritance and better resolution timers. Linux/RT is also binary ’compatible’ with Linux. A.1 Overview of the Hardware and Software Thus, it is possible to have Linux and Linux/RT on the Configuration same hardware very easily. Booting with the Linux/RT kernel starts the Linux/RT OS. The rest of the system A.1.1 Hardware Overview (file-systems, C-libraries, compiler, command-line tools, All of the tests in this section were run on a Intel Pentium III etc.) is just a vanilla Linux distribution. @866MHz, the system had 512 Mbytes of RAM, and a 256Kb cache. We focused primarily on a single CPU hardware con- Windows NT Microsoft Windows NT is a general-purpose, preemptive, multi-threading OS designed to provide fast figuration to factor out differences in network interface driver interactive response. Windows NT uses a round-robin support and to isolate the effects of OS design and implemenscheduling algorithm that attempts to share the CPU tation on the end-to-end performance of ORB middleware and fairly among all ready threads of the same priority. applications. Windows NT defines a high-priority thread class called REALTIME_PRIORITY_CLASS. Threads in this class A.1.2 Operating Systems Overview are scheduled before most other threads, which are usuWe ran the ORB/OS benchmarks described in this paper on ally in the NORMAL_PRIORITY_CLASS. Windows NT three real-time operating systems, VxWorks 5.4, QNX-RTP is not designed as a deterministic real-time OS, however.
A OS and ORB Performance Analysis
13
In particular, its internal queuing is performed in FIFO order and priority inheritance is not supported for mutexes or semaphores. Moreover, there is no way to prevent hardware interrupts and OS interrupt handlers from preempting application threads. A.1.3 Benchmarking Metric Description The remainder of this section describes the results of the following benchmarking metrics we developed to evaluate the performance and predictability of VxWorks, QNX, Windows NT, Linux/RT, and Linux running TAO: ORB/OS operation throughput This test provides an indication of the maximum operation throughput that applications can expect. It measures end-to-end two-way response when the client sends a request immediately after receiving the response to the previous request. This test and its results are presented in Section A.2.
A.2.2 Overview of the operation throughput metric Our throughput test, called IDL_Cubit, uses a singlethreaded client that issues an IDL operation at the fastest possible rate. The server performs the operation, which is to cube each parameter in the request. For two-way operations, the client thread waits for the response and checks that it is correct. Interprocess communication is performed via the network loop back interface because the client and server process run on the same machine. The time required for cubing the argument on the server is small but non-zero. The client performs the same operation and compares it with the two-way operation result. The cubing operation itself is not intended to be a representative workload. However, many real-time and embedded applications do rely on a large volume of small messages that each requires a small amount of processing. Therefore, the IDL_Cubit benchmark is useful for evaluating ORB/OS overhead by measuring operation throughput. We measure throughput for one-way and two-way operations using a variety of IDL data types, including void, short, long and sequence types. The one-way operation measurement eliminates the server reply overhead. The void data type instructs the server to not perform any processing other than that necessary to prepare and send the response, i.e.it does not cube its input parameters. The sequence data types exercises TAO’s marshaling/de-marshaling engine.
ORB/OS Latency and Jitter This test is a measure of how high priority tasks may be affected both in terms of latency of operations and jitter when the number of lowerpriority tasks is increased. In the ideal case higher priority tasks should not be affected at all while the latency of lower priority tasks should increase as their numbers increase. For real-time systems it is also imperative that the jitter of high priority tasks remain as constant as possible when the number of lower priority tasks is varied. This A.2.3 Results of the operation throughput measurements test and its results are presented in Section A.3. The throughput measurements are shown in Figure 15. Linux Context switch overhead These tests measure general OS (along with Linux/RT) exhibits the best operation throughput context switch overhead. High context switch overhead for all the data types tested, with around 10,000 operations/sec can significantly degrade application responsiveness and for simple data types. The one-way throughput on Linux, howdeterminism. These tests and their results are presented ever, was significantly higher and almost double than that on all of the other platforms. We do not have an explanation for in Section A.4. this, though we note that Linux threads are heavier weight than Memory Footprint This test is a measure of the size of the on other systems. static ACE and TAO libraries on the various Operating Systems. As already mentioned, ACE and TAO were QNX and VxWorks results QNX and VxWorks offer consistently good performance for simple types such as void, compiled statically on all the platforms. The results are short and long with QNX’s throughput being around presented in Section A.5. ; calls/sec and VxWorks a little lower at ; calls/sec.
6 000
5 500
A.2 Measuring ORB/OS Operation ThroughWindows NT results Windows-NT performed the worst of put all the OSes compared, except on the one-way test where it was better than QNX and VxWorks but still worse than Linux and Linux/RT.
A.2.1 Terminology synopsis
Operation throughput is the maximum rate at which operations can be performed. We measure the throughput of both A.2.4 Result synopsis two-way (request/response) and one-way (request without response) operations from client to server. This test indicates the Operation throughput provides a measure of the overhead imoverhead imposed by the ORB and OS on each operation. posed by the ORB/OS. IDL_Cubit test directly measures 14
25000
[P1 ]
[Pn ]
...
Linux
Servants Object Adapter [P ] 0
Linux/RT
10000
C0
QNX
C1
...
Cn
Win NT
S0
Requests
5000
[P ] 1
...
[P ]
S 1 ... Sn ORB Core
n
RUNTIME
[P0 ] 15000
SCHEDULER
Operations/second
20000
[P] Priority I/O SUBSYSTEM 0 void
short
long
oneway
Client
large octet seq
Server
1100 00 11 001 110 11 00
Pentium II Figure 16: ORB Endsystem Latency and Jitter Test Configuration
Figure 15: Operation Throughtput throughput for a variety of operation types and data types. Our measurements show that end-to end performance depends dramatically on data type and the operating system.
A.3 Measuring ORB/OS Latency and Jitter A.3.1 Terminology Synopsis ORB end-to-end latency is defined as the average amount of delay seen by a client thread from the time it sends the request to the time it completely receives the response from a server thread. Jitter is the variance of the latency for a series of requests. Increased latency directly impairs the ability to meet deadlines, whereas jitter makes real-time scheduling more difficult. A.3.2 Overview of Latency and Jitter Metrics We computed the latency and jitter incurred by various clients and servers using the following configurations shown in Figure 16 Server Configuration As shown in Figure 16, our test bed server consists of one servant S0 , with the highest realtime priority P0 , and servants S1 : : : Sn that have lower thread priorities, each with a different real-time priority P1 : : : Pn . Each thread, processes requests that are sent to its servant by client threads in the other process. Each client thread communicates with a servant thread that has an identical priority, i.e.a client A with thread priority PA communicates with a servant A that has thread priority PA .
Client Configuration Figure 16 shows how the benchmarking test uses clients from C0 : : : Cn . The highest priority client, i.e.C0 , runs at the default OS real-time priority P0 and invokes operations at Hz, i.e.it invokes CORBA two-way calls per second. The remaining clients, C1 : : : Cn , have different lower OS thread priorities P1 : : : Pn and invoke operations at 10 Hz, i.e.they invoke 10 CORBA two-way calls per second.
20
20
All client threads have matching priorities with their corresponding servant thread. In each call, the client sends a value of type CORBA::Octet to the servant. The servant cubes the number and returns it to the client, which checks that the returned value is correct. When the test program creates the client threads, these threads block on a barrier lock so that no client begins until the others are created and ready to run. When all client threads are ready to begin sending requests, the main thread unblocks them. These threads execute in an order determined by the real-time thread dispatcher. Each low-priority client thread invokes ; CORBA twoway requests at its prescribed rate. The high-priority client thread makes CORBA requests as long as there are lowpriority clients issuing requests. Thus, high-priority client operations run for the duration of the test. In an ideal ORB endsystem, the latency for the low-priority clients should rise gradually as the number of low-priority client threads increases. This behavior is expected because the low-priority clients compete for OS and network resources as the load increases. However, the high-priority client should remain constant or show a minor increase in latency. In general, a significant amount of jitter complicates the computation
15
4 000
1000 900
Microseconds
800
client latency was however only third best. High-priority latency was quite consistent, showing no major change as the number of low-priority clients increased. However, latency was higher than both Linux-RT and Linux. Overall, QNX produced consistent results, i.e.low-priority client jitter and latency rose nearly linearly with the number of low-priority clients. 5
700 NT
600
Linux/RT
500
Linux
400
QNX
300 200 100 0 2
5
9
Linux/RT (Timesys 1.0) results Linux/RT had better results compared to QNX for high-priority client jitter with few client threads (up to 5). However, the jitter with (10 clients) was much greater compared to that of QNX. With a larger number of clients (15 and above), its jitter started rising almost linearly to the number of lowpriority clients . Linux/RT had lower high-priority latency compared to QNX, but was higher than vanilla Linux. As the number of clients increased (15 and above), however, Linux/RT had the lowest high-priority latency. Low priority client jitter and latency rose almost linearly up to 15 client threads, though there was a marked increase in the Low priority jitter from 15 to 20 client threads.
Number of Low Priority Clients
700
Figure 17: TAO’s Latency for High Priority Clients 600 Microseconds
500 400
NT RT-Linux
300
Linux
Linux (2.2.14) results Linux did surprisingly well on highpriority latency for small number of low-priority client threads. It had the least latency among all the operating systems tested for up to 10 clients, better than Linux/RT and QNX. With an increase in the number of clients (above 15), however, it fell behind Linux/RT, though it was still better than Win-NT. Thus, Linux performs well with a low number of threads. For low-priority clients, Linux incurred high client jitter and latency when the number of threads was high (above 15). However, its latency and jitter fell (was lesser compared to that of Linux/RT) for up to 25 low priority clients.
QNX
200 100 0 2
5
9
Number of Low Priority Clients
Windows NT results The high-priority client latency for Win-NT increased linearly with the number of client Figure 18: TAO’s Jitter for High Priority Clients threads. However, the latency was the worst of the OS platforms tested. For high-priority jitter, Win-NT was also the worst; the pattern was erratic and unpredictable. of realistic worst-case execution times, which makes it hard to An interesting observation is that both jitter (for both low create a feasible real-time schedule. and high priority clients) and low priority latency fell for 20 client threads from 15. Low priority jitter was better A.3.3 Results of Latency and Jitter Metrics than that of QNX (up to 10) but it was the worst of all the platforms as the number of clients threads increased. The average two-way response time incurred by the highOverall, Win-NT produced poor results for all the cases. priority clients is shown in Figure 17. The jitter results are In addition, unpredictability in the results indicates that shown in Figure 18. QNX results QNX had the best results for high-priority client jitter (Figure 18. It remained essentially constant as the number of low-priority clients increased. Its low-priority
5 Unfortunately, QNX is pre-configured to support a very small number of socket handles (32), so the results in this paper are limited to only 9 simultaneous clients. We plan to rebuild the QNX release shortly and will re-run the tests with a large number of client threads
16
Win-NT could be unsuitable for applications requiring predictable QoS guarantees.
2. The time to suspend and resume a low-priority thread that does nothing. There is no context switching. This time is subtracted from the one described above, and the result is divided by two to yield the context switch time.
A.3.4 Result synopsis In general, low latency and jitter are necessary for realtime operating systems to bound application execution times. General-purpose operating systems show erratic behavior, particularly under higher load. For example, Win-NT and Linux exhibit higher latency for high-priority clients. Win-NT had almost three times the jitter as compared to Linux/RT (25 clients). In contrast, real-time operating systems are more predictable. For example high priority jitter for QNX was almost constant and with regards to Linux/RT, jitter rose only slightly with an increase in load.
POSIX threads do not support a suspend/resume thread interface. Therefore, the Suspend-Resume test is not applicable to OS platforms, such as QNX and Linux, that only support POSIX threads. The second context switching metric is the Yield test. It runs two threads at the same priority. Each thread iteratively calls its system function to immediately yield the CPU. The third context switching metric is the Synchronized Suspend-Resume test. This test contains two threads, one higher priority than the other. The test measures two different times:
A.4 Measuring ORB/OS Context Switching Overhead
1. The high-priority thread blocks on a mutex held by the low-priority thread. Just prior to releasing the mutex, the low-priority thread reads the high-resolution clock (tick counter). Immediately after acquiring the mutex, the high-priority thread also reads the high-resolution clock. The time between the two clock reads includes a mutex release, context switch, and mutex acquire.
A.4.1 Terminology synopsis A context switch involves the suspension of one thread and immediate resumption of another thread. The time between suspension and resumption is the context-switching overhead. Context switching overhead indicates the efficiency of the OS thread dispatcher. From the point of view of applications and ORB middleware, context switch time overhead should be minimized because it directly reduces the effective use of CPU resources. There are two types of context switching, voluntary and involuntary, which are defined as follows:
The lower priority thread uses a semaphore to suspend each iteration of the high-priority thread. This prevents the high-priority thread from simply acquiring and releasing the mutex ad infinitum. The timed portions of the test do not include semaphore operation overhead. 2. The time to acquire and release a mutex in a single thread, without context switching, is measured. This time is subtracted from the one described above to yield the context switch time.
Voluntary context switch This occurs when a thread voluntarily yields the processor before its time slice completes. We used multiple context switch metrics because not all OS Voluntary context switching commonly occurs when a platforms support each approach. Moreover, some operating thread blocks awaiting a resource to become available. systems show anomalous results with certain metrics. For exInvoluntary context switch This occurs when a higher pri- ample, Win-NT performs very poorly on the Suspend-Resume ority thread becomes runnable or because the current and Yield tests. thread’s time quantum has expired. Below, we describe the results from tests that measure the OS context-switching overhead. A.4.2 Overview of context switching overhead metrics A.4.3 Results of OS context switch overhead metrics We measured OS context switching overhead using three metrics. The first context switching metric is the Suspend-Resume The results in Table 1 show that QNX and VxWorks are the top performers though there is only one test common between test. It measures two different times: them: the Synchronized-Suspend-Resume test. In that test, 1. The time to resume a blocked high-priority thread, which QNX was only slightly worse but its jitter was considerably does nothing other than block again immediately when lower than that of VxWorks, making it more predictable. On it is resumed. A low-priority thread resumes the high- the Yield test, QNX was the best. The other real-time OS, priority thread, so the elapsed time includes two context Linux/RT was around three times slower than QNX or Vxswitches, one thread suspend, and one thread resume. Works on the Synchronized suspend-resume test. It was also 17
2500 2062 2000
Suspend/Resume Test 0.586 (0.025) N/A N/A N/A 989.559 (0.836)
Yield Test N/A 0.470 (0.007) 0.645 (0.007) 0.548 (0.006) 932.815 (0.562)
Synch Test 0.821 (0.019) 0.861 (0.003) 2.88 (0.015) 2.559 (0.011) 1.667 (0.009)
1612 Kilobytes
Operating System VxWorks QNX Linux/RT Linux Windows-NT
ACE
1500
TAO
1000
856
1000
956 595
500
0
Table 1: Context Switching Latency in Microsecond. The jitter is shown in parenthesis
slower compared to QNX and vanilla Linux on the Yield test. Linux/RT showed a consistent trend of having a higher context switch time than Linux. This could be due to the addition of more preemption points in the kernel. Its jitter was also more than vanilla Linux. Win-NT had anomalous results on the Suspend-Resume and Yield tests, being thousands of times slower than any other OS. This could be due to a bug in the scheduler or because the scheduler takes excessive time in computing which thread to run next. Win-NT performed better than Linux or Linux/RT on the Synchronized-SuspendResume test though. The results above demonstrate that it is difficult to measure context switching overhead reliably. Therefore, the multiple measures of context switch time are useful.
Linux , RT Linux
QNX
Win NT
Figure 19: Memory footprint in bytes
References [1] I. C. Society, ed., Standard for Distributed Interactive Simulation – Communication Services and Profiles. 345 E. 47th St, New York, NY 10017, USA: IEEE Computer Society, 1995. [2] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M. Stal, Pattern-Oriented Software Architecture – A System of Patterns. Wiley and Sons, 1996.
A.5 Measuring ORB Footprint A.5.1 Overview of footprint metric
[3] A. Gokhale and D. C. Schmidt, “Techniques for Optimizing CORBA Middleware for Distributed Embedded Systems,” in Proceedings of INFOCOM ’99, Mar. 1999.
We measured the sizes of the ACE (libACE) and TAO (libTAO) libraries using the ’size’ command. Since the libraries were compiled as static libraries, they give a measure of the memory overhead that applications would incur if they include all the features of ACE and TAO.
[4] D. C. Schmidt, T. H. Harrison, and E. Al-Shaer, “Object-Oriented Components for High-speed Network Programming,” in Proceedings of the 1st Conference on Object-Oriented Technologies and Systems, (Monterey, CA), USENIX, June 1995. [5] C. D. Gill, D. L. Levine, and D. C. Schmidt, “The Design and Performance of a Real-Time CORBA Scheduling Service,” Real-Time Systems, The International Journal of Time-Critical Computing Systems, special issue on Real-Time Middleware, vol. 20, March 2001.
A.5.2 Results of footprint metrics Figure 19 shows the memory footprint measured on each of the platforms in kilobytes. These results show that Win-NT has the least memory footprint followed by VxWorks. QNX has the most memory footprint. This shows that the NT compiler is very good at producing compact code. For VxWorks, the level of optimization was only -O while for QNX, Linux and Linux/RT it was -O3. This might have been responsible for the bigger memory footprint due to aggressive inlining that occurs with the higher level of optimization. 18
[6] J. A. Zinky, D. E. Bakken, and R. Schantz, “Architectural Support for Quality of Service for CORBA Objects,” Theory and Practice of Object Systems, vol. 3, no. 1, pp. 1–20, 1997. [7] D. C. Schmidt and C. Cleeland, “Applying Patterns to Develop Extensible ORB Middleware,” IEEE Communications Magazine, vol. 37, April 1999. [8] T. H. Harrison, D. L. Levine, and D. C. Schmidt, “The Design and Performance of a Real-time CORBA Event Service,” in Proceedings of OOPSLA ’97, (Atlanta, GA), pp. 184–199, ACM, October 1997.
[9] Object Management Group, The Common Object Request Broker: Architecture and Specification, 2.3 ed., June 1999. [10] D. C. Schmidt, D. L. Levine, and S. Mungee, “The Design and Performance of Real-Time Object Request Brokers,” Computer Communications, vol. 21, pp. 294–324, Apr. 1998. [11] F. Kuhns, D. C. Schmidt, and D. L. Levine, “The Design and Performance of a Real-time I/O Subsystem,” in Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium, (Vancouver, British Columbia, Canada), pp. 154–163, IEEE, June 1999. [12] F. Kuhns, C. O’Ryan, D. C. Schmidt, and J. Parsons, “The Design and Performance of a Pluggable Protocols Framework for Object Request Broker Middleware,” in Proceedings of the IFIP 6th International Workshop on Protocols For High-Speed Networks (PfHSN ’99), (Salem, MA), IFIP, August 1999. [13] D. C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale, “Software Architectures for Reducing Priority Inversion and Non-determinism in Real-time Object Request Brokers,” Journal of Real-time Systems, special issue on Real-time Computing in the Age of the Web and the Internet, vol. 21, no. 2, 2001. [14] A. B. Arulanthu, C. O’Ryan, D. C. Schmidt, M. Kircher, and J. Parsons, “The Design and Performance of a Scalable ORB Architecture for CORBA Asynchronous Messaging,” in Proceedings of the Middleware 2000 Conference, ACM/IFIP, Apr. 2000. [15] A. Gokhale and D. C. Schmidt, “Measuring the Performance of Communication Middleware on High-Speed Networks,” in Proceedings of SIGCOMM ’96, (Stanford, CA), pp. 306–317, ACM, August 1996.
[22] Object Management Group, Persistent State Service 2.0 Specification, OMG Document orbos/99-07-07 ed., July 1999. [23] Object Management Group, Security Service Specification, OMG Document ptc/98-12-03 ed., December 1998. [24] Object Management Group, Transaction Services Specification, OMG Document formal/97-12-17 ed., December 1997. [25] Object Management Group, Fault Tolerance CORBA Using Entity Redundancy RFP, OMG Document orbos/98-04-01 ed., April 1998. [26] Object Management Group, Concurrency Services Specification, OMG Document formal/97-12-14 ed., December 1997. [27] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software. Reading, Massachusetts: Addison-Wesley, 1995. [28] D. C. Schmidt, “Strategized Locking, Thread-safe Interface, and Scoped Locking: Patterns and Idioms for Simplifying Multi-threaded C++ Components,” C++ Report, vol. 11, Sept. 1999. [29] P. Jain and D. C. Schmidt, “Dynamically Configuring Communication Services with the Service Configurator Pattern,” C++ Report, vol. 9, June 1997. [30] Object Management Group, Trading ObjectService Specification, 1.0 ed., Mar. 1997. [31] T. Kofler, “Robust iterators for ET++,” Structured Programming, vol. 14, no. 2, pp. 62–85, 1993.
[16] I. Pyarali, C. O’Ryan, D. C. Schmidt, N. Wang, V. Kachroo, and A. Gokhale, “Applying Optimization Patterns to the Design of Real-time ORBs,” in Proceedings of the 5th Conference on Object-Oriented Technologies and Systems, (San Diego, CA), USENIX, May 1999. [17] Object Management Group, The Common Object Request Broker: Architecture and Specification, 2.4 ed., Oct. 2000. [18] R. Rajkumar, M. Gagliardi, and L. Sha, “The Real-Time Publisher/Subscriber Inter-Process Communication Model for Distributed Real-Time Systems: Design and Implementation,” in First IEEE Real-Time Technology and Applications Symposium, May 1995. [19] C. Ma and J. Bacon, “COBEA: A CORBA-Based Event Architecture,” in Proceedings of the 4rd Conference on Object-Oriented Technologies and Systems, USENIX, Apr. 1998. [20] Y. Aahlad, B. Martin, M. Marathe, and C. Lee, “Asynchronous Notification Among Distributed Objects,” in Proceedings of the 2nd Conference on Object-Oriented Technologies and Systems, (Toronto, Canada), USENIX, June 1996. [21] Object Management Group, CORBAServices: Common Object Services Specification, Revised Edition, 95-3-31 ed., Mar. 1995.
19