RTSJ manuscript No. (will be inserted by the editor)
Fault Tolerant Approaches for Distributed Real-time and Embedded Systems Paul Rubel
1
, Matthew Gillen1, Joseph Loyall1 , Aniruddha
Gokhale2 , Jaiganesh Balasubramanian2 , Aaron Paulos3 , Priya Narasimhan3 , Richard Schantz1 1 2 3
BBN Technologies, Cambridge, MA 02138, USA Dept of EECS, Vanderbilt University, Nashville, TN 37235, USA Dept. of ECE, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Received: date / Revised version: date
Abstract
Fault tolerance is a crucial design consideration for mission-
critical distributed real-time and embedded (DRE) systems, which combine the real-time characteristics of embedded platforms with the dynamic characteristics of distributed platforms. DRE systems present unique challenges for traditional fault tolerance approaches because of their scale, heterogeneity, real-time requirements, and other characteristics. Most previous R&D efforts in fault tolerance middleware has focused on client-server object sys
This work was supported by the Defense Advanced Research Projects Agency
(DARPA) under contract NBCHC030119.
Contact Author:
[email protected]
2
Paul Rubel et al.
tems, whereas DRE systems are more and more frequently being based on component-oriented architectures, which support more complex interaction patterns. This paper describes our current applied R&D efforts to develop fault tolerance technology for the characteristics of DRE systems. This paper makes three contributions to the state of the art in fault tolerance. First, we describe three enhanced fault tolerance techniques that support the needs of DRE systems: a transparent approach to mixed-mode communication, auto-configuration of dynamic systems, and duplicate management for peer-to-peer interactions. Second, we describe an integrated fault tolerance capability for a real-world component-based DRE system that uses off-the-shelf fault tolerance middleware integrated with our enhanced fault tolerance techniques. Finally, we present experimental results that show that our integrated fault tolerance capability meets the DRE system’s real-time performance requirements for both the failure recovery and the faul-free case, and show that our enhanced fault tolerance techniques exhibit better, more predictable performance than traditional techniques. Keywords: Fault tolerance, Mission-critical Middleware
1 Introduction
A growing class of systems is that of Distributed Real-time Embedded (DRE) systems, which combine the strict real-time characteristics of embedded platforms with the dynamic, unpredictable characteristics of distributed platforms. As these DRE systems become more and more at the heart
FT for DRE Systems
3
of critical domains, such as defense, aerospace, telecommunications, and healthcare, fault tolerance becomes a critical requirement that must coexist with their real-time performance requirements. However, DRE systems present challenges for many traditional fault tolerance approaches because of their following characteristics: DRE systems typically consist of many elements, that are independently developed and that have different fault tolerance requirements.
This means
that any fault tolerance approach must support mixed-mode fault tolerance (i.e., the coexistence of different strategies) and the coexistence of fault tolerance infrastructure (e.g., group communication) and non-fault tolerance infrastructure (e.g., TCP/IP). DRE applications are increasingly component-oriented, with calling patterns beyond pure client-server and database transactions, the two calling patterns on which many previous fault tolerant strategies have concentrated.
Com-
ponents frequently use peer-to-peer interactions, in which any component can be a client, a server, or simultaneously a client and server; event-based interactions, in which a reply is not expected and the destination may or may not be known; and callbacks which can be invoked in the middle of round trip calls. Fault tolerance for DRE systems must support components and these calling patterns. DRE systems have stringent real-time requirements.
The real-time require-
ments of DRE systems mean several things for fault tolerance solutions. First, any fault tolerance strategy must meet real-time requirements itself
4
Paul Rubel et al.
with respect to recovery and availability of the individual elements being made fault tolerant. Second, the overhead imposed by any specific fault tolerance strategy on real-time elements must be weighed as part of the selection of a fault tolerance strategy for those elements. Third, the fault tolerance infrastructure must impose no unnecessary extra overhead on elements. For example, elements that are not replicated should not have extra overhead imposed by the fault tolerance infrastructure, if at all possible.
DRE applications are frequently long-lived and deployed in highly dynamic environments.
Fault tolerance solutions should not count on static config-
urations and should be evolvable at runtime to handle new elements.
DRE systems frequently require fault tolerance of multiple tiers.
Tradi-
tional fault tolerance has concentrated on replicating a single tier, i.e., an unreplicated client calling a replicated server, which does not make any calls itself before returning a reply to the client. DRE systems typically have more complicated calling patterns, in which a call to a replicated server can just be the beginning of a chain of calls to replicated or non-replicated elements. Fault tolerance support for these multiple tiers of replication can be challenging. This paper describes several advances we have made in the state of the art in fault tolerance that address the above challenges for component-based DRE systems. The first of these is a new approach to communicating with replicas that
FT for DRE Systems
5
1. Supports the coexistence of non-replicated and replicated elements without imposing fault tolerance infrastructure everywhere. 2. Supports the seamless coexistence of group communication and nongroup communication for DRE systems with varying FT requirements. 3. Introduces no extra elements that need to be made fault tolerant. 4. Introduces no extra overhead on non replicated elements that only communicate with other non-replicated elements. 5. Is transparently inserted into DRE applications, at either the system call or communication middleware layer.
Second, we describe an approach to self-configuration of replica communication, which enables replicas, non-replicas, and groups to discover one another automatically as the number of, and fault tolerance requirements of, elements change dynamically. Finally, we describe an approach to duplicate management that supports replicated clients and replicated servers. Suppressing duplicates is a critical aspect of active and semi-active replication strategies. The peer-to-peer and multi-tiered semantics of DRE applications include replicated clients in addition to replicated clients. We describe an approach to duplicate management that enables the handling of duplicates on both the client and server side. A second contribution of this paper is that we demonstrate these advances in the context of an integrated fault tolerance capability for a realworld DRE system with strict real-time and fault tolerance requirements,
6
Paul Rubel et al.
a multi-layered resource manager (MLRM) for shipboard computing systems. The fault tolerance we developed for this context utilizes off-the-shelf fault tolerance and component middleware with the above enhancements; supports a mixture of fault tolerance strategies and large numbers of interoperating elements, with varying degrees of fault tolerance; and has the characteristics mentioned above (multi-tiered, peer-to-peer, and non-roundtrip semantics). We then evaluate the performance of the replicated MLRM to meet its real-time and fault tolerance requirements and present analysis of the performance overhead of our fault tolerance approach. The rest of the paper is organized as follows: Section 2 describes some of the challenges of providing fault tolerance in DRE systems in more detail and how existing fault tolerance research and approaches support for these challenges. Section 3 describes the set of enhancements to fault tolerance that we have developed. Section 4 describes the application of these approaches to the MLRM application for shipboard systems, trading off its fault tolerance and real-time requirements. Section 5 describes experimental results. Finally, Section 6 provides concluding remarks and a discussion of future research.
2 Challenges in Providing Fault Tolerance in DRE Systems
Fault tolerance is an important characteristic to build into mission-critical and robust systems, prompting many years of research into fault tolerance services, strategies, and techniques. DRE systems provide unique challenges
FT for DRE Systems
7
to applying much of this existing work because of their scale, real-time requirements, dynamic configurations, and peer-to-peer and multi-tiered calling semantics. This section describes three particular sets of challenges with applying existing fault tolerance solutions to the needs of DRE systems with the above characteristics: – Communicating with replicas in large scale, mixed mode systems – Handling dynamic system configurations – Handling peer-to-peer communications and replicated clients and servers. These are in addition to the general challenges of making critical functionality available throughout the system, within its real-time constraints.
2.1 Communication with groups of replicas DRE systems provide unique challenges for replica communication due to their scale, varying fault tolerance needs, and stringent real-time requirements. Most solutions to fault-tolerance via replication make use of group communication systems (GCS), which ensure consistency between replicas for inter-replica interactions and between replicas and non-replicated client or servers they interact with. Various fault tolerance solutions have taken different approaches to using GCS within fault tolerant applications. Pervasive GCS.
Some approaches [1] use GCS for communication through-
out the entire system. This approach can provide strict guarantees and ensure that applications interacting with replicas always do so in the correct
8
Paul Rubel et al.
manner, regardless of their intentions. It can, however, limit the scalability of resulting systems and add extra overhead associated with group communication on the communication of elements that don’t need it. In very large DRE systems, non-replica communication can be the more common case and using GCS everywhere can severely impact performance (as we show in Section 5).
Communication during component deployment.
Another reason why mixed
mode communications is necessary is due to features of component deployment. The MLRM system that we describe in Section 4 is built using Lightweight CORBA Component Model (CCM) [2] components. The deployment of components both when new applications are deployed and when additional replicas are needed (e.g., to replace a replica that has failed) is done using a CORBA-based deployment framework, called the Deployment and Configuration Engine (DAnCE) [3]. We need to support CCM-based replicas entering the system at any time. Since the messages related to the deployment of a CCM-based replica are of concern only to the new replica, we need a way to communicate with individual replicas during the bootstrapping phase, without the deployment messages going to existing replicas (which were previously deployed). Thus, replicating components requires the coexistence of non-group communications (during the deployment of a new replicated component) and group communications (once all replicas have been fully deployed).
FT for DRE Systems
Gateways.
9
Other systems [4] make use of gateways on the client-side that
change interactions into GCS messages. This approach enables the coexistence of GCS and non-GCS communications, limits group communication to communication between replicas, and provides the option to use non-GCS communication paths where necessary. It is therefore useful in applications that need group communication for replicas but cannot afford to use group communication everywhere, especially in applications with relatively small numbers of replicated elements when compared to non-replicated elements. The gateway approach does come with tradeoffs, however. First, it is less transparent than the pure GCS approach because the gateway itself has a reference that has to be explicitly called. Second, while gateways are a possible source for optimization, providing a way to interact with elements using an efficient transport, they are also a source of potential non-determinism that needs to be managed, since they can introduce state divergence and non-recoverable differences if they are used without complete understanding of the consequences of action. Finally, since gateways introduce extra processes, they can introduce extra overhead (since messages need to traverse extra process boundaries before reaching their final destination) and extra elements that need to be made fault tolerant to avoid single points of failure.
Other projects [5] take a hybrid approach where GCS is only used to communicate between replicas and not to get messages to the replicas. This places the gateway functionality on the server-side of a client-server interac-
10
Paul Rubel et al.
tion, which limits the interactions between replicated clients and replicated servers but has implications for replicating both clients and servers at the same time. It introduces the possibility that lost messages may need to be dealt with at the application level as they cannot use the guarantees provided by the GCS to send to multiple recipients at the same time.
ORB-provided transports.
Some service-based approaches such as DOORS[6]
completely remove GCS from the fault-tolerance infrastructure and use ORB-provided transports instead. By limiting the replication schemes to passive ones, they can address some fault-tolerance concerns without dealing with issues such as duplicate detection and suppression at the potential cost of complications at the application or ORB layer.
Ad hoc approaches.
Opposite the all-GCS solution implemented through
intercepted middleware on the communication spectrum is a completely application-level solution with custom fault-tolerance code added with the application code. This places the burden of fault-tolerance squarely on the developer and gives both the greatest flexibility and efficiency but also the greatest chance for a one-off or non-reusable solution. Traditionally many solutions of this type have been made and will continue to be made as the search for general-purpose fault tolerance infrastructure services remains elusive.
FT for DRE Systems
11
2.2 Configuring FT Solutions A recurring problem with using GCS in dynamic systems like DRE systems is keeping track of available resources as applications and their supporting infrastructure come and go during the life of a large DRE system. Many existing fault tolerance solutions make use of static configuration files or environment variables [1,4,7]. The DRE systems that we are discussing are highly dynamic, with elements and replicated groups that can come and go and need to make runtime decisions about things such as fault tolerance strategy, level of replication, and replica placement. Static configuration strategies lack the flexibility needed to handle these runtime dynamics. Greater flexibility is available in some agent-based systems [8] but for more common non-agent infrastructures adding additional FT elements to a running system is not common.
2.3 Replicated Client and Servers and Peer-to-Peer Interactions Support for replicated servers is ubiquitous in fault tolerance replication solutions, whereas support for replicated clients is not as common. The FTCORBA standard [9], upon which many object-based fault tolerance solutions are based concentrates on single-tier replication semantics, in which an unreplicated client calls a replicated server which then returns a reply to the client, without making additional calls. Multi-tiered or peer-to-peer invocations are possible but the standard does not provide sufficient guarantees or infrastructure to ensure that failures, especially on the client-side, during
12
Paul Rubel et al.
these invocations can be recovered from. A similar situation exists in some service-based approaches [6,10] where peer-to-peer interactions are possible but care must be taken by developers to make use of the functionality in a safe manner. In contrast, many applications exhibit more peer-to-peer communication patterns, in which components can be clients, servers, or even both simultaneously. Newer CORBA infrastructure, the CORBA Component Model [2], exhibits peer-to-peer semantics more normally than it does client-server semantics, since any component can simultaneously be a client (through receptacles that call other components), a server (through facets that export interfaces to be called), and even an event source or sink. Many emerging DRE systems are developed based on component models and exhibit peerto-peer or application string (a chain of processes through which data is asynchronously pushed from source to sink) calling structure. Because of this, fault tolerance strategies and solutions based on strict client-server, single-tier replication are of limited applicability. Multi-tiered peer-to-peer systems require additional functionality above and beyond simply supporting replicated clients and servers. Their interaction patterns are more complex and handling failures requires additional protocols and interactions to ensure that the effects of a failure do not propogate beyond the near neighbors of the faulty application or system. Research into supporting fault-tolerance in multi-tiered applications is still ongoing. Some of the most promising recent work has concentrated on two-
FT for DRE Systems
13
tier replication, with the specific topology of a non-replicated client, replicated server, and replicated database [11]. General, unrestricted calling patterns, such as asynchronous calls, nested client-server calls, and even callbacks (where clients also act as servers and can have messages arrive via the callback mechanism while replies from sequential request-reply messages are pending), present tremendous challenges for fault tolerance solutions. This is partially due to the need for fault tolerance to maintain message ordering, reliable delivery, and state consistency, which is harder to do in asynchronous, multi-threaded, and unconstrained calling patterns. It is also due to the fact that the semantics of such calling patterns in the face of replication are more difficult to define.
2.4 An Integrated Fault Tolerant Solution for DRE Systems
Even when solutions to the challenges above are created, they only provide parts of the capabilities needed for a complete, integrated, system wide fault tolerance property. Other aspects of fault tolerance, such as replication strategy, fault detection, recovery, and replica reconstitution, need to be provided as part of the overall system fault tolerance property. Furthermore, these have to be tailored to the particular needs of the system and functionality being made fault tolerance, e.g., How available a particular piece of system functionality needs to be, which partially leads to deciding on a fault tolerance strategy. Active replication strategies [12] can provide fault masking and constant availability, but use
14
Paul Rubel et al.
more resources in the fault-free case, while passive replication strategies [13] introduce more recovery time in the face of a failure, but use fewer resources in the fault-free case. The real-time requirements for a particular piece of functionality. Fault tolerance will introduce extra latency and overhead. Part of the challenge to providing fault tolerance is to decide where the real-time requirements of your system can tolerate the extra overhead – after a failure when recovering functionality or spread through all interactions to avoid extra recovery time when a failure occurs. The characteristics of the application, which affects the fault tolerance strategies that are applicable. Characteristics like the presence of non-determinism in a component and the amount and type of state can affect the choice of fault tolerance strategy and the effort required to integrate and apply it.
3 Fault Tolerance Solutions to the Challenges for DRE Systems
In this section, we describe three new fault tolerance advances that we have developed, each of which addresses one of the challenges described in Section 2. First, we describe a Replica Communicator (RC) that enables the seamless and transparent coexistence of group communication and nongroup communication while providing guarantees essential for consistent replicas. Next, we describe a self-configuration layer for the RC that enables dynamic auto-discovery of new applications and replicas. Finally, we describe an approach and implementation of duplicate message manage-
FT for DRE Systems
15
ment for both the client- and server-side message handling code in order to deal with peer-to-peer interactions where an application can take on both client and server roles at the same time. Once we’ve described each of these solutions, Section 4 describes how we integrated them with other off-the-shelf fault tolerance software to create an integrated, tailored fault-tolerance capability for a specific DRE system.
3.1 The Replica Communicator
In order to provide the GCS underpinnings necessary for maintaining consistent replicas while at the same time limiting unnecessary resource utilization and not disturbing the delicate tuning necessary for real-time applications, we needed a way to limit the use of group communication to those places in which it was absolutely necessary. Analysis of various replication schemes shows that the only places where GCS communication is necessary is when interacting with a replica. That is, only replicas and those components that interact with them need the guarantees provided by group communication. Other applications can use TCP without having to accept the consequences of using group communication, whose benefits are not needed in their case. There are several advantages to limiting the use of GCS to only those places in which it is needed. The first reason is that GCS introduces a certain amount of extra latency overhead and message traffic that is not needed in the non-replica case and, in fact, can jeopardize real-time requirements. Second, many off the shelf GCS packages, such as Spread [14], have built
16
Paul Rubel et al.
in limits on their scalability and simply do not scale to the size DRE applications that we are targeting (with thousands of distributed processes). Finally, as described in Section 1, many of the components of our targeted DRE systems are developed independently. Since the non-replicated case is the prevalent one (most components are not replicated), retrofitting these components onto GCS, with the subsequent testing and verification, would be a tremendous extra added effort for no perceived benefit. Therefore, we developed a new capability, called a Replica Communicator, with the following features: – The RC supports the seamless co-existence of mixed mode communications, i.e., group communication and non-group communication (specifically in our prototype, TCP). – It introduces no new elements in the system. – It can be implemented in a manner transparent to applications. The performance benefits of the RC approach can be seen in section 5 where a single client interacts with a single server over TCP and GCS. In the case where TCP will suffice it can be used to provide significantly faster performance with lower resource utilization. The RC can be seen as the introduction of a new role in an application, along with the corresponding code and functionality to support it. That is, the application now has three communication patterns: 1. Replicas that only talk to other replicas, which use GCS 2. Non-replicas that only talk to other non-replicas, which use TCP
FT for DRE Systems
17
3. Non-replicas that talk to non-replicas or replicas, and use an RC to route the communication along the proper protocol.
Fig. 1 Generalized pattern of the Replica Communicator
An abstract view of the RC is illustrated in Fig. 1. Its basic functionality consists of the following pieces: – Interception of client calls – A lookup table to hold references to replicas and to non-replicas – A decision branch that determines whether a call is destined for a nonreplica or a replica and treats it accordingly – A means to send a message to all replicas, e.g., multicast, a loop over all replica references, or GCS – A default behavior, treating a message by default as one of the branches – A configuration interface to add references to new servers or to new replicas (to an existing group), or to remove a replica (if it has failed) Documented as the above pattern, the RC can be realized with multiple implementations. Application specific implementations can be made in the
18
Paul Rubel et al.
application logic itself, using aspect-oriented programming or component assembly to insert the RC transparently into the path of client calls, or it can be realized using standardized insertion points, such as library interpositioning to hook into the system at the system-call level or using the CORBA-standard Extensible Transport Framework (ETF) [15]. Transparent insertion is important so that applications do not need to be modified for use in a FT situation. This lack of application modification fits with the expectations of a user, a component that does not need to be replicated should only need very limited, if any, changes in order to interact in a FT situation. The RC functionality resides in the same process space as the application. In this way, the RC improves over traditional gateway approaches, because it introduces no extra elements into the system (that would need to be made fault tolerant), limiting the extra overhead associated with integrating fault-tolerance. Notice that the RC does not need to be made fault tolerant, since it is a role of a non-replicated client. We have realized a prototype of the RC pattern in the system described in Section 4 by implementing it using the MEAD framework [1] and its system call interception layer, as illustrated in Figure 2. Client IIOP calls are intercepted by MEAD, which is added to applications at execution time through dynamic loading of libraries. The RC code maintains a lookup table associating IP addresses and port numbers with the appropriate transport and group name if GCS is to be used. The default is TCP; if there is no
FT for DRE Systems
19
entry in the lookup table, the destination is assumed to be a non-replicated entity. By keying on the IP and port to distinguish transports, we can use applications and references without requiring that either be modified for use in our system. For replicated entities, the RC sends the message using the Spread GCS [14], which provides totally-ordered reliable multicasting. For replies, the RC remembers the transport used for the call, and returns the reply in the same manner.
Fig. 2 The Replica Communicator instantiated at the system call layer
The Replica Communicator was crucial for resolving the problem outlined in section 2.1, namely that the deployment infrastructure for CCMbased replicas needed a way to communicate with exactly one replica during bootstrapping. We used the RC with our CCM-based active and passive replicas to allow the DAnCE framework to bootstrap a replica while not disturbing the existing replicas.
20
Paul Rubel et al.
3.2 A Self-Configuring Replica Communicator
Populating the table distinguishing GCS and TCP endpoints shown in figure 2 can be done in multiple ways. One way is to set all the values statically at application start-up time using configuration files. However, this leads to static configurations in which groups are defined a priori and supporting dynamic groups and configurations is difficult, if not impossible, and error prone. To better support the dynamic characteristics of DRE systems and to simplify configuration and replica component deployment, we developed a self-configuring capability for the RC. When a GCS-using element (i.e., a replica or non-replica RC) is started we have it join a special group used solely for distributing reference information. The new element then announces itself to the other members of the system (shown by the arrows labeled 1 in Figure 3), which add an entry to their lookup table for the new element. An existing member, in a manner similar to passive replication, responds to this notification with a complete list of system elements in the form of an RC lookup table, shown by the arrow labeled 2. This allows the new element to interact with any existing element using the appropriate communication protocol. The new element blocks until the start-up information is received, to ensure that the necessary information will be available when a connection needs to be established (i.e., when the element makes a call). When an element using the RC pattern attempts to initiate a connection, it might be a call that needs to use GCS or one that shouldn’t use GCS. Since GCS-using elements always reg-
FT for DRE Systems
21
ister and are blocked at start-up until they are finished registering, the RC has all the information it needs to initiate the correct connection. If there is no entry for a given endpoint it means that TCP should be used for that connection. Elements that exclusively use non-GCS communications do not need to register themselves, since TCP is the default.
Fig. 3 Steps in updating a new RC reference
One complexity that does not affect users but needs to be taken into account while developing the self-configuring RC is that the relationships between elements are not necessarily transitive. Simply because RC1 interacts with replica R via GCS and R also interacts with RC2 via GCS, this does not mean that RC1 should use GCS to interact with RC2 . There is little benefit to using GCS where TCP is acceptible. In contrast, using GCS where it is not needed impacts our real-time performance negatively. In the case of manual configuration this is handled by having a configura-
22
Paul Rubel et al.
tion specific for each application. However, in our automated solution it is necessary to do more than note that a given endpoint can be contacted via a given GCS group name. We also need to distinguish the circumstances where GCS is necessary and those where it is not. We accomplish this by adding an extra piece of information that notes whether a reference refers to a replica. This is sufficient to disambiguate the cases and allow replicas (which are aware of their replicated status) and RCs to use the proper transport when interacting with applications whose transports are discovered automatically. Given that interacting with a replica or being a replica are the only two times GCS is necessary, an RC knows to use GCS when it is interacting with a replica (and TCP elsewhere) and replicas always use GCS.
3.3 Clent- and Server-Side Duplicate Management
One step towards a solution for replication in multi-tiered systems is the ability for each side of an interaction to perform both client and server roles, at the same time. This is essential to our support of components and allows nested calls to be made without locking up an entire tier while waiting for a response, which can guarantee consistency, but is very limiting. Supporting these dual roles has two aspects. The first is the ability to send and receive in a non-blocking way. When an application sends out a request the underlying infrastructure cannot block solely waiting for a reply as another request may need to be serviced before the first request
FT for DRE Systems
23
will return. Blocking needs are simulated for the higher-level application. The second role is the ability to distinguish and suppress duplicate messages from both replicated clients and replicated servers. If this suppression is not performed the applications, which do not know they are replicated, can be confused by receiving the same message multiple times. Solutions that require idempotent operations [16] also solve this problem and can work in multi-tiered situations, but are not satisfactory for use in general DRE systems.
Duplicate management is integral to applications that act as both clients and servers, whether in a middle-tier of a multi-tiered system or those involved in registering and then receiving callbacks. One key characteristic necessary to support duplicate management is that messages need to be globally distinguishable, both within an interaction and between multiple interactions. Within a series of interactions, protocols often have sequence numbers to ensure ordered delivery. In the case of CORBA, IIOP makes use of request IDs and we used these in our solution. Sequence numbers and request IDs can be used to distinguish messages within a single stream from one another. However, when multiple different clients interact with the same group of servers, we also need to differentiate messages based on their source as well as their request ID. Figure 4 illustrates this point as multiple senders send message with sequence number 1 but the duplicates are suppressed based both on the sequence number and the sender. This allows different clients, such as C in Figure 4 to independently (and cor-
24
Paul Rubel et al.
rectly) generate the same request ID and not have their messages confused with those from other sources, such as A. Knowing the message ID as well as the sender of the message allows for globally unique message identification and enables duplicate management. An important aspect here is that when a new replica is integrated with existing replicas it is essential that the message ID aspect of the existing replicas’ state is transferred to the new replica. Without this a new replica could have all of its (non-duplicate) messages dropped incorrectly. We ensure this is not the case.
Fig. 4 Duplicate management during peer-to-peer interactions
Our solution enables duplicate management in the highly dynamic situations typical of DRE and component-based software. Requests and replies are dealt with simultaneously and are unaffected by failures that could reset application-level sequence numbers. We replace the ORB supplied request ID with a unique and consistent value for each request or reply and distinguish messages upon receipt using both the ID as well as the sending group. This allows replicas to come and go without introducing any extra messages at the application layer.
FT for DRE Systems
25
4 Implementation of an Integrated FT Capability for a Mission-Critical DRE System
As part of a case study we performed on providing fault tolerance in a realworld DRE system with stringent real-time requirements [17], we implemented a fault tolerance architecture integrating the techniques described in section 3 with other off-the-shelf fault tolerance software to make a preexisting software base called the Multi-Layer Resource Manager (MLRM) fault tolerant. MLRM is a critical piece of system functionality because it deploys mission-critical applications and enables them to continue functioning after failures by redeploying them. When an operator wants to start an application, MLRM is informed about the details of the application, and MLRM picks a physical node on which to run the application. MLRM is a hierarchical system that has layers corresponding to physical divisions. There are three distinct layers of MLRM:
– Node: Provides services (such as Start Application or Stop Application) for a specific node. One of this layer’s tasks is to start execution of applications. There are many (hundreds to thousands) nodes in the system. – Pool: Provides services and an abstraction layer for physically clustered groups of nodes called Pools. One of this layer’s responsibilities is to designate nodes for applications to run on. The experiments we ran consisted of a system with 2-3 pools.
26
Paul Rubel et al.
– Infrastructure: Provides the control interface to operators and coordinates the Pool-layer services. One of this layer’s tasks is to designate pools for applications to run in.
The MLRM is implemented using a number of base technologies including CIAO [18], which is a C++ implementation of the Lightweight CORBA Component Model (CCM) [2]; OpenCCM [19], which is a Java implementation of CCM; real-time CORBA ORBs, such as TAO [20]; and Java ORBs, such as JacORB [21].
The MLRM takes care of the fault tolerance of individual applications, by restarting them if they fail (on a different node if the cause of the failure is a node failure). However, our design requirements state that pool failures (entire clusters of nodes) are a possibility. Since the infrastructure-level components are hosted on nodes within one of the pools, the failure of an arbitrary pool could lead to the failure of the infrastructure-level components, rendering the entire system unusable. Therefore, our fault-tolerance focus is on the availability of the infrastructure-level MLRM components and recovery from catastrophic pool failures. The goal was to have an instance of the infrastructure-level components in every pool, so that the failure of any pool would not bring down the infrastructure layer MLRM functionality.
Our fault tolerance solution was influenced by the system requirements and characteristics as follows:
FT for DRE Systems
27
Infrastructure EC
IA BBDB
BB
GRSS
PA−2
PA−1
PRSS−1 RA−1
PRSS−2 RA−2
NA−1.2 NA−1.1
NA−2.2 NA−2.1
Pool 1
Pool 2
Fig. 5 Target system architecture
– The infrastructure layer functionality, since it is critical functionality, needs to be nearly continuously available, motivating fault masking or very rapid recovery. – The MLRM has to deploy many hundreds or thousands of application components, and there are hundreds or thousands of node level service components, most of which do not need to be fault tolerant to the same degree. Yet some of these need to communicate with the infrastructure layer components. Existing off-the-shelf fault tolerance software used GCS systems for group management and reliable, ordered multicasting, and provided consistency guarantees that we needed. However, they did not handle mixed-mode communications and rehosting the entire sys-
28
Paul Rubel et al.
tem over GCS was out of the question from scalability and performance points of view. This motivated using the RC approach described in Section 3. – The infrastructure components are implemented as CCM components, many of which are both clients and servers, and with multiple tiers needing replication. Because of this, we needed both client-side and serverside duplicate management, as described in Section 3. – The infrastructure components differed in their amount of state and their use of non-determinism. Despite our need for continuous availability, some components simply could not be actively replicated. We had to use mixed-mode replication, with a mixture of active and passive schemes applied where they met the requirements and matched the component characteristics.
The relevant MLRM components and their communications paths are shown in figure 5. Replicated components are shown in boxes, non-replicated components are ellipses. Communication paths that needed to go over GCS are shown using dotted lines, and regular TCP connections are shown using solid lines. The Infrastructure Allocator (IA) component makes the top-level application deployment decisions. This component has significant state, but is largely deterministic. We determined that active replication was most appropriate for this component. Active replication provides fault masking, so that there is always a replica available for processing messages, and allows
FT for DRE Systems
29
us to avoid state transfers, which in the case of the IA (which had significant state, but infrequent message traffic) reduced the impact of fault tolerance on system performance. While active replication would seem the appropriate choice to use wherever possible, the characteristics of other MLRM components made it infeasible. The Global Resource Status Service (GRSS) (which monitors resource usage and system health) and Bandwidth Broker (BB) (which manages bandwidth allocation) elements were written in Java and used JacORB as their CORBA implementation. The fact that JacORB is inherently multithreaded, together with internal use of timers by the application logic, mean that the GRSS and BB elements are non-deterministic. The GRSS has a very small amount of state, and the BB element is stateless (it uses a backend database to store all state). We determined that passive replication was the best choice for the GRSS and BB elements, as long as we could implement the passive recovery within our real-time requirements (Section 5 shows that we did). The performance trade-off is favorable. With passive replication, overhead on ordinary message traffic is lower, but periodic state transfers are necessary. The state being transferred for these two elements was much smaller (when compared to the state of the IA), but the GRSS receives frequent messages reporting on the health of nodes, processes, and pools. We used our previously developed MEAD fault tolerance framework [1], extended with our new Replica Communicator and client-side duplicate
30
Paul Rubel et al.
suppression, and the Spread [14] group communication system to implement our active and passive fault tolerance. Spread is a group communication system that provides group membership management and total ordered, reliable multicast of messages to group members. We used Spread’s group membership features to detect failures of replicas. MEAD is a fault tolerance framework built to support CORBA clientserver applications with single tier replication. As such, it was not designed to handle component-based systems like the MLRM with multi-tier replication and presented some of the challenges documented in Section 2. MEAD had several features that provided suitable building blocks for the integrated fault tolerance that we were developing. First, MEAD provides a system-call interception layer that enables its fault tolerance to be transparently inserted into applications. Second, MEAD provides support for state transfer, a necessary feature. Finally, MEAD provides server-side duplicate suppression. While Spread and MEAD provided good starting points, we had to extend, enhance, and supplement them to provide the integrated solution that we were seeking, in the following ways:
– While Spread provided group communication that could be used mostly “off-the-shelf”, its default parameters exhibited non-real-time (very long) detection and recovery times. We tuned its parameters to get the realtime performance that we needed, as described in [17].
FT for DRE Systems
31
– MEAD’s interception layer rerouted everything over GCS (Spread), and its implementation was not conducive to mixed-mode communications. At the same time, Spread had a built in limit on the number of nodes that it could handle (due to the need to deploy a Spread daemon on each node) which was much smaller than the MLRM target. We implemented an instantiation of the Replica Communicator pattern, as described in Section 3, at the MEAD interception layer, enabling Spread to be used exclusively between replicas, the choice of TCP or Spread for non-replicated components talking to replicas (through the RC), and TCP only between non-replicated components. – MEAD and Spread relied on static configuration files to set up groups. We made this more dynamic using our self-configuring approach described in Section 3. – We augmented MEAD’s server-side duplicate suppression with the implementation of client-side duplicate suppression described in Section 3. This, along with careful matching of the fault tolerance strategy with the application characteristics, provided the support we needed for MLRM multi-tiered replication.
The implementations of replica communication, self-configuration, and client-side duplicate suppression that we created for this application are being rolled back into the next release of the MEAD framework, strengthening its support for component architectures, non-client-server calling patterns, dynamic configurations, and mixed-mode communications.
32
Paul Rubel et al.
While the GRSS element used a standard passive replication scheme that broadcast the primary’s state on every state change, the BB element used a custom passive scheme. The BB element kept all its state within a MySQL database. We used MySQL’s clustering mechanisms (with some customizations by our collaborators from Telecordia) to achieve an actively replicated database. Since the BB element itself had no state, and the MySQL backend was replicating itself using the built-in clustering mechanisms, we were able to use an optimized passive scheme that did not transfer state from the primaries to the secondaries. None of the pool-level components are replicated, but several must communicate (either as clients or servers) with infrastructure-level components, and therefore must use the replica communicator pattern (i.e., PA-1, RA-1, and PRSS-1). Note also that the NA components are not replicated, and use regular TCP connections to communicate with the RA component.
5 Evaluation of the Integrated FT Solution
We measured the performance of our fault tolerance solution both in terms of meeting the real-time recovery requirements and in terms of the impact of the solution on the fault-free performance of the system. The first metric we collect is failure-recovery time (i.e., the down-time during a failure). We executed two types of failure scenarios to determine the failure-recovery time, described in sections 5.1 and 5.2 and compare the time to our real-time system recovery time requirement. These scenarios
FT for DRE Systems
33
included the full MLRM, deploying 150 applications on up to 9 physical machines divided into 3 pools. The second metric we collect is the fault-free overhead (i.e., the extra latency during normal operation introduced by the fault-tolerance software). We compare the “raw TCP” performance of a simple client-server configuration against the same client-server using our fault-tolerance software. These experiments are described in section 5.3.
5.1 Single Pool Failure Scenario
For our first scenario, we considered a single pool failure, which calls for a whole pool to fail instantaneously, such as a major power failure or destruction of a hosting facility. We simulated this failure by creating a network partition in such a way that packets sent to hosts in the failed pool would be dropped at a router. We believe that this is an accurate simulation of the failure, and has the advantage that we are able to determine the exact time the failure occurred (as opposed to trying to obtain a time-stamp for physically pulling a plug). The experiment evaluated two things, (1) that the mission-critical functionality (i.e., the infrastructure layer MLRM) could survive the failure and (2) the speed of recovery, with the goal being to recover from the failure in under a second. The recovery times for the actively replicated and passively replicated elements were derived from instrumentation inserted in the fault tolerance software. The recovery time for the database replication was de-
34
Paul Rubel et al.
rived from an external program that made constant (as fast as possible) queries on the databases. When the database fail-over occurs, there is a slight increase in the time it takes to do the query, since the database essentially blocks until it determines that one of the replicated instances is gone (and never coming back).
As illustrated in Figure 6, the MLRM recovered all its functionality well within our real-time requirement. The MLRM elements made fault tolerant using MEAD, Spread, and our enhancements recovered on average in under 130 ms, with a worst case recovery in less than 140 ms. The database, made fault tolerant using MySQL’s clustering techniques, recovered on average within 140 ms, with a worst case recovery time under 170 ms.
170 Max Avg Min
Time from Failure to Recovery (ms)
160
150
140
130
120
110
100 BBDB
GRSS
Fig. 6 Experimental Results for Scenario A
IA
BB-frontend
FT for DRE Systems
35
5.2 Two Cascading Failures Scenario
Our second scenario introduced cascading failures. The point of this scenario was to show that the fault-tolerant solution was robust. For this scenario we used three pools. We induced failure on one pool, and then before the recovery was complete, induced failure on another pool. We evaluated the same things as the single-pool failure scenario, except the “recovery time” is based on the time of the first induced failure. As with the single-pool failure’s results, the mission-critical MLRM functionality survived the cascading failures. Also, as expected, the recovery times (from the time of the first failure) are about twice those of the single failure, but still well within our real-time requirement, as shown in Figure 7. This is because we induced the second (cascading) fault at the worst possible time with respect to the recovery from the first failure. 330 Max Avg Min
320
Time from Failure to Recovery (ms)
310
300
290
280
270
260
250
240 BBDB
GRSS
Fig. 7 Experimental Results for Scenario B
IA
BB-frontend
36
Paul Rubel et al.
5.3 Overhead of the Fault Tolerant Software
To determine the fault-free overhead we performed a series of tests measured the overhead of C++/TAO and Java/JacORB versions of our faulttolerance solution. These tests did not involve the MRLM system, but instead used a simple client-server configuration that should be generalizable to any architecture. Our goal was to compare the latency of using CORBA with raw TCP against the latency of using CORBA with our fault-tolerant middleware (using the Spread GCS and MEAD with our duplicate detection and RC enhancement). Since we were not attempting to measure CORBA marshaling cost, we chose the simplest variable-sized data structure that CORBA provides: a sequence of octets. The client would send an octet sequence of a particular size, and the server would respond with a single byte. To avoid the inaccuracies associated with comparing timestamps from different machines, the round-trip trip time was measured on the client side. The results shown in figure 8 show that our fault-tolerance middleware adds a factor two to the latency compared to CORBA over TCP. The single-server latency doesn’t tell the whole story: if we didn’t need replicated servers, then we wouldn’t use anything but regular TCP (the whole point of the Replica Communicator). So we also ran the same tests, but with an actively replicated server. Adding a replicated server using our fault-tolerant middleware version was trivial. To implement the replicated server in the TCP version, we constructed a simple sequential invocation
FT for DRE Systems
37
1.4 TAO-TCP TAO-FTM JacORB-TCP JacORB-FTM
Round Trip Time for 1-byte reply (milliseconds)
1.2
1
0.8
0.6
0.4
0.2
0 1
2
4
8
16
32
64
128
256
512
1024 2048 4096 8192 16384 32768
Forward Packet Size (bytes)
Fig. 8 Latency of Transport Mechanisms
scheme where in order to make a single logical call on 2 replicated servers, the client would make an invocation on server instance 1, then after that call returned the client would make the same invocation on server instance 2. The round trip time for the TCP case is the sum of the round-trip time for both invocations. We implemented a similar setup for a 3-replica configuration. The results are shown in figures 9 and 10. In the two replica case, the results show that the fault tolerance software using GCS performs nearly as well as TCP, introducing very little extra latency for its total order and consensus capabilities. In the three replica case, the fault tolerance with GCS performs better than raw TCP.
5.4 Analysis of Experimental Evaluation
Our results indicate that our FT solution enables fast and bounded recovery both in a single failure and cascaded failure scenarios.
38
Paul Rubel et al. 1.3 TAO-TCP TAO-FTM
Round Trip Time for 1-byte reply (milliseconds)
1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1
2
4
8
16
32
64
128
256
512
1024 2048 4096 8192 16384 32768
Forward Packet Size (bytes)
Fig. 9 Latency of Transport Mechanisms with 2 replicas
1.4 TAO-TCP TAO-FTM
Round Trip Time for 1-byte reply (milliseconds)
1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 1
2
4
8
16
32
64 128 256 512 1024 2048 4096 8192 16384 32768 Forward Packet Size (bytes)
Fig. 10 Latency of Transport Mechanisms with 3 replicas
The fault-free overhead experiments highlight the importance of the Replica Communicator pattern. For non-replicated configurations (i.e. one client, one server) the latency overhead of using our fault tolerance middleware is a factor of two over plain TCP. If every component in the system was required to use our fault tolerance middleware, the cumulative effect
FT for DRE Systems
39
would be detrimental to many of the real-time applications in our system. This supports our claim that only components that really need to use fault tolerance infrastructure (i.e., replicas and components that communicate with replicas) should use it, and using the Replica Communicator to limit the fault tolerance infrastructure to where it is needed provides benefit to real-time systems. On the other hand, the 2 and 3-replica versions of the same latency experiments show that our fault tolerance middleware performs much more favorably when replication is required. Our fault tolerance middleware performs nearly as well for the 2-replica case, and as well or better in the 3-replica case than a TCP-based active replication scheme, as well as providing the extra benefits of total ordered messages, consistency among replicas and group management.
6 Conclusions
This paper has described advances we have made in software support for fault tolerance for distributed, real-time embedded systems. Our approach – very successful in this project – was to utilize off-the-shelf fault tolerance software where it was applicable for our needs, customize it where necessary, and develop new capabilities where none existed. In the case of developing new capabilities, we strived to do so in order to further the following goals: 1. They advance the state of the art of fault tolerance by extending the concepts and defining general principles and approaches.
40
Paul Rubel et al.
2. They were consistent with existing fault tolerance approaches and worked with existing fault tolerance code bases, while not being limited to those specific code bases.
The three techniques that we presented in this paper – the Replica Communicator, self-configuration for replica communication, and client- and server-side duplicate management – extend existing fault tolerance techniques to make them suitable for component-oriented DRE applications. Yet, they are complementary to, and consistent with, other existing fault tolerance services. To illustrate this, we have instantiated them in the context of two existing off-the-shelf fault tolerance software packages, the MEAD framework and Spread group communication system, and applied them to a real-world DRE example application. Our experiments show that these solutions provide suitable real-time performance in both failure recovery and fault-free cases.
While these capabilities are necessary for fault tolerant DRE systems, they just scratch the surface of some of the difficult challenges in such systems. For example, the semantics of arbitrary nested calls in a multi-tiered replicated architecture is still an open research problem which requires further investigation. Likewise, composing individual fault tolerance elements into an integrated solution (like we did in this project) is a difficult problem in general. In future work, we will continue to explore these and other issues.
FT for DRE Systems
41
References
1. Narasimhan, P., Dumitras, T., Paulos, A., Pertet, S., Reverte, C., Slember, J., Srivastava, D.: MEAD: Support for Real-time Fault-Tolerant CORBA. Concurrency and Computation: Practice and Experience (2005) 2. Object Management Group: Lightweight CCM RFP. realtime/02-11-27 edn. (2002) 3. Deng, G., Balasubramanian, J., Otte, W., Schmidt, D.C., Gokhale, A.: DAnCE: A QoS-enabled Component Deployment and Configuration Engine. In: Proceedings of the 3rd Working Conference on Component Deployment, Grenoble, France (2005) 4. Cukier, M., Ren, J., Sabnis, C., Sanders, W., Bakken, D., Berman, M., Karr, D., Schantz, R.: AQuA: An Adaptive Architecture that provides Dependable Distributed Objects. In: IEEE Symposium on Reliable and Distributed Systems (SRDS), West Lafayette, IN (1998) 245–253 5. Reiser, H.P., Kapitza, R., Domaschka, J., Hauck, F.J.: Fault-Tolerant Replication Based on Fragmented Objects. In: ”Proceedings of the 6th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems - DAIS 2006”. (2006) 256–271 6. Gill, C.D., Levine, D.L., Schmidt, D.C.: Towards Real-time Adaptive QoS Management in Middleware for Embedded Computing Systems. In: Proceedings of the Fourth Annual Workshop on High Performance Embedded Computing, Lexington, MA, Lincoln Laboratory, Massachusetts Institute of Technology (2000) 7. Narasimhan, P., Moser, L.E., Melliar-Smith, P.M.: Eternal: a componentbased framework for transparent fault-tolerant corba. Softw. Pract. Exper.
42
Paul Rubel et al. 32(8) (2002) 771–788
8. Marin, O., Bertier, M., Sens, P.: Darx - a framework for the fault-tolerant support of agent software. In: Proc. of the 14th. IEEE International Symposium on Software Reliability Engineering (ISSRE 2003). (2003) 406–417 9. Object Management Group: Fault Tolerant CORBA Specification. OMG Document orbos/99-12-08 edn. (1999) 10. Baldoni, R., Marchetti, C., Virgillito, A.: Design of an Interoperable FTCORBA Compliant Infrastructure. In: in Proceedings of the 4th European Research Seminar on Advances in Distributed Systems (ERSADS’01). (2001) 11. Kemme, B., Patino-Martinez, M., Jimenez-Peris, R., Salas, J.: Exactly-once interaction in a multi-tier architecture. In: VLDB Workshop on Design, Implementation, and Deployment of Database Replication. (2005) 12. Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4) (1990) 299–319 13. Budhiraja, N., Marzullo, K., Scneider, F., Toueg, S.: 8. ACM Press, Frontier Series. In: The Primary-Backup Approach. (S.J. Mullender Ed.) (1993) 14. Amir, Y., Stanton, J.: The spread wide area group communication system. Technical Report CNDS 98-4, Center for Networking and Distributed Systems, Johns Hopkins University (1998) 15. Object Management Group: Extensible Transport Framework for Real-time CORBA, Request for Proposal. Object Management Group. OMG Document orbos/2000-09-12 edn. (2000) 16. Dekel, E., Goft, G.: Itra: Inter-tier relationship architecture for end-to-end qos. Journal of Supercomputing 28(1) (2004) 43–70 17. Rubel, P., Loyall, J., Schantz, R., Gillent, M.: Fault tolerance in a multilayered DRE system: a case study. Journal of Computers (JCP) 1(6) (2006)
FT for DRE Systems
43
18. Wang, N., Schmidt, D.C., Gokhale, A., Rodrigues, C., Natarajan, B., Loyall, J.P., Schantz, R.E., Gill, C.D.: QoS-enabled Middleware. In Mahmoud, Q., ed.: Middleware for Communications. Wiley and Sons, New York (2003) 131– 162 19. Universite des Sciences et Technologies de Lille, F.: The OpenCCM Platform. corbaweb.lifl.fr/OpenCCM/ (2003) 20. Schmidt, D.C., Natarajan, B., Gokhale, A., Wang, N., Gill, C.: TAO: A Pattern-Oriented Object Request Broker for Distributed Real-time and Embedded Systems. IEEE Distributed Systems Online 3(2) (2002) 21. Brose, G.: JacORB: Implementation and Design of a Java ORB. In: Proc. DAIS’97, IFIP WG 6.1 International Working Conference on Distributed Aplications and Interoperable Systems. (1997) 143–154