Timeliness in Auto-Adaptive Distributed Systems - CiteSeerX

2 downloads 0 Views 696KB Size Report
Two major factors motivating our research into adaptive behavior in distributed systems are: 1) Quality of Service (QoS): How the QoS offered by a system can be ...
Timeliness in Auto-Adaptive Distributed Systems Partha Pal, Rick Schantz, Joseph Loyall BBN Technologies, Cambridge, MA 02138 {ppal, rschantz, jloyall} @bbn.com

conditions in the operating environment change, and the availability and quality of resources that the system depends on, deviate from expected levels. 2 ) Survival and Defense: How to respond to malicious manipulations of the environment by attackers and how to keep operating through attacks providing some level of useful service, even if it is degraded.

Abstract Designers of auto-adaptive systems must devise a way to engage the right response at the right time. In order to bring auto-adaptive capabilities to mainstream distributed systems, it must be ascertained that the adaptation architecture is capable of mounting appropriate adaptive responses in a timely manner. In this paper we present an approach to designing adaptation control frameworks that facilitate such time-appropriate adaptations. We have used this approach in designing various adaptation control architectures using the QuO (Quality Objects) adaptive middleware framework.

1. Introduction Recent advances in distributed systems technology are giving rise to a new breed of sophisticated systems that are flexible in their configuration and agile in their behavior. They monitor and respond to changes in environmental conditions by altering their configuration or their behavior in an effort to provide the best quality of service possible under the circumstances or to defend against malicious attacks. For the most part, the objective here is to mount dynamic actions autonomously and to provide operators with better awareness of the system for applying appropriate additional manual intervention. These new type of distributed systems and their constituent applications are often described as being “auto-adaptive”, indicating the capability ingrained in them to adapt themselves. The scope or range of dynamic modification in an auto-adaptive distributed system can go beyond its runtime operation and behavior. For instance, the way a system is built or configured [1][2] can also be adaptive. However, the present paper is focused largely on adaptation of runtime operation and behavior. Two major factors motivating our research into adaptive behavior in distributed systems are: 1) Quality of Service (QoS): How the QoS offered by a system can be maintained and managed as

In either case, there is a common timeliness criteria that affects the usefulness of adaptive behavior. If the system adapts too late, it may not be useful and may even be detrimental. Typical examples in the context of survivability involve defensive reactions such as blocking a port (e.g., the source of attack packets) , or killing a process (e.g., a maliciously started process) or removing a file (e.g., a Trojan horse). There is little value in blocking the source after a poison-pill1 has been admitted into a system, killing a rogue process after it has wiped out parts of the file system, or removing a Trojan Horse after it has been executed. Another example concerning QoS involves a dynamic mission-planning scenario where the pilot in a flying aircraft would rather have images of the new flight path more quickly than late, even though the image quality in the more timely version may have to be sacrificed. In this case, delay in obtaining the right image may be directly related to the air asset getting closer to danger. So the system needs to decide on the quality of images at runtime, based on available resources and timeliness requirements. There is another factor that is intimately related to timely adaptation, and that has to do with determining the correctness or appropriateness of the adaptive response. It may be possible to mount a response quickly, but a hastily taken decision may sometimes be wrong. For instance, one common tendency in defensive reaction is to shutdown the suspected process or host. However, a quick shutdown is not always the 1

Active attack packets that leave malicious side effects as opposed to passive attack packets, as in scanning.

desirable response to every suspected event, and will lead to self-denial of service. It should also be noted that the same adaptive response may not be appropriate for every occurrence of the same trigger. The autoadaptive system may in fact need some time to determine an appropriate adaptive response. The above discussion points to a deeper issue underlying every auto-adaptive system: how to engage in the right adaptation at the right time. In this paper we describe an approach that we took to facilitate timeappropriate adaptations in auto-adaptive distributed systems built using the QuO [3] adaptive middleware framework. The rest of the paper is organized as follows. Section 2 describes the techniques we used. Section 3 presents several examples where these techniques were used. Section 4 describes remaining issues and next steps. Section 5 concludes the paper.

condition objects provide interfaces to system resources, mechanisms, and managers. They are used to capture the states of particular resources, mechanisms, or managers that are required by the contracts and to control them as directed by the contracts. Figure 1: The QuO Remote Method Call Model

2. Addressing Timeliness and Correctness in QuO 2.1 QuO Background The distributed object computing (DOC) paradigm as exemplified by CORBA or Java (RMI) is the most advanced, mature, flexible context available today for the development of large-scale, network-based systems. DOC middleware effectively hides many of the complexities of distributed computing, exposing only the functional interfaces of components. However, many critical applications often have strict security, dependability, and real-time performance requirements and must control or react to how services are delivered, not just to w h a t services are delivered. DOC middleware falls short in providing support for these requirements, and hence critical applications are often programmed around t h e distributed object infrastructure, effectively gaining little or no advantage from the middleware. The problem gets worse when an application is distributed over a WAN, which is inherently more dynamic, unpredictable, and unreliable than a LAN. QuO, a DOC framework has been developed in attempt to address these issues, and is being used to build auto-adaptive distributed systems. The QuO functional path, illustrated in Figure 1, is a superset of the CORBA functional path. The operating regions and service requirements of the application are encoded in contracts, which describe the possible states the system might be in and actions to take when the state changes. QuO inserts delegates in the CORBA functional path. The delegates project the same interfaces as the stub (client-side delegate) and the skeleton (server-side delegate), but support adaptive behavior upon method call and return. System

The delegates, contracts and the system condition objects provide the basic building blocks of QuO’s adaptation control and management. Using these, it is possible to add adaptive behavior in the functional path (in-band adaptation, triggered by invocation of a DOC functional path) or to build dedicated adaptation control objects that manipulate the system independent of any DOC functional path (out of band adaptation). In addition to adding application level adaptive behavior (i.e., between two application objects), inband adaptation can be used to build dedicated adaptation control objects as well (i.e., by making system objects, special purpose tools or mechanisms, or other adaptation control objects interact with each other). Both types of adaptations are shown in figure 1 as red arrows. QuO provides a suite of Quality Description Languages (QDL), code generators and a runtime kernel for supporting the adaptation control objects [4] [5] [6].

2.2 The QuO Approach Appropriate Adaptation

to

Time-

We adopted a three-prong strategy to facilitate the right response at the right time. First, the system’s adaptive response is structured in a decentralized manner organized in multiple layers. This layered and decentralized adaptation control architecture serves as the foundation for designing a range of adaptive responses with varying scope, and also allows

coordination among the individual adaptation control entities in the architecture. Second, we distinguish between adaptive responses that deal with the “symptoms” or perceived effects and adaptive responses aimed at diagnosing and addressing the cause. Symptom-treating responses are mounted quickly, allowing us to continue, while we gain time for diagnosing the problem and deciding on the appropriate response. Third, we design the adaptive responses that need to be mounted quickly - without much diagnosis or coordination - in such a way that they can be rendered harmless or nullified by future adaptive responses. This gives the more coordinated adaptations mounted at upper layers a chance to rectify potential missteps at lower layers.

objects like a display server), for the purpose of mounting useful adaptive response, to interface with • tools or mechanisms (e.g., a compression filter that takes raw images from the server, runs a compression transform and pumps the compressed image out to the network) that are part of the application functionality (serving images), as well as • tools (like firewalls) that are not part of the application functionality.

In some sense, the layered and decentralized adaptation control architecture is an example of “divide and conquer”. Instead of designing the adaptation control as a single centralized object, we implement it as a collection of distributed autonomous objects. The layering implies that these objects are organized in a hierarchy like the one depicted in Figure 2. The lowest layer consists of objects (in the DOC sense), which deliver the adaptation based on local information. These are represented as the circles in Figure 2. These objects encapsulate or interface with tools and mechanisms such as, intrusion detection systems, firewalls, schedulers or network management systems. Typically, such an object interjects out of band adaptation into the tool or mechanism it encapsulates or interfaces with. In many cases, the adaptive response is to report an observation up a layer or to execute a command from the upper layer. In some cases, the object may perform both, and may have simple logic to react to observed symptoms. The intermediate layer consists of individual objects (denoted as the octagons in Figure 2) that coordinate a small number of lower layer objects, one of which is typically a dedicated observer. The key thing to note here is the inter-object coordination in a fairly localized scale (such as within a single host). In Figure 2, the triangles represent the scope of coordination at this layer. The topmost layer (denoted in Figure 2 by the parallelograms), consists of objects that operate based on system-wide information. Such objects may be dedicated adaptation control objects (e.g., managing a host) or they may be integrated with application objects (i.e., application level adaptation performed inband). These objects may control the layer below, and may also coordinate among themselves. This allows the higher level objects in the system (e.g., application

Figure 2: Layered organization of system-wide adaptation This layered organization provides the foundation for implementing the 2nd and 3rd parts of our strategy. It is obvious that not all adaptive responses that are of interest in a distributed system are of equal scope and severity. For instance, one response may restore a deleted file and another may shut down a host. Clearly, the host shutdown has a more severe and more global impact than the file restoration, and while the latter (file restoration) can be mounted as a rapid response, it will not be a good idea to execute a host shutdown without coordinating with other parts of the system. At design time (a manual activity at this stage), it is possible to identify which of the desired adaptive responses can be mounted quickly and which require more coordination. It is our strategy to make the responses that have largely local impact be governed by the encapsulating adaptation control object itself or by adaptation control objects within the same layer. This will allow lightweight and local responses like file restoration to be triggered quickly and locally. On the other hand, heavyweight responses with greater impact (such as host shutdown) are governed by upper layer objects that trigger the action only after wider coordination. In most cases, the adaptive responses that can be governed at the lowest layer (i.e., without much system wide coordination) are directly responding to symptoms that were observed locally, such as a file being deleted or an

attack signature being detected at the network interface of the host. Therefore, at design time, when we know what symptoms or undesirable effects are of concern to the system at hand, we first explore adaptive responses that can provide a quick patch or countermeasure for these symptoms. We then explore what other actions need to happen as part of desirable adaptations in the system and, as explained earlier, many of these actions can only be triggered after coordination among various parts of the system. Collectively, the actions thus identified provide a specification of the lowest layer adaptation control objects, which are built to encapsulate these actions. The coordination required to trigger the more heavyweight actions provide a specification for adaptation control objects at the next layer(s)— how deep the hierarchy is depends on the complexity of implementing the required coordination. Major responsibilities of adaptation control objects at these upper layers will include monitoring and control of lower layer objects, reporting to upper layers and coordinating with other objects it its own layer. Monitoring and reporting across the layers make it possible to determine the effectiveness of lower level responses over time, and reverse the adaptive steps if required. This is important since rapid response reactions, mounted quickly with incomplete information, sometimes may not be the best approach to the problem when viewed from the perspective of the entire system. Therefore, it is our strategy to design the low-level rapid response type adaptations in a way that could be reversed or nullified. Typically, this means encapsulating additional behavior in the adaptation control objects (such as unblocking the port to reverse the blocking action) that were not initially included in the list of desirable adaptive responses. We believe that such a layered and decentralized architecture is essential for any auto-adaptive system. The adaptation control framework is itself a distributed application, and its construction benefits from the use of advanced distributed middleware. We have used the QuO description languages, code generation tools and runtime kernel to create multiple instances of this adaptation control architecture for different kinds of auto-adaptive applications. Some of these are described next.

3. Examples In this section we will briefly describe how our strategy of organizing adaptive responses was used in various auto-adaptive distributed systems.

In the APOD [7] project we first made use of the distinction between local rapid-response and a more coordinated type of adaptation in a two-layer adaptation control architecture. The lowest layer consisted of sensors (tools and mechanisms that observe) and actuators (tools and mechanisms that carry out actions) with localized scope and capability, instrumented with QuO out of band adaptation. This enabled the instrumented tools and mechanisms to interact with the rest of the adaptation control architecture using adaptive behavior like reporting a suspected host, reporting a corrupt file, restoring a file, blocking a host, periodically restoring the ARP cache, or changing the IP address and port number of key application services. The upper layer consisted of rapid-response tactics that paired the local observations (by a sensor) with suitable local actions (by an actuator) such as restoring an uncorrupted version of file or blocking the suspect host. Such tactics were implemented as individual adaptation control objects in this layer. Some of these objects were designed to reverse (when needed) the actions of the lower layer adaptation control objects such as unblocking a host. And some of these objects were also designed to coordinate with their counterparts in other hosts in the system. If symptoms persisted then these objects would coordinate among hosts to decide whether more drastic measures, such as quarantining a single host or a set of hosts was necessary. In addition, some objects in this layer were responsible for migrating the necessary components away from the quarantined host(s). The restoring of files or ARP cache are examples of treating the symptoms, while intelligent blocking, periodic blocking and unblocking, or quarantining attempt to address their cause. In ITUA[8] we enhanced the two layers from APOD, and added a corruption tolerant third layer. Thus the ITUA adaptation control architecture consists of the instrumented sensors and actuators at the bottom layer, rapid-response loops, which are enhancements of the tactic implementation from APOD at the middle layer, and ITUA managers at the top layer. There are multiple loops in a host, and one ITUA manager per host in this architecture. The ITUA managers are dedicated adaptation control objects that implement sophisticated consensus-based coordination algorithms capable of tolerating corrupt behavior of one or more managers. Among the various adaptive decisions the managers can make are when to kill a replica (a copy of an application process), when to start a replica and where, when to isolate a host, etc. Note that these actions have significant impact on the ongoing operation of the system, which is why they need system-wide coordination. In APOD, the upper (of the two) layer was responsible for controlling sensors and actuators as well as some system-wide coordination. In ITUA,

these are separated to add more fidelity and control: the loops control sensors and actuators, while the managers perform host-wide (i.e., among the various loops in the host) and system-wide (i.e., among the managers on different hosts) coordination. This separation is also useful in keeping the more expensive consensus algorithms of the managers from impacting local adaptive responses that need a quicker engagement. In DPASA [9] we are designing and evaluating adaptive defense for an under development military system. The adaptation control architecture is much more sophisticated than the layered organization described above, and can be described as being multilayered at multiple dimensions. The DPASA defense strategy is to integrate protection, detection and reaction, and therefore the adaptation control needs to also interface with various protection mechanisms embedded in the OS (e.g., domain enforcement) or in the network (e.g., hardware interface cards) as well as intrusion detection mechanisms (both host and network based). Like ITUA, adaptive responses here also utilize redundancy at various system layers (replicas, redundant hosts, redundant paths, redundant protocols etc), and also take steps to deal with corruption of key adaptation control elements. The PCES Open Experimental Platform (OEP),[10] provides a platform for implementing our three-prong strategy in a context different from adaptive security and cyber-defense. This project is exploring the use of auto-adaptive capabilities in an open experimental platform (OEP) that simulates the missions of collections of UAVs (Unmanned Aerial Vehicles), including surveillance, i.e., distribution of imagery data to Command and Control (C2) nodes. Local adaptive behaviors include CPU management, bandwidth management, data filtering, compression, tiling, and scaling. These are engaged in response to local (single path) conditions. A second layer of contracts coordinate these local behaviors by determining what the bottleneck is (e.g., which links, which CPUs), the mission state of the UAV (e.g, surveillance or target tracking which determines what tradeoffs we can make), and the interactions with other nodes. Based on this information, the coordination layer then determines what adaptation (for example, tiling or compression) is necessary at which part of the system (for example, in the UAV, the control station, or the C2 node). We are currently designing a policy layer that can encapsulate the tradeoffs of the overarching multi-vehicle mission and determine adaptive policy for the coordination layer.

4. Ongoing Investigation

Other related issues we are continuing to investigate include: • Effectiveness: One key question for auto-adaptive systems is its practical utility under actual conditions in a real system. In our context this translates to whether adaptive responses are helpful in managing the QoS of distributed systems or in defending against cyber-attacks. So far, we have established that it is feasible to use adaptive responses beneficially in both these situations through technology demonstrations [11] and red team experiments [12]. A second phase of that evaluation now focuses on questions like how effective was it? Can we adapt quick enough? Can we react faster than the attacker? We are using experiments to gain further understanding of these issues [13] so that we can enhance our three-prong strategy accordingly. • Completeness and interference: Issues like what the best response is, whether a mounted response is sufficient (completeness), whether a given response affects (interferes with) other aspects of the system, and how to cover all the possible cases are also under investigation. Initially, we have identified the right or acceptable responses, and which of these can be mounted locally and which require coordination by ad-hoc and offline analysis techniques. These provided only limited exposure to analysis of potential conflicts, dependencies and interference issues. We have started an investigation of model-based techniques to provide automated tools to help in organizing and designing appropriate and more optimal adaptive strategies in the MOBIES project[14]. • Self protection: Sophisticated adaptive behavior in auto-adaptive distributed systems invariably requires integration of a diverse set of entities (tools, application objects, system objects, resource managers and mechanisms) and coordination among the adaptation control objects. This requires quite a bit of distributed communication. Degradation, loss or corruption in the communication can impact the overall adaptive behavior of the system. In ITUA and DPASA, we are exploring how the adaptation control architecture itself can by design, be made more robust against such threats.

5. Conclusion In this paper we first describe an issue that all autoadaptive distributed systems must deal with: facilitating timely engagement of appropriate adaptive responses. We then present the multi-layered, decentralized approach we took to address this issue in our work in the area of auto-adaptive distributed systems. Various instantiations of our approach are

then presented. We show how the three-prong strategy of having a range of adaptive response, treating symptoms first followed by more coordinated response and designing low-level, symptom-treating adaptive responses that can either be reversed or become irrelevant in the next step was used in different projects. We also pointed out some of the related issues that we are continuing to investigate. We believe that the layered and decentralized adaptation control framework has a general applicability both in terms of what the auto-adaptive capability is trying to achieve (e.g., survival, QoS management) and in the underlying enabling technology (e.g., QuO).

6. Acknowledgement This research was funded by DARPA in parts under the contracts No. F30602-99-C-0188 and No. F3060200-C-0172. The authors wishes to thank the sponsors for their support, and rest of the QuO group for their technical contribution.

7. References [1] F. Ogel, B. Folliot, and I. Piumarta, “On Reflexive and Dynamically Adaptable Environments for Distributed Computing”, Proc. of the DARES 2003 Workshop, IEEE, Providence, RI, May 2003, pp. 112-117. [2] E. Wohlstadter, S. Jackson and P. Devanbu, “DADO: Enhancing Middleware to Support Cross-Cutting Features in Distributed, Heterogeneous Systems”, Proc. International Conference on Software Engineering, Portland, Oregon, May 2003. [3] J. Zinky, D. Bakken, and R. Schantz, “Architectural Support for quality of service for CORBA objects”, “Theory and Practice of Object Systems”, Vol. 1, No. 3, pp. 55-73, Apr. 1997

[4] J. P. Loyall, R. E. Schantz, J. A. Zinky, and D. E.

Bakken. “Specifying and measuring quality of service in distributed object systems”. Proc.1st IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 98), April 1998.

[5] J. P. Loyall, D. E. Bakken, R. E. Schantz, J. A. Zinky,

D. Karr, R. Vanegas, and K. R. Anderson. “ QuO aspect languages and their runtime integration”, Proc. 4th Workshop on Languages, Compilers and Runtime Systems for Scalable Components, May 1998. [6] R. Vanegas, J. Zinky, J. Loyall, D. Karr, R. Schantz, D. Bakken. “ QuO's Runtime Support for Quality of Service in Distributed Objects”, Proc. IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware'98), pp.15-18 , The Lake District, England, September 1998. [7] APOD project webpage: http://apod.bbn.com

[8] ITUA project web page: http://itua.bbn.com [9] DPASA web page: http://dpasa.bbn.com [10] PCES UAV OEP webpage: http://www.distsystems.bbn.com/projects/AIRES/UAV [11] News Brief about the WSOA flight demonstration, Aviation Week and Space Technology, January 13, 2003, p. 394. [12] B. Nelson, W. Farrel, M. Atighetchi, J. Clem, L. Sudin, M. Shephard, and K. Theriault, “APOD Experiment 1 and 2 Final Reports”, Technical Memorundum Nos. 1311 and 1326, prepared for DARPA by BBN Technologies LLC, 2002

[13] BBN Technologies, “PCES UAV Phase I Experiments Summary Report,” http://www.distsystems.bbn.com/projects/AIRES/UAV/experimentatio n. [14] MOBIES project web page: systems.bbn.com/projects/MoBIES

http://www.dist-

Suggest Documents