A Survey for Self-Healing Architectures and Algorithms Ibrahim Al-oqily Computer Science Department Hashemite University Zarqa 13115, Jordan
[email protected]
Saad Bani-Mohammad Computer Science Department Al al-Bayt University Mafraq 25113, Jordan
[email protected] Abstract—Service Specific Overlay Networks have recently attracted a great interest, and have been extensively investigated in the context of multimedia delivery over the internet. they are virtual networks constructed on top of the underlying network and they have been proposed to provide and improve services not provided by other traditional networks to the end users. The increased complexity and heterogeneity of these networks in addition to ever changing conditions in the network and the different types of faults that may occur make their control and management by human administrators more difficult. Therefore, self-healing concept was introduced to handle these changes and to assure highly reliable and dependable network system performance. Self-healing aims at ensuring that the service will continue to work regardless of defects that might occur in the network. This paper introduce literature in the area of selfhealing overlay networks, present their basic concepts, requirements, and architectures. In addition to that we present a proposed self-healing architecture for multimedia delivery services. Our proposed solution is oriented to discover new approaches for monitoring, diagnosing, and recovering of services thus achieving self-healing. Keywords: self-healing; service specific overlay network; overlay network; quality of service; autonomic computing.
I.
INTRODUCTION
A service specific overlay network [1] is an overlay network built on top of the physical network and is designed to provide end-to-end quality of service guarantees in the internet and to facilitate the creation and deployment of value added functionality to the service without needing support by the underlying network. It consists of Media Ports (MPs), Media Servers (MSs), and Media Clients (MCs). MPs are specific intermediate nodes perform service specific data forwarding and control functions in order to enable the correct media delivery of the service to the end user as required. Also the SSON consists of MS, the provider of the requested service, and the MC whose request the service. However, designing services to meet users' specific requirements implies
978-1-4673-1591-3/12/$31.00 ©2012 IEEE
Bassam Subaih Computer Engineering Department Faculty of Engineeering Technology Al-Balqa' Applied University Amman 15008, Jordan
[email protected]
Jawdat Jamil Alshaer Computer Information Systems Al- Balqa' Applied University Salt, 19117, Jordan
[email protected] that huge number of services will exist in the network thus managing them is not an easy task. Management complexity can be tackled by using the IBM self-management concept [2]. IBM introduced this concept through Autonomic Computing (AC) to enable systems to manage themselves according to administrative objectives. The term autonomic is inspired from human biology autonomic nervous system. AC system simplifies the design and development of systems that can adapt themselves to changes in their environment to meet requirements of performance, fault tolerance, reliability and security with minimum human intervention. The result is a great improvement in management costs, reduced time and skills requirements to perform the tasks. Hence IT professionals can focus on improving their overall services rather than on managing them.AC divides self-management into four functional areas [3, 4]: 1) Selfconfiguration where an autonomic system should be able to configure components automatically to adapt them to varying conditions, 2) Self-healing where an autonomic system should be able to detect, diagnose and repair potential problems resulting from failures in hardware and software, 3) Selfoptimization where an autonomic system should be able to monitor and seek ways to improve their operations and to ensure optimal functioning, and 4) Self-protection where an autonomic system should be able to detect, identify and protect its resources from malevolent attacks and cascading failures. In this paper, self-healing systems for overlay networks are surveyed. Self-healing is the property that allows a system to perceive that is not working properly and without human intervention, make the necessary adjustment that can automatically restore the services affected by a failure in a manner that is seamless to the end systems. Defects may occur due to overlay nodes that may join or leave the network, the congestion on overlay links, or the ever changing of routing information. This in addition to the rapid evolution of overlay networks technologies and the various proposed schemes that
are based on self-healing. As the number of mobile users increase, the demand of self-healing overlay services will also increase. This paper serves to capture a snapshot of current design trends and techniques in self-healing overlay networks. The goal is not to compare one solution with another, but to identify the common design goals and put them in context. In this paper we also propose a self-healing mechanism for services that are built to deliver media and designed to meet users' particular requirements. Our proposed solution is oriented to discover new approaches for monitoring, diagnosing and recovering of services thus achieving selfhealing. The rest of this paper is organized as follows. Section 2 discusses self-healing architecture models and requirements. Section 3 outlines and reviews the proposed self-healing approaches while Section 4 briefly introduces our proposed overlay self-healing architecture. In Section 5 we conclude the paper. 2. SELF-HEALING REQUIREMENTS
ARCHITECTURE
MODELS
AND
The architecture and quality requirements of self-healing systems are proposed in [5]. It presents a framework that can facilitate both design and maintenance of self-healing systems. It identifies the quality requirements for any self-healing system, characterized as traditional quality attributes and the autonomic-specific attributes. Traditional attributes includes reliability and maintainability. Reliability means that the selfhealing system should be fault-tolerant; specified service has to be delivered in spite of the presence of faults, and be robust to use the suitable recovery technique to restore its normal operation. Maintainability means that the system architecture must be scalable and flexible to allow for modifying selfmanaging systems without breaking them. The autonomicspecified attributes includes a set of requirements such as 1) support for detecting exceptional system behavior; the ability to monitor and recognize deviation behavior with respect to quality of service, 2) support for failure diagnose; the ability to find the source of failures, 3) support for testing of correct behavior; the ability to test and verify that autonomic elements working correctly. The authors in [6] have identified a set of requirements that an architectural style for self-healing systems should satisfy: 1) adaptability which means that the style should be easily modified either its structure or its interactions, 2) dynamicity which means that the system should be adapted to any changes during execution, 3) robustness which means that the style must have the ability to effectively response to exceptional conditions such as internal failures and external malicious attack, 4) awareness which means that the style should be able to check up system performance and identify any violation in that performance. Garlan and Schmerl [7] have proposed a self-healing architecture model for system monitoring, fault detection, and executing the repair actions. Their model concentrates on
using a number of external components to monitor system run time behaviour, determining when a system is functioning normally by comparing the monitored values with the properties of an architecture model. A constraint violation is used for inducing an adapting process in case of the system is violated from the expected ranges to make a particular repair, and choosing that repair, based on architectural styles. Another architectural-based approach to self-healing systems is proposed in [8]. Creating self-healing systems is based on software architecture that uses software components and connectors for repair, where the changes and repairs to a running software system are done at the architecture level. It presents tools and methods for implementing architecturebased self-healing systems. This approach focuses on describing and executing architectural changes after the system is deployed. So this approach does not require prespecified repair operations. To ensure that the overlay network will continue to achieve its goals and objectives and the service will continue to work regardless of any fault that might occur in the network, the self-healing system should satisfy the following requirements: 1) The system should provide support for monitoring its performance and recognition anomalies with respect to these performance parameters. 2) The system should have the ability to locate the source of failure. 3) The system should be able to respond to varying conditions and have the flexible mechanisms to deal with these violations. 4) The system should execute the appropriate mechanisms to bring the system back to its normal state of operation. 3. SELF-HEALING APPROACHES 3.1. Overlay Networks An overlay network is an application layer network implemented on top of a physical network. The nodes of the overlay are end-hosts of the underlying network. Their function is to receive and forward packets in an applicationspecific way, and each host in the overlay is connected with other hosts by some logical connections. Many overlay networks can be built on top of the same physical network. Overlay networks are exposed to different types of failures such as, network nodes that may fail, or links may get congested, routing information may change along the time, and new users may join or leave the network dynamically. There exist some approaches which intend to present a generic repair approach for overlay networks. For instance, the authors in [9] propose a self-repair mechanism to detect and bypass failures of overlay network nodes. The proposed scheme have formulated by an algorithm that operates in 3 main stages, and depends on the existence of the following three major services [10]: 1) a distributed backup service which is used to restore the overlay node state in another
overlay node in case that node fails, and this state is represented by using two basic elements: Accessinfo records and Nodestate records. An Accessinfo record may be used to represent a Node ID, or a link to a neighboring node in the overlay. Nodestate record represents other state that an overlay node is interested in having restored when the node is recovered; 2) a failure detection service which is used to monitor and detect node failures, and to inform the recovery service about that failure; and 3) a recovery service which is used to execute the appropriate repair to the failed section of the overlay. Assumed that these sub-service instances exists in each overlay node, and can be implemented in a form the overlay can use to monitor the nodes of another overlay network. The proposed scheme in [9] operates in 3 main stages. To enter the first stage of the algorithm the node must be informed by the failure detection service instance that one of its neighbors has failed, and then try to discover if there is another failed nodes by depending on the information provided by the distributed backup service and the failure detection service. Executing this procedure leads to determine failed nodes “failed section” neighboring that node and the living nodes bordering this failed section “border nodes”. In the next stage a repair coordinator is selected depending on a specific parameters. Finally, the node that has been chosen to be a repair coordinator must execute the repair strategy using one of the two strategies: i) node restoration by selecting a suitable alternative node to backup failed node state; ii) structured adaptation by adapting the overlay to perform the same functions without the failed nodes. To make the overlay nodes interact with their generic self-repair service they suggest to use an API (Application Programming Interface) [11]. Through this API, all the management operations and guidance operations are executed. A self-repair mechanism was introduced by [12], chord is a distributed hash-table (DHT) overlay network designed to be scalable and resilient to node failures. It assigns keys to data items and organizes the nodes into a graph that maps each data key to anode, each node maintains a list of the successor nodes following it in a ring structure, and this list is continuously updated and each node maintains a pointer to its immediate successor and if it fails, the node becomes linked to the next live node in the list instead. Furthermore, chord redundantly stores data on multiple nodes in case the node fails. Overcast [13] is a tree-based overlay multicast system that employs a self-repair mechanism to maintains its tree structure in case of one of its nodes has been failed. Overcast organizes its nodes into a distribution tree rooted at the source, which can efficiently adapt itself to changing network conditions such as congestion problems, and node failures. Each node maintains a list of its ancestor, so that in case of its current parent has failed the communication with the network will be lost, to restore that it will find an alternative parent from the list and attempt to locate and rejoin a surviving ancestor. Another generic repair approach for overlay network is proposed in [14]. It was developed to ensure and enhance the routing capabilities between overlay nodes in case of the
presence of underlying network failures. The proposed approach has the ability to automatically recognize itself and reconfigures dynamically the virtual links between the nodes without any additional human intervention for its maintenance. They suppose that the network is consists of a set of groups and each group is a set of fully virtually connected nodes. The groups are organized into a chain, in such a way that two sequential groups in the chain have common node. To get an efficient and flexible routing property the groups density must be high, where the density is the minimum number of physical network elements failures that will isolate a node from the other nodes, and achieving this requires reorganizing the groups through splitting the high density groups and allowing them to join and increase the low density groups. Self-Initiated and Self-Maintained Overlay Networks (SIMON) [15] were proposed to provide additional enhancement features such as multipath routing. It consists of a set of domains where each domain has two servers. Local Group Server (LGS) which maintains a list of its local domain members, and Domain Virtual Network Service (DVNS) to translate a SIMON name into an IP address. In addition to that a Global Group Server (GGS) is used to manage the work of local group servers in each domain. To achieve connectivity between the member nodes and their server, LGS periodically sends out a probe message to its members every pre-specified amount of time. If the member node does not receive a probe message from its LGS because of server failures or link failures, it will execute a set of points. Firstly, every node set a timeout timer to a predefined value and try to declare itself as an LGS (some member nodes have the ability to be a server) and sends a probe message to other member nodes. Any node that receives the probe message stops its timer and responds to the message by reporting to the new LGS and sending the probe message to its neighbours. Also the node that declares itself as an LGS will inform the GGS with that declaration, if no new LGS is reachable by GGS it will accept the request. Otherwise, the GGS responds to the request by sending the IP address of the existing LGS. In case of the timer is expired and no new LGS is declared, the member node will contact the DVNS to find appropriate LGS. Moreover, there are several self-healing techniques that are used in different fields such as self-healing wireless sensor networks, mobile ad-hoc networks, and web services, all the presented algorithms assumed specific parameters drawn for their targeted environment which render them unsuitable for our proposed environment because each SSON consists of MC, MS, and a set of MPs that presented a specific service and each SSON is responsible for providing a certain specific service that should meet the user requirements. 4. PROPOSED ARCHITECTURE In this section we will briefly present our proposed selfhealing architecture. It is based on our previous work [16] in which we have proposed autonomic overlay networks
When it detects QoS degradation in overlay service. When it detects faulty nodes. When it detects node joining.
The first function for this unit is to detect failed nodes. There are two aspects of node failures: selective departure, in which the node informs the SSON-AM before leaving the network and reports that to the fault management unit where it will take the necessary actions to correct the situation. The other one is the sudden node failure resulted from node malfunction or system failures. In such case, we suggest to use the periodic transmission of keeping a live packets (heartbeat message mechanism) to indicate the aliveness of any node in the overlay service. The SSON-AM periodically sends “are you alive” message to the nodes participating in the service. If
QoS Analysis
Node Failure/Join Analysis Fault Management
System Recovery
1. 2. 3.
System Monitoring
architecture. The proposed self-healing architecture is targeting the kind of applications that involve multimedia delivery services. Our architecture will be oriented to service specific overlay network (SSON) interaction between the media client that request the service and the media server that provide the service. In addition to the main component is the media ports (MPs) which allow for processing of multimedia data along the end-to-end media delivery path, and it is the most proposed for failures. SSON must be efficient and satisfy the optimal service supported to the end user, and as we know the SSON is dynamic which needs a continuous monitoring for the behavior of each component. As shown in Figure 1, the proposed self-healing architecture mainly consists of: 1) System monitoring unit which is responsible for detecting failures that has impact on network and services performance. 2) Fault management module which is responsible for executing the required evaluation for the data provided by the system monitoring unit, and consists of a Quality of Service (QoS) analysis unit, a node failure analysis unit, and a node joining analysis unit, and a 3) Recovery unit to select and execute a set of actions that bring the system back to the normal state. System monitoring unit continuously monitor the behaviour of each component on the network and detect any interruption that has negative effect on the service (presence of faulty nodes, node joining process, etc.) and it used for collecting and storing relevant QoS parameters values such as: end-to-end delay, throughput and delay jitter. We assume that the system monitoring unit is handled by the Service Specific Overlay Network-Autonomic Manager (SSON-AM). It is the node that is responsible for managing the overlay service. The system monitoring unit will alert the fault management module in the following cases:
Figure 1. Overlay services self-healing system architecture
it does not receive a replying message (i.e. ACK message) from a node and after a certain number of consecutive probe messages, it considers the node as failed one. The next function for the system monitoring unit is to detect QoS constraints violation. QoS is an important feature when dealing with interactive real time services. It needs to compare the monitored QoS parameters against the expected performance, detect possible QoS degradation, and then adjust network resources accordingly to preserve the delivered QoS. The last function for the system monitoring unit is to detect new joining nodes. Media Ports (MPs) services are dynamic and change over time. The node that joins the network is only being useful directly to its locality so it can be detected. Any deviation detected by the system monitoring unit will be reported to the fault management module that will evaluate and analyze the provided data. It will react to the alert sent by the system monitoring unit, depending on that it performs the required evaluation for the system and determine if the system requires a change or not. It consists of a QoS analysis unit, a node failure analysis unit, and a node joining analysis unit. The QoS analysis unit will analyze the QoS parameters, if there is any violation, the fault management module will instruct the recovery unit to take a suitable action. The node failure analysis unit will verify if the failed node detected by the system monitoring unit is part of another SSON or not, in both cases the recovery unit will be notified with this deviation to execute the proper action. Finally, the node joining analysis unit will analyze the node joining process, and determine if it is participating in a service and can improve or increase the performance. The recovery unit will be triggered by the fault management module. It provides and adapts the appropriate mechanisms that reflect the required change and control the execution of a set of actions (repair plan) stored in policy repository to recover from a failure or to prevent a QoS degradation. For example, in case of QoS degradation, the recovery unit will execute the necessary mechanism to bypass or recover from this problem by finding an alternative overlay path that meets the requirements of the requested service. Moreover, this unit will adapt the appropriate mechanism that can bring the service back to a consistent state after a fault has been detected. Therefore, services must replace the failed node with a new overlay node. To this end, the system requires that each node participating in an overlay service should backup its information state in another node. This will allow the SSON-
AM to restore the backup data into the node replacing the faulty one. 5. CONCLUSION Management of overlay networks becomes more and more difficult. As these networks are exposed for different kinds of faults such as nodes joining and leaving, network congestions, and node and link failures. Therefore, self-healing ability is required, it aims at helping systems autonomously control themselves. It is essential to ensure network coverage and continued network functionality. This paper reviewed different design trends for self-healing overlay networks and proposed a self-healing system architecture for overlay networks that are designed specifically for the applications that involve multimedia delivery services and to meet individual users' requirements. In the proposed system, the monitoring, diagnosing and recovering of overlay services are achieved seamlessly without users' intervention. Detailed design factors and the implementation of the proposed system are planned in our future work. 7. REFERENCES [1] S. Schmid, F. Hartung, and M. Kampmann, “SMART: Intelligent Multimedia Routing and Adaptation based on service specific overlay networks,” In Proceeding of Eurescom Summit 2005, Heidelberg, Germany, pp. 69-77, 2005. [2] IBM Corporation, “An architecture blueprint for autonomic computing,” Autonomic Computing White Paper, Fourth Edition, June 2006. [3] J. K. Kephart, and D. M. Chess, “The vision of autonomic computing,” IEEE Computer Mag. vol. 36, no. 1, pp. 41-50, Jan. 2003. [4] M. Parashar, and S. Hariri, “Autonomic computing: an overview,” In Proceeding of the Unconventional Programming Paradigms, Springer. vol. 3566, pp. 247-259, 2005. [5] S. Neti, and H. Muller, “Quality criteria and an analysis framework for self-healing systems,” In Proceeding of the IEEE International Workshop on Software Engineering for Adaptive and self-managing Systems, 2007. [6] M. Mikic-Rakic, N. Mehta, and N. Medvidovic, “Architecture style requirements for self-healing systems,” In Proceeding of the ACM First Workshop on Self-healing Systems (WOSS 2002), pp. 49-54, 2002. [7] D. Garlan, and B. Schmerl, “Model-based adaptation for self-healing systems,” In Proceeding of the ACM SIGSOFT Workshop on Self-Healing Systems, pp. 27-32, 2002. [8] E. M. Dashofy, A. V. Hoek, and R. N. Taylor, “Towards architecture-based self-healing systems,” In Proceeding of the ACM First Workshop on Self-Healing Systems (WOSS 2002). [9] B. Porter, F. Taiani, and G. Coulson, “Generalized repair for overlay networks,” In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems, Leeds, UK (2006).
[10] B. Porter, G. Coulson, and D. Hughes, “Intelligent dependability services for overlay networks,” In Proceedings of Distributed Applications and Interoperable Systems 2006 (DAIS’06), vol. 4025 of LNCS., Bologna, pp. 199-212. [11] B. Porter, G. Coulson, and F. Taiani, “A generic selfrepair approach for overlays,” In Proceeding of the Internet System, OTM 2006 workshop, pp. 1490-1499. [12] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan, “Chord: a scalable peer-to-peer lookup service for internet applications,“ In Proceeding of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ACM 2001. [13] J. Jannotti, D. K. Gifford, K. L. Johonson, M. Frans Kaashoek and J. O’Toole, “Overcast: reliable multicasting with an overlay network,” In Proceeding of the 4th Symposium on Operating Systems Design and implementation, pp. 197212, 2000. [14] L. Baud, and B. Bellot, “ROSA: a step towards a global virtual network,” In Proceeding of the IEEE Ultra Modem Telecommunications & Workshops, 2009. [15] M. Elaoud, A. McAuley, G. Kim, and J. ChennikaraVarghese, “Self-Initiated and Self-Maintained Overlay Networks (SIMON) for enhancing military network capabilities,” In Proceeding of the IEEE Military Communications Conference, pp. 1147-1151, vol. 2, 2005. [16] I. Al-Oqily, A. Karmouch, and R. Glitho, “An architecture for multimedia delivery over service specific overlay networks,” In Proceeding of the Wireless Sensor and actor Networks II, IFIP International Federation for Information Processing, vol. 264, pp. 97-112, Springer 2008.